UI Class and Code Writing Code
Realizing that I need more modular control over giving the user access to back-end variables, I worked on splitting up the interface code into its own PHP class and opening up some new options in the UI for the user to tweak. The new UI class, though basic by template standards, makes it easy to remember user preferences between subsequent runs, and to add and remove interface elements as needed.
Currently the user has control over:
- using rank-ordered or absolute occurrence weighting for the input document
- what percent of terms activated by the input document are passed along to calculate SEP entry activations
Inspired by the newfound ease of UI modifications, I also wrote some PHP that dynamically writes some JavaScript to display the resulting SEP entry activations in a nice bar graph using Google’s Charts and Graphs API, giving a better representation of the relative activations of the SEP entries.
Re-Scanning for Accuracy
Recently, I re-scanned the SEP entries in order to use a substantially more accurate regex for finding the InPhO terms within the entries. In addition, I grabbed their filesize and wordcount for future normalizing purposes. In working so long to get scans of this scale going, it made sense to make the SEP class more robust and have the iterative-scan provide more feedback on the progress of the scan through the set of all SEP entries. I also needed to add the ability to auto-scan in non-consecutive entries, since in the first attempt I missed about 180 entries because the script timed out at 60 seconds (the new regex is more accurate but very more slow).
I am liking the new non-grid layout of the activation table; now the sparse matrix is represented quite efficiently/sparsely, as only the SEP-term combos that have activations will appear in the table.
Obtaining Full Titles
Sadly, there is no API to get a definitive list of SEP entries with their URLs and full titles, so I had to screenscrape all 1170 of them, using the SEP’s table of contents as an index, then using the archival/citation page of each entry to glean the full titles, since it is much smaller to transfer and parse than any given entry. Now I have a DB table with titles like “18th Century British Aesthetics” rather than “British, in the 18th century.” Also, each entry now has a unique ID that I can use in future tables to allow for faster calculations, rather than matching on the entry title.
Doc-occurrence-weighted Activation
Doc-occurence-weighted activation is working, after some pretty massive restructuring of the database. Now the table I’m using has only four-tuples, [ id | idea | sep | activation ], and each cell in the previous setup (the 2D array) is now held by one row in this new structure. The most efficient way to add in by-row activation I’ve found is to insert a temporary table and join it with the idea/entry rows, multiplying across by number of occurrences (the weight), then summing down within the group (the group being an SEP entry).
leave a comment