CATABLOG is a web application that clusters and recommends blog text based on writing style. It uses Python, the Readability API, ScraPy, BeautifulSoup, MurmurHash3, SciPy, NumPy, PyEnchant, SQLAlchemy, Postgres, Flask, Memcached, JS/Jquery, HTML, and Rickshaw.
##Scraping blogs with Blogscraper ######(blogspider.py in blogscraper/)
The ScraPy-built blogspider crawls the links for trending tags on wordpress.com/tags. It then scrapes the links to recently posted blogs for each tag and saves them as a CSV file.
##Building a corpus and clustering ######(build_corpus.py, calculate_feature_vector.py, utilities.py)
-
Calculating the Feature Vector
A call is made for each scraped link to the Readability API, which returns salient data (title, HTML, url, domain, excerpt) from the page. The program extracts the text from the HTML using BeautifulSoup and calculates a feature vector based on stylistic characteristics.
-
Stylistic Characteristics
Lexical analysis is performed using regular expressions. The program calculates averages sentence length, number of self-references (by counting first-person singular pronouns), number of exclamation points, and number of ellipses. Number of misspellings is counted by checking each token against the PyEnchant English spellchecker.
-
Exploiting Zipf's law to count words
To calculate the word frequencies, the script exploits Zipf's law, which states that the frequency of any word is inversely proportional to its rank in the frequency table. The script uses MurmurHash3 to employ a "hashing trick" before performing a word count. This essentially forces common words into the same "buckets," and distributes the uncommon word counts evenly throughout the vector. It seems like word collision would throw this off, but it doesn't. I had a hard time believing this, too.
-
K-means clustering
The script uses SciPy's k-means clustering algorithm to cluster feature vectors for all posts. Prior to clustering, it whitens the feature vectors by dividing them by feature-by-feature standard deviation (calculated with NumPy's standard deviation function).
-
Write that to CSV
All of the post data (including the newly-calculated feature vectors), parent blog information, and cluster centroid data are saved to three separate CSV files in seed_data/.
##Seeding the database ######(seed.py, seed_data/, & model.py)
Every time seed.py is called, it uses SQLAlchemy to recreate a Postgresql database and insert the data into three data tables: posts, blogs, and clusters. It is worth noting that k-means clusters around randomly generated centroid points, so clusters change each time it is run. The database thus must be recreated each time k-means is run to reflect the most current cluster data.
##Web Framework ######(app.py, model.py, calculate_feature_vector.py, utilities.py)
The web framework is built using Flask. The program uses AJAX to send the user-input URL to the server and call the route which works backend magic. After checking the database for the URL, a call is made to the Readability API, and a feature vector is calculated on the returned data. Each feature is normalized by dividing by that feature's standard deviation from all vectors in the database (stored in memcache). Then the program calculates the nearest centroid, identifying the cluster of posts most similar to the user-input post. Posts from that cluster are then selected at random from the database and returned in JSON.
##Displaying Data
######(main.html, myjs.js, stats.js)
After receiving the JSON data from the server, the script displays the relevant title, excerpt, and link for the post. The stylistic features for both the sample post (from the user) and the recommended post (pulled from the database) are displayed side-by-side in both a stats table and a graph (created with Rickshaw).
(Mic drop.)