Code Monkey home page Code Monkey logo

hackbrightproject's Introduction

CATABLOG (Because you judge blogs, too.)

CATABLOG is a web application that clusters and recommends blog text based on writing style. It uses Python, the Readability API, ScraPy, BeautifulSoup, MurmurHash3, SciPy, NumPy, PyEnchant, SQLAlchemy, Postgres, Flask, Memcached, JS/Jquery, HTML, and Rickshaw.

##Scraping blogs with Blogscraper ######(blogspider.py in blogscraper/)

The ScraPy-built blogspider crawls the links for trending tags on wordpress.com/tags. It then scrapes the links to recently posted blogs for each tag and saves them as a CSV file.

##Building a corpus and clustering ######(build_corpus.py, calculate_feature_vector.py, utilities.py)

  • Calculating the Feature Vector

    A call is made for each scraped link to the Readability API, which returns salient data (title, HTML, url, domain, excerpt) from the page. The program extracts the text from the HTML using BeautifulSoup and calculates a feature vector based on stylistic characteristics.

  • Stylistic Characteristics

    Lexical analysis is performed using regular expressions. The program calculates averages sentence length, number of self-references (by counting first-person singular pronouns), number of exclamation points, and number of ellipses. Number of misspellings is counted by checking each token against the PyEnchant English spellchecker.

  • Exploiting Zipf's law to count words

    To calculate the word frequencies, the script exploits Zipf's law, which states that the frequency of any word is inversely proportional to its rank in the frequency table. The script uses MurmurHash3 to employ a "hashing trick" before performing a word count. This essentially forces common words into the same "buckets," and distributes the uncommon word counts evenly throughout the vector. It seems like word collision would throw this off, but it doesn't. I had a hard time believing this, too.

  • K-means clustering

The script uses SciPy's k-means clustering algorithm to cluster feature vectors for all posts. Prior to clustering, it whitens the feature vectors by dividing them by feature-by-feature standard deviation (calculated with NumPy's standard deviation function).

  • Write that to CSV

    All of the post data (including the newly-calculated feature vectors), parent blog information, and cluster centroid data are saved to three separate CSV files in seed_data/.

##Seeding the database ######(seed.py, seed_data/, & model.py)

Every time seed.py is called, it uses SQLAlchemy to recreate a Postgresql database and insert the data into three data tables: posts, blogs, and clusters. It is worth noting that k-means clusters around randomly generated centroid points, so clusters change each time it is run. The database thus must be recreated each time k-means is run to reflect the most current cluster data.

##Web Framework ######(app.py, model.py, calculate_feature_vector.py, utilities.py)

The web framework is built using Flask. The program uses AJAX to send the user-input URL to the server and call the route which works backend magic. After checking the database for the URL, a call is made to the Readability API, and a feature vector is calculated on the returned data. Each feature is normalized by dividing by that feature's standard deviation from all vectors in the database (stored in memcache). Then the program calculates the nearest centroid, identifying the cluster of posts most similar to the user-input post. Posts from that cluster are then selected at random from the database and returned in JSON.

##Displaying Data

######(main.html, myjs.js, stats.js)

After receiving the JSON data from the server, the script displays the relevant title, excerpt, and link for the post. The stylistic features for both the sample post (from the user) and the recommended post (pulled from the database) are displayed side-by-side in both a stats table and a graph (created with Rickshaw).

alt text

(Mic drop.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.