Code Monkey home page Code Monkey logo

thematcproject's Introduction

#Welcome to The Manually Annotated Twitter Corpus Project ###The Manually Annotated Twitter Corpus Project (a.k.a. The MATC Project) is a crowdsourced part of speech tagging game that results in a corpus of words sourced from Twitter.

The whole idea behind The MATC Project is to allow linguistic data to be annotated by people with little or no training in linguistics.

There's two parts to The MATC Project: Tres Tigres, and The TwitUrban Dictionary.

##General Information ####Tres Tigres is the crowdsourcing game that manually annotates the linguistic data.

Tres Tigres collects all tweets in English from the sample Public stream via Twitter's streaming API. Tres Tigres then checks each word in a tweet against the Wordnik corpus (which pulls from The American Heritage Dictionary, WordNet, Wiktionary, and a variety of other sources).

If a word isn't in the Wordnik corpus (and isn't a user name, emoticon, et cetera), it gets tagged with a part of speech by users in Tres Tigres.

Once a word has been tagged with a part of speech through Tres Tigres, it goes into The TwitUrban Dictionary.

####The TwitUrban Dictionary is the manually annotated corpus of the words pulled from Twitter.

It is constantly updating, and it is available for download free of charge. You can also browse it online.

##Technical Information ####The MATC Project is a web application that uses Redis, Python, Twitter's Streaming API, Wordnik's API, Celery, Flask, Jinja, JavaScript, and the D3 JavaScript library.

###Pull Tweets #####celery_client.py, twython_streaming.py

First, Celery (a queueing system for Python) pulls English tweets from the Public Twitter stream via Twitter's Streaming API. To access the Twitter Streaming API, I used the Twython module for Python.

###Clean Tweets #####celery_tasks.py, tweet_cleaner.py

Each tweet is added the Redis ToGo database. These tweets are then sent via Celery to tweet_cleaner.py, which checks each word in a tweet against a variety of criteria.

If a word is a user name, emoticon, number, meta word (such as 'RT'), non-word (such as 'haha'), url, or contains non-English letters, it is ignored.

###Check Words Against A Dictionary #####celery_tasks.py, wordnik_api.py

The words that make it through tweet_cleaner.py are then sent to wordnik_api.py. There, each word in a tweet is checked against the Wordnik corpus (which pulls from The American Heritage Dictionary, WordNet, Wiktionary, and a variety of other sources).

###Put Words & Tweets In The Database #####celery_tasks.py, model.py

Using model.py, any words that are not found in Wordnik are added to the Redis Cloud database, along with the tweet they were found in.

###Tagging Words #####app.py, model.py

The front end of The MATC Project was created using Flask (a web framework for Python) and Jinja (to communicate between Python and HTML/JavaScript).

Once a user has registered with The MATC Project through Flask, they can play Tres Tigres (the part of speech tagging game). Users are shown a random word and a tweet associated with the word.

They are also shown an assortment of sentences with different parts of speech removed. They are asked to click on the sentence that the word fits into. Based on which sentence they click, it is tagged with a different part of speech. (JavaScript is used in order to retrieve which sentence they have clicked.)

Each word must be tagged at least 5 times with a specific part of speech before it is given a final part of speech tag in the Redis Cloud database. In addition to saving the word and the part of speech tag, the tweet the word came from is saved and corresponds to the part of speech it was tagged with.

###Adding Words To The Corpus #####app.py, model.py

Once a word is given a final part of speech tag, it is added to The TwitUrban Dictionary (the corpus). The TwitUrban Dictionary is accessed online (or via downloading it). All of the tags and words are stored in the Redis Cloud database.

###Visualizing The Tagged Words #####app.py, model.py

You can also access visualizations of The TwitUrban Dictionary online. Using the D3 JavaScript library, words tagged as a certain part of speech are shown in word bubbles. The size of each word bubble is dependent upon how many tweets the word was tagged as that part of speech in.

##Click here to go to The MATC Project

thematcproject's People

Contributors

rebeccastandig avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.