Code Monkey home page Code Monkey logo

dracula's Introduction

You're looking at the 2016-04-16 release of Dracula, a part-of-speech tagger optimized for Twitter. This tagger offers very competitive performance whilst only learning character embeddings and neural network weights, meaning it requires considerably less pre-processing that another techniques. This branch represents the release, the actual contents of this branch may change as additional things are documented, but there will be no functional changes.

Background

Part of speech tagging is a fundamental task in natural language processing, and its part of figuring out the meaning of a particular, for example if the word heated represents an adjective ("he was involved in a heated conversation") or a past-tense verb ("the room was heated for several hours"). It's the first step towards a more complete understanding of a phrase through parsing. Tweets are particularly hard to deal with because they contain links, emojis, at-mentions, hashtags, slang, poor capitalisation, typos and bad spelling.

How the model works

Unlike most other part of speech taggers, Dracula doesn't look at words directly. Instead, it reads the characters that make up a word and then uses deep neural network techniques to figure out the right tag. Read more »

Installing the model

You'll need Theano 0.7 or better. See Theano's installation page for additional details »

Training the model

Run the train.sh script to train with the default settings. You may need to modify the THEANO_FLAGS variable at the top of this file to suit your hardware configuration (by default, it assumes a single GPU system).

Assessing the model

  1. Start the HTTP server, using THEANO_FLAGS="floatX=float32" python server.py.
  2. In another terminal, type python eval.py path/to/assessment/file.conll.

How well does the model perform?

Here's the model's performance for various character embedding sizes. This is assessed using GATE's TweetIE Evaluation Set (Data/Gate-Eval.conll).

TagSizeAccuracy (% tokens correct)Accuracy (% entire sentences correct)
2016-04-16-12812888.69%20.33%
2016-04-16-646487.29%16.10%
2016-04-16-323284.98%11.86%
2016-04-16-161674.24%3.39%

Changing the the embedding size

Make the following modifications:

  • server.py, in the prepare_data call on line 122, change 32 (the last argument) to the correct size.
  • lstm.py, in the train_lstm arguments on line 104, change dim_proj_chars default value to the correct size.

Licensing

All the code in this repository is distributed under the terms of LICENSE.md.

Acknowledgements, references

The code in lstm.py is a heavily modified version of Pierre Luc Carrier and Kyunghyun Cho's LSTM Networks for Sentiment Analysis tutorial.

The inspiration for using character embeddings to do this job is from C. Santos' series of papers linked below.

Finally, GATE gathered the the most important corpora used for training, and provide a reference benchmark:

dracula's People

Contributors

sentimentron avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.