Code Monkey home page Code Monkey logo

entity_recognition's Introduction

entity-recognition

Intro

Framework for doing NER and other types of entity recognition in Python.

Baseline feature extraction relies on Brown clusters and typical NER features, similar to Roth & Ratinov 2009. We use CRFsuite and try to keep things modular and simple, so you're not stuck to just NEs - parts of speech, MWEs, temporal expressions and the whole smørrebrød of entity classes are up for grabs with this framework; simply adjust training data and labels to taste.

Features

  • Pluggable feature generation
  • Support for Reddit/Twitter JSON formats
  • State-of-the-art Twitter NER performance out-of-the-box

Running

To get started, run ./train_tagger.py --help

Toy data in datasets/ top-level directory.

Should you like to tag data with your code, ./run_tagger.py --help is your friend. Remember to keep the Brown clusters around!

For example, to learn a model from the Ritter NER CoNLL data, and then apply it to some Reddit JSON, try this:

$ ./train_tagger.py -f datasets/ritter.ner.conll \
  --clusters brown_paths/gha.250M-c2000.paths --output \ 
  ritter.socmed.crfsuite.model
$ ./run_tagger.py -f datasets/RC_2013-04.1000.json \ 
  -c brown_paths/gha.250M-c2000.paths \ 
  --model ritter.socmed.crfsuite.model \ 
  --json --json-text body --stdout 

An "entity_texts" top-level field is added, containing extracted entities. For example:

{
	"archived": true, 
	"author": "walrusboy", 
	"author_flair_css_class": null, 
	"author_flair_text": null, 
	"body": "Quick, someone photoshop Natalie Portman!",
	"controversiality": 0, 
	"created_utc": "1364774484", 
	"distinguished": null, 
	"downs": 0,
	"edited": false, 
	"entity_texts": ["Natalie Portman"],
	"gilded": 0, 
	"id": "c95zmil", 
	"link_id": 
	"name": "t1_c95zmil", 
	"parent_id": "t3_1bddiw", 
	"removal_reason": null, 
	"retrieved_on": 1431716826, 
	"score": 1, 
	"score_hidden": false, 
	"subreddit": "pics", 
	"subreddit_id": "t5_2qh0u", 
	"t3_1bddiw", 
	"ups": 1
}

Dependencies

At least:

  • Python 3
  • NLTK
  • pycrfsuite
  • sklearn
  • scipy
  • numpy

Check you're using Python 3, with python -V (THAT'S A BIG V). Next, try something like:

$ sudo easy_install3 -U pip
$ sudo pip3 install numpy scipy sklearn pycrfsuite nltk

Then go for two cups of tea / one brief fika, after troubleshooting errors. If you get super stuck, sometimes it helps to try your distribution's Python 3 packages for numpy and scipy, and then upgrade them with something like:

$ sudo pip3 install -U numpy
$ sudo pip3 install -U scipy

Hints and tips

If you use Brown clusters (and we recommend them!), this system expects cluster paths in binary branch format - à la wcluster - as opposed to base 10 paths, like from JCLUSTER. If you're not sure how many Brown clusters to use, check out our 3D interactive guide to tuning Brown clustering.

Reference

If you use this work, please cite our paper:

Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva (2015)
USFD: Twitter NER with Drift Compensation and Linked Data
Proceedings of the ACL Workshop on Noisy User-generated Text (W-NUT)
[Paper] [bib]

Tools under active development until at least 2017 as part of the PHEME project: www.pheme.eu

entity_recognition's People

Contributors

isabelleaugenstein avatar leondz avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.