Code Monkey home page Code Monkey logo

expander's Introduction

Expander

A small module to expand common contractions in the english language

This is the expander module with it's main feature the function expand_contractions in expand.py. It uses an object of the StanfordPOSTagger class from nltk to POS-tag input sentences and decide accordingly which expansion to use.

To be able to run the code you need to download a Stanford POS-tagger model. You can download the basic english tagger on the official homepage. Furthermore to enhance the POS-tagging the named-entity recognition model from the Stanford Core NLP set is used as well, it can be downloaded on it's respective official homepage. Extract the zip-file(s) into the subdirectory stanford_models of this module. Alternatively, you can supply the path to the model in the call to load_stanford as documented in the program.

To see example output run expand.py directly using python expand.py. You can supply your own directory to the call of load_stanford() here. In this you can also see how to use this module.

Assumptions being made

  • Apostrophes in the middle of a lexical item (i.e. usually sequences of characters surrounded by spaces and/or delimited by punctuation) are signs for contraction and will be dealt with as such.
  • The input sentence is grammatically correct.
  • The only replacements needed to be done are defined in contractions.yaml

Notable drawbacks

  • The nature of using POS-taggers is of course, that they are not perfect. The best is being done to make correct expansions, but errors will happen. Especially since expansions are not unambiguous.

TODOs

  • Include a test case when expander.py is run directly, correctly asserting that the right results come out.
  • Write a function that divides list at certain characters (apostrophe in our case), and refactor code with it.
  • Combine the he-she-it cases to one central <HSI> case in order to get more test cases and thusly improve accuracy (may not be sensible, as the cases for he/she and it are different).
  • Adapt load_stanford in utils.py to use the new CoreNLPPOSTagger and CoreNLPNERTagger instead of the deprecated ones.

Notes about Licensing

This software is distributed unter the Apache 2.0 license, mainly because NLTK is as well and it seems to allow enough freedom. Note though that the stanford models are not distributed under that license. They are full GPL and restrict any kind of proprietary use. If you intend to use this software in your own proprietary software, either get in contact with the people at stanford or rewrite the program to use models included in NLTK (if you are doing that, I would also be grateful for a pull request with the changes). I have just generally found the stanford models to be more reliable.

expander's People

Contributors

yannick-couzinie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.