Code Monkey home page Code Monkey logo

truecaser's Introduction

Language Independet Truecaser for Python

This is an implementation of a trainable Truecaser for Python.

A truecaser converts a sentence where the casing was lost to the most probable casing. Use cases are sentences that are in all-upper case, in all-lower case or in title case.

A model for English is provided, achieving an accuracy of 98.39% on a small test set of random sentences from Wikipedia.

Model

The model was inspired by the paper of Lucian Vlad Lita et al., tRuEcasIng but with some simplifications.

The model applies a greed strategy. For each token, from left to right, it computes for all possible casing:

score(w_0) = P(w_0) * P(w_0 | w_{-1}) * P(w_0 | w_1) * P(w_0 | w_{-1}, w_1)

with w_0 the word at the current position, w_{-1} the previous word, w_1 the next word in the sentence.

All observed casings for w_0 are tested and the casing with the highest score is selected.

The probabilities P(...) are computed based on a large training corpus.

Requirements

The Code was written for Python 2.7 and requires NLTK 3.0.

From NLTK, it uses the functions to spilt sentences into tokens and the FreqDist(). These parts of the code can easily be replaced, so that the code can be used without NLTK install.

Run the Code

You need a distributions.obj that contains information on the frequencies of unigrams, bigrams, and trigrams.

A pre-trained distributions.obj for English is provided in the release section (name: english_distribitions.obj.zip. Unzip it before you can use it).

One large distributions.obj for English is provided in the download section of github.

You can train your own distributions.obj using the TrainTruecaser.py script.

To run the model on one (or multiple) text files, provide distributions.obj to the PredictTruecase.py script. If no text file(s) are provided as arguments, input is read from STDIN.

To evaluate a model, have a look at EvaluateTruecaser.py.

Train your own Truecaser

You can retrain the Truecaser easily. Simply change the train.txt file with a large sample of sentences, change the TrainTruecaser.py such that is uses the train.txt and run the script. You can also use it for other languages than English like German, Spanish, or French.

Disclaimer

Sorry that this is kind of shitty code without documentation. I was looking for my research for a truecaser, but I couldn't find any working implementation. I implemented this script in a hacky manner and it works quite well (at least for me).

I think the code is so simple that everyone can use and adapt it and maybe it is handy for someone. The principle behind the code is really simple, but as mentioned above, it achieves good results.

Hint: The casing of company and product names is the hardest. Train the system on a large and recent dataset to achieve the best results (e.g. on a recent dump of Wikipedia).

truecaser's People

Contributors

hoonkai avatar nreimers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

truecaser's Issues

What data was used to train the included model?

Thanks for this! We have found it useful, and would (potentially) like to cite it. When discussing it in our paper, it would be helpful to know what data the included model was trained on. Could you say what data you used?

Academic Citing

Hi,

First of all, thank you for sharing your work with the community!

Can you provide details (maybe a paper) that can be used to cite your work in another scientific work? I am working on my master thesis and would like to attribute the credit to your work. :)

Cheers,
Rosko

its very slow

its very slow , how to improve the performance of this application

Bug in trigrams loop

There is probably a bug in TrainFunctions.py.
In loop that creates trigrams missed assignment to word variable (word = sentence[tokenIdx]). So this loop uses the last value from previous loop.

on CRF truecasing

Hi, thanks for making this repo. I want to ask whether a CRF model is better than this n-gram model. Becuase I saw that stanford nlp implemented a CRF truecaser.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.