Code Monkey home page Code Monkey logo

babzel's Introduction

Babzel - OpenNLP models generator

Babzel is a library which allows computing models for Apache OpenNLP from Universal Dependencies annotated language files. OpenNLP supports natural language processing with tools like: sentence detector, tokenizer, part of speech tagger, lemmatizer etc. However models for various languages are not easily available. This project adresses these shortcomings and allows to train and evaluate models for any language supported by Universal Dependencies treebank. Models can also be interactively verified.

Pre-trained models

Pre-trained models for various languages are automatically computed and available here

Training and evaluation process

All files (input or generated models) are processed in the root directory (or subdirectories) which is $HOME/.cache/babzel

Universal Dependencies treebank consists of conllu files for many languages. Conllu file contains annotated sentences in a particular language. Annotations describe tokens and part of speech and lemma for every token. Possible POS tags are listed here

Text used for training is normalized. Two normalizers are supported so far:

  • simple (text is lowercased only using Locale.ENGLISH)
  • lucene (text is normalized using Apache Lucene analyzers - text is lowercased, normalized, folded to ascii equivalents where possible. Such models may be used in Apache SOLR or Elasticsearch which supports OpenNLP analyzers

The process of training and evaluation of models roughly consists of the following steps:

  • Download universal dependencies treebank (only if does not exists locally or newer version is available).
  • Unpack conllu files for a particular language
  • For every supported trainer (sentence-detector, tokenizer, pos-tagger, lemmatizer) perform further steps. Training is performed only if a model does not exists or newer conllu file is available.
    • Read the sentences from conllu file
    • Optional: Try to fix the data (in example for 'de' language)
    • Convert sentences to sample stream for a particular trainer (token sample stream, lemma sample stream etc)
    • Train and evaluate model. Several available algorithms are tried and evaluated. Only the best one is choosen.
    • Save model and evaluation report.

Usage

Download appropriate fat jar from Packages (babzel-tools-simple or babzel-tools-lucene)

Training and evaluation:

  java -jar <jar-file-name> train <two-letter-language-code> <optional-working-directory>

If working directory is not specified then models are trained and evaluated in directory $HOME/.cache/babzel

Interactive verification:

  java -jar <jar-file-name> verify <two-letter-language-code> <optional-working-directory>

In this mode user is prompted to enter some text. Text is divided into sentences, tokenized, lemmatized etc. The results are printed on screen.

Lucene analysis chain

Lucene models are trained assuming specific chain of analyzer filters. Such chain must be preserved in order for models to work properly.

<analyzer>
  <!-- lowercase -->
  <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc_cf" mode="compose"/>
  <!-- fold to ascii, drop accents, expand ligatures etc -->
  <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfc" mode="decompose"/>
  <charFilter class="solr.MappingCharFilterFactory" mapping="fold-to-ascii.txt"/>
  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\p{InCombiningDiacriticalMarks}" replacement=""/>
  <!-- tokenizer -->
  <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="xy-sentence-detector.onlpm" tokenizerModel="xy-tokenizer.onlpm"/>
  <!-- part of speech tagging -->
  <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="xy-pos-tagger.onlpm"/>
  <!-- lemmatizer -->
  <filter class="solr.OpenNLPLemmatizerFilterFactory" lemmatizerModel="xy-lemmatizer.onlpm"/>
  <!-- other necessary filters TypeTokenFilterFactory, TypeAsPayloadFilterFactory etc -->
</analyzer>

fold-to-ascii.txt is a file which normalizes additional characters which are not handled by ICU normalizer.

Evaluation results

Several models were trained for different language types (using lucene text normalizer). The results of their evaluation are presented below. Available sentences are divided into training/evaluation sets. Every 10th sentence goes to evaluation set. 90% of sentences is used for training.

  • language: language code + language name
  • training sentences: approximate number of training sentences
  • models: training algorithm (algorithm with the best evaluation score) + score (ranging from 0.0 to 1.0)

Alphabetic latin languages

These languages uses alphabetic latin script with native diactritic characters. Words are separated by whitespace.

language training sentences sentence-detector tokenizer pos-tagger lemmatizer
de
german
65k MAXENT_QN
0.72
MAXENT_QN
0.99
MAXENT
0.94
MAXENT
0.96
en
english
35k MAXENT_QN
0.74
MAXENT_QN
0.99
MAXENT
0.94
MAXENT
0.98
es
spanish
30k MAXENT_QN
0.96
MAXENT_QN
0.99
MAXENT
0.94
MAXENT
0.98
fr
french
25k MAXENT_QN
0.92
MAXENT_QN
0.99
MAXENT
0.95
MAXENT
0.98
pl
polish
36k MAXENT
0.95
MAXENT_QN
0.99
MAXENT
0.96
MAXENT
0.96

Models generated for such types of languages have good quality. These types of languages are supported very well. Sentence detection score is relatively low, because many sentences in the sample were not properly ended.

Alphabetic non-latin languages

These languages uses alphabetic non-latin scripts (greek, cyrylic). Words are separated by whitespace.

language training sentences sentence-detector tokenizer pos-tagger lemmatizer
el
greek
2k MAXENT_QN
0.90
MAXENT_QN
0.99
PERCEPTRON
0.95
MAXENT
0.95
ru
russian
99k MAXENT_QN
0.93
MAXENT_QN
0.99
MAXENT
0.96
MAXENT
0.97
uk
ukrainian
6k MAXENT
0.91
PERCEPTRON
0.99
MAXENT
0.94
MAXENT
0.94

These types of languages are also well supported.

Abjads languages

These languages are commonly written from right to left. Vovels are often omitted. Words are separated by whitespace.

language training sentences sentence-detector tokenizer pos-tagger lemmatizer
ar
arabic
7k MAXENT_QN
0.71
MAXENT_QN
0.97
MAXENT
0.93
Serialization
exception
he
hebrew
8k PERCEPTRON
0.94
MAXENT_QN
0.92
MAXENT
0.94
MAXENT
0.96

Evaluation score is a bit lower for these languages. Lemmatizer model training for arabian language fails. Computed model cannot be serialized. Don't know the reason.

South-east asian languages

These languages uses logographs/syllabic scripts. Words usually are not separated which causes problems with tokenization.

language training sentences sentence-detector tokenizer pos-tagger lemmatizer
ja
japanese
16k MAXENT_QN
0.96
NAIVEBAYES
0.79
PERCEPTRON
0.96
MAXENT
0.97
ko
korean
30k MAXENT_QN
0.94
MAXENT_QN
0.99
MAXENT
0.89
MAXENT
0.90
zh
chinese
9k MAXENT
0.98
MAXENT_QN
0.91
PERCEPTRON
0.94
MAXENT
0.99

The results are not that impressive. Tokenization quality for japanese is quite low. Tokenizer seems not to support well such types of languages. Maybe if tokenizer have a dictionary of "known words" then trained model would be better. POS tagging/lemmatization for korean language is also not good. Chinese tokenization quality is higher than for japanese. Chinese words are shorter than japanese words, it means that surrounding context is shorter. This may explain why tokenizer better segments chinese words than japanese words.

babzel's People

Contributors

abzif avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

babzel's Issues

Implement common (single) text preprocessor instead of two different preprocessors

Two sets of models are produced for a language (simple and lucene). They differ only how raw text are pre-processed. This is unnecessarily complicated and requires twice as much time to train models.

Implement common text pre-processor which can be used from java code/solr/elastic. It may consist of:

  • lowercasing
  • normalization
  • accent drop / ligature expansion

Tokenization and lemmatization behavior of French hyphened compound words

I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns

But none of these French native compound words don't get lemmatized properly either.

compound word in plural form expected lemma actual lemma
portes-fenêtres porte-fenêtre portes-fenêtre
grands-mères grand-mère grands-mère
chefs-d'oeuvre chef-d'œuvre chefs-d'œuvre

As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.

I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".

Inconsistent tokenization of "T-shirt".

I am seeing inconsistent tokenization for "T-shirt" and probably for any hyphen separated words.
Below, the italic is the input and the bold is the output

$ bin/opennlp TokenizerME en-tokenizer.onlpm
Loading Tokenizer model ... done (0.123s)
yellow t-shirt
yellow t - shirt
yellow t-shirt!
yellow t- shirt !

"t-shirt" is sometimes tokenized as three tokens and sometimes two tokens.

I tested with openNLP 1.9.1 and 2.1.0 and they show the same results.

The model distributed in opennlp.apache.org, opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin, doesn't have this problem. "t-shirt" is always tokenized as "t", "-", "shirt". I tested other words like "ice-cream", "truck-driver", "sign-in", "warm-up", and they are consistent in that "-" is a separate token.

Is there any cure on this?

How to use the model from Java (not from Solr) ?

I'd like to use the pre-trained models for lemmatization of text from Java program.
I see, I'd need a complex analysis chain if I am using Solr. I'd like to skip the unnecessary complexity of Solr.
I took a look at the tool's command implementation and this is what I am guessing:

  1. Make a TextNormalizer like: var normalizer = new LuceneTextNormalizer;
  2. Run the input text through the normalizer: String normalized = normalizer.normalizeText(inputText);
  3. Then use sentenceDetector, posTaggar and lemmatizer as used in ToolsCmd#verifyModels.

Am I understanding this correctly?

Another question is, do you have these artifacts (JARs) published in a public repo? Or do I have to compile and install them locally?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.