abzif / babzel Goto Github PK

OpenNLP models generator for various languages

License: Apache License 2.0

Java 99.33% HTML 0.67%

lemmatizer natural-language-processing nlp opennlp part-of-speech-tagger pos-tagger tokenizer sentence-detector language-model

babzel's Issues

How to use the model from Java (not from Solr) ?

I'd like to use the pre-trained models for lemmatization of text from Java program.
I see, I'd need a complex analysis chain if I am using Solr. I'd like to skip the unnecessary complexity of Solr.
I took a look at the tool's command implementation and this is what I am guessing:

Make a TextNormalizer like: var normalizer = new LuceneTextNormalizer;
Run the input text through the normalizer: String normalized = normalizer.normalizeText(inputText);
Then use sentenceDetector, posTaggar and lemmatizer as used in ToolsCmd#verifyModels.

Am I understanding this correctly?

Another question is, do you have these artifacts (JARs) published in a public repo? Or do I have to compile and install them locally?

Is the Chinese model for simplified han script or traditional?

This is just a question.
Is the model listed as Chinese for Simplified (script) Chinese, Traditional Chinese, or support both?

Add "end of sentence" characters specific for a language

End of sentence characters characters are not correct for far-east languages. Sentence-detector is of low quality.
Add some class which computes appropriate EOS characters for a language (or computes default set !.?)

Tokenization and lemmatization behavior of French hyphened compound words

I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns

But none of these French native compound words don't get lemmatized properly either.

compound word in plural form	expected lemma	actual lemma
portes-fenêtres	porte-fenêtre	portes-fenêtre
grands-mères	grand-mère	grands-mère
chefs-d'oeuvre	chef-d'œuvre	chefs-d'œuvre

As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.

I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".

Implement common (single) text preprocessor instead of two different preprocessors

Two sets of models are produced for a language (simple and lucene). They differ only how raw text are pre-processed. This is unnecessarily complicated and requires twice as much time to train models.

Implement common text pre-processor which can be used from java code/solr/elastic. It may consist of:

lowercasing
normalization
accent drop / ligature expansion

Inconsistent tokenization of "T-shirt".

I am seeing inconsistent tokenization for "T-shirt" and probably for any hyphen separated words.
Below, the italic is the input and the bold is the output

$ bin/opennlp TokenizerME en-tokenizer.onlpm
Loading Tokenizer model ... done (0.123s)
yellow t-shirt
yellow t - shirt
yellow t-shirt!
yellow t- shirt !

"t-shirt" is sometimes tokenized as three tokens and sometimes two tokens.

I tested with openNLP 1.9.1 and 2.1.0 and they show the same results.

The model distributed in opennlp.apache.org, opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin, doesn't have this problem. "t-shirt" is always tokenized as "t", "-", "shirt". I tested other words like "ice-cream", "truck-driver", "sign-in", "warm-up", and they are consistent in that "-" is a separate token.

Is there any cure on this?

abzif / babzel Goto Github PK

babzel's Issues

How to use the model from Java (not from Solr) ?

Is the Chinese model for simplified han script or traditional?

Add "end of sentence" characters specific for a language

Tokenization and lemmatization behavior of French hyphened compound words

Implement common (single) text preprocessor instead of two different preprocessors

Inconsistent tokenization of "T-shirt".

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent