Code Monkey home page Code Monkey logo

babzel's Issues

How to use the model from Java (not from Solr) ?

I'd like to use the pre-trained models for lemmatization of text from Java program.
I see, I'd need a complex analysis chain if I am using Solr. I'd like to skip the unnecessary complexity of Solr.
I took a look at the tool's command implementation and this is what I am guessing:

  1. Make a TextNormalizer like: var normalizer = new LuceneTextNormalizer;
  2. Run the input text through the normalizer: String normalized = normalizer.normalizeText(inputText);
  3. Then use sentenceDetector, posTaggar and lemmatizer as used in ToolsCmd#verifyModels.

Am I understanding this correctly?

Another question is, do you have these artifacts (JARs) published in a public repo? Or do I have to compile and install them locally?

Tokenization and lemmatization behavior of French hyphened compound words

I found out "tee-shirts" in French don't get lemmatized to "tee-shirt". I thought this is may be happening just because the "tee-shirt" is a loan word. To verify this hypothesis, I tried a few French compounds listed in https://www.colanguage.com/french-compound-nouns

But none of these French native compound words don't get lemmatized properly either.

compound word in plural form expected lemma actual lemma
portes-fenêtres porte-fenêtre portes-fenêtre
grands-mères grand-mère grands-mère
chefs-d'oeuvre chef-d'œuvre chefs-d'œuvre

As seen, the first noun element stays in the plural form and only the second noun gets lemmatized.

I also noticed that "chefs-d'oeuvre" gets tokenized strangely. This compound word is tokenized into two tokens "chefs-d'" and "oeuvre".

Implement common (single) text preprocessor instead of two different preprocessors

Two sets of models are produced for a language (simple and lucene). They differ only how raw text are pre-processed. This is unnecessarily complicated and requires twice as much time to train models.

Implement common text pre-processor which can be used from java code/solr/elastic. It may consist of:

  • lowercasing
  • normalization
  • accent drop / ligature expansion

Inconsistent tokenization of "T-shirt".

I am seeing inconsistent tokenization for "T-shirt" and probably for any hyphen separated words.
Below, the italic is the input and the bold is the output

$ bin/opennlp TokenizerME en-tokenizer.onlpm
Loading Tokenizer model ... done (0.123s)
yellow t-shirt
yellow t - shirt
yellow t-shirt!
yellow t- shirt !

"t-shirt" is sometimes tokenized as three tokens and sometimes two tokens.

I tested with openNLP 1.9.1 and 2.1.0 and they show the same results.

The model distributed in opennlp.apache.org, opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin, doesn't have this problem. "t-shirt" is always tokenized as "t", "-", "shirt". I tested other words like "ice-cream", "truck-driver", "sign-in", "warm-up", and they are consistent in that "-" is a separate token.

Is there any cure on this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.