I'd like to use the pre-trained models for lemmatization of text from Java program.<br

How to use the model from Java (not from Solr) ? about babzel HOT 8 CLOSED

abzif commented on August 10, 2024

How to use the model from Java (not from Solr) ?

from babzel.

Comments (8)

abzif commented on August 10, 2024

If you want to use models from java then then the algorithm is exactly like in function ToolsCmd::verifyModels. The simplest way would be to copy/modify this function to your program.

If you plan to analyze "proper" portugese text then SimpleNormalizer is enough because it simply lowercase the text
text = text.toLowerCase(Locale.ENGLISH);
By "proper" text I mean text with native diactritic characters

Solr models are planned to be used during text search (solr supports opennlp analyzers) when people often input text without native letters, just plain ascii. So my recommendation is to start with "simple" one.

This project is designed to automatically compute models (every month), not as library, so no artifacts are published nor planned to be published.

from babzel.

tkurosaka commented on August 10, 2024

Oh, I see, I realized there are two sets of models made with different normalizers.
I'm guessing the model made with the Lucene normalizer disregards diacritics.
My only worry is that the CharFilter chains for Lucene does Unicode normalization while the simple one doesn't. Don't I need to run the Unicode normalization over the input text even when I use the simple normalization model?

from babzel.

abzif commented on August 10, 2024

The simple models are trained without normalization. Maybe it is a good idea for an improvement but for now try analysis without normalization.

from babzel.

abzif commented on August 10, 2024

I think the process is clear for now. Closing the issue.

from babzel.

tkurosaka commented on August 10, 2024

Oh, sorry, I didn't respond quick enough. Although no Unicode normalization is done when building the simple models, is it possible that files used in the training are mostly in one Unicode form? My guess is that most European language documents were originally written in ISO-8859 family and they would use the pre-composed characters.

from babzel.

abzif commented on August 10, 2024

All files used for training are encoded in utf-8. So if your text is in utf-8 encoding then no other normalisation should be necessary. Otherwise it must be somehow re-coded from ISO (or whatever) to utf.

from babzel.

tkurosaka commented on August 10, 2024

I am afraid there is a confusion between the encoding (UTF-8, UTF-16, UTF-32) and the Unicode normalization forms. The Unicode normalization forms are concept independent of the encoding.
"ô", for example, can be expressed in two different ways, using just one code point 00F4 (pre-composed), or by two code points 006F (for "O") followed by 0302 (for composing hat). The first two CharFilters in your recommended Solr analyzer normalizes the input text to use one normalization form, NFD, I believe. Because it is in a decomposed form, the last CharFilter, PatternReplaceCharFilter, can remove the diacritics. (Otherwise, it cannot make "o" from "ô".) If your training text was originally encoded in
ISO-8859-x, then they are likely to be in NFC or NFKC, since ISO-8859-x does not have composing characters (I think). If you don't use the decomposition form, removing diacritics would be difficult.
Here is a lengthy official description of Unicode Normalization forms:
ICUNormalizer2CharFilterFactory

from babzel.

abzif commented on August 10, 2024

Indeed, it is complicated. That's why I plan to implement common normalizer to use from java or solr. I don't know how the train files are composed. If it is crucial for your text then maybe use lucene normalizer. It can be used from java code with some tweaking. Alternatively you can download the code, implement normalizer which suits your needs and train portugese models yourself.

from babzel.

How to use the model from Java (not from Solr) ? about babzel HOT 8 CLOSED

Comments (8)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent