Code Monkey home page Code Monkey logo

Comments (8)

abzif avatar abzif commented on August 10, 2024

If you want to use models from java then then the algorithm is exactly like in function ToolsCmd::verifyModels. The simplest way would be to copy/modify this function to your program.

If you plan to analyze "proper" portugese text then SimpleNormalizer is enough because it simply lowercase the text
text = text.toLowerCase(Locale.ENGLISH);
By "proper" text I mean text with native diactritic characters

Solr models are planned to be used during text search (solr supports opennlp analyzers) when people often input text without native letters, just plain ascii. So my recommendation is to start with "simple" one.

This project is designed to automatically compute models (every month), not as library, so no artifacts are published nor planned to be published.

from babzel.

tkurosaka avatar tkurosaka commented on August 10, 2024

Oh, I see, I realized there are two sets of models made with different normalizers.
I'm guessing the model made with the Lucene normalizer disregards diacritics.
My only worry is that the CharFilter chains for Lucene does Unicode normalization while the simple one doesn't. Don't I need to run the Unicode normalization over the input text even when I use the simple normalization model?

from babzel.

abzif avatar abzif commented on August 10, 2024

The simple models are trained without normalization. Maybe it is a good idea for an improvement but for now try analysis without normalization.

from babzel.

abzif avatar abzif commented on August 10, 2024

I think the process is clear for now. Closing the issue.

from babzel.

tkurosaka avatar tkurosaka commented on August 10, 2024

Oh, sorry, I didn't respond quick enough. Although no Unicode normalization is done when building the simple models, is it possible that files used in the training are mostly in one Unicode form? My guess is that most European language documents were originally written in ISO-8859 family and they would use the pre-composed characters.

from babzel.

abzif avatar abzif commented on August 10, 2024

All files used for training are encoded in utf-8. So if your text is in utf-8 encoding then no other normalisation should be necessary. Otherwise it must be somehow re-coded from ISO (or whatever) to utf.

from babzel.

tkurosaka avatar tkurosaka commented on August 10, 2024

I am afraid there is a confusion between the encoding (UTF-8, UTF-16, UTF-32) and the Unicode normalization forms. The Unicode normalization forms are concept independent of the encoding.
"รด", for example, can be expressed in two different ways, using just one code point 00F4 (pre-composed), or by two code points 006F (for "O") followed by 0302 (for composing hat). The first two CharFilters in your recommended Solr analyzer normalizes the input text to use one normalization form, NFD, I believe. Because it is in a decomposed form, the last CharFilter, PatternReplaceCharFilter, can remove the diacritics. (Otherwise, it cannot make "o" from "รด".) If your training text was originally encoded in
ISO-8859-x, then they are likely to be in NFC or NFKC, since ISO-8859-x does not have composing characters (I think). If you don't use the decomposition form, removing diacritics would be difficult.
Here is a lengthy official description of Unicode Normalization forms:
ICUNormalizer2CharFilterFactory

from babzel.

abzif avatar abzif commented on August 10, 2024

Indeed, it is complicated. That's why I plan to implement common normalizer to use from java or solr. I don't know how the train files are composed. If it is crucial for your text then maybe use lucene normalizer. It can be used from java code with some tweaking. Alternatively you can download the code, implement normalizer which suits your needs and train portugese models yourself.

from babzel.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.