Comments (8)
If you want to use models from java then then the algorithm is exactly like in function ToolsCmd::verifyModels. The simplest way would be to copy/modify this function to your program.
If you plan to analyze "proper" portugese text then SimpleNormalizer is enough because it simply lowercase the text
text = text.toLowerCase(Locale.ENGLISH);
By "proper" text I mean text with native diactritic characters
Solr models are planned to be used during text search (solr supports opennlp analyzers) when people often input text without native letters, just plain ascii. So my recommendation is to start with "simple" one.
This project is designed to automatically compute models (every month), not as library, so no artifacts are published nor planned to be published.
from babzel.
Oh, I see, I realized there are two sets of models made with different normalizers.
I'm guessing the model made with the Lucene normalizer disregards diacritics.
My only worry is that the CharFilter chains for Lucene does Unicode normalization while the simple one doesn't. Don't I need to run the Unicode normalization over the input text even when I use the simple normalization model?
from babzel.
The simple models are trained without normalization. Maybe it is a good idea for an improvement but for now try analysis without normalization.
from babzel.
I think the process is clear for now. Closing the issue.
from babzel.
Oh, sorry, I didn't respond quick enough. Although no Unicode normalization is done when building the simple models, is it possible that files used in the training are mostly in one Unicode form? My guess is that most European language documents were originally written in ISO-8859 family and they would use the pre-composed characters.
from babzel.
All files used for training are encoded in utf-8. So if your text is in utf-8 encoding then no other normalisation should be necessary. Otherwise it must be somehow re-coded from ISO (or whatever) to utf.
from babzel.
I am afraid there is a confusion between the encoding (UTF-8, UTF-16, UTF-32) and the Unicode normalization forms. The Unicode normalization forms are concept independent of the encoding.
"รด", for example, can be expressed in two different ways, using just one code point 00F4 (pre-composed), or by two code points 006F (for "O") followed by 0302 (for composing hat). The first two CharFilters in your recommended Solr analyzer normalizes the input text to use one normalization form, NFD, I believe. Because it is in a decomposed form, the last CharFilter, PatternReplaceCharFilter, can remove the diacritics. (Otherwise, it cannot make "o" from "รด".) If your training text was originally encoded in
ISO-8859-x, then they are likely to be in NFC or NFKC, since ISO-8859-x does not have composing characters (I think). If you don't use the decomposition form, removing diacritics would be difficult.
Here is a lengthy official description of Unicode Normalization forms:
ICUNormalizer2CharFilterFactory
from babzel.
Indeed, it is complicated. That's why I plan to implement common normalizer to use from java or solr. I don't know how the train files are composed. If it is crucial for your text then maybe use lucene normalizer. It can be used from java code with some tweaking. Alternatively you can download the code, implement normalizer which suits your needs and train portugese models yourself.
from babzel.
Related Issues (6)
- Add "end of sentence" characters specific for a language HOT 1
- Implement common (single) text preprocessor instead of two different preprocessors
- Inconsistent tokenization of "T-shirt". HOT 8
- Is the Chinese model for simplified han script or traditional? HOT 1
- Tokenization and lemmatization behavior of French hyphened compound words HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from babzel.