Comments (3)
Yes this is not good, thanks for the submission.
You can solve it by changing the defaults.
Visually, the text looks as if it would have way more Han than Latin characters. Those blocks are so dominant compared to the small Latin letters with few strokes. However, the counts are not so clear:
Han = 105
Latin = 45 (!)
(plus 31 spaces, which are script agnostic)
The project's front page explains in the section "How to Use" that text should be run through the text object factory:
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
TextObject textObject = textObjectFactory.forText("my text");
However: because the text given in the example has such a high rate of secondary script content, it is not removed. The removal is performed by the RemoveMinorityScriptsTextFilter and the example above needs a value >= 0.43, the default of 0.3 is not enough.
Something like this will do for you:
new TextObjectFactoryBuilder()
.withTextFilter(UrlTextFilter.getInstance())
.withTextFilter(RemoveMinorityScriptsTextFilter.forThreshold(0.5))
.build();
While this may solve your problem for the moment, I fail to see right now why this does not work out of the box, why it thinks English is more dominant. It needs debugging. So thanks for reporting it.
from language-detector.
For what it's worth, with current git master the example returns the list:
[DetectedLanguage[vi:0.8571398047407237], DetectedLanguage[zh:0.14285646388985637]]
Vietnamese profile has a mix of Latin and Han, which seems to fit the provided text better than Chinese profile. I believe the profiles need a cleanup.
from language-detector.
Interesting to see about secondary script filtering @fabiankessler
In #63 I found that forcing short-text language detection also helped a lot for similar mixes of >50 char Chinese/English. However, I'm not sure exactly of the implications of forcing short-text language detection.
At this moment I'm not even sure if the short and long text algorithms use the same language models. The algorithms differ quite a lot at least.
from language-detector.
Related Issues (20)
- "dallas" is recognised as Spanish Language HOT 1
- No way to change default n-gram size from 3 to something else HOT 1
- Source of language corpus
- How to remoe Logback from dependencies
- Text with English and Japanese characters is identified as Galician or Basque HOT 1
- Is it this repo still actively maintained? HOT 2
- FYI: Language detector Lingua outperforming Optimaize HOT 4
- Remove or limit dependency to spring
- Japanese detection is not good HOT 8
- Pls help HOT 1
- How to build and install language detector HOT 1
- How to use this Library in Netbeans project
- Upper case English text returns a low probability
- misdetection because of break at CONV_THRESHOLD HOT 1
- incorrect shortTextAlgorithm documentation HOT 1
- How to set short-text profile from Java code HOT 1
- MAIL_REGEX should be limited
- TextObjectFactory changes text HOT 2
- Every Time It Returns only absent() HOT 2
- How to retrieve particular languages in language profile reader HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from language-detector.