Example text: <div class="snippet-clipboard-content notranslate position-relative

Interesting to see about secondary filtering <a class="user-mention notranslate

Mixed language strange results (one is clearly more dominant). about language-detector HOT 3 OPEN

optimaize commented on July 17, 2024

Mixed language strange results (one is clearly more dominant).

from language-detector.

Comments (3)

fabiankessler commented on July 17, 2024 1

Yes this is not good, thanks for the submission.
You can solve it by changing the defaults.

Visually, the text looks as if it would have way more Han than Latin characters. Those blocks are so dominant compared to the small Latin letters with few strokes. However, the counts are not so clear:

Han = 105
Latin = 45 (!)
(plus 31 spaces, which are script agnostic)

The project's front page explains in the section "How to Use" that text should be run through the text object factory:

TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
TextObject textObject = textObjectFactory.forText("my text");

However: because the text given in the example has such a high rate of secondary script content, it is not removed. The removal is performed by the RemoveMinorityScriptsTextFilter and the example above needs a value >= 0.43, the default of 0.3 is not enough.

Something like this will do for you:

new TextObjectFactoryBuilder()
                .withTextFilter(UrlTextFilter.getInstance())
                .withTextFilter(RemoveMinorityScriptsTextFilter.forThreshold(0.5))
                .build();

While this may solve your problem for the moment, I fail to see right now why this does not work out of the box, why it thinks English is more dominant. It needs debugging. So thanks for reporting it.

from language-detector.

djelinski commented on July 17, 2024 1

For what it's worth, with current git master the example returns the list:
[DetectedLanguage[vi:0.8571398047407237], DetectedLanguage[zh:0.14285646388985637]]

Vietnamese profile has a mix of Latin and Han, which seems to fit the provided text better than Chinese profile. I believe the profiles need a cleanup.

from language-detector.

IdiosApps commented on July 17, 2024

Interesting to see about secondary script filtering @fabiankessler
In #63 I found that forcing short-text language detection also helped a lot for similar mixes of >50 char Chinese/English. However, I'm not sure exactly of the implications of forcing short-text language detection.
At this moment I'm not even sure if the short and long text algorithms use the same language models. The algorithms differ quite a lot at least.

from language-detector.

Recommend Projects

Mixed language strange results (one is clearly more dominant). about language-detector HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent