Code Monkey home page Code Monkey logo

Comments (3)

fabiankessler avatar fabiankessler commented on July 17, 2024 1

Yes this is not good, thanks for the submission.
You can solve it by changing the defaults.

Visually, the text looks as if it would have way more Han than Latin characters. Those blocks are so dominant compared to the small Latin letters with few strokes. However, the counts are not so clear:

Han = 105
Latin = 45 (!)
(plus 31 spaces, which are script agnostic)

The project's front page explains in the section "How to Use" that text should be run through the text object factory:

TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
TextObject textObject = textObjectFactory.forText("my text");

However: because the text given in the example has such a high rate of secondary script content, it is not removed. The removal is performed by the RemoveMinorityScriptsTextFilter and the example above needs a value >= 0.43, the default of 0.3 is not enough.

Something like this will do for you:

new TextObjectFactoryBuilder()
                .withTextFilter(UrlTextFilter.getInstance())
                .withTextFilter(RemoveMinorityScriptsTextFilter.forThreshold(0.5))
                .build();

While this may solve your problem for the moment, I fail to see right now why this does not work out of the box, why it thinks English is more dominant. It needs debugging. So thanks for reporting it.

from language-detector.

djelinski avatar djelinski commented on July 17, 2024 1

For what it's worth, with current git master the example returns the list:
[DetectedLanguage[vi:0.8571398047407237], DetectedLanguage[zh:0.14285646388985637]]

Vietnamese profile has a mix of Latin and Han, which seems to fit the provided text better than Chinese profile. I believe the profiles need a cleanup.

from language-detector.

IdiosApps avatar IdiosApps commented on July 17, 2024

Interesting to see about secondary script filtering @fabiankessler
In #63 I found that forcing short-text language detection also helped a lot for similar mixes of >50 char Chinese/English. However, I'm not sure exactly of the implications of forcing short-text language detection.
At this moment I'm not even sure if the short and long text algorithms use the same language models. The algorithms differ quite a lot at least.

from language-detector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.