Code Monkey home page Code Monkey logo

Comments (3)

fabiospampinato avatar fabiospampinato commented on May 28, 2024

You can download the dataset from here: https://tatoeba.org/en/downloads It's the "sentences" one.

If I had to guess I would say that the Naive Bayes classifier approach for language detection is basically hopeless. You can easily get into the millions of words for a single language, and even getting these words for all the languages that you want to support is non trivial. That probably blows the memory and startup budget for client side applications (though you can use this approach for spell checking a single language fairly efficiently with "succinct tries").

So basically the accuracy at similar (or even 100x bigger) size and for a similar number of languages would probably not be great. Maybe it could be improved significantly once you get into the gigabytes of data, but that's probably too expensive.

Other neural networks, like fastText or some LSTM models, seem to do significantly better than Lande, at reasonable sizes (like 10x bigger maybe), whatever they are doing sounds like probably the best approach. I don't know what datasets those things are trained on though, some of them support like 200 languages, but I haven't seen a good dataset with that many supported languages.

from lande.

leeoniya avatar leeoniya commented on May 28, 2024

i wonder if it's sufficient to simply sample the 1000 most common words from each lang and/or do char frequency stats (you can probably get really close with just detected charset ranges). 🤔

EDIT: prolly not!

another idea was to run this whole list of unicode regexes [1] against a training set and feed those into a Bayes classifier. but i'm sure this will only get you part of the way and won't work well for short stuff like single words.

[1] https://en.wikipedia.org/wiki/Script_(Unicode)

from lande.

fabiospampinato avatar fabiospampinato commented on May 28, 2024

or do char frequency stats

That's basically how this program works, it seems to work ok, there are probably better more end-to-end approaches where you feed the model the entire text.

another idea was to run this whole list of unicode regexes [1] against a training set and feed those into a Bayes classifier. but i'm sure this will only get you part of the way and won't work well for short stuff like single words.

I'd guess that should work well for some languages with unique characters, and not that well for others. Though potentially one could also have multiple languages in one piece of text, which might confuse the program.

I should probably try adding some input neurons for this stuff, it might free up other neurons to learn something else 🤔

from lande.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.