hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

You can download the dataset from here: <a href="https://tatoeba.org/en/downloads" re

vs Naive Bayes' Classifier? about lande HOT 3 CLOSED

leeoniya commented on May 28, 2024

vs Naive Bayes' Classifier?

from lande.

Comments (3)

fabiospampinato commented on May 28, 2024

You can download the dataset from here: https://tatoeba.org/en/downloads It's the "sentences" one.

If I had to guess I would say that the Naive Bayes classifier approach for language detection is basically hopeless. You can easily get into the millions of words for a single language, and even getting these words for all the languages that you want to support is non trivial. That probably blows the memory and startup budget for client side applications (though you can use this approach for spell checking a single language fairly efficiently with "succinct tries").

So basically the accuracy at similar (or even 100x bigger) size and for a similar number of languages would probably not be great. Maybe it could be improved significantly once you get into the gigabytes of data, but that's probably too expensive.

Other neural networks, like fastText or some LSTM models, seem to do significantly better than Lande, at reasonable sizes (like 10x bigger maybe), whatever they are doing sounds like probably the best approach. I don't know what datasets those things are trained on though, some of them support like 200 languages, but I haven't seen a good dataset with that many supported languages.

from lande.

leeoniya commented on May 28, 2024

i wonder if it's sufficient to simply sample the 1000 most common words from each lang and/or do char frequency stats (you can probably get really close with just detected charset ranges). 🤔

EDIT: prolly not!

another idea was to run this whole list of unicode regexes [1] against a training set and feed those into a Bayes classifier. but i'm sure this will only get you part of the way and won't work well for short stuff like single words.

[1] https://en.wikipedia.org/wiki/Script_(Unicode)

from lande.

fabiospampinato commented on May 28, 2024

or do char frequency stats

That's basically how this program works, it seems to work ok, there are probably better more end-to-end approaches where you feed the model the entire text.

another idea was to run this whole list of unicode regexes [1] against a training set and feed those into a Bayes classifier. but i'm sure this will only get you part of the way and won't work well for short stuff like single words.

I'd guess that should work well for some languages with unique characters, and not that well for others. Though potentially one could also have multiple languages in one piece of text, which might confuse the program.

I should probably try adding some input neurons for this stuff, it might free up other neurons to learn something else 🤔

from lande.

vs Naive Bayes' Classifier? about lande HOT 3 CLOSED

Comments (3)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent