Comments (3)
You can download the dataset from here: https://tatoeba.org/en/downloads It's the "sentences" one.
If I had to guess I would say that the Naive Bayes classifier approach for language detection is basically hopeless. You can easily get into the millions of words for a single language, and even getting these words for all the languages that you want to support is non trivial. That probably blows the memory and startup budget for client side applications (though you can use this approach for spell checking a single language fairly efficiently with "succinct tries").
So basically the accuracy at similar (or even 100x bigger) size and for a similar number of languages would probably not be great. Maybe it could be improved significantly once you get into the gigabytes of data, but that's probably too expensive.
Other neural networks, like fastText or some LSTM models, seem to do significantly better than Lande, at reasonable sizes (like 10x bigger maybe), whatever they are doing sounds like probably the best approach. I don't know what datasets those things are trained on though, some of them support like 200 languages, but I haven't seen a good dataset with that many supported languages.
from lande.
i wonder if it's sufficient to simply sample the 1000 most common words from each lang and/or do char frequency stats (you can probably get really close with just detected charset ranges). 🤔
EDIT: prolly not!
another idea was to run this whole list of unicode regexes [1] against a training set and feed those into a Bayes classifier. but i'm sure this will only get you part of the way and won't work well for short stuff like single words.
[1] https://en.wikipedia.org/wiki/Script_(Unicode)
from lande.
or do char frequency stats
That's basically how this program works, it seems to work ok, there are probably better more end-to-end approaches where you feed the model the entire text.
another idea was to run this whole list of unicode regexes [1] against a training set and feed those into a Bayes classifier. but i'm sure this will only get you part of the way and won't work well for short stuff like single words.
I'd guess that should work well for some languages with unique characters, and not that well for others. Though potentially one could also have multiple languages in one piece of text, which might confuse the program.
I should probably try adding some input neurons for this stuff, it might free up other neurons to learn something else 🤔
from lande.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lande.