Comments (2)
I think one project that contains the texts for all languages would be okay, but it depends on how large the data is. I like the idea of having the original data and a script that cleans it up, but I'm not sure how practical that is. For example, one might want to remove sentences with a lot of names, proper nouns etc and detecting those isn't trivial.
from language-detector.
Well...
From the original language detection project wiki
- Generate language profiles from Wikipedia abstract xml
From Nakatani Shuyo's tools wiki
This tool generates language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" ( http://download.wikimedia.org/ ). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ). See also LanguageList about language code.
We can just download the database from Wikipedia (albeit the date and version will be different) if the profiles here are just pulled as is from the original.
Also seems like danielnaber also used some tatoeba.org stuff for Esperanto 👍 I've used that site also haha.
from language-detector.
Related Issues (20)
- "dallas" is recognised as Spanish Language HOT 1
- No way to change default n-gram size from 3 to something else HOT 1
- Source of language corpus
- How to remoe Logback from dependencies
- Text with English and Japanese characters is identified as Galician or Basque HOT 1
- Is it this repo still actively maintained? HOT 2
- FYI: Language detector Lingua outperforming Optimaize HOT 4
- Remove or limit dependency to spring
- Japanese detection is not good HOT 8
- Pls help HOT 1
- How to build and install language detector HOT 1
- How to use this Library in Netbeans project
- Upper case English text returns a low probability
- misdetection because of break at CONV_THRESHOLD HOT 1
- incorrect shortTextAlgorithm documentation HOT 1
- How to set short-text profile from Java code HOT 1
- MAIL_REGEX should be limited
- TextObjectFactory changes text HOT 2
- Every Time It Returns only absent() HOT 2
- How to retrieve particular languages in language profile reader HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from language-detector.