optimaize / language-detector Goto Github PK

Language Detection Library for Java

License: Apache License 2.0

Java 99.54% HTML 0.46%

language-detector's Introduction

language-detector

Language Detection Library for Java

<dependency>
    <groupId>com.optimaize.languagedetector</groupId>
    <artifactId>language-detector</artifactId>
    <version>0.6</version>
</dependency>

Language Support

71 Built-in Language Profiles

af Afrikaans
an Aragonese
ar Arabic
ast Asturian
be Belarusian
br Breton
ca Catalan
bg Bulgarian
bn Bengali
cs Czech
cy Welsh
da Danish
de German
el Greek
en English
es Spanish
et Estonian
eu Basque
fa Persian
fi Finnish
fr French
ga Irish
gl Galician
gu Gujarati
he Hebrew
hi Hindi
hr Croatian
ht Haitian
hu Hungarian
id Indonesian
is Icelandic
it Italian
ja Japanese
km Khmer
kn Kannada
ko Korean
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
ms Malay
mt Maltese
ne Nepali
nl Dutch
no Norwegian
oc Occitan
pa Punjabi
pl Polish
pt Portuguese
ro Romanian
ru Russian
sk Slovak
sl Slovene
so Somali
sq Albanian
sr Serbian
sv Swedish
sw Swahili
ta Tamil
te Telugu
th Thai
tl Tagalog
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wa Walloon
yi Yiddish
zh-cn Simplified Chinese
zh-tw Traditional Chinese

User danielnaber has made available a profile for Esperanto on his website, see open tasks.

There are two kinds of profiles. The standard ones created from Wikipedia articles and similar. And the "short text" profiles created from Twitter tweets. Fewer language profiles exist for the short text, more would be available, see #57

Other Languages

You can create a language profile for your own language easily. See https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md

How it Works

The software uses language profiles which were created based on common text for each language. N-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.

When trying to figure out in what language a certain text is written, the program goes through the same process: It creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the language that matches best.

Challenges

This software does not work as well when the input text to analyze is short, or unclean. For example tweets.

When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.

This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)

If you are looking for a language detector / language guesser library in Java, this seems to be the best open source library you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/

How to Use

Language Detection for your Text

//load all languages:
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();

//build language detector:
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
        .withProfiles(languageProfiles)
        .build();

//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();

//query:
TextObject textObject = textObjectFactory.forText("my text");
Optional<LdLocale> lang = languageDetector.detect(textObject);

Creating Language Profiles for your Training Text

See https://github.com/optimaize/language-detector/wiki/Creating-Language-Profiles

How You Can Help

If your language is not supported yet, then you can provide clean "training text", that is, common text written in your language. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open a ticket.

If your language is supported already, but not identified clearly all the time, you can still provide such training text. We might then be able to improve detection for your language.

If you're a programmer, dig in the source and see what you can improve. Check the open tasks.

Memory Consumption

Loading all 71 language profiles uses 74MB ram to store the data in memory. For memory considerations see https://github.com/optimaize/language-detector/wiki/Memory-Consumption

History and Changes

This project is a fork of a fork, the original author is Nakatani Shuyo. For detail see https://github.com/optimaize/language-detector/wiki/History-and-Changes

Where it's used

An adapted version of this is used by the http://www.NameAPI.org server.

https://www.languagetool.org/ is a proof-reading software for LibreOffice/OpenOffice, for the Desktop and for Firefox.

License

Apache 2 (business friendly)

Authors

Nakatani Shuyo, Fabian Kessler, Francois ROLAND, Robert Theis

For detail see https://github.com/optimaize/language-detector/wiki/Authors

For Maven Users

The project is in Maven central http://search.maven.org/#artifactdetails%7Ccom.optimaize.languagedetector%7Clanguage-detector%7C0.4%7Cjar this is the latest version:

<dependency>
    <groupId>com.optimaize.languagedetector</groupId>
    <artifactId>language-detector</artifactId>
    <version>0.6</version>
</dependency>

language-detector's People

Contributors

Stargazers

Watchers

Forkers

yauheni-sidarenka fanweihua jaanek sunchanras nachoorme yuchaozhou kod3r dignwei plan3 cominvent redskate masasdani hbxscz lababidi albertweichselbraun boweiliu ruedigermoeller bernardosulzbach fancyerii suprabhasharma epfisztner osori politear hardingicefield srtxg roberterdin haisi odaymard ninelhodzic miradel51 thaveedu aoking gilnoh huangkang-chn alonaz lilonghua1987 hummingbirdbea andiges netarchivesuite fcfl alxndrpsclt newsriver eclectice kduhyun waleedalrashed zhouzhui myvoyage cnantoninor sachin-13 skobets mouradmars hajoo nguyenhuuhien15 jcf danielabeledo trivo-ch imotov rillaha annalina saebyeok ajay-g phoerious xiao-cui munzuruleee borsik joeran nbartels myorm00000000 afnisse djelinski rollingstone vivekkumarapps 4v mohsin055 abdulsittar orctom mintsil arzt philip-muench dimashka007 dungvo mmakai sleyzerzon chocolatezhu happyyolanda wjmboss pandagod rococode hannes kotobot bradenwu gtorressfdc anujsrc tanghaozheng inh3rit lijun003 suvashishtha zabihimayvan tovbinm odinzhou

language-detector's Issues

Adding recognition of Walloon (wa) language

Hello,
I'm working on adding Walloon language to LanguageTool, which itself requires proper language detection from language-detector.
I don't see any clear instructions on how to generate a profile; so, as suggested, I'll attach some text files: http://chanae.walon.org/walon/wa.zip
It's a small zip file with some random pages from Wikipedia and rifondou.walon.org (for that last one, I only took texts more than 70 years old); it's about 2MB of text.
The zip include plain text dumps, as well as the html pages (which most often include, lang=... tags, in case it may be useful for you)

Another thing to know about Walloon, is that there are actually two ways of writting it.
A "unified orthography", called "rifondou" (which is the one used in those texts).
And a traditional "feller" one; which does a lot of emphasis on local accent and phonetic, with the consequence that is actually not one orthography, but a group of orthographies (at a very least there are four main groups: western, central, easter and south).

What would be the best thing to do:

only focus on "rifondou"
dump together all ways of writing the language
create several profiles (wa@rif, wa@ch, wa@na, wa@lg, wa@ba) ?

Thanks
wa.zip

japanese context are detected as chinese

this is my text :
"印刷用のトナーにおいては近赤外線を照射すると消色(無色化)する消色トナーが知られており、この消色トナーを用いて印刷を行う各種の画像処理装置が提案されている."
this text are detected as chinese. How can I solve this problem?

Feature to remove minority script content

When a text is largely written in 1 script (eg Cyrillic), but still contains some of another script (eg Latin), then remove the minority script content as noise.

Make configurable what the limit is in percent.

(Previously, this only allowed to remove ASCII, a subset of Latin).

More short profiles be pulled form shuyo/language-detection

I see that original project has a lot more short profiles than this project. Is it possible for you do upload these short files from the original project into this?

Original:
https://github.com/shuyo/language-detection/tree/master/profiles.sm

This project:
https://github.com/optimaize/language-detector/tree/master/src/main/resources/languages.shorttext

Use latest language profiles

Take them from the original project, 53 languages.

java.io.IOException in exported jar file

IOException occurs when I try to load the built in profiles
java.io.IOException: No language file available named af at languages/af!
at com.optimaize.langdetect.profiles.LanguageProfileReader.readBuiltIn(L
anguageProfileReader.java:91)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltI
n(LanguageProfileReader.java:127)

Jar file exported by Eclipse. The profiles are located under /resources/languages in the jar file.

This didn't happen when running it directly from eclipse.

The readAllBuiltIn method seems to be trying to get it from /languages/

Does anyone know how to use langdetection ?

Hi everybody,
Does anyone know how to use langdetection ?
because i am new to java
thanks in advance

using this library to detect arabic dialects

I am very happy to find this tool
i have a question
can this library help me in detecting arabic dialects (syrian iraqi gulf)
i will try to build corpora for each dialect and add it to language profile
is that right?

Update to Java 7 syntax

It's 2014, and 7 is the only officially supported version by Oracle.

Profile for Khmer

I've created a profile for Khmer: http://danielnaber.de/tmp/km.zip. Here's what I did:

exported 72,000 sentences (14MB) from Wikipedia dump
remove sentences with traditional ASCII chars: egrep -v "[a-zA-Z]+" => 37,000 sentences, 9MB
created the profile with a minimum frequency of 20

unclear License

src/main/java/overview.html still refers to the Apache License, while the README now refers to LGPLv3.

Also, did the original authors agree to the license change?

Wrong text detection in "no sense" text

Hi,
I'm having a "de" response with > 0.99 score for a text like the following:

6LSHOJDV 5LYR 8LERXSLQ 8UPDV 5DXGVHSS 0DULQH 6\VWHPV ,QVWLWXWH DW 7DOOLQQ 8QLYHUVLW\ RI 7HFKQRORJ\ $%675$&7 7KH RWKHU LPSRUWDQW HQYLURQPHQWDO DVSHFW WKDW QHHGV (QYLURQPHQWDO FRQGLWLRQV ZHUH PRQLWRUHG XVLQJ LQ VLWX FRQWLQXRXV PRQLWRULQJ DUH RLO VSLOOV ,Q WKH *XOI RI )LQODQG PHDVXUHG LQKHUHQW RSWLFDO SURSHUWLHV DQG ZDWHU VDPSOLQJ WKH SUREDELOLW\ RI RLO VSLOOV LV KLJK GXH WR WKH LQFUHDVLQJ RLO WRJHWKHU ZLWK UHPRWH VHQVLQJ LPDJHU\ 0(5,6 DQG $6$5 WUDQVSRUWDWLRQ 6HYHUDO RLO SROOXWLRQ LQFLGHQWV KDSSHQHG LQ LQ 0XXJD %D\ %DOWLF 6HD 6LPXOWDQHRXV PRQLWRULQJ XVLQJ WKH JXOI RYHU WKH ODVW GHFDGH 'LUHFW HQYLURQPHQWDO LPSDFWV GLIIHUHQW PHWKRGRORJLHV JDYH GHWDLOHG RYHUYLHZ RI RI RLO VSLOOV DIIHFW VHDELUGV DQG FRDVWDO HFRORJ\ HVSHFLDOO\ VXVSHQGHG PDWWHU 630 ORDG LQWR WKH ZDWHU FROXPQ GXULQJ ZKHQ WKH VSLOO KLWV WKH VKRUH 7R PLQLPL]H WKH QHJDWLYH HIIHFW WKH GUHGJLQJ RSHUDWLRQV 0(5,6 )56 GDWD HQDEOHG WR RI RLO SROOXWLRQ DQG WR IDFLOLWDWH IDVW DSSOLFDWLRQ RI RLO UHFHLYH WKH GLVWULEXWLRQ RI 630 RQ ZDWHU VXUIDFH 7KH FRPEDWLQJ PHWKRGV DQ HDUO\ GHWHFWLRQ RI RLO VSLOOV DW VHD LV PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV UHYLOHG WKH RI D JUHDW LPSRUWDQFH 0DQ\ VWXGLHV KDYH SURYHG WKDW UDGDU SDUWLFOH FRQFHQWUDWLRQ RQ YHUWLFDO VFDOH %DFNVFDWWHULQJ IURP LPDJHV FDQ SURYLGH LQIRUPDWLRQ RQ SRVVLEOH ORFDWLRQ DQG WKH $6$5 GDWD ZDV LQ FRUUHODWLRQ ZLWK RLO SURGXFWV H[WHQW RI RLO VSLOOV > @ GHWHUPLQHG IURP ZDWHU VDPSOHV ZKHQ EDOODVW ZDWHU GLVFKDUJH &RQWLQXRXV DQG ILQH VFDOH UHPRWH VHQVLQJ LV RQH RI WKH NH\ ZDV GHWHFWHG GXULQJ ILHOG VDPSOLQJ DVSHFWV LQ PRQLWRULQJ RI 630 DQG SRVVLEOH RLO VSLOOV QHDU WKH KDUERUV (QYLVDW 0(5,6 IXOO UHVROXWLRQ GDWD 0(5,6 )56 ,QGH[ 7HUPV 0(5,6 VXVSHQGHG PDWWHU LQKHUHQW DQG (QYLVDW $6$5 LPDJHU\ LV SURYLGHG E\ (6$ GDLO\ EDVHV RSWLFDO SURSHUWLHV DQG JLYHV JRRG EDVHV IRU FRQWLQXRXV PRQLWRULQJ 7KH VFRSH RI WKH FXUUHQW VWXG\ ZDV WR HYDOXDWH WKH XVH RI ,1752'8&7,21 0(5,6 )56 GDWD IRU PRQLWRULQJ RI VXVSHQGHG PDWWHU ORDG WR WKH FRDVWDO VHD GXULQJ WKH KDUERU GUHGJLQJ (QYLVDW $6$5 2QH RI WKH PDLQ FKDOOHQJHV LGHQWLILHG E\ WKH (XURSHDQ 6HD GDWD ZDV XVHG WR HYDOXDWH WKH SRVVLELOLW\ WR GHWHFW WKH RLO 3RUWV 2UJDQLVDWLRQ (632 LQ LWV HQYLURQPHQWDO FRGH VSLOOV (632 ZDV WKH VXVWDLQDEOH GHYHORSPHQW RI VHD SRUWV 0(7+2'6 $FFRUGLQJ WR WKH GRFXPHQW WKH HQYLURQPHQWDO LPSDFWV FDXVHG E\ SRUW UHODWHG DFWLYLWLHV VKRXOG EH UHGXFHG > @ 7KH 7KH ILHOG PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV ILUVW VWHS LV WR SURSHUO\ PDQDJH HQYLURQPHQWDO LVVXHV ZKLFK WRJHWKHU ZLWK WDNLQJ ZDWHU VDPSOHV ZHUH SHUIRUPHG LQ UHTXLUHV FRQWLQXRXV HQYLURQPHQWDO PRQLWRULQJ 5HPRWH DQG 0XXJD %D\ RQ DQG

Norwegian profile: Bokmål or Nynorsk?

Norwegian is a macro language. The written standards are

Bokmål iso 639-1 "nb"
Nynorsk iso 639-1 "nn"

90% of all publications are in Bokmål, the remaining 10% in Nynorsk.

Technically, the code "no" may not be used in this context. It is not clear which it is.

I am pretty sure that the training text was from one standard only, and that it was Bokmål. But without having the original training text info I would not be able to tell (based on n-grams). A Norwegian might know the differences...

So the only way to fix this is to either get the original training text, or to create a new profile. Or 2 profiles to distinguish them.

Greek is identified as Catalan when no Greek model is loaded

I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.

Mixed language strange results (one is clearly more dominant).

Example text:

设为首页收藏本站 开启辅助访问 为首页收藏本站 开启辅助访为首页收藏本站 开启辅助访切换到窄版 请 登录 后使用快捷导航 没有帐号 注册 用户名 Email 自动登录  找回密码 密码 登录  注册 快捷导航 论坛BBS 导读Guide 排行榜Ranklist 淘帖Collection 日志Blog 相册Album 分享Share 搜索 搜索 帖子 用户 公告

Chinese is clearly more dominant than English, but it can't detect Chinese at all and "getProbabilities()" returns this list:

[DetectedLanguage[en:0.8571376011773154], DetectedLanguage[fr:0.14286031717254952]]

French? Have no idea where it sees French.

If I remove the end (with those English words) it does detect the Chinese language fine.

I don't think that few English words in such a dominant Chinese text should give such a false result.

ShortText algorithm sometimes yields zero probabilities for all languages

detectBlockShortText does not break, once CONV_THRESHOLD has been reached. Depending on the text size this leads to zero probabilities for all languages.

Example:

The bulgarian sentence
Европа не трябва да стартира нов конкурентен маратон и изход с приватизация
yields a zero probability for all languages and, therefore, no result.

How to reproduce:

add the following line to runTests in the DataLanguageDetectorImplTest unittest:

assertEquals(detector.getProbabilities(text("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")).get(0).getLocale().getLanguage(), "bg");

Feature to filter language profiles by criteria

Some users selectively choose which built-in language profiles to load.
Others want all.
And yet others want all except some.

For the last group it must be super-simple to filter.
Currently only the language code is available for sure.
I'll also need the script.
And the fact whether the language is artificial (Esperanto), or extinct.
Since we probably don't want to include ICU just for that, we'll have to provide this data for the built-in languages.

As it is now, one has to load the profile to figure out the primary script. It could be mapped to know early. But I don't want to enforce too much, because there can also be user-defined profiles. We'd have to require it in the file name. I would not do that.

Port the cybozu command line utils

See https://code.google.com/p/language-detection/wiki/Tools

The ported classes should be moved to proper package name space. Remove deprecated classes. Perhaps include the GenProfile from free-text feature from lang-guess too into the command line tool.

Discuss whether to add another standalone jar with all deps to support java -jar invoking of cmdline tool.

Look at "Compact Language Detector 2" (cld2)

See https://code.google.com/p/cld2/
It explains how it performs the language detection.

Of interest is:

It mainly uses 4-grams (whereas this project uses 1, 2 and 3-grams). I too expect 4grams to perform better. We'd need the (some) original language text to play with this.
For some languages it decides based on script (incl. Greek and Thai).
It can cut the text and analyze sections.
Naïve Bayesian classifier, using one of three different token algorithms.
Does all in lower case only (our indexed n-grams are currently partly caseful)
"For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities." seems like a good optimization.

remove stax-api dependency?

Can the dependency on stax-api be removed? If I remove it from the pom.xml and run mvn clean test, everything still works.

Make minimal confidence configurable for LanguageDetector.detect()

Default is 0.9999 atm.

New n-gram generator

Faster, and flexible (filter, space-padding). Previously it was hardcoded to 1, 2 and 3-grams, and it was hardcoded which n-grams were ignored.

need instructions on how to use/install

Very unclear on how to use this library.... is there a jar file? Should I import the entire github project?

Re-add "seed" parameter

I had removed the seed parameter recently because I did not understand its use.
Re-add named "consistent-results" or so.

This needs to go into the LanguageDetector, and the CommandLineInterface is affected too. See git history.

Make LanguageDetector thread-safe

So that one instance can be created once (using a builder), then used by multiple threads simultaneously, like a service.

Add 16 language profiles from rmtheis

see https://github.com/rmtheis/language-detection

Bring code to current standards

Code quality improvements:

Returning interfaces instead of implementations (List instead of ArrayList etc)
String .equals instead of ==
Replaced StringBuffer with StringBuilder
Renamed classes for clarity
Made classes immutable, and thus thread safe
Made fields private, using accessors
Clear null reference concept:
- using IntelliJ's @nullable and @NotNull annotations
- using Guava's Optional
Added JavaDoc, fixed typos
Added interfaces
More tests. Thanks to the refactorings, code is now testable that was too much embedded before.

Chinese detection is unusable

I noticed that even slightest inclusion of English terms (like brand names) result into absolute nonsense as an output:

Language: [it] 
Score: [0.15132266] 
Text: [小猫终于换发型！ Ariana Grande为新砖染白金长发 爱莉安娜为新专辑变金发

Language: [nl] 
Score: [0.2858114] 
Text: [东京风尚 EP36 电子音乐节Audio2015 高颜值军团潮搭 TOKYO TRENDS东京风尚，穿梭东京最热点，直击日本潮人搭配。本期来到日本最大型的音乐节！来看看日本潮人参加音乐节都是穿什么的哦！潮人搭配：上衣Gamber；裤子Galson；鞋子Reebok；背包Nike。

Language: [nl] 
Score: [0.52030957] 
Text: [东京风尚下北泽篇 EP28 优雅时尚的精品男适用街 TOKYO TRENDS东京风尚，穿梭东京最热点，直击日本潮人搭配。今天主持人洋子来到了有名的东京代官山，男生的潮人们可以参考一下！这里必逛：TSUTAVA BOOKS。采访潮人搭配：T恤Bobson；衬衫Rage Blue；裤子Diesel；帽子CA4LA；鞋子Nike Jordan。

Language: [zh-CN] 
Score: [0.5714257] 
Text: [光泽魅惑色盘 打造中性秋季妆容 用PONY新改良光泽魅惑色盘打造中性秋季妆容]

Language: [es] 
Score: [0.285714] 
Text: [东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚，穿梭东京最热点，直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道，这里聚集了潮牌的精品店，所以各位喜欢潮牌的男士一定不能错过！本期采访潮男搭配：Lad Musician上衣；WEGO裤子；Dr.Martens鞋子。

Profile for Esperanto

I've created a profile for Esperanto: http://danielnaber.de/tmp/eo.zip. It's based on the data from http://tatoeba.org, which is about 15 MB of plain Esperanto text.

README outdated

readAll() is deprecated -> change to readAllBuiltIn()
Optional<String> lang = languageDetector.detect(textObject); gives an error "Type mismatch: cannot convert from Optional to Optional"

CommonTextObjectFactories.forDetectingOnLargeText() not working

The detector cannot even detect English text (large enough), wasted time. Just returning Optional.absent() . And working slow

Traditional Chinese is detected as Korean

"幫助" "help" in Traditional Chinese is incorrectly detected as korean

Published Artifact

Hey,

is there anywhere a published maven artifact?

Probability must be <= 1 but was Infinity

Exception in thread "Driver" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: java.lang.IllegalArgumentException: Probability must be <= 1 but was Infinity
at com.optimaize.langdetect.DetectedLanguage.(DetectedLanguage.java:47)
at com.optimaize.langdetect.LanguageDetectorImpl.sortProbability(LanguageDetectorImpl.java:240)
at com.optimaize.langdetect.LanguageDetectorImpl.getProbabilities(LanguageDetectorImpl.java:121)
at com.optimaize.langdetect.LanguageDetectorImpl.detect(LanguageDetectorImpl.java:102)
at

how to retrieve a particular language from a set of possible languages?

[DetectedLanguage[ar:0.8556187887595297], DetectedLanguage[ur:0.12626434999662134]]

languageDetector.detect(textObject) prints the above line as the output but the function returns "optional.absent()". So can anyone tell me about how to pick the most probable language(here: arabic with 85.5%)?

Replace "Language" with "Locale"

There are several places where the term "language" is used, but it's not well defined.

the profile file name
LanguageProfile
LanguageDetector
DetectedLanguage

As can be seen in the profile file names, the language itself is not always enough. There currently is "zh-cn" and "zh-tw". These are not languages, they are a combination of language with country.

What is needed is a locale. For language, optionally with script and/or region.
Examples:

zh-Hans-CN
zh-Hant-TW

getLanguages() and getShortTextLanguages() need documentation

Hi!

This file contains two methods getLanguages() and getShortTextLanguages():
https://github.com/optimaize/language-detector/blob/master/src/main/java/com/optimaize/langdetect/profiles/BuiltInLanguages.java

What's the difference between a language and a short text language?

Updated Javadocs would be nice, plus an answer right here of course :).

Regards /Johan

Feature to weight prefix and suffix n-grams differently (higher)

Affixes are important in detecting script-based languages like Latin. More important than what's in the middle of words.

Compatibility with cybozu

Hi guys,

We loved what you did with optimaize and would love to use it in our library. Unfortunately the current version does not play well with Cybozu. This is a problem for us since many products in our company use Cybozu directly or transitively, so by depending on our library things would break, and migrating all of them at once is not feasible.

The problem is mainly with classes that share the same name/package (see [1] for an example stack trace). So the question is: Would you guys accept a PR that fixes this? It would be a backwards incompatible change as there would be name changes of public classes.

Thanks

[1] Optimaize tries to use a method that does not exist in the class with the same name in Cybozu.

Exception in thread "main" java.lang.NoSuchMethodError: com.cybozu.labs.langdetect.util.LangProfile.getFreq()Ljava/util/Map;
    at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:76)
    at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:48)
    at com.optimaize.langdetect.profiles.LanguageProfileReader.read(LanguageProfileReader.java:27)
    at com.optimaize.langdetect.profiles.LanguageProfileReader.readAll(LanguageProfileReader.java:154)
    ...

Updated all Maven dependency versions

Replace last lib dependency with Maven (jsonic)

Not able to detect English and Chinese?

I happened to find this powerful tool, then did a simple try.
Unfortunately it's not working out well.
Below are related code snippet, can anyone tell why?

    <dependency>
        <groupId>com.optimaize.languagedetector</groupId>
        <artifactId>language-detector</artifactId>
        <version>0.5</version>
    </dependency>

public class LanuageDetector {
//load all languages:
static List languageProfiles;
static {
try {
languageProfiles = new LanguageProfileReader().readAllBuiltIn();
} catch (IOException e) {
log.error("Exception when loading language profile", e);
}
}

//build language detector:
static LanguageDetector languageDetector = LanguageDetectorBuilder
    .create(NgramExtractors.standard())
    .withProfiles(languageProfiles)
    .build();

//create a text object factory
static TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();

public static String detectLang(String text) {
    TextObject textObject = textObjectFactory.forText(text);
    com.google.common.base.Optional<LdLocale> lang = languageDetector.detect(textObject);
    LdLocale locale = lang.orNull();
    return locale == null ? null : locale.getLanguage();
}
public static void main(String[] args) {
    String english = "I am English";
    String chinese = "我是简体中文";
    String hindi = "मैं हिन्दी हूं";
    System.out.println(detectLang(english));
    System.out.println(detectLang(chinese));
    System.out.println(detectLang(hindi));
}

}

Make PROB_THRESHOLD configurable

See probabilityThreshold() in builder.

readBuiltIn() not working

I specified which languages to include in readBuiltIn:

List<LdLocale> languages;
        List<LdLocale> names = new ArrayList<>();
        names.add(LdLocale.fromString("en"));
        names.add(LdLocale.fromString("tl"));
        languages = ImmutableList.copyOf(names);
        List<LanguageProfile> languageProfiles = new LanguageProfileReader().readBuiltIn(languages);

LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
                .withProfiles(languageProfiles)
                .build();

TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingShortCleanText();

TextObject textObject = textObjectFactory.forText("basta");
List<DetectedLanguage> lang = languageDetector.getProbabilities(textObject);

However, when I run this, I am still getting results for languages that I did not include.

Bug

Hello

There is an issue when I use the jar file here :
http://search.maven.org/remotecontent?filepath=com/optimaize/languagedetector/language-detector/0.5/language-detector-0.5.jar

The code used is exactly the same on the readme part "How to Use"

The exception is caused by this line :
List languageProfiles = new LanguageProfileReader().readAllBuiltIn();
And the exception is :
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
at com.optimaize.langdetect.i18n.LdLocale.fromString(LdLocale.java:77)
at com.optimaize.langdetect.profiles.BuiltInLanguages.(BuiltInLanguages.java:21)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118)

Please note that with v0.4 jar I don't have the issue and everthing is working fine

Algorithm for short text

Giving predictable results. Using all text (all n-grams), no random picking.

Provide training data for all language profiles

danielnaber mentioned that language profiles should come with the training text they are based on.

I totally agree with that. This allows anyone to play with customizations to improve the profiles, such as:

playing with profile size by setting the word count cutoff higher or lower
cleaning the text from foreign words, English phrases, ...
changing n-gram types (3-gram, 4-gram, ...)
debugging and understanding certain results
use cases we can't even think of now

The readme text for contributions should be updated to kindly ask for the training text. The best would be to get the original training text, and the program that applies modifications on it to make the index.

Since not everyone who would like to check out the language detector needs these training texts, I'd vote for keeping them separate. Otherwise a simple checkout becomes very large.
On the other hand, if there are 150 GitHub projects for 150 languages, and one would like to try out 4-grams on all of them, that's quite some work also... opinions on that?

I don't recall the state of the current language profiles - I just took what was there. I'll have to see if the original texts are (easily) accessible.

Not able to choose certain languages to include

I am working on a project that needs to detect the language of short texts, but they will only ever be Tagalog or English. I wanted to remove all the other language profiles to increase the accuracy of the detection, but am having trouble doing so. When I just removed the files for the other profiles, it says that they're missing - is there a hard coded list somewhere?

How is the best way to go about doing this?

(I used to use the shuyo library, and was able to very easily just delete the other language profiles, but it doesn't seem to work the same way with this library?)

Better Android support

As done here https://github.com/rmtheis/language-detection

language profiles as Java code files
separate Maven modules to cut away the code not required at runtime (to create profiles), and the profiles stored as text files.

Using a logger instead of System.out.println

It's only used in 'verbose' mode, so it's not urgent. Not used by default.

Sami languages support

According to document "Requirements for support for Sami languages in data processing" (http://www.sami.lwp.se/01-850-51.pdf), "Basic Level" support for Sami is achieved by detecting North Sami, Lule Sami and South Sami.

Resources:
http://crubadan.org/languages/se
http://crubadan.org/languages/sma
http://crubadan.org/languages/smj