Code Monkey home page Code Monkey logo

language-detector's Introduction

language-detector

Language Detection Library for Java

<dependency>
    <groupId>com.optimaize.languagedetector</groupId>
    <artifactId>language-detector</artifactId>
    <version>0.6</version>
</dependency>

Language Support

71 Built-in Language Profiles

  1. af Afrikaans
  2. an Aragonese
  3. ar Arabic
  4. ast Asturian
  5. be Belarusian
  6. br Breton
  7. ca Catalan
  8. bg Bulgarian
  9. bn Bengali
  10. cs Czech
  11. cy Welsh
  12. da Danish
  13. de German
  14. el Greek
  15. en English
  16. es Spanish
  17. et Estonian
  18. eu Basque
  19. fa Persian
  20. fi Finnish
  21. fr French
  22. ga Irish
  23. gl Galician
  24. gu Gujarati
  25. he Hebrew
  26. hi Hindi
  27. hr Croatian
  28. ht Haitian
  29. hu Hungarian
  30. id Indonesian
  31. is Icelandic
  32. it Italian
  33. ja Japanese
  34. km Khmer
  35. kn Kannada
  36. ko Korean
  37. lt Lithuanian
  38. lv Latvian
  39. mk Macedonian
  40. ml Malayalam
  41. mr Marathi
  42. ms Malay
  43. mt Maltese
  44. ne Nepali
  45. nl Dutch
  46. no Norwegian
  47. oc Occitan
  48. pa Punjabi
  49. pl Polish
  50. pt Portuguese
  51. ro Romanian
  52. ru Russian
  53. sk Slovak
  54. sl Slovene
  55. so Somali
  56. sq Albanian
  57. sr Serbian
  58. sv Swedish
  59. sw Swahili
  60. ta Tamil
  61. te Telugu
  62. th Thai
  63. tl Tagalog
  64. tr Turkish
  65. uk Ukrainian
  66. ur Urdu
  67. vi Vietnamese
  68. wa Walloon
  69. yi Yiddish
  70. zh-cn Simplified Chinese
  71. zh-tw Traditional Chinese

User danielnaber has made available a profile for Esperanto on his website, see open tasks.

There are two kinds of profiles. The standard ones created from Wikipedia articles and similar. And the "short text" profiles created from Twitter tweets. Fewer language profiles exist for the short text, more would be available, see #57

Other Languages

You can create a language profile for your own language easily. See https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md

How it Works

The software uses language profiles which were created based on common text for each language. N-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.

When trying to figure out in what language a certain text is written, the program goes through the same process: It creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the language that matches best.

Challenges

This software does not work as well when the input text to analyze is short, or unclean. For example tweets.

When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.

This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)

If you are looking for a language detector / language guesser library in Java, this seems to be the best open source library you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/

How to Use

Language Detection for your Text

//load all languages:
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();

//build language detector:
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
        .withProfiles(languageProfiles)
        .build();

//create a text object factory
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();

//query:
TextObject textObject = textObjectFactory.forText("my text");
Optional<LdLocale> lang = languageDetector.detect(textObject);

Creating Language Profiles for your Training Text

See https://github.com/optimaize/language-detector/wiki/Creating-Language-Profiles

How You Can Help

If your language is not supported yet, then you can provide clean "training text", that is, common text written in your language. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open a ticket.

If your language is supported already, but not identified clearly all the time, you can still provide such training text. We might then be able to improve detection for your language.

If you're a programmer, dig in the source and see what you can improve. Check the open tasks.

Memory Consumption

Loading all 71 language profiles uses 74MB ram to store the data in memory. For memory considerations see https://github.com/optimaize/language-detector/wiki/Memory-Consumption

History and Changes

This project is a fork of a fork, the original author is Nakatani Shuyo. For detail see https://github.com/optimaize/language-detector/wiki/History-and-Changes

Where it's used

An adapted version of this is used by the http://www.NameAPI.org server.

https://www.languagetool.org/ is a proof-reading software for LibreOffice/OpenOffice, for the Desktop and for Firefox.

License

Apache 2 (business friendly)

Authors

Nakatani Shuyo, Fabian Kessler, Francois ROLAND, Robert Theis

For detail see https://github.com/optimaize/language-detector/wiki/Authors

For Maven Users

The project is in Maven central http://search.maven.org/#artifactdetails%7Ccom.optimaize.languagedetector%7Clanguage-detector%7C0.4%7Cjar this is the latest version:

<dependency>
    <groupId>com.optimaize.languagedetector</groupId>
    <artifactId>language-detector</artifactId>
    <version>0.6</version>
</dependency>

language-detector's People

Contributors

albertweichselbraun avatar alexeiarshavin avatar danielnaber avatar fabiankessler avatar haisi avatar janhoy avatar nicoletorres avatar puntogil avatar roberterdin avatar srtxg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

language-detector's Issues

Adding recognition of Walloon (wa) language

Hello,
I'm working on adding Walloon language to LanguageTool, which itself requires proper language detection from language-detector.
I don't see any clear instructions on how to generate a profile; so, as suggested, I'll attach some text files: http://chanae.walon.org/walon/wa.zip
It's a small zip file with some random pages from Wikipedia and rifondou.walon.org (for that last one, I only took texts more than 70 years old); it's about 2MB of text.
The zip include plain text dumps, as well as the html pages (which most often include, lang=... tags, in case it may be useful for you)

Another thing to know about Walloon, is that there are actually two ways of writting it.
A "unified orthography", called "rifondou" (which is the one used in those texts).
And a traditional "feller" one; which does a lot of emphasis on local accent and phonetic, with the consequence that is actually not one orthography, but a group of orthographies (at a very least there are four main groups: western, central, easter and south).

What would be the best thing to do:

  • only focus on "rifondou"
  • dump together all ways of writing the language
  • create several profiles (wa@rif, wa@ch, wa@na, wa@lg, wa@ba) ?

Thanks
wa.zip

japanese context are detected as chinese

this is my text :
"印刷用のトナーにおいては近赤外線を照射すると消色(無色化)する消色トナーが知られており、この消色トナーを用いて印刷を行う各種の画像処理装置が提案されている."
this text are detected as chinese. How can I solve this problem?

Feature to remove minority script content

When a text is largely written in 1 script (eg Cyrillic), but still contains some of another script (eg Latin), then remove the minority script content as noise.

Make configurable what the limit is in percent.

(Previously, this only allowed to remove ASCII, a subset of Latin).

java.io.IOException in exported jar file

IOException occurs when I try to load the built in profiles
java.io.IOException: No language file available named af at languages/af!
at com.optimaize.langdetect.profiles.LanguageProfileReader.readBuiltIn(L
anguageProfileReader.java:91)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltI
n(LanguageProfileReader.java:127)

Jar file exported by Eclipse. The profiles are located under /resources/languages in the jar file.

This didn't happen when running it directly from eclipse.

The readAllBuiltIn method seems to be trying to get it from /languages/

using this library to detect arabic dialects

I am very happy to find this tool
i have a question
can this library help me in detecting arabic dialects (syrian iraqi gulf)
i will try to build corpora for each dialect and add it to language profile
is that right?

Profile for Khmer

I've created a profile for Khmer: http://danielnaber.de/tmp/km.zip. Here's what I did:

  • exported 72,000 sentences (14MB) from Wikipedia dump
  • remove sentences with traditional ASCII chars: egrep -v "[a-zA-Z]+" => 37,000 sentences, 9MB
  • created the profile with a minimum frequency of 20

unclear License

src/main/java/overview.html still refers to the Apache License, while the README now refers to LGPLv3.

Also, did the original authors agree to the license change?

Wrong text detection in "no sense" text

Hi,
I'm having a "de" response with > 0.99 score for a text like the following:

6LSHOJDV 5LYR 8LERXSLQ 8UPDV 5DXGVHSS 0DULQH 6\VWHPV ,QVWLWXWH DW 7DOOLQQ 8QLYHUVLW\ RI 7HFKQRORJ\ $%675$&amp;7 7KH RWKHU LPSRUWDQW HQYLURQPHQWDO DVSHFW WKDW QHHGV (QYLURQPHQWDO FRQGLWLRQV ZHUH PRQLWRUHG XVLQJ LQ VLWX FRQWLQXRXV PRQLWRULQJ DUH RLO VSLOOV ,Q WKH *XOI RI )LQODQG PHDVXUHG LQKHUHQW RSWLFDO SURSHUWLHV DQG ZDWHU VDPSOLQJ WKH SUREDELOLW\ RI RLO VSLOOV LV KLJK GXH WR WKH LQFUHDVLQJ RLO WRJHWKHU ZLWK UHPRWH VHQVLQJ LPDJHU\ 0(5,6 DQG $6$5 WUDQVSRUWDWLRQ 6HYHUDO RLO SROOXWLRQ LQFLGHQWV KDSSHQHG LQ LQ 0XXJD %D\ %DOWLF 6HD 6LPXOWDQHRXV PRQLWRULQJ XVLQJ WKH JXOI RYHU WKH ODVW GHFDGH 'LUHFW HQYLURQPHQWDO LPSDFWV GLIIHUHQW PHWKRGRORJLHV JDYH GHWDLOHG RYHUYLHZ RI RI RLO VSLOOV DIIHFW VHDELUGV DQG FRDVWDO HFRORJ\ HVSHFLDOO\ VXVSHQGHG PDWWHU 630 ORDG LQWR WKH ZDWHU FROXPQ GXULQJ ZKHQ WKH VSLOO KLWV WKH VKRUH 7R PLQLPL]H WKH QHJDWLYH HIIHFW WKH GUHGJLQJ RSHUDWLRQV 0(5,6 )56 GDWD HQDEOHG WR RI RLO SROOXWLRQ DQG WR IDFLOLWDWH IDVW DSSOLFDWLRQ RI RLO UHFHLYH WKH GLVWULEXWLRQ RI 630 RQ ZDWHU VXUIDFH 7KH FRPEDWLQJ PHWKRGV DQ HDUO\ GHWHFWLRQ RI RLO VSLOOV DW VHD LV PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV UHYLOHG WKH RI D JUHDW LPSRUWDQFH 0DQ\ VWXGLHV KDYH SURYHG WKDW UDGDU SDUWLFOH FRQFHQWUDWLRQ RQ YHUWLFDO VFDOH %DFNVFDWWHULQJ IURP LPDJHV FDQ SURYLGH LQIRUPDWLRQ RQ SRVVLEOH ORFDWLRQ DQG WKH $6$5 GDWD ZDV LQ FRUUHODWLRQ ZLWK RLO SURGXFWV H[WHQW RI RLO VSLOOV > @ GHWHUPLQHG IURP ZDWHU VDPSOHV ZKHQ EDOODVW ZDWHU GLVFKDUJH &RQWLQXRXV DQG ILQH VFDOH UHPRWH VHQVLQJ LV RQH RI WKH NH\ ZDV GHWHFWHG GXULQJ ILHOG VDPSOLQJ DVSHFWV LQ PRQLWRULQJ RI 630 DQG SRVVLEOH RLO VSLOOV QHDU WKH KDUERUV (QYLVDW 0(5,6 IXOO UHVROXWLRQ GDWD 0(5,6 )56 ,QGH[ 7HUPV 0(5,6 VXVSHQGHG PDWWHU LQKHUHQW DQG (QYLVDW $6$5 LPDJHU\ LV SURYLGHG E\ (6$ GDLO\ EDVHV RSWLFDO SURSHUWLHV DQG JLYHV JRRG EDVHV IRU FRQWLQXRXV PRQLWRULQJ 7KH VFRSH RI WKH FXUUHQW VWXG\ ZDV WR HYDOXDWH WKH XVH RI ,1752'8&7,21 0(5,6 )56 GDWD IRU PRQLWRULQJ RI VXVSHQGHG PDWWHU ORDG WR WKH FRDVWDO VHD GXULQJ WKH KDUERU GUHGJLQJ (QYLVDW $6$5 2QH RI WKH PDLQ FKDOOHQJHV LGHQWLILHG E\ WKH (XURSHDQ 6HD GDWD ZDV XVHG WR HYDOXDWH WKH SRVVLELOLW\ WR GHWHFW WKH RLO 3RUWV 2UJDQLVDWLRQ (632 LQ LWV HQYLURQPHQWDO FRGH VSLOOV (632 ZDV WKH VXVWDLQDEOH GHYHORSPHQW RI VHD SRUWV 0(7+2'6 $FFRUGLQJ WR WKH GRFXPHQW WKH HQYLURQPHQWDO LPSDFWV FDXVHG E\ SRUW UHODWHG DFWLYLWLHV VKRXOG EH UHGXFHG > @ 7KH 7KH ILHOG PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV ILUVW VWHS LV WR SURSHUO\ PDQDJH HQYLURQPHQWDO LVVXHV ZKLFK WRJHWKHU ZLWK WDNLQJ ZDWHU VDPSOHV ZHUH SHUIRUPHG LQ UHTXLUHV FRQWLQXRXV HQYLURQPHQWDO PRQLWRULQJ 5HPRWH DQG 0XXJD %D\ RQ DQG

Norwegian profile: Bokmål or Nynorsk?

Norwegian is a macro language. The written standards are

  • Bokmål iso 639-1 "nb"
  • Nynorsk iso 639-1 "nn"

90% of all publications are in Bokmål, the remaining 10% in Nynorsk.

Technically, the code "no" may not be used in this context. It is not clear which it is.

I am pretty sure that the training text was from one standard only, and that it was Bokmål. But without having the original training text info I would not be able to tell (based on n-grams). A Norwegian might know the differences...

So the only way to fix this is to either get the original training text, or to create a new profile. Or 2 profiles to distinguish them.

Greek is identified as Catalan when no Greek model is loaded

I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.

Mixed language strange results (one is clearly more dominant).

Example text:

设为首页收藏本站 开启辅助访问 为首页收藏本站 开启辅助访为首页收藏本站 开启辅助访切换到窄版 请 登录 后使用快捷导航 没有帐号 注册 用户名 Email 自动登录  找回密码 密码 登录  注册 快捷导航 论坛BBS 导读Guide 排行榜Ranklist 淘帖Collection 日志Blog 相册Album 分享Share 搜索 搜索 帖子 用户 公告

Chinese is clearly more dominant than English, but it can't detect Chinese at all and "getProbabilities()" returns this list:

[DetectedLanguage[en:0.8571376011773154], DetectedLanguage[fr:0.14286031717254952]]

French? Have no idea where it sees French.

If I remove the end (with those English words) it does detect the Chinese language fine.

I don't think that few English words in such a dominant Chinese text should give such a false result.

ShortText algorithm sometimes yields zero probabilities for all languages

detectBlockShortText does not break, once CONV_THRESHOLD has been reached. Depending on the text size this leads to zero probabilities for all languages.

Example:

The bulgarian sentence
Европа не трябва да стартира нов конкурентен маратон и изход с приватизация
yields a zero probability for all languages and, therefore, no result.

How to reproduce:

add the following line to runTests in the DataLanguageDetectorImplTest unittest:

assertEquals(detector.getProbabilities(text("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")).get(0).getLocale().getLanguage(), "bg");

Feature to filter language profiles by criteria

Some users selectively choose which built-in language profiles to load.
Others want all.
And yet others want all except some.

For the last group it must be super-simple to filter.
Currently only the language code is available for sure.
I'll also need the script.
And the fact whether the language is artificial (Esperanto), or extinct.
Since we probably don't want to include ICU just for that, we'll have to provide this data for the built-in languages.

As it is now, one has to load the profile to figure out the primary script. It could be mapped to know early. But I don't want to enforce too much, because there can also be user-defined profiles. We'd have to require it in the file name. I would not do that.

Look at "Compact Language Detector 2" (cld2)

See https://code.google.com/p/cld2/
It explains how it performs the language detection.

Of interest is:

  • It mainly uses 4-grams (whereas this project uses 1, 2 and 3-grams). I too expect 4grams to perform better. We'd need the (some) original language text to play with this.
  • For some languages it decides based on script (incl. Greek and Thai).
  • It can cut the text and analyze sections.
  • Naïve Bayesian classifier, using one of three different token algorithms.
  • Does all in lower case only (our indexed n-grams are currently partly caseful)
  • "For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities." seems like a good optimization.

remove stax-api dependency?

Can the dependency on stax-api be removed? If I remove it from the pom.xml and run mvn clean test, everything still works.

New n-gram generator

Faster, and flexible (filter, space-padding). Previously it was hardcoded to 1, 2 and 3-grams, and it was hardcoded which n-grams were ignored.

Re-add "seed" parameter

I had removed the seed parameter recently because I did not understand its use.
Re-add named "consistent-results" or so.

This needs to go into the LanguageDetector, and the CommandLineInterface is affected too. See git history.

Bring code to current standards

Code quality improvements:

  • Returning interfaces instead of implementations (List instead of ArrayList etc)
  • String .equals instead of ==
  • Replaced StringBuffer with StringBuilder
  • Renamed classes for clarity
  • Made classes immutable, and thus thread safe
  • Made fields private, using accessors
  • Clear null reference concept:
  • Added JavaDoc, fixed typos
  • Added interfaces
  • More tests. Thanks to the refactorings, code is now testable that was too much embedded before.

Chinese detection is unusable

I noticed that even slightest inclusion of English terms (like brand names) result into absolute nonsense as an output:

Language: [it] 
Score: [0.15132266] 
Text: [小猫终于换发型! Ariana Grande为新砖染白金长发 爱莉安娜为新专辑变金发
Language: [nl] 
Score: [0.2858114] 
Text: [东京风尚 EP36 电子音乐节Audio2015 高颜值军团潮搭 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期来到日本最大型的音乐节!来看看日本潮人参加音乐节都是穿什么的哦!潮人搭配:上衣Gamber;裤子Galson;鞋子Reebok;背包Nike。
Language: [nl] 
Score: [0.52030957] 
Text: [东京风尚下北泽篇 EP28 优雅时尚的精品男适用街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。今天主持人洋子来到了有名的东京代官山,男生的潮人们可以参考一下!这里必逛:TSUTAVA BOOKS。采访潮人搭配:T恤Bobson;衬衫Rage Blue;裤子Diesel;帽子CA4LA;鞋子Nike Jordan。
Language: [zh-CN] 
Score: [0.5714257] 
Text: [光泽魅惑色盘 打造中性秋季妆容 用PONY新改良光泽魅惑色盘打造中性秋季妆容]
Language: [es] 
Score: [0.285714] 
Text: [东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道,这里聚集了潮牌的精品店,所以各位喜欢潮牌的男士一定不能错过!本期采访潮男搭配:Lad Musician上衣;WEGO裤子;Dr.Martens鞋子。

README outdated

  • readAll() is deprecated -> change to readAllBuiltIn()
  • Optional<String> lang = languageDetector.detect(textObject); gives an error "Type mismatch: cannot convert from Optional to Optional"

Probability must be <= 1 but was Infinity

Exception in thread "Driver" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: java.lang.IllegalArgumentException: Probability must be <= 1 but was Infinity
at com.optimaize.langdetect.DetectedLanguage.(DetectedLanguage.java:47)
at com.optimaize.langdetect.LanguageDetectorImpl.sortProbability(LanguageDetectorImpl.java:240)
at com.optimaize.langdetect.LanguageDetectorImpl.getProbabilities(LanguageDetectorImpl.java:121)
at com.optimaize.langdetect.LanguageDetectorImpl.detect(LanguageDetectorImpl.java:102)
at

how to retrieve a particular language from a set of possible languages?

[DetectedLanguage[ar:0.8556187887595297], DetectedLanguage[ur:0.12626434999662134]]

languageDetector.detect(textObject) prints the above line as the output but the function returns "optional.absent()". So can anyone tell me about how to pick the most probable language(here: arabic with 85.5%)?

Replace "Language" with "Locale"

There are several places where the term "language" is used, but it's not well defined.

  • the profile file name
  • LanguageProfile
  • LanguageDetector
  • DetectedLanguage

As can be seen in the profile file names, the language itself is not always enough. There currently is "zh-cn" and "zh-tw". These are not languages, they are a combination of language with country.

What is needed is a locale. For language, optionally with script and/or region.
Examples:

  • zh-Hans-CN
  • zh-Hant-TW

Compatibility with cybozu

Hi guys,

We loved what you did with optimaize and would love to use it in our library. Unfortunately the current version does not play well with Cybozu. This is a problem for us since many products in our company use Cybozu directly or transitively, so by depending on our library things would break, and migrating all of them at once is not feasible.

The problem is mainly with classes that share the same name/package (see [1] for an example stack trace). So the question is: Would you guys accept a PR that fixes this? It would be a backwards incompatible change as there would be name changes of public classes.

Thanks

[1] Optimaize tries to use a method that does not exist in the class with the same name in Cybozu.

Exception in thread "main" java.lang.NoSuchMethodError: com.cybozu.labs.langdetect.util.LangProfile.getFreq()Ljava/util/Map;
    at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:76)
    at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:48)
    at com.optimaize.langdetect.profiles.LanguageProfileReader.read(LanguageProfileReader.java:27)
    at com.optimaize.langdetect.profiles.LanguageProfileReader.readAll(LanguageProfileReader.java:154)
    ...

Not able to detect English and Chinese?

I happened to find this powerful tool, then did a simple try.
Unfortunately it's not working out well.
Below are related code snippet, can anyone tell why?

    <dependency>
        <groupId>com.optimaize.languagedetector</groupId>
        <artifactId>language-detector</artifactId>
        <version>0.5</version>
    </dependency>

public class LanuageDetector {
//load all languages:
static List languageProfiles;
static {
try {
languageProfiles = new LanguageProfileReader().readAllBuiltIn();
} catch (IOException e) {
log.error("Exception when loading language profile", e);
}
}

//build language detector:
static LanguageDetector languageDetector = LanguageDetectorBuilder
    .create(NgramExtractors.standard())
    .withProfiles(languageProfiles)
    .build();

//create a text object factory
static TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();

public static String detectLang(String text) {
    TextObject textObject = textObjectFactory.forText(text);
    com.google.common.base.Optional<LdLocale> lang = languageDetector.detect(textObject);
    LdLocale locale = lang.orNull();
    return locale == null ? null : locale.getLanguage();
}
public static void main(String[] args) {
    String english = "I am English";
    String chinese = "我是简体中文";
    String hindi = "मैं हिन्दी हूं";
    System.out.println(detectLang(english));
    System.out.println(detectLang(chinese));
    System.out.println(detectLang(hindi));
}

}

readBuiltIn() not working

I specified which languages to include in readBuiltIn:

List<LdLocale> languages;
        List<LdLocale> names = new ArrayList<>();
        names.add(LdLocale.fromString("en"));
        names.add(LdLocale.fromString("tl"));
        languages = ImmutableList.copyOf(names);
        List<LanguageProfile> languageProfiles = new LanguageProfileReader().readBuiltIn(languages);

LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
                .withProfiles(languageProfiles)
                .build();

TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingShortCleanText();

TextObject textObject = textObjectFactory.forText("basta");
List<DetectedLanguage> lang = languageDetector.getProbabilities(textObject);

However, when I run this, I am still getting results for languages that I did not include.

Bug

Hello

There is an issue when I use the jar file here :
http://search.maven.org/remotecontent?filepath=com/optimaize/languagedetector/language-detector/0.5/language-detector-0.5.jar

The code used is exactly the same on the readme part "How to Use"

The exception is caused by this line :
List languageProfiles = new LanguageProfileReader().readAllBuiltIn();
And the exception is :
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
at com.optimaize.langdetect.i18n.LdLocale.fromString(LdLocale.java:77)
at com.optimaize.langdetect.profiles.BuiltInLanguages.(BuiltInLanguages.java:21)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118)

Please note that with v0.4 jar I don't have the issue and everthing is working fine

Provide training data for all language profiles

danielnaber mentioned that language profiles should come with the training text they are based on.

I totally agree with that. This allows anyone to play with customizations to improve the profiles, such as:

  • playing with profile size by setting the word count cutoff higher or lower
  • cleaning the text from foreign words, English phrases, ...
  • changing n-gram types (3-gram, 4-gram, ...)
  • debugging and understanding certain results
  • use cases we can't even think of now

The readme text for contributions should be updated to kindly ask for the training text. The best would be to get the original training text, and the program that applies modifications on it to make the index.

Since not everyone who would like to check out the language detector needs these training texts, I'd vote for keeping them separate. Otherwise a simple checkout becomes very large.
On the other hand, if there are 150 GitHub projects for 150 languages, and one would like to try out 4-grams on all of them, that's quite some work also... opinions on that?

I don't recall the state of the current language profiles - I just took what was there. I'll have to see if the original texts are (easily) accessible.

Not able to choose certain languages to include

I am working on a project that needs to detect the language of short texts, but they will only ever be Tagalog or English. I wanted to remove all the other language profiles to increase the accuracy of the detection, but am having trouble doing so. When I just removed the files for the other profiles, it says that they're missing - is there a hard coded list somewhere?

How is the best way to go about doing this?

(I used to use the shuyo library, and was able to very easily just delete the other language profiles, but it doesn't seem to work the same way with this library?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.