optimaize / language-detector Goto Github PK
View Code? Open in Web Editor NEWLanguage Detection Library for Java
License: Apache License 2.0
Language Detection Library for Java
License: Apache License 2.0
See https://code.google.com/p/cld2/
It explains how it performs the language detection.
Of interest is:
Default is 0.9999 atm.
See probabilityThreshold() in builder.
I happened to find this powerful tool, then did a simple try.
Unfortunately it's not working out well.
Below are related code snippet, can anyone tell why?
<dependency>
<groupId>com.optimaize.languagedetector</groupId>
<artifactId>language-detector</artifactId>
<version>0.5</version>
</dependency>
public class LanuageDetector {
//load all languages:
static List languageProfiles;
static {
try {
languageProfiles = new LanguageProfileReader().readAllBuiltIn();
} catch (IOException e) {
log.error("Exception when loading language profile", e);
}
}
//build language detector:
static LanguageDetector languageDetector = LanguageDetectorBuilder
.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
//create a text object factory
static TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();
public static String detectLang(String text) {
TextObject textObject = textObjectFactory.forText(text);
com.google.common.base.Optional<LdLocale> lang = languageDetector.detect(textObject);
LdLocale locale = lang.orNull();
return locale == null ? null : locale.getLanguage();
}
public static void main(String[] args) {
String english = "I am English";
String chinese = "我是简体中文";
String hindi = "मैं हिन्दी हूं";
System.out.println(detectLang(english));
System.out.println(detectLang(chinese));
System.out.println(detectLang(hindi));
}
}
Hi guys,
We loved what you did with optimaize and would love to use it in our library. Unfortunately the current version does not play well with Cybozu. This is a problem for us since many products in our company use Cybozu directly or transitively, so by depending on our library things would break, and migrating all of them at once is not feasible.
The problem is mainly with classes that share the same name/package (see [1] for an example stack trace). So the question is: Would you guys accept a PR that fixes this? It would be a backwards incompatible change as there would be name changes of public classes.
Thanks
[1] Optimaize tries to use a method that does not exist in the class with the same name in Cybozu.
Exception in thread "main" java.lang.NoSuchMethodError: com.cybozu.labs.langdetect.util.LangProfile.getFreq()Ljava/util/Map;
at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:76)
at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:48)
at com.optimaize.langdetect.profiles.LanguageProfileReader.read(LanguageProfileReader.java:27)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAll(LanguageProfileReader.java:154)
...
Very unclear on how to use this library.... is there a jar file? Should I import the entire github project?
So that one instance can be created once (using a builder), then used by multiple threads simultaneously, like a service.
I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.
When a text is largely written in 1 script (eg Cyrillic), but still contains some of another script (eg Latin), then remove the minority script content as noise.
Make configurable what the limit is in percent.
(Previously, this only allowed to remove ASCII, a subset of Latin).
Exception in thread "Driver" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: java.lang.IllegalArgumentException: Probability must be <= 1 but was Infinity
at com.optimaize.langdetect.DetectedLanguage.(DetectedLanguage.java:47)
at com.optimaize.langdetect.LanguageDetectorImpl.sortProbability(LanguageDetectorImpl.java:240)
at com.optimaize.langdetect.LanguageDetectorImpl.getProbabilities(LanguageDetectorImpl.java:121)
at com.optimaize.langdetect.LanguageDetectorImpl.detect(LanguageDetectorImpl.java:102)
at
As done here https://github.com/rmtheis/language-detection
Hello,
I'm working on adding Walloon language to LanguageTool, which itself requires proper language detection from language-detector.
I don't see any clear instructions on how to generate a profile; so, as suggested, I'll attach some text files: http://chanae.walon.org/walon/wa.zip
It's a small zip file with some random pages from Wikipedia and rifondou.walon.org (for that last one, I only took texts more than 70 years old); it's about 2MB of text.
The zip include plain text dumps, as well as the html pages (which most often include, lang=... tags, in case it may be useful for you)
Another thing to know about Walloon, is that there are actually two ways of writting it.
A "unified orthography", called "rifondou" (which is the one used in those texts).
And a traditional "feller" one; which does a lot of emphasis on local accent and phonetic, with the consequence that is actually not one orthography, but a group of orthographies (at a very least there are four main groups: western, central, easter and south).
What would be the best thing to do:
Thanks
wa.zip
Faster, and flexible (filter, space-padding). Previously it was hardcoded to 1, 2 and 3-grams, and it was hardcoded which n-grams were ignored.
There are several places where the term "language" is used, but it's not well defined.
As can be seen in the profile file names, the language itself is not always enough. There currently is "zh-cn" and "zh-tw". These are not languages, they are a combination of language with country.
What is needed is a locale. For language, optionally with script and/or region.
Examples:
Giving predictable results. Using all text (all n-grams), no random picking.
Hi everybody,
Does anyone know how to use langdetection ?
because i am new to java
thanks in advance
Replace last lib dependency with Maven (jsonic)
I am very happy to find this tool
i have a question
can this library help me in detecting arabic dialects (syrian iraqi gulf)
i will try to build corpora for each dialect and add it to language profile
is that right?
I am working on a project that needs to detect the language of short texts, but they will only ever be Tagalog or English. I wanted to remove all the other language profiles to increase the accuracy of the detection, but am having trouble doing so. When I just removed the files for the other profiles, it says that they're missing - is there a hard coded list somewhere?
How is the best way to go about doing this?
(I used to use the shuyo library, and was able to very easily just delete the other language profiles, but it doesn't seem to work the same way with this library?)
this is my text :
"印刷用のトナーにおいては近赤外線を照射すると消色(無色化)する消色トナーが知られており、この消色トナーを用いて印刷を行う各種の画像処理装置が提案されている."
this text are detected as chinese. How can I solve this problem?
Example text:
设为首页收藏本站 开启辅助访问 为首页收藏本站 开启辅助访为首页收藏本站 开启辅助访切换到窄版 请 登录 后使用快捷导航 没有帐号 注册 用户名 Email 自动登录 找回密码 密码 登录 注册 快捷导航 论坛BBS 导读Guide 排行榜Ranklist 淘帖Collection 日志Blog 相册Album 分享Share 搜索 搜索 帖子 用户 公告
Chinese is clearly more dominant than English, but it can't detect Chinese at all and "getProbabilities()" returns this list:
[DetectedLanguage[en:0.8571376011773154], DetectedLanguage[fr:0.14286031717254952]]
French? Have no idea where it sees French.
If I remove the end (with those English words) it does detect the Chinese language fine.
I don't think that few English words in such a dominant Chinese text should give such a false result.
Affixes are important in detecting script-based languages like Latin. More important than what's in the middle of words.
I see that original project has a lot more short profiles than this project. Is it possible for you do upload these short files from the original project into this?
Original:
https://github.com/shuyo/language-detection/tree/master/profiles.sm
This project:
https://github.com/optimaize/language-detector/tree/master/src/main/resources/languages.shorttext
src/main/java/overview.html
still refers to the Apache License, while the README now refers to LGPLv3.
Also, did the original authors agree to the license change?
Hi,
I'm having a "de" response with > 0.99 score for a text like the following:
6LSHOJDV 5LYR 8LERXSLQ 8UPDV 5DXGVHSS 0DULQH 6\VWHPV ,QVWLWXWH DW 7DOOLQQ 8QLYHUVLW\ RI 7HFKQRORJ\
It's only used in 'verbose' mode, so it's not urgent. Not used by default.
[DetectedLanguage[ar:0.8556187887595297], DetectedLanguage[ur:0.12626434999662134]]
languageDetector.detect(textObject) prints the above line as the output but the function returns "optional.absent()". So can anyone tell me about how to pick the most probable language(here: arabic with 85.5%)?
The detector cannot even detect English text (large enough), wasted time. Just returning Optional.absent() . And working slow
Take them from the original project, 53 languages.
IOException occurs when I try to load the built in profiles
java.io.IOException: No language file available named af at languages/af!
at com.optimaize.langdetect.profiles.LanguageProfileReader.readBuiltIn(L
anguageProfileReader.java:91)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltI
n(LanguageProfileReader.java:127)
Jar file exported by Eclipse. The profiles are located under /resources/languages in the jar file.
This didn't happen when running it directly from eclipse.
The readAllBuiltIn method seems to be trying to get it from /languages/
I noticed that even slightest inclusion of English terms (like brand names) result into absolute nonsense as an output:
Language: [it]
Score: [0.15132266]
Text: [小猫终于换发型! Ariana Grande为新砖染白金长发 爱莉安娜为新专辑变金发
Language: [nl]
Score: [0.2858114]
Text: [东京风尚 EP36 电子音乐节Audio2015 高颜值军团潮搭 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期来到日本最大型的音乐节!来看看日本潮人参加音乐节都是穿什么的哦!潮人搭配:上衣Gamber;裤子Galson;鞋子Reebok;背包Nike。
Language: [nl]
Score: [0.52030957]
Text: [东京风尚下北泽篇 EP28 优雅时尚的精品男适用街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。今天主持人洋子来到了有名的东京代官山,男生的潮人们可以参考一下!这里必逛:TSUTAVA BOOKS。采访潮人搭配:T恤Bobson;衬衫Rage Blue;裤子Diesel;帽子CA4LA;鞋子Nike Jordan。
Language: [zh-CN]
Score: [0.5714257]
Text: [光泽魅惑色盘 打造中性秋季妆容 用PONY新改良光泽魅惑色盘打造中性秋季妆容]
Language: [es]
Score: [0.285714]
Text: [东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚,穿梭东京最热点,直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道,这里聚集了潮牌的精品店,所以各位喜欢潮牌的男士一定不能错过!本期采访潮男搭配:Lad Musician上衣;WEGO裤子;Dr.Martens鞋子。
See https://code.google.com/p/language-detection/wiki/Tools
The ported classes should be moved to proper package name space. Remove deprecated classes. Perhaps include the GenProfile from free-text feature from lang-guess too into the command line tool.
Discuss whether to add another standalone jar with all deps to support java -jar invoking of cmdline tool.
Code quality improvements:
Optional<String> lang = languageDetector.detect(textObject);
gives an error "Type mismatch: cannot convert from Optional to Optional"It's 2014, and 7 is the only officially supported version by Oracle.
detectBlockShortText
does not break, once CONV_THRESHOLD
has been reached. Depending on the text size this leads to zero probabilities for all languages.
The bulgarian sentence
Европа не трябва да стартира нов конкурентен маратон и изход с приватизация
yields a zero probability for all languages and, therefore, no result.
add the following line to runTests
in the DataLanguageDetectorImplTest
unittest:
assertEquals(detector.getProbabilities(text("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")).get(0).getLocale().getLanguage(), "bg");
I had removed the seed parameter recently because I did not understand its use.
Re-add named "consistent-results" or so.
This needs to go into the LanguageDetector, and the CommandLineInterface is affected too. See git history.
Some users selectively choose which built-in language profiles to load.
Others want all.
And yet others want all except some.
For the last group it must be super-simple to filter.
Currently only the language code is available for sure.
I'll also need the script.
And the fact whether the language is artificial (Esperanto), or extinct.
Since we probably don't want to include ICU just for that, we'll have to provide this data for the built-in languages.
As it is now, one has to load the profile to figure out the primary script. It could be mapped to know early. But I don't want to enforce too much, because there can also be user-defined profiles. We'd have to require it in the file name. I would not do that.
Norwegian is a macro language. The written standards are
90% of all publications are in Bokmål, the remaining 10% in Nynorsk.
Technically, the code "no" may not be used in this context. It is not clear which it is.
I am pretty sure that the training text was from one standard only, and that it was Bokmål. But without having the original training text info I would not be able to tell (based on n-grams). A Norwegian might know the differences...
So the only way to fix this is to either get the original training text, or to create a new profile. Or 2 profiles to distinguish them.
I specified which languages to include in readBuiltIn:
List<LdLocale> languages;
List<LdLocale> names = new ArrayList<>();
names.add(LdLocale.fromString("en"));
names.add(LdLocale.fromString("tl"));
languages = ImmutableList.copyOf(names);
List<LanguageProfile> languageProfiles = new LanguageProfileReader().readBuiltIn(languages);
LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
.withProfiles(languageProfiles)
.build();
TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingShortCleanText();
TextObject textObject = textObjectFactory.forText("basta");
List<DetectedLanguage> lang = languageDetector.getProbabilities(textObject);
However, when I run this, I am still getting results for languages that I did not include.
I've created a profile for Khmer: http://danielnaber.de/tmp/km.zip. Here's what I did:
egrep -v "[a-zA-Z]+"
=> 37,000 sentences, 9MBHello
There is an issue when I use the jar file here :
http://search.maven.org/remotecontent?filepath=com/optimaize/languagedetector/language-detector/0.5/language-detector-0.5.jar
The code used is exactly the same on the readme part "How to Use"
The exception is caused by this line :
List languageProfiles = new LanguageProfileReader().readAllBuiltIn();
And the exception is :
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
at com.optimaize.langdetect.i18n.LdLocale.fromString(LdLocale.java:77)
at com.optimaize.langdetect.profiles.BuiltInLanguages.(BuiltInLanguages.java:21)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118)
Please note that with v0.4 jar I don't have the issue and everthing is working fine
Hey,
is there anywhere a published maven artifact?
Can the dependency on stax-api
be removed? If I remove it from the pom.xml and run mvn clean test
, everything still works.
Hi!
This file contains two methods getLanguages()
and getShortTextLanguages()
:
https://github.com/optimaize/language-detector/blob/master/src/main/java/com/optimaize/langdetect/profiles/BuiltInLanguages.java
What's the difference between a language
and a short text language
?
Updated Javadocs would be nice, plus an answer right here of course :).
Regards /Johan
"幫助" "help" in Traditional Chinese is incorrectly detected as korean
danielnaber mentioned that language profiles should come with the training text they are based on.
I totally agree with that. This allows anyone to play with customizations to improve the profiles, such as:
The readme text for contributions should be updated to kindly ask for the training text. The best would be to get the original training text, and the program that applies modifications on it to make the index.
Since not everyone who would like to check out the language detector needs these training texts, I'd vote for keeping them separate. Otherwise a simple checkout becomes very large.
On the other hand, if there are 150 GitHub projects for 150 languages, and one would like to try out 4-grams on all of them, that's quite some work also... opinions on that?
I don't recall the state of the current language profiles - I just took what was there. I'll have to see if the original texts are (easily) accessible.
I've created a profile for Esperanto: http://danielnaber.de/tmp/eo.zip. It's based on the data from http://tatoeba.org, which is about 15 MB of plain Esperanto text.
According to document "Requirements for support for Sami languages in data processing" (http://www.sami.lwp.se/01-850-51.pdf), "Basic Level" support for Sami is achieved by detecting North Sami, Lule Sami and South Sami.
Resources:
http://crubadan.org/languages/se
http://crubadan.org/languages/sma
http://crubadan.org/languages/smj
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.