optimaize / language-detector Goto Github PK

View Code? Open in Web Editor NEW

561.0 38.0 167.0 2.05 MB

Language Detection Library for Java

License: Apache License 2.0

Java 99.54% HTML 0.46%

language-detector's Issues

Look at "Compact Language Detector 2" (cld2)

See https://code.google.com/p/cld2/
It explains how it performs the language detection.

Of interest is:

It mainly uses 4-grams (whereas this project uses 1, 2 and 3-grams). I too expect 4grams to perform better. We'd need the (some) original language text to play with this.
For some languages it decides based on script (incl. Greek and Thai).
It can cut the text and analyze sections.
Naïve Bayesian classifier, using one of three different token algorithms.
Does all in lower case only (our indexed n-grams are currently partly caseful)
"For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities." seems like a good optimization.

Make minimal confidence configurable for LanguageDetector.detect()

Default is 0.9999 atm.

Make PROB_THRESHOLD configurable

See probabilityThreshold() in builder.

Not able to detect English and Chinese?

I happened to find this powerful tool, then did a simple try.
Unfortunately it's not working out well.
Below are related code snippet, can anyone tell why?

    <dependency>
        <groupId>com.optimaize.languagedetector</groupId>
        <artifactId>language-detector</artifactId>
        <version>0.5</version>
    </dependency>

public class LanuageDetector {
//load all languages:
static List languageProfiles;
static {
try {
languageProfiles = new LanguageProfileReader().readAllBuiltIn();
} catch (IOException e) {
log.error("Exception when loading language profile", e);
}
}

//build language detector:
static LanguageDetector languageDetector = LanguageDetectorBuilder
    .create(NgramExtractors.standard())
    .withProfiles(languageProfiles)
    .build();

//create a text object factory
static TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();

public static String detectLang(String text) {
    TextObject textObject = textObjectFactory.forText(text);
    com.google.common.base.Optional<LdLocale> lang = languageDetector.detect(textObject);
    LdLocale locale = lang.orNull();
    return locale == null ? null : locale.getLanguage();
}
public static void main(String[] args) {
    String english = "I am English";
    String chinese = "我是简体中文";
    String hindi = "मैं हिन्दी हूं";
    System.out.println(detectLang(english));
    System.out.println(detectLang(chinese));
    System.out.println(detectLang(hindi));
}

}

Compatibility with cybozu

Hi guys,

We loved what you did with optimaize and would love to use it in our library. Unfortunately the current version does not play well with Cybozu. This is a problem for us since many products in our company use Cybozu directly or transitively, so by depending on our library things would break, and migrating all of them at once is not feasible.

The problem is mainly with classes that share the same name/package (see [1] for an example stack trace). So the question is: Would you guys accept a PR that fixes this? It would be a backwards incompatible change as there would be name changes of public classes.

Thanks

[1] Optimaize tries to use a method that does not exist in the class with the same name in Cybozu.

Exception in thread "main" java.lang.NoSuchMethodError: com.cybozu.labs.langdetect.util.LangProfile.getFreq()Ljava/util/Map;
    at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:76)
    at be.frma.langguess.LangProfileReader.read(LangProfileReader.java:48)
    at com.optimaize.langdetect.profiles.LanguageProfileReader.read(LanguageProfileReader.java:27)
    at com.optimaize.langdetect.profiles.LanguageProfileReader.readAll(LanguageProfileReader.java:154)
    ...

need instructions on how to use/install

Very unclear on how to use this library.... is there a jar file? Should I import the entire github project?

Make LanguageDetector thread-safe

So that one instance can be created once (using a builder), then used by multiple threads simultaneously, like a service.

Greek is identified as Catalan when no Greek model is loaded

I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.

Feature to remove minority script content

When a text is largely written in 1 script (eg Cyrillic), but still contains some of another script (eg Latin), then remove the minority script content as noise.

Make configurable what the limit is in percent.

(Previously, this only allowed to remove ASCII, a subset of Latin).

Probability must be <= 1 but was Infinity

Exception in thread "Driver" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: java.lang.IllegalArgumentException: Probability must be <= 1 but was Infinity
at com.optimaize.langdetect.DetectedLanguage.(DetectedLanguage.java:47)
at com.optimaize.langdetect.LanguageDetectorImpl.sortProbability(LanguageDetectorImpl.java:240)
at com.optimaize.langdetect.LanguageDetectorImpl.getProbabilities(LanguageDetectorImpl.java:121)
at com.optimaize.langdetect.LanguageDetectorImpl.detect(LanguageDetectorImpl.java:102)
at

Better Android support

As done here https://github.com/rmtheis/language-detection

language profiles as Java code files
separate Maven modules to cut away the code not required at runtime (to create profiles), and the profiles stored as text files.

Adding recognition of Walloon (wa) language

Hello,
I'm working on adding Walloon language to LanguageTool, which itself requires proper language detection from language-detector.
I don't see any clear instructions on how to generate a profile; so, as suggested, I'll attach some text files: http://chanae.walon.org/walon/wa.zip
It's a small zip file with some random pages from Wikipedia and rifondou.walon.org (for that last one, I only took texts more than 70 years old); it's about 2MB of text.
The zip include plain text dumps, as well as the html pages (which most often include, lang=... tags, in case it may be useful for you)

Another thing to know about Walloon, is that there are actually two ways of writting it.
A "unified orthography", called "rifondou" (which is the one used in those texts).
And a traditional "feller" one; which does a lot of emphasis on local accent and phonetic, with the consequence that is actually not one orthography, but a group of orthographies (at a very least there are four main groups: western, central, easter and south).

What would be the best thing to do:

only focus on "rifondou"
dump together all ways of writing the language
create several profiles (wa@rif, wa@ch, wa@na, wa@lg, wa@ba) ?

Thanks
wa.zip

New n-gram generator

Faster, and flexible (filter, space-padding). Previously it was hardcoded to 1, 2 and 3-grams, and it was hardcoded which n-grams were ignored.

Replace "Language" with "Locale"

There are several places where the term "language" is used, but it's not well defined.

the profile file name
LanguageProfile
LanguageDetector
DetectedLanguage

As can be seen in the profile file names, the language itself is not always enough. There currently is "zh-cn" and "zh-tw". These are not languages, they are a combination of language with country.

What is needed is a locale. For language, optionally with script and/or region.
Examples:

zh-Hans-CN
zh-Hant-TW

Algorithm for short text

Giving predictable results. Using all text (all n-grams), no random picking.

Does anyone know how to use langdetection ?

Hi everybody,
Does anyone know how to use langdetection ?
because i am new to java
thanks in advance

Updated all Maven dependency versions

Replace last lib dependency with Maven (jsonic)

using this library to detect arabic dialects

I am very happy to find this tool
i have a question
can this library help me in detecting arabic dialects (syrian iraqi gulf)
i will try to build corpora for each dialect and add it to language profile
is that right?

Not able to choose certain languages to include

I am working on a project that needs to detect the language of short texts, but they will only ever be Tagalog or English. I wanted to remove all the other language profiles to increase the accuracy of the detection, but am having trouble doing so. When I just removed the files for the other profiles, it says that they're missing - is there a hard coded list somewhere?

How is the best way to go about doing this?

(I used to use the shuyo library, and was able to very easily just delete the other language profiles, but it doesn't seem to work the same way with this library?)

Add 16 language profiles from rmtheis

see https://github.com/rmtheis/language-detection

japanese context are detected as chinese

this is my text :
"印刷用のトナーにおいては近赤外線を照射すると消色(無色化)する消色トナーが知られており、この消色トナーを用いて印刷を行う各種の画像処理装置が提案されている."
this text are detected as chinese. How can I solve this problem?

Mixed language strange results (one is clearly more dominant).

Example text:

设为首页收藏本站 开启辅助访问 为首页收藏本站 开启辅助访为首页收藏本站 开启辅助访切换到窄版 请 登录 后使用快捷导航 没有帐号 注册 用户名 Email 自动登录  找回密码 密码 登录  注册 快捷导航 论坛BBS 导读Guide 排行榜Ranklist 淘帖Collection 日志Blog 相册Album 分享Share 搜索 搜索 帖子 用户 公告

Chinese is clearly more dominant than English, but it can't detect Chinese at all and "getProbabilities()" returns this list:

[DetectedLanguage[en:0.8571376011773154], DetectedLanguage[fr:0.14286031717254952]]

French? Have no idea where it sees French.

If I remove the end (with those English words) it does detect the Chinese language fine.

I don't think that few English words in such a dominant Chinese text should give such a false result.

Feature to weight prefix and suffix n-grams differently (higher)

Affixes are important in detecting script-based languages like Latin. More important than what's in the middle of words.

More short profiles be pulled form shuyo/language-detection

I see that original project has a lot more short profiles than this project. Is it possible for you do upload these short files from the original project into this?

Original:
https://github.com/shuyo/language-detection/tree/master/profiles.sm

This project:
https://github.com/optimaize/language-detector/tree/master/src/main/resources/languages.shorttext

unclear License

src/main/java/overview.html still refers to the Apache License, while the README now refers to LGPLv3.

Also, did the original authors agree to the license change?

Wrong text detection in "no sense" text

Hi,
I'm having a "de" response with > 0.99 score for a text like the following:

6LSHOJDV 5LYR 8LERXSLQ 8UPDV 5DXGVHSS 0DULQH 6\VWHPV ,QVWLWXWH DW 7DOOLQQ 8QLYHUVLW\ RI 7HFKQRORJ\ $%675$&7 7KH RWKHU LPSRUWDQW HQYLURQPHQWDO DVSHFW WKDW QHHGV (QYLURQPHQWDO FRQGLWLRQV ZHUH PRQLWRUHG XVLQJ LQ VLWX FRQWLQXRXV PRQLWRULQJ DUH RLO VSLOOV ,Q WKH *XOI RI )LQODQG PHDVXUHG LQKHUHQW RSWLFDO SURSHUWLHV DQG ZDWHU VDPSOLQJ WKH SUREDELOLW\ RI RLO VSLOOV LV KLJK GXH WR WKH LQFUHDVLQJ RLO WRJHWKHU ZLWK UHPRWH VHQVLQJ LPDJHU\ 0(5,6 DQG $6$5 WUDQVSRUWDWLRQ 6HYHUDO RLO SROOXWLRQ LQFLGHQWV KDSSHQHG LQ LQ 0XXJD %D\ %DOWLF 6HD 6LPXOWDQHRXV PRQLWRULQJ XVLQJ WKH JXOI RYHU WKH ODVW GHFDGH 'LUHFW HQYLURQPHQWDO LPSDFWV GLIIHUHQW PHWKRGRORJLHV JDYH GHWDLOHG RYHUYLHZ RI RI RLO VSLOOV DIIHFW VHDELUGV DQG FRDVWDO HFRORJ\ HVSHFLDOO\ VXVSHQGHG PDWWHU 630 ORDG LQWR WKH ZDWHU FROXPQ GXULQJ ZKHQ WKH VSLOO KLWV WKH VKRUH 7R PLQLPL]H WKH QHJDWLYH HIIHFW WKH GUHGJLQJ RSHUDWLRQV 0(5,6 )56 GDWD HQDEOHG WR RI RLO SROOXWLRQ DQG WR IDFLOLWDWH IDVW DSSOLFDWLRQ RI RLO UHFHLYH WKH GLVWULEXWLRQ RI 630 RQ ZDWHU VXUIDFH 7KH FRPEDWLQJ PHWKRGV DQ HDUO\ GHWHFWLRQ RI RLO VSLOOV DW VHD LV PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV UHYLOHG WKH RI D JUHDW LPSRUWDQFH 0DQ\ VWXGLHV KDYH SURYHG WKDW UDGDU SDUWLFOH FRQFHQWUDWLRQ RQ YHUWLFDO VFDOH %DFNVFDWWHULQJ IURP LPDJHV FDQ SURYLGH LQIRUPDWLRQ RQ SRVVLEOH ORFDWLRQ DQG WKH $6$5 GDWD ZDV LQ FRUUHODWLRQ ZLWK RLO SURGXFWV H[WHQW RI RLO VSLOOV > @ GHWHUPLQHG IURP ZDWHU VDPSOHV ZKHQ EDOODVW ZDWHU GLVFKDUJH &RQWLQXRXV DQG ILQH VFDOH UHPRWH VHQVLQJ LV RQH RI WKH NH\ ZDV GHWHFWHG GXULQJ ILHOG VDPSOLQJ DVSHFWV LQ PRQLWRULQJ RI 630 DQG SRVVLEOH RLO VSLOOV QHDU WKH KDUERUV (QYLVDW 0(5,6 IXOO UHVROXWLRQ GDWD 0(5,6 )56 ,QGH[ 7HUPV 0(5,6 VXVSHQGHG PDWWHU LQKHUHQW DQG (QYLVDW $6$5 LPDJHU\ LV SURYLGHG E\ (6$ GDLO\ EDVHV RSWLFDO SURSHUWLHV DQG JLYHV JRRG EDVHV IRU FRQWLQXRXV PRQLWRULQJ 7KH VFRSH RI WKH FXUUHQW VWXG\ ZDV WR HYDOXDWH WKH XVH RI ,1752'8&7,21 0(5,6 )56 GDWD IRU PRQLWRULQJ RI VXVSHQGHG PDWWHU ORDG WR WKH FRDVWDO VHD GXULQJ WKH KDUERU GUHGJLQJ (QYLVDW $6$5 2QH RI WKH PDLQ FKDOOHQJHV LGHQWLILHG E\ WKH (XURSHDQ 6HD GDWD ZDV XVHG WR HYDOXDWH WKH SRVVLELOLW\ WR GHWHFW WKH RLO 3RUWV 2UJDQLVDWLRQ (632 LQ LWV HQYLURQPHQWDO FRGH VSLOOV (632 ZDV WKH VXVWDLQDEOH GHYHORSPHQW RI VHD SRUWV 0(7+2'6 $FFRUGLQJ WR WKH GRFXPHQW WKH HQYLURQPHQWDO LPSDFWV FDXVHG E\ SRUW UHODWHG DFWLYLWLHV VKRXOG EH UHGXFHG > @ 7KH 7KH ILHOG PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV ILUVW VWHS LV WR SURSHUO\ PDQDJH HQYLURQPHQWDO LVVXHV ZKLFK WRJHWKHU ZLWK WDNLQJ ZDWHU VDPSOHV ZHUH SHUIRUPHG LQ UHTXLUHV FRQWLQXRXV HQYLURQPHQWDO PRQLWRULQJ 5HPRWH DQG 0XXJD %D\ RQ DQG

Using a logger instead of System.out.println

It's only used in 'verbose' mode, so it's not urgent. Not used by default.

how to retrieve a particular language from a set of possible languages?

[DetectedLanguage[ar:0.8556187887595297], DetectedLanguage[ur:0.12626434999662134]]

languageDetector.detect(textObject) prints the above line as the output but the function returns "optional.absent()". So can anyone tell me about how to pick the most probable language(here: arabic with 85.5%)?

CommonTextObjectFactories.forDetectingOnLargeText() not working

The detector cannot even detect English text (large enough), wasted time. Just returning Optional.absent() . And working slow

Use latest language profiles

Take them from the original project, 53 languages.

java.io.IOException in exported jar file

IOException occurs when I try to load the built in profiles
java.io.IOException: No language file available named af at languages/af!
at com.optimaize.langdetect.profiles.LanguageProfileReader.readBuiltIn(L
anguageProfileReader.java:91)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltI
n(LanguageProfileReader.java:127)

Jar file exported by Eclipse. The profiles are located under /resources/languages in the jar file.

This didn't happen when running it directly from eclipse.

The readAllBuiltIn method seems to be trying to get it from /languages/

Chinese detection is unusable

I noticed that even slightest inclusion of English terms (like brand names) result into absolute nonsense as an output:

Language: [it] 
Score: [0.15132266] 
Text: [小猫终于换发型！ Ariana Grande为新砖染白金长发 爱莉安娜为新专辑变金发

Language: [nl] 
Score: [0.2858114] 
Text: [东京风尚 EP36 电子音乐节Audio2015 高颜值军团潮搭 TOKYO TRENDS东京风尚，穿梭东京最热点，直击日本潮人搭配。本期来到日本最大型的音乐节！来看看日本潮人参加音乐节都是穿什么的哦！潮人搭配：上衣Gamber；裤子Galson；鞋子Reebok；背包Nike。

Language: [nl] 
Score: [0.52030957] 
Text: [东京风尚下北泽篇 EP28 优雅时尚的精品男适用街 TOKYO TRENDS东京风尚，穿梭东京最热点，直击日本潮人搭配。今天主持人洋子来到了有名的东京代官山，男生的潮人们可以参考一下！这里必逛：TSUTAVA BOOKS。采访潮人搭配：T恤Bobson；衬衫Rage Blue；裤子Diesel；帽子CA4LA；鞋子Nike Jordan。

Language: [zh-CN] 
Score: [0.5714257] 
Text: [光泽魅惑色盘 打造中性秋季妆容 用PONY新改良光泽魅惑色盘打造中性秋季妆容]

Language: [es] 
Score: [0.285714] 
Text: [东京风尚银座篇 EP42 优雅而时尚的精品店街 TOKYO TRENDS东京风尚，穿梭东京最热点，直击日本潮人搭配。本期美女主持阳子带我们来到银座的一条街道，这里聚集了潮牌的精品店，所以各位喜欢潮牌的男士一定不能错过！本期采访潮男搭配：Lad Musician上衣；WEGO裤子；Dr.Martens鞋子。

Port the cybozu command line utils

See https://code.google.com/p/language-detection/wiki/Tools

The ported classes should be moved to proper package name space. Remove deprecated classes. Perhaps include the GenProfile from free-text feature from lang-guess too into the command line tool.

Discuss whether to add another standalone jar with all deps to support java -jar invoking of cmdline tool.

Bring code to current standards

Code quality improvements:

Returning interfaces instead of implementations (List instead of ArrayList etc)
String .equals instead of ==
Replaced StringBuffer with StringBuilder
Renamed classes for clarity
Made classes immutable, and thus thread safe
Made fields private, using accessors
Clear null reference concept:
- using IntelliJ's @nullable and @NotNull annotations
- using Guava's Optional
Added JavaDoc, fixed typos
Added interfaces
More tests. Thanks to the refactorings, code is now testable that was too much embedded before.

README outdated

readAll() is deprecated -> change to readAllBuiltIn()
Optional<String> lang = languageDetector.detect(textObject); gives an error "Type mismatch: cannot convert from Optional to Optional"

Update to Java 7 syntax

It's 2014, and 7 is the only officially supported version by Oracle.

ShortText algorithm sometimes yields zero probabilities for all languages

detectBlockShortText does not break, once CONV_THRESHOLD has been reached. Depending on the text size this leads to zero probabilities for all languages.

Example:

The bulgarian sentence
Европа не трябва да стартира нов конкурентен маратон и изход с приватизация
yields a zero probability for all languages and, therefore, no result.

How to reproduce:

add the following line to runTests in the DataLanguageDetectorImplTest unittest:

assertEquals(detector.getProbabilities(text("Европа не трябва да стартира нов конкурентен маратон и изход с приватизация")).get(0).getLocale().getLanguage(), "bg");

Re-add "seed" parameter

I had removed the seed parameter recently because I did not understand its use.
Re-add named "consistent-results" or so.

This needs to go into the LanguageDetector, and the CommandLineInterface is affected too. See git history.

Feature to filter language profiles by criteria

Some users selectively choose which built-in language profiles to load.
Others want all.
And yet others want all except some.

For the last group it must be super-simple to filter.
Currently only the language code is available for sure.
I'll also need the script.
And the fact whether the language is artificial (Esperanto), or extinct.
Since we probably don't want to include ICU just for that, we'll have to provide this data for the built-in languages.

As it is now, one has to load the profile to figure out the primary script. It could be mapped to know early. But I don't want to enforce too much, because there can also be user-defined profiles. We'd have to require it in the file name. I would not do that.

Norwegian profile: Bokmål or Nynorsk?

Norwegian is a macro language. The written standards are

Bokmål iso 639-1 "nb"
Nynorsk iso 639-1 "nn"

90% of all publications are in Bokmål, the remaining 10% in Nynorsk.

Technically, the code "no" may not be used in this context. It is not clear which it is.

I am pretty sure that the training text was from one standard only, and that it was Bokmål. But without having the original training text info I would not be able to tell (based on n-grams). A Norwegian might know the differences...

So the only way to fix this is to either get the original training text, or to create a new profile. Or 2 profiles to distinguish them.

readBuiltIn() not working

I specified which languages to include in readBuiltIn:

List<LdLocale> languages;
        List<LdLocale> names = new ArrayList<>();
        names.add(LdLocale.fromString("en"));
        names.add(LdLocale.fromString("tl"));
        languages = ImmutableList.copyOf(names);
        List<LanguageProfile> languageProfiles = new LanguageProfileReader().readBuiltIn(languages);

LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())
                .withProfiles(languageProfiles)
                .build();

TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingShortCleanText();

TextObject textObject = textObjectFactory.forText("basta");
List<DetectedLanguage> lang = languageDetector.getProbabilities(textObject);

However, when I run this, I am still getting results for languages that I did not include.

Profile for Khmer

I've created a profile for Khmer: http://danielnaber.de/tmp/km.zip. Here's what I did:

exported 72,000 sentences (14MB) from Wikipedia dump
remove sentences with traditional ASCII chars: egrep -v "[a-zA-Z]+" => 37,000 sentences, 9MB
created the profile with a minimum frequency of 20

Bug

Hello

There is an issue when I use the jar file here :
http://search.maven.org/remotecontent?filepath=com/optimaize/languagedetector/language-detector/0.5/language-detector-0.5.jar

The code used is exactly the same on the readme part "How to Use"

The exception is caused by this line :
List languageProfiles = new LanguageProfileReader().readAllBuiltIn();
And the exception is :
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;
at com.optimaize.langdetect.i18n.LdLocale.fromString(LdLocale.java:77)
at com.optimaize.langdetect.profiles.BuiltInLanguages.(BuiltInLanguages.java:21)
at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118)

Please note that with v0.4 jar I don't have the issue and everthing is working fine

Published Artifact

Hey,

is there anywhere a published maven artifact?

remove stax-api dependency?

Can the dependency on stax-api be removed? If I remove it from the pom.xml and run mvn clean test, everything still works.

getLanguages() and getShortTextLanguages() need documentation

Hi!

This file contains two methods getLanguages() and getShortTextLanguages():
https://github.com/optimaize/language-detector/blob/master/src/main/java/com/optimaize/langdetect/profiles/BuiltInLanguages.java

What's the difference between a language and a short text language?

Updated Javadocs would be nice, plus an answer right here of course :).

Regards /Johan

Traditional Chinese is detected as Korean

"幫助" "help" in Traditional Chinese is incorrectly detected as korean

Provide training data for all language profiles

danielnaber mentioned that language profiles should come with the training text they are based on.

I totally agree with that. This allows anyone to play with customizations to improve the profiles, such as:

playing with profile size by setting the word count cutoff higher or lower
cleaning the text from foreign words, English phrases, ...
changing n-gram types (3-gram, 4-gram, ...)
debugging and understanding certain results
use cases we can't even think of now

The readme text for contributions should be updated to kindly ask for the training text. The best would be to get the original training text, and the program that applies modifications on it to make the index.

Since not everyone who would like to check out the language detector needs these training texts, I'd vote for keeping them separate. Otherwise a simple checkout becomes very large.
On the other hand, if there are 150 GitHub projects for 150 languages, and one would like to try out 4-grams on all of them, that's quite some work also... opinions on that?

I don't recall the state of the current language profiles - I just took what was there. I'll have to see if the original texts are (easily) accessible.

Profile for Esperanto

I've created a profile for Esperanto: http://danielnaber.de/tmp/eo.zip. It's based on the data from http://tatoeba.org, which is about 15 MB of plain Esperanto text.

Sami languages support

According to document "Requirements for support for Sami languages in data processing" (http://www.sami.lwp.se/01-850-51.pdf), "Basic Level" support for Sami is achieved by detecting North Sami, Lule Sami and South Sami.

Resources:
http://crubadan.org/languages/se
http://crubadan.org/languages/sma
http://crubadan.org/languages/smj

optimaize / language-detector Goto Github PK

language-detector's Issues

Example:

How to reproduce:

Recommend Projects

Recommend Topics

Recommend Org