Comments (2)
You are right. But most of the applications I have used detection for only care about detecting whether we have Norwegian, English or Chinese etc, not the "sub" language, so the ISO-639 macro language code "no" is good enough for many. Besides, Arabic is also a macro language, but do people care wether it is Tunisian or Egyptian arabic or any of the other 30 forms (according to https://en.wikipedia.org/wiki/ISO_639_macrolanguage)? I don't know.
So if someone create (good enough) detectors for individual languages like nb and nn, maybe the library should also be able to return both the language code and the macro language code?
from language-detector.
Thanks Jan for your comments. You're from Oslo, and in the search business, so your opinion is very relevant.
I see a small difference between Norwegian and Arabic.
Starting from the English Wikipedia article about the Eiffel Tower https://en.wikipedia.org/wiki/Eiffel_Tower you can get to the Arabic one https://ar.wikipedia.org/wiki/%D8%A8%D8%B1%D8%AC_%D8%A5%D9%8A%D9%81%D9%84 and yes there is just one for all Arabic variations.
For Norwegian we have 2:
https://no.wikipedia.org/wiki/Eiffelt%C3%A5rnet
https://nn.wikipedia.org/wiki/Eiffelt%C3%A5rnet
Wikipedia is cheating, they label the first "Norsk bokmål" but use the (common) macro language code "no" instead of "nb".
This is similar to what we have with German:
https://de.wikipedia.org/wiki/Eiffelturm
https://als.wikipedia.org/wiki/Eiffelturm
Just that in German the code "de" has a unique meaning.
I guess that this answers the question:
The language profile we have for "Norwegian" must have been created from plain Bokmål content. Just like the language profile for German used "de" pages and no other writing forms or dialects.
Then this is good enough. It's documented now. It matches the Wikipedia use of the language. And if one ever needed to detect Nynorsk separately, then he could create a profile for it based on Wikipedia "nn" content.
from language-detector.
Related Issues (20)
- "dallas" is recognised as Spanish Language HOT 1
- No way to change default n-gram size from 3 to something else HOT 1
- Source of language corpus
- How to remoe Logback from dependencies
- Text with English and Japanese characters is identified as Galician or Basque HOT 1
- Is it this repo still actively maintained? HOT 2
- FYI: Language detector Lingua outperforming Optimaize HOT 4
- Remove or limit dependency to spring
- Japanese detection is not good HOT 8
- Pls help HOT 1
- How to build and install language detector HOT 1
- How to use this Library in Netbeans project
- Upper case English text returns a low probability
- misdetection because of break at CONV_THRESHOLD HOT 1
- incorrect shortTextAlgorithm documentation HOT 1
- How to set short-text profile from Java code HOT 1
- MAIL_REGEX should be limited
- TextObjectFactory changes text HOT 2
- Every Time It Returns only absent() HOT 2
- How to retrieve particular languages in language profile reader HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from language-detector.