Comments (10)
What's the expected encoding by the way? Is it one uchardet is supposed to be able to detect currently?
If I apply a change to discard charsets when invalid bytes are detected, your file just ends up as "unknown".
from uchardet.
Well it looks like it could be one of the ISO-8859 (like ISO-8859-1 or ISO-8859-15), but without any meaning: "àðôû".
In this case, it is completely normal that uchardet cannot detect the encoding. There is no way any algorithm can detect a proper charset for random bytes when many charsets are compatible with these codepoints.
from uchardet.
This is the exemple I gave on linuxfr. This a russian word. WINDOWS-1251.
from uchardet.
Oh, and I just realised I swapped the first two characters. It's E0 F0 FB F4
. арфы.
from uchardet.
Ok.
Well I see now it could also be MAC-CYRILLIC with the same characters.
In any case, the current language models return too low a confidence (not even 0.1) for any of these encodings to be recognized with certaincy.
I will keep this opened for now, and see if the Russian models can be improved, but I don't give it too much hope for uchardet ability to recognize such low-length text.
from uchardet.
Maybe the confidence should depend on the percentage of recognised characters, and not on their number.
from uchardet.
I don't understand what you are saying. The percentage of recognized characters is always 100%. If we don't recognize characters, it means they are invalid, then it is definitely not the right encoding.
from uchardet.
I mean the percentage of frequent character. I don't know the formula used to determine the confidence, but doesn't the fact that it doesn't work with short character sequences means that it relies at some point on the number of frequent byte sequences that were found and not on the percentage of them ?
If not, I still don't understand why it fails at recognizing E0 F0 FB F4
. It's only very frequent russian characters encoded in WINDOWS-1251.
from uchardet.
Well patches are accepted. :-)
Just remember that uchardet still has to be generic, work with all possible languages and encoding (once frequency data has been gathered), and stay fast.
from uchardet.
I am moving bug reports to the new hosting.
I think I will close this one though, not move it. Uchardet is not meant for detection with such short string. It is actually pretty good even with short sentences, but a single word of 4 characters, I think we are getting too close to the limits here.
For such single-words, the approach you had been proposing on linuxfr (using dictionnaries) is probably the only viable approach, though as you noted yourself, it is quite slow (and uchardet is meant for quick processing, at least quick enough processing for a comfortable desktop workflow).
Now I may be wrong and I happily welcome patches if you can implement an efficient improvement to the algorithm which will work with your exemple (while still keep fast and not break what is currently working): https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet
Also if you have longer texts which are not correctly detected, do not hesitate to report them as well! :-)
Thanks for reporting your issue!
from uchardet.
Related Issues (20)
- Windows-1251 detection failed on a file in Russian. HOT 1
- Detect files whose encoded has been corrupted by a text editor ? HOT 1
- PACKAGE_NAME opencc??? HOT 1
- Add a dbus service HOT 1
- Transferring to uchardet organization? HOT 17
- GB18030 file detected as WINDOWS-1252 HOT 6
- Can't detect GBK. HOT 2
- Next release HOT 4
- WINDOWS-1253 file detected as ISO-8859-7 HOT 10
- Possibly incomplete project license HOT 2
- LangModels refs error HOT 21
- Cast unsigned int HOT 1
- The code make me egg pain. HOT 1
- Can this code be used to make a Windows DLL? How? HOT 6
- UTF-8 with right single quote (U+2019) mistaken as Windows-1250 HOT 2
- lower case german umlauts in utf-8 are detected incorrectly HOT 1
- uchardet wrongly determines the text as WINDOWS-1252 HOT 2
- Make a portable executable
- libuchardet-ios.a能不能支持下iOS Simulator~~~~ HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from uchardet.