Comments (2)
I'm seeing an issue where English UTF-8 encoded text with a Unicode Right Single Quotation Mark (U+2019), is being identified by uchardet as WINDOWS-1250.
$ hexdump -C test/en/utf-8.txt 00000000 45 6e 67 6c 69 73 68 20 74 65 78 74 20 77 69 74 |English text wit| 00000010 68 20 61 20 72 69 67 68 74 20 73 69 6e 67 6c 65 |h a right single| 00000020 20 71 75 6f 74 65 20 28 55 2b 32 30 31 39 29 20 | quote (U+2019) | 00000030 69 6e 73 74 65 61 64 20 6f 66 20 61 6e 20 61 70 |instead of an ap| 00000040 6f 73 74 72 6f 70 68 65 20 73 68 6f 75 6c 64 6e |ostrophe shouldn| 00000050 e2 80 99 74 0d 0a 62 65 20 6d 69 73 74 61 6b 65 |...t..be mistake| 00000060 6e 20 66 6f 72 20 73 6f 6d 65 74 68 69 6e 67 20 |n for something | 00000070 65 6c 73 65 2e |else.| 00000075 $ uchardet test/en/utf-8.txt WINDOWS-1250
(I've also seen this misidentified elsewhere as WINDOWS-1258, but those files have confidential content and I couldn't share them. I was able to reproduce misidentification with this smaller sample, though with a slightly different outcome.)
I have the same problem, Have you solved it ?
from uchardet.
It looks like development is now occurring at https://gitlab.freedesktop.org/uchardet/uchardet/ . You might like to try the latest code there.
I note that other chardets often have the opposite problem PyYoshi/cChardet#26 thombashi/mbstrdecoder#2
from uchardet.
Related Issues (20)
- Windows-1251 detection failed on a file in Russian. HOT 1
- Invalid WINDOWS-1255 file detected as WINDOWS-1255 HOT 10
- Detect files whose encoded has been corrupted by a text editor ? HOT 1
- PACKAGE_NAME opencc??? HOT 1
- Add a dbus service HOT 1
- Transferring to uchardet organization? HOT 17
- GB18030 file detected as WINDOWS-1252 HOT 6
- Can't detect GBK. HOT 2
- Next release HOT 4
- WINDOWS-1253 file detected as ISO-8859-7 HOT 10
- Possibly incomplete project license HOT 2
- LangModels refs error HOT 21
- Cast unsigned int HOT 1
- The code make me egg pain. HOT 1
- Can this code be used to make a Windows DLL? How? HOT 6
- lower case german umlauts in utf-8 are detected incorrectly HOT 1
- uchardet wrongly determines the text as WINDOWS-1252 HOT 2
- Make a portable executable
- libuchardet-ios.a能不能支持下iOS Simulator~~~~ HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from uchardet.