Code Monkey home page Code Monkey logo

Comments (2)

lemonl2 avatar lemonl2 commented on July 21, 2024

I'm seeing an issue where English UTF-8 encoded text with a Unicode Right Single Quotation Mark (U+2019), is being identified by uchardet as WINDOWS-1250.

$ hexdump -C test/en/utf-8.txt 
00000000  45 6e 67 6c 69 73 68 20  74 65 78 74 20 77 69 74  |English text wit|
00000010  68 20 61 20 72 69 67 68  74 20 73 69 6e 67 6c 65  |h a right single|
00000020  20 71 75 6f 74 65 20 28  55 2b 32 30 31 39 29 20  | quote (U+2019) |
00000030  69 6e 73 74 65 61 64 20  6f 66 20 61 6e 20 61 70  |instead of an ap|
00000040  6f 73 74 72 6f 70 68 65  20 73 68 6f 75 6c 64 6e  |ostrophe shouldn|
00000050  e2 80 99 74 0d 0a 62 65  20 6d 69 73 74 61 6b 65  |...t..be mistake|
00000060  6e 20 66 6f 72 20 73 6f  6d 65 74 68 69 6e 67 20  |n for something |
00000070  65 6c 73 65 2e                                    |else.|
00000075

$ uchardet test/en/utf-8.txt
WINDOWS-1250

(I've also seen this misidentified elsewhere as WINDOWS-1258, but those files have confidential content and I couldn't share them. I was able to reproduce misidentification with this smaller sample, though with a slightly different outcome.)

I have the same problem, Have you solved it ?

from uchardet.

jayvdb avatar jayvdb commented on July 21, 2024

It looks like development is now occurring at https://gitlab.freedesktop.org/uchardet/uchardet/ . You might like to try the latest code there.

I note that other chardets often have the opposite problem PyYoshi/cChardet#26 thombashi/mbstrdecoder#2

from uchardet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.