I'm seeing an issue where English UTF-8 encoded text with a Unicode Right Single Quotation Mark (U+2019), is being identified by uchardet as WINDOWS-1250.
$ hexdump -C test/en/utf-8.txt
00000000 45 6e 67 6c 69 73 68 20 74 65 78 74 20 77 69 74 |English text wit|
00000010 68 20 61 20 72 69 67 68 74 20 73 69 6e 67 6c 65 |h a right single|
00000020 20 71 75 6f 74 65 20 28 55 2b 32 30 31 39 29 20 | quote (U+2019) |
00000030 69 6e 73 74 65 61 64 20 6f 66 20 61 6e 20 61 70 |instead of an ap|
00000040 6f 73 74 72 6f 70 68 65 20 73 68 6f 75 6c 64 6e |ostrophe shouldn|
00000050 e2 80 99 74 0d 0a 62 65 20 6d 69 73 74 61 6b 65 |...t..be mistake|
00000060 6e 20 66 6f 72 20 73 6f 6d 65 74 68 69 6e 67 20 |n for something |
00000070 65 6c 73 65 2e |else.|
00000075
$ uchardet test/en/utf-8.txt
WINDOWS-1250
(I've also seen this misidentified elsewhere as WINDOWS-1258, but those files have confidential content and I couldn't share them. I was able to reproduce misidentification with this smaller sample, though with a slightly different outcome.)