byvoid / uchardet Goto Github PK

View Code? Open in Web Editor NEW

602.0 39.0 106.0 499 KB

An encoding detector library ported from Mozilla

License: Other

C++ 63.36% Shell 0.06% C 21.44% CMake 0.85% Python 14.29%

uchardet's Introduction

uchardet

uchardet moved!

uchardet is now a freedesktop project.

The page: https://www.freedesktop.org/wiki/Software/uchardet/
Bug reports: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues
Releases: https://www.freedesktop.org/software/uchardet/releases/
Code: https://cgit.freedesktop.org/uchardet/uchardet/

Please update your links to the project. New releases, code updates and announcements will happen on the Freedesktop hosting.

uchardet's People

Contributors

Stargazers

Watchers

Forkers

airyai lono175 kerneltravel alepharchives csshuai pypypystudy yixia scorpiozj js2854 ultimate010 bjdavidtian gavinlin lcode haofree slimity lrntothink dinhvh michael6pr david122886 wangyja asdfsx llloic11 jehan cicku bygreencn silkedit wang-bin mrcuihongbao lovasoa smartdj badwtg1111 zmm20 ceekay1991 mapje71 askxionghu geraldsec bugparty xoxoj wangwocg zhuomingliang shaoshuai213 lyndon8978 gaoyingie darklinden hymn zhaofeng0327 euphorix a1252425 dongjunqiang weloytty shihuamarryme imvenj qianchia seequent lucifinil-long liumorgan cnsuhao wenij daddycool9999 aw691190716 cofam movie-travel-code hangyu1987 hjzc 63830708 firedtoad presleyhank 1183618052 linuxstv mingruizhou kuzmenkov111 yang123vc ccf19881030 harrywalker linecode kylinxh amua szgavin babyaaaaaa xiaoshzx kobezou wenhuilu a664571463 framatome ron-a20 meixi123456 2582210930 wuyang-dl 506124204 leithergit icodein git-kevin huangjie18 eaglexmw-gmail zerger maskmale shenhuashan yexuanxiao mkxp-z tangmans

uchardet's Issues

Make building static build optional

Please add cmake option for disabling build of static library.

UTF-8 with right single quote (U+2019) mistaken as Windows-1250

I'm seeing an issue where English UTF-8 encoded text with a Unicode Right Single Quotation Mark (U+2019), is being identified by uchardet as WINDOWS-1250.

$ hexdump -C test/en/utf-8.txt 
00000000  45 6e 67 6c 69 73 68 20  74 65 78 74 20 77 69 74  |English text wit|
00000010  68 20 61 20 72 69 67 68  74 20 73 69 6e 67 6c 65  |h a right single|
00000020  20 71 75 6f 74 65 20 28  55 2b 32 30 31 39 29 20  | quote (U+2019) |
00000030  69 6e 73 74 65 61 64 20  6f 66 20 61 6e 20 61 70  |instead of an ap|
00000040  6f 73 74 72 6f 70 68 65  20 73 68 6f 75 6c 64 6e  |ostrophe shouldn|
00000050  e2 80 99 74 0d 0a 62 65  20 6d 69 73 74 61 6b 65  |...t..be mistake|
00000060  6e 20 66 6f 72 20 73 6f  6d 65 74 68 69 6e 67 20  |n for something |
00000070  65 6c 73 65 2e                                    |else.|
00000075

$ uchardet test/en/utf-8.txt
WINDOWS-1250

(I've also seen this misidentified elsewhere as WINDOWS-1258, but those files have confidential content and I couldn't share them. I was able to reproduce misidentification with this smaller sample, though with a slightly different outcome.)

ISO-8859-2 should be detected

In your README, ISO-8859-2 is not supported. Yet I can find a model for it in src/LangHungarianModel.cpp. I tried it with a ISO-8859-2 file I built myself:
https://cloud.libreart.info/public.php?service=files&t=40140bd3fd105b2c03d7716dfe4b498a
And it fails detecting it as "windows-1252".

On the other hand python-chardet was able to properly detect the ISO-8859-2 encoding:

$ chardetect iso-8859-2.smi
iso-8859-2.smi: ISO-8859-2 with confidence 0.850807928898

Considering they are both supposed to be based on the same algorithm from Mozilla and that you have mention of this encoding in your code, I'm thinking it would be cool if it were supported.

Detect files whose encoded has been corrupted by a text editor ?

It happens that when a text editor didn't detect the right encoding of a file while opening it, and then saves it to another encoding, it corrupts the file.

Such files are hard to recover, because no tool exist to detect their correct encoding.

This file was first saved as WINDOWS-1251, then opened as WINDOWS-1252, and saved as UTF-8, and ended on a popular subtitles download site.
Princessa.I.Ljagushka.2009.RUS.BDRip.XviD.AC3.-HQCLUB.Rus.srt.txt

Would it be conceivable that uchardet could one day detect such "composed" charsets ?

Add a dbus service

Cf. title. :-)

Can this code be used to make a Windows DLL? How?

Hi, sorry for communicate with you for this channel...
THIS IS NOT A ISSUE AT ALL
...but I dont know how ask you this:

I am a windows programmer (delphi and pascal) with zero experience in C (or C++).

My question is:
Can this code be used to make a Windows DLL?
If yes please could you guide me how?
I have some hope because looking in CMakeLists.txt file there is a option for win32 (# although commented #)

Basic questions:

Compiler name and version you know this can be compiled (in windows of course)
Basic procedure (or at least some hints)
With the right compiler and procedure can be done or is need some modification of the code?

Thanks in advance!

PD: I know what exist a version of Mozilla code for Windows made in Delphi here but yours seems more complete and detect much more encodes.

WINDOWS-1253 file detected as ISO-8859-7

These two are the most common Greek encodings and they are mostly identical. One major difference between them is the mapping of GREEK CAPITAL LETTER ALPHA WITH TONOS (Ά), which is very common in Greek texts/subtitles.

code | ISO 8859-7                                     | windows-1253
------------------------------------------------------------------------------------------------------
0xA1 | [U+2018] LEFT SINGLE QUOTATION MARK            | [U+0385] GREEK DIALYTIKA TONOS
0xA2 | [U+2019] RIGHT SINGLE QUOTATION MARK           | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS
0xA4 | _unassigned_                                   | [U+00A4] [CURRENCY SIGN]
0xA5 | _unassigned_                                   | [U+00A5] [YEN SIGN]
0xAE | _unassigned_                                   | [U+00AE] [REGISTERED SIGN]
0xB5 | [U+0385] GREEK DIALYTIKA TONOS                 | [U+00B5] [MICRO SIGN]
0xB6 | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS | [U+00B6] [PILCROW SIGN]

Source: ISO 8859-7 vs. windows-1253

I don't know how the detection works but more 0xA2 than 0xB6 would be a strong indication of WINDOWS-1253 (and vice versa).

PS. I use uchardet through mpv for the subtitle language detection and I would say about 10-20% of the subtitles have this problem

Attached sample file

document the difference between this and libchardet

On reviewing bomi it looks like that uses libchardet which like this library also is based on Mozilla's code. I see that the public APIs between these two projects are different as well; however having two copies of the same code is not great for the FOSS community in general.

Could anyone give a more detailed account of the differences, and maybe merge the two libraries? For example, which version of Mozilla's code this library contains, the history of both codebases, how easy it would be to merge the two, etc.

New release?

There is no tags in git, and it seems that there is no releases. The version is currently the 0.0.1, so at a first glance it looks like the project is very young and has just started, but the first commit was created in 2011. It's maybe time to create a real release.

Next release

Hello.

Are there any plans regarding the next uchardet release?

Future roadmap?

Hi Carbo,

It's been a long time, I'd ask for future roadmap here. Because mpv and other music players would like to support special charset via this lib, but it's not developed as far as I see from the commit.

Opinions?

Make a portable executable

Hi,

I like your project and I would like to use your tool on a workstation without manually installing lib.

Can you explain me how to create a portable executable of your project ?

Thanks in advance!

Invalid WINDOWS-1255 file detected as WINDOWS-1255

Uchardet detects this file as WINDOWS-1255 whereas it contains the octet 0xFB, which is invalid in this charset.

How to reproduce:

$ echo -ne "\xf0\xe0\xfb\xf4" | uchardet
> WINDOWS-1255
$ echo -ne "\xf0\xe0\xfb\xf4" | iconv -f WINDOWS-1255
> נiconv: invalid escape sequence at position 2

Can't detect GBK.

gbk.txt
The detected result is UTF-8, but it's GBK actually.

music id3 short text detect

for music id3 short text detection, charset seems often return ""

Minor header file issues

There are some minor issues in the public header file.

#ifndef ___UCHARDET_H___

Identifiers starting with 2 _ are reserved by the system. Normal libraries cannot use such identifiers.

typedef void * uchardet_t;

This isn't really idea. Using void* reduces type-safety, and the typedef obscures the real type of the handle. Also, _t suffixes are reserved for POSIX standard symbols. I suggest doing typedef struct uchardet uchardet; and using it as uchardet *ud in function parameters.

uchardet_t uchardet_new();

The parameter list is empty - in C, this means the parameter list is not defined, and anything is allowed. clang e.g. warns with: uchardet/uchardet.h:52:1: warning: function declaration isn't a prototype. It should be uchardet_t uchardet_new(void);. This compiles in C++ too.

Also, this should probably define that it returns NULL on memory allocation failure.

GB18030 file detected as WINDOWS-1252

I encountered an issue when I try to use uchardet to detect a file almost all English(a very little Chinese).
The file is GB18030 encoded but detected as WINDOWS-1252.

PACKAGE_NAME opencc???

CMakeLists.txt ->

set (PACKAGE_NAME opencc)

This likely should be "uchardet"

uchardet wrongly determines the text as UTF-8

The file has these bytes:

00000000  78 78 78 e2 80 99 78 78  78 0a 63 68 61 72 20 27  |xxx...xxx.char '|
00000010  e2 27 20 28 69 6e 0a 4d  69 6c 6f c5 a1 5f 46 6f  |.' (in.Milo.._Fo|
00000020  72 6d 61 6e 0a                                    |rman.|

Please note that it has 3 non-ascii areas:

1. e2 80 99: is U+2019 RIGHT SINGLE QUOTATION MARK
2. e2: could be UTF-8 3-char sequence, but bytes 27 20 don't make for any UTF-8 symbol
3. c5 a1: U+0161 LATIN SMALL LETTER S WITH CARON

However, uchardet determines that it is UFT-8:

$ uchardet < xxx 
UTF-8

FreeBSD file(1) determines this file as:

$ file xxx 
xxx: C source, Non-ISO extended-ASCII text

I am not sure how it should determine this text, but this isn't UTF-8 for sure.

lower case german umlauts in utf-8 are detected incorrectly

Test in shell:

echo -n ä | uchardet
-> TIS-620

echo -n ö | uchardet
-> TIS-620

echo -n ü | uchardet
-> ISO-8859-7

Upper case works ok. Ä,Ö,Ü and also ß

System: Ubuntu 16.04

Possibly incomplete project license

Hello.

README says that license is just MPL-1.1.
License headers inside C files permit any of MPL-1.1, GPL-2+, LGPL-2.1+.

Shouldn't README offer multiple licenses as well?

Transferring to uchardet organization?

Hi @BYVoid,

Would you mind if we create a "uchardet" organization in github, and transfer the uchardet project there? It will make it more "official" and stand up in the middle of the various forks.

As an alternative, we could be hosted by a friend organization, like on GNOME repository (i.e. the main repository will be out of github, though GNOME also has mirror of all its repositories here: https://github.com/GNOME). I have not asked the GNOME foundation yet, but I think they would accept. This would not make it a GNOME project, simply a friend project and still keeps independence.
This second alternative would likely be my favorite. :-) But I'm fine if you are absolutely attached to keep the main repository on github.

Thanks!

Jehan

LangModels refs error

LangModels refs error in build-mac/uchardet.xcodeproj

uchardet wrongly determines the text as WINDOWS-1252

the file name is 123.txt, that content is "hour时.txt" or "hour间.txt" , uchardet determines the file charset is "WINDOWS-1252", but actual is "UTF-8", could you help this ?

libuchardet-ios.a能不能支持下iOS Simulator~~~~

Windows-1251 detection failed on a file in Russian.

I've added some test files in test/.
Among them, there is windows-1251-bulgarian.txt and windows-1251-russian.txt.
The Bulgarian text is well detected as Windows 1251, but the Russian one is detected as Mac Cyrillic.
Note that I have checked. One is not even a subset of the other, and the wrong detection actually break the text (easily checked by making an encoding conversion with iconv).
It would be worth improving our Russian models for Windows-1251.

I open this report to remember.

Cast unsigned int

On :

PRBool nsCharSetProber::FilterWithoutEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen)

PRBool nsCharSetProber::FilterWithEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen)

Cast :

newLen = (unsigned int)(newptr - *newBuf);

Needs updated to the current Mozilla code

Mozilla people say that uchardet code is 3 years old (https://bugzilla.mozilla.org/show_bug.cgi?id=1105839).