Code Monkey home page Code Monkey logo

uchardet's Introduction

uchardet's People

Contributors

byvoid avatar cicku avatar coacher avatar dinhvh avatar jehan avatar llloic11 avatar lovasoa avatar nu774 avatar wang-bin avatar wiiaboo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

uchardet's Issues

Possibly incomplete project license

Hello.

README says that license is just MPL-1.1.
License headers inside C files permit any of MPL-1.1, GPL-2+, LGPL-2.1+.

Shouldn't README offer multiple licenses as well?

Next release

Hello.

Are there any plans regarding the next uchardet release?

Detect files whose encoded has been corrupted by a text editor ?

It happens that when a text editor didn't detect the right encoding of a file while opening it, and then saves it to another encoding, it corrupts the file.

Such files are hard to recover, because no tool exist to detect their correct encoding.

This file was first saved as WINDOWS-1251, then opened as WINDOWS-1252, and saved as UTF-8, and ended on a popular subtitles download site.
Princessa.I.Ljagushka.2009.RUS.BDRip.XviD.AC3.-HQCLUB.Rus.srt.txt

Would it be conceivable that uchardet could one day detect such "composed" charsets ?

ISO-8859-2 should be detected

In your README, ISO-8859-2 is not supported. Yet I can find a model for it in src/LangHungarianModel.cpp. I tried it with a ISO-8859-2 file I built myself:
https://cloud.libreart.info/public.php?service=files&t=40140bd3fd105b2c03d7716dfe4b498a
And it fails detecting it as "windows-1252".

On the other hand python-chardet was able to properly detect the ISO-8859-2 encoding:

$ chardetect iso-8859-2.smi
iso-8859-2.smi: ISO-8859-2 with confidence 0.850807928898

Considering they are both supposed to be based on the same algorithm from Mozilla and that you have mention of this encoding in your code, I'm thinking it would be cool if it were supported.

New release?

There is no tags in git, and it seems that there is no releases. The version is currently the 0.0.1, so at a first glance it looks like the project is very young and has just started, but the first commit was created in 2011. It's maybe time to create a real release.

Cast unsigned int

On :

PRBool nsCharSetProber::FilterWithoutEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen)

&

PRBool nsCharSetProber::FilterWithEnglishLetters(const char* aBuf, PRUint32 aLen, char** newBuf, PRUint32& newLen)

Cast :

newLen = (unsigned int)(newptr - *newBuf);

Minor header file issues

There are some minor issues in the public header file.

#ifndef ___UCHARDET_H___

Identifiers starting with 2 _ are reserved by the system. Normal libraries cannot use such identifiers.

typedef void * uchardet_t;

This isn't really idea. Using void* reduces type-safety, and the typedef obscures the real type of the handle. Also, _t suffixes are reserved for POSIX standard symbols. I suggest doing typedef struct uchardet uchardet; and using it as uchardet *ud in function parameters.

uchardet_t uchardet_new();

The parameter list is empty - in C, this means the parameter list is not defined, and anything is allowed. clang e.g. warns with: uchardet/uchardet.h:52:1: warning: function declaration isn't a prototype. It should be uchardet_t uchardet_new(void);. This compiles in C++ too.

Also, this should probably define that it returns NULL on memory allocation failure.

UTF-8 with right single quote (U+2019) mistaken as Windows-1250

I'm seeing an issue where English UTF-8 encoded text with a Unicode Right Single Quotation Mark (U+2019), is being identified by uchardet as WINDOWS-1250.

$ hexdump -C test/en/utf-8.txt 
00000000  45 6e 67 6c 69 73 68 20  74 65 78 74 20 77 69 74  |English text wit|
00000010  68 20 61 20 72 69 67 68  74 20 73 69 6e 67 6c 65  |h a right single|
00000020  20 71 75 6f 74 65 20 28  55 2b 32 30 31 39 29 20  | quote (U+2019) |
00000030  69 6e 73 74 65 61 64 20  6f 66 20 61 6e 20 61 70  |instead of an ap|
00000040  6f 73 74 72 6f 70 68 65  20 73 68 6f 75 6c 64 6e  |ostrophe shouldn|
00000050  e2 80 99 74 0d 0a 62 65  20 6d 69 73 74 61 6b 65  |...t..be mistake|
00000060  6e 20 66 6f 72 20 73 6f  6d 65 74 68 69 6e 67 20  |n for something |
00000070  65 6c 73 65 2e                                    |else.|
00000075

$ uchardet test/en/utf-8.txt
WINDOWS-1250

(I've also seen this misidentified elsewhere as WINDOWS-1258, but those files have confidential content and I couldn't share them. I was able to reproduce misidentification with this smaller sample, though with a slightly different outcome.)

Make a portable executable

Hi,

I like your project and I would like to use your tool on a workstation without manually installing lib.

Can you explain me how to create a portable executable of your project ?

Thanks in advance!

WINDOWS-1253 file detected as ISO-8859-7

These two are the most common Greek encodings and they are mostly identical. One major difference between them is the mapping of GREEK CAPITAL LETTER ALPHA WITH TONOS (Ά), which is very common in Greek texts/subtitles.

code | ISO 8859-7                                     | windows-1253
------------------------------------------------------------------------------------------------------
0xA1 | [U+2018] LEFT SINGLE QUOTATION MARK            | [U+0385] GREEK DIALYTIKA TONOS
0xA2 | [U+2019] RIGHT SINGLE QUOTATION MARK           | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS
0xA4 | _unassigned_                                   | [U+00A4] [CURRENCY SIGN]
0xA5 | _unassigned_                                   | [U+00A5] [YEN SIGN]
0xAE | _unassigned_                                   | [U+00AE] [REGISTERED SIGN]
0xB5 | [U+0385] GREEK DIALYTIKA TONOS                 | [U+00B5] [MICRO SIGN]
0xB6 | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS | [U+00B6] [PILCROW SIGN]

Source: ISO 8859-7 vs. windows-1253

I don't know how the detection works but more 0xA2 than 0xB6 would be a strong indication of WINDOWS-1253 (and vice versa).

PS. I use uchardet through mpv for the subtitle language detection and I would say about 10-20% of the subtitles have this problem

Attached sample file

Windows-1251 detection failed on a file in Russian.

I've added some test files in test/.
Among them, there is windows-1251-bulgarian.txt and windows-1251-russian.txt.
The Bulgarian text is well detected as Windows 1251, but the Russian one is detected as Mac Cyrillic.
Note that I have checked. One is not even a subset of the other, and the wrong detection actually break the text (easily checked by making an encoding conversion with iconv).
It would be worth improving our Russian models for Windows-1251.

I open this report to remember.

Can this code be used to make a Windows DLL? How?

Hi, sorry for communicate with you for this channel...
THIS IS NOT A ISSUE AT ALL
...but I dont know how ask you this:

I am a windows programmer (delphi and pascal) with zero experience in C (or C++).

My question is:
Can this code be used to make a Windows DLL?
If yes please could you guide me how?
I have some hope because looking in CMakeLists.txt file there is a option for win32 (# although commented #)

Basic questions:

  • Compiler name and version you know this can be compiled (in windows of course)
  • Basic procedure (or at least some hints)
  • With the right compiler and procedure can be done or is need some modification of the code?

Thanks in advance!

PD: I know what exist a version of Mozilla code for Windows made in Delphi here but yours seems more complete and detect much more encodes.

document the difference between this and libchardet

On reviewing bomi it looks like that uses libchardet which like this library also is based on Mozilla's code. I see that the public APIs between these two projects are different as well; however having two copies of the same code is not great for the FOSS community in general.

Could anyone give a more detailed account of the differences, and maybe merge the two libraries? For example, which version of Mozilla's code this library contains, the history of both codebases, how easy it would be to merge the two, etc.

Future roadmap?

Hi Carbo,

It's been a long time, I'd ask for future roadmap here. Because mpv and other music players would like to support special charset via this lib, but it's not developed as far as I see from the commit.

Opinions?

uchardet wrongly determines the text as UTF-8

The file has these bytes:

00000000  78 78 78 e2 80 99 78 78  78 0a 63 68 61 72 20 27  |xxx...xxx.char '|
00000010  e2 27 20 28 69 6e 0a 4d  69 6c 6f c5 a1 5f 46 6f  |.' (in.Milo.._Fo|
00000020  72 6d 61 6e 0a                                    |rman.|

Please note that it has 3 non-ascii areas:

1. e2 80 99: is U+2019 RIGHT SINGLE QUOTATION MARK
2. e2: could be UTF-8 3-char sequence, but bytes 27 20 don't make for any UTF-8 symbol
3. c5 a1: U+0161 LATIN SMALL LETTER S WITH CARON

However, uchardet determines that it is UFT-8:

$ uchardet < xxx 
UTF-8

FreeBSD file(1) determines this file as:

$ file xxx 
xxx: C source, Non-ISO extended-ASCII text

I am not sure how it should determine this text, but this isn't UTF-8 for sure.

GB18030 file detected as WINDOWS-1252

I encountered an issue when I try to use uchardet to detect a file almost all English(a very little Chinese).
The file is GB18030 encoded but detected as WINDOWS-1252.

Transferring to uchardet organization?

Hi @BYVoid,

Would you mind if we create a "uchardet" organization in github, and transfer the uchardet project there? It will make it more "official" and stand up in the middle of the various forks.

As an alternative, we could be hosted by a friend organization, like on GNOME repository (i.e. the main repository will be out of github, though GNOME also has mirror of all its repositories here: https://github.com/GNOME). I have not asked the GNOME foundation yet, but I think they would accept. This would not make it a GNOME project, simply a friend project and still keeps independence.
This second alternative would likely be my favorite. :-) But I'm fine if you are absolutely attached to keep the main repository on github.

Thanks!

Jehan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.