These two are the most common Greek encodings and they are mostly identical. One major

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

WINDOWS-1253 file detected as ISO-8859-7 about uchardet HOT 10 CLOSED

byvoid commented on July 21, 2024

WINDOWS-1253 file detected as ISO-8859-7

from uchardet.

Comments (10)

Jehan commented on July 21, 2024

Thanks for the bug report.
Indeed these kind of slight differences between 2 encodings (for the characters of a given language, it doesn't matter much how different there may be on characters of other language in the same encodings) are the most difficult parts.

I don't know how the detection works but more 0xA2 than 0xB6 would be a strong indication of WINDOWS-1253 (and vice versa).

uchardet is mostly statistical (not only, this is a mix of techniques, but statistics are the bigger part, in particular for single-byte encodings like here). The issue with the kind of examples you give (I encountered similar examples for other langs/encodings) is that the alternative is not strictly an error. Indeed I would assume that Greek could have right single quotation mark as well. No? Yet I can foresee some improvements. I'll look into it.

What is this character used for exactly: Ά?
In my logs, when I built the Greek language model, this character was not at all in the top used character (less than 0.1%). Now my data was Wikipedia articles. And you seem to say that this character is mostly used in subtitles. Is it not used in "usual" Greek texts other than subtitles? My main problem here would be that I don't have access to thousands of Greek subtitles (whereas I have access to a huge quantity of Wikipedia article), and if I were to use free-of-charge subtitles on the web, I am not even sure of the legality of training the engine on these (since simply downloading them is usually breaking copyright! There are very very few subtitles legally available for download).

PS. I use uchardet through mpv for the subtitle language detection and I would say about 10-20% of the subtitles have this problem

Happy that you still get good detections 80% of the time since the detection system that mpv used to have would not have ever detected your encoding (even remotely since neither encoding are supported by enca). :-)

from uchardet.

Jehan commented on July 21, 2024

Note: in any case, I'll retrain (tomorrow) the Greek engine by forcing some importance on this character and still using Wikipedia data. Hopefully it could be enough to improve the detection.

from uchardet.

larvanitis commented on July 21, 2024

Thanks for the quick reply.

Indeed I would assume that Greek could have right single quotation mark as well.

Yes, indeed.

What is this character used for exactly: Ά?

The greek alpha Αα is equivalent to the latin Aa. It gets accented Άά (like all vowels) where the word is pronounced louder, when the word has at least two syllables (eg if english had the same accenting rule, the leading a in animal would be accented).

Also the capitalization rules are the same as english (first word of a sentence, names etc)

And you seem to say that this character is mostly used in subtitles. Is it not used in "usual" Greek texts other than subtitles?

It is used everywhere but subtitles tend to have a lot of short sentences and names, making the capitalization rules apply more frequently than longer texts such are wikipedia.

Out of curiosity 1:
How do you train from wikipedia? Do you get the UTF-8 content and convert it to the various encodings, which you then feed to the algorithm?

Out of curiosity 2:
Do you know if mpv analyses the whole subtitle file contents including the format data (timing, markup etc) or just the stripped text which is to be actually displayed?

from uchardet.

Jehan commented on July 21, 2024

Also the capitalization rules are the same as english (first word of a
sentence, names etc)

My algorithm tends to lowercase everything anyway (so 'a' and 'A' are the same),
which makes things simpler. uchardet does not have any grammatical logics
embedded (like what is a "sentence"?). It is purely statistical. Up to now, this
does not seem to affect much the quality of the detection (which is very
efficient, even though it could obviously be better).

By the way, I was wrong yesterday when I said that 'Ά' was rarely used. I just forgot to search as lowercase. It is actually used nearly 2% of times, which makes it the 16th or 17th (depending on data I used) more used character in Greek texts.

How do you train from wikipedia? Do you get the UTF-8 content and convert it
to the various encodings, which you then feed to the algorithm?

Exactly what you say. Obviously I add some max number of page (otherwise it
could just go on and on indefinitely).

Do you know if mpv analyses the whole subtitle file contents including the
format data (timing, markup etc) or just the stripped text which is to be
actually displayed?

Not sure. Obviously stripping the text would lead to more accuracy, but I don't know if they bother (and uchardet stays efficient even with text mixed with some English markup). Moreover it may be not easy to actually strip the markups if you don't even know which encoding the file uses (though on the other hand, I imagine that most markup character would be ASCII, and I don't know if there exists any encoding which is not a subset of ASCII. Yet that could still lead to much more complicated parsing).

So yes, my guess is that they don't strip anything, but I have not actually checked! Feel free to check and tell me. :-)

from uchardet.

Jehan commented on July 21, 2024

Hi @larvanitis,

I have pushed some change. I will want to test this more deeply on various files of other language, so it is not certain that it won't change or even be reverted.

Yet could you test it in the current version and tell me if it improves detection of your various files and subtitles?
Thanks!

from uchardet.

larvanitis commented on July 21, 2024

My algorithm tends to lowercase everything anyway (so 'a' and 'A' are the same),
which makes things simpler. uchardet does not have any grammatical logics
embedded (like what is a "sentence"?). It is purely statistical.

I think that's the culprit in this case. From what you said in your post I conclude that the process:

takes UTF-8 Ά
converts it to UTF-8 ά
converts it for training each encoding to:
1. WINDOWS-1253 ά (0xDC instead of 0xA2)
2. ISO-8859-7 ά (0xDC instead of 0xB6)

If you notice, the significant difference from this character gets lost during the lowercase conversion.

I am not sure what a good solution would be but this might affect other languages with similar differences among their encodings, especially iso->microsoft based ones.

mpv... So yes, my guess is that they don't strip anything, but I have not actually checked! Feel free to check and tell me. :-)

I went on and asked on mpv-player/mpv#3180

I have pushed some change. I will want to test this more deeply on various files of other language, so it is not certain that it won't change or even be reverted.

Yet could you test it in the current version and tell me if it improves detection of your various files and subtitles?

I'd be happy to. Where can I get your modified code or binary (even better:)? I have access to Linux and Windows and can compile under the first.

from uchardet.

Jehan commented on July 21, 2024

I think that's the culprit in this case. From what you said in your post I conclude that the process: [...]

No that's not how this works. Lowercasing thing is no problem here. You should not reason in terms of encoding, but of characters. Statistics are language based, not encoding based. In any case, there is no conversion errors here.
Considering lower and upper case as different characters is not a solution.

I went on and asked on mpv-player/mpv#3180

Answer is as I thought.

Where can I get your modified code or binary (even better:)?

No binary, but you can get the updated code here on github:

git clone https://github.com/BYVoid/uchardet.git

Then build with cmake.

from uchardet.

larvanitis commented on July 21, 2024

Do you mean I should build the master? Its last commit was March 27.

from uchardet.

Jehan commented on July 21, 2024

Oups sorry! I am slowly moving out of github and did push but to another remote! I updated the github remote as well to the last commits.

from uchardet.

larvanitis commented on July 21, 2024

I tested 5-6 files using uchardet cli command and now they are detected correctly.
I am also closing the issue.

Thanks for your time and support!

from uchardet.

WINDOWS-1253 file detected as ISO-8859-7 about uchardet HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent