I get this error with most of my books: 'utf8' codec can't decode by

Commit <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https:/

As reported here: <a href="http://forum.simplicissimus.it/kobo/calibre-driver-per-

Error in parsing some books about calibre-kobo-driver HOT 33 CLOSED

jgoguen commented on May 24, 2024

Error in parsing some books

from calibre-kobo-driver.

Comments (33)

ComicSans commented on May 24, 2024

Same here. About 10% of my books fail while converting. If I reconvert them first to epub again, most of them work, some don't.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

This is actually failing in calibre itself and indicates that the book (or more specifically, whatever file within the book it just tried to parse) has mixed file encodings or identifies itself as a UTF-8 file but has invalid byte sequences within the file. In the specific stack trace here, the message says that if the file were encoded properly, it would expect the byte 0xc3 to be followed by something else. Another common byte to see is 0xe2.

Try converting your book and choose the option to unsmarten punctuation. Smart quotes have been the cause of this problem every time I've seen it. The Modify EPub plugin can remove smart quotes without requiring a full conversion process. If that works, or if that doesn't work for some or all of your books, let me know so I can investigate some possibilities.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

Reconverting actually worked, I'd rather avoid this step if it is possible though.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

Yes, I agree. I'll take a look and see how I can best strip smart quotes auto-magically if there's an issue with parsing.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

Would you be willing to send me a book or two that are currently still failing without converting them? I seem to be having a hard time finding any books like that in my collection and I need to test what I think is a fix for this.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

I'd like to but I don't know how ;)

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

Commit 5e51dcf: try to strip "smart" punctuation when a UnicodeDecodeError is encountered when parsing (X)HTML files.

Testing required, looking for books that are broken in this manner to verify this works.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

Since you've committed it, I've decided to test this myself; I don't have books that are causing this issue anymore. So to my knowledge the issue is fixed. Thanks for your work!

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

As reported here:
http://forum.simplicissimus.it/kobo/calibre-driver-per-kobo-con-funzioni-aggiuntive-%28kepub%29/15/

calibre, version 0.9.16
ERRORE: Errore: Errore di comunicazione col dispositivo

'utf8' codec can't decode byte 0xc2 in position 51161: unexpected end of data

Traceback (most recent call last):
File "site-packages\calibre\gui2\device.py", line 85, in run
File "site-packages\calibre\gui2\device.py", line 551, in _upload_books
File "calibre_plugins.kobotouch_extended.driver", line 183, in upload_books
File "calibre_plugins.kobotouch_extended.driver", line 144, in _modify_epub
File "calibre_plugins.kobotouch_extended.container", line 93, in get_parsed
File "site-packages\calibre\ebooks\chardet.py", line 30, in strip_encoding_declarations
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 51161: unexpected end of data

So also "0xc2" is a possible problem. I'm reopening this even if I think this will be trivial to fix.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

This is actually not going to be trivial. I'm actually going to take a hard line here and say that a new version of the book has to be obtained. If this is coming from a bookstore, go back to them and complain that they're giving you a badly-encoded book. If it's not coming from a bookstore, I'll assume it's custom content that was put together by hand and say go back and re-encode the files properly. Perhaps Sigil (http://code.google.com/p/sigil/) can be of assistance.

This points to a file which has been declared (or detected) as using UTF-8 encoding, but has mixed character encodings. 0xc2 is not valid as the last byte in a UTF-8 character but that's exactly where it's being found, which is why it chokes and complains about "unexpected end of data". It is possible (although I don't know all the rules) to determine how many bytes a UTF-8 character should have. Looking at a Unicode table, it looks like 0xc2 is supposed to be the first byte in a two-byte sequence.

My only real option here is to try to force the string to a UTF-8 string, but that has its own issue. The three possible methods for dealing with errors when converting a string to UTF-8 are:

strict - Fail with an exception when an unknown byte sequence is detected
ignore - Ignore the byte sequence and continue on as if nothing happened
replace - Replace all invalid byte sequences with the three-byte sequence "0xef 0xbf 0xbd" (unicode replacement character: �)

I can't use strict since that's what's causing this issue in the first place. I can't use ignore since calibre expects valid UTF-8 strings. That leaves replace, which will cause � to be shown all over the book. Depending on how the Kobo renders this, it might also appear simply as a question mark. That makes it look like the driver is corrupting the book.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

I don't see the issue with "ignore", since the book was being visualized properly in the beginning?
I'm sorry if I assumed too much with my "trivial", I thought you put those bytes in a table or something like that.
Since calibre properly converts those books, is it possible to see how those strings are treated there?

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

No, no table on my end. The rules for Unicode parsing are well-defined and quite strict. There's one thing I realized I can try, which I've done in my local copy, but I need a book that fails to test it out on. However, as I'll explain below, neither ignore nor replace are acceptable to me as decode options. Can you find a book that has this error (the byte is not important, but 0xc2 would be nice) and send it to me?

The issue with the ignore option is that it leaves the invalid byte sequences in place. My plugin does not operate in isolation, it interacts with calibre and so needs to give calibre valid data. If I give calibre data with invalid byte sequences, we'll see the same problem but (maybe) in a different place. I could use replace, but that replaces all invalid byte sequences with �, which makes it look like my plugin is corrupting the book when it really isn't. The only option left is strict, which is the option that calibre uses in the few places where it explicitly decodes a byte sequence. Most places, calibre uses the data as-is and assumes it's valid. I'm going to try explicitly decoding data as UTF-8, but with the strict option.

from calibre-kobo-driver.

ComicSans commented on May 24, 2024

I stated earlier, that reconverting to epub and sending to device failed sometimes. I just found out, that converting to mobi and back to epub works fine for every single book I had issues with - I had about 50 books in my library that I could not send to my device using the this driver (all of these hav been converted to epub several times, using both smarten and unsmarten punctuation and the modify epub plugin). All gave me the utf-8-cannot-convert-message (as this is copyrighted material, I cannot send copies of the books).

Can anybody reproduce this behavior? Is there any encoding/conversion difference from converting epub->epub to epub->mobi->epub?

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

Have you tried again with the updated code without converting? There's been
some changes since you last posted. You should try with the no-database
branch.

from calibre-kobo-driver.

ComicSans commented on May 24, 2024

Ah, I forgot to mention that... sorry. I used the no-database-branch of yesterday evening.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

Found one with 0xc3, sending it right now.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

It seems that the fix for this issue is creating many more problems than it fixed.
There are several reports of characters being malformed on the parsed epub, if I'm right all special characters in utf-8 format get interpreted as if they were coded with the system's encoding, and most of the time (at least on windows) this is wrong.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

That's what I was afraid of. This means that calibre's way of handling these files is, for whatever strange reason, not sufficient.

I have a couple of ideas, but none of them are particularly pretty or elegant. I'll need to draw them out and see which one would work best. I still have one book that failed before I put this code in place so I'll use that book to check and see where it falls through to. That book works with this though, so it would be appreciated if I could get a file that is now failing but worked before.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

Are you stiil against the ignore option? Since calibre doesn't complain in the first place handling the original file, I can't understand why it should have problems with it after adding the KoboSpan ids and the cover-image property.
Did you try it and it caused issues?
Ideally I think this driver should change the file as little as possible.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

And with "That book works with this though", what do you mean? No book spits out errors now, but characters are mangled if you try to read them. I'd expect it to happen also on the test book you have.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

I'm against "ignore", because when I do that and then I give it to calibre,
it fails anyway because the invalid byte sequences are still there. However,
I'm going to try it again anyway just because I need to exhaust al my
options.

I'm not sure what calibre is doing differently, the only place in the
conversion pipeline that I can see calibre specifically handling the file
encoding is where it decodes the content using the file system's default
encoding if the content is a byte string. Otherwise, it just uses the
content "as-is". I have ideas, it's just a matter of trying them out and
making sure there's no major performance slow-down.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

I haven't seen any corrupted characters with this update using the books I have so far. Could you try with some books that have been showing corrupted characters?

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

Some users on the forum reported mixed results in the same epub, on some files accented letters (àòèùì) are rendered correctly, on some others they get corrupted.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

That's what I expect to happen using the "ignore" option. Using "replace" should replace all those corrupted characters with "�", but I don't really want to change it and ask people to check if it's broken better or worse. Could you get me one of the affected books, and tell me which chapter and page number I could find corrupted characters on, and I'll verify that this is actually the result of using "ignore" or see if there's some other issue I can work around?

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

That's not what I meant with corrupted, I meant that like before when all the characters were interpreted as encoded in the system's encoding and then translated back to utf-8, this is still happening on a per-file basis. It seems that chardet runs on each text file in the epub and especially on short texts like introductions or book titles the results could be wrong. I'll try to get a hold on the affected file.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

I'm sending you a copy of that file. BTW, have you thought asking Kovid Goyal for assistance on this topic?

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

I have, but now that I have a book that I don't know what to do with I want to check a few things before I do.

I suspect the reason it doesn't break normally is because the existing KoboTouch driver doesn't process the book files, just sends them along to the right place. Similarly for side loading. What I want to check though is to see what encodings chardet reports each file as, then take that file alone and seeing there's some post-processing I can do to accommodate the issues. Failing that, I'll be asking the mobileread forums for help since I already have no ideas how to get around chardet failing to detect the proper encoding.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

I think this is actually me misunderstanding how chardet works. I had thought that it takes a file, but it takes a buffer. I took just the file for the section you said had problems, used chardet, and it failed horribly. When I used the UniversalDetector class directly, I got the expected result of UTF-8. It'll be a few hours before I can try it, but if that works I'll be happy. And uncertain as to how I got it properly detecting at all passing in only file names...

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

I pushed an update, I'm assuming that the files are encoded correctly and using chardet as a fallback if an exception is raised. This works surprisingly well, I'm not able to read the language of the book you sent but I don't see anything corrupted. I've asked on the mobileread forums just in case there's something I can do to improve character detection, but detection on short input files with a large percentage of characters in the standard ASCII range is notoriously error-prone.

from calibre-kobo-driver.

giorgio130 commented on May 24, 2024

Looks like the way to go. I'll do some testing in the coming days, thank you!

from calibre-kobo-driver.

kovidgoyal commented on May 24, 2024

@jgoguen I happened to come across this bug report while browsing something unrelated. If you wish to duplicate the technique for decoding html that the calibre conversion pipeline uses, you need to replicate the decode() method from ebooks.oeb.base or ebooks.oeb.polish.container. That way you will get the same results as you would get doing a conversion in calibre.

from calibre-kobo-driver.

jgoguen commented on May 24, 2024

Is there any reason I shouldn't extend the ebooks.oeb.polish.container.EpubContainer class instead of duplicating the decode() method?

from calibre-kobo-driver.

kovidgoyal commented on May 24, 2024

If you wish to use the functionality from that class, feel free. The only downside is that the container class is likely to evolve in the near future as polish is under active development so you would have to keep an eye on the changes. I generally try to avoid break interfaces where ever possible, but I only test with calibre's code base, not third party plugins.

from calibre-kobo-driver.

Error in parsing some books about calibre-kobo-driver HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent