Comments (33)
Same here. About 10% of my books fail while converting. If I reconvert them first to epub again, most of them work, some don't.
from calibre-kobo-driver.
This is actually failing in calibre itself and indicates that the book (or more specifically, whatever file within the book it just tried to parse) has mixed file encodings or identifies itself as a UTF-8 file but has invalid byte sequences within the file. In the specific stack trace here, the message says that if the file were encoded properly, it would expect the byte 0xc3 to be followed by something else. Another common byte to see is 0xe2.
Try converting your book and choose the option to unsmarten punctuation. Smart quotes have been the cause of this problem every time I've seen it. The Modify EPub plugin can remove smart quotes without requiring a full conversion process. If that works, or if that doesn't work for some or all of your books, let me know so I can investigate some possibilities.
from calibre-kobo-driver.
Reconverting actually worked, I'd rather avoid this step if it is possible though.
from calibre-kobo-driver.
Yes, I agree. I'll take a look and see how I can best strip smart quotes auto-magically if there's an issue with parsing.
from calibre-kobo-driver.
Would you be willing to send me a book or two that are currently still failing without converting them? I seem to be having a hard time finding any books like that in my collection and I need to test what I think is a fix for this.
from calibre-kobo-driver.
I'd like to but I don't know how ;)
from calibre-kobo-driver.
Commit 5e51dcf: try to strip "smart" punctuation when a UnicodeDecodeError is encountered when parsing (X)HTML files.
Testing required, looking for books that are broken in this manner to verify this works.
from calibre-kobo-driver.
Since you've committed it, I've decided to test this myself; I don't have books that are causing this issue anymore. So to my knowledge the issue is fixed. Thanks for your work!
from calibre-kobo-driver.
As reported here:
http://forum.simplicissimus.it/kobo/calibre-driver-per-kobo-con-funzioni-aggiuntive-%28kepub%29/15/
calibre, version 0.9.16
ERRORE: Errore: Errore di comunicazione col dispositivo
'utf8' codec can't decode byte 0xc2 in position 51161: unexpected end of data
Traceback (most recent call last):
File "site-packages\calibre\gui2\device.py", line 85, in run
File "site-packages\calibre\gui2\device.py", line 551, in _upload_books
File "calibre_plugins.kobotouch_extended.driver", line 183, in upload_books
File "calibre_plugins.kobotouch_extended.driver", line 144, in _modify_epub
File "calibre_plugins.kobotouch_extended.container", line 93, in get_parsed
File "site-packages\calibre\ebooks\chardet.py", line 30, in strip_encoding_declarations
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 51161: unexpected end of data
So also "0xc2" is a possible problem. I'm reopening this even if I think this will be trivial to fix.
from calibre-kobo-driver.
This is actually not going to be trivial. I'm actually going to take a hard line here and say that a new version of the book has to be obtained. If this is coming from a bookstore, go back to them and complain that they're giving you a badly-encoded book. If it's not coming from a bookstore, I'll assume it's custom content that was put together by hand and say go back and re-encode the files properly. Perhaps Sigil (http://code.google.com/p/sigil/) can be of assistance.
This points to a file which has been declared (or detected) as using UTF-8 encoding, but has mixed character encodings. 0xc2 is not valid as the last byte in a UTF-8 character but that's exactly where it's being found, which is why it chokes and complains about "unexpected end of data". It is possible (although I don't know all the rules) to determine how many bytes a UTF-8 character should have. Looking at a Unicode table, it looks like 0xc2 is supposed to be the first byte in a two-byte sequence.
My only real option here is to try to force the string to a UTF-8 string, but that has its own issue. The three possible methods for dealing with errors when converting a string to UTF-8 are:
- strict - Fail with an exception when an unknown byte sequence is detected
- ignore - Ignore the byte sequence and continue on as if nothing happened
- replace - Replace all invalid byte sequences with the three-byte sequence "0xef 0xbf 0xbd" (unicode replacement character: �)
I can't use strict since that's what's causing this issue in the first place. I can't use ignore since calibre expects valid UTF-8 strings. That leaves replace, which will cause � to be shown all over the book. Depending on how the Kobo renders this, it might also appear simply as a question mark. That makes it look like the driver is corrupting the book.
from calibre-kobo-driver.
I don't see the issue with "ignore", since the book was being visualized properly in the beginning?
I'm sorry if I assumed too much with my "trivial", I thought you put those bytes in a table or something like that.
Since calibre properly converts those books, is it possible to see how those strings are treated there?
from calibre-kobo-driver.
No, no table on my end. The rules for Unicode parsing are well-defined and quite strict. There's one thing I realized I can try, which I've done in my local copy, but I need a book that fails to test it out on. However, as I'll explain below, neither ignore nor replace are acceptable to me as decode options. Can you find a book that has this error (the byte is not important, but 0xc2 would be nice) and send it to me?
The issue with the ignore option is that it leaves the invalid byte sequences in place. My plugin does not operate in isolation, it interacts with calibre and so needs to give calibre valid data. If I give calibre data with invalid byte sequences, we'll see the same problem but (maybe) in a different place. I could use replace, but that replaces all invalid byte sequences with �, which makes it look like my plugin is corrupting the book when it really isn't. The only option left is strict, which is the option that calibre uses in the few places where it explicitly decodes a byte sequence. Most places, calibre uses the data as-is and assumes it's valid. I'm going to try explicitly decoding data as UTF-8, but with the strict option.
from calibre-kobo-driver.
I stated earlier, that reconverting to epub and sending to device failed sometimes. I just found out, that converting to mobi and back to epub works fine for every single book I had issues with - I had about 50 books in my library that I could not send to my device using the this driver (all of these hav been converted to epub several times, using both smarten and unsmarten punctuation and the modify epub plugin). All gave me the utf-8-cannot-convert-message (as this is copyrighted material, I cannot send copies of the books).
Can anybody reproduce this behavior? Is there any encoding/conversion difference from converting epub->epub to epub->mobi->epub?
from calibre-kobo-driver.
Have you tried again with the updated code without converting? There's been
some changes since you last posted. You should try with the no-database
branch.
from calibre-kobo-driver.
Ah, I forgot to mention that... sorry. I used the no-database-branch of yesterday evening.
from calibre-kobo-driver.
Found one with 0xc3, sending it right now.
from calibre-kobo-driver.
It seems that the fix for this issue is creating many more problems than it fixed.
There are several reports of characters being malformed on the parsed epub, if I'm right all special characters in utf-8 format get interpreted as if they were coded with the system's encoding, and most of the time (at least on windows) this is wrong.
from calibre-kobo-driver.
That's what I was afraid of. This means that calibre's way of handling these files is, for whatever strange reason, not sufficient.
I have a couple of ideas, but none of them are particularly pretty or elegant. I'll need to draw them out and see which one would work best. I still have one book that failed before I put this code in place so I'll use that book to check and see where it falls through to. That book works with this though, so it would be appreciated if I could get a file that is now failing but worked before.
from calibre-kobo-driver.
Are you stiil against the ignore option? Since calibre doesn't complain in the first place handling the original file, I can't understand why it should have problems with it after adding the KoboSpan ids and the cover-image property.
Did you try it and it caused issues?
Ideally I think this driver should change the file as little as possible.
from calibre-kobo-driver.
And with "That book works with this though", what do you mean? No book spits out errors now, but characters are mangled if you try to read them. I'd expect it to happen also on the test book you have.
from calibre-kobo-driver.
I'm against "ignore", because when I do that and then I give it to calibre,
it fails anyway because the invalid byte sequences are still there. However,
I'm going to try it again anyway just because I need to exhaust al my
options.
I'm not sure what calibre is doing differently, the only place in the
conversion pipeline that I can see calibre specifically handling the file
encoding is where it decodes the content using the file system's default
encoding if the content is a byte string. Otherwise, it just uses the
content "as-is". I have ideas, it's just a matter of trying them out and
making sure there's no major performance slow-down.
from calibre-kobo-driver.
I haven't seen any corrupted characters with this update using the books I have so far. Could you try with some books that have been showing corrupted characters?
from calibre-kobo-driver.
Some users on the forum reported mixed results in the same epub, on some files accented letters (àòèùì) are rendered correctly, on some others they get corrupted.
from calibre-kobo-driver.
That's what I expect to happen using the "ignore" option. Using "replace" should replace all those corrupted characters with "�", but I don't really want to change it and ask people to check if it's broken better or worse. Could you get me one of the affected books, and tell me which chapter and page number I could find corrupted characters on, and I'll verify that this is actually the result of using "ignore" or see if there's some other issue I can work around?
from calibre-kobo-driver.
That's not what I meant with corrupted, I meant that like before when all the characters were interpreted as encoded in the system's encoding and then translated back to utf-8, this is still happening on a per-file basis. It seems that chardet runs on each text file in the epub and especially on short texts like introductions or book titles the results could be wrong. I'll try to get a hold on the affected file.
from calibre-kobo-driver.
I'm sending you a copy of that file. BTW, have you thought asking Kovid Goyal for assistance on this topic?
from calibre-kobo-driver.
I have, but now that I have a book that I don't know what to do with I want to check a few things before I do.
I suspect the reason it doesn't break normally is because the existing KoboTouch driver doesn't process the book files, just sends them along to the right place. Similarly for side loading. What I want to check though is to see what encodings chardet reports each file as, then take that file alone and seeing there's some post-processing I can do to accommodate the issues. Failing that, I'll be asking the mobileread forums for help since I already have no ideas how to get around chardet failing to detect the proper encoding.
from calibre-kobo-driver.
I think this is actually me misunderstanding how chardet works. I had thought that it takes a file, but it takes a buffer. I took just the file for the section you said had problems, used chardet, and it failed horribly. When I used the UniversalDetector class directly, I got the expected result of UTF-8. It'll be a few hours before I can try it, but if that works I'll be happy. And uncertain as to how I got it properly detecting at all passing in only file names...
from calibre-kobo-driver.
I pushed an update, I'm assuming that the files are encoded correctly and using chardet as a fallback if an exception is raised. This works surprisingly well, I'm not able to read the language of the book you sent but I don't see anything corrupted. I've asked on the mobileread forums just in case there's something I can do to improve character detection, but detection on short input files with a large percentage of characters in the standard ASCII range is notoriously error-prone.
from calibre-kobo-driver.
Looks like the way to go. I'll do some testing in the coming days, thank you!
from calibre-kobo-driver.
@jgoguen I happened to come across this bug report while browsing something unrelated. If you wish to duplicate the technique for decoding html that the calibre conversion pipeline uses, you need to replicate the decode() method from ebooks.oeb.base or ebooks.oeb.polish.container. That way you will get the same results as you would get doing a conversion in calibre.
from calibre-kobo-driver.
Is there any reason I shouldn't extend the ebooks.oeb.polish.container.EpubContainer class instead of duplicating the decode() method?
from calibre-kobo-driver.
If you wish to use the functionality from that class, feel free. The only downside is that the container class is likely to evolve in the near future as polish is under active development so you would have to keep an eye on the changes. I generally try to avoid break interfaces where ever possible, but I only test with calibre's code base, not third party plugins.
from calibre-kobo-driver.
Related Issues (20)
- Libra 2 HOT 3
- Missing cover after the conversion HOT 4
- 'Untitled Chapter' bug for all chapters after kepub conversion
- Obok DeDRM plugin not functioning in Calibre 6.0 HOT 1
- Kepub Output - QT errors in Calibre 6.x HOT 4
- UniCodeDecodeError HOT 1
- "Continue on error" for copy jobs with multiple books that fails because of content errors HOT 2
- List name of failing book in the job log error message. HOT 1
- name of file after converting EPUB to KEPUB HOT 1
- Add support for 4.34.20097 firmware HOT 1
- Send to device always results in epub, not kepub HOT 1
- Error Communicating with Device (TimeoutError on conversion)
- Updating an existing ebook on device does not take new pages into account HOT 8
- Process for importing (new?) Amazon Kindle Unlimited Voucher-associated file content HOT 2
- Conversion to kepub fails with "list index out of range" HOT 1
- Retrieving annotations not working (but 3rd party Annotations plugin works) HOT 1
- KoboTouchExtended: Error communicating with device HOT 5
- synopsis not loaded HOT 1
- SafeConfigParser has been removed in Python 3.12 HOT 1
- Option to not encoding non-English "Send-to" path into English one HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from calibre-kobo-driver.