Code Monkey home page Code Monkey logo

Comments (12)

coolwanglu avatar coolwanglu commented on June 28, 2024

The problem is the ToUnicode map provided in the PDF is bad. both 'e' and 'E' are mapped to 'e'.
It's not the fault of poppler nor pdf2htmlEX.

I'll add an option that ignore the ToUnicode map for a specified font, as I've seen many bad ToUnicode maps. With the optioned enabled, pdf2htmlEX will work as if there's no such map, in which case the font should be able to render correctly, but text selection may not work

from pdf2htmlex.

iapain avatar iapain commented on June 28, 2024

This option would be great :) Thx.

from pdf2htmlex.

coolwanglu avatar coolwanglu commented on June 28, 2024

I've just hacked the pdf, that removing the ToUnicode map of the font. Then I got the correct characters.
Now I need a ToUnicode mapping based only on the font info, without ToUnicode map provided in PDF.
There's a function I can use in poppler, but no public.
I'm contacting with poppler guys before I have to rewrite the function.

from pdf2htmlex.

coolwanglu avatar coolwanglu commented on June 28, 2024

I used another method for it. Now tounicode is disabled for nonTTF fonts by default.
Please test it as much as you can :)

from pdf2htmlex.

iapain avatar iapain commented on June 28, 2024

Tested intensively on variety of PDFs. Works great. Thanks!
BTW I noticed that glyphs that were unused on page they were also encoded in font. That is quite weird because it should just encode used glyphs.

from pdf2htmlex.

coolwanglu avatar coolwanglu commented on June 28, 2024

No, I embed the 'entire font' in the PDF. Previously I meant that usually
PDF generators embed only necessary glyphs.
Probably I'll consider this in the future, as currently I don't have an
easy to manipulating the glyphs, or counting used ones.

On Tue, Aug 28, 2012 at 9:08 PM, Deepak [email protected] wrote:

Tested intensively on variety of PDFs. Works great. Thanks!
BTW I noticed that glyphs that were unused on page they were also encoded
in font. That is quite weird because it should just encode used glyphs.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-8090708.

from pdf2htmlex.

iapain avatar iapain commented on June 28, 2024

It'd be nice future add-on. Font size will dramatically reduced in graphical PDFs. Also this would benefit mobile viewing.

from pdf2htmlex.

coolwanglu avatar coolwanglu commented on June 28, 2024

Hi @iapain
In the latest devv branch, I've change the default behaviour when --tounicode=0
Basically when --tounicode=1, the ToUnicode Map will be forced applied, when --tounicode=-1, the map will be completely ignored.

When --tounicode=0, the map is attempted to be applied. If anything wrong is found, it'll be dropped.

I changed the behaviour because I have received many PDF files where there are Type 1 fonts embedded without proper font names in the font, whereas proper ToUnicode CMaps are provided. These are actually more consistent with the PDF Standard.

What do you say about it?

from pdf2htmlex.

iapain avatar iapain commented on June 28, 2024

I think it's very wise decision. In the end it's all how similar is output to pdf matters. I think --tounicode=0 will make HTML looks similar to PDF.

By the way, I found that this option is not working for me. Try it on the PDF refereed on this ticket. I think we need to speed up testing to avoid regressions, I will start contributing.

from pdf2htmlex.

coolwanglu avatar coolwanglu commented on June 28, 2024

I mean the default behavior is not good the that PDF.
And there's a typo, please specify "-1" to force disabling tounicode map,
i.e. for that PDF.

On Mon, Sep 24, 2012 at 7:00 PM, Deepak [email protected] wrote:

I think it's very wise decision. In the end it's all how similar is output
to pdf matters. I think --tounicode=0 will make HTML looks similar to PDF.

By the way, I found that this option is not working for me. Try it on the
PDF refereed on this ticket. I think we need to speed up testing to avoid
regressions, I will start contributing.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-8814346.

from pdf2htmlex.

iapain avatar iapain commented on June 28, 2024

Now I get it clearer. I am okay with this behaviour. I agree with you that default auto mode should be smart enough. I tried some PDF with Type2 font with collision in font map. I didn’t see any error messages in debug mode about "drop". I guess it needs to be tested well.

from pdf2htmlex.

coolwanglu avatar coolwanglu commented on June 28, 2024

OK, I'll test it with more files.

On Mon, Sep 24, 2012 at 8:57 PM, Deepak [email protected] wrote:

Now I get it clearer. I am okay with this behaviour. I agree with you that
default auto mode should be smart enough. I tried some PDF with Type2
font with collision in font map. I didn’t see any error messages in debug
mode about "drop". I guess it needs to be tested well.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-8817825.

from pdf2htmlex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.