Comments (12)
The problem is the ToUnicode map provided in the PDF is bad. both 'e' and 'E' are mapped to 'e'.
It's not the fault of poppler nor pdf2htmlEX.
I'll add an option that ignore the ToUnicode map for a specified font, as I've seen many bad ToUnicode maps. With the optioned enabled, pdf2htmlEX will work as if there's no such map, in which case the font should be able to render correctly, but text selection may not work
from pdf2htmlex.
This option would be great :) Thx.
from pdf2htmlex.
I've just hacked the pdf, that removing the ToUnicode map of the font. Then I got the correct characters.
Now I need a ToUnicode mapping based only on the font info, without ToUnicode map provided in PDF.
There's a function I can use in poppler, but no public.
I'm contacting with poppler guys before I have to rewrite the function.
from pdf2htmlex.
I used another method for it. Now tounicode is disabled for nonTTF fonts by default.
Please test it as much as you can :)
from pdf2htmlex.
Tested intensively on variety of PDFs. Works great. Thanks!
BTW I noticed that glyphs that were unused on page they were also encoded in font. That is quite weird because it should just encode used glyphs.
from pdf2htmlex.
No, I embed the 'entire font' in the PDF. Previously I meant that usually
PDF generators embed only necessary glyphs.
Probably I'll consider this in the future, as currently I don't have an
easy to manipulating the glyphs, or counting used ones.
On Tue, Aug 28, 2012 at 9:08 PM, Deepak [email protected] wrote:
Tested intensively on variety of PDFs. Works great. Thanks!
BTW I noticed that glyphs that were unused on page they were also encoded
in font. That is quite weird because it should just encode used glyphs.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-8090708.
from pdf2htmlex.
It'd be nice future add-on. Font size will dramatically reduced in graphical PDFs. Also this would benefit mobile viewing.
from pdf2htmlex.
Hi @iapain
In the latest devv branch, I've change the default behaviour when --tounicode=0
Basically when --tounicode=1, the ToUnicode Map will be forced applied, when --tounicode=-1, the map will be completely ignored.
When --tounicode=0, the map is attempted to be applied. If anything wrong is found, it'll be dropped.
I changed the behaviour because I have received many PDF files where there are Type 1 fonts embedded without proper font names in the font, whereas proper ToUnicode CMaps are provided. These are actually more consistent with the PDF Standard.
What do you say about it?
from pdf2htmlex.
I think it's very wise decision. In the end it's all how similar is output to pdf matters. I think --tounicode=0
will make HTML looks similar to PDF.
By the way, I found that this option is not working for me. Try it on the PDF refereed on this ticket. I think we need to speed up testing to avoid regressions, I will start contributing.
from pdf2htmlex.
I mean the default behavior is not good the that PDF.
And there's a typo, please specify "-1" to force disabling tounicode map,
i.e. for that PDF.
On Mon, Sep 24, 2012 at 7:00 PM, Deepak [email protected] wrote:
I think it's very wise decision. In the end it's all how similar is output
to pdf matters. I think --tounicode=0 will make HTML looks similar to PDF.By the way, I found that this option is not working for me. Try it on the
PDF refereed on this ticket. I think we need to speed up testing to avoid
regressions, I will start contributing.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-8814346.
from pdf2htmlex.
Now I get it clearer. I am okay with this behaviour. I agree with you that default auto mode should be smart enough. I tried some PDF with Type2 font with collision in font map. I didn’t see any error messages in debug mode about "drop". I guess it needs to be tested well.
from pdf2htmlex.
OK, I'll test it with more files.
On Mon, Sep 24, 2012 at 8:57 PM, Deepak [email protected] wrote:
Now I get it clearer. I am okay with this behaviour. I agree with you that
default auto mode should be smart enough. I tried some PDF with Type2
font with collision in font map. I didn’t see any error messages in debug
mode about "drop". I guess it needs to be tested well.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-8817825.
from pdf2htmlex.
Related Issues (20)
- How to convert a PDF form into a table label in HTML HOT 1
- Warning: Very difficult to get this to build or run HOT 1
- 无法查看网页源代码
- Open at 100% width HOT 1
- Segmentation Fault HOT 1
- Embed background images into CSS instead of HTML
- Is this project dead? HOT 4
- running this image from nodejs program. HOT 1
- pdf2html is a wonderful tool.
- Memory leak for some pdf files
- How to compile pdf2htmlEX in CentOS 7?
- How to get the width of the div?
- How to get hidden element using --correct-text-visibility option?
- Official way to run it on ubuntu 18.04
- pdftohtmlex for ios HOT 1
- Problems with list symbols HOT 1
- compile error HOT 4
- is there a way to use em font-sizes instead of px
- how to building and run it on ubuntu16 HOT 2
- Option to generate images as in pdf f
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2htmlex.