Comments (8)
Hello Philip,
Well, to tell the truth, the initial version of my class did suppress
hyphens ; I noticed that when running it with the Microsoft RTF
Specifications, converted to a PDF file.
I finally suppressed it because during the following weeks, I did not have
any new sample showing such samples, and I was afraid of side-effects.
However, now it seems that it makes sense to put it back. I think I will add
a PDFOPT_UNHYPHENATE option in the constructor, so that the output text will
be post-processed to remove hyphens.
I will come back to you when the new version will be available.
Christian.
De : phisu [mailto:[email protected]]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)
hello Christian,
in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?
philipp
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
Ooops I completely forgot : do you have a sample to give to me ? or
recommend me on sample you already sent to me ?
De : phisu [mailto:[email protected]]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)
hello Christian,
in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?
philipp
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
Hello Philipp,
I�m glad to tell you that the PdfToText V1.2.36 class is now able to
�un-hyphenate� words. Simply specify the PDFOPT_NO_HYPHENATED_WORDS for the
$options parameter of the constructor or of the Load() method.
I�ve noticed one unwanted side-effect in your sample
�150701-DSE-Katalog-verlinkt.pdf� : the output text
à-la-carte-
Speisen
Is displayed as :
à-la-carteSpeisen
Maybe it will be better once I�ll have implemented a more robust management
of x/y coordinates, but don�t expect miracles !
However, the rest of the text contents, which contains many hyphenated
words, seems to look fine.
Christian.
De : phisu [mailto:[email protected]]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)
hello Christian,
in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?
philipp
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
hello christian,
i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:
the output starts with:
61111113111399111111111111391121111911111311137111111146111911113 43
6111111311139911111111111139112111191111131113711111119111911113
7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443
61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139
11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111
3■3
1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111
911119143911911911119114433■3
with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.
philipp
from pdftotext.
hello christian,
i think the elimination of hyphens is not so important than the a akurat output of white-spaces and line-breaks.
philipp
from pdftotext.
Hello Philipp,
It�s too late ! I implemented this feature in the early versions of my class
then removed it because I feared side effects.
I added it again : it was nothing and took me an hour to complete. Sometimes
I need to work on easy things�
Christian.
De : phisu [mailto:[email protected]]
Envoyé : mardi 9 août 2016 10:03
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)
hello christian,
i think the elimination of hyphens is not so important than the a akurat
output of white-spaces and line-breaks.
philipp
�
You are receiving this because you commented.
Reply to this email directly, view
<#10 (comment)
nt-238482045> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ald1FHdilDBR8ng1zo1sB
jg1x53aks5qeDQkgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8an50_UmgICjHCziu41nSiW1hlF8uks5
qeDQkgaJpZM4Je3eP.gif>
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
Hello Philipp,
I solved this problem late this night before you performed your testings.
It was due to my complete reworking of how I’m handling Unicode to UTF8 translations. One internal function, which was accepting a character s a parameter, now accepts an integer value. I just missed 2 calls in my code which were still supplying a character value as a parameter.
The latest version, 1.2.38, solved that (I tried it on the sample you sent to me).
Christian.
De : phisu [mailto:[email protected]]
Envoyé : mardi 9 août 2016 08:31
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)
hello christian,
i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:
the output starts with:
61111113111399111111111111391121111911111311137111111146111911113 43
6111111311139911111111111139112111191111131113711111119111911113
7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443
61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139
11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111
3■3
1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111
911119143911911911119114433■3
with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.
philipp
—
You are receiving this because you commented.
Reply to this email directly, view #10 (comment) it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8amMHYNivYV0uj2tuAzVlmMyOzt8Lks5qeB6sgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8al6dVdqX7GHE84_XoIDm6wKJ4BnOks5qeB6sgaJpZM4Je3eP.gif
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
I want hyphens in my pdf. Is there an option not to remove it with layout, because as of now it removes all the hyphens from my table in pdf.
from pdftotext.
Related Issues (20)
- Retrieve an Interval of Pages from PDF
- Font Widths from another PDF Object
- Can't read PDF-file HOT 13
- Preserve new lines in pdf after converting to text.
- issue in convert maths paper how can i solve it HOT 6
- extracted images are black HOT 1
- Problem with Euro (€) char HOT 1
- different fonts problem
- No Spaces in between two text
- Converting only parts of the file
- Error of 'Undefined Constant 'IMG_JPEG' HOT 7
- problem with extracting some hebrew font
- How to get PDF form fields and values ?
- High Memory Usage HOT 1
- Causes garbled characters HOT 2
- PdfToText not reading files created or modified with PDFelement
- Extract Data from PDF form Undefined Functions
- Coordinates not recognized HOT 1
- Why is the original image different from the extracted image? HOT 1
- PdfToText returns only spaces but no text
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdftotext.