Code Monkey home page Code Monkey logo

Comments (8)

christian-vigh-phpclasses avatar christian-vigh-phpclasses commented on August 13, 2024

Hello Philip,

Well, to tell the truth, the initial version of my class did suppress
hyphens ; I noticed that when running it with the Microsoft RTF
Specifications, converted to a PDF file.

I finally suppressed it because during the following weeks, I did not have
any new sample showing such samples, and I was afraid of side-effects.

However, now it seems that it makes sense to put it back. I think I will add
a PDFOPT_UNHYPHENATE option in the constructor, so that the output text will
be post-processed to remove hyphens.

I will come back to you when the new version will be available.

Christian.


De : phisu [mailto:[email protected]]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?

philipp


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

from pdftotext.

christian-vigh-phpclasses avatar christian-vigh-phpclasses commented on August 13, 2024

Ooops I completely forgot : do you have a sample to give to me ? or
recommend me on sample you already sent to me ?


De : phisu [mailto:[email protected]]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?

philipp


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

from pdftotext.

christian-vigh-phpclasses avatar christian-vigh-phpclasses commented on August 13, 2024

Hello Philipp,

I�m glad to tell you that the PdfToText V1.2.36 class is now able to
�un-hyphenate� words. Simply specify the PDFOPT_NO_HYPHENATED_WORDS for the
$options parameter of the constructor or of the Load() method.

I�ve noticed one unwanted side-effect in your sample
�150701-DSE-Katalog-verlinkt.pdf� : the output text

        à-la-carte-

        Speisen

Is displayed as :

        à-la-carteSpeisen

Maybe it will be better once I�ll have implemented a more robust management
of x/y coordinates, but don�t expect miracles !

However, the rest of the text contents, which contains many hyphenated
words, seems to look fine.

Christian.


De : phisu [mailto:[email protected]]
Envoyé : lundi 8 août 2016 11:37
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello Christian,

in mostly every pdf we find hyphens. when the hyphens are on the end of a
line, i guess, we are mostly not interested in them. the quality of the
extracted text is maybe better, if they are eliminated. this could be done
by a extra cleanup of the output of your class or by your class itself. what
do you think about that?

philipp


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#10 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8asZy5pKx9BJiowMWHHFIa
yFp4iLwks5qdvjIgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8aqjGj8BpoHZ8hfYqhD5AooFoDk-Iks5
qdvjIgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

from pdftotext.

phisu avatar phisu commented on August 13, 2024

hello christian,
i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:

https://www.digitales.oesterreich.gv.at/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

the output starts with:

61111113111399111111111111391121111911111311137111111146111911113 43
6111111311139911111111111139112111191111131113711111119111911113
7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443
61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139
11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111
3■3
1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111
911119143911911911119114433■3

with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.

philipp

from pdftotext.

phisu avatar phisu commented on August 13, 2024

hello christian,
i think the elimination of hyphens is not so important than the a akurat output of white-spaces and line-breaks.

philipp

from pdftotext.

christian-vigh-phpclasses avatar christian-vigh-phpclasses commented on August 13, 2024

Hello Philipp,

It�s too late ! I implemented this feature in the early versions of my class
then removed it because I feared side effects.

I added it again : it was nothing and took me an hour to complete. Sometimes
I need to work on easy things�

Christian.


De : phisu [mailto:[email protected]]
Envoyé : mardi 9 août 2016 10:03
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello christian,
i think the elimination of hyphens is not so important than the a akurat
output of white-spaces and line-breaks.

philipp


You are receiving this because you commented.
Reply to this email directly, view
<#10 (comment)
nt-238482045> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ald1FHdilDBR8ng1zo1sB
jg1x53aks5qeDQkgaJpZM4Je3eP> the thread.
<https://github.com/notifications/beacon/ARM8an50_UmgICjHCziu41nSiW1hlF8uks5
qeDQkgaJpZM4Je3eP.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

from pdftotext.

christian-vigh-phpclasses avatar christian-vigh-phpclasses commented on August 13, 2024

Hello Philipp,

I solved this problem late this night before you performed your testings.

It was due to my complete reworking of how I’m handling Unicode to UTF8 translations. One internal function, which was accepting a character s a parameter, now accepts an integer value. I just missed 2 calls in my code which were still supplying a character value as a parameter.

The latest version, 1.2.38, solved that (I tried it on the sample you sent to me).

Christian.


De : phisu [mailto:[email protected]]
Envoyé : mardi 9 août 2016 08:31
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] problem with hyphen (#10)

hello christian,
i will take a closer look on the hyphens in my pdf files. but, i tried your new version [Version : 1.2.36] [Date : 2016/08/07] with the following file:

https://www.digitales.oesterreich.gv.at/documents/22124/30428/BarrierefreiesInternet_WCAG_Aspekte_SdOeB_20100818.pdf/9dc7ffb9-6420-406d-be6e-a0624e91547b

the output starts with:

61111113111399111111111111391121111911111311137111111146111911113 43
6111111311139911111111111139112111191111131113711111119111911113
7111111111361111111111119113113111381911911131213211111119111111911111111114336111111911111911387745311137444434545454443
61311113111111111111311131111911399191311 111111137111111111111921911114311131113911211119111113111371111111911191111312139
11139111111111131911111311131111111131119111111436111111119113191213111153721381111111311138119111111111111
3■3
1111311111111113111911111381111111 11111139114361111111131113811111111111111347112437119111
911119143911911911119114433■3

with version [Version : 1.2.35] [Date : 2016/08/06] the output of the same file was very fine.

philipp


You are receiving this because you commented.
Reply to this email directly, view #10 (comment) it on GitHub, or mute https://github.com/notifications/unsubscribe-auth/ARM8amMHYNivYV0uj2tuAzVlmMyOzt8Lks5qeB6sgaJpZM4Je3eP the thread. https://github.com/notifications/beacon/ARM8al6dVdqX7GHE84_XoIDm6wKJ4BnOks5qeB6sgaJpZM4Je3eP.gif


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

from pdftotext.

shravspy avatar shravspy commented on August 13, 2024

I want hyphens in my pdf. Is there an option not to remove it with layout, because as of now it removes all the hyphens from my table in pdf.

from pdftotext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.