Code Monkey home page Code Monkey logo

Comments (1)

GitHubRulesOK avatar GitHubRulesOK commented on August 13, 2024

you need to use a co-ordinate viewer to see why that value may be wrong see here
https://github.com/christian-vigh-phpclasses/PdfToText/blob/master/examples/text-capture/sample-report.pdf
supposedly produces this result

[Page : 1, width = 596, height = 843]
[x:248.76, y:760.4, w: 79.895, h:12]REPORT HEADER 
[x:70.695, y:746.6, w: 2.381, h:12] 
[x:84.495, y:722.6, w: 2.381, h:12] 
[x:84.495, y:734.6, w: 2.381, h:12] 
[x:0, y:708.08, w: 124.619, h:12]Column1  Column2  Column3 
[x:70.8, y:690.32, w: 76.99, h:12]L1C1  L1C2  L1C3

note in this example the second last row x:0 appears not to be correct especially if w:124.619 is supposedly right too

The PDF file reports
MediaBox[0 0 595.28 841.89]/Rotate 0
so it appears there is some odd math's rounding up as integer should be width 595 height 842
the first text block is supposedly at this point so we can agree its 12 points high (h:12)
/R8 12 Tf
1.00055 0 0 1 248.76 760.24 Tm
[<01>-2.64015<02>1.17611<03>-3.54053<04>2.56451<01>-2.64015<05>1.17611<06>.136644<07>2.56451<02>1.17611(\b)2.56451(\t)2.56451<02>1.17611<01>-2.64014<06>] TJ

We can also see there is some odd scalar (1.00055) that's going to upset scaled calculations and we can also see the text is not normal. It is mapped thus <01>=R <02>=E <03>=P <04>=O <01>=R <05>=T <06>=" " thus spells out REPORT" "HEADER" " and we also see each letter is using a twirking factor to jiggle its position (Kerning) but those values are too erratic to be used for the length of the string thus the best we can accept is the start values and height the width is unlikely to be of much value especially with the included white spaces after each sub part of the string.

so why is the odd row in that block of text at x:0 it should be like others well defined and the answer is that it is relative to all the previous widths so has no absolute value for x: but why is the next row showing a reasonable [x:70.8 ? well that's because the string is a fresh Absolute (not relative) location e.g. x:70.8 y:690.16 is where we find L1C1 L1C2 L1C3 where <06> is still =" " and (\n)=L <0B>=1 (\f)=C (\r)=2 and <0E>=3
that's right there is very little logic in reversing each PDF using a logical methodology since its a language stack

BT
/R8 12 Tf
1.00055 0 0 1 70.8 690.16 Tm
[(\n)11.1694<0B>.274507(\f)-2.63954<0B>.274792<06>-10264.2(\n)11.1694<0B>.274507(\f)-2.64015(\r).274792<06>-10274.2(\n)11.1706<0B>.275727(\f)-2.64015<0E>.274792<06>] TJ
ET

So in summary PDF is not the easiest way to define blobs of ink (glyphs) and trying to measure relative offsets in poorly defined strings of text contents is prone to errors.

from pdftotext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.