Comments (1)
you need to use a co-ordinate viewer to see why that value may be wrong see here
https://github.com/christian-vigh-phpclasses/PdfToText/blob/master/examples/text-capture/sample-report.pdf
supposedly produces this result
[Page : 1, width = 596, height = 843]
[x:248.76, y:760.4, w: 79.895, h:12]REPORT HEADER
[x:70.695, y:746.6, w: 2.381, h:12]
[x:84.495, y:722.6, w: 2.381, h:12]
[x:84.495, y:734.6, w: 2.381, h:12]
[x:0, y:708.08, w: 124.619, h:12]Column1 Column2 Column3
[x:70.8, y:690.32, w: 76.99, h:12]L1C1 L1C2 L1C3
note in this example the second last row x:0 appears not to be correct especially if w:124.619 is supposedly right too
The PDF file reports
MediaBox[0 0 595.28 841.89]/Rotate 0
so it appears there is some odd math's rounding up as integer should be width 595 height 842
the first text block is supposedly at this point so we can agree its 12 points high (h:12)
/R8 12 Tf
1.00055 0 0 1 248.76 760.24 Tm
[<01>-2.64015<02>1.17611<03>-3.54053<04>2.56451<01>-2.64015<05>1.17611<06>.136644<07>2.56451<02>1.17611(\b)2.56451(\t)2.56451<02>1.17611<01>-2.64014<06>] TJ
We can also see there is some odd scalar (1.00055) that's going to upset scaled calculations and we can also see the text is not normal. It is mapped thus <01>=R <02>=E <03>=P <04>=O <01>=R <05>=T <06>=" " thus spells out REPORT" "HEADER" "
and we also see each letter is using a twirking factor to jiggle its position (Kerning) but those values are too erratic to be used for the length of the string thus the best we can accept is the start values and height the width is unlikely to be of much value especially with the included white spaces after each sub part of the string.
so why is the odd row in that block of text at x:0 it should be like others well defined and the answer is that it is relative to all the previous widths so has no absolute value for x: but why is the next row showing a reasonable [x:70.8 ? well that's because the string is a fresh Absolute (not relative) location e.g. x:70.8 y:690.16 is where we find L1C1 L1C2 L1C3
where <06> is still =" " and (\n)=L <0B>=1 (\f)=C (\r)=2 and <0E>=3
that's right there is very little logic in reversing each PDF using a logical methodology since its a language stack
BT
/R8 12 Tf
1.00055 0 0 1 70.8 690.16 Tm
[(\n)11.1694<0B>.274507(\f)-2.63954<0B>.274792<06>-10264.2(\n)11.1694<0B>.274507(\f)-2.64015(\r).274792<06>-10274.2(\n)11.1706<0B>.275727(\f)-2.64015<0E>.274792<06>] TJ
ET
So in summary PDF is not the easiest way to define blobs of ink (glyphs) and trying to measure relative offsets in poorly defined strings of text contents is prone to errors.
from pdftotext.
Related Issues (20)
- No coordinates when using PDFOPT_DEBUG_SHOW_COORDINATES HOT 2
- Retrieve an Interval of Pages from PDF
- Font Widths from another PDF Object
- Can't read PDF-file HOT 13
- Preserve new lines in pdf after converting to text.
- issue in convert maths paper how can i solve it HOT 6
- extracted images are black HOT 1
- Problem with Euro (€) char HOT 1
- different fonts problem
- No Spaces in between two text
- Converting only parts of the file
- Error of 'Undefined Constant 'IMG_JPEG' HOT 6
- problem with extracting some hebrew font
- How to get PDF form fields and values ?
- High Memory Usage HOT 1
- Causes garbled characters HOT 2
- PdfToText not reading files created or modified with PDFelement
- Extract Data from PDF form Undefined Functions
- Why is the original image different from the extracted image? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdftotext.