Comments (12)
@lfoppiano you need to give me details about the command options you are using, is the output from the actual configuration from grobid ?
from pdfalto.
@Aazhar this output was the result from a module using the latest Grobid version. I did not process it into pdfalto directly.
from pdfalto.
After re-checking, this is the output with pdf2xml:
1903.07791.pdfxml.txt
and this is the output with pdfalto: 1903.07791.txt
If I'm not wrong, there is some missing texts in the pdfalto version, but maybe is better if you double check.
To reproduce the issue you can use grobid-quantities, generate the training and pick up the txt version of the training data. If you use grobid 0.5.4 (the default for grobid-quantities) you will have the pdf2xml output, while if you update to 0.5.5-SNAPSHOT you will have the pdfalto output.
from pdfalto.
so this document is an example of problematic glyphs (not correctly mapped to corresponding unicode code point), so since the ocr feature will be soon implemented (still fixing model parameters and training data for better generalisation) this output is correct.
regarding the results diff (pdfalto/pdf2xml), unless you are using ocr or reading order features the output should not differ, except for notations/greek glyph for which some common font rules are used.
from pdfalto.
I am very interested in this (OCR of unknown glyphs), let me know if I can help with anything (deep learning, dataset preparation etc.).
You rock ;)
from pdfalto.
Another test document here: hal-00720564.pdf
This snippet has m s -1
which are extracted (2.57–4.63 mÆs)1)
:
pdf2xml and Preview have the same output, so I would say is not urgent to fix it, I'm just giving an additional test document 😄
from pdfalto.
I found a similar problem with ligatures, example file https://arxiv.org/pdf/1906.08479.pdf
The character in question is a "fi" ligature, used in part of "mean-field", for example. I compiled pdfalto from the master branch and run ./pdfalto -f 48 -l 48 1906.08479.pdf 1906.08479.alto.xml
, and fi ligature is getting dropped (I removed some zero attributes for easier reading):
<String sid="p48_s151" ID="p48_w151" CONTENT="mean-" HPOS="492.065" VPOS="327.587" WIDTH="27.5184" HEIGHT="9.7091" STYLEREFS="font1"/>
<String sid="p48_s152" ID="p48_w152" CONTENT="eld" HPOS="525.57" VPOS="327.587" WIDTH="13.0108" HEIGHT="9.7091" STYLEREFS="font1"/>
This is from [CDL03]
reference line and word "field" is pretty popular in this pdf, so example is abundant.
Is this the same problem?
from pdfalto.
Just curious if any works has been done on the ligature issue. I'm also seeing it occur with "ff" and "ffi" as in the words "effect" and "efficacy"
from pdfalto.
@bmorton1, so far the ligature are left as such, so if we have a \uFB00
, we leave it as such and we don't rewrite it as 2 characters ff
. We considered it is out of the scope of pdfalto (some users could be interested by keeping the ligatures). Of course we could also include it in pdfalto, it's open to debate :)
On the other hand we handle in pdfalto the character composition, because they are introduced simply for saving glyphs, there is no reason to keep the sequence 'e
for é
(it's something very standardised too).
Then the tools using pdfalto can handle ligature on their own, it's not sophisticated. For instance in GROBID, which uses pdfalto, we apply several steps of character normalizations (ligature, unicode character family normalization, etc.). For the ligature, we use this mapping:
// ligature
case '\uFB00': {
res += "ff";
break;
}
case '\uFB01': {
res += "fi";
break;
}
case '\uFB02': {
res += "fl";
break;
}
case '\uFB03': {
res += "ffi";
break;
}
case '\uFB04': {
res += "ffl";
break;
}
case '\uFB06': {
res += "st";
break;
}
case '\uFB05': {
res += "ft";
break;
}
case '\u00E6': {
res += "ae";
break;
}
case '\u00C6': {
res += "AE";
break;
}
case '\u0153': {
res += "oe";
break;
}
case '\u0152': {
res += "OE";
break;
}
Note that it's a different question than raised in this issue, which is that some glyphes cannot correctly be mapped to a unicode - and this incorrectly mapped glyph could be one corresponding to a ligature (for that issue, we plan to use a bit of "local" OCR, because there is no other solution).
from pdfalto.
@kermitt2 is this code for ligatures in a released Grobid version, or is it only in master branch?
Also, in the example that I gave it looks like these ligatures do not get passed to consumer by pdfalto. They are simply dropped, so there is no chance for Grobid to handle these ligatures. Maybe this PDF is writing ligatures in some weird way?
When I open my example PDF in Firefox and copy a span, I get
Role of the interaction matrix in mean-�eldspin glass models
This � is a character \u001B
, doesn't look right.
MacOS Preview.app correctly copies it as:
Role of the interaction matrix in mean-field spin glass models
Chrome copies almost the same string as Firefox, except it enters \n between "\u001Beld" and "spin". I think they both use pdf.js, but I'm not 100% sure. Safari gives the same correct output as Preview.
I could investigate what happens in pdfalto, but I don't know where to begin. It would be helpful if you can point me in the general direction of where to look at it in pdfalto source.
from pdfalto.
@vsolovyov the mapping is in both GROBID master and released versions.
However, if the character is "dropped", it means that the unicode is not resolved for this glyph. It relates in particular to the way PDF embeds fonts. In the case we have a glyph that is in a locally embedded font, the unicode for this glyph is often not the expected unicode (for your example it should be \u001B
) but a code in the free unicode range (thus the placeholder � in your case), so there is no way to understand which character this glyph actually represents.
This is mentioned here and in some issues for grobid in the last years.
There is happening usually with special characters, in particular mathematical symbols, the fact that it is a ligature glyphs is not specific.
It was already mentioned in other issues I think. MacOS in general includes some advanced PDF processing, in particular more proprietary fonts (from Adobe in particular) and asaik it is doing on-the-fly OCR for unresolved glyphs. So when testing with Preview, all these locally unresolved glyphs are solved - it's actually great stuff and I hope to add that to pdfalto something similar, so that the tool can reach the level of PDF support by MacOS. Other linux libraries (derived from xpdf) and open source PDF parsing libraries are also "closer" to the actual PDF encoding, so you will see the problem.
It also means in practice that it is not possible to test the encoding of a PDF with MacOS Preview, because it does behind the scene some excellent reconstruction (also for columns).
I hope it clarifies this problem!
from pdfalto.
@kermitt2 thank you very much for the write up, it does clarify the problem. I haven't expected that MacOS does on-the-fly OCR for unresolved glyphs, and thought that the problem with ligatures is much easier than it turns out to be.
from pdfalto.
Related Issues (20)
- compile error on RHEL 8.6 (Ootpa): /usr/bin/ld: cannot find -lstdc++ HOT 1
- Error case with invalid characters mapping
- Segmentation fault with pdf with comments
- Soft hyphens omitted HOT 3
- PDF to XML conversion time out for some files in server mode but run the pdfalto_server cmd in shell is fast and returns ok. HOT 1
- xpdf version 4.04
- ARM binaries for the Apple M1 HOT 3
- Cannot run pdfalto HOT 5
- PDF cause a crash with annotation option
- Building on arm64 Ubuntu Server 22.04 fails HOT 1
- Building for Apple Silicon failed due to missing directories (with manual fix) HOT 1
- Wrong characters / difference between extraction and display HOT 1
- [Suggestion] Reporting the byte location of images HOT 2
- Compilation error on arch linux HOT 1
- Error case, missing digits HOT 10
- Error case: double column, and line numbers
- No rule to make target `libs/image/png/mac/arm64/libpng.a'
- icu related crash HOT 1
- Missing Words while extracting from PDF HOT 2
- detect password-protected PDF
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfalto.