I'm uploading this file <a href="https://github.com/kermitt2/pdfalto/files/3083042

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

After re-checking, this is the output with pdf2xml: <a href="https://github.com/ke

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Another test document here: <a href="https://github.com/kermitt2/pdfalto/files/3323201

I found a similar problem with ligatures, example file <a href="https://arxiv.org/pdf/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

characters not recognised about pdfalto HOT 12 OPEN

lfoppiano commented on September 26, 2024

characters not recognised

from pdfalto.

Comments (12)

Aazhar commented on September 26, 2024

@lfoppiano you need to give me details about the command options you are using, is the output from the actual configuration from grobid ?

from pdfalto.

lfoppiano commented on September 26, 2024

@Aazhar this output was the result from a module using the latest Grobid version. I did not process it into pdfalto directly.

from pdfalto.

lfoppiano commented on September 26, 2024

After re-checking, this is the output with pdf2xml:
1903.07791.pdfxml.txt

and this is the output with pdfalto: 1903.07791.txt

If I'm not wrong, there is some missing texts in the pdfalto version, but maybe is better if you double check.

To reproduce the issue you can use grobid-quantities, generate the training and pick up the txt version of the training data. If you use grobid 0.5.4 (the default for grobid-quantities) you will have the pdf2xml output, while if you update to 0.5.5-SNAPSHOT you will have the pdfalto output.

from pdfalto.

Aazhar commented on September 26, 2024

so this document is an example of problematic glyphs (not correctly mapped to corresponding unicode code point), so since the ocr feature will be soon implemented (still fixing model parameters and training data for better generalisation) this output is correct.

regarding the results diff (pdfalto/pdf2xml), unless you are using ocr or reading order features the output should not differ, except for notations/greek glyph for which some common font rules are used.

from pdfalto.

krzynio commented on September 26, 2024

@Aazhar

I am very interested in this (OCR of unknown glyphs), let me know if I can help with anything (deep learning, dataset preparation etc.).

You rock ;)

from pdfalto.

lfoppiano commented on September 26, 2024

Another test document here: hal-00720564.pdf

This snippet has m s -1 which are extracted (2.57–4.63 mÆs)1):

pdf2xml and Preview have the same output, so I would say is not urgent to fix it, I'm just giving an additional test document 😄

from pdfalto.

vsolovyov commented on September 26, 2024

I found a similar problem with ligatures, example file https://arxiv.org/pdf/1906.08479.pdf

The character in question is a "fi" ligature, used in part of "mean-field", for example. I compiled pdfalto from the master branch and run ./pdfalto -f 48 -l 48 1906.08479.pdf 1906.08479.alto.xml, and fi ligature is getting dropped (I removed some zero attributes for easier reading):

<String sid="p48_s151" ID="p48_w151" CONTENT="mean-" HPOS="492.065" VPOS="327.587" WIDTH="27.5184" HEIGHT="9.7091" STYLEREFS="font1"/>
<String sid="p48_s152" ID="p48_w152" CONTENT="eld" HPOS="525.57" VPOS="327.587" WIDTH="13.0108" HEIGHT="9.7091" STYLEREFS="font1"/>

This is from [CDL03] reference line and word "field" is pretty popular in this pdf, so example is abundant.

Is this the same problem?

from pdfalto.

bmorton1 commented on September 26, 2024

Just curious if any works has been done on the ligature issue. I'm also seeing it occur with "ff" and "ffi" as in the words "effect" and "efficacy"

from pdfalto.

kermitt2 commented on September 26, 2024

@bmorton1, so far the ligature are left as such, so if we have a \uFB00, we leave it as such and we don't rewrite it as 2 characters ff. We considered it is out of the scope of pdfalto (some users could be interested by keeping the ligatures). Of course we could also include it in pdfalto, it's open to debate :)

On the other hand we handle in pdfalto the character composition, because they are introduced simply for saving glyphs, there is no reason to keep the sequence 'e for é (it's something very standardised too).

Then the tools using pdfalto can handle ligature on their own, it's not sophisticated. For instance in GROBID, which uses pdfalto, we apply several steps of character normalizations (ligature, unicode character family normalization, etc.). For the ligature, we use this mapping:

                // ligature
                case '\uFB00': {
                    res += "ff";
                    break;
                }
                case '\uFB01': {
                    res += "fi";
                    break;
                }
                case '\uFB02': {
                    res += "fl";
                    break;
                }
                case '\uFB03': {
                    res += "ffi";
                    break;
                }
                case '\uFB04': {
                    res += "ffl";
                    break;
                }
                case '\uFB06': {
                    res += "st";
                    break;
                }
                case '\uFB05': {
                    res += "ft";
                    break;
                }
                case '\u00E6': {
                    res += "ae";
                    break;
                }
                case '\u00C6': {
                    res += "AE";
                    break;
                }
                case '\u0153': {
                    res += "oe";
                    break;
                }
                case '\u0152': {
                    res += "OE";
                    break;
                }

Note that it's a different question than raised in this issue, which is that some glyphes cannot correctly be mapped to a unicode - and this incorrectly mapped glyph could be one corresponding to a ligature (for that issue, we plan to use a bit of "local" OCR, because there is no other solution).

from pdfalto.

vsolovyov commented on September 26, 2024

@kermitt2 is this code for ligatures in a released Grobid version, or is it only in master branch?

Also, in the example that I gave it looks like these ligatures do not get passed to consumer by pdfalto. They are simply dropped, so there is no chance for Grobid to handle these ligatures. Maybe this PDF is writing ligatures in some weird way?

When I open my example PDF in Firefox and copy a span, I get
Role of the interaction matrix in mean-�eldspin glass models
This � is a character \u001B, doesn't look right.

MacOS Preview.app correctly copies it as:
Role of the interaction matrix in mean-field spin glass models

Chrome copies almost the same string as Firefox, except it enters \n between "\u001Beld" and "spin". I think they both use pdf.js, but I'm not 100% sure. Safari gives the same correct output as Preview.

I could investigate what happens in pdfalto, but I don't know where to begin. It would be helpful if you can point me in the general direction of where to look at it in pdfalto source.

from pdfalto.

kermitt2 commented on September 26, 2024

@vsolovyov the mapping is in both GROBID master and released versions.

However, if the character is "dropped", it means that the unicode is not resolved for this glyph. It relates in particular to the way PDF embeds fonts. In the case we have a glyph that is in a locally embedded font, the unicode for this glyph is often not the expected unicode (for your example it should be \u001B) but a code in the free unicode range (thus the placeholder � in your case), so there is no way to understand which character this glyph actually represents.

This is mentioned here and in some issues for grobid in the last years.

There is happening usually with special characters, in particular mathematical symbols, the fact that it is a ligature glyphs is not specific.

It was already mentioned in other issues I think. MacOS in general includes some advanced PDF processing, in particular more proprietary fonts (from Adobe in particular) and asaik it is doing on-the-fly OCR for unresolved glyphs. So when testing with Preview, all these locally unresolved glyphs are solved - it's actually great stuff and I hope to add that to pdfalto something similar, so that the tool can reach the level of PDF support by MacOS. Other linux libraries (derived from xpdf) and open source PDF parsing libraries are also "closer" to the actual PDF encoding, so you will see the problem.

It also means in practice that it is not possible to test the encoding of a PDF with MacOS Preview, because it does behind the scene some excellent reconstruction (also for columns).

I hope it clarifies this problem!

from pdfalto.

vsolovyov commented on September 26, 2024

@kermitt2 thank you very much for the write up, it does clarify the problem. I haven't expected that MacOS does on-the-fly OCR for unresolved glyphs, and thought that the problem with ligatures is much easier than it turns out to be.

from pdfalto.

characters not recognised about pdfalto HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent