Code Monkey home page Code Monkey logo

Comments (12)

Aazhar avatar Aazhar commented on September 26, 2024

@lfoppiano you need to give me details about the command options you are using, is the output from the actual configuration from grobid ?

from pdfalto.

lfoppiano avatar lfoppiano commented on September 26, 2024

@Aazhar this output was the result from a module using the latest Grobid version. I did not process it into pdfalto directly.

from pdfalto.

lfoppiano avatar lfoppiano commented on September 26, 2024

After re-checking, this is the output with pdf2xml:
1903.07791.pdfxml.txt

and this is the output with pdfalto: 1903.07791.txt

If I'm not wrong, there is some missing texts in the pdfalto version, but maybe is better if you double check.

To reproduce the issue you can use grobid-quantities, generate the training and pick up the txt version of the training data. If you use grobid 0.5.4 (the default for grobid-quantities) you will have the pdf2xml output, while if you update to 0.5.5-SNAPSHOT you will have the pdfalto output.

from pdfalto.

Aazhar avatar Aazhar commented on September 26, 2024

so this document is an example of problematic glyphs (not correctly mapped to corresponding unicode code point), so since the ocr feature will be soon implemented (still fixing model parameters and training data for better generalisation) this output is correct.

regarding the results diff (pdfalto/pdf2xml), unless you are using ocr or reading order features the output should not differ, except for notations/greek glyph for which some common font rules are used.

from pdfalto.

krzynio avatar krzynio commented on September 26, 2024

@Aazhar

I am very interested in this (OCR of unknown glyphs), let me know if I can help with anything (deep learning, dataset preparation etc.).

You rock ;)

from pdfalto.

lfoppiano avatar lfoppiano commented on September 26, 2024

Another test document here: hal-00720564.pdf

This snippet has m s -1 which are extracted (2.57–4.63 mÆs)1):
image

pdf2xml and Preview have the same output, so I would say is not urgent to fix it, I'm just giving an additional test document 😄

from pdfalto.

vsolovyov avatar vsolovyov commented on September 26, 2024

I found a similar problem with ligatures, example file https://arxiv.org/pdf/1906.08479.pdf

The character in question is a "fi" ligature, used in part of "mean-field", for example. I compiled pdfalto from the master branch and run ./pdfalto -f 48 -l 48 1906.08479.pdf 1906.08479.alto.xml, and fi ligature is getting dropped (I removed some zero attributes for easier reading):

<String sid="p48_s151" ID="p48_w151" CONTENT="mean-" HPOS="492.065" VPOS="327.587" WIDTH="27.5184" HEIGHT="9.7091" STYLEREFS="font1"/>
<String sid="p48_s152" ID="p48_w152" CONTENT="eld" HPOS="525.57" VPOS="327.587" WIDTH="13.0108" HEIGHT="9.7091" STYLEREFS="font1"/>

This is from [CDL03] reference line and word "field" is pretty popular in this pdf, so example is abundant.

Is this the same problem?

from pdfalto.

bmorton1 avatar bmorton1 commented on September 26, 2024

Just curious if any works has been done on the ligature issue. I'm also seeing it occur with "ff" and "ffi" as in the words "effect" and "efficacy"

from pdfalto.

kermitt2 avatar kermitt2 commented on September 26, 2024

@bmorton1, so far the ligature are left as such, so if we have a \uFB00, we leave it as such and we don't rewrite it as 2 characters ff. We considered it is out of the scope of pdfalto (some users could be interested by keeping the ligatures). Of course we could also include it in pdfalto, it's open to debate :)

On the other hand we handle in pdfalto the character composition, because they are introduced simply for saving glyphs, there is no reason to keep the sequence 'e for é (it's something very standardised too).

Then the tools using pdfalto can handle ligature on their own, it's not sophisticated. For instance in GROBID, which uses pdfalto, we apply several steps of character normalizations (ligature, unicode character family normalization, etc.). For the ligature, we use this mapping:

                // ligature
                case '\uFB00': {
                    res += "ff";
                    break;
                }
                case '\uFB01': {
                    res += "fi";
                    break;
                }
                case '\uFB02': {
                    res += "fl";
                    break;
                }
                case '\uFB03': {
                    res += "ffi";
                    break;
                }
                case '\uFB04': {
                    res += "ffl";
                    break;
                }
                case '\uFB06': {
                    res += "st";
                    break;
                }
                case '\uFB05': {
                    res += "ft";
                    break;
                }
                case '\u00E6': {
                    res += "ae";
                    break;
                }
                case '\u00C6': {
                    res += "AE";
                    break;
                }
                case '\u0153': {
                    res += "oe";
                    break;
                }
                case '\u0152': {
                    res += "OE";
                    break;
                }

Note that it's a different question than raised in this issue, which is that some glyphes cannot correctly be mapped to a unicode - and this incorrectly mapped glyph could be one corresponding to a ligature (for that issue, we plan to use a bit of "local" OCR, because there is no other solution).

from pdfalto.

vsolovyov avatar vsolovyov commented on September 26, 2024

@kermitt2 is this code for ligatures in a released Grobid version, or is it only in master branch?

Also, in the example that I gave it looks like these ligatures do not get passed to consumer by pdfalto. They are simply dropped, so there is no chance for Grobid to handle these ligatures. Maybe this PDF is writing ligatures in some weird way?

When I open my example PDF in Firefox and copy a span, I get
Role of the interaction matrix in mean-�eldspin glass models
This � is a character \u001B, doesn't look right.

MacOS Preview.app correctly copies it as:
Role of the interaction matrix in mean-field spin glass models

Chrome copies almost the same string as Firefox, except it enters \n between "\u001Beld" and "spin". I think they both use pdf.js, but I'm not 100% sure. Safari gives the same correct output as Preview.

I could investigate what happens in pdfalto, but I don't know where to begin. It would be helpful if you can point me in the general direction of where to look at it in pdfalto source.

from pdfalto.

kermitt2 avatar kermitt2 commented on September 26, 2024

@vsolovyov the mapping is in both GROBID master and released versions.

However, if the character is "dropped", it means that the unicode is not resolved for this glyph. It relates in particular to the way PDF embeds fonts. In the case we have a glyph that is in a locally embedded font, the unicode for this glyph is often not the expected unicode (for your example it should be \u001B) but a code in the free unicode range (thus the placeholder � in your case), so there is no way to understand which character this glyph actually represents.

This is mentioned here and in some issues for grobid in the last years.

There is happening usually with special characters, in particular mathematical symbols, the fact that it is a ligature glyphs is not specific.

It was already mentioned in other issues I think. MacOS in general includes some advanced PDF processing, in particular more proprietary fonts (from Adobe in particular) and asaik it is doing on-the-fly OCR for unresolved glyphs. So when testing with Preview, all these locally unresolved glyphs are solved - it's actually great stuff and I hope to add that to pdfalto something similar, so that the tool can reach the level of PDF support by MacOS. Other linux libraries (derived from xpdf) and open source PDF parsing libraries are also "closer" to the actual PDF encoding, so you will see the problem.

It also means in practice that it is not possible to test the encoding of a PDF with MacOS Preview, because it does behind the scene some excellent reconstruction (also for columns).

I hope it clarifies this problem!

from pdfalto.

vsolovyov avatar vsolovyov commented on September 26, 2024

@kermitt2 thank you very much for the write up, it does clarify the problem. I haven't expected that MacOS does on-the-fly OCR for unresolved glyphs, and thought that the problem with ligatures is much easier than it turns out to be.

from pdfalto.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.