Code Monkey home page Code Monkey logo

pdf-glyph-mapping's Introduction

pdf-glyph-mapping

Scripts to help with text extraction from (some) PDF files.

(Specifically: fix incorrect/incomplete ToUnicode CMap, i.e. the mapping between individual glyphs and Unicode text.)

What's this?

Text that requires complex text layout (because of being in Indic scripts, say) cannot be copied correctly from PDFs, unless annotated with ActualText. Here is a bunch of tools that may help in some cases.

Roughly, the idea is to

  1. from the PDF file, extract
    • the current glyph<->Unicode mapping (if present), and
    • the runs of text present (as character codes), per font
  2. use this information (and the shapes of the glyphs) to assist with manually associating each glyph with its equivalent Unicode sequence,
  3. use this correct mapping to obtain Unicode text: either
    • convert the text runs extracted earlier, or
    • post-process the PDF file to wrap each text run inside /ActualText.

Background

Some PDF files are just a collection of images (scans of pages) — we ignore those (for those, use OCR). In any other PDF file that contains text streams (e.g. one where you can select a run of text), the text is displayed by laying out glyphs from a font. For example, in a certain PDF that uses the font Noto Sans Devanagari, the word प्राप्त may be formed by laying out four glyphs:

0112 0042 00CB 0028

In this font, these glyphs happen to have numerical IDs (like 0112, 0042, 00CB, 0028) that are font-specific. If we'd like to get text out of this, and the PDF does not provide it with /ActualText, we need to map the four glyphs to the corresponding Unicode scalar values:

  • 0112 (0112) maps to
    • 092A DEVANAGARI LETTER PA
    • 094D DEVANAGARI SIGN VIRAMA
    • 0930 DEVANAGARI LETTER RA
  • 0042 (0042) maps to
    • 093E DEVANAGARI VOWEL SIGN AA
  • 00CB (00CB) maps to
    • 092A DEVANAGARI LETTER PA
    • 094D DEVANAGARI SIGN VIRAMA
  • 0028 (0028) maps to
    • 0924 DEVANAGARI LETTER TA

The PDF file itself may already contain such a mapping (CMap), but it is often incomplete, missing nontrivial cases like the first glyph above.

Even after the mapping is fixed, a second problem is that, roughly speaking, the glyph ids are laid out in visual order while Unicode text is in phonetic order. So the correspondence may be nontrivial. See the example on page 36 here; a couple more examples below:

  1. The word विकर्ण may be laid out as:

    0231 0039 0019 0027 00B5

    and we want this to correspond to the following sequence of Unicode codepoints:

    1. 0935 DEVANAGARI LETTER VA
    2. 093F DEVANAGARI VOWEL SIGN I
    3. 0915 DEVANAGARI LETTER KA
    4. 0930 DEVANAGARI LETTER RA
    5. 094D DEVANAGARI SIGN VIRAMA
    6. 0923 DEVANAGARI LETTER NNA

    (The first glyph corresponds to the second codepoint, and the last glyph corresponds to the fourth and fifth codepoints.)

  2. The word धर्मो may be laid out as:

    002B 0032 01C3

    and the word सर्वांग as:

    003C 0039 0042 01CB 001B

Example of usage

(TODO)

(But see: this comment and these files 1 2 3 4.)

Usage

(Short version: Run make and follow instructions.)

  1. (Not part of this repository.) Prerequisites:
    1. Make sure mutool is installed (and also Python and Rust).
    2. If you know fonts that may be related to the fonts in the directory, run ttx (from fonttools) on them, and put the resulting files inside the work/helper_fonts/ directory.
  2. Run make, from within the work/ directory. This will do the following:
    1. Extracts the font data from the PDF file, using mutool extract.
    2. Dumps each glyph from each font as a bitmap image, using the dump-glyphs binary from this repository.
    3. Extracts each "text operation" (Tj, TJ, ', "; see 9.4.3 Text-Showing Operators in the PDF 1.7 spec) in the PDF (which glyphs from which font were used), using the dump-tjs binary from this repository.
    4. Runs the sample-runs.py script from this repository, which
      1. generates the glyph_id to Unicode mapping known so far (see this comment),
      2. generates HTML pages with some visual information about each glyph used in the PDF (showing it in context with neighbouring glyphs etc) (example).
  3. Create a new directory called maps/manual/ and
    1. copy the toml files under maps/look/ into it,
    2. (The main manual grunt work needed) Edit each of those TOML files, and (using the HTML files that have been generated), for each glyph that is not already mapped in the PDF itself, add the Unicode mapping for that glyph. (Any one format will do; the existing TOML entries are highly redundant but you can be concise: see the comment.)
  4. Run make again. This will do the following:
    1. Validates that the TOML files you generated are ok (it won't catch mistakes in the Unicode mapping though!), and
    2. (This is slow, may take ~150 ms per page.) Generates a copy of your original PDF, with data in it about the actual text corresponding to each text operation.

All this has been tested only with one large PDF. These scripts are rather hacky and some decisions about PDF structure etc are hard-coded; for other PDFs they will likely need to be changed.

TODO: Read this answer and try qpdf/mutool-clean, to simplify parsing work: https://stackoverflow.com/questions/3446651/how-to-convert-pdf-binary-parts-into-ascii-ansi-so-i-can-look-at-it-in-a-text-ed/3483710#3483710

pdf-glyph-mapping's People

Contributors

shreevatsa avatar ujjvlh avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

ujjvlh sabit ndevtk

pdf-glyph-mapping's Issues

cargo invocation problem

[vvasuki:~/shreevatsa/pdf-glyph-mapping/work:main]─[08:54:13]─$make
RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- /home/vvasuki/Documents/books/granthasangrahaH/purANam/unabridged_full.pdf font-usage/ --phase phase1
error: no such subcommand: `+nightly`
make: *** [Makefile:41: font-usage/] Error 101

Tjs-*.map files not generated

On running without making any changes to the code (other than .png being replaced with .jpg in doit.sh), no Tjs-*.map files were generated even after dump-tjs ran successfully.

Segmentation fault

I am getting segfault at this line while dumping the Tjs. The page number on which this happens (sometimes on 1000+, sometimes on 10000+) is not deterministic, but out of ten times I have tried it, it has happened every single time.

lopdf::content::Content::decode(&content_u8).unwrap()

Get this working again

I had abandoned this project in favour of https://github.com/shreevatsa/pdf-explorer but that itself seems abandoned so it may be good to resurrect one or both of them.

At minimum, it seems the Cargo.toml here needs to be changed to:

clap = "=3.0.0-beta.2"
clap_derive = "=3.0.0-beta.2"

per here. On top of that, may be good to clarify the API a bit, e.g. remove dependency on mupdf and have clear boundaries / separate binaries: one for just dumping the text content stream, etc?

mutool extract output format not uniform

I get (with mupdf-tools 1.18.0-2, Arch Linux) .jpg output by this command instead of .png. So we need to explicitly specify the format in the command (if it is possible) for doit.sh to work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.