Code Monkey home page Code Monkey logo

pdfalto's Introduction

pdfalto

Build Status SWH License: GPL v2

pdfalto is a command line executable for parsing PDF files and producing structured XML representations of the PDF content in ALTO format, capturing in particular all the layout and style information of the PDF.

pdfalto is initially a fork of pdf2xml, developed at XRCE, with modifications for robustness, addition of features, improved layout element detections, and output enhanced format in ALTO (including in particular space information, useful for instance for further machine learning processing). It is based on the Xpdf library.

The latest stable version is 0.4. Working version (master) is 0.5.

An Archlinux package for pdfalto is available here, thanks to @andreasbaumann. The build process described below will create a portable standalone pdfalto executable that can be packaged with other tools without further installation requirements for the end-user.

Requirements

  • compilers : clang 3.6 or gcc 4.9
  • makefile generator : cmake 3.12.0
  • fetching dependencies : wget

Usage

General usage is as follow:

Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -noLineNumbers                : do not output line numbers added in manuscript-style textual documents
  -readingOrder                 : blocks follow the reading order
  -noText                       : do not extract textual objects (might be useful, but non-valid ALTO)
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -filesLimit <int>             : limit of asset files be extracted
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information

In addition to the ALTO file describing the PDF content, the following files are generated:

  • _metadata.xml file containing a pdf file metadata (generate metadata information in a separate XML file as ALTO schema does not support that).

  • _annot.xml file containing a description of the annotations in the PDF (e.g. GOTO, external http links, ...) obtained with -annotation option

  • _outline.xml file containing a possible PDF-embedded table of content (aka outline) obtained with -outline option

  • .xml_data/ subdirectory containing the vectorial (.vec) and bitmap images (.png) embedded in the PDF, this is generated by default - when the option -noImage is not present. This extraction slows down the process very significantly, so if no image is required, use the option -noImage. When the images are not extracted, image elements with layout properties still appear in the ALTO file, but they reference no extracted image files.

Extra script to get only text content

The goal of pdfalto is to extract all the content of a PDF, not just text, but also layout, style, font, vector graphics, embedded bitmap, annotation, metadata, and outline information. For convenience and debugging, we provide a simple XSLT to extract only the text content from the produced ALTO XML file. For instance, using xsltproc command line, the following outputs the text content only:

xsltproc schema/alto2txt.xsl alto_file.xml

Dependencies

Dependencies can be recompiled by running this script

./install_deps.sh

The script will download and build the dependencies unders libs/ and the additional language support packages for xpdf under languages/.

If necessary, see compiling dependencies procedures for further details.

Known issues

(issue 41) might occur while building, in this case you'll need to compile the dependencies before building pdflato.

Build

  • NOTE for windows : it's recommended to use Cygwin and install standard libraries (either for cland or gcc)

git clone https://github.com/kermitt2/pdfalto.git && cd pdfalto

  • Xpdf-4.03 is shipped as git submodule, to download it:

git submodule update --init --recursive

  • Build pdfalto:

cmake .

make

The executable pdfalto is generated in the root directory. Additionally, this will create a static library for xpdf-4.03 at the following path xpdf-4.03/build/xpdf/lib/libxpdf.a and all the libraries and their respective subdirectory.

To use the additional xpdf language support packages, the executable pdfalto comes with a config file xpdfrc and language resources installed under languages/. Both xpdfrc and languages/ must be alongside the executable pdfalto to be used. To add pdfalto with these additional resources to a third party application (e.g. GROBID), move the executation together with these files:

lopez@work:~$ ls my_pdfalto/
languages  pdfalto  xpdfrc
Known issues

(issue #135) on macOS "fontconfig.h file not found" might occur while building, see described workaround.

Future work

  • Text like containing block element characters (https://unicode.org/charts/PDF/U2B00.pdf) are used as placeholders for unknown character unicodes, instead of what would be expected when visually inspecting the text. The reason for these unsolved character unicode values is that the actual characters are glyphs that are embedded in the PDF document which use free unicode range for embedded fonts, not the right unicode. The only way to extract the valid text for those special characters is to use OCR at glyph level . This is our targeted main future enhancement, relying on a custom Deep Learning approach.

  • map special characters in secondary fonts to their expected unicode

  • try to optimize speed and memory

  • see the issue tracker for further tasks

Changes

New in version 0.4 (apart various bug fixes):

  • support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. they are pre-installed locally and portable

  • refined line number detection and fixing a bug which could result in random missing numbers in the ALTO output

  • update to xpdf-4.03

  • fix issue with character spacing due to invalid rotation condition

  • update dependencies and dependency install script

New in version 0.3 (apart various bug fixes):

  • line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (noLineNumbers option)

  • removal of -blocks option, the block information are always returned for ensuring ALTO validation (<TextBlock> element)

  • bug fixing on reading order

  • fix possible incorrect XMax and YMax values at 0 on block coordinates having only one line

New in version 0.2 (apart various bug fixes):

  • support Unicode composition of characters

  • generalize reading order to all blocks (it was limited to the blocks of the first page)

  • detect subscript/superscript text font style attribute

  • use SVG as a format for vectorial images

  • propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach)

  • generate metadata information in a separate XML file (as ALTO schema does not support that)

  • use the latest version of xpdf, version 4.00

  • add cmake

  • ALTO output is replacing custom Xerox XML format

  • Note: this released version was used for Grobid release 0.5.6

New in version 0.1 (apart various bug fixes):

  • encode URI (using xmlURIEscape from libxml2) for the @href attribute content to avoid blocking XML wellformedness issues. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand.

  • output coordinates attributes for the BLOCK elements when the -block option is selected,

  • add a parameter -readingOrder which re-order the blocks following the reading order when the -block option is selected. By default in pdf2xml, the elements followed the PDF content stream (the so-called raw order). In xpdf, several text flow orders are available including the raw order and the reading order. Note that, with this modification and this new option, only the blocks are re-ordered.

    From our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. headnote introduced at the end of the page content). This additional mode can be thus quite useful for information/structure extraction applications exploiting pdfalto output.

  • use the latest version of xpdf, version 3.04.

Contributors

Contact: Patrice Lopez ([email protected])

pdfalto is developed by Patrice Lopez ([email protected]) and Achraf Azhar ([email protected]).

pdf2xml is orignally written by Hervé Déjean, Sophie Andrieu, Jean-Yves Vion-Dury and Emmanuel Giguet (XRCE) under GPL2 license.

Xpdf is developed by Glyph & Cog, LLC (1996-2017) and distributed under GPL2 or GPL3 license.

The windows version has been built originally by @pboumenot and ported on windows 7 for 64 bit, then for windows (native and cygwin) by @lfoppiano and @flydutch.

License

As the original pdf2xml and main dependency Xpdf, pdfalto is distributed under GPL2 license.

Useful links

Some tools for converting ALTO into other formats:

pdfalto's People

Contributors

aazhar avatar dac514 avatar de-code avatar kermitt2 avatar lfoppiano avatar mauvilsa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfalto's Issues

-readingOrder option seems not working?

Hi,

in the readme section, you said that the reordering has been extended to whole text flow rather than the first page. However, it seems not working on my case.

屏幕快照 2019-03-23 18 34 43

You can see from above example, the block ordering is not changed.

Null pointer Deference in function AnnotsXrce::AnnotsXrce( )

Description - we observed a Null pointer Deference in function AnnotsXrce::AnnotsXrce( ) located in AnnotsXrce.cc .The same be triggered by sending a crafted pdf file to the pdfalto binary. It allows an attacker to cause Denial of Service (Segmentation fault) or possibly have unspecified other impact.

Command - ./pdfalto -f 1 -l 2 -noText -noImage -outline -annotation -cutPages -blocks -readingOrder -ocr -fullFontName $POC

POC - REPRODUCER

Degub -

gdb: 
[ Legend: Modified register | Code | Heap | Stack | String ]
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
$rax   : 0x0               
$rbx   : 0x00007fffffffda40  →  0x000061700000f580  →  0x000061300000de80  →  0x00000000009c1828  →  0x000000000062bd46  →  <FileStream::~FileStream()+0> push rbp
$rcx   : 0x300             
$rdx   : 0x0               
$rsp   : 0x00007fffffffd440  →  0x0000000041b58ab3
$rbp   : 0x00007fffffffda70  →  0x00007fffffffdbf0  →  0x00007fffffffdd10  →  0x000000000090c360  →  <__libc_csu_init+0> push r15
$rsi   : 0x1               
$rdi   : 0x000060400000c850  →  0xbebebebebebebebe
$rip   : 0x0000000000406adc  →  <AnnotsXrce::AnnotsXrce(Object&,+0> mov rax, QWORD PTR [rax]
$r8    : 0x0               
$r9    : 0x35ef            
$r10   : 0x50              
$r11   : 0x00007ffff7efb310  →  0x0000000000000000
$r12   : 0x00000ffffffffabc  →  0x0000000000000000
$r13   : 0x00007fffffffd5e0  →  0x0000000041b58ab3
$r14   : 0x000060400000c850  →  0xbebebebebebebebe
$r15   : 0x00007fffffffd5e0  →  0x0000000041b58ab3
$eflags: [carry PARITY adjust ZERO sign trap INTERRUPT direction overflow RESUME virtualx86 identification]
$cs: 0x0033 $ss: 0x002b $ds: 0x0000 $es: 0x0000 $fs: 0x0000 $gs: 0x0000 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
0x00007fffffffd440│+0x0000: 0x0000000041b58ab3     ← $rsp
0x00007fffffffd448│+0x0008: 0x0000602000010330  →  0xbebebebe0000003a (":"?)
0x00007fffffffd450│+0x0010: 0x000000010000000d  →  0x0000000000000000
0x00007fffffffd458│+0x0018: 0x00007fffffffdb60  →  0x3ff0000000000000
0x00007fffffffd460│+0x0020: 0x0000611000009c80  →  0x000060800000bfa8  →  0x0000602000010ad0  →  0xbebebebe00000031 ("1"?)
0x00007fffffffd468│+0x0028: 0x000060c000007c00  →  0x0000000000000000
0x00007fffffffd470│+0x0030: 0x00007fffffffdb20  →  0xbebebebe00000006
0x00007fffffffd478│+0x0038: 0x0000602000106f50  →  0x0000000000000002
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
     0x406acd <AnnotsXrce::AnnotsXrce(Object&,+0> mov    rdi, rax
     0x406ad0 <AnnotsXrce::AnnotsXrce(Object&,+0> call   0x404a40 <__asan_report_load8@plt>
     0x406ad5 <AnnotsXrce::AnnotsXrce(Object&,+0> mov    rax, QWORD PTR [rbp-0x548]
→   0x406adc <AnnotsXrce::AnnotsXrce(Object&,+0> mov    rax, QWORD PTR [rax]
     0x406adf <AnnotsXrce::AnnotsXrce(Object&,+0> add    rax, 0x10
     0x406ae3 <AnnotsXrce::AnnotsXrce(Object&,+0> mov    rdx, rax
     0x406ae6 <AnnotsXrce::AnnotsXrce(Object&,+0> mov    rcx, rdx
     0x406ae9 <AnnotsXrce::AnnotsXrce(Object&,+0> shr    rcx, 0x3
     0x406aed <AnnotsXrce::AnnotsXrce(Object&,+0> add    rcx, 0x7fff8000
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── source:/home/aceteam/Downloads/sources/pdfalto/src/AnnotsXrce.cc+85 ────
     80                     Link *link = new Link(dict, catalog->getBaseURI());
     81                     //printf("%d \n",link->isOk());
     82                     LinkAction *ac = link->getAction();
     83                     //printf("ac %d \n",ac->isOk());
     84                     // Get the Action information
        // ac=0x00007fffffffd528  →  0x0000000000000000
→   85                     if (ac->isOk()) {
     86                         xmlNodePtr nodeActionAction;
     87                         xmlNodePtr nodeActionDEST;
     88                         if (nodeAnnot) {
     89                             nodeActionAction = xmlNewNode(NULL, (const xmlChar *) TAG_ACTION);
     90                             nodeActionAction->type = XML_ELEMENT_NODE;
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "pdfalto", stopped, reason: SIGSEGV
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x406adc → AnnotsXrce::AnnotsXrce(this=0x602000106f50, objA=@0x7fffffffdb20, docrootA=0x60c000007c00, catalog=0x611000009c80, ctmA=0x7fffffffdb60, pageNumA=0x1)
[#1] 0x40a94a → PDFDocXrce::displayPages(this=0x60800000bfa0, out=0x61500000c100, docrootA=0x60c000007c00, firstPage=0x1, lastPage=0x1, hDPI=72, vDPI=72, rotate=0x0, useMediaBox=0x0, crop=0x1, doLinks=0x0, abortCheckCbk=0x0, abortCheckCbkData=0x0)
[#2] 0x40bdf6 → main(argc=0x2, argv=0x7fffffffddf8)

gef➤  p ac
$9 = (LinkAction *) 0x0
gef➤  p ac->isOk()
Cannot access memory at address 0x0


Build from a fresh clone fails in Ubuntu 16.04 LTS

cmake.stderr.txt
cmake.stdout.txt

CMake Error at xpdf-4.00/xpdf-qt/CMakeLists.txt:65 (add_executable):
add_executable cannot create target "xpdf" because another target with the
same name already exists. The existing target is a static library created
in source directory "/usr/local/src/pdfalto/xpdf-4.00/xpdf". See
documentation for policy CMP0002 for more details.

SEGV in function TextPage::restoreState

I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause SEGV in function TextPage::restoreState in XmlAltoOutputDev.cc when executing this command:

./pdfalto SEGV_restoreState 1.xml

This is the ASAN information:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==13300==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000005addf7 bp 0x0c2c000001c7 sp 0x7fff8c9133e0 T0)
==13300==The signal is caused by a READ memory access.
==13300==Hint: address points to the zero page.
    #0 0x5addf6 in TextPage::restoreState(GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:5763:21
    #1 0x5addf6 in XmlAltoOutputDev::restoreState(GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7414
    #2 0x9a6668 in Gfx::execOp(Object*, Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:826:3
    #3 0x9a42b1 in Gfx::go(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:719:12
    #4 0x9a1d1b in Gfx::display(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:641:3
    #5 0x77c466 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:373:10
    #6 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
    #7 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
    #8 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
    #9 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
    #10 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
    #11 0x7f5e9a57182f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #12 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:5763:21 in TextPage::restoreState(GfxState*)
==13300==ABORTING

Ubuntu, CentOS7, pb of compilation

Hi,
I tried to build pdfalto twice on 2 different OS and I have the same problem.
At the last step

  • Build pdfalto:
    cmake -D'ICU_PATH=Path to ICU source folder'
    make <---- this step

I have a link issue related to splash lib.
I think it is related to splash that has not generated ".o".. But I'm not easy with CMakeFiles..
No report like mine? One idea?
...
[ 66%] Linking CXX executable pdfalto
CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o: In function XmlAltoOutputDev::~XmlAltoOutputDev()': XmlAltoOutputDev.cc:(.text+0x185f0): undefined reference to SplashFontEngine::~SplashFontEngine()'
XmlAltoOutputDev.cc:(.text+0x18620): undefined reference to Splash::~Splash()' CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o: In function XmlAltoOutputDev::startDoc(XRef*)':
XmlAltoOutputDev.cc:(.text+0x19c42): undefined reference to SplashFontEngine::~SplashFontEngine()' ... CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o:(.rodata._ZTI19SplashOutFontFileID[_ZTI19SplashOutFontFileID]+0x10): undefined reference to typeinfo for SplashFontFileID'
collect2: error: ld returned 1 exit status
make[2]: *** [pdfalto] Error 1
make[1]: *** [CMakeFiles/pdfalto.dir/all] Error 2
make: *** [all] Error 2

characters not recognised

I'm uploading this file
1903.07791.pdf

where some characters are not recognised:

image

the result is this: 

or, this, in the text (please ignore the tags...):

Room temperature electrical resistivity was decreased down from 300 mcm for x = 0 to 8 mcm for x = 0.4. However, the temperature dependence of electrical resistivity was still insulating for x  0.4. In the present study, we show that Bi-rich composition up to ca. x = 0.8 can be obtained by optimizing synthesis temperature.

invalid memory access in GfxIndexedColorSpace::mapColorToBase( )

Description - we observed a invalid memory access in function GfxIndexedColorSpace::mapColorToBase( ) located in GfxState.cc .The same be triggered by sending a crafted pdf file to the pdfalto binary. It allows an attacker to cause Denial of Service (Segmentation fault) or possibly have unspecified other impact.

Command - : ./pdfalto -f 1 -l 2 -noText -noImage -outline -annotation -cutPages -blocks -readingOrder -ocr -fullFontName $POC

POC - REPRODUCER

Degub -

Gdb: 

[ Legend: Modified register | Code | Heap | Stack | String ]
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
$rax   : 0xfffffffffffffffd
$rbx   : 0x00000ffffffff9a6  →  0x0000000000000000
$rcx   : 0xfffffffffffffffd
$rdx   : 0x200000007fff7fff
$rsp   : 0x00007fffffffccf0  →  0x00007fffffffcd30  →  0x0000000041b58ab3
$rbp   : 0x00007fffffffcfd0  →  0x00007fffffffd100  →  0x00007fffffffd120  →  0x00007fffffffd2a0  →  0x00007fffffffd2d0  →  0x00007fffffffd330  →  0x00007fffffffd640  →  0x00007fffffffd750
$rsi   : 0x3               
$rdi   : 0x0               
$rip   : 0x00000000005cd542  →  <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> movzx edx, BYTE PTR [rdx]
$r8    : 0x00000000005cc2ea  →  <GfxICCBasedColorSpace::getDefaultRanges(double*,+0> push rbp
$r9    : 0x7a1a            
$r10   : 0x0000602000073650  →  0xbebebebebebebe00
$r11   : 0x00007ffff7eec448  →  0x0000000000000000
$r12   : 0x00007fffffffcd30  →  0x0000000041b58ab3
$r13   : 0x00007fffffffcfb0  →  0x00000ffffffffa00  →  0x0000000000000000
$r14   : 0x00007fffffffcd30  →  0x0000000041b58ab3
$r15   : 0x00007fffffffd170  →  0x0000000041b58ab3
$eflags: [carry PARITY adjust zero sign trap INTERRUPT direction overflow RESUME virtualx86 identification]
$cs: 0x0033 $ss: 0x002b $ds: 0x0000 $es: 0x0000 $fs: 0x0000 $gs: 0x0000 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
0x00007fffffffccf0│+0x0000: 0x00007fffffffcd30  →  0x0000000041b58ab3     ← $rsp
0x00007fffffffccf8│+0x0008: 0x00007fffffffd020  →  0x0000003000000020  →  0x0000000000000000
0x00007fffffffcd00│+0x0010: 0x000061700000e108  →  0x00007fff00000000
0x00007fffffffcd08│+0x0018: 0x000060400000d050  →  0x00000000009688d0  →  0x00000000005cc590  →  <GfxIndexedColorSpace::~GfxIndexedColorSpace()+0> push rbp
0x00007fffffffcd10│+0x0020: 0x00000ffffffff9c4  →  0x0000000000000000
0x00007fffffffcd18│+0x0028: 0x0000000000000020
0x00007fffffffcd20│+0x0030: 0x00000003ffffffff  →  0x0000000000000000
0x00007fffffffcd28│+0x0038: 0xfffffffffffffffd
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
     0x5cd533 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> enter  0x8948, 0xc2
     0x5cd537 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> shr    rdx, 0x3
     0x5cd53b <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> add    rdx, 0x7fff8000
→   0x5cd542 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> movzx  edx, BYTE PTR [rdx]
     0x5cd545 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> test   dl, dl
     0x5cd547 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> setne  sil
     0x5cd54b <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> mov    rdi, rax
     0x5cd54e <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> and    edi, 0x7
     0x5cd551 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> cmp    dil, dl
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── source:/home/aceteam/Downloads/sources/pdfalto/xpdf-4.00/xpdf/GfxState.cc+1149 ────
   1144       } else if (k > indexHigh) {
   1145         k = indexHigh;
   1146       }
   1147       p = &lookup[k * n];
   1148       for (i = 0; i < n; ++i) {
        // baseColor=0x00007fffffffccf8  →  [...]  →  0x0000000000000000, p=0x00007fffffffcd28  →  0xfffffffffffffffd, low=0x00007fffffffcd50  →  0x0000000000000000, range=0x00007fffffffce70  →  0x3ff0000000000000, i=0x0
→ 1149         baseColor->c[i] = dblToCol(low[i] + (p[i] / 255.0) * range[i]);
   1150       }
   1151       return baseColor;
   1152     }
   1153     
   1154     void GfxIndexedColorSpace::getGray(GfxColor *color, GfxGray *gray,
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "pdfalto", stopped, reason: SIGSEGV
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x5cd542 → GfxIndexedColorSpace::mapColorToBase(this=0x60400000d050, color=0x61700000e108, baseColor=0x7fffffffd020)
[#1] 0x5cdaa4 → GfxIndexedColorSpace::getRGB(this=0x60400000d050, color=0x61700000e108, rgb=0x7fffffffd190, ri=gfxRenderingIntentRelativeColorimetric)
[#2] 0x5f6b4f → GfxState::getFillRGB(this=0x61700000e080, rgb=0x7fffffffd190)
[#3] 0x445f21 → XmlAltoOutputDev::fill(this=0x61500000f300, state=0x61700000e080)
[#4] 0x6c4f54 → Gfx::opFill(this=0x60f00000e140, args=0x7fffffffd3d0, numArgs=0x0)
[#5] 0x6bc95f → Gfx::execOp(this=0x60f00000e140, cmd=0x7fffffffd390, args=0x7fffffffd3d0, numArgs=0x0)
[#6] 0x6bbf7a → Gfx::go(this=0x60f00000e140, topLevel=0x1)
[#7] 0x6bb562 → Gfx::display(this=0x60f00000e140, objRef=0x60800000bed0, topLevel=0x1)
[#8] 0x61cf67 → Page::displaySlice(this=0x60800000bea0, out=0x61500000f300, hDPI=72, vDPI=72, rotate=0x0, useMediaBox=0x0, crop=0x0, sliceX=0xffffffff, sliceY=0xffffffff, sliceW=0xffffffff, sliceH=0xffffffff, printing=0x0, abortCheckCbk=0x0, abortCheckCbkData=0x0)
[#9] 0x61c7af → Page::display(this=0x60800000bea0, out=0x61500000f300, hDPI=72, vDPI=72, rotate=0x0, useMediaBox=0x0, crop=0x1, printing=0x0, abortCheckCbk=0x0, abortCheckCbkData=0x0)


gef➤  p/d  k * n
$24 = -3
gef➤  p &lookup[k * n]
$25 = (Guchar *) 0xfffffffffffffffd <error: Cannot access memory at address 0xfffffffffffffffd>
gef➤  p (p[i] / 255.0)
Cannot access memory at address 0xfffffffffffffffd


SEGV in function GfxImageColorMap::getRGB

I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause SEGV in function GfxImageColorMap::getRGB in GfxState.cc when executing this command:

./pdfalto SEGV_getRGB 1.xml

This is the ASAN information:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==20699==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000710e0c bp 0x7ffd9f911d30 sp 0x7ffd9f911be0 T0)
==20699==The signal is caused by a READ memory access.
==20699==Hint: address points to the zero page.
    #0 0x710e0b in GfxImageColorMap::getRGB(unsigned char*, GfxRGB*, GfxRenderingIntent) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/GfxState.cc:3659:30
    #1 0x596afe in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6477:35
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
    #3 0x5ae4a4 in XmlAltoOutputDev::drawSoftMaskedImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, Stream*, int, int, GfxImageColorMap*, double*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7458:5
    #4 0x9d94cd in Gfx::doImage(Object*, Stream*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:4447:7
    #5 0x9709a5 in Gfx::opXObject(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:3980:2
    #6 0x9a6668 in Gfx::execOp(Object*, Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:826:3
    #7 0x9a42b1 in Gfx::go(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:719:12
    #8 0x9a1d1b in Gfx::display(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:641:3
    #9 0x77c466 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:373:10
    #10 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
    #11 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
    #12 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
    #13 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
    #14 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
    #15 0x7fc4172c282f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #16 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/GfxState.cc:3659:30 in GfxImageColorMap::getRGB(unsigned char*, GfxRGB*, GfxRenderingIntent)
==20699==ABORTING

Issue with alto file from a PDF

The attached PDF generates an XML file that cannot be parsed by GROBID's SAX parser:

ERROR [2019-02-25 20:10:13,990] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
! at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
! at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
! ... 80 common frames omitted
! Causing: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
! at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
! at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
! at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:381)
! ... 70 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [PARSING_ERROR] Cannot parse file: /home/lopez/grobid/grobid-home/tmp/xsW7YuKt23.lxml
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:393)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:94)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:130)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:109)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:474)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:465)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:179)
...

01 Ramadan Indexing techniques for 2016.pdf

sorry, unimplemented: non-trivial designated initializers not supported

In file included from /opt/src/pdfalto/xpdf-4.00/xpdf/GlobalParams.cc:64:
/opt/src/pdfalto/xpdf-4.00/xpdf/UnicodeToUnicodeFontRules.h:25:1: sorry, unimplemented: non-trivial designated initializers not supported
};

$ cc -v
Using built-in specs.
COLLECT_GCC=cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto
Thread model: posix
gcc version 8.2.1 20181127 (GCC)

Any ideas?
Thanks,
Rytis

ALTO version with latest release

Previously, we used pdfalto to generate an ALTO XML from the pdf and https://github.com/filak/hOCR-to-ALTO to convert the ALTO XML to hOCR file after that. With the newest release of pdfalto this does not work anymore, since the ALTO version has seemed to have changed. Can you share which version of ALTO is currently produced with pdfalto?

FPE in function ImageStream::ImageStream

I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause FPE in function ImageStream::ImageStream in Stream.cc when executing this command:

./pdfalto FPE_ImageStream 1.xml

This is the ASAN information:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==4985==ERROR: AddressSanitizer: FPE on unknown address 0x00000079252d (pc 0x00000079252d bp 0x0c0c000006ae sp 0x7ffde533a9d0 T0)
    #0 0x79252c in ImageStream::ImageStream(Stream*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Stream.cc:359:23
    #1 0x5969bc in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6427:43
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
    #3 0x5ae52f in XmlAltoOutputDev::drawSoftMaskedImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, Stream*, int, int, GfxImageColorMap*, double*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7460:5
    #4 0x9d94cd in Gfx::doImage(Object*, Stream*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:4447:7
    #5 0x9709a5 in Gfx::opXObject(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:3980:2
    #6 0x9a6668 in Gfx::execOp(Object*, Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:826:3
    #7 0x9a42b1 in Gfx::go(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:719:12
    #8 0x9a1d1b in Gfx::display(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:641:3
    #9 0x77c466 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:373:10
    #10 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
    #11 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
    #12 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
    #13 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
    #14 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
    #15 0x7f7dc0f1382f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #16 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Stream.cc:359:23 in ImageStream::ImageStream(Stream*, int, int, int)
==4985==ABORTING

Reading order: another issue (this is more problematic)

Image Pasted at 2019-3-26 11-34

And the text appears to be maintained until approximately 35 40 years -of age, followed by modest decreases until 50 years of age,, you can see the 'hypen' out of place...

this happens with pdf2xml too btw

here the output from pdfalto

                <TextLine WIDTH="502.269" HEIGHT="8.208" ID="p1_t62" HPOS="51" VPOS="638.632">
                    <String ID="p1_w620" CONTENT="Endurance" HPOS="51" VPOS="638.632" WIDTH="38.556" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="89.556"/>
                    <String ID="p1_w621" CONTENT="and" HPOS="92.04" VPOS="638.632" WIDTH="13.014" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="105.054"/>
                    <String ID="p1_w622" CONTENT="ultra-endurance" HPOS="107.538" VPOS="638.632" WIDTH="56.601"
                            HEIGHT="8.208" STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="164.139"/>
                    <String ID="p1_w623" CONTENT="performance," HPOS="166.623" VPOS="638.632" WIDTH="47.826"
                            HEIGHT="8.208" STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="214.449"/>
                    <String ID="p1_w624" CONTENT="in" HPOS="216.933" VPOS="638.632" WIDTH="7.011" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="223.944"/>
                    <String ID="p1_w625" CONTENT="terms" HPOS="226.428" VPOS="638.632" WIDTH="20.034" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="246.462"/>
                    <String ID="p1_w626" CONTENT="of" HPOS="248.946" VPOS="638.632" WIDTH="7.506" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="256.452"/>
                    <String ID="p1_w627" CONTENT="the" HPOS="258.936" VPOS="638.632" WIDTH="11.016" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="269.952"/>
                    <String ID="p1_w628" CONTENT="overall" HPOS="272.436" VPOS="638.632" WIDTH="25.047"
                            HEIGHT="8.208" STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="297.483"/>
                    <String ID="p1_w629" CONTENT="time" HPOS="299.967" VPOS="638.632" WIDTH="16.029" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="315.996"/>
                    <String ID="p1_w630" CONTENT="taken," HPOS="318.48" VPOS="638.632" WIDTH="21.789" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="340.269"/>
                    <String ID="p1_w631" CONTENT="appears" HPOS="342.753" VPOS="638.632" WIDTH="27.54"
                            HEIGHT="8.208" STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="370.293"/>
                    <String ID="p1_w632" CONTENT="to" HPOS="372.777" VPOS="638.632" WIDTH="7.011" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="379.788"/>
                    <String ID="p1_w633" CONTENT="be" HPOS="382.272" VPOS="638.632" WIDTH="8.505" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="390.777"/>
                    <String ID="p1_w634" CONTENT="maintained" HPOS="393.261" VPOS="638.632" WIDTH="40.077"
                            HEIGHT="8.208" STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="433.338"/>
                    <String ID="p1_w635" CONTENT="until" HPOS="435.822" VPOS="638.632" WIDTH="16.542" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="452.364"/>
                    <String ID="p1_w636" CONTENT="approximately" HPOS="454.848" VPOS="638.632" WIDTH="52.101"
                            HEIGHT="8.208" STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="506.949"/>
                    <String ID="p1_w637" CONTENT="35" HPOS="509.433" VPOS="638.632" WIDTH="9.009" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="4.308" VPOS="638.632" HPOS="518.442"/>
                    <String ID="p1_w638" CONTENT="40" HPOS="522.75" VPOS="638.632" WIDTH="9.009" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                    <SP WIDTH="2.484" VPOS="638.632" HPOS="531.759"/>
                    <String ID="p1_w639" CONTENT="years" HPOS="534.243" VPOS="638.632" WIDTH="19.026" HEIGHT="8.208"
                            STYLEREFS="font10"/>
                </TextLine>
            </TextBlock>
            <TextBlock ID="p1_b49" HPOS="518.7" VPOS="637" HEIGHT="9.75579" WIDTH="3.995">
                <TextLine WIDTH="3.995" HEIGHT="9.75579" ID="p1_t63" HPOS="518.7" VPOS="637">
                    <String ID="p1_w640" CONTENT="–" HPOS="518.7" VPOS="637" WIDTH="3.995" HEIGHT="9.75579"
                            STYLEREFS="font5"/>
                </TextLine>
            </TextBlock>

1942 PDF test set

As a reference with the current version, we have 1071 PDF failing out of 1942 when testing pdfalto with GROBID using the 1942 PubMed Central PDF set.
Errors are pdfalto failure (mostly) and not well-formed XML - usually not valid XML character in attributes.
I open separate issues with test PDF attached for the different cases.
Note that the latest version of our pdf2xml fork modified for grobid was 100% successful on this set, so should pdfalto be too :)

Build from a fresh clone fails on Ubuntu

When trying to build pdfalto in Ubuntu, it fails on XmlAltoOutputDev.h:

[ 57%] Building CXX object CMakeFiles/pdfalto.dir/src/pdfalto.cc.o In file included from /root/pdfalto/src/pdfalto.cc:18:0: /root/pdfalto/src/XmlAltoOutputDev.h:25:25: warning: extra tokens at end of #include directive #include <unordered_map>; ^ In file included from /usr/include/c++/5/unordered_map:35:0, from /root/pdfalto/src/XmlAltoOutputDev.h:25, from /root/pdfalto/src/pdfalto.cc:18: /usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options. #error This file requires compiler and library support \ ^ /root/pdfalto/src/pdfalto.cc:43:24: warning: extra tokens at end of #include directive #include "TextString.h"; ^ In file included from /root/pdfalto/xpdf-4.00/goo/parseargs.h:16:0, from /root/pdfalto/src/pdfalto.cc:6: /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:154:22: note: in expansion of macro 'gFalse' GBool fontType = gFalse; //Enumeration : serif (gTrue) or sans-serif(gFalse) ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:155:23: note: in expansion of macro 'gFalse' GBool fontWidth = gFalse; //Enumeration : proportional(gFalse) or fixed(gTrue) ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:158:20: note: in expansion of macro 'gFalse' GBool isbold = gFalse; ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:159:22: note: in expansion of macro 'gFalse' GBool isitalic = gFalse; ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:160:25: note: in expansion of macro 'gFalse' GBool issubscript = gFalse; ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:161:27: note: in expansion of macro 'gFalse' GBool issuperscript = gFalse; ^ In file included from /root/pdfalto/src/pdfalto.cc:18:0: /root/pdfalto/src/XmlAltoOutputDev.h:1388:13: error: 'unordered_map' does not name a type typedef unordered_map<char*, unsigned int, Hash_Func, my_equal_to<char*> > my_unordered_map; ^ /root/pdfalto/src/XmlAltoOutputDev.h:1390:5: error: 'my_unordered_map' does not name a type my_unordered_map unicode_map; ^ /root/pdfalto/src/pdfalto.cc: In function 'int main(int, char**)': /root/pdfalto/src/pdfalto.cc:180:41: warning: format not a string literal and no format arguments [-Wformat-security] fprintf(stderr, PDFALTO_NAME); ^ /root/pdfalto/src/pdfalto.cc:182:44: warning: format not a string literal and no format arguments [-Wformat-security] fprintf(stderr, PDFALTO_VERSION); ^ make[2]: *** [CMakeFiles/pdfalto.dir/src/pdfalto.cc.o] Error 1 make[1]: *** [CMakeFiles/pdfalto.dir/all] Error 2 CMakeFiles/pdfalto.dir/build.make:182: recipe for target 'CMakeFiles/pdfalto.dir/src/pdfalto.cc.o' failed CMakeFiles/Makefile2:71: recipe for target 'CMakeFiles/pdfalto.dir/all' failed Makefile:127: recipe for target 'all' failed make: *** [all] Error 2 The command '/bin/sh -c make' returned a non-zero code: 2
It seems as if the latest updates causes this...

Some failing ISTEX PDF

Here are some failing PDF files from ISTEX (after processing 1000 random), which were not failing with the latest pdf2xml:

D3B2DA15EBD9A692BF1EF4D32606F95A72D5D381 
5A09169C31467704EBB453123479708334DDAF35 
45BCDD6CD0ECF1D7C6B9169E999C63BFF30DB501 
7528880E3DCB09F09E214AFEC57C3A4FCEA15905 
774EE3CD645A861B5F5184F96F80A837412887FA 
F8E939EACBC26F4F39B309AAAC7ABB8FC8A86C59 
864EFF775D7F56E7223EAD95801A6A07ACD8CF71 
0ACFDDBB83BF9A5ABAD34686AC4C8CE9317BDB2E 
122C63850FE715C35B2B7A5FF376E484E5627C75 
7F6B6DE03BEA6EAE36896867F88670B0E62F6EAA 
86EDACFC946D09D3F2F7703448EA7E2544CC9AE4 
7EE3BDA171BB275B860191E918289EC4F1289566 
ED9964BA3659F48C2E9227DE095836E47845B509 
83FAB54C06DCFB813657DCCAFF3C66AD67CC95FE 
38BD0E2B812BB737321F7D9E76AF0E95A9593E3D 
DD522EB94B2A865F00B4DDFC345780B119579DD7 
5EA4AAF2C4674DC7C8AFC750AF5320C0F7489FCE 
DD338AD05CEA42CF737A187344B1337385CC6FFB 
052DFBD14E0015CA914E28A0A561675D36FFA2CC 
C3D11DEE82F3403336BE55E9F94DB6E9A6343E1B

The following 3 are failing both with pdfalto and pdf2xml if I am not wrong:

2DE3AF6CC5E90F16E64866D3784DC06B33705360
5E00837DC8C8EF0C9B4D16603261C993135EBCD5
8AD9F55CF0BC915BB6B448B1B536D3A7DC08239D

To get them: https://api.istex.fr/document/*ISTEXID*/fulltext/pdf

Issues with paragraphs/text blocks detection in documents

Hi,
I'm parsing documents after being processed by pdfalto. And I especially need the text block tags to split the text in paragraphs. However, the documents I'm working on contain the same distance between every lines as you can see in the dummy example attached.
Example document.docx

The only parameter to identify these text blocks is this one in the XmlAltoOutputDev.cc file :
// Max distance between baselines of two lines within a block, as a
// fraction of the font size.
#define maxLineSpacingDelta 1.5

I've tried to tweak it but it only leads to one line paragraphs or the entire text on a page gathered in one.

So I was wondering if there were any other ways to get these paragraphs/blocks, for example looking at the x position of the final word in a line (because a paragraph can be quite often identified by a carriage return before the end of a line)

Thank you in advance

Seg. fault on some PDF when generating outline file

recompile with -fPIC

$ make
...
[ 15%] Linking CXX executable pdfalto
/usr/bin/ld: libs/image/png/linux/libpng.a(png.c.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC
....

$ cc -v
Using built-in specs.
COLLECT_GCC=cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto
Thread model: posix
gcc version 8.2.1 20181127 (GCC)

Any ideas?
Thanks,
Rytis

Memory Leaks

I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause memory leaks when executing this command:

./pdfalto detected_memory_leaks 1.xml

This is the ASAN information:

=================================================================
==12842==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 104340 byte(s) in 1 object(s) allocated from:
    #0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
    #1 0x596941 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6423:35
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 104340 byte(s) in 1 object(s) allocated from:
    #0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
    #1 0x596941 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6423:35
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 14000 byte(s) in 14 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff25d7f6 in xmlEncodeEntitiesInternal /home/fouzhe/my_fuzz/libxml2/entities.c:576

Direct leak of 120 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff282dc0 in xmlNewNode__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:2239

Direct leak of 84 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 48 byte(s) in 3 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x52a5e1 in removeAlreadyExistingData(GString*) /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:464:20
    #2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff281d6b in xmlNewNs__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:757

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59e8d1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6721:26
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x54021f in TextPage::TextPage(int, Catalog*, _xmlNode*, GString*, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1508:17
    #2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
    #3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x544254 in TextPage::startPage(int, GfxState*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1567:13
    #2 0x5a992b in XmlAltoOutputDev::startPage(int, GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7200:15

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59b645 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6663:19
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59e7e4 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6708:24
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x77b765 in Page::getLinks() /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:311:11
    #2 0x544f81 in TextPage::startPage(int, GfxState*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1732:30
    #3 0x5a992b in XmlAltoOutputDev::startPage(int, GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7200:15

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
    #2 0x5a49a8 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6938:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59e681 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6692:18
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
    #2 0x5a4c24 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6954:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x540b34 in TextPage::TextPage(int, Catalog*, _xmlNode*, GString*, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1552:23
    #2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
    #3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
    #2 0x5a490c in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6934:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
    #2 0x5a4b85 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6950:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x540abe in TextPage::TextPage(int, Catalog*, _xmlNode*, GString*, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1551:23
    #2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
    #3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
    #2 0x5a4b85 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6950:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
    #2 0x5a4c24 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6954:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x595d0d in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6259:24
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a63db in XmlAltoOutputDev::getInfoDate(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7122:19
    #2 0x5a4c8a in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6958:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a63db in XmlAltoOutputDev::getInfoDate(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7122:19
    #2 0x5a4cf0 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6962:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59e5cd in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6687:18
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x529331 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:395:23
    #2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
    #2 0x5a4ae5 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6946:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x52902e in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:328:29
    #2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x595cac in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6255:24
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
    #2 0x5a4ae5 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6946:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x597f8f in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:48
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
    #2 0x5a4a44 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6942:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a31e6 in XmlAltoOutputDev::toUnicode(GString*, UnicodeMap*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7188:12

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x597f75 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:23
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x597faa in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:78
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x595cac in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6255:24
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
    #2 0x595cf3 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6257:21
    #3 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
    #2 0x595d4b in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6261:21
    #3 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
    #2 0x595d4b in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6261:21
    #3 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x597f75 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:23
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x597f8f in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:48
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x597faa in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:78
    #2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
    #2 0x547aff in TextPage::endPage(GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1785:25

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
    #2 0x547bff in TextPage::endPage(GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1790:25

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
    #2 0x5a4a44 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6942:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x595d0d in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6259:24
    #2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
    #2 0x595cf3 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6257:21
    #3 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
    #2 0x5a49a8 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6938:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59f1bd in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6783:13
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x59e604 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6689:15
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
    #2 0x5a490c in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6934:19
    #3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Direct leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
    #1 0x5a750f in XmlAltoOutputDev::closeMetadataInfoDoc(GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6970:33
    #2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 424 byte(s) in 33 object(s) allocated from:
    #0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
    #1 0xb34fb9 in GString::resize(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:119:9

Indirect leak of 336 byte(s) in 11 object(s) allocated from:
    #0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
    #1 0xb34eba in GString::resize(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:121:10

Indirect leak of 176 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff282754 in xmlNewDoc__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:1171

Indirect leak of 120 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff283006 in xmlNewText__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:2445

Indirect leak of 120 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff282dc0 in xmlNewNode__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:2239

Indirect leak of 120 byte(s) in 2 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
    #2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 96 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff283146 in xmlNewPropInternal /home/fouzhe/my_fuzz/libxml2/tree.c:1855

Indirect leak of 87 byte(s) in 8 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x7fdeff2d8518 in xmlStrndup__internal_alias /home/fouzhe/my_fuzz/libxml2/xmlstring.c:45

Indirect leak of 64 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
    #2 0x5a992b in XmlAltoOutputDev::startPage(int, GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7200:15

Indirect leak of 64 byte(s) in 1 object(s) allocated from:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
    #2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
    #3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
    #4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

SUMMARY: AddressSanitizer: 225355 byte(s) leaked in 128 allocation(s).

Infinite loop

I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can infinite loop when executing this command:

./pdfalto infinite_loop 1.xml

Build from a fresh clone fails on ubuntu:xenial

Newest build fails

Output:
Sending build context to Docker daemon 11.19MB
Step 1/31 : FROM ubuntu:xenial
---> a51debf7e1eb
Step 2/31 : ADD https://github.com/openfaas/faas/releases/download/0.7.0/fwatchdog /usr/bin
Downloading [==================================================>] 4.111MB/4.111MB
---> Using cache
---> 583dc3bc25f3
Step 3/31 : RUN chmod +x /usr/bin/fwatchdog
---> Using cache
---> a008a7bd6eba
Step 4/31 : RUN apt-get update -y
---> Using cache
---> 4f2effbf5ad3
Step 5/31 : RUN apt-get install -y python3.5
---> Using cache
---> 708ec07d2bac
Step 6/31 : RUN apt-get -y install python3-pip
---> Using cache
---> 743ff0ec2418
Step 7/31 : RUN apt-get install -y --no-install-recommends wget build-essential automake g++
---> Using cache
---> 99e8d7b1c5fd
Step 8/31 : RUN apt-get install -y libxml2-dev
---> Using cache
---> c3f85c426b14
Step 9/31 : RUN apt-get install -y libmotif-dev
---> Using cache
---> 799c31317488
Step 10/31 : RUN apt-get install -y git
---> Using cache
---> b63d24e5472f
Step 11/31 : RUN mkdir icu && wget -q https://github.com/unicode-org/icu/releases/download/release-63-1/icu4c-63_1-src.tgz && gunzip -d < icu4c-63_1-src.tgz | tar xvf - && cd icu/source && chmod +x runConfigureICU configure install-sh && ./runConfigureICU Linux/gcc --enable-static --disable-shared && make
---> Using cache
---> e38664784bba
Step 12/31 : RUN git clone https://github.com/kermitt2/pdfalto.git ~/pdfalto
---> Using cache
---> f7f4bb177a3f
Step 13/31 : WORKDIR /root/pdfalto
---> Using cache
---> 301297d7ce11
Step 14/31 : RUN git checkout tags/0.2
---> Using cache
---> 47b23c80e9b2
Step 15/31 : RUN git submodule update --init --recursive
---> Using cache
---> 0f75f1910452
Step 16/31 : RUN apt-get install -y cmake
---> Using cache
---> 64f007a8cc61
Step 17/31 : WORKDIR /root/pdfalto
---> Running in 1a6524069e26
Removing intermediate container 1a6524069e26
---> 41214ff6f6a5
Step 18/31 : RUN cmake -D'ICU_PATH=/root/icu'
---> Running in 064de13d7ade
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for mkstemp
-- Looking for mkstemp - found
-- Looking for mkstemps
-- Looking for mkstemps - found
-- Looking for popen
-- Looking for popen - found
-- Performing Test HAVE_STD_SORT
-- Performing Test HAVE_STD_SORT - Success
-- Looking for fseeko
-- Looking for fseeko - found
-- Looking for fseek64
-- Looking for fseek64 - not found
-- Looking for _fseeki64
-- Looking for _fseeki64 - not found
-- Found FreeType (old-style includes): /usr/lib/x86_64-linux-gnu/libfreetype.so
-- Could NOT find TIFF (missing: TIFF_LIBRARY TIFF_INCLUDE_DIR)
-- lcms2 not found
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /root/pdfalto
Removing intermediate container 064de13d7ade
---> 0db206805ae8
Step 19/31 : RUN make
---> Running in ad116718332b
Scanning dependencies of target xpdf
[ 1%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/AcroForm.cc.o
[ 1%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Annot.cc.o
[ 2%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Array.cc.o
[ 3%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/BuiltinFont.cc.o
[ 3%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/BuiltinFontTables.cc.o
[ 4%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Catalog.cc.o
[ 5%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/CharCodeToUnicode.cc.o
[ 5%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/CMap.cc.o
[ 6%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Decrypt.cc.o
[ 7%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Dict.cc.o
[ 8%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/DisplayState.cc.o
[ 8%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Error.cc.o
[ 9%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/FontEncodingTables.cc.o
[ 10%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Form.cc.o
[ 10%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Function.cc.o
[ 11%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Gfx.cc.o
[ 12%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GfxFont.cc.o
[ 12%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GfxState.cc.o
[ 13%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GlobalParams.cc.o
[ 14%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/HTMLGen.cc.o
[ 15%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/JArithmeticDecoder.cc.o
[ 15%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/JBIG2Stream.cc.o
[ 16%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/JPXStream.cc.o
[ 17%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Lexer.cc.o
[ 17%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Link.cc.o
[ 18%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/NameToCharCode.cc.o
[ 19%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Object.cc.o
[ 19%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/OptionalContent.cc.o
[ 20%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Outline.cc.o
[ 21%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/OutputDev.cc.o
[ 22%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Page.cc.o
[ 22%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Parser.cc.o
[ 23%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PDFCore.cc.o
[ 24%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PDFDoc.cc.o
[ 24%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PDFDocEncoding.cc.o
[ 25%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PSTokenizer.cc.o
[ 26%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/SecurityHandler.cc.o
[ 26%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Stream.cc.o
[ 27%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/TextString.cc.o
[ 28%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/UnicodeMap.cc.o
[ 29%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/UnicodeTypeTable.cc.o
[ 29%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/UTF8.cc.o
[ 30%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/XFAForm.cc.o
[ 31%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/XRef.cc.o
[ 31%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Zoox.cc.o
[ 32%] Linking CXX static library ../build/xpdf/lib/libxpdf.a
[ 32%] Built target xpdf
Scanning dependencies of target zlib
[ 33%] Building C object image/zlib/CMakeFiles/zlib.dir/adler32.c.o
[ 34%] Building C object image/zlib/CMakeFiles/zlib.dir/compress.c.o
[ 35%] Building C object image/zlib/CMakeFiles/zlib.dir/crc32.c.o
[ 35%] Building C object image/zlib/CMakeFiles/zlib.dir/deflate.c.o
[ 36%] Building C object image/zlib/CMakeFiles/zlib.dir/gzio.c.o
[ 37%] Building C object image/zlib/CMakeFiles/zlib.dir/infback.c.o
[ 37%] Building C object image/zlib/CMakeFiles/zlib.dir/inffast.c.o
[ 38%] Building C object image/zlib/CMakeFiles/zlib.dir/inflate.c.o
[ 39%] Building C object image/zlib/CMakeFiles/zlib.dir/inftrees.c.o
[ 39%] Building C object image/zlib/CMakeFiles/zlib.dir/trees.c.o
[ 40%] Building C object image/zlib/CMakeFiles/zlib.dir/uncompr.c.o
[ 41%] Building C object image/zlib/CMakeFiles/zlib.dir/zutil.c.o
[ 42%] Linking C static library libzlib.a
[ 42%] Built target zlib
Scanning dependencies of target png
[ 43%] Building C object image/png/CMakeFiles/png.dir/png.c.o
[ 44%] Building C object image/png/CMakeFiles/png.dir/pngerror.c.o
[ 44%] Building C object image/png/CMakeFiles/png.dir/pnggccrd.c.o
[ 45%] Building C object image/png/CMakeFiles/png.dir/pngget.c.o
[ 46%] Building C object image/png/CMakeFiles/png.dir/pngmem.c.o
[ 46%] Building C object image/png/CMakeFiles/png.dir/pngpread.c.o
[ 47%] Building C object image/png/CMakeFiles/png.dir/pngread.c.o
[ 48%] Building C object image/png/CMakeFiles/png.dir/pngrio.c.o
[ 49%] Building C object image/png/CMakeFiles/png.dir/pngrtran.c.o
[ 49%] Building C object image/png/CMakeFiles/png.dir/pngrutil.c.o
[ 50%] Building C object image/png/CMakeFiles/png.dir/pngset.c.o
[ 51%] Building C object image/png/CMakeFiles/png.dir/pngtrans.c.o
[ 51%] Building C object image/png/CMakeFiles/png.dir/pngvcrd.c.o
[ 52%] Building C object image/png/CMakeFiles/png.dir/pngwio.c.o
[ 53%] Building C object image/png/CMakeFiles/png.dir/pngwrite.c.o
[ 53%] Building C object image/png/CMakeFiles/png.dir/pngwtran.c.o
[ 54%] Building C object image/png/CMakeFiles/png.dir/pngwutil.c.o
[ 55%] Linking C static library libpng.a
[ 55%] Built target png
Scanning dependencies of target goo_objs
[ 56%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/FixedPoint.cc.o
[ 56%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/GHash.cc.o
[ 57%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/GList.cc.o
[ 58%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/GString.cc.o
[ 59%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/gfile.cc.o
[ 59%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/gmem.cc.o
[ 60%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/gmempp.cc.o
[ 61%] Building C object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/parseargs.c.o
[ 61%] Built target goo_objs
Scanning dependencies of target goo
[ 62%] Linking CXX static library libgoo.a
[ 62%] Built target goo
Scanning dependencies of target fofi_objs
[ 63%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiBase.cc.o
[ 64%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiEncodings.cc.o
[ 64%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiIdentifier.cc.o
[ 65%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiTrueType.cc.o
[ 66%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiType1.cc.o
[ 66%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiType1C.cc.o
[ 66%] Built target fofi_objs
Scanning dependencies of target fofi
[ 66%] Linking CXX static library libfofi.a
[ 66%] Built target fofi
Scanning dependencies of target pdfalto
[ 66%] Building CXX object CMakeFiles/pdfalto.dir/src/AnnotsXrce.cc.o
[ 67%] Building CXX object CMakeFiles/pdfalto.dir/src/ConstantsUtils.cc.o
[ 68%] Building CXX object CMakeFiles/pdfalto.dir/src/ConstantsXMLALTO.cc.o
[ 68%] Building CXX object CMakeFiles/pdfalto.dir/src/Parameters.cc.o
[ 69%] Building CXX object CMakeFiles/pdfalto.dir/src/PDFDocXrce.cc.o
[ 70%] Building CXX object CMakeFiles/pdfalto.dir/src/pdfalto.cc.o
[ 71%] Building CXX object CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o
make[2]: *** No rule to make target '/root/icu/lib/libicuuc.a', needed by 'pdfalto'. Stop.
CMakeFiles/Makefile2:71: recipe for target 'CMakeFiles/pdfalto.dir/all' failed
make[1]: *** [CMakeFiles/pdfalto.dir/all] Error 2
make: *** [all] Error 2
Makefile:127: recipe for target 'all' failed
The command '/bin/sh -c make' returned a non-zero code: 2

Problem with coordinates in some PDF

I am trying to trace the problem of incorrect coordinates for string elements in some PDF. One example is the attached PubMed Central PDF. Using it with GROBID and the PDF.js document display + annotations, we see that the bounding boxes for the annotations are not correct (while usually they are!).

The problem is apparently coming from pdfalto, but I am not sure if it comes from incorrect page dimension or an incorrect origin point on the page for the string coordinates.

So in the attached PDF, all page dimensions are x:662, y:860. First page, first token "Association" is positioned with x:85, y:126, w:115, h:17.8. Proportion x/y is visually incorrect. x and y should be x:57, y:90 (from PDF.js)

Second page, first token "Xia" is positioned x:71, y:64, w:10, h:7.3, once again x/y is not visually clearly not correct. It should be x:42, y:21 (from PDF.js)

Looking at XmlAltoOutputDev.cc and TextPage::startPage, page coordinates come from GfxState, and the pagebox, but then I saw nothing that looks really related to this :/

PMC5348138.pdf

pdfalto ignore first line/string on every page

Every first line is missing. I've attached the demo file.

normal.pdf

I compiled pdfalto today from git repo on a Mac.

As you can see - there is no first line but the last is parsed twice.

Regards,

K.
image

    <Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="595" HEIGHT="842">
      <PrintSpace>
        <TextLine WIDTH="22.569" HEIGHT="10.812" ID="p1_t1" HPOS="71.38" VPOS="104.596">
          <String ID="p1_w1" CONTENT="Bold" HPOS="71.38" VPOS="104.596" WIDTH="22.569" HEIGHT="10.812" STYLEREFS="font0"/>
        </TextLine>
        <TextLine WIDTH="58.0176" HEIGHT="10.812" ID="p1_t2" HPOS="71.38" VPOS="133.876">
          <String ID="p1_w2" CONTENT="Bold" HPOS="71.38" VPOS="133.876" WIDTH="22.3338" HEIGHT="10.812" STYLEREFS="font1"/>
          <SP WIDTH="2.71507" VPOS="133.876" HPOS="93.7138"/>
          <String ID="p1_w3" CONTENT="+" HPOS="96.4288" VPOS="133.876" WIDTH="5.976" HEIGHT="10.812" STYLEREFS="font1"/>
          <SP WIDTH="2.7132" VPOS="133.876" HPOS="102.405"/>
          <String ID="p1_w4" CONTENT="italic" HPOS="105.118" VPOS="133.876" WIDTH="24.2796" HEIGHT="10.812" STYLEREFS="font1"/>
        </TextLine>
        <TextLine WIDTH="58.0176" HEIGHT="10.812" ID="p1_t3" HPOS="71.38" VPOS="133.876">
          <String ID="p1_w5" CONTENT="Bold" HPOS="71.38" VPOS="133.876" WIDTH="22.3338" HEIGHT="10.812" STYLEREFS="font1"/>
          <SP WIDTH="2.71507" VPOS="133.876" HPOS="93.7138"/>
          <String ID="p1_w6" CONTENT="+" HPOS="96.4288" VPOS="133.876" WIDTH="5.976" HEIGHT="10.812" STYLEREFS="font1"/>
          <SP WIDTH="2.7132" VPOS="133.876" HPOS="102.405"/>
          <String ID="p1_w7" CONTENT="italic" HPOS="105.118" VPOS="133.876" WIDTH="24.2796" HEIGHT="10.812" STYLEREFS="font1"/>
        </TextLine>
      </PrintSpace>
    </Page>

Heap buffer overflow in function TextPage::dump

I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause heap buffer overflow in function TextPage::dump when executing this command:

./pdfalto heap-buffer-overflow_dump 1.xml

This is the ASAN information:

=================================================================
==1865==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200047ff5a at pc 0x00000049397b bp 0x7ffd280d7300 sp 0x7ffd280d6ab0
WRITE of size 13 at 0x60200047ff5a thread T0
    #0 0x49397a in vsprintf /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1572
    #1 0x493ad2 in __interceptor_sprintf /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1615
    #2 0x587186 in TextPage::dump(int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:5307:13
    #3 0x5a9de7 in XmlAltoOutputDev::endPage() /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7216:19
    #4 0x9a06eb in Gfx::~Gfx() /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:590:10
    #5 0x77cd47 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:406:3
    #6 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
    #7 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
    #8 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
    #9 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
    #10 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
    #11 0x7f4ca2c9982f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #12 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)

0x60200047ff5a is located 0 bytes to the right of 10-byte region [0x60200047ff50,0x60200047ff5a)
allocated by thread T0 here:
    #0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x5821d5 in TextPage::dump(int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:4706:20
    #2 0x5a9de7 in XmlAltoOutputDev::endPage() /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7216:19

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1572 in vsprintf
Shadow bytes around the buggy address:
  0x0c0480087f90: fa fa fd fa fa fa fd fd fa fa fd fa fa fa 05 fa
  0x0c0480087fa0: fa fa 06 fa fa fa 05 fa fa fa 07 fa fa fa 05 fa
  0x0c0480087fb0: fa fa 06 fa fa fa 06 fa fa fa 07 fa fa fa 07 fa
  0x0c0480087fc0: fa fa 07 fa fa fa 00 02 fa fa 06 fa fa fa 03 fa
  0x0c0480087fd0: fa fa 06 fa fa fa 00 fa fa fa 05 fa fa fa 07 fa
=>0x0c0480087fe0: fa fa 05 fa fa fa 00 fa fa fa 00[02]fa fa 07 fa
  0x0c0480087ff0: fa fa fd fd fa fa fd fd fa fa fd fa fa fa fd fd
  0x0c0480088000: fa fa fd fd fa fa 04 fa fa fa 00 01 fa fa fd fd
  0x0c0480088010: fa fa fd fa fa fa fd fd fa fa 03 fa fa fa 00 fa
  0x0c0480088020: fa fa fd fd fa fa fd fa fa fa 00 00 fa fa 00 fa
  0x0c0480088030: fa fa fd fa fa fa 00 fa fa fa 02 fa fa fa 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1865==ABORTING

CI build

I open this in order not ot forget.

The CI build would avoid potential issues due to static library buildind and portability.

tmp strings too small

In XmlAltoOutputDev.cc a lot of small strings are allocated with the size of 10/20 or 50 characters. It seems that in some circumstances the buffer overflow occurs (typically when outputting the WIDTH attribute) and the strings then include some garbage. I suggest usage of snprintf or larger sizes or limiting the format %g/%d.

Memory corruption

For some PDF files an error is thrown concerning memory corruption:
*** Error in `pdfalto/pdfalto': malloc(): memory corruption (fast): 0x000000000193db80 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc3bebeb7e5] /lib/x86_64-linux-gnu/libc.so.6(+0x82651)[0x7fc3bebf6651] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fc3bebf8184] /usr/lib/x86_64-linux-gnu/libxml2.so.2(xmlStrdup+0x42)[0x7fc3bf8b36a2] /usr/lib/x86_64-linux-gnu/libxml2.so.2(+0x5f8e8)[0x7fc3bf83e8e8] pdfalto/pdfalto[0x415add] pdfalto/pdfalto[0x40722c] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc3beb94830] pdfalto/pdfalto[0x403829] ======= Memory map: ======== 00400000-00549000 r-xp 00000000 fd:01 7357305 /root/pdfalto/pdfalto 00749000-0074a000 r--p 00149000 fd:01 7357305 /root/pdfalto/pdfalto 0074a000-0078a000 rw-p 0014a000 fd:01 7357305 /root/pdfalto/pdfalto 018e6000-02867000 rw-p 00000000 00:00 0 [heap] 7fc3b8000000-7fc3b8021000 rw-p 00000000 00:00 0 7fc3b8021000-7fc3bc000000 ---p 00000000 00:00 0 7fc3bc4ce000-7fc3bdd84000 r-xp 00000000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdd84000-7fc3bdf83000 ---p 018b6000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdf83000-7fc3bdf84000 r--p 018b5000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdf84000-7fc3bdf85000 rw-p 018b6000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdf85000-7fc3bdfa6000 r-xp 00000000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3bdfa6000-7fc3be1a5000 ---p 00021000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3be1a5000-7fc3be1a6000 r--p 00020000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3be1a6000-7fc3be1a7000 rw-p 00021000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3be1a7000-7fc3be1c0000 r-xp 00000000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be1c0000-7fc3be3bf000 ---p 00019000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be3bf000-7fc3be3c0000 r--p 00018000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be3c0000-7fc3be3c1000 rw-p 00019000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be3c1000-7fc3be540000 r-xp 00000000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be540000-7fc3be740000 ---p 0017f000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be740000-7fc3be750000 r--p 0017f000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be750000-7fc3be751000 rw-p 0018f000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be751000-7fc3be755000 rw-p 00000000 00:00 0 7fc3be755000-7fc3be758000 r-xp 00000000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be758000-7fc3be957000 ---p 00003000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be957000-7fc3be958000 r--p 00002000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be958000-7fc3be959000 rw-p 00003000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be959000-7fc3be972000 r-xp 00000000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3be972000-7fc3beb72000 ---p 00019000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3beb72000-7fc3beb73000 r--p 00019000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3beb73000-7fc3beb74000 rw-p 0001a000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3beb74000-7fc3bed34000 r-xp 00000000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bed34000-7fc3bef34000 ---p 001c0000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bef34000-7fc3bef38000 r--p 001c0000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bef38000-7fc3bef3a000 rw-p 001c4000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bef3a000-7fc3bef3e000 rw-p 00000000 00:00 0 7fc3bef3e000-7fc3bef54000 r-xp 00000000 fd:01 8374942 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fc3bef54000-7fc3bf153000 ---p 00016000 fd:01 8374942 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fc3bf153000-7fc3bf154000 rw-p 00015000 fd:01 8374942 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fc3bf154000-7fc3bf25c000 r-xp 00000000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf25c000-7fc3bf45b000 ---p 00108000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf45b000-7fc3bf45c000 r--p 00107000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf45c000-7fc3bf45d000 rw-p 00108000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf45d000-7fc3bf5cf000 r-xp 00000000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf5cf000-7fc3bf7cf000 ---p 00172000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf7cf000-7fc3bf7d9000 r--p 00172000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf7d9000-7fc3bf7db000 rw-p 0017c000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf7db000-7fc3bf7df000 rw-p 00000000 00:00 0 7fc3bf7df000-7fc3bf990000 r-xp 00000000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bf990000-7fc3bfb8f000 ---p 001b1000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bfb8f000-7fc3bfb97000 r--p 001b0000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bfb97000-7fc3bfb99000 rw-p 001b8000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bfb99000-7fc3bfb9a000 rw-p 00000000 00:00 0 7fc3bfb9a000-7fc3bfbd0000 r-xp 00000000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfbd0000-7fc3bfdd0000 ---p 00036000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfdd0000-7fc3bfdd1000 r--p 00036000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfdd1000-7fc3bfdd2000 rw-p 00037000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfdd2000-7fc3bfdf8000 r-xp 00000000 fd:01 8374901 /lib/x86_64-linux-gnu/ld-2.23.so 7fc3bffa6000-7fc3bfff0000 rw-p 00000000 00:00 0 7fc3bfff5000-7fc3bfff7000 rw-p 00000000 00:00 0 7fc3bfff7000-7fc3bfff8000 r--p 00025000 fd:01 8374901 /lib/x86_64-linux-gnu/ld-2.23.so 7fc3bfff8000-7fc3bfff9000 rw-p 00026000 fd:01 8374901 /lib/x86_64-linux-gnu/ld-2.23.so 7fc3bfff9000-7fc3bfffa000 rw-p 00000000 00:00 0 7fffb26bb000-7fffb26dd000 rw-p 00000000 00:00 0 [stack] 7fffb27ed000-7fffb27ef000 r--p 00000000 00:00 0 [vvar] 7fffb27ef000-7fffb27f1000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

C++11 required by libxml2 + icu 61?

Hi!

So I opened kermitt2/xpdf-4.00#1 because I was trying to build pdfalto.

After updating the xpdf-4.00 submodule in this repo to have the xpdf-qt target name change (kermitt2/xpdf-4.00@61eb5e4), and the patch for kermitt2/xpdf-4.00#1, on my system this build fails with:

/usr/include/unicode/umachine.h:351:13: error: ‘char16_t’ does not name a type; did you mean ‘wchar_t’?
     typedef char16_t UChar;
             ^~~~~~~~
             wchar_t
In file included from /usr/include/unicode/utypes.h:39:0,
                 from /usr/include/unicode/ucnv_err.h:88,
                 from /usr/include/unicode/ucnv.h:52,
                 from /usr/include/libxml2/libxml/encoding.h:31,
                 from /usr/include/libxml2/libxml/parser.h:810,
                 from /usr/include/libxml2/libxml/globals.h:18,
                 from /usr/include/libxml2/libxml/threads.h:35,
                 from /usr/include/libxml2/libxml/xmlmemory.h:218,
                 from /home/sl/Code/Infuse/pdfalto/src/AnnotsXrce.h:14,
                 from /home/sl/Code/Infuse/pdfalto/src/AnnotsXrce.cc:1:

(and then more errors such as ‘UChar’ does not name a type which come from the above.)

Requiring C++11 fixes the error for me:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index d9dc061..460a72e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,7 +1,7 @@
 cmake_minimum_required(VERSION 3.5+)
 project(pdfalto)
 
-set(CMAKE_CXX_STANDARD 98)
+set(CMAKE_CXX_STANDARD 11)
 
 #build xpdf
 set ( XPDF_SUBDIR ${CMAKE_CURRENT_SOURCE_DIR}/xpdf-4.00)

I have icu 61.1-1 installed (which is the package that owns /usr/include/unicode/utypes.h), and that version could be the reason for this? Is this an issue you have seen?


The next step for me is then the same as for kermitt2/xpdf-4.00#1, because I have libpaper present. I have to patch CMakeLists.txt with:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index d9dc061..460a72e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -30,7 +30,7 @@ set(SOURCE_FILES
 
 add_executable(pdfalto ${SOURCE_FILES})
 
-target_link_libraries(pdfalto png zlib xml2 xpdf goo fofi ${XPDF_BUILD_DIR}/xpdf/lib/libxpdf.a)
+target_link_libraries(pdfalto png zlib xml2 xpdf goo fofi ${XPDF_BUILD_DIR}/xpdf/lib/libxpdf.a ${HAVE_PAPER_H})
 target_include_directories(pdfalto
         PUBLIC /usr/include/libxml2
         PUBLIC image/png

so that CMake knows about libpaper, and the build succeeds. I can PR these changes too if you'd like to, just let me know!

Segmentation Fault (Core Dumped)

When I try to start pdfalto, I get a segmentation fault (Core dumped)

OS: Ubuntu 18 LTS
gcc 7.3.0
libxml2-dev: 2.9.4+dfsg1-6.1ubuntu1

Compiling works fine with warnings.
It generates a *_metadata.xml, but then it stops with a segmentation fault

I followed the compiling steps from Github.
Was anybody able to compile it under Ubuntu 18?

Produce human-friendly output

This is a suggestion from the user @dlaurie
linebreaks except when they would be significant (pdftohtml -xml did that), elision of unnecessary attributes, i.e. rotation=0, angle=0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.