kermitt2 / pdfalto Goto Github PK
View Code? Open in Web Editor NEWPDF to XML ALTO file converter
License: GNU General Public License v2.0
PDF to XML ALTO file converter
License: GNU General Public License v2.0
Content mine regroups a list of some known problematic fonts and maps character to correct unicode (e.g : l -> λ)
$ make
...
[ 15%] Linking CXX executable pdfalto
/usr/bin/ld: libs/image/png/linux/libpng.a(png.c.o): relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIC
....
$ cc -v
Using built-in specs.
COLLECT_GCC=cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto
Thread model: posix
gcc version 8.2.1 20181127 (GCC)
Any ideas?
Thanks,
Rytis
I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause FPE in function ImageStream::ImageStream in Stream.cc when executing this command:
./pdfalto FPE_ImageStream 1.xml
This is the ASAN information:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==4985==ERROR: AddressSanitizer: FPE on unknown address 0x00000079252d (pc 0x00000079252d bp 0x0c0c000006ae sp 0x7ffde533a9d0 T0)
#0 0x79252c in ImageStream::ImageStream(Stream*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Stream.cc:359:23
#1 0x5969bc in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6427:43
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
#3 0x5ae52f in XmlAltoOutputDev::drawSoftMaskedImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, Stream*, int, int, GfxImageColorMap*, double*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7460:5
#4 0x9d94cd in Gfx::doImage(Object*, Stream*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:4447:7
#5 0x9709a5 in Gfx::opXObject(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:3980:2
#6 0x9a6668 in Gfx::execOp(Object*, Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:826:3
#7 0x9a42b1 in Gfx::go(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:719:12
#8 0x9a1d1b in Gfx::display(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:641:3
#9 0x77c466 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:373:10
#10 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
#11 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
#12 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
#13 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
#14 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
#15 0x7f7dc0f1382f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#16 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Stream.cc:359:23 in ImageStream::ImageStream(Stream*, int, int, int)
==4985==ABORTING
In file included from /opt/src/pdfalto/xpdf-4.00/xpdf/GlobalParams.cc:64:
/opt/src/pdfalto/xpdf-4.00/xpdf/UnicodeToUnicodeFontRules.h:25:1: sorry, unimplemented: non-trivial designated initializers not supported
};
$ cc -v
Using built-in specs.
COLLECT_GCC=cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto
Thread model: posix
gcc version 8.2.1 20181127 (GCC)
Any ideas?
Thanks,
Rytis
I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause SEGV in function TextPage::restoreState in XmlAltoOutputDev.cc when executing this command:
./pdfalto SEGV_restoreState 1.xml
This is the ASAN information:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==13300==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x0000005addf7 bp 0x0c2c000001c7 sp 0x7fff8c9133e0 T0)
==13300==The signal is caused by a READ memory access.
==13300==Hint: address points to the zero page.
#0 0x5addf6 in TextPage::restoreState(GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:5763:21
#1 0x5addf6 in XmlAltoOutputDev::restoreState(GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7414
#2 0x9a6668 in Gfx::execOp(Object*, Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:826:3
#3 0x9a42b1 in Gfx::go(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:719:12
#4 0x9a1d1b in Gfx::display(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:641:3
#5 0x77c466 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:373:10
#6 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
#7 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
#8 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
#9 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
#10 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
#11 0x7f5e9a57182f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#12 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:5763:21 in TextPage::restoreState(GfxState*)
==13300==ABORTING
Every first line is missing. I've attached the demo file.
I compiled pdfalto today from git repo on a Mac.
As you can see - there is no first line but the last is parsed twice.
Regards,
<Page ID="Page1" PHYSICAL_IMG_NR="1" WIDTH="595" HEIGHT="842">
<PrintSpace>
<TextLine WIDTH="22.569" HEIGHT="10.812" ID="p1_t1" HPOS="71.38" VPOS="104.596">
<String ID="p1_w1" CONTENT="Bold" HPOS="71.38" VPOS="104.596" WIDTH="22.569" HEIGHT="10.812" STYLEREFS="font0"/>
</TextLine>
<TextLine WIDTH="58.0176" HEIGHT="10.812" ID="p1_t2" HPOS="71.38" VPOS="133.876">
<String ID="p1_w2" CONTENT="Bold" HPOS="71.38" VPOS="133.876" WIDTH="22.3338" HEIGHT="10.812" STYLEREFS="font1"/>
<SP WIDTH="2.71507" VPOS="133.876" HPOS="93.7138"/>
<String ID="p1_w3" CONTENT="+" HPOS="96.4288" VPOS="133.876" WIDTH="5.976" HEIGHT="10.812" STYLEREFS="font1"/>
<SP WIDTH="2.7132" VPOS="133.876" HPOS="102.405"/>
<String ID="p1_w4" CONTENT="italic" HPOS="105.118" VPOS="133.876" WIDTH="24.2796" HEIGHT="10.812" STYLEREFS="font1"/>
</TextLine>
<TextLine WIDTH="58.0176" HEIGHT="10.812" ID="p1_t3" HPOS="71.38" VPOS="133.876">
<String ID="p1_w5" CONTENT="Bold" HPOS="71.38" VPOS="133.876" WIDTH="22.3338" HEIGHT="10.812" STYLEREFS="font1"/>
<SP WIDTH="2.71507" VPOS="133.876" HPOS="93.7138"/>
<String ID="p1_w6" CONTENT="+" HPOS="96.4288" VPOS="133.876" WIDTH="5.976" HEIGHT="10.812" STYLEREFS="font1"/>
<SP WIDTH="2.7132" VPOS="133.876" HPOS="102.405"/>
<String ID="p1_w7" CONTENT="italic" HPOS="105.118" VPOS="133.876" WIDTH="24.2796" HEIGHT="10.812" STYLEREFS="font1"/>
</TextLine>
</PrintSpace>
</Page>
Here are some examples of not well-formed generated XML - invalid characters in attribute value.
Uploading cphc0012-0609.pdf…
Uploading hel0015-0279.pdf…
Uploading ljii31-131.pdf…
If I use this contract to test pdfalto, I get a list of invalid characters codes. When I open the pdf and copy paste the text, the text does not include these (only quotes).
At present pdfalto exports images as <VECTORIALIMAGES>
which can't be rendered by a browser. <svg>
can be so rendered.
Here are some failing PDF files from ISTEX (after processing 1000 random), which were not failing with the latest pdf2xml
:
D3B2DA15EBD9A692BF1EF4D32606F95A72D5D381
5A09169C31467704EBB453123479708334DDAF35
45BCDD6CD0ECF1D7C6B9169E999C63BFF30DB501
7528880E3DCB09F09E214AFEC57C3A4FCEA15905
774EE3CD645A861B5F5184F96F80A837412887FA
F8E939EACBC26F4F39B309AAAC7ABB8FC8A86C59
864EFF775D7F56E7223EAD95801A6A07ACD8CF71
0ACFDDBB83BF9A5ABAD34686AC4C8CE9317BDB2E
122C63850FE715C35B2B7A5FF376E484E5627C75
7F6B6DE03BEA6EAE36896867F88670B0E62F6EAA
86EDACFC946D09D3F2F7703448EA7E2544CC9AE4
7EE3BDA171BB275B860191E918289EC4F1289566
ED9964BA3659F48C2E9227DE095836E47845B509
83FAB54C06DCFB813657DCCAFF3C66AD67CC95FE
38BD0E2B812BB737321F7D9E76AF0E95A9593E3D
DD522EB94B2A865F00B4DDFC345780B119579DD7
5EA4AAF2C4674DC7C8AFC750AF5320C0F7489FCE
DD338AD05CEA42CF737A187344B1337385CC6FFB
052DFBD14E0015CA914E28A0A561675D36FFA2CC
C3D11DEE82F3403336BE55E9F94DB6E9A6343E1B
The following 3 are failing both with pdfalto
and pdf2xml
if I am not wrong:
2DE3AF6CC5E90F16E64866D3784DC06B33705360
5E00837DC8C8EF0C9B4D16603261C993135EBCD5
8AD9F55CF0BC915BB6B448B1B536D3A7DC08239D
To get them: https://api.istex.fr/document/*ISTEXID*/fulltext/pdf
a very slight difference
the ; at the beginning of the sentence
I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause memory leaks when executing this command:
./pdfalto detected_memory_leaks 1.xml
This is the ASAN information:
=================================================================
==12842==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 104340 byte(s) in 1 object(s) allocated from:
#0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
#1 0x596941 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6423:35
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 104340 byte(s) in 1 object(s) allocated from:
#0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
#1 0x596941 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6423:35
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 14000 byte(s) in 14 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff25d7f6 in xmlEncodeEntitiesInternal /home/fouzhe/my_fuzz/libxml2/entities.c:576
Direct leak of 120 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff282dc0 in xmlNewNode__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:2239
Direct leak of 84 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 48 byte(s) in 3 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x52a5e1 in removeAlreadyExistingData(GString*) /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:464:20
#2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 48 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff281d6b in xmlNewNs__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:757
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59e8d1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6721:26
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x54021f in TextPage::TextPage(int, Catalog*, _xmlNode*, GString*, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1508:17
#2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
#3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x544254 in TextPage::startPage(int, GfxState*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1567:13
#2 0x5a992b in XmlAltoOutputDev::startPage(int, GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7200:15
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59b645 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6663:19
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59e7e4 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6708:24
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x77b765 in Page::getLinks() /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:311:11
#2 0x544f81 in TextPage::startPage(int, GfxState*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1732:30
#3 0x5a992b in XmlAltoOutputDev::startPage(int, GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7200:15
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
#2 0x5a49a8 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6938:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59e681 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6692:18
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
#2 0x5a4c24 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6954:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x540b34 in TextPage::TextPage(int, Catalog*, _xmlNode*, GString*, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1552:23
#2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
#3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
#2 0x5a490c in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6934:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
#2 0x5a4b85 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6950:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x540abe in TextPage::TextPage(int, Catalog*, _xmlNode*, GString*, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1551:23
#2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
#3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
#2 0x5a4b85 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6950:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
#2 0x5a4c24 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6954:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x595d0d in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6259:24
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a63db in XmlAltoOutputDev::getInfoDate(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7122:19
#2 0x5a4c8a in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6958:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a63db in XmlAltoOutputDev::getInfoDate(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7122:19
#2 0x5a4cf0 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6962:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59e5cd in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6687:18
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x529331 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:395:23
#2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
#2 0x5a4ae5 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6946:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x52902e in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:328:29
#2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x595cac in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6255:24
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
#2 0x5a4ae5 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6946:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x597f8f in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:48
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
#2 0x5a4a44 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6942:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a31e6 in XmlAltoOutputDev::toUnicode(GString*, UnicodeMap*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7188:12
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x597f75 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:23
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x597faa in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:78
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x595cac in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6255:24
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
#2 0x595cf3 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6257:21
#3 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
#2 0x595d4b in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6261:21
#3 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
#2 0x595d4b in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6261:21
#3 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x597f75 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:23
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x597f8f in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:48
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x597faa in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6497:78
#2 0x5af0b2 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7547:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
#2 0x547aff in TextPage::endPage(GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1785:25
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
#2 0x547bff in TextPage::endPage(GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:1790:25
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
#2 0x5a4a44 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6942:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x595d0d in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6259:24
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0xb23b8c in GString::fromInt(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:186:10
#2 0x595cf3 in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6257:21
#3 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a55b1 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7080:19
#2 0x5a49a8 in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6938:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59f1bd in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6783:13
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x59e604 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6689:15
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a5597 in XmlAltoOutputDev::getInfoString(Dict*, char const*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7079:18
#2 0x5a490c in XmlAltoOutputDev::addMetadataInfo(PDFDocXrce*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6934:19
#3 0x5292dc in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:388:17
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x518338 in operator new(unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:92
#1 0x5a750f in XmlAltoOutputDev::closeMetadataInfoDoc(GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6970:33
#2 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Indirect leak of 424 byte(s) in 33 object(s) allocated from:
#0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
#1 0xb34fb9 in GString::resize(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:119:9
Indirect leak of 336 byte(s) in 11 object(s) allocated from:
#0 0x5184e8 in operator new[](unsigned long) /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:95
#1 0xb34eba in GString::resize(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/GString.cc:121:10
Indirect leak of 176 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff282754 in xmlNewDoc__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:1171
Indirect leak of 120 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff283006 in xmlNewText__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:2445
Indirect leak of 120 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff282dc0 in xmlNewNode__internal_alias /home/fouzhe/my_fuzz/libxml2/tree.c:2239
Indirect leak of 120 byte(s) in 2 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
#2 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#3 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Indirect leak of 96 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff283146 in xmlNewPropInternal /home/fouzhe/my_fuzz/libxml2/tree.c:1855
Indirect leak of 87 byte(s) in 8 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x7fdeff2d8518 in xmlStrndup__internal_alias /home/fouzhe/my_fuzz/libxml2/xmlstring.c:45
Indirect leak of 64 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
#2 0x5a992b in XmlAltoOutputDev::startPage(int, GfxState*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7200:15
Indirect leak of 64 byte(s) in 1 object(s) allocated from:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0xb378a8 in gmalloc /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/goo/gmem.cc:140:13
#2 0x59fbc1 in XmlAltoOutputDev::XmlAltoOutputDev(GString*, GString*, Catalog*, int, int, GString*, GString*) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6852:16
#3 0x5292c5 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:385:22
#4 0x7fdefdfba82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
SUMMARY: AddressSanitizer: 225355 byte(s) leaked in 128 allocation(s).
ICU library provides a normalisation api for contiguous composition.
As a reference with the current version, we have 1071 PDF failing out of 1942 when testing pdfalto with GROBID using the 1942 PubMed Central PDF set.
Errors are pdfalto failure (mostly) and not well-formed XML - usually not valid XML character in attributes.
I open separate issues with test PDF attached for the different cases.
Note that the latest version of our pdf2xml fork modified for grobid was 100% successful on this set, so should pdfalto be too :)
Hi!
So I opened kermitt2/xpdf-4.00#1 because I was trying to build pdfalto.
After updating the xpdf-4.00 submodule in this repo to have the xpdf-qt target name change (kermitt2/xpdf-4.00@61eb5e4), and the patch for kermitt2/xpdf-4.00#1, on my system this build fails with:
/usr/include/unicode/umachine.h:351:13: error: ‘char16_t’ does not name a type; did you mean ‘wchar_t’?
typedef char16_t UChar;
^~~~~~~~
wchar_t
In file included from /usr/include/unicode/utypes.h:39:0,
from /usr/include/unicode/ucnv_err.h:88,
from /usr/include/unicode/ucnv.h:52,
from /usr/include/libxml2/libxml/encoding.h:31,
from /usr/include/libxml2/libxml/parser.h:810,
from /usr/include/libxml2/libxml/globals.h:18,
from /usr/include/libxml2/libxml/threads.h:35,
from /usr/include/libxml2/libxml/xmlmemory.h:218,
from /home/sl/Code/Infuse/pdfalto/src/AnnotsXrce.h:14,
from /home/sl/Code/Infuse/pdfalto/src/AnnotsXrce.cc:1:
(and then more errors such as ‘UChar’ does not name a type
which come from the above.)
Requiring C++11 fixes the error for me:
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d9dc061..460a72e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 3.5+)
project(pdfalto)
-set(CMAKE_CXX_STANDARD 98)
+set(CMAKE_CXX_STANDARD 11)
#build xpdf
set ( XPDF_SUBDIR ${CMAKE_CURRENT_SOURCE_DIR}/xpdf-4.00)
I have icu 61.1-1 installed (which is the package that owns /usr/include/unicode/utypes.h
), and that version could be the reason for this? Is this an issue you have seen?
The next step for me is then the same as for kermitt2/xpdf-4.00#1, because I have libpaper present. I have to patch CMakeLists.txt
with:
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d9dc061..460a72e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -30,7 +30,7 @@ set(SOURCE_FILES
add_executable(pdfalto ${SOURCE_FILES})
-target_link_libraries(pdfalto png zlib xml2 xpdf goo fofi ${XPDF_BUILD_DIR}/xpdf/lib/libxpdf.a)
+target_link_libraries(pdfalto png zlib xml2 xpdf goo fofi ${XPDF_BUILD_DIR}/xpdf/lib/libxpdf.a ${HAVE_PAPER_H})
target_include_directories(pdfalto
PUBLIC /usr/include/libxml2
PUBLIC image/png
so that CMake knows about libpaper, and the build succeeds. I can PR these changes too if you'd like to, just let me know!
Previously, we used pdfalto to generate an ALTO XML from the pdf and https://github.com/filak/hOCR-to-ALTO to convert the ALTO XML to hOCR file after that. With the newest release of pdfalto this does not work anymore, since the ALTO version has seemed to have changed. Can you share which version of ALTO is currently produced with pdfalto?
Hi,
I'm parsing documents after being processed by pdfalto. And I especially need the text block tags to split the text in paragraphs. However, the documents I'm working on contain the same distance between every lines as you can see in the dummy example attached.
Example document.docx
The only parameter to identify these text blocks is this one in the XmlAltoOutputDev.cc file :
// Max distance between baselines of two lines within a block, as a
// fraction of the font size.
#define maxLineSpacingDelta 1.5
I've tried to tweak it but it only leads to one line paragraphs or the entire text on a page gathered in one.
So I was wondering if there were any other ways to get these paragraphs/blocks, for example looking at the x position of the final word in a line (because a paragraph can be quite often identified by a carriage return before the end of a line)
Thank you in advance
I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause SEGV in function GfxImageColorMap::getRGB in GfxState.cc when executing this command:
./pdfalto SEGV_getRGB 1.xml
This is the ASAN information:
AddressSanitizer:DEADLYSIGNAL
=================================================================
==20699==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000710e0c bp 0x7ffd9f911d30 sp 0x7ffd9f911be0 T0)
==20699==The signal is caused by a READ memory access.
==20699==Hint: address points to the zero page.
#0 0x710e0b in GfxImageColorMap::getRGB(unsigned char*, GfxRGB*, GfxRenderingIntent) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/GfxState.cc:3659:30
#1 0x596afe in TextPage::drawImageOrMask(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:6477:35
#2 0x5aedb4 in XmlAltoOutputDev::drawImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, int*, int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7535:28
#3 0x5ae4a4 in XmlAltoOutputDev::drawSoftMaskedImage(GfxState*, Object*, Stream*, int, int, GfxImageColorMap*, Stream*, int, int, GfxImageColorMap*, double*, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7458:5
#4 0x9d94cd in Gfx::doImage(Object*, Stream*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:4447:7
#5 0x9709a5 in Gfx::opXObject(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:3980:2
#6 0x9a6668 in Gfx::execOp(Object*, Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:826:3
#7 0x9a42b1 in Gfx::go(int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:719:12
#8 0x9a1d1b in Gfx::display(Object*, int) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:641:3
#9 0x77c466 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:373:10
#10 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
#11 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
#12 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
#13 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
#14 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
#15 0x7fc4172c282f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#16 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/GfxState.cc:3659:30 in GfxImageColorMap::getRGB(unsigned char*, GfxRGB*, GfxRenderingIntent)
==20699==ABORTING
pdfalto hangs for ever with this PMC pdf:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4153526/pdf/mjiri-28-18.pdf
lopez@work:~/tools/softcite-dataset$ ~/pdfalto/pdfalto -outline pdf/problems/PMC5181807.pdf
Segmentation fault (core dumped)
lopez@work:~/tools/softcite-dataset$ ~/pdfalto/pdfalto pdf/problems/PMC5181807.pdf
-> ok
here are a few PMC PDF with this error:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5181807/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5265988/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5266189/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3566810/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5348138/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5137495/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5084642/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5402591/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5386927/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5207597/pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5300027/pdf
I used Clang 6.0 and AddressSanitizer to build pdfalto, this file can cause heap buffer overflow in function TextPage::dump when executing this command:
./pdfalto heap-buffer-overflow_dump 1.xml
This is the ASAN information:
=================================================================
==1865==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200047ff5a at pc 0x00000049397b bp 0x7ffd280d7300 sp 0x7ffd280d6ab0
WRITE of size 13 at 0x60200047ff5a thread T0
#0 0x49397a in vsprintf /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1572
#1 0x493ad2 in __interceptor_sprintf /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1615
#2 0x587186 in TextPage::dump(int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:5307:13
#3 0x5a9de7 in XmlAltoOutputDev::endPage() /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7216:19
#4 0x9a06eb in Gfx::~Gfx() /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Gfx.cc:590:10
#5 0x77cd47 in Page::displaySlice(OutputDev*, double, double, int, int, int, int, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:406:3
#6 0x77babc in Page::display(OutputDev*, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/Page.cc:321:3
#7 0x78268e in PDFDoc::displayPage(OutputDev*, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:386:27
#8 0x78268e in PDFDoc::displayPages(OutputDev*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/xpdf-4.00/xpdf/PDFDoc.cc:399
#9 0x526f9d in PDFDocXrce::displayPages(OutputDev*, _xmlNode*, int, int, double, double, int, int, int, int, int (*)(void*), void*) /home/fouzhe/my_fuzz/pdfalto/src/PDFDocXrce.cc:22:10
#10 0x529565 in main /home/fouzhe/my_fuzz/pdfalto/src/pdfalto.cc:415:18
#11 0x7f4ca2c9982f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#12 0x41c678 in _start (/home/fouzhe/my_fuzz/pdfalto/pdfalto+0x41c678)
0x60200047ff5a is located 0 bytes to the right of 10-byte region [0x60200047ff50,0x60200047ff5a)
allocated by thread T0 here:
#0 0x4e08a8 in __interceptor_malloc /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
#1 0x5821d5 in TextPage::dump(int, int) /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:4706:20
#2 0x5a9de7 in XmlAltoOutputDev::endPage() /home/fouzhe/my_fuzz/pdfalto/src/XmlAltoOutputDev.cc:7216:19
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/fouzhe/llvm/llvm/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:1572 in vsprintf
Shadow bytes around the buggy address:
0x0c0480087f90: fa fa fd fa fa fa fd fd fa fa fd fa fa fa 05 fa
0x0c0480087fa0: fa fa 06 fa fa fa 05 fa fa fa 07 fa fa fa 05 fa
0x0c0480087fb0: fa fa 06 fa fa fa 06 fa fa fa 07 fa fa fa 07 fa
0x0c0480087fc0: fa fa 07 fa fa fa 00 02 fa fa 06 fa fa fa 03 fa
0x0c0480087fd0: fa fa 06 fa fa fa 00 fa fa fa 05 fa fa fa 07 fa
=>0x0c0480087fe0: fa fa 05 fa fa fa 00 fa fa fa 00[02]fa fa 07 fa
0x0c0480087ff0: fa fa fd fd fa fa fd fd fa fa fd fa fa fa fd fd
0x0c0480088000: fa fa fd fd fa fa 04 fa fa fa 00 01 fa fa fd fd
0x0c0480088010: fa fa fd fa fa fa fd fd fa fa 03 fa fa fa 00 fa
0x0c0480088020: fa fa fd fd fa fa fd fa fa fa 00 00 fa fa 00 fa
0x0c0480088030: fa fa fd fa fa fa 00 fa fa fa 02 fa fa fa 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==1865==ABORTING
Newest build fails
Output:
Sending build context to Docker daemon 11.19MB
Step 1/31 : FROM ubuntu:xenial
---> a51debf7e1eb
Step 2/31 : ADD https://github.com/openfaas/faas/releases/download/0.7.0/fwatchdog /usr/bin
Downloading [==================================================>] 4.111MB/4.111MB
---> Using cache
---> 583dc3bc25f3
Step 3/31 : RUN chmod +x /usr/bin/fwatchdog
---> Using cache
---> a008a7bd6eba
Step 4/31 : RUN apt-get update -y
---> Using cache
---> 4f2effbf5ad3
Step 5/31 : RUN apt-get install -y python3.5
---> Using cache
---> 708ec07d2bac
Step 6/31 : RUN apt-get -y install python3-pip
---> Using cache
---> 743ff0ec2418
Step 7/31 : RUN apt-get install -y --no-install-recommends wget build-essential automake g++
---> Using cache
---> 99e8d7b1c5fd
Step 8/31 : RUN apt-get install -y libxml2-dev
---> Using cache
---> c3f85c426b14
Step 9/31 : RUN apt-get install -y libmotif-dev
---> Using cache
---> 799c31317488
Step 10/31 : RUN apt-get install -y git
---> Using cache
---> b63d24e5472f
Step 11/31 : RUN mkdir icu && wget -q https://github.com/unicode-org/icu/releases/download/release-63-1/icu4c-63_1-src.tgz && gunzip -d < icu4c-63_1-src.tgz | tar xvf - && cd icu/source && chmod +x runConfigureICU configure install-sh && ./runConfigureICU Linux/gcc --enable-static --disable-shared && make
---> Using cache
---> e38664784bba
Step 12/31 : RUN git clone https://github.com/kermitt2/pdfalto.git ~/pdfalto
---> Using cache
---> f7f4bb177a3f
Step 13/31 : WORKDIR /root/pdfalto
---> Using cache
---> 301297d7ce11
Step 14/31 : RUN git checkout tags/0.2
---> Using cache
---> 47b23c80e9b2
Step 15/31 : RUN git submodule update --init --recursive
---> Using cache
---> 0f75f1910452
Step 16/31 : RUN apt-get install -y cmake
---> Using cache
---> 64f007a8cc61
Step 17/31 : WORKDIR /root/pdfalto
---> Running in 1a6524069e26
Removing intermediate container 1a6524069e26
---> 41214ff6f6a5
Step 18/31 : RUN cmake -D'ICU_PATH=/root/icu'
---> Running in 064de13d7ade
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for mkstemp
-- Looking for mkstemp - found
-- Looking for mkstemps
-- Looking for mkstemps - found
-- Looking for popen
-- Looking for popen - found
-- Performing Test HAVE_STD_SORT
-- Performing Test HAVE_STD_SORT - Success
-- Looking for fseeko
-- Looking for fseeko - found
-- Looking for fseek64
-- Looking for fseek64 - not found
-- Looking for _fseeki64
-- Looking for _fseeki64 - not found
-- Found FreeType (old-style includes): /usr/lib/x86_64-linux-gnu/libfreetype.so
-- Could NOT find TIFF (missing: TIFF_LIBRARY TIFF_INCLUDE_DIR)
-- lcms2 not found
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /root/pdfalto
Removing intermediate container 064de13d7ade
---> 0db206805ae8
Step 19/31 : RUN make
---> Running in ad116718332b
Scanning dependencies of target xpdf
[ 1%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/AcroForm.cc.o
[ 1%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Annot.cc.o
[ 2%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Array.cc.o
[ 3%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/BuiltinFont.cc.o
[ 3%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/BuiltinFontTables.cc.o
[ 4%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Catalog.cc.o
[ 5%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/CharCodeToUnicode.cc.o
[ 5%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/CMap.cc.o
[ 6%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Decrypt.cc.o
[ 7%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Dict.cc.o
[ 8%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/DisplayState.cc.o
[ 8%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Error.cc.o
[ 9%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/FontEncodingTables.cc.o
[ 10%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Form.cc.o
[ 10%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Function.cc.o
[ 11%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Gfx.cc.o
[ 12%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GfxFont.cc.o
[ 12%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GfxState.cc.o
[ 13%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GlobalParams.cc.o
[ 14%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/HTMLGen.cc.o
[ 15%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/JArithmeticDecoder.cc.o
[ 15%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/JBIG2Stream.cc.o
[ 16%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/JPXStream.cc.o
[ 17%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Lexer.cc.o
[ 17%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Link.cc.o
[ 18%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/NameToCharCode.cc.o
[ 19%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Object.cc.o
[ 19%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/OptionalContent.cc.o
[ 20%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Outline.cc.o
[ 21%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/OutputDev.cc.o
[ 22%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Page.cc.o
[ 22%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Parser.cc.o
[ 23%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PDFCore.cc.o
[ 24%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PDFDoc.cc.o
[ 24%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PDFDocEncoding.cc.o
[ 25%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/PSTokenizer.cc.o
[ 26%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/SecurityHandler.cc.o
[ 26%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Stream.cc.o
[ 27%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/TextString.cc.o
[ 28%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/UnicodeMap.cc.o
[ 29%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/UnicodeTypeTable.cc.o
[ 29%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/UTF8.cc.o
[ 30%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/XFAForm.cc.o
[ 31%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/XRef.cc.o
[ 31%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/Zoox.cc.o
[ 32%] Linking CXX static library ../build/xpdf/lib/libxpdf.a
[ 32%] Built target xpdf
Scanning dependencies of target zlib
[ 33%] Building C object image/zlib/CMakeFiles/zlib.dir/adler32.c.o
[ 34%] Building C object image/zlib/CMakeFiles/zlib.dir/compress.c.o
[ 35%] Building C object image/zlib/CMakeFiles/zlib.dir/crc32.c.o
[ 35%] Building C object image/zlib/CMakeFiles/zlib.dir/deflate.c.o
[ 36%] Building C object image/zlib/CMakeFiles/zlib.dir/gzio.c.o
[ 37%] Building C object image/zlib/CMakeFiles/zlib.dir/infback.c.o
[ 37%] Building C object image/zlib/CMakeFiles/zlib.dir/inffast.c.o
[ 38%] Building C object image/zlib/CMakeFiles/zlib.dir/inflate.c.o
[ 39%] Building C object image/zlib/CMakeFiles/zlib.dir/inftrees.c.o
[ 39%] Building C object image/zlib/CMakeFiles/zlib.dir/trees.c.o
[ 40%] Building C object image/zlib/CMakeFiles/zlib.dir/uncompr.c.o
[ 41%] Building C object image/zlib/CMakeFiles/zlib.dir/zutil.c.o
[ 42%] Linking C static library libzlib.a
[ 42%] Built target zlib
Scanning dependencies of target png
[ 43%] Building C object image/png/CMakeFiles/png.dir/png.c.o
[ 44%] Building C object image/png/CMakeFiles/png.dir/pngerror.c.o
[ 44%] Building C object image/png/CMakeFiles/png.dir/pnggccrd.c.o
[ 45%] Building C object image/png/CMakeFiles/png.dir/pngget.c.o
[ 46%] Building C object image/png/CMakeFiles/png.dir/pngmem.c.o
[ 46%] Building C object image/png/CMakeFiles/png.dir/pngpread.c.o
[ 47%] Building C object image/png/CMakeFiles/png.dir/pngread.c.o
[ 48%] Building C object image/png/CMakeFiles/png.dir/pngrio.c.o
[ 49%] Building C object image/png/CMakeFiles/png.dir/pngrtran.c.o
[ 49%] Building C object image/png/CMakeFiles/png.dir/pngrutil.c.o
[ 50%] Building C object image/png/CMakeFiles/png.dir/pngset.c.o
[ 51%] Building C object image/png/CMakeFiles/png.dir/pngtrans.c.o
[ 51%] Building C object image/png/CMakeFiles/png.dir/pngvcrd.c.o
[ 52%] Building C object image/png/CMakeFiles/png.dir/pngwio.c.o
[ 53%] Building C object image/png/CMakeFiles/png.dir/pngwrite.c.o
[ 53%] Building C object image/png/CMakeFiles/png.dir/pngwtran.c.o
[ 54%] Building C object image/png/CMakeFiles/png.dir/pngwutil.c.o
[ 55%] Linking C static library libpng.a
[ 55%] Built target png
Scanning dependencies of target goo_objs
[ 56%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/FixedPoint.cc.o
[ 56%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/GHash.cc.o
[ 57%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/GList.cc.o
[ 58%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/GString.cc.o
[ 59%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/gfile.cc.o
[ 59%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/gmem.cc.o
[ 60%] Building CXX object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/gmempp.cc.o
[ 61%] Building C object xpdf-4.00/goo/CMakeFiles/goo_objs.dir/parseargs.c.o
[ 61%] Built target goo_objs
Scanning dependencies of target goo
[ 62%] Linking CXX static library libgoo.a
[ 62%] Built target goo
Scanning dependencies of target fofi_objs
[ 63%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiBase.cc.o
[ 64%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiEncodings.cc.o
[ 64%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiIdentifier.cc.o
[ 65%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiTrueType.cc.o
[ 66%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiType1.cc.o
[ 66%] Building CXX object xpdf-4.00/fofi/CMakeFiles/fofi_objs.dir/FoFiType1C.cc.o
[ 66%] Built target fofi_objs
Scanning dependencies of target fofi
[ 66%] Linking CXX static library libfofi.a
[ 66%] Built target fofi
Scanning dependencies of target pdfalto
[ 66%] Building CXX object CMakeFiles/pdfalto.dir/src/AnnotsXrce.cc.o
[ 67%] Building CXX object CMakeFiles/pdfalto.dir/src/ConstantsUtils.cc.o
[ 68%] Building CXX object CMakeFiles/pdfalto.dir/src/ConstantsXMLALTO.cc.o
[ 68%] Building CXX object CMakeFiles/pdfalto.dir/src/Parameters.cc.o
[ 69%] Building CXX object CMakeFiles/pdfalto.dir/src/PDFDocXrce.cc.o
[ 70%] Building CXX object CMakeFiles/pdfalto.dir/src/pdfalto.cc.o
[ 71%] Building CXX object CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o
make[2]: *** No rule to make target '/root/icu/lib/libicuuc.a', needed by 'pdfalto'. Stop.
CMakeFiles/Makefile2:71: recipe for target 'CMakeFiles/pdfalto.dir/all' failed
make[1]: *** [CMakeFiles/pdfalto.dir/all] Error 2
make: *** [all] Error 2
Makefile:127: recipe for target 'all' failed
The command '/bin/sh -c make' returned a non-zero code: 2
The attached PDF generates an XML file that cannot be parsed by GROBID's SAX parser:
ERROR [2019-02-25 20:10:13,990] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
! at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
! at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
! ... 80 common frames omitted
! Causing: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
! at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
! at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
! at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:381)
! ... 70 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [PARSING_ERROR] Cannot parse file: /home/lopez/grobid/grobid-home/tmp/xsW7YuKt23.lxml
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:393)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:94)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:130)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:109)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:474)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:465)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:179)
...
I am trying to trace the problem of incorrect coordinates for string elements in some PDF. One example is the attached PubMed Central PDF. Using it with GROBID and the PDF.js document display + annotations, we see that the bounding boxes for the annotations are not correct (while usually they are!).
The problem is apparently coming from pdfalto, but I am not sure if it comes from incorrect page dimension or an incorrect origin point on the page for the string coordinates.
So in the attached PDF, all page dimensions are x:662, y:860. First page, first token "Association" is positioned with x:85, y:126, w:115, h:17.8. Proportion x/y is visually incorrect. x and y should be x:57, y:90 (from PDF.js)
Second page, first token "Xia" is positioned x:71, y:64, w:10, h:7.3, once again x/y is not visually clearly not correct. It should be x:42, y:21 (from PDF.js)
Looking at XmlAltoOutputDev.cc
and TextPage::startPage
, page coordinates come from GfxState, and the pagebox, but then I saw nothing that looks really related to this :/
Description - we observed a Null pointer Deference in function AnnotsXrce::AnnotsXrce( ) located in AnnotsXrce.cc .The same be triggered by sending a crafted pdf file to the pdfalto binary. It allows an attacker to cause Denial of Service (Segmentation fault) or possibly have unspecified other impact.
Command - ./pdfalto -f 1 -l 2 -noText -noImage -outline -annotation -cutPages -blocks -readingOrder -ocr -fullFontName $POC
POC - REPRODUCER
Degub -
gdb:
[ Legend: Modified register | Code | Heap | Stack | String ]
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
$rax : 0x0
$rbx : 0x00007fffffffda40 → 0x000061700000f580 → 0x000061300000de80 → 0x00000000009c1828 → 0x000000000062bd46 → <FileStream::~FileStream()+0> push rbp
$rcx : 0x300
$rdx : 0x0
$rsp : 0x00007fffffffd440 → 0x0000000041b58ab3
$rbp : 0x00007fffffffda70 → 0x00007fffffffdbf0 → 0x00007fffffffdd10 → 0x000000000090c360 → <__libc_csu_init+0> push r15
$rsi : 0x1
$rdi : 0x000060400000c850 → 0xbebebebebebebebe
$rip : 0x0000000000406adc → <AnnotsXrce::AnnotsXrce(Object&,+0> mov rax, QWORD PTR [rax]
$r8 : 0x0
$r9 : 0x35ef
$r10 : 0x50
$r11 : 0x00007ffff7efb310 → 0x0000000000000000
$r12 : 0x00000ffffffffabc → 0x0000000000000000
$r13 : 0x00007fffffffd5e0 → 0x0000000041b58ab3
$r14 : 0x000060400000c850 → 0xbebebebebebebebe
$r15 : 0x00007fffffffd5e0 → 0x0000000041b58ab3
$eflags: [carry PARITY adjust ZERO sign trap INTERRUPT direction overflow RESUME virtualx86 identification]
$cs: 0x0033 $ss: 0x002b $ds: 0x0000 $es: 0x0000 $fs: 0x0000 $gs: 0x0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
0x00007fffffffd440│+0x0000: 0x0000000041b58ab3 ← $rsp
0x00007fffffffd448│+0x0008: 0x0000602000010330 → 0xbebebebe0000003a (":"?)
0x00007fffffffd450│+0x0010: 0x000000010000000d → 0x0000000000000000
0x00007fffffffd458│+0x0018: 0x00007fffffffdb60 → 0x3ff0000000000000
0x00007fffffffd460│+0x0020: 0x0000611000009c80 → 0x000060800000bfa8 → 0x0000602000010ad0 → 0xbebebebe00000031 ("1"?)
0x00007fffffffd468│+0x0028: 0x000060c000007c00 → 0x0000000000000000
0x00007fffffffd470│+0x0030: 0x00007fffffffdb20 → 0xbebebebe00000006
0x00007fffffffd478│+0x0038: 0x0000602000106f50 → 0x0000000000000002
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
0x406acd <AnnotsXrce::AnnotsXrce(Object&,+0> mov rdi, rax
0x406ad0 <AnnotsXrce::AnnotsXrce(Object&,+0> call 0x404a40 <__asan_report_load8@plt>
0x406ad5 <AnnotsXrce::AnnotsXrce(Object&,+0> mov rax, QWORD PTR [rbp-0x548]
→ 0x406adc <AnnotsXrce::AnnotsXrce(Object&,+0> mov rax, QWORD PTR [rax]
0x406adf <AnnotsXrce::AnnotsXrce(Object&,+0> add rax, 0x10
0x406ae3 <AnnotsXrce::AnnotsXrce(Object&,+0> mov rdx, rax
0x406ae6 <AnnotsXrce::AnnotsXrce(Object&,+0> mov rcx, rdx
0x406ae9 <AnnotsXrce::AnnotsXrce(Object&,+0> shr rcx, 0x3
0x406aed <AnnotsXrce::AnnotsXrce(Object&,+0> add rcx, 0x7fff8000
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── source:/home/aceteam/Downloads/sources/pdfalto/src/AnnotsXrce.cc+85 ────
80 Link *link = new Link(dict, catalog->getBaseURI());
81 //printf("%d \n",link->isOk());
82 LinkAction *ac = link->getAction();
83 //printf("ac %d \n",ac->isOk());
84 // Get the Action information
// ac=0x00007fffffffd528 → 0x0000000000000000
→ 85 if (ac->isOk()) {
86 xmlNodePtr nodeActionAction;
87 xmlNodePtr nodeActionDEST;
88 if (nodeAnnot) {
89 nodeActionAction = xmlNewNode(NULL, (const xmlChar *) TAG_ACTION);
90 nodeActionAction->type = XML_ELEMENT_NODE;
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "pdfalto", stopped, reason: SIGSEGV
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x406adc → AnnotsXrce::AnnotsXrce(this=0x602000106f50, objA=@0x7fffffffdb20, docrootA=0x60c000007c00, catalog=0x611000009c80, ctmA=0x7fffffffdb60, pageNumA=0x1)
[#1] 0x40a94a → PDFDocXrce::displayPages(this=0x60800000bfa0, out=0x61500000c100, docrootA=0x60c000007c00, firstPage=0x1, lastPage=0x1, hDPI=72, vDPI=72, rotate=0x0, useMediaBox=0x0, crop=0x1, doLinks=0x0, abortCheckCbk=0x0, abortCheckCbkData=0x0)
[#2] 0x40bdf6 → main(argc=0x2, argv=0x7fffffffddf8)
gef➤ p ac
$9 = (LinkAction *) 0x0
gef➤ p ac->isOk()
Cannot access memory at address 0x0
Some requests coming coming about having the possibility to output characters along with their respective attributs (width, height, fonts..)
When I try to start pdfalto, I get a segmentation fault (Core dumped)
OS: Ubuntu 18 LTS
gcc 7.3.0
libxml2-dev: 2.9.4+dfsg1-6.1ubuntu1
Compiling works fine with warnings.
It generates a *_metadata.xml, but then it stops with a segmentation fault
I followed the compiling steps from Github.
Was anybody able to compile it under Ubuntu 18?
The error is:
undefined reference to `FT_Library_Version`
// ...
undefined reference to `FT_Done_Glyph'
Probably due to an older version? Can be upgraded.
I tried in this pull request, but then noticed the path is hard coded now.
And the text appears to be maintained until approximately 35 40 years -of age, followed by modest decreases until 50 years of age
,, you can see the 'hypen' out of place...
this happens with pdf2xml too btw
here the output from pdfalto
<TextLine WIDTH="502.269" HEIGHT="8.208" ID="p1_t62" HPOS="51" VPOS="638.632">
<String ID="p1_w620" CONTENT="Endurance" HPOS="51" VPOS="638.632" WIDTH="38.556" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="89.556"/>
<String ID="p1_w621" CONTENT="and" HPOS="92.04" VPOS="638.632" WIDTH="13.014" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="105.054"/>
<String ID="p1_w622" CONTENT="ultra-endurance" HPOS="107.538" VPOS="638.632" WIDTH="56.601"
HEIGHT="8.208" STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="164.139"/>
<String ID="p1_w623" CONTENT="performance," HPOS="166.623" VPOS="638.632" WIDTH="47.826"
HEIGHT="8.208" STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="214.449"/>
<String ID="p1_w624" CONTENT="in" HPOS="216.933" VPOS="638.632" WIDTH="7.011" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="223.944"/>
<String ID="p1_w625" CONTENT="terms" HPOS="226.428" VPOS="638.632" WIDTH="20.034" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="246.462"/>
<String ID="p1_w626" CONTENT="of" HPOS="248.946" VPOS="638.632" WIDTH="7.506" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="256.452"/>
<String ID="p1_w627" CONTENT="the" HPOS="258.936" VPOS="638.632" WIDTH="11.016" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="269.952"/>
<String ID="p1_w628" CONTENT="overall" HPOS="272.436" VPOS="638.632" WIDTH="25.047"
HEIGHT="8.208" STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="297.483"/>
<String ID="p1_w629" CONTENT="time" HPOS="299.967" VPOS="638.632" WIDTH="16.029" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="315.996"/>
<String ID="p1_w630" CONTENT="taken," HPOS="318.48" VPOS="638.632" WIDTH="21.789" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="340.269"/>
<String ID="p1_w631" CONTENT="appears" HPOS="342.753" VPOS="638.632" WIDTH="27.54"
HEIGHT="8.208" STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="370.293"/>
<String ID="p1_w632" CONTENT="to" HPOS="372.777" VPOS="638.632" WIDTH="7.011" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="379.788"/>
<String ID="p1_w633" CONTENT="be" HPOS="382.272" VPOS="638.632" WIDTH="8.505" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="390.777"/>
<String ID="p1_w634" CONTENT="maintained" HPOS="393.261" VPOS="638.632" WIDTH="40.077"
HEIGHT="8.208" STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="433.338"/>
<String ID="p1_w635" CONTENT="until" HPOS="435.822" VPOS="638.632" WIDTH="16.542" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="452.364"/>
<String ID="p1_w636" CONTENT="approximately" HPOS="454.848" VPOS="638.632" WIDTH="52.101"
HEIGHT="8.208" STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="506.949"/>
<String ID="p1_w637" CONTENT="35" HPOS="509.433" VPOS="638.632" WIDTH="9.009" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="4.308" VPOS="638.632" HPOS="518.442"/>
<String ID="p1_w638" CONTENT="40" HPOS="522.75" VPOS="638.632" WIDTH="9.009" HEIGHT="8.208"
STYLEREFS="font10"/>
<SP WIDTH="2.484" VPOS="638.632" HPOS="531.759"/>
<String ID="p1_w639" CONTENT="years" HPOS="534.243" VPOS="638.632" WIDTH="19.026" HEIGHT="8.208"
STYLEREFS="font10"/>
</TextLine>
</TextBlock>
<TextBlock ID="p1_b49" HPOS="518.7" VPOS="637" HEIGHT="9.75579" WIDTH="3.995">
<TextLine WIDTH="3.995" HEIGHT="9.75579" ID="p1_t63" HPOS="518.7" VPOS="637">
<String ID="p1_w640" CONTENT="–" HPOS="518.7" VPOS="637" WIDTH="3.995" HEIGHT="9.75579"
STYLEREFS="font5"/>
</TextLine>
</TextBlock>
I'm uploading this file
1903.07791.pdf
where some characters are not recognised:
the result is this:
or, this, in the text (please ignore the tags...):
Room temperature electrical resistivity was decreased down from 300 mcm for x = 0 to 8 mcm for x = 0.4. However, the temperature dependence of electrical resistivity was still insulating for x 0.4. In the present study, we show that Bi-rich composition up to ca. x = 0.8 can be obtained by optimizing synthesis temperature.
In XmlAltoOutputDev.cc a lot of small strings are allocated with the size of 10/20 or 50 characters. It seems that in some circumstances the buffer overflow occurs (typically when outputting the WIDTH attribute) and the strings then include some garbage. I suggest usage of snprintf or larger sizes or limiting the format %g/%d.
cmake.stderr.txt
cmake.stdout.txt
CMake Error at xpdf-4.00/xpdf-qt/CMakeLists.txt:65 (add_executable):
add_executable cannot create target "xpdf" because another target with the
same name already exists. The existing target is a static library created
in source directory "/usr/local/src/pdfalto/xpdf-4.00/xpdf". See
documentation for policy CMP0002 for more details.
For some PDF files an error is thrown concerning memory corruption:
*** Error in `pdfalto/pdfalto': malloc(): memory corruption (fast): 0x000000000193db80 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc3bebeb7e5] /lib/x86_64-linux-gnu/libc.so.6(+0x82651)[0x7fc3bebf6651] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fc3bebf8184] /usr/lib/x86_64-linux-gnu/libxml2.so.2(xmlStrdup+0x42)[0x7fc3bf8b36a2] /usr/lib/x86_64-linux-gnu/libxml2.so.2(+0x5f8e8)[0x7fc3bf83e8e8] pdfalto/pdfalto[0x415add] pdfalto/pdfalto[0x40722c] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc3beb94830] pdfalto/pdfalto[0x403829] ======= Memory map: ======== 00400000-00549000 r-xp 00000000 fd:01 7357305 /root/pdfalto/pdfalto 00749000-0074a000 r--p 00149000 fd:01 7357305 /root/pdfalto/pdfalto 0074a000-0078a000 rw-p 0014a000 fd:01 7357305 /root/pdfalto/pdfalto 018e6000-02867000 rw-p 00000000 00:00 0 [heap] 7fc3b8000000-7fc3b8021000 rw-p 00000000 00:00 0 7fc3b8021000-7fc3bc000000 ---p 00000000 00:00 0 7fc3bc4ce000-7fc3bdd84000 r-xp 00000000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdd84000-7fc3bdf83000 ---p 018b6000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdf83000-7fc3bdf84000 r--p 018b5000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdf84000-7fc3bdf85000 rw-p 018b6000 fd:01 7358735 /usr/lib/x86_64-linux-gnu/libicudata.so.55.1 7fc3bdf85000-7fc3bdfa6000 r-xp 00000000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3bdfa6000-7fc3be1a5000 ---p 00021000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3be1a5000-7fc3be1a6000 r--p 00020000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3be1a6000-7fc3be1a7000 rw-p 00021000 fd:01 8374952 /lib/x86_64-linux-gnu/liblzma.so.5.0.0 7fc3be1a7000-7fc3be1c0000 r-xp 00000000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be1c0000-7fc3be3bf000 ---p 00019000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be3bf000-7fc3be3c0000 r--p 00018000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be3c0000-7fc3be3c1000 rw-p 00019000 fd:01 8375020 /lib/x86_64-linux-gnu/libz.so.1.2.8 7fc3be3c1000-7fc3be540000 r-xp 00000000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be540000-7fc3be740000 ---p 0017f000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be740000-7fc3be750000 r--p 0017f000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be750000-7fc3be751000 rw-p 0018f000 fd:01 7358763 /usr/lib/x86_64-linux-gnu/libicuuc.so.55.1 7fc3be751000-7fc3be755000 rw-p 00000000 00:00 0 7fc3be755000-7fc3be758000 r-xp 00000000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be758000-7fc3be957000 ---p 00003000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be957000-7fc3be958000 r--p 00002000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be958000-7fc3be959000 rw-p 00003000 fd:01 8374934 /lib/x86_64-linux-gnu/libdl-2.23.so 7fc3be959000-7fc3be972000 r-xp 00000000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3be972000-7fc3beb72000 ---p 00019000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3beb72000-7fc3beb73000 r--p 00019000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3beb73000-7fc3beb74000 rw-p 0001a000 fd:01 7357217 /root/pdfalto/image/zlib/libzlib.so 7fc3beb74000-7fc3bed34000 r-xp 00000000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bed34000-7fc3bef34000 ---p 001c0000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bef34000-7fc3bef38000 r--p 001c0000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bef38000-7fc3bef3a000 rw-p 001c4000 fd:01 8374921 /lib/x86_64-linux-gnu/libc-2.23.so 7fc3bef3a000-7fc3bef3e000 rw-p 00000000 00:00 0 7fc3bef3e000-7fc3bef54000 r-xp 00000000 fd:01 8374942 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fc3bef54000-7fc3bf153000 ---p 00016000 fd:01 8374942 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fc3bf153000-7fc3bf154000 rw-p 00015000 fd:01 8374942 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fc3bf154000-7fc3bf25c000 r-xp 00000000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf25c000-7fc3bf45b000 ---p 00108000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf45b000-7fc3bf45c000 r--p 00107000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf45c000-7fc3bf45d000 rw-p 00108000 fd:01 8374953 /lib/x86_64-linux-gnu/libm-2.23.so 7fc3bf45d000-7fc3bf5cf000 r-xp 00000000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf5cf000-7fc3bf7cf000 ---p 00172000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf7cf000-7fc3bf7d9000 r--p 00172000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf7d9000-7fc3bf7db000 rw-p 0017c000 fd:01 8375809 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 7fc3bf7db000-7fc3bf7df000 rw-p 00000000 00:00 0 7fc3bf7df000-7fc3bf990000 r-xp 00000000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bf990000-7fc3bfb8f000 ---p 001b1000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bfb8f000-7fc3bfb97000 r--p 001b0000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bfb97000-7fc3bfb99000 rw-p 001b8000 fd:01 7358767 /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.3 7fc3bfb99000-7fc3bfb9a000 rw-p 00000000 00:00 0 7fc3bfb9a000-7fc3bfbd0000 r-xp 00000000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfbd0000-7fc3bfdd0000 ---p 00036000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfdd0000-7fc3bfdd1000 r--p 00036000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfdd1000-7fc3bfdd2000 rw-p 00037000 fd:01 7346757 /root/pdfalto/image/png/libpng.so 7fc3bfdd2000-7fc3bfdf8000 r-xp 00000000 fd:01 8374901 /lib/x86_64-linux-gnu/ld-2.23.so 7fc3bffa6000-7fc3bfff0000 rw-p 00000000 00:00 0 7fc3bfff5000-7fc3bfff7000 rw-p 00000000 00:00 0 7fc3bfff7000-7fc3bfff8000 r--p 00025000 fd:01 8374901 /lib/x86_64-linux-gnu/ld-2.23.so 7fc3bfff8000-7fc3bfff9000 rw-p 00026000 fd:01 8374901 /lib/x86_64-linux-gnu/ld-2.23.so 7fc3bfff9000-7fc3bfffa000 rw-p 00000000 00:00 0 7fffb26bb000-7fffb26dd000 rw-p 00000000 00:00 0 [stack] 7fffb27ed000-7fffb27ef000 r--p 00000000 00:00 0 [vvar] 7fffb27ef000-7fffb27f1000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
When trying to build pdfalto in Ubuntu, it fails on XmlAltoOutputDev.h:
[ 57%] Building CXX object CMakeFiles/pdfalto.dir/src/pdfalto.cc.o In file included from /root/pdfalto/src/pdfalto.cc:18:0: /root/pdfalto/src/XmlAltoOutputDev.h:25:25: warning: extra tokens at end of #include directive #include <unordered_map>; ^ In file included from /usr/include/c++/5/unordered_map:35:0, from /root/pdfalto/src/XmlAltoOutputDev.h:25, from /root/pdfalto/src/pdfalto.cc:18: /usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options. #error This file requires compiler and library support \ ^ /root/pdfalto/src/pdfalto.cc:43:24: warning: extra tokens at end of #include directive #include "TextString.h"; ^ In file included from /root/pdfalto/xpdf-4.00/goo/parseargs.h:16:0, from /root/pdfalto/src/pdfalto.cc:6: /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:154:22: note: in expansion of macro 'gFalse' GBool fontType = gFalse; //Enumeration : serif (gTrue) or sans-serif(gFalse) ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:155:23: note: in expansion of macro 'gFalse' GBool fontWidth = gFalse; //Enumeration : proportional(gFalse) or fixed(gTrue) ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:158:20: note: in expansion of macro 'gFalse' GBool isbold = gFalse; ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:159:22: note: in expansion of macro 'gFalse' GBool isitalic = gFalse; ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:160:25: note: in expansion of macro 'gFalse' GBool issubscript = gFalse; ^ /root/pdfalto/xpdf-4.00/goo/gtypes.h:18:16: warning: non-static data member initializers only available with -std=c++11 or -std=gnu++11 #define gFalse 0 ^ /root/pdfalto/src/XmlAltoOutputDev.h:161:27: note: in expansion of macro 'gFalse' GBool issuperscript = gFalse; ^ In file included from /root/pdfalto/src/pdfalto.cc:18:0: /root/pdfalto/src/XmlAltoOutputDev.h:1388:13: error: 'unordered_map' does not name a type typedef unordered_map<char*, unsigned int, Hash_Func, my_equal_to<char*> > my_unordered_map; ^ /root/pdfalto/src/XmlAltoOutputDev.h:1390:5: error: 'my_unordered_map' does not name a type my_unordered_map unicode_map; ^ /root/pdfalto/src/pdfalto.cc: In function 'int main(int, char**)': /root/pdfalto/src/pdfalto.cc:180:41: warning: format not a string literal and no format arguments [-Wformat-security] fprintf(stderr, PDFALTO_NAME); ^ /root/pdfalto/src/pdfalto.cc:182:44: warning: format not a string literal and no format arguments [-Wformat-security] fprintf(stderr, PDFALTO_VERSION); ^ make[2]: *** [CMakeFiles/pdfalto.dir/src/pdfalto.cc.o] Error 1 make[1]: *** [CMakeFiles/pdfalto.dir/all] Error 2 CMakeFiles/pdfalto.dir/build.make:182: recipe for target 'CMakeFiles/pdfalto.dir/src/pdfalto.cc.o' failed CMakeFiles/Makefile2:71: recipe for target 'CMakeFiles/pdfalto.dir/all' failed Makefile:127: recipe for target 'all' failed make: *** [all] Error 2 The command '/bin/sh -c make' returned a non-zero code: 2
It seems as if the latest updates causes this...
Description - we observed a invalid memory access in function GfxIndexedColorSpace::mapColorToBase( ) located in GfxState.cc .The same be triggered by sending a crafted pdf file to the pdfalto binary. It allows an attacker to cause Denial of Service (Segmentation fault) or possibly have unspecified other impact.
Command - : ./pdfalto -f 1 -l 2 -noText -noImage -outline -annotation -cutPages -blocks -readingOrder -ocr -fullFontName $POC
POC - REPRODUCER
Degub -
Gdb:
[ Legend: Modified register | Code | Heap | Stack | String ]
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
$rax : 0xfffffffffffffffd
$rbx : 0x00000ffffffff9a6 → 0x0000000000000000
$rcx : 0xfffffffffffffffd
$rdx : 0x200000007fff7fff
$rsp : 0x00007fffffffccf0 → 0x00007fffffffcd30 → 0x0000000041b58ab3
$rbp : 0x00007fffffffcfd0 → 0x00007fffffffd100 → 0x00007fffffffd120 → 0x00007fffffffd2a0 → 0x00007fffffffd2d0 → 0x00007fffffffd330 → 0x00007fffffffd640 → 0x00007fffffffd750
$rsi : 0x3
$rdi : 0x0
$rip : 0x00000000005cd542 → <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> movzx edx, BYTE PTR [rdx]
$r8 : 0x00000000005cc2ea → <GfxICCBasedColorSpace::getDefaultRanges(double*,+0> push rbp
$r9 : 0x7a1a
$r10 : 0x0000602000073650 → 0xbebebebebebebe00
$r11 : 0x00007ffff7eec448 → 0x0000000000000000
$r12 : 0x00007fffffffcd30 → 0x0000000041b58ab3
$r13 : 0x00007fffffffcfb0 → 0x00000ffffffffa00 → 0x0000000000000000
$r14 : 0x00007fffffffcd30 → 0x0000000041b58ab3
$r15 : 0x00007fffffffd170 → 0x0000000041b58ab3
$eflags: [carry PARITY adjust zero sign trap INTERRUPT direction overflow RESUME virtualx86 identification]
$cs: 0x0033 $ss: 0x002b $ds: 0x0000 $es: 0x0000 $fs: 0x0000 $gs: 0x0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
0x00007fffffffccf0│+0x0000: 0x00007fffffffcd30 → 0x0000000041b58ab3 ← $rsp
0x00007fffffffccf8│+0x0008: 0x00007fffffffd020 → 0x0000003000000020 → 0x0000000000000000
0x00007fffffffcd00│+0x0010: 0x000061700000e108 → 0x00007fff00000000
0x00007fffffffcd08│+0x0018: 0x000060400000d050 → 0x00000000009688d0 → 0x00000000005cc590 → <GfxIndexedColorSpace::~GfxIndexedColorSpace()+0> push rbp
0x00007fffffffcd10│+0x0020: 0x00000ffffffff9c4 → 0x0000000000000000
0x00007fffffffcd18│+0x0028: 0x0000000000000020
0x00007fffffffcd20│+0x0030: 0x00000003ffffffff → 0x0000000000000000
0x00007fffffffcd28│+0x0038: 0xfffffffffffffffd
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
0x5cd533 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> enter 0x8948, 0xc2
0x5cd537 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> shr rdx, 0x3
0x5cd53b <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> add rdx, 0x7fff8000
→ 0x5cd542 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> movzx edx, BYTE PTR [rdx]
0x5cd545 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> test dl, dl
0x5cd547 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> setne sil
0x5cd54b <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> mov rdi, rax
0x5cd54e <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> and edi, 0x7
0x5cd551 <GfxIndexedColorSpace::mapColorToBase(GfxColor*,+0> cmp dil, dl
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── source:/home/aceteam/Downloads/sources/pdfalto/xpdf-4.00/xpdf/GfxState.cc+1149 ────
1144 } else if (k > indexHigh) {
1145 k = indexHigh;
1146 }
1147 p = &lookup[k * n];
1148 for (i = 0; i < n; ++i) {
// baseColor=0x00007fffffffccf8 → [...] → 0x0000000000000000, p=0x00007fffffffcd28 → 0xfffffffffffffffd, low=0x00007fffffffcd50 → 0x0000000000000000, range=0x00007fffffffce70 → 0x3ff0000000000000, i=0x0
→ 1149 baseColor->c[i] = dblToCol(low[i] + (p[i] / 255.0) * range[i]);
1150 }
1151 return baseColor;
1152 }
1153
1154 void GfxIndexedColorSpace::getGray(GfxColor *color, GfxGray *gray,
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "pdfalto", stopped, reason: SIGSEGV
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x5cd542 → GfxIndexedColorSpace::mapColorToBase(this=0x60400000d050, color=0x61700000e108, baseColor=0x7fffffffd020)
[#1] 0x5cdaa4 → GfxIndexedColorSpace::getRGB(this=0x60400000d050, color=0x61700000e108, rgb=0x7fffffffd190, ri=gfxRenderingIntentRelativeColorimetric)
[#2] 0x5f6b4f → GfxState::getFillRGB(this=0x61700000e080, rgb=0x7fffffffd190)
[#3] 0x445f21 → XmlAltoOutputDev::fill(this=0x61500000f300, state=0x61700000e080)
[#4] 0x6c4f54 → Gfx::opFill(this=0x60f00000e140, args=0x7fffffffd3d0, numArgs=0x0)
[#5] 0x6bc95f → Gfx::execOp(this=0x60f00000e140, cmd=0x7fffffffd390, args=0x7fffffffd3d0, numArgs=0x0)
[#6] 0x6bbf7a → Gfx::go(this=0x60f00000e140, topLevel=0x1)
[#7] 0x6bb562 → Gfx::display(this=0x60f00000e140, objRef=0x60800000bed0, topLevel=0x1)
[#8] 0x61cf67 → Page::displaySlice(this=0x60800000bea0, out=0x61500000f300, hDPI=72, vDPI=72, rotate=0x0, useMediaBox=0x0, crop=0x0, sliceX=0xffffffff, sliceY=0xffffffff, sliceW=0xffffffff, sliceH=0xffffffff, printing=0x0, abortCheckCbk=0x0, abortCheckCbkData=0x0)
[#9] 0x61c7af → Page::display(this=0x60800000bea0, out=0x61500000f300, hDPI=72, vDPI=72, rotate=0x0, useMediaBox=0x0, crop=0x1, printing=0x0, abortCheckCbk=0x0, abortCheckCbkData=0x0)
gef➤ p/d k * n
$24 = -3
gef➤ p &lookup[k * n]
$25 = (Guchar *) 0xfffffffffffffffd <error: Cannot access memory at address 0xfffffffffffffffd>
gef➤ p (p[i] / 255.0)
Cannot access memory at address 0xfffffffffffffffd
This is a suggestion from the user @dlaurie
linebreaks except when they would be significant (pdftohtml -xml did that), elision of unnecessary attributes, i.e. rotation=0, angle=0.
I found two bugs in pdfalto, the details can be found at here
This is the text from the PDF
and this is the result from the text :
the 2010Hawaii Ironman Triathlon consisting of 3.8 km swimming, 180 km cycling and 42 km running, in less than 16 h ( ). That same year, and at the same event, a 75 year old female triathlete finished the race in http://ironmanworldchampionship.com/results/ 16h 20min.
the http://ironmanworldchampionship.com/results/ should be between parenthesis, after 16h
I open this in order not ot forget.
The CI build would avoid potential issues due to static library buildind and portability.
Hi,
I tried to build pdfalto twice on 2 different OS and I have the same problem.
At the last step
I have a link issue related to splash lib.
I think it is related to splash that has not generated ".o".. But I'm not easy with CMakeFiles..
No report like mine? One idea?
...
[ 66%] Linking CXX executable pdfalto
CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o: In function XmlAltoOutputDev::~XmlAltoOutputDev()': XmlAltoOutputDev.cc:(.text+0x185f0): undefined reference to
SplashFontEngine::~SplashFontEngine()'
XmlAltoOutputDev.cc:(.text+0x18620): undefined reference to Splash::~Splash()' CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o: In function
XmlAltoOutputDev::startDoc(XRef*)':
XmlAltoOutputDev.cc:(.text+0x19c42): undefined reference to SplashFontEngine::~SplashFontEngine()' ... CMakeFiles/pdfalto.dir/src/XmlAltoOutputDev.cc.o:(.rodata._ZTI19SplashOutFontFileID[_ZTI19SplashOutFontFileID]+0x10): undefined reference to
typeinfo for SplashFontFileID'
collect2: error: ld returned 1 exit status
make[2]: *** [pdfalto] Error 1
make[1]: *** [CMakeFiles/pdfalto.dir/all] Error 2
make: *** [all] Error 2
Here are some examples of PDF parsing failures from PubMed Central reusable set 1942.
Uploading trf0051-0558.pdf…
Uploading 1746-1340-18-26.pdf…
Uploading 1617-9625-9-4.pdf…
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.