Comments (4)
Does this mean that the 1071 PDF failing out cannot be processed by PDFAlto or not in combination with GROBID? I did see some errors concerning OCRISE characters when trying to process a specific PDF, is this what you are talking about?
from pdfalto.
yes in combination with GROBID, which permits to test also the outputted XML ALTO. This is a regression test with pdf2xml.
I am not talking about OCR of unsolved character codes which was anyway not present in pdf2xml.
from pdfalto.
We're improving :)
We have now 356 PDF out of 1942 with errors. Most of them are invalid XML character in attribute content.
There are still some PDF parsing failures, I update the corresponding issue with some examples.
from pdfalto.
I have now 100% success, great !!
Regarding the metrics, apparently some loss, 1-2% on field accuracy a bit everywhere. I will investigate this to see if it comes from modification of content stream or problems from the character composition.
from pdfalto.
Related Issues (20)
- XML to PDF HOT 1
- Is there an option to output ALTO XML to STDOUT? HOT 3
- heap-buffer-overflow found?
- empty image / svg
- compile error on RHEL 8.6 (Ootpa): /usr/bin/ld: cannot find -lstdc++ HOT 1
- Error case with invalid characters mapping
- Segmentation fault with pdf with comments
- Soft hyphens omitted HOT 3
- PDF to XML conversion time out for some files in server mode but run the pdfalto_server cmd in shell is fast and returns ok. HOT 1
- xpdf version 4.04
- ARM binaries for the Apple M1 HOT 3
- Cannot run pdfalto HOT 5
- PDF cause a crash with annotation option
- Building on arm64 Ubuntu Server 22.04 fails
- Building for Apple Silicon failed due to missing directories (with manual fix)
- Wrong characters / difference between extraction and display HOT 1
- [Suggestion] Reporting the byte location of images HOT 2
- Compilation error on arch linux HOT 1
- Error case, missing digits HOT 4
- Error case: double column, and line numbers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfalto.