Comments (5)
Attached a verbose (-V 2) logfile:
log.txt
from ocrmypdf.
Can't reproduce here. Possibly, this is a tesseract bug.
What is the output of tesseract --version
on the machine that produced the issue?
from ocrmypdf.
$ tesseract --version
tesseract 4.1.1
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found SSE
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
from ocrmypdf.
Can you try upgrading to tesseract 5.x?
For Ubuntu here is the PPA.
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5
from ocrmypdf.
Yep, that solves it, thanks!
tesseract 5.3.4
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.17
``
from ocrmypdf.
Related Issues (20)
- Show progress during postprocessing HOT 5
- [Feature]: If page has text, force OCR and rasterize page HOT 1
- [Bug]: NotImplementedError: not sure how to get colorspace
- [Bug]: test_semfree fails with ghostscript 10.03.0+
- Pushed docker image is always Ubuntu instead of alpine HOT 1
- [Bug]: HOT 4
- [Bug]: Flood of "Recursion depth exceeded in _find_image_xrefs_page" HOT 5
- [Bug]: multiple spaces not supported for delimitation of bbox parameters HOT 1
- [Bug]: OCR not complete. Parts of all pages are ignored HOT 1
- Error occurred while consuming document out1.pdf: SubprocessOutputError: Ghostscript rasterizing failed. HOT 1
- [Bug]: --tesseract-pagesegmode is not sufficiently documented
- Incorrect behavior of text color setting in hocrtransform HOT 2
- [Bug]: crashes with tesseract 5.4.0 HOT 8
- [Bug]: ocrmypdf 16.3.1 fails on a file on Arch that 13.4.0 on Ubuntu handles well HOT 1
- [Feature]: Alternative AI OCR "surya" as opposed to EasyOCR, Just found it today and it dominated the accuracy and speed of Tesseract & EasyOCR
- [Bug]: Paperless-ngx Release 2.9.0 Ghostscript rasterizing failed HOT 1
- [Bug]: MetadataProgress does not respect progress_bar=False argument
- [Bug]: No errors and no output for large DPI files HOT 2
- [Bug]: `lots of diacritics - possibly poor OCR` but using standalone tesseract works perfectly HOT 1
- [Bug]: ocrmypdf (16.3.1) and Tesseract 5.4.1 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.