Comments (8)
This is a tesseract issue - we will need to wait for them resolve it
tesseract-ocr/tesseract#4257
from ocrmypdf.
Since the issue is with Tesseract itself, downgrading is the only option at the moment
from ocrmypdf.
The bug is in the the legacy engine.
Since the issue is with Tesseract itself, downgrading is the only option at the moment
It's not the only option, unless ypu want Tesseract to use the legacy engine.
You can bypass this bug by using a model from the tessdata_fast
repo or by using oem 1.
from ocrmypdf.
@amitdo ocrmypdf uses orientation and script detection (osd.traineddata) which currently only has the legacy option even in tessdata_fast. Your workaround will help people looking to get tesseract 5.4.0 working on OCR (without using any feature that requires page orientation detection) but it's not a full solution.
For maintainers looking for a full solution that passes the test suite, unfortunately ocrmypdf with tesseract 5.4.0 is not workable and will have to wait for 5.4.1.
from ocrmypdf.
Yeah, I forgot about OSD.
from ocrmypdf.
I just updated tesseract to version 5.4.1-1 on Arch Linux and the problem is gone.
from ocrmypdf.
For's for me aswell... ArchLinux w/ ocrmypdf 16.3.1-1 + tesseract 5.4.1-1
from ocrmypdf.
In 16.4.0 we refuse to use tesseract 5.4.0. 5.4.1 with any version works.
from ocrmypdf.
Related Issues (20)
- [Query]: docker watched folder environment variables, optimize how? HOT 2
- [Bug]: Output file is okay but is not PDF/A HOT 3
- does not ocr 90° rotated texts HOT 1
- [Feature]: Result Improvement with OpenCV + Pillow Preprocessing HOT 3
- [Bug]: ocrmypdf: error: unrecognized arguments: input.pdf output.pdf HOT 3
- [Bug]: NotImplementedError in colorspace HOT 6
- [Bug]: Regression in 16.4 HOT 7
- [Bug]: Scan time increases quadratically with page count HOT 8
- [Bug/Feature]: a way to disable Ghostscript requirement & broken plugin_manager option HOT 12
- [Bug]: Scan time regression in 16.4.3 with `--redo-ocr` HOT 14
- Recommended way of running ocrmypdf with memory limits
- [Bug]: "AttributeError: module 'numpy.typing' has no attribute 'NDArray'" after Homebrew installation HOT 6
- [Feature]: decrypt file if qpdf is installed (EncryptedPdfError: Input PDF is encrypted. The encryption must be removed to perform OCR.) HOT 1
- [Feature]: Add a flag to enable ocrmypdf to write "last-modified attribute" to the OCR'ed file. HOT 2
- [3rdparty]: 当使用ocrmypdf输入 PDF 为中文时,结果 复制PDF 中有额外的空格 HOT 1
- 当使用ocrmypdf输入 PDF 为中文时,结果 复制PDF 中有额外的空格 HOT 1
- How to remove the image-with-text from the PDF
- [Feature]: Align pages to text baseline HOT 2
- [Bug]: Tesseract fails on Alpine 3.20.3 HOT 1
- [Bug]: Cannot create a file when that file already exists
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.