ocrmypdf / ocrmypdf-easyocr Goto Github PK

OCRmyPDF EasyOCR plugin

License: MIT License

Python 100.00%

ocrmypdf-easyocr's Introduction

OCRmyPDF EasyOCR

This is plugin to run OCRmyPDF with the EasyOCR engine instead of Tesseract OCR, the default OCR engine for OCRmyPDF. Since EasyOCR is based on PyTorch, it makes use of Nvidia GPUs. Hopefully it will be more performant and accurate than Tesseract OCR.

It is currently experimental and does not implement all of the features of OCRmyPDF with Tesseract, and still relies on Tesseract for certain operations.

Installation

To use this plugin, first install PyTorch according to the official instructions, which may differ for your platform.

Then install OCRmyPDF-EasyOCR to the same virtual environment or conda environment as you installed PyTorch:

pip install git+https://github.com/ocrmypdf/OCRmyPDF-EasyOCR.git

The OCRmyPDF-EasyOCR will override Tesseract for OCR; however, OCR still depends on Tesseract for some tasks.

If Celery's multiprocessing is installed in the virtual environment, it will be used instead of the standard Python multiprocessing. This allows paperless-ngx, which uses Celery, to function correctly.

Troubleshooting

If you see a log message Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU then PyTorch is not installed.

To do

Contributions, especially pull requests are quite welcome!

At the moment this plugin is alpha status and missing some essential features:

Tesseract is still required for determine page skew and for orientation correction
EasyOCR is effectively single threaded, to eliminate race conditions

ocrmypdf-easyocr's People

Contributors

Stargazers

Watchers

Forkers

webstorage119 pugio phu54321 netinvent deajan kastningbrandon sshuster tssujt sendstruct

ocrmypdf-easyocr's Issues

Fast switch engines

How can I switch back to Tesseract Engine without having to remove this plugin from env?

Error: NotImplementedError: EasyOCR does not support hOCR output

I got this error when I do ocrmypdf input.pdf output.pdf
I'm pretty sure it's intended result, but is there any way to try this plugin?

Need to know : how to use this script

I am trying to use this script but couldnt find the way

Acceleration of the launch

I noticed that the creation of easyocr.Reader(languages, use_gpu) in def _ocr_process takes more than a second, at the time of starting recognition, it would be good to do this in advance if possible, so that at the time of the recognition request you do not waste an extra second

italic text in Bulgarian is not recognised

It appears easyOCR model for Bulgarian lang is not good. It doesn't recognise italic font and produces garbage. Bold and normal fonts are ok. Underlined font produces underscore on spaces between words.

How do you switch back to tesseract for a particular language?

Newlines missing in sidecar

I installed this easyocr version via pipx and I went to compare a bunch of files between the original ocrmypdf and this one, and found that while easyocr is WAY more accurate at getting the letters right, the sidecar is all one line. Less than ideal and sounds like a bug to me.

If I pdftotext the pdf, it comes out on multiple lines. But the sidecar is jacked.

to reproduce, use --sidecar I can provide a jpg for sure if you want.

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.

While running ocrmypdf with EasyOCR as an engine I get the following output. When using EasyOCR not as part of ocrmypdf I have actually not had any problems with using the GPU.

Start processing 11 pages concurrently                                                                                                                                            _sync.py:259    
1 redoing OCR                                                                                                                                                             _pipeline.py:309    
2 redoing OCR                                                                                                                                                             _pipeline.py:309    
3 redoing OCR                                                                                                                                                             _pipeline.py:309    
4 redoing OCR                                                                                                                                                             _pipeline.py:309    
5 redoing OCR                                                                                                                                                             _pipeline.py:309    
6 redoing OCR                                                                                                                                                             _pipeline.py:309    
7 redoing OCR                                                                                                                                                             _pipeline.py:309    
8 redoing OCR                                                                                                                                                             _pipeline.py:309    
9 redoing OCR                                                                                                                                                             _pipeline.py:309   
10 redoing OCR                                                                                                                                                             _pipeline.py:309   
11 redoing OCR                                                                                                                                                             _pipeline.py:309   
11 Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.                                                                       easyocr.py:80    
4 Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.                                                                       easyocr.py:80

Error: Plugin already registered

After installing the plugin, I tried running it using this command:

ocrmypdf -l eng --plugin ocrmypdf_easyocr input.pdf output.pdf

But it throws error: ValueError: Plugin already registered under a different name: ocrmypdf_easyocr=<module 'ocrmypdf_easyocr'

PyPI release

Hello,

I'm currently tweaking paperless-ngx, which already uses OCRMyPDF, to use EasyOCR.
Would you mind releasing a PyPI version of this package, so it can easily integrate into a pip file ?

Thanks.

ERROR: File "setup.py" not found. when pip installing

Hi,

firstly thanks a lot for the development, EasyOCR has been performing better for my usecase so it's very nice to have it integrated into OCRmyPDF, another great project.

I am having trouble making it run though, and I am not sure whether this is a bug or simply me not knowing how to run a plugin with OCRmyPDF.

I have cloned https://github.com/ocrmypdf/OCRmyPDF-EasyOCR, then I cd into the directory and then run

python -m venv .venv
.venv\Scripts\activate
pip install -e .

At this point I get the error

ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: C:\Users\istaka\OneDrive - Schüßler-Plan GmbH\Dev\OCRmyPDF-EasyOCR
(A "pyproject.toml" file was found, but editable mode currently requires a setup.py based build.)

Am I doing something wrong?

Share `easyocr.Reader` instance

with simple profiling code:

        with GPU_SEMAPHORE:
            s0 = time.time()
            reader = easyocr.Reader(languages, gpu=options.gpu)
            s1 = time.time()
            raw_results = reader.readtext(gray)
            s2 = time.time()
            print('reader init: %.1fs, readtext: %.1fs' % (s1 - s0, s2 - s1))

Time it takes to construct easyocr.Reader is quite significant.
If the code can re-use reader object across pages, OCRing time could be cut considerably.

Non multiprocessing mode

Hello,

Trying to make OCRmyPDF-EasyOCR work in paperless-ngx, I realize that I cannot open a multiprocess pool in another multiprocess pool.

I've tried running your easyOCR plugin git master and back to commit e4a010d in hope to get rid of the multiprocessing, but I'm always stuck with error message AssertionError: daemonic processes are not allowed to have children since I run OCRMyPDF in a multiprocess pool.

Any chance for you to dive into that issue, so that the easyOCR plugin becomes the de-facto standard for paperless-ngx ?

Best regards.