Code Monkey home page Code Monkey logo

ocrmypdf-easyocr's Introduction

OCRmyPDF EasyOCR

This is plugin to run OCRmyPDF with the EasyOCR engine instead of Tesseract OCR, the default OCR engine for OCRmyPDF. Since EasyOCR is based on PyTorch, it makes use of Nvidia GPUs. Hopefully it will be more performant and accurate than Tesseract OCR.

It is currently experimental and does not implement all of the features of OCRmyPDF with Tesseract, and still relies on Tesseract for certain operations.

Installation

To use this plugin, first install PyTorch according to the official instructions, which may differ for your platform.

Then install OCRmyPDF-EasyOCR to the same virtual environment or conda environment as you installed PyTorch:

pip install git+https://github.com/ocrmypdf/OCRmyPDF-EasyOCR.git

The OCRmyPDF-EasyOCR will override Tesseract for OCR; however, OCR still depends on Tesseract for some tasks.

If Celery's multiprocessing is installed in the virtual environment, it will be used instead of the standard Python multiprocessing. This allows paperless-ngx, which uses Celery, to function correctly.

Troubleshooting

If you see a log message Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU then PyTorch is not installed.

To do

Contributions, especially pull requests are quite welcome!

At the moment this plugin is alpha status and missing some essential features:

  • Tesseract is still required for determine page skew and for orientation correction
  • EasyOCR is effectively single threaded, to eliminate race conditions

ocrmypdf-easyocr's People

Contributors

deajan avatar jbarlow83 avatar phu54321 avatar rakurtz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocrmypdf-easyocr's Issues

Fast switch engines

How can I switch back to Tesseract Engine without having to remove this plugin from env?

Acceleration of the launch

I noticed that the creation of easyocr.Reader(languages, use_gpu) in def _ocr_process takes more than a second, at the time of starting recognition, it would be good to do this in advance if possible, so that at the time of the recognition request you do not waste an extra second

italic text in Bulgarian is not recognised

It appears easyOCR model for Bulgarian lang is not good. It doesn't recognise italic font and produces garbage. Bold and normal fonts are ok. Underlined font produces underscore on spaces between words.

How do you switch back to tesseract for a particular language?

Newlines missing in sidecar

I installed this easyocr version via pipx and I went to compare a bunch of files between the original ocrmypdf and this one, and found that while easyocr is WAY more accurate at getting the letters right, the sidecar is all one line. Less than ideal and sounds like a bug to me.

If I pdftotext the pdf, it comes out on multiple lines. But the sidecar is jacked.

to reproduce, use --sidecar I can provide a jpg for sure if you want.

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.

While running ocrmypdf with EasyOCR as an engine I get the following output. When using EasyOCR not as part of ocrmypdf I have actually not had any problems with using the GPU.

Start processing 11 pages concurrently                                                                                                                                            _sync.py:259    
1 redoing OCR                                                                                                                                                             _pipeline.py:309    
2 redoing OCR                                                                                                                                                             _pipeline.py:309    
3 redoing OCR                                                                                                                                                             _pipeline.py:309    
4 redoing OCR                                                                                                                                                             _pipeline.py:309    
5 redoing OCR                                                                                                                                                             _pipeline.py:309    
6 redoing OCR                                                                                                                                                             _pipeline.py:309    
7 redoing OCR                                                                                                                                                             _pipeline.py:309    
8 redoing OCR                                                                                                                                                             _pipeline.py:309    
9 redoing OCR                                                                                                                                                             _pipeline.py:309   
10 redoing OCR                                                                                                                                                             _pipeline.py:309   
11 redoing OCR                                                                                                                                                             _pipeline.py:309   
11 Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.                                                                       easyocr.py:80    
4 Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.                                                                       easyocr.py:80

Error: Plugin already registered

After installing the plugin, I tried running it using this command:

ocrmypdf -l eng --plugin ocrmypdf_easyocr input.pdf output.pdf

But it throws error: ValueError: Plugin already registered under a different name: ocrmypdf_easyocr=<module 'ocrmypdf_easyocr'

PyPI release

Hello,

I'm currently tweaking paperless-ngx, which already uses OCRMyPDF, to use EasyOCR.
Would you mind releasing a PyPI version of this package, so it can easily integrate into a pip file ?

Thanks.

ERROR: File "setup.py" not found. when pip installing

Hi,

firstly thanks a lot for the development, EasyOCR has been performing better for my usecase so it's very nice to have it integrated into OCRmyPDF, another great project.

I am having trouble making it run though, and I am not sure whether this is a bug or simply me not knowing how to run a plugin with OCRmyPDF.

I have cloned https://github.com/ocrmypdf/OCRmyPDF-EasyOCR, then I cd into the directory and then run

python -m venv .venv
.venv\Scripts\activate
pip install -e .

At this point I get the error

ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: C:\Users\istaka\OneDrive - Schüßler-Plan GmbH\Dev\OCRmyPDF-EasyOCR
(A "pyproject.toml" file was found, but editable mode currently requires a setup.py based build.)

Am I doing something wrong?

Share `easyocr.Reader` instance

with simple profiling code:

        with GPU_SEMAPHORE:
            s0 = time.time()
            reader = easyocr.Reader(languages, gpu=options.gpu)
            s1 = time.time()
            raw_results = reader.readtext(gray)
            s2 = time.time()
            print('reader init: %.1fs, readtext: %.1fs' % (s1 - s0, s2 - s1))

Time it takes to construct easyocr.Reader is quite significant.
If the code can re-use reader object across pages, OCRing time could be cut considerably.

image

Non multiprocessing mode

Hello,

Trying to make OCRmyPDF-EasyOCR work in paperless-ngx, I realize that I cannot open a multiprocess pool in another multiprocess pool.

I've tried running your easyOCR plugin git master and back to commit e4a010d in hope to get rid of the multiprocessing, but I'm always stuck with error message AssertionError: daemonic processes are not allowed to have children since I run OCRMyPDF in a multiprocess pool.

Any chance for you to dive into that issue, so that the easyOCR plugin becomes the de-facto standard for paperless-ngx ?

Best regards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.