Code Monkey home page Code Monkey logo

paperless-ocrmypdf's Introduction

paperless-ocrmypdf

Docker compose recipe for The Paperless Project + OCRmyPDF that uses inotify to detect new files and process them

I wanted to archive processed documents and re-try OCR on DpiError using img2pdf, so I rolled out my own script instead.

Limitations

This image relies on inotify events from your host system propagating into the container. This works if your host system is Linux, but does not if your host system is Windows (for example, see http://blog.subjectify.us/miscellaneous/2017/04/24/docker-for-windows-watch-bindings.html)

Note that since recently (late 2019) OCRmyPDF docker image includes watcher.py (based on Python watchdog module), so you might consider using it instead, even though it depends on filesystem polling.

How does it work

This is a file-based workflow, organized in a bunch of folders inside "scans"

  • PDFs to be OCRed are put into "in"

  • inotify-based script picks them up and passes them to OCRmyPDF

  • OCRmyPDF does its job, temporary creating files in "ocr"

  • Once file is processed, the original is moved from "in" to "archive", and OCRed document is put into "ocr-ed"

  • Paperless picks it up from "ocr-ed" and moves it into "documents"

If you have PDFs that do not need OCR, inject them in the middle of this pipeline by putting them in "ocr-ed"

Configuration

Move "config" and "scans" folders somewhere on your filesystem.

Change paths in .env to point to the locations of "config" and "scans"

If you need extra languages, configure them in docker-compose.yml and modify Dockerfile to install them into ocrmypdf container. Dockerfile currently is written to include English, Russian and Ukrainian languages.

Run "docker-compose up -d" and navigate to http://localhost:8000 to configure Paperless.

paperless-ocrmypdf's People

Contributors

adept avatar blackerking avatar ontje avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

paperless-ocrmypdf's Issues

not scanning input folder

I am trying to run the whole thing.
Everything works, except the input folder is not scanned.
it tells:
Setting up watches.
Watches established.

but it doesn't recognize any files.

ocrmypdf not able to install extra language

Hi,
I want to install dutch in tesseract in ocrmypdf. However, apt-get install tesseract-ocr-nld gives me an error:

Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)                          
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

I tried sudo, but this command is not installed.
I tried 'su', but don't know the password.
The ocrmypdf help does not mention a password.
How do I get past this?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.