Code Monkey home page Code Monkey logo

Comments (8)

jonaswinkler avatar jonaswinkler commented on May 21, 2024 1

I've just merged this into dev.

And there's an option to skip storing converted documents if the original already has text.

from paperless-ng.

jonaswinkler avatar jonaswinkler commented on May 21, 2024

See https://github.com/jbarlow83/OCRmyPDF

and https://github.com/mimimi1968/paperless

from paperless-ng.

jonaswinkler avatar jonaswinkler commented on May 21, 2024

The solution should also take already existing documents into account and transform these as well.

from paperless-ng.

jonaswinkler avatar jonaswinkler commented on May 21, 2024

Integration works but needs more tests and more configuration, apparently. As of now, Paperless will offer

  • --skip, --redo, and --force as three different modes for OCR, defaults to skip
  • output_type, defaults to pdfa
  • pages, to only OCR the first n pages.
  • clean is turned on by default.
  • language, with no default.

I suppose we need an additional interface so that users can specify whatever they want.

Paperless will also store these documents in addition to the untouched originals, both for exit strategy as well as if someone decides to recreate the archived versions with different settings.

from paperless-ng.

totti4ever avatar totti4ever commented on May 21, 2024

moved from #50:

I also have an ocrmypdf job throwing the sandwiched items straight into the current paperless application in use. I'm using the following commands:

		--output-type pdfa-2 \
		--pdfa-image-compression jpeg \
		--rotate-pages \
		--clean \
		--remove-background \
		--deskew \
		--optimize 3 \
		--skip-text \
		-l "deu" \

In casethe ocrmypdf integration into paperless-ng is supposed to be the primary pdf creator for most users, I suggest to make the arguments overwritable by some config file!

What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.

Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?
Of course I still would like to be able to throw in files without text-layer. I could imagine that they would come from another consume service, so I can disable ocrmypdf for one and enable for the other?

A lot of thoughts, sorry :-)

from paperless-ng.

jonaswinkler avatar jonaswinkler commented on May 21, 2024

What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.

Good point, adding checksums for converted documents.

Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?

I understand what you mean. With skip_text, we totally could skip documents that already have text in them and don't perform any calls to ocrmypdf, however, that does not account for cases where only some of the pages have text. Also, ocrmypdf still does a pretty good job at image optimization and making sure that all archived documents are in the same format, so having ocrmypdf process text-only documents is desirable.

from paperless-ng.

totti4ever avatar totti4ever commented on May 21, 2024

That is actually a good point - might throw everything into ocrmypdf then to have a common format!

from paperless-ng.

jonaswinkler avatar jonaswinkler commented on May 21, 2024

It's in the latest release.

from paperless-ng.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.