Add the ocr'ed text as a text layer to the scanned documents so that text can be copie

moved from <a class="issue-link js-issue-link" data-error-text="Failed to load tit

Integration with OCRmyPDF about paperless-ng HOT 8 CLOSED

jonaswinkler commented on May 21, 2024

Integration with OCRmyPDF

from paperless-ng.

Comments (8)

jonaswinkler commented on May 21, 2024 1

I've just merged this into dev.

And there's an option to skip storing converted documents if the original already has text.

from paperless-ng.

jonaswinkler commented on May 21, 2024

See https://github.com/jbarlow83/OCRmyPDF

and https://github.com/mimimi1968/paperless

from paperless-ng.

jonaswinkler commented on May 21, 2024

The solution should also take already existing documents into account and transform these as well.

from paperless-ng.

jonaswinkler commented on May 21, 2024

Integration works but needs more tests and more configuration, apparently. As of now, Paperless will offer

--skip, --redo, and --force as three different modes for OCR, defaults to skip
output_type, defaults to pdfa
pages, to only OCR the first n pages.
clean is turned on by default.
language, with no default.

I suppose we need an additional interface so that users can specify whatever they want.

Paperless will also store these documents in addition to the untouched originals, both for exit strategy as well as if someone decides to recreate the archived versions with different settings.

from paperless-ng.

totti4ever commented on May 21, 2024

moved from #50:

I also have an ocrmypdf job throwing the sandwiched items straight into the current paperless application in use. I'm using the following commands:

		--output-type pdfa-2 \
		--pdfa-image-compression jpeg \
		--rotate-pages \
		--clean \
		--remove-background \
		--deskew \
		--optimize 3 \
		--skip-text \
		-l "deu" \

In casethe ocrmypdf integration into paperless-ng is supposed to be the primary pdf creator for most users, I suggest to make the arguments overwritable by some config file!

What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.

Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?
Of course I still would like to be able to throw in files without text-layer. I could imagine that they would come from another consume service, so I can disable ocrmypdf for one and enable for the other?

A lot of thoughts, sorry :-)

from paperless-ng.

jonaswinkler commented on May 21, 2024

What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.

Good point, adding checksums for converted documents.

Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?

I understand what you mean. With skip_text, we totally could skip documents that already have text in them and don't perform any calls to ocrmypdf, however, that does not account for cases where only some of the pages have text. Also, ocrmypdf still does a pretty good job at image optimization and making sure that all archived documents are in the same format, so having ocrmypdf process text-only documents is desirable.

from paperless-ng.

totti4ever commented on May 21, 2024

That is actually a good point - might throw everything into ocrmypdf then to have a common format!

from paperless-ng.

jonaswinkler commented on May 21, 2024

It's in the latest release.

from paperless-ng.

Integration with OCRmyPDF about paperless-ng HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent