Comments (8)
I've just merged this into dev.
And there's an option to skip storing converted documents if the original already has text.
from paperless-ng.
See https://github.com/jbarlow83/OCRmyPDF
and https://github.com/mimimi1968/paperless
from paperless-ng.
The solution should also take already existing documents into account and transform these as well.
from paperless-ng.
Integration works but needs more tests and more configuration, apparently. As of now, Paperless will offer
- --skip, --redo, and --force as three different modes for OCR, defaults to skip
- output_type, defaults to pdfa
- pages, to only OCR the first n pages.
- clean is turned on by default.
- language, with no default.
I suppose we need an additional interface so that users can specify whatever they want.
Paperless will also store these documents in addition to the untouched originals, both for exit strategy as well as if someone decides to recreate the archived versions with different settings.
from paperless-ng.
moved from #50:
I also have an ocrmypdf
job throwing the sandwiched items straight into the current paperless application in use. I'm using the following commands:
--output-type pdfa-2 \
--pdfa-image-compression jpeg \
--rotate-pages \
--clean \
--remove-background \
--deskew \
--optimize 3 \
--skip-text \
-l "deu" \
In casethe ocrmypdf integration into paperless-ng is supposed to be the primary pdf creator for most users, I suggest to make the arguments overwritable by some config file!
What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.
Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?
Of course I still would like to be able to throw in files without text-layer. I could imagine that they would come from another consume service, so I can disable ocrmypdf for one and enable for the other?
A lot of thoughts, sorry :-)
from paperless-ng.
What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.
Good point, adding checksums for converted documents.
Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?
I understand what you mean. With skip_text, we totally could skip documents that already have text in them and don't perform any calls to ocrmypdf, however, that does not account for cases where only some of the pages have text. Also, ocrmypdf still does a pretty good job at image optimization and making sure that all archived documents are in the same format, so having ocrmypdf process text-only documents is desirable.
from paperless-ng.
That is actually a good point - might throw everything into ocrmypdf then to have a common format!
from paperless-ng.
It's in the latest release.
from paperless-ng.
Related Issues (20)
- [Gitpod] Integration
- [BUG] Redis - Authentication required Unraid Docker HOT 1
- [Feature] Add WebSocket Token/Basic Authentication
- [Other] Reference to paperless-ngx in docs
- [Other] Google has ended the support for "less secure apps" HOT 1
- [Other] Each User own Documents HOT 1
- [Other] log spamming error and "Failed to get link config: No such device" errors on Raspberry PI4 docker
- [Feature] Restart Paperless NG in Browser
- PermissionError: [Errno 13] Permission denied: '/config/log'
- Run document_retagger only for certain tags
- [BUG] pdfminer defaults cause excessive whitespaces in extracted text
- [BUG] Unsupported mime type application/csv - documents.consumer.ConsumerError
- [BUG] Downloaded filenames don't support Unicode HOT 3
- [Other] Resolution of jpg reduced, recommended workflow? HOT 1
- [BUG] Stale file handle HOT 3
- [BUG] Sorting tags by color does not work HOT 2
- Hardware requirements HOT 1
- [BUG] mail_fetcher throws UserWarning: seen method are deprecated and will be removed soon HOT 1
- [Other] Can't scan Gmail sent items HOT 1
- docker-compse results in ApplyLayer error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paperless-ng.