deajan / pmocr Goto Github PK

A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR conversion on file activity

License: BSD 3-Clause "New" or "Revised" License

Shell 100.00%

ocr ocr-service ocr-conversion abbyy tesseract smb cifs nfs inotify pdf

pmocr's Introduction

pmOCR (poor man's OCR tool)

This project has been archived !

It has been fun improving my bash skills while I begun coding this in like 2015.
I initially planned to produce a better, python based version of this, but then I found OCRmyPDF project, which already does a great job ;)
See https://github.com/ocrmypdf/OCRmyPDF for more info.

If you're interested in Document Management Systems, also checkout paperless-ngx, which is a fully open source, using OCRmyPDF.

Farewell, my old bash project.

pmOCR

A multicore batch & service wrapper script for Tesseract v3/v4/v5 (https://github.com/tesseract-ocr/) or ABBYY CLI OCR 11 FOR LINUX based on Finereader Engine 11 optical character recognition (www.ocr4linux.com).

Conversions support tiff/jpg/png/pdf/bmp to PDF, TXT and CSV (also DOCX and XSLX for Abbyy OCR). It can actually support any other format that your OCR engine can handle.

This wrapper can work both in batch and service mode.

In batch mode, it's used as commandline tool for processing multiple files at once, being able to output one or more formats.

In service mode, it will monitor directories and launch OCR conversions as soon as new files get into the directories. Since v1.8.0, it can also monitor NFS / SMB mountpoints with new integrated inotifywait emulation poller.

pmOCR has the following options:

Include current date into the output filename
Ignore already OCRed PDF files based on font detection and / or file suffix
Delete or move input file after successful conversion

Install it

$ git clone https://github.com/deajan/pmOCR
$ cd pmOCR
$ ./install.sh

You will need pdffonts util (from poppler-utils package). Optionally, you can install inotifywait (from inotify-tools package).

If you are using tesseract OCR, please install tesseract-osd and tesseract-[your language] (sometimes called tesseract-ocr-osd). You will also need ImageMagick in order to be able to transform bitmap PDF documents to indexed PDFs.

Batch mode

Use pmocr to batch process all files in a given directory and its subdirectories.

Use --help for command line usage.

Example:

$ pmocr.sh --batch --target=pdf --skip-txt-pdf --delete-input /some/path
$ pmocr.sh --batch --target=pdf --target=csv --suffix=processed /some/path

If pmOCR wasn't installed, you may run it directly with a configuration file like:

$ ./pmocr.sh --config=./default.conf --batch -p /some/path

OCR Configuration

pmOCR uses a default config stored in /etc/pmocr/default.conf You may change it's contents or clone it and have pmOCR use an alternative configuration with:

$ pmocr.sh --config=/etc/pmocr/myConfig.conf --batch --target=csv /some/path

Service mode

Service mode monitors directories and their subdirectories and launched an OCR conversion whenever a new file is written. Keep in mind that only file creations are monitored. File moves aren't.

pmocr is written to monitor up to 5 directories, each producing a different target format (PDF, DOCX, XLSX, TXT & CSV). Comment out a folder to disable it's monitoring.

There's also an option to avoid passing PDFs to the OCR engine that already contain text.

After installation, please configure /etc/pmocr/default.conf in order to monitor the directories you need, and adjust your specific options.

Launch service (initV style) service pmocr-srv start

Launch service (systemd style) systemctl start [email protected]

Check service state (initV style) service pmocr-srv status

Check service state (systemd style) systemctl status [email protected]

Multiple service instances

In order to monitor multiple directories with different OCR settings, you need to duplicate /etc/pmocr/default.conf configuration file. When launching pmOCR service with initV, each config file will create an instance. With systemD, you have to launch a service for each config file. Example for configs /etc/pmocr/default.conf and /etc/pmocr/other.conf

$ systemctl start [email protected]
$ systemctl start [email protected]

Support for OCR engines

Has been tested so far with:

ABBYY FineReader OCR Engine 11 CLI for Linux releases R2 (v 11.1.6.562411), R3 (v 11.1.9.622165) and R6 (v 11.1.14.707470)
Tesseract-ocr 3.0.4
Tesseract-ocr 4.0.0 and 4.0.12
Tesseract-ocr 5.0.0 and 5.0.1

Tesseract mode also uses ghostscript to convert PDF files to an intermediary TIFF format in order to process them.

It should virtually work with any engine as long as you adjust the parameters.

Parameters include any arguments to pass to the OCR program depending on the target format.

Support for OCR Preprocessors

ABBYY has in integrated preprocessor in order to enhance recognition qualitiy whereas Tesseract relies on external tools. pmOCR can use a preprocessor like ImageMagick to deskew / clear noise / render white background and remove black borders. ImageMagick preprocessor is configured, and enabled by default to be used with Tesseract.

Tesseract caveats

When no OSD / language data is installed, tesseract will still process documents, but the quality may suffer. While pmocr will warn you about this, the conversion still happens. Please make sure to install all necessary addons for tesseract.

Troubleshooting

Please check /var/log/pmocr.log or ./pmocr.log file for errors.

Filenames containing special characters should work, nevertheless, if your file doesn't get converted, try to rename it and copy it again to the monitored directory or batch process it again.

By default, failing to prevent files will add a prefix '_OCR_ERR' + date to the filename. In order to reprocess those files, the prefix has to be removed with the following command

$ find /monitor/path -iname "*_OCR_ERR.*" -print0 | xargs -0 -I {} sh -c 'export file="{}"; mv "$file" "${file//_OCR_ERR/}"'

If using tesseract to create searchable PDF files, please make sure to have version 3.03 or better installed.

pmocr's People

Contributors

Stargazers

Watchers

Forkers

mhelff nwtgck gridl cnsuhao morristech pubfork hoangclinh mustafakarali gkachru ajg707 jerrydeasleyjamison guonetnet51 joolstorrentecalo

pmocr's Issues

Setting up ServiceMode

I now have batch mode working quite well.

But the documentation around the service mode, does not make much sense to me.

Can someone document the steps a little more detailed?

pmocr-srv only resides in my git folder, not in the system bin folder. If launched from this location does it still reference /etc/pmocr/default.conf ?

where can i get FineReader OCR Engine 11 CLI for Linux

where can i get FineReader OCR Engine 11 CLI for Linux for download

install.sh: check destination for SERVICE_DIR_SYSTEMD_SYSTEM

The install.sh script points to /usr/lib/systemd/system to copy the pmocr-srv.service file however it would appear that not all distros have that folder by default. I'm running Debian on Raspberry Pi (aka Raspbian)

I'm not familiar enough with other distros to know where the optimal location for that the *.service files would be but install.sh stores SERVICE_DIR_SYSTEMD_USER under /etc/systemd/user which resides alongside /etc/systemd/system

So I'm not sure what advantage there would be to keeping the service files under /usr/lib/systemd/system instead of /etc/systemd/system but checking if the /usr/lib/systemd/system folder exists first would ensure that line 133 in install.sh does not copy the file to the wrong location.

convert: Option '-resample' requires an argument or argument is malformed

Hello,

please excuse my bad english - but I am not a native Speaker.

I installed pmOCR on a Ubuntu 20.04.4 LTS. But if I try to start it in batch-mode I got the error Message "convert: Option '-resample' requires an argument or argument is malformed.".

These were my Steps:

apt-get install -f poppler-utils
Paketlisten werden gelesen... Fertig Abhängigkeitsbaum wird aufgebaut.... 50% Abhängigkeitsbaum wird aufgebaut. Statusinformationen werden eingelesen.... Fertig Die folgenden zusätzlichen Pakete werden installiert: libcairo2 libjbig0 libjpeg-turbo8 libjpeg8 liblcms2-2 libnspr4 libnss3 libopenjp2-7 libpixman-1-0 libpoppler97 libtiff5 libwebp6 libxcb-render0 poppler-data Vorgeschlagene Pakete: liblcms2-utils ghostscript fonts-japanese-mincho | fonts-ipafont-mincho fonts-japanese-gothic | fonts-ipafont-gothic fonts-arphic-ukai fonts-arphic-uming fonts-nanum Die folgenden NEUEN Pakete werden installiert: libcairo2 libjbig0 libjpeg-turbo8 libjpeg8 liblcms2-2 libnspr4 libnss3 libopenjp2-7 libpixman-1-0 libpoppler97 libtiff5 libwebp6 libxcb-render0 poppler-data poppler-utils 0 aktualisiert, 15 neu installiert, 0 zu entfernen und 0 nicht aktualisiert.

apt-get install tesseract-ocr-deu
Paketlisten werden gelesen... Fertig Abhängigkeitsbaum wird aufgebaut. Statusinformationen werden eingelesen.... Fertig Die folgenden zusätzlichen Pakete werden installiert: fontconfig libarchive13 libdatrie1 libgif7 libgomp1 libgraphite2-3 libharfbuzz0b liblept5 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libtesseract4 libthai-data libthai0 libwebpmux3 tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd Vorgeschlagene Pakete: lrzip Die folgenden NEUEN Pakete werden installiert: fontconfig libarchive13 libdatrie1 libgif7 libgomp1 libgraphite2-3 libharfbuzz0b liblept5 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0 libtesseract4 libthai-data libthai0 libwebpmux3 tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng tesseract-ocr-osd 0 aktualisiert, 19 neu installiert, 0 zu entfernen und 0 nicht aktualisiert. Es müssen 9.340 kB an Archiven heruntergeladen werden. Nach dieser Operation werden 28,3 MB Plattenplatz zusätzlich benutzt.

apt-get install -f git
Paketlisten werden gelesen... Fertig Abhängigkeitsbaum wird aufgebaut. Statusinformationen werden eingelesen.... Fertig Die folgenden zusätzlichen Pakete werden installiert: git-man libbrotli1 libcurl3-gnutls liberror-perl libnghttp2-14 librtmp1 libssh-4 patch Vorgeschlagene Pakete: git-daemon-run | git-daemon-sysvinit git-doc git-el git-email git-gui gitk gitweb git-cvs git-mediawiki git-svn diffutils-doc Die folgenden NEUEN Pakete werden installiert: git git-man libbrotli1 libcurl3-gnutls liberror-perl libnghttp2-14 librtmp1 libssh-4 patch 0 aktualisiert, 9 neu installiert, 0 zu entfernen und 0 nicht aktualisiert. Es müssen 6.379 kB an Archiven heruntergeladen werden. Nach dieser Operation werden 41,0 MB Plattenplatz zusätzlich benutzt.

git clone https://github.com/deajan/pmOCR
Klone nach 'pmOCR' … remote: Enumerating objects: 2030, done. remote: Counting objects: 100% (74/74), done. remote: Compressing objects: 100% (45/45), done. remote: Total 2030 (delta 47), reused 46 (delta 29), pack-reused 1956 Empfange Objekte: 100% (2030/2030), 1021.31 KiB | 5.58 MiB/s, fertig. Löse Unterschiede auf: 100% (1385/1385), fertig.

cd pmOCR

./install.sh
2022-05-07 21:02:07 - Detected systemd. 2022-05-07 21:02:07 - Copying [default.conf] to [/etc/pmocr/default.conf.new]. 2022-05-07 21:02:07 - Copied [/pmOCR/default.conf] to [/etc/pmocr/default.conf.new]. 2022-05-07 21:02:07 - Copied [/pmOCR/pmocr.sh] to [/usr/local/bin/pmocr.sh]. 2022-05-07 21:02:07 - Set file permissions to [755] on [/usr/local/bin/pmocr.sh]. 2022-05-07 21:02:07 - Set file ownership on [/usr/local/bin/pmocr.sh] to [root:root]. 2022-05-07 21:02:07 - Copied [/pmOCR/[email protected]] to [/lib/systemd/system/[email protected]]. 2022-05-07 21:02:07 - Created [pmocr-srv] service in [/lib/systemd/system] and [/etc/systemd/user]. 2022-05-07 21:02:07 - Can be activated with [systemctl start [email protected]] where instance.conf is the name of the config file in /etc/pmocr. 2022-05-07 21:02:07 - Can be enabled on boot with [systemctl enable [email protected]]. 2022-05-07 21:02:07 - In userland, active with [systemctl --user start [email protected]]. 2022-05-07 21:02:07 - pmocr installed. Use with /usr/local/bin/pmocr.sh 2022-05-07 21:02:07 - In order to make usage statistics, the script would like to connect to http://instcount.netpower.fr?program=pmocr&version=1.8.1&os=Linux%205.13.19-6-pve%20x86_64%20x86_64%20GNU%2FLinux%20%28%22Ubuntu%22%20%2220.04.4%20LTS%20%28Focal%20Fossa%29%22%29%2064-bit%20Unix&action=install No data except those in the url will be send. Allow [Y/n]

ln -s /usr/local/bin/pmocr.sh /usr/bin/pmocr.sh

apt-file search /usr/bin/convert
bedops: /usr/bin/convert2bed bitseq: /usr/bin/convertSamples caffe-tools-cpu: /usr/bin/convert_cifar_data caffe-tools-cpu: /usr/bin/convert_imageset caffe-tools-cpu: /usr/bin/convert_mnist_data caffe-tools-cpu: /usr/bin/convert_mnist_siamese_data cbflib-bin: /usr/bin/convert_image cct: /usr/bin/convert_vcf_to_features cgns-convert: /usr/bin/convert_dataclass cgns-convert: /usr/bin/convert_location cgns-convert: /usr/bin/convert_variables convertall: /usr/bin/convertall device-tree-compiler: /usr/bin/convert-dtsv0 dvbstreamer: /usr/bin/convertdvbdb eigensoft: /usr/bin/convertf findbugs: /usr/bin/convertXmlToText foxtrotgps: /usr/bin/convert2gpx foxtrotgps: /usr/bin/convert2osm graphicsmagick-imagemagick-compat: /usr/bin/convert imagemagick-6.q16: /usr/bin/convert-im6.q16 imagemagick-6.q16hdri: /usr/bin/convert-im6.q16hdri ir.lv2: /usr/bin/convert4chan leptonica-progs: /usr/bin/convertfilestopdf leptonica-progs: /usr/bin/convertfilestops leptonica-progs: /usr/bin/convertformat leptonica-progs: /usr/bin/convertsegfilestopdf leptonica-progs: /usr/bin/convertsegfilestops leptonica-progs: /usr/bin/converttopdf leptonica-progs: /usr/bin/converttops lilypond: /usr/bin/convert-ly ncbi-blast+: /usr/bin/convert2blastmask octomap-tools: /usr/bin/convert_octree omniorb: /usr/bin/convertior opendkim-tools: /usr/bin/convert_keylist phast: /usr/bin/convert_coords profphd-utils: /usr/bin/convert_seq python3-oslo.log: /usr/bin/convert-json python3-potr: /usr/bin/convertkey rsem: /usr/bin/convert-sam-for-rsem ruby-shoulda-context: /usr/bin/convert_to_should_syntax staden-io-lib-utils: /usr/bin/convert_trace syrthes-tools: /usr/bin/convert2syrthes4 texlive-bibtex-extra: /usr/bin/convertgls2bib xoreos-tools: /usr/bin/convert2da

apt-get install -f graphicsmagick-imagemagick-compat
Paketlisten werden gelesen... Fertig Abhängigkeitsbaum wird aufgebaut. Statusinformationen werden eingelesen.... Fertig Die folgenden zusätzlichen Pakete werden installiert: fonts-droid-fallback fonts-noto-mono fonts-urw-base35 ghostscript graphicsmagick gsfonts libavahi-client3 libavahi-common-data libavahi-common3 libcups2 libgraphicsmagick-q16-3 libgs9 libgs9-common libijs-0.35 libjbig2dec0 libpaper-utils libpaper1 libwmf0.2-7 Vorgeschlagene Pakete: fonts-noto fonts-freefont-otf | fonts-freefont-ttf fonts-texgyre ghostscript-x graphicsmagick-dbg cups-common libwmf0.2-7-gtk Die folgenden NEUEN Pakete werden installiert: fonts-droid-fallback fonts-noto-mono fonts-urw-base35 ghostscript graphicsmagick graphicsmagick-imagemagick-compat gsfonts libavahi-client3 libavahi-common-data libavahi-common3 libcups2 libgraphicsmagick-q16-3 libgs9 libgs9-common libijs-0.35 libjbig2dec0 libpaper-utils libpaper1 libwmf0.2-7 0 aktualisiert, 19 neu installiert, 0 zu entfernen und 0 nicht aktualisiert. Es müssen 16,6 MB an Archiven heruntergeladen werden. Nach dieser Operation werden 54,0 MB Plattenplatz zusätzlich benutzt. Möchten Sie fortfahren? [J/n]

Here the log of the error:
2022-05-07 21:25:13 - Beginning PDF OCR recognition of files in [/test] using tesseract.
2022-05-07 21:25:14 - Preparing to process [/test/testpdf.pdf].
2022-05-07 21:25:14 - Preparing to process [/test/testpdf2 - Kopie.pdf].
2022-05-07 21:25:14 - Preparing to process [/test/testpdf3.pdf].
2022-05-07 21:25:14 - Preparing to process [/test/testpdf4.pdf].
2022-05-07 21:25:14 - _ExecTasksPidsCheck called by [OCR_Dispatch] finished monitoring pid [5087] with exitcode [1].
2022-05-07 21:25:14 - Command was [OCR "/test/testpdf.pdf" ".pdf" "pdf" "false"].
2022-05-07 21:25:14 - Truncated output:
2022-05-07 21:25:14 - Processing file [/test/testpdf.pdf].
convert convert: Option '-resample' requires an argument or argument is malformed.
2022-05-07 21:25:14 - /usr/bin/convert intermediary transformation failed.
2022-05-07 21:25:14 - Could not process file [/test/testpdf.pdf] (OCR error code 1). See logs.
2022-05-07 21:25:14 - Truncated OCR Engine Output:

2022-05-07 21:25:14 - Renaming file [/test/testpdf.pdf] to [/test/testpdf_OCR_ERR.pdf] in order to exclude it from next run.
2022-05-07 21:25:14 - Sent mail using mail command.

I added the /etc/pmocr/default.conf

What is wrong?

Thank you

Tony

default.txt

init file starts even if CheckEnvironment exits with exitcode 1

https://github.com/deajan/osync/issues/156

Add support for GNU Parallel

The batch and service modes of pmOCR seem like good candidates to integrate Parallel

https://www.gnu.org/software/parallel/

I'm not sure which would be the best way to do it but I'd be willing collaborate on it

(CRITICAL): not present appearing in log first run issue

Need a bit of help getting this started please. Seeing the below in the log.

ubuntu@host:/etc/pmocr$ sudo cat /var/log/pmocr.log
Mon Jul 24 06:49:09 UTC 2017 - (CRITICAL): not present.
Mon Jul 24 06:51:30 UTC 2017 - (CRITICAL): not present.
Mon Jul 24 06:51:38 UTC 2017 - (CRITICAL): not present.
Mon Jul 24 06:51:50 UTC 2017 - (CRITICAL): not present.
Mon Jul 24 06:51:55 UTC 2017 - (CRITICAL): not present.
Mon Jul 24 06:52:02 UTC 2017 - (CRITICAL): not present.

Merge ofunctions

Replace WaitForIt with WaitForTaskCompletion
Check new KillChilds function
Test log output for systemd

Tesseract 4 support

Would it be much effort to add support for tesseract 4? (https://github.com/tesseract-ocr/tesseract)?

installer thinks there must be a pmocr-batch.sh

Enhance multilingual processing/options

Both Tesseract and ABBYY support multilingual OCR however including multiple dictionaries for a single scan increases processing time and (at least for Tesseract) it decreases the language optimization strategy since pattern matching breaks down as you increase the variety of patterns to match against.

I would argue that unless you work in the translation field, it's unlikely that a single document will contain more than one language, and it's more likely that you will have a variety of documents to scan that will be one of a handful of languages.

This happens frequently in countries that have more than one official language. Canada is one example where documentation can arrive in either English or French but rarely both together on the same printed page. The United States for example has a high percentage of Spanish albeit unofficially a second language, however Switzerland has four official languages being German, French, Italian and Romansh.

The script does have language settings however unless you edit those parameters each time you scan a given document, all specified languages will be included in all scans and the end result may not be as accurate or effective versus if a particular language was specified for each document.

What I propose is that the script be modified slightly to monitor subfolders within the current set of monitored folders and then adjust the language parameter accordingly. Example:

/storage/service_ocr/PDF/eng --> document contains only English , only English dictionary used
/storage/service_ocr/PDF/fra --> document contains only French, only French dictionary used
/storage/service_ocr/PDF/eng+fra -->English and French both appear within the same document
/storage/service_ocr/PDF/eng+fra+spa -->English, French, Spanish all appear in the same document

This can be especially useful if you use something like FTP from a scanner so you can change the target location of a scan and that will change the language.