Both Tesseract and ABBYY support multilingual OCR however including multiple dictionar

You are right, each could use the given number of cores. So having 6 s ea

Enhance multilingual processing/options,about deajan/pmocr

Comments (32)

popnt commented on August 23, 2024 1

I manually cleared off previous pmocr tests, and updated to master, and I am no longer able to repro. I'm not sure what might've changed but with each test I performed there was no duplication of processes.

I have not yet attempted to roll back to previous pmOCR versions to see if the behaviour can be found in a previous version or if it was simply my local environment. However I did complete tests using QEMU running Wheezy 3.1.9 using latest master and again I could not find any duplicate process. If I notice the behaviour occurring again I'll update this bug but I'm fine with moving on and closing the issue.

The only behaviour I noticed that I am not able to reconcile is that when I get the pmocr service status, there are two pmocr.sh processes running, as shown below. Notice there are two pmocr.sh running but like I said above there is no duplication of process, each jpg is scanned only once.

   Loaded: loaded (/lib/systemd/system/[email protected]; disabled)
   Active: active (running) since Wed 2016-09-14 22:09:10 EDT; 43min ago
 Main PID: 7734 (bash)
   CGroup: /system.slice/system-pmocr\x2dsrv.slice/[email protected]
           ├─ 7734 bash /usr/local/bin/pmocr.sh --service --config=/etc/pmocr/fr.conf
           ├─ 7749 bash /usr/local/bin/pmocr.sh --service --config=/etc/pmocr/fr.conf
           ├─16735 inotifywait --exclude (.*)_OCR.pdf --exclude (.*)_OCR_ERR.pdf -qq -r -e create /storage/service_ocr/PDF/fra
           └─19268 sleep 1

Regarding your question about deskew and reduce noise, my scanner does have built-in deskew and noise reduction so I have not yet had the need to perform these as part of the OCR process, but I can certainly see the benefit for others. If a new issue is opened regarding ImageMagick integration I'll add in feedback if any come to mind :)

from pmocr.

deajan commented on August 23, 2024

I think it would add big complexity to include an undefined number of languages.
The script already supports multiple instances (I run two instances on the same server for different purposes).
Thus, you could have one instance per language(s) you want with folders like:

/storage/service_ocr/eng/PDF
/storage/service_ocr/fra/PDF
/storage/service_ocr/eng+fra/PDF
/storage/service_ocr/eng+fra+spa/PDF

I'll add some instructions on how to multiply service instances.

from pmocr.

popnt commented on August 23, 2024

I can appreciate that adding specialised language triaging would make the script more complex. I was hoping to keep it manageable based on a single abstraction layer tied into the pathnames and mapping those directly to the language parameter.. Maybe it's more trouble than it's worth :)

Your workaround of running multiple scripts sounds very reasonable however wouldn't doing it that way then make the multicore routines in v1.5 independent from each script? Like there would not be any pooling of available cores, right?

from pmocr.

deajan commented on August 23, 2024

You are right, each script could use the given number of cores. So having 6 scripts each allocating 4 cores could lead to 24 core usage when all folders get feeded at the same time.

Let me think about a config array in order to launch one monitor per dir with different OCR options.
I think of something like

serviceConfig=(
"/storage/service_ocr/eng/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_ENG"
"/storage/service_ocr/fra/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_FRA"
"/storage/service_ocr/eng+fra/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_ENG_FRA"
"/storage/service_ocr/eng+fra+spa/PDF","$PDF_EXTENSION","$OCR_ENGINE_ARGS_ENG_FRA_SPA"
)

This way, I could loop over the array and create a monitor per dir with different OCR parameteres.
The would still be shared parameters for all monitored dirs, like CHECK_PDF or DELETE_ORIGINAL.

from pmocr.

deajan commented on August 23, 2024

Actually, I'm thinking of this since some time now.
I should split pmocr.sh into a conf file the executable, so you can run different configs depending on what you need.
In batch mode, you'd have to pass a config file. In service mode, it'll be one instance per config file.

from pmocr.

popnt commented on August 23, 2024

Perhaps keep the default options in the main pmocr.sh but allow individual configs to override them so you don't end up with the same settings repeated in 10 different configs?

from pmocr.

popnt commented on August 23, 2024

Actually, in batch mode wouldn't it make more sense to pass the options as parameters from the CLI? that way if you have a special batch to process it doesn't necessitate having a config already built.

Passing a config as parameter would be useful too but adjusting individual settings sans config is much quicker

from pmocr.

deajan commented on August 23, 2024

There's no way you can pass OCR options from CLI, because there are way too much. All other options except the new multicore variable can be passed as cli argument already in batch mode.

The point of removing default options from main pmocr.sh is to be able to upgrade without losing config.
I'll go for a default /etc/pmocr/default.conf file which is called unless another conf file is given as argument or via a service file.

from pmocr.

popnt commented on August 23, 2024

sounds good..! Eager to see what you come up with :)

from pmocr.

deajan commented on August 23, 2024

Finished moving config to default.conf and adapt service files.
Also improved idle cpu usage and made minor fixes.
Care to review ?

from pmocr.

popnt commented on August 23, 2024

I updated via git pull and after running install.sh and using systemctl start [email protected] then systemctl status [email protected] I receive the following

   Loaded: loaded (/lib/systemd/system/[email protected]; disabled)
   Active: failed (Result: exit-code) since Wed 2016-09-07 22:29:11 EDT; 6s ago
  Process: 354 ExecStart=/usr/local/bin/pmocr.sh --service --config=/etc/pmocr/%i (code=exited, status=1/FAILURE)
 Main PID: 354 (code=exited, status=1/FAILURE)

Important to note is that I had made manual changes from v1.4 to monitor multiple paths. I have removed pmocr-instance.sh and also /storage/storage_ocr and re-ran install.sh but the folders are not re-created... so without looking too deeply into the matter it seems there may be some inconsistencies in 1.5

Also, the README only lists instructions for how to run multiple configs with systemd.. will the initV style no longer support multiple configs?

from pmocr.

deajan commented on August 23, 2024

I've worked a bit too fast and made an error in the default.conf file.
Please update to latest commit and then manually delete the file and use install.sh again (new install without prior deletion will not overwrite the default config).

Install.sh won't create any other folders than /etc/pmocr. You're supposed to have folders to monitor which you setup in default.conf.

The README states that running InitV style automatically creates an instance per config file, so yes, initV supports multiple configs :)

If you have other failures with the new version, please give me the output of

systemctl status [email protected]

and /var/log/pmocr.log

from pmocr.

popnt commented on August 23, 2024

I updated to the last committ, re-added the /storage/service_ocr/* paths and ran systemctl start [email protected] however the I'm still getting a similar error. Here is the full output:

   Loaded: loaded (/lib/systemd/system/[email protected]; disabled)
   Active: failed (Result: exit-code) since Thu 2016-09-08 21:44:18 EDT; 13min ago
  Process: 3735 ExecStart=/usr/local/bin/pmocr.sh --service --config=/etc/pmocr/%i (code=exited, status=1/FAILURE)
 Main PID: 3735 (code=exited, status=1/FAILURE)

Sep 08 21:44:17 host systemd[1]: Started pmocr - monitors a local directory and gives any new file to an OCR ... file.
Sep 08 21:44:18 host pmocr.sh[3735]: CRITICAL:/usr/local/bin/abbyyocr11 not present.
Sep 08 21:44:18 host systemd[1]: [email protected]: main process exited, code=exited, status=1/FAILURE
Sep 08 21:44:18 host systemd[1]: Unit [email protected] entered failed state.

the /var/log/pmocr.log only contains one line at the end which is the following:

CRITICAL:/usr/local/bin/abbyyocr11 not present.

Given that this version is still not quite ready and I do not have a test environment setup, I'll revert back to the previous working version and just run two instances and only specify 2 cores for each instance on my 4 core machine.

from pmocr.

deajan commented on August 23, 2024

Do you use tesseract or abbyyocr ?
I've tested with both ocr tools and pmocr works and ran my tests successfully, in batch and in service mode.

But you have to comment out the lines in default.conf that don't correspond to your ocr tool, and uncomment those that correspond to your tool. Have you done this step ?
Maybe this has to be documented more clearly in the default.conf file, but this is not a developpment issue anymore, only config issue I think.

from pmocr.

popnt commented on August 23, 2024

I had reviewed default.conf and saw

OCR_ENGINE=tesseract3

which is the ocr engine I'm using so I'm not sure why other lines corresponding ABBYY should actually be commented out. I do not recall having to comment out those lines from v1.4.

I had a look at the readme again and there is no indication to perform the step you describe. I can go through the remainder of default.conf but I have no idea what other lines should be edited or commented out.

from pmocr.

deajan commented on August 23, 2024

OCR_ENGINE=xxxxxxx helps the program to decide what type of special code it has to run.

But OCR_ENGINE_EXEC is defined twice if one of the sections is not commented out.
Just comment all the abbyy11 lines out of your default.conf and you should be running.

I've commented both sections out by default, and added more clear instructions in default.conf.
Sometimes things seem clear to the developper who designed something, even if it's not clear for anyone else :)

[Edit] In v1.4 I had logic to remove lines depending on OCR_ENGINE=xxxxx, but I cannot add logic code to a conf file :)[/Edit]

from pmocr.

popnt commented on August 23, 2024

I see that pmocr.sh originally had an IF wrapper which prevented the declarations for tesseract3 and abbyyocr11 from affecting each other (line-59).

Personally I think that was a more elegant solution than commenting out entire sections as this can lead to human error, and it's must easier to control which OCR to use using one global variable OCR_ENGINE instead.

Nevertheless I commented out the abbyyocr11 settings from default.conf and the service appears to start properly now. I have not tested with multiple configs yet.

When I moved one single jpg into a monitored path, the ocr process began as expected. However then I moved another different jpg into the same directory while the first process was still running, the ocr process picked up the new jpg but also started a new process to ocr the previous jpg. So there were 3 processes running concurrently, 2 for the same jpg started at different times.

from pmocr.

deajan commented on August 23, 2024

As I said, I cannot add IF logic directly into a config file.
Having commented out settings is a small price to pay to have an upgrade path that doesn't overwrite the configuration.
Also, it's way more elegant to only duplicate a config file than having to duplicate the main executable and the services files for each instance.

I'm not sure about your problem with the double ocr process as no other OCR session should launch until the first is finished, from a code point of view. I'll have some tests.

from pmocr.

deajan commented on August 23, 2024

Having done multiple tests, there shouldn't be a way to get multiple times the same file processed.
I think that you already have another pmocr service running while trying the new one.
You should check with

ps aux | grep inotifywait

There shouldn't be more than one inotifywait instance with the same path.

from pmocr.

popnt commented on August 23, 2024

I re-read previous notes and I cannot find where you indicated the config cannot contain an IF, but your explanation makes sense though and I agree the tradeoff to having multiple configs is better than to duplicate the executable.

As for the multiple instances, I ran ps aux|grep inotify and there is only one process running. I repeated my earlier test of first copying a single file to the monitored path, wait a second for the ocr to start, then copy a second different file, and unfortunately three processes were running again, same pattern as before: 2 OCR running for the same file. I am not sure what might be causing this but even after rebooting I am able to repro.

I am monitoring by running screen, then top in one window and bash in another. Then from the bash window I copy a single jpg, wait for tesseract to appear, then copy another jpg. It does successfully output three PDFs and there are no other jpgs in the monitored path prior to the test procedure.

Can you think of anything else other than multiple inotify running that might cause this?

from pmocr.

deajan commented on August 23, 2024

I stated that "I cannot add logic to a config file" (the comment with the EDIT tag), which includes any instructions other than variable assignments, including IF.

Anyway, I've tried to reproduce your problem, without success.
While doing so, I identified two other potential bugs, of which one made me modify the code that handles file monitoring to make it asynchronous in order to catch all files (files weren't catched while added when OCR process was already running, leaving them unprocessed until a next file is added).
Please update to the latest code and check again.

If you still experience your problem, I'll need you to follow the instructions below:

Stop the service, double check there is no inotifywait and no pmocr running with

ps aux | egrep "pmocr|inotify"

Then launch the service manually with the following command (using the debug version instead of the normal version of the program):

_PARANOIA_DEBUG=yes bash -x ./debug_pmocr.sh --service --verbose > /var/log/pmocr_debug.log 2>&1

Then I'll need the output of /var/log/pmocr.log and /var/log/pmocr_debug.log (pasted as gist if possible).

from pmocr.

popnt commented on August 23, 2024

In the interest of keeping the testing as clean as possible I'll setup a test server via QEMU.. my platform is a Raspberry Pi which is ARM and apparently QEMU is the easiest virtualisation method. I'll let you know soon when I have that setup.

I had also noticed that bug which skipped newly added files if there was a current OCR process running, however since the next batch of files would then include the one skipped I didn't think it was too critical, but nice to see you caught that one too :)

Please allow me a few days to setup the virtual machine. I'll reply back with the results after following the debug instructions above.

from pmocr.

deajan commented on August 23, 2024

Corrected two other bugs this morning and improved tesseract support.
Please only test with latest master. Eager to have your test results.

Btw, I'm thinking of including an OCR preprocessor for tesseract. Do you use any tools like OpenCV or ImageMagick to deskew / clear background / remove noise from your images prior to handle them with tesseract ?

from pmocr.

deajan commented on August 23, 2024

Glad to hear everything worked for you.
I still don't know what could have been triggering double conversion, and never got some in my tests, but since I rewrote the code that triggers conversion in order to get async monitoring, I had to redo all tests anyway.
There are now two pmocr services running in order to keep monitoring even while converting, so that's pretty normal.

About the preprocessing, I already integrated ImageMagick as optional preprocessor for Tesseract in latest commits.
If you are happy with the new functionality regarding mutliple monitoring with different OCR options, feel free to close this issue.

from pmocr.

popnt commented on August 23, 2024

Yes I saw the new settings, I haven't experimented with them but I'll give them a shot when I get a chance.

One question, but I'm not sure it's relative to this release or was the bahviour before: the source files are added to the monitored path by a regular non-root user, however the output pdf is actually owned by root. This doesn't seem normal, although on my system I am the only user it's no big deal but shouldn't the output files be owned by the same user as the source files?

from pmocr.

deajan commented on August 23, 2024

Good point. Actually the files are created as the user who runs the service.
There's no easy way to preserve file ownership as there are no real transformations, but rather creations.
Three solutions:

Get the file permission before processing them, and add them later to the output file (bulky and not elegant solution)
Launch the service as another user (makes everything more complex)
Add ACL heritage on the monitored files (simple as long as the FS has ACL support)

What do you think about the ACL solution ?

from pmocr.

popnt commented on August 23, 2024

For ACL heritage, just so I understand the suggestion correctly, you're saying that the current folder's owner is what would be inherited by the files created? I suspect I have misunderstood something, but if that's what you're saying then yeah maybe that would be best.

I agree the first two points are probably not ideal, although I'm not sure retaining all the file permissions would be necessary but rather just the owner -- how much more or less complex that would be versus retaining all the permissions, is this just semantics? Launching the service as another user could end up being a mess and probably not useful for an admin.

from pmocr.

deajan commented on August 23, 2024

The owner of new files would still be root, but an ACL on the parent folder would allow other users or groups than root to have the same privileges as root on the file.

Getting and setting ownership and / or permissions on file isn't really elegant, even if it doesn't ask too much code effort.

Launching the service as another user is what is often used other daemons, but it would just lift the problem you describe from root to another user.

from pmocr.

popnt commented on August 23, 2024

Setting the ACL on the parent folder would require the person installing the service to set the permissions correctly and then manage the user/group rights, which I imagine probably be very reasonable from an admin perspective. However perhaps for a less experienced admin it might be an annoyance.

Conceptually speaking if I were running a service for a group of users and I wanted them to benefit from the OCR service but I didn't want to allow each one to read or edit any of the other users content then maybe managing groups in the parent folder would be less convenient than making the output accessible only by the originating user. I really don't have enough experience managing file rights for groups of users like this so I'm not sure what sort of headache this might become.

Perhaps I am naive but if assigning the same ownership/permissions to the output file is technically trivial albeit not super elegant, that may be a worthwhile sacrifice. I only asked this question under this issue to confirm if this is the intended behaviour for v1.5, but we can open up a new Issue and discuss it more in detail there? The multilingual/multiconfig aspects of v1.5 seem to be pretty solid AFAIC.

from pmocr.

deajan commented on August 23, 2024

Well there aren't like 1000 issues open on pmocr, so we might just continue to talk here.

I added an optional parameter to keep ownership of the files, as long as the service is executed as root so it can chown.
I'll also added a chmod mask parameter so new files will have the permissions set in config file instead of default ones.

Let me know if this works out for you.

from pmocr.

popnt commented on August 23, 2024

I've been running the most recent master with new ownership/permissions settings and everything seems to be running smoothly! I haven't noticed anything out of the ordinary, all around job well done I would say :)

from pmocr.

deajan commented on August 23, 2024

Thanks for the feedback.
I'll have to review the code before going for a release.
Feel free to open another issue if you think of other improvements.

from pmocr.

Enhance multilingual processing/options about pmocr HOT 32 CLOSED

Comments (32)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent