qurator-spk / ocrd-galley Goto Github PK

View Code? Open in Web Editor NEW

7.0 5.0 1.0 19.52 MB

A Dockerized test environment for OCR-D processors 🚢

License: Apache License 2.0

Shell 54.36% Python 40.54% Roff 5.10%

ocr ocr-d qurator

ocrd-galley's Introduction

ocrd-galley

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:

Binarized images
Line segmentation
OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
(Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

ocrd-galley uses Docker to run the OCR-D images. We provide pre-built container images that get downloaded automatically when you run the provided wrappers for the OCR-D processors.

You can then install the wrappers into a Python venv:

cd ~/devel/ocrd-galley/wrapper
pip install .

To download models, you need to use the -a flag of ocrd resmgr:

ocrd resmgr download -a ocrd-calamari-recognize default

You may then use the script my_ocrd_workflow to use your self-built containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a workspace and METS file with the best images for a given document in the Berlin State Library (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).

ocrd-workspace-from-images

The ocrd-workspace-from-images script produces a OCR-D workspace (incl. METS) for the given images.

~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow

This produces a workspace from the files and then runs the OCR workflow on it.

Build the containers yourself

To build the containers yourself using Docker:

cd ~/devel/ocrd-galley/
./build

ocrd-galley's People

Contributors

Stargazers

Watchers

Forkers

mikegerber

ocrd-galley's Issues

Move each processor into its own container

Now that we need TF1 for sbb_textline_detector and TF2 for ocrd_calamari, it's time.

Integrate ocrd-skimage-*

Needed for the recommended workflow: https://ocr-d.de/en/workflows#workflows

Review README.md

General review
The README should show that you can just use the OCR-D as normal, i.e. that the wrapper makes it transparent
GPU support

ppn2ocr: Rename BEST fileGrp to MAX

As we are going to have a MAX fileGrp in the SBB's METS files, we are moving to MAX here too (instead of BEST). I am currently assuming that the SBB's MAX fileGrp is going to be made from full IIIF URLs, so identical to our current BEST.

@j23d @cneud

Rename BEST to MAX
Handle an existing MAX

Travis: Should not push Docker image when build fails

https://travis-ci.org/github/mikegerber/my_ocrd_workflow/jobs/699616134

The command "FORCE_DOWNLOAD=y ./build" exited with 1.

0.10s$ docker tag my_ocrd_workflow $DOCKER_USERNAME/my_ocrd_workflow:$TRAVIS_COMMIT

The command "docker tag my_ocrd_workflow $DOCKER_USERNAME/my_ocrd_workflow:$TRAVIS_COMMIT" exited with 0.

0.09s$ docker images

REPOSITORY                    TAG                                        IMAGE ID            CREATED             SIZE

<none>                        <none>                                     a5967a65a668        4 minutes ago       4.51GB

ubuntu                        18.04                                      8e4ce0a6ce69        32 hours ago        64.2MB

my_ocrd_workflow              latest                                     a84109dc0496        4 weeks ago         6.21GB

mikegerber/my_ocrd_workflow   746fb768da6bf5b92c1012a18236291ae07b805a   a84109dc0496        4 weeks ago         6.21GB

mikegerber/my_ocrd_workflow   latest                                     a84109dc0496        4 weeks ago         6.21GB

The command "docker images" exited with 0.

1.74s$ docker push $DOCKER_USERNAME/my_ocrd_workflow:$TRAVIS_COMMIT

Use sbb_textline_detector from GitHub

We currently maintain a copy of sbb_textline_detector, and should use https://github.com/qurator-spk/sbb_textline_detector instead.

Only use PyPI versions where possible

Only use PyPI version where possible.
If that is not possible (= available) use a versioned release from GitHub
Otherwise, use a GitHub commit

Other than relying on "proper releases", this also has a second purpose: Review releases of qurator-spk releases.

Don't duplicate command names

Currently, command names need to be maintained in both wrapper/qurator/ocrd_galley/cli.py and wrapper/setup.py. Maybe setup.py could read from cli.py (if safe) or both from some kind of configuration file?

Integrate ocrd-fileformat-*

It would be very useful to include PAGE to ALTO transformation via ocrd-fileformat-transform from ocrd_fileformat in the workflow.

Use versioned release of eynollah

In 984a823 we updated to a git version of eynollah, to hotfix qurator-spk/eynollah#38. We should use a versioned release again.

Test

Use e.g. Travis to automatically test:

Docker build, including model download
Basic functionality with a test workspace
Add Travis badge

This is especially important because I almost never use the model downloads.

Integrate ocrd-anybaseocr-*

Integrate sbb_binarization

pip install -r ~/devel/my_ocrd_workflow/requirements-ppn2ocr.txt seems to need scikit-build

@cneud reports:

Ich notiers nur kurz hier als Erinnerung...damit für https://github.com/mikegerber/my_ocrd_workflow#ppn2ocr opencv-headless gebaut wird musste ich noch scikit-build per pip installieren

Collecting opencv-python-headless (from ocrd->-r /home/cnd/tmp/dev/qurator/my_ocrd_workflow/requirements-ppn2ocr.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/2f/b4/2ddaaecc332e6ddafb7726abb6139955a99282afe5f370930890bb572707/opencv-python-headless-4.4.0.42.tar.gz (88.9MB)
    100% |████████████████████████████████| 88.9MB 18kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wx36s1mb/opencv-python-headless/setup.py", line 9, in <module>
        import skbuild
    ModuleNotFoundError: No module named 'skbuild'   
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wx36s1mb/opencv-python-headless/

Merge changes into qurator_data_lib.sh and use that

Merge build changes into qurator_data_lib.sh
Use qurator_data_lib.sh here

Integrate ocrd-cis-ocropy-*

Include glyph output

Tesseract
Calamari

ppn2ocr: oai.sbb.berlin certificate problem

f5b2eed contains a work around (verify=False) for https://oai.sbb.berlin not having a valid certificate. This should be removed as soon as it gets a proper certificate.

Fix certificate chain
Remove workaround again

stable tag not in https://github.com/qurator-spk/my_ocrd_workflow

Need to push the tags to https://github.com/qurator-spk/my_ocrd_workflow

ppn2ocr: Should handle PPNs with no PPN prefix

./ppn2ocr 627074103 fails while ./ppn2ocr PPN627074103 succeeds.

Check Travis problems

15:08:16.312 INFO processor.OcrdSbbTextlineDetectorRecognize - INPUT FILE 0 / <OcrdFile fileGrp=OCR-D-IMG-BIN, ID=OCR-D-IMG-BIN_00000024, mimetype=application/vnd.prima.page+xml, url=OCR-D-IMG-BIN/OCR-D-IMG-BIN_00000024.xml, local_filename=OCR-D-IMG-BIN/OCR-D-IMG-BIN_00000024.xml]/> 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

https://travis-ci.org/github/mikegerber/my_ocrd_workflow/jobs/730303389#L386

Check if the build is slower now
Also check error handling, i.e. set -e

Test ocrd process

Build stable Docker images using GitHub Actions

run-docker-hub not working

Is the image not public? Needs investigation.

Processors should work with the default models/resources

Calling ocrd-calamari-recognize without checkpoint currently yields:

15:06:34.301 ERROR ocrd.ocrd-calamari-recognize.resolve_resource - Could not find resource 'qurator-gt4histocr-1.0' for executable 'ocrd-calamari-recognize'. Try 'ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0' to download this resource.

Update sbb-textline-detector

@cneud reported that sbb-textline-detector has some bugfixes.

ppn2ocr: Check DEFAULT / Remove superfluous file groups

The DEFAULT filegroup does not seem to have a consistent IIIF equivalent. Check this and decide what to do with it. (I think it currently uses the non-IIIF URLs.)
Consider removing superfluous file groups like DEFAULT, THUMBS altogether as they seem to cause more confusion than anything else. (I.e. by users accidently generating OCR from DEFAULT and then having trouble finding the equivalent IIIF for use in neat. With MAX we use IIIF full, so this should not happen there.)

👀 @kba @labusch – As discussed in Gitter

Fix Travis build

https://travis-ci.org/github/mikegerber/my_ocrd_workflow/jobs/699616134

ppn2ocr currently requires NFS access to PRESENTATION file group

ppn2ocr should

Use SBB's IIIF API
Maybe support loading even better quality from NFS

Side note: When fumbling with the image URLs from the METS by replacing 800 by full, we get the best resolution. IIIF seems to give smaller images.

Manipulated METS URL using full: https://content.staatsbibliothek-berlin.de/dms/PPN77164308X/full/0/00000001.jpg (2400 × 3923)
IIIF URL using full: https://content.staatsbibliothek-berlin.de/dc/PPN77164308X-00000001/full/full/0/default.jpg (1582 × 2586)

Move to qurator-spk group

Move to qurator-spk
Finish rename to ocrd-galley
Delete mikegerber/my_ocrd_workflow-* images on Docker Hub

Compare FULLTEXT with OCRD-OCR-* results

The workflow should compare the ABBYY results (aka FULLTEXT) with our OCR results, using dinglehopper with metrics=False.

Fix CI build

Handle tessdata_best with --strip-components?

ppn2ocr: Fix PRESENTATION file group

We currently remove the file group as it was trouble. To fix this:

Fix #13 - allow Docker volumes
Investigate why OCR-D is choking on the local file names/file URLs

Review TODOs/XXXs

Add option to allow Docker options, e.g. for volumes

To access local files we need a way to allow Docker volumes (docker run -v).

ppn2ocr: Check BEST dpi

JPEGs are still without DPI:

  <notice>Image FILE_0001_BEST: xResolution (1 pixels per inches) is suspiciously low</notice>

TIFFs are wrong:

  <notice>Image FILE_0001_BEST: xResolution (25 pixels per inches) is suspiciously low</notice>

Provide a stable version

ppn2ocr: FULLTEXT should reference images (+ sanity check)

We assume that FULLTEXT ALTO was created using the BEST images. It should also reference these images.

Sanity check image:
- image dimensions against <Page HEIGHT= WIDTH= />
  - There may be documents where this is not the case. Which ones?
- also check that <MeasurementUnit>pixel</MeasurementUnit>
Insert or check <imageFilename>

GPU support

We current run the workflow in a GPU-less container. From what I have tested, this could get GPU support by:

basing on a CUDA image, e.g. FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04, and
running docker run --gpus all

It should still work on CPU-only systems.

TensorFlow CUDA dependency hell

Regardless of a possible future version pinning:

On build,

Retrieve installed TensorFlow version
Retrieve installed CUDA Toolkit version
Run a script that checks compatibility (generated by test-nvidia, we can't just is_gpu_available() as we don't always build on a GPU enabled system)

Split out ppn2ocr, zdb2ocr and ocrd-workspace-from-images

The Travis-built Docker image fails on my machine with `Illegal instruction`

The Travis-built Docker image fails on my machine with Illegal instruction:

/usr/bin/my_ocrd_workflow: line 70:   361 Illegal instruction     (core dumped) ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p "$ocrd_tesserocr_recognize_parameters"

ppn2ocr should only give MAX filegroup

Consider only providing the MAX filegroup by default, as the DEFAULT filegroup makes too much trouble (scaling etc.)

Improve README

README should include

a download of the example workspace
instructions how to check the results using the XML and jpageviewer

Fix Travis test

E.g. https://travis-ci.org/mikegerber/my_ocrd_workflow/jobs/655327227:

The command "grep -q 'auswartige Rechtsgelahrte hatten muen bef.aget werden' OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml" exited with 1.