Code Monkey home page Code Monkey logo

ocrd-galley's Introduction

ocrd-galley

A Dockerized test environment for OCR-D processors 🚒

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:

  • Binarized images
  • Line segmentation
  • OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
  • (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

ocrd-galley uses Docker to run the OCR-D images. We provide pre-built container images that get downloaded automatically when you run the provided wrappers for the OCR-D processors.

You can then install the wrappers into a Python venv:

cd ~/devel/ocrd-galley/wrapper
pip install .

To download models, you need to use the -a flag of ocrd resmgr:

ocrd resmgr download -a ocrd-calamari-recognize default

You may then use the script my_ocrd_workflow to use your self-built containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a workspace and METS file with the best images for a given document in the Berlin State Library (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).

ocrd-workspace-from-images

The ocrd-workspace-from-images script produces a OCR-D workspace (incl. METS) for the given images.

~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow

This produces a workspace from the files and then runs the OCR workflow on it.

Build the containers yourself

To build the containers yourself using Docker:

cd ~/devel/ocrd-galley/
./build

ocrd-galley's People

Contributors

cneud avatar mikegerber avatar robinschaefer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

mikegerber

ocrd-galley's Issues

Move each processor into its own container

Now that we need TF1 for sbb_textline_detector and TF2 for ocrd_calamari, it's time.

  • setup containers
  • ENTRYPOINT? CMD?
  • Fix run
  • getpip
  • s/base/core
  • Check build cache
    • We used to use --cache-from my_ocrd_workflow in build
  • ocrd_logging.py β†’ RUN echo
  • Review LOG_LEVEL setting
  • Test README
  • Update README
  • Push to Docker Hub
    • also check a versioned/tagged build
  • Fix run-docker-hub
  • Delete (old single container) docker image my_ocrd_workflow (from Docker Hub)

Review README.md

  • General review
  • The README should show that you can just use the OCR-D as normal, i.e. that the wrapper makes it transparent
  • GPU support

ppn2ocr: Rename BEST fileGrp to MAX

As we are going to have a MAX fileGrp in the SBB's METS files, we are moving to MAX here too (instead of BEST). I am currently assuming that the SBB's MAX fileGrp is going to be made from full IIIF URLs, so identical to our current BEST.

@j23d @cneud

  • Rename BEST to MAX
  • Handle an existing MAX

Travis: Should not push Docker image when build fails

https://travis-ci.org/github/mikegerber/my_ocrd_workflow/jobs/699616134

The command "FORCE_DOWNLOAD=y ./build" exited with 1.

0.10s$ docker tag my_ocrd_workflow $DOCKER_USERNAME/my_ocrd_workflow:$TRAVIS_COMMIT

The command "docker tag my_ocrd_workflow $DOCKER_USERNAME/my_ocrd_workflow:$TRAVIS_COMMIT" exited with 0.

0.09s$ docker images

REPOSITORY                    TAG                                        IMAGE ID            CREATED             SIZE

<none>                        <none>                                     a5967a65a668        4 minutes ago       4.51GB

ubuntu                        18.04                                      8e4ce0a6ce69        32 hours ago        64.2MB

my_ocrd_workflow              latest                                     a84109dc0496        4 weeks ago         6.21GB

mikegerber/my_ocrd_workflow   746fb768da6bf5b92c1012a18236291ae07b805a   a84109dc0496        4 weeks ago         6.21GB

mikegerber/my_ocrd_workflow   latest                                     a84109dc0496        4 weeks ago         6.21GB

The command "docker images" exited with 0.

1.74s$ docker push $DOCKER_USERNAME/my_ocrd_workflow:$TRAVIS_COMMIT

Only use PyPI versions where possible

  1. Only use PyPI version where possible.
  2. If that is not possible (= available) use a versioned release from GitHub
  3. Otherwise, use a GitHub commit

Other than relying on "proper releases", this also has a second purpose: Review releases of qurator-spk releases.

Don't duplicate command names

Currently, command names need to be maintained in both wrapper/qurator/ocrd_galley/cli.py and wrapper/setup.py. Maybe setup.py could read from cli.py (if safe) or both from some kind of configuration file?

Test

Use e.g. Travis to automatically test:

  • Docker build, including model download
  • Basic functionality with a test workspace
  • Add Travis badge

This is especially important because I almost never use the model downloads.

pip install -r ~/devel/my_ocrd_workflow/requirements-ppn2ocr.txt seems to need scikit-build

@cneud reports:

Ich notiers nur kurz hier als Erinnerung...damit fΓΌr https://github.com/mikegerber/my_ocrd_workflow#ppn2ocr opencv-headless gebaut wird musste ich noch scikit-build per pip installieren

Collecting opencv-python-headless (from ocrd->-r /home/cnd/tmp/dev/qurator/my_ocrd_workflow/requirements-ppn2ocr.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/2f/b4/2ddaaecc332e6ddafb7726abb6139955a99282afe5f370930890bb572707/opencv-python-headless-4.4.0.42.tar.gz (88.9MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 88.9MB 18kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wx36s1mb/opencv-python-headless/setup.py", line 9, in <module>
        import skbuild
    ModuleNotFoundError: No module named 'skbuild'   
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wx36s1mb/opencv-python-headless/

Check Travis problems

15:08:16.312 INFO processor.OcrdSbbTextlineDetectorRecognize - INPUT FILE 0 / <OcrdFile fileGrp=OCR-D-IMG-BIN, ID=OCR-D-IMG-BIN_00000024, mimetype=application/vnd.prima.page+xml, url=OCR-D-IMG-BIN/OCR-D-IMG-BIN_00000024.xml, local_filename=OCR-D-IMG-BIN/OCR-D-IMG-BIN_00000024.xml]/> 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

https://travis-ci.org/github/mikegerber/my_ocrd_workflow/jobs/730303389#L386

  • Check if the build is slower now
  • Also check error handling, i.e. set -e

Processors should work with the default models/resources

Calling ocrd-calamari-recognize without checkpoint currently yields:

15:06:34.301 ERROR ocrd.ocrd-calamari-recognize.resolve_resource - Could not find resource 'qurator-gt4histocr-1.0' for executable 'ocrd-calamari-recognize'. Try 'ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0' to download this resource.
  • ocrd_calamari
  • ocrd_tesserocr?
  • sbb_binarization
  • sbb_textline_detection?
  • eynollah?
  • Consider how OCR-D resources currently reflect our model versions

ppn2ocr: Check DEFAULT / Remove superfluous file groups

  • The DEFAULT filegroup does not seem to have a consistent IIIF equivalent. Check this and decide what to do with it. (I think it currently uses the non-IIIF URLs.)
  • Consider removing superfluous file groups like DEFAULT, THUMBS altogether as they seem to cause more confusion than anything else. (I.e. by users accidently generating OCR from DEFAULT and then having trouble finding the equivalent IIIF for use in neat. With MAX we use IIIF full, so this should not happen there.)

πŸ‘€ @kba @labusch – As discussed in Gitter

ppn2ocr currently requires NFS access to PRESENTATION file group

ppn2ocr should

  • Use SBB's IIIF API
  • Maybe support loading even better quality from NFS

Side note: When fumbling with the image URLs from the METS by replacing 800 by full, we get the best resolution. IIIF seems to give smaller images.

Move to qurator-spk group

  • Move to qurator-spk
  • Finish rename to ocrd-galley
  • Delete mikegerber/my_ocrd_workflow-* images on Docker Hub

ppn2ocr: Fix PRESENTATION file group

We currently remove the file group as it was trouble. To fix this:

  • Fix #13 - allow Docker volumes
  • Investigate why OCR-D is choking on the local file names/file URLs

ppn2ocr: Check BEST dpi

JPEGs are still without DPI:

  <notice>Image FILE_0001_BEST: xResolution (1 pixels per inches) is suspiciously low</notice>

TIFFs are wrong:

  <notice>Image FILE_0001_BEST: xResolution (25 pixels per inches) is suspiciously low</notice>

ppn2ocr: FULLTEXT should reference images (+ sanity check)

We assume that FULLTEXT ALTO was created using the BEST images. It should also reference these images.

  • Sanity check image:
    • image dimensions against <Page HEIGHT= WIDTH= />
      • There may be documents where this is not the case. Which ones?
    • also check that <MeasurementUnit>pixel</MeasurementUnit>
  • Insert or check <imageFilename>

GPU support

We current run the workflow in a GPU-less container. From what I have tested, this could get GPU support by:

  • basing on a CUDA image, e.g. FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04, and
  • running docker run --gpus all

It should still work on CPU-only systems.

TensorFlow CUDA dependency hell

Regardless of a possible future version pinning:

On build,

  • Retrieve installed TensorFlow version
  • Retrieve installed CUDA Toolkit version
  • Run a script that checks compatibility (generated by test-nvidia, we can't just is_gpu_available() as we don't always build on a GPU enabled system)

Improve README

README should include

  • a download of the example workspace
  • instructions how to check the results using the XML and jpageviewer

Build stable Docker images again

As building these Docker images with the free Travis and GitHub Actions (#36) plans is not possible due to resource constraints, I'll do it with some private CI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.