Light

bertsky / ocrd_wrap Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 2.0 45 KB

OCR-D wrapper for arbitrary coords-preserving image operations

License: MIT License

Makefile 0.73% Python 99.27%

ocr-d

ocrd_wrap's Introduction

ocrd_wrap

OCR-D wrapper for arbitrary coords-preserving image operations

Introduction
Installation
Usage
Testing

Introduction

This offers OCR-D compliant workspace processors for any image processing tools which have some (usable) CLI and do not modify/invalidate image coordinates.

It thus wraps them for OCR-D without the need to write and manage code for each of them individually (exposing/passing/documenting their parameters and usage, managing releases etc). It shifts all the burden to workflow configuration (i.e. defining a suitable parameter set on how to call what program on what data, and installing all the required tools).

It is itself written in Python, and relies heavily on the OCR-D core API. This is responsible for handling METS/PAGE, and providing the OCR-D CLI.

In addition, this aims to wrap existing Python packages for preprocessing as OCR-D processors (one at a time).

Installation

Create and activate a virtual environment as usual.

To install Python dependencies:

make deps

Which is the equivalent of:

pip install -r requirements.txt

To install this module, then do:

make install

Which is the equivalent of:

pip install .

Usage

OCR-D processor interface `ocrd-preprocess-image`

To be used with PAGE-XML documents in an OCR-D annotation workflow.

Usage: ocrd-preprocess-image [OPTIONS]

  Convert or enhance images

  > Performs coords-preserving image operations via runtime shell calls
  > anywhere.

  > Open and deserialize PAGE input files and their respective images,
  > then iterate over the element hierarchy down to the requested
  > ``level-of-operation`` in the element hierarchy.

  > For each segment element, retrieve a segment image according to the
  > layout annotation (from an existing AlternativeImage, or by cropping
  > via coordinates into the higher-level image, and - when applicable -
  > deskewing.

  > If ``input_feature_selector`` and/or ``input_feature_filter`` is
  > non-empty, then select/filter among the @imageFilename image and the
  > available AlternativeImages the last one which contains all of the
  > selected, but none of the filtered features (i.e. @comments
  > classes), or raise an error.

  > Then write that image into a temporary PNG file, create a new METS
  > file ID for the result image (based on the segment ID and the
  > operation to be run), along with a local path for it, and pass
  > ``command`` to the shell after replacing: - the string ``@INFILE``
  > with that input image path, and - the string ``@OUTFILE`` with that
  > output image path.

  > If the shell returns with a failure, skip that segment with an
  > approriate error message. Otherwise, add the new image to the
  > workspace along with the output fileGrp, and using a file ID with
  > suffix ``.IMG-``, and further identification of the input element.

  > Reference it as AlternativeImage in the element, adding
  > ``output_feature_added`` to its @comments.

  > Produce a new PAGE output file by serialising the resulting
  > hierarchy.

Options:
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process
  --overwrite                     Remove existing output pages/images
                                  (with --page-id, remove only those)
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -m, --mets URL-PATH             URL or file path of METS to process
  -w, --working-dir PATH          Working directory of local workspace
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -h, --help                      This help message
  -V, --version                   Show version

Parameters:
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level to operate on
    Possible values: ["page", "region", "line", "word", "glyph"]
   "input_feature_selector" [string - ""]
    comma-separated list of required image features (e.g.
    binarized,despeckled)
   "input_feature_filter" [string - ""]
    comma-separated list of forbidden image features (e.g.
    binarized,despeckled)
   "output_feature_added" [string - REQUIRED]
    image feature(s) to be added after this operation (if multiple,
    separate by comma)
   "input_mimetype" [string - "image/png"]
    File format to save input images to (tool's expected input)
    Possible values: ["image/bmp", "application/postscript", "image/gif",
    "image/jpeg", "image/jp2", "image/png", "image/x-portable-pixmap",
    "image/tiff"]
   "output_mimetype" [string - "image/png"]
    File format to load output images from (tool's expected output)
    Possible values: ["image/bmp", "application/postscript", "image/gif",
    "image/jpeg", "image/jp2", "image/png", "image/x-portable-pixmap",
    "image/tiff"]
   "command" [string - REQUIRED]
    shell command to operate on image files, with @INFILE as place-holder
    for the input file path, and @OUTFILE as place-holder for the output
    file path

presets

The following example recipes are included in the distribution:

enhancement/conversion/denoising using
- ImageMagick: param_im6convert-denoise-raw
- GIMP script-fu
- ...
binarization using
- Olena/Scribo: param_scribo-cli-binarize-sauvola-ms-split
- https://github.com/ajgallego/document-image-binarization ...
- https://github.com/qurator-spk/sbb_binarization ...
- https://github.com/masyagin1998/robin ...
- ...
text/non-text segmentation using
- Olena/Scribo ...
- ...
...

OCR-D processor interface `ocrd-skimage-normalize`

To be used with PAGE-XML documents in an OCR-D annotation workflow.

Usage: ocrd-skimage-normalize [OPTIONS]

  Equalize contrast/exposure of images with Scikit-image; stretches the color value/tone to the full dynamic range

  > Performs contrast-enhancing equalization of segment or page images
  > with scikit-image on the workspace.

  > Open and deserialize PAGE input files and their respective images,
  > then iterate over the element hierarchy down to the requested
  > ``level-of-operation`` in the element hierarchy.

  > For each segment element, retrieve a segment image according to the
  > layout annotation (from an existing AlternativeImage, or by cropping
  > via coordinates into the higher-level image, and - when applicable -
  > deskewing), in raw (non-binarized) form.

  > Next, normalize the image according to ``method`` in skimage.

  > Then write the new image to the workspace along with the output
  > fileGrp, and using a file ID with suffix ``.IMG-NRM`` with further
  > identification of the input element.

  > Produce a new PAGE output file by serialising the resulting
  > hierarchy.

Options:
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process
  --overwrite                     Remove existing output pages/images
                                  (with --page-id, remove only those)
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -m, --mets URL-PATH             URL or file path of METS to process
  -w, --working-dir PATH          Working directory of local workspace
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -h, --help                      This help message
  -V, --version                   Show version

Parameters:
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level to operate on
    Possible values: ["page", "region", "line", "word", "glyph"]
   "dpi" [number - 0]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when zero
   "black-point" [number - 1.0]
    black point point in percent of luminance/value/tone histogram; up to
    ``black-point`` darkest pixels will be clipped to black when
    stretching
   "white-point" [number - 7.0]
    white point in percent of luminance/value/tone histogram; up to
    ``white-point`` brightest pixels will be clipped to white when
    stretching
   "method" [string - "stretch"]
    contrast-enhancing transformation to use after clipping; ``stretch``
    uses ``skimage.exposure.rescale_intensity`` (globally linearly
    stretching to full dynamic range) and ``adapthist`` uses
    ``skimage.exposure.equalize_adapthist`` (applying over tiles with
    context from 1/8th of the image's width)
    Possible values: ["stretch", "adapthist"]

OCR-D processor interface `ocrd-skimage-denoise-raw`

To be used with PAGE-XML documents in an OCR-D annotation workflow.

Usage: ocrd-skimage-denoise-raw [OPTIONS]

  Denoise raw images with Scikit-image

  > Performs raw denoising of segment or page images with scikit-image
  > on the workspace.

  > Open and deserialize PAGE input files and their respective images,
  > then iterate over the element hierarchy down to the requested
  > ``level-of-operation`` in the element hierarchy.

  > For each segment element, retrieve a segment image according to the
  > layout annotation (from an existing AlternativeImage, or by cropping
  > via coordinates into the higher-level image, and - when applicable -
  > deskewing), in raw (non-binarized) form.

  > Next, denoise the image with a Wavelet transform scheme according to
  > ``method`` in skimage.

  > Then write the new image to the workspace along with the output
  > fileGrp, and using a file ID with suffix ``.IMG-DEN`` with further
  > identification of the input element.

  > Produce a new PAGE output file by serialising the resulting
  > hierarchy.

Options:
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process
  --overwrite                     Remove existing output pages/images
                                  (with --page-id, remove only those)
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -m, --mets URL-PATH             URL or file path of METS to process
  -w, --working-dir PATH          Working directory of local workspace
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -h, --help                      This help message
  -V, --version                   Show version

Parameters:
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level to operate on
    Possible values: ["page", "region", "line", "word", "glyph"]
   "dpi" [number - 0]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when zero
   "method" [string - "VisuShrink"]
    Wavelet filtering scheme to use
    Possible values: ["BayesShrink", "VisuShrink"]

OCR-D processor interface `ocrd-skimage-binarize`

To be used with PAGE-XML documents in an OCR-D annotation workflow.

Usage: ocrd-skimage-binarize [OPTIONS]

  Binarize images with Scikit-image

  > Performs binarization of segment or page images with scikit-image on
  > the workspace.

  > Open and deserialize PAGE input files and their respective images,
  > then iterate over the element hierarchy down to the requested
  > ``level-of-operation`` in the element hierarchy.

  > For each segment element, retrieve a segment image according to the
  > layout annotation (from an existing AlternativeImage, or by cropping
  > via coordinates into the higher-level image, and - when applicable -
  > deskewing).

  > Next, binarize the image according to ``method`` with skimage.

  > Then write the new image to the workspace along with the output
  > fileGrp, and using a file ID with suffix ``.IMG-BIN`` with further
  > identification of the input element.

  > Produce a new PAGE output file by serialising the resulting
  > hierarchy.

Options:
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process
  --overwrite                     Remove existing output pages/images
                                  (with --page-id, remove only those)
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -m, --mets URL-PATH             URL or file path of METS to process
  -w, --working-dir PATH          Working directory of local workspace
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -h, --help                      This help message
  -V, --version                   Show version

Parameters:
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level to operate on
    Possible values: ["page", "region", "line", "word", "glyph"]
   "dpi" [number - 0]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when zero
   "method" [string - "sauvola"]
    Thresholding algorithm to use
    Possible values: ["sauvola", "niblack", "otsu", "gauss", "yen", "li"]
   "window_size" [number - 0]
    For Sauvola/Niblack/Gauss, the (odd) window size in pixels; when zero
    (default), set to DPI
   "k" [number - 0.34]
    For Sauvola/Niblack, formula parameter influencing the threshold
    bias; larger is lighter foreground

OCR-D processor interface `ocrd-skimage-denoise`

To be used with PAGE-XML documents in an OCR-D annotation workflow.

Usage: ocrd-skimage-denoise [OPTIONS]

  Denoise binarized images with Scikit-image

  > Performs binary denoising of segment or page images with scikit-
  > image on the workspace.

  > Open and deserialize PAGE input files and their respective images,
  > then iterate over the element hierarchy down to the requested
  > ``level-of-operation`` in the element hierarchy.

  > For each segment element, retrieve a segment image according to the
  > layout annotation (from an existing AlternativeImage, or by cropping
  > via coordinates into the higher-level image, and - when applicable -
  > deskewing), in binarized form.

  > Next, denoise the image by removing too small connected components
  > with skimage.

  > Then write the new image to the workspace along with the output
  > fileGrp, and using a file ID with suffix ``.IMG-DEN`` with further
  > identification of the input element.

  > Produce a new PAGE output file by serialising the resulting
  > hierarchy.

Options:
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process
  --overwrite                     Remove existing output pages/images
                                  (with --page-id, remove only those)
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -m, --mets URL-PATH             URL or file path of METS to process
  -w, --working-dir PATH          Working directory of local workspace
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -h, --help                      This help message
  -V, --version                   Show version

Parameters:
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level to operate on
    Possible values: ["page", "region", "line", "word", "glyph"]
   "dpi" [number - 0]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when zero
   "maxsize" [number - 3]
    maximum component size of (bg holes or fg specks) noise in pt

Testing

none yet

ocrd_wrap's People

Contributors

Stargazers

Watchers

Forkers

kba joschrew

ocrd_wrap's Issues

Release 0.1.4 is missing here on GitHub

The release 0.1.4 is missing here on GitHub!

Commas are not allowed in xsd:ID

ocrd_wrap/ocrd_wrap/shell.py

Line 180 in 84db8bf

out_id = file_id + '.IMG-' + feature_added.upper()

e.g.

mets.xml: Line 86: Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'DESKEW_1586.IMG-BINARIZED,DESKEWED' is not a valid value of the atomic type 'xs:ID'.

Wrong data type in PIL/Image.py

I am using the latest ocrd_all maximum image. Workspace used: https://gdz.sub.uni-goettingen.de/mets/PPN1023134829.mets.xml

N E X T F L O W  ~  version 21.04.3
Launching `/scratch1/users/mmustaf/operandi/slurm_workspaces/a7752ccc-3908-4d6a-917c-036cf9ffef6c/user_workflow.nf` [hopeful_plateau] - revision: 4d3b00d56e
O P E R A N D I - H P C - D E F A U L T  P I P E L I N E
===========================================
input_file_group    : MAX
mets                : /scratch1/users/mmustaf/operandi/slurm_workspaces/a7752ccc-3908-4d6a-917c-036cf9ffef6c/7ed688de-482d-439a-816f-75b2226c60db/mets.xml
volume_map_dir      : /scratch1/users/mmustaf/operandi/slurm_workspaces/a7752ccc-3908-4d6a-917c-036cf9ffef6c
models_mapping      : /scratch1/users/mmustaf/ocrd_models:/usr/local/share
sif_path            : /scratch1/users/mmustaf/ocrd_all_maximum_image.sif
singularity_wrapper : singularity exec --bind /scratch1/users/mmustaf/operandi/slurm_workspaces/a7752ccc-3908-4d6a-917c-036cf9ffef6c --bind /scratch1/users/mmustaf/ocrd_models:/usr/local/share --env OCRD_METS_CACHING=true /scratch1/users/mmustaf/ocrd_all_maximum_image.sif

[2e/fe535b] Submitted process > ocrd_cis_ocropy_binarize
[66/3a142e] Submitted process > ocrd_anybaseocr_crop
[fe/6339fd] Submitted process > ocrd_skimage_denoise
Error executing process > 'ocrd_skimage_denoise'

Caused by:
  Process `ocrd_skimage_denoise` terminated with an error exit status (1)

Command executed:

  singularity exec --bind /scratch1/users/mmustaf/operandi/slurm_workspaces/a7752ccc-3908-4d6a-917c-036cf9ffef6c --bind /scratch1/users/mmustaf/ocrd_models:/usr/local/share --env OCRD_METS_CACHING=true /scratch1/users/mmustaf/ocrd_all_maximum_image.sif ocrd-skimage-denoise -m mets.xml -I OCR-D-CROP -O OCR-D-BIN-DENOISE -p '{"level-of-operation": "page"}'

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 3089, in fromarray
      mode, rawmode = _fromarray_typemap[typekey]
  KeyError: ((1, 1, 2), '|b1')
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
      processor.process()
    File "/usr/local/lib/python3.8/site-packages/ocrd_wrap/skimage_denoise.py", line 90, in process
      self._process_segment(page, page_image, page_coords, dpi,
    File "/usr/local/lib/python3.8/site-packages/ocrd_wrap/skimage_denoise.py", line 166, in _process_segment
      image = Image.fromarray(~array2)
    File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 3092, in fromarray
      raise TypeError(msg) from e
  TypeError: Cannot handle this data type: (1, 1, 2), |b1
  Traceback (most recent call last):
    File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 3089, in fromarray
      mode, rawmode = _fromarray_typemap[typekey]
  KeyError: ((1, 1, 2), '|b1')
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/usr/local/bin/ocrd-skimage-denoise", line 8, in <module>
      sys.exit(ocrd_skimage_denoise())
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
      rv = self.invoke(ctx)
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/usr/local/lib/python3.8/site-packages/ocrd_wrap/cli.py", line 33, in ocrd_skimage_denoise
      return ocrd_cli_wrap_processor(SkimageDenoise, *args, **kwargs)
    File "/build/core/ocrd/ocrd/decorators/__init__.py", line 116, in ocrd_cli_wrap_processor
      run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 131, in run_processor
      raise err
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
      processor.process()
    File "/usr/local/lib/python3.8/site-packages/ocrd_wrap/skimage_denoise.py", line 90, in process
      self._process_segment(page, page_image, page_coords, dpi,
    File "/usr/local/lib/python3.8/site-packages/ocrd_wrap/skimage_denoise.py", line 166, in _process_segment
      image = Image.fromarray(~array2)
    File "/usr/local/lib/python3.8/dist-packages/PIL/Image.py", line 3092, in fromarray
      raise TypeError(msg) from e
  TypeError: Cannot handle this data type: (1, 1, 2), |b1

check that output has same dimensions

add example parameter files for various tools and tasks

wrap more Python image processing libraries

Candidates for low-effort wrappers:

pgmagick, instead of using convert with ocrd-preprocess-image
pyleptonica, perhaps mimicking some of Tesseract's better usage (like h/v-line detection or flip detection)
more from OpenCV (but ocrd_cis already wraps its binarization and morphology functions)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.