core,ocr-d

Feature request for ocrd workspace clone

Maybe it's a good idea to allow users to clone directory to a given target.
-o --output-dir Clone workspace to given directory (optional)

Move cross-class helper functions to appropriate file

E.g. _points_from_box which is used in multiple classes should be some kind of static helper function. Check python best-practice guides on that.

Approximating DPI as part of image characterization

If dpi/dpix/dpiy ist not in the EXIF metadata or cannot be trusted to be accurate, original page extent must be an input to the characterization.

Region segmentation: Webservice

Characterization: Handle non-existing-xml case

Makefile should test for the right version of python.

Makefile works also with Python 2.X.
Version should be checked and print an Error if not sufficient.
Alternative:
Call python3 explicitely

Move from argparse to Click for command line scripts

It's superior!

Page segmentation: Basic functionality

Add basic segmentation functionalities using tesserocr (https://github.com/sirfz/tesserocr/).

Usage of 'ocrd ocrd-tools tool' is wrong

The JSON-File is missing for the Usage:

> ocrd ocrd-tool tool --help
Usage: ocrd ocrd-tool tool [OPTIONS] TOOL_NAME COMMAND [ARGS]...

Feature request for ocrd workspace

It would be nice to

list-group: list the USE attribute of all file groups.
list-id: list the group ID of all files in a file group
``: get mimetype of a file referenced by USE and GROUPID or by ID

Text recognition: Add to CLI

Support different (older) PAGE namespaces

Naming conflict in model spec

FileNotFoundError: [Errno 2] No such file or directory: '/home/kmw/projects/dwds/ocr/src/ocrd_kraken/env/lib/python3.5/site-packages/ocrd/model/yaml/ocrd_oas3.spec.yml'

but we have
pyocrd/ocrd/model/yaml/ocrd_oas3.yml

README is outdated

referring to pyocrd, not linking to docs etc.

ocrd workspace add: Show full help when calling without arguments

Usage prints no Options:

ocrd workspace add
Usage: ocrd workspace add [OPTIONS] LOCAL_FILENAME

Error: Missing argument "local_filename".

cached file names should retain extension

Currently files are cached by the URL sans all non-alnum characters removed. This confuses tools that rely on the file extension to detect file type.

Easy fix would be to replace 1:n non-alnum characters with .

Characterization: Separate functionality and XML output

By now, it's a multi-purpose function. Not good for an API.

CLI: processor wrapper should call save_mets after running process()

Encapsulate merging functionality in an API

merge_ocr_txt.py is a command line script containing actual functionality which should be available via API.

(Smoke) testing

To ensure code remains functional, some basic test of functionality is required. Helps while refactoring the code (e.g. #28 #9 #20)

First, fix output (currently XML declaration is duplicated for each page tree).

Then have an example (e.g. the one in @OCR-D/spec) with the expected output and create a test script to ensure it's still produced.

Make sure the tool is deployable and functional by adding CI to test continuously. Extend test script / examples to reflect extended CLI/specs.

Cache: Changes in METS do not trigger reload.

If the METS file changes (cf. https://github.com/kba/ocrd-assets/pull/1), it should be “reloaded” into the cache.

Page segmentation: Add tesseract via tesserocr

https://github.com/sirfz/tesserocr

Testing: More varied and feasible test data

#40 (comment)

Test data here was taken from ocr-d/spec. It's a good example but not ideal for testing. We should make use of the ground truth http://ocr-d.de/daten, also with the evaluation module in mind.

Samples for the unit tests should be minimal, since this is supposed to run fast.

Characterization: add simple Exiftool wrapper

Replace "pyocrd"

Still used in some constants and filenames such as tempfile names.

Documentation: Add METS template

...

Page/region segmentation: Evaluate alternatives to opencv

something lighter?
Cf. #29

ocrd process -m ocrd-assets/dist/mets.xml characterize/exif segment-region/tesserocr

works great (i.e. metadata and regions show up in OUTPUT PAGE XML).
However, adding line detection

ocrd process -m ocrd-assets/dist/mets.xml characterize/exif segment-region/tesserocr segment-line/tesserocr

results in “empty” XML:

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15/pagecontent.xsd">
        <Page imageFileName="http://localhost:5001/00000005.tif">
        </Page>
</PcGts>

ocr-d / core Goto Github PK

core's People

Contributors

Stargazers

Watchers

Forkers

core's Issues

Recommend Projects

Recommend Topics

Recommend Org