ocr-d / core Goto Github PK
View Code? Open in Web Editor NEWCollection of OCR-related python tools and wrappers from @OCR-D
Home Page: https://ocr-d.de/core/
License: Apache License 2.0
Collection of OCR-related python tools and wrappers from @OCR-D
Home Page: https://ocr-d.de/core/
License: Apache License 2.0
Maybe it's a good idea to allow users to clone directory to a given target.
-o --output-dir Clone workspace to given directory (optional)
E.g. _points_from_box
which is used in multiple classes should be some kind of static helper function. Check python best-practice guides on that.
If dpi/dpix/dpiy ist not in the EXIF metadata or cannot be trusted to be accurate, original page extent must be an input to the characterization.
Makefile works also with Python 2.X.
Version should be checked and print an Error if not sufficient.
Alternative:
Call python3 explicitely
It's superior!
Add basic segmentation functionalities using tesserocr (https://github.com/sirfz/tesserocr/).
The JSON-File is missing for the Usage:
> ocrd ocrd-tool tool --help
Usage: ocrd ocrd-tool tool [OPTIONS] TOOL_NAME COMMAND [ARGS]...
It would be nice to
list-group
: list the USE attribute of all file groups.list-id
: list the group ID of all files in a file groupFileNotFoundError: [Errno 2] No such file or directory: '/home/kmw/projects/dwds/ocr/src/ocrd_kraken/env/lib/python3.5/site-packages/ocrd/model/yaml/ocrd_oas3.spec.yml'
but we have
pyocrd/ocrd/model/yaml/ocrd_oas3.yml
referring to pyocrd
, not linking to docs etc.
Usage prints no Options:
ocrd workspace add
Usage: ocrd workspace add [OPTIONS] LOCAL_FILENAME
Error: Missing argument "local_filename".
Currently files are cached by the URL sans all non-alnum characters removed. This confuses tools that rely on the file extension to detect file type.
Easy fix would be to replace 1:n non-alnum characters with .
By now, it's a multi-purpose function. Not good for an API.
merge_ocr_txt.py
is a command line script containing actual functionality which should be available via API.
To ensure code remains functional, some basic test of functionality is required. Helps while refactoring the code (e.g. #28 #9 #20)
First, fix output (currently XML declaration is duplicated for each page tree).
Then have an example (e.g. the one in @OCR-D/spec) with the expected output and create a test script to ensure it's still produced.
Make sure the tool is deployable and functional by adding CI to test continuously. Extend test script / examples to reflect extended CLI/specs.
If the METS file changes (cf. https://github.com/kba/ocrd-assets/pull/1), it should be “reloaded” into the cache.
Test data here was taken from ocr-d/spec. It's a good example but not ideal for testing. We should make use of the ground truth http://ocr-d.de/daten, also with the evaluation module in mind.
Samples for the unit tests should be minimal, since this is supposed to run fast.
Still used in some constants and filenames such as tempfile names.
...
Implement text recognition using ocropus.
After cloning mets.xml looks the same than before.
Although the contained image was copied to the new directory.
ocrd --version won't work which is more or less a standard switch.
...
Add files and classes for text recognition.
It's superior (and offers full XPATH functionality).
For the use cases we have a solution like https://github.com/ianare/exif-py or https://github.com/hMatoba/Piexif would be preferrable so users (and tools depending on OCR-D/core as a lib) do not have to install libimage-exiftool-perl
.
Discussion was in OCR-D/spec#44, ocrd_swagger
needs adaption.
Add a super class for all process steps (processor
).
And/or have a sane cache invalidation strategy in place.
Add a restful web service for the page segmentation module.
Add a restful web service for the text recognition module.
xmllint from libxml2 arguably produces the most predictable pretty-printed XML but it should be an optional dependency, gracefully fall back to lxml tools when not available.
Namespace issue?
Using
ocrd process -m ocrd-assets/dist/mets.xml characterize/exif segment-region/tesserocr
works great (i.e. metadata and regions show up in OUTPUT PAGE XML).
However, adding line detection
ocrd process -m ocrd-assets/dist/mets.xml characterize/exif segment-region/tesserocr segment-line/tesserocr
results in “empty” XML:
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15/pagecontent.xsd">
<Page imageFileName="http://localhost:5001/00000005.tif">
</Page>
</PcGts>
Implement text recognition using tesserocr.
By now, modules “parse” XML. Access to relevant structures should be provided in a pythonic way by the handle.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.