ocropus / hocr-tools Goto Github PK

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

License: Other

Python 93.48% HTML 4.18% Shell 1.28% Dockerfile 1.06%

hocr-tools's People

Contributors

Stargazers

Watchers

Forkers

edsu grassit ryanfb zjucsxxd ub-mannheim wollmers jjones-jr hkwhyip sathishsampath pcast chagge zbxzc35 jronallo sunxingxingtf amitdo moozer jkamlah gofullthrottle eric013 eoma diegopino whikloj guptaaman2011 stweil skylord123 engahmed1190 storytracer jinjinhao tboenig texervn rosithp indeterminateoutcomesstudios ksjpswaroop anoop2019 nuxeo-sandbox shalevy1 starfishco simondavid diegosiqueir4 sekihanium ciur indybeck may-ank tusharkhanka finnatsea sahwar sshuster amitpoonia datascouting ocroarchive h2-ml-ocr orcema gerhobbelt smijo149 kokuljose tyt3 corentinbrule trendingtechnology rohitapporto rocke2020 sarkologist anhtudotinfo lanjkn ctrlcctrlv karama300 webstorage119 onefur stefan6419846 vinz3g l4stw1shown masamihadama franklinbaldo hertera1

hocr-tools's Issues

Switch to argparse module

In Python 2.7 and later there is a module argparse that helps with arguments parsing for command line tools. This module is also used in ocropus and it looks better structured than the current situation we have.

Release v1.0.1

We fixed some bugs, did a lot of cleanup and started some other work. Thus I think we should create a new release soon. Here is my draft:

Fixed bugs

hocr-split: Duplicate content in <html> #58
hocr-pdf: ocr_line does not have to be a span (e.g. also a div is possible) #57
~~hocr-pdf: empty rawtext caused index error #57~~
hocr-check: Fix containment checks and metadata checks, add tests #52 #61 #62

Ongoing work

Check handling of non ASCII characters in hOCR files #53
Make hocr-tools fit for Python 3 #37

See details: v1.0.0...master

Any comments or approvings are appreciated.

Fix and improve hocr-check

See #45 (comment) . Also the tests should then be improved.

More options for hocr-wordfreq

We discussed more options for hocr-wordfreq:

An option for splitting on spaces only, which will then also words containing punctations. This is actually what is used for tesseract and therefore there is a use case for this as well.
An option for undo the hyphens at the line ends. This also needs to delete the newline symbols before counting the frequencies. Moreover, possible blank lines should also be deleted.

Add a test for hocr-lines

Release management

For one, having users use specific versions makes debugging easier.

The tools could be uploaded to PyPI, so users can install it with pip install hocr-tools, or included in distros like Debian.

Possible course of action:

Release a 0.2 version soon (i.e. tag a git commit v0.2) to have a starting point
Consider reorganizing the module (issue #42)
Make the tools compatible with PyPI
Try to adhere to semantic versioning

The CLI of the tools has not changed or at least not much over the last years. However, this could (and should) change in the future, possibly breaking backwards compatibility if it cannot be avoided.

Python PEP 8: Imports should usually be on separate lines

https://www.python.org/dev/peps/pep-0008/#imports

Trouble merging with hocr-pdf

I'm having issues merging hocr & jpeg files into a searchable PDF. I generated the hocr with ocropy, hopefully that is correct. I used the run-test as an example workflow. Neither the run-test output or my source will generate a working PDF.

I attempt to merge with hocr-pdf . > test.pdf but only get an image in the PDF. Preview & the PDF.js tool in Firefox have no searchable content. Are there better tools to see what is happening?

I am using hocr-tools 1.1.1 & Python 2.7.12 on OS X 10.9.5.
I also tried hocr-tools on a debian system & got the same result.

Are there any sample files to try merging to see if it can work?

My test directory looks like…
ls temp-output
temp.hocr
temp.jpg
test.pdf

Here is a zip of that directory incase testing helps.
https://www.dropbox.com/s/j6lpx2kgfao6143/temp-output.zip

Make hocr-tools a proper module

The README currently states:

Each command line program is self contained; if you have Python 2.7 with the required packages installed, it should just work. (Unfortunately, that means some code duplication; we may revisit this issue in later revisions.)

I would like to revisit this issue 😄

The advantages of striving to make the programs self-contained is that there is no need to install the whole project to run an individual script, provided the requirements were installed by some other means (e.g. apt-get). For simple scripts like hocr-check this is really neat.

The disadvantages of self-contained commands are IMHO:

Code redundancy (assoc, get_text etc.). These are small functions but it's considerable boilerplate and keeping them consistent is a hassle. This also makes it hard to spot that e.g. get_text has not been needed for a while.
Embedding resources in the source code, such as the invisible font in hocr-pdf, makes it hard to add changes.
It makes it harder to keep consistent interfaces. Some commands use optparse, others parse CLI arguments themselves, some read from STDIN on no args, some show the help page on no args, some exit with an error etc. A shared hocrlib module could help reduce boilerplate, though a consistent use of one of argparse could also remedy this situation.

In summary, I would argue for an approach with a shared library, resources in the file system and require users to properly (setup.py) install the tools.

What do you think?

In particular, is anyone relying on the scripts being self-contained?

Convert to use lxml (more scripts)

Convert to use lxml in the following scripts:

This replaces the issues UB-Mannheim#11, UB-Mannheim#12, UB-Mannheim#13, UB-Mannheim#14

UnboundLocalError: local variable 'rawtext' referenced before assignment

With some hocr files created by tesseract I get the following error when using hocr-pdf. Here's an example hocr file that causes this error:
https://drive.google.com/file/d/0ByUq6R632zOwU2ZrOTVMT0ZtdG8/view?usp=sharing

raceback (most recent call last):
  File "/usr/bin/hocr-pdf", line 5, in <module>
    pkg_resources.run_script('hocr-tools==0.1', 'hocr-pdf')
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 540, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1462, in run_script
    exec_(script_code, namespace, namespace)
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 41, in exec_
    exec("""exec code in globs, locs""")
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/hocr_tools-0.1-py2.7.egg/EGG-INFO/scripts/hocr-pdf", line 137, in <module>

  File "/usr/lib/python2.7/site-packages/hocr_tools-0.1-py2.7.egg/EGG-INFO/scripts/hocr-pdf", line 53, in export_pdf

  File "/usr/lib/python2.7/site-packages/hocr_tools-0.1-py2.7.egg/EGG-INFO/scripts/hocr-pdf", line 86, in add_text_layer

UnboundLocalError: local variable 'rawtext' referenced before assignment

If rawtext is set to an empty string at some point before line 86, then the script completes creating a searchable PDF--yes, a rather poorly OCR'd one, but a PDF nevertheless.

Check hocr-pdf for possible update

https://github.com/HKWhyIP/hocr-tools/commits/master

Error while using hocr-pdf file

While using the below command i m getting error related to character
help out please

hocr-pdf . > out.pdf
Traceback (most recent call last):
  File "C:\Python36\Scripts\hocr-pdf.py", line 143, in <module>
    export_pdf(args.imgdir, 300)
  File "C:\Python36\Scripts\hocr-pdf.py", line 70, in export_pdf
    pdf.save()
  File "c:\python36\lib\site-packages\reportlab\pdfgen\canvas.py", line 1237, in save
    self._doc.SaveToFile(self._filename, self)
  File "c:\python36\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 224, in SaveToFile
    f.write(data)
  File "C:\Python36\Scripts\hocr-pdf.py", line 47, in write
    sys.stdout.write(data)
  File "c:\python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to <undefined>

hocr-pdf could recalculate word positions for resized image

If I want to make a PDF from an image that is exactly the same dimensions as were used during OCR, then hocr-pdf can do that. But if I want my PDFs to be smaller in file size then one way is to use images that are resized smaller than were used for OCR. Currently using an image that is a different size from that used during OCR puts the words in the wrong place. As long as the aspect ratio of the image is maintained even if it is a different size it ought to be possible to recalculate where to place words in the PDF so that they show up in the correct location.

Is this feature of interest? Is this an issue anyone else has?

Do we really need matplotlib/pylab?

The matplotlib python library is around 54 MB and currently a requirement for hocr-tools. As far as I can see we only need this package in hocr-eval and hocr-eval-lines for some (easy) matrix functions: https://github.com/tmbdev/hocr-tools/blob/7f49c5cab8489473332e81f747fdfc1cbdbb0aeb/hocr-eval-lines#L35-L38 . Could we try to eliminate this dependency?

Extract HOCR from searchable PDF

Thank you so much with your great works!

But I wonder if it is possible to extract HOCR from searchable PDF, I mean, PDFs that are already combined with HOCR, I haven't find any tools to do that for me...

Check hocr-de-noising

https://github.com/Early-Modern-OCR/hOCR-De-Noising CC @mchristy-tamu @tfmorris

Create PyPi release 1.1.1

Ping @kba

Coordinates in bbox must be from top to bottom

From the hocr specs:

the bounding box of the page; for pages, the top left corner must be at (0,0), so a typical page bounding box will look like bbox 0 0 2300 3200

Clean up coding style

See #14 (comment) and https://travis-ci.org/UB-Mannheim/hocr-tools/builds/131389791 .

TODOs from hocr-check

These are commented as FIXME at the end of hocr-check, I'll put them here for discussion.

containment of paragraphs, columns, etc.
ocr-recognized vs. actual tags
warn about text outside ocr_ elements
check title= attribute format
check that only the right attributes are present on the right elements
check for unrecognized ocr_ elements
check for significant overlaps
check that image files are not repeated

Keep this in check with hocr-spec (cross-reference maybe) and consider creating an XSD schema for use in ocr-fileformats (though these tend to be inflexible).

hocr-lines outputs byte strings in python3

E.g. ./hocr-lines test/testdata/sample.html

b'1 Down the Rabbit-Hole'
b'Alice was beginning to get very tired of sitting by her sister on the bank,'
b'and of having nothing to do: once or twice she had peeped into the book her'
b'sister was reading, but it had no pictures or conversations in it, `and what is'
b"the use of a book,' thought Alice `without pictures or conversation?"

Specify utf-8 encoding in test/testdata/sample.html

I guess this should always be given, check also ocropy-hocr...

hocr-check complains assert doc.xpath("//meta[@name='ocr-id']")!=[]

Can be reproduced with both tesseract and gImageReader hOCR files.
manisandro/gImageReader#101

Does the script end with this error or is it still checking the other issues?

Check PyPI deployment over Travis

See https://docs.travis-ci.com/user/deployment/pypi/

However, we first need Travis activated (properly) https://travis-ci.org/tmbdev/hocr-tools

hocr-clean

Go through all ocr-elements and delete empty elements and possibly also elements with spaces only. Either do this recursive or start with the top elements and look at the textContent.

All tools should have a -h/--help option

Not all the tools support a help flag. We should add this as a baseline so users can get at least minimal usage info on the command line.

Tools without -h/--help (c.f. smoke.tsht):

Plus possibly those that do not run at the moment because of PyXML/lxml #9

hocr-pdf output failed

Thank you for update of hocr-pdf.

I try to convert image to searchable pdf by using hocr-pdf and my gcv2hocr.
hocr-pdf sometimes fails to convert pdf.

Attached scan0002.jpg and scan0002.hocr can convert to pdf.

But attached scan0003.jpg and scan0003.hocr fails to convert to pdf.
The error messages are below.

Traceback (most recent call last):
File "hocr-pdf", line 139, in
export_pdf(sys.argv[1], 300)
File "hocr-pdf", line 64, in export_pdf
add_text_layer(pdf, image, height, dpi)
File "hocr-pdf", line 73, in add_text_layer
hocr = etree.parse(hocrfile, html.XHTMLParser())
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:85131)
File "src/lxml/parser.pxi", line 1782, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:124005)
File "src/lxml/parser.pxi", line 1808, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:124374)
File "src/lxml/parser.pxi", line 1712, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:123169)
File "src/lxml/parser.pxi", line 1115, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:117533)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510
)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111124)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 469, column 234

I can not distinguish the problem comes from hocr-pdf or my gcv2hocr.
If you find problems in my gcv2hocr, please notify me.

scan0002.hocr.txt
scan0002.jpg
scan0003.jpg
scan0003.hocr.txt

Check handling of non ASCII characters in hOCR files

As PR #29 shows, there is a problem when hocr-lines gets lines which contain umlauts or other non ASCII characters (UTF-8 encoded). Maybe more tools are affected.

Provide conda package

I just learned today, that it should be easy to provide a conda package over conda-forge by refering to the PyPi package: https://conda-forge.github.io/

README: Link to hOCR format spec

Maybe you don't know this project:
https://github.com/kba/hocr-spec
😆

Make hocr-tools fit for Python 3

We should consider making the tools ready for Python 3. Not so much because 2.7 is going away soon (it will be officially developed until 2020) but because all the problems 2to3 reports are easily fixable, e.g. throwing strings is bad practice in all versions.

The changes to the standard libaries (e.g. cStringIO is now io.StringIO) make it tedious to use the same code for both versions (and it's generally not recommended). Instead, I propose that we fix the code problems 2to3 reports and ensure (by testing in CI) that the code can be automatically ported to Python 3 and run the tests. At some point in the future, we can then just run 2to3 for good.

hocr-pdf: issue with search and copy/paste in macOS Preview.app

The Preview.app is the default PDF reader for macOS.
When using hocr-pdf to generate a PDF file, from an image + hocr file, the generated PDF works well for search, and copy/paste in Acrobat, PDF.js and others, but not Preview. You can't search in Preview, though you can select text and copy/paste to another document, but are just blank characters.

Anyone knows of a specific reason for this to happen?

Requiring authentication for accessing the rest api

Hi,

I may be missing something obvious, but how do we "lock down" access to the rest api?
Thanks
Rob

3 unittests failed

(venv) ubuntu@tesseract-ocr:~/hocr-tools$ ./test/tsht
# Testing ./hocr-check/test-hocr-check.tsht
1..11
ok 1 - Check from filename
ok 2 - Check from stdin
# ./ancestor: valid examples
ok 3 - 'hocr-check ./ancestor/ok-par.html' (failed: 0)
ok 4 - 'hocr-check ./ancestor/ok-line.html' (failed: 0)
ok 5 - 'hocr-check ./ancestor/ok-carea.html' (failed: 0)
# ./ancestor: invalid examples
ok 3 - 'hocr-check ./ancestor/notok-line.html' (failed: 1)
ok 4 - 'hocr-check ./ancestor/notok-carea.html' (failed: 1)
ok 5 - 'hocr-check ./ancestor/notok-par.html' (failed: 1)
# ./meta: valid examples
ok 3 - 'hocr-check ./meta/ok-system.html' (failed: 0)
# ./meta: invalid examples
ok 3 - 'hocr-check ./meta/notok-typo.html' (failed: 1)
ok 4 - 'hocr-check ./meta/notok-system.html' (failed: 1)
# Testing ./hocr-combine/test-hocr-combine.tsht
1..2
ok 1 - Executed: hocr-combine ../testdata/sample.html ../testdata/sample.html
ok 2 - check whether number ocr_lines in self-combined result is doubled
# Testing ./hocr-eval-geom/hocr-eval-geom.tsht
1..3
ok 1 - Executed: hocr-eval-geom ../testdata/sample.html ../testdata/sample.html
ok 2 - Executed: hocr-eval-geom -e ocr_line -o 0.05 -c 0.88 ../testdata/tess.hocr ../testdata/sample.html
ok 3 - Matches '\(0, 0, 0.0, ': '(0, 0, 0.0, 37) (0, 0, 0.0, 37)'
# Testing ./hocr-eval/hocr-eval.tsht
1..2
ok 1 - Executed: hocr-eval ../testdata/sample.html ../testdata/sample.html
not ok 2 - Failed: hocr-eval -d -v ../testdata/tess.hocr ../testdata/sample.html
---
diag: |
overlap 52041 true_bbox (470, 528, 1383, 585)
1 Down the Rabbit-Hole
1 Down the Rabbit-Hole
overlap 85330 true_bbox (464, 651, 2074, 704)
Alice was beginning to get very tired of sitting by her sister on the bank,
Alice was beginning to get very tired of sitting by her sister on the bank,
overlap 83824 true_bbox (464, 711, 2076, 763)
and of having nothing to do: once or twice she had peeped into the book her
and of having nothing to do: once or twice she had peeped into the book her
overlap 80600 true_bbox (463, 773, 2075, 823)
Traceback (most recent call last):
File "/usr/local/bin/hocr-eval", line 4, in <module>
__import__('pkg_resources').run_script('hocr-tools==1.2.0', 'hocr-eval')
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1511, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/hocr_tools-1.2.0-py2.7.egg/EGG-INFO/scripts/hocr-eval", line 227, in <module>

UnicodeEncodeError: 'ascii' codec can't encode character u'‘' in position 67: ordinal not in range(128)
...

# Testing ./hocr-eval-lines/hocr-eval-lines.tsht
1..4
ok 1 - Executed: hocr-eval-lines -v ../testdata/sample.txt ../testdata/sample.html
ok 2 - Matches 'ocr_errors 7': 'segmentation_errors 0<LF>ocr_errors 7'
ok 3 - Matches 'segmentation_errors 0': 'segmentation_errors 0<LF>ocr_errors 7'
ok 4 - Not like '\('segmentation_errors'': 'string'
# Testing ./hocr-extract-images/test-hocr-extract-images.tsht
1..10
# ocr_page argument
ok 1 - Executed: hocr-extract-images -p page-%03d.png -b ../testdata -e ocr_page ../testdata/tess.hocr
ok 2 - ocr_page: number of images == 1
ok 3 - ocr_page: number of texts == 1
# ocr_page stdin
ok 4 - ocr_page: number of images == 1
ok 5 - ocr_page: number of texts == 1
# ocr_line argument
ok 6 - Executed: hocr-extract-images -p line-%03d.png -b ../testdata -e ocr_line ../testdata/tess.hocr
ok 7 - ocr_line: number of images == 37
ok 8 - ocr_line: number of texts == 37
# ocr_line stdin
ok 9 - ocr_line: number of images == 37
ok 10 - ocr_line: number of texts == 37
# ocrx_word argument
ok 11 - Executed: hocr-extract-images -p word-%03d.png -b ../testdata -e ocrx_word ../testdata/tess.hocr
ok 12 - ocrx_word: number of images == 503
ok 13 - ocrx_word: number of texts == 503
# ocrx_word stdin
ok 14 - ocrx_word: number of images == 503
ok 15 - ocrx_word: number of texts == 503
ok 16 - Indeed 503 words in sample
# Testing ./hocr-lines/hocr-lines.tsht
1..3
Traceback (most recent call last):
  File "/usr/local/bin/hocr-lines", line 4, in <module>
    __import__('pkg_resources').run_script('hocr-tools==1.2.0', 'hocr-lines')
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1511, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/hocr_tools-1.2.0-py2.7.egg/EGG-INFO/scripts/hocr-lines", line 22, in <module>

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 67: ordinal not in range(128)
not ok 1 - hocr-lines
not ok 2 - ./tess.lines ('37' != '3')
ok 3 - check first line
after
# Running function after for hocr-lines.tsht
# Testing ./hocr-merge-dc/hocr-merge-dc.tsht
1..4
ok 1 - Command succeeded
ok 2 - Matches 'name='DC.title' content='Alice im Wonderland'': '<?xml version="1.0" encoding="UTF-8"?><LF><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" <LF>    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <LF><html xmlns="http://www.'
ok 3 - Not like 'name='DC.title' content='UKOLN'': 'string'
ok 4 - Matches 'name="DC.title" content="UKOLN"': '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><LF><?xml version="1.0" encoding="UTF-8"??><LF><html xmlns="http://www.w3.org/1'
# Testing ./hocr-pdf/test-hocr-pdf.tsht
ok 1 - Executed: wget --quiet http://digi.bib.uni-mannheim.de/fileadmin/digi/445442158/tess/445442158_0126.hocr
ok 2 - Executed: wget --quiet http://digi.bib.uni-mannheim.de/fileadmin/digi/445442158/max/445442158_0126.jpg
ok 3 - Not empty file: 445442158_0126.pdf
ok 4 - Executed: pdfgrep tribunali 445442158_0126.pdf
1..4
# Testing ./hocr-split/test-hocr-split.tsht
1..4
ok 1 - Executed: hocr-split test.hocr test-%003d.hocr
ok 2 - two files were produced
ok 3 - one page in test-001.hocr
ok 4 - one page in test-002.hocr
ok 5 - one xml:lang= only, #58
ok 6 - one xmlns= only, #58
# Testing ./hocr-wordfreq/hocr-wordfreq.tsht
1..4
ok 1 - Executed: hocr-wordfreq ../testdata/sample.html
ok 2 - Executed: hocr-wordfreq -i -n 30 ../testdata/sample.html
ok 3 - Matches '23\s*the': '23          the'
ok 4 - Matches '24\s*the': '24          the'
# Testing ./smoke.tsht
ok 1 - Executed: hocr-check --help
ok 2 - Executed: hocr-check -h
ok 3 - Executed: hocr-combine --help
ok 4 - Executed: hocr-combine -h
ok 5 - Executed: hocr-eval --help
ok 6 - Executed: hocr-eval -h
ok 7 - Executed: hocr-eval-geom --help
ok 8 - Executed: hocr-eval-geom -h
ok 9 - Executed: hocr-eval-lines --help
ok 10 - Executed: hocr-eval-lines -h
ok 11 - Executed: hocr-extract-g1000 --help
ok 12 - Executed: hocr-extract-g1000 -h
ok 13 - Executed: hocr-extract-images --help
ok 14 - Executed: hocr-extract-images -h
ok 15 - Executed: hocr-lines --help
ok 16 - Executed: hocr-lines -h
ok 17 - Executed: hocr-merge-dc --help
ok 18 - Executed: hocr-merge-dc -h
ok 19 - Executed: hocr-pdf --help
ok 20 - Executed: hocr-pdf -h
ok 21 - Executed: hocr-split --help
ok 22 - Executed: hocr-split -h
1..22
# Failed 3 tests

Switch from PyXML to BeautifulSoup

PyXML hasn't been updated in 10+ years. Would you consider moving these tools over to something like BeautifulSoup which is on PyPI and easier to install?

Here's what hocr-lines looks like using BeautifulSoup:

#!/usr/bin/env python

import re
import sys
from bs4 import BeautifulSoup

if len(sys.argv)>1:
    stream = open(sys.argv[1])
else:
    stream = sys.stdin

soup = BeautifulSoup(stream)
for line in soup.select('.ocr_line'):
    print re.sub(r'\s+', ' ', line.text)

Improve error handling in hocr-pdf

I would like to be able to do following calls (with expected output):

hocr-pdf -h --> Help text
hocr-pdf --help --> Help text
hocr-pdf filename.hocr --> Wrong argument ...
hocr-pdf filename.jpg --> Wrong argument ...
hocr-pdf filename1.hocr filename2.jpg --> Wrong argument ...

Check HocrConverter

http://xplus3.net/2009/04/02/convert-hocr-to-pdf/

https://github.com/jbrinley/HocrConverter

hocr-pdf: Warn when there is no JPG file in directory and stop skript

The script hocr-pdf relies on jpg files in a directory together with hocr files. However, if the script is run in a folder only with a hocr file and a png file, then an empty PDF es created. It would be better to warn in this case and not output any PDF.

Does hocr-pdf support Japanese ?

I try to make a searchable pdf by hocr-pdf.
I made gcv2hocr to make hocr from Google Cloud Vision OCR output.
https://github.com/dinosauria123/gcv2hocr
English image conversion seems to be good, I try to Japanese image conversion.

I found hocr-pdf output of pdf using Japanese included hocr, text position is much displaced.
I can't separate the problem comes from my gcv2hocr or hocr-pdf, I want to ask your opinion.

I know baseline number is important for text position in pdf file.
Could you tell me what is another important parameter for text position in pdf file ?

<span class='ocr_line' id='line_1_1' title="bbox 96 79 127 144 ; baseline 0 -10; x_size 89; x_descenders 20; x_ascenders 21"><span class='ocrx_word' id='word_1_1' title='bbox 96 79 127 144 ; x_wconf 85' lang='jpn' dir='ltr'> 光学 </span>

jptest.hocr.txt
jptest2.pdf
jptest2

How to support vertical text ?

Thank you for answering me every time.

I try to convert Japanese text image file to pdf by hocr-pdf.
Japanese use both vertical and horizontal writing style.

hocr-pdf seems to not support vertical text.
It shows last single letter position of the word.
How to support vertical text ?

jp_vert.jpg
jp_vert.jpg.json.txt
jp_vert.hocr.txt
jp_vert.pdf

hocr-pdf : Possible calculation issue

I could be wrong, but in reading this calculation which you use for adjusting the height of text it seems like box[0] is left and box[2] is right from the bbox coordinates. Additionally, the linebox[0] would also be left.

I changed it to this based on my reading of the hOCR spec for bbox

But in case I misunderstood your intention, I thought I'd open this issue.

Check exact-image for hocr2pdf implementation

See http://exactcode.com/opensource/exactimage/ but I haven't found the SVN repo...

HTML exporter

The hocr files are already html files and can be displayed in any browser. However, they will just display the text without any layout or format information. What do you think about doing some HTML exporter which will display also some of the layout or format information? With the bbox we can show the text at the correct position, see also ocropus-archive/DUP-ocropy#80 (comment)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" xml:lang="en">

xmlns and xml:lang attributes are duplicated.
Then W3 validator produce errors -- see https://validator.w3.org/

Change meta/@value to meta/@content

In html the meta has an attribute content and not value. This is also used in the hocr specs.

Fix this in

sample.html
hocr-merge-dc
...?

Convert Google Cloud Vision OCR output to hocr.

I have a question.

I try to use Google Cloud Vision API to OCR.

https://cloud.google.com/vision/

The output of the OCR results including the position of the texts.

I want to convert Google OCR output to hocr format, do you have any ideas ?