ocrmypdf / ocrmypdf Goto Github PK
View Code? Open in Web Editor NEWOCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Home Page: http://ocrmypdf.readthedocs.io/
License: Mozilla Public License 2.0
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Home Page: http://ocrmypdf.readthedocs.io/
License: Mozilla Public License 2.0
Issue by andreasotto
Tue Nov 4 10:44:25 2014
Originally opened as fritz-hh/OCRmyPDF#99
# ./OCRmyPDF.sh /home/ao/Leerungstermine189973.PDF /home/ao/test.pdf
Please install the python library reportlab. Exiting...
# apt-get install python-reportlab
python-reportlab ist schon die neueste Version.
.. already installed.
Debian 6 squeeze
When I try to run:
sudo ocrmypdf --verbose 3 eiffel.jpg eiffel.pdf
I get:
Original exception:
Exception #1
'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
Task = def ocrmypdf.main.split_pages(...):
Job = [[] -> .../com.github.ocrmypdf.45n_qza7/*.page.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 415, in split_pages
npages = qpdf.get_npages(input_file)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/qpdf.py", line 68, in get_npages
universal_newlines=True, close_fds=True)
File "/usr/lib/python3.4/subprocess.py", line 607, in check_output
with Popen(*popenargs, stdout=PIPE, **kwargs) as process:
File "/usr/lib/python3.4/subprocess.py", line 859, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.4/subprocess.py", line 1395, in _execute_child
restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert 'list' object to str implicitly
If I try the same thing on a PDF file it works fine. This is for version 3.1.1, thanks!
I can repeat the bug on both Mac OS X El Capitan and Debian 8, I can also repeat the error in version 3.1 and 3.0.
The file in question is here (yes I know there isn't any text I was just using it for testing):
unpaper may not be a viable deskewer and ImageMagick is awful. It seems that presence of italics font may be part of the issue.
Tesseract does not calculate the skew angle (logically, since there is no global skew angle on a page).
Best option is to go back to Leptonica.
In its current form the pipeline is not re-entrant -- it is assembled based on command line arguments prior to main() and cannot be changed after that. As such, there is no value to "import ocrmypdf".
Also, all test cases need to run in a subprocess which is not ideal for inspecting test failures.
A re-entrant pipeline would make it possible to customize the pipeline if ocrmypdf were used as a library.
Sorry I am new to docker. I just pull the latest, and want to use language chi_sim in tesseract, but it seems this language support is not installed by default, as it complains:
~/work/tmp$ docker run -v "$(pwd):/home/docker" ocrmypdf 31.pdf 31-ocr.pdf -l chi_sim
The installed version of tesseract does not have language data for the following requested languages:
chi_sim
It seems the tesseract used by the docker image is different from the system's tesseract-ocr package, with which I installed the language package by "apt-get install tesseract-ocr-chi-sim".
How to update the docker image for including the desired language support? And how to check which languages are supported (like "tesseract --list-langs" in the system)?
Thanks a lot.
I have a scanner that produces PDF files with lansdcape layout but with 90 degrees rotation. This kind of files is displayed correctly in nautilus file manager for example (Linux) as portrait file.
I have other scanned files from other scanner that produced portrait files directly. They are correctly handled.
As an example take attached test2.pdf which is a standard test print page scanned.
But in ocrmypdf I got a wrong file (see test2b.pdf)
test2.pdf
test2b.pdf
Issue by geaplanet
Sun Mar 8 10:46:08 2015
Originally opened as fritz-hh/OCRmyPDF#104
Is there any posibility to use OCRmyPDF passing raw TIFF images as a parameter?
OCRmyPDF convert pdfs to image to work with them, but in case you have got raw images from scanner or cam, how can you use it?
-j 1
will get mapped to available_cpu_count() and use all CPUs. Did I add this to work around a ruffus issue?
Hi there, and thank you for any assistance,
OCRmyPDF fails to create a new file.
here's the install process:
pip3 install ocrmypdf
Downloading/unpacking ocrmypdf
Downloading ocrmypdf-3.0.tar.gz
Running setup.py (path:/tmp/pip-build-wqh0224e/ocrmypdf/setup.py) egg_info for package ocrmypdf
Checking for tesseract >= 3.02.02...
Found tesseract 3.03
Checking for gs >= 9.14...
Found gs 9.14
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.0.0...
Found qpdf 5.1.2
warning: no previously-included files matching '*' found under directory 'tests/output'
Requirement already satisfied (use --upgrade to upgrade): ruffus>=2.6.3 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): Pillow>=2.4.0 in /usr/lib/python3/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): reportlab>=3.1.44 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): PyPDF2>=1.25.1 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): pip>=1.4.1 in /usr/lib/python3/dist-packages (from reportlab>=3.1.44->ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): setuptools>=2.2 in /usr/lib/python3/dist-packages (from reportlab>=3.1.44->ocrmypdf)
Installing collected packages: ocrmypdf
Running setup.py install for ocrmypdf
Checking for tesseract >= 3.02.02...
Found tesseract 3.03
Checking for gs >= 9.14...
Found gs 9.14
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.0.0...
Found qpdf 5.1.2
warning: no previously-included files matching '*' found under directory 'tests/output'
Installing ocrmypdf script to /usr/local/bin
Successfully installed ocrmypdf
Cleaning up...
$ ocrmypdf A.pdf B.pdf --verbose
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
[{'height_inches': Decimal('24.3611'), 'pageno': 0, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 1, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 2, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 3, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 4, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 5, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 6, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 7, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 8, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 9, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 10, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 11, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 12, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 13, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 14, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 15, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 16, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 17, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 18, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 19, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 20, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 21, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 22, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 23, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 24, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 25, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 26, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 27, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 28, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 29, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 30, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 31, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 32, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 33, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 34, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 35, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 36, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 37, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 38, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 39, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 40, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 41, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 42, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 43, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 44, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 45, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 46, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 47, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 48, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 49, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 50, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 51, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 52, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 53, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 54, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 55, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}]
Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000048.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000048.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000003.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000003.ocr.page.pdf)
Page 33 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000033.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000033.skip.page.pdf)
Page 50 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000050.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000050.skip.page.pdf)
Page 2 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000002.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000002.skip.page.pdf)
Page 52 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000052.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000052.skip.page.pdf)
Page 8 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000008.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000008.skip.page.pdf)
Page 12 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000012.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000012.skip.page.pdf)
Page 41 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000041.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000041.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000039.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000039.ocr.page.pdf)
Page 1 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000001.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000001.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000026.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000026.ocr.page.pdf)
Page 5 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000005.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000005.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000016.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000016.ocr.page.pdf)
Page 11 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000011.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000011.skip.page.pdf)
Page 21 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000021.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000021.skip.page.pdf)
Page 28 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000028.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000028.skip.page.pdf)
Page 38 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000038.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000038.skip.page.pdf)
Page 47 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000047.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000047.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000017.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000017.ocr.page.pdf)
Page 49 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000049.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000049.skip.page.pdf)
Page 29 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000029.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000029.skip.page.pdf)
Page 31 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000031.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000031.skip.page.pdf)
Page 9 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000009.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000009.skip.page.pdf)
Page 43 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000043.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000043.skip.page.pdf)
Page 20 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000020.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000020.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000013.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000013.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000014.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000014.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000037.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000037.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000056.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000056.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000025.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000025.ocr.page.pdf)
Page 45 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000045.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000045.skip.page.pdf)
Page 55 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000055.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000055.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000032.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000032.ocr.page.pdf)
Page 51 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000051.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000051.skip.page.pdf)
Page 27 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000027.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000027.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000040.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000040.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000019.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000019.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000053.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000053.ocr.page.pdf)
Page 36 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000036.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000036.skip.page.pdf)
Page 46 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000046.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000046.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000024.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000024.ocr.page.pdf)
Page 10 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000010.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000010.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000007.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000007.ocr.page.pdf)
Page 23 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000023.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000023.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000044.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000044.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000035.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000035.ocr.page.pdf)
Page 6 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000006.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000006.skip.page.pdf)
Page 18 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000018.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000018.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000054.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000054.ocr.page.pdf)
Page 15 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000015.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000015.skip.page.pdf)
Page 22 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000022.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000022.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000004.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000004.ocr.page.pdf)
Page 42 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000042.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000042.skip.page.pdf)
Page 30 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000030.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000030.skip.page.pdf)
Page 34 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000034.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000034.skip.page.pdf)
Original exception:
Exception #1
'builtins.FileNotFoundError(Could not find Ghostscript's iccprofiles)' raised in ...
Task = def ocrmypdf.main.generate_postscript_stub(...):
Job = [.../com.github.ocrmypdf.fwij8o72/YummyS3ptember2015.repaired.pdf -> .../com.github.ocrmypdf.fwij8o72/YummyS3ptember2015.pdfa_def.ps, <ocrmypdf.main.WrappedLogger>]
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 761, in generate_postscript_stub
generate_pdfa_def(output_file, pdfmark)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pdfa.py", line 123, in generate_pdfa_def
icc_profile = os.path.join(_get_postscript_icc_path(), 'srgb.icc')
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pdfa.py", line 118, in _get_postscript_icc_path
raise FileNotFoundError("Could not find Ghostscript's iccprofiles")
FileNotFoundError: Could not find Ghostscript's iccprofiles
I tried removing ocrmypdf and re-installing it and had the same behaviour. Any ideas on what I need to do to fix this?
Thanks in advance.
Adam
I'm using version 3.2.1 but still pdfs with jbig2 compression are changed to ccitt leading to considerably greater file sizes. Am I doing something wrong or is there a bug?
This is the output (see attachment for test.pdf):
$ pdfimages -list test.pdf
page num type width height color comp bpc enc interp object ID
---------------------------------------------------------------------
1 0 image 2062 3190 gray 1 1 jbig2 no 5 0
$ ocrmypdf -v 1 test.pdf test-ocr.pdf
________________________________________
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
[{'xres': Decimal('599.999'), 'height_inches': Decimal('5.31667'), 'width_pixels': 2062, 'width_inches': Decimal('3.43667'), 'pageno': 0, 'images': [{'color': 'gray', 'bpc': 1, 'enc': 'jbig2', 'dpi_w': Decimal('599.999'), 'width': 2062, 'height': 3190, 'comp': 1, 'dpi_h': Decimal('600.000'), 'dpi': Decimal('599.999')}], 'yres': Decimal('600.000'), 'height_pixels': 3190, 'has_text': False}]
Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.pdf, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.skip_page'
Uptodate Task = 'ocrmypdf.main.skip_page'
WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
Rendering 000001.ocr.page.pdf with pngmono
Completed Task = 'ocrmypdf.main.generate_postscript_stub'
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'
os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean'
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.main.select_image_for_pdf'
os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.image)
Completed Task = 'ocrmypdf.main.select_image_for_pdf'
Task enters queue = 'ocrmypdf.main.select_image_layer'
os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.image-layer.pdf)
Completed Task = 'ocrmypdf.main.select_image_layer'
Tesseract Open Source OCR Engine v3.03 with Leptonica
Completed Task = 'ocrmypdf.main.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.main.render_hocr_page'
Completed Task = 'ocrmypdf.main.render_hocr_page'
Task enters queue = 'ocrmypdf.main.add_text_layer'
Completed Task = 'ocrmypdf.main.add_text_layer'
Task enters queue = 'ocrmypdf.main.merge_pages'
['/tmp/com.github.ocrmypdf.hjkqg9uk/000001.rendered.pdf', '/tmp/com.github.ocrmypdf.hjkqg9uk/pdfa_def.ps']
Completed Task = 'ocrmypdf.main.merge_pages'
Task enters queue = 'ocrmypdf.main.copy_final'
Completed Task = 'ocrmypdf.main.copy_final'
$ pdfimages -list test-ocr.pdf
page num type width height color comp bpc enc interp object ID
---------------------------------------------------------------------
1 0 image 2062 3190 gray 1 1 ccitt no 10 0
Issue by fritz-hh
Thu Apr 25 17:18:06 2013
Originally opened as fritz-hh/OCRmyPDF#15
Vertical text is not oriented correctly because the hocr file produced by tesseract does not contain the "textangle" attribute.
Issue by mlissner
Fri Aug 21 00:10:05 2015
Originally opened as fritz-hh/OCRmyPDF#114
I could be wrong, but I haven't been able to find documentation for the command itself. Either for the command line API nor for the Python API that looks like it might be coming in 3.0.
Am I blind? If not, this would be great to get. If so, my apologies!
Looks like a great project.
Issue by femifrak
Wed May 7 05:34:43 2014
Originally opened as fritz-hh/OCRmyPDF#75
When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.
Here the output:
root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -d -c -i test.pdf testOCR.pdf
ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
unpaper version:
tesseract version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
python2 version:
Ghostscript version:
Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
Created temporary folder: "/tmp/tmp.cL2lCvVStC"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 578x342 (h*w in pt)
Page 0001: Size 3424x2208 (in pixel)
Page 0001: Extracting image as pbm file (445 dpi)
Page 0001: Deskewing image
Page 0001: Cleaning image with unpaper
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 31 seconds
-2b seems nicer because of support for transparency and higher PDF format version.
Provided Ghostscript can produce correct -2b's.
Issue by sjoswig
Wed Jul 22 09:25:37 2015
Originally opened as fritz-hh/OCRmyPDF#110
I'm using ocrmypdf 2.1.0-1 on my arch and the last weeks I had no problem get ocr out of pdfs correctly with ocrmypdf, but no only temporary files were created and no single output pdf.
Here is the log file:
`OCRmyPDF version: v2.1-stable
Arguments: -f -vvv -l deu 2015-03-13 Kraftfahrtversicherung_ohne.pdf /home/js/Share/2015-03-13 Kraftfahrtversicherung.pdf
ImageMagick version:
Version: ImageMagick 6.9.1-8 Q16 x86_64 2015-07-14 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenCL OpenMP
Delegates (built-in): bzlib cairo fontconfig freetype gslib jng jp2 jpeg lcms lqr ltdl lzma pangocairo png ps rsvg tiff webp wmf x xml zlib
GNU Parallel version:
GNU parallel 20150622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015 Ole Tange
and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using programs that use GNU Parallel to process data for publication
Poppler-utils version:
pdfimages version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
unpaper version:
tesseract version:
tesseract 3.04.00
leptonica-1.71
libgif 5.1.0 : libjpeg 8d : libpng 1.6.16 : libtiff 4.0.4 : zlib 1.2.8 : libwebp 0.4.3
python2 version:
Ghostscript version:
Java version:
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Created temporary folder: "/tmp/tmp.XZtlIvt11N"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x596 (h*w in pt)
Page 0001: Size 2482x3510 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF`
Issue by fritz-hh
Wed Jan 8 22:05:16 2014
Originally opened as fritz-hh/OCRmyPDF#46
Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120
A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr
mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).
The error message is only shown when OCRmyPDF is used with the -v
option.
Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?
Full output:
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
Original exception:
Exception #1
'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
Task = def ocrmypdf.main.repair_pdf(...):
Job = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
pdfinfo.extend(pdf_get_all_pageinfo(output_file))
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
text = page.extractText()
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
content = ContentStream(content, self.pdf)
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
self.__parseContentStream(stream)
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
operands.append(readObject(stream, None))
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
arr.append(readObject(stream, pdf))
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
return readStringFromStream(stream)
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'
Hi,
I followed the docs for installing the docker container.
Running "docker run ocrmypdf --help" works fine.
But if I try to execute ocrmypdf on a local file, I get an error:
[root@CentOS7 test]# docker run -v "/srv/test/:/home/docker/" ocrmypdf ocrmypdf -v 1 x.pdf 1.pdf
usage: ocrmypdf [-h] [--verbose [VERBOSE]] [--version] [-L FILE] [-j N]
[--use_threads] [-n] [--flowchart FILE] [-l LANGUAGE]
[--title TITLE] [--author AUTHOR] [--subject SUBJECT]
[--keywords KEYWORDS] [-d] [-c] [-i] [--oversample DPI] [-f]
[-s] [--skip-big MPixels]
[--tesseract-config TESSERACT_CONFIG]
[--pdf-renderer {tesseract,hocr}]
[--tesseract-timeout TESSERACT_TIMEOUT] [-k] [-g]
input_file output_file
ocrmypdf: error: unrecognized arguments: 1.pdf
Any help would be nice.
Thank you!
Kind regards,
Nicole
Hi,
I'm getting this error
[Anaconda3] C:\Users\Carlos\Anaconda3>ocrmypdf --help
Traceback (most recent call last):
File "C:\Users\Carlos\Anaconda3\Scripts\ocrmypdf-script.py", line 9, in
load_entry_point('ocrmypdf==3.1.1', 'console_scripts', 'ocrmypdf')()
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 549, in load_entry_point
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2709, in load_entry_point
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2369, in load
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2375, in resolve
File "C:\Users\Carlos\Anaconda3\lib\site-packages\ocrmypdf-3.1.1-py3.5.egg\ocrmypdf\main.py", line 51, in
if tesseract.version() < MINIMUM_TESS_VERSION:
File "C:\Users\Carlos\Anaconda3\lib\site-packages\ocrmypdf-3.1.1-py3.5.egg\ocrmypdf\tesseract.py", line 51, in version
stderr=STDOUT)
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 629, in check_output
*_kwargs).stdout
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 696, in run
with Popen(_popenargs, **kwargs) as process:
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 873, in init
"close_fds is not supported on Windows platforms"
ValueError: close_fds is not supported on Windows platforms if you redirect stdin/stdout/stderr
Thank you for your help.
Issue by b21e
Fri Sep 19 16:14:39 2014
Originally opened as fritz-hh/OCRmyPDF#88
Hi, especially for scans integration with jbig2enc for better compression of the textimage layer would make this software perfect.
Preview image is grayscale 200 DPI JPEG. This is generated twice if the actual image is near or at those specs, so it could be reused.
Hey!
I have a problem. I played around a lot with the options of ocrmypdf but still my outout file is heavily distorted.
The input and the output file as well as the file generated in /tmp can be found here:
http://www.file-upload.net/download-11132774/sample_in.pdf.html
http://www.file-upload.net/download-11132775/sample_out.pdf.html
http://www.file-upload.net/download-11132776/com.github.ocrmypdf.wfze06uzsample_in.repaired.pdf.html
Thanks for your help!
Sammy
I append the -v 1 STDOUT:
$ ocrmypdf -v 1 sample_in.pdf sample_out.pdf
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
[{'images': [{'width': 4299, 'height': 3035, 'bpc': 8, 'dpi': Decimal('482.983'), 'comp': 1, 'enc': 'jpeg', 'color': 'gray', 'dpi_h': Decimal('333.110'), 'dpi_w': Decimal('700.289')}], 'width_pixels': 4299, 'xres': Decimal('700.289'), 'has_text': False, 'pageno': 0, 'height_inches': Decimal('9.11111'), 'yres': Decimal('333.110'), 'height_pixels': 3035, 'width_inches': Decimal('6.13889')}]
Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.page.pdf, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.skip_page'
Uptodate Task = 'ocrmypdf.main.skip_page'
WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
Rendering 000001.ocr.page.pdf with pnggray
Completed Task = 'ocrmypdf.main.generate_postscript_stub'
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.page.png, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean'
Task enters queue = 'ocrmypdf.main.select_image_for_pdf'
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr'
Completed Task = 'ocrmypdf.main.select_image_for_pdf'
Tesseract Open Source OCR Engine v3.03 with Leptonica
Completed Task = 'ocrmypdf.main.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.main.render_hocr_page'
Completed Task = 'ocrmypdf.main.render_hocr_page'
Task enters queue = 'ocrmypdf.main.merge_pages'
['/tmp/com.github.ocrmypdf.e6lkdkqy/000001.rendered.pdf', '/tmp/com.github.ocrmypdf.e6lkdkqy/com.github.ocrmypdf.e6lkdkqysample_in.pdfa_def.ps']
Completed Task = 'ocrmypdf.main.merge_pages'
Task enters queue = 'ocrmypdf.main.copy_final'
Completed Task = 'ocrmypdf.main.copy_final'
When installing on Debian Wheezy I am getting:
$ sudo pip-3.2 install ocrmypdf
Downloading/unpacking ocrmypdf
Downloading ocrmypdf-3.1.tar.gz
Running setup.py egg_info for package ocrmypdf
Traceback (most recent call last):
File "", line 14, in
File "/home/shaun/build/ocrmypdf/setup.py", line 7, in
from collections.abc import Mapping
ImportError: No module named abc
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 14, in
File "/home/shaun/build/ocrmypdf/setup.py", line 7, in
from collections.abc import Mapping
Command python setup.py egg_info failed with error code 1 in /home/shaun/build/ocrmypdf
Storing complete log in /root/.pip/pip.log
Any idea how to fix this? Thanks!
Issue by femifrak
Mon Nov 17 01:56:35 2014
Originally opened as fritz-hh/OCRmyPDF#100
In the german fractur there exist a sign for "et cetera" which is ocr'ed by "&c." (see http://de.wikipedia.org/wiki/Et_cetera). In the hocr file this somehow creates a conflict with the html code and leads to exit.
Sometimes the graphic layer is misaligned while the text layer seems to be placed correctly. I uploaded a sample pdf (test07.pdf) at:
http://www.loaditup.de/838186-ns8hr3kcbg.html
ocrmypdf --oversample 600 test07.pdf test07ocr.pdf
shows what I mean. test07ocr.pdf can be seen here:
http://www.loaditup.de/838187-4hkqhkbvnm.html
Additionally ocrmypdf gives a warning:
**** File did not complete the page properly and may be damaged.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> PyPDF2 <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
I don't know whether this warning is justified. At least I have no problems in viewing the pdf in common pdf viewers. Have you got any idea about this?
ocrmypdf increases file size by about a factor of 4 (even more when using oversampling).
I assume this is because the graphic layer is created instead of using the original graphic. Correct?
Is it possible to force ocrmypdf to use the original graphics? (I do not understand issue #8, but the comment from kebekus sounds promising to me.)
If the graphics have to be generated because of some missing information: Would it be possible to feed ocrmypdf with this information (e.g. I know the scanning resolution, orientation, and page size and i could provide this information to ocrmypdf during function call).
Thanks, Femi
I just wanted to install 4.0.1 but had unfortunately no success.
Have you got a clue how to align the ducks?
>$ sudo pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
Downloading/unpacking git+https://github.com/jbarlow83/OCRmyPDF.git
Cloning https://github.com/jbarlow83/OCRmyPDF.git to /tmp/pip-jyrz2gnr-build
Running setup.py (path:/tmp/pip-jyrz2gnr-build/setup.py) egg_info for package from git+https://github.com/jbarlow83/OCRmyPDF.git
Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
zip_safe flag not set; analyzing archive contents...
Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
zip_safe=False)
File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
dist = working_set.find(req)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
raise VersionConflict(dist, req)
pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
Complete output from command python setup.py egg_info:
Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
zip_safe flag not set; analyzing archive contents...
Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
zip_safe=False)
File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
dist = working_set.find(req)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
raise VersionConflict(dist, req)
pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Storing debug log for failure in /home/xxx/.pip/pip.log
The mentioned pip.log file says:
------------------------------------------------------------
/usr/bin/pip3 run on Thu Feb 18 12:14:54 2016
Downloading/unpacking git+https://github.com/jbarlow83/OCRmyPDF.git
Cloning https://github.com/jbarlow83/OCRmyPDF.git to /tmp/pip-jyrz2gnr-build
Found command 'git' at '/usr/bin/git'
Running command /usr/bin/git clone -q https://github.com/jbarlow83/OCRmyPDF.git /tmp/pip-jyrz2gnr-build
Running setup.py (path:/tmp/pip-jyrz2gnr-build/setup.py) egg_info for package from git+https://github.com/jbarlow83/OCRmyPDF.git
Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
zip_safe flag not set; analyzing archive contents...
Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
zip_safe=False)
File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
dist = working_set.find(req)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
raise VersionConflict(dist, req)
pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
Complete output from command python setup.py egg_info:
Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
zip_safe flag not set; analyzing archive contents...
Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
zip_safe=False)
File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
self.fetch_build_eggs(attrs['setup_requires'])
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
replace_conflicting=True,
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
dist = working_set.find(req)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
raise VersionConflict(dist, req)
pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Exception information:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 122, in main
status = self.run(options, args)
File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 304, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File "/usr/lib/python3/dist-packages/pip/req.py", line 1230, in prepare_files
req_to_install.run_egg_info()
File "/usr/lib/python3/dist-packages/pip/req.py", line 326, in run_egg_info
command_desc='python setup.py egg_info')
File "/usr/lib/python3/dist-packages/pip/util.py", line 716, in call_subprocess
% (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Install on Ubuntu 15.10
Software versions
$ ocrmypdf --version
3.2
$ python3 --version
Python 3.4.3+
$ unpaper -version
6.1
Exception (on every attempt)
$ ocrmypdf --verbose 1 --force-ocr scansmpl.pdf test.pdf
Original exception:
Exception #1
'builtins.TypeError(convert() got an unexpected keyword argument 'dpi')' raised in ...
Task = def ocrmypdf.main.select_image_layer(...):
Job = [[.../com.github.ocrmypdf.aziws_b9/000001.image, .../com.github.ocrmypdf.aziws_b9/000001.ocr.page.pdf] -> .../com.github.ocrmypdf.aziws_b9/000001.image-layer.pdf, <ocrmypdf.main.WrappedLogger>, [{'width_inches': Decimal('8.48611'), 'width_pixels': 1696, 'pageno': 0, 'images': [{'dpi_h': Decimal('2E+2'), 'color': 'gray', 'width': 1696, 'comp': 1, 'dpi': Decimal('199.928'), 'height': 2175, 'bpc': 1, 'dpi_w': Decimal('199.856'), 'enc': 'ccitt'}], 'has_text': False, 'xres': Decimal('199.856'), 'height_inches': Decimal('10.875'), 'height_pixels': 2175, 'yres': Decimal('2E+2')}], <_thread.lock>]
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 597, in select_image_layer
img2pdf.convert([image], dpi=dpi, outputstream=pdf)
TypeError: convert() got an unexpected keyword argument 'dpi'
Hi,
thank you very much for this really decent software!
When i run ocrmypdf 3.1 on an pdf which contains pages in A5 size (148 x 210 mm), the output page size is not the same (125 x 210 mm).
Regards,
Femi
For example, OCRMYPDF_GHOSTSCRIPT could point at an alternate Ghostscript binary.
This is mainly for test cases, to allow replacing the real binary with one that always fails, or to stub out/cache Tesseract when the OCR output doesn't matter.
Issue by femifrak
Wed May 28 16:01:43 2014
Originally opened as fritz-hh/OCRmyPDF#78
When using the 2.x version available as zip file at the right side of
https://github.com/fritz-hh/OCRmyPDF
with xubuntu 14.04 the original pdf is altered although i did not use -i
The first page of
http://www.loaditup.de/files/817245_gcstsh3wuy.pdf
shows the original black and white pdf, the second page the altered pdf which unfortunately looks frazzled. (I merged both pages for convenience.)
Is there a way to avoid this quality loss?
I tested the suggestion of #61 but without success, which is clear as no "-i" was used.
I also tested a pdf with integer number of pixels but without success.
Maybe it has to do with the problem described here? http://lists.freedesktop.org/archives/poppler-bugs/2013-August/010469.html
Thanks for the help.
Here the output with -g:
># /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g sw_original.pdf sw_original_OCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g sw_original.pdf sw_original_OCR.pdf
Checking if all dependencies are installed
--------------------------------
ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
--------------------------------
GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
--------------------------------
Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
--------------------------------
unpaper version:
0.4.2
--------------------------------
tesseract version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
--------------------------------
python2 version:
Python 2.7.6
--------------------------------
Ghostscript version:
9.10
--------------------------------
Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
--------------------------------
Created temporary folder: "/tmp/tmp.ZIHGjUFKJS"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x594 (h*w in pt)
Page 0001: Size 3508x2477 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.ZIHGjUFKJS/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 25 seconds
Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120
A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr
mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).
The error message is only shown when OCRmyPDF is used with the -v
option.
Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?
Full output:
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
Original exception:
Exception #1
'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
Task = def ocrmypdf.main.repair_pdf(...):
Job = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
pdfinfo.extend(pdf_get_all_pageinfo(output_file))
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
text = page.extractText()
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
content = ContentStream(content, self.pdf)
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
self.__parseContentStream(stream)
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
operands.append(readObject(stream, None))
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
arr.append(readObject(stream, pdf))
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
return readStringFromStream(stream)
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'
Issue by witchi
Mon Mar 23 10:50:36 2015
Originally opened as fritz-hh/OCRmyPDF#106
Hi,
Nice script, I use it with another script from http://www.konradvoelkel.com/2013/03/scan-to-pdfa/
Can you enhance your script with a call to aspell?
I have tried it within src/ocrPage.sh
on line 198:
# perform spell check
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Performing spell check"
!aspell --dont-backup --lang=de_DE --mode=sgml -c "${curHocr}" < /dev/tty \
&& echo "Could not spell checking file \"${curHocr}\". Exiting..." && exit $EXIT_OTHER_ERROR
but it doesn't work with the Gnu-Parallel tool.
Thank you
Andre
Ruffus's console logging seems to be far too quiet, suppressing error messages in some cases.
Find out how to create our own error logging and tell ruffus about it.
Issue by fritz-hh
Sun Sep 28 20:27:47 2014
Originally opened as fritz-hh/OCRmyPDF#93
If you try to OCRmyPDF a file, the output file cannot again be processed as input file. OPCRmyPDF fails.
In my view, this is a consequence of an unknown problem inside Tesseract, already filed as #19
Issue by alphablue52
Tue Feb 18 20:11:19 2014
Originally opened as fritz-hh/OCRmyPDF#70
Hello,
first thanks for the great work with this script. It made me work with OCR again at all after 10 years of frustrated absence :-)
Only one negative thing: Most of my PDFs come from a Canon ImageRunner scan, and are very good in quality vs. size. OCR gives great results, but the output PDFs are 7-8x bigger than input. As far as I can see, the embedded images get recompressed to JPEG, while the original is /CCITTFaxDecode.
If this is because of PDF/A compatibility, I suggest to add an option for non-PDF/A output.
You can download input.pdf and output.pdf here:
https://www.dropbox.com/l/KYlpYRiSs6IjWVOmF1fX39
Here is the output of the script with -g option.
~/bin/OCRmyPDF-2.0-stable$ sh OCRmyPDF.sh -g -l deu input.pdf output.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -l deu input.pdf output.pdf
ImageMagick version:
Version: ImageMagick 6.7.7-10 2013-09-10 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
GNU Parallel version:
GNU parallel 20130622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
Poppler-utils version:
pdfimages version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
unpaper version:
tesseract version:
tesseract 3.02.01
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.8
python2 version:
Ghostscript version:
Java version:
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.10.1)
Created temporary folder: "/tmp/tmp.X82OQourlI"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0014
Page 0001: Size 842x595 (h_w in pt)
Page 0001: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0001: Continuing anyway, assuming a default resolution of 300 dpi
Page 0001: Extracting image as ppm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Processing page 0002 / 0014
Page 0002: Size 842x595 (h_w in pt)
Page 0002: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0002: Continuing anyway, assuming a default resolution of 300 dpi
Page 0002: Extracting image as ppm file (300 dpi)
Page 0002: Performing OCR
Page 0002: Embedding text in PDF
Page 0002: Embedding text in PDF (debug page)
Processing page 0003 / 0014
Page 0003: Size 842x595 (h_w in pt)
Page 0003: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0003: Continuing anyway, assuming a default resolution of 300 dpi
Page 0003: Extracting image as ppm file (300 dpi)
Page 0003: Performing OCR
Page 0003: Embedding text in PDF
Page 0003: Embedding text in PDF (debug page)
Processing page 0004 / 0014
Page 0004: Size 842x595 (h_w in pt)
Page 0004: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0004: Continuing anyway, assuming a default resolution of 300 dpi
Page 0004: Extracting image as ppm file (300 dpi)
Page 0004: Performing OCR
Page 0004: Embedding text in PDF
Page 0004: Embedding text in PDF (debug page)
Processing page 0005 / 0014
Page 0005: Size 842x595 (h_w in pt)
Page 0005: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0005: Continuing anyway, assuming a default resolution of 300 dpi
Page 0005: Extracting image as ppm file (300 dpi)
Page 0005: Performing OCR
Page 0005: Embedding text in PDF
Page 0005: Embedding text in PDF (debug page)
Processing page 0006 / 0014
Page 0006: Size 842x595 (h_w in pt)
Page 0006: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0006: Continuing anyway, assuming a default resolution of 300 dpi
Page 0006: Extracting image as ppm file (300 dpi)
Page 0006: Performing OCR
Page 0006: Embedding text in PDF
Page 0006: Embedding text in PDF (debug page)
Processing page 0007 / 0014
Page 0007: Size 842x595 (h_w in pt)
Page 0007: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0007: Continuing anyway, assuming a default resolution of 300 dpi
Page 0007: Extracting image as ppm file (300 dpi)
Page 0007: Performing OCR
Page 0007: Embedding text in PDF
Page 0007: Embedding text in PDF (debug page)
Processing page 0008 / 0014
Page 0008: Size 842x595 (h_w in pt)
Page 0008: Expecting exactly 1 image covering the whole page (found 8). Cannot compute dpi value.
Page 0008: Continuing anyway, assuming a default resolution of 300 dpi
Page 0008: Extracting image as ppm file (300 dpi)
Page 0008: Performing OCR
Page 0008: Embedding text in PDF
Page 0008: Embedding text in PDF (debug page)
Processing page 0009 / 0014
Page 0009: Size 842x595 (h_w in pt)
Page 0009: Expecting exactly 1 image covering the whole page (found 5). Cannot compute dpi value.
Page 0009: Continuing anyway, assuming a default resolution of 300 dpi
Page 0009: Extracting image as ppm file (300 dpi)
Page 0009: Performing OCR
Page 0009: Embedding text in PDF
Page 0009: Embedding text in PDF (debug page)
Processing page 0010 / 0014
Page 0010: Size 842x595 (h_w in pt)
Page 0010: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0010: Continuing anyway, assuming a default resolution of 300 dpi
Page 0010: Extracting image as ppm file (300 dpi)
Page 0010: Performing OCR
Page 0010: Embedding text in PDF
Page 0010: Embedding text in PDF (debug page)
Processing page 0011 / 0014
Page 0011: Size 842x595 (h_w in pt)
Page 0011: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0011: Continuing anyway, assuming a default resolution of 300 dpi
Page 0011: Extracting image as ppm file (300 dpi)
Page 0011: Performing OCR
Page 0011: Embedding text in PDF
Page 0011: Embedding text in PDF (debug page)
Processing page 0012 / 0014
Page 0012: Size 842x595 (h_w in pt)
Page 0012: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0012: Continuing anyway, assuming a default resolution of 300 dpi
Page 0012: Extracting image as ppm file (300 dpi)
Page 0012: Performing OCR
Page 0012: Embedding text in PDF
Page 0012: Embedding text in PDF (debug page)
Processing page 0013 / 0014
Page 0013: Size 842x595 (h_w in pt)
Page 0013: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0013: Continuing anyway, assuming a default resolution of 300 dpi
Page 0013: Extracting image as ppm file (300 dpi)
Page 0013: Performing OCR
Page 0013: Embedding text in PDF
Page 0013: Embedding text in PDF (debug page)
Processing page 0014 / 0014
Page 0014: Size 842x595 (h_w in pt)
Page 0014: Size 1240x1753 (in pixel)
Page 0014: Low image resolution detected (150 dpi). If needed, please use the "-o" to try to get better OCR results.
Page 0014: Extracting image as pgm file (150 dpi)
Page 0014: Performing OCR
Page 0014: Embedding text in PDF
Page 0014: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.X82OQourlI/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 20 seconds
Issue by drdownload
Thu Oct 30 08:25:16 2014
Originally opened as fritz-hh/OCRmyPDF#98
it would be great to have an option to remove blank pages. I scan a lot of images with my duplex scanner and not all scanned documents have a printed backside.
Issue by segro21
Thu Sep 25 13:09:49 2014
Originally opened as fritz-hh/OCRmyPDF#90
Hi,
this is not realey an issue of ocrmypdf, but I'm trying to get this to work on an samba-share with incrontab /inotify.
I've created a folder and watch activities in this folder with incrontab. That works fine for things like pdftk, but nothing happens on ocrmypdf. Syslog shows the command correct, but then ends.
my incrontab -e
/home/pdfin IN_CLOSE_WRITE /opt/ocrmypdf/ocrmypdf.sh
/home/pdfin/out IN_CLOSE_WRITE /bin/rm
->this works fine for stamping pdfs with logo
/home/stamp IN_CLOSE_WRITE /usr/bin/pdftk
/home/stamp/out IN_CLOSE_WRITE /bin/rm
Any ideas?
Issue by femifrak
Fri Jan 31 18:50:11 2014
Originally opened as fritz-hh/OCRmyPDF#64
OCRmyPDF is brilliant but sometimes i have a problem with the order of text that is underlaid. When i select the text starting from top left and go to the right end of the line and then successively down line by line, there are sometimes gaps of text which is not selected. After a few more lines these gaps are suddenly selected. Copying the selected text and pasting it into another application reveals the order, which is unfortunately wrong. I use latest stable version and have no error or warning messages.
http://www.loaditup.de/files/803343_acxm67dsue.pdf
(problem occurs in second paragraph.)
Although setup attempts to check the python version and throw an error message, in fact with python2.7 you never get that far: it barfs on the copyright symbol on line 2.
$ python setup.py build
File "setup.py", line 2
SyntaxError: Non-ASCII character '\xc2' in file setup.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Google has just released alpha access to the Google Vision API, I am hopeful their OCR will be better than Tesseract, if that's true would this be a good option to potentially add as another way to handle the OCR input to this project, maybe you could add a switch somewhere to choose the OCR source? The sign up page for alpha access is here: https://services.google.com/fb/forms/visionapialpha/. It would be great to get your opinion on this. Thanks!
Hey,
is there a way to define the pagesegmode for the tesseract OCR?
(https://tesseract-ocr.googlecode.com/git/doc/tesseract.1.html)
Thank you very much
tuxasus
it looks, as if a basic test for "input file exists" is missing, or not working.
Assume, a file x.pdf exists.
Then
ocrmypdf x.pdf x.ocr.pdf
works, however
ocrmypdf x-no-such-file.pdf x.ocr.pdf
silently fails.
woukd be great if ocrmypdf would automatically recognize the language used in a pdf and used it
Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120
A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr
mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).
The error message is only shown when OCRmyPDF is used with the -v
option.
Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?
Full output:
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
Original exception:
Exception #1
'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
Task = def ocrmypdf.main.repair_pdf(...):
Job = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]
Traceback (most recent call last):
File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
pdfinfo.extend(pdf_get_all_pageinfo(output_file))
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
text = page.extractText()
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
content = ContentStream(content, self.pdf)
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
self.__parseContentStream(stream)
File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
operands.append(readObject(stream, None))
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
arr.append(readObject(stream, pdf))
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
return readStringFromStream(stream)
File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'
eval is asking for trouble.
I might be reading this wrong, but when I run skip-text on this PDF:
https://www.dropbox.com/s/dt0d3wpwb6ovngi/OTII_PressRelease-200110301.pdf?dl=0
Which has text in it the output file looks the same except the searchable text is now gone, what am I doing wrong? Thanks!
My new duplex scanner is BROTHER ADS-2600we. It generates PDF (which are not compatible and make also convert
fail. It can however generate PDF/A. The standard filenames have the form
[0-9]{8}.PDF
Example: 06091500.PDF, 06091501.PDF etc. for files scanned on 06. September 2015. These filenames (I don't like the format) cannot be changed in the scanner.
When you start
ocrmypdf --verbose -L log.txt -l deu 06091500.PDF 06091500.ocr.pdf
this silently fails ! ("...No jobs were run because no file names matched.")
Workaround:
Rename files 06091500.PDF to x.pdf and process then.
Due to bug(s) in PyPDF's extractText, which does not find text OCR'ed by Tesseract 3.04.
There are probably other cases.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.