Comments (9)
Comment by jbarlow83
Sat Sep 20 06:33:15 2014
pdfbeads (a ruby project) attempts to do that although it has issues with aligning the hidden OCR text layer with the image and some crash bugs, and the documentation is mainly in Russian.
I've looked into making the changes for OCRmyPDF. It would be a major overhaul/rewrite and would call for a new PDF generation backend.
from ocrmypdf.
Comment by b21e
Sat Sep 20 07:57:02 2014
jbig2enc itself is quite stable, now it recognises also quite well the resolution of the images. There is also support for basic foreground background separation. There's a one page script in python for generation of multilayer pdf for an earlier version of jbig2enc. When this script was written the recognition of the resolution of the pdf still did not work reliably. In short for scans jbig2 is a must, but on linux this is still not available.
from ocrmypdf.
Comment by b21e
Sat Sep 20 08:38:25 2014
If one is willing to use more than one graphics library leptonica written in c for jbig2enc and for text foreground and background separation gamera written in python for didjvu all the ingredients are already there and well tested.
from ocrmypdf.
Blocked as discussed in #48
from ocrmypdf.
Blocked as discussed in #48
As #48 is now solved, does it make sense to reopen?
Is anyone interested in implementing or helping to implement? @jbarlow83 : Do you think this is feasible, and would you accept patches for such a functionality?
from ocrmypdf.
Jbig2enc is currently supported along with optimization (although it can be inconvenient to install since many distributions don't distribute it).
Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input.
By color segmentation, I mean examining every image to see if it can be separated into one dominant foreground color (usually black) and a grayscale or color image, and if that is a more efficient compression option than retaining the original.
from ocrmypdf.
As far as I can see, jbig2's pdf.py does not do color segmentation itself. The only Open Source implementation that I know that does this is in fact didjvu using gamera, as mentioned in #9 (comment)
So, I guess the quickest or most feasible way to do this would be adapt code from didjvu and include gamera. Not sure if this is a good option, as gamera would be another dependency.
An alternative could be to make a more ore less independent "helper" program that just takes an image, does the separation, and returns 3 (unoptimized) images, leaving the pdf stuff to ocrmypdf?
Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input.
I would indeed see it as a good feature that a large, badly compressed or even uncompressed PDF (e.g. directly created by img2pdf, or by commercial scanners) is taken in, and converted to a highly optimized MRC PDF.
The advantage compared to a completely standalone MRC pdf creator would be that all the nice processing options of ocrmypdf including deskew and OCR, could be done before converting to a lower quality MRC.
Or is there a better way to do this without bloating ocrmypdf or making the process too cumbersome for the user?
from ocrmypdf.
jbig2 uses leptonica for color segmentation - some sort of API call that returns a foreground "black" image and a background color or grayscale image. Ocrmypdf has soft ABI-level bindings to Leptonica, which I currently want to replace... that might mean spinning off a new Leptonica for Python package with API bindings (although, that means I'd have to maintain another package that builds a binary wheel on every platform-architecture combination and depends on C libraries). There's some nasty business in leptonica.py
that involves redirecting stderr on the fly if you want to see what I want to eliminate.... In short it's very tempting to move to scikit-image or opencv since they are well maintained libraries with good packaging, even though not necessarily focused on document imaging.
didjvu uses GPL2 so we cannot use it. Gamera is not available packaged as Python wheels and needs to be manually built so it is not suitable. Thank you for the suggestions, though, as I had not heard of either....
(We actually do color segmentation in a special case - if pngquant is installed, high optimization is used, and pngquant is able to reduce the image to monochrome. In that we case, we notice a suboptimal 1-bit PNG image and convert it to jbig2. But the stars have to align perfectly. The case where does not help is say, black text on a yellowed background.)
from ocrmypdf.
The segmentation didjvu does seems to be a bit more complicated:
- Some kind of local thresholding with multiple different algorithms, as can be seen on http://manpages.ubuntu.com/manpages/bionic/man1/didjvu.1.html (argument "-m" ) These are all purely implemented in gamera, didjvu only does organizational stuff.
This results in a binary mask. - Then didjvu itself does some kind of morphological operations with that mask and uses the result to cut out background and foreground image out of the one image, both reduced in resolution
- Then didjvu compresses the mask with JBIG2, and the fg/bg by calling the iw44 wavelet codec. Note that the latter seems to support a mask, and optimized such that the masked out regions would decode to whatever makes the file size minimal.
For PDF, one could use jpeg2000 that, but I have not found an obvious option to use the mask in jpeg2000 encoders. Might be "ROI" encoding, but I didn't find a free software encoder that supports this. This would mean that the mask data is somehow encoded in the jpeg2000 file, unnecessarily increasing the filesize somewhat :( But this might turn out to be only a minimal problem. - Final step is assembly, which should be, in the case of PDF, more or less trivial with the 3 images fg/mask/bg :
use an SMask / ImageMask for the foreground and overlay that onto the background. I did some experiments, this resulted in a correct, but slow rendering, unlike djvu which is very fast.
I could imagine to have ocrmypdf call an external binary for steps 1-2 or steps 1-3 to avoid GPL problems, as it is done with unpaper. didjvu conveniently supports the "separate" option which would output a mask, covering 1-2.
I have also analyzed a MRC pdf coming out of a cheap commercial multifunction printer (see below). For some reason they seem to just keep the bg as a jpeg image, and then have a fixed number of 31 single-color b/w images layered on top. Each single b/w image encodes one color, and have offsets.
It also does have ugly problems with some letters or even text regions only encoded in the bg image, and text on colored background often in the background as a whole.
The didjvu results I have seen are much better, allowing for colored text and line graphics, except for problems with non-binary images where the separation messed it up a bit, although it never got really unreadable.
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1208 1716 rgb 3 8 jpeg no 11 0 151 150 11.1K 0.2%
1 1 stencil 496 30 - 1 1 ccitt no 12 0 300 300 317B 17%
1 2 stencil 48 28 - 1 1 ccitt no 13 0 300 300 16B 9.5%
1 3 stencil 928 844 - 1 1 ccitt no 24 0 300 300 192B 0.2%
1 4 stencil 8 6 - 1 1 ccitt no 35 0 300 300 7B 117%
1 5 stencil 1344 1992 - 1 1 ccitt no 37 0 300 300 2179B 0.7%
1 6 stencil 1408 1272 - 1 1 ccitt no 38 0 301 300 721B 0.3%
1 7 stencil 1536 810 - 1 1 ccitt no 39 0 300 300 128B 0.1%
1 8 stencil 24 8 - 1 1 ccitt no 40 0 300 300 11B 46%
1 9 stencil 1216 918 - 1 1 ccitt no 41 0 300 300 1942B 1.4%
1 10 stencil 1784 1276 - 1 1 ccitt no 42 0 300 300 4385B 1.5%
1 11 stencil 8 6 - 1 1 ccitt no 14 0 300 300 7B 117%
1 12 stencil 440 570 - 1 1 ccitt no 15 0 300 300 85B 0.3%
1 13 stencil 8 6 - 1 1 ccitt no 16 0 300 300 8B 133%
1 14 stencil 1024 494 - 1 1 ccitt no 17 0 300 300 80B 0.1%
1 15 stencil 8 6 - 1 1 ccitt no 18 0 300 300 8B 133%
1 16 stencil 1776 556 - 1 1 ccitt no 19 0 300 300 2215B 1.8%
1 17 stencil 8 6 - 1 1 ccitt no 20 0 300 300 7B 117%
1 18 stencil 8 6 - 1 1 ccitt no 21 0 300 300 8B 133%
1 19 stencil 8 8 - 1 1 ccitt no 22 0 300 300 9B 112%
1 20 stencil 1320 198 - 1 1 ccitt no 23 0 300 300 2356B 7.2%
1 21 stencil 1344 12 - 1 1 ccitt no 25 0 300 300 14B 0.7%
1 22 stencil 176 6 - 1 1 ccitt no 26 0 300 300 34B 26%
1 23 stencil 8 6 - 1 1 ccitt no 27 0 300 300 8B 133%
1 24 stencil 8 8 - 1 1 ccitt no 28 0 300 300 9B 112%
1 25 stencil 24 30 - 1 1 ccitt no 29 0 300 300 33B 37%
1 26 stencil 1752 48 - 1 1 ccitt no 30 0 300 300 1852B 18%
1 27 stencil 8 6 - 1 1 ccitt no 31 0 300 300 8B 133%
1 28 stencil 8 6 - 1 1 ccitt no 32 0 300 300 8B 133%
1 29 stencil 24 38 - 1 1 ccitt no 33 0 300 300 37B 32%
1 30 stencil 128 38 - 1 1 ccitt no 34 0 300 300 32B 5.3%
1 31 stencil 8 8 - 1 1 ccitt no 36 0 300 300 9B 112%
from ocrmypdf.
Related Issues (20)
- Allow resuming OCR after DecompressionBombError HOT 3
- [Bug] SubprocessOutputError HOT 2
- [Feature]: Choose between NFKC and NFC normalization for Unicode characters so copy-pasting works HOT 5
- max_workers must be greater than 0 HOT 2
- [Feature]: Could watcher.py be enhanced to support the conversion of single or multi TIF and JPG files to PDF?
- [Bug]: DecompressionBombWarning HOT 1
- [Bug]: Memory Error
- [Bug]: Warning: "xref 473: While extracting this image, an error occurred" HOT 1
- [Bug]: watcher.py requires the "ARCHIVE" folder to be assigned, even if the option is disabled HOT 1
- Release notes don't include the latest versions HOT 1
- [Bug]: real text replaced by � � (visually unchanged, only by copying)
- [Feature]: Change demo format to VHS
- [Feature]: JPEG XL support HOT 3
- not user friendly HOT 1
- [Bug]: ValueError: ObjectList must have 6 elements HOT 3
- [Bug]: conda installation HOT 2
- [Bug]: File size increased HOT 7
- [Bug]: No longer works - macos-11.7 x86_64 Python 3.10 HOT 9
- [Bug]: cannot import name 'PDFTextSeq' from 'pdfminer.pdfdevice' HOT 3
- Make usage of --rotate-pages-threshold clearer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.