Code Monkey home page Code Monkey logo

Comments (9)

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on May 18, 2024

Comment by jbarlow83
Sat Sep 20 06:33:15 2014


pdfbeads (a ruby project) attempts to do that although it has issues with aligning the hidden OCR text layer with the image and some crash bugs, and the documentation is mainly in Russian.

I've looked into making the changes for OCRmyPDF. It would be a major overhaul/rewrite and would call for a new PDF generation backend.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on May 18, 2024

Comment by b21e
Sat Sep 20 07:57:02 2014


jbig2enc itself is quite stable, now it recognises also quite well the resolution of the images. There is also support for basic foreground background separation. There's a one page script in python for generation of multilayer pdf for an earlier version of jbig2enc. When this script was written the recognition of the resolution of the pdf still did not work reliably. In short for scans jbig2 is a must, but on linux this is still not available.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on May 18, 2024

Comment by b21e
Sat Sep 20 08:38:25 2014


If one is willing to use more than one graphics library leptonica written in c for jbig2enc and for text foreground and background separation gamera written in python for didjvu all the ingredients are already there and well tested.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on May 18, 2024

Blocked as discussed in #48

from ocrmypdf.

blaueente avatar blaueente commented on May 18, 2024

Blocked as discussed in #48

As #48 is now solved, does it make sense to reopen?
Is anyone interested in implementing or helping to implement? @jbarlow83 : Do you think this is feasible, and would you accept patches for such a functionality?

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on May 18, 2024

Jbig2enc is currently supported along with optimization (although it can be inconvenient to install since many distributions don't distribute it).

Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input.

By color segmentation, I mean examining every image to see if it can be separated into one dominant foreground color (usually black) and a grayscale or color image, and if that is a more efficient compression option than retaining the original.

from ocrmypdf.

blaueente avatar blaueente commented on May 18, 2024

As far as I can see, jbig2's pdf.py does not do color segmentation itself. The only Open Source implementation that I know that does this is in fact didjvu using gamera, as mentioned in #9 (comment)
So, I guess the quickest or most feasible way to do this would be adapt code from didjvu and include gamera. Not sure if this is a good option, as gamera would be another dependency.
An alternative could be to make a more ore less independent "helper" program that just takes an image, does the separation, and returns 3 (unoptimized) images, leaving the pdf stuff to ocrmypdf?

Although we don't do color segmentation like jbig2's pdf.py. That's not a good solution for an application that accepts PDFs rather than images as input.

I would indeed see it as a good feature that a large, badly compressed or even uncompressed PDF (e.g. directly created by img2pdf, or by commercial scanners) is taken in, and converted to a highly optimized MRC PDF.
The advantage compared to a completely standalone MRC pdf creator would be that all the nice processing options of ocrmypdf including deskew and OCR, could be done before converting to a lower quality MRC.
Or is there a better way to do this without bloating ocrmypdf or making the process too cumbersome for the user?

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on May 18, 2024

jbig2 uses leptonica for color segmentation - some sort of API call that returns a foreground "black" image and a background color or grayscale image. Ocrmypdf has soft ABI-level bindings to Leptonica, which I currently want to replace... that might mean spinning off a new Leptonica for Python package with API bindings (although, that means I'd have to maintain another package that builds a binary wheel on every platform-architecture combination and depends on C libraries). There's some nasty business in leptonica.py that involves redirecting stderr on the fly if you want to see what I want to eliminate.... In short it's very tempting to move to scikit-image or opencv since they are well maintained libraries with good packaging, even though not necessarily focused on document imaging.

didjvu uses GPL2 so we cannot use it. Gamera is not available packaged as Python wheels and needs to be manually built so it is not suitable. Thank you for the suggestions, though, as I had not heard of either....

(We actually do color segmentation in a special case - if pngquant is installed, high optimization is used, and pngquant is able to reduce the image to monochrome. In that we case, we notice a suboptimal 1-bit PNG image and convert it to jbig2. But the stars have to align perfectly. The case where does not help is say, black text on a yellowed background.)

from ocrmypdf.

blaueente avatar blaueente commented on May 18, 2024

The segmentation didjvu does seems to be a bit more complicated:

  1. Some kind of local thresholding with multiple different algorithms, as can be seen on http://manpages.ubuntu.com/manpages/bionic/man1/didjvu.1.html (argument "-m" ) These are all purely implemented in gamera, didjvu only does organizational stuff.
    This results in a binary mask.
  2. Then didjvu itself does some kind of morphological operations with that mask and uses the result to cut out background and foreground image out of the one image, both reduced in resolution
  3. Then didjvu compresses the mask with JBIG2, and the fg/bg by calling the iw44 wavelet codec. Note that the latter seems to support a mask, and optimized such that the masked out regions would decode to whatever makes the file size minimal.
    For PDF, one could use jpeg2000 that, but I have not found an obvious option to use the mask in jpeg2000 encoders. Might be "ROI" encoding, but I didn't find a free software encoder that supports this. This would mean that the mask data is somehow encoded in the jpeg2000 file, unnecessarily increasing the filesize somewhat :( But this might turn out to be only a minimal problem.
  4. Final step is assembly, which should be, in the case of PDF, more or less trivial with the 3 images fg/mask/bg :
    use an SMask / ImageMask for the foreground and overlay that onto the background. I did some experiments, this resulted in a correct, but slow rendering, unlike djvu which is very fast.

I could imagine to have ocrmypdf call an external binary for steps 1-2 or steps 1-3 to avoid GPL problems, as it is done with unpaper. didjvu conveniently supports the "separate" option which would output a mask, covering 1-2.

I have also analyzed a MRC pdf coming out of a cheap commercial multifunction printer (see below). For some reason they seem to just keep the bg as a jpeg image, and then have a fixed number of 31 single-color b/w images layered on top. Each single b/w image encodes one color, and have offsets.
It also does have ugly problems with some letters or even text regions only encoded in the bg image, and text on colored background often in the background as a whole.
The didjvu results I have seen are much better, allowing for colored text and line graphics, except for problems with non-binary images where the separation messed it up a bit, although it never got really unreadable.

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1208  1716  rgb     3   8  jpeg   no        11  0   151   150 11.1K 0.2%
   1     1 stencil   496    30  -       1   1  ccitt  no        12  0   300   300  317B  17%
   1     2 stencil    48    28  -       1   1  ccitt  no        13  0   300   300   16B 9.5%
   1     3 stencil   928   844  -       1   1  ccitt  no        24  0   300   300  192B 0.2%
   1     4 stencil     8     6  -       1   1  ccitt  no        35  0   300   300    7B 117%
   1     5 stencil  1344  1992  -       1   1  ccitt  no        37  0   300   300 2179B 0.7%
   1     6 stencil  1408  1272  -       1   1  ccitt  no        38  0   301   300  721B 0.3%
   1     7 stencil  1536   810  -       1   1  ccitt  no        39  0   300   300  128B 0.1%
   1     8 stencil    24     8  -       1   1  ccitt  no        40  0   300   300   11B  46%
   1     9 stencil  1216   918  -       1   1  ccitt  no        41  0   300   300 1942B 1.4%
   1    10 stencil  1784  1276  -       1   1  ccitt  no        42  0   300   300 4385B 1.5%
   1    11 stencil     8     6  -       1   1  ccitt  no        14  0   300   300    7B 117%
   1    12 stencil   440   570  -       1   1  ccitt  no        15  0   300   300   85B 0.3%
   1    13 stencil     8     6  -       1   1  ccitt  no        16  0   300   300    8B 133%
   1    14 stencil  1024   494  -       1   1  ccitt  no        17  0   300   300   80B 0.1%
   1    15 stencil     8     6  -       1   1  ccitt  no        18  0   300   300    8B 133%
   1    16 stencil  1776   556  -       1   1  ccitt  no        19  0   300   300 2215B 1.8%
   1    17 stencil     8     6  -       1   1  ccitt  no        20  0   300   300    7B 117%
   1    18 stencil     8     6  -       1   1  ccitt  no        21  0   300   300    8B 133%
   1    19 stencil     8     8  -       1   1  ccitt  no        22  0   300   300    9B 112%
   1    20 stencil  1320   198  -       1   1  ccitt  no        23  0   300   300 2356B 7.2%
   1    21 stencil  1344    12  -       1   1  ccitt  no        25  0   300   300   14B 0.7%
   1    22 stencil   176     6  -       1   1  ccitt  no        26  0   300   300   34B  26%
   1    23 stencil     8     6  -       1   1  ccitt  no        27  0   300   300    8B 133%
   1    24 stencil     8     8  -       1   1  ccitt  no        28  0   300   300    9B 112%
   1    25 stencil    24    30  -       1   1  ccitt  no        29  0   300   300   33B  37%
   1    26 stencil  1752    48  -       1   1  ccitt  no        30  0   300   300 1852B  18%
   1    27 stencil     8     6  -       1   1  ccitt  no        31  0   300   300    8B 133%
   1    28 stencil     8     6  -       1   1  ccitt  no        32  0   300   300    8B 133%
   1    29 stencil    24    38  -       1   1  ccitt  no        33  0   300   300   37B  32%
   1    30 stencil   128    38  -       1   1  ccitt  no        34  0   300   300   32B 5.3%
   1    31 stencil     8     8  -       1   1  ccitt  no        36  0   300   300    9B 112%

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.