Code Monkey home page Code Monkey logo

pymupdf-optional-material's Introduction

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Community

Join us on Discord here: #pymupdf

Installation

PyMuPDF requires Python 3.8 or later, install using pip with:

pip install PyMuPDF

There are no mandatory external dependencies. However, some optional features become available only if additional packages are installed.

You can also try without installing by visiting PyMuPDF.io.

Usage

Basic usage is as follows:

import pymupdf # imports the pymupdf library
doc = pymupdf.open("example.pdf") # open a document
for page in doc: # iterate the document pages
  text = page.get_text() # get plain text encoded as UTF-8

Documentation

Full documentation can be found on pymupdf.readthedocs.io.

Optional Features

  • fontTools for creating font subsets.
  • pymupdf-fonts contains some nice fonts for your text output.
  • Tesseract-OCR for optical character recognition in images and document pages.

About

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF was originally written by Jorj X. McKie.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

pymupdf-optional-material's People

Contributors

jlerouge avatar jorjmckie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pymupdf-optional-material's Issues

Fitz.open is not passing back exception

My open call is quite simply:
doc = fitz.open(input_path + '\' + file)

I'm experiencing crashes when corrupt PDF files are encountered. I would be happy to pre-screen them if I knew what to look for. I assumed that fitz.open would raise and exception that's passed back to me but instead it's crashing with this information:
error: cannot find startxref
warning: trying to repair broken xref
warning: repairing PDF document
warning: object missing 'endobj' token
error: non-page object in page tree
uncaught exception: non-page object in page tree

I'm attaching my PDF file that is causing the trouble.
I'm using this code to extract images from a large number of PDF files that I've generated using WKHTMLTOPDF. I'm unsure why a few of them are corrupt. I'm working on that end of things.

Is there a different way I can call the open that will cause the exception to be passed back to me so that I can skip the file and move on to the next?

Thank you for your time.

1960s GRUEN Airflight Vintage Pilot Aviators Military Time Jump Hour Watch.pdf

get_pixmap may cause large image size

doc = fitz.open(pdf_file)
for page in doc:
pix = page.get_pixmap()
img_file = f'{img_file_prefix}-{page.number}.jpg'
pix.save(img_file)

Will get_pixmap cause the generated JPG image to be too large in the above code? Is there a better way to convert every page in the PDF into a JPG image?

how to scale all page to A4

I combined some pages from different PDF files, so the pages have various pase size, as a result printed paper looks ugly if I forget to use "fit to paper size". Is there a way to

  1. either set every page will be printed on A4 size. In this way, text/vector-drawing on PDF page will still be text/vector-drawing so it still looks clear
  2. or zoom the page.
    2.1 But how to make the page looks clear as before?
    2.2 I have tried the following code code, it works but I don't like lines
    pix.writePNG('r:/tmp.png')
    img = fitz.open('r:/tmp.png')  # open pic as document

so is it possible make it simple?

import fitz
docIn = fitz.open('out - combined.pdf')

docOut = fitz.open()
for idxPage in range(len(docIn)):
    page = docIn[idxPage]

    pix = page.getPixmap()

    pix.writePNG('r:/tmp.png')
    img = fitz.open('r:/tmp.png')  # open pic as document

    rect = img[0].rect  # pic dimension
    pdfbytes = img.convertToPDF()  # make a PDF stream
    img.close()  # no longer needed
    imgPDF = fitz.open("pdf", pdfbytes)  # open stream as PDF
    sizeA4 = fitz.PaperSize("A4")
    page = docOut.newPage(width = sizeA4[0], height = sizeA4[1])
    page.showPDFpage(rect, imgPDF, 0)  # image fills the page

docOut.save("r:/all-my-pics.pdf")
docOut.close()

docIn.close()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.