Light

pymupdf / pymupdf-optional-material Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 4.0 2.82 GB

Help file downloads, early ZIP binaries, wheels for retired Python 2.7, 3.5.

License: GNU Affero General Public License v3.0

fitz mupdf pdf pymupdf python windows

pymupdf-optional-material's Introduction

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Community

Join us on Discord here: #pymupdf

Installation

PyMuPDF requires Python 3.8 or later, install using pip with:

pip install PyMuPDF

There are no mandatory external dependencies. However, some optional features become available only if additional packages are installed.

You can also try without installing by visiting PyMuPDF.io.

Usage

Basic usage is as follows:

import pymupdf # imports the pymupdf library
doc = pymupdf.open("example.pdf") # open a document
for page in doc: # iterate the document pages
  text = page.get_text() # get plain text encoded as UTF-8

Documentation

Full documentation can be found on pymupdf.readthedocs.io.

Optional Features

fontTools for creating font subsets.
pymupdf-fonts contains some nice fonts for your text output.
Tesseract-OCR for optical character recognition in images and document pages.

About

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF was originally written by Jorj X. McKie.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

pymupdf-optional-material's People

Contributors

Stargazers

Watchers

Forkers

vaginessa shubhampachori12110095 jlerouge seanpm2001

pymupdf-optional-material's Issues

Fitz.open is not passing back exception

My open call is quite simply:
doc = fitz.open(input_path + '\' + file)

I'm experiencing crashes when corrupt PDF files are encountered. I would be happy to pre-screen them if I knew what to look for. I assumed that fitz.open would raise and exception that's passed back to me but instead it's crashing with this information:
error: cannot find startxref
warning: trying to repair broken xref
warning: repairing PDF document
warning: object missing 'endobj' token
error: non-page object in page tree
uncaught exception: non-page object in page tree

I'm attaching my PDF file that is causing the trouble.
I'm using this code to extract images from a large number of PDF files that I've generated using WKHTMLTOPDF. I'm unsure why a few of them are corrupt. I'm working on that end of things.

Is there a different way I can call the open that will cause the exception to be passed back to me so that I can skip the file and move on to the next?

Thank you for your time.

1960s GRUEN Airflight Vintage Pilot Aviators Military Time Jump Hour Watch.pdf

get_pixmap may cause large image size

doc = fitz.open(pdf_file)
for page in doc:
pix = page.get_pixmap()
img_file = f'{img_file_prefix}-{page.number}.jpg'
pix.save(img_file)

Will get_pixmap cause the generated JPG image to be too large in the above code? Is there a better way to convert every page in the PDF into a JPG image?

how to scale all page to A4

I combined some pages from different PDF files, so the pages have various pase size, as a result printed paper looks ugly if I forget to use "fit to paper size". Is there a way to

either set every page will be printed on A4 size. In this way, text/vector-drawing on PDF page will still be text/vector-drawing so it still looks clear
or zoom the page.
2.1 But how to make the page looks clear as before?
2.2 I have tried the following code code, it works but I don't like lines

    pix.writePNG('r:/tmp.png')
    img = fitz.open('r:/tmp.png')  # open pic as document

so is it possible make it simple?

import fitz
docIn = fitz.open('out - combined.pdf')

docOut = fitz.open()
for idxPage in range(len(docIn)):
    page = docIn[idxPage]

    pix = page.getPixmap()

    pix.writePNG('r:/tmp.png')
    img = fitz.open('r:/tmp.png')  # open pic as document

    rect = img[0].rect  # pic dimension
    pdfbytes = img.convertToPDF()  # make a PDF stream
    img.close()  # no longer needed
    imgPDF = fitz.open("pdf", pdfbytes)  # open stream as PDF
    sizeA4 = fitz.PaperSize("A4")
    page = docOut.newPage(width = sizeA4[0], height = sizeA4[1])
    page.showPDFpage(rect, imgPDF, 0)  # image fills the page

docOut.save("r:/all-my-pics.pdf")
docOut.close()

docIn.close()

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.