pypdfium2-team / pypdfium2 Goto Github PK

Python bindings to PDFium

Home Page: https://pypdfium2.readthedocs.io/

Shell 0.46% Python 99.54%

pdf rasterisation pdfium pdf-to-image python pdf-documents

pypdfium2's Introduction

pypdfium2

pypdfium2 is an ABI-level Python 3 binding to PDFium, a powerful and liberal-licensed library for PDF rendering, inspection, manipulation and creation.

It is built with ctypesgen and external PDFium binaries. The custom setup infrastructure provides a seamless packaging and installation process. A wide range of platforms is supported with pre-built packages.

pypdfium2 includes helpers to simplify common use cases, while the raw PDFium/ctypes API remains accessible as well.

Installation

From PyPI 🔗 (recommended)
```
python -m pip install -U pypdfium2
```
This will use a pre-built wheel package, the easiest way of installing pypdfium2.

From source 🔗

Dependencies:
- System: git, C pre-processor (gcc/clang - alternatively, specify the command to envoke via $CPP)
- Python: ctypesgen (pypdfium2-team fork), wheel, setuptools. Usually installed automatically.

Get the code

git clone "https://github.com/pypdfium2-team/pypdfium2.git"
cd pypdfium2/

With pre-built binary 🔗
```
# In the pypdfium2/ directory
python -m pip install -v .
```
A binary is downloaded implicitly from pdfium-binaries and bundled into pypdfium2.
With self-built binary 🔗
```
# call build script with --help to list options
python setupsrc/pypdfium2_setup/build_pdfium.py
PDFIUM_PLATFORM="sourcebuild" python -m pip install -v .
```
Building PDFium may take a long time, as it comes with its bundled toolchain and deps, rather than taking them from the system.¹ However, we can at least provide the --use-syslibs option to build against system-provided runtime libraries.
With system-provided binary 🔗
```
# Substitute $PDFIUM_VER with the system pdfium's build version.
# For ABI safety reasons, you'll want to make sure `$PDFIUM_VER` is correct and the bindings are rebuilt whenever system pdfium is updated.
PDFIUM_PLATFORM="system:$PDFIUM_VER" python -m pip install -v .
```
Link against external pdfium instead of bundling it. Note, this is basically a high-level convenience entry point to internal bindings generation, and intended for end users. Therefore it is less flexible, supporting only the "simple case" for now. For more sohpisticated use cases that need passing custom parameters to ctypesgen (e.g. runtime libdirs / headers / feature flags), consider caller-provided data files.

With caller-provided data files 🔗 (this is expected to work offline)

# Call ctypesgen (see --help or packaging_base.py::run_ctypesgen() for further options)
# Reminder: you'll want to use the pypdfium2-team fork of ctypesgen
ctypesgen --library pdfium --runtime-libdirs $MY_LIBDIRS --headers $MY_INCLUDE_DIR/fpdf*.h -o src/pypdfium2_raw/bindings.py [-D $MY_FLAGS]

# Write the version file (fill the placeholders).
# Note, this is not a mature interface yet and might change!
# major/minor/build/patch: integers forming the pdfium version being packaged
# n_commits/hash: git describe like post-tag info (0/null for release commit)
# origin: a string to identify the build, consisting of binary source and package provider (e.g. "system/debian", "pdfium-binaries/debian")
# flags: a comma-delimited list of pdfium feature flag strings (e.g. "V8", "XFA") - may be empty for default build
cat >"src/pypdfium2_raw/version.json" <<END
{
  "major": $PDFIUM_MAJOR,
  "minor": $PDFIUM_MINOR,
  "build": $PDFIUM_BUILD,
  "patch": $PDFIUM_PATCH,
  "n_commits": $POST_TAG_COMMIT_COUNT,
  "hash": $POST_TAG_HASH,
  "origin": "$ORIGIN",
  "flags": [$MY_FLAGS]
}
END

# optional: copy in a binary if bundling
cp "$BINARY_PATH" src/pypdfium2_raw/libpdfium.so

# Finally, install
# set $MY_PLATFORM to "system" if building against system pdfium, "auto" or the platform name otherwise
PDFIUM_PLATFORM='prepared!$MY_PLATFORM:$PDFIUM_BUILD' python -m pip install --no-build-isolation -v .

See Setup Magic for details.

Support for source installs (esp. with self-built/system pdfium) is limited, as their integrity depends somewhat on a correctly acting caller.

Installing an sdist does not implicitly trigger a sourcebuild if no pre-built binary is available. It is preferred to let callers decide consciously what to do, and run the build script without pip encapsulation.

Relevant pip options:

-v: Verbose logging output. Useful for debugging.
-e: Install in editable mode, so the installation points to the source tree. This way, changes directly take effect without needing to re-install. Recommended for development.
--no-build-isolation: Do not isolate setup in a virtual env; use the main env instead. This renders pyproject.toml [build-system] inactive, setup deps must be prepared by caller. Useful to install custom versions of setup deps, or as speedup when installing repeatedly.

From Conda 🔗

Beware: Any conda packages/recipes of pypdfium2 or pdfium-binaries that might be provided by other distributors, including anaconda/main or conda-forge default channels, are unofficial.
- To install
  
  With permanent channel config (encouraged):
```
conda config --add channels bblanchon
conda config --add channels pypdfium2-team
conda config --set channel_priority strict
conda install pypdfium2-team::pypdfium2_helpers
```
  Alternatively, with temporary channel config:
```
conda install pypdfium2-team::pypdfium2_helpers --override-channels -c pypdfium2-team -c bblanchon
```
  Adding the channels permanently and tightening priority is encouraged to include pypdfium2 in conda update by default, and to avoid accidentally replacing the install with a different channel. (If desired, you may limit the channel config to the current environment by adding --env.) Otherwise, you should be cautious when making changes to the environment.
- To depend on pypdfium2 in a conda-build recipe
```
requirements:
  run:
    - pypdfium2-team::pypdfium2_helpers
```
  You'll want to have downstream callers handle the custom channels as shown above, otherwise conda will not be able to satisfy requirements.
- To set up channels in a GH workflow
```
- name: ...
  uses: conda-incubator/setup-miniconda@v3
  with:
    # ... your options
    channels: pypdfium2-team,bblanchon
    channel-priority: strict
```
  This is just a suggestion, you can also call conda config manually, or pass channels on command basis using -c, as discussed above.
- To verify the sources
```
conda list --show-channel-urls "pypdfium2|pdfium-binaries"
conda config --show-sources
```
  The table should show pypdfium2-team and bblanchon in the channels column. If added permanently, the config should also include these channels, ideally with top priority. Please check this before reporting any issue with a conda install of pypdfium2.
Note: Conda packages are normally managed using recipe feedstocks driven by third parties, in a Linux repository like fashion. However, with some quirks it is also possible to do conda packaging within the original project and publish to a custom channel, which is what pypdfium2-team does, and the above instructions are referring to.
Unofficial packages 🔗

The authors of this project have no control over and are not responsible for possible third-party builds of pypdfium2, and we do not support them. Please use the official packages where possible. If you have an issue with a third-party build, either contact your distributor, or try to reproduce with an official build.

Do not expect us to help with the creation of unofficial builds or add/change code for downstream setup tasks. Related issues or PRs may be closed without further notice if we don't see fit for upstream.

If you are a third-party distributor, please point out clearly and visibly in the description that your package is unofficial, i.e. not affiliated with or endorsed by pypdfium2 team.

Runtime Dependencies

As of this writing, pypdfium2 does not need any mandatory runtime dependencies apart from Python itself.

However, some optional support model features require additional packages:

Pillow (module name PIL) is a pouplar imaging library for Python. pypdfium2 provides convenience methods to translate between raw bitmap buffers and PIL images.
NumPy is a library for scientific computing. Similar to Pillow, pypdfium2 provides helpers to get a numpy array view of a raw bitmap.

Setup Magic

As pypdfium2 requires a C extension and has custom setup code, there are some special features to consider. Note, the APIs below may change any time and are mostly of internal interest.

Binaries are stored in platform-specific sub-directories of data/, along with bindings and version information.
$PDFIUM_PLATFORM defines which binary to include on setup.
- Format spec: [$PLATFORM][-v8][:$VERSION] ([] = segments, $CAPS = variables).
- Examples: auto, auto:5975 auto-v8:5975 (auto may be substituted by an explicit platform name, e.g. linux_x64).
- Platform:
  - If unset or auto, the host platform is detected and a corresponding binary will be selected.
  - If an explicit platform identifier (e.g. linux_x64, darwin_arm64, ...), binaries for the requested platform will be used.²
  - If system, bind against system-provided pdfium instead of embedding a binary. Version must be given explicitly so matching bindings can be generated.
  - If sourcebuild, binaries will be taken from data/sourcebuild/, assuming a prior run of build_pdfium.py.
  - If sdist, no platform-dependent files will be included, so as to create a source distribution. sourcebuild and sdist are standalone, they cannot be followed by additional specifiers.
- V8: If given, use the V8 (JavaScript) and XFA enabled pdfium binaries. Otherwise, use the regular (non-V8) binaries.
- Version: If given, use the specified pdfium-binaries release. Otherwise, use the latest one.
- It is possible to prepend prepared! to install with existing platform files instead of generating on the fly; the value will be used for metadata / file inclusion. This can be helpful when installing in an isolated env where ctypesgen is not available, but it is not desirable to use the reference bindings (e.g. conda).
$PYPDFIUM_MODULES=[raw,helpers] defines the modules to include. Metadata adapts dynamically.
- May be used by packagers to decouple raw bindings and helpers, which may be relevant if packaging against system pdfium.
- Would also allow to install only the raw module without helpers, or only helpers with a custom raw module.
$PDFIUM_BINDINGS=reference allows to override ctypesgen and use the reference bindings file autorelease/bindings.py instead.
- This is a convenience option to get pypdfium2 installed from source even if a working ctypesgen is not available in the install env.
- Warning: This may not be ABI-safe. Please make sure binary/bindings build headers match to avoid ABI issues.

Usage

Support model

Here are some examples of using the support model API.

Import the library
```
import pypdfium2 as pdfium
```

Open a PDF using the helper class PdfDocument (supports file path strings, bytes, and byte buffers)

pdf = pdfium.PdfDocument("./path/to/document.pdf")
version = pdf.get_version()  # get the PDF standard version
n_pages = len(pdf)  # get the number of pages in the document
page = pdf[0]  # load a page

Render the page

bitmap = page.render(
    scale = 1,    # 72dpi resolution
    rotation = 0, # no additional rotation
    # ... further rendering options
)
pil_image = bitmap.to_pil()
pil_image.show()

Try some page methods

# Get page dimensions in PDF canvas units (1pt->1/72in by default)
width, height = page.get_size()
# Set the absolute page rotation to 90° clockwise
page.set_rotation(90)

# Locate objects on the page
for obj in page.get_objects():
    print(obj.level, obj.type, obj.get_pos())

Extract and search text

# Load a text page helper
textpage = page.get_textpage()

# Extract text from the whole page
text_all = textpage.get_text_range()
# Extract text from a specific rectangular area
text_part = textpage.get_text_bounded(left=50, bottom=100, right=width-50, top=height-100)

# Locate text on the page
searcher = textpage.search("something", match_case=False, match_whole_word=False)
# This returns the next occurrence as (char_index, char_count), or None if not found
first_occurrence = searcher.get_next()

Read the table of contents

for item in pdf.get_toc():
    state = "*" if item.n_kids == 0 else "-" if item.is_closed else "+"
    target = "?" if item.page_index is None else item.page_index+1
    print(
        "    " * item.level +
        "[%s] %s -> %s  # %s %s" % (
            state, item.title, target, item.view_mode, item.view_pos,
        )
    )

Create a new PDF with an empty A4 sized page

pdf = pdfium.PdfDocument.new()
width, height = (595, 842)
page_a = pdf.new_page(width, height)

Include a JPEG image in a PDF

pdf = pdfium.PdfDocument.new()

image = pdfium.PdfImage.new(pdf)
image.load_jpeg("./tests/resources/mona_lisa.jpg")
width, height = image.get_size()

matrix = pdfium.PdfMatrix().scale(width, height)
image.set_matrix(matrix)

page = pdf.new_page(width, height)
page.insert_obj(image)
page.gen_content()

Save the document

# PDF 1.7 standard
pdf.save("output.pdf", version=17)

Raw PDFium API

While helper classes conveniently wrap the raw PDFium API, it may still be accessed directly and is available in the namespace pypdfium2.raw. Lower-level helpers that may aid with using the raw API are provided in pypdfium2.internal.

import pypdfium2.raw as pdfium_c
import pypdfium2.internal as pdfium_i

Since PDFium is a large library, many components are not covered by helpers yet. You may seamlessly interact with the raw API while still using helpers where available. When used as ctypes function parameter, helper objects automatically resolve to the underlying raw object (but you may still access it explicitly if desired):

permission_flags = pdfium_c.FPDF_GetDocPermission(pdf.raw)  # explicit
permission_flags = pdfium_c.FPDF_GetDocPermission(pdf)      # implicit

For PDFium docs, please look at the comments in its public header files.³ A large variety of examples on how to interface with the raw API using ctypes is already provided with support model source code. Nonetheless, the following guide may be helpful to get started with the raw API, especially for developers who are not familiar with ctypes yet.

In general, PDFium functions can be called just like normal Python functions. However, parameters may only be passed positionally, i. e. it is not possible to use keyword arguments. There are no defaults, so you always need to provide a value for each argument.
```
# arguments: filepath (bytes), password (bytes|None)
# null-terminate filepath and encode as UTF-8
pdf = pdfium_c.FPDF_LoadDocument((filepath+"\x00").encode("utf-8"), None)
```
This is the underlying bindings declaration,⁴ which loads the function from the binary and contains the information required to convert Python types to their C equivalents.
```
if _libs["pdfium"].has("FPDF_LoadDocument", "cdecl"):
    FPDF_LoadDocument = _libs["pdfium"].get("FPDF_LoadDocument", "cdecl")
    FPDF_LoadDocument.argtypes = [FPDF_STRING, FPDF_BYTESTRING]
    FPDF_LoadDocument.restype = FPDF_DOCUMENT
```
Python bytes are converted to FPDF_STRING by ctypes autoconversion. When passing a string to a C function, it must always be null-terminated, as the function merely receives a pointer to the first item and then continues to read memory until it finds a null terminator.

While some functions are quite easy to use, things soon get more complex. First of all, function parameters are not only used for input, but also for output:

# Initialise an integer object (defaults to 0)
c_version = ctypes.c_int()
# Let the function assign a value to the c_int object, and capture its return code (True for success, False for failure)
ok = pdfium_c.FPDF_GetFileVersion(pdf, c_version)
# If successful, get the Python int by accessing the `value` attribute of the c_int object
# Otherwise, set the variable to None (in other cases, it may be desired to raise an exception instead)
version = c_version.value if ok else None

If an array is required as output parameter, you can initialise one like this (in general terms):

# long form
array_type = (c_type * array_length)
array_object = array_type()
# short form
array_object = (c_type * array_length)()

Example: Getting view mode and target position from a destination object returned by some other function.

# (Assuming `dest` is an FPDF_DEST)
n_params = ctypes.c_ulong()
# Create a C array to store up to four coordinates
view_pos = (pdfium_c.FS_FLOAT * 4)()
view_mode = pdfium_c.FPDFDest_GetView(dest, n_params, view_pos)
# Convert the C array to a Python list and cut it down to the actual number of coordinates
view_pos = list(view_pos)[:n_params.value]

For string output parameters, callers needs to provide a sufficiently long, pre-allocated buffer. This may work differently depending on what type the function requires, which encoding is used, whether the number of bytes or characters is returned, and whether space for a null terminator is included or not. Carefully review the documentation for the function in question to fulfill its requirements.

Example A: Getting the title string of a bookmark.

# (Assuming `bookmark` is an FPDF_BOOKMARK)
# First call to get the required number of bytes (not characters!), including space for a null terminator
n_bytes = pdfium_c.FPDFBookmark_GetTitle(bookmark, None, 0)
# Initialise the output buffer
buffer = ctypes.create_string_buffer(n_bytes)
# Second call with the actual buffer
pdfium_c.FPDFBookmark_GetTitle(bookmark, buffer, n_bytes)
# Decode to string, cutting off the null terminator
# Encoding: UTF-16LE (2 bytes per character)
title = buffer.raw[:n_bytes-2].decode("utf-16-le")

Example B: Extracting text in given boundaries.

# (Assuming `textpage` is an FPDF_TEXTPAGE and the boundary variables are set)
# Store common arguments for the two calls
args = (textpage, left, top, right, bottom)
# First call to get the required number of characters (not bytes!) - a possible null terminator is not included
n_chars = pdfium_c.FPDFText_GetBoundedText(*args, None, 0)
# If no characters were found, return an empty string
if n_chars <= 0:
    return ""
# Calculate the required number of bytes (UTF-16LE encoding again)
n_bytes = 2 * n_chars
# Initialise the output buffer - this function can work without null terminator, so skip it
buffer = ctypes.create_string_buffer(n_bytes)
# Re-interpret the type from char to unsigned short as required by the function
buffer_ptr = ctypes.cast(buffer, ctypes.POINTER(ctypes.c_ushort))
# Second call with the actual buffer
pdfium_c.FPDFText_GetBoundedText(*args, buffer_ptr, n_chars)
# Decode to string (You may want to pass `errors="ignore"` to skip possible errors in the PDF's encoding)
text = buffer.raw.decode("utf-16-le")

Not only are there different ways of string output that need to be handled according to the requirements of the function in question. String input, too, can work differently depending on encoding and type. We have already discussed FPDF_LoadDocument(), which takes a UTF-8 encoded string as char *. A different examples is FPDFText_FindStart(), which needs a UTF-16LE encoded string, given as unsigned short *:
```
# (Assuming `text` is a str and `textpage` an FPDF_TEXTPAGE)
# Add the null terminator and encode as UTF-16LE
enc_text = (text + "\x00").encode("utf-16-le")
# cast `enc_text` to a c_ushort pointer
text_ptr = ctypes.cast(enc_text, ctypes.POINTER(ctypes.c_ushort))
search = pdfium_c.FPDFText_FindStart(textpage, text_ptr, 0, 0)
```

Leaving strings, let's suppose you have a C memory buffer allocated by PDFium and wish to read its data. PDFium will provide you with a pointer to the first item of the byte array. To access the data, you'll want to re-interpret the pointer using ctypes.cast() to encompass the whole array:

# (Assuming `bitmap` is an FPDF_BITMAP and `size` is the expected number of bytes in the buffer)
buffer_ptr = pdfium_c.FPDFBitmap_GetBuffer(bitmap)
buffer_ptr = ctypes.cast(buffer_ptr, ctypes.POINTER(ctypes.c_ubyte * size))
# Buffer as ctypes array (referencing the original buffer, will be unavailable as soon as the bitmap is destroyed)
c_array = buffer_ptr.contents
# Buffer as Python bytes (independent copy)
data = bytes(c_array)

Writing data from Python into a C buffer works in a similar fashion:

# (Assuming `buffer_ptr` is a pointer to the first item of a C buffer to write into,
#  `size` the number of bytes it can store, and `py_buffer` a Python byte buffer)
buffer_ptr = ctypes.cast(buffer_ptr, ctypes.POINTER(ctypes.c_char * size))
# Read from the Python buffer, starting at its current position, directly into the C buffer
# (until the target is full or the end of the source is reached)
n_bytes = py_buffer.readinto(buffer_ptr.contents)  # returns the number of bytes read

If you wish to check whether two objects returned by PDFium are the same, the is operator won't help because ctypes does not have original object return (OOR), i. e. new, equivalent Python objects are created each time, although they might represent one and the same C object.⁵ That's why you'll want to use ctypes.addressof() to get the memory addresses of the underlying C object. For instance, this is used to avoid infinite loops on circular bookmark references when iterating through the document outline:

# (Assuming `pdf` is an FPDF_DOCUMENT)
seen = set()
bookmark = pdfium_c.FPDFBookmark_GetFirstChild(pdf, None)
while bookmark:
    # bookmark is a pointer, so we need to use its `contents` attribute to get the object the pointer refers to
    # (otherwise we'd only get the memory address of the pointer itself, which would result in random behaviour)
    address = ctypes.addressof(bookmark.contents)
    if address in seen:
        break  # circular reference detected
    else:
        seen.add(address)
    bookmark = pdfium_c.FPDFBookmark_GetNextSibling(pdf, bookmark)

In many situations, callback functions come in handy.⁶ Thanks to ctypes, it is seamlessly possible to use callbacks across Python/C language boundaries.

Example: Loading a document from a Python buffer. This way, file access can be controlled in Python while the whole data does not need to be in memory at once.

import os

# Factory class to create callable objects holding a reference to a Python buffer
class _reader_class:
  
  def __init__(self, py_buffer):
      self.py_buffer = py_buffer
  
  def __call__(self, _, position, p_buf, size):
      # Write data from Python buffer into C buffer, as explained before
      buffer_ptr = ctypes.cast(p_buf, ctypes.POINTER(ctypes.c_char * size))
      self.py_buffer.seek(position)
      self.py_buffer.readinto(buffer_ptr.contents)
      return 1  # non-zero return code for success

# (Assuming py_buffer is a Python file buffer, e. g. io.BufferedReader)
# Get the length of the buffer
py_buffer.seek(0, os.SEEK_END)
file_len = py_buffer.tell()
py_buffer.seek(0)

# Set up an interface structure for custom file access
fileaccess = pdfium_c.FPDF_FILEACCESS()
fileaccess.m_FileLen = file_len

# Assign the callback, wrapped in its CFUNCTYPE
fileaccess.m_GetBlock = type(fileaccess.m_GetBlock)( _reader_class(py_buffer) )

# Finally, load the document
pdf = pdfium_c.FPDF_LoadCustomDocument(fileaccess, None)

When using the raw API, special care needs to be taken regarding object lifetime, considering that Python may garbage collect objects as soon as their reference count reaches zero. However, the interpreter has no way of magically knowing how long the underlying resources of a Python object might still be needed on the C side, so measures need to be taken to keep such objects referenced until PDFium does not depend on them anymore.

If resources need to remain valid after the time of a function call, PDFium docs usually indicate this clearly. Ignoring requirements on object lifetime will lead to memory corruption (commonly resulting in a segfault).

For instance, the docs on FPDF_LoadCustomDocument() state that

The application must keep the file resources |pFileAccess| points to valid until the returned FPDF_DOCUMENT is closed. |pFileAccess| itself does not need to outlive the FPDF_DOCUMENT.

This means that the callback function and the Python buffer need to be kept alive as long as the FPDF_DOCUMENT is used. This can be achieved by referencing these objects in an accompanying class, e. g.
```
class PdfDataHolder:
    
    def __init__(self, buffer, function):
        self.buffer = buffer
        self.function = function
    
    def close(self):
        # Make sure both objects remain available until this function is called
        # No-op id() call to denote that the object needs to stay in memory up to this point
        id(self.function)
        self.buffer.close()

# ... set up an FPDF_FILEACCESS structure

# (Assuming `py_buffer` is the buffer and `fileaccess` the FPDF_FILEACCESS interface)
data_holder = PdfDataHolder(py_buffer, fileaccess.m_GetBlock)
pdf = pdfium_c.FPDF_LoadCustomDocument(fileaccess, None)

# ... work with the pdf

# Close the PDF to free resources
pdfium_c.FPDF_CloseDocument(pdf)
# Close the data holder, to keep the object itself and thereby the objects it
# references alive up to this point, as well as to release the buffer
data_holder.close()
```

Finally, let's finish with an example how to render the first page of a document to a PIL image in RGBA color format.

import math
import ctypes
import os.path
import PIL.Image
import pypdfium2.raw as pdfium_c

# Load the document
filepath = os.path.abspath("tests/resources/render.pdf")
pdf = pdfium_c.FPDF_LoadDocument((filepath+"\x00").encode("utf-8"), None)

# Check page count to make sure it was loaded correctly
page_count = pdfium_c.FPDF_GetPageCount(pdf)
assert page_count >= 1

# Load the first page and get its dimensions
page = pdfium_c.FPDF_LoadPage(pdf, 0)
width  = math.ceil(pdfium_c.FPDF_GetPageWidthF(page))
height = math.ceil(pdfium_c.FPDF_GetPageHeightF(page))

# Create a bitmap
# (Note, pdfium is faster at rendering transparency if we use BGRA rather than BGRx)
use_alpha = pdfium_c.FPDFPage_HasTransparency(page)
bitmap = pdfium_c.FPDFBitmap_Create(width, height, int(use_alpha))
# Fill the whole bitmap with a white background
# The color is given as a 32-bit integer in ARGB format (8 bits per channel)
pdfium_c.FPDFBitmap_FillRect(bitmap, 0, 0, width, height, 0xFFFFFFFF)

# Store common rendering arguments
render_args = (
    bitmap,  # the bitmap
    page,    # the page
    # positions and sizes are to be given in pixels and may exceed the bitmap
    0,       # left start position
    0,       # top start position
    width,   # horizontal size
    height,  # vertical size
    0,       # rotation (as constant, not in degrees!)
    pdfium_c.FPDF_LCD_TEXT | pdfium_c.FPDF_ANNOT,  # rendering flags, combined with binary or
)

# Render the page
pdfium_c.FPDF_RenderPageBitmap(*render_args)

# Get a pointer to the first item of the buffer
buffer_ptr = pdfium_c.FPDFBitmap_GetBuffer(bitmap)
# Re-interpret the pointer to encompass the whole buffer
buffer_ptr = ctypes.cast(buffer_ptr, ctypes.POINTER(ctypes.c_ubyte * (width * height * 4)))

# Create a PIL image from the buffer contents
img = PIL.Image.frombuffer("RGBA", (width, height), buffer_ptr.contents, "raw", "BGRA", 0, 1)
# Save it as file
img.save("out.png")

# Free resources
pdfium_c.FPDFBitmap_Destroy(bitmap)
pdfium_c.FPDF_ClosePage(page)
pdfium_c.FPDF_CloseDocument(pdf)

Command-line Interface

pypdfium2 also ships with a simple command-line interface, providing access to key features of the support model in a shell environment (e. g. rendering, content extraction, document inspection, page rearranging, ...).

The primary motivation behind this is to have a nice testing interface, but it may be helpful in a variety of other situations as well. Usage should be largely self-explanatory, assuming a minimum of familiarity with the command-line.

Licensing

PDFium and pypdfium2 are available by the terms and conditions of either Apache-2.0 or BSD-3-Clause, at your choice. Various other open-source licenses apply to dependencies bundled with PDFium. Verbatim copies of their respective licenses are contained in the file LicenseRef-PdfiumThirdParty.txt, which also has to be shipped with binary redistributions. Documentation and examples of pypdfium2 are licensed under CC-BY-4.0.

pypdfium2 complies with the reuse standard by including SPDX headers in source files, and license information for data files in .reuse/dep5.

To the author's knowledge, pypdfium2 is one of the rare Python libraries that are capable of PDF rendering while not being covered by copyleft licenses (such as the GPL).⁷

As of early 2023, a single developer is author and rightsholder of the code base (apart from a few minor code contributions).

Issues

While using pypdfium2, you might encounter bugs or missing features. In this case, feel free to open an issue or discuss thread. If applicable, include details such as tracebacks, OS and CPU type, as well as the versions of pypdfium2 and used dependencies. However, please note our response policy.

Roadmap:

pypdfium2
- Issues panel: Initial bug reports and feature requests. May need to be transferred to dependencies.
- Discussions page: General questions and suggestions.
PDFium
- Bug tracker: Issues in PDFium. Beware: The bridge between Python and C increases the probability of integration issues or API misuse. The symptoms can often make it look like a PDFium bug while it is not.
- Mailing list: Questions regarding PDFium usage.
pdfium-binaries: Binary builder.
ctypesgen: Bindings generator.

Known limitations

Incompatibility with CPython 3.7.6 and 3.8.1

pypdfium2 built with mainstream ctypesgen cannot be used with releases 3.7.6 and 3.8.1 of the CPython interpreter due to a regression that broke ctypesgen-created string handling code.

Since version 4, pypdfium2 is built with a patched fork of ctypesgen that removes ctypesgen's problematic string code.

Risk of unknown object lifetime violations

As outlined in the raw API section, it is essential that Python-managed resources remain available as long as they are needed by PDFium.

The problem is that the Python interpreter may garbage collect objects with reference count zero at any time, so an unreferenced but still required object may either by chance stay around long enough or disappear too soon, resulting in non-deterministic memory issues that are hard to debug. If the timeframe between reaching reference count zero and removal is sufficiently large and roughly consistent across different runs, it is even possible that mistakes regarding object lifetime remain unnoticed for a long time.

Although we intend to develop helpers carefully, it cannot be fully excluded that unknown object lifetime violations are still lurking around somewhere, especially if unexpected requirements were not documented by the time the code was written.

Missing raw PDF access

As of this writing, PDFium's public interface does not provide access to the raw PDF data structure (see issue 1694). It does not expose APIs to read/write PDF dictionaries, streams, name/number trees, etc. Instead, it merely offers a predefined set of abstracted functions. This considerably limits the library's potential, compared to other products such as pikepdf.

Limitations of ABI bindings

PDFium's non-public backend would provide extended capabilities, including raw access, but it is not exported into the ABI and written in C++ (not pure C), so we cannot use it with ctypes. This means it's out of scope for this project.

Also, while ABI bindings tend to be more convenient, they have some technical drawbacks compared to API bindings (see e.g. 1, 2)

Development

Contributions

We may accept contributions, but only if our code quality expectations are met.

Policy:

We may not respond to your issue or PR.
We may close an issue or PR without much feedback.
We may lock discussions or contributions if our attention is getting DDOSed.
We may not provide much usage support.

Long lines

The pypdfium2 codebase does not hard wrap long lines. It is recommended to set up automatic word wrap in your text editor, e.g. VS Code:

editor.wordWrap = bounded
editor.wordWrapColumn = 100

Docs

pypdfium2 provides API documentation using Sphinx, which can be rendered to various formats, including HTML:

sphinx-build -b html ./docs/source ./docs/build/html/
./run build  # short alias

Built docs are primarily hosted on readthedocs.org. It may be configured using a .readthedocs.yaml file (see instructions), and the administration page on the web interface. RTD supports hosting multiple versions, so we currently have one linked to the main branch and another to stable. New builds are automatically triggered by a webhook whenever you push to a linked branch.

Additionally, one doc build can also be hosted on GitHub Pages. It is implemented with a CI workflow, which is supposed to be triggered automatically on release. This provides us with full control over the build env and the used commands, whereas RTD may be less liberal in this regard.

Testing

pypdfium2 contains a small test suite to verify the library's functionality. It is written with pytest:

./run test

Note that ...

you can pass -sv to get more detailed output.
$DEBUG_AUTOCLOSE=1 may be set to get debugging information on automatic object finalization.

To get code coverage statistics, you may call

./run coverage

Sometimes, it can also be helpful to test code on many PDFs.⁸ In this case, the command-line interface and find come in handy:

# Example A: Analyse PDF images (in the current working directory)
find . -name '*.pdf' -exec bash -c "echo \"{}\" && pypdfium2 pageobjects \"{}\" --filter image" \;
# Example B: Parse PDF table of contents
find . -name '*.pdf' -exec bash -c "echo \"{}\" && pypdfium2 toc \"{}\"" \;

Release workflow

The release process is fully automated using Python scripts and scheduled release workflows. You may also trigger the workflow manually using the GitHub Actions panel or the gh command-line tool.

Python release scripts are located in the folder setupsrc/pypdfium2_setup, along with custom setup code:

update_pdfium.py downloads binaries.
craft_packages.py pypi builds platform-specific wheel packages and a source distribution suitable for PyPI upload.
autorelease.py takes care of versioning, changelog, release note generation and VCS checkin.

The autorelease script has some peculiarities maintainers should know about:

The changelog for the next release shall be written into docs/devel/changelog_staging.md. On release, it will be moved into the main changelog under docs/source/changelog.md, annotated with the PDFium version update. It will also be shown on the GitHub release page.
pypdfium2 versioning uses the pattern major.minor.patch, optionally with an appended beta mark (e. g. 2.7.1, 2.11.0, 3.0.0b1, ...). Version changes are based on the following logic:
- If PDFium was updated, the minor version is incremented.
- If only pypdfium2 code was updated, the patch version is incremented instead.
- Major updates and beta marks are controlled via autorelease/config.json. If major is true, the major version is incremented. If beta is true, a new beta tag is set, or an existing one is incremented. The control file is automatically reset when the versioning is finished.
- If switching from a beta release to a non-beta release, only the beta mark is removed while minor and patch versions remain unchanged.

In case of necessity, you may also forego autorelease/CI and do the release manually, which will roughly work like this (though ideally it should never be needed):

Commit changes to the version file

git add src/pypdfium2/version.py
git commit -m "increment version"
git push

Create a new tag that matches the version file

# substitute $VERSION accordingly
git tag -a $VERSION
git push --tags

Build the packages

python setupsrc/pypdfium2_setup/update_pdfium.py
python setupsrc/pypdfium2_setup/craft_packages.py pypi

Upload to PyPI

# make sure the packages are valid
twine check dist/*
# upload to PyPI (this will interactively ask for your username/password)
twine upload dist/*

Update the stable branch to trigger a documentation rebuild

git checkout stable
git rebase origin/main  # alternatively: git reset --hard main
git checkout main

If something went wrong with commit or tag, you can still revert the changes:

# perform an interactive rebase to change history (substitute $N_COMMITS with the number of commits to drop or modify)
git rebase -i HEAD~$N_COMMITS
git push --force
# delete local tag (substitute $TAGNAME accordingly)
git tag -d $TAGNAME
# delete remote tag
git push --delete origin $TAGNAME

Faulty PyPI releases may be yanked using the web interface.

Prominent Embedders

pypdfium2 is used by prominent embedders such as langchain, nougat, pdfplumber, and doctr.

This results in pypdfium2 being part of a large dependency tree.

Thanks to⁹

Benoît Blanchon: Author of PDFium binaries and patches.
Anderson Bravalheri: Help with PEP 517/518 compliance. Hint to use an environment variable rather than separate setup files.
Bastian Germann: Help with inclusion of licenses for third-party components of PDFium.
Tim Head: Original idea for Python bindings to PDFium with ctypesgen in wowpng.
Yinlin Hu: pypdfium prototype and kuafu PDF viewer.
Adam Huganir: Help with maintenance and development decisions since the beginning of the project.
kobaltcore: Bug fix for PdfDocument.save().
Mike Kroutikov: Examples on how to use PDFium with ctypes in redstork and pdfbrain.

... and further code contributors (GitHub stats).

If you have somehow contributed to this project but we forgot to mention you here, please let us know.

History

PDFium

The PDFium code base was originally developed as part of the commercial Foxit SDK, before being acquired and open-sourced by Google, who maintain PDFium independently ever since, while Foxit continue to develop their SDK closed-source.

pypdfium2

pypdfium2 is the successor of pypdfium and pypdfium-reboot.

Inspired by wowpng, the first known proof of concept Python binding to PDFium using ctypesgen, the initial pypdfium package was created. It had to be updated manually, which did not happen frequently. There were no platform-specific wheels, but only a single wheel that contained binaries for 64-bit Linux, Windows and macOS.

pypdfium-reboot then added a script to automate binary deployment and bindings generation to simplify regular updates. However, it was still not platform specific.

pypdfium2 is a full rewrite of pypdfium-reboot to build platform-specific wheels and consolidate the setup scripts. Further additions include ...

A CI workflow to automatically release new wheels every Tuesday
Support models that conveniently wrap the raw PDFium/ctypes API
Test code
A script to build PDFium from source

This means pdfium may not compile on arbitrary hosts. The script is limited to build hosts supported by Google's toolchain. Ideally, we'd need an alternative build system that runs with system packages instead. ↩
Intended for packaging, so that wheels can be crafted for any platform without access to a native host. ↩
Unfortunately, no recent HTML-rendered docs are available for PDFium at the moment. ↩
From the auto-generated bindings file. We maintain a reference copy at autorelease/bindings.py. Or if you have an editable install, there will also be src/pypdfium2_raw/bindings.py. ↩
Confer the ctypes documentation on Pointers. ↩
e. g. incremental read/write, management of progressive tasks, ... ↩
The only other liberal-licensed PDF rendering libraries known to the author are pdf.js (JavaScript) and Apache PDFBox (Java), but python bindings packages don't exist yet or are unsatisfactory. However, we wrote some gists that show it'd be possible in principle: pdfbox (+ setup), pdfjs. ↩
For instance, one could use the testing corpora of open-source PDF libraries (pdfium, pikepdf/ocrmypdf, mupdf/ghostscript, tika/pdfbox, pdfjs, ...) ↩
People listed in this section may not necessarily have contributed any copyrightable code to the repository. Some have rather helped with ideas, or contributions to dependencies of pypdfium2. ↩

pypdfium2's People

Contributors

Stargazers

Watchers

Forkers

declark1 mlove4u trellixvulnteam lewistrick jmpfar mara004 russellwmy zhangshuangjun milahu nh2 dgollings ganymede-bio

pypdfium2's Issues

docs: add information concerning the `render_pdf()` method

I have already shown usage of render_pdf() in a StackOverflow comment, but it should yet be added to our docs.

tests: add test for `render_pdf()` in `test_helpers.py`

Circular imports (fixed)

not sure if you had resolved this locally or not, but the main branch was having some circular import issues. instead of pulling the pdfium contents from the __init__, pulling them from _pypdfium will avoid this. I have it working in the unit tests branch I am working with if you need to use it @mara004

How to increase DPI for rendering?

how can we increase the dpi of images?

Rework patches to be operating system specific

Divide patches into generic, linux, darwin and windows groups, to only apply those patches that are relevant for the host system.

Add Pillow requirement in `setup.cfg`

That was an oversight, see #19 (comment)

Project layout and coding standards

I have a way of organizing my company's projects previously that works very well in coordinating development across multiple devs, and I is pretty pep compliant. I can create a branch to show the layout and see if you are interested in adopting it. The basic tooling is:

poetry for virtualenvs, dependency management, and building/publishing
black, isort, and flake8 for consistent formatting
pytest for testing
sphinx for documentation
pre-commit to ensure the formatting, testing, and documentation are enforced for contributors

Any or all of these can really benefit projects I think.

Thoughts?

Investigate whether the `if bitmap is not None` is necessary

sourcebuild: allow for syncing existing repositories

If we are building from source and the repositories of DepotTools and PDFium are already present from previous runs, the current behaviour is to do nothing and re-use the existing repository as-is.
It would make sense if our build script could automate syncing/updating the repository. However, I think this should be opt-in to allow for building a custom version of PDFium, and to prevent inadvertent loss of unstaged changes.

The update strategy for the PDFium repository could either be done manually:

# clean up uncommitted changes (i. e. patches from the previous build)
run_cmd(f"git reset --hard HEAD", cwd=PDFiumDir)

# the `build/` directory contains an independent git repository, so reset that, too
run_cmd(f"git reset --hard HEAD", cwd=join(PDFiumDir,'build'))

# remove the `resources.rc` file that is not affected by `git reset`
os.remove(join(PDFiumDir,'resources.rc'))

# sync
run_cmd("gclient sync --no-history --shallow", cwd=WorkDir)

Or probably a lot easier with gclient:

run_cmd("gclient revert", cwd=WorkDir)
run_cmd("gclient sync", cwd=WorkDir)  # not sure whether revert already includes a sync

Should we add a support model to the raw PDFium bindings?

I was thinking about whether we should perhaps add a support model (i. e. custom objects and functions) on top of PDFium to make usage a bit more comfortable and more 'pythonic' than just the raw C-style interface.

First of all, there is the strange, but required FPDF_InitLibrary() call. Should we automatically do this in our __init__.py file?

Secondly, rendering a page is a bit laborious in terms of code, due to the necessity of ctypes casting and so on. I was thinking about whether we could add a simplified function render_page() that takes a PDFium document, a page index, rotation, scaling, and background colour parameters that returns the rasterised page as PIL image.

Another thing that annoys me is that one must always pass all arguments, and that it's not possible to use keyword arguments. A way to change this would be to create more custom objects and methods on top of PDFium, but this would be kind of a maintenance burden, and it might be less powerful than accessing PDFium directly.

Replace deprecated `distutils.util.get_platform()`

I'm not aware of an equivalent function in setuptools, but the desired result could probably be achieved with the platform module. Maybe something like this:

f"{platform.system()}_{platform.machine()}".replace('-','_').replace('.','_').lower()

Feedback wanted: incoming changes in pdfium-binaries

Hi,

I'm the creator of pdfium-binaries, which host pre-build binaries of PDFium that this project uses.

You probably noticed that the Windows packages are slightly different from the other packages because the import library is not in lib/, but in lib/x86/, lib/x64/ or lib/arm64/. Similarly, the DLL is not in the bin/ but in a subfolder.
I'm currently considering removing this separation so that every package has the same layout, and I'd like to get your feedback before making the change.
I initially made this separation because it allowed me to put x86 and x64 in the same package, but I don't do that anymore, and the subfolders are causing me troubles with CMake where there is no reliable way to detect the target CPU architecture (bblanchon/pdfium-binaries#17).

Please let me know if changing the Windows packages layout is a problem for you and if you think this is going in the right direction or not.

Best regards,
Benoit

Support model for PDFium attachment API

PDFium provides a fancy set of functions to work with PDF attachments (fpdf_attachment.h), even including generic dictionary access. Would be nice if we could add a support model for it.

helpers: Move PDFium error handling into a method

Currently, our error handling code using FPDF_GetLastError() is inside PdfContext.__enter__(). As we might add more support model code, it would make sense to move said error handling into a method and adapt it accordingly, to share code.

Add musllinux to setup

Upstream pdfium-binaries added build support for musllinux quite recently. We should thus integrate it into our setup code, as per add_platform.md. It might take a while until I have some time to work on this.

Test suite randomly segfaults

Probably caused by recent changes. Need to investigate.

Compatibility with legacy setuptools versions

On Raspbian 9 with Python 3.5, the older setuptools of the system was unable to install the wheel (claimed it would be incompatible). Upgrading setuptools fixed the issue, though.
It's a shot in the dark, but I think this might hint at legacy setuptools not supporting the newer manylinux version scheme defined in PEP 600, so perhaps we should use the older manylinux2014 tag in favour of manylinux_2_17.

Investigate possibility of calling pyminify on generated bindings file

pyminify could be used to make the bindings file smaller:

Pros:

Reduces file size of _pypdfium.py from approximately 240 KiB to 140 KiB
Makes the workaround to prevent leaking absolute paths unnecessary

Cons:

Reduces readability
Additional risk that something could go wrong (especially considering that we currently have no tests)

pdfium binary not found in executable created by pyinstaller

When I ran the code, everything was fine.

But when I packaged the code as an EXE, it ran wrong.

D:\Develop\Support\Push.AI.OCR\dist>app.exe
[38472] WARNING: file already exists but should not: C:\Users\CHENYO~1\AppData\Local\Temp\_MEI384722\torch\_C.cp36-win_amd64.pyd
Traceback (most recent call last):
  File "app.py", line 6, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "apphelper\image.py", line 17, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pypdfium2\__init__.py", line 6, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pypdfium2\_namespace.py", line 6, in <module>
  File "PyInstaller\loader\pyimod03_importers.py", line 495, in exec_module
  File "pypdfium2\_pypdfium.py", line 863, in <module>
  File "pypdfium2\_pypdfium.py", line 551, in __call__
ImportError: Could not load pdfium.
[4836] Failed to execute script 'app' due to unhandled exception!

Support building PDFium on non-standard architectures

While testing sourcebuild on Linux armv7l, I noticed that the PDFium build process relies on several pre-built binaries from depot-tools that are only available for the most common architectures.
Therefore, we will need to adapt our build script to use the gn and ninja executables provided by the system, if available. On Debian, the relevant packages are named generate-ninja and ninja-build.

Incompatibility with python 3.8.1

Hello there 👋

Thank you for your wonderful here 🙏 It's great to find good alternatives to PyMuPDF with a proper open-source license!

I figured I should share a problem that I had: the library (installed with pypi) doesn't work on Python 3.8.1. I have reproduced this in a clean environment using docker. Using the following Dockerfile

FROM python:3.8.1-slim

ENV PYTHONUNBUFFERED 1
ENV PYTHONDONTWRITEBYTECODE 1

RUN pip install --upgrade pip setuptools wheel \
    && pip install pypdfium2 \
    && pip cache purge \
    && rm -rf /root/.cache/pip

This command:

docker build . -t pypdfium2-py3.8.1-slim
docker run pypdfium2-py3.8.1-slim python -c "import pypdfium2"

yields:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/pypdfium2/__init__.py", line 7, in <module>
    from pypdfium2._namespace import *
  File "/usr/local/lib/python3.8/site-packages/pypdfium2/_namespace.py", line 6, in <module>
    from pypdfium2._pypdfium import *
  File "/usr/local/lib/python3.8/site-packages/pypdfium2/_pypdfium.py", line 1245, in <module>
    FPDF_LoadDocument.argtypes = [FPDF_STRING, FPDF_BYTESTRING]
TypeError: item 1 in _argtypes_ passes a union by value, which is unsupported.

I tried in 3.8.10 and 3.8.12, the problem doesn't arise with those later versions 👍

Improve wonky setup code

Make copying of platform-specific files more robust.
Replace .presetup-done.txt with version status files per platform folder.
In update_pdfium and build_pdfium, write status files and only change version.py in setup.py.

Why pdf.close must be called explicitly and not via a destructor ?

It is really annoying to keep opening and closing the PDF. Can't we put the close method in the destructor of class class named PdfPage ?

Create a Makefile to facilitate various shell scripts into one interface

We could use this to run build/release/update all from the make command line, e.g. make build make clean make update-bins` etc, which would clean up the root repository folder a bit

Thoughts? @mara004

`get_toc()`: cicurlar references are not prohibited, causing the risk of infinite loops

PDFium does not check for circular references, and there appeared to be no obvious way to do this in Python, because ctypes does not have original object return.
See https://bugs.chromium.org/p/pdfium/issues/detail?id=1759 for details and discussion with upstream developers.

Page deletion has no effect when saving the PDF

I've been trying to move to pdfium from mupdf -largely due to the licensing issues with the latter- and in the process of porting over some of my code I've run into a strange issue in regards to deleting pages from PDF documents.
As far as I can tell the utility classes and methods do not offer a helping hand with this, so I used the low-level API instead.
It could very well be that I'm using this wrong, but to me it appears that while something is being deleted, it has no bearing on the output document.
Is there a specific magic incantation I have to produce to get this to work in addition to the example below?

Minimal Example

import pypdfium2 as pdfium

# load in the test document (any PDF with multiple pages will do)
doc = pdfium.PdfDocument("in.pdf")
# page count is correct
print("   before page deletion", pdfium.FPDF_GetPageCount(doc.raw))

# delete the first page
pdfium.FPDFPage_Delete(doc.raw, 0)
# page count is correct, one less than before
print("    after page deletion", pdfium.FPDF_GetPageCount(doc.raw))

# save the document
with open("out.pdf", "wb") as f:
    pdfium.save_pdf(doc.raw, f)

# load the saved document
doc = pdfium.PdfDocument("out.pdf")

# page count is incorrect, same as the original document
print("loading edited document", pdfium.FPDF_GetPageCount(doc.raw))

In general I would be interested in a couple more examples of working with this API, especially in the realm of moving about and modifying pages, i.e. deleting them, adding new pages from other PDF's (which is somewhat covered by the PDF merge example) as well as creating new pages from an image file.

Make use of `FPDF_GetLastError()` for improved exception messages

https://developers.foxit.com/resources/pdf-sdk/c_api_reference_pdfium/group___f_p_d_f_i_u_m.html#gae825d36f23e023757bb127fa82b01454

https://developers.foxit.com/resources/pdf-sdk/c_api_reference_pdfium/group___f_p_d_f_i_u_m.html#gad30f914714111f9a50f812ebc1fd4bca

Source installation trouble when using older version of pip

From testing a source installation of pypdfium2 on a relatively fresh Ubuntu 20.04, it turns out that there are rather serious compatibility issues with older versions of pip. Apparently, these partly do the packaging work sandboxed in some kind of temporary directory, which breaks assumptions regarding build file paths. For instance, ctypesgen is cloned into the temporary directory, but later our code looks for it in the actual source tree. This can lead to fairly obscure errors.

A slightly newer (but not the latest) version of pip showed this warning that helped trace the problem:

  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.

Commit 325145c attemps to address the problem, but it does not quite work like this yet: If a source installation was invoked with pip3 install ., then even if we update pip, this run will still be done with the old version, which will fail.

I currently do not see a good way to automatically get around this. I'll update the docs to inform users about the need for a recent version of pip, though.

Ctypesgen bindings generation fails on Windows

At the moment, ctypesgen appears to be unable to create the bindings file on Windows (see ctypesgen/ctypesgen#138).
This means that Windows users cannot easily install pypdfium2 from git main, unfortunately. (Using the release wheels on PyPI will works, of course. This issue only applies if you want to install the latest state of the repository from source.)

In the meantime, I have attached the current bindings file so that users may inject it manually: _pypdfium.py.txt
Then it should be possible to create and install a wheel somewhat like this:

@echo off

REM Prerequisite: You are in the `pypdfium/` directory

REM Define the target platform (in this case, Windows 64-bit amd/intel)
set PYP_TARGET_PLATFORM="windows_x64"

REM Download binary
python3 platform_setup\update_pdfium.py -p %PYP_TARGET_PLATFORM%

REM Now you need to manually copy the attached bindings file into `data/%PYP_TARGET_PLATFORM%`
REM (Remember to remove the `.txt` extension which was only for the upload to GitHub)

REM Build the wheel, according to the target platform environment variable
python3 -m build -n -x --wheel

REM Finally, install it (replace `WHEELNAME_HERE.whl` with the corresponding file name)
python3 -m pip install dist/WHEELNAME_HERE.whl

REM Optionally, run the test suite
python3 -m pytest tests/

Restructure automatic dependency installation

Call getdeps from build_pdfium rather than setup so we know whether it's a native build or not. If it is, we will need more system dependencies.
In addition, if we are on Linux, we could consider attempting to install missing system packages and prompt the user for the password (using sudo or pkexec). The issue is that I don't know a cross-distrubtion way to install packages, though -- each major distribution has its own package manager and naming conventions...

Do we need `ctypes.byref()`?

Check whether we need to use ctypes.byref() sometimes or if ctypes autoconversion handles this for us implicitly.

Expanded examples of working with the library beyond rendering to images?

I deal with a lot of PDFs of varying quality and from what I can see PDFium/pypdfium2 seems to handle PDFs that other open source renderers (ghostscript, poppler, etc) cannot. I've been looking at the old Foxit SDK docs and it appears that (assuming it wasn't removed from the project when google purchased it) it's possible to use PDFium to do things like import pages from one or more PDFs into a new PDF (PDF merging) and import images into a new PDF. Are you able confirm if that is possible with pypdfium2 and if you would be interest in receiving a bounty to provide examples in pypdfium2? I'd much rather contribute funds to improve your project than seek out a close source/commercial solution.

Allow avoiding use of ProcessPoolExecutor in render_topil

First off, thank you so much for building out this library - love that we're able to use pdfium from within our Python codebase.

In our setup, we primarily use Celery to run asynchronous tasks. We use Celery prefork workers which use the multiprocessing library underneath the hood. This conflicts with ProcessPoolExecutor raising this error:

Traceback (most recent call last):
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/app-RnXbbLtS-py3.8/lib/python3.8/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/app-RnXbbLtS-py3.8/lib/python3.8/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/ubuntu/app/core/tasks.py", line 453, in cache_page_images_pypdf
    for page_num, image_bytes in pypdf.get_page_images(
  File "/home/ubuntu/app/core/pypdf.py", line 50, in get_page_images
    for page_number, image in zip(page_numbers, renderer):
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/viaduct-RnXbbLtS-py3.8/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 453, in render_topil
    yield from self._render_base("pil", **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/viaduct-RnXbbLtS-py3.8/lib/python3.8/site-packages/pypdfium2/_helpers/document.py", line 425, in _render_base
    for result, index in pool.map(invoke_renderer, page_indices):
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/process.py", line 674, in map
    results = super().map(partial(_process_chunk, fn),
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/process.py", line 645, in submit
    self._start_queue_management_thread()
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/process.py", line 584, in _start_queue_management_thread
    self._adjust_process_count()
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/process.py", line 608, in _adjust_process_count
    p.start()
  File "/home/ubuntu/.pyenv/versions/3.8.12/lib/python3.8/multiprocessing/process.py", line 118, in start
    assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children

We've worked around the issue by monkey patching the ProcessPoolExecutor with this:

import pypdfium2._helpers.document as pdfium_helpers_document

class SerialProcessPoolExecutor:
    def __init__(*args, **kwargs):
        pass

    def map(self, func, args):
        for arg in args:
            result = func(arg)
            yield result

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        pass


# HACK: monkey-patch ProcessPoolExecutor
pdfium_helpers_document.ProcessPoolExecutor = SerialProcessPoolExecutor

It would be great if the library supported this execution mode where multiprocessing / ProcessPoolExecutor is optional.

Platform support and testing status

Platforms we build wheels for and their current testing status:

macOS x64:
Tested, works (10.11.6 El Capitan / 10.15.7 Catalina / 11.6.1 Big Sur)
macOS arm64/M1:
Tested, works (11.6 Big Sur)
Linux x64 glibc:
Tested, works (Ubuntu 20.04, Python 3.8)
Linux x86/i686 glibc:
Untested
Linux aarch64/arm64 glibc:
Tested, works (via emulation in GitHub workflow)
Linux armv7l/armhf glibc:
Tested, works (Debian 9, Raspberry Pi 2, Python 3.9)
Linux x64 musl:
Untested
Linux x84/i686 musl:
Untested
Windows x64:
Tested, works (Windows 8.1)
Windows arm64:
Untested
Windows x86 (i. e. 32-bit):
Untested

Summary: At the moment, 6 of 11 wheels are confirmed to work.
pypdfium2 does not only run with CPython, but also with PyPy.

`platform_setup` imports not working when scripts are invoked directly?

Whilst testing on a different device, I encountered a behaviour that I had not experienced on my main workplace yet:
When the scripts are invoked directly, the imports from platform_setup fail. Actually this is quite logical, as the top-level directory is not in PYTHONPATH or PATH. I'll have to investigate why it works on the other device.
(We are already using importlib in setup.py to load platform_setup manually, so this part should be unaffected.)

`FPDF_LoadDocument()` fails with non-ascii filenames on Windows

pypdfium2.FPDF_LoadDocument(file_path, password) fails to open documents from filenames that contain non-ascii characters.
Confirmed on Windows 10 and 8.1; not reproducible on Linux Ubuntu 20.04.

It would be nice if a PDFium C user could check whether this is a bug in PDFium itself, otherwise it will be an issue with the bindings generator ctypesgen.

As a workaround, FPDF_LoadMemDocument() can be used instead:

file_handler = open(file_name, 'rb')
file_bytes = file_handler.read()
pdf = FPDF_LoadMemDocument(file_bytes, len(file_bytes), password)
# ...
file_handler.close()

`FPDF_LoadMemDocument()` not working correctly?

While working on #57 / #58, I first tried to use FPDF_LoadMemDocument() for non-ascii filepaths on Windows.
For testing, I temporarily edited open_pdf() to read all files into memory and pass the data to FPDF_LoadMemDocument().
Then I ran the test suite, which showed issues with rendering: Several failed pixel assertions. From looking at the files in tests/output, I noticed that a lot of content was missing, especially texts.
This is weired, as one would expect all loading strategies to yield exactly the same output, and I think it did work correctly with previous PDFium versions, so upstream might have recently introduced a bug.

Avoid uninitialised struct fields

I've been informed by PDFium team that passing structs with uninitialised fields should be avoided in general.

Handling high resolution images in pdf when converted to PIL.Image (DPI=200)

Hi Team,

I am finding some issue with memory when dealing with high resolution images scanned in pdf. How do we appropriately handle it for DPI 200?
Requirement: We need an array of PIL.Images from converted pdf. We need DPI to be set to 200 only

Code:

@profile
def test_pdfium():
    pdf = pdfium.PdfDocument(pdf_PATH)
    version = pdf.get_version()  # get the PDF standard version
    n_pages = len(pdf)  # get the number of pages in the document
    img_list = []
    for page_number in range(n_pages):
        page = pdf.get_page(page_number)
        pil_image = page.render_topil(
            optimise_mode=pdfium.OptimiseMode.NONE,
            scale=200/72
        )
        img_list.append(pil_image)
    print(img_list)
test_pdfium()

Output:

[<PIL.Image.Image image mode=RGB size=6622x8673 at 0x10BD10AC0>, <PIL.Image.Image image mode=RGB size=6447x8356 at 0x10BD10A90>, <PIL.Image.Image image mode=RGB size=6556x8628 at 0x10BD10310>]
Filename: /Users/rushabhwadkar/Desktop/Training/pdfissue/server.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     23.6 MiB     23.6 MiB           1   @profile
     8                                         def test_pdfium():
     9     23.8 MiB      0.2 MiB           1       pdf = pdfium.PdfDocument(pdf_PATH)
    10     23.8 MiB      0.0 MiB           1       version = pdf.get_version()  # get the PDF standard version
    11     23.8 MiB      0.0 MiB           1       n_pages = len(pdf)  # get the number of pages in the document
    12     23.8 MiB      0.0 MiB           1       img_list = []
    13    707.8 MiB      0.0 MiB           4       for page_number in range(n_pages):
    14    476.6 MiB      0.2 MiB           3           page = pdf.get_page(page_number)
    15    707.8 MiB    683.8 MiB           6           pil_image = page.render_topil(
    16    476.6 MiB      0.0 MiB           3               optimise_mode=pdfium.OptimiseMode.NONE,
    17    476.6 MiB      0.0 MiB           3               scale=200/72
    18                                                 )
    19    707.8 MiB      0.0 MiB           3           img_list.append(pil_image)
    20                                                 # pil_image.save(f"image_{page_number+1}.png")
    21    707.8 MiB      0.0 MiB           1       print(img_list)

If you check the memory profiler from page.render_topil function call, it reaches around 700MiB. Is there a way where we can appropriately handle this i.e to bring that down ?
The above scenario uses CPU and Memory which either kills our k8s pods or restarts it again-and-again!

PdfContext is no longer supported

import pypdfium2 as pdfium

...

with pdfium.PdfContext(outpath) as pdf:
    image = pdfium.render_page_topil(pdf, 0)
    image.save(outname)

Produces the following error:
AttributeError: module 'pypdfium2' has no attribute 'PdfContext'

But if I roll back to pypdfium2 version 1.0.0 then this code works successfully.

I figure this is something I need to change/fix in how I use pypdfium2, but I think it would be worth having some notes in the ChangeLog that say that the PdfContext is no longer supported. It would be helpful for me at least if some tips were provided for migration.

I've tried this

with pdfium.PdfDocument(outpath) as pdf:
    image = list(pdf.render_to(pdfium.BitmapConv.pil_image))[0]
    image.save(outname)

Which is producing a RuntimeError in render_to

Extracting images from a pdf file

First of all thanks for open sourcing this python library, it is one of the rare libraries that provides all the PDF reading functionality I need in a single library. Also the performance is great !

I am trying to extract an image from a single page pdf, but the page object is of a type Form (FPDF_PAGEOBJ_FORM) instead of an image ( FPDF_PAGEOBJ_IMAGE). Also the pdf viewer also displays an image.

PDF: https://www.cabsec.nic.in/writereaddata/changeinportfolio/english/1_Upload_864.pdf

Python 3.7.13 (default, Aug 12 2022, 05:20:12) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pypdfium2 as pdfium
>>> pdf = pdfium.PdfDocument('pdfs/1_Upload_864.pdf')
>>> len(pdf)
1
>>> [obj.get_type() for obj in pdf[0].get_objects()]
[5]
>>> pdfium.FPDF_PAGEOBJ_IMAGE
3
>>>

$ pdfimages -list pdfs/1_Upload_864.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     783  1263  gray    1   8  image  yes        9  0   100   100  308K  32%

Why is it showing a form when it is actually an image ? Should I be using a different API ? is there any workaround for this ?

CLI: Support custom output formats

Currently, the command-line tool can only create PNG files. However, it would be nice to support custom output formats, as PIL provides a lot more encoders.

Support loading PDF from in-memory data using `PdfContext` (bytes or `io.BytesIO`)

Create `.readthedocs.yaml` to install external requirements differently

Then we could use the extra requirements from the setup file and get rid of docs/requirements.txt, see
https://docs.readthedocs.io/en/stable/config-file/v2.html?#packages

Add python pathlib support for any file path operations

Lowest priority since it's not going to prevent anyone from using the lib, but with how pathlib.Path has become usable everywhere in the standard library now, and is super handy, it would be nice to support.

e.g.

doc = pdfium.FPDF_LoadDocument(test_pdf, None)

ArgumentError                             Traceback (most recent call last)
Input In [32], in <cell line: 1>()
----> 1 doc = pdfium.FPDF_LoadDocument(test_pdf, None)

ArgumentError: argument 1: <class 'TypeError'>: object of type 'PosixPath' has no len()

I can take care of this, I have been meaning to do some more open source work but haven't had the time, so I might create some more small tickets to knock out.

Provide a source build fallback

Add a generic source build strategy for platforms where we don't have pre-built binaries:

Download DepotTools and PDFium
Set configuration, optionally apply patches
Perform the build
Move header files and binary to data/sourcebuild
Package a wheel

Allow for rendering a certain area of a page

There is the FPDF_RenderPageBitmapWithMatrix() function that may be used to render a custom area of the page.

Moreover, we should add some method to get the PDF boxes via PDFium (CropBox, MediaBox, ... ; using FPDFPage_GetCropBox(), FPDFPage_GetMediaBox(), ...)

acroforms: add call to `FORM_OnAfterLoadPage()`

https://developers.foxit.com/resources/pdf-sdk/c_api_reference_pdfium/group___f_p_d_f_i_u_m.html#ga6bfb44ecd56c58cb6a3317626d07d47c

Support for PIL.PPMImage Plugin

pil_image = page.render_topil(scale=200/72)

The above render_topil returns PIL.Image.Image which in turn consumes high memory. Is it possible to render to PIL.PPMImage or custom PIL plugins ? https://pillow.readthedocs.io/en/stable/_modules/PIL/PpmImagePlugin.html

Interruptable rendering

Change the current Page.render_base() to Page.render_base_async() and re-implement render_base() using render_base_async(). Create a helper class to control asynchronous rendering (implementing start(), pause(), resume(), finish() and cancel()).

This will provide callers with a higher level of control over the rendering process.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

pypdfium2-team / pypdfium2 Goto Github PK

pypdfium2's Introduction

pypdfium2

Installation

Runtime Dependencies

Setup Magic

Usage

Raw PDFium API

Licensing

Issues

Known limitations

Incompatibility with CPython 3.7.6 and 3.8.1

Risk of unknown object lifetime violations

Missing raw PDF access

Limitations of ABI bindings

Development

Contributions

Long lines

Docs

Testing

Release workflow

Prominent Embedders

Thanks to9

History

PDFium

pypdfium2

Footnotes

pypdfium2's People

Contributors

Stargazers

Watchers

Forkers

pypdfium2's Issues

Minimal Example

Recommend Projects

Recommend Topics

Recommend Org

Thanks to⁹