artifexsoftware / mupdf.js Goto Github PK

View Code? Open in Web Editor NEW

219.0 13.0 9.0 1.42 MB

JavaScript bindings for MuPDF

Home Page: https://mupdfjs.readthedocs.io

License: GNU Affero General Public License v3.0

Makefile 0.06% Shell 0.71% C 34.17% TypeScript 63.30% Vim Script 0.05% JavaScript 1.72%

javascript mupdf pdf wasm pdf-extraction pdf-viewer typescript

mupdf.js's Introduction

MuPDF.js

This is a build of MuPDF for JavaScript and TypeScript, using the speed and performance of WebAssembly.

The MuPDF.js library can be used both in browsers and in Node.js.

Features

Render PDF pages to images
Search PDF file text contents
Create and edit PDF annotations
Access and fill out PDF forms
Edit PDF documents
Supports basic CJK (Chinese, Japanese, Korean) fonts

Installing

From the command line, go to the folder you want to work from and run:

npm install mupdf

The mupdf module is only available as an ESM module. Either use the .mjs file extension or change the project type:

npm pkg set type=module

Running

The following example script demonstrates how to load a document and then print out the page count.

Create a file count-pages.mjs:

import * as process from "node:process"
import * as fs from "node:fs"
import * as mupdf from "mupdf"

if (process.argv.length < 3) {
    console.error("usage: node count-pages.mjs file.pdf");
    process.exit(1);
}

const filename = process.argv[2];
const doc = mupdf.Document.openDocument(fs.readFileSync(filename), "application/pdf");
const count = doc.countPages();

console.log(`${filename} has ${count} pages.`);

Run the script:

node count-pages.mjs file.pdf

Using Typescript

To use TypeScript you need to create a tsconfig.json project file to tell the compiler and Visual Studio Code to use the "nodenext" module resolution:

{
    "compilerOptions": {
        "module": "nodenext"
    }
}

License and Copyright

MuPDF.js is available under Open Source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Documentation

For documentation please refer to mupdfjs.readthedocs.io.

Code Examples

Check out the example projects to help you get started. The examples include a simple PDF Viewer that runs mupdf in the browser, several command line scripts, and more!

Getting Started with Local Development

You can build the MuPDF.js library from source by referring to BUILDING.md.

Contributing

To contribute please open up (or help answer!) an Issue on our Github board and create a Pull Request (PR) for review. Find us on Discord at #mupdf-js to chat with us directly.

mupdf.js's People

Contributors

Stargazers

Watchers

Forkers

sebras charris123 xybei ccxvii penxla rythenglyth qqq-tech firejox vilhelm-ian

mupdf.js's Issues

API for remove embedded file

We can add embedded files to an annotation, can we add an API to remove the embedded file from an annotation?

Annotation: "Intended Type" API

Supply an API for Intended Type (IT)

These are subtypes for example for Line and Polygon..

Annotation: Supply API to get RD value from an annotation

Required for rendering FreeTextCallout properly. Determining the textbox's location is only possible through calculating the bbox and RD. (please refer to the attached picture for clarification)

Annotation: Leader Line API

LL (Leader Line) should be supported.

Annotation: Free Text Callout API

CL ( Free Text Callout ) should be supported.

DocumentWriter.close WASM RuntimeError

The DocumentWriter.close function seems malfunction and throws WASM RuntimeError.

This is the code reduced to the minimum:

import * as mupdf from 'mupdf'

const outBuffer = new mupdf.Buffer()
const out = new mupdf.DocumentWriter(outBuffer, ".pdf", "")
out.close()

It throws the following error

wasm://wasm/0239b566:1


RuntimeError: null function or function signature mismatch
    at wasm://wasm/0239b566:wasm-function[1894]:0x13ff94
    at wasm://wasm/0239b566:wasm-function[1893]:0x13ff49
    at wasm://wasm/0239b566:wasm-function[1985]:0x147d8e
    at wasm://wasm/0239b566:wasm-function[1895]:0x13fff1
    at wasm://wasm/0239b566:wasm-function[3558]:0x2669f9
    at invoke_viiii (file:///.../node_modules/mupdf/dist/mupdf-wasm.js:5784:29)
    at wasm://wasm/0239b566:wasm-function[3548]:0x25ff90
    at wasm://wasm/0239b566:wasm-function[3546]:0x25c738
    at wasm://wasm/0239b566:wasm-function[3567]:0x26cc49
    at wasm://wasm/0239b566:wasm-function[2299]:0x1797b5

Node.js v21.6.2

instead if I try to add a page:

import * as mupdf from 'mupdf'

const outBuffer = new mupdf.Buffer()
const out = new mupdf.DocumentWriter(outBuffer, ".pdf", "")
const dev = out.beginPage([0,0,200,200])
/* draw something on the Device or not, it doesn't matter */
out.endPage()
out.close()

another error is thrown:

wasm://wasm/0239b566:1


RuntimeError: memory access out of bounds
    at wasm://wasm/0239b566:wasm-function[1894]:0x13ffad
    at wasm://wasm/0239b566:wasm-function[1893]:0x13ff49
    at wasm://wasm/0239b566:wasm-function[1985]:0x147ecf
    at wasm://wasm/0239b566:wasm-function[1895]:0x13fff1
    at wasm://wasm/0239b566:wasm-function[3558]:0x2669f9
    at invoke_viiii (file:///.../node_modules/mupdf/dist/mupdf-wasm.js:5784:29)
    at wasm://wasm/0239b566:wasm-function[3548]:0x25ff90
    at wasm://wasm/0239b566:wasm-function[3546]:0x25c738
    at wasm://wasm/0239b566:wasm-function[3567]:0x26cc49
    at wasm://wasm/0239b566:wasm-function[2299]:0x1797b5
    
Node.js v21.6.2

No matter what, outBuffer.getLength() is always 0 after the out.close()

When recreating everything from mupdf's muconvert.c in JS, the second error is thrown as well.

Annotation: "In Reply To" API

IRT (In Reply To) for annotations is not currently supported. can we expose it?

REST server should cache open documents

Repeatedly doing fetch() on the same document from a third party server without caching is going to be slower than it needs to be. We should cache the most recently used documents and reuse the same array buffer that has already been fetched.

This caching can be handled in loadDocumentFromUrl, which can resolve to the cached document if it is in the cache.

Ideally we should use the fetch HTTP response headers to check for freshness as well, but that may be overkill for an example server.

Annotations: Add API for "DS"

We can extract required information for color and alignment from a text item from the "DS" property.

Can we expose a get/set API for this?

Annotation: get/set rotation API

There is some unofficial API which extends PDF and can be used to set/get rotation, for example:

https://pspdfkit.com/guides/web/annotations/annotation-rotation/#user-interface

We will correctly render PDF annotations with rotation, however we should programmatically support this API for our annotations as well.

Add an API to add text to a PDF

Note: not as an annotation, but to add directly as a text object - perhaps like the way PyMuPDF does it with "Stories".

Annotation: Measure API

Let's support these "Measure" dictionary objects as defined on page 746 of the PDF version 1.7 specification

Annotation: Leader Line Extension API

LLE (Leader Line Extension) should be supported

Annotation: provide Subject API

Subj (Subject) for annotations is not currently supported. can we expose it?

"Error: invalid page number: 2"

I was trying to open one of my go-to pdfs for stress testing, but this seems to occur when trying to open any pdf in Firefox (nightly).

Stack trace:

Error: invalid page number: 2                                                    viewer.js:750:11
    2114013 https://mupdf.com/wasm/demo/lib/mupdf-wasm.js:1142
    _emscripten_asm_const_int https://mupdf.com/wasm/demo/lib/mupdf-wasm.js:3820
    invoke_vi https://mupdf.com/wasm/demo/lib/mupdf-wasm.js:6179
    createExportWrapper https://mupdf.com/wasm/demo/lib/mupdf-wasm.js:994
    loadPage https://mupdf.com/wasm/demo/lib/mupdf.js:1392
    getPageSize https://mupdf.com/wasm/demo/worker.js:78
    onmessage https://mupdf.com/wasm/demo/worker.js:39
    open_document_from_file https://mupdf.com/wasm/demo/viewer.js:750
    onchange https://mupdf.com/wasm/demo/index.html:1

The pdf content include "1" could not be copied correctly

pdf:
raw.pdf

open the attach file in this url:
https://mupdf.com/wasm/demo/view.html?file=../../docs/mupdf_explored.pdf

Update npm build to 0.1.4 to include typescript changes

Hi Team,

Just wondering when you will be updating the npm version to 0.1.4 to get the awesome Typescript changes that were made 4 days ago?

Regards,
Tarek

User friendly image adding.

Like PyMuPDF: https://pymupdf.readthedocs.io/en/latest/the-basics.html#adding-an-image-to-a-pdf

Can we add a similar API call to handle image adding to a PDF file in mupdf.js ?

Sanitise the simple-viewer example

At the moment the solution works, however it is perhaps over-engineered. Let's try to tidy it up and make it more "simple" to follow :)

Add graft methods to the MuPDF.js API

We have these 3 methods available in mutool, but not for wasm

newGraftMap
graftObject
graftPage

See: https://mupdf.readthedocs.io/en/latest/mupdf-js.html#copying-objects-across-pdfs

Let's make it available in wasm so we can then use the methods in Node.js

REST server should check HTTP error codes from fetch

loadDocumentFromUrl does not handle HTTP errors.

If we fail to fetch the URL for any reason, we should return an error code to the user.

Add documentation to the repo.

Currently much of our documentation exists within the main mupdf repo. We need to move this into this repo.

Annotation: Support for get/set rich media attributes on text

Adobe supply this kind of widget in their UI to style annotation text:

However we are unable to read/write to these rich media objects.

Annotation: Add API to get/set Name

NM (Name) for annotations is not currently supported. can we expose it?

User friendly Text Adding

Right now when we add text it is like: "BT /Helv 12 Tf 100 100 Td (MuPDF!)Tj ET" in a content stream

Let's make a higher level API in lib/tasks.js to make this easier for the developer.

Suggest we look at: https://pymupdf.readthedocs.io/en/latest/page.html#Page.insert_text and https://pymupdf.readthedocs.io/en/latest/page.html#Page.insert_textbox and do something similar.

Update core JS docs in mupdf as required

New methods are being added to the core - the base docs need to be updated as this happens.

Add new items here as they are committed.

See: 0e0ad52

OCR

Haven't found any reference of OCR in the mupdf.js docs, but see that tesseract is mupdf's optional dependency. Is there an option do OCR using mupdf.js?

Page: Draw as SVG API

Page.toPixmap() allows exporting the page to a raster image.

I am looking for an API to export as SVG.

I see mutool draw has the ability to export as svg. It also seems that another Wasm port of MuPDF has support for drawing as SVG. So I assume MuPDF has this capability, and it only needs to be exposed as an API in the JS library.

Add "has" accessors to mupdf.js

It looks like PDFAnnotation in lib/mupdf.js is missing all the hasXxx accessors and corresponding functions in lib/mupdf.c - let's add them to improve the API.

page.toStructuredText method needs to be able to return image objects

At present with the walk()method we can find text objects, but images are not being retrieved, we need to assign a flag to ask to deliver these.

Trying mupdf library directly with React

With this setup we can run a React-based app via a node server, so basically a REST API provided by the node server to the react client.

However, if we just want to run a React-based app directly upon the MuPDF library, i.e.:

It fails to compile with:

./node_modules/mupdf/lib/mupdf-wasm.js:130:7
Module not found: Can't resolve 'fs'

Could there be some further environmental setup that the wasm library needs to understand?

User friendly page cropping

Can we do something in our mupdf.js file to allow for an easy way to crop a page?

In PyMUPDF we have: https://pymupdf.readthedocs.io/en/latest/the-basics.html#cropping-a-pdf , but I can't see the equivalent in our API.

Please let's add this to the API to match the way PyMuPDF provides it.

User friendly page rotation

Can we do something in our mupdf.js file to allow for an easy way to rotate a page?

In PyMUPDF we have: https://pymupdf.readthedocs.io/en/latest/the-basics.html#rotating-a-pdf , but I can't see the equivalent in our API.

Please let's add this to the API to match the way PyMuPDF provides it.

REST server should catch exceptions from WASM

Each API call should really be wrapped in a try/catch to return a user friendly HTTP status code and error message if there's an exception in the mupdf library.

For example opening a file that is corrupt, or not a PDF file, or loading a page that is out of range, etc.

REST server should rate limit fetches to third party server

We do not want our example server to be able to be used as a DDOS proxy.
For each API request a user does, we issue a fetch to the third party server in the URL.
This can be used to launch a DDOS attack using our REST server's bandwidth.

Here are a few possible solutions:

Do not a fetch for the same file if the same URL is already being fetched.
Rate limit how many fetches we do to each domain.
Only allow fetching from a domains that is in a whitelist.

simple-viewer search always skips results from the current page

The search function of simple-viewer has some minor flaws.

When I type in the search panel and click Next for the first time, the search results on the current page are always skipped and the results on the next page are displayed directly.

It is expected that the search results on the current page should be displayed first, and when Next is clicked again, the results on the next page will be displayed.