modesty / pdf2json Goto Github PK

View Code? Open in Web Editor NEW

1.9K 50.0 373.0 123.09 MB

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.

Home Page: https://github.com/modesty/pdf2json

License: Other

JavaScript 5.38% Java 94.53% Shell 0.09%

json pdf pdf-converter pdf-form pdf-text pdf2json pdf2text pdf2form

pdf2json's Introduction

pdf2json

pdf2json is a node.js module converts binary PDF to JSON and text. Built with pdf.js, it extracts text content and interactive form elements for server-side processing and command-line use.

Features

PDF text extraction: extracts textual content of PDF documents into structured JSON.
Form element handling: parses interactive form fields within PDFs for flexible data capture.
Server-side and command-line versatility: Integrate with web services for remote PDF processing or use as a standalone command-line tool for local file conversion.
Community driven: decade+ long community driven development ensures continuous improvement.

Install

npm i pdf2json

Or, install it globally:

npm i pdf2json -g

To update with latest version:

npm update pdf2json -g

To Run in RESTful Web Service or as command line Utility

More details can be found at the bottom of this document.

Test

After install, run command line:

npm test:jest

It'll build bundles and source maps for both ES Module and CommonJS, output to ./dist directory, and run Jest test suit defined in ./test/_test_.cjs.

The default test suits are essential tests for all PRs. But it only covers a portion of all testing PDFs, for more broader coverage, run:

npm run test:forms

It'll scan and parse 260 PDF AcroForm files under ./test/pdf, runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text file for each PDF. It usually takes ~20s in my MacBook Pro to complete, check ./test/target/ for outputs.

update on 4/27/2024: parsing 260 PDFs by npm run test:forms on M2 Mac takes 7~8s

To run full test suits:

npm test

Test Exception Handlings

After install, run command line:

npm run test;misc

It'll scan and parse all PDF files under ./test/pdf/misc, also runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for 5 PDF fields, while catches exceptions with stack trace for:

bad XRef entry for pdf/misc/i200_test.pdf
unsupported encryption algorithm for pdf/misc/i43_encrypted.pdf
Invalid XRef stream header for pdf/misc/i243_problem_file_anon.pdf

Test Streams

After install, run command line:

npm run parse-r

It scans 165 PDF files under _../test/pdf/fd/form_, parses with Stream API, then generates output to __./test/target/fd/form_.

More test scripts with different command line options can be found at package.json.

Disabling Test logs

For CI/CD, you probably would like to disable unnecessary logs for unit testing.

The code has two types of logs:

The logs that consume the console.log and console.warn APIs;
And the logs that consume our own base/shared/util.js log function.

To disable the first type, you could mock the console.log and console.warn APIs, but to disable the second one, you must set the env variable PDF2JSON_DISABLE_LOGS to "1".

Code Example

Parse a PDF file then write to a JSON file:

import fs from "fs";
import PDFParser from "pdf2json"; 

const pdfParser = new PDFParser();

pdfParser.on("pdfParser_dataError", (errData) =>
 console.error(errData.parserError)
);
pdfParser.on("pdfParser_dataReady", (pdfData) => {
 fs.writeFile(
  "./pdf2json/test/F1040EZ.json",
  JSON.stringify(pdfData),
  (data) => console.log(data)
 );
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Or, call directly with buffer:

fs.readFile(pdfFilePath, (err, pdfBuffer) => {
 if (!err) {
  pdfParser.parseBuffer(pdfBuffer);
 }
});

Or, use more granular page level parsing events (v2.0.0)

pdfParser.on("readable", (meta) => console.log("PDF Metadata", meta));
pdfParser.on("data", (page) =>
 console.log(page ? "One page paged" : "All pages parsed", page)
);
pdfParser.on("error", (err) => console.error("Parser Error", err));

Parse a PDF then write a .txt file (which only contains textual content of the PDF)

import fs from "fs";
import PDFParser from "pdf2json"; 

const pdfParser = new PDFParser(this, 1);

pdfParser.on("pdfParser_dataError", (errData) =>
 console.error(errData.parserError)
);
pdfParser.on("pdfParser_dataReady", (pdfData) => {
 fs.writeFile(
  "./pdf2json/test/F1040EZ.content.txt",
  pdfParser.getRawTextContent(),
  () => {
   console.log("Done.");
  }
 );
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Parse a PDF then write a fields.json file that only contains interactive forms' fields information:

import fs from "fs";
import PDFParser from "pdf2json"; 

const pdfParser = new PDFParser();

pdfParser.on("pdfParser_dataError", (errData) =>
 console.error(errData.parserError)
);
pdfParser.on("pdfParser_dataReady", (pdfData) => {
 fs.writeFile(
  "./pdf2json/test/F1040EZ.fields.json",
  JSON.stringify(pdfParser.getAllFieldsTypes()),
  () => {
   console.log("Done.");
  }
 );
});

pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Alternatively, you can pipe input and output streams: (requires v1.1.4)

import fs from "fs";
import PDFParser from "pdf2json";

const inputStream = fs.createReadStream(
 "./pdf2json/test/pdf/fd/form/F1040EZ.pdf",
 { bufferSize: 64 * 1024 }
);
const outputStream = fs.createWriteStream(
 "./pdf2json/test/target/fd/form/F1040EZ.json"
);

inputStream
 .pipe(new PDFParser())
 .pipe(new StringifyStream())
 .pipe(outputStream);

With v2.0.0, last line above changes to

inputStream
 .pipe(this.pdfParser.createParserStream())
 .pipe(new StringifyStream())
 .pipe(outputStream);

For additional output streams support:

    //private methods
 #generateMergedTextBlocksStream() {
  return new Promise( (resolve, reject) => {
   const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".merged.json"), resolve, reject);
   this.pdfParser.getMergedTextBlocksStream().pipe(new StringifyStream()).pipe(outputStream);
  });
 }

    #generateRawTextContentStream() {
  return new Promise( (resolve, reject) => {
   const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".content.txt"), resolve, reject);
   this.pdfParser.getRawTextContentStream().pipe(outputStream);
  });
    }

    #generateFieldsTypesStream() {
  return new Promise( (resolve, reject) => {
   const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".fields.json"), resolve, reject);
   this.pdfParser.getAllFieldsTypesStream().pipe(new StringifyStream()).pipe(outputStream);
  });
 }

 #processAdditionalStreams() {
        const outputTasks = [];
        if (PROCESS_FIELDS_CONTENT) {//needs to generate fields.json file
            outputTasks.push(this.#generateFieldsTypesStream());
        }
        if (PROCESS_RAW_TEXT_CONTENT) {//needs to generate content.txt file
            outputTasks.push(this.#generateRawTextContentStream());
        }
        if (PROCESS_MERGE_BROKEN_TEXT_BLOCKS) {//needs to generate json file with merged broken text blocks
            outputTasks.push(this.#generateMergedTextBlocksStream());
        }
  return Promise.allSettled(outputTasks);
 }

Note, if primary JSON parsing has exceptions, none of additional stream will be processed. See p2jcmd.js for more details.

API Reference

events:
- pdfParser_dataError: will be raised when parsing failed
- pdfParser_dataReady: when parsing succeeded
alternative events: (v2.0.0)
- readable: first event dispatched after PDF file metadata is parsed and before processing any page
- data: one parsed page succeeded, null means last page has been processed, single end of data stream
- error: exception or error occurred
start to parse PDF file from specified file path asynchronously:

        function loadPDF(pdfFilePath);

If failed, event "pdfParser_dataError" will be raised with error object: {"parserError": errObj}; If success, event "pdfParser_dataReady" will be raised with output data object: {"formImage": parseOutput}, which can be saved as json file (in command line) or serialized to json when running in web service. note: "formImage" is removed from v2.0.0, see breaking changes for details.

Get all textual content from "pdfParser_dataReady" event handler:

        function getRawTextContent();

returns text in string.

Get all input fields information from "pdfParser_dataReady" event handler:

        function getAllFieldsTypes();

returns an array of field objects.

Output format Reference

Current parsed data has four main sub objects to describe the PDF document.

'Transcoder': pdf2json version number
'Agency': the main text identifier for the PDF document. If Id.AgencyId present, it'll be same, otherwise it'll be set as document title; (deprecated since v2.0.0, see notes below)
'Id': the XML meta data that embedded in PDF document (deprecated since v2.0.0, see notes below)
- all forms attributes metadata are defined in "Custom" tab of "Document Properties" dialog in Acrobat Pro;
- v0.1.22 added support for the following custom properties:
  - AgencyId: default "unknown";
  - Name: default "unknown";
  - MC: default false;
  - Max: default -1;
  - Parent: parent name, default "unknown";
- v2.0.0: 'Agency' and 'Id' are replaced with full metadata, example: for ./test/pdf/fd/form/F1040.pdf, full metadata is:

Meta: {
 PDFFormatVersion: '1.7',
 IsAcroFormPresent: true,
 IsXFAPresent: false,
 Author: 'SE:W:CAR:MP',
 Subject: 'U.S. Individual Income Tax Return',
 Creator: 'Adobe Acrobat Pro 10.1.8',
 Producer: 'Adobe Acrobat Pro 10.1.8',
 CreationDate: "D:20131203133943-08'00'",
 ModDate: "D:20140131180702-08'00'",
 Metadata: {
  'xmp:modifydate': '2014-01-31T18:07:02-08:00',
  'xmp:createdate': '2013-12-03T13:39:43-08:00',
  'xmp:metadatadate': '2014-01-31T18:07:02-08:00',
  'xmp:creatortool': 'Adobe Acrobat Pro 10.1.8',
  'dc:format': 'application/pdf',
  'dc:description': 'U.S. Individual Income Tax Return',
  'dc:creator': 'SE:W:CAR:MP',
  'xmpmm:documentid': 'uuid:4d81e082-7ef2-4df7-b07b-8190e5d3eadf',
  'xmpmm:instanceid': 'uuid:7ea96d1c-3d2f-284a-a469-f0f284a093de',
  'pdf:producer': 'Adobe Acrobat Pro 10.1.8',
  'adhocwf:state': '1',
  'adhocwf:version': '1.1'
 }
}

'Pages': array of 'Page' object that describes each page in the PDF, including sizes, lines, fills and texts within the page. More info about 'Page' object can be found at 'Page Object Reference' section
'Width': the PDF page width in page unit

Page object Reference

Each page object within 'Pages' array describes page elements and attributes with 5 main fields:

'Height': height of the page in page unit
'Width': width of the page in page unit, moved from root to page object in v2.0.0
'HLines': horizontal line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit
'Vline': vertical line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit;
- v0.4.3 added Line color support. Default is 'black', other wise set in 'clr' if found in color dictionary, or 'oc' field if not found in dictionary;
- v0.4.4 added dashed line support. Default is 'solid', if line style is dashed line, {dsh:1} is added to line object;
'Fills': an array of rectangular area with solid color fills, same as lines, each 'fill' object has 'x', 'y' in relative coordinates for positioning, 'w' and 'h' for width and height in page unit, plus 'clr' to reference a color with index in color dictionary. More info about 'color dictionary' can be found at 'Dictionary Reference' section.
'Texts': an array of text blocks with position, actual text and styling information:
- 'x' and 'y': relative coordinates for positioning
- 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
- 'A': text alignment, including:
  - left
  - center
  - right
- 'R': an array of text run, each text run object has two main fields:
  - 'T': actual text
  - 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
  - 'TS': [fontFaceId, fontSize, 1/0 for bold, 1/0 for italic]

v0.4.5 added support when fields attributes information is defined in external xml file. pdf2json will always try load field attributes xml file based on file name convention (pdfFileName.pdf's field XML file must be named pdfFileName_fieldInfo.xml in the same directory). If found, fields info will be injected.

Dictionary Reference

Same reason to having "HLines" and "VLines" array in 'Page' object, color and style dictionary will help to reduce the size of payload when transporting the parsing object over the wire. This dictionary data contract design will allow the output just reference a dictionary key , rather than the actual full definition of color or font style. It does require the client of the payload to have the same dictionary definition to make sense out of it when render the parser output on to screen.

Color Dictionary

const kColors = [
 "#000000", // 0
 "#ffffff", // 1
 "#4c4c4c", // 2
 "#808080", // 3
 "#999999", // 4
 "#c0c0c0", // 5
 "#cccccc", // 6
 "#e5e5e5", // 7
 "#f2f2f2", // 8
 "#008000", // 9
 "#00ff00", // 10
 "#bfffa0", // 11
 "#ffd629", // 12
 "#ff99cc", // 13
 "#004080", // 14
 "#9fc0e1", // 15
 "#5580ff", // 16
 "#a9c9fa", // 17
 "#ff0080", // 18
 "#800080", // 19
 "#ffbfff", // 20
 "#e45b21", // 21
 "#ffbfaa", // 22
 "#008080", // 23
 "#ff0000", // 24
 "#fdc59f", // 25
 "#808000", // 26
 "#bfbf00", // 27
 "#824100", // 28
 "#007256", // 29
 "#008000", // 30
 "#000080", // Last + 1
 "#008080", // Last + 2
 "#800080", // Last + 3
 "#ff0000", // Last + 4
 "#0000ff", // Last + 5
 "#008000", // Last + 6
 "#000000", // Last + 7
];

Style Dictionary:

const kFontFaces = [
 "QuickType,Arial,Helvetica,sans-serif", // 00 - QuickType - sans-serif variable font
 "QuickType Condensed,Arial Narrow,Arial,Helvetica,sans-serif", // 01 - QuickType Condensed - thin sans-serif variable font
 "QuickTypePi", // 02 - QuickType Pi
 "QuickType Mono,Courier New,Courier,monospace", // 03 - QuickType Mono - san-serif fixed font
 "OCR-A,Courier New,Courier,monospace", // 04 - OCR-A - OCR readable san-serif fixed font
 "OCR B MT,Courier New,Courier,monospace", // 05 - OCR-B MT - OCR readable san-serif fixed font
];

const kFontStyles = [
 // Face  Size Bold Italic  StyleID(Comment)
 // ----- ---- ---- -----  -----------------
 [0, 6, 0, 0], //00
 [0, 8, 0, 0], //01
 [0, 10, 0, 0], //02
 [0, 12, 0, 0], //03
 [0, 14, 0, 0], //04
 [0, 18, 0, 0], //05
 [0, 6, 1, 0], //06
 [0, 8, 1, 0], //07
 [0, 10, 1, 0], //08
 [0, 12, 1, 0], //09
 [0, 14, 1, 0], //10
 [0, 18, 1, 0], //11
 [0, 6, 0, 1], //12
 [0, 8, 0, 1], //13
 [0, 10, 0, 1], //14
 [0, 12, 0, 1], //15
 [0, 14, 0, 1], //16
 [0, 18, 0, 1], //17
 [0, 6, 1, 1], //18
 [0, 8, 1, 1], //19
 [0, 10, 1, 1], //20
 [0, 12, 1, 1], //21
 [0, 14, 1, 1], //22
 [0, 18, 1, 1], //23
 [1, 6, 0, 0], //24
 [1, 8, 0, 0], //25
 [1, 10, 0, 0], //26
 [1, 12, 0, 0], //27
 [1, 14, 0, 0], //28
 [1, 18, 0, 0], //29
 [1, 6, 1, 0], //30
 [1, 8, 1, 0], //31
 [1, 10, 1, 0], //32
 [1, 12, 1, 0], //33
 [1, 14, 1, 0], //34
 [1, 18, 1, 0], //35
 [1, 6, 0, 1], //36
 [1, 8, 0, 1], //37
 [1, 10, 0, 1], //38
 [1, 12, 0, 1], //39
 [1, 14, 0, 1], //40
 [1, 18, 0, 1], //41
 [2, 8, 0, 0], //42
 [2, 10, 0, 0], //43
 [2, 12, 0, 0], //44
 [2, 14, 0, 0], //45
 [2, 12, 0, 0], //46
 [3, 8, 0, 0], //47
 [3, 10, 0, 0], //48
 [3, 12, 0, 0], //49
 [4, 12, 0, 0], //50
 [0, 9, 0, 0], //51
 [0, 9, 1, 0], //52
 [0, 9, 0, 1], //53
 [0, 9, 1, 1], //54
 [1, 9, 0, 0], //55
 [1, 9, 1, 0], //56
 [1, 9, 1, 1], //57
 [4, 10, 0, 0], //58
 [5, 10, 0, 0], //59
 [5, 12, 0, 0], //60
];

v2.0.0: to access these dictionary programactically, do either

import { kColors, kFontFaces, kFontStyles } from "./lib/pdfconst.js"; // <-- pre 3.1.0
import { kColors, kFontFaces, kFontStyles } from "pdf2json"; // <-- since 3.1.0

or via public static getters of PDFParser:

console.dir(PDFParser.colorDict);
console.dir(PDFParser.fontFaceDict);
console.dir(PDFParser.fontStyleDict);

Interactive Forms Elements

v0.1.5 added interactive forms element parsing, including text input, radio button, check box, link button and drop down list.

Interactive forms can be created and edited in Acrobat Pro for AcroForm, or in LiveCycle Designer ES for XFA forms. Current implementation for buttons only supports "link button": when clicked, it'll launch a URL specified in button properties. Examples can be found at f1040ezt.pdf file under test/data folder.

All interactive form elements parsing output will be part of corresponding 'Page' object where they belong to, radio buttons and check boxes are in 'Boxsets' array while all other elements objects are part of 'Fields' array.

Each object with in 'Boxset' can be either checkbox or radio button, the only difference is that radio button object will have more than one element in 'boxes' array, it indicates it's a radio button group. The following sample output illustrate one checkbox ( Id: F8888 ) and one radio button group ( Id: ACC ) in the 'Boxsets' array:

Boxsets: [
{//first element, check box
 boxes: [ //only one box object object in this array
 {
  x: 47,
  y: 40,
  w: 3,
  h: 1,
  style: 48,
  TI: 39,
  AM: 4,
  id: {
   Id: "F8888",
  },
  T: {
   Name: "box"
  }
  }
  ],
  id: {
  Id: "A446",
  }
},//end of first element
{//second element, radio button group
 boxes: [// has two box elements in boxes array
 {
  x: 54,
  y: 41,
  w: 3,
  h: 1,
  style: 48,
  TI: 43,
  AM: 132,
  id: {
   Id: "ACCC",
  },
  T: {
   Name: "box"
  }
 },
 {
  x: 67,
  y: 41,
  w: 3,
  h: 1,
  style: 48,
  TI: 44,
  AM: 132,
  id: {
   Id: "ACCS",
   EN: 0
  },
  T: {
   Name: "box"
  }
 }
 ],
 id: {
  Id: "ACC",
  EN: 0
 }
}//end of second element
] //end of Boxsets array

'Fields' array contains parsed object for text input (Name: 'alpha'), drop down list (Name: 'apha', but has 'PL' object which contains label array in 'PL.D' and value array in 'PL.V'), link button (Name: 'link', linked URL is in 'FL.form.Id' field). Some examples:

Text input box example:

{
 style: 48,
 T: {
  Name: "alpha",
  TypeInfo: { }
 },
 id: {
  Id: "p1_t40",
  EN: 0
 },
 TU: "alternative text", //for accessibility, added only when available from PDF stream. (v0.3.6).
 TI: 0,
 x: 6.19,
 y: 5.15,
 w: 30.94,
 h: 0.85,
 V: "field value" //only available when the text input box has value
},

Note: v0.7.0 extends TU (Alternative Text) to all interactive fields to better support accessibility.

Drop down list box example:

{
 x: 60,
 y: 11,
 w: 4,
 h: 1,
 style: 48,
 TI: 13,
 AM: 388,
 mxL: 2,
 id: {
  Id: "ST",
  EN: 0
 },
 T: {
  Name: "alpha",
  TypeInfo: {
  }
 },
 PL: {
  V: [
   "",
   "AL",
   "AK"
  ],
  D: [
  "%28no%20entry%29",
  "Alabama",
  "Alaska"
  ]
 }
}

Link button example:

{
 style: 48,
 T: {
  Name: "link"
 },
 FL: {form: {Id:"http://www.github.com"},
 id: {
  Id: "quad8",
  EN: 0
 },
 TI: 0,
 x: 52.35,
 y: 28.35,
 w: 8.88,
 h: 0.85
}

v0.2.2 added support for "field attribute mask", it'd be common for all fields, form author can set it in Acrobat Pro's Form Editing mode: if a field is ReadOnly, it's AM field will be set as 0x00000400, otherwise AM will be set as 0.

Another supported field attributes is "required": when form author mark a field is "required" in Acrobat, the parsing result for 'AM' will be set as 0x00000010.

"Read-Only" filed attribute mask example:

{
 style: 48,
 T: {
  Name: "alpha",
  TypeInfo: { }
 },
 id: {
  Id: "p1_t40",
  EN: 0
 },
 TI: 0,
 AM: 1024, //If (AM & 0x00000400) set, it indicates this is a read-only filed
 x: 6.19,
 y: 5.15,
 w: 30.94,
 h: 0.85
}

v2.X.X added support for the signature form element (Name: 'signature'). If the field has been signed, the 'Sig' property will be present, and will contain any of the following signature details if available:

'Name' - Signer's name
'M' - Time of signing in ISO 8601 format
'Location' - Location of signing
'Reason' - Reason for signing
'ContactInfo' - Signer's contact information

Signature example:

{
 style: 48,
 T: {
  Name: "signature",
  TypeInfo: {}
 },
 id: {
  Id: "SignatureFormField_1",
  EN: 0
 },
 TI: 0,
 AM: 16,
 x: 5.506,
 y: 31.394,
 w: 14.367,
 h: 4.241,
 Sig: {
  Name: "Signer Name",
  M: "2022-03-15T19:17:34-04:00"
 }
}

Text Input Field Formatter Types

v0.1.8 added text input field formatter types detection for

number
ssn
date (tested date formatter: mm/dd/yyyy, mm/dd, mm/yyyy and Custom yyyy)
zip
phone
percent (added v0.5.6)

v0.3.9 added "arbitrary mask" (in "special" format category) support, the input field format type is "mask" and the mask string is added as "MV", its value can be found at Format => Special => Arbitrary Mask in Acrobat; Some examples of "mask" format including:

9999: 4 digit PIN field
99999: 5 digit PIN field
99-9999999: formatted 9 digit EIN number
999999999: 9 digit routing number
aaa: 3 letters input

Additionally, the "arbitrary mask" length is extended from 1 characters to 64 characters. And when the mask has only one character, it has the following meanings:

a: alphabet only input, no numeric input allowed
n: numeric only input, no locale based number formatting, no alphabet or special characters allowed
d: numeric only input, with locale based number formatting, one decimal point allowed, no rounding expected and no alphabet or special characters allowed
-: negative number only, with locale based number formatting, no alphabet or special characters allowed
+: positive number only, with locale based number formatting, no alphabet or special characters allowed

v0.4.1 added more date format detection, these formats are set in Acrobat's field's Properties => Format => Date => Custom:

yyyy: 4 digit year

Types above are detected only when the widget field type is "Tx" and the additional-actions dictionary 'AA' is set. Like what you see, not all pre-defined formatters and special formatters are supported, if you need more support, you can extend the 'processFieldAttribute' function in core.js file.

For the supported types, the result data is set to the field item's T object. Example of a 'number' field in final json output:

{
 style: 48,
 T: {
  Name: "number",
  TypeInfo: { }
 },
 id: {
  Id: "FAGI",
  EN: 0
 },
 TI: 0,
 x: 68.35,
 y: 22.43,
 w: 21.77,
 h: 1.08
},

Another example of 'date' field:

{
 style: 48,
 T: {
  Name: "date",
  TypeInfo: { }
 },
 id: {
  Id: "Your Birth Date",
  EN: 0
 },
 TI: 0,
 x: 33.43,
 y: 20.78,
 w: 5.99,
 h: 0.89
},

Text Style data without Style Dictionary

v0.1.11 added text style information in addition to style dictionary. As we discussed earlier, the idea of style dictionary is to make the parsing result payload to be compact, but I found out the limited dictionary entries for font (face, size) and style (bold, italic) can not cover majority of text contents in PDFs, because of some styles are matched with closest dictionary entry, the client rendering will have mis-aligned, gapped or overlapped text. To solve this problem, pdf2json v0.1.11 extends the dictionary approach, all previous dictionary entries stay the same, but parsing result will not try to match to a closest style entry, instead, all exact text style will be returned in a TS filed.

When the actual text style doesn't match any pre-defined style dictionary entry, the text style ID (S filed) will be set as -1. The actual text style will be set in a new field (TS) with or without a matched style dictionary entry ID. This means, if your client renderer works with pdf2json v0.1.11 and later, style dictionary ID can be ignored. Otherwise, previous client renderer can still work with style dictionary ID.

The new TS filed is an Array with format as:

First element in TS Array is Font Face ID (integer)
Second element is Font Size (px)
Third is 1 when font weight is bold, otherwise 0
Forth is 1 when font style is italic, otherwise 0

For example, the following is a text block data in the parsing result:

{
 x: 7.11,
 y: 2.47,
 w: 1.6,
 clr: 0,
 A: "left",
 R: [
  {
   T: "Modesty%20PDF%20Parser%20NodeJS",
   S: -1,
   TS: [0, 15, 1, 0]
  }
 ]
},

The text is "Modesty PDF Parser NodeJS", text style dictionary entry ID is -1 (S field, meaning no match), and its Font Face ID is 0 (TS[0], "QuickType,Arial,Helvetica,sans-serif"), Font Size is 15px (TS[1]), Font weight is bold (TS[2]) and font style is normal (TS[3]).

Note: (v0.3.7) When a color is not in style dictionary, "clr" value will be set to -1. Item's (fills and text) original color in hex string format will be added to "oc" field. In other word, "oc" only exists if and only if "clr" is -1;

Rotated Text Support

V0.1.13 added text rotation value (degree) in the R array's object, if and only if the text rotation angle is not 0. For example, if text is not rotated, the parsed output would be the same as above. When the rotation angle is 90 degree, the R array object would be extended with "RA" field:

{
 x: 7.11,
 y: 2.47,
 w: 1.6,
 clr: 0,
 A: "left",
 R: [
  {
   T: "Modesty%20PDF%20Parser%20NodeJS",
   S: -1,
   TS: [0, 15, 1, 0],
   RA: 90
  }
 ]
},

Notes

pdf.js is designed and implemented to run within browsers that have HTML5 support, it has some dependencies that's only available from browser's JavaScript runtime, including:

XHR Level 2 (for Ajax)
DOMParser (for parsing embedded XML from PDF)
Web Worker (to enable parsing work run in a separated thread)
Canvas (to draw lines, fills, colors, shapes in browser)
Others (like web fonts, canvas image, DOM manipulations, etc.)

In order to run pdf.js in Node.js, we have to address those dependencies and also extend/modify the fork of pdf.js. Here below are some works implemented in this pdf2json module to enable pdf.js running with Node.js:

Global Variables
- pdf.js' global objects (like PDFJS and globalScope) need to be wrapped in a node module's scope
API Dependencies
- XHR Level 2: I don't need XMLHttpRequest to load PDF asynchronously in node.js, so replaced it with node's fs (File System) to load PDF file based on request parameters;
- DOMParser: pdf.js instantiates DOMParser to parse XML based PDF meta data, I used xmldom node module to replace this browser JS library dependency. xmldom can be found at https://github.com/xmldom/xmldom;
- Web Worker: pdf.js has "fake worker" code built in, not much works need to be done, only need to stay aware the parsing would occur in the same thread, not in background worker thread;
- Canvas: in order to keep pdf.js code intact as much as possible, I decided to create a HTML5 Canvas API implementation in a node module. It's named as 'PDFCanvas' and has the same API as HTML5 Canvas does, so no change in pdf.js' canvas.js file, we just need to replace the browser's Canvas API with PDFCanvas. This way, when 2D context API invoked, PDFCanvas just write it to a JS object based on the json format above, rather than drawing graphics on html5 canvas;
Extend/Modify pdf.js
- Fonts: no need to call ensureFonts to make sure fonts downloaded, only need to parse out font info in CSS font format to be used in json's texts array.
- DOM: all DOM manipulation code in pdf.js are commented out, including creating canvas and div for screen rendering and font downloading purpose.
- Interactive Forms elements: (in process to support them)
- Leave out the support to embedded images

After the changes and extensions listed above, this pdf2json node.js module will work either in a server environment ( I have a RESTful web service built with resitify and pdf2json, it's been running on an Amazon EC2 instance) or as a standalone command line tool (something similar to the Vows unit tests).

More porting notes can be found at Porting and Extending PDFJS to NodeJS.

Known Issues

This pdf2json module's output does not 100% maps from PDF definitions, some of them is because of time limitation I currently have, some others result from the 'dictionary' concept for the output. Given these known issues or unsupported features in current implementation, it allows me to contribute back to the open source community with the most important features implemented while leaving some improvement space for the future. All un-supported features listed below can be resolved technically some way or other, if your use case really requires them:

Embedded content:
- All embedded content are igored, current implementation focuses on static contents and interactive forms. Un-supported PDF embedded contents includes 'Images', 'Fonts' and other dynamic contents;
Text and Form Styles:
- text and form elements styles has partial support. This means when you have client side renderer (say in HTML5 canvas or SVG renderer), the PDF content may not look exactly the same as how Acrobat renders. The reason is that we've used "style dictionary" in order to reduce the payload size over the wire, while "style dictionary" doesn't have all styles defined. This sort of partial support can be resolved by extending those 'style dictionaries'. Primary text style issues include:
  - Font face: only limit to the font families defined in style dictionary
  - Font size: only limit to 6, 8, 10, 12, 14, 18 that are defined in style dictionary, all other sized font are mapped to the closest size. For example: when a PDF defines a 7px sized font, the size will be mapped to 8px in the output;
  - Color: either font color or fill colors, are limited to the entries in color dictionary
  - Style combinations: when style combination is not supported, say in different size, face, bold and italic, the closest entry will be selected in the output;
- Note: v0.1.11 started to add support for actual font style (size, bold, italic), but still no full support on font family;
Text positioning and spacing:
- Since embedded font and font styles are only honored if they defined in style dictionary, when they are not defined in there, the final output may have word positioning and spacing issues that's noticeable. I also found that even with specific font style support (added in v0.1.11), because of sometimes PDF text object data stream is breaking up into multiple blocks in the middle of a word, and text position is calculated based on the font settings, we still see some word breaking and extra spaces when rendering the parsed json data in browser (HTML5 canvas and IE's SVG).
User input data in form element:
- As for interactive forms elements, their type, positions, sizes, limited styles and control data are all parsed and served in output, but user interactive data are not parsed, including radio button selection, checkbox status, text input box value, etc., these values should be handled in client renderer as part of user data, so that we can treat parsed PDF data as form template.

Run As a Commandline Utility

v0.1.15 added the capability to run pdf2json as command line tool. It enables the use case that when running the parser as a web service is not absolutely necessary while transcoding local pdf files to json format is desired. Because in some use cases, the PDF files are relatively stable with less updates, even though parsing it in a web service, the parsing result will remain the same json payload. In this case, it's better to run pdf2json as a command line tool to pre-process those pdf files, and deploy the parsing result json files onto web server, client side form renderer can work in the same way as before while eliminating server side process to achieve higher scalability.

This command line utility is added as an extension, it doesn't break previous functionalities of running with a web service context. In my real project, I have a web service written in restify.js to run pdf2json with a RESTful web service interface, I also have the needs to pre-process some local static pdfs through the command line tool without changing the actual pdf2json module code.

To use the command line utility to transcode a folder or a file:

node pdf2json.js -f [input directory or pdf file]

When -f is a PDF file, it'll be converted to json file with the same name and saved in the same directory. If -f is a directory, it'll scan all ".pdf" files within the specified directory to transcode them one by one.

Optionally, you can specify the output directory: -o:

node pdf2json.js -f [input directory or pdf file] -o [output directory]

The output directory must exist, otherwise, it'll exit with an error.

Additionally, you can also use -v or --version to show version number or to display more help info with -h.

Note

v0.2.1 added the ability to run pdf2json directly from the command line without specifying "node" and the path of pdf2json. To run this self-executable in command line, first install pdf2json globally:

npm install pdf2json -g

Then run it in command line:

pdf2json -f [input directory or pdf file]

pdf2json -f [input directory or pdf file] -o [output directory]

v0.5.4 added "-s" or "--silent" command line argument to suppress informative logging output. When using pdf2json as a command line tool, the default verbosity is 5 (INFOS). While when running as a web service, default verbosity is 9 (ERRORS). Examples to suppress logging info from command line:

pdf2json -f [input directory or pdf file] -o [output directory] -s

pdf2json -f [input directory or pdf file] -o [output directory] --silent

Examples to turn on logging info in web service:

var pdfParser = new PFParser();
...
pdfParser.loadPDF(pdfFilePath, 5);

v0.5.7 added the capability to skip input PDF files if filename begins with any one of "!@#$%^&*()+=[]\';,/{}|":<>?~`.-_ ", usually these files are created by PDF authoring tools as backup files.

v0.6.2 added "-t" command line argument to generate fields json file in addition to parsed json. The fields json file will contain one Array which contains fieldInfo object for each field, and each fieldInfo object will have 4 fields:

id: field ID
type: string name of field type, like radio, alpha, etc
calc: true if read only, otherwise false
value: initial value of the field

Example of fields.json content:

[
 {"id":"ADDRCH","type":"alpha","calc":false,"value":"user input data"},
 {"id":"FSRB","type":"radio","calc":false,"value":"Single"},
 {"id":"APPROVED","type":"alpha","calc":true,"value":"Approved Form"}
...
]

The fields.json output can be used to validate fields IDs with other data source, and/or to extract data value from user submitted PDFs.

v0.6.8 added "-c" or "--content" command line argument to extract raw text content from PDF. It'll be a separated output file named as (pdf_file_name).content.txt. If all you need is the textual content of the PDF, "-c" essentially converts PDF to text, of cause, all formatting and styling will be lost.

Run Unit Test (commandline)

It takes less than 1 minutes for pdf2json to parse 261 PDFs under test/pdf directory. Usually, it takes about 40 seconds or so to parses all of them. Besides the primary JSON for each PDF, it also generates text content JSON and form fields JSON file (by -c and -t parameters) for further testing.

The 265 PDFs are all fill-able tax forms from government agencies for tax year 2013, including 165 federal forms, 23 efile instructions and 9 other state tax forms.

Shell script is current driver for unit test. To parse one agency's PDFs, run the command line:

 cd test
 sh p2f.one.sh [2_character_agency_name]

For example, to parse and generate all 165 federal forms together with text content and forms fields:

 cd test
 sh p2f.one.sh fd

To parse and generate all VA forms together with text content and forms fields:

 cd test
 sh p2f.one.sh va

Additionally, to parse all 261 PDFs from commandline:

 cd test
 sh p2f.forms.sh

Or, from npm scripts:

 npm test

Some testing PDFs are provided by bug reporters, like the "unsupported encryption" (#43), "read property num from undefined" (#26), and "excessive line breaks in text content" (#28), their PDFs are all stored in test/pdf/misc directory. To run tests against these community contributed PDFs, run commandline:

 npm run-script test-misc

Upgrade to ~v1.x.x

If you have an early version of pdf2json, please remove your local node_modules directory and re-run npm install to upgrade to [email protected].

v1.x.x upgraded dependency packages, removed some unnecessary dependencies, started to assumes ES6 / ES2015 with node ~v4.x. More PDFs are added for unit testing.

Note: pdf2json has been in production for over 3 years, it's pretty reliable and solid when parsing hundreds (sometimes tens of thousands) of PDF forms every day, thanks to everybody's help.

Starting v1.0.3, I'm trying to address a long over due annoying problem on broken text blocks. It's the biggest problem that hinders the efficiency of PDF content creation in our projects. Although the root cause lies in the original PDF streams, since the client doesn't render JSON character by character, it's a problem often appears in final rendered web content. We had to work around it by manually merge those text blocks. With the solution in v1.0.x, the need for manual text block merging is greatly reduced.

The solution is to put to a post-parsing process stage to identify and auto-merge those adjacent blocks. It's not ideal, but works in most of my tests with those 261 PDFs underneath test directory.

The auto merge solution still needs some fine tuning, I keep it as an experimental feature for now, it's off by default, can be turned on by "-m" switch in command line.

In order to support this auto merging capability, text block objects have an additional "sw" (space width of the font) property together with x, y, clr and R. If you have a more effective usage of this new property for merging text blocks, please drop me a line.

Breaking Changes:

v1.1.4 unified event data structure: only when you handle these top level events, no change if you use commandline
- event "pdfParser_dataError": {"parserError": errObj}
- event "pdfParser_dataReady": {"formImage": parseOutput} note: "formImage" is removed from v2.0.0, see breaking changes for details.
v1.0.8 fixed issue 27, it converts x coordinate with the same ratio as y, which is 24 (96/4), rather than 8.7 (96/11), please adjust client renderer accordingly when position all elements' x coordinate.
v2.0.0 output data field, Agency and Id are replaced with Meta, JSON of the PDF's full metadata. (See above for details). Each page object also added Width property besides Height.
v3.0.0 converted commonJS to ES Modules, plus dependency update and other minor bug fixes. Please update your project configuration file to enable ES Module before upgrade, ex., in tsconfig.json, set "compilerOptions":{"module":"ESNext"}

Major Refactoring

v2.0.0 has the major refactoring since 2015. Primary updates including:
- Full PDF metadata support (see page format and breaking changes for details)
- Simplify root properties, besides the addition of Meta as root property, unnecessary "formImage" is removed from v2.0.0, also Width is move from root to each page object under Pages.
- Improved Stream support with test npm run parse-r, plus new events are added to PDF.js, including readable, data, end, error. These new Readable Stream like events can be optional replacement for customed events (pdfjs_parseDataReady, and pdfjs_parseDataError). It offers more granular data chunk flow control, like readable with Meta, data sequence for each PDF page result, instead of pdfjs_parseDataReady combines all pages in one shot. See ./lib/parserstream.js for more details
- Object with {clr:-1} (like HLines, VLines, Fills, etc.) is replaced with {oc: "#xxxxxx"}. If clr index value is valid, then oc (original color) field is removed.
- Greater performance, near ~20% improvements with PDFs under test directory
- Better exception handling, fixes a few uncaught exception errors
- More test coverage, 4 more test scripts added, see package.json for details
- Easier access to dictionaries, including color, font face and font style, see Dictionary reference section for details
- Refactor to ES6 class for major entry modules
- Dependencies removed: lodash, async and yargs
- Upgrade to Node v14.18.0 LTSs
v3.0.0 converted commonJS to ES Modules
- v3.1.0 added build step to output both ES Module and CommonJS bundles
  - PDFParser is no longer the default export, it's a named export that requires changes to import statement.
  - test is written in Jest
  - PR will require GitHub work flow check, currently is npm ci and npm test

Install on Ubuntu

Make sure nodejs is installed. Detailed installation steps can be found at http://stackoverflow.com/a/16303380/433814.

$ nodejs --version
v0.10.22

Create a symbolic link from node to nodejs

sudo rm -f /usr/sbin/node
sudo ln -s /usr/bin/nodejs /usr/sbin/node

Verify the version of node and installation

$ which node
/usr/sbin/node

$ node --version
v4.5.0

Proceed with the install of pdf2json as described above

$ npm install -g pdf2json
npm http GET https://registry.npmjs.org/pdf2json
npm http 304 https://registry.npmjs.org/pdf2json
/usr/bin/pdf2json -> /usr/lib/node_modules/pdf2json/bin/pdf2json
[email protected] /usr/lib/node_modules/pdf2json

$ which pdf2json
/usr/bin/pdf2json

$ pdf2json --version
0.6.2

Run in a RESTful Web Service

More info can be found at Restful Web Service for pdf2json.

Contribution

Participating in this project, you are expected to honor open code of conduct.

License

Licensed under the Apache License Version 2.0.

Support

I'm currently running this project in my spare time. Thanks all for your stars and supports.

pdf2json's People

Contributors

Stargazers

Watchers

Forkers

ekanna samueltilly giano baldurbjarnason paullryan sc13-bioinf purecreative won21kr maxwellrebo chadieb fbcouch garysieling rst-j devildeveloper whockey oitozero marcellodesales web5design robert-yarborough monwater marcosrmendez brianc anujku mayalekova kuguobing jkutianski nodejstw pingjiang mfiske classloader eyethereal edsoto gogistics kkdg jifffffy mbrioski motusdevelopers epappas joelmwas brandong84 holdfenytolvaj andrewluetgers lduchesne eeertekin ynagarjuna2012 tarunsinghal92 nagyistoce g0ogle wkryst humasae rvkishore pandipanda86 jchandra74 liu4lin modulexcite digitallandes owmf quanticpotato donnut shobhitg redroot sebasao jjviscomi blister suraj3006 ryanwilliamquinn iyuohz morganebilloud ahahxof mvanderw felipegtx ctstone sg1705 bugeats frenchbread lethalbrains kevinperumal m-h-miller anukat2015 tfg-urjc-2017 tfmv qiaoyuanmaxdeng tuningguide hydraseed jagannathan-m yonidejene pj035 alexandr2110pro scolustenko oleglustenko ericson-cepeda kishorsharma dafortune mjtworks pacharrin swifthero wanghaisheng lanxingshou crank50 vinayasathyanarayana

pdf2json's Issues

StringifyStream is not defined

I used your stream example but am missing the StringifyStream:

request(pdfUrl).pipe(pdfParser).pipe(new StringifyStream())

how can I define / load it?

Content output gives line break on dash -

this is referring to -c, --content option - it's experimental but still needs bug reports too

Everywhere a dash character - appears in the document, it is replaced by a line break before and after itself.

To recreate, I used http://static.e-publishing.af.mil/production/1/af_sg/publication/afi41-210/afi41-210.pdf and command node pdf2json.js -f /home/user/afi...pdf -o /home/user -c on Debian.

Example:

ORIGINAL

If the data is stored on a facility-shared computer drive, the drive or data folder must be locked so unauthorized users are prevented from gaining access to the information.

OUTPUT

If the data is stored on a facility
-
shared computer drive, the drive or ...

Didn't see the issue already listed but if I'm duplicating someone or just using it incorrectly, please feel free to close.

PS - thank you so very much for this code - it's exactly what I've been looking for.

Page unit conversion to PDF points

@RichardLitt and I are also having a little understanding the 'page unit', the coordinate convention and how that relates to PDF points (8.5" x 11" = 612 x 792 points). Can you provide a little clarification?

Boxsets stays empty

Hi,

I tried to use pdf2json with three different pdfs containing links to other websites.

But when I try, the boxsets returns empty.

This is my code :

var pdfParser = new PDFParser();

  pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
  pdfParser.on("pdfParser_dataReady", pdfData => {
      for (var i = 0; i < pdfData.formImage.Pages.length; i++){
        console.log(pdfData.formImage.Pages[i].Boxsets) // why empty? Boxsets??
    }
  });

  pdfParser.loadPDF(pdf_path);

[http://www74.zippyshare.com/d/MzUNluNF/7310663/test1.pdf](this is my pdf test : http://www74.zippyshare.com/d/MzUNluNF/7310663/test1.pdf)

when I try to show pdfData.formImage.Pages[i].Boxsets it stays always empty

This is what i get :

{"Height":52.618,"HLines":[{"x":3.543,"y":10.757,"w":0.814,"l":1.529}],"VLines":[],"Fills":[{"x":0,"y":0,"w":0,"h":0,"clr":1},{"x":0,"y":-0.056,"w":37.25,"h":52.687,"clr":1}],"Texts":[{"x":3.313,"y":6.681,"w":17.597,"sw":null,"clr":0,"A":"left","R":[{"T":"TOTOTOTOTOTOOTOTOTOTOTOTOT","S":4,"TS":[0,14,0,0]}]},{"x":3.313,"y":9.931,"w":2.223,"sw":null,"clr":0,"A":"left","R":[{"T":"toto2","S":4,"TS":[0,14,0,0]}]}],"Fields":[],"Boxsets":[]}
any idea why?

Checkbox status is always false

I tried extracting the fields of PDF which is already filled out with data. After extracting i can get all textfields available along with saved data in each fields. But in case of checkboxes or radio buttons, checked status is always false. Maybe i missed something out?

Texts array empty on OSX not empty on CentOS

Any idea why this might be the case?

Store pdf canvas in the output json file

How to store pdf canvas in the output json file ?

thank you.

Cannon't Read Property Num

'Warning: Unhandled rejection: TypeError: Cannot read property 'num' of undefined' at Obj.RefSetCache_has [as has] .....

The error occurs in base/core/objs.js, I replaced the line with a hack fix for the time being:

has: function RefSetCache_has(ref) { return ('R' + ref.num + '.' + ref.gen) in this.dict; }

after:

has: function RefSetCache_has(ref) { if(ref !== undefined) return ('R' + ref.num + '.' + ref.gen) in this.dict; else return null; }

Cannot read property '0' of undefined when parsing pdf

PDF it fails on: http://www.novasoftware.se/ImgGen/schedulegenerator.aspx?format=pdf&schoolid=60410/nb-no&type=-1&id=2eda&period=&week=21&mode=0&printer=0&colors=32&head=0&clock=0&foot=0&day=0&width=1880&height=371&maxwidth=1880&maxheight=371

Stack trace:

(while reading XRef): TypeError: Cannot read property '0' of undefined
XRefParseException
    at XRefParseExceptionClosure (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:379:34)
    at eval (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:384:3)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:1)
    at Module._compile (module.js:413:34)
    at Object.Module._extensions..js (module.js:422:10)
    at Module.load (module.js:357:32)
    at Function.Module._load (module.js:314:12)
    at Module.require (module.js:367:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/pdfparser.js:8:10)
Error
    at InvalidPDFExceptionClosure (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:330:35)
    at eval (eval at <anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:6), <anonymous>:334:3)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/lib/pdf.js:64:1)
    at Module._compile (module.js:413:34)
    at Object.Module._extensions..js (module.js:422:10)
    at Module.load (module.js:357:32)
    at Function.Module._load (module.js:314:12)
    at Module.require (module.js:367:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/home/petterroea/Dropbox/div-projects/bot/node_modules/pdf2json/pdfparser.js:8:10)

Code:

        var pdfParser = new PDFParser();
        console.log("Downloaded timeschedule.");
        pdfParser.on("pdfParser_dataReady", pdfData => {
        console.log("Got pdf data");
        console.log(pdfData);
        });
        pdfParser.loadPDF("temp.pdf");

Node -v:

v5.11.1

It might be a poorly generated pdf(2000's consultant work apparently), but other readers support it fine.

"An error occurred while rendering the page" when page contains image.

I forked the repo in order to inspect the exact error:

+ nodeUtil._logN.call(self, 'Error: ' + require('util').inspect(error, null, null));

The problem is:
{
message: 'Image is not defined',
stack: 'ReferenceError: Image is not defined\n at loadJpegStream (eval at (/Users/Tim/EG Server/Source/Engine/eg-exam/node_modules/pdf2json/pdf.js:46:6))'
}

I'm looking into this issue and will add a pull request when I fixed it. :)

Pass in options that tell pdf2json what to output

Building off of the 'add PostScript coordinates' idea in #12 , perhaps we should support for an options object passed as a second parameter to loadPDF? And this options object could include key-value pairs like coordinates: 'PostScript' or useDictionary: false or excludeTextsProperties: ['clr', 'oc', 'A', 'R.S'] etc. This could also be an easy way to implement #20 .

I understand @modesty 's comment in closing #20 that an output format other than what he has produced is not what he had in mind, but surely we can make pdf2json more flexible to support more varied projects. I think it can produce the current output as well as others as different projects may desire. I'm happy to try to tackle this if others think it's a good idea, and I welcome ideas on the best implementation.

pdf2json Performance over large PDF

Hi All,

I have a PDF file that contains about 500 pages (3.6mb) - I can't post because it contains sensitive data. When I load it up through pdf2json, it takes about 10 minutes to fire the dataReady callback... is this expected?

I am running the node application on an macbook pro, i7, 16GB... and seriously expected it to be faster.

The PDF contents are of a timetable nature... and all I want to extract are the text strings and their x/y locations for grouped by page.

Does anyone else have performance issues with pdf2json... or does anyone else have any suggestions as to other node modules to use for this purpose?

Looking forward to some help... and free to answer any questions.

Ta.

Use with Other JS Servers?

@modesty Cool tool. Just doing a brief walkthrough I didn't really see much that screamed Node.js only.... Do you feel that a lot of this was done such that it would only work in node or do you see it working on other javascript engines such as Rhino with relative ease?

Just asking your opinion based on your indepth knowledge.

Thanks for putting this out as open source.

Does some one know why R is an array

in the README is written: 'R': an array of text run, each text run object has two main fields...
But all my pdf have a maximum length of 1 for all R's. So what is a text run?

code example

in the read me code example PDFParser = require("./pdf2json/pdfparser"); is not good anymore.
it should be PDFParser = require("./pdf2json/PDFParser");

pdf2json 0.7.1: parseBuffer() stopping execution instead of gracefully returning via pdfParser_dataError

When parsing certain PDF files that cause errors (perhaps due to ill-formatted content), pdf2json quits program execution rather than gracefully handling the error via pdfParser_dataError.

Unfortunately I can't currently find the PDF that caused this situation for me, but the line that ultimately "crashes" pdf2json is the following line inside display/canvas.js (see also below console error log):

fontObj.spaceWidth = (spaceId >= 0 && isArray(fontObj.widths)) ? fontObj.widths[spaceId] : 250;

Placing this inside a try / catch at least allows pdf2json to return "ok", instead of stopping the program flow entirely.

The error that occurred in my case was:

Error: Required "glyf" or "loca" tables are not found
    at error (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:193:7)
    at Object.Font_checkAndRepair [as checkAndRepair] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:12213:11)
    at Object.Font (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:10756:21)
    at Object.PartialEvaluator_translateFont [as translateFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:8161:14)
    at Object.PartialEvaluator_loadFont [as loadFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:7311:29)
    at Object.PartialEvaluator_handleSetFont [as handleSetFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:7154:23)
    at Object.PartialEvaluator_getOperatorList [as getOperatorList] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:7470:37)
    at Object.eval [as onResolve] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:4345:26)
    at Object.runHandlers (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:864:35)
undefined:40950
h = (spaceId >= 0 && isArray(fontObj.widths)) ? fontObj.widths[spaceId] : 250;
                                                                          ^
TypeError: Cannot assign to read only property 'spaceWidth' of Required "glyf" or "loca" tables are not found
    at Object.CanvasGraphics_setFont [as setFont] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:40950:104)
    at Object.CanvasGraphics_executeOperatorList [as executeOperatorList] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:40560:27)
    at Object.InternalRenderTask__next [as _next] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:43553:39)
    at Object.InternalRenderTask__continue [as _continue] (eval at <anonymous> (/Users/me/projects/p1/node_modules/pdf2json/lib/pdf.js:61:10), <anonymous>:43545:14)
    at Timer.listOnTimeout (timers.js:110:15)

The first "Error" causes some delay, the second "TypeError" then quits the program flow.

Is there an update for pdf2json that can fix this? (If the PDF is ill-formatted and there is no way for pdf2json to parse the file, it should then just exit gracefully.)

Object has no method 'loadPdf'

var PdfParser = require('./pdf2json');
var parser = new PdfParser();
parser.loadPdf('./sample.pdf');

For this simple portion of code I'm getting the mentioned error.

Here is what I get when calling:

console.log(Object.keys(parser));

[ 'domain',
  '_events',
  '_maxListeners',
  'get_id',
  'get_name',
  'context',
  'pdfFilePath',
  'data',
  'PDFJS',
  'parsePropCount',
  'processFieldInfoXML' ]

Missing ) after argument error

Hi guys ,

/usr/local/lib/node_modules/pdf2json/lib/p2jcmd.js:49
fs.writeFile(fieldsTypesPath, JSON.stringify(pJSON), err => {
^^^

SyntaxError: missing ) after argument list

I'm new to node and javascript, currently i faced this problem when i tried to execute pdf2json directly from shell. What is the "=>" function anway. All the codes with this "=>" is giving me error, including the examples

Thanks

agenda throws error when working with pdf2json

after update to 1.1.5 from 0.7.1, when i use pdf2json

'TypeError: Cannot read property \'update\' of undefined', ' at unlockJobs (/home/jons/***/***/node_modules/sails-hook-jobs/node_modules/agenda/lib/agenda.js:319:11)', ' at Agenda.stop (/home/jons/***/***/node_modules/sails-hook-jobs/node_modules/agenda/lib/agenda.js:247:14)', ' at Sails.stopServer (/home/***/***/node_modules/sails-hook-jobs/index.js:14:12)', ' at emitNone (events.js:72:20)', ' at Sails.emit (events.js:166:7)', ' at Sails.emitter.emit (/home/***/***/node_modules/sails/lib/app/private/after.js:50:11)'

this agenda is a dependancy of sails-hook-jobs
sails 0.12.3
node 4.4.7
ubuntu 14.04
agenda 0.6.28

accept input stream

I'm downloading a PDF from a third-party site and instead of storing, I would like to pipe the stream into pdf2json to retrieve the text. Is this possible yet?
This use case can be easily wrapped around file-centric approach, see http://stackoverflow.com/a/18658613/353337 for a simple example for the nodejs hash function.

Errors after parsing are getting eaten by library

I have run into a case where I complete a successful pdf parsing, but have an error down the road in my app that gets caught within the parser library. The problem is that the library actually ends up catching the error on line87 and eating it. I am given no indicators as to what happened or the ability to handle it properly in my app.

Here's a script that will demonstrate parsing a pdf, and then intentionally throwing an error.
https://gist.github.com/IanShoe/e92ee20f4862b187f9ae

It fail to parse pdf on window Server

Hi,

I am running a nodejs application on window server, and it could not parse pdf file there that no data was returned. The parsing work well when i run the code on window pc. Is there any reason that pdf2json can't work on window server? I executed the code from command prompt.

Text X positions Incorrect

The document I am working is a 11.5 x 16 PDF document. The height I get back from pdf2json is 51.75, which when examining the Text's locations (x,y), and assuming that they also are represented as page units (PU), the y seems to be correct. However, the x seems to be off for elements located on the right half of the document. For instance, I placed text ("BottomRight") in the bottom right and got back the following coordinates: { x: 193.45312500000003, y: 50.918749999999996 }. Seeing that the document is 11.5 x 16, and the PU for the height are 51.75, this would technically make the width 74.25 PU. How is it possible that a text can have a position of 193.45..., with a max PU of 74.25?

define(function(require,exports,modules){

```
var fs        = require('fs'),
    _         = require('underscore-node'),
    PDFParser = require('pdf2json/pdfparser'),
    pdfParser = new PDFParser(),
    pdfutils = require('pdfutils').pdfutils;


var PDF = function(base,file){

    var pdf = this;

    var location = '/Users/dayne/sites/wl/client/products/';

    pdf.base = null;
    pdf.file = null;

    pdf.adors = [];
    pdf.pages = [];

    pdf.init = function(base,file){

        console.log('starting pdf parsing');

        // set base path + file name
        pdf.file = file;
        pdf.base = base;

        // set the bindings
        pdfParser.on("pdfParser_dataReady", _.bind(pdf.initParse, this));
        pdfParser.on("pdfParser_dataError", _.bind(pdf.parseDataError, this));

        // start parsing
        pdfParser.loadPDF(base + file);
    };

    pdf.initParse = function(data){
```

//            console.log('parsing pdf data');

```
        pdfutils(pdf.base + pdf.file, function(err,doc){
```

//                for(var i = 0; i < data.PDFJS.pages.length; i++)
                for(var i = 0; i < 1; i++)
                    pdf.pages.push(pdf.parsePage(data.PDFJS.pages[i],doc[i]));

//                console.log(data.PDFJS.pages[0]);
            });

```
    };

    pdf.parsePage = function(page,doc){

        var parsedPage = {};

        parsedPage.adors  = [];

        parsedPage.ratio  = doc.height / page.Height;
        parsedPage.width  = doc.width;
        parsedPage.height = doc.height;

        for(var i = 0; i < page.Texts.length; i++)
            pdf.findCamelCase( page.Texts[i].R[0].T, page.Texts[i], page.Texts[i].R[0].TS, parsedPage, parsedPage.ratio);

        // TODO:: find solution for this xml parsing (grabbing pictures)...
```

//            console.log(parsedPage);
//            var meta   = doc.metadata.split('\n');
//            doc[0].asPNG({maxWidth: doc[0].width, maxHeight: doc[0].height }).toFile( pdf.base + 'test.png' )
            return parsedPage;
        };

```
    pdf.findCamelCase = function(text,textLocation,textData,parsedPage,ratio){
        // TODO :: fix regex to only accept camelcase without spacing...

        text.replace(/[A-Z]([A-Z0-9]*[a-z][a-z0-9]*[A-Z]|[a-z0-9]*[A-Z][A-Z0-9]*[a-z])[A-Za-z0-9]*/g, function(match){

            var t = {};
```

//                console.log(textLocation.x);
//                console.log(ratio);

```
            t.text    = text;
            t.size    = textData[1];
            t.bold    = textData[2] == 1;
            t.italics = textData[3] == 1;
            t.position = {
                x: textLocation.x,
                y: textLocation.y
            };
```

//                console.log(textLocation.x);
                console.log(t.text, t.position);

```
            parsedPage.adors.push(t);
        });
    };

    pdf.parseDataError = function(err){

        console.log('pdf parse error...',err);
    };

    pdf.init(base,file);
};

return new PDF('/Users/dayne/sites/wl/server/utils/','test.pdf');
```

});

bounding boxes

Could the documentation explain how to calculate bounding boxes for text items?

Text has x, y and w but no h. I presume that the font size could give you h, but they seem to be in other units. How should I convert?

BTW, what is the "TS" element? Can this help me?

Segfaults

I get seg faults with many (the majority) pdfs I tested, eg:

http://www.schroders.com/staticfiles/Schroders/Sites/global/IRpdf/Annual_Report_2007.pdf
http://www.northnorfolk.org/files/Sports_Assoc_001.pdf
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

Running as root does not produce seg fault, but produces no output.

I tried stepping through the execution with node.js debugger, in which case it ran OK and produced output.

Offering an ES5 Version

It'd be amazing for projects that aren't using node4 to be able to use this project without needing harmony bindings - would you be able to offer an es5 version of this package?

positions of fields

Hi,

Thanks for your library, helps out a lot with some pdf work I'm doing!

I am having a small problem with field positions, and this could be because I'm not familiar enough with the library, but wanted to bring it up just in case not. I've got an image of the pdf and I'm trying to place fields and markers on the pdf image where the fields are in the real pdf, but when I get the x and y positions, the fields seem to be just a little bit off. Here is the graphic:

date and contactName are slightly out of position, date to the right, and contactName to low. I'm using the following to get their coordinates:

cls.toPixelX(pdfField.x), cls.toPixelY(pdfField.y)

Do I need to convert the pixel values I get in some way? I'm a bit confused because some of the fields are right where I would have expected them, while others are offset in different directions...

Thank you for any insight/help!!

Mark

Height and Width confusion

Hello modesty,

I am a bit confused about how the page height and width sizes work. I am using the 1040 test form that comes with the library and am getting a larger Width than Height even though the pages are all portrait. It also seems that the hlines and vlines abide by the page dimensions so I think this is something I am doing wrong.

Any thoughts?

Error: stream must have data

Error: stream must have data at error (eval at <anonymous> (/Users/raineroviir/best-scraper/node_modules/pdf2json/lib/pdf.js:60:6), <anonymous>:193:7)

Not sure why this is happening. How do I resolve this? In my console.log I see the file loads successfully but this comes up right afterwards

new version

Can you publish a new version to npm? I'm depending on the ability to parse directly from a buffer and installing from github is a real pain. 🙏

Parsing USPTO forms?

The USPTO uses some kind of form, created by Adobe LiveCycle Designer, that can't be read in any PDF viewer except for Acrobat Reader, Acrobat Professional, and maybe other Adobe products. For example, see the ADS form.

I'm not even sure what format those forms are in, but pdf2json (like all other non-Adobe PDF viewers) doesn't see any data except for the standard message, "If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document."

Is there any chance that pdf2json might be able to parse form data from such forms at some point in future?

Thanks for any input you may have. And thanks for a very useful utility!

An error occurred while rendering the page

Hi,

I was looking at this project, tried to render a pdf which renders fine with pdfjs browser version. It gave me error "An error occurred while rendering the page" for most of the pages.

I installed it via npm but had to move pdf2json out of node_modules folder.

Is this a known issue or I am doing something wrong.

Also had a query can i create the cache of canvas generation code via this so that i do most of my canvas stuff here, just rendering part will be on client side (I dont know should I post this here, this is my first issue post on github).

Is there a future path for pdf2json?

First of all this is a really great project, and there is none like it.

But I can't help but notice that the files copied from PDF.js are 2 years old and aging.

In last two years a bunch of work has been done @ PDF.js: https://github.com/mozilla/pdf.js/commits/master/src/core

If this project has to keep up, survive and flourish, there has to be a strategy to keep up to date.

I tried to do this myself, but failing horribly.

Is it possible to make use of the deliverable (combined file) in pdfjs-dist project: https://github.com/mozilla/pdfjs-dist/tree/master/build

Lets discuss ideas around this, even if we don't have sure shot solutions.

pdf2json interprets one word as 2 JSON objects

Hi,

I use pdf2json to parse some pdfs, which contain week-tables. However, after I parse the files, there’s a strange behavior - some of the words are separated in the JSON-file as different objects, while they’re actually one word inside of the pdf.

Example:
Files:
plan.pdf
plan.txt (sorry, can't upload JSON-Files)

The word:
„Champignons“ (column: „Mittwoch“, row: 2, line: 3) is interpreted as "Champig" and "nons" (2 JSON Objects)

JSON:
{"x":26.51,"y":16.388,"w":3.932,"sw":0.35678125,"clr":0,"A":"left","R":[{"T":"Champig","S":3,"TS":[0,12,0,0]}]},{"x":28.745,"y":16.388,"w":7.169,"sw":0.35678125,"clr":0,"A":"left","R":[{"T":"nons%20und%20Lauch%20","S":3,"TS":[0,12,0,0]}]}

This issue also occur in other rows. I suspect that it's caused by the specific pdf-structure.

Any ideas how I can fix that?

Thanks for your support!

Color dictionary confusion

Is it posible to implement feature for getting correct color and not from the "dictionary", also it would be awesome to access cmyk colors. Perhaps dynamicly generate the dictionaries?

Great work in porting pdf.js over to node!

Italics not working

Dear Mr. Zhang,

The Italic field (TS[3]) is always zero regardless of whether the text field is Italic or not. After digging in pdffont.js for a bit, I figured out that it's because the value is always the initial value (false) set in the constructor and it is never set anywhere else.

In my case, I corrected the issue by making this very simple change to pdffont.js:

    var _setFaceIndex = function() {
        var fontObj = this.fontObj;

        this.bold = fontObj.bold;
        if (!this.bold) {
            this.bold = this.typeName.indexOf("bold") >= 0 || this.typeName.indexOf("black") >= 0;
        }

        this.italic = fontObj.italic;  // <---- Added this line only

Please note that Bold works as advertised. I notice that you are also analyzing the typeface name to distinguish between bold and normal text in the case of "pseudobold" text fonts, I have not done anything like that for italics so it probably won't work for typefaces that oblique by design but not by formatting.

I have not forked the project so please accept this issue and code snippet in lieu of a pull request. :)

Yours faithfully,
Riaan

PS. Thanks for the package, it's much appreciated!

PDF files on the web

Hi,

I'm attempting to use the pdf2json utility and got this error:
{ [Error: ENOENT: no such file or directory, open 'http://www.patrick.af.mil/shared/media/document/AFD-070716-028.pdf' ] }
Am I doing something wrong here? Does the file need to be local?

pdf2json is referring to xmldom by './../node_modules/xmldom'

You should never refer to a module using a path:

On line 6 of pdf.js:
DOMParser = require('./../node_modules/xmldom').DOMParser

Should be
DOMParser = require('xmldom').DOMParser

'Cause now I'm getting the following error when using pdf2json:

Error: Cannot find module './../node_modules/xmldom'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:362:17)
    at require (module.js:378:17)
    at Object.<anonymous> (/Some/great/path/node_modules/pdf2json/pdf.js:6:17)
    at Module._compile (module.js:449:26)
    at Object.Module._extensions..js (module.js:467:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:362:17)

Text escaping and granularity

I and @rkpatel33 are trying to use pdf2json to get a better understanding of each word in a .pdf file. We need this to enable word-level highlighting and to be able to run NLP libraries on the text. Currently, running a .pdf file through the command line outputs this:

{
  "formImage": {
    "Transcoder": "[email protected]",
    "Agency": "Microsoft Word - test.docx",
    "Id": {
      "AgencyId": "",
      "Name": "",
      "MC": false,
      "Max": 1,
      "Parent": ""
    },
    "Pages": [
      {
        "Height": 49.5,
        "HLines": [],
        "VLines": [],
        "Fills": [
          {
            "x": 0,
            "y": 0,
            "w": 0,
            "h": 0,
            "clr": 1
          },
          {
            "x": 29.083,
            "y": 7.936,
            "w": 6.105,
            "h": 0.015,
            "clr": 0
          }
        ],
        "Texts": [
          {
            "x": 15.262,
            "y": 4.471,
            "w": 6.573400000000001,
            "clr": 0,
            "A": "left",
            "R": [
              {
                "T": "This%09%0D%20%C2%A0is%09%0D%20%C2%A0a%09%0D%20%C2%A0test%09%0D%20%C2%A0of%09%0D%20%C2%A0",
                "S": -1,
                "TS": [
                  2,
                  53,
                  0,
                  0
                ]
              }
            ]
          },
          {
            "x": 28.827,
            "y": 4.471,
            "w": 2.6036,
            "clr": 0,
            "A": "left",
            "R": [
              {
                "T": "BOLD",
                "S": -1,
                "TS": [
                  2,
                  54,
                  1,
                  0
                ]
              }
            ]
          },
          {
            "x": 34.648,
            "y": 4.471,
            "w": 1.9335000000000002,
            "clr": 0,
            "A": "left",
            "R": [
              {
                "T": "font.",
                "S": -1,
                "TS": [
                  2,
                  53,
                  0,
                  0
                ]
              }
            ]
          }
        ],
        "Fields": [],
        "Boxsets": []
      }
    ],
    "Width": 105.188
  }
}

The text seems to be all globbed together and escaped: This%09%0D%20%C2%A0is%09%0D%20%C2%A0a%09%0D%20%C2%A0test%09%0D%20%C2%A0of%09%0D%20%C2%A0. Is there any way to get there in it's own object with positional properties? The content file also doesn't show the text in the original format - before, it was all on one line, but now I get:

This    
  is   
  a    
  test 
  of   
  
BOLD

  
font.

  

  

  

  
----------------Page (0) Break----------------

I'm not quite sure if I've misunderstood something. I was hoping for this is a test of BOLD font.

Any ideas?

encoding issues

i hava a pdf file like this

***同津巴布韦总统穆加贝举行会谈_国内新闻_环球网.pdf

i can get the chinese character successfully ,but text in the following pdf file
1.pdf

,only get

does this caused by pdf file encoding or something else?

Expose pdf.js getTextContent method for a pdf page

Could it be possible for you to expose the getTextContent method via let's say a Content property to get easily a page raw text?

Use Case

The developer needs to generate a PDF via let's say PhantomJS for example.
Inside the PDF file, specific text content needs to be extracted.
When accessing data.Pages via the pdfParser_dataReady callback, the developer could grab a page text Content promise for further processing, instead of dealing with text.R[0].T manipulations(loops, encoding, etc.). pdf2json is invoked from phantomJS via a node.js sub-process.

Proposed Implementation
Add a Content property in pdf.js.

var page = {Height: pageParser.height,                                                                                                             
                 HLines: pageParser.HLines,                                                                                                                     
                 VLines: pageParser.VLines,                                                                                                                     
                 Fills:pageParser.Fills,                                                                                                                        
                 Content:pdfPage.getTextContent(),                                                                                                              
                 Texts: pageParser.Texts,                                                                                                                       
                 Fields: pageParser.Fields,                                                                                                                     
                 Boxsets: pageParser.Boxsets                                                                                                                    
             };

If there's another approach that deals with funky characters easily without introducing an API add-on, I'd be glad to hear about it.

Problem with CMYK colors

I am having some trouble with cmyk colors, in my example PDF the color becomes #00FF00 but it should be full magenta so purple-ish color. I don't think i has to do with dictionary translating this but somewere it's not recognizing my color as cmyk?

Im not sure, here is the file annyway
https://docs.google.com/file/d/0B6YLTkp6bMZPbGlJNE1vV3NpOWc/edit

Also the latest pdf.js has better support for translating cmyk to rgb with a lut table for comparing, maby this could be implemented.

Word wrapping

{
"x": 7.9914062500000025,
"y": 3.984375000000001,
"w": 1292.917,
"clr": 0,
"A": "left",
"R": [
{
"T": "SECTION%3A%20CON",
"S": -1,
"TS": [3, 182, 0, 0]
}
]
},
{
"x": 19.10241171875,
"y": 3.984375000000001,
"w": 984.6790000000001,
"clr": 0,
"A": "left",
"R": [
{
"T": "TACT%20LENS",
"S": -1,
"TS": [3, 182, 0, 0]
}
]
},

This was the "SECTION: CONTACT LENS" string in reference PDF.

Lot's of logging messages

I'm seeing lot's of logging messages like these:

6 Mar 11:18:54 - PDFFont2235 - Default - SymbolicFont - (NJTZCI+Constantia-Bold) : 50::NaN => 2 length = 1
6 Mar 11:18:54 - PDFFont2311 - Default - SymbolicFont - (NJTZCI+Constantia-Bold) : 66::NaN => B length = 1
6 Mar 11:18:54 - PDFPageParser19 - page 19 is rendered successfully.
6 Mar 11:18:54 - PDFJSClass1 - start to parse page:20
6 Mar 11:18:54 - PDFPageParser20 - page 20 is rendered successfully.
6 Mar 11:18:54 - PDFJSClass1 - complete parsing page:20

How can I turn them off?

Thanks

_onPFBdataReady not defined error

var nodeUtil = require("util"),
fs = require('fs'),
_ = require('underscore'),
PDFParser = require("./pdfparser");

    var pdfParser = new PDFParser();

    pdfParser.on("pdfParser_dataReady", _.bind(_onPFBinDataReady, self));

    pdfParser.on("pdfParser_dataError", _.bind(_onPFBinDataError, self));

//    var pdfFilePath = _pdfPathBase + folderName + "/" + pdfId + ".pdf";

var pdfFilePath = 'ibpsrrb2012.pdf';
pdfParser.loadPDF(pdfFilePath);

    // or call directly with buffer
    fs.readFile(pdfFilePath, function (err, pdfBuffer) {
      if (!err) {
        pdfParser.parseBuffer(pdfBuffer);
      }
    })

When i fire the command npm mypdf.js i get the following error
pdfParser.on("pdfParser_dataReady",_.bind<_onPFBinDataReady,self>)

refrence error _onpfbindataready is not defined

Running on Electron "Uncaught Error: No PDFJS.workerSrc specified"

For some reason when I run pdf2json on my electron app I get "Uncaught Error: No PDFJS.workerSrc specified".

FYI: I tried setting workerSrc to pdf.worker.js but that won't solve it. It just brings up another error,

Loading pdfs from a remote server via a stream

I'm trying to load a single PDF from a remote server. Here is my approach:
(I can confirm that if I just pipe the request into a write stream it saves the PDF fine)

var request = require('request');
var pdfParser = require('pdf2json');
var pdfUrl = 'somepdf.pdf'

var pdfPipe = request({url: pdfUrl, encoding:null}).pipe(pdfParser);

pdfPipe.on("pdfParser_dataError", err => console.error(err) );
pdfPipe.on("pdfParser_dataReady", pdf => {
    //let pdf = pdfParser.getMergedTextBlocksIfNeeded();
    console.log(pdfParser.getAllFieldsTypes());
});

However, I'm getting an error:

stream.js:45
  dest.on('drain', ondrain);
       ^

TypeError: dest.on is not a function
    at Request.Stream.pipe (stream.js:45:8)
    at Request.pipe (/Users/zaf/development/minerva-bot/node_modules/request/request.js:1395:34)
    at Object.<anonymous> (/Users/zaf/development/minerva-bot/plugins/exam_module/index.js:9:53)
    at Module._compile (module.js:434:26)
    at Object.Module._extensions..js (module.js:452:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:475:10)
    at startup (node.js:117:18)
    at node.js:951:3

Code constructed from here: http://stackoverflow.com/a/36882510/3779915

Crash in 0.4.5 - xmldom tagName error

In the current version, the xmldom module throws a fatal error when parsing a PDF (grayscale table with text only). Previous versions didn't have this issue.

7 Sep 20:50:25 - PDFParser1 -  is about to load PDF file uploads\035361c35f424d6885574aae35eae88b
7 Sep 20:50:25 - PDFJSClass1 - About to load fieldInfo XML : uploads\035361c35f424d6885574aae35eae88b
element parse error: Error: invalid tagName:<
@#[line:4,col:1]
element parse error: Error: invalid tagName:
@#[line:4,col:2]
element parse error: Error: invalid tagName:<
@#[line:4,col:376]
element parse error: Error: invalid tagName:
@#[line:4,col:377]
element parse error: Error: invalid tagName:<
@#[line:7,col:1]
end tag name: Filter /FlateDecode /Length 1613 is not match the current start tagName:undefined
@#[line:7,col:1]

C:\app\node_modules\pdf2json\node_modules\xmldom\dom-parser.js:185
            throw error;
                  ^
end tag name: Filter /FlateDecode /Length 1613 is not match the current start tagName:undefined

No way for consumer to handle crypto related errors when calling loadPdf()

Currently base/core/crypto.js calls error() on errors. There is no callback for catching these errors, and the errors don't emit to pdfjs_parseDataError.

Here is the PDF causing the error: https://drive.google.com/file/d/0B3yADm5p-GRCRExYVXFjTnhvQ2c/view?usp=sharing

Does not extract Hyperlink on text

It would be great it a text hyperlink could be exported to JSON. Only the text value is exported

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.