nisaacson / pdf-extract Goto Github PK

Node PDF Extract

License: MIT License

JavaScript 98.14% Makefile 0.74% Shell 1.11%

pdf-extract's Introduction

Node PDF

Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text

Installation

To begin install the module.

npm install pdf-extract

After the library is installed you will need the following binaries accessible on your path to process pdfs.

pdftk
- pdftk splits multi-page pdf into single pages.
pdftotext
- pdftotext is used to extract text out of searchable pdf documents
ghostscript
- ghostscript is an ocr preprocessor which convert pdfs to tif files for input into tesseract
tesseract
- tesseract performs the actual ocr on your scanned images

OSX

To begin on OSX, first make sure you have the homebrew package manager installed.

pdftk is not available in Homebrew. However a gui install is available here. http://www.pdflabs.com/docs/install-pdftk/

pdftotext is included as part of the poppler utilities library. poppler can be installed via homebrew

brew install poppler

ghostscript can be install via homebrew

brew install gs

tesseract can be installed via homebrew as well

brew install tesseract

After tesseract is installed you need to install the alphanumeric config and an updated trained data file

cd <root of this module>
cp "./share/eng.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata"
cp "./share/dia.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata"
cp "./share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric"

Ubuntu

pdftk can be installed directly via apt-get

apt-get install pdftk

pdftotext is included in the poppler-utils library. To installer poppler-utils execute

apt-get install poppler-utils

ghostscript can be install via apt-get

apt-get install ghostscript

tesseract can be installed via apt-get. Note that unlike the osx install the package is called tesseract-ocr on Ubuntu, not tesseract

apt-get install tesseract-ocr

For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the alphanumeric file included with this pdf-extract module into the tess-data folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system

cd <root of this module>
cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"
cp "./share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric"

SmartOS

pdftk can be installed directly via apt-get

apt-get install pdftk

pdftotext is included in the poppler-utils library. To installer poppler-utils execute

apt-get install poppler-utils

ghostscript can be install via pkgin. Note you may need to update the pkgin repo to include the additional sources provided by Joyent. Check http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html for details

pkgin install ghostscript

tesseract can be must be manually downloaded and compiled. You must also install leptonica before installing tesseract. At the time of this writing leptonica is available from http://www.leptonica.com/download.html, with the latest version tarball available from http://www.leptonica.com/source/leptonica-1.69.tar.gz

pkgin install autoconf
wget http://www.leptonica.com/source/leptonica-1.69.tar.gz
tar -xvzf leptonica-1.69.tar.gz
cd leptonica-1.69
./configure
make
[sudo] make install

After installing leptonic move on to tesseract. Tesseract is available from https://code.google.com/p/tesseract-ocr/downloads/list with the latest version available from https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=

wget https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=
tar -xvzf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr
./configure
make
[sudo] make install

Windows

Important! You will have to add some variables to the PATH of your machine. You do this by right clicking your computer in file explorer, select Properties, select Advanced System Settings, Environment Variables. You can then add the folder that contains the executables to the path variable.

pdftk can be installed using the PDFtk Server installer found here: https://www.pdflabs.com/tools/pdftk-server/ It should autmatically add itself to the PATH, if not, the default install location is "C:\Program Files (x86)\PDFtk Server\bin"

pdftotext can be installed using the recompiled poppler utils for windows, which have been collected and bundled here: http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/ Unpack these in a folder, (example: "C:\poppler-utils") and add this to the PATH.

ghostscript for Windows can be found at: http://www.ghostscript.com/download/gsdnld.html Make sure you download the General Public License and the correct version (32/64bit). Install it and go to the installation folder (default: "C:\Program Files\gs\gs9.19") and go into the bin folder. Rename the gswin64c to gs, and add the bin folder to your PATH.

tesseract can be build, but you can also download an older version which seems to work fine. Downloads at: https://sourceforge.net/projects/tesseract-ocr-alt/files/ Version tested is tesseract-ocr-setup-3.02.02.exe, the default install location is "C:\Program Files (x86)\Tesseract-OCR" and is also added to the PATH. Note, this is only when you've checked that it will install for everyone on the machine.

Everything should work after all this! If not, try restarting to make sure the PATH variables are correctly used. This setup was tested on a Windows 10 Pro N 64bit machine.

Usage

OCR Extract from scanned image

Extract from a pdf file which contains a scanned image and no searchable text

const path = require("path")
const pdf_extract = require('pdf-extract')

console.log("Usage: node thisfile.js the/path/tothe.pdf")
const absolute_path_to_pdf = path.resolve(process.argv[2])
if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf)

const options = {
  type: 'ocr', // perform ocr to get the text within the scanned image
  ocr_flags: ['--psm 1'], // automatically detect page orientation
}
const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…"))
processor.on('complete', data => callback(null, data))
processor.on('error', callback)
function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) }

Text extract from searchable pdf

Extract from a pdf file which contains actual searchable text

const path = require("path")
const pdf_extract = require('./main.js')

console.log("Usage: node thisfile.js the/path/tothe.pdf")
const absolute_path_to_pdf = path.resolve(process.argv[2])
if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf)

const options = {
  type: 'text', // extract searchable text from PDF
  ocr_flags: ['--psm 1'], // automatically detect page orientation
  enc: 'UTF-8',  // optional, encoding to use for the text output
  mode: 'layout' // optional, mode to use when reading the pdf
}
const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…"))
processor.on('complete', data => callback(null, data))
processor.on('error', callback)
function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) }

Options

At a minimum you must specific the type of pdf extract you wish to perform

clean When the system performs extracts text from a multi-page pdf, it first splits the pdf into single pages. This are written to disk before the ocr occurs. For some applications these single page files can be useful. If you need to work with the single page pdf files after the ocr is complete, set the clean option to false as show below. Note that the single page pdf files are written to the system appropriate temp directory, so if you must copy the files to a more permanent location yourself after the ocr process completes

var options = {
  type: 'ocr' // (required), perform ocr to get the text within the scanned image
  enc: 'UTF-8' // optional, only applies to 'text' type
  mode: 'layout' // optional, only applies to 'text' type. Available modes are 'layout', 'simple', 'table' or 'lineprinter'. Default is 'layout'
  clean: false // keep the single page pdfs created during the ocr process
  ocr_flags: [
    '-psm 1',       // automatically detect page orientation
    '-l dia',       // use a custom language file
    'alphanumeric'  // only output ascii characters
  ]
}

Events

When processing, the module will emit various events as they occurr

page Emitted when a page has completed processing. The data passed with this event looks like

var data = {
  hash: <sha1 hash of the input pdf file here>
  text: <extracted text here>,
  index: 2,
  num_pages: 4,
  pdf_path: "~/Downloads/input_pdf_file.pdf",
  single_page_pdf_path: "/tmp/temp_pdf_file2.pdf"
}

error Emitted when an error occurs during processing. After this event is emitted processing will stop. The data passed with this event looks like

var data = {
  error: 'no file exists at the path you specified',
  pdf_path: "~/Downloads/input_pdf_file.pdf",
}

complete Emitted when all pages have completed processing and the pdf extraction is complete

var data = {
  hash: <sha1 hash of the input pdf file here>
  text_pages: <Array of Strings, one per page>,
  pdf_path: "~/Downloads/input_pdf_file.pdf",
  single_page_pdf_file_paths: [
    "/tmp/temp_pdf_file1.pdf",
    "/tmp/temp_pdf_file2.pdf",
    "/tmp/temp_pdf_file3.pdf",
    "/tmp/temp_pdf_file4.pdf",
  ]
}

log To avoid spamming process.stdout, log events are emitted instead.

Tests

To test that your system satisfies the needed dependencies and that module is functioning correctly execute the command in the pdf-extract module folder

cd <project_root>/node_modules/pdf-extract
npm test

pdf-extract's People

Contributors

Stargazers

Watchers

Forkers

big-data open-source-gis facets-xx leadsplus bawerd morristech alexscheelmeyer eahefnawy imclab gotomypc santeriv paulcnichols derekrazo sheedy stonelinks motusdevelopers liquid1982 picnichealth dwohlfahrt digitallandes y2kbot ontra-ai anonymousdonald miguelramosfdz rightisleft maxkurama danielezn mchapman jazarja redanium elopezphy jmaleonard electricpen gcpantazis star-nodejs brettveenstra mfainshtein2 skgbafa agile-innovations viphuangwei rodrigobeavis ycrao jackeluo fbennett vall12 nsacerdote rsach mluby lucmousinho shuebner20 simran-sawhney csltech zenfeed mfamilia odnodn eulogetie xlan-codes alabbas-ali iki webstorage119 fa-b to-the-source androxxe notbrianzach bar-easydoc indigov-us kaush26 zuokerb

pdf-extract's Issues

Error (XX): Illegal character <XX> in hex string

{
    "error": "Illegal character <3f> in hex string\nError (82): Illegal character <78> in hex string\nError (83): Illegal character <70> in hex string\nError (86): Illegal character <6b> in hex string\nError (88): Illegal character <74> in hex string\nError (92): Illegal character <67> in hex string\nError (93): Illegal character <69> in hex string\nError (94): Illegal character <6e> in hex string\nError (95): Illegal character <3d> in hex string\nError (96): Illegal character <27> in hex string\nError (97): Illegal character <ef> in hex string\nError (98): Illegal character <bb> in hex string\nError (99): Illegal character <bf> in hex string\nError (100): Illegal character <27> in hex string\nError (102): Illegal character <69> in hex string\nError (104): Illegal character <3d> in hex string\nError (105): Illegal character <27> in hex string\nError (106): Illegal character <57> in hex string\nError (108): Illegal character <4d> in hex string\nError (110): Illegal character <4d> in hex string\nError (111): Illegal character <70> in hex string\nError (114): Illegal character <68> in hex string\nError (115): Illegal character <69> in hex string\nError (116): Illegal character <48> in hex string\nError (117): Illegal character <7a> in hex string\nError (118): Illegal character <72> in hex string\nError (120): Illegal character <53> in hex string\nError (121): Illegal character <7a> in hex string\nError (122): Illegal character <4e> in hex string\nError (123): Illegal character <54> in hex string\nError (125): Illegal character <7a> in hex string\nError (126): Illegal character <6b> in hex string\nError (130): Illegal character <27> in hex string\nError (131): Illegal character <3f> in hex string\n",
}

Adding the option to Decrypt a PDF

Hi,

Currently not able to pass through some form of arguments to pdftk to decrypt a document.

Id like to submit a change to allow for this.

Documentation not reflecting library use for multiple pdf OCR

Preferable if library docs show how to extract text from multiple pdf files.

Undefined vars in readme

Here's an example you give:

var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
  type: 'text'  // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
  if (err) {
    return callback(err);
  }
});
processor.on('complete', function(data) {
  inspect(data.text_pages, 'extracted text pages');
  callback(null, data.text_pages);
});
processor.on('error', function(err) {
  inspect(err, 'error while extracting pages');
  return callback(err);
});

But the npm package eyes was never installed, and callback (called 3 times) is undefined.

Arguments to pdftotext

I did not see if pdf_extract() function allowed for arguments to pdftotext.
I am looking at lib/searchable.js line 30

I altered my local copy to

var child = spawn('pdftotext', (options.layout ? ['-layout'] : []).concat(options.ocr_flags).concat([pdf_path, '-']));

And call it with

var options = {
  type: 'text',  // extract the actual text in the pdf file
  ocr_flags: [
     '-f',1,
     '-l',1
  ]
}

var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {             
  if (err) {
    res.end(util.inspect(err));
  }

Way to remove header and footer in the generated text

Hi,

The tool works great for the extraction of data but in some pages - the footer text gets intermingled with the page text body and it breaks the parsing.

Is there a way to turn off the header and footer extraction somewhere in the code?

Thank YOU

One of the examples at Readme.md is wrong

Text extract from searchable pdf

Extract from a pdf file which contains actual searchable text

var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
  type: 'text'  // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
  if (err) {
    return callback(err);
  }
});
processor.on('complete', function(data) {
  inspect(data.text_pages, 'extracted text pages');
  callback(null, data.text_pages); //<----- data.text_pages instead of just text_pages.
});
processor.on('error', function(err) {
  inspect(err, 'error while extracting pages');
  return callback(err);
});

I guess this little typo doesn't deserve a fork and a pull request.
Thank you very much for your software.

Windows guide

Thanks for the library. Here are the steps to run on windows.

npm install pdf-extract
install pdftk server
install ghostscript
go to ghostscript install bin directory, copy gswin64c.exe and paste a copy named gs.exe in same directory
add path to gs bin to environment variables path
install teserract-ocr
add path to teserract-ocr root dir to environment variables path
add new environment variable named TESSDATA_PREFIX and path it to the teserract-ocr root install dir

npm test Failing on Ubuntu 12.04

Followed the installation steps and then tried running npm test. I got an error indicating the absence of mocha. Installed node.js and mocha (which I did not see among the prerequisites), but sill get an error. Here is the terminal output:

node_modules/.bin/mocha --reporter spec

sh: 1: node_modules/.bin/mocha: not found
npm ERR! Test failed. See above for more details.
npm ERR! not ok code 0

I suspect that the problem is that mocha is not in the expected folder.

an error occurred while splitting pdf into single pages with the pdftk burst command

 



static extractTextFromPdf(pdf, options) {
    console.log('entred to exctract');
    pdf = pathe.resolve(pdf)

    return new Promise((resolve, reject) => {
      const processor = pdfExtract(pdf, { ...options, clean: true }, (err) => {
        if (err) {
          reject(err);
        }
      });
      processor.on('complete', (data) => {
        resolve(data.text_pages);
        console.log('good');
      });
      processor.on('error', (err) => {
        console.log(' error extractTextFromPdf ');
        reject(err);
      });
    });
  }

The tests are failing in the ubuntu 13.04

7 tests are failing when running the npm test. Most of them are due to "Error: timeout of 100000ms exceeded"
Have tried increasing the timeout value but that doesn't help too ..

GHSL-2020-116

Hello,

I am a member of the GitHub Security Lab (https://securitylab.github.com).

I've attempted to reach a maintainer for this project to report a potential security issue but have been unable to verify the report was received. Please could a project maintainer could contact us at [email protected], using reference GHSL-2020-116?

Thank you,
Kevin Backhouse
GitHub Security Lab

Broken Link

In the README under windows, the link 'pdftotext' http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/ now just redirects to the manifestwebdesign.com homepage.

Searched a word inside text

It gives [-1 , -1] response. I think it gives -1 when it does not find a word inside text. But there is a word inside text.

Error extracting data from PDF document: "No current point in closepath"

Hello,
First of all, thanks a lot for the awesome work on this library. We have been using it for some time and are quite amazed by the work you made here.
Today we run into this error from an apparently totally OK PDF:

  error: 'Syntax Error (30523): No current point in closepath\n' +
    'Syntax Error (30538): No current point in closepath\n' +
    'Syntax Error (30556): No current point in closepath\n' +
    'Syntax Error (30566): No current point in closepath\n',
  pdf_path: '../samples/515317730_121477412.pdf'

This is a searchable/text pdf, so it is using pdfOCR with the following options:

const ocrSearchableOptions = {
    type: 'text', // extract searchable text from PDF
    ocr_flags: ['--psm 1'], 
    enc: 'UTF-8',  
    mode: 'layout'
}

I can provide the PDF if needed to analyze it.
Any help is greatly, greatly appreciated 🙏.
Thanks a lot in advance.

Does this extract images?

Reading from URL instead of File

This is a great library that I've been using extensively to get text from scanned pdf files. I was wondering if there was a way to read from a url instead of a file.

If there is a way, I would have to change the convert.js file correct ? I just want to know if I am on the right path and would appreciate any input!

no access to PATH in AWS Lambda

As title suggests, you can't always get access to change the PATH, AWS Lambda is one of many that come to mind (CI's, AWS ECS, many PaaS, ect).

We need to be able to tell the library which binary to use, similar to this npm library offers, https://www.npmjs.com/package/pdf-text-extract

Please update "sys" to "utils"

When execute a node script from my terminal this message appear:

(node:21492) DeprecationWarning: sys is deprecated. Use util instead.

As you can read here, there's no more chances in the future to use sys.

Can you update the repo?

Thanks

Timeout in Tests / Complete,Done,Error callback never called OSX

This package is perfect for what i am trying to do, however i cant seem to get it work on OSX. I dont get any errors when running in my Node application, however it never fires any of its callbacks after the process function. When running the NPM Test from within the node modules, dependencies come back fine, but i get 7 errors after all having to do with Timeout after. Ive upped the Timeouts in the test and still not running. Only diffrence i have is Tesseract being version 3.04.00, however im not even trying to use OCR so i cant image this would be causing the issues.

Any help is appreciated!

Poppler util does not exist on the link provided in readme

poppler util does not exist on the link provided in readme. I am trying to use the text extract in a node app on windows

Download from for PDFTK is not up to date

I had a lot of trouble using this package. I was able to run the code but it hanged indefinitely on this exec line.

https://github.com/nisaacson/pdf-extract/blob/master/lib/split.js#L47

I was able to fix this by downloading the installer from this stackoverflow link. I think it should be updated in the README as well.

https://stackoverflow.com/questions/60859527/how-to-solve-pdftk-bad-cpu-type-in-executable-on-mac

Extract a single page

This takes a long time for a large pdf file, how to extract and process a single page only?

tests failing on homebrew install

Homebrew has moved up to Tesseract 3.02; the README has specific references to 3.01's directory structure.

I tried copying the training data files into the apparent directories used by the 3.02 brew install

cp ./share/eng.traineddata /usr/local/share/tessdata/
cp ./share/configs/alphanumeric /usr/local/share/tessdata/configs/

but the tests still fail.
I tried downgrading to the 3.01 homebrew version of Tesseract, but changes to the latest XCode seem to have broken the way that it assumes autoconf will work.
Using tesseract directly confirms problems loading relevant image processing libraries. after brew reinstall libtiff and brew reinstall libjpeg, TIFF support is still screwed up. This appears to be a problem with the leptonica recipe. I finally overcame it with brew reinstall --with-libtiff leptonica

This was a complete disaster, and I'm unsure where the fault lies. Clearly not with pdf-extract (other than the README referring to the wrong location for tesseract training date)! But other users will likely come across this, too.

UPDATE:

The README is also lacking a reference to copying around dia.traineddata.

ghostscript vs imagemagick

By default, tesseract produces gibberish for me. I noticed that convert is commented out in favor of gs. I tried convert -depth 8 -background white -flatten -matte -density 300 <input> <output> instead and tesseract produced great results. The whole process was a lot faster too: ~15 minutes vs ~1 minute for 6 pages. I am curious why ghostscript is used rather than imagemagick for conversion?

nisaacson / pdf-extract Goto Github PK

pdf-extract's Introduction

Node PDF

Installation

OSX

Ubuntu

SmartOS

Windows

Usage

OCR Extract from scanned image

Text extract from searchable pdf

Options

Events

Tests

pdf-extract's People

Contributors

Stargazers

Watchers

Forkers

pdf-extract's Issues

Text extract from searchable pdf

Recommend Projects

Recommend Topics

Recommend Org