nisaacson / pdf-extract Goto Github PK

Node PDF Extract

License: MIT License

JavaScript 98.14% Makefile 0.74% Shell 1.11%

pdf-extract's Issues

npm test Failing on Ubuntu 12.04

Followed the installation steps and then tried running npm test. I got an error indicating the absence of mocha. Installed node.js and mocha (which I did not see among the prerequisites), but sill get an error. Here is the terminal output:

node_modules/.bin/mocha --reporter spec

sh: 1: node_modules/.bin/mocha: not found
npm ERR! Test failed. See above for more details.
npm ERR! not ok code 0

I suspect that the problem is that mocha is not in the expected folder.

Arguments to pdftotext

I did not see if pdf_extract() function allowed for arguments to pdftotext.
I am looking at lib/searchable.js line 30

I altered my local copy to

var child = spawn('pdftotext', (options.layout ? ['-layout'] : []).concat(options.ocr_flags).concat([pdf_path, '-']));

And call it with

var options = {
  type: 'text',  // extract the actual text in the pdf file
  ocr_flags: [
     '-f',1,
     '-l',1
  ]
}

var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {             
  if (err) {
    res.end(util.inspect(err));
  }

ghostscript vs imagemagick

By default, tesseract produces gibberish for me. I noticed that convert is commented out in favor of gs. I tried convert -depth 8 -background white -flatten -matte -density 300 <input> <output> instead and tesseract produced great results. The whole process was a lot faster too: ~15 minutes vs ~1 minute for 6 pages. I am curious why ghostscript is used rather than imagemagick for conversion?

Reading from URL instead of File

This is a great library that I've been using extensively to get text from scanned pdf files. I was wondering if there was a way to read from a url instead of a file.

If there is a way, I would have to change the convert.js file correct ? I just want to know if I am on the right path and would appreciate any input!

Please update "sys" to "utils"

When execute a node script from my terminal this message appear:

(node:21492) DeprecationWarning: sys is deprecated. Use util instead.

As you can read here, there's no more chances in the future to use sys.

Can you update the repo?

Thanks

Poppler util does not exist on the link provided in readme

poppler util does not exist on the link provided in readme. I am trying to use the text extract in a node app on windows

The tests are failing in the ubuntu 13.04

7 tests are failing when running the npm test. Most of them are due to "Error: timeout of 100000ms exceeded"
Have tried increasing the timeout value but that doesn't help too ..

Adding the option to Decrypt a PDF

Hi,

Currently not able to pass through some form of arguments to pdftk to decrypt a document.

Id like to submit a change to allow for this.

Searched a word inside text

It gives [-1 , -1] response. I think it gives -1 when it does not find a word inside text. But there is a word inside text.

an error occurred while splitting pdf into single pages with the pdftk burst command

 



static extractTextFromPdf(pdf, options) {
    console.log('entred to exctract');
    pdf = pathe.resolve(pdf)

    return new Promise((resolve, reject) => {
      const processor = pdfExtract(pdf, { ...options, clean: true }, (err) => {
        if (err) {
          reject(err);
        }
      });
      processor.on('complete', (data) => {
        resolve(data.text_pages);
        console.log('good');
      });
      processor.on('error', (err) => {
        console.log(' error extractTextFromPdf ');
        reject(err);
      });
    });
  }

Timeout in Tests / Complete,Done,Error callback never called OSX

This package is perfect for what i am trying to do, however i cant seem to get it work on OSX. I dont get any errors when running in my Node application, however it never fires any of its callbacks after the process function. When running the NPM Test from within the node modules, dependencies come back fine, but i get 7 errors after all having to do with Timeout after. Ive upped the Timeouts in the test and still not running. Only diffrence i have is Tesseract being version 3.04.00, however im not even trying to use OCR so i cant image this would be causing the issues.

Any help is appreciated!

Undefined vars in readme

Here's an example you give:

var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
  type: 'text'  // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
  if (err) {
    return callback(err);
  }
});
processor.on('complete', function(data) {
  inspect(data.text_pages, 'extracted text pages');
  callback(null, data.text_pages);
});
processor.on('error', function(err) {
  inspect(err, 'error while extracting pages');
  return callback(err);
});

But the npm package eyes was never installed, and callback (called 3 times) is undefined.

no access to PATH in AWS Lambda

As title suggests, you can't always get access to change the PATH, AWS Lambda is one of many that come to mind (CI's, AWS ECS, many PaaS, ect).

We need to be able to tell the library which binary to use, similar to this npm library offers, https://www.npmjs.com/package/pdf-text-extract

One of the examples at Readme.md is wrong

Text extract from searchable pdf

Extract from a pdf file which contains actual searchable text

var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
  type: 'text'  // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
  if (err) {
    return callback(err);
  }
});
processor.on('complete', function(data) {
  inspect(data.text_pages, 'extracted text pages');
  callback(null, data.text_pages); //<----- data.text_pages instead of just text_pages.
});
processor.on('error', function(err) {
  inspect(err, 'error while extracting pages');
  return callback(err);
});

I guess this little typo doesn't deserve a fork and a pull request.
Thank you very much for your software.

Extract a single page

This takes a long time for a large pdf file, how to extract and process a single page only?

Broken Link

In the README under windows, the link 'pdftotext' http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/ now just redirects to the manifestwebdesign.com homepage.

Does this extract images?

Download from for PDFTK is not up to date

I had a lot of trouble using this package. I was able to run the code but it hanged indefinitely on this exec line.

https://github.com/nisaacson/pdf-extract/blob/master/lib/split.js#L47

I was able to fix this by downloading the installer from this stackoverflow link. I think it should be updated in the README as well.

https://stackoverflow.com/questions/60859527/how-to-solve-pdftk-bad-cpu-type-in-executable-on-mac

Windows guide

Thanks for the library. Here are the steps to run on windows.

npm install pdf-extract
install pdftk server
install ghostscript
go to ghostscript install bin directory, copy gswin64c.exe and paste a copy named gs.exe in same directory
add path to gs bin to environment variables path
install teserract-ocr
add path to teserract-ocr root dir to environment variables path
add new environment variable named TESSDATA_PREFIX and path it to the teserract-ocr root install dir

tests failing on homebrew install

Homebrew has moved up to Tesseract 3.02; the README has specific references to 3.01's directory structure.

I tried copying the training data files into the apparent directories used by the 3.02 brew install

cp ./share/eng.traineddata /usr/local/share/tessdata/
cp ./share/configs/alphanumeric /usr/local/share/tessdata/configs/

but the tests still fail.
I tried downgrading to the 3.01 homebrew version of Tesseract, but changes to the latest XCode seem to have broken the way that it assumes autoconf will work.
Using tesseract directly confirms problems loading relevant image processing libraries. after brew reinstall libtiff and brew reinstall libjpeg, TIFF support is still screwed up. This appears to be a problem with the leptonica recipe. I finally overcame it with brew reinstall --with-libtiff leptonica

This was a complete disaster, and I'm unsure where the fault lies. Clearly not with pdf-extract (other than the README referring to the wrong location for tesseract training date)! But other users will likely come across this, too.

UPDATE:

The README is also lacking a reference to copying around dia.traineddata.

Way to remove header and footer in the generated text

Hi,

The tool works great for the extraction of data but in some pages - the footer text gets intermingled with the page text body and it breaks the parsing.

Is there a way to turn off the header and footer extraction somewhere in the code?

Thank YOU

Error (XX): Illegal character <XX> in hex string

{
    "error": "Illegal character <3f> in hex string\nError (82): Illegal character <78> in hex string\nError (83): Illegal character <70> in hex string\nError (86): Illegal character <6b> in hex string\nError (88): Illegal character <74> in hex string\nError (92): Illegal character <67> in hex string\nError (93): Illegal character <69> in hex string\nError (94): Illegal character <6e> in hex string\nError (95): Illegal character <3d> in hex string\nError (96): Illegal character <27> in hex string\nError (97): Illegal character <ef> in hex string\nError (98): Illegal character <bb> in hex string\nError (99): Illegal character <bf> in hex string\nError (100): Illegal character <27> in hex string\nError (102): Illegal character <69> in hex string\nError (104): Illegal character <3d> in hex string\nError (105): Illegal character <27> in hex string\nError (106): Illegal character <57> in hex string\nError (108): Illegal character <4d> in hex string\nError (110): Illegal character <4d> in hex string\nError (111): Illegal character <70> in hex string\nError (114): Illegal character <68> in hex string\nError (115): Illegal character <69> in hex string\nError (116): Illegal character <48> in hex string\nError (117): Illegal character <7a> in hex string\nError (118): Illegal character <72> in hex string\nError (120): Illegal character <53> in hex string\nError (121): Illegal character <7a> in hex string\nError (122): Illegal character <4e> in hex string\nError (123): Illegal character <54> in hex string\nError (125): Illegal character <7a> in hex string\nError (126): Illegal character <6b> in hex string\nError (130): Illegal character <27> in hex string\nError (131): Illegal character <3f> in hex string\n",
}

GHSL-2020-116

Hello,

I am a member of the GitHub Security Lab (https://securitylab.github.com).

I've attempted to reach a maintainer for this project to report a potential security issue but have been unable to verify the report was received. Please could a project maintainer could contact us at [email protected], using reference GHSL-2020-116?

Thank you,
Kevin Backhouse
GitHub Security Lab

Error extracting data from PDF document: "No current point in closepath"

Hello,
First of all, thanks a lot for the awesome work on this library. We have been using it for some time and are quite amazed by the work you made here.
Today we run into this error from an apparently totally OK PDF:

  error: 'Syntax Error (30523): No current point in closepath\n' +
    'Syntax Error (30538): No current point in closepath\n' +
    'Syntax Error (30556): No current point in closepath\n' +
    'Syntax Error (30566): No current point in closepath\n',
  pdf_path: '../samples/515317730_121477412.pdf'

This is a searchable/text pdf, so it is using pdfOCR with the following options:

const ocrSearchableOptions = {
    type: 'text', // extract searchable text from PDF
    ocr_flags: ['--psm 1'], 
    enc: 'UTF-8',  
    mode: 'layout'
}

I can provide the PDF if needed to analyze it.
Any help is greatly, greatly appreciated 🙏.
Thanks a lot in advance.

Documentation not reflecting library use for multiple pdf OCR

Preferable if library docs show how to extract text from multiple pdf files.

nisaacson / pdf-extract Goto Github PK

pdf-extract's Issues

Text extract from searchable pdf

Recommend Projects

Recommend Topics

Recommend Org