Code Monkey home page Code Monkey logo

pdf-extract's Issues

npm test Failing on Ubuntu 12.04

Followed the installation steps and then tried running npm test. I got an error indicating the absence of mocha. Installed node.js and mocha (which I did not see among the prerequisites), but sill get an error. Here is the terminal output:

node_modules/.bin/mocha --reporter spec

sh: 1: node_modules/.bin/mocha: not found
npm ERR! Test failed. See above for more details.
npm ERR! not ok code 0

I suspect that the problem is that mocha is not in the expected folder.

Arguments to pdftotext

I did not see if pdf_extract() function allowed for arguments to pdftotext.
I am looking at lib/searchable.js line 30

I altered my local copy to

var child = spawn('pdftotext', (options.layout ? ['-layout'] : []).concat(options.ocr_flags).concat([pdf_path, '-']));

And call it with

var options = {
  type: 'text',  // extract the actual text in the pdf file
  ocr_flags: [
     '-f',1,
     '-l',1
  ]
}

var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {             
  if (err) {
    res.end(util.inspect(err));
  }     

ghostscript vs imagemagick

By default, tesseract produces gibberish for me. I noticed that convert is commented out in favor of gs. I tried convert -depth 8 -background white -flatten -matte -density 300 <input> <output> instead and tesseract produced great results. The whole process was a lot faster too: ~15 minutes vs ~1 minute for 6 pages. I am curious why ghostscript is used rather than imagemagick for conversion?

Reading from URL instead of File

This is a great library that I've been using extensively to get text from scanned pdf files. I was wondering if there was a way to read from a url instead of a file.

If there is a way, I would have to change the convert.js file correct ? I just want to know if I am on the right path and would appreciate any input!

Please update "sys" to "utils"

When execute a node script from my terminal this message appear:

(node:21492) DeprecationWarning: sys is deprecated. Use util instead.

As you can read here, there's no more chances in the future to use sys.

Can you update the repo?

Thanks

The tests are failing in the ubuntu 13.04

7 tests are failing when running the npm test. Most of them are due to "Error: timeout of 100000ms exceeded"
Have tried increasing the timeout value but that doesn't help too ..

Adding the option to Decrypt a PDF

Hi,

Currently not able to pass through some form of arguments to pdftk to decrypt a document.

Id like to submit a change to allow for this.

Searched a word inside text

It gives [-1 , -1] response. I think it gives -1 when it does not find a word inside text. But there is a word inside text.

an error occurred while splitting pdf into single pages with the pdftk burst command

 



static extractTextFromPdf(pdf, options) {
    console.log('entred to exctract');
    pdf = pathe.resolve(pdf)

    return new Promise((resolve, reject) => {
      const processor = pdfExtract(pdf, { ...options, clean: true }, (err) => {
        if (err) {
          reject(err);
        }
      });
      processor.on('complete', (data) => {
        resolve(data.text_pages);
        console.log('good');
      });
      processor.on('error', (err) => {
        console.log(' error extractTextFromPdf ');
        reject(err);
      });
    });
  }

Timeout in Tests / Complete,Done,Error callback never called OSX

This package is perfect for what i am trying to do, however i cant seem to get it work on OSX. I dont get any errors when running in my Node application, however it never fires any of its callbacks after the process function. When running the NPM Test from within the node modules, dependencies come back fine, but i get 7 errors after all having to do with Timeout after. Ive upped the Timeouts in the test and still not running. Only diffrence i have is Tesseract being version 3.04.00, however im not even trying to use OCR so i cant image this would be causing the issues.

Any help is appreciated!

Undefined vars in readme

Here's an example you give:

var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
  type: 'text'  // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
  if (err) {
    return callback(err);
  }
});
processor.on('complete', function(data) {
  inspect(data.text_pages, 'extracted text pages');
  callback(null, data.text_pages);
});
processor.on('error', function(err) {
  inspect(err, 'error while extracting pages');
  return callback(err);
});

But the npm package eyes was never installed, and callback (called 3 times) is undefined.

One of the examples at Readme.md is wrong

Text extract from searchable pdf

Extract from a pdf file which contains actual searchable text

var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
  type: 'text'  // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
  if (err) {
    return callback(err);
  }
});
processor.on('complete', function(data) {
  inspect(data.text_pages, 'extracted text pages');
  callback(null, data.text_pages); //<----- data.text_pages instead of just text_pages.
});
processor.on('error', function(err) {
  inspect(err, 'error while extracting pages');
  return callback(err);
});

I guess this little typo doesn't deserve a fork and a pull request.
Thank you very much for your software.

Extract a single page

This takes a long time for a large pdf file, how to extract and process a single page only?

Windows guide

Thanks for the library. Here are the steps to run on windows.

  • npm install pdf-extract
  • install pdftk server
  • install ghostscript
  • go to ghostscript install bin directory, copy gswin64c.exe and paste a copy named gs.exe in same directory
  • add path to gs bin to environment variables path
  • install teserract-ocr
  • add path to teserract-ocr root dir to environment variables path
  • add new environment variable named TESSDATA_PREFIX and path it to the teserract-ocr root install dir

tests failing on homebrew install

Homebrew has moved up to Tesseract 3.02; the README has specific references to 3.01's directory structure.

  • I tried copying the training data files into the apparent directories used by the 3.02 brew install

    cp ./share/eng.traineddata /usr/local/share/tessdata/
    cp ./share/configs/alphanumeric /usr/local/share/tessdata/configs/

    but the tests still fail.

  • I tried downgrading to the 3.01 homebrew version of Tesseract, but changes to the latest XCode seem to have broken the way that it assumes autoconf will work.

  • Using tesseract directly confirms problems loading relevant image processing libraries. after brew reinstall libtiff and brew reinstall libjpeg, TIFF support is still screwed up. This appears to be a problem with the leptonica recipe. I finally overcame it with brew reinstall --with-libtiff leptonica

This was a complete disaster, and I'm unsure where the fault lies. Clearly not with pdf-extract (other than the README referring to the wrong location for tesseract training date)! But other users will likely come across this, too.

UPDATE:

The README is also lacking a reference to copying around dia.traineddata.

Way to remove header and footer in the generated text

Hi,

The tool works great for the extraction of data but in some pages - the footer text gets intermingled with the page text body and it breaks the parsing.

Is there a way to turn off the header and footer extraction somewhere in the code?

Thank YOU

Error (XX): Illegal character <XX> in hex string

{
    "error": "Illegal character <3f> in hex string\nError (82): Illegal character <78> in hex string\nError (83): Illegal character <70> in hex string\nError (86): Illegal character <6b> in hex string\nError (88): Illegal character <74> in hex string\nError (92): Illegal character <67> in hex string\nError (93): Illegal character <69> in hex string\nError (94): Illegal character <6e> in hex string\nError (95): Illegal character <3d> in hex string\nError (96): Illegal character <27> in hex string\nError (97): Illegal character <ef> in hex string\nError (98): Illegal character <bb> in hex string\nError (99): Illegal character <bf> in hex string\nError (100): Illegal character <27> in hex string\nError (102): Illegal character <69> in hex string\nError (104): Illegal character <3d> in hex string\nError (105): Illegal character <27> in hex string\nError (106): Illegal character <57> in hex string\nError (108): Illegal character <4d> in hex string\nError (110): Illegal character <4d> in hex string\nError (111): Illegal character <70> in hex string\nError (114): Illegal character <68> in hex string\nError (115): Illegal character <69> in hex string\nError (116): Illegal character <48> in hex string\nError (117): Illegal character <7a> in hex string\nError (118): Illegal character <72> in hex string\nError (120): Illegal character <53> in hex string\nError (121): Illegal character <7a> in hex string\nError (122): Illegal character <4e> in hex string\nError (123): Illegal character <54> in hex string\nError (125): Illegal character <7a> in hex string\nError (126): Illegal character <6b> in hex string\nError (130): Illegal character <27> in hex string\nError (131): Illegal character <3f> in hex string\n",
}

GHSL-2020-116

Hello,

I am a member of the GitHub Security Lab (https://securitylab.github.com).

I've attempted to reach a maintainer for this project to report a potential security issue but have been unable to verify the report was received. Please could a project maintainer could contact us at [email protected], using reference GHSL-2020-116?

Thank you,
Kevin Backhouse
GitHub Security Lab

Error extracting data from PDF document: "No current point in closepath"

Hello,
First of all, thanks a lot for the awesome work on this library. We have been using it for some time and are quite amazed by the work you made here.
Today we run into this error from an apparently totally OK PDF:

  error: 'Syntax Error (30523): No current point in closepath\n' +
    'Syntax Error (30538): No current point in closepath\n' +
    'Syntax Error (30556): No current point in closepath\n' +
    'Syntax Error (30566): No current point in closepath\n',
  pdf_path: '../samples/515317730_121477412.pdf'

This is a searchable/text pdf, so it is using pdfOCR with the following options:

const ocrSearchableOptions = {
    type: 'text', // extract searchable text from PDF
    ocr_flags: ['--psm 1'], 
    enc: 'UTF-8',  
    mode: 'layout'
}

I can provide the PDF if needed to analyze it.
Any help is greatly, greatly appreciated ๐Ÿ™.
Thanks a lot in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.