nisaacson / pdf-extract Goto Github PK
View Code? Open in Web Editor NEWNode PDF Extract
License: MIT License
Node PDF Extract
License: MIT License
Followed the installation steps and then tried running npm test. I got an error indicating the absence of mocha. Installed node.js and mocha (which I did not see among the prerequisites), but sill get an error. Here is the terminal output:
node_modules/.bin/mocha --reporter spec
sh: 1: node_modules/.bin/mocha: not found
npm ERR! Test failed. See above for more details.
npm ERR! not ok code 0
I suspect that the problem is that mocha is not in the expected folder.
I did not see if pdf_extract() function allowed for arguments to pdftotext.
I am looking at lib/searchable.js line 30
I altered my local copy to
var child = spawn('pdftotext', (options.layout ? ['-layout'] : []).concat(options.ocr_flags).concat([pdf_path, '-']));
And call it with
var options = {
type: 'text', // extract the actual text in the pdf file
ocr_flags: [
'-f',1,
'-l',1
]
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
if (err) {
res.end(util.inspect(err));
}
By default, tesseract produces gibberish for me. I noticed that convert
is commented out in favor of gs
. I tried convert -depth 8 -background white -flatten -matte -density 300 <input> <output>
instead and tesseract produced great results. The whole process was a lot faster too: ~15 minutes vs ~1 minute for 6 pages. I am curious why ghostscript is used rather than imagemagick for conversion?
This is a great library that I've been using extensively to get text from scanned pdf files. I was wondering if there was a way to read from a url instead of a file.
If there is a way, I would have to change the convert.js file correct ? I just want to know if I am on the right path and would appreciate any input!
When execute a node script from my terminal this message appear:
(node:21492) DeprecationWarning: sys is deprecated. Use util instead.
As you can read here, there's no more chances in the future to use sys.
Can you update the repo?
Thanks
poppler util does not exist on the link provided in readme. I am trying to use the text extract in a node app on windows
7 tests are failing when running the npm test. Most of them are due to "Error: timeout of 100000ms exceeded"
Have tried increasing the timeout value but that doesn't help too ..
Hi,
Currently not able to pass through some form of arguments to pdftk to decrypt a document.
Id like to submit a change to allow for this.
It gives [-1 , -1] response. I think it gives -1 when it does not find a word inside text. But there is a word inside text.
static extractTextFromPdf(pdf, options) {
console.log('entred to exctract');
pdf = pathe.resolve(pdf)
return new Promise((resolve, reject) => {
const processor = pdfExtract(pdf, { ...options, clean: true }, (err) => {
if (err) {
reject(err);
}
});
processor.on('complete', (data) => {
resolve(data.text_pages);
console.log('good');
});
processor.on('error', (err) => {
console.log(' error extractTextFromPdf ');
reject(err);
});
});
}
This package is perfect for what i am trying to do, however i cant seem to get it work on OSX. I dont get any errors when running in my Node application, however it never fires any of its callbacks after the process function. When running the NPM Test from within the node modules, dependencies come back fine, but i get 7 errors after all having to do with Timeout after. Ive upped the Timeouts in the test and still not running. Only diffrence i have is Tesseract being version 3.04.00, however im not even trying to use OCR so i cant image this would be causing the issues.
Any help is appreciated!
Here's an example you give:
var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
type: 'text' // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
if (err) {
return callback(err);
}
});
processor.on('complete', function(data) {
inspect(data.text_pages, 'extracted text pages');
callback(null, data.text_pages);
});
processor.on('error', function(err) {
inspect(err, 'error while extracting pages');
return callback(err);
});
But the npm package eyes
was never installed, and callback
(called 3 times) is undefined.
As title suggests, you can't always get access to change the PATH, AWS Lambda is one of many that come to mind (CI's, AWS ECS, many PaaS, ect).
We need to be able to tell the library which binary to use, similar to this npm library offers, https://www.npmjs.com/package/pdf-text-extract
Extract from a pdf file which contains actual searchable text
var inspect = require('eyes').inspector({maxLength:20000});
var pdf_extract = require('pdf-extract');
var absolute_path_to_pdf = '~/Downloads/electronic.pdf'
var options = {
type: 'text' // extract the actual text in the pdf file
}
var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {
if (err) {
return callback(err);
}
});
processor.on('complete', function(data) {
inspect(data.text_pages, 'extracted text pages');
callback(null, data.text_pages); //<----- data.text_pages instead of just text_pages.
});
processor.on('error', function(err) {
inspect(err, 'error while extracting pages');
return callback(err);
});
I guess this little typo doesn't deserve a fork and a pull request.
Thank you very much for your software.
This takes a long time for a large pdf file, how to extract and process a single page only?
In the README under windows, the link 'pdftotext' http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/ now just redirects to the manifestwebdesign.com homepage.
Does this extract images?
I had a lot of trouble using this package. I was able to run the code but it hanged indefinitely on this exec line.
https://github.com/nisaacson/pdf-extract/blob/master/lib/split.js#L47
I was able to fix this by downloading the installer from this stackoverflow link. I think it should be updated in the README as well.
https://stackoverflow.com/questions/60859527/how-to-solve-pdftk-bad-cpu-type-in-executable-on-mac
Thanks for the library. Here are the steps to run on windows.
Homebrew has moved up to Tesseract 3.02; the README has specific references to 3.01's directory structure.
I tried copying the training data files into the apparent directories used by the 3.02 brew install
cp ./share/eng.traineddata /usr/local/share/tessdata/
cp ./share/configs/alphanumeric /usr/local/share/tessdata/configs/
but the tests still fail.
I tried downgrading to the 3.01 homebrew version of Tesseract, but changes to the latest XCode seem to have broken the way that it assumes autoconf
will work.
Using tesseract directly confirms problems loading relevant image processing libraries. after brew reinstall libtiff
and brew reinstall libjpeg
, TIFF support is still screwed up. This appears to be a problem with the leptonica recipe. I finally overcame it with brew reinstall --with-libtiff leptonica
This was a complete disaster, and I'm unsure where the fault lies. Clearly not with pdf-extract (other than the README referring to the wrong location for tesseract training date)! But other users will likely come across this, too.
UPDATE:
The README is also lacking a reference to copying around dia.traineddata.
Hi,
The tool works great for the extraction of data but in some pages - the footer text gets intermingled with the page text body and it breaks the parsing.
Is there a way to turn off the header and footer extraction somewhere in the code?
Thank YOU
{
"error": "Illegal character <3f> in hex string\nError (82): Illegal character <78> in hex string\nError (83): Illegal character <70> in hex string\nError (86): Illegal character <6b> in hex string\nError (88): Illegal character <74> in hex string\nError (92): Illegal character <67> in hex string\nError (93): Illegal character <69> in hex string\nError (94): Illegal character <6e> in hex string\nError (95): Illegal character <3d> in hex string\nError (96): Illegal character <27> in hex string\nError (97): Illegal character <ef> in hex string\nError (98): Illegal character <bb> in hex string\nError (99): Illegal character <bf> in hex string\nError (100): Illegal character <27> in hex string\nError (102): Illegal character <69> in hex string\nError (104): Illegal character <3d> in hex string\nError (105): Illegal character <27> in hex string\nError (106): Illegal character <57> in hex string\nError (108): Illegal character <4d> in hex string\nError (110): Illegal character <4d> in hex string\nError (111): Illegal character <70> in hex string\nError (114): Illegal character <68> in hex string\nError (115): Illegal character <69> in hex string\nError (116): Illegal character <48> in hex string\nError (117): Illegal character <7a> in hex string\nError (118): Illegal character <72> in hex string\nError (120): Illegal character <53> in hex string\nError (121): Illegal character <7a> in hex string\nError (122): Illegal character <4e> in hex string\nError (123): Illegal character <54> in hex string\nError (125): Illegal character <7a> in hex string\nError (126): Illegal character <6b> in hex string\nError (130): Illegal character <27> in hex string\nError (131): Illegal character <3f> in hex string\n",
}
Hello,
I am a member of the GitHub Security Lab (https://securitylab.github.com).
I've attempted to reach a maintainer for this project to report a potential security issue but have been unable to verify the report was received. Please could a project maintainer could contact us at [email protected], using reference GHSL-2020-116?
Thank you,
Kevin Backhouse
GitHub Security Lab
Hello,
First of all, thanks a lot for the awesome work on this library. We have been using it for some time and are quite amazed by the work you made here.
Today we run into this error from an apparently totally OK PDF:
error: 'Syntax Error (30523): No current point in closepath\n' +
'Syntax Error (30538): No current point in closepath\n' +
'Syntax Error (30556): No current point in closepath\n' +
'Syntax Error (30566): No current point in closepath\n',
pdf_path: '../samples/515317730_121477412.pdf'
This is a searchable/text pdf, so it is using pdfOCR with the following options:
const ocrSearchableOptions = {
type: 'text', // extract searchable text from PDF
ocr_flags: ['--psm 1'],
enc: 'UTF-8',
mode: 'layout'
}
I can provide the PDF if needed to analyze it.
Any help is greatly, greatly appreciated ๐.
Thanks a lot in advance.
Preferable if library docs show how to extract text from multiple pdf files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.