Comments (8)
the biggest pdf files in unit tests are under 8 pages, never tested it with 'large' file. If performance is an issue, I'd recommend to split it into smaller ones before parsing, since smaller pdfs are well tested and well performed.
from pdf2json.
Hi there... thanks for the reply.
With so many downloads I am surprised no one else has hit this issue. The PDF files that we need to import are outwith our control, so we cannot lessen their size. They can be anything from one page to 1500 pages.
Are there any input options that cuts down the amount of work this plugin does when preparing the data? The only information we require is the textual data along with it's x and y coordinates.
Looking forward to your response.
Many thanks, Barney
from pdf2json.
one option is to update the stream implementation from file to page, so the process starts to flow when a single page data is ready. It would improve responsiveness, but won't reduce the total processing time for large PDFs.
from pdf2json.
Yep that's a shame. I take it there is no way of speeding up the process by limiting what it ends up outputting? So for example, asking it to only do specific types of work when loading the PDF document.
What would be the cause of the slowness... is it string manipulation or something similar in the inner workings of the module?
from pdf2json.
We can use child process to process pages parallel. This will not only improve responsiveness but also reduce time for such large file. I would love to contribute and create PR for it if you think the same.
from pdf2json.
I don't seem to have this issue.
I have tried parsing 11mb PDF, and the dataReady callback fires in under a minute.
I am running the node application on my macbook pro, i5, 8GB.
Here's the PDF that i tested - https://drive.google.com/file/d/0BzR-ZOIycHumX3hsbTVWbFMyQlU/view?usp=sharing
from pdf2json.
Sorry for the delay... damn holidays huh?! Well I am back now, so here goes...
Although the PDFs I am using are only ~4mb, each page (~1,300 pages) have a grid of tabular data (about 8x8)... and some "cells" can have up to 6x text items in - vertically placed. So it might not be about the size of the PDF, but rather the contents and their structure.
kishorsharma - if you could look into speeding this up using child processes, then I would be happy to test your code. Any advance on 10 minutes would be a big bonus!
Please let me know your thoughts.
from pdf2json.
anything update?
from pdf2json.
Related Issues (20)
- Node.js Server got stuck when parsing specific PDF while it is working for other PDFs HOT 2
- fields with periods are truncated HOT 1
- TypeError: pdf2json_1.default is not a constructor HOT 3
- Property 'getRawTextContent' does not exist on type 'Pdfparser'.ts(2339) HOT 6
- The interface for `Line` is missing the `l` property HOT 1
- ENOENT: no such file or directory - util.js HOT 10
- How to detect the HLines correctly?
- Without a ToUnicode CMap, PDF viewers can't map glyphs to Unicode values -> rely on pdf.js?? HOT 2
- Characters coming as NULL HOT 1
- Is the auto-merge broken text blocks capability active in the last stable version (3.0.5)? HOT 3
- Cloudflare Worker issue with Could not resolve "fs/promises" HOT 5
- something is colossaly f***** up in the exports for version 3.1.0 HOT 2
- ENOENT: no such file or directory, open '/var/task/../package.json' HOT 2
- no such file or directory - pkinfo.js HOT 1
- PDF parse, edit and retain HOT 1
- Cannot compile project with 3.1.3 HOT 5
- All PDF Parsing fails after upgrading from v3.1.2 to v3.1.3
- unexpected space HOT 7
- Parser NO SPACE - "pdf2json": "^3.1.3", HOT 1
- FATAL ERROR: JavaScript heap out of memory HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2json.