I'm trying to take each page of a PDF, pass it through a hash, and write the page to f

Copy PDF pages to multiple destinations with one pdfWriter about hummusjs HOT 7 CLOSED

galkahana commented on June 1, 2024

Copy PDF pages to multiple destinations with one pdfWriter

from hummusjs.

Comments (7)

galkahana commented on June 1, 2024

Not sure that optimizing the number of pdfWriters would be relevant. however, i'm fairly certain that having to create a PDF copying context is something that can be optimized - with some initial parsing done that can be avoided. The C++ library does have a method to create a copying context based on a PDFParser instead of a file name or file stream, so the same parser can be reused - however this is not something open to the javascript code. If you feel like experimenting, you can expend the node module to allow this as well. Then change the code so that it creates an initial pdfparser for main.pdf, and in the call to createPDFCopyingContext pass the parser instead of the file name. This may optimize things a little. other than that - no suggestions.

from hummusjs.

cazgp commented on June 1, 2024

Thank you for the speedy reply! That sounds like a reasonable thing to do, but my c++ fu is non-existent. Are you able to give any pointers for how this may work, what files to edit, etc? For compiling the c++ is there a documented build process specific to this?

from hummusjs.

cazgp commented on June 1, 2024

Update: after looking through the source of PDFWriterDriver.cpp, it seems that CreatePDFCopyingContext does already try to coerce args[0] to ObjectByteReaderWithPosition. This made me assume something like:

inputStream = fs.createReadStream('main.pdf')
...
pdfWriter.createPDFCopyContext(inputStream)

would be possible. Tracing through the C++, it seems that the error is coming from PDFParserTokenizer::GetNextByteForToken L341 mStream->Read(&outByte,1) is returning 1. I'm not really sure what that means nor how to further debug this.

from hummusjs.

cazgp commented on June 1, 2024

Update2 (sorry for the spamming). This works, but not multiple times.

inputStream = new hummus.PDFRStreamForFile('main.pdf')

On the next pass of the loop, Error: unable to create copying context. verify that the target is an existing PDF file. Re-stating inputStream on each iteration works.

I'm wondering if indeed this is barking up the tree and whether or not there's a way to do something like:

pdfReader = hummus.createReader('main.pdf')
pdfWriter = hummus.createWriter('output0.pdf')
page = pdfReader.parsePage(0)
p = pdfWriter.createPage(page)
pdfWriter.write(p)
pdfWriter.end()

Only that doesn't work because page is type PDFPageInput. Is there a way of somehow coercing PDFPageInput to be a page object that is then writable for pdfWriter?

from hummusjs.

galkahana commented on June 1, 2024

Hi cazgp!
In case you are still looking into this. I got some info that might be of help.
I looked at your various attempts on improving the performance of the process, however I really cannot see any way to improve this other than sharing a parser.

creating a new instance of pdfwriter shouldn't take long. it just writes the absolutely necessary data into the PDF file and holds some state.
same for ending - writes absolutely necessary data to the PDF and finishes.

the 'ending' content may not be shared as it is different from one PDF PDF to another. it depends highly on what content is copies, and is affected by even simply the byte size of the pdf, and number of internal objects.

So unless we're looking at writing something really specific for the pdf that you are trying to split, this seems to me like a dead end.

PDFWriter objects are dedicated to the file that they are writing, and so sharing them in order to write different pdfs is meaningless, and will cause exceptions. and again - there's no significant improvement that i expect by going for it, so no need to go there. i think.

However

One thing that i do see that can be shared is the parser object.
I quickly added to the module an ability to create a copying context based on a parser, and it seems to have a great performance merit.

if i borrow the sample provided in the last comment from you, it looks a bit like this:

pdfReader = hummus.createReader('main.pdf')
for(var i=0;i<pdfReader.getPagesCount();++i){
    pdfWriter = hummus.createWriter('output' + i + '.pdf')
    pdfWriter.createPDFCopyingContext(pdfReader).appendPDFPageFromPDF(i);
    pdfWriter.end();
}

the critical part is creating a context from a parser - pdfWriter.createPDFCopyingContext(pdfReader).
I tested this code with the pdf reference manual, that has 1309 pages, and got this:

with the regular method, it takes 10 mts and 12 seconds to split the manual to its pages
with this new method it takes 17 seconds

i also tried only 44 pages, to get close that what you saw.

the regular method took 20 seconds
the new method took 1 second

so, looks like we may have something here.

I checked in the code to git, but didn't yet publish it to npm. if interesting do pull it and build, and see if you can see similar improvement when using the new method in your code.
if this satisfies your requirements, i will do some more testing to make sure no harm is caused by the code change, and publish.

Good luck.
Gal.

from hummusjs.

cazgp commented on June 1, 2024

Hey Gal,

Thank you for getting back to me so usefully! I tried your code and it works, and it's definitely much quicker. Unfortunately it's still taking about 4 seconds, which is 3x as slow as spawning out to lib-poppler's pdfseparate and creating the PDFs, spawning out to node-glob to read in the file names, then spawning out to ruby to hash the PDFs by reading them all individually, and finally creating read/write streams to move them to where they need to be.

I would really love to use hummus to prevent so many spawns and bring everything "in house", but it does need to be quicker. Could it be the actual parsing itself that's taking the time now?

Thank you so much again, it's really awesome that you worked on this!

from hummusjs.

galkahana commented on June 1, 2024

cazgp,
first, if there's an already working solution than why not use it. i mean with spawning and pdfseparate etc.

if you'd still like to look into the problem at hand than you'll need to time the various elements of the code.
check how much time it takes to do the initial parsing (When creating the reader object), then how much time it takes to do the creation of separate pdfs. This should give us some material as to where we can improve. time the initial reader creation, the writing creation, end statement, and the appending statement. accumulate through the for all pages counts so we should have 4 numbers that tells us which phase takes how much time.

There are several points to consider that may be optimized:

first, there's the initial parsing time
then there's the copying, which can be split to:
2.1 reading elements from the source file
2.2 writing them to the target file
then there's the writing of the header of the pdf and the ending code, for each of the pdfs.

i'd think that using node streams, as you do, for writing, should make #3 and #2.2, not critical.
#1 can be measured, so you can figure out if it takes a significant amount of time. if it does...we'll think of something.
#2.1 - same.

parsing may take time as it is synchronous by nature. hummus requires that.
what's more - hummus parsing and writing is "interested" in the objects. it looks to read them and not just blindly copy. to get something better, you need to do your own copying algorithm which that tries to improve on that.
i'm fairly sure that if you get there, you are better of with spawning, even something as splitting the process to 4 spawns working on different ranges of pages in the file, per how much you want to optimize and the smallest amount of time such a spawn can take.

anyways, to make an informed decision we need times. So, if you care to look into this, please time and let me know what came out of it.

Regards,
Gal.

from hummusjs.

Copy PDF pages to multiple destinations with one pdfWriter about hummusjs HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent