Code Monkey home page Code Monkey logo

Comments (7)

galkahana avatar galkahana commented on June 1, 2024

Not sure that optimizing the number of pdfWriters would be relevant. however, i'm fairly certain that having to create a PDF copying context is something that can be optimized - with some initial parsing done that can be avoided. The C++ library does have a method to create a copying context based on a PDFParser instead of a file name or file stream, so the same parser can be reused - however this is not something open to the javascript code. If you feel like experimenting, you can expend the node module to allow this as well. Then change the code so that it creates an initial pdfparser for main.pdf, and in the call to createPDFCopyingContext pass the parser instead of the file name. This may optimize things a little. other than that - no suggestions.

from hummusjs.

cazgp avatar cazgp commented on June 1, 2024

Thank you for the speedy reply! That sounds like a reasonable thing to do, but my c++ fu is non-existent. Are you able to give any pointers for how this may work, what files to edit, etc? For compiling the c++ is there a documented build process specific to this?

from hummusjs.

cazgp avatar cazgp commented on June 1, 2024

Update: after looking through the source of PDFWriterDriver.cpp, it seems that CreatePDFCopyingContext does already try to coerce args[0] to ObjectByteReaderWithPosition. This made me assume something like:

inputStream = fs.createReadStream('main.pdf')
...
pdfWriter.createPDFCopyContext(inputStream)

would be possible. Tracing through the C++, it seems that the error is coming from PDFParserTokenizer::GetNextByteForToken L341 mStream->Read(&outByte,1) is returning 1. I'm not really sure what that means nor how to further debug this.

from hummusjs.

cazgp avatar cazgp commented on June 1, 2024

Update2 (sorry for the spamming). This works, but not multiple times.

inputStream = new hummus.PDFRStreamForFile('main.pdf')

On the next pass of the loop, Error: unable to create copying context. verify that the target is an existing PDF file. Re-stating inputStream on each iteration works.

I'm wondering if indeed this is barking up the tree and whether or not there's a way to do something like:

pdfReader = hummus.createReader('main.pdf')
pdfWriter = hummus.createWriter('output0.pdf')
page = pdfReader.parsePage(0)
p = pdfWriter.createPage(page)
pdfWriter.write(p)
pdfWriter.end()

Only that doesn't work because page is type PDFPageInput. Is there a way of somehow coercing PDFPageInput to be a page object that is then writable for pdfWriter?

from hummusjs.

galkahana avatar galkahana commented on June 1, 2024

Hi cazgp!
In case you are still looking into this. I got some info that might be of help.
I looked at your various attempts on improving the performance of the process, however I really cannot see any way to improve this other than sharing a parser.

  1. creating a new instance of pdfwriter shouldn't take long. it just writes the absolutely necessary data into the PDF file and holds some state.
  2. same for ending - writes absolutely necessary data to the PDF and finishes.

the 'ending' content may not be shared as it is different from one PDF PDF to another. it depends highly on what content is copies, and is affected by even simply the byte size of the pdf, and number of internal objects.

So unless we're looking at writing something really specific for the pdf that you are trying to split, this seems to me like a dead end.

PDFWriter objects are dedicated to the file that they are writing, and so sharing them in order to write different pdfs is meaningless, and will cause exceptions. and again - there's no significant improvement that i expect by going for it, so no need to go there. i think.

However

One thing that i do see that can be shared is the parser object.
I quickly added to the module an ability to create a copying context based on a parser, and it seems to have a great performance merit.

if i borrow the sample provided in the last comment from you, it looks a bit like this:

pdfReader = hummus.createReader('main.pdf')
for(var i=0;i<pdfReader.getPagesCount();++i){
    pdfWriter = hummus.createWriter('output' + i + '.pdf')
    pdfWriter.createPDFCopyingContext(pdfReader).appendPDFPageFromPDF(i);
    pdfWriter.end();
}

the critical part is creating a context from a parser - pdfWriter.createPDFCopyingContext(pdfReader).
I tested this code with the pdf reference manual, that has 1309 pages, and got this:

  1. with the regular method, it takes 10 mts and 12 seconds to split the manual to its pages
  2. with this new method it takes 17 seconds

i also tried only 44 pages, to get close that what you saw.

  1. the regular method took 20 seconds
  2. the new method took 1 second

so, looks like we may have something here.

I checked in the code to git, but didn't yet publish it to npm. if interesting do pull it and build, and see if you can see similar improvement when using the new method in your code.
if this satisfies your requirements, i will do some more testing to make sure no harm is caused by the code change, and publish.

Good luck.
Gal.

from hummusjs.

cazgp avatar cazgp commented on June 1, 2024

Hey Gal,

Thank you for getting back to me so usefully! I tried your code and it works, and it's definitely much quicker. Unfortunately it's still taking about 4 seconds, which is 3x as slow as spawning out to lib-poppler's pdfseparate and creating the PDFs, spawning out to node-glob to read in the file names, then spawning out to ruby to hash the PDFs by reading them all individually, and finally creating read/write streams to move them to where they need to be.

I would really love to use hummus to prevent so many spawns and bring everything "in house", but it does need to be quicker. Could it be the actual parsing itself that's taking the time now?

Thank you so much again, it's really awesome that you worked on this!

from hummusjs.

galkahana avatar galkahana commented on June 1, 2024

cazgp,
first, if there's an already working solution than why not use it. i mean with spawning and pdfseparate etc.

if you'd still like to look into the problem at hand than you'll need to time the various elements of the code.
check how much time it takes to do the initial parsing (When creating the reader object), then how much time it takes to do the creation of separate pdfs. This should give us some material as to where we can improve. time the initial reader creation, the writing creation, end statement, and the appending statement. accumulate through the for all pages counts so we should have 4 numbers that tells us which phase takes how much time.

There are several points to consider that may be optimized:

  1. first, there's the initial parsing time
  2. then there's the copying, which can be split to:
    2.1 reading elements from the source file
    2.2 writing them to the target file
  3. then there's the writing of the header of the pdf and the ending code, for each of the pdfs.

i'd think that using node streams, as you do, for writing, should make #3 and #2.2, not critical.
#1 can be measured, so you can figure out if it takes a significant amount of time. if it does...we'll think of something.
#2.1 - same.

parsing may take time as it is synchronous by nature. hummus requires that.
what's more - hummus parsing and writing is "interested" in the objects. it looks to read them and not just blindly copy. to get something better, you need to do your own copying algorithm which that tries to improve on that.
i'm fairly sure that if you get there, you are better of with spawning, even something as splitting the process to 4 spawns working on different ranges of pages in the file, per how much you want to optimize and the smallest amount of time such a spawn can take.

anyways, to make an informed decision we need times. So, if you care to look into this, please time and let me know what came out of it.

Regards,
Gal.

from hummusjs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.