Code Monkey home page Code Monkey logo

Comments (7)

jakopako avatar jakopako commented on August 19, 2024

Thanks for bringing this up. You're right, when setting renderJs: true, the scraper is pretty slow. This part of the code is not yet very optimized and there are a bunch of hardcoded timers ( e.g. here: https://github.com/jakopako/goskyr/blob/main/fetch/fetcher.go#L73 ) to make sure the site has actually loaded. I'll look into this in more detail as soon as I have time. Feel free to make some improvements yourself and make a pull request if you have some ideas :)

from goskyr.

alucab avatar alucab commented on August 19, 2024

I will certainly give a look, but i have veeeeery limited experience with golang so i take it more as a learning opportunity than the possibility to concretely contribute in the short term.

I was giving a look to the function and the invocation and i don't see huge timers (1 or 5 secs max).
As I said I am not expert but might be that the slowness comes because you are instantiating a chromedp headless browser for every call ?

Thanks for being so reactive !

from goskyr.

jakopako avatar jakopako commented on August 19, 2024

Sure, no worries!

Yeah, you might be totally right, it could very well be that the reinstantiation is the thing that takes so much time.

from goskyr.

jakopako avatar jakopako commented on August 19, 2024

I released a new version, 0.5.8, that should be a little better speed-wise. It reuses the same browser instance for multiple requests, and you can change the default (which is now 2) page load wait time with the key page_load_wait_sec. So your config could look something like:

scrapers:
  - name: discontinued
    renderJs: true
    page_load_wait_sec: 1
    url: https://www.hikvision.com/en/products/discontinued-products
...

It's still not very fast but it should already be better than before. Hope this helps!

from goskyr.

jakopako avatar jakopako commented on August 19, 2024

actually changed the parameter's name to page_load_wait and the unit to milliseconds (instead of seconds) in version 0.5.9

from goskyr.

alucab avatar alucab commented on August 19, 2024

Beautiful!

I'll study your commit to learn more of the tool

from goskyr.

jakopako avatar jakopako commented on August 19, 2024

Closing this issue now. There are still ways to improve the scraping speed for dynamic pages but quite some improvement has already been achieved since this issue was opened. Issue #253 describes one potential further improvement.

@alucab feel free to open another issue if there's anything else that can be improved.

from goskyr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.