Code Monkey home page Code Monkey logo

Comments (4)

mnmkng avatar mnmkng commented on May 13, 2024

Hi and thanks for the opportunity to work on this with you. Based on some quick research of the codebase, I'd like to clarify certain points before I start coding the first draft. Please share any thoughts to the following topics:

Web Server

I see that there's no web server technology currently used in apify-js. I'm familiar with Express, but I feel it would introduce unnecessary dependency to the project. Do you plan to use web servers more extensively in the future? Otherwise, I'll just set up a pure Node http server.

Client-Server communication

Regularly updating the view can be achieved either by client-side javascript that regularly polls the server for new data or through WebSockets. I saw that you already use the ws package so WebSockets seem like a logical choice. I have no experience using them, but I'll learn, if you feel this is the way to go too.

API

  • Apify.launchPuppeteer accepts options. Should the live-view be an optional behavior? It seems like a good idea. Perhaps turned on by default.
  • Should the user be able to choose screenshot / html, or should it be shown side by side?
  • Should the polling interval be configurable?

Puppeteer

I have no relevant experience using Puppeteer, so I might be off here, but I suppose that the ways to get the HTML is document.documentElement.outerHTML or new XMLSerializer().serializeToString(document). What I'm wondering is, how does it handle CSS and JS? Does it rely on being able to download the statics via the relevant and <script> tags?

Thanks for your time.

from crawlee.

mtrunkat avatar mtrunkat commented on May 13, 2024

All sounds good.

  • I'd use plain Node HTTP server.
  • WS might be a bit tricky because WS server must run at the same port as HTTP server as we will have only one port opened. So please check if there are some issues with this. Otherwise if update is scheduled to happen lets say every 3s then we can add meta tag that reloads the page <meta http-equiv="refresh" content="3">.
  • Lets add liveView: true parameter to Apify.launchPuppeteer().
  • For the first screenshot we could use page.on('domcontentloaded' event and then some constant interval (1s?). Lets see how it works and then decide if we need to make it configurable.
  • Side by side or html below screenshot? Lets decide this later when we see it in application UI.
  • You can use page.content() to get html and page.screenshot([options]) to get screenshot as buffer. Screenshot can be inserted to live view HTML page as base64 encoded image directly to html code. HTML can be displayed in <code> tags.

Let me know if you have any other questions!

from crawlee.

mnmkng avatar mnmkng commented on May 13, 2024

Hello Marek,

I spent the afternoon and evening with the LiveView today and managed to implement a minimal working solution for the screenshot capturing. See: master...mnmkng:feature/live-view It's definitely not pull request ready, but I've learned a lot so next steps should be smoother.

The primary pain point was the fact that screenshot capturing would break due to pages getting destroyed by the Acts or SDK without me having any control over it, and it took me some time to figure out how to circumvent this. Essentially, I'm replacing the page.close() function with a noop while the screenshot is being made and reverting back after I have it.

Is that OK with you or would you prefer a different solution? It definitely is kind of a hack.

I've also tried the HTML, but figured that it's not entirely straightforward too. Getting the full HTML using page.content() works, but stylesheet links seemed to break for various reasons, resulting in zero CSS, which looks terrible. Eg.: 'https://en.wikipedia.org/wiki/Amazon_Web_Services' I figured I probably could inline the styles in Puppeteer, but didn't want to build a full SSR package before consulting with you.

What is your take on the matter? Perhaps I'm missing something.

Finally, how would you like to handle multiple pages spawned in parallel? I will make a root route where the user will be able to choose from a list of browsers, as requested in the Issue. I thought of doing the same for the pages, but they come and go very fast, so it doesn't seem like a viable option after all.

Also, any comments towards the code already written are much appreciated!
Thanks.

from crawlee.

mtrunkat avatar mtrunkat commented on May 13, 2024

Version 1 merged.

from crawlee.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.