Comments (4)
Hi and thanks for the opportunity to work on this with you. Based on some quick research of the codebase, I'd like to clarify certain points before I start coding the first draft. Please share any thoughts to the following topics:
Web Server
I see that there's no web server technology currently used in apify-js
. I'm familiar with Express, but I feel it would introduce unnecessary dependency to the project. Do you plan to use web servers more extensively in the future? Otherwise, I'll just set up a pure Node http server.
Client-Server communication
Regularly updating the view can be achieved either by client-side javascript that regularly polls the server for new data or through WebSockets. I saw that you already use the ws
package so WebSockets seem like a logical choice. I have no experience using them, but I'll learn, if you feel this is the way to go too.
API
Apify.launchPuppeteer
accepts options. Should the live-view be an optional behavior? It seems like a good idea. Perhaps turned on by default.- Should the user be able to choose screenshot / html, or should it be shown side by side?
- Should the polling interval be configurable?
Puppeteer
I have no relevant experience using Puppeteer, so I might be off here, but I suppose that the ways to get the HTML is document.documentElement.outerHTML
or new XMLSerializer().serializeToString(document)
. What I'm wondering is, how does it handle CSS and JS? Does it rely on being able to download the statics via the relevant and <script> tags?
Thanks for your time.
from crawlee.
All sounds good.
- I'd use plain Node HTTP server.
- WS might be a bit tricky because WS server must run at the same port as HTTP server as we will have only one port opened. So please check if there are some issues with this. Otherwise if update is scheduled to happen lets say every 3s then we can add meta tag that reloads the page
<meta http-equiv="refresh" content="3">
. - Lets add
liveView: true
parameter toApify.launchPuppeteer()
. - For the first screenshot we could use
page.on('domcontentloaded'
event and then some constant interval (1s?). Lets see how it works and then decide if we need to make it configurable. - Side by side or html below screenshot? Lets decide this later when we see it in application UI.
- You can use
page.content()
to get html andpage.screenshot([options])
to get screenshot as buffer. Screenshot can be inserted to live view HTML page as base64 encoded image directly to html code. HTML can be displayed in<code>
tags.
Let me know if you have any other questions!
from crawlee.
Hello Marek,
I spent the afternoon and evening with the LiveView today and managed to implement a minimal working solution for the screenshot capturing. See: master...mnmkng:feature/live-view It's definitely not pull request ready, but I've learned a lot so next steps should be smoother.
The primary pain point was the fact that screenshot capturing would break due to pages getting destroyed by the Acts or SDK without me having any control over it, and it took me some time to figure out how to circumvent this. Essentially, I'm replacing the page.close()
function with a noop
while the screenshot is being made and reverting back after I have it.
Is that OK with you or would you prefer a different solution? It definitely is kind of a hack.
I've also tried the HTML, but figured that it's not entirely straightforward too. Getting the full HTML using page.content()
works, but stylesheet links seemed to break for various reasons, resulting in zero CSS, which looks terrible. Eg.: 'https://en.wikipedia.org/wiki/Amazon_Web_Services' I figured I probably could inline the styles in Puppeteer, but didn't want to build a full SSR package before consulting with you.
What is your take on the matter? Perhaps I'm missing something.
Finally, how would you like to handle multiple pages spawned in parallel? I will make a root route where the user will be able to choose from a list of browsers, as requested in the Issue. I thought of doing the same for the pages, but they come and go very fast, so it doesn't seem like a viable option after all.
Also, any comments towards the code already written are much appreciated!
Thanks.
from crawlee.
Version 1 merged.
from crawlee.
Related Issues (20)
- No links are being enqueued on some pages HOT 3
- Playwright requires installation via `npx playwright install` HOT 13
- Issue Downgrading from Crawlee 3.7.2 to 3.4.0 - Persistent Version and TypeScript Errors HOT 8
- Save screenshot/HTML on first occurrence of error in error statistics HOT 2
- Double clicking title selects also prefix pill – makes it harder to copypaste HOT 1
- dataset as requestsFromUrl
- add "exclude" property to enqueueLinksByClickingElements like "enqueueLinks"
- Implement Automatic Memory Management in Playwright for Enhanced Stability in Web Crawling Operations
- Support plain-text sitemaps (sitemap.txt) in the `Sitemap` class HOT 1
- Implement sitemap autodetection (independent of robots.txt)
- `maxUsageCount: 1` does not retire session after a single use HOT 1
- `useIncognitoPages` doesn't rotate fingerprints HOT 1
- Add support for all tags defined by the sitemap protocol
- `page.evaluate` results error HOT 2
- HttpCrawler - determining character encoding
- Add `waitForAllRequestsToBeAdded` option to `enqueueLinks`
- XPATH selectors support HOT 4
- Multiple calls to enqueueLinks with Promise.all result in a crash HOT 1
- `RestrictedCrawlingContext` should not extend `Record<string, unknown>` HOT 2
- Could not kill browser: Cannot read private member #process from an object whose class did not declare it HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.