Comments (2)
Your proposal is interesting, there are a couple of things to consider.
-
Cold-boot time for the lambda function would be prohibitively slow given that a browser has to launch, load the page, and retrieve the results from the API.
-
Assuming it would act as an http endpoint, a containerised application would be just as simple to use, more cross-platform accessible, and could be run locally just as easily.
-
This feature might need to be its own project rather than part of the instamancer package. Instamancer would remain the core module, and then you can build whatever server system you want around it. Having an authoritative server model doesn't really justify adding additional weight and complexity to the current module (except perhaps if it was just an extremely simple express server).
Interested to hear your thoughts. We could create a new repo in https://github.com/instamancer
from instamancer.
True.
I think a secondary repo would for sure make sense.
On the cold boot & browser spin up time
In my experience the bigger issue is the spin up time for the browser. Most of the reasoning behind working on a more sophisticated deployment is to handle a larger (possibly concurrent instances executing at once) and/or more consistently executed / scheduled use case.
Since the Cold Boot only applies to containers that haven't been run in a while USUALLY it doesn't add a huge amount of overhead on it's own since really it's only your first execution. This assumes running the container 50 or 100 times after the first execution (which warms it up).
In cases where thousands of lambda / serverless executions occur prior to cool down the overhead for the warm up doesn't end up impacting things in a meaningful way (in my experience!).
The bigger issue is the browser spinning up each time β but again to me this is just sort of par for the course to avoid the scraping defenses out there (in this case with instagram) by using Chrome / Puppeteer, but I think it's worth it.
In my experience almost all of the user behaviors that can be used to detect a scraper can be replicated in puppeteer so there is a huge amount of value / resilience added to the project by relying on Chrome βΒ even though you eat the above mentioned overhead.
If the reasoning behind moving toward serverless is to be able to abstract the management of consistently run (every day, every hour, etc) Instamancer queries then this separated project could also provide for the use / application of proxies to allow concurrent executions for larger projects.
I would like to play around with Instamancer a little more in a containerized environment but I don't see any reason why it would be super hard to configure.
Have you done any work on containerization / dockerization locally?
I can probably at the least contribute there!
from instamancer.
Related Issues (20)
- [FEATURE] Need a step-by-step example HOT 2
- [BUG] Cannot use tagged
- [BUG] HOT 1
- [BUG] After scraping around 800 hashtags Instamancer reloads the browser HOT 6
- Instgram login pops up and scraping freezes [BUG - possibly...?] HOT 1
- Omitting fullAPI skips first 12 posts HOT 2
- [BUG] Redirecting to Instagram login page HOT 8
- Alert from # used in post. HOT 4
- Scraped: 0 in production server HOT 3
- [BUG] Scraping is not working anymore because Instagram requres authorization HOT 9
- [FEATURE] Parallel Batch Processing? HOT 1
- [BUG] Basic API does not work HOT 1
- [BUG] HOT 1
- I'm not getting the latest posts HOT 2
- Get amount of certain hashtag[FEATURE] HOT 1
- [FEATURE] Want to add new attribute under Owner HOT 1
- Is it possible to download only the first slide from post that have multiple? HOT 1
- Write to data file on the fly? HOT 2
- [BUG] Instagram requires login HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from instamancer.