Comments (6)
In my initial attempts to reproduce this, I am able to gather 1000 posts from a hashtag.
The restarting process you describe is what I call grafting, which allows instamancer to perform long scraping jobs by restarting the browser in order to limit resource usage. You can read about it on the website
Because using a browser consumes lots of memory in large scraping jobs, Instamancer employs a new scraping technique called grafting. It intercepts and saves the URL and headers of each request, and then after a certain number of interactions with the page it will restart the browser and navigate back to the same page. Once the page initiates the first request to the API, its URL and headers are swapped on-the-fly with the most recently saved ones. The scraping continues without incident because the response from the API is in the correct form despite being for the incorrect data.
and in the FAQ
What happens if I disable grafting?
Chrome / Chromium will eventually decide that it doesn't want the page to consume any more resources and future requests to the API will be aborted. This usually happens between 5k-10k posts regardless of the memory available on the system. There doesn't seem to be any combination of Chrome flags to avoid this.
This bug could be because when instamancer attempts to perform a graft by swapping request parameters on the fly after being restarted, something is going wrong.
So, a few questions:
- Can you include a copy of the graphql error here so I can see what it says?
- Does the same issue occur when using
-g=false
?
from instamancer.
Sorry @ScriptSmith I have not had a chance to try it. I will close this issue for now and reopen if I can get any further info.
from instamancer.
Hi @ScriptSmith
Thanks for the detailed response! I didn't event notice the FAQ document, that will be super handy.
From what I recall when grafting was triggered and the browser restarted it started scraping the hashtags from the very beginning which would put it into an infinite loop.
I will confirm this after I finish work and try to get a reproducible example. I will also answer the two questions you posted.
from instamancer.
Example failed requests with grafting disabled:
Each on of those failures happens roughly every 800 requests, this is with grafting disabled.
from instamancer.
I think that error is caused by chrome cancelling requests due to resource limitations. Try cloning this repo and changing the value of jumpMod
in src/api/instagram.ts
to 50. That should cause grafting to be initiated more quickly.
from instamancer.
Did you get a chance to try out the fix?
from instamancer.
Related Issues (20)
- [FEATURE] Need a step-by-step example HOT 2
- [BUG] Cannot use tagged
- [BUG] HOT 1
- Instgram login pops up and scraping freezes [BUG - possibly...?] HOT 1
- [FEATURE] Serverless Framework Support HOT 2
- Omitting fullAPI skips first 12 posts HOT 2
- [BUG] Redirecting to Instagram login page HOT 8
- Alert from # used in post. HOT 4
- Scraped: 0 in production server HOT 3
- [BUG] Scraping is not working anymore because Instagram requres authorization HOT 9
- [FEATURE] Parallel Batch Processing? HOT 1
- [BUG] Basic API does not work HOT 1
- [BUG] HOT 1
- I'm not getting the latest posts HOT 2
- Get amount of certain hashtag[FEATURE] HOT 1
- [FEATURE] Want to add new attribute under Owner HOT 1
- Is it possible to download only the first slide from post that have multiple? HOT 1
- Write to data file on the fly? HOT 2
- [BUG] Instagram requires login HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from instamancer.