Code Monkey home page Code Monkey logo

Comments (6)

ScriptSmith avatar ScriptSmith commented on June 12, 2024 1

In my initial attempts to reproduce this, I am able to gather 1000 posts from a hashtag.

The restarting process you describe is what I call grafting, which allows instamancer to perform long scraping jobs by restarting the browser in order to limit resource usage. You can read about it on the website

Because using a browser consumes lots of memory in large scraping jobs, Instamancer employs a new scraping technique called grafting. It intercepts and saves the URL and headers of each request, and then after a certain number of interactions with the page it will restart the browser and navigate back to the same page. Once the page initiates the first request to the API, its URL and headers are swapped on-the-fly with the most recently saved ones. The scraping continues without incident because the response from the API is in the correct form despite being for the incorrect data.

and in the FAQ

What happens if I disable grafting?

Chrome / Chromium will eventually decide that it doesn't want the page to consume any more resources and future requests to the API will be aborted. This usually happens between 5k-10k posts regardless of the memory available on the system. There doesn't seem to be any combination of Chrome flags to avoid this.

This bug could be because when instamancer attempts to perform a graft by swapping request parameters on the fly after being restarted, something is going wrong.

So, a few questions:

  • Can you include a copy of the graphql error here so I can see what it says?
  • Does the same issue occur when using -g=false?

from instamancer.

Daniel-Griffiths avatar Daniel-Griffiths commented on June 12, 2024 1

Sorry @ScriptSmith I have not had a chance to try it. I will close this issue for now and reopen if I can get any further info.

from instamancer.

Daniel-Griffiths avatar Daniel-Griffiths commented on June 12, 2024

Hi @ScriptSmith

Thanks for the detailed response! I didn't event notice the FAQ document, that will be super handy.

From what I recall when grafting was triggered and the browser restarted it started scraping the hashtags from the very beginning which would put it into an infinite loop.

I will confirm this after I finish work and try to get a reproducible example. I will also answer the two questions you posted.

from instamancer.

Daniel-Griffiths avatar Daniel-Griffiths commented on June 12, 2024

Example failed requests with grafting disabled:

endpoint: https://www.instagram.com/graphql/query/?query_hash=174a5243287c5f3a7de741089750ab3b&variables=%7B%22tag_name%22%3A%22rebelgal%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFCZndxMUV2QXlQalMyTVJ5ZUFqUDVraGRhc20wTmJfNkthMlZYa3kwSGZUODJid3JRWHp6VmQ2VUIxRTRNRWRzU0kzVlVCT0o2VER3SWVmWWl2Z3RHdg%3D%3D%22%7D

image

Each on of those failures happens roughly every 800 requests, this is with grafting disabled.

from instamancer.

ScriptSmith avatar ScriptSmith commented on June 12, 2024

I think that error is caused by chrome cancelling requests due to resource limitations. Try cloning this repo and changing the value of jumpMod in src/api/instagram.ts to 50. That should cause grafting to be initiated more quickly.

from instamancer.

ScriptSmith avatar ScriptSmith commented on June 12, 2024

Did you get a chance to try out the fix?

from instamancer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.