Code Monkey home page Code Monkey logo

Comments (9)

IORoot avatar IORoot commented on June 12, 2024 2

Just my little contribution here. I've managed to jerry-rig a poor implementation of logging in with credentials if puppeteer is redirected to the login page.

Within the constructPage method in the instagram.js file I added in a check for the login page and attempt to insert the username/password if found. Seems to be working for me now.
I'm not adding a pull request because this is very hacked together and I don't really know JS very well - It has hard-coded user/pass in it as well, which is bad practice.

However, I'm sure someone else can make a much better implementation of this.

Just remember to replace the username / password with your account details.
replace YOUR_ACCOUNT_USERNAME_GOES_HERE and YOUR_ACCOUNT_PASSWORD_GOES_HERE with real creds.

    /**
     * Create the browser and page, then visit the url
     */
    async constructPage() {
        // Browser args
        const args = [];
        /* istanbul ignore if */
        if (process.env.NO_SANDBOX) {
            args.push("--no-sandbox");
            args.push("--disable-setuid-sandbox");
        }
        if (this.proxyURL !== undefined) {
            args.push("--proxy-server=" + this.proxyURL);
        }
        // Browser launch options
        const options = {
            args,
            headless: this.headless,
        };
        if (this.executablePath !== undefined) {
            options.executablePath = this.executablePath;
        }
        // Launch browser
        if (this.browserInstance) {
            await this.progress(Progress.LAUNCHING);
            this.browser = this.browserInstance;
            this.browserDisconnected = !this.browser.isConnected();
            this.browser.on("disconnected", () => (this.browserDisconnected = true));
        }
        else if (!this.sameBrowser || (this.sameBrowser && !this.started)) {
            await this.progress(Progress.LAUNCHING);
            this.browser = await puppeteer_1.launch(options);
            this.browserDisconnected = false;
            this.browser.on("disconnected", () => (this.browserDisconnected = true));
        }
        // New page
        this.page = await this.browser.newPage();
        await this.progress(Progress.OPENING);

        // Attempt to visit URL
        try {

            await this.page.goto(this.url);

            // ┌─────────────────────────────────────────────────────────────────────────┐ 
            // │                                                                         │░
            // │                                                                         │░
            // │                      CHECK FOR LOGIN PAGE HERE                          │░
            // │                                                                         │░
            // │                                                                         │░
            // └─────────────────────────────────────────────────────────────────────────┘░
            // ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
            try {

                this.logger.error("Checking if been redirected to Login page.");
                await this.page.waitForSelector('input[name="username"]', { timeout: 2000 });
                this.logger.error("Login Page found, attempting to use credentials.");
                await this.page.type('input[name="username"]', 'YOUR_ACCOUNT_USERNAME_GOES_HERE');
                await this.page.type('input[name="password"]', 'YOUR_ACCOUNT_PASSWORD_GOES_HERE');
                await this.page.click('button[type="submit"]');
                await this.page.waitFor(2000);

                // Save Details Button
                await this.page.waitForSelector('button[type="button"]');
                await this.page.click('button[type="button"]');

                // Notifications button
                await this.page.waitForSelector('button[tabindex="0"]');
                await this.page.click('button[tabindex="0"]');
                
                // Goto original URL Request, not login page.
                await this.page.goto(this.url);

            } catch (error) {
                this.logger.error("No LOGIN Screen found.");
            }

            
            // Check page loads
            /* istanbul ignore next */
            const pageLoaded = await this.page.evaluate(() => {
                const headings = document.querySelectorAll("h2");
                for (const heading of Array.from(headings)) {
                    if (heading.innerHTML ===
                        "Sorry, this page isn't available.") {
                        return false;
                    }
                }
                return true;
            });


            if (!pageLoaded) {
                await this.handleConstructionError("Page loaded with no content", 10);
                return false;
            }
            // Run defaultPagePlugins
            for (const f of this.defaultPageFunctions) {
                await this.page.evaluate(f);
            }
            // Fix issue with disabled scrolling
            /* istanbul ignore next */
            await this.page.evaluate(() => {
                setInterval(() => {
                    try {
                        document.body.style.overflow = "";
                    }
                    catch (error) {
                        this.logger.error("Failed to update style", { error });
                    }
                }, 10000);
            });

        }
        catch (e) {
            await this.handleConstructionError(e, 60);
            return false;
        }
        return true;
    }

from instamancer.

IORoot avatar IORoot commented on June 12, 2024 2

So, I've been playing around with trying to get a proof of concept working to batch process 50 user requests in a row on my DigitalOcean server and I think I've just managed to crack it. There's a bunch of steps I took, and once I've put it all together I'll submit a pull request. However, here's the things that I think you need to solve:

  1. Puppeteer will open a new browser for every request. Essentially, I think Instagram was seeing that many new requests were happening and a lot of opening/closing of the browser. This was triggering their spam detection. On my server, I could get through nine requests before the rate limit.

To mitigate against this, I managed to get the code to loop across all of the requests while keeping the first instance of the browser open. This meant that Instagram thinks it's a single session where the user is just visiting multiple accounts.

  1. Login detection. If a login screen is detected, handling of that with some credentials was needed. So I supplied a creds.json file that was read and entered if the login page was detected.

  2. The new login location (location of your server) will be detected by Instagram. You need to manually accept through email that this is a new location and that it's you. (2FA)

  3. I've swapped puppeteer out for puppeteer-extra and am using the puppeteer-extra-stealth-plugin to help deter any bot-detection.

That's it. At the moment, the code is in a mess, but I think that this might point folks in the right direction. I've successfully just scraped 50 individual accounts from the server.

from instamancer.

ScriptSmith avatar ScriptSmith commented on June 12, 2024 1

Hi all, thanks for your contributions. Unfortunately I haven't had much time to spend on instamancer recently, but I'm a little more free now. Hopefully I can provide some more insight.

I think Instagram could be flagging multiple unauthenticated requests from the same address with different session cookies and other headers. However I am more confident that the explanation is simply that they now block unauthenticated requests from popular cloud platforms, as instamancer was working very reliably on these platforms until recently, and now it doesn't work even with brand new connections.

Ultimately there is a tradeoff between the pattern of multiple short sessions from the same source, and a single long session from the same source. In the past, having multiple sessions proved to be advantageous, but perhaps this is no longer the case.

The instamancer module (not cli) has an optional argument called browserInstance which you can use to persist a single puppeteer browser between scraping jobs. The sameBrowser argument can also be used to stop instamancer initiating grafting with a separate browser.

I'm not sure if you have been using those two features in your private fork @IORoot, but if so I think they can be used to test whether it is more advantageous to persist a single session. If so, we can add more options to keep instagram cookies, persist profile data with userDataDir etc.

I don't know how useful puppeteer-extra-stealth-plugin is as I don't see any evidence of instagram looking for puppeteer. I attribute this mostly to the fact that puppeteer is not the most popular instagram scraping method.

One other thing to note is that I likely won't be including any instagram 'login' or other sophisticated user interaction mechanisms in instamancer. You can write plugins to interact with the instagram webpages yourselves, or use plugins written by others.

If people are interested in using plugins to have more intricate interactions with the webpage, then I can also look at making improvements to their usability. They're pretty easy to use if you're using instamancer as a node module, but it's quite hard to use them with the CLI.

from instamancer.

navxio avatar navxio commented on June 12, 2024

Tried this myself, same issue. Generated a log file though-

instamancer.log
Looks like the preliminary oauth request fails for some reason coming from a linux host...

Edit: Works just fine on my mac, but fails on an ubuntu vps

from instamancer.

IORoot avatar IORoot commented on June 12, 2024

Hey @ScriptSmith, thanks for the comments and heads-up on the optional arguments. I didn't see them actually, and would have made life much easier! Oh well.
I ended up creating a new command that behaved very similar to the posts command but called users. Which allows you to submit a CSV of multiple accounts.
This would then loop through each one, keeping the browser open for all of them.

Completely understand the motivation to not do the login part, and really, that was probably the easiest part to do within the constructPage() method. It's wise to make that a plugin anyway, since I imagine the complexity of it will become more difficult in the future.

Once everything was running, I disabled the stealth-plugin and it made no difference, so I don't think that's needed right now.

My code isn't perfect by any means since I think I broke some of your functionality, which I need to fix (I'm new to TS - it's taking me time), but it seems to mostly work. The changes I made are all on my fork and it's happily running on the server.

from instamancer.

lum1nat0r avatar lum1nat0r commented on June 12, 2024

@IORoot So you are telling me that your code is currently running on your server?
How come I can't get it running :(
I mean i have it running inside a Ubuntu-container, cloned your repo and installed all the dependencies, but I sill get the same error-message that @navxio showed in his logs. Maybe you have some suggestion what I could try to get it running? :)

from instamancer.

IORoot avatar IORoot commented on June 12, 2024

Yep, it's still running and working well. There are a LOT of gotchas with Instagram that you need to work your way through. Off the top of my head the main ones are:

  1. If it's running on a server, let's say with an IP address of 1.1.1.1 then Instagram will see that as a new IP address connecting to its service. With the login functionality I added, that account will get an email notification to say "Hey, we just saw a new connection from this browser/machine/IP 1.1.1.1 - is that you?". Which you'll need to confirm to say that's you.

  2. If your server IP is 1.1.1.1, sometimes Instagram will flag this as "suspicious behaviour" and send a 6-digit code to your email account to then add into the browser, right there and then. This is a problem because Instamancer can't deal with this. So, the way I fixed it was to install a proxy server on the machine (TinyProxy) and then use my laptop 2.2.2.2 to tunnel through the server 1.1.1.1, so I can have the same IP address as the server and then manually deal with the 6-digit code confirmation on my laptop. Once I've confirmed the "suspicious behaviour" as me, Instagram then sees the IP 1.1.1.1 as an OK IP address and won't flag it up again.

  3. I've added a "screenshot" function into instamancer that takes an image and places it into /tmp/instamancer/ at each step of the process so I can see where it's getting stuck. This definitely helps to debug what Instagrams current problem is.

  4. I've allowed the --proxyURL flag on the command line so I can proxy through any other servers I need to to help debug.

  5. I've added a --user and --pass flag now to allow the login steps to work instead of supplying a creds.json file. makes life easier.

I have noticed that Instagram sees that "Headless Chrome" and "Linux" is being used and may become an issue if it doesn't like that being used, to which I may return to the stealth puppeteer project.

from instamancer.

diegofullstackjs avatar diegofullstackjs commented on June 12, 2024

where I find the creds.json file

from instamancer.

ScriptSmith avatar ScriptSmith commented on June 12, 2024

Instagram is now much more aggressively enforcing login.

See the notice in the README and #58

from instamancer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.