nacimgoura / instagram-profilecrawl Goto Github PK

:computer: Quickly crawl the information (e.g. followers, tags, etc...) of an instagram profile. No login required!

License: MIT License

JavaScript 100.00%

automation browser chromedriver crawler instagram nodejs script selenium

instagram-profilecrawl's Introduction

💫 About Me:

🔭 I’m currently working on my Skills.
🌱 I’m mainly Golang/NodeJS/Deno developer but I'm in the process to learn more about Rust and Blockchain development.
💬 I’m looking for help with exploring New Technologies like Blockchain and AI.
📫 How to reach me: Email - [email protected]
💚 I love photography and hiking
⚡ Fun fact: My favourite dishe is ramen

💻 Tech Stack:

📊 GitHub Stats:

🏆 GitHub Trophies

🔝 Top Contributed Repo

instagram-profilecrawl's People

Contributors

Stargazers

Watchers

instagram-profilecrawl's Issues

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organiszation’s settings page, under Installed GitHub Apps.

crawl only shared_data

Well, more like an suggestion, than a "bug" report.

I think you shouldn't load the full instagram page.
First, the element classes (inside elements.json), change with some frequency (have no ideia wich frequency is that).

So, my sugestion it 's just load _sharedData inside the profile.
Don't load javascripts, images, styles.. It's way faster.

Something like this:

        this.page = await this.browser.newPage();
        await this.page.setRequestInterception(true);
        this.page.on('request', (request) => {
            if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
                request.abort();
            } else {
                request.continue();
            }
        });
        await this.page.setExtraHTTPHeaders({
            'Accept-Language': 'pt-BR'
        });
        await this.page.goto('https://instagram.com/' + username, {
            waitUntil: 'networkidle0'
        });
        const sharedData = document.querySelector('script').innerText;
        const html = /window._sharedData = (.*);/.exec(sharedData)[1];
        const profileData = JSON.parse(html);

/* Maybe here you could use your version 1.0 "parseData" function from here ? */

Tests and Issues

Gave it a try today, the Selenium method dies with an error, every time.
The API Method works, but all posts are marked with the 1970-01-17 date.

Code cleanup

Hello @nacimgoura, thanks for creating this tool!

How do you feel about adding prettier + editor config to the project? If that sounds good to you, I can create a PR.

Thanks!

Some fields changed

Hi,

thanks for providing this open source code. It's a pleasure to work with :-)

I guess instagram changed their UI, so some fields are not scraped correctly.

Here is my element.json which works in my case:

{
	"notExist": ".dialog-404",
	"dismissInvitationLogin": ".coreSpriteDismissLarge",
	"alias": " div.-vDIg > h1", //changed
	"username": "h1",
	"isOfficial": "span.coreSpriteVerifiedBadge",
	"descriptionProfile": "div.-vDIg > span", //changed
	"urlImgProfile": "header img",
	"website": "div.-vDIg > a",
	"isPrivate": "h2",
	"numberPosts": "ul li:first-child a span", //changed
	"numberFollowers": "ul li:nth-child(2) span",
	"numberFollowing": "ul li:nth-child(3) span",
	"listPost": "article div > a",
	"numberLike": "article section div span > span",
	"numberComments": "article li",
	"numberView": "views",
	"urlImage": "article div > img",
	"video": "video",
	"description": "article ul > li:first-child",
	"tags": "span",
	"mentions": "mentions",
	"date": "time",
	"multipleImage": ".coreSpriteRightChevron"
}

Additionally I had an issue with puppeteer but the fix mentioned in puppeteer/puppeteer#2746 solved my problem.

No treatment when posts > 1000

Hi @nacimgoura !
When the number of posts is bigger than 1000, your script stop without treatment.
instagram-profilecrawl jamieoliver
✔ Profile successfully loaded for jamieoliver!
✔ File created with success!

got null or '' image url if multipleImage is true

{
      "urlImage": [
        null,
        "",
        "",
        ""
      ],
      "isVideo": true,
      "video": "https://scontent-hkg3-2.cdninstagram.com/vp/ea15e7de012f048c65133d0ba04161a4/5B849A74/t50.2886-16/37996294_326760741402685_7469339062971549339_n.mp4",
      "numberLike": null,
      "numberView": null,
      "numberComments": 5,
      "description": "Click video for sound\nVideo\n",
      "tags": null,
      "mentions": [],
      "date": "2018-08-15T07:09:39.000Z",
      "multipleImage": true
    }

and this is a post with 2 videos.

Add tests.

Hello @nacimgoura,

I'm thinking about adding JEST in here, what do you think?

Thanks!

Tool is broken

I updated to the latest version and cannot successfully crawl a profile anymore.

Can anyone try running the crawler with --method=api on profile marsquest and see if it works?

Improve readme

What unit is --limit specified in? An integer representing a number of posts? A timestamp representing a date?
What is headless mode?

Add ability to set a "since" parameter

For people who want to use this utility more than once on the same instagram account, it would be awesome to be able to set a parameter that would tell the script to only crawl posts that are newer than a given date or post (basically using the date that a particular post was uploaded).

Script hangs on private profiles

If you try to crawl a private profile, the script hangs on Get list post. It should return an error when it comes across something unexpected instead of becoming nonresponsive!

Cheers 😄

Reference missing ?

Dear Author, I appreciate that you took time to write this script in javascript, I also believe that this is your original work but I think it is inspired by https://github.com/timgrossmann/instagram-profilecrawl if this is the case it is always a good practice to refer to the repository which inspired you to do so.

small thing

in file <api.js>
line 95~97

post.edge_media_to_caption.edges[0].node.text <-- this

because <post.edge_media_to_caption.edges > may be empty
need some check , or it would stop here,
I add something like
mentions: (post.edge_media_to_caption.edges.length)?utils.getMentions(post.edge_media_to_caption.edges[0].node.text) : "",

Then it work

thank you

Do not target random generated classes - script error

Currently to target elements you use random generated classnames that instagram is rerolling from time to time like: h1._rf3jb. This class does not exist any more which causes puppeteer to error and whole script to die. Please target specific tags or sections, do not rely on classes

Selenium install fails when pushing to Heroku

Selenium can't be installed when pushing to Heroku without additional configuration. Since I have only been using the API mode so far, I think simply disabling that option should work for me.

I'll try to do that and submit a PR for consideration 👍

✖ Cannot read property 'length' of null when no description

Hi @nacimgoura !
Your new commits seems to make a side effect when i try your app on my own account with no post description:
$ instagram-profilecrawl _laurent_la
✔ Profile successfully loaded for _laurent_la!
✖ Cannot read property 'length' of null
Have a good day !

Tool is broken

Unable to crawl with Videos

Hi @nacimgoura ,

Thanks for you script, it work great except when there is video in the feed.

Cheers

Laurent

Include comments in output

Something that would be nice to have. I'll do this myself when I get the chance, in case no one wants to go ahead until then 😄

does this tool still work?

Seeing the several issues and bugs i was wondering if this tool still works?

Kind regards

Is is possible to use from React Native?

Hello,

Is it possible to use this module from React Native? I want to crawl a user's profile without having him/her to login.

It would be great if you can provide any document for this.

Unable to start selenium server!

Hi, I am trying to run the selenium method (API method runs without problems), I installed it beforehand, but while executing the example instagram-profilecrawl nacimgoura --thod=selenium, I get an error ✖ Unable to start selenium server!. Any clues what can be wrong?

Is it possible to get location?

Thanks for great tool! I am interested in the location info using API crawl. Are you aware of any tricks on the url to pull location info for each media as well?

numberLike should be numeric | without "" in json output

I (want) to use the data in R studio an always have to convert them to numeric after importing the json so it would be good if the of numberLike(s) is without "..." in the output file.

Usage Typo

Hi again @nacimgoura

It seems there is a typo on your usage example on the readme :
instagram-profile-crawl instead of instagram-profilecrawl

Cheers

No session ID for selenium

I'm getting an error when trying to use selenium scaping.

I've tried this in ubuntu subsystem and in windows powershell. It worked before but for some reason is not working now.

I've even tried starting the selenium standalone server manually and it still gives me the same error.

× Timed out waiting for driver server to start.
Build info: version: '3.7.1', revision: '8a0099a', time: '2017-11-06T21:07:36.161Z'
System info: host: 'Bens-Desktop.localdomain', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-43-Microsoft', java.version: '1.8.0_151'
Driver info: driver.version: unknown
C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\fibers\future.js:313
                                                throw(ex);
                                                ^

Error: A session id is required for this command but wasn't found in the response payload
    at new RuntimeError (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\utils\ErrorHandler.js:144:12)
    at RequestHandler.createOptions (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\utils\RequestHandler.js:121:23)
    at RequestHandler.create (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\utils\RequestHandler.js:209:43)
    at Object.url (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\protocol\url.js:24:32)
    at Object.exec (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\helpers\safeExecute.js:28:24)
    at Object.resolve (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\webdriverio.js:191:29)
    at lastPromise.then.resolve.call.depth (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\webdriverio.js:486:32)
    at _fulfilled (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\node_modules\q\q.js:854:54)
    at self.promiseDispatch.done (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\node_modules\q\q.js:883:30)
    at Promise.promise.promiseDispatch (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\node_modules\q\q.js:816:13)```

Is there a specific node version that should be used? Not sure what could have changed to cause this?

numberComments is (often) not scraped correctly

numberComments is scraped correctly up to (=under) 26 comments, if a post has more comments than 26, the number in the output json is still 26.

it can not run

/usr/local/node-v7.6.0/lib/node_modules/instagram-profilecrawl/index.js:85
...(await this.getInfoProfile()),
^^^
SyntaxError: Unexpected token ...
at Object.exports.runInThisContext (vm.js:73:16)
at Module._compile (module.js:543:28)
at Object.Module._extensions..js (module.js:580:10)
at Module.load (module.js:488:32)
at tryModuleLoad (module.js:447:12)
at Function.Module._load (module.js:439:3)
at Module.runMain (module.js:605:10)
at run (bootstrap_node.js:422:7)
at startup (bootstrap_node.js:143:9)
at bootstrap_node.js:537:3

Is it possible to crawl from private account?

I just curious for this context. Is it possible to crawl from private account? I've try for some private account and this program just return basic information and not for posts. Is it possible to do this case, even though I have instagram account that follow the account.

Check if selenium is installed

Check if selenium is installed before launching the crawl. otherwise we install it.

support phantomjs

hi~i am using you project , but i am in china,so can't install chromedriver. could you tall me how to let it support phantomjs?

Crawl all comments of a specific post

Does the package support crawling all the comments of a post?

stays in Init API

Make it easier to call programatically

It would be great if we could call the script using an attribute with the method choice (api or selenium) instead of selecting with arrows and pressing enter. Something like:

instagram-profilecrawl --method="api" or instagram-profilecrawl --method="selenium"

Another good thing to have would be a way to choose the filename for the result, perhaps with another parameter. It would also be nice to have the default name be "profile-<name>" instead of "profile
<name>", because spaces in filenames can be problematic.

These changes would make it easier to use this inside scripts instead of just manually through the console!

Cheers 😄

Problem paging (or so it seems)

I'm getting an issue and it looks like this

[vernacchia:~/Coding/testing]$ node node_modules/instagram-profilecrawl/index.js
✔ Profile successfully loaded for vernak2539!
✖ An element could not be located on the page using the given search parameters.

It get's to around the 20th step ⠇ Advancement of the first step : 20/203

any help would be great.

I have a little trouble installing it globally (problem starting selenium), so i had to install it and modify the index file, hence the reason for my command line approach