Code Monkey home page Code Monkey logo

instagram-profilecrawl's Introduction

πŸ’« About Me:

πŸ”­ I’m currently working on my Skills.
🌱 I’m mainly Golang/NodeJS/Deno developer but I'm in the process to learn more about Rust and Blockchain development.
πŸ’¬ I’m looking for help with exploring New Technologies like Blockchain and AI.
πŸ“« How to reach me: Email - [email protected]
πŸ’š I love photography and hiking
⚑ Fun fact: My favourite dishe is ramen

πŸ’» Tech Stack:

Go TypeScript Next JS React JavaScript Solidity PHP Rust AWS Netlify NodeJS TailwindCSS Nginx MongoDB Postgres LINUX ElasticSearch Notion Raspberry Pi HTML5 CSS3

πŸ“Š GitHub Stats:



πŸ† GitHub Trophies

πŸ” Top Contributed Repo


instagram-profilecrawl's People

Contributors

alejandronanez avatar blackbyte-pl avatar greenkeeper[bot] avatar marcospgp avatar nacimgoura avatar thosuperman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

instagram-profilecrawl's Issues

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organiszation’s settings page, under Installed GitHub Apps.

crawl only shared_data

Well, more like an suggestion, than a "bug" report.

I think you shouldn't load the full instagram page.
First, the element classes (inside elements.json), change with some frequency (have no ideia wich frequency is that).

So, my sugestion it 's just load _sharedData inside the profile.
Don't load javascripts, images, styles.. It's way faster.

Something like this:

        this.page = await this.browser.newPage();
        await this.page.setRequestInterception(true);
        this.page.on('request', (request) => {
            if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
                request.abort();
            } else {
                request.continue();
            }
        });
        await this.page.setExtraHTTPHeaders({
            'Accept-Language': 'pt-BR'
        });
        await this.page.goto('https://instagram.com/' + username, {
            waitUntil: 'networkidle0'
        });
        const sharedData = document.querySelector('script').innerText;
        const html = /window._sharedData = (.*);/.exec(sharedData)[1];
        const profileData = JSON.parse(html);

/* Maybe here you could use your version 1.0 "parseData" function from here ? */

Tests and Issues

Gave it a try today, the Selenium method dies with an error, every time.
The API Method works, but all posts are marked with the 1970-01-17 date.

Code cleanup

Hello @nacimgoura, thanks for creating this tool!

How do you feel about adding prettier + editor config to the project? If that sounds good to you, I can create a PR.

Thanks!

Some fields changed

Hi,

thanks for providing this open source code. It's a pleasure to work with :-)

I guess instagram changed their UI, so some fields are not scraped correctly.

Here is my element.json which works in my case:

{
	"notExist": ".dialog-404",
	"dismissInvitationLogin": ".coreSpriteDismissLarge",
	"alias": " div.-vDIg > h1", //changed
	"username": "h1",
	"isOfficial": "span.coreSpriteVerifiedBadge",
	"descriptionProfile": "div.-vDIg > span", //changed
	"urlImgProfile": "header img",
	"website": "div.-vDIg > a",
	"isPrivate": "h2",
	"numberPosts": "ul li:first-child a span", //changed
	"numberFollowers": "ul li:nth-child(2) span",
	"numberFollowing": "ul li:nth-child(3) span",
	"listPost": "article div > a",
	"numberLike": "article section div span > span",
	"numberComments": "article li",
	"numberView": "views",
	"urlImage": "article div > img",
	"video": "video",
	"description": "article ul > li:first-child",
	"tags": "span",
	"mentions": "mentions",
	"date": "time",
	"multipleImage": ".coreSpriteRightChevron"
}

Additionally I had an issue with puppeteer but the fix mentioned in puppeteer/puppeteer#2746 solved my problem.

No treatment when posts > 1000

Hi @nacimgoura !
When the number of posts is bigger than 1000, your script stop without treatment.
instagram-profilecrawl jamieoliver
βœ” Profile successfully loaded for jamieoliver!
βœ” File created with success!

got null or '' image url if multipleImage is true

{
      "urlImage": [
        null,
        "",
        "",
        ""
      ],
      "isVideo": true,
      "video": "https://scontent-hkg3-2.cdninstagram.com/vp/ea15e7de012f048c65133d0ba04161a4/5B849A74/t50.2886-16/37996294_326760741402685_7469339062971549339_n.mp4",
      "numberLike": null,
      "numberView": null,
      "numberComments": 5,
      "description": "Click video for sound\nVideo\n",
      "tags": null,
      "mentions": [],
      "date": "2018-08-15T07:09:39.000Z",
      "multipleImage": true
    }

and this is a post with 2 videos.

Tool is broken

I updated to the latest version and cannot successfully crawl a profile anymore.

Can anyone try running the crawler with --method=api on profile marsquest and see if it works?

Improve readme

  • What unit is --limit specified in? An integer representing a number of posts? A timestamp representing a date?
  • What is headless mode?

Add ability to set a "since" parameter

For people who want to use this utility more than once on the same instagram account, it would be awesome to be able to set a parameter that would tell the script to only crawl posts that are newer than a given date or post (basically using the date that a particular post was uploaded).

Script hangs on private profiles

If you try to crawl a private profile, the script hangs on Get list post. It should return an error when it comes across something unexpected instead of becoming nonresponsive!

Cheers πŸ˜„

small thing

in file <api.js>
line 95~97

post.edge_media_to_caption.edges[0].node.text <-- this

because <post.edge_media_to_caption.edges > may be empty
need some check , or it would stop here,
I add something like
mentions: (post.edge_media_to_caption.edges.length)?utils.getMentions(post.edge_media_to_caption.edges[0].node.text) : "",

Then it work

thank you

Do not target random generated classes - script error

Currently to target elements you use random generated classnames that instagram is rerolling from time to time like: h1._rf3jb. This class does not exist any more which causes puppeteer to error and whole script to die. Please target specific tags or sections, do not rely on classes

Selenium install fails when pushing to Heroku

Selenium can't be installed when pushing to Heroku without additional configuration. Since I have only been using the API mode so far, I think simply disabling that option should work for me.

I'll try to do that and submit a PR for consideration πŸ‘

Include comments in output

Something that would be nice to have. I'll do this myself when I get the chance, in case no one wants to go ahead until then πŸ˜„

Is is possible to use from React Native?

Hello,

Is it possible to use this module from React Native? I want to crawl a user's profile without having him/her to login.

It would be great if you can provide any document for this.

Unable to start selenium server!

Hi, I am trying to run the selenium method (API method runs without problems), I installed it beforehand, but while executing the example instagram-profilecrawl nacimgoura --thod=selenium, I get an error βœ– Unable to start selenium server!. Any clues what can be wrong?

Is it possible to get location?

Thanks for great tool! I am interested in the location info using API crawl. Are you aware of any tricks on the url to pull location info for each media as well?

Usage Typo

Hi again @nacimgoura

It seems there is a typo on your usage example on the readme :
instagram-profile-crawl instead of instagram-profilecrawl

Cheers

No session ID for selenium

I'm getting an error when trying to use selenium scaping.

I've tried this in ubuntu subsystem and in windows powershell. It worked before but for some reason is not working now.

I've even tried starting the selenium standalone server manually and it still gives me the same error.

Γ— Timed out waiting for driver server to start.
Build info: version: '3.7.1', revision: '8a0099a', time: '2017-11-06T21:07:36.161Z'
System info: host: 'Bens-Desktop.localdomain', ip: '127.0.1.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-43-Microsoft', java.version: '1.8.0_151'
Driver info: driver.version: unknown
C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\fibers\future.js:313
                                                throw(ex);
                                                ^

Error: A session id is required for this command but wasn't found in the response payload
    at new RuntimeError (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\utils\ErrorHandler.js:144:12)
    at RequestHandler.createOptions (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\utils\RequestHandler.js:121:23)
    at RequestHandler.create (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\utils\RequestHandler.js:209:43)
    at Object.url (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\protocol\url.js:24:32)
    at Object.exec (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\helpers\safeExecute.js:28:24)
    at Object.resolve (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\webdriverio.js:191:29)
    at lastPromise.then.resolve.call.depth (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\build\lib\webdriverio.js:486:32)
    at _fulfilled (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\node_modules\q\q.js:854:54)
    at self.promiseDispatch.done (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\node_modules\q\q.js:883:30)
    at Promise.promise.promiseDispatch (C:\Users\Benjamin\AppData\Roaming\nvm\v8.5.0\node_modules\instagram-profilecrawl\node_modules\webdriverio\node_modules\q\q.js:816:13)```

Is there a specific node version that should be used? Not sure what could have changed to cause this?

it can not run

/usr/local/node-v7.6.0/lib/node_modules/instagram-profilecrawl/index.js:85
...(await this.getInfoProfile()),
^^^
SyntaxError: Unexpected token ...
at Object.exports.runInThisContext (vm.js:73:16)
at Module._compile (module.js:543:28)
at Object.Module._extensions..js (module.js:580:10)
at Module.load (module.js:488:32)
at tryModuleLoad (module.js:447:12)
at Function.Module._load (module.js:439:3)
at Module.runMain (module.js:605:10)
at run (bootstrap_node.js:422:7)
at startup (bootstrap_node.js:143:9)
at bootstrap_node.js:537:3

Is it possible to crawl from private account?

I just curious for this context. Is it possible to crawl from private account? I've try for some private account and this program just return basic information and not for posts. Is it possible to do this case, even though I have instagram account that follow the account.

support phantomjs

hi~i am using you project , but i am in china,so can't install chromedriver. could you tall me how to let it support phantomjs?

Make it easier to call programatically

It would be great if we could call the script using an attribute with the method choice (api or selenium) instead of selecting with arrows and pressing enter. Something like:

instagram-profilecrawl --method="api" or instagram-profilecrawl --method="selenium"

Another good thing to have would be a way to choose the filename for the result, perhaps with another parameter. It would also be nice to have the default name be "profile-<name>" instead of "profile
<name>", because spaces in filenames can be problematic.

These changes would make it easier to use this inside scripts instead of just manually through the console!

Cheers πŸ˜„

Problem paging (or so it seems)

I'm getting an issue and it looks like this

[vernacchia:~/Coding/testing]$ node node_modules/instagram-profilecrawl/index.js
βœ” Profile successfully loaded for vernak2539!
βœ– An element could not be located on the page using the given search parameters.

It get's to around the 20th step β ‡ Advancement of the first step : 20/203

any help would be great.

I have a little trouble installing it globally (problem starting selenium), so i had to install it and modify the index file, hence the reason for my command line approach

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.