Code Monkey home page Code Monkey logo

twitter-scraper's Introduction

twitter-scraper

A tool for scraping tweet ids from the Twitter website.

Explanation

Tweets collected from Twitter's APIs provide metadata about the tweets that is machine-readable (in JSON) and may not be available from the website. For the purposes of archiving and/or analyzing tweets this metadata is potentially significant.

Unfortunately, Twitter's APIs don't support getting a comprehensive set of a user's tweets. Twitter's statuses/user_timeline REST API method only allows collecting the last 3,200 tweets. Similarly, Twitter's Search API will only provide tweets from the last 6-9 days.

twitter-scraper attempts to support getting a comprehensive set of a user's tweets (with optional date constraints). It accomplishes this by making requests to Twitter's website search (which is different than the Search API) and extracting tweet ids. These tweet ids can then be passed to twarc to retrieve from Twitter's REST API (aka "hydrating").

Installation

  1. Install Python, Pip, Git, and Chrome.

  2. Clone this repo: git clone https://github.com/justinlittman/twitter-scraper.git

  3. Install ChromeDriver. On a Mac, this can be done with brew install chromedriver. On Ubuntu:

     wget http://chromedriver.storage.googleapis.com/2.24/chromedriver_linux64.zip
     unzip chromedriver_linux64.zip
     sudo mv chromedriver /usr/bin/
    
  4. Install Selenium: pip install selenium

  5. Install Twarc: pip install twarc

Usage

    python twitter_scraper.py -h
    usage: twitter_scraper.py [-h] [--since SINCE] [--until UNTIL]
                              [--exclude-retweets] [--delta-days DELTA_DAYS]
                              [--wait-secs WAIT_SECS] [--debug]
                              screen_name
    
    positional arguments:
      screen_name
    
    optional arguments:
      -h, --help            show this help message and exit
      --since SINCE         Tweets since this date. Default is 2011-04-05.
      --until UNTIL         Tweets until this date. Default is today.
      --exclude-retweets
      --delta-days DELTA_DAYS
                            Number of days to include in each search.
      --wait-secs WAIT_SECS
                            Number of seconds to wait between each scroll.
      --debug

Running

To collect @realDonaldTrump's tweets between January 1, 2016 and April 1, 2016:

  1. Run twitter_scraper and write the tweet ids to a file.

     python twitter_scraper.py @realDonaldTrump --since=2016-01-01 --until=2016-04-01 > tweet_ids.txt
    

    Leave your system alone while twitter_scraper is running. I received inconsistent results while I was doing other work on my system while twitter_scraper was running. Better yet, use a VM.

    Tip: You can get the date that a user joined Twitter from the user's account page.

    Tip: Timestamp the tweet id file by using > tweet_ids_$(date -d "today" +"%Y%m%d%H%M").txt

    Tip: To account for the inconsistency, you may want to run multiple scrapes. They can be merged with:

     cat tweet_ids_*.txt | sort -u > tweet_ids.txt
    
  2. Hydrate the tweet ids with Twarc and write to a file. You will need to provide Twarc with a set of Twitter API keys. For more information, see Twarc's documentation.

     twarc.py --hydrate tweet_ids.txt > tweets.json
    

Acknowledgements

This work was inspired by the Trump Twitter Archive.

And appreciative (once again) for Ed Summer's twarc.

twitter-scraper's People

Contributors

justinlittman avatar

Watchers

James Drew avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.