Code Monkey home page Code Monkey logo

twitterscraper's Introduction

TwitterScraper

Description

Twitter's API limits you to querying a user's most recent 3200 tweets. This is a pain in the ass. However, we can circumvent this limit using Selenium and doing some webscraping.

We can query a user's entire time on twitter, finding the IDs for each of their tweets. From there, we can use the tweepy API to query the complete metadata associated with each tweet. You can adjust which metadata are collected by changing the variable METADATA_LIST at the top of scrape.py. Personally, I was just collecting text to train a model, so I only cared about the full_text field in addition to whether the tweet was a retweet.

I've included a list of all available tweet attributes at the top of scrape.py so that you can adjust things as you wish.

NOTE: This scraper will notice if a user has less than 3200 tweets. In this case, it will do a "quickscrape" to grab all available tweets at once (significantly faster). It will store them in the exact same manner as a manual scrape.

Requirements (or rather, what I used)

Example:

I'll run the script two times on one of my advisors. By default, the scraper will start whenever the user created their twitter. I've chosen to look at a 1 year window, scraping at two week intervals. I then go from the beginning of 2019 until the present day, at 1 week intervals. The scraped tweets are stored in a JSON file that bears the Twitter user's handle.

$ ./scrape.py --help
usage: python3 scrape.py [options]

scrape.py - Twitter Scraping Tool

optional arguments:
  -h, --help            show this help message and exit
  -u USERNAME, --username USERNAME
                        Scrape this user\'s Tweets
  --since SINCE         Get Tweets after this date (Example: 2010-01-01).
  --until UNTIL         Get Tweets before this date (Example: 2018-12-07).
  --by BY               Scrape this many days at a time
  --delay DELAY         Time given to load a page before scraping it (seconds)
  --debug               Debug mode. Shows Selenium at work + additional logging

$ ./scrape.py -u phillipcompeau --by 14 --since 2018-01-01 --until 2019-01-01 
[ scraping user @phillipcompeau... ]
[ 1156 existing tweets in phillipcompeau.json ]
[ searching for tweets... ]
[ found 254 new tweets ]
[ retrieving new tweets (estimated time: 18 seconds)... ]
- batch 1 of 3
- batch 2 of 3
- batch 3 of 3
[ finished scraping ]
[ stored tweets in phillipcompeau.json ]

$ ./scrape.py -u phillipcompeau --since 2019-01-01 --by 7
[ scraping user @phillipcompeau... ]
[ 1410 existing tweets in phillipcompeau.json ]
[ searching for tweets... ]
[ found 541 new tweets ]
[ retrieving new tweets (estimated time: 36 seconds)... ]
- batch 1 of 6
- batch 2 of 6
- batch 3 of 6
- batch 4 of 6
- batch 5 of 6
- batch 6 of 6
[ finished scraping ]
[ stored tweets in phillipcompeau.json ]

$ ./scrape.py -u realwoofy
[ scraping user @realwoofy... ]
[ 149 existing tweets in realwoofy.json ]
[ searching for tweets... ]
[ user has fewer than 3200 tweets, conducting quickscrape... ]
[ found 3 new tweets ]
[ finished scraping ]
[ stored tweets in realwoofy.json ]

Using the Scraper

  • run python3 scrape.py and add the arguments you desire. Try ./scrape.py --help for all options.
    • -u followed by the username [required]
    • --since followed by a date string, e.g., (2017-01-01). Defaults to whenever the user created their Twitter
    • --until followed by a date string, e.g., (2018-01-01). Defaults to the current day
    • --by followed by the number of days to scrape at once (default: 7)
      • If someone tweets dozens of times a day, it might be better to use a lower number
    • --delay followed by an integer. This will be the number of seconds to wait on each page load before reading the page
      • if your internet is slow, put this higher (default: 3 seconds)
    • --debug. This will disable headless mode on the WebDriver so that you can watch it scrape. This is useful for assessing why it's unable to find tweets.
  • a browser window will pop up and begin scraping
  • when the browser window closes, metadata collection begins for all new tweets
  • when collection finishes, it will dump all the data to a .json file that corresponds to the twitter handle
    • don't worry about running two scrapes that have a time overlap; it will only retrieve new tweets!

Troubleshooting

  • do you get a driver error when you try and execute the script?
    • make sure your browser is up to date and that you have a driver version that matches your browser version
    • you can also open scrape.py and change the driver to use Chrome() or Firefox()
  • does the scraper seem like it's missing tweets that you know should be there?
    • try increasing the --delay parameter, it likely isn't waiting long enough for everything to load
    • try decreasing the --by parameter, it likely has too many tweets showing up on certain days

Twitter API credentials

twitterscraper's People

Contributors

kylefmohr avatar matthewwolff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitterscraper's Issues

AttributeError: 'NoneType' object has no attribute 'find_all'

[from twitterscraper import query_tweets
import datetime as dt
import pandas as pd
begin_date=dt.date(2023,4,2)
begin_fin=dt.date(2023,5,2)
limits=1000
lang='english'

user='elonmusk'

tweets=query_tweets("elonmusk",begindate=begin_date,enddate=begin_fin,limit=limits,lang=lang)

after executing this code i get this error
Traceback (most recent call last):
File "C:\Users\HP Probook\PycharmProjects\firstproject\TweetsSraper.py", line 23, in
from twitterscraper import query_tweets
File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\twitterscraper_init_.py", line 13, in
from twitterscraper.query import query_tweets
File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\twitterscraper\query.py", line 76, in
proxies = get_proxies()
File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\twitterscraper\query.py", line 49, in get_proxies
list_tr = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'](url)

TypeError: 'NoneType' object is not subscriptable

python3 ./scrape.py -u jack

Traceback (most recent call last):
  File "./scrape.py", line 276, in <module>
    begin = datetime.strptime(args.since, DATE_FORMAT) if args.since else get_join_date(args.username)
  File "./scrape.py", line 256, in get_join_date
    date_string = soup.find("span", {"class": "ProfileHeaderCard-joinDateText"})["title"].split(" - ")[1]
TypeError: 'NoneType' object is not subscriptable

No Logging Currently

TODO:

The debug mode could be augmented by adding in more logging throughout all the code and allowing the debug flag to set the verbosity of the output

It works for one profile but it doesn't work for another

When I run via ./scrape.py -u uaitabao it works perfectly, but when I try to run via ./scrape.py -u stakehighroller it only shows this text at the prompt but nothing happens.

Can you help me?

Note: I'm running from Visual Studio Code.

image

Any reason you do it month-by-month?

Are there any advantages to doing it month-by-month? Is this to bypass certain scrolling limits in Twitter search? I heard that there's something like that, but I have not experienced it myself.

What level of access is required?

Hey, thanks for the project!
The free subscription tokens envoke the following error:
You currently have access to Twitter API v2 endpoints and limited v1.1 endpoints only. If you need access to this endpoint, you may need a different access level.
Do you really need to have a higher one?

Error when attempting to scrape Tweets

This is the output when I attempt to scrape Tweets:

Traceback (most recent call last):
  File &quot;/usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py&quot;, line 76, in start
    stdin=PIPE)
  File &quot;/usr/lib/python3.6/subprocess.py&quot;, line 729, in __init__
    restore_signals, start_new_session)
  File &quot;/usr/lib/python3.6/subprocess.py&quot;, line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: &apos;chromedriver&apos;: &apos;chromedriver&apos;

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File &quot;./scrape.py&quot;, line 204, in &lt;module&gt;
    user.scrape(begin, end, args.by, args.delay)
  File &quot;./scrape.py&quot;, line 83, in scrape
    self.__find_tweets(start, end, by, loading_delay)
  File &quot;./scrape.py&quot;, line 101, in __find_tweets
    with webdriver.Chrome() as driver:  # options are Chrome(), Firefox(), Safari()
  File &quot;/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py&quot;, line 73, in __init__
    self.service.start()
  File &quot;/usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py&quot;, line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: &apos;chromedriver&apos; executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
</pre>

TabError: inconsistent use of tabs and spaces in indentation

Hi. I created a Twitter developer's account, and filled in the details to api_key.py and removed the (example) portion. Running the script outputs this error:

  File "./scrape.py", line 105
    window_start, ids = start, set()
                                   ^
TabError: inconsistent use of tabs and spaces in indentation

I'm not sure what the problem is. I installed all the requirements, but for the Chrome webdriver, which resulted in a bunch of errors. But I have vanilla Google Chrome, and un-Googled Chromium installed, as well as vanilla Firefox. Using Linux Mint 19.3.

Timeout error when attempting to connect

Hi. I'm trying to run this script through a Linux subsystem on Windows 10 (Ubuntu). I am getting the following timeout error when trying to connect.

./scrape.py -u OtogibaraEra
[ scraping user @otogibaraera... ]
[ 0 existing tweets in otogibaraera.json ]
[ searching for tweets... ]
Traceback (most recent call last):
  File "./scrape.py", line 284, in <module>
    user.scrape(begin, end, args.by, args.delay)
  File "./scrape.py", line 103, in scrape
    self.__find_tweets(start, end, by, loading_delay)
  File "./scrape.py", line 174, in __find_tweets
    with init_chromedriver(debug=False) as driver:  # options are Chrome(), Firefox(), Safari()
  File "./scrape.py", line 267, in init_chromedriver
    return webdriver.Chrome(options=options)
  File "/home/hzhu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/hzhu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/hzhu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/hzhu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/hzhu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created
from timeout: Timed out receiving message from renderer: 600.000
  (Session info: headless chrome=89.0.4389.82)

I installed all the modules as directed, except the chrome driver which I installed with
sudo apt-get install chromium-chromedriver
as homebrew only supports installing casks on macos.

Do you have any advice for my problem?

TypeError: get_user() takes 1 positional argument but 2 were given

After a fresh git pull I'm seeing this error when trying to run the scraper, including with the example in the Readme:

./scrape.py -u phillipcompeau --by 14 --since 2018-01-01 --until 2019-01-01
Traceback (most recent call last):
File "/home/philipjohn/projects/TwitterScraper/./scrape.py", line 81, in __check_if_scrapable
u = self.api.get_user(self.handle)
File "/home/linuxbrew/.linuxbrew/Cellar/[email protected]/3.9.8/lib/python3.9/site-packages/tweepy/api.py", line 46, in wrapper
return method(*args, **kwargs)
TypeError: get_user() takes 1 positional argument but 2 were given

My python version is 3.9.8 and here's the pip3 freeze output for the depencies:

beautifulsoup4==4.10.0
requests==2.27.1
requests-oauthlib==1.3.0
selenium==4.1.0
tweepy==4.4.0

This also appears below the error:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/philipjohn/projects/TwitterScraper/./scrape.py", line 284, in
user.scrape(begin, end, args.by, args.delay)
File "/home/philipjohn/projects/TwitterScraper/./scrape.py", line 95, in scrape
self.__check_if_scrapable()
File "/home/philipjohn/projects/TwitterScraper/./scrape.py", line 84, in __check_if_scrapable
except tweepy.TweepError as e:
AttributeError: module 'tweepy' has no attribute 'TweepError'

Perhaps there were some breaking changes in Tweepy ?

Search no longer works

I don't believe twitter allows you to search through a specific user anymore. All the searches yield 0 results.

Twitter Doesn't Allow Search Without Logging In - Causes Scraper To Find 0 Tweets

Context

I noticed when running in --debug mode that the browser will re-direct to https://twitter.com/i/flow/login until the user logs in. Not sure what to do about this...

Possibilities

  • Implement some sort of login mechanism
    • cookies?
      • complicated
      • unclear how to implement—unsure if able to log in with just API credentials to browser...
    • make scraper provide username/password information?
      • simplest, but most insecure

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.