shaikhsajid1111 / twitter-scraper-selenium Goto Github PK

View Code? Open in Web Editor NEW

299.0 6.0 46.0 126 KB

Python's package to scrap Twitter's front-end easily

Home Page: https://pypi.org/project/twitter-scraper-selenium

License: MIT License

Python 100.00%

python python3 selenium twitter twitter-scraper twitter-bot twitter-profile twitter-hashtag twitter-profiles automation

twitter-scraper-selenium's Introduction

Twitter scraper selenium

Python's package to scrape Twitter's front-end easily with selenium.

Table of Contents

Getting Started
- Prerequisites
- Installation
  - Installing from source
  - Installing with PyPI
Usage
- Available Functions in this package- Summary
- Scraping profile's details
Privacy
License

Prerequisites

Internet Connection

Python 3.6+

Chrome or Firefox browser installed on your machine

Installation

Installing from the source

Download the source code or clone it with:

git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium

Open terminal inside the downloaded folder:

 python3 setup.py install

Installing with PyPI

pip3 install twitter-scraper-selenium

Usage

Available Function In this Package - Summary

Function Name	Function Description	Scraping Method	Scraping Speed
`scrape_profile()`	Scrape's Twitter user's profile tweets	Browser Automation	Slow
`get_profile_details()`	Scrape's Twitter user details.	HTTP Request	Fast
`scrape_profile_with_api()`	Scrape's Twitter tweets by twitter profile username. It expects the username of the profile	Browser Automation & HTTP Request	Fast

Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.

To scrape twitter profile details:

from twitter_scraper_selenium import get_profile_details

twitter_username = "TwitterAPI"
filename = "twitter_api_data"
browser = "firefox"
headless = True
get_profile_details(twitter_username=twitter_username, filename=filename, browser=browser, headless=headless)

Output:

{
	"id": 6253282,
	"id_str": "6253282",
	"name": "Twitter API",
	"screen_name": "TwitterAPI",
	"location": "San Francisco, CA",
	"profile_location": null,
	"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
	"url": "https:\/\/t.co\/8IkCzCDr19",
	"entities": {
		"url": {
			"urls": [{
				"url": "https:\/\/t.co\/8IkCzCDr19",
				"expanded_url": "https:\/\/developer.twitter.com",
				"display_url": "developer.twitter.com",
				"indices": [
					0,
					23
				]
			}]
		},
		"description": {
			"urls": []
		}
	},
	"protected": false,
	"followers_count": 6133636,
	"friends_count": 12,
	"listed_count": 12936,
	"created_at": "Wed May 23 06:01:13 +0000 2007",
	"favourites_count": 31,
	"utc_offset": null,
	"time_zone": null,
	"geo_enabled": null,
	"verified": true,
	"statuses_count": 3656,
	"lang": null,
	"contributors_enabled": null,
	"is_translator": null,
	"is_translation_enabled": null,
	"profile_background_color": null,
	"profile_background_image_url": null,
	"profile_background_image_url_https": null,
	"profile_background_tile": null,
	"profile_image_url": null,
	"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
	"profile_banner_url": null,
	"profile_link_color": null,
	"profile_sidebar_border_color": null,
	"profile_sidebar_fill_color": null,
	"profile_text_color": null,
	"profile_use_background_image": null,
	"has_extended_profile": null,
	"default_profile": false,
	"default_profile_image": false,
	"following": null,
	"follow_request_sent": null,
	"notifications": null,
	"translator_type": null
}

get_profile_details() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter Username
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.

Keys of the output:
Detail of each key can be found here.

To scrape profile's tweets:

In JSON format:

from twitter_scraper_selenium import scrape_profile

microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)

Output:

{
  "1430938749840629773": {
    "tweet_id": "1430938749840629773",
    "username": "Microsoft",
    "name": "Microsoft",
    "profile_picture": "https://twitter.com/Microsoft/photo",
    "replies": 29,
    "retweets": 58,
    "likes": 453,
    "is_retweet": false,
    "retweet_link": "",
    "posted_time": "2021-08-26T17:02:38+00:00",
    "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
    "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
  },...
}

In CSV format:

from twitter_scraper_selenium import scrape_profile


scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")

Output:

tweet_id	username	name	profile_picture	replies	retweets	likes	is_retweet	retweet_link	posted_time	content	hashtags	mentions	images	videos	post_url	link
1430938749840629773	Microsoft	Microsoft	https://twitter.com/Microsoft/photo	64	75	521	False		2021-08-26T17:02:38+00:00	Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW	[]	[]	[]	[]	https://twitter.com/Microsoft/status/1430938749840629773	https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

...

scrape_profile() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter username of the account
browser	String	Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
tweets_count	Integer	Number of posts to scrape. Default is 10.
output_format	String	The output format, whether JSON or CSV. Default is JSON.
filename	String	If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.
directory	String	If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
headless	Boolean	Whether to run crawler headlessly?. Default is `True`

Keys of the output

Key	Type	Description
tweet_id	String	Post Identifier(integer casted inside string)
username	String	Username of the profile
name	String	Name of the profile
profile_picture	String	Profile Picture link
replies	Integer	Number of replies of tweet
retweets	Integer	Number of retweets of tweet
likes	Integer	Number of likes of tweet
is_retweet	boolean	Is the tweet a retweet?
retweet_link	String	If it is retweet, then the retweet link else it'll be empty string
posted_time	String	Time when tweet was posted in ISO 8601 format
content	String	content of tweet as text
hashtags	Array	Hashtags presents in tweet, if they're present in tweet
mentions	Array	Mentions presents in tweet, if they're present in tweet
images	Array	Images links, if they're present in tweet
videos	Array	Videos links, if they're present in tweet
tweet_url	String	URL of the tweet
link	String	If any link is present inside tweet for some external website.

To Scrap profile's tweets with API:

from twitter_scraper_selenium import scrape_profile_with_api

scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)

scrape_profile_with_api() Arguments:

Argument	Argument Type	Description
username	String	Twitter's Profile username
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
browser	String	Which browser to use for extracting out graphql key. Default is firefox.
headless	String	Whether to run browser in headless mode?

Output:

{
  "1608939190548598784": {
    "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

Using scraper with proxy (http proxy)

Just pass proxy argument to function.

from twitter_scraper_selenium import scrape_profile

scrape_profile("elonmusk", headless=False, proxy="66.115.38.247:5678", output_format="csv",filename="musk") #In IP:PORT format

Proxy that requires authentication:

from twitter_scraper_selenium import scrape_profile

microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
                      proxy="sajid:[email protected]:5678")  #  username:password@IP:PORT
print(microsoft_data)

Privacy

This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.

LICENSE

MIT

twitter-scraper-selenium's People

Contributors

Stargazers

Watchers

Forkers

itsonlym3 marcusnk237 joan1712jj bhanu1478 liiklin hdelafuente dutchosintguy ibrahimmazlum researchoor p0rtal gnuhub rachmadaniharyono comandowang nefelum dragonizedpizza robert-zlf techsoft29 annabeers gfyre minhthai1995 choirudinemcha danztensai sanggwon jamesjing106 eyonjoshua91 justanotheruser deviance-dev ekapujiw2002 goeiecool9999 liminalpepe mastermu2022 mauropelucchi elightcap cangyuyao vpineda7 dannoto nguyenminhduc9988 amartyacodes jackman337 kzoink it4043e-it5384-2023 ftandy remoteblue primary-byte kennyhlam hdaipteam

twitter-scraper-selenium's Issues

"Tweets did not appear!"

Code
scrap_keyword(keyword="nft", browser="chrome", tweets_count=999999, until="2022-06-30", since="2022-06-29",output_format="csv",filename="nft")

Error Message
[WDM] - Current google-chrome version is 103.0.5060
[WDM] - Get LATEST driver version for 103.0.5060

[WDM] - Driver [./.wdm/drivers/chromedriver/mac64/103.0.5060.53/chromedriver] found in cache
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!

Other Infomation
requirement.txt has already been installed

These Functions will be removed on new release

scrape_profile() - Its alternative is scrape_profile_with_api()
scrape_keyword() - Its alternative is scrape_keyword_with_api()
scrape_topic()- Its alternative is scrape_topic_with_api()

Install false

Please check . Some thinks wrong . Thanks for suppot

The address returned by calling scrape_profile is wrong！

hi，there。The address returned by calling scrape_profile(twitter_username="blvckledge", output_format="json", browser="chrome", tweets_count=10) for videos is blob:https://twitter.com/facd7bdb-73ec-49ff-9492-993a165a3585, but the actual address is https://video.twimg.com/ext_tw_video/1673328923311058944/pu/vid/848x464/KNnl7Rqk_MfiX4X9.mp4?tag=12. Why is an incorrect address being returned? Additionally, I receive errors when calling scrape_topic_with_api() and scrape_profile_with_api(). What could be causing this?

Install help: selenium.common.exceptions.SessionNotCreatedException

Hi there team,
Thanks for this amazing lib, used it a few times already and had no problem but today trying to make it work again and got stuck on a gecko driver error.

...
selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
...

I created a docker to have a fresh install and make sure it's not my setup, I use it on a Macos and have the same error

Here are the files and how to reproduce

Dockerfile

# Use the official Python 3.9.16 image as the base image
FROM python:3.9.16

# Set the working directory to /app
WORKDIR /app

# Install the necessary dependencies
RUN pip install twitter-scraper-selenium

# Set up the shared volume
VOLUME ["/app"]

# Set the default command to run your script
CMD [ "python", "scrapper.py" ]

scrapper.py

from twitter_scraper_selenium import scrape_profile

microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)

To run, after having docker installed and setup do:

docker build -t twitter-scraper .
docker run -v $(pwd):/app twitter-scraper

The bellow error happens on docker run and couldn't find anything useful on the internet to help me fix. Can you help me understand what is happening? It's seems to be with the geckodriver, not exactly with the twitter-scapper-selenium but I'm not sure where else to look

Full logs bellow

[WDM] - There is no [linux64] geckodriver for browser  in cache
[WDM] - Getting latest mozilla release info for v0.33.0
[WDM] - Trying to download new driver from https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
[WDM] - Driver has been saved in cache [/root/.wdm/drivers/geckodriver/linux64/v0.33.0]
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 118, in scrap
    self.__start_driver()
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 39, in __start_driver
    self.__driver = Initializer(
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/driver_initialization.py", line 104, in init
    driver = self.set_driver_for_browser(self.browser_name)
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/driver_initialization.py", line 97, in set_driver_for_browser
    return webdriver.Firefox(service=FirefoxService(executable_path=GeckoDriverManager().install()), options=self.set_properties(browser_option))
  File "/usr/local/lib/python3.9/site-packages/seleniumwire/webdriver.py", line 179, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 197, in __init__
    super().__init__(command_executor=executor, options=options, keep_alive=True)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 288, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 381, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 444, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 249, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/scrapper.py", line 21, in <module>
    microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 197, in scrape_profile
    data = profile_bot.scrap()
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 128, in scrap
    self.__close_driver()
  File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 43, in __close_driver
    self.__driver.close()
AttributeError: 'str' object has no attribute 'close'

Appreciate any help

add input tweet url to scrape tweet comments and nested replies

pls add there is no comment scraper for twitter that i could find

How to show progress for the scraping process? Right now there is no indication that it works until it is all done.

I am still learning to use the package. It is VERY nice in terms of functionality.

The delays before seeing anything can be 2-3 minutes or longer, depending on how many tweets are being fetched.

Is there an easy way to monitor progress without slowing down the scraping process?

I see that logging is built in. Where are the log files stored?
Can the log be streamed to another terminal in my IDE? I am using PyCharm Pro.

scrape_keyword : search in array of keyword not just a string

Hello,

Thank you very much for your work/

Is it possible to change the function scrape_keyword in order to search multiple keyword with the same session ?
instead of a loop of authentication, search, close ??

Thank you very much

scrap hashtag

does using hashtags work?
I want data on people who use a specific hashtag and find out who uses it the most

Tweets did not appear!

I can't get any tweets T T
I don't know why

AttributeError: 'str' object has no attribute 'close'

Running on my VPS: Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)

[WDM] - Driver [/root/.wdm/drivers/geckodriver/linux64/v0.32.2/geckodriver] found in cache
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 118, in scrap
    self.__start_driver()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 40, in __start_driver
    self.browser, self.headless, self.proxy, self.browser_profile).init()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 104, in init
    driver = self.set_driver_for_browser(self.browser_name)
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 97, in set_driver_for_browser
    return webdriver.Firefox(service=FirefoxService(executable_path=GeckoDriverManager().install()), options=self.set_properties(browser_option))
  File "/usr/local/lib/python3.10/dist-packages/seleniumwire/webdriver.py", line 179, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/firefox/webdriver.py", line 197, in __init__
    super().__init__(command_executor=executor, options=options, keep_alive=True)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 288, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 381, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 444, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py", line 249, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status 1


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/crypto-ordinals/twitter_feeds.py", line 67, in <module>
    tweets = scrape_profile(twitter_username="ordswapbot",output_format="json",browser="firefox",tweets_count=10,headless=False)
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 197, in scrape_profile
    data = profile_bot.scrap()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 128, in scrap
    self.__close_driver()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 43, in __close_driver
    self.__driver.close()
AttributeError: 'str' object has no attribute 'close'

is it working with more than ten tweets?

it working just need to wait my fault sorry

hi，dou you have a plan of adding a api about user followers count？

Get newest tweets with keyword

I'm using this awesome lib to get the newest tweets containing certain keywords. If you don't supply values for "since" and "until" you will get the same tweets everytime - the first tweets of the current day. I managed to get around that by deleting the since and until part from the search URL but that is certainly not intended 😄

Could you maybe add a live feature so your code doesn't get corrupted by me? 😆

Tweets did not appear

Hello I keep getting this error. I am new to this package so i do not know what I am missing.

"Tweets did not appear!, Try setting headless=False to see what is happening"

Scraping only 4 tweets no matter the tweet count

As mentioned in the title, I'm getting only 4 tweets using this code

from twitter_scraper_selenium import scrape_profile
import os
import json
account = input("Account: ")
tweets = int(input("How many tweets: "))
path = os.path.join(parent_dir, account)
if os.path.exists(path) == False:
    os.mkdir(path)
data = scrape_profile(twitter_username=account, output_format="json", browser="firefox", tweets_count=tweets)
print(data)
parsed = json.loads(data)
json_data = json.dumps(parsed, indent=4)
with open(path + "\\" + account + ".json", "w") as outfile:
    outfile.write(json_data)

And this is the print output: https://pastebin.com/p1UuxZFa

Thank you.

login support before executing apis

add support for logging into twitter using email/username/password
once logged in, can more data be accessed?
is the session going to be more stable?

AttributeError: 'Keyword' object has no attribute '_Keyword__driver' while trying to search

When I execute the code example in readme file it gives the attribute error mentioned in title.

code:
from twitter_scraper_selenium import scrap_keyword
#scrap 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrap_keyword(keyword="india", browser="firefox",
tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
print(india)

Unable to launch the lib

Hello,

I would like to use the lib to do some dev with it but i'm facing issues even launching a project with it.
My project is built with poetry so i'm handling the libs through this tool. My config file looks like this:

[tool.poetry]
name = "poetrytestproject"
version = "0.1.0"
description = "Test project"
authors = ["Kamigaku <[email protected]>"]

[tool.poetry.dependencies]
python = "^3.10"
matplotlib = "^3.5.3"
Unidecode = "^1.3.4"
numpy = "^1.23.4"
scipy = "^1.9.3"
tweepy = "^4.12.1"
Pillow = "^9.2.0"
fonttools = "^4.38.0"
twitter-scraper-selenium = "^4.1.2"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

When i'm launching the project with a simple import of your solution, i'm getting this error:

Traceback (most recent call last):
  File "G:\Code\Python\PoetryTestProject\main_selenium.py", line 6, in <module>
    from twitter_scraper_selenium import scrape_profile
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\__init__.py", line 5, in <module>
    from .keyword import scrape_keyword
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\keyword.py", line 4, in <module>
    from .driver_initialization import Initializer
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\driver_initialization.py", line 12, in <module>
    from seleniumwire import webdriver
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\webdriver.py", line 27, in <module>
    from seleniumwire import backend, utils
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\backend.py", line 4, in <module>
    from seleniumwire.server import MitmProxy
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\server.py", line 5, in <module>
    from seleniumwire.handler import InterceptRequestHandler
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\handler.py", line 5, in <module>
    from seleniumwire import har
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\har.py", line 11, in <module>
    from seleniumwire.thirdparty.mitmproxy import connections
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\connections.py", line 10, in <module>
    from seleniumwire.thirdparty.mitmproxy.net import tls, tcp
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\net\tls.py", line 43, in <module>
    "SSLv2": (SSL.SSLv2_METHOD, BASIC_OPTIONS),
AttributeError: module 'OpenSSL.SSL' has no attribute 'SSLv2_METHOD'. Did you mean: 'SSLv23_METHOD'?

Process finished with exit code 1

I've read on multiple spot on the internet (and even on your facebook scrapper project) that the issue might comes from the "PyOpenSSL" package and i've downgraded its version to "21.0.0" using poetry and the error changes to:

Traceback (most recent call last):
  File "G:\Code\Python\PoetryTestProject\main_selenium.py", line 6, in <module>
    from twitter_scraper_selenium import scrape_profile
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\__init__.py", line 5, in <module>
    from .keyword import scrape_keyword
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\keyword.py", line 4, in <module>
    from .driver_initialization import Initializer
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\driver_initialization.py", line 12, in <module>
    from seleniumwire import webdriver
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\webdriver.py", line 27, in <module>
    from seleniumwire import backend, utils
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\backend.py", line 4, in <module>
    from seleniumwire.server import MitmProxy
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\server.py", line 5, in <module>
    from seleniumwire.handler import InterceptRequestHandler
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\handler.py", line 5, in <module>
    from seleniumwire import har
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\har.py", line 11, in <module>
    from seleniumwire.thirdparty.mitmproxy import connections
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\connections.py", line 9, in <module>
    from seleniumwire.thirdparty.mitmproxy import certs, exceptions, stateobject
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\certs.py", line 10, in <module>
    import OpenSSL
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\OpenSSL\__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\OpenSSL\crypto.py", line 3279, in <module>
    _lib.OpenSSL_add_all_algorithms()
AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'

Process finished with exit code 1

Any idea what might cause it?
I've got a friend that is also having the same issue.
I'm using python version 3.10.0

Thanks!

Inject javascript to get media url

You can inject javascript to get the media url.
This can fix issue #72

Proxy

Authenticated proxy doesn't load correctly, if i check "whatismyipaddress" on the driver i get my real IP

Are Musk,s new limitations impacting this module?

These lines of code worked fine just a few weeks ago and now I'm getting TypeError: object of type 'NoneType' has no len()

This is the code I'm using:
from twitter_scraper_selenium import get_profile_details

twitter_username = "tim_cook"
filename = "twitter_dummy_ceo"
get_profile_details(twitter_username=twitter_username, filename=filename)

Is it possible to get a list of all the followers?

Hey, I was wondering if there is a way to scrape all the followers of a profile using the package, I could not figure it out
Any help will be appreciated!

directory option not working

ValueError: not enough values to unpack (expected 2, got 0)
Traceback (most recent call last):
File "/home/arch/Code/Commisions/CryptoGuys/./src/util/scrape.py", line 3, in
scrap_topic(filename="tweets", url='https://twitter.com/i/topics/1468157909318045697',browser="firefox", tweets_count=10, directory='./src/util')
File "/usr/lib/python3.10/site-packages/twitter_scraper_selenium/topic.py", line 60, in scrap_topic
output_path = directory / "{}.json".format(filename)
TypeError: unsupported operand type(s) for /: 'str' and 'str'

why the very long wait in wait_until_completion( )

I found that we spend 90% of the time in wait_until_completion( ), because the delay value time.sleep(randint(3, 5)) is 3 to 5 seconds, which seems very high - why is that?

time.sleep(random.uniform(0.1, 0.2)) seems more than enough for my simple tests, but maybe I'm missing something?

Tweets not ordered by date

Hello! Is it possible to get tweets in order of date(recent to older)?

API rate limit exceeded with scrape_profile() function

first off all thanks for the package.

i'm getting this error:

ValueError: {'message': "API rate limit exceeded for (my ip adress) (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", 'documentation_url': 'https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting'}

any way to get around this?

README.MD error within oistallation by pip

Hello,

I got that installation error with

pip install twitter-scraper-selenium

Error:
Collecting twitter-scraper-selenium
Using cached twitter_scraper_selenium-0.1.3.tar.gz (14 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-sw0o3qs2/twitter-scraper-selenium_e1bf4012aa9340ebaba22e04e0ba72df/setup.py", line 3, in
with open("README.MD", "r") as file:
FileNotFoundError: [Errno 2] No such file or directory: 'README.MD'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

It will be fine if you checke coding and integration of README.MD. Also the standard is README.md [----> non capital letters of extension *.md].

Failed to make request!

import os
from twitter_scraper_selenium import scrape_topic_with_api

scrape_topic_with_api(URL='https://twitter.com/i/topics/1468157909318045697', output_filename='tweets', tweets_count=10, headless=False)

if os.path.isfile("./tweets.json") == False:
    scrape_topic_with_api(URL='https://twitter.com/i/topics/1468157909318045697', output_filename='tweets', tweets_count=10, headless=False)

if len(open("./tweets.json").read()) < 100:
    scrape_topic_with_api(URL='https://twitter.com/i/topics/1468157909318045697', output_filename='tweets', tweets_count=10, headless=False)

I'm repeatedly getting the warning "Failed to make request!" even though I can see firefox opening and I can see the tweets.

pip installed and running example, throws error.

from twitter_scraper_selenium import scrap_keyword
#scrap 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrap_keyword(keyword="movie", browser="chrome",
tweets_count=10, output_format="json" ,until="2021-08-31", since="2021-08-29")
print(india)

is the code that I am trying to run.
At first it begins the read out begins with

Current google-chrome version is 103.0.5060
Get LATEST driver version for 103.0.5060
[WDM] - Driver found in cache

however, I beleive the following is where the error is occuring:

create_client_context
param = SSL._lib.SSL_CTX_get0_param(context._context)
AttributeError: module 'lib' has no attribute 'SSL_CTX_get0_param'

Output exceeds the size limit. Open the full output data in a text editor
Message: unknown error: net::ERR_CONNECTION_CLOSED
(Session info: headless chrome=103.0.5060.114)

timeout exception

[WDM] - Driver [C:\Users\HP Probook.wdm\drivers\geckodriver\win64\v0.33.0\geckodriver.exe] found in cache
2023-07-13 13:12:40,345 - twitter_scraper_selenium.driver_utils - ERROR - Tweets did not appear!, Try setting headless=False to see what is happening
Traceback (most recent call last):
File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\twitter_scraper_selenium\driver_utils.py", line 35, in wait_until_tweets_appear
WebDriverWait(driver, 80).until(EC.presence_of_element_located(
File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\selenium\webdriver\support\wait.py", line 95, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:183:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:395:5
element.find/</<@chrome://remote/content/marionette/element.sys.mjs:134:16

can't set firefox profile path

from twitter_scraper_selenium.topic import scrap_topic
scrap_topic(
    filename='linux',
    url='https://twitter.com/i/topics/848959431836487680',
    headless=False,
    browser_profile='/home/r3r/Documents/selenium_profile'
)

output

> python steamdeck.py
INFO:root:Loading Profile from /home/r3r/Documents/selenium_profile
[WDM] - Driver [/home/r3r/.wdm/drivers/geckodriver/linux64/v0.31.0/geckodriver] found in cache
INFO:WDM:Driver [/home/r3r/.wdm/drivers/geckodriver/linux64/v0.31.0/geckodriver] found in cache
INFO:seleniumwire.storage:Using default request storage
INFO:seleniumwire.backend:Created proxy listening on 127.0.0.1:44371

this will freeze after those text output

looking at source code for selenium setting profile is deprecated https://github.com/SeleniumHQ/selenium/blob/a4995e2c096239b42c373f26498a6c9bb4f2b3e7/py/selenium/webdriver/firefox/options.py#L101-L105

i have better success by using profile argument just like chrome https://github.com/rachmadaniHaryono/twitter-scraper-selenium/tree/bugfix/profile

but i only test it with linux

Returns only "Tweets did not appear!"

Environment Information

OS Version:

Edition	Windows 10 Pro
Version	21H2
Installed on	‎2022.‎04.‎10
OS build	19044.1706
Experience	Windows Feature Experience Pack 120.2212.4170.0

Python Version:

PS C:\Users\xxx> python -V
Python 3.10.5

Twitter scraper selenium Version:

0.1.6

requirement.txt is already installed

Code

from twitter_scraper_selenium import scrap_profile

microsoft = scrap_profile(twitter_username="Microsoft", output_format="json", browser="firefox", tweets_count=10)
print(microsoft)

Error Information

C:\Users\xxx\PycharmProjects\venv\Scripts\python.exe
C:/Users/xxx/PycharmProjects/Twitter_Selenium.py
[WDM] - Driver [C:\Users\xxx\.wdm\drivers\geckodriver\win64\v0.31.0\geckodriver.exe] found in cache
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
Tweets did not appear!
{}

Process finished with exit code 0

log

geckodriver.log:

1655093631147	geckodriver	INFO	Listening on 127.0.0.1:23023
1655093631171	mozrunner::runner	INFO	Running command: "C:\\Program Files\\Mozilla Firefox\\firefox.exe" "--marionette" "--headless" "--no-sandbox" "--disable-dev-shm-usage" "--ignore-certificate-errors" "--disable-gpu" "--log-level=3" "--disable-notifications" "--disable-popup-blocking" "-no-remote" "-profile" "C:\\Users\\xxx\\AppData\\Local\\Temp\\rust_mozprofileMChkmM"
*** You are running in headless mode.
1655093631395	Marionette	INFO	Marionette enabled
[GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at C:\\Users\\xxx\\AppData\\Local\\Temp\\rust_mozprofileMChkmM\\search.json.mozlz4", (void 0)))
1655093632149	Marionette	INFO	Listening on port 23060
Read port: 23060
1655093632179	RemoteAgent	WARN	TLS certificate errors will be ignored for this session
1655093632180	RemoteAgent	INFO	Proxy settings initialised: {"proxyType":"manual","httpProxy":"127.0.0.1:23022","noProxy":[],"sslProxy":"127.0.0.1:23022"}
1655093781898	Marionette	INFO	Stopped listening on port 23060

###!!! [Parent][PImageBridgeParent] Error: RunMessage(msgname=PImageBridge::Msg_WillClose) Channel closing: too late to send/recv, messages will be lost

Incorrect Retweets 'tweet_url '

When scraping a profile, retweets get an incorrect 'tweet_url'

in profile.py
tweet_url = "https://twitter.com/{}/status/{}".format(username,status)
status is the same as username when it is a retweet.

Feature request - conversation ID

Hi thanks for putting this together. Been looking around for active scrapping projects since twint has been archived and was happy to find this!

I want to request getting the conversaion_id. I didn't that's not in this project's docs. More info here: https://developer.twitter.com/en/docs/twitter-api/conversation-id

Search results

Can i scrape people search results using this tool

twitter_scraper_selenium.scraping_utilities:Error at find_x_guest_token: 'guest_token'

I'm trying to make an API, but this error keeps troubling me. This snippet works in my local machine but gets broken when I ship it to Render using docker.

def init(app: flask.app.Flask):
    @app.route("/user/<string:username>")
    def user(username):
        filename = "get_profile_details"
        get_profile_details(
            twitter_username=username,
            filename=filename,
        )
        data = json.load(open(filename + ".json"))
        return data

not enough values to. unpack

hi,guys.I pulled the code again from GitHub and reinstalled it, but why am I still getting the following error?

example for twitter topic

i just tried this program to get twitter topic

here is the final result

from twitter_scraper_selenium.keyword import Keyword
URL = 'https://twitter.com/i/topics/1415728297065861123'
headless = False
keyword = 'steamdeck'
browser = 'firefox'
keyword_bot = Keyword(keyword, browser=browser, url=URL, headless=headless, proxy=None, tweets_count=1000)
data = keyword_bot.scrap()
with open('steamdeck.json', 'w') as f:
    json.dump(json.loads(data), f, indent=2)

#  print result
import textwrap
width = 120
for item in sorted(list(json.loads(data).values()), key=lambda x: x['posted_time']):
    wrap_text = '\n'.join(textwrap.wrap(item['content'], width=width))
    print(f"{item['posted_time']} {item['tweet_url']}\n{wrap_text}")
    print('-'*width)

some note on this

i got error when initializing webdriver similar to scrapy/scrapy#5635
- pip install 'pyOpenSSL==22.0.0' should fix it from linked issue
- this is little bit confusing because all import error is catch with general exception see also example 1 below
  - if possible just let error happened and end the program
save json will replace old data, so be careful
- it is possible to update json data by load the data first if file exist
- the same thing happen with csv
selenium can use custom profile folder, currently i have to edit on either set_properties or set_driver_for_browser on driver_initialization.Initializer
any reason why Keyword.scrap have to return json string? why not just return it as dict? when saving the data as csv, it have to be decoded back to dict

example 1

try:
 # assume error on this line because import webdriver failed
 from inspect import currentframe
except Exception as ex:
 print(ex)

# error happened again because currentframe is not imported
frameinfo = currentframe()

Failed to make request!

Username = input("Account: ")
tweets = int(input("How many tweets: "))
path = "C:/Users/HP Probook/PycharmProjects/scrap-scripts/"

data = scrape_keyword_with_api(f"(from:{Username})")
print(data)
and got this error 2023-07-26 15:03:21,739 - twitter_scraper_selenium.keyword_api - WARNING - Failed to make request!

Not scraping every tweet from a user

Hello, I am trying to scrape every tweet from a user. From the twitter page, I can see that they have tweeted more than 5000 times. However, even when I set my tweets_count to 5000, I am getting less than 1000 tweets from that user.

My code is below:

scrape_profile(twitter_username = "elonmusk", output_format ="csv", tweets_count = 6000, browser = "chrome", filename = "elonmusk")

(Note that @ElonMusk is just a stand-in example)

Incomprehendable Error

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 128, in scrap
    self.start_driver()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 52, in start_driver
    self.browser, self.headless, self.proxy, self.browser_profile).init()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 104, in init
    driver = self.set_driver_for_browser(self.browser_name)
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 97, in set_driver_for_browser
    return webdriver.Firefox(service=FirefoxService(executable_path=GeckoDriverManager().install()), options=self.set_properties(browser_option))
  File "/usr/local/lib/python3.10/dist-packages/seleniumwire/webdriver.py", line 178, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/firefox/webdriver.py", line 177, in __init__
    super().__init__(
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 277, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 370, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 435, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Failed to decode response from marionette


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Bots/DiscordBots/TwitterTopics/MTC/src/util/scrape.py", line 6, in <module>
    scrape_topic(filename="tweets", url='https://twitter.com/i/topics/1468157909318045697',browser="firefox", tweets_count=25)
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/topic.py", line 53, in scrape_topic
    data = keyword_bot.scrap()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 140, in scrap
    self.close_driver()
  File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 55, in close_driver
    self.driver.close()
AttributeError: 'str' object has no attribute 'close'

I do not understand this error.

shaikhsajid1111 / twitter-scraper-selenium Goto Github PK

twitter-scraper-selenium's Introduction

Twitter scraper selenium

Table of Contents

Prerequisites

Installation

Installing from the source

Installing with PyPI

Usage

Available Function In this Package - Summary

To scrape twitter profile details:

Keys of the output: Detail of each key can be found here.

To scrape profile's tweets:

Using scraper with proxy (http proxy)

Privacy

LICENSE

twitter-scraper-selenium's People

Contributors

Stargazers

Watchers

Forkers

twitter-scraper-selenium's Issues

Environment Information

Code

Error Information

log

Recommend Projects

Recommend Topics

Recommend Org

Keys of the output:
Detail of each key can be found here.