Code Monkey home page Code Monkey logo

twitter-scraper-selenium's Introduction

Twitter scraper selenium

Python's package to scrape Twitter's front-end easily with selenium.

PyPI license Python >=3.6.9 Maintenance

Table of Contents

Table of Contents
  1. Getting Started
  2. Usage
  3. Privacy
  4. License


Prerequisites

  • Internet Connection
  • Python 3.6+
  • Chrome or Firefox browser installed on your machine

  • Installation

    Installing from the source

    Download the source code or clone it with:

    git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
    

    Open terminal inside the downloaded folder:


     python3 setup.py install
    

    Installing with PyPI

    pip3 install twitter-scraper-selenium
    

    Usage

    Available Function In this Package - Summary

    Function Name Function Description Scraping Method Scraping Speed
    scrape_profile() Scrape's Twitter user's profile tweets Browser Automation Slow
    get_profile_details() Scrape's Twitter user details. HTTP Request Fast
    scrape_profile_with_api() Scrape's Twitter tweets by twitter profile username. It expects the username of the profile Browser Automation & HTTP Request Fast

    Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.



    To scrape twitter profile details:

    from twitter_scraper_selenium import get_profile_details
    
    twitter_username = "TwitterAPI"
    filename = "twitter_api_data"
    browser = "firefox"
    headless = True
    get_profile_details(twitter_username=twitter_username, filename=filename, browser=browser, headless=headless)

    Output:

    {
    	"id": 6253282,
    	"id_str": "6253282",
    	"name": "Twitter API",
    	"screen_name": "TwitterAPI",
    	"location": "San Francisco, CA",
    	"profile_location": null,
    	"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    	"url": "https:\/\/t.co\/8IkCzCDr19",
    	"entities": {
    		"url": {
    			"urls": [{
    				"url": "https:\/\/t.co\/8IkCzCDr19",
    				"expanded_url": "https:\/\/developer.twitter.com",
    				"display_url": "developer.twitter.com",
    				"indices": [
    					0,
    					23
    				]
    			}]
    		},
    		"description": {
    			"urls": []
    		}
    	},
    	"protected": false,
    	"followers_count": 6133636,
    	"friends_count": 12,
    	"listed_count": 12936,
    	"created_at": "Wed May 23 06:01:13 +0000 2007",
    	"favourites_count": 31,
    	"utc_offset": null,
    	"time_zone": null,
    	"geo_enabled": null,
    	"verified": true,
    	"statuses_count": 3656,
    	"lang": null,
    	"contributors_enabled": null,
    	"is_translator": null,
    	"is_translation_enabled": null,
    	"profile_background_color": null,
    	"profile_background_image_url": null,
    	"profile_background_image_url_https": null,
    	"profile_background_tile": null,
    	"profile_image_url": null,
    	"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    	"profile_banner_url": null,
    	"profile_link_color": null,
    	"profile_sidebar_border_color": null,
    	"profile_sidebar_fill_color": null,
    	"profile_text_color": null,
    	"profile_use_background_image": null,
    	"has_extended_profile": null,
    	"default_profile": false,
    	"default_profile_image": false,
    	"following": null,
    	"follow_request_sent": null,
    	"notifications": null,
    	"translator_type": null
    }

    get_profile_details() arguments:

    Argument Argument Type Description
    twitter_username String Twitter Username
    output_filename String What should be the filename where output is stored?.
    output_dir String What directory output file should be saved?
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.


    Keys of the output:

    Detail of each key can be found here.



    To scrape profile's tweets:

    In JSON format:

    from twitter_scraper_selenium import scrape_profile
    
    microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
    print(microsoft)

    Output:

    {
      "1430938749840629773": {
        "tweet_id": "1430938749840629773",
        "username": "Microsoft",
        "name": "Microsoft",
        "profile_picture": "https://twitter.com/Microsoft/photo",
        "replies": 29,
        "retweets": 58,
        "likes": 453,
        "is_retweet": false,
        "retweet_link": "",
        "posted_time": "2021-08-26T17:02:38+00:00",
        "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
        "hashtags": [],
        "mentions": [],
        "images": [],
        "videos": [],
        "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
        "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
      },...
    }

    In CSV format:

    from twitter_scraper_selenium import scrape_profile
    
    
    scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")
    

    Output:

    tweet_id username name profile_picture replies retweets likes is_retweet retweet_link posted_time content hashtags mentions images videos post_url link
    1430938749840629773 Microsoft Microsoft https://twitter.com/Microsoft/photo 64 75 521 False 2021-08-26T17:02:38+00:00 Easy to use and efficient for all – Windows 11 is committed to an accessible future.

    Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW
    [] [] [] [] https://twitter.com/Microsoft/status/1430938749840629773 https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

    ...



    scrape_profile() arguments:

    Argument Argument Type Description
    twitter_username String Twitter username of the account
    browser String Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
    tweets_count Integer Number of posts to scrape. Default is 10.
    output_format String The output format, whether JSON or CSV. Default is JSON.
    filename String If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.
    directory String If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
    headless Boolean Whether to run crawler headlessly?. Default is True


    Keys of the output

    Key Type Description
    tweet_id String Post Identifier(integer casted inside string)
    username String Username of the profile
    name String Name of the profile
    profile_picture String Profile Picture link
    replies Integer Number of replies of tweet
    retweets Integer Number of retweets of tweet
    likes Integer Number of likes of tweet
    is_retweet boolean Is the tweet a retweet?
    retweet_link String If it is retweet, then the retweet link else it'll be empty string
    posted_time String Time when tweet was posted in ISO 8601 format
    content String content of tweet as text
    hashtags Array Hashtags presents in tweet, if they're present in tweet
    mentions Array Mentions presents in tweet, if they're present in tweet
    images Array Images links, if they're present in tweet
    videos Array Videos links, if they're present in tweet
    tweet_url String URL of the tweet
    link String If any link is present inside tweet for some external website.


    To Scrap profile's tweets with API:

    from twitter_scraper_selenium import scrape_profile_with_api
    
    scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)

    scrape_profile_with_api() Arguments:

    Argument Argument Type Description
    username String Twitter's Profile username
    tweets_count Integer Number of tweets to scrape.
    output_filename String What should be the filename where output is stored?.
    output_dir String What directory output file should be saved?
    proxy String Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
    browser String Which browser to use for extracting out graphql key. Default is firefox.
    headless String Whether to run browser in headless mode?

    Output:

    {
      "1608939190548598784": {
        "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
        "tweet_details":{
          ...
        },
        "user_details":{
          ...
        }
      }, ...
    }


    Using scraper with proxy (http proxy)

    Just pass proxy argument to function.

    from twitter_scraper_selenium import scrape_profile
    
    scrape_profile("elonmusk", headless=False, proxy="66.115.38.247:5678", output_format="csv",filename="musk") #In IP:PORT format

    Proxy that requires authentication:

    from twitter_scraper_selenium import scrape_profile
    
    microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
                          proxy="sajid:[email protected]:5678")  #  username:password@IP:PORT
    print(microsoft_data)
    


    Privacy

    This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.



    LICENSE

    MIT

    twitter-scraper-selenium's People

    Contributors

    marcusnk237 avatar rachmadaniharyono avatar shaikhsajid1111 avatar weltolk avatar

    Stargazers

     avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

    Watchers

     avatar  avatar  avatar  avatar  avatar  avatar

    twitter-scraper-selenium's Issues

    twitter_scraper_selenium.scraping_utilities:Error at find_x_guest_token: 'guest_token'

    I'm trying to make an API, but this error keeps troubling me. This snippet works in my local machine but gets broken when I ship it to Render using docker.

    def init(app: flask.app.Flask):
        @app.route("/user/<string:username>")
        def user(username):
            filename = "get_profile_details"
            get_profile_details(
                twitter_username=username,
                filename=filename,
            )
            data = json.load(open(filename + ".json"))
            return data
    

    can't set firefox profile path

    from twitter_scraper_selenium.topic import scrap_topic
    scrap_topic(
        filename='linux',
        url='https://twitter.com/i/topics/848959431836487680',
        headless=False,
        browser_profile='/home/r3r/Documents/selenium_profile'
    )

    output

    > python steamdeck.py
    INFO:root:Loading Profile from /home/r3r/Documents/selenium_profile
    [WDM] - Driver [/home/r3r/.wdm/drivers/geckodriver/linux64/v0.31.0/geckodriver] found in cache
    INFO:WDM:Driver [/home/r3r/.wdm/drivers/geckodriver/linux64/v0.31.0/geckodriver] found in cache
    INFO:seleniumwire.storage:Using default request storage
    INFO:seleniumwire.backend:Created proxy listening on 127.0.0.1:44371

    this will freeze after those text output

    looking at source code for selenium setting profile is deprecated https://github.com/SeleniumHQ/selenium/blob/a4995e2c096239b42c373f26498a6c9bb4f2b3e7/py/selenium/webdriver/firefox/options.py#L101-L105

    i have better success by using profile argument just like chrome https://github.com/rachmadaniHaryono/twitter-scraper-selenium/tree/bugfix/profile

    but i only test it with linux

    "Tweets did not appear!"

    Code
    scrap_keyword(keyword="nft", browser="chrome", tweets_count=999999, until="2022-06-30", since="2022-06-29",output_format="csv",filename="nft")

    Error Message
    [WDM] - Current google-chrome version is 103.0.5060
    [WDM] - Get LATEST driver version for 103.0.5060

    [WDM] - Driver [./.wdm/drivers/chromedriver/mac64/103.0.5060.53/chromedriver] found in cache
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!

    Other Infomation
    requirement.txt has already been installed
    Screenshot 2022-06-28 at 1 52 17 PM

    directory option not working

    ValueError: not enough values to unpack (expected 2, got 0)
    Traceback (most recent call last):
    File "/home/arch/Code/Commisions/CryptoGuys/./src/util/scrape.py", line 3, in
    scrap_topic(filename="tweets", url='https://twitter.com/i/topics/1468157909318045697',browser="firefox", tweets_count=10, directory='./src/util')
    File "/usr/lib/python3.10/site-packages/twitter_scraper_selenium/topic.py", line 60, in scrap_topic
    output_path = directory / "{}.json".format(filename)
    TypeError: unsupported operand type(s) for /: 'str' and 'str'

    AttributeError: 'str' object has no attribute 'close'

    Running on my VPS: Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-52-generic x86_64)

    [WDM] - Driver [/root/.wdm/drivers/geckodriver/linux64/v0.32.2/geckodriver] found in cache
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 118, in scrap
        self.__start_driver()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 40, in __start_driver
        self.browser, self.headless, self.proxy, self.browser_profile).init()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 104, in init
        driver = self.set_driver_for_browser(self.browser_name)
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 97, in set_driver_for_browser
        return webdriver.Firefox(service=FirefoxService(executable_path=GeckoDriverManager().install()), options=self.set_properties(browser_option))
      File "/usr/local/lib/python3.10/dist-packages/seleniumwire/webdriver.py", line 179, in __init__
        super().__init__(*args, **kwargs)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/firefox/webdriver.py", line 197, in __init__
        super().__init__(command_executor=executor, options=options, keep_alive=True)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 288, in __init__
        self.start_session(capabilities, browser_profile)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 381, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 444, in execute
        self.error_handler.check_response(response)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py", line 249, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status 1
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/root/crypto-ordinals/twitter_feeds.py", line 67, in <module>
        tweets = scrape_profile(twitter_username="ordswapbot",output_format="json",browser="firefox",tweets_count=10,headless=False)
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 197, in scrape_profile
        data = profile_bot.scrap()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 128, in scrap
        self.__close_driver()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/profile.py", line 43, in __close_driver
        self.__driver.close()
    AttributeError: 'str' object has no attribute 'close'
    
    

    Not scraping every tweet from a user

    Hello, I am trying to scrape every tweet from a user. From the twitter page, I can see that they have tweeted more than 5000 times. However, even when I set my tweets_count to 5000, I am getting less than 1000 tweets from that user.

    My code is below:

    scrape_profile(twitter_username = "elonmusk", output_format ="csv", tweets_count = 6000, browser = "chrome", filename = "elonmusk")

    (Note that @ElonMusk is just a stand-in example)

    Tweets did not appear

    Hello I keep getting this error. I am new to this package so i do not know what I am missing.

    "Tweets did not appear!, Try setting headless=False to see what is happening"

    Install help: selenium.common.exceptions.SessionNotCreatedException

    Hi there team,
    Thanks for this amazing lib, used it a few times already and had no problem but today trying to make it work again and got stuck on a gecko driver error.

    ...
    selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
    ...
    

    I created a docker to have a fresh install and make sure it's not my setup, I use it on a Macos and have the same error

    Here are the files and how to reproduce

    Dockerfile

    # Use the official Python 3.9.16 image as the base image
    FROM python:3.9.16
    
    # Set the working directory to /app
    WORKDIR /app
    
    # Install the necessary dependencies
    RUN pip install twitter-scraper-selenium
    
    # Set up the shared volume
    VOLUME ["/app"]
    
    # Set the default command to run your script
    CMD [ "python", "scrapper.py" ]
    
    

    scrapper.py

    from twitter_scraper_selenium import scrape_profile
    
    microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
    print(microsoft)
    

    To run, after having docker installed and setup do:

    docker build -t twitter-scraper .
    docker run -v $(pwd):/app twitter-scraper
    

    The bellow error happens on docker run and couldn't find anything useful on the internet to help me fix. Can you help me understand what is happening? It's seems to be with the geckodriver, not exactly with the twitter-scapper-selenium but I'm not sure where else to look

    Full logs bellow

    [WDM] - There is no [linux64] geckodriver for browser  in cache
    [WDM] - Getting latest mozilla release info for v0.33.0
    [WDM] - Trying to download new driver from https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
    [WDM] - Driver has been saved in cache [/root/.wdm/drivers/geckodriver/linux64/v0.33.0]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 118, in scrap
        self.__start_driver()
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 39, in __start_driver
        self.__driver = Initializer(
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/driver_initialization.py", line 104, in init
        driver = self.set_driver_for_browser(self.browser_name)
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/driver_initialization.py", line 97, in set_driver_for_browser
        return webdriver.Firefox(service=FirefoxService(executable_path=GeckoDriverManager().install()), options=self.set_properties(browser_option))
      File "/usr/local/lib/python3.9/site-packages/seleniumwire/webdriver.py", line 179, in __init__
        super().__init__(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 197, in __init__
        super().__init__(command_executor=executor, options=options, keep_alive=True)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 288, in __init__
        self.start_session(capabilities, browser_profile)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 381, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 444, in execute
        self.error_handler.check_response(response)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 249, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/app/scrapper.py", line 21, in <module>
        microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 197, in scrape_profile
        data = profile_bot.scrap()
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 128, in scrap
        self.__close_driver()
      File "/usr/local/lib/python3.9/site-packages/twitter_scraper_selenium/profile.py", line 43, in __close_driver
        self.__driver.close()
    AttributeError: 'str' object has no attribute 'close'
    

    Appreciate any help

    Are Musk,s new limitations impacting this module?

    These lines of code worked fine just a few weeks ago and now I'm getting TypeError: object of type 'NoneType' has no len()

    This is the code I'm using:
    from twitter_scraper_selenium import get_profile_details

    twitter_username = "tim_cook"
    filename = "twitter_dummy_ceo"
    get_profile_details(twitter_username=twitter_username, filename=filename)

    Incorrect Retweets 'tweet_url '

    When scraping a profile, retweets get an incorrect 'tweet_url'

    in profile.py
    tweet_url = "https://twitter.com/{}/status/{}".format(username,status)
    status is the same as username when it is a retweet.

    Incomprehendable Error

    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 128, in scrap
        self.start_driver()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 52, in start_driver
        self.browser, self.headless, self.proxy, self.browser_profile).init()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 104, in init
        driver = self.set_driver_for_browser(self.browser_name)
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/driver_initialization.py", line 97, in set_driver_for_browser
        return webdriver.Firefox(service=FirefoxService(executable_path=GeckoDriverManager().install()), options=self.set_properties(browser_option))
      File "/usr/local/lib/python3.10/dist-packages/seleniumwire/webdriver.py", line 178, in __init__
        super().__init__(*args, **kwargs)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/firefox/webdriver.py", line 177, in __init__
        super().__init__(
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 277, in __init__
        self.start_session(capabilities, browser_profile)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 370, in start_session
        response = self.execute(Command.NEW_SESSION, parameters)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/webdriver.py", line 435, in execute
        self.error_handler.check_response(response)
      File "/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.WebDriverException: Message: Failed to decode response from marionette
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/root/Bots/DiscordBots/TwitterTopics/MTC/src/util/scrape.py", line 6, in <module>
        scrape_topic(filename="tweets", url='https://twitter.com/i/topics/1468157909318045697',browser="firefox", tweets_count=25)
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/topic.py", line 53, in scrape_topic
        data = keyword_bot.scrap()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 140, in scrap
        self.close_driver()
      File "/usr/local/lib/python3.10/dist-packages/twitter_scraper_selenium/keyword.py", line 55, in close_driver
        self.driver.close()
    AttributeError: 'str' object has no attribute 'close'
    
    • I do not understand this error.

    Failed to make request!

    Username = input("Account: ")
    tweets = int(input("How many tweets: "))
    path = "C:/Users/HP Probook/PycharmProjects/scrap-scripts/"

    data = scrape_keyword_with_api(f"(from:{Username})")
    print(data)
    and got this error 2023-07-26 15:03:21,739 - twitter_scraper_selenium.keyword_api - WARNING - Failed to make request!

    not enough values to. unpack

    image
    hi,guys.I pulled the code again from GitHub and reinstalled it, but why am I still getting the following error?

    timeout exception

    [WDM] - Driver [C:\Users\HP Probook.wdm\drivers\geckodriver\win64\v0.33.0\geckodriver.exe] found in cache
    2023-07-13 13:12:40,345 - twitter_scraper_selenium.driver_utils - ERROR - Tweets did not appear!, Try setting headless=False to see what is happening
    Traceback (most recent call last):
    File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\twitter_scraper_selenium\driver_utils.py", line 35, in wait_until_tweets_appear
    WebDriverWait(driver, 80).until(EC.presence_of_element_located(
    File "C:\Users\HP Probook\PycharmProjects\firstproject\venv\lib\site-packages\selenium\webdriver\support\wait.py", line 95, in until
    raise TimeoutException(message, screen, stacktrace)
    selenium.common.exceptions.TimeoutException: Message:
    Stacktrace:
    RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
    WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:183:5
    NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:395:5
    element.find/</<@chrome://remote/content/marionette/element.sys.mjs:134:16

    More Examples and Documentation Needed

    More Examples and Documentation are needed...

    I am new to using this package. I see a lot of functionality, but not enough examples of how to use it.
    It would be nice if there was a ReadTheDocs.io website for it.

    I am interested in pulling images and metadata associated with tweets. The element_finder.py file appears to pull images, but it unclear how to call it separately, or whether that is the best way to use it in this package.

    @staticmethod
    def find_images(tweet) -> Union[list, None]:
        """finds all images of the tweet"""
        try:
            image_element = tweet.find_elements(By.CSS_SELECTOR,
                                                'div[data-testid="tweetPhoto"]')
            images = []
            for image_div in image_element:
                href = image_div.find_element(By.TAG_NAME,
                                              "img").get_attribute("src")
                images.append(href)
            return images
        except Exception as ex:
            logger.exception("Error at method find_images : {}".format(ex))
            return []
    

    Many of the functions do not have docstrings. I am going through them now to try to add descriptions for my own benefit.

    I also want to add things to outputs to automate the data capture. For example, I want to time-stamp output files, because during testing I will have to do multiple runs to make sure that I got all information I was looking for.

    For JSON files, I would like to pretty-print them before saving them, because they come out in a one-liner format in the file.

    I am not sure where to insert these functions. I will keep browsing the package structure to see if I can figure out these things, and add helpful docstrings where they are missing.

    Unable to launch the lib

    Hello,

    I would like to use the lib to do some dev with it but i'm facing issues even launching a project with it.
    My project is built with poetry so i'm handling the libs through this tool. My config file looks like this:

    [tool.poetry]
    name = "poetrytestproject"
    version = "0.1.0"
    description = "Test project"
    authors = ["Kamigaku <[email protected]>"]
    
    [tool.poetry.dependencies]
    python = "^3.10"
    matplotlib = "^3.5.3"
    Unidecode = "^1.3.4"
    numpy = "^1.23.4"
    scipy = "^1.9.3"
    tweepy = "^4.12.1"
    Pillow = "^9.2.0"
    fonttools = "^4.38.0"
    twitter-scraper-selenium = "^4.1.2"
    
    [tool.poetry.dev-dependencies]
    
    [build-system]
    requires = ["poetry-core>=1.0.0"]
    build-backend = "poetry.core.masonry.api"
    

    When i'm launching the project with a simple import of your solution, i'm getting this error:

    Traceback (most recent call last):
      File "G:\Code\Python\PoetryTestProject\main_selenium.py", line 6, in <module>
        from twitter_scraper_selenium import scrape_profile
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\__init__.py", line 5, in <module>
        from .keyword import scrape_keyword
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\keyword.py", line 4, in <module>
        from .driver_initialization import Initializer
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\driver_initialization.py", line 12, in <module>
        from seleniumwire import webdriver
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\webdriver.py", line 27, in <module>
        from seleniumwire import backend, utils
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\backend.py", line 4, in <module>
        from seleniumwire.server import MitmProxy
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\server.py", line 5, in <module>
        from seleniumwire.handler import InterceptRequestHandler
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\handler.py", line 5, in <module>
        from seleniumwire import har
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\har.py", line 11, in <module>
        from seleniumwire.thirdparty.mitmproxy import connections
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\connections.py", line 10, in <module>
        from seleniumwire.thirdparty.mitmproxy.net import tls, tcp
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\net\tls.py", line 43, in <module>
        "SSLv2": (SSL.SSLv2_METHOD, BASIC_OPTIONS),
    AttributeError: module 'OpenSSL.SSL' has no attribute 'SSLv2_METHOD'. Did you mean: 'SSLv23_METHOD'?
    
    Process finished with exit code 1
    

    I've read on multiple spot on the internet (and even on your facebook scrapper project) that the issue might comes from the "PyOpenSSL" package and i've downgraded its version to "21.0.0" using poetry and the error changes to:

    Traceback (most recent call last):
      File "G:\Code\Python\PoetryTestProject\main_selenium.py", line 6, in <module>
        from twitter_scraper_selenium import scrape_profile
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\__init__.py", line 5, in <module>
        from .keyword import scrape_keyword
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\keyword.py", line 4, in <module>
        from .driver_initialization import Initializer
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\twitter_scraper_selenium\driver_initialization.py", line 12, in <module>
        from seleniumwire import webdriver
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\webdriver.py", line 27, in <module>
        from seleniumwire import backend, utils
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\backend.py", line 4, in <module>
        from seleniumwire.server import MitmProxy
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\server.py", line 5, in <module>
        from seleniumwire.handler import InterceptRequestHandler
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\handler.py", line 5, in <module>
        from seleniumwire import har
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\har.py", line 11, in <module>
        from seleniumwire.thirdparty.mitmproxy import connections
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\connections.py", line 9, in <module>
        from seleniumwire.thirdparty.mitmproxy import certs, exceptions, stateobject
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\seleniumwire\thirdparty\mitmproxy\certs.py", line 10, in <module>
        import OpenSSL
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\OpenSSL\__init__.py", line 8, in <module>
        from OpenSSL import crypto, SSL
      File "C:\Users\Aurelien\AppData\Local\pypoetry\Cache\virtualenvs\poetrytestproject-3wjWjU16-py3.10\lib\site-packages\OpenSSL\crypto.py", line 3279, in <module>
        _lib.OpenSSL_add_all_algorithms()
    AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'
    
    Process finished with exit code 1
    

    Any idea what might cause it?
    I've got a friend that is also having the same issue.
    I'm using python version 3.10.0

    Thanks!

    Scraping only 4 tweets no matter the tweet count

    As mentioned in the title, I'm getting only 4 tweets using this code

    from twitter_scraper_selenium import scrape_profile
    import os
    import json
    account = input("Account: ")
    tweets = int(input("How many tweets: "))
    path = os.path.join(parent_dir, account)
    if os.path.exists(path) == False:
        os.mkdir(path)
    data = scrape_profile(twitter_username=account, output_format="json", browser="firefox", tweets_count=tweets)
    print(data)
    parsed = json.loads(data)
    json_data = json.dumps(parsed, indent=4)
    with open(path + "\\" + account + ".json", "w") as outfile:
        outfile.write(json_data)
    

    And this is the print output: https://pastebin.com/p1UuxZFa

    Thank you.

    The address returned by calling scrape_profile is wrong!

    hi,there。The address returned by calling scrape_profile(twitter_username="blvckledge", output_format="json", browser="chrome", tweets_count=10) for videos is blob:https://twitter.com/facd7bdb-73ec-49ff-9492-993a165a3585, but the actual address is https://video.twimg.com/ext_tw_video/1673328923311058944/pu/vid/848x464/KNnl7Rqk_MfiX4X9.mp4?tag=12. Why is an incorrect address being returned? Additionally, I receive errors when calling scrape_topic_with_api() and scrape_profile_with_api(). What could be causing this?

    These Functions will be removed on new release

    • scrape_profile() - Its alternative is scrape_profile_with_api()
    • scrape_keyword() - Its alternative is scrape_keyword_with_api()
    • scrape_topic()- Its alternative is scrape_topic_with_api()

    Returns only "Tweets did not appear!"

    Environment Information

    OS Version:

    Edition	Windows 10 Pro
    Version	21H2
    Installed on	‎2022.‎04.‎10
    OS build	19044.1706
    Experience	Windows Feature Experience Pack 120.2212.4170.0
    

    Python Version:

    PS C:\Users\xxx> python -V
    Python 3.10.5

    Twitter scraper selenium Version:

    0.1.6
    

    requirement.txt is already installed

    Code

    from twitter_scraper_selenium import scrap_profile
    
    microsoft = scrap_profile(twitter_username="Microsoft", output_format="json", browser="firefox", tweets_count=10)
    print(microsoft)

    Error Information

    C:\Users\xxx\PycharmProjects\venv\Scripts\python.exe
    C:/Users/xxx/PycharmProjects/Twitter_Selenium.py
    [WDM] - Driver [C:\Users\xxx\.wdm\drivers\geckodriver\win64\v0.31.0\geckodriver.exe] found in cache
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    Tweets did not appear!
    {}
    
    Process finished with exit code 0
    

    log

    geckodriver.log:

    1655093631147	geckodriver	INFO	Listening on 127.0.0.1:23023
    1655093631171	mozrunner::runner	INFO	Running command: "C:\\Program Files\\Mozilla Firefox\\firefox.exe" "--marionette" "--headless" "--no-sandbox" "--disable-dev-shm-usage" "--ignore-certificate-errors" "--disable-gpu" "--log-level=3" "--disable-notifications" "--disable-popup-blocking" "-no-remote" "-profile" "C:\\Users\\xxx\\AppData\\Local\\Temp\\rust_mozprofileMChkmM"
    *** You are running in headless mode.
    1655093631395	Marionette	INFO	Marionette enabled
    [GFX1-]: RenderCompositorSWGL failed mapping default framebuffer, no dt
    console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at C:\\Users\\xxx\\AppData\\Local\\Temp\\rust_mozprofileMChkmM\\search.json.mozlz4", (void 0)))
    1655093632149	Marionette	INFO	Listening on port 23060
    Read port: 23060
    1655093632179	RemoteAgent	WARN	TLS certificate errors will be ignored for this session
    1655093632180	RemoteAgent	INFO	Proxy settings initialised: {"proxyType":"manual","httpProxy":"127.0.0.1:23022","noProxy":[],"sslProxy":"127.0.0.1:23022"}
    1655093781898	Marionette	INFO	Stopped listening on port 23060
    
    ###!!! [Parent][PImageBridgeParent] Error: RunMessage(msgname=PImageBridge::Msg_WillClose) Channel closing: too late to send/recv, messages will be lost
    

    scrap hashtag

    does using hashtags work?
    I want data on people who use a specific hashtag and find out who uses it the most

    pip installed and running example, throws error.

    from twitter_scraper_selenium import scrap_keyword
    #scrap 10 posts by searching keyword "india" from date 30th August till date 31st August
    india = scrap_keyword(keyword="movie", browser="chrome",
    tweets_count=10, output_format="json" ,until="2021-08-31", since="2021-08-29")
    print(india)

    is the code that I am trying to run.
    At first it begins the read out begins with

    Current google-chrome version is 103.0.5060
    Get LATEST driver version for 103.0.5060
    [WDM] - Driver found in cache

    however, I beleive the following is where the error is occuring:

    create_client_context
    param = SSL._lib.SSL_CTX_get0_param(context._context)
    AttributeError: module 'lib' has no attribute 'SSL_CTX_get0_param'

    Output exceeds the size limit. Open the full output data in a text editor
    Message: unknown error: net::ERR_CONNECTION_CLOSED
    (Session info: headless chrome=103.0.5060.114)

    README.MD error within oistallation by pip

    Hello,

    I got that installation error with

    pip install twitter-scraper-selenium

    Error:
    Collecting twitter-scraper-selenium
    Using cached twitter_scraper_selenium-0.1.3.tar.gz (14 kB)
    Preparing metadata (setup.py) ... error
    error: subprocess-exited-with-error

    × python setup.py egg_info did not run successfully.
    │ exit code: 1
    ╰─> [6 lines of output]
    Traceback (most recent call last):
    File "", line 2, in
    File "", line 34, in
    File "/tmp/pip-install-sw0o3qs2/twitter-scraper-selenium_e1bf4012aa9340ebaba22e04e0ba72df/setup.py", line 3, in
    with open("README.MD", "r") as file:
    FileNotFoundError: [Errno 2] No such file or directory: 'README.MD'
    [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
    error: metadata-generation-failed

    × Encountered error while generating package metadata.
    ╰─> See above for output.

    note: This is an issue with the package mentioned above, not pip.
    hint: See above for details.

    It will be fine if you checke coding and integration of README.MD. Also the standard is README.md [----> non capital letters of extension *.md].

    login support before executing apis

    add support for logging into twitter using email/username/password
    once logged in, can more data be accessed?
    is the session going to be more stable?

    Install false

    image
    Please check . Some thinks wrong . Thanks for suppot

    Failed to make request!

    import os
    from twitter_scraper_selenium import scrape_topic_with_api
    
    scrape_topic_with_api(URL='https://twitter.com/i/topics/1468157909318045697', output_filename='tweets', tweets_count=10, headless=False)
    
    if os.path.isfile("./tweets.json") == False:
        scrape_topic_with_api(URL='https://twitter.com/i/topics/1468157909318045697', output_filename='tweets', tweets_count=10, headless=False)
    
    if len(open("./tweets.json").read()) < 100:
        scrape_topic_with_api(URL='https://twitter.com/i/topics/1468157909318045697', output_filename='tweets', tweets_count=10, headless=False)

    I'm repeatedly getting the warning "Failed to make request!" even though I can see firefox opening and I can see the tweets.

    Get newest tweets with keyword

    I'm using this awesome lib to get the newest tweets containing certain keywords. If you don't supply values for "since" and "until" you will get the same tweets everytime - the first tweets of the current day. I managed to get around that by deleting the since and until part from the search URL but that is certainly not intended 😄

    Could you maybe add a live feature so your code doesn't get corrupted by me? 😆

    example for twitter topic

    i just tried this program to get twitter topic

    here is the final result

    from twitter_scraper_selenium.keyword import Keyword
    URL = 'https://twitter.com/i/topics/1415728297065861123'
    headless = False
    keyword = 'steamdeck'
    browser = 'firefox'
    keyword_bot = Keyword(keyword, browser=browser, url=URL, headless=headless, proxy=None, tweets_count=1000)
    data = keyword_bot.scrap()
    with open('steamdeck.json', 'w') as f:
        json.dump(json.loads(data), f, indent=2)
    
    #  print result
    import textwrap
    width = 120
    for item in sorted(list(json.loads(data).values()), key=lambda x: x['posted_time']):
        wrap_text = '\n'.join(textwrap.wrap(item['content'], width=width))
        print(f"{item['posted_time']} {item['tweet_url']}\n{wrap_text}")
        print('-'*width)

    some note on this

    • i got error when initializing webdriver similar to scrapy/scrapy#5635
      • pip install 'pyOpenSSL==22.0.0' should fix it from linked issue
      • this is little bit confusing because all import error is catch with general exception see also example 1 below
        • if possible just let error happened and end the program
    • save json will replace old data, so be careful
      • it is possible to update json data by load the data first if file exist
      • the same thing happen with csv
    • selenium can use custom profile folder, currently i have to edit on either set_properties or set_driver_for_browser on driver_initialization.Initializer
    • any reason why Keyword.scrap have to return json string? why not just return it as dict? when saving the data as csv, it have to be decoded back to dict

    example 1

    try:
     # assume error on this line because import webdriver failed
     from inspect import currentframe
    except Exception as ex:
     print(ex)
    
    # error happened again because currentframe is not imported
    frameinfo = currentframe()

    AttributeError: 'Keyword' object has no attribute '_Keyword__driver' while trying to search

    When I execute the code example in readme file it gives the attribute error mentioned in title.

    code:
    from twitter_scraper_selenium import scrap_keyword
    #scrap 10 posts by searching keyword "india" from date 30th August till date 31st August
    india = scrap_keyword(keyword="india", browser="firefox",
    tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
    print(india)

    Proxy

    Authenticated proxy doesn't load correctly, if i check "whatismyipaddress" on the driver i get my real IP

    How to show progress for the scraping process? Right now there is no indication that it works until it is all done.

    I am still learning to use the package. It is VERY nice in terms of functionality.

    The delays before seeing anything can be 2-3 minutes or longer, depending on how many tweets are being fetched.

    Is there an easy way to monitor progress without slowing down the scraping process?

    I see that logging is built in. Where are the log files stored?
    Can the log be streamed to another terminal in my IDE? I am using PyCharm Pro.

    why the very long wait in wait_until_completion( )

    I found that we spend 90% of the time in wait_until_completion( ), because the delay value time.sleep(randint(3, 5)) is 3 to 5 seconds, which seems very high - why is that?

    time.sleep(random.uniform(0.1, 0.2)) seems more than enough for my simple tests, but maybe I'm missing something?

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.