Code Monkey home page Code Monkey logo

events-scrapper's Introduction

Event-scrapper

A web scraping python script to generate an ical calandar file from a list of Facebook events pages. The goal is to retrieve events without having to spend hours on FB. Unfortunately Facebook has won and the majority of events are announced and shared on FB, even when the bands or venues have their own website.

Facebook Graph API no longer supports events, so this script uses selenium. The use of selenium and web scarpping in general is contrary to Facebook's terms of use. If you remain reasonable about the number of pages, the frequency of use, if you don't use selenium in headless mode, tor or other proxy, it should be a okay. If you scarpe thousands of page every 30 minutes in headless mode from a tor gateway, at best the response time will increase until it become unusable or the IP adress will be blacklisted.

For the kinkiest among you, the script supports authentication to access pages reserved for adults. Multiple factor authentification is supported (if you are not in headless mode). If you use the authentication function and play dumb, you risk having your account slowed down, see the number of captcha increase or even get ban. You've been warned, there's no need to come crying to me afterwards.

Facebook uses many browser identification techniques to detect that Firefox is running headless, including SVG and WebGL rendering. So they know that something odd happens when you use the headless mode.

I use this script once a day for about 40 pages and I haven't had any problems for the moment.

Use

Scrap on specific event page: python3 events-scraper.py -e https://www.facebook.com/somepage/events/

Scrap all the page listed inside a file: python3 events-scarper.py -f event_velo.txt

Scrap all the page listed inside a file and login with the credential inside credential.txt: python3 events-scraper.py -f event_velo.txt -c

usage: events-scraper.py [-h] [-e EVENTS [EVENTS ...]] [-f FILE] [-o OUTPUT_FILE] [-c] [-hl] [-q]

Non API public FB event miner

optional arguments:
  -h, --help            show this help message and exit
  -e EVENTS [EVENTS ...], --events EVENTS [EVENTS ...]
                        List the pages you want to scrape for events
  -f FILE, --file FILE  file with the list of pages to scrape for events
  -o OUTPUT_FILE, --output OUTPUT_FILE
                        output ical file name
  -c, --credentials     use credentials from credentials.txt
  -hl, --headless       run FireFox in headless mode
  -q, --quiet           silence output

Requirements

  • Selenium
  • icalandar

ToDo / Know issues

  • Tickets url isn't scrapped
  • Recurrent events are not supported (Pub quiz every sunday)
  • Day saving time was not tested
  • URL inside the description of the events are removed for no reason ( selenium getAttribut("text") don't like links)
  • need a crash or layout change alert system
  • Credential option work only with the classic layout and don't detect the new FB layout

events-scrapper's People

Contributors

berettavexee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

events-scrapper's Issues

Facebook login button

I have found this to be a more reliable way of finding the login button:

findElement(By.xpath(".//input[@data-testid='royal_login_button']"))

Script fails if FB event location is empty

Get this error on some events:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: ._4dpf._phw

After some investigation, it appears this happens when the Facebook event does not include complete location information.
Some events, for location, instead of a name and address, it just has a link. This fails.

Could you put try/except code around this line so it adds a blank location to the calendar for that event, rather than exiting the script with a failure?

Maybe something like this:
try:
event_info["location"] = self.browser.find_element_by_class_name(
"_4dpf._phw").text
except NoSuchElementException:
event_info["location"] = "TBD"

Example events that trigger this failure:
https://www.facebook.com/events/1140886672933580/
https://www.facebook.com/events/594600484704371/

FB changed code related to Organizer

Facebook must have changed something recently, the script is failing trying to find the organizer.
(This is not a multiple organizer issue, it fails even on events hosted by a single organizer.)

error is:
File "events-scraper.py", line 219, in find_organizer
event_info["organizer"] = organizer[0].text
IndexError: list index out of range

JSON debugging file

Would it be possible to add a command line switch to force the script to generate the .JSON debugging file, perhaps with a different file name? I've found this file very useful, and in fact, I'm using it to interface with other systems. (Namely, generating events automatically for a musician's website, example here: https://thomashindsmedia.com/#shows )

Unable to find a matching set of capabilities

Traceback (most recent call last):
File "C:\xampp\htdocs\fb\events-scrapper-master\events-scraper.py", line 448, in
headless=args.headless)
File "C:\xampp\htdocs\fb\events-scrapper-master\events-scraper.py", line 80, in init
options=FireFoxOptions,)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 174, in init
keep_alive=True)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.