berettavexee / events-scrapper Goto Github PK

19.0 3.0 2.0 45 KB

Web scraping python script to convert a list of Facebook events pages into a ical calendar.

License: MIT License

Python 100.00%

facebook scraper selenium-python selenium ical facebook-crawler agenda icalendar facebook-scraper ics-ical

events-scrapper's Introduction

Event-scrapper

A web scraping python script to generate an ical calandar file from a list of Facebook events pages. The goal is to retrieve events without having to spend hours on FB. Unfortunately Facebook has won and the majority of events are announced and shared on FB, even when the bands or venues have their own website.

Facebook Graph API no longer supports events, so this script uses selenium. The use of selenium and web scarpping in general is contrary to Facebook's terms of use. If you remain reasonable about the number of pages, the frequency of use, if you don't use selenium in headless mode, tor or other proxy, it should be a okay. If you scarpe thousands of page every 30 minutes in headless mode from a tor gateway, at best the response time will increase until it become unusable or the IP adress will be blacklisted.

For the kinkiest among you, the script supports authentication to access pages reserved for adults. Multiple factor authentification is supported (if you are not in headless mode). If you use the authentication function and play dumb, you risk having your account slowed down, see the number of captcha increase or even get ban. You've been warned, there's no need to come crying to me afterwards.

Facebook uses many browser identification techniques to detect that Firefox is running headless, including SVG and WebGL rendering. So they know that something odd happens when you use the headless mode.

I use this script once a day for about 40 pages and I haven't had any problems for the moment.

Use

Scrap on specific event page: python3 events-scraper.py -e https://www.facebook.com/somepage/events/

Scrap all the page listed inside a file: python3 events-scarper.py -f event_velo.txt

Scrap all the page listed inside a file and login with the credential inside credential.txt: python3 events-scraper.py -f event_velo.txt -c

usage: events-scraper.py [-h] [-e EVENTS [EVENTS ...]] [-f FILE] [-o OUTPUT_FILE] [-c] [-hl] [-q]

Non API public FB event miner

optional arguments:
  -h, --help            show this help message and exit
  -e EVENTS [EVENTS ...], --events EVENTS [EVENTS ...]
                        List the pages you want to scrape for events
  -f FILE, --file FILE  file with the list of pages to scrape for events
  -o OUTPUT_FILE, --output OUTPUT_FILE
                        output ical file name
  -c, --credentials     use credentials from credentials.txt
  -hl, --headless       run FireFox in headless mode
  -q, --quiet           silence output

Requirements

Selenium
icalandar

ToDo / Know issues

Tickets url isn't scrapped
Recurrent events are not supported (Pub quiz every sunday)
Day saving time was not tested
URL inside the description of the events are removed for no reason ( selenium getAttribut("text") don't like links)
need a crash or layout change alert system
Credential option work only with the classic layout and don't detect the new FB layout

events-scrapper's People

Contributors

Stargazers

Watchers

Forkers

bedros makkerzsombor

events-scrapper's Issues

Only gets 6 events, then stops.

Can the script be modified to pull in all events for the next 2 or 3 months?

Facebook login button

I have found this to be a more reliable way of finding the login button:

findElement(By.xpath(".//input[@data-testid='royal_login_button']"))

Script fails if FB event location is empty

Get this error on some events:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: ._4dpf._phw

After some investigation, it appears this happens when the Facebook event does not include complete location information.
Some events, for location, instead of a name and address, it just has a link. This fails.

Could you put try/except code around this line so it adds a blank location to the calendar for that event, rather than exiting the script with a failure?

Maybe something like this:
try:
event_info["location"] = self.browser.find_element_by_class_name(
"_4dpf._phw").text
except NoSuchElementException:
event_info["location"] = "TBD"

Example events that trigger this failure:
https://www.facebook.com/events/1140886672933580/
https://www.facebook.com/events/594600484704371/

Broken by New Facebook layout

Script no longer runs now that the "new" Facebook layout has been forced out to everyone.

FB changed code related to Organizer

Facebook must have changed something recently, the script is failing trying to find the organizer.
(This is not a multiple organizer issue, it fails even on events hosted by a single organizer.)

error is:
File "events-scraper.py", line 219, in find_organizer
event_info["organizer"] = organizer[0].text
IndexError: list index out of range

JSON debugging file

Would it be possible to add a command line switch to force the script to generate the .JSON debugging file, perhaps with a different file name? I've found this file very useful, and in fact, I'm using it to interface with other systems. (Namely, generating events automatically for a musician's website, example here: https://thomashindsmedia.com/#shows )

Unable to find a matching set of capabilities

Traceback (most recent call last):
File "C:\xampp\htdocs\fb\events-scrapper-master\events-scraper.py", line 448, in
headless=args.headless)
File "C:\xampp\htdocs\fb\events-scrapper-master\events-scraper.py", line 80, in init
options=FireFoxOptions,)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 174, in init
keep_alive=True)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\mariuszm\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities