brutalsavage / facebook-post-scraper Goto Github PK
View Code? Open in Web Editor NEWFacebook Post Scraper 馃暤锔忦煐憋笍
License: GNU General Public License v3.0
Facebook Post Scraper 馃暤锔忦煐憋笍
License: GNU General Public License v3.0
53 postDict['Comments'][commenter] = dict()
54
---> 55 comment_text = comment.find("span", class_="_3l3x").text
56 postDict['Comments'][commenter]["text"] = comment_text
57
AttributeError: 'NoneType' object has no attribute 'text'
C:\Users\Diya Guha Roy\Desktop\Articles\Netnography 1\data>scraper.py -h
Traceback (most recent call last):
File "C:\Users\Diya Guha Roy\Desktop\Articles\Netnography 1\data\scraper.py", line 6, in
from selenium import webdriver
ImportError: No module named selenium
from browser.find_element_by_id('loginbutton').click() to browser.find_element_by_name('login').click()
Any idea what is wrong?
My Command:
python scraper.py -p "https://www.facebook.com/icc/" -l 10
Error:
Traceback (most recent call last):
File "scraper.py", line 322, in <module>
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 256, in extract
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
self.service.start()
File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 98, in start
self.assert_process_still_running()
File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 111, in assert_process_still_running
% (self.path, return_code)
selenium.common.exceptions.WebDriverException: Message: Service ./chromedriver unexpectedly exited. Status code was: 1
I also tried in windows but it isn't working as well.
My command:
python scraper.py -p "https://www.facebook.com/icc/" -l 10
Error:
Traceback (most recent call last):
File "scraper.py", line 322, in <module>
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 256, in extract
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
File "C:\Python38\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 76, in __init__
RemoteWebDriver.__init__(
File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Python38\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 84
Hi, thank you for this great work, i arrived to pull posts from facebook, but without reactions ? and when i uncomment Dict['Reaction'] in _extract_html function, i get this error : KeyError: 'data-testid' concerning _extract_reaction function, can you help me to solve this problem? thank you.
Hey everybody and thank you for this great piece of software!
However, I have a feature request, because scraping a lot of posts with many comments may lead to fast blocks/deactivation of whole scraping accounts. And creating new facebook accounts (which each might need a fresh mobile number for verification) can be a costy thing. Would it also be possible to scrape public comments to public posts without being logged in (getting a new IP adress is very easy on my DSL or Cable connection).
See for example here, without being logged in: https://www.facebook.com/cnn
So since you know the underlying meachnisms: Would it be possible to enhance the scraper to scrape public posts without being logged in?
Best,
Scrapehouse
Facebook has updated its layout, so all the parsing has to be reviewed :(
I am having this error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=85.0.4183.83)
Please guide how can I fix it.
list = extract(page='https://www.facebook.com/bringmethehorizon', numOfPost=15, infinite_scroll=True, scrape_comment=True)
print("DONE!")
for i in list:
print(i)
returns nothing
This is an amazing scrapper I really appreciate you for this.
Is this scrapper used to download the comments of a particular post. I mean if I have to download comments from this URL https://www.facebook.com/IamSRK/posts/5046429322049957 . Then what I should I do to accomplished this task?
I really look forward for your response.
Thank you.
When I used this scraper, the chromedriver lead me to the URL with this format facebook.com/facebook.user/post
. The page says it is an invalid page. Did I commit something wrong in the argument lines?
I'm trying to run this code on an ec2 instance.
However if I run:
python scraper.py -page 'Cr贸nica del Norte' -len 2
I get:
Traceback (most recent call last):
File "scraper.py", line 359, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 259, in extract
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 76, in init
RemoteWebDriver.init(
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
While if I run sudo python scraper.py -page 'Cr贸nica del Norte' -len 2
File "scraper.py", line 43
post_id = f"https://www.facebook.com{postId.get('href')}"
^
SyntaxError: invalid syntax
What can it be?
Hi @brutalsavage ,
I tried to download the comments from posts but it returns empty:
from scraper import extract
page = 'https://www.facebook.com/vodafonePT/'
comments = 'y'
list = extract(page, 30, comments)
and returns 'Comments': {} in every post. Did I miss something?
### UPDATE:
The comments field is returning data, however it only scrapes the most relevant comments. Any chance that it could scrap all coments instead @brutalsavage .
Thanks!
Looking at the following code segment:
cmmBtn = browser.find_elements_by_xpath('//a[@class="_3hg- _42ft"]')
for btn in cmmBtn:
try:
btn.click()
except:
pass
time.sleep(1)
moreCmm= browser.find_elements_by_xpath('//a[@class="_4sxc _42ft"]')
for moreCmmBtn in moreCmm:
try:
moreCmmBtn.click()
except:
pass
moreComments = browser.find_elements_by_xpath('//a[@class="_6w8_"]')
When you try to get all the "X comments" button ( with the line cmmBtn = browser.find_elements_by_xpath('//a[@class="_3hg- _42ft"]')
) and later in the for loop when you click them all, notice that if a post already has listed few comments (on page load before clicking X comments button - look picture ), than by clicking the button with class _3hg- _42ft
you basically hide all the comments from that post.
There needs to be added additional checking to see if there already exists a _4sxc _42ft
class whiting the post div
( meaning the view more comments button is shown = the _3hg- _42ft
button doesn't need to be clicked )
How do I get the source code of a facebook group page? I tried cntrl + u but chrome cant find any of the class names I inputted in cntrl + f
hello, hope you are good,
when I'm trying to run the script on windows it keeps giving me this error
Traceback (most recent call last):
File "scraper.py", line 317, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
TypeError: extract() missing 1 required positional argument: 'chromedriver_path'
and the chrome driver is in the same directory
I'm trying to scrape as much data as possible.
I'm sure there are more posts than what I'm getting, I got 1400 posts without comments and checking the last one's date, I found that it is relatively new (25 days old). Any idea why this occurs? Please note there is adequate ram in my computer and cpu usage never goes above 30%
hello i don't know why i'm getting this error
Traceback (most recent call last):
File "scraper.py", line 210, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 178, in extract
postBigDict = _extract_html(bs_data)
File "scraper.py", line 48, in extract_html
commenter = comment.find(class="_6qw4").text
AttributeError: 'NoneType' object has no attribute 'text'
it seems fairly difficult creating a dummy facebook account
will this program cause blocking or banning of the facebook account? any good approach to get a dummy account?
Opening this issue to mention two points which will be committed as solutions aftermath :
great scraper for me, really useful, thank you.
Any chance to scrape all public posts starting from a specific "keyword" search, instead a specific page?
For ex.: python scraper.py -keyword Play Station 5 -len 20
Hope already exists something like that.
Thank you very much
Is it possible to add this feature? Scrap posts and comments from this posts from specific date ?
I'm having this error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=85.0.4183.83)
Any advice?
Seems to not mantain the login state after login....
Hi! If you want to add the Date the post was write and the link to a video upload to the facebook page you can add the following lines to the scraper.py file.
# Post Date
postDate = item.find(class_="_5ptz")
postDict['Date'] = postDate.get('data-utime')
# Post Videos
postVideos = item.find(class_="async_saving _400z _2-40 _5pcq")
postDict['Video'] = ""
if postVideos is not None:
postDict['Video'] = "https://facebook.com"+postVideos.get('href')
chromedriver writes down my username and password but doesn't finish login (by pressing enter I suppose) and suddenly stops. Do you have any idea why? thanks a lot!
When running the script, I get:
Traceback (most recent call last):
File "scraper.py", line 357, in <module>
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 258, in extract
_login(browser, EMAIL, PASSWORD)
File "scraper.py", line 201, in _login
browser.find_element_by_id('loginbutton').click()
File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in find_element_by_id
return self.find_element(by=By.ID, value=id_)
File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=87.0.4280.88)
The browser shows the allow cookie window. Is there any solution?
problem in posts with multiple images ,getting only the link of the first one
import argparse
import time
import json
import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
from scraper import extract
list = extract("ministere.sante.ma", 1 )
KeyError Traceback (most recent call last)
in
10 from scraper import extract
11
---> 12 list = extract("ministere.sante.ma", 1 )
~\Desktop\PFE\PFE-GIT\scraper.py in extract(page, numOfPost, infinite_scroll, scrape_comment)
207 bs_data = bs(source_data, 'html.parser')
208
--> 209 postBigDict = _extract_html(bs_data)
210 browser.close()
211
~\Desktop\PFE\PFE-GIT\scraper.py in extract_html(bs_data)
98 for toolBar_child in toolBar[0].children:
99
--> 100 str = toolBar_child['data-testid']
101 reaction = str.split("UFI2TopReactions/tooltip")[1]
102
C:\ProgramData\Anaconda3\lib\site-packages\bs4\element.py in getitem(self, key)
969 """tag[key] returns the value of the 'key' attribute for the tag,
970 and throws an exception if it's not there."""
--> 971 return self.attrs[key]
972
973 def iter(self):
KeyError: 'data-testid'
_extract_html is producing an empty postBigDict list that yield no output. May have introduced bug when editing the reactions feature?
Complete noob here, so excuse my naivete
I was getting the error "'charmap' codec can't encode characters ", I googled the error and changed the code
if args.usage == "WT":
with open('output.txt', 'w') as file:
for post in postBigDict:
file.write(json.dumps(post)) # use json load to recover
elif args.usage == "CSV":
with open('data.csv', 'w',) as csvfile:
writer = csv.writer(csvfile)
#writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Reaction'])
writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Shares'])
for post in postBigDict:
writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Shares']])
#writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Reaction']])
else:
for post in postBigDict:
print("\n")
to this
`
if args.usage == "WT":
with io.open('output.txt', 'w', encoding='utf-8') as file:
for post in postBigDict:
file.write(json.dumps(post)) # use json load to recover
elif args.usage == "CSV":
with io.open('data.csv', 'w', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
#writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Reaction'])
writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Shares'])
for post in postBigDict:
writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Shares']])
#writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Reaction']])
else:
for post in postBigDict:
print("\n")
`
It then worked.
First of all, thanks for this scrapper!
My problem is that when I download a large number of posts (> 4000) with 5-10 comments for each post, chrome just crashes.
Initially, I got an error when opening uncollapsed comments (invalid session ID)
Then I changed the code, set to open comments at the time of the scroll function, and the error began to appear there (invalid session ID again)
I read a lot of threads on the stackoverflow, they recommend adding some options to chrome, I tried it all. Also, many places offer to add memory to chrome (if using docker), but I just run the script
It also seems to me that this problem is somehow related to memory, chrome closes due to too many images, media, etc.
Can you help me somehow? Have you had this and have you tested the script on large amounts of information?
Hi,
Is this crawler still working? I managed to get it up and running but then once the script is scrolling down the page, it abruptly closes and nothing gets written to the csv file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google 鉂わ笍 Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.