brutalsavage / facebook-post-scraper Goto Github PK

View Code? Open in Web Editor NEW

313.0 15.0 115.0 38 KB

Facebook Post Scraper 🕵️🖱️

License: GNU General Public License v3.0

Python 100.00%

facebook-scraper facebook-scraper-software facebook-scraper-tool python beautifulsoup4 selenium-webdriver

facebook-post-scraper's People

Contributors

Stargazers

Watchers

Forkers

amienkhaled savex83 zzzzzzyyy haixuantao innogrey databill86 angrbrd sebastienkanj afsara-ben busattoale octavian-negru whatneuron dapivei mohmmedwee thekafer sentecode ongzexuan buhehaha marincehajic sunnyhasija yanghusx harshmeet30 lw6ege gameye98 diehl neraunzaran rexshijaku marianna202020 rakibsxyz kargam0167 rollys giosomma hollow667 excitoon-favorites rasel-ish shahzaibkhan stg284 anhvaut 0xjimm achi0012 ciro-gallo badbroken joaoribeiromedeiros alishan-bee hybluky nguyetvo enzorobaina konkmak bbitmaster pmsorion bedros tamsweet mathiasfls keelasevalpattisaisabareesh puja93 topliftarm hamidouu gonhonoria p1halani sasaaty chaoshengggg rizwan183 datafields-team chibuikeeugene dariaphoebe austinstoker kaoutharmokrane rachaelstott odeen1701 akhil-t-github kareenar anmelkonyan santosguerra israelccarvalho lowgun julobato diogenis-si barno1994 alfonsovc15 christinataft worldie-com eelegiap mizzony leerockdev syedfaraz55 peterqtr11 powerspope ekaceseehc elmergonzalezb 1samhicks amrelnagar286 oradeam stevavoliajvar bibolil iigodii minhkhoi1412 gretadieck hexbacon worxten jxuanl

facebook-post-scraper's Issues

Invalid Argument

Hello,

I'm having the following issue when I run ".../python scraper.py -page nexmm -len 1"

Could you kindly why I'm having it? Thank you.

Comments = 0

 53             postDict['Comments'][commenter] = dict()
 54

---> 55 comment_text = comment.find("span", class_="_3l3x").text
56 postDict['Comments'][commenter]["text"] = comment_text
57

AttributeError: 'NoneType' object has no attribute 'text'

ImportError: No module named selenium

C:\Users\Diya Guha Roy\Desktop\Articles\Netnography 1\data>scraper.py -h
Traceback (most recent call last):
File "C:\Users\Diya Guha Roy\Desktop\Articles\Netnography 1\data\scraper.py", line 6, in
from selenium import webdriver
ImportError: No module named selenium

update login button

from browser.find_element_by_id('loginbutton').click() to browser.find_element_by_name('login').click()

chromedriver unexpectedly exited. Status code was: 1

Any idea what is wrong?
My Command:

python scraper.py -p "https://www.facebook.com/icc/" -l 10

Error:

Traceback (most recent call last):
  File "scraper.py", line 322, in <module>
    postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite,     scrape_comment=scrape_comment)
  File "scraper.py", line 256, in extract
    browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
  File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 98, in start
    self.assert_process_still_running()
  File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 111, in assert_process_still_running
    % (self.path, return_code)
selenium.common.exceptions.WebDriverException: Message: Service ./chromedriver unexpectedly exited. Status code was: 1

selenium.common.exceptions.SessionNotCreatedException error

I also tried in windows but it isn't working as well.

My command:

 python scraper.py -p "https://www.facebook.com/icc/" -l 10

Error:

Traceback (most recent call last):
  File "scraper.py", line 322, in <module>
    postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
  File "scraper.py", line 256, in extract
    browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
  File "C:\Python38\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 76, in __init__
    RemoteWebDriver.__init__(
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 84

How can i get Post reactions ?

Hi, thank you for this great work, i arrived to pull posts from facebook, but without reactions ? and when i uncomment Dict['Reaction'] in _extract_html function, i get this error : KeyError: 'data-testid' concerning _extract_reaction function, can you help me to solve this problem? thank you.

[REQUEST]: Scrape comments without being logged in

Hey everybody and thank you for this great piece of software!

However, I have a feature request, because scraping a lot of posts with many comments may lead to fast blocks/deactivation of whole scraping accounts. And creating new facebook accounts (which each might need a fresh mobile number for verification) can be a costy thing. Would it also be possible to scrape public comments to public posts without being logged in (getting a new IP adress is very easy on my DSL or Cable connection).

See for example here, without being logged in: https://www.facebook.com/cnn

So since you know the underlying meachnisms: Would it be possible to enhance the scraper to scrape public posts without being logged in?

Best,

Scrapehouse

Facebook New Design

Facebook has updated its layout, so all the parsing has to be reviewed :(

Unable to locate element

I am having this error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=85.0.4183.83)

Please guide how can I fix it.

extract function returns empty list

list = extract(page='https://www.facebook.com/bringmethehorizon', numOfPost=15, infinite_scroll=True, scrape_comment=True)
print("DONE!")
for i in list:
print(i)

returns nothing

Can I use this scrapper to download the comments from a particukar post

This is an amazing scrapper I really appreciate you for this.
Is this scrapper used to download the comments of a particular post. I mean if I have to download comments from this URL https://www.facebook.com/IamSRK/posts/5046429322049957 . Then what I should I do to accomplished this task?

I really look forward for your response.
Thank you.

Went to an invalid post page

When I used this scraper, the chromedriver lead me to the URL with this format facebook.com/facebook.user/post. The page says it is an invalid page. Did I commit something wrong in the argument lines?

Infodemic memes

Problem on ec2 instance

I'm trying to run this code on an ec2 instance.

However if I run:
python scraper.py -page 'Crónica del Norte' -len 2

I get:

Traceback (most recent call last):
File "scraper.py", line 359, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 259, in extract
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 76, in init
RemoteWebDriver.init(
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

While if I run sudo python scraper.py -page 'Crónica del Norte' -len 2

File "scraper.py", line 43
post_id = f"https://www.facebook.com{postId.get('href')}"
^
SyntaxError: invalid syntax

What can it be?

Comments return empty

Hi @brutalsavage ,
I tried to download the comments from posts but it returns empty:

from scraper import extract
page = 'https://www.facebook.com/vodafonePT/'
comments = 'y'
list = extract(page, 30, comments)

and returns 'Comments': {} in every post. Did I miss something?

### UPDATE:

The comments field is returning data, however it only scrapes the most relevant comments. Any chance that it could scrap all coments instead @brutalsavage .

Thanks!

reaction numbers not working

Scraping comments doesn't work correctly

Looking at the following code segment:

        cmmBtn = browser.find_elements_by_xpath('//a[@class="_3hg- _42ft"]')
        for btn in cmmBtn:
            try:
                btn.click()
            except:
                pass
        time.sleep(1)
        moreCmm= browser.find_elements_by_xpath('//a[@class="_4sxc _42ft"]')
        for moreCmmBtn in moreCmm:
            try:
                moreCmmBtn.click()
            except:
                pass
        moreComments = browser.find_elements_by_xpath('//a[@class="_6w8_"]')

When you try to get all the "X comments" button ( with the line cmmBtn = browser.find_elements_by_xpath('//a[@class="_3hg- _42ft"]') ) and later in the for loop when you click them all, notice that if a post already has listed few comments (on page load before clicking X comments button - look picture ), than by clicking the button with class _3hg- _42ft you basically hide all the comments from that post.

There needs to be added additional checking to see if there already exists a _4sxc _42ft class whiting the post div ( meaning the view more comments button is shown = the _3hg- _42ft button doesn't need to be clicked )

Usage, following arguments are required

usage: ipykernel_launcher.py [-h] -page PAGE -len LEN [-infinite INFINITE] [-usage USAGE] [-comments COMMENTS]
ipykernel_launcher.py: error: the following arguments are required: -page/-p, -len/-l

can anyone help me to solve this problems?

Facebook Source Code

How do I get the source code of a facebook group page? I tried cntrl + u but chrome cant find any of the class names I inputted in cntrl + f

chrom path

hello, hope you are good,
when I'm trying to run the script on windows it keeps giving me this error

Traceback (most recent call last):
File "scraper.py", line 317, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
TypeError: extract() missing 1 required positional argument: 'chromedriver_path'

and the chrome driver is in the same directory

Scraping is stopped even though I passed infinite as an argument

I'm trying to scrape as much data as possible.
I'm sure there are more posts than what I'm getting, I got 1400 posts without comments and checking the last one's date, I found that it is relatively new (25 days old). Any idea why this occurs? Please note there is adequate ram in my computer and cpu usage never goes above 30%

comment

hello i don't know why i'm getting this error

Traceback (most recent call last):
File "scraper.py", line 210, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 178, in extract
postBigDict = _extract_html(bs_data)
File "scraper.py", line 48, in extract_html
commenter = comment.find(class="_6qw4").text
AttributeError: 'NoneType' object has no attribute 'text'

can this lead to a block or ban?

it seems fairly difficult creating a dummy facebook account

will this program cause blocking or banning of the facebook account? any good approach to get a dummy account?

uncollapsed comments and unfiltered comments

Opening this issue to mention two points which will be committed as solutions aftermath :

Facebook collapses comments, and they should be expanded (uncollapsed) before the view more comments can be generated (or before ranking them).
Comments are usually filtered (ranked as e.g 'Most relevant comments') this doesn't mean that by clicking view more you can reach all comments! You should first filter them as 'All comments' and then try to view them.

scraping public posts by keywords?

great scraper for me, really useful, thank you.

Any chance to scrape all public posts starting from a specific "keyword" search, instead a specific page?

For ex.: python scraper.py -keyword Play Station 5 -len 20

Hope already exists something like that.

Thank you very much

[REQUEST] Scrap posts and comments from - to specific date

Is it possible to add this feature? Scrap posts and comments from this posts from specific date ?

Unable to locate element:

I'm having this error:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=85.0.4183.83)

Any advice?

Scraping not working - login issue (do not mantain the login state)

Seems to not mantain the login state after login....

Page crashed while scrapping

Is there any solution to this. It's not possible to scrape anything as it keeps on crashing after 1-3 scrolls

could not convert string to float: '1,1'

Timestamp and VIdeo Link

Hi! If you want to add the Date the post was write and the link to a video upload to the facebook page you can add the following lines to the scraper.py file.

   # Post Date
    postDate = item.find(class_="_5ptz")
    postDict['Date'] = postDate.get('data-utime')

    # Post Videos
    postVideos = item.find(class_="async_saving _400z _2-40 _5pcq")
    postDict['Video'] = ""
    if postVideos is not None:
            postDict['Video'] = "https://facebook.com"+postVideos.get('href')

chromedriver stops after filling in username and password

chromedriver writes down my username and password but doesn't finish login (by pressing enter I suppose) and suddenly stops. Do you have any idea why? thanks a lot!

Can't log in because of cookies

When running the script, I get:

Traceback (most recent call last):
  File "scraper.py", line 357, in <module>
    postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
  File "scraper.py", line 258, in extract
    _login(browser, EMAIL, PASSWORD)
  File "scraper.py", line 201, in _login
    browser.find_element_by_id('loginbutton').click()
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in find_element_by_id
    return self.find_element(by=By.ID, value=id_)
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
    'value': value})['value']
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
  (Session info: chrome=87.0.4280.88)

The browser shows the allow cookie window. Is there any solution?

The scraper doesn't work

command: python scraper.py -p [page name] -l 18 -c y

Error 1: in #reaction row 100

Error 2: If i comment the #reaction block and remove the column reaction from the output file, this is the result:

As you can see the Comments column contains only {}.

getting images link

problem in posts with multiple images ,getting only the link of the first one

KeyError: 'data-testid'

Hello, Thank you for the post of Facebook-post-scraper
I have done all the requirements, but when I run this code :

import argparse
import time
import json
import csv

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs

from scraper import extract

list = extract("ministere.sante.ma", 1 )

===============================
and I got thhis error :

Number Of Scrolls Needed 0
in shares

KeyError Traceback (most recent call last)
in
10 from scraper import extract
11
---> 12 list = extract("ministere.sante.ma", 1 )

~\Desktop\PFE\PFE-GIT\scraper.py in extract(page, numOfPost, infinite_scroll, scrape_comment)
207 bs_data = bs(source_data, 'html.parser')
208
--> 209 postBigDict = _extract_html(bs_data)
210 browser.close()
211

~\Desktop\PFE\PFE-GIT\scraper.py in extract_html(bs_data)
98 for toolBar_child in toolBar[0].children:
99
--> 100 str = toolBar_child['data-testid']
101 reaction = str.split("UFI2TopReactions/tooltip")[1]
102

C:\ProgramData\Anaconda3\lib\site-packages\bs4\element.py in getitem(self, key)
969 """tag[key] returns the value of the 'key' attribute for the tag,
970 and throws an exception if it's not there."""
--> 971 return self.attrs[key]
972
973 def iter(self):

KeyError: 'data-testid'

Not saving to output.txt

_extract_html is producing an empty postBigDict list that yield no output. May have introduced bug when editing the reactions feature?

Error: 'charmap' codec can't encode characters

Complete noob here, so excuse my naivete

I was getting the error "'charmap' codec can't encode characters ", I googled the error and changed the code

if args.usage == "WT":
        with open('output.txt', 'w') as file:
            for post in postBigDict:
                file.write(json.dumps(post))  # use json load to recover

    elif args.usage == "CSV":
        with open('data.csv', 'w',) as csvfile:
           writer = csv.writer(csvfile)
           #writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Reaction'])
           writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Shares'])

           for post in postBigDict:
              writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Shares']])
              #writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Reaction']])

    else:
        for post in postBigDict:
            print("\n")

to this
`

 if args.usage == "WT":
        with io.open('output.txt', 'w', encoding='utf-8') as file:
            for post in postBigDict:
                file.write(json.dumps(post))  # use json load to recover

    elif args.usage == "CSV":
        with io.open('data.csv', 'w', encoding='utf-8') as csvfile:
           writer = csv.writer(csvfile)
           #writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Reaction'])
           writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Shares'])

           for post in postBigDict:
              writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Shares']])
              #writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Reaction']])

    else:
        for post in postBigDict:
            print("\n")

It then worked.

Invalid Session ID (Large amount of posts)

First of all, thanks for this scrapper!
My problem is that when I download a large number of posts (> 4000) with 5-10 comments for each post, chrome just crashes.

Initially, I got an error when opening uncollapsed comments (invalid session ID)
Then I changed the code, set to open comments at the time of the scroll function, and the error began to appear there (invalid session ID again)

I read a lot of threads on the stackoverflow, they recommend adding some options to chrome, I tried it all. Also, many places offer to add memory to chrome (if using docker), but I just run the script
It also seems to me that this problem is somehow related to memory, chrome closes due to too many images, media, etc.
Can you help me somehow? Have you had this and have you tested the script on large amounts of information?

Page closes during scroll down

Hi,

Is this crawler still working? I managed to get it up and running but then once the script is scrolling down the page, it abruptly closes and nothing gets written to the csv file.

brutalsavage / facebook-post-scraper Goto Github PK

facebook-post-scraper's People

Contributors

Stargazers

Watchers

Forkers

facebook-post-scraper's Issues

Hello, Thank you for the post of Facebook-post-scraper I have done all the requirements, but when I run this code :

=============================== and I got thhis error :

Number Of Scrolls Needed 0 in shares

Recommend Projects

Recommend Topics

Recommend Org

Hello, Thank you for the post of Facebook-post-scraper
I have done all the requirements, but when I run this code :

===============================
and I got thhis error :

Number Of Scrolls Needed 0
in shares