Code Monkey home page Code Monkey logo

facebook-post-scraper's People

Contributors

afsara-ben avatar brutalsavage avatar busattoale avatar ciro-gallo avatar octavian-negru avatar rexshijaku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

facebook-post-scraper's Issues

Invalid Argument

Hello,

I'm having the following issue when I run ".../python scraper.py -page nexmm -len 1"

Could you kindly why I'm having it? Thank you.

Annotation 2020-06-09 131503

Comments = 0

 53             postDict['Comments'][commenter] = dict()
 54 

---> 55 comment_text = comment.find("span", class_="_3l3x").text
56 postDict['Comments'][commenter]["text"] = comment_text
57

AttributeError: 'NoneType' object has no attribute 'text'

ImportError: No module named selenium

C:\Users\Diya Guha Roy\Desktop\Articles\Netnography 1\data>scraper.py -h
Traceback (most recent call last):
File "C:\Users\Diya Guha Roy\Desktop\Articles\Netnography 1\data\scraper.py", line 6, in
from selenium import webdriver
ImportError: No module named selenium

update login button

from browser.find_element_by_id('loginbutton').click() to browser.find_element_by_name('login').click()

chromedriver unexpectedly exited. Status code was: 1

Any idea what is wrong?
My Command:

python scraper.py -p "https://www.facebook.com/icc/" -l 10

Error:

Traceback (most recent call last):
  File "scraper.py", line 322, in <module>
    postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite,     scrape_comment=scrape_comment)
  File "scraper.py", line 256, in extract
    browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
  File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 98, in start
    self.assert_process_still_running()
  File "/home/visionph/virtualenv/py.sayfcodes.com/3.7/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 111, in assert_process_still_running
    % (self.path, return_code)
selenium.common.exceptions.WebDriverException: Message: Service ./chromedriver unexpectedly exited. Status code was: 1

selenium.common.exceptions.SessionNotCreatedException error

I also tried in windows but it isn't working as well.

My command:

 python scraper.py -p "https://www.facebook.com/icc/" -l 10

Error:

Traceback (most recent call last):
  File "scraper.py", line 322, in <module>
    postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
  File "scraper.py", line 256, in extract
    browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
  File "C:\Python38\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 76, in __init__
    RemoteWebDriver.__init__(
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Python38\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 84

How can i get Post reactions ?

Hi, thank you for this great work, i arrived to pull posts from facebook, but without reactions ? and when i uncomment Dict['Reaction'] in _extract_html function, i get this error : KeyError: 'data-testid' concerning _extract_reaction function, can you help me to solve this problem? thank you.

[REQUEST]: Scrape comments without being logged in

Hey everybody and thank you for this great piece of software!

However, I have a feature request, because scraping a lot of posts with many comments may lead to fast blocks/deactivation of whole scraping accounts. And creating new facebook accounts (which each might need a fresh mobile number for verification) can be a costy thing. Would it also be possible to scrape public comments to public posts without being logged in (getting a new IP adress is very easy on my DSL or Cable connection).

See for example here, without being logged in: https://www.facebook.com/cnn

So since you know the underlying meachnisms: Would it be possible to enhance the scraper to scrape public posts without being logged in?

Best,

Scrapehouse

Unable to locate element

I am having this error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=85.0.4183.83)

Please guide how can I fix it.

Went to an invalid post page

When I used this scraper, the chromedriver lead me to the URL with this format facebook.com/facebook.user/post. The page says it is an invalid page. Did I commit something wrong in the argument lines?

Problem on ec2 instance

I'm trying to run this code on an ec2 instance.

However if I run:
python scraper.py -page 'Cr贸nica del Norte' -len 2

I get:

Traceback (most recent call last):
File "scraper.py", line 359, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 259, in extract
browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 76, in init
RemoteWebDriver.init(
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/ec2-user/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

While if I run sudo python scraper.py -page 'Cr贸nica del Norte' -len 2

File "scraper.py", line 43
post_id = f"https://www.facebook.com{postId.get('href')}"
^
SyntaxError: invalid syntax

What can it be?

Comments return empty

Hi @brutalsavage ,
I tried to download the comments from posts but it returns empty:

from scraper import extract
page = 'https://www.facebook.com/vodafonePT/'
comments = 'y'
list = extract(page, 30, comments)

and returns 'Comments': {} in every post. Did I miss something?

### UPDATE:

The comments field is returning data, however it only scrapes the most relevant comments. Any chance that it could scrap all coments instead @brutalsavage .

Thanks!

Scraping comments doesn't work correctly

Looking at the following code segment:

        cmmBtn = browser.find_elements_by_xpath('//a[@class="_3hg- _42ft"]')
        for btn in cmmBtn:
            try:
                btn.click()
            except:
                pass
        time.sleep(1)
        moreCmm= browser.find_elements_by_xpath('//a[@class="_4sxc _42ft"]')
        for moreCmmBtn in moreCmm:
            try:
                moreCmmBtn.click()
            except:
                pass
        moreComments = browser.find_elements_by_xpath('//a[@class="_6w8_"]')

When you try to get all the "X comments" button ( with the line cmmBtn = browser.find_elements_by_xpath('//a[@class="_3hg- _42ft"]') ) and later in the for loop when you click them all, notice that if a post already has listed few comments (on page load before clicking X comments button - look picture ), than by clicking the button with class _3hg- _42ft you basically hide all the comments from that post.

There needs to be added additional checking to see if there already exists a _4sxc _42ft class whiting the post div ( meaning the view more comments button is shown = the _3hg- _42ft button doesn't need to be clicked )

Image

Usage, following arguments are required

usage: ipykernel_launcher.py [-h] -page PAGE -len LEN [-infinite INFINITE] [-usage USAGE] [-comments COMMENTS]
ipykernel_launcher.py: error: the following arguments are required: -page/-p, -len/-l

Capture

can anyone help me to solve this problems?

Facebook Source Code

How do I get the source code of a facebook group page? I tried cntrl + u but chrome cant find any of the class names I inputted in cntrl + f

chrom path

hello, hope you are good,
when I'm trying to run the script on windows it keeps giving me this error

Traceback (most recent call last):
File "scraper.py", line 317, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
TypeError: extract() missing 1 required positional argument: 'chromedriver_path'

and the chrome driver is in the same directory

Scraping is stopped even though I passed infinite as an argument

I'm trying to scrape as much data as possible.
I'm sure there are more posts than what I'm getting, I got 1400 posts without comments and checking the last one's date, I found that it is relatively new (25 days old). Any idea why this occurs? Please note there is adequate ram in my computer and cpu usage never goes above 30%

comment

hello i don't know why i'm getting this error

Traceback (most recent call last):
File "scraper.py", line 210, in
postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
File "scraper.py", line 178, in extract
postBigDict = _extract_html(bs_data)
File "scraper.py", line 48, in extract_html
commenter = comment.find(class
="_6qw4").text
AttributeError: 'NoneType' object has no attribute 'text'

can this lead to a block or ban?

it seems fairly difficult creating a dummy facebook account

will this program cause blocking or banning of the facebook account? any good approach to get a dummy account?

uncollapsed comments and unfiltered comments

Opening this issue to mention two points which will be committed as solutions aftermath :

  1. Facebook collapses comments, and they should be expanded (uncollapsed) before the view more comments can be generated (or before ranking them).
  2. Comments are usually filtered (ranked as e.g 'Most relevant comments') this doesn't mean that by clicking view more you can reach all comments! You should first filter them as 'All comments' and then try to view them.

scraping public posts by keywords?

great scraper for me, really useful, thank you.

Any chance to scrape all public posts starting from a specific "keyword" search, instead a specific page?

For ex.: python scraper.py -keyword Play Station 5 -len 20

Hope already exists something like that.

Thank you very much

Unable to locate element:

I'm having this error:

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
(Session info: chrome=85.0.4183.83)

Any advice?

Page crashed while scrapping

image

Is there any solution to this. It's not possible to scrape anything as it keeps on crashing after 1-3 scrolls

Timestamp and VIdeo Link

Hi! If you want to add the Date the post was write and the link to a video upload to the facebook page you can add the following lines to the scraper.py file.

   # Post Date
    postDate = item.find(class_="_5ptz")
    postDict['Date'] = postDate.get('data-utime')

    # Post Videos
    postVideos = item.find(class_="async_saving _400z _2-40 _5pcq")
    postDict['Video'] = ""
    if postVideos is not None:
            postDict['Video'] = "https://facebook.com"+postVideos.get('href')

Can't log in because of cookies

When running the script, I get:

Traceback (most recent call last):
  File "scraper.py", line 357, in <module>
    postBigDict = extract(page=args.page, numOfPost=args.len, infinite_scroll=infinite, scrape_comment=scrape_comment)
  File "scraper.py", line 258, in extract
    _login(browser, EMAIL, PASSWORD)
  File "scraper.py", line 201, in _login
    browser.find_element_by_id('loginbutton').click()
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in find_element_by_id
    return self.find_element(by=By.ID, value=id_)
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
    'value': value})['value']
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="loginbutton"]"}
  (Session info: chrome=87.0.4280.88)

The browser shows the allow cookie window. Is there any solution?

The scraper doesn't work

command: python scraper.py -p [page name] -l 18 -c y

Error 1: in #reaction row 100

Schermata 2020-03-07 alle 16 24 18

Error 2: If i comment the #reaction block and remove the column reaction from the output file, this is the result:

Schermata 2020-03-07 alle 16 37 18

Schermata 2020-03-07 alle 16 38 54

As you can see the Comments column contains only {}.

getting images link

problem in posts with multiple images ,getting only the link of the first one

KeyError: 'data-testid'

Hello, Thank you for the post of Facebook-post-scraper
I have done all the requirements, but when I run this code :

import argparse
import time
import json
import csv

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs

from scraper import extract

list = extract("ministere.sante.ma", 1 )

===============================
and I got thhis error :

Number Of Scrolls Needed 0
in shares

KeyError Traceback (most recent call last)
in
10 from scraper import extract
11
---> 12 list = extract("ministere.sante.ma", 1 )

~\Desktop\PFE\PFE-GIT\scraper.py in extract(page, numOfPost, infinite_scroll, scrape_comment)
207 bs_data = bs(source_data, 'html.parser')
208
--> 209 postBigDict = _extract_html(bs_data)
210 browser.close()
211

~\Desktop\PFE\PFE-GIT\scraper.py in extract_html(bs_data)
98 for toolBar_child in toolBar[0].children:
99
--> 100 str = toolBar_child['data-testid']
101 reaction = str.split("UFI2TopReactions/tooltip
")[1]
102

C:\ProgramData\Anaconda3\lib\site-packages\bs4\element.py in getitem(self, key)
969 """tag[key] returns the value of the 'key' attribute for the tag,
970 and throws an exception if it's not there."""
--> 971 return self.attrs[key]
972
973 def iter(self):

KeyError: 'data-testid'

Not saving to output.txt

_extract_html is producing an empty postBigDict list that yield no output. May have introduced bug when editing the reactions feature?

Error: 'charmap' codec can't encode characters

Complete noob here, so excuse my naivete

I was getting the error "'charmap' codec can't encode characters ", I googled the error and changed the code

if args.usage == "WT":
        with open('output.txt', 'w') as file:
            for post in postBigDict:
                file.write(json.dumps(post))  # use json load to recover

    elif args.usage == "CSV":
        with open('data.csv', 'w',) as csvfile:
           writer = csv.writer(csvfile)
           #writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Reaction'])
           writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Shares'])

           for post in postBigDict:
              writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Shares']])
              #writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Reaction']])

    else:
        for post in postBigDict:
            print("\n") 

to this
`

 if args.usage == "WT":
        with io.open('output.txt', 'w', encoding='utf-8') as file:
            for post in postBigDict:
                file.write(json.dumps(post))  # use json load to recover

    elif args.usage == "CSV":
        with io.open('data.csv', 'w', encoding='utf-8') as csvfile:
           writer = csv.writer(csvfile)
           #writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Reaction'])
           writer.writerow(['Post', 'Link', 'Image', 'Comments', 'Shares'])

           for post in postBigDict:
              writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Shares']])
              #writer.writerow([post['Post'], post['Link'],post['Image'], post['Comments'], post['Reaction']])

    else:
        for post in postBigDict:
            print("\n")

`

It then worked.

Invalid Session ID (Large amount of posts)

First of all, thanks for this scrapper!
My problem is that when I download a large number of posts (> 4000) with 5-10 comments for each post, chrome just crashes.

Initially, I got an error when opening uncollapsed comments (invalid session ID)
Then I changed the code, set to open comments at the time of the scroll function, and the error began to appear there (invalid session ID again)

I read a lot of threads on the stackoverflow, they recommend adding some options to chrome, I tried it all. Also, many places offer to add memory to chrome (if using docker), but I just run the script
It also seems to me that this problem is somehow related to memory, chrome closes due to too many images, media, etc.
Can you help me somehow? Have you had this and have you tested the script on large amounts of information?

小薪懈屑芯泻 褝泻褉邪薪邪 2020-08-20 胁 01 09 02

Page closes during scroll down

Hi,

Is this crawler still working? I managed to get it up and running but then once the script is scrolling down the page, it abruptly closes and nothing gets written to the csv file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    馃枛 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 馃搳馃搱馃帀

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google 鉂わ笍 Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.