tzuhsial / instagramcrawler Goto Github PK
View Code? Open in Web Editor NEWA non API python program to crawl public photos, posts or followers
Home Page: https://github.com/iammrhelo/InstagramCrawler
License: MIT License
A non API python program to crawl public photos, posts or followers
Home Page: https://github.com/iammrhelo/InstagramCrawler
License: MIT License
Please add info, that user should install geckodriver to PC and add path to PATH. Thanks!
Query account 'instagram', download 20 photos and their captions Traceback (most recent call last): File "instagramcrawler.py", line 355, in <module> main() File "instagramcrawler.py", line 345, in main crawler = InstagramCrawler(headless=args.headless) File "instagramcrawler.py", line 67, in __init__ self._driver = webdriver.Firefox() File "C:\Python27\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 142, in __init_ _ self.service.start() File "C:\Python27\lib\site-packages\selenium\webdriver\common\service.py", line 81, in start os.path.basename(self.path), self.start_error_message) selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH .
Hi guys, I was able to crawl followers on Instagram after modifying some lines in the intagramcrawler.py. Now, I am facing a problem when the number of followers is over 1000: The crawler scrolls down the followers page but the page freezes in a loading phase so after some time the crawler simply quits the query.
Have you faced this issue before? If so how to fix it?
Hi! I typed
python instagramcrawler.py -q #breakfast -n 50
Then the firefox worked. But the crawler program showed:
posts: 61650050, number: 50
Saving...
Saving to directory: E:/ins\breakfast.hashtag
Scraping photo links...
Number of photo_links: 33
?[FDownloading 1 images to
Traceback (most recent call last):
File "instagramcrawler.py", line 342, in
main()
File "instagramcrawler.py", line 338, in main
authentication=args.authentication)
File "instagramcrawler.py", line 133, in crawl
self.download_and_save(dir_prefix, query, crawl_type)
File "instagramcrawler.py", line 294, in download_and_save
urlretrieve(photo_link, filepath)
File "C:\Python27\lib\urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "C:\Python27\lib\urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "C:\Python27\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 443, in open_https
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 1038, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 882, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 844, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 1263, in connect
server_hostname=server_hostname)
File "C:\Python27\lib\ssl.py", line 363, in wrap_socket
_context=self)
File "C:\Python27\lib\ssl.py", line 611, in init
self.do_handshake()
File "C:\Python27\lib\ssl.py", line 840, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:661)
So why can't I download them?
I try the program and changing path to "C:\Program Files\Mozilla Firefox" in instagramcrawker.py program, but when just run the program i get this message. I use it with Windows 10 and firefox 59. Any Idea o fix this issue?
Traceback (most recent call last):
File "instagramcrawler.py", line 360, in <module>
main()
File "instagramcrawler.py", line 350, in main
crawler = InstagramCrawler(headless=args.headless, firefox_path="C:\Program Files\Mozilla Firefox")
File "instagramcrawler.py", line 70, in __init__
self._driver = webdriver.Firefox(firefox_binary=binary)
File "C:\Users\Santo Wijaya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 142, in __init__
self.service.start()
File "C:\Users\Santo Wijaya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\common\service.py", line 81, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
windows7 64bit, running the following command:
python instagramcrawler.py -q 'nude_yogagirl' -n 20
output is:
d:\python\InstagramCrawler-master>python instagramcrawler.py -q 'nude_yogagirl' -n 20 Traceback (most recent call last):
File "instagramcrawler.py", line 297, in
main()
File "instagramcrawler.py", line 291, in main
crawler = InstagramCrawler()
File "instagramcrawler.py", line 58, in init
self._driver = webdriver.Firefox()
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 152, in init
keep_alive=True)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 98, in init
self.start_session(desired_capabilities, browser_profile)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 185, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Unable to find a matching set of apabilities
I'm getting this error when i ran that script ( also happens with python comand )
python3 instagramcrawler.py -q '#breakfast' -n 20 -a auth.json
Traceback (most recent call last):
File "instagramcrawler.py", line 360, in
main()
File "instagramcrawler.py", line 350, in main
crawler = InstagramCrawler(headless=args.headless, firefox_path=args.firefox_path)
File "instagramcrawler.py", line 72, in init
self._driver.implicitly_wait(10)
File "/home/mestre/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 719, in implicitly_wait
'implicit': int(float(time_to_wait) * 1000)})
File "/home/mestre/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "/home/mestre/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: timeouts
Works fine but scraping followers gives me:
Scraping followers...
Traceback (most recent call last):
File "instagramcrawler.py", line 302, in
caption='None')
File "instagramcrawler.py", line 96, in crawl
self.scrape_followers_or_following(crawl_type, query, number)
File "instagramcrawler.py", line 214, in scrape_followers_or_following
title = self._driver.find_element_by_xpath(FOLLOW_PATH)
File "...\selenium\webdriver\remote\webdriver.py", line 313, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "...\selenium\webdriver\remote\webdriver.py", line 791, in find_element
'value': value})['value']
File "...\selenium\webdriver\remote\webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "...\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: //div[contains(text(), 'Followers')]
Any idea? Thanks for your help.
Hi can anyone tell me how to use this? I'm new to this and can't seem to get it working. Appreciate it if anyone could help!! thanks!
I am getting the following error. The CSS selector needs to be updated from what is in the code (it is now "a._1cr2e _epyes"), but that still does not solve it for me.
Any insights much appreciated!
Traceback (most recent call last):
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 360, in
main()
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 356, in main
authentication=args.authentication)
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 119, in crawl
self.scroll_to_num_of_posts(number)
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 171, in scroll_to_num_of_posts
(By.CSS_SELECTOR, CSS_LOAD_MORE))
File "C:\Users\QS-2 SARAH\Anaconda2\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
TimeoutException
it works perfectly about few days ago
But suddenly photo_links return only 51elements max (or lower)
How can i fix it?
Thanks for your help!! :)
Hey, so far I crawled followers smoothly, but I have 2 issues:
D:\Development\python\InstagramCrawler-master>python instagramcrawler.py -q daniel.hoi23s -c -n 100
dir_prefix: ./data/, query: daniel.hoi23s, crawl_type: photos, number: 100, caption: True
posts: 394, number: 100
Scraping photo links...
Number of photo_links: 27
Scraping captions...
Traceback (most recent call last):
File "instagramcrawler.py", line 297, in
main()
File "instagramcrawler.py", line 293, in main
caption=args.caption)
File "instagramcrawler.py", line 85, in crawl
self.click_and_scrape_captions(number)
File "instagramcrawler.py", line 161, in click_and_scrape_captions
FIREFOX_FIRST_POST_PATH).click()
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\webelement.py", line 77, in click
self._execute(Command.CLICK_ELEMENT)
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\webelement.py", line 493, in _execute
return self._parent.execute(command, params)
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message:
============================================
geckodriver.log:
1496170001211 geckodriver INFO Listening on 127.0.0.1:64773
1496170003300 geckodriver::marionette INFO Starting browser \?\C:\Program Files (x86)\Mozilla Firefox\firefox.exe with args ["-marionette"]
1496170004239 addons.manager ERROR startup failed: [Exception... "Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIFile.create]" nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)" location: "JS frame :: resource://gre/modules/FileUtils.jsm :: FileUtils_getDir :: line 70" data: no] Stack trace: FileUtils_getDir()@resource://gre/modules/FileUtils.jsm:70 < FileUtils_getFile()@resource://gre/modules/FileUtils.jsm:42 < validateBlocklist()@resource://gre/modules/AddonManager.jsm:671 < startup()@resource://gre/modules/AddonManager.jsm:834 < startup()@resource://gre/modules/AddonManager.jsm:3129 < observe()@resource://gre/components/addonManager.js:65
JavaScript error: resource://gre/modules/AddonManager.jsm, line 1657: NS_ERROR_NOT_INITIALIZED: AddonManager is not initialized
1496170009047 Marionette INFO Listening on port 64780
JavaScript error: resource://gre/modules/AddonManager.jsm, line 2570: NS_ERROR_NOT_INITIALIZED: AddonManager is not initialized
1496170009378 Marionette WARN TLS certificate errors will be ignored for this session
JavaScript error: resource://gre/modules/FileUtils.jsm, line 70: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIFile.create]
Hi Guys,
Thanks for the great project which I use to get followers of a user. This works in about 50% of the cases, but sometimes I get the following error:
Traceback (most recent call last):
File "crawler_tuintjedelen.py", line 364, in main
crawler.browse(args.query,args.type).crawl(args.number,args.caption).save()
File "crawler_tuintjedelen.py", line 160, in crawl
self.followlist = self._crawl_follow()
File "crawler_tuintjedelen.py", line 319, in _crawl_follow
self.driver.execute_script(SCROLL_DOWN)
File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 465, in execute_script
'args': converted_args})['value']
File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 408, in execute
return self._request(command_info[0], url, body=data)
File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 478, in _request
resp = opener.open(request, timeout=self._timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 453, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
raise BadStatusLine(line)
BadStatusLine: ''
Exception urllib2.URLError: URLError(error(111, 'Connection refused'),) in <bound method InstagramCrawler.__del__ of <__main__.InstagramCrawler object at 0x7f81e3cf7bd0>> ignored
Any idea what could be going on? Running it on Ubuntu with PhantomJS.
Thanks!
Frequently get this error when storing captions. Doesn't seem to be an issue when just taking images.
Traceback (most recent call last):
File "instagramcrawler.py", line 341, in <module>
main()
File "instagramcrawler.py", line 337, in main
authentication=args.authentication)
File "instagramcrawler.py", line 114, in crawl
self.click_and_scrape_captions(number)
File "instagramcrawler.py", line 221, in click_and_scrape_captions
EC.presence_of_element_located((By.TAG_NAME, "time"))
File "/home/jake/.virtualenvs/insta/lib/python3.5/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Do you know how can I fix this error?
[jalal@goku InstagramCrawler]$ python instagramcrawler.py -q '#breakfast' -n 50
Traceback (most recent call last):
File "instagramcrawler.py", line 360, in <module>
main()
File "instagramcrawler.py", line 350, in main
crawler = InstagramCrawler(headless=args.headless, firefox_path=args.firefox_path)
File "instagramcrawler.py", line 70, in __init__
self._driver = webdriver.Firefox(firefox_binary=binary)
File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 152, in __init__
keep_alive=True)
File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 185, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Unable to find a matching set of capabilities
Also, how should I enter user/pass?
I only get 1 following or followers when set -n to 20
I analyse the xml code of instagram and find something wrong with line 273 num_of_shown_follow = len(List.find_elements_by_xpath('*'))
In instagram, xml structure is showed as follows:
ul>
div>
li>'follow info'<\li>
...
li>'another follow info'<\li>
\div>
\ul>
Here class List is node ul
and we want info of node li
The problem is line 273 get node div
I changed line 273s to List.find_elements_by_tag_name('li') and solved this problem.
Hope may help others.
When I crawle diffrent users, I only get 13 photos in many users wherever my '-n' is set to 200 and these user in fact have more than 13 photos.
So I think loadmore.click() in line 168 may not work well. Dose anyone suffer the same problem
some info:
posts: 148, number: 200
Scraping photo links...
Number of photo_links: 13
Saving...
Downloading 12 images to ta/linyaudavy
Quitting driver...
headless mode on
dir_prefix: ./data/, query: hecs_510, crawl_type: photos, number: 200, caption: False, authentica
tion: auth.json
posts: 273, number: 200
Scraping photo links...
Number of photo_links: 13
Saving...
Downloading 12 images to ta/hecs_510
Quitting driver...
headless mode on
dir_prefix: ./data/, query: da1sun, crawl_type: photos, number: 200, caption: False, authenticati
on: auth.json
posts: 2140, number: 200
Scraping photo links...
Number of photo_links: 13
Saving...
Downloading 12 images to ta/da1sun
Quitting driver...
Hi
When we run the script in pycharm on windows we get this exception:
C:\Python27\python.exe C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py -q instagram -t photos -c -n 100
Number to crawl 100
Traceback (most recent call last):
File "C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py", line 314, in main
crawler.browse(args.query,args.type).crawl(args.number,args.caption).save()
File "C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py", line 127, in crawl
self.captions = self._crawl_captions()
File "C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py", line 234, in _crawl_captions
EC.presence_of_element_located((By.CSS_SELECTOR,CSS_RIGHT_ARROW))
File "C:\Python27\lib\site-packages\selenium\webdriver\support\wait.py", line 81, in until
raise TimeoutException(message, screen, stacktrace)
TimeoutException: Message:
How can we solve the problem?
Hello!
If I want to crawl 1000 users' photos and save them to 1000 discrete files. How to do that?
Thank you!
When running just a basic search on the instagram profile, with python instagramcrawler.py -q instagram -t photos -n 100 -l
I get the following error:
Traceback (most recent call last):
File "instagramcrawler.py", line 360, in <module>
main()
File "instagramcrawler.py", line 356, in main
authentication=args.authentication)
File "instagramcrawler.py", line 117, in crawl
self.scroll_to_num_of_posts(number)
File "instagramcrawler.py", line 161, in scroll_to_num_of_posts
self._driver.page_source).group()
AttributeError: 'NoneType' object has no attribute 'group'
Any idea on what I'm doing wrong?
phantomjs version 1.9.8
ubuntu 16.04
After calling this:
python instagramcrawler.py -q 'instagram' -t 'followers' -n 30 -a auth.json
the process works well till it opens the follower list. After opening it, it stucks. Nothing happens anymore. Weeks ago it worked well with autoscrolling down of follower list and getting all followers.
Does instagram changed something in their code? Is there a solution for it?
is there any way to run it as headless?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.