tzuhsial / instagramcrawler Goto Github PK

View Code? Open in Web Editor NEW

373.0 373.0 108.0 43 KB

A non API python program to crawl public photos, posts or followers

Home Page: https://github.com/iammrhelo/InstagramCrawler

License: MIT License

Python 94.32% Shell 5.68%

crawler instagram python selenium

instagramcrawler's Issues

'geckodriver' executable needs to be in PATH

Please add info, that user should install geckodriver to PC and add path to PATH. Thanks!

Query account 'instagram', download 20 photos and their captions Traceback (most recent call last): File "instagramcrawler.py", line 355, in <module> main() File "instagramcrawler.py", line 345, in main crawler = InstagramCrawler(headless=args.headless) File "instagramcrawler.py", line 67, in __init__ self._driver = webdriver.Firefox() File "C:\Python27\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 142, in __init_ _ self.service.start() File "C:\Python27\lib\site-packages\selenium\webdriver\common\service.py", line 81, in start os.path.basename(self.path), self.start_error_message) selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH .

Followers crawling

Hi guys, I was able to crawl followers on Instagram after modifying some lines in the intagramcrawler.py. Now, I am facing a problem when the number of followers is over 1000: The crawler scrolls down the followers page but the page freezes in a loading phase so after some time the crawler simply quits the query.
Have you faced this issue before? If so how to fix it?

why can't I get the photo links?

Hi! I typed
python instagramcrawler.py -q #breakfast -n 50
Then the firefox worked. But the crawler program showed:
posts: 61650050, number: 50
Saving...
Saving to directory: E:/ins\breakfast.hashtag
Scraping photo links...
Number of photo_links: 33
?[FDownloading 1 images to
Traceback (most recent call last):
File "instagramcrawler.py", line 342, in
main()
File "instagramcrawler.py", line 338, in main
authentication=args.authentication)
File "instagramcrawler.py", line 133, in crawl
self.download_and_save(dir_prefix, query, crawl_type)
File "instagramcrawler.py", line 294, in download_and_save
urlretrieve(photo_link, filepath)
File "C:\Python27\lib\urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "C:\Python27\lib\urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "C:\Python27\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 443, in open_https
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 1038, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 882, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 844, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 1263, in connect
server_hostname=server_hostname)
File "C:\Python27\lib\ssl.py", line 363, in wrap_socket
_context=self)
File "C:\Python27\lib\ssl.py", line 611, in init
self.do_handshake()
File "C:\Python27\lib\ssl.py", line 840, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:661)

So why can't I download them?

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 637: character maps to <undefined>

Firefox path not solve and Need gecko driver executable in path

I try the program and changing path to "C:\Program Files\Mozilla Firefox" in instagramcrawker.py program, but when just run the program i get this message. I use it with Windows 10 and firefox 59. Any Idea o fix this issue?

Traceback (most recent call last):
  File "instagramcrawler.py", line 360, in <module>
    main()
  File "instagramcrawler.py", line 350, in main
    crawler = InstagramCrawler(headless=args.headless, firefox_path="C:\Program Files\Mozilla Firefox")
  File "instagramcrawler.py", line 70, in __init__
    self._driver = webdriver.Firefox(firefox_binary=binary)
  File "C:\Users\Santo Wijaya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 142, in __init__
    self.service.start()
  File "C:\Users\Santo Wijaya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\common\service.py", line 81, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

error

windows7 64bit, running the following command:
python instagramcrawler.py -q 'nude_yogagirl' -n 20

output is:
d:\python\InstagramCrawler-master>python instagramcrawler.py -q 'nude_yogagirl' -n 20 Traceback (most recent call last):
File "instagramcrawler.py", line 297, in
main()
File "instagramcrawler.py", line 291, in main
crawler = InstagramCrawler()
File "instagramcrawler.py", line 58, in init
self._driver = webdriver.Firefox()
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 152, in init
keep_alive=True)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 98, in init
self.start_session(desired_capabilities, browser_profile)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 185, in start_session

response = self.execute(Command.NEW_SESSION, parameters)

File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "D:\Program Files\Python361\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Unable to find a matching set of apabilities

Timeout error

I'm getting this error when i ran that script ( also happens with python comand )

python3 instagramcrawler.py -q '#breakfast' -n 20 -a auth.json
Traceback (most recent call last):
File "instagramcrawler.py", line 360, in
main()
File "instagramcrawler.py", line 350, in main
crawler = InstagramCrawler(headless=args.headless, firefox_path=args.firefox_path)
File "instagramcrawler.py", line 72, in init
self._driver.implicitly_wait(10)
File "/home/mestre/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 719, in implicitly_wait
'implicit': int(float(time_to_wait) * 1000)})
File "/home/mestre/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "/home/mestre/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: timeouts

IndentationError: unexpected indent

Error on scraping followers

Works fine but scraping followers gives me:

Scraping followers...
Traceback (most recent call last):
File "instagramcrawler.py", line 302, in
caption='None')
File "instagramcrawler.py", line 96, in crawl
self.scrape_followers_or_following(crawl_type, query, number)
File "instagramcrawler.py", line 214, in scrape_followers_or_following
title = self._driver.find_element_by_xpath(FOLLOW_PATH)
File "...\selenium\webdriver\remote\webdriver.py", line 313, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "...\selenium\webdriver\remote\webdriver.py", line 791, in find_element
'value': value})['value']
File "...\selenium\webdriver\remote\webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "...\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: //div[contains(text(), 'Followers')]

Any idea? Thanks for your help.

How does this work?

Hi can anyone tell me how to use this? I'm new to this and can't seem to get it working. Appreciate it if anyone could help!! thanks!

Timeout exception

I am getting the following error. The CSS selector needs to be updated from what is in the code (it is now "a._1cr2e _epyes"), but that still does not solve it for me.

Any insights much appreciated!

Traceback (most recent call last):
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 360, in
main()
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 356, in main
authentication=args.authentication)
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 119, in crawl
self.scroll_to_num_of_posts(number)
File "C:/Users/QS-2 SARAH/Desktop/IG/IGcrawler.py", line 171, in scroll_to_num_of_posts
(By.CSS_SELECTOR, CSS_LOAD_MORE))
File "C:\Users\QS-2 SARAH\Anaconda2\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
TimeoutException

"photo_links" only return 51(max) elements

it works perfectly about few days ago
But suddenly photo_links return only 51elements max (or lower)
How can i fix it?
Thanks for your help!! :)

Error when scraping captions

Hey, so far I crawled followers smoothly, but I have 2 issues:

I get this when I try to crawl the captions
python instagramcrawler.py -d data -q 'viralnova365' -c -n 10
dir_prefix: data, query: viralnova365, crawl_type: photos, number: 10, caption: True
posts: 1660, number: 10
Scraping photo links...
Number of photo_links: 25
Scraping captions...
Traceback (most recent call last):
File "instagramcrawler.py", line 297, in
main()
File "instagramcrawler.py", line 293, in main
caption=args.caption)
File "instagramcrawler.py", line 85, in crawl
self.click_and_scrape_captions(number)
File "instagramcrawler.py", line 161, in click_and_scrape_captions
FIREFOX_FIRST_POST_PATH).click()
File "/InstagramCrawler/crawl/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 313, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "InstagramCrawler/crawl/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 791, in find_element
'value': value})['value']
File 'InstagramCrawler/crawl/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "InstagramCrawler/crawl/lib/python3.4/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: //a[contains(@Class, '_8mlbc _vbtk2 _t5r8b')]
also I would like to crawl all the images, but it never downloades the number specifed by -n, do you have any suggestions?

Cannot crawl data

D:\Development\python\InstagramCrawler-master>python instagramcrawler.py -q daniel.hoi23s -c -n 100
dir_prefix: ./data/, query: daniel.hoi23s, crawl_type: photos, number: 100, caption: True
posts: 394, number: 100
Scraping photo links...
Number of photo_links: 27
Scraping captions...
Traceback (most recent call last):
File "instagramcrawler.py", line 297, in
main()
File "instagramcrawler.py", line 293, in main
caption=args.caption)
File "instagramcrawler.py", line 85, in crawl
self.click_and_scrape_captions(number)
File "instagramcrawler.py", line 161, in click_and_scrape_captions
FIREFOX_FIRST_POST_PATH).click()
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\webelement.py", line 77, in click
self._execute(Command.CLICK_ELEMENT)
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\webelement.py", line 493, in _execute
return self._parent.execute(command, params)
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 249, in execute
self.error_handler.check_response(response)
File "C:\Users\aaaaa\AppData\Local\Programs\Python\Python35\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message:

============================================

geckodriver.log:
1496170001211 geckodriver INFO Listening on 127.0.0.1:64773
1496170003300 geckodriver::marionette INFO Starting browser \?\C:\Program Files (x86)\Mozilla Firefox\firefox.exe with args ["-marionette"]
1496170004239 addons.manager ERROR startup failed: [Exception... "Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIFile.create]" nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)" location: "JS frame :: resource://gre/modules/FileUtils.jsm :: FileUtils_getDir :: line 70" data: no] Stack trace: FileUtils_getDir()@resource://gre/modules/FileUtils.jsm:70 < FileUtils_getFile()@resource://gre/modules/FileUtils.jsm:42 < validateBlocklist()@resource://gre/modules/AddonManager.jsm:671 < startup()@resource://gre/modules/AddonManager.jsm:834 < startup()@resource://gre/modules/AddonManager.jsm:3129 < observe()@resource://gre/components/addonManager.js:65
JavaScript error: resource://gre/modules/AddonManager.jsm, line 1657: NS_ERROR_NOT_INITIALIZED: AddonManager is not initialized
1496170009047 Marionette INFO Listening on port 64780
JavaScript error: resource://gre/modules/AddonManager.jsm, line 2570: NS_ERROR_NOT_INITIALIZED: AddonManager is not initialized
1496170009378 Marionette WARN TLS certificate errors will be ignored for this session
JavaScript error: resource://gre/modules/FileUtils.jsm, line 70: NS_ERROR_ILLEGAL_VALUE: Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIFile.create]

BadStatusLine: '' exceptions

Hi Guys,

Thanks for the great project which I use to get followers of a user. This works in about 50% of the cases, but sometimes I get the following error:

Traceback (most recent call last):
  File "crawler_tuintjedelen.py", line 364, in main
    crawler.browse(args.query,args.type).crawl(args.number,args.caption).save()
  File "crawler_tuintjedelen.py", line 160, in crawl
    self.followlist = self._crawl_follow()
  File "crawler_tuintjedelen.py", line 319, in _crawl_follow
    self.driver.execute_script(SCROLL_DOWN)
  File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 465, in execute_script
    'args': converted_args})['value']
  File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 234, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 408, in execute
    return self._request(command_info[0], url, body=data)
  File "/home/makusu/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 478, in _request
    resp = opener.open(request, timeout=self._timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1136, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 453, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''

Exception urllib2.URLError: URLError(error(111, 'Connection refused'),) in <bound method InstagramCrawler.__del__ of <__main__.InstagramCrawler object at 0x7f81e3cf7bd0>> ignored

Any idea what could be going on? Running it on Ubuntu with PhantomJS.

Thanks!

Erratic but very common timeout exception

Frequently get this error when storing captions. Doesn't seem to be an issue when just taking images.

Traceback (most recent call last):
  File "instagramcrawler.py", line 341, in <module>
    main()
  File "instagramcrawler.py", line 337, in main
    authentication=args.authentication)
  File "instagramcrawler.py", line 114, in crawl
    self.click_and_scrape_captions(number)
  File "instagramcrawler.py", line 221, in click_and_scrape_captions
    EC.presence_of_element_located((By.TAG_NAME, "time"))
  File "/home/jake/.virtualenvs/insta/lib/python3.5/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: Unable to find a matching set of capabilities

Do you know how can I fix this error?

[jalal@goku InstagramCrawler]$ python instagramcrawler.py -q '#breakfast' -n 50
Traceback (most recent call last):
  File "instagramcrawler.py", line 360, in <module>
    main()
  File "instagramcrawler.py", line 350, in main
    crawler = InstagramCrawler(headless=args.headless, firefox_path=args.firefox_path)
  File "instagramcrawler.py", line 70, in __init__
    self._driver = webdriver.Firefox(firefox_binary=binary)
  File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 152, in __init__
    keep_alive=True)
  File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 185, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 249, in execute
    self.error_handler.check_response(response)
  File "/home/grad3/jalal/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Unable to find a matching set of capabilities

Also, how should I enter user/pass?

Only crawl 1 following or followers

I only get 1 following or followers when set -n to 20
I analyse the xml code of instagram and find something wrong with line 273 num_of_shown_follow = len(List.find_elements_by_xpath('*'))
In instagram, xml structure is showed as follows:
ul>
div>
li>'follow info'<\li>
...
li>'another follow info'<\li>
\div>
\ul>
Here class List is node ul
and we want info of node li
The problem is line 273 get node div
I changed line 273s to List.find_elements_by_tag_name('li') and solved this problem.
Hope may help others.

#168 loadmore.click() seems to be unstable

When I crawle diffrent users, I only get 13 photos in many users wherever my '-n' is set to 200 and these user in fact have more than 13 photos.
So I think loadmore.click() in line 168 may not work well. Dose anyone suffer the same problem

some info:
posts: 148, number: 200
Scraping photo links...
Number of photo_links: 13
Saving...
Downloading 12 images to ta/linyaudavy
Quitting driver...
headless mode on
dir_prefix: ./data/, query: hecs_510, crawl_type: photos, number: 200, caption: False, authentica
tion: auth.json
posts: 273, number: 200
Scraping photo links...
Number of photo_links: 13
Saving...
Downloading 12 images to ta/hecs_510
Quitting driver...
headless mode on
dir_prefix: ./data/, query: da1sun, crawl_type: photos, number: 200, caption: False, authenticati
on: auth.json
posts: 2140, number: 200
Scraping photo links...
Number of photo_links: 13
Saving...
Downloading 12 images to ta/da1sun
Quitting driver...

TimeoutException

Hi
When we run the script in pycharm on windows we get this exception:


C:\Python27\python.exe C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py -q instagram -t photos -c -n 100
Number to crawl 100
Traceback (most recent call last):
  File "C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py", line 314, in main
    crawler.browse(args.query,args.type).crawl(args.number,args.caption).save()
  File "C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py", line 127, in crawl
    self.captions = self._crawl_captions()
  File "C:/Users/Vahid/PycharmProjects/instagram/instagramcrawler.py", line 234, in _crawl_captions
    EC.presence_of_element_located((By.CSS_SELECTOR,CSS_RIGHT_ARROW))
  File "C:\Python27\lib\site-packages\selenium\webdriver\support\wait.py", line 81, in until
    raise TimeoutException(message, screen, stacktrace)
TimeoutException: Message:

How can we solve the problem?

how to crawl one user's all photos

Hello!
If I want to crawl 1000 users' photos and save them to 1000 discrete files. How to do that?
Thank you!

AttributeError: 'NoneType' object has no attribute 'group'

When running just a basic search on the instagram profile, with python instagramcrawler.py -q instagram -t photos -n 100 -l I get the following error:

Traceback (most recent call last):
  File "instagramcrawler.py", line 360, in <module>
    main()
  File "instagramcrawler.py", line 356, in main
    authentication=args.authentication)
  File "instagramcrawler.py", line 117, in crawl
    self.scroll_to_num_of_posts(number)
  File "instagramcrawler.py", line 161, in scroll_to_num_of_posts
    self._driver.page_source).group()
AttributeError: 'NoneType' object has no attribute 'group'

Any idea on what I'm doing wrong?

phantomjs version 1.9.8
ubuntu 16.04

css are dynamically generated, not hard coded

https://github.com/iammrhelo/InstagramCrawler/blob/2203de193b217bf67753da19795d2eda9ce43229/instagramcrawler.py#L33

Crawler for "followers" stops after opening follower list

After calling this:

python instagramcrawler.py -q 'instagram' -t 'followers' -n 30 -a auth.json

the process works well till it opens the follower list. After opening it, it stucks. Nothing happens anymore. Weeks ago it worked well with autoscrolling down of follower list and getting all followers.

Does instagram changed something in their code? Is there a solution for it?

tzuhsial / instagramcrawler Goto Github PK

instagramcrawler's Issues

Recommend Projects

Recommend Topics

Recommend Org