Code Monkey home page Code Monkey logo

python3spiders / weibosuperspider Goto Github PK

View Code? Open in Web Editor NEW
1.5K 24.0 332.0 334 KB

微博爬虫及配套工具箱,微博用户、话题、评论采集一网打尽。图片下载、情感分析,地理位置、关系网络、spammer 机器人识别等功能应有尽有。Docs:https://buyixiao.github.io/blog/weibo-super-spider.html 配套可视化网站:https://buyixiao.github.io/blog/one-stop-weibo-visualization.html

License: Apache License 2.0

Python 100.00%
weibo-spider weibocrawler weibo-comment-crawl emotion-analysis weibo-image location-tracker weibospider

weibosuperspider's Issues

话题爬虫获取不到数据

你好!我在话题爬虫中遇到了问题,我使用的是无 GUI 版,在WeiboTopicScrapy.py中修改了keyword、starttime、endtime和cookie,检查过cookie复制无误。但是爬取到的微博数量为0,出现“自动结束,大概率是因为内容爬完了,也请留意是否是 cookie 失效等情况”的提示。想知道是否有解决方案呢?

登录失败

Traceback:

Traceback (most recent call last):
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 223, in login
    sever_data = self.pre_login()
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 189, in pre_login
    servertime = sever_data["servertime"]
KeyError: 'servertime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 407, in <module>
    WeiboLogin(username, password, cookie_path).login()
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 230, in login
    sever_data = self.pre_login()
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 189, in pre_login
    servertime = sever_data["servertime"]
KeyError: 'servertime'

(行号可能不准确,请定位到def pre_login(self):处的servertime = sever_data["servertime"])
Server returned:
{'retcode': 0, 'msg': 'system error', 'exectime': 1}

关于随机微博内容的爬取问题

作者你好!非常感谢您的spider分享!想请问,基于您的项目,有没有可能完成对2012年所有微博的一个随机样本的抽取?

关于爬取大量评论中断的问题

repo主你好,非常强大的脚本,感谢!有个小问题,就是当我爬取一条评论很多的微博的时候,经常会在中间中断,可能是因为网络的原因,经常爬到几百条就会断开,显示traceback卡在
KeyError: 'data'
这一句。但我很确定后面还是有数据的,如果重新执行有时候能爬几百条,有时候能爬几千条,但断了以后只能从0开始爬。
所以想问,如果想遇到这种情况,继续重试,不break,需要改哪几句?

请教评论爬取不全是什么原因?

无 GUI 功能独立版的WeiboCommentScrapy.py运行单个微博ID评论的时候50多个能爬30多个评论,导入CSV文件批量爬取的时候运行一半就会一直报错。菜鸡求教,谢谢!
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /comment/Ca3BigTRn (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000000100828B0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

请求商务推广合作

作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,我们注册认证会送10000IP(可以帮助您的学者适当薅羊毛试用 :) 。想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)

超话问题

用话题爬取的超话数据不完全。话题用的是搜索,而超话有个独立的界面,请教大大,那超话爬取是不是跟评论爬取差不多?

报错

你好,我是一个小白,冒昧打扰您。我想问问当使用WeiboSuperCommentScrapy.py这个文件时,是只用修改最后几行的账号密码爬取id么?其他的都不用管么?我运行后提示如下,请问大大是登录遇到了什么问题么?我试了下就算是用网页版自己登陆也会遇见要求扫码的情况,好像微博现在全部都要求登陆扫码了,和这个有关么?
C:\Users\49166\Downloads>WeiboSuperCommentScrapy.py
Traceback (most recent call last):
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 220, in login
sever_data = self.pre_login()
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 186, in pre_login
servertime = sever_data["servertime"]
KeyError: 'servertime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 399, in
WeiboLogin(username, password, cookie_path).login()
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 227, in login
sever_data = self.pre_login()
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 186, in pre_login
servertime = sever_data["servertime"]
KeyError: 'servertime'

作者你好

这个如果爬取话题并下载图片,要怎么解决呢

new issue

encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error

按照方法放了cookie运行后出现这样的错误

encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error

就仅仅是在“输入cookie”那个地方塞了自己的微博cookie进去

作者你好 我按照默认那个微博用户爬取的 出现了错误

Exception in thread Thread-6:
Traceback (most recent call last):
File "F:\python\anaconda\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "E:/微博数据/爬虫工具/WeiboSuperSpider-master/WeiboSuperSpider-master/无 GUI 功能独立版/WeiboCommentScrapy.py", line 161, in run
self.write_to_csv(result,isHeader=False)
File "E:/微博数据/爬虫工具/WeiboSuperSpider-master/WeiboSuperSpider-master/无 GUI 功能独立版/WeiboCommentScrapy.py", line 129, in write_to_csv
with open('comment/' + self.wid + '.csv', 'a', encoding='utf-8-sig', newline='') as f:
PermissionError: [Errno 13] Permission denied: 'comment/IaYZIu0Ko.csv'

是怎么回事呀

WeiboCommentScrapy.py报错

PS C:\Users\Administrator\Desktop\WeiboSuperSpider-master\无 GUI 功能独立版> python .\WeiboCommentScrapy.py
1000000
100000
第1/100000页
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\threading.py", line 932,
in _bootstrap_inner
self.run()
File ".\WeiboCommentScrapy.py", line 156, in run
result.append(self.get_one_comment_struct(comments[i]))
File ".\WeiboCommentScrapy.py", line 124, in get_one_comment_struct
nickName,sex,location,weiboNum,followingNum,followsNum = self.getPublisherInfo(url=userURL)
File ".\WeiboCommentScrapy.py", line 84, in getPublisherInfo
head = html.xpath("//div[@Class='ut']/span[1]")[0]
IndexError: list index out of range

输入查人的时候出错

ERROR Log details Traceback (most recent call last): File ".\GUI.py", line 69, in run search_response = requests.post(url='https://weibo.cn/search/?pos=search', headers=self.headers, data=query_data,verify=False) File "C:\Aconada\lib\site-packages\requests\api.py", line 119, in post return request('post', url, data=data, json=json, **kwargs) File "C:\Aconada\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "C:\Aconada\lib\site-packages\requests\sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "C:\Aconada\lib\site-packages\requests\sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "C:\Aconada\lib\site-packages\requests\adapters.py", line 449, in send timeout=timeout File "C:\Aconada\lib\site-packages\urllib3\connectionpool.py", line 677, in urlopen chunked=chunked, File "C:\Aconada\lib\site-packages\urllib3\connectionpool.py", line 392, in _make_request conn.request(method, url, **httplib_request_kw) File "C:\Aconada\lib\http\client.py", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "C:\Aconada\lib\http\client.py", line 1280, in _send_request self.putheader(hdr, value) File "C:\Aconada\lib\http\client.py", line 1212, in putheader values[i] = one_value.encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-6: ordinal not in range(256)
  </code>

登陆时输入验证码,报错。

res = requests.get(url=next_url.format(id, id, max_id,id_type), headers=headers,cookies=cookie_dict)
UnboundLocalError: local variable 'max_id' referenced before assignment

可否增加一下爬虫的暂停时间

我发现微博限制的比较厉害,评论较多时,很容易提示服务器走丢
如果限制脚本运行几秒 随机停几秒 会不会好一点
WeiboSuperCommentScrapy.py

话题爬取

作者您好,请问怎么爬取话题的内容,WeiboTopicScrapy.py里好像是对关键词的爬取,直接输入#话题名#爬出来的还是微博中带有相关关键字的,并不是话题对应的内容

验证码

首次登陆时,终端输入验证码后报如下错误:
Traceback (most recent call last):
File "d:/VSCodeWD for Python/NLPExercise/WeiboSuperSpider/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 145, in login
ticket = ticket_js["ticket"]
KeyError: 'ticket'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "d:/VSCodeWD for Python/NLPExercise/WeiboSuperSpider/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 297, in WeiboLogin(username, password, cookie_path).login()
File "d:/VSCodeWD for Python/NLPExercise/WeiboSuperSpider/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 155, in login
ticket = ticket_js["ticket"]
KeyError: 'ticket'

爬取评论的两个脚本都会报错

commentNum = re.findall("评论\[.*?\]",res.text)[0]

IndexError: list index out of range

这个是comment报的,那个SuperComment报server_data的key不存在,看下大佬能更新下吗

topic spider

请问运行TopicScrapy出现 '自动结束,大概率是因为内容爬完了,也请留意是否是 cookie 失效等情况’ 是什么情况呢?相同的cookie在CommentScrapy中可以正常使用

代码错误

你应该更新下,或者把逻辑写的清楚些,目前代码是错误的。

话题爬虫

你好,我在跑话题爬虫的时候,总是在150条左右自动停止,大概23页(总页数不止如此),不知道是什么原因。我还试着爬过去几个月的微博,几十条就停了。
还有就是我们在爬取的时候能不能获得user_id,这样就可以利用user_id去纵向爬取用户微博,以获取更多的相关性微博。
我现在正在做抑郁症分析,所以我对抑郁症话题进行爬取。另外我也在试着改一下代码,把user_id弄下来,user_id在URL里面。这样就能去爬该user_id所发布的微博。https://github.com/dataabc/weibo-search这个项目里面爬下来的数据就包含了user_id。

非常感谢作者的更新维护,已经star

时间段搜索失效

我用WeiboTopicScrapy.py这个程序输入了关键词和时间段start_time='20190601', end_time='20190602',但是返回的数据依旧是最新的当日数据

ModuleNotFoundError

请问这是什么情况?谢谢。
ModuleNotFoundError: No module named 'NewSuperWeiboChildCommentsForMac'

评论爬取的两个问题

有两个问题请教,谢谢。
一是在json设置了limit为50,但是爬出来的远远不止,怎样才能控制数量呢?
二是有的微博显然有许多评论,可是一爬就出现data crawl finish或者system is busy,是什么情况?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.