Code Monkey home page Code Monkey logo

python3spiders / weibosuperspider Goto Github PK

View Code? Open in Web Editor NEW
1.5K 24.0 331.0 334 KB

微博爬虫及配套工具箱,微博用户、话题、评论采集一网打尽。图片下载、情感分析,地理位置、关系网络、spammer 机器人识别等功能应有尽有。Docs:https://buyixiao.github.io/blog/weibo-super-spider.html 配套可视化网站:https://buyixiao.github.io/blog/one-stop-weibo-visualization.html

License: Apache License 2.0

Python 100.00%
weibo-spider weibocrawler weibo-comment-crawl emotion-analysis weibo-image location-tracker weibospider

weibosuperspider's Introduction

项目简介

微博爬虫及配套工具箱,一站式微博爬虫采集、分析、可视化工具。微博用户、话题、评论爬虫一网打尽;图片下载、情感分析,地理位置、关系网络、机器人识别等功能应有尽有。

项目遵循以下两个设计原则:

  • 爬虫抓取的数据保存在 Excel 可以打开的 CSV 中,不依赖任何数据库。
  • 每个功能 Feat 对应的爬虫文件都是相互独立的,不存在依赖关系,虽然不利于维护和重构,但是对于使用者友好。

作者简介

作者 inspurer
QQ交流群 751114777
个人博客 https://buyixiao.github.io/

项目资料

Docs 2022 最新指南
配套的自助抓取网站(顺带任意深度和广度的微博用户关系(关注/粉丝)网络构建、任意深度和广度的微博转发路径网络、微博、评论、签到等数据集在线构建)、微博 Spammer 识别 execute data crawling without any environment setting
配套的微博可视化网站 https://buyixiao.github.io/blog/one-stop-weibo-visualization.html
包含世界、**-省-市地图和动态排序柱状图、桑基图、关系图、弦图、旭日图、树图、矩形树图等图表在线配置数据可视化工具网站 https://tools.buyixiao.xyz/
微博签到相关可视化 B 站教程 https://www.bilibili.com/video/BV1S14y1x73y

项目声明

If you use this project in your research, please cite this project.

@misc{WeiboSuperSpider,
    author = {Tao Xiao},
    title = {微博超级爬虫,最强微博爬虫,用户、话题、评论一网打尽。图片下载、情感分析,地理位置、关系网络等功能应有尽有。},
    year = {2019},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Python3Spiders/WeiboSuperSpider}},
}

weibosuperspider's People

Contributors

inspurer avatar yarkable avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weibosuperspider's Issues

ModuleNotFoundError

请问这是什么情况?谢谢。
ModuleNotFoundError: No module named 'NewSuperWeiboChildCommentsForMac'

WeiboCommentScrapy.py报错

PS C:\Users\Administrator\Desktop\WeiboSuperSpider-master\无 GUI 功能独立版> python .\WeiboCommentScrapy.py
1000000
100000
第1/100000页
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\threading.py", line 932,
in _bootstrap_inner
self.run()
File ".\WeiboCommentScrapy.py", line 156, in run
result.append(self.get_one_comment_struct(comments[i]))
File ".\WeiboCommentScrapy.py", line 124, in get_one_comment_struct
nickName,sex,location,weiboNum,followingNum,followsNum = self.getPublisherInfo(url=userURL)
File ".\WeiboCommentScrapy.py", line 84, in getPublisherInfo
head = html.xpath("//div[@Class='ut']/span[1]")[0]
IndexError: list index out of range

爬取评论的两个脚本都会报错

commentNum = re.findall("评论\[.*?\]",res.text)[0]

IndexError: list index out of range

这个是comment报的,那个SuperComment报server_data的key不存在,看下大佬能更新下吗

评论爬取的两个问题

有两个问题请教,谢谢。
一是在json设置了limit为50,但是爬出来的远远不止,怎样才能控制数量呢?
二是有的微博显然有许多评论,可是一爬就出现data crawl finish或者system is busy,是什么情况?

作者你好

这个如果爬取话题并下载图片,要怎么解决呢

话题爬取

作者您好,请问怎么爬取话题的内容,WeiboTopicScrapy.py里好像是对关键词的爬取,直接输入#话题名#爬出来的还是微博中带有相关关键字的,并不是话题对应的内容

关于随机微博内容的爬取问题

作者你好!非常感谢您的spider分享!想请问,基于您的项目,有没有可能完成对2012年所有微博的一个随机样本的抽取?

可否增加一下爬虫的暂停时间

我发现微博限制的比较厉害,评论较多时,很容易提示服务器走丢
如果限制脚本运行几秒 随机停几秒 会不会好一点
WeiboSuperCommentScrapy.py

话题爬虫获取不到数据

你好!我在话题爬虫中遇到了问题,我使用的是无 GUI 版,在WeiboTopicScrapy.py中修改了keyword、starttime、endtime和cookie,检查过cookie复制无误。但是爬取到的微博数量为0,出现“自动结束,大概率是因为内容爬完了,也请留意是否是 cookie 失效等情况”的提示。想知道是否有解决方案呢?

new issue

encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error

时间段搜索失效

我用WeiboTopicScrapy.py这个程序输入了关键词和时间段start_time='20190601', end_time='20190602',但是返回的数据依旧是最新的当日数据

超话问题

用话题爬取的超话数据不完全。话题用的是搜索,而超话有个独立的界面,请教大大,那超话爬取是不是跟评论爬取差不多?

验证码

首次登陆时,终端输入验证码后报如下错误:
Traceback (most recent call last):
File "d:/VSCodeWD for Python/NLPExercise/WeiboSuperSpider/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 145, in login
ticket = ticket_js["ticket"]
KeyError: 'ticket'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "d:/VSCodeWD for Python/NLPExercise/WeiboSuperSpider/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 297, in WeiboLogin(username, password, cookie_path).login()
File "d:/VSCodeWD for Python/NLPExercise/WeiboSuperSpider/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 155, in login
ticket = ticket_js["ticket"]
KeyError: 'ticket'

话题爬虫

你好,我在跑话题爬虫的时候,总是在150条左右自动停止,大概23页(总页数不止如此),不知道是什么原因。我还试着爬过去几个月的微博,几十条就停了。
还有就是我们在爬取的时候能不能获得user_id,这样就可以利用user_id去纵向爬取用户微博,以获取更多的相关性微博。
我现在正在做抑郁症分析,所以我对抑郁症话题进行爬取。另外我也在试着改一下代码,把user_id弄下来,user_id在URL里面。这样就能去爬该user_id所发布的微博。https://github.com/dataabc/weibo-search这个项目里面爬下来的数据就包含了user_id。

非常感谢作者的更新维护,已经star

登录失败

Traceback:

Traceback (most recent call last):
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 223, in login
    sever_data = self.pre_login()
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 189, in pre_login
    servertime = sever_data["servertime"]
KeyError: 'servertime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 407, in <module>
    WeiboLogin(username, password, cookie_path).login()
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 230, in login
    sever_data = self.pre_login()
  File "E:/Documents/Personal/WorkSpace/PyCharm/WeiboSuperSpider_GitHub/无 GUI 功能独立版/WeiboSuperCommentScrapy.py", line 189, in pre_login
    servertime = sever_data["servertime"]
KeyError: 'servertime'

(行号可能不准确,请定位到def pre_login(self):处的servertime = sever_data["servertime"])
Server returned:
{'retcode': 0, 'msg': 'system error', 'exectime': 1}

关于爬取大量评论中断的问题

repo主你好,非常强大的脚本,感谢!有个小问题,就是当我爬取一条评论很多的微博的时候,经常会在中间中断,可能是因为网络的原因,经常爬到几百条就会断开,显示traceback卡在
KeyError: 'data'
这一句。但我很确定后面还是有数据的,如果重新执行有时候能爬几百条,有时候能爬几千条,但断了以后只能从0开始爬。
所以想问,如果想遇到这种情况,继续重试,不break,需要改哪几句?

输入查人的时候出错

ERROR Log details Traceback (most recent call last): File ".\GUI.py", line 69, in run search_response = requests.post(url='https://weibo.cn/search/?pos=search', headers=self.headers, data=query_data,verify=False) File "C:\Aconada\lib\site-packages\requests\api.py", line 119, in post return request('post', url, data=data, json=json, **kwargs) File "C:\Aconada\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "C:\Aconada\lib\site-packages\requests\sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "C:\Aconada\lib\site-packages\requests\sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "C:\Aconada\lib\site-packages\requests\adapters.py", line 449, in send timeout=timeout File "C:\Aconada\lib\site-packages\urllib3\connectionpool.py", line 677, in urlopen chunked=chunked, File "C:\Aconada\lib\site-packages\urllib3\connectionpool.py", line 392, in _make_request conn.request(method, url, **httplib_request_kw) File "C:\Aconada\lib\http\client.py", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "C:\Aconada\lib\http\client.py", line 1280, in _send_request self.putheader(hdr, value) File "C:\Aconada\lib\http\client.py", line 1212, in putheader values[i] = one_value.encode('latin-1') UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-6: ordinal not in range(256)
  </code>

topic spider

请问运行TopicScrapy出现 '自动结束,大概率是因为内容爬完了,也请留意是否是 cookie 失效等情况’ 是什么情况呢?相同的cookie在CommentScrapy中可以正常使用

报错

你好,我是一个小白,冒昧打扰您。我想问问当使用WeiboSuperCommentScrapy.py这个文件时,是只用修改最后几行的账号密码爬取id么?其他的都不用管么?我运行后提示如下,请问大大是登录遇到了什么问题么?我试了下就算是用网页版自己登陆也会遇见要求扫码的情况,好像微博现在全部都要求登陆扫码了,和这个有关么?
C:\Users\49166\Downloads>WeiboSuperCommentScrapy.py
Traceback (most recent call last):
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 220, in login
sever_data = self.pre_login()
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 186, in pre_login
servertime = sever_data["servertime"]
KeyError: 'servertime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 399, in
WeiboLogin(username, password, cookie_path).login()
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 227, in login
sever_data = self.pre_login()
File "C:\Users\49166\Downloads\WeiboSuperCommentScrapy.py", line 186, in pre_login
servertime = sever_data["servertime"]
KeyError: 'servertime'

登陆时输入验证码,报错。

res = requests.get(url=next_url.format(id, id, max_id,id_type), headers=headers,cookies=cookie_dict)
UnboundLocalError: local variable 'max_id' referenced before assignment

请教评论爬取不全是什么原因?

无 GUI 功能独立版的WeiboCommentScrapy.py运行单个微博ID评论的时候50多个能爬30多个评论,导入CSV文件批量爬取的时候运行一半就会一直报错。菜鸡求教,谢谢!
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='weibo.cn', port=443): Max retries exceeded with url: /comment/Ca3BigTRn (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000000100828B0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

请求商务推广合作

作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,我们注册认证会送10000IP(可以帮助您的学者适当薅羊毛试用 :) 。想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)

代码错误

你应该更新下,或者把逻辑写的清楚些,目前代码是错误的。

按照方法放了cookie运行后出现这样的错误

encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error

就仅仅是在“输入cookie”那个地方塞了自己的微博cookie进去

作者你好 我按照默认那个微博用户爬取的 出现了错误

Exception in thread Thread-6:
Traceback (most recent call last):
File "F:\python\anaconda\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "E:/微博数据/爬虫工具/WeiboSuperSpider-master/WeiboSuperSpider-master/无 GUI 功能独立版/WeiboCommentScrapy.py", line 161, in run
self.write_to_csv(result,isHeader=False)
File "E:/微博数据/爬虫工具/WeiboSuperSpider-master/WeiboSuperSpider-master/无 GUI 功能独立版/WeiboCommentScrapy.py", line 129, in write_to_csv
with open('comment/' + self.wid + '.csv', 'a', encoding='utf-8-sig', newline='') as f:
PermissionError: [Errno 13] Permission denied: 'comment/IaYZIu0Ko.csv'

是怎么回事呀

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.