Code Monkey home page Code Monkey logo

weibospidersimple's Introduction

WeiboSpider

This is a sina weibo spider built by scrapy

这是一个持续维护的微博爬虫开源项目,有任何问题请开issue

该项目爬取的数据字段说明,请移步:数据字段说明与示例

已经在senior分支的基础上新增了search分支,用于微博关键词搜索

update

如何使用

下面是simple分支,也就是单账号爬取,每日十万级的抓取量

克隆本项目 && 安装依赖

本项目Python版本为Python3.6

https://github.com/MarvelousDick/WeiboSpiderSimple.git
cd WeiboSpiderSimple
pip install -r requirements.txt

除此之外,还需要安装mongodb,这个自行Google把

替换Cookie

访问https://weibo.cn/

并登陆,打开浏览器的开发者模式,再次刷新

复制weibo.cn这个数据包,network中的cookie值

sina/settings.py中:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0',
    'Cookie':'OUTFOX_SEARCH_USER_ID_NCOO=1780588551.4011402; browser=d2VpYm9mYXhpYW4%3D; SCF=AsJyCasIxgS59OhHHUWjr9OAw83N3BrFKTpCLz2myUf2_vdK1UFy6Hucn5KaD7mXIoq8G25IMnTUPRRfr3U8ryQ.; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFGJINkqaLbAcTzz2isXDTA5JpX5KMhUgL.Foq0e0571hBp1hn2dJLoIp7LxKML1KBLBKnLxKqL1hnLBoMpe0ec1h5feKMR; SUB=_2A252a4N_DeRhGeBI61EV9CzPyD-IHXVVly03rDV6PUJbkdAKLRakkW1NRqYKs18Yrsf_SKnpgehmxRFUVgzXtwQO; SUHB=0U15b0sZ4CX6O4; _T_WM=0653fb2596917b052152f773a5976ff4; _WEIBO_UID=6603442333; SSOLoginState=1536482073; ALF=1539074073'
}

Cookie字段替换成你自己的Cookie

如果爬虫运行出现403/302,说明账号被封/cookie失效,请重新替换cookie

运行爬虫

scrapy crawl weibo_spider 

运行截图:

导入pycharm后,也可以直接执行sina/spider/weibo_spider.py

该爬虫是示例爬虫,将爬取 人民日报 和 新华视点 的 用户信息,全部微博,每条微博的评论,还有用户关系。

可以根据你的实际需求改写示例爬虫。

速度说明

一个页面可以抓取10则微博数据

下表是我的配置情况和速度测试结果

爬虫配置 配置值
CONCURRENT_REQUESTS 16
DOWNLOAD_DELAY 3s
每分钟抓取网页量 15+
每分钟抓取数据量 150+
总体一天抓取数据量 20万+

实际速度和你自己电脑的网速/CPU/内存有很大关系。

weibospidersimple's People

Contributors

klash-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ge-limin

weibospidersimple's Issues

代码运行不了

用pycharm运行时,会提示代码缺少2个东西,如id,希望博主能更新一下

Double requirements given

Requirements.txt 里面出现了两个pymongo导致:

Double requirement given: pymongo (from -r requirements.txt (line 7)) (already in pymongo==3.5.1 (from -r requirements.txt (line 3)), name='pymongo')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.