Code Monkey home page Code Monkey logo

scrapyuniversal's Introduction

Python3 网络爬虫开发实战

本书介绍了如何利用 Python 3 开发网络爬虫。书中首先详细介绍了环境配置过程和爬虫基础知识;然后讨论了 urllib、requests 等请求库,Beautiful Soup、XPath、pyquery 等解析库以及文本和各类数据库的存储方法;接着通过多个案例介绍了如何进行 Ajax 数据爬取,如何使用 Selenium 和 Splash 进行动态网站爬取;接着介绍了爬虫的一些技巧,比如使用代理爬取和维护动态代理池的方法,ADSL 拨号代理的使用,图形、 极验、点触、宫格等各类验证码的破解方法,模拟登录网站爬取的方法及 Cookies 池的维护。 此外,本书还结合移动互联网的特点探讨了使用 Charles、mitmdump、Appium 等工具实现 App 爬取 的方法,紧接着介绍了 pyspider 框架和 Scrapy 框架的使用,以及分布式爬虫的知识,最后介绍了 Bloom Filter 效率优化、Docker 和 Scrapyd 爬虫部署、Gerapy 爬虫管理等方面的知识。

本书由图灵教育 - 人民邮电出版社出版发行,版权所有,禁止转载。

作者:崔庆才

购买地址:

加读者群:

视频资源:

Python3 爬虫三大案例实战分享

自己动手,丰衣足食!Python3 网络爬虫实战案例

scrapyuniversal's People

Contributors

germey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scrapyuniversal's Issues

run multiple universal spider with json configs

I have multiple json config file, but only one universal spider. If we follow like this
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(jsonname1)
process.crawl(jsonname2)
process.start()
it will throw
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
I think is universal spider create multiple times cause this. how to solve it?

请问如何将数据保存呢?

崔大大您好,在看您的书籍第13章节中的Scrapy通用爬虫的时候,跟着您的思路将爬虫写完,运行没有错误且正确输出,就是不知道如何实现数据存储,尝试使用Pipeline无果,考虑到自定义配置文件覆盖了项目配置文件,但是还是无法将数据保存,期待您的回复.感谢,祝您生活愉快

'NewsLoader' object does not support item assignment

def parse_item(self, response): item = self.configs.get('item') if item: cls = eval(item.get('class'))() loader = eval(item.get('loader'))(cls,response=response) print('loader:' + str(type(loader))) #动态获取属性配置 for key,value in item.get('attrs').items(): for extractor in value: if extractor.get('mothod') == 'xpath': loader.add_xpath(key,*extractor.get('args'),**{'re':extractor.get('re')}) if extractor.get('mothod') == 'css': loader.add_css(key, *extractor.get('args'), **{'re': extractor.get('re')}) if extractor.get('mothod') == 'value': loader.add_value(key, *extractor.get('args'), **{'re': extractor.get('re')}) if extractor.get('mothod') == 'attr': loader.add_value(key, getattr(response,*extractor.get('args'))) yield loader.load_item()
2018-12-10 14:26:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://global.eastmoney.com/a/201812101002405673.html> (referer: http://stock.eastmoney.com/news/chstpyl.html)
Traceback (most recent call last):
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\spiders\crawl.py", line 78, in parse_response
for requests_or_item in iterate_spider_output(cb_res):
File "E:\Anaconda3\project\Spiders\EastmoneyCrawl\EastmoneyCrawl\spiders\univer.py", line 48, in parse_item
yield loader.load_item()
File "d:\ProgramData\Anaconda3\lib\site-packages\scrapy\loader_init
.py", line 117, in load_item
item[field_name] = value
TypeError: 'NewsLoader' object does not support item assignment

scrapyd 部署

请问改成这种配置的形式后,如何使用scrapyd来部署这类爬虫呢?

rules.py 注释掉下一页

要把这一行注释掉,要不然停不下来 :)
#Rule(LinkExtractor(restrict_xpaths='//div[@id="pageStyle"]//a[contains(., "下一页")]'))

本来要弄个 Pull requests
发现没有分支

为什么获取到的正文都是一段一段的呢?

'content': '<div id="ContentBody" class="Body">\r\n' ' <div class="abstract">摘要</div>\r\n' ' <div class="b-review">【证监会副主席阎庆民:“有进有出 ' '优胜劣汰”的市场生态正在形成】证监会副主席阎庆民今天上午在2018央视财经论坛暨**上市公司峰会主体活动中发表主旨演讲,他表示,证监会启动了新一轮上市公司退市制 度改革,新增了“五大安全”重大违法强制退市情形。今年以来,长生生物等5家公司被强制退市,金亚科技等3家公司启动了强制退市程序,“有进有出优胜劣汰”的市场生态正在逐渐形成。(上 海证券报)</div>\r\n' ' <!--浪客直播-->\r\n' '\r\n' ' <!--文章主体-->\r\n' ' <center><img ' 'src="https://z1.dfcfw.com/2018/12/12/201812121453371504016687.jpg" ' 'width="580" emheight="265" style="border:#d1d1d1 1px ' 'solid;padding:3px;margin:5px 0;"></center><p>\u3000\u3000' '据央视财经微博12月12日报道,证监会副主席阎庆民今天上午在2018央视财经论坛暨**上市公司峰会主体活动中发表主旨演讲,他表示,证监会启动了新一轮上市公司退市制 度改革,新增了“五大安全”重大违法强制退市情形。今年以来,长生生物等5家公司被强制退市,金亚科技等3家公司启动了强制退市程序,“有进有出优胜劣汰”的市场生态正在逐渐形成。此外 ,被称为“史上最严<span ' 'id="Info.334"><a target="_blank" ' 'href="http://data.eastmoney.com/tfpxx/" class="infokey ' '">停牌</a></span>新规”——《关于完善上市公司股票停<span id="Info.335"><a ' 'target="_blank" href="http://data.eastmoney.com/tfpxx/" ' 'class="infokey ' '">复牌</a></span>制度的指导意见》实施以后,A股市场停盘“顽疾”明显改观,沪深两市停牌公司已降至20家左右,停牌率在国际主要市场处于领先水平。</p><p>\u3000\u3000 ' '<strong>证监会副主席阎庆民:建议上市公司专注主业,这是金融危机后最大教训</strong></p><div ',

如上,本来是一个整体的正文,被一行一行(单引号)分开了,就导致后面用正则替换的时候,不能整个匹配,用了''.join也不行

中华网页面改版之后的爬取规则

目前想要爬取中华网的文章链接和下一页的链接的rules如下所示,作者的rules已经失效
rules = (
Rule(LinkExtractor(allow='article/.*.html', restrict_xpaths='//div[@id="rank-defList"]//div[@Class="item-con-inner"]'),
callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[@Class="pages"]//a[contains(., "下一页")]'))
)

在windows下代码的问题

作者您好,我在linux下运行没问题 , 在windows下有问题,如下图:

image

image

这两行代码在windows下会报错,原因是?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.