Code Monkey home page Code Monkey logo

scrapy's Introduction

Scrapy

这里通过抓取简单网站简单介绍一些Scrapy的用法

1. 抓取流程

a. 抓取第一页
b. 获取内容
c. 翻页爬取
d. 保存爬取内容

2. Scrapy基本用法

which scrapy #命令可以查看scrapy的路径
scrapy startproject quotetutorial(项目名称) #创建一个项目
cd quotetutorial
ls
scrapy genspider quotes quotes.toscrape.com #爬取网站
ls
cd spiders  #爬取的主要代码在这里面

3.1 抓取第一页

  • quotetutorial/spider/quotes.py
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
   name = 'quotes'
   allowed_domains = ['quotes.tosrape.com']
   start_urls = ['http://quotes.toscrape.com/']

   def parse(self, response):
     return response.text

  • quotetutorial/spiders/items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

scrapy crawl quotes

 #### 3.2 抓取内容

  • quotetutorial/quotes.py
# -*- coding: utf-8 -*-
class QuotesSpider(scrapy.Spider):
   name = 'quotes'
   allowed_domains = ['quotes.tosrape.com']
   start_urls = ['http://quotes.toscrape.com/']

   def parse(self, response):
       quotes = response.css('.quote')
       for quote in quotes:
           item = QuoteItem()
           text = quote.css('.text::text').extract_first()
           author = quote.css('.author::text').extract_first()
           tag = quote.css('.tags .tag::text').extract()
           item['text'] = text
           item['author'] = author
           item['tag'] = tag
           yield item

  • quotetutorial/items.py
class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tag =  scrapy.Field()

简单介绍Scrapy shell交互

scrapy shell quotes.toscrape.com # shell交互命令
response
quotes = response.css('quote')
qutoes # css selector一种选择器
qutoes[0] # 输出第一个元素
quotes[0].css('.text')
quotes[0].css('.text::text')
quotes[0].css('.text::text').extract() # 返回一个列表
text = quote.css('.text::text').extract_first() # 返回第一个元素

3.3 翻页抓取

  • quotetutorial/quotes.py
class QuotesSpider(scrapy.Spider):
   name = 'quotes'
   allowed_domains = ['quotes.tosrape.com']
   start_urls = ['http://quotes.toscrape.com/']

   def parse(self, response):
       quotes = response.css('.quote')
       for quote in quotes:
           item = QuoteItem()
           text = quote.css('.text::text').extract_first()
           author = quote.css('.author::text').extract_first()
           tag = quote.css('.tags .tag::text').extract()
           item['text'] = text
           item['author'] = author
           item['tag'] = tag
           yield (item)

       next = response.css('.pager .next a::attr(href)').extract_first()
       url = response.urljoin(next)
       yield scrapy.Request(url=url, callback=self.parse)

3.4 保存内容

保存为文件形式

:scrapy crawl quotes -o quotes.json #.csv .jl .json .marshal .pickle .xml
:scrapy crawl quotes -o quotes.

  • pipelines.py
保存到数据库
class MongoPipleline(object):

   def __init__(self):
       self.client = pymongo.MongoClient('localhost')
       self.db = self.client['quotestutorial']

   def process_item(self,item,spider):
       self.db['quotes'].insert(dict(item))  # 字典形式
       return item

   def close_spider(self):
       self.client.close()
数据处理
class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + "..."
            return item
        else:
            return DropItem("Missing Text")

scrapy's People

Contributors

dcwjh avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.