Code Monkey home page Code Monkey logo

requests_html_spider's Introduction

requests升级版requests-html 爬虫编写及通用爬虫模块搭建


安装: pip install requests-html

搭建常用通用爬虫各组件

简介:

  • 1、 爬虫模块编写,支持pyquery、xpath、JavaScript、beautifulsoup、正则等多种解析模式,使用请查看上面中文文档;
  • 2、 支持抓取各类日志保存,抓取日志、错误日志等各类日志信息;
  • 3、 抓取起始链接可来自于Redis,只需提供Redis-key信息,不用额外编写;
  • 4、 抓取信息持久化支持CSV、JSON、MYSQL、REDIS、KAFAKA、MONGODB等几大类常用持久化工具;
  • 5、 该框架主要是几大模块的组合,至于爬虫逻辑的实现,根据个人需求。

文件树:

|-Requests_Html_Spider          |--目录文件
   |--BaseFile                               |--基础配置
       |---GetLocalFile.py                   |--读取本地文件,如URL
       |---GetProxyIp.py                      |--获取代理IP
       |---Logger.py                            |--配置logging日志
       |--- ReadConfig.py                    |--读取配置文件
       |--- UserAgent.py                      |--轮换请求头
   |--Common                                |--公共操作类
       |---CsvHelper.py                       |--操作CSV文件
       |---JsonHelper.py                      |--操作JSON文件
       |---KafkaHelper.py                    |--操作KAFKA文件
       |---MongoHelper.py                  |--操作MONGODB文件
       |---MysqlHelper.py                    |--操作MYSQL文件
       |---RedisHelper.py                    |--操作REDIS文件
    |--Config                                   |--配置信息
       |---HEADERS.py                        |--配置请求头
       |---KAFKA                                  |--KAFKA配置
       |---MONGODB                           |--MONGODB配置
       |---MYSQL                                 |--MYSQL配置
       |---PROXYIP                              |--代理IP配置
       |---REDIS                                  |--REDIS配置
    |--Data                                      |--文件存储目录
    |--Logs                                      |--Logs日志存储目录
    |--Spider                                    |--爬虫类
       |---request_html_demo_1.py   |--简书python爬虫教程抓取
       |---request_html_demo_2.py   |--爬取博客园新闻
       |---request_html_demo_3.py   |--爬取电脑高清壁纸库

说明:  本框架主要是爬虫基本常用模块组合,避免了日常爬虫编写中各类组件重新编写过程,同时结合requests—html使得编写更为简便,其中requests-html是requests的原作者专门针对爬虫编写的一个新模块,并在不断的跟新状态,官方-github

Only Python 3.6 is supported.

requests_html_spider's People

Contributors

liangchengdeye avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.