Light

kingking888 / requests_html_spider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from liangchengdeye/requests_html_spider

0.0 1.0 0.0 3.33 MB

requests升级版requests-html 爬虫编写及通用爬虫模块搭建

License: Apache License 2.0

Python 100.00%

requests_html_spider's Introduction

requests升级版requests-html 爬虫编写及通用爬虫模块搭建

安装： pip install requests-html

中文文档：https://cncert.github.io/requests-html-doc-cn/#/

搭建常用通用爬虫各组件

简介：

1、爬虫模块编写，支持pyquery、xpath、JavaScript、beautifulsoup、正则等多种解析模式，使用请查看上面中文文档；
2、支持抓取各类日志保存，抓取日志、错误日志等各类日志信息；
3、抓取起始链接可来自于Redis，只需提供Redis-key信息，不用额外编写；
4、抓取信息持久化支持CSV、JSON、MYSQL、REDIS、KAFAKA、MONGODB等几大类常用持久化工具；
5、该框架主要是几大模块的组合，至于爬虫逻辑的实现，根据个人需求。

文件树：

|-Requests_Html_Spider          |--目录文件
   |--BaseFile                               |--基础配置
       |---GetLocalFile.py                   |--读取本地文件，如URL
       |---GetProxyIp.py                      |--获取代理IP
       |---Logger.py                            |--配置logging日志
       |--- ReadConfig.py                    |--读取配置文件
       |--- UserAgent.py                      |--轮换请求头
   |--Common                                |--公共操作类
       |---CsvHelper.py                       |--操作CSV文件
       |---JsonHelper.py                      |--操作JSON文件
       |---KafkaHelper.py                    |--操作KAFKA文件
       |---MongoHelper.py                  |--操作MONGODB文件
       |---MysqlHelper.py                    |--操作MYSQL文件
       |---RedisHelper.py                    |--操作REDIS文件
    |--Config                                   |--配置信息
       |---HEADERS.py                        |--配置请求头
       |---KAFKA                                  |--KAFKA配置
       |---MONGODB                           |--MONGODB配置
       |---MYSQL                                 |--MYSQL配置
       |---PROXYIP                              |--代理IP配置
       |---REDIS                                  |--REDIS配置
    |--Data                                      |--文件存储目录
    |--Logs                                      |--Logs日志存储目录
    |--Spider                                    |--爬虫类
       |---request_html_demo_1.py   |--简书python爬虫教程抓取
       |---request_html_demo_2.py   |--爬取博客园新闻
       |---request_html_demo_3.py   |--爬取电脑高清壁纸库

说明：本框架主要是爬虫基本常用模块组合，避免了日常爬虫编写中各类组件重新编写过程，同时结合requests—html使得编写更为简便，其中requests-html是requests的原作者专门针对爬虫编写的一个新模块，并在不断的跟新状态，官方-github

Only Python 3.6 is supported.

requests_html_spider's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.