Code Monkey home page Code Monkey logo

proxypool's Introduction

ProxyPool

build deploy Docker Pulls

简易高效的代理池,提供如下功能:

  • 定时抓取免费代理网站,简易可扩展。
  • 使用 Redis 对代理进行存储并对代理可用性进行排序。
  • 定时测试和筛选,剔除不可用代理,留下可用代理。
  • 提供代理 API,随机取用测试通过的可用代理。

代理池原理解析可见「如何搭建一个高效的代理池」,建议使用之前阅读。

使用准备

首先当然是克隆代码并进入 ProxyPool 文件夹:

git clone https://github.com/Python3WebSpider/ProxyPool.git
cd ProxyPool

然后选用下面 Docker 和常规方式任意一个执行即可。

使用要求

可以通过两种方式来运行代理池,一种方式是使用 Docker(推荐),另一种方式是常规方式运行,要求如下:

Docker

如果使用 Docker,则需要安装如下环境:

  • Docker
  • Docker-Compose

安装方法自行搜索即可。

官方 Docker Hub 镜像:germey/proxypool

常规方式

常规方式要求有 Python 环境、Redis 环境,具体要求如下:

  • Python>=3.6
  • Redis

Docker 运行

如果安装好了 Docker 和 Docker-Compose,只需要一条命令即可运行。

docker-compose up

运行结果类似如下:

redis        | 1:M 19 Feb 2020 17:09:43.940 * DB loaded from disk: 0.000 seconds
redis        | 1:M 19 Feb 2020 17:09:43.940 * Ready to accept connections
proxypool    | 2020-02-19 17:09:44,200 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
proxypool    | 2020-02-19 17:09:44,203 INFO supervisord started with pid 1
proxypool    | 2020-02-19 17:09:45,209 INFO spawned: 'getter' with pid 10
proxypool    | 2020-02-19 17:09:45,212 INFO spawned: 'server' with pid 11
proxypool    | 2020-02-19 17:09:45,216 INFO spawned: 'tester' with pid 12
proxypool    | 2020-02-19 17:09:46,596 INFO success: getter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: tester entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

可以看到 Redis、Getter、Server、Tester 都已经启动成功。

这时候访问 http://localhost:5555/random 即可获取一个随机可用代理。

当然你也可以选择自己 Build,直接运行如下命令即可:

docker-compose -f build.yaml up

如果下载速度特别慢,可以自行修改 Dockerfile,修改:

- RUN pip install -r requirements.txt
+ RUN pip install -r requirements.txt -i https://pypi.douban.com/simple

常规方式运行

如果不使用 Docker 运行,配置好 Python、Redis 环境之后也可运行,步骤如下。

安装和配置 Redis

本地安装 Redis、Docker 启动 Redis、远程 Redis 都是可以的,只要能正常连接使用即可。

首先可以需要一下环境变量,代理池会通过环境变量读取这些值。

设置 Redis 的环境变量有两种方式,一种是分别设置 host、port、password,另一种是设置连接字符串,设置方法分别如下:

设置 host、port、password,如果 password 为空可以设置为空字符串,示例如下:

export PROXYPOOL_REDIS_HOST='localhost'
export PROXYPOOL_REDIS_PORT=6379
export PROXYPOOL_REDIS_PASSWORD=''
export PROXYPOOL_REDIS_DB=0

或者只设置连接字符串:

export PROXYPOOL_REDIS_CONNECTION_STRING='redis://localhost'

这里连接字符串的格式需要符合 redis://[:password@]host[:port][/database] 的格式, 中括号参数可以省略,port 默认是 6379,database 默认是 0,密码默认为空。

以上两种设置任选其一即可。

安装依赖包

这里强烈推荐使用 Condavirtualenv 创建虚拟环境,Python 版本不低于 3.6。

然后 pip 安装依赖即可:

pip3 install -r requirements.txt

运行代理池

两种方式运行代理池,一种是 Tester、Getter、Server 全部运行,另一种是按需分别运行。

一般来说可以选择全部运行,命令如下:

python3 run.py

运行之后会启动 Tester、Getter、Server,这时访问 http://localhost:5555/random 即可获取一个随机可用代理。

或者如果你弄清楚了代理池的架构,可以按需分别运行,命令如下:

python3 run.py --processor getter
python3 run.py --processor tester
python3 run.py --processor server

这里 processor 可以指定运行 Tester、Getter 还是 Server。

使用

成功运行之后可以通过 http://localhost:5555/random 获取一个随机可用代理。

可以用程序对接实现,下面的示例展示了获取代理并爬取网页的过程:

import requests

proxypool_url = 'http://127.0.0.1:5555/random'
target_url = 'http://httpbin.org/get'

def get_random_proxy():
    """
    get random proxy from proxypool
    :return: proxy
    """
    return requests.get(proxypool_url).text.strip()

def crawl(url, proxy):
    """
    use proxy to crawl page
    :param url: page url
    :param proxy: proxy, such as 8.8.8.8:8888
    :return: html
    """
    proxies = {'http': 'http://' + proxy}
    return requests.get(url, proxies=proxies).text


def main():
    """
    main method, entry point
    :return: none
    """
    proxy = get_random_proxy()
    print('get random proxy', proxy)
    html = crawl(target_url, proxy)
    print(html)

if __name__ == '__main__':
    main()

运行结果如下:

get random proxy 116.196.115.209:8080
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e4d7140-662d9053c0a2e513c7278364"
  },
  "origin": "116.196.115.209",
  "url": "https://httpbin.org/get"
}

可以看到成功获取了代理,并请求 httpbin.org 验证了代理的可用性。

可配置项

代理池可以通过设置环境变量来配置一些参数。

开关

  • ENABLE_TESTER:允许 Tester 启动,默认 true
  • ENABLE_GETTER:允许 Getter 启动,默认 true
  • ENABLE_SERVER:运行 Server 启动,默认 true

环境

  • APP_ENV:运行环境,可以设置 dev、test、prod,即开发、测试、生产环境,默认 dev
  • APP_DEBUG:调试模式,可以设置 true 或 false,默认 true
  • APP_PROD_METHOD: 正式环境启动应用方式,默认是gevent, 可选:tornadomeinheld(分别需要安装 tornado 或 meinheld 模块)

Redis 连接

  • PROXYPOOL_REDIS_HOST / REDIS_HOST:Redis 的 Host,其中 PROXYPOOL_REDIS_HOST 会覆盖 REDIS_HOST 的值。
  • PROXYPOOL_REDIS_PORT / REDIS_PORT:Redis 的端口,其中 PROXYPOOL_REDIS_PORT 会覆盖 REDIS_PORT 的值。
  • PROXYPOOL_REDIS_PASSWORD / REDIS_PASSWORD:Redis 的密码,其中 PROXYPOOL_REDIS_PASSWORD 会覆盖 REDIS_PASSWORD 的值。
  • PROXYPOOL_REDIS_DB / REDIS_DB:Redis 的数据库索引,如 0、1,其中 PROXYPOOL_REDIS_DB 会覆盖 REDIS_DB 的值。
  • PROXYPOOL_REDIS_CONNECTION_STRING / REDIS_CONNECTION_STRING:Redis 连接字符串,其中 PROXYPOOL_REDIS_CONNECTION_STRING 会覆盖 REDIS_CONNECTION_STRING 的值。
  • PROXYPOOL_REDIS_KEY / REDIS_KEY:Redis 储存代理使用字典的名称,其中 PROXYPOOL_REDIS_KEY 会覆盖 REDIS_KEY 的值。

处理器

  • CYCLE_TESTER:Tester 运行周期,即间隔多久运行一次测试,默认 20 秒
  • CYCLE_GETTER:Getter 运行周期,即间隔多久运行一次代理获取,默认 100 秒
  • TEST_URL:测试 URL,默认百度
  • TEST_TIMEOUT:测试超时时间,默认 10 秒
  • TEST_BATCH:批量测试数量,默认 20 个代理
  • TEST_VALID_STATUS:测试有效的状态码
  • API_HOST:代理 Server 运行 Host,默认 0.0.0.0
  • API_PORT:代理 Server 运行端口,默认 5555
  • API_THREADED:代理 Server 是否使用多线程,默认 true

日志

  • LOG_DIR:日志相对路径
  • LOG_RUNTIME_FILE:运行日志文件名称
  • LOG_ERROR_FILE:错误日志文件名称
  • LOG_ROTATION: 日志记录周转周期或大小,默认 500MB,见 loguru - rotation
  • LOG_RETENTION: 日志保留日期,默认 7 天,见 loguru - retention
  • ENABLE_LOG_FILE:是否输出 log 文件,默认 true,如果设置为 false,那么 ENABLE_LOG_RUNTIME_FILE 和 ENABLE_LOG_ERROR_FILE 都不会生效
  • ENABLE_LOG_RUNTIME_FILE:是否输出 runtime log 文件,默认 true
  • ENABLE_LOG_ERROR_FILE:是否输出 error log 文件,默认 true

以上内容均可使用环境变量配置,即在运行前设置对应环境变量值即可,如更改测试地址和 Redis 键名:

export TEST_URL=http://weibo.cn
export REDIS_KEY=proxies:weibo

即可构建一个专属于微博的代理池,有效的代理都是可以爬取微博的。

如果使用 Docker-Compose 启动代理池,则需要在 docker-compose.yml 文件里面指定环境变量,如:

version: "3"
services:
  redis:
    image: redis:alpine
    container_name: redis
    command: redis-server
    ports:
      - "6379:6379"
    restart: always
  proxypool:
    build: .
    image: "germey/proxypool"
    container_name: proxypool
    ports:
      - "5555:5555"
    restart: always
    environment:
      REDIS_HOST: redis
      TEST_URL: http://weibo.cn
      REDIS_KEY: proxies:weibo

扩展代理爬虫

代理的爬虫均放置在 proxypool/crawlers 文件夹下,目前对接了有限几个代理的爬虫。

若扩展一个爬虫,只需要在 crawlers 文件夹下新建一个 Python 文件声明一个 Class 即可。

写法规范如下:

from pyquery import PyQuery as pq
from proxypool.schemas.proxy import Proxy
from proxypool.crawlers.base import BaseCrawler

BASE_URL = 'http://www.664ip.cn/{page}.html'
MAX_PAGE = 5

class Daili66Crawler(BaseCrawler):
    """
    daili66 crawler, http://www.66ip.cn/1.html
    """
    urls = [BASE_URL.format(page=page) for page in range(1, MAX_PAGE + 1)]

    def parse(self, html):
        """
        parse html file to get proxies
        :return:
        """
        doc = pq(html)
        trs = doc('.containerbox table tr:gt(0)').items()
        for tr in trs:
            host = tr.find('td:nth-child(1)').text()
            port = int(tr.find('td:nth-child(2)').text())
            yield Proxy(host=host, port=port)

在这里只需要定义一个 Crawler 继承 BaseCrawler 即可,然后定义好 urls 变量和 parse 方法即可。

  • urls 变量即为爬取的代理网站网址列表,可以用程序定义也可写成固定内容。
  • parse 方法接收一个参数即 html,代理网址的 html,在 parse 方法里只需要写好 html 的解析,解析出 host 和 port,并构建 Proxy 对象 yield 返回即可。

网页的爬取不需要实现,BaseCrawler 已经有了默认实现,如需更改爬取方式,重写 crawl 方法即可。

欢迎大家多多发 Pull Request 贡献 Crawler,使其代理源更丰富强大起来。

部署

本项目提供了 Kubernetes 部署脚本,如需部署到 Kubernetes,请参考 kubernetes

待开发

  • 前端页面管理
  • 使用情况统计分析

如有一起开发的兴趣可以在 Issue 留言,非常感谢!

LICENSE

MIT

proxypool's People

Contributors

aaronjny avatar b1ack917 avatar bnightning avatar cesaryuan avatar dashuaixu avatar demsephiroth avatar dependabot[bot] avatar everhopingandwaiting avatar germey avatar invains avatar k8scat avatar linupychiang avatar mgmcn avatar rocketvv avatar staugur avatar wangshayne avatar wc571498244 avatar weltolk avatar westonzhang avatar winturn avatar xionghaizicuncunzhang avatar zyrsss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxypool's Issues

项目部署到Linux直接报错了

代理池开始运行

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
    代理池开始运行
  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    Process Process-3:
    Traceback (most recent call last):
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
    app.run(API_HOST, API_PORT)
    File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
    run_simple(host, port, self, **options)
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
    inner()
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
    fd=fd,
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
    host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
    HTTPServer.init(self, server_address, handler)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
    self.server_bind()
    File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
    self.socket.bind(self.server_address)
    OSError: [Errno 98] Address already in use
    代理池开始运行
  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: This is a development server. Do not use it in a production deployment.
    Use a production WSGI server instead.
  • Debug mode: off
    Process Process-3:
    Traceback (most recent call last):
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
    File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
    app.run(API_HOST, API_PORT)
    File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
    run_simple(host, port, self, **options)
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
    inner()
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
    fd=fd,
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
    host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
    File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
    HTTPServer.init(self, server_address, handler)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
    self.server_bind()
    File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
    File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
    self.socket.bind(self.server_address)
    OSError: [Errno 98] Address already in use
    开始抓取代理
    获取器开始执行
    Process Process-2:
    Traceback (most recent call last):
    File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
    sock = self._connect()
    File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
    raise err
    File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
    sock.connect(socket_address)
    TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.

Windows下运行正常,macOS和Linux下均报错,网上查了半天,依然一头雾水,求大神解惑。

代理池开始运行

如何在pycharm里调试该项目

我使用一个远程的环境,想在pycharm里调试该项目,但是每次Debug run.py 都显示文件无法找到,请问如何使用pycharm调试这个项目

AttributeError: 'OutStream' object has no attribute 'buffer'

#启动代理池

from proxypool.scheduler import Scheduler
import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def main():
try:
s = Scheduler()
s.run()
except:
main()

if name == 'main':
main()

AttributeError: 'OutStream' object has no attribute 'buffer'

关于redis-py版本问题

redis-py 3.X版和2.X版 zadd和zincrby有变化
3.X版中的zadd需要传入一个dict,(element-names -> score)
zincrby参数中amount和value互换

关于trace back中db.py的问题

看了之前的issues都说要把db.py中的add和max两个函数里的zadd方法的参数修改成字典形式(zadd(REDIS_KEY, {proxy:score})),以及decrease函数里的zincrby方法参数对调(zincrby(REDIS_KEY, -1, proxy))但是我修改之后反而会出问题。
我去看了Redis-py的文档,不知道是不是有什么改版,发现作者原本写的才符合文档里的要求。
文档说明如下:
image
image

所以这两个地方其实不用改了。
不过有个地方似乎是错了,
image
,框住的第一个应该是MIN_SCORE吧?

what/

D:\Pycharm工作资料\代码流\venv\Scripts\python.exe C:/Users/ThinkPad/Downloads/ProxyPool-master/run.py
浠g悊姹犲紑濮嬭繍琛�

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
    Process Process-2:
    寮�濮嬫姄鍙栦唬鐞�
    鑾峰彇鍣ㄥ紑濮嬫墽琛�
    Crawling http://www.66ip.cn/1.html
    姝e湪鎶撳彇 http://www.66ip.cn/1.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/1.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 177.185.148.46:58623
    鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.11:7
    鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.117:33729
    鎴愬姛鑾峰彇鍒颁唬鐞� 43.243.141.126:53281
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.181.35.219:9999
    Crawling http://www.66ip.cn/2.html
    姝e湪鎶撳彇 http://www.66ip.cn/2.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/2.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 170.0.112.226:50359
    鎴愬姛鑾峰彇鍒颁唬鐞� 54.39.144.247:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 171.41.82.36:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.32.126.0:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 213.33.224.82:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 144.123.71.133:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.166.59:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 117.196.237.40:59250
    鎴愬姛鑾峰彇鍒颁唬鐞� 121.61.3.110:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 212.200.126.14:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.9:4
    鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.94:3128
    Crawling http://www.66ip.cn/3.html
    姝e湪鎶撳彇 http://www.66ip.cn/3.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/3.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.162.175:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.60:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.1:0
    鎴愬姛鑾峰彇鍒颁唬鐞� 175.100.185.151:53281
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.10:6
    鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.5:3
    鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.106:58553
    鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.53:46257
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.4:5
    鎴愬姛鑾峰彇鍒颁唬鐞� 1.20.101.221:55707
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.45:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.8:4
    Crawling http://www.66ip.cn/4.html
    姝e湪鎶撳彇 http://www.66ip.cn/4.html
    鎶撳彇鎴愬姛 http://www.66ip.cn/4.html 200
    鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.84:43726
    鎴愬姛鑾峰彇鍒颁唬鐞� 212.26.247.178:38418
    鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.1:1
    鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.11:3
    鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.113:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.126:9999
    鎴愬姛鑾峰彇鍒颁唬鐞� 170.239.224.58:8080
    鎴愬姛鑾峰彇鍒颁唬鐞� 190.119.199.18:57333
    鎴愬姛鑾峰彇鍒颁唬鐞� 5.0.0.815:0
    鎴愬姛鑾峰彇鍒颁唬鐞� 190.152.182.150:53281
    鎴愬姛鑾峰彇鍒颁唬鐞� 119.40.98.84:46119
    鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.170.220:9999
    Traceback (most recent call last):
    File "D:\python\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
    File "D:\python\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
    getter.run()
    File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\getter.py", line 30, in run
    self.redis.add(proxy)
    File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\db.py", line 30, in add
    return self.db.zadd(REDIS_KEY, score, proxy)
    File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis\client.py", line 2320, in zadd
    for pair in iteritems(mapping):
    File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis_compat.py", line 122, in iteritems
    return iter(x.items())
    AttributeError: 'int' object has no attribute 'items'

嗯把大神的直接拿过来运行,一开始还能运行,爬着爬着报错了,好难啊,感觉好复杂

代理池开始运行

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
    开始抓取代理
    获取器开始执行
    Crawling http://www.66ip.cn/1.html
    正在抓取 http://www.66ip.cn/1.html
    抓取成功 http://www.66ip.cn/1.html 521
    Crawling http://www.66ip.cn/2.html
    正在抓取 http://www.66ip.cn/2.html
    抓取成功 http://www.66ip.cn/2.html 521
    Crawling http://www.66ip.cn/3.html
    正在抓取 http://www.66ip.cn/3.html
    抓取成功 http://www.66ip.cn/3.html 521
    Crawling http://www.66ip.cn/4.html
    正在抓取 http://www.66ip.cn/4.html
    抓取成功 http://www.66ip.cn/4.html 521
    正在抓取 http://www.ip3366.net/?stype=1&page=1
    抓取成功 http://www.ip3366.net/?stype=1&page=1 200
    成功获取到代理 222.135.25.243:8060
    成功获取到代理 180.175.8.5:8060
    成功获取到代理 119.180.131.25:8060
    成功获取到代理 180.175.160.130:8060
    成功获取到代理 119.180.177.138:8060
    成功获取到代理 119.180.1.42:8060
    成功获取到代理 171.112.165.22:9999
    成功获取到代理 222.182.121.71:8118
    成功获取到代理 118.81.68.2:80
    成功获取到代理 117.166.3.51:8118
    正在抓取 http://www.ip3366.net/?stype=1&page=2
    抓取成功 http://www.ip3366.net/?stype=1&page=2 200
    成功获取到代理 171.83.164.51:9999
    成功获取到代理 47.101.189.13:80
    成功获取到代理 171.112.164.149:9999
    成功获取到代理 171.112.164.109:9999
    成功获取到代理 119.97.237.74:80
    成功获取到代理 197.234.42.73:8083
    成功获取到代理 103.120.152.182:59068
    成功获取到代理 117.168.86.102:8118
    成功获取到代理 115.215.212.116:8118
    成功获取到代理 103.244.91.61:8080
    正在抓取 http://www.ip3366.net/?stype=1&page=3
    抓取成功 http://www.ip3366.net/?stype=1&page=3 200
    成功获取到代理 117.80.137.238:9999
    成功获取到代理 103.233.145.133:8080
    成功获取到代理 117.80.17.81:8118
    成功获取到代理 171.83.165.10:9999
    成功获取到代理 43.248.123.237:8080
    成功获取到代理 113.227.182.15:8118
    成功获取到代理 138.97.219.51:65301
    成功获取到代理 117.41.142.159:8118
    成功获取到代理 197.234.42.209:8083
    成功获取到代理 197.234.44.125:8083
    Process Process-2:
    Traceback (most recent call last):
    File "D:\Python\lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
    File "D:\Python\lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
    File "D:\spider-test\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
    getter.run()
    File "D:\spider-test\ProxyPool-master\proxypool\getter.py", line 30, in run
    self.redis.add(proxy)
    File "D:\spider-test\ProxyPool-master\proxypool\db.py", line 30, in add
    return self.db.zadd(REDIS_KEY, score, proxy)
    File "D:\Python\lib\site-packages\redis\client.py", line 2320, in zadd
    for pair in iteritems(mapping):
    File "D:\Python\lib\site-packages\redis_compat.py", line 109, in iteritems
    return iter(x.items())
    AttributeError: 'int' object has no attribute 'items'

请问 AttributeError: type object 'URL' has no attribute 'build' 这个怎么解决

File "run.py", line 1, in
from proxypool.scheduler import Scheduler
File "C:\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "C:\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "C:\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp_init_.py", line 6, in
from .client import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\client.py", line 32, in
from . import hdrs, http, payload
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http.py", line 7, in
from .http_parser import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http_parser.py", line 755, in
from ._http_parser import (HttpRequestParser, # type: ignore # noqa

File "aiohttp_http_parser.pyx", line 44, in init aiohttp._http_parser
AttributeError: type object 'URL' has no attribute 'build'

请教这个问题怎么解决?

Ip processing running

  • Serving Flask app "proxypool.api" (lazy loading)
  • Environment: production
    WARNING: Do not use the development server in a production environment.
    Use a production WSGI server instead.
  • Debug mode: off
  • Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
    Refreshing ip
    PoolAdder is working
    Waiting for adding
    Callback crawl_ip181
    Error occurred during loading data. Trying to use cache server http://d2g6u4gh6d9rq0.cloudfront.net/browsers/fake_useragent_0.1.10.json
    Traceback (most recent call last):
    File "C:\Python\Python36\lib\urllib\request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
    File "C:\Python\Python36\lib\http\client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
    File "C:\Python\Python36\lib\http\client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
    File "C:\Python\Python36\lib\http\client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
    File "C:\Python\Python36\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
    File "C:\Python\Python36\lib\http\client.py", line 964, in send
    self.connect()
    File "C:\Python\Python36\lib\http\client.py", line 1392, in connect
    super().connect()
    File "C:\Python\Python36\lib\http\client.py", line 936, in connect
    (self.host,self.port), self.timeout, self.source_address)
    File "C:\Python\Python36\lib\socket.py", line 722, in create_connection
    raise err
    File "C:\Python\Python36\lib\socket.py", line 713, in create_connection
    sock.connect(sa)
    socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "C:\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python\Python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Python\Python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python\Python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Python\Python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
Process Process-2:
Traceback (most recent call last):
File "C:\Python\Python36\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Python\Python36\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 130, in check_pool
adder.add_to_queue()
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 87, in add_to_queue
raw_proxies = self._crawler.get_raw_proxies(callback)
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 28, in get_raw_proxies
for proxy in eval("self.{}()".format(callback)):
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 35, in crawl_ip181
html = get_page(start_url)
File "C:\迅雷下载\ProxyPool-master\proxypool\utils.py", line 14, in get_page
'User-Agent': ua.random,
UnboundLocalError: local variable 'ua' referenced before assignment
Refreshing ip
Waiting for adding
Refreshing ip
Waiting for adding
Refreshing ip

redis的有序集合zadd方法变更

def add(self, proxy, score=INITIAL_SCORE):
    """
    添加代理,设置分数为最高
    :param proxy: 代理
    :param score: 分数
    :return: 添加结果
    """
    if not re.match('\d+\.\d+\.\d+\.\d+\:\d+', proxy):
        print('代理不符合规范', proxy, '丢弃')
        return
    if not self.db.zscore(REDIS_KEY, proxy):
        dic={}
        dic[proxy] =score

        return self.db.zadd(REDIS_KEY, dic)

我访问localhost:5555/random时 不能换代理 多次刷新只有一个最初的代理地址,请问一下是什么问题?

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environments (please complete the following information):

  • OS: [e.g. macOS 10.15.2]
  • Python [e.g. Python 3.6]
  • Browser [e.g. Chrome 67 ]

Additional context
Add any other context about the problem here.

程序执行不成功

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.
image

Environments (please complete the following information):

  • OS: [win10]
  • Python [Python 3.7]

Additional context
Add any other context about the problem here.

写入时报错

运行报错

/proxypool/db.py", line 30, in add

return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

代理池项目中setting.py文件相关配置

不算bug,建议:
1.在项目setting.py文件中,看到声明了LOG_DIR日志存储路径参数,但未使用。
应新建出...\project\ProxyPool\logs文件夹,并在配置文件中修改:
logger.add(env.str('LOG_RUNTIME_FILE', 'runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', 'error.log'), level='ERROR', rotation='1 week')

修改为:
logger.add(env.str('LOG_RUNTIME_FILE', f'{LOG_DIR}/runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', f'{LOG_DIR}/error.log'), level='ERROR', rotation='1 week')

2.setting.py文件中ENABLE_TESTER, ENABLE_GETTER, ENABLE_SERVER开关参数若都为False时,运行run.py文件报错(try方法中finally还会报错),可修改scheduler.py文件。(此条有点杠精,可忽略)
开关参数配置

新版redis中zadd改动

新的版本中zadd有改动,需要改成zadd(REDIS_KEY, {proxy: score})
一共两处,分别在RedisClient.add()和RedisClient.max()里

弹出错误,请求帮助attributes() got an unexpected keyword argument 'frozen'

Traceback (most recent call last):
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\run.py", line 1, in
from proxypool.scheduler import Scheduler
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "D:\Anaconda3\lib\site-packages\aiohttp_init_.py", line 6, in
from .client import * # noqa
File "D:\Anaconda3\lib\site-packages\aiohttp\client.py", line 16, in
from . import client_exceptions, client_reqrep
File "D:\Anaconda3\lib\site-packages\aiohttp\client_reqrep.py", line 18, in
from . import hdrs, helpers, http, multipart, payload
File "D:\Anaconda3\lib\site-packages\aiohttp\helpers.py", line 161, in
@attr.s(frozen=True, slots=True)
TypeError: attributes() got an unexpected keyword argument 'frozen'

爬虫函数可以考虑增加一个直接读文件

粗略写了个,放在Crawler类里面,每行用“地址:端口”格式就行。
def crawl_file(self):
filename = 'proxy.txt' # txt文件和当前脚本在同一目录下,所以不用写具体路径
with open(filename, 'r') as file_to_read:
while True:
lines = file_to_read.readline() # 整行读取数据
if not lines:
break
yield lines

代理获取进程好像死亡了,这是怎么回事

运行过程中代理抓取进程好像死亡了,不知道是什么问题?
观察到测试进程和API进程一直在运行,代理抓取进程没有动,redis队列中代理也一直在减少,有人知道这是什么问题吗?

Macbook上碰到https的错误

➜  ProxyPool git:(master) pip3 install -r requirements.txt
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting aiohttp>=1.3.3 (from -r requirements.txt (line 1))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Could not fetch URL https://pypi.org/simple/aiohttp/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/aiohttp/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
  Could not find a version that satisfies the requirement aiohttp>=1.3.3 (from -r requirements.txt (line 1)) (from versions: )
No matching distribution found for aiohttp>=1.3.3 (from -r requirements.txt (line 1))
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
➜  ProxyPool git:(master)

对爬取ip的代码进行了优化, 正则全部替换成pyquery来提取

`import json
import re
from .utils import get_page
from pyquery import PyQuery as pq

class ProxyMetaclass(type):
def new(cls, name, bases, attrs):
count = 0
attrs['CrawlFunc'] = []
for k, v in attrs.items():
if 'crawl_' in k:
attrs['CrawlFunc'].append(k)
count += 1
attrs['CrawlFuncCount'] = count
return type.new(cls, name, bases, attrs)

class Crawler(object, metaclass=ProxyMetaclass):
def get_proxies(self, callback):
proxies = []
for proxy in eval(f"self.{callback}()"):
print('成功获取到代理', proxy)
proxies.append(proxy)
return proxies

def crawl_daili66(self, page_count=4):
    """
    获取代理66
    :param page_count: 页码
    :return: 代理
    """
    start_url = 'http://www.66ip.cn/{}.html'
    urls = [start_url.format(page) for page in range(1, page_count + 1)]
    for url in urls:
        print('Crawling', url)
        html = get_page(url)
        if html:
            doc = pq(html)
            trs = doc('.containerbox table tr:gt(0)').items()  # index > 0  第0个tr节点里面没有ip和port
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_ip3366(self):
    for i in range(1, 4):
        start_url = 'http://www.ip3366.net/?stype=1&page={}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#container #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_kuaidaili(self):
    for i in range(1, 4):
        start_url = 'http://www.kuaidaili.com/free/inha/{}/'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#content .con-body #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_iphai(self):
    start_url = 'http://www.iphai.com/'
    html = get_page(start_url)
    # print(html)
    if html:
        doc = pq(html)
        trs = doc('.container .table tr:gt(0)').items()
        for tr in trs:
            ip = tr.find('td:nth-child(1)').text()
            port = tr.find('td:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

def crawl_xicidaili(self):
    for i in range(1, 3):
        start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#wrapper #body table tr:gt(0)').items()
            for tr in trs:
                ip = tr.find('td:nth-child(2)').text()
                port = tr.find('td:nth-child(3)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_data5u(self):
    start_url = 'http://www.data5u.com/'
    html = get_page(start_url)
    if html:
        doc = pq(html)
        uls = doc('.wlist>ul ul:gt(0)').items()
        for ul in uls:
            ip = ul.find('span:nth-child(1)').text()
            port = ul.find('span:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

        `

自己替换一下就行了, 亲测没问题, 当前时间2019-10-10

是否存在管理redis connectionpool的问题?

https://stackoverflow.com/questions/31663288/how-do-i-properly-use-connection-pools-in-redis
我在想每次请求链接redis都创建一个链接,不如写成
`redis_pool = None

class RedisClient(object):
def init(self, host=HOST, port=PORT):
global redis_pool

    if not redis_pool:
        if PASSWORD:
            redis_pool = redis.Redis(host=host, port=port, password=PASSWORD)
        else:
            redis_pool = redis.Redis(host=host, port=port)
        self._db = redis_pool
    else:
        self._db = redis_pool`

setup.py 里面的console_script

我发现console_script里面写run:cli完全没有办法安装之后成功使用。请问为什么可以写成这个样子呢?我把run名称改为pool_run,脚本改成 pool_run:main 就可以正常使用了。

api没有写get

不知道作者是忘了写还是,应该改成random,不然获取不到

优化付费代理按需使用的需求

免费资源可用率不高,希望是一个付费ip和免费ip结合的代理池。
这时候就有一个问题:无限测试付费ip,只扣费了,但是实际业务没有在用代理。

希望优化付费代理按需使用机制:
付费代理只有在有爬虫需求的时候,启动拉取,并且定制从代理服务商拉取IP个数。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.