python3webspider / proxypool Goto Github PK

View Code? Open in Web Editor NEW

5.4K 120.0 1.9K 941 KB

An Efficient ProxyPool with Getter, Tester and Server

Home Page: https://proxypool.scrape.center

License: MIT License

Python 95.92% Dockerfile 1.21% Mustache 2.73% Shell 0.15%

proxypool redis http flask proxy webspider

proxypool's Introduction

ProxyPool

简易高效的代理池，提供如下功能：

定时抓取免费代理网站，简易可扩展。
使用 Redis 对代理进行存储并对代理可用性进行排序。
定时测试和筛选，剔除不可用代理，留下可用代理。
提供代理 API，随机取用测试通过的可用代理。

代理池原理解析可见「如何搭建一个高效的代理池」，建议使用之前阅读。

使用准备

首先当然是克隆代码并进入 ProxyPool 文件夹：

git clone https://github.com/Python3WebSpider/ProxyPool.git
cd ProxyPool

然后选用下面 Docker 和常规方式任意一个执行即可。

使用要求

可以通过两种方式来运行代理池，一种方式是使用 Docker（推荐），另一种方式是常规方式运行，要求如下：

Docker

如果使用 Docker，则需要安装如下环境：

Docker
Docker-Compose

安装方法自行搜索即可。

官方 Docker Hub 镜像：germey/proxypool

常规方式

常规方式要求有 Python 环境、Redis 环境，具体要求如下：

Python>=3.6
Redis

Docker 运行

如果安装好了 Docker 和 Docker-Compose，只需要一条命令即可运行。

docker-compose up

运行结果类似如下：

redis        | 1:M 19 Feb 2020 17:09:43.940 * DB loaded from disk: 0.000 seconds
redis        | 1:M 19 Feb 2020 17:09:43.940 * Ready to accept connections
proxypool    | 2020-02-19 17:09:44,200 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
proxypool    | 2020-02-19 17:09:44,203 INFO supervisord started with pid 1
proxypool    | 2020-02-19 17:09:45,209 INFO spawned: 'getter' with pid 10
proxypool    | 2020-02-19 17:09:45,212 INFO spawned: 'server' with pid 11
proxypool    | 2020-02-19 17:09:45,216 INFO spawned: 'tester' with pid 12
proxypool    | 2020-02-19 17:09:46,596 INFO success: getter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: tester entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

可以看到 Redis、Getter、Server、Tester 都已经启动成功。

这时候访问 http://localhost:5555/random 即可获取一个随机可用代理。

当然你也可以选择自己 Build，直接运行如下命令即可：

docker-compose -f build.yaml up

如果下载速度特别慢，可以自行修改 Dockerfile，修改：

- RUN pip install -r requirements.txt
+ RUN pip install -r requirements.txt -i https://pypi.douban.com/simple

常规方式运行

如果不使用 Docker 运行，配置好 Python、Redis 环境之后也可运行，步骤如下。

安装和配置 Redis

本地安装 Redis、Docker 启动 Redis、远程 Redis 都是可以的，只要能正常连接使用即可。

首先可以需要一下环境变量，代理池会通过环境变量读取这些值。

设置 Redis 的环境变量有两种方式，一种是分别设置 host、port、password，另一种是设置连接字符串，设置方法分别如下：

设置 host、port、password，如果 password 为空可以设置为空字符串，示例如下：

export PROXYPOOL_REDIS_HOST='localhost'
export PROXYPOOL_REDIS_PORT=6379
export PROXYPOOL_REDIS_PASSWORD=''
export PROXYPOOL_REDIS_DB=0

或者只设置连接字符串：

export PROXYPOOL_REDIS_CONNECTION_STRING='redis://localhost'

这里连接字符串的格式需要符合 redis://[:password@]host[:port][/database] 的格式，中括号参数可以省略，port 默认是 6379，database 默认是 0，密码默认为空。

以上两种设置任选其一即可。

安装依赖包

这里强烈推荐使用 Conda 或 virtualenv 创建虚拟环境，Python 版本不低于 3.6。

然后 pip 安装依赖即可：

pip3 install -r requirements.txt

运行代理池

两种方式运行代理池，一种是 Tester、Getter、Server 全部运行，另一种是按需分别运行。

一般来说可以选择全部运行，命令如下：

python3 run.py

运行之后会启动 Tester、Getter、Server，这时访问 http://localhost:5555/random 即可获取一个随机可用代理。

或者如果你弄清楚了代理池的架构，可以按需分别运行，命令如下：

python3 run.py --processor getter
python3 run.py --processor tester
python3 run.py --processor server

这里 processor 可以指定运行 Tester、Getter 还是 Server。

使用

成功运行之后可以通过 http://localhost:5555/random 获取一个随机可用代理。

可以用程序对接实现，下面的示例展示了获取代理并爬取网页的过程：

import requests

proxypool_url = 'http://127.0.0.1:5555/random'
target_url = 'http://httpbin.org/get'

def get_random_proxy():
    """
    get random proxy from proxypool
    :return: proxy
    """
    return requests.get(proxypool_url).text.strip()

def crawl(url, proxy):
    """
    use proxy to crawl page
    :param url: page url
    :param proxy: proxy, such as 8.8.8.8:8888
    :return: html
    """
    proxies = {'http': 'http://' + proxy}
    return requests.get(url, proxies=proxies).text


def main():
    """
    main method, entry point
    :return: none
    """
    proxy = get_random_proxy()
    print('get random proxy', proxy)
    html = crawl(target_url, proxy)
    print(html)

if __name__ == '__main__':
    main()

运行结果如下：

get random proxy 116.196.115.209:8080
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e4d7140-662d9053c0a2e513c7278364"
  },
  "origin": "116.196.115.209",
  "url": "https://httpbin.org/get"
}

可以看到成功获取了代理，并请求 httpbin.org 验证了代理的可用性。

可配置项

代理池可以通过设置环境变量来配置一些参数。

开关

ENABLE_TESTER：允许 Tester 启动，默认 true
ENABLE_GETTER：允许 Getter 启动，默认 true
ENABLE_SERVER：运行 Server 启动，默认 true

环境

APP_ENV：运行环境，可以设置 dev、test、prod，即开发、测试、生产环境，默认 dev
APP_DEBUG：调试模式，可以设置 true 或 false，默认 true
APP_PROD_METHOD: 正式环境启动应用方式，默认是gevent，可选：tornado，meinheld（分别需要安装 tornado 或 meinheld 模块）

Redis 连接

PROXYPOOL_REDIS_HOST / REDIS_HOST：Redis 的 Host，其中 PROXYPOOL_REDIS_HOST 会覆盖 REDIS_HOST 的值。
PROXYPOOL_REDIS_PORT / REDIS_PORT：Redis 的端口，其中 PROXYPOOL_REDIS_PORT 会覆盖 REDIS_PORT 的值。
PROXYPOOL_REDIS_PASSWORD / REDIS_PASSWORD：Redis 的密码，其中 PROXYPOOL_REDIS_PASSWORD 会覆盖 REDIS_PASSWORD 的值。
PROXYPOOL_REDIS_DB / REDIS_DB：Redis 的数据库索引，如 0、1，其中 PROXYPOOL_REDIS_DB 会覆盖 REDIS_DB 的值。
PROXYPOOL_REDIS_CONNECTION_STRING / REDIS_CONNECTION_STRING：Redis 连接字符串，其中 PROXYPOOL_REDIS_CONNECTION_STRING 会覆盖 REDIS_CONNECTION_STRING 的值。
PROXYPOOL_REDIS_KEY / REDIS_KEY：Redis 储存代理使用字典的名称，其中 PROXYPOOL_REDIS_KEY 会覆盖 REDIS_KEY 的值。

处理器

CYCLE_TESTER：Tester 运行周期，即间隔多久运行一次测试，默认 20 秒
CYCLE_GETTER：Getter 运行周期，即间隔多久运行一次代理获取，默认 100 秒
TEST_URL：测试 URL，默认百度
TEST_TIMEOUT：测试超时时间，默认 10 秒
TEST_BATCH：批量测试数量，默认 20 个代理
TEST_VALID_STATUS：测试有效的状态码
API_HOST：代理 Server 运行 Host，默认 0.0.0.0
API_PORT：代理 Server 运行端口，默认 5555
API_THREADED：代理 Server 是否使用多线程，默认 true

日志

LOG_DIR：日志相对路径
LOG_RUNTIME_FILE：运行日志文件名称
LOG_ERROR_FILE：错误日志文件名称
LOG_ROTATION: 日志记录周转周期或大小，默认 500MB，见 loguru - rotation
LOG_RETENTION: 日志保留日期，默认 7 天，见 loguru - retention
ENABLE_LOG_FILE：是否输出 log 文件，默认 true，如果设置为 false，那么 ENABLE_LOG_RUNTIME_FILE 和 ENABLE_LOG_ERROR_FILE 都不会生效
ENABLE_LOG_RUNTIME_FILE：是否输出 runtime log 文件，默认 true
ENABLE_LOG_ERROR_FILE：是否输出 error log 文件，默认 true

以上内容均可使用环境变量配置，即在运行前设置对应环境变量值即可，如更改测试地址和 Redis 键名：

export TEST_URL=http://weibo.cn
export REDIS_KEY=proxies:weibo

即可构建一个专属于微博的代理池，有效的代理都是可以爬取微博的。

如果使用 Docker-Compose 启动代理池，则需要在 docker-compose.yml 文件里面指定环境变量，如：

version: "3"
services:
  redis:
    image: redis:alpine
    container_name: redis
    command: redis-server
    ports:
      - "6379:6379"
    restart: always
  proxypool:
    build: .
    image: "germey/proxypool"
    container_name: proxypool
    ports:
      - "5555:5555"
    restart: always
    environment:
      REDIS_HOST: redis
      TEST_URL: http://weibo.cn
      REDIS_KEY: proxies:weibo

扩展代理爬虫

代理的爬虫均放置在 proxypool/crawlers 文件夹下，目前对接了有限几个代理的爬虫。

若扩展一个爬虫，只需要在 crawlers 文件夹下新建一个 Python 文件声明一个 Class 即可。

写法规范如下：

from pyquery import PyQuery as pq
from proxypool.schemas.proxy import Proxy
from proxypool.crawlers.base import BaseCrawler

BASE_URL = 'http://www.664ip.cn/{page}.html'
MAX_PAGE = 5

class Daili66Crawler(BaseCrawler):
    """
    daili66 crawler, http://www.66ip.cn/1.html
    """
    urls = [BASE_URL.format(page=page) for page in range(1, MAX_PAGE + 1)]

    def parse(self, html):
        """
        parse html file to get proxies
        :return:
        """
        doc = pq(html)
        trs = doc('.containerbox table tr:gt(0)').items()
        for tr in trs:
            host = tr.find('td:nth-child(1)').text()
            port = int(tr.find('td:nth-child(2)').text())
            yield Proxy(host=host, port=port)

在这里只需要定义一个 Crawler 继承 BaseCrawler 即可，然后定义好 urls 变量和 parse 方法即可。

urls 变量即为爬取的代理网站网址列表，可以用程序定义也可写成固定内容。
parse 方法接收一个参数即 html，代理网址的 html，在 parse 方法里只需要写好 html 的解析，解析出 host 和 port，并构建 Proxy 对象 yield 返回即可。

网页的爬取不需要实现，BaseCrawler 已经有了默认实现，如需更改爬取方式，重写 crawl 方法即可。

欢迎大家多多发 Pull Request 贡献 Crawler，使其代理源更丰富强大起来。

部署

本项目提供了 Kubernetes 部署脚本，如需部署到 Kubernetes，请参考 kubernetes。

待开发

前端页面管理
使用情况统计分析

如有一起开发的兴趣可以在 Issue 留言，非常感谢！

LICENSE

MIT

proxypool's People

Contributors

Stargazers

Watchers

Forkers

zsweet xianggithubli scalershare bonaba kof0012 aegeansea quincyc379 okakaino sanliang666 silverbooker keepgogoing huochequan kuaixuesoft anwzx isxuanxuan jinyun lvsoso xca117 middlexu criskt tim1999 fore-stack wukainf beriwan mayfool cowry5 mimizhang barktegh python0925 iamwyh yang-xinhui kwff pythonzm wangxu98 landihua auditore8 panziqiang007 toyourheart163 sportzhang fly2fire keyrenelu avispeng vanella farolding fon-khan sondeer oneisking lgravity yy189 rynder waitingfy feigong12 luyichengg cocktailpy anoshop appleandpearwow 631068264 jmd110 shidadao gold-cfx jealousljl junqiangle doraemon1 phantom3389 victsww wangchaooo yohee2015 hyfgreg xinihe hawu0616 jesszen baozhong1010 phychaos 86839858 relei1234 kainchow echolvan gonewithgt harryzj timelistener billxq lensv adamhk01 tec-yao mayi140611 luckygl linhuiyangcdns qithird wudangqibujie damon-zln zer0fire zhuyoucai168 ficuser jayden-z cg110778 michaelzyy leolianger pyyourdaye thee225 yangzhaoyunfei

proxypool's Issues

项目部署到Linux直接报错了

代理池开始运行

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
代理池开始运行
Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
Process Process-3:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
app.run(API_HOST, API_PORT)
File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
run_simple(host, port, self, **options)
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
inner()
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
fd=fd,
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
HTTPServer.init(self, server_address, handler)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
self.server_bind()
File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
代理池开始运行
Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
Process Process-3:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 35, in schedule_api
app.run(API_HOST, API_PORT)
File "/usr/local/python3/lib/python3.6/site-packages/flask/app.py", line 990, in run
run_simple(host, port, self, **options)
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 1009, in run_simple
inner()
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 962, in inner
fd=fd,
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 805, in make_server
host, port, app, request_handler, passthrough_errors, ssl_context, fd=fd
File "/usr/local/python3/lib/python3.6/site-packages/werkzeug/serving.py", line 698, in init
HTTPServer.init(self, server_address, handler)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 453, in init
self.server_bind()
File "/usr/local/python3/lib/python3.6/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/usr/local/python3/lib/python3.6/socketserver.py", line 467, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local2/app/ProxyPool-master/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 23, in run
if not self.is_over_threshold():
File "/usr/local2/app/ProxyPool-master/proxypool/getter.py", line 16, in is_over_threshold
if self.redis.count() >= POOL_UPPER_THRESHOLD:
File "/usr/local2/app/ProxyPool-master/proxypool/db.py", line 84, in count
return self.db.zcard(REDIS_KEY)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 2395, in zcard
return self.execute_command('ZCARD', name)
File "/usr/local/python3/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 1059, in get_connection
connection.connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 531, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 110 connecting to 120.79.34.216:6379. Connection timed out.
开始抓取代理
获取器开始执行
Process Process-2:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 526, in connect
sock = self._connect()
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 583, in _connect
raise err
File "/usr/local/python3/lib/python3.6/site-packages/redis/connection.py", line 571, in _connect
sock.connect(socket_address)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

配置文件中的redis缓存配置暴露了

hi 你好我在用你的代码的时候发现你的redis配置暴露了.....

Windows下运行正常，macOS和Linux下均报错，网上查了半天，依然一头雾水，求大神解惑。

代理池开始运行

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:7777/ (Press CTRL+C to quit)
开始抓取代理
获取器开始执行
Crawling http://www.66ip.cn/1.html
正在抓取 http://www.66ip.cn/1.html
抓取成功 http://www.66ip.cn/1.html 521
Crawling http://www.66ip.cn/2.html
正在抓取 http://www.66ip.cn/2.html
抓取成功 http://www.66ip.cn/2.html 521
Crawling http://www.66ip.cn/3.html
正在抓取 http://www.66ip.cn/3.html
抓取成功 http://www.66ip.cn/3.html 521
Crawling http://www.66ip.cn/4.html
正在抓取 http://www.66ip.cn/4.html
抓取成功 http://www.66ip.cn/4.html 521
Crawling http://www.proxy360.cn/Region/China
正在抓取 http://www.proxy360.cn/Region/China
抓取成功 http://www.proxy360.cn/Region/China 400
正在抓取 http://www.goubanjia.com/free/gngn/index.shtml
抓取成功 http://www.goubanjia.com/free/gngn/index.shtml 404
正在抓取 http://www.ip3366.net/?stype=1&page=1
抓取成功 http://www.ip3366.net/?stype=1&page=1 200
成功获取到代理 112.87.254.81:8118
成功获取到代理 103.115.180.96:42556
成功获取到代理 103.218.25.52:53281
成功获取到代理 80.211.55.179:3128
成功获取到代理 137.59.162.178:52497
成功获取到代理 165.90.209.141:31975
成功获取到代理 80.211.84.179:3128
成功获取到代理 103.108.96.159:46258
成功获取到代理 103.106.101.12:45100
成功获取到代理 112.84.85.164:8118
正在抓取 http://www.ip3366.net/?stype=1&page=2
抓取成功 http://www.ip3366.net/?stype=1&page=2 200
成功获取到代理 183.172.131.4:8118
成功获取到代理 112.67.35.134:8118
成功获取到代理 59.110.48.236:3128
成功获取到代理 111.224.137.25:80
成功获取到代理 138.121.31.108:53281
成功获取到代理 103.225.228.101:58732
成功获取到代理 222.181.10.102:8118
成功获取到代理 111.224.34.224:80
成功获取到代理 103.81.15.113:57803
成功获取到代理 101.27.22.144:61234
正在抓取 http://www.ip3366.net/?stype=1&page=3
抓取成功 http://www.ip3366.net/?stype=1&page=3 200
成功获取到代理 119.179.133.233:8060
成功获取到代理 119.179.143.43:8060
成功获取到代理 119.179.143.43:8060
成功获取到代理 106.58.248.101:80
成功获取到代理 119.254.94.71:52811
成功获取到代理 119.179.130.179:8060
成功获取到代理 27.208.85.141:8060
成功获取到代理 123.207.233.182:808
成功获取到代理 112.66.70.180:8060
成功获取到代理 170.82.21.168:53281
Process Process-2:
Traceback (most recent call last):
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/hao/Documents/Coding/ProxyPool/proxypool/scheduler.py", line 28, in schedule_getter
getter.run()
File "/Users/hao/Documents/Coding/ProxyPool/proxypool/getter.py", line 30, in run
self.redis.add(proxy)
File "/Users/hao/Documents/Coding/ProxyPool/proxypool/db.py", line 30, in add
return self.db.zadd(REDIS_KEY, score, proxy)
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/site-packages/redis/client.py", line 2263, in zadd
for pair in iteritems(mapping):
File "/usr/local/var/pyenv/versions/3.7.1/lib/python3.7/site-packages/redis/_compat.py", line 123, in iteritems
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

crawl_ip3366（）函数是不是有一个多余了，源代码中并没有删除或注释

Crawl.py模块里是不是有个函数写多余了，crawl_ip3366写了两个。

大佬们，为什么我运行run.py之后就没反应了

程序也不报错，也不停止，但就是什么也不显示，那些print的语句我写了的。

如何在pycharm里调试该项目

我使用一个远程的环境，想在pycharm里调试该项目，但是每次Debug run.py 都显示文件无法找到，请问如何使用pycharm调试这个项目

AttributeError: 'OutStream' object has no attribute 'buffer'

#启动代理池

from proxypool.scheduler import Scheduler
import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def main():
try:
s = Scheduler()
s.run()
except:
main()

if name == 'main':
main()

AttributeError: 'OutStream' object has no attribute 'buffer'

关于redis-py版本问题

redis-py 3.X版和2.X版 zadd和zincrby有变化
3.X版中的zadd需要传入一个dict，（element-names -> score）
zincrby参数中amount和value互换

关于trace back中db.py的问题

看了之前的issues都说要把db.py中的add和max两个函数里的zadd方法的参数修改成字典形式（zadd(REDIS_KEY, {proxy：score})），以及decrease函数里的zincrby方法参数对调（zincrby(REDIS_KEY, -1， proxy)）但是我修改之后反而会出问题。
我去看了Redis-py的文档，不知道是不是有什么改版，发现作者原本写的才符合文档里的要求。
文档说明如下：

所以这两个地方其实不用改了。
不过有个地方似乎是错了，

，框住的第一个应该是MIN_SCORE吧？

what/

D:\Pycharm工作资料\代码流\venv\Scripts\python.exe C:/Users/ThinkPad/Downloads/ProxyPool-master/run.py
浠ｇ悊姹犲紑濮嬭繍琛�

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
Process Process-2:
寮�濮嬫姄鍙栦唬鐞�
鑾峰彇鍣ㄥ紑濮嬫墽琛�
Crawling http://www.66ip.cn/1.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/1.html
鎶撳彇鎴愬姛 http://www.66ip.cn/1.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 177.185.148.46:58623
鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.11:7
鎴愬姛鑾峰彇鍒颁唬鐞� 131.196.143.117:33729
鎴愬姛鑾峰彇鍒颁唬鐞� 43.243.141.126:53281
鎴愬姛鑾峰彇鍒颁唬鐞� 111.181.35.219:9999
Crawling http://www.66ip.cn/2.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/2.html
鎶撳彇鎴愬姛 http://www.66ip.cn/2.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 170.0.112.226:50359
鎴愬姛鑾峰彇鍒颁唬鐞� 54.39.144.247:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 171.41.82.36:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 37.32.126.0:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 213.33.224.82:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 144.123.71.133:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.166.59:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 117.196.237.40:59250
鎴愬姛鑾峰彇鍒颁唬鐞� 121.61.3.110:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 212.200.126.14:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.9:4
鎴愬姛鑾峰彇鍒颁唬鐞� 47.107.245.94:3128
Crawling http://www.66ip.cn/3.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/3.html
鎶撳彇鎴愬姛 http://www.66ip.cn/3.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.162.175:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.60:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.1:0
鎴愬姛鑾峰彇鍒颁唬鐞� 175.100.185.151:53281
鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.10:6
鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.5:3
鎴愬姛鑾峰彇鍒颁唬鐞� 37.224.19.106:58553
鎴愬姛鑾峰彇鍒颁唬鐞� 179.127.249.53:46257
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.4:5
鎴愬姛鑾峰彇鍒颁唬鐞� 1.20.101.221:55707
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.183.45:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.8:4
Crawling http://www.66ip.cn/4.html
姝ｅ湪鎶撳彇 http://www.66ip.cn/4.html
鎶撳彇鎴愬姛 http://www.66ip.cn/4.html 200
鎴愬姛鑾峰彇鍒颁唬鐞� 91.219.171.84:43726
鎴愬姛鑾峰彇鍒颁唬鐞� 212.26.247.178:38418
鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.1:1
鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.11:3
鎴愬姛鑾峰彇鍒颁唬鐞� 203.42.227.113:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 110.52.235.126:9999
鎴愬姛鑾峰彇鍒颁唬鐞� 170.239.224.58:8080
鎴愬姛鑾峰彇鍒颁唬鐞� 190.119.199.18:57333
鎴愬姛鑾峰彇鍒颁唬鐞� 5.0.0.815:0
鎴愬姛鑾峰彇鍒颁唬鐞� 190.152.182.150:53281
鎴愬姛鑾峰彇鍒颁唬鐞� 119.40.98.84:46119
鎴愬姛鑾峰彇鍒颁唬鐞� 111.177.170.220:9999
Traceback (most recent call last):
File "D:\python\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "D:\python\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
getter.run()
File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\getter.py", line 30, in run
self.redis.add(proxy)
File "C:\Users\ThinkPad\Downloads\ProxyPool-master\proxypool\db.py", line 30, in add
return self.db.zadd(REDIS_KEY, score, proxy)
File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis\client.py", line 2320, in zadd
for pair in iteritems(mapping):
File "D:\Pycharm工作资料\代码流\venv\lib\site-packages\redis_compat.py", line 122, in iteritems
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

大神：能不能做个docker，环境变化太多了

如题

嗯把大神的直接拿过来运行，一开始还能运行，爬着爬着报错了，好难啊，感觉好复杂

代理池开始运行

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://0.0.0.0:5555/ (Press CTRL+C to quit)
开始抓取代理
获取器开始执行
Crawling http://www.66ip.cn/1.html
正在抓取 http://www.66ip.cn/1.html
抓取成功 http://www.66ip.cn/1.html 521
Crawling http://www.66ip.cn/2.html
正在抓取 http://www.66ip.cn/2.html
抓取成功 http://www.66ip.cn/2.html 521
Crawling http://www.66ip.cn/3.html
正在抓取 http://www.66ip.cn/3.html
抓取成功 http://www.66ip.cn/3.html 521
Crawling http://www.66ip.cn/4.html
正在抓取 http://www.66ip.cn/4.html
抓取成功 http://www.66ip.cn/4.html 521
正在抓取 http://www.ip3366.net/?stype=1&page=1
抓取成功 http://www.ip3366.net/?stype=1&page=1 200
成功获取到代理 222.135.25.243:8060
成功获取到代理 180.175.8.5:8060
成功获取到代理 119.180.131.25:8060
成功获取到代理 180.175.160.130:8060
成功获取到代理 119.180.177.138:8060
成功获取到代理 119.180.1.42:8060
成功获取到代理 171.112.165.22:9999
成功获取到代理 222.182.121.71:8118
成功获取到代理 118.81.68.2:80
成功获取到代理 117.166.3.51:8118
正在抓取 http://www.ip3366.net/?stype=1&page=2
抓取成功 http://www.ip3366.net/?stype=1&page=2 200
成功获取到代理 171.83.164.51:9999
成功获取到代理 47.101.189.13:80
成功获取到代理 171.112.164.149:9999
成功获取到代理 171.112.164.109:9999
成功获取到代理 119.97.237.74:80
成功获取到代理 197.234.42.73:8083
成功获取到代理 103.120.152.182:59068
成功获取到代理 117.168.86.102:8118
成功获取到代理 115.215.212.116:8118
成功获取到代理 103.244.91.61:8080
正在抓取 http://www.ip3366.net/?stype=1&page=3
抓取成功 http://www.ip3366.net/?stype=1&page=3 200
成功获取到代理 117.80.137.238:9999
成功获取到代理 103.233.145.133:8080
成功获取到代理 117.80.17.81:8118
成功获取到代理 171.83.165.10:9999
成功获取到代理 43.248.123.237:8080
成功获取到代理 113.227.182.15:8118
成功获取到代理 138.97.219.51:65301
成功获取到代理 117.41.142.159:8118
成功获取到代理 197.234.42.209:8083
成功获取到代理 197.234.44.125:8083
Process Process-2:
Traceback (most recent call last):
File "D:\Python\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "D:\Python\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "D:\spider-test\ProxyPool-master\proxypool\scheduler.py", line 28, in schedule_getter
getter.run()
File "D:\spider-test\ProxyPool-master\proxypool\getter.py", line 30, in run
self.redis.add(proxy)
File "D:\spider-test\ProxyPool-master\proxypool\db.py", line 30, in add
return self.db.zadd(REDIS_KEY, score, proxy)
File "D:\Python\lib\site-packages\redis\client.py", line 2320, in zadd
for pair in iteritems(mapping):
File "D:\Python\lib\site-packages\redis_compat.py", line 109, in iteritems
return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

增加对pipenv包管理器的支持

增加对pipenv包管理器的支持, 本机已经测试通过.
redis: v=4.0.9

获取器过一段时间就宕掉了？

运行一段时间，自动就宕掉了，是什么情况，可以解决吗？

运行后一直是代理请求失败

前面的都能正常运行，到了测试的时候就是代理请求失败，想寻求解决方法

aiohttp 不支持https代理，有办法可以批量测试https代理吗

requirements.txt中的 redis版本有限制

requirements.txt中redis版本为redis>=2.10.5
默认会安装最新版，现在已经3.x了。
试验证明，操作zadd时会报错。

报了这个错误：requests.exceptions.ConnectionError：10054

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

请问 AttributeError: type object 'URL' has no attribute 'build' 这个怎么解决

File "run.py", line 1, in
from proxypool.scheduler import Scheduler
File "C:\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "C:\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "C:\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp_init_.py", line 6, in
from .client import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\client.py", line 32, in
from . import hdrs, http, payload
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http.py", line 7, in
from .http_parser import (
File "C:\Users\MZY\AppData\Roaming\Python\Python36\site-packages\aiohttp\http_parser.py", line 755, in
from ._http_parser import (HttpRequestParser, # type: ignore # noqa

File "aiohttp_http_parser.pyx", line 44, in init aiohttp._http_parser
AttributeError: type object 'URL' has no attribute 'build'

在使用的过程中出出现 AttributeError: 'int' object has no attribute 'items'

请教这个问题怎么解决？

Ip processing running

Serving Flask app "proxypool.api" (lazy loading)
Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
Debug mode: off
Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Refreshing ip
PoolAdder is working
Waiting for adding
Callback crawl_ip181
Error occurred during loading data. Trying to use cache server http://d2g6u4gh6d9rq0.cloudfront.net/browsers/fake_useragent_0.1.10.json
Traceback (most recent call last):
File "C:\Python\Python36\lib\urllib\request.py", line 1318, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "C:\Python\Python36\lib\http\client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Python\Python36\lib\http\client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Python\Python36\lib\http\client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Python\Python36\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "C:\Python\Python36\lib\http\client.py", line 964, in send
self.connect()
File "C:\Python\Python36\lib\http\client.py", line 1392, in connect
super().connect()
File "C:\Python\Python36\lib\http\client.py", line 936, in connect
(self.host,self.port), self.timeout, self.source_address)
File "C:\Python\Python36\lib\socket.py", line 722, in create_connection
raise err
File "C:\Python\Python36\lib\socket.py", line 713, in create_connection
sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 67, in get
context=context,
File "C:\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python\Python36\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Python\Python36\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python\Python36\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Python\Python36\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 154, in load
for item in get_browsers(verify_ssl=verify_ssl):
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 97, in get_browsers
html = get(settings.BROWSERS_STATS_PAGE, verify_ssl=verify_ssl)
File "C:\Python\Python36\lib\site-packages\fake_useragent\utils.py", line 84, in get
raise FakeUserAgentError('Maximum amount of retries reached')
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
Process Process-2:
Traceback (most recent call last):
File "C:\Python\Python36\lib\multiprocessing\process.py", line 249, in _bootstrap
self.run()
File "C:\Python\Python36\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 130, in check_pool
adder.add_to_queue()
File "C:\迅雷下载\ProxyPool-master\proxypool\schedule.py", line 87, in add_to_queue
raw_proxies = self._crawler.get_raw_proxies(callback)
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 28, in get_raw_proxies
for proxy in eval("self.{}()".format(callback)):
File "C:\迅雷下载\ProxyPool-master\proxypool\getter.py", line 35, in crawl_ip181
html = get_page(start_url)
File "C:\迅雷下载\ProxyPool-master\proxypool\utils.py", line 14, in get_page
'User-Agent': ua.random,
UnboundLocalError: local variable 'ua' referenced before assignment
Refreshing ip
Waiting for adding
Refreshing ip
Waiting for adding
Refreshing ip

zincrby 新的版本redis-py有改动

zincrby(name, amount, value)

需要将源代码中的zincrby第二、三参数换个顺序

redis的有序集合zadd方法变更

def add(self, proxy, score=INITIAL_SCORE):
    """
    添加代理，设置分数为最高
    :param proxy: 代理
    :param score: 分数
    :return: 添加结果
    """
    if not re.match('\d+\.\d+\.\d+\.\d+\:\d+', proxy):
        print('代理不符合规范', proxy, '丢弃')
        return
    if not self.db.zscore(REDIS_KEY, proxy):
        dic={}
        dic[proxy] =score

        return self.db.zadd(REDIS_KEY, dic)

我访问localhost:5555/random时不能换代理多次刷新只有一个最初的代理地址，请问一下是什么问题？

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environments (please complete the following information):

OS: [e.g. macOS 10.15.2]
Python [e.g. Python 3.6]
Browser [e.g. Chrome 67 ]

Additional context
Add any other context about the problem here.

能否获取能够破网的代理

程序执行不成功

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environments (please complete the following information):

OS: [win10]
Python [Python 3.7]

Additional context
Add any other context about the problem here.

写入时报错

运行报错

/proxypool/db.py", line 30, in add

return iter(x.items())
AttributeError: 'int' object has no attribute 'items'

书上的和这个上面的都试过了就是不行

各个库以及Python的版本都是符合要求的
然后python run.py 的时候就报错
ImportError: cannot import name 'etree' from 'lxml'
实在是百度不到方法了，求解。

在 VS Code 中运行异常 ~ 而且无法关闭

当在命令行中运行 python run.py 时，项目可以正常工作，可以通过 http://127.0.0.1:5555/random 获取到代理；但当使用 VS Code 直接运行（按 F5）时，会出现异常，而且关不掉，几次都是自己通过重启才关掉。

有大佬能解释一下么？

代理池项目中setting.py文件相关配置

不算bug，建议：
1.在项目setting.py文件中，看到声明了LOG_DIR日志存储路径参数，但未使用。
应新建出...\project\ProxyPool\logs文件夹，并在配置文件中修改：
logger.add(env.str('LOG_RUNTIME_FILE', 'runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', 'error.log'), level='ERROR', rotation='1 week')

修改为：
logger.add(env.str('LOG_RUNTIME_FILE', f'{LOG_DIR}/runtime.log'), level='DEBUG', rotation='1 week', retention='20 days')
logger.add(env.str('LOG_ERROR_FILE', f'{LOG_DIR}/error.log'), level='ERROR', rotation='1 week')

2.setting.py文件中ENABLE_TESTER， ENABLE_GETTER， ENABLE_SERVER开关参数若都为False时，运行run.py文件报错（try方法中finally还会报错），可修改scheduler.py文件。（此条有点杠精，可忽略）

新版redis中zadd改动

新的版本中zadd有改动，需要改成zadd(REDIS_KEY, {proxy: score})
一共两处，分别在RedisClient.add()和RedisClient.max()里

弹出错误，请求帮助attributes() got an unexpected keyword argument 'frozen'

Traceback (most recent call last):
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\run.py", line 1, in
from proxypool.scheduler import Scheduler
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\scheduler.py", line 4, in
from proxypool.getter import Getter
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\getter.py", line 1, in
from proxypool.tester import Tester
File "D:\Anaconda3\envs\py3\project\ProxyPool-master\proxypool\tester.py", line 2, in
import aiohttp
File "D:\Anaconda3\lib\site-packages\aiohttp_init_.py", line 6, in
from .client import * # noqa
File "D:\Anaconda3\lib\site-packages\aiohttp\client.py", line 16, in
from . import client_exceptions, client_reqrep
File "D:\Anaconda3\lib\site-packages\aiohttp\client_reqrep.py", line 18, in
from . import hdrs, helpers, http, multipart, payload
File "D:\Anaconda3\lib\site-packages\aiohttp\helpers.py", line 161, in
@attr.s(frozen=True, slots=True)
TypeError: attributes() got an unexpected keyword argument 'frozen'

爬虫函数可以考虑增加一个直接读文件

粗略写了个，放在Crawler类里面，每行用“地址：端口”格式就行。
def crawl_file(self):
filename = 'proxy.txt' # txt文件和当前脚本在同一目录下，所以不用写具体路径
with open(filename, 'r') as file_to_read:
while True:
lines = file_to_read.readline() # 整行读取数据
if not lines:
break
yield lines

运行run后卡在“获取器开始执行”

相关配置和安装等都搞定了，之前那个用pop实现的也能用，但这个我运行run后却卡在“获取器开始执行”，请问怎么解决？谢谢了。

代理获取进程好像死亡了，这是怎么回事

运行过程中代理抓取进程好像死亡了，不知道是什么问题？
观察到测试进程和API进程一直在运行，代理抓取进程没有动，redis队列中代理也一直在减少，有人知道这是什么问题吗？

value is not a valid float

redis.exceptions.ResponseError: value is not a valid float

本地网页的代理是不是已经测试过可以使用的？

RequestsDependencyWarning: urllib3 (1.23) or chardet (2.3.0) doesn't match a supported version!

怎么解决

无法定时爬取新的代理

请问为何只是在刚开始的时候爬取了一次代理之后会就不会定时爬取了？

按照readme获取的端口不对

应该是localhost:5555/random吧

error in db.py: fixed the problem by change line 30 with:

return self.db.zadd(REDIS_KEY, {proxy: score})

http类型的ip测试结果不可用

如果使用下面的测试方式是没有问题的，另外一个问题是aiohttp不支持https的代理
response=requests.get('https://www.baidu.com',proxies=‘HTTP://125.123.139.131:9999’,timeout=3)

Macbook上碰到https的错误

➜  ProxyPool git:(master) pip3 install -r requirements.txt
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting aiohttp>=1.3.3 (from -r requirements.txt (line 1))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/aiohttp/
  Could not fetch URL https://pypi.org/simple/aiohttp/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/aiohttp/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
  Could not find a version that satisfies the requirement aiohttp>=1.3.3 (from -r requirements.txt (line 1)) (from versions: )
No matching distribution found for aiohttp>=1.3.3 (from -r requirements.txt (line 1))
pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
➜  ProxyPool git:(master)

对爬取ip的代码进行了优化, 正则全部替换成pyquery来提取

`import json
import re
from .utils import get_page
from pyquery import PyQuery as pq

class ProxyMetaclass(type):
def new(cls, name, bases, attrs):
count = 0
attrs['CrawlFunc'] = []
for k, v in attrs.items():
if 'crawl_' in k:
attrs['CrawlFunc'].append(k)
count += 1
attrs['CrawlFuncCount'] = count
return type.new(cls, name, bases, attrs)

class Crawler(object, metaclass=ProxyMetaclass):
def get_proxies(self, callback):
proxies = []
for proxy in eval(f"self.{callback}()"):
print('成功获取到代理', proxy)
proxies.append(proxy)
return proxies

def crawl_daili66(self, page_count=4):
    """
    获取代理66
    :param page_count: 页码
    :return: 代理
    """
    start_url = 'http://www.66ip.cn/{}.html'
    urls = [start_url.format(page) for page in range(1, page_count + 1)]
    for url in urls:
        print('Crawling', url)
        html = get_page(url)
        if html:
            doc = pq(html)
            trs = doc('.containerbox table tr:gt(0)').items()  # index > 0  第0个tr节点里面没有ip和port
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_ip3366(self):
    for i in range(1, 4):
        start_url = 'http://www.ip3366.net/?stype=1&page={}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#container #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_kuaidaili(self):
    for i in range(1, 4):
        start_url = 'http://www.kuaidaili.com/free/inha/{}/'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#content .con-body #list table tbody tr').items()
            for tr in trs:
                ip = tr.find('td:nth-child(1)').text()
                port = tr.find('td:nth-child(2)').text()
                yield ':'.join([ip.strip(), port.strip()])

def crawl_iphai(self):
    start_url = 'http://www.iphai.com/'
    html = get_page(start_url)
    # print(html)
    if html:
        doc = pq(html)
        trs = doc('.container .table tr:gt(0)').items()
        for tr in trs:
            ip = tr.find('td:nth-child(1)').text()
            port = tr.find('td:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

def crawl_xicidaili(self):
    for i in range(1, 3):
        start_url = 'http://www.xicidaili.com/nn/{}'.format(i)
        html = get_page(start_url)
        if html:
            doc = pq(html)
            trs = doc('#wrapper #body table tr:gt(0)').items()
            for tr in trs:
                ip = tr.find('td:nth-child(2)').text()
                port = tr.find('td:nth-child(3)').text()
                yield ':'.join([ip.strip(), port.strip()])


def crawl_data5u(self):
    start_url = 'http://www.data5u.com/'
    html = get_page(start_url)
    if html:
        doc = pq(html)
        uls = doc('.wlist>ul ul:gt(0)').items()
        for ul in uls:
            ip = ul.find('span:nth-child(1)').text()
            port = ul.find('span:nth-child(2)').text()
            yield ':'.join([ip.strip(), port.strip()])

        `

自己替换一下就行了, 亲测没问题, 当前时间2019-10-10

是否存在管理redis connectionpool的问题？

https://stackoverflow.com/questions/31663288/how-do-i-properly-use-connection-pools-in-redis
我在想每次请求链接redis都创建一个链接，不如写成
`redis_pool = None

class RedisClient(object):
def init(self, host=HOST, port=PORT):
global redis_pool

    if not redis_pool:
        if PASSWORD:
            redis_pool = redis.Redis(host=host, port=port, password=PASSWORD)
        else:
            redis_pool = redis.Redis(host=host, port=port)
        self._db = redis_pool
    else:
        self._db = redis_pool`

proxypool.crawlers.daili66.BASE_URL:http://www.664ip.cn/{page}.html ,这个url的域名应该是写错了，改成 www.66ip.cn 就可以正常运行

python3webspider / proxypool Goto Github PK

proxypool's Introduction

ProxyPool

使用准备

使用要求

Docker

常规方式

Docker 运行

常规方式运行

安装和配置 Redis

安装依赖包

运行代理池

使用

可配置项

开关

环境

Redis 连接

处理器

日志

扩展代理爬虫

部署

待开发

LICENSE

proxypool's People

Contributors

Stargazers

Watchers

Forkers

proxypool's Issues

Recommend Projects

Recommend Topics

Recommend Org