Code Monkey home page Code Monkey logo

proxy_pool's Introduction

My GitHub

Top Langs

proxy_pool's People

Contributors

bernieyangmh avatar bhzhangsun avatar bobobo80 avatar chncaption avatar dependabot[bot] avatar dustinpt avatar feng409 avatar gladmo avatar halleywj avatar highroom avatar houbaron avatar jhao104 avatar jiannanya avatar kagxin avatar kangnwh avatar netair avatar newlyedward avatar ozhiwei avatar plokid avatar roronoa-dong avatar scil avatar sunjngje avatar tinker-li avatar vc5 avatar vissssa avatar wang-ye avatar windhw avatar xuan25 avatar yeclimeric avatar yrjyrj123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxy_pool's Issues

调用get时返回服务器内部错误

使用get_all返回为空,使用get时返回服务器内部错误,这个何解?
Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

数据库更新逻辑

工具很有用赞一个。
几个问题想确认下:

  1. get_all返回的代理list的更新逻辑是什么。好像这个List里面是越来越多的。昨天的ipA可用,今天不可用,也会返回。
  2. 是否考虑加入api实现:返回过去X分钟测试过的,确定可用的代理List。

谢谢。

Python3 下运行报错,希望帮忙看下

[root@iz8vbawf20vjywci9aweg8z Run]# python3 main.py
Process ValidRun:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "../Schedule/ProxyValidSchedule.py", line 61, in run
p.main()
File "../Schedule/ProxyValidSchedule.py", line 56, in main
self.__validProxy()
File "../Schedule/ProxyValidSchedule.py", line 36, in __validProxy
for each_proxy in self.db.getAll():
File "../DB/DbClient.py", line 94, in getAll
return self.client.getAll()
File "/home/software/proxy_pool/DB/SsdbClient.py", line 94, in getAll
return self.__conn.hgetall(self.name).keys()
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/client.py", line 1050, in hgetall
return self.execute_command('hgetall', name)
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/client.py", line 225, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/connection.py", line 404, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/connection.py", line 383, in send_packed_command
self._sock.sendall(item)
TypeError: a bytes-like object is required, not 'str'

想认识一下!

你好,你这些项目都不错 想认识一下,一直在组织大家做数据积累挖掘的事情,大家的力量是无限的,我的微信:toyaowu

python2.7 -m Schedule.ProxyRefreshSchedule出现OverFlowError

在项目目录下执行python2.7 -m Schedule.ProxyRefreshSchedule
出现报错:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/sam/app/venv/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
    main()
  File "/home/sam/app/venv/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
    p.refresh()
  File "Manager/ProxyManager.py", line 46, in refresh
    self.db.put(proxy)
  File "DB/DbClient.py", line 73, in put
    return self.client.put(value, **kwargs)
  File "/home/sam/app/venv/proxy_pool/DB/SsdbClient.py", line 62, in put
    return self.__conn.hset(self.name, value, None)
  File "/usr/local/lib/python2.7/site-packages/ssdb/client.py", line 797, in hset
    return self.execute_command('hset', name, key, value)
  File "/usr/local/lib/python2.7/site-packages/ssdb/client.py", line 218, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 404, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 378, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 281, in connect
    sock = self._connect()
  File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 308, in _connect
    socket.SOCK_STREAM):
OverflowError: Python int too large to convert to C long

系统: CentOS6.4, 64位系统。
上网查了这个报错是因为底层用到了C的函数,导致此报错。
http://bugs.python.org/issue21816
奇怪的是我在issues里没看到其他人有同样的报错。

端口号

端口号最大为65534,所以最多有5位。
爬虫的正则表达式只会保存前4位

关于IP定期更新的问题

在代理程序运行一段时间后,会出现大量僵尸进程, 如下:
2017-01-11 1 04 05

猜测是定期更新的ProxyRefreshSchedule类有bug~

python -m Schedule.ProxyRefreshSchedule 执行报错

Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
    main()
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
    p.refresh()
  File "Manager/ProxyManager.py", line 40, in refresh
    for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
  File "ProxyGetter/getFreeProxy.py", line 102, in freeProxyFifth
    d = tree.xpath('.//table[@class="table"]/tbody/tr[{}]/td'.format(i + 1))[0]
IndexError: list index out of range

怀疑是那个代理获取源xpath有问题,尝试注释掉后,出现另外一个错误
ProxyGetter/getFreeProxy.py

  91     @staticmethod
  92     @robustCrawl
  93     def freeProxyFifth():
  94         """
  95         抓取guobanjia http://www.goubanjia.com/free/gngn/index.shtml
  96         :return:
  97         """
  98         url = "http://www.goubanjia.com/free/gngn/index.shtml"
  99         tree = getHtmlTree(url)
 100         # 现在每天最多放15个(一页)
 101         #for i in xrange(15):
 102             #d = tree.xpath('.//table[@class="table"]/tbody/tr[{}]/td'.format(i + 1))[0]
 103             #o = d.xpath('.//span/text() | .//div/text()')
 104             #yield ''.join(o[:-1]) + ':' + o[-1]
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
    main()
  File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
    p.refresh()
  File "Manager/ProxyManager.py", line 40, in refresh
    for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
TypeError: 'NoneType' object is not iterable

python2.6在安装ssdb python驱动是报错

Collecting ssdb
/usr/lib/python2.6/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Using cached ssdb-0.0.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/tmp/pip-build-sfGK5r/ssdb/setup.py", line 5, in
from ssdb import version
File "ssdb/init.py", line 2, in
from ssdb.client import StrictSSDB, SSDB
File "ssdb/client.py", line 74
return {k:int(v) for k,v in list_to_dict(lst).items()}
^
SyntaxError: invalid syntax

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-sfGK5r/ssdb

关于ProxyValidSchedule里的计数问题

在ProxyValidSchedule中,如果一个proxy存在很长时间,那么即使失效,它的计数也会很大,需要很久才能减为负数并被清理掉。如果设置一个计数的上限,比如10。当计数超过十就不再增加,是不是可以更有效地清理过期的proxy?

拼写错误

Manager.ProxyManager 18行有个拼写错误

from ProxyGetter.GetFreeProxy import GetFreeProxy
==》
from ProxyGetter.getFreeProxy import GetFreeProxy

西刺被Ban

你好,自己在写一个类似的练习,但抓取西刺代理时就出现多次被Ban所得页面是block的情况,请问您是通过不断更换headers处理还是怎样呢?

新手求教,谢谢大神!

Error 10061 connecting localhost:8888

你好,我在配置完依赖环境后跑了main.py然后就返回了ConnectionError: Error 10061 connecting localhost:8888 我还是新手没接触过SSDB方面的东西,不知道是不是我SSDB的配置有问题?单独运行getFreeProxy是可以返回IP列表的,应该就是数据库设置有问题吧,我把它放到云服务器上用SSDBAdmin改了服务器公网IP访问也还是返回10061的错误,请问我是还需要设置什么?谢谢。

为何必须要先运行redis?

你好jhao104,我看文档说是用SSDB来替代redis,但实际运行程序中,如果不先运行redis就运行main.py,就会报错,新手请指教,谢谢。(windows 8 64位系统)。而且有时运行成功,但没有任何代理显示出来,在浏览器中只是可怜的显示[ ]。

is proxy pool double ip checker?

my mean
if we give same ip:port from several sources

example :
192.168.56.1:123 from X
and
192.168.56.1:123 from Y

jhao proxy_pool can resolve this?

ssdb兼容问题

python3环境下安装ssdb报错,3.4.3和3.6.2环境下均无法安装,建议采用pyssdb或者ssdb.py

ubuntu@hp:~/workspace/proxy_pool/Run$ python -V
Python 3.6.2
ubuntu@hp:~/workspace/proxy_pool/Run$ pip install ssdb
Collecting ssdb
  Using cached ssdb-0.0.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-srsinocq/ssdb/setup.py", line 5, in <module>
        from ssdb import __version__
      File "/tmp/pip-build-srsinocq/ssdb/ssdb/__init__.py", line 2, in <module>
        from ssdb.client import StrictSSDB, SSDB
      File "/tmp/pip-build-srsinocq/ssdb/ssdb/client.py", line 3, in <module>
        from itertools import chain, starmap, izip_longest
    ImportError: cannot import name 'izip_longest'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-srsinocq/ssdb/

我想加一个新的feature

你好。我在使用你的代码的时候,直接用的存在redis里面的结果。但是代理在useful_proxy_queue中的代理要人为丢掉才会丢掉,但一次失败就丢会导致代理很快用完,所以我使用的时候,在从redis取东西的时候加了判断,连续20次失败了我才正式把它丢了,效果还不错。但是这边主要是加在我的代码逻辑里面,我想把它加在代理池接口部分,不知道这个feature接收吗。

有一点改动

在 proxy_pool/DB/RedisClient.py 中
pop 应该改为:
def pop(self):
return self.__conn.spop(self.name)

使用问题

作者大大:
你好,下载了你的作品,在linux运行起来了,但只提供给我的程序不到几分钟的Ip地址,后面,proxyApi就开始报500错误,不再提供Ip地址了。这是为什么呢?下面是proxyApi报的Log。

...................................
ValueError: View function did not return a response
127.0.0.1 - - [22/Jan/2017 17:07:58] "GET /get/ HTTP/1.1" 500 -
[2017-01-22 17:07:59,348] ERROR in app: Exception on /get/ [GET]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1642, in full_dispatch_request
response = self.make_response(rv)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1731, in make_response
raise ValueError('View function did not return a response')
ValueError: View function did not return a response
127.0.0.1 - - [22/Jan/2017 17:07:59] "GET /get/ HTTP/1.1" 500 -

centos7访问报错

ERROR in app: Exception on /get/ [GET]
Traceback (most recent call last):
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1642, in full_dispatch_request
    response = self.make_response(rv)
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1731, in make_response
    raise ValueError('View function did not return a response')
ValueError: View function did not return a response
ERROR:Api.ProxyApi:Exception on /get/ [GET]
Traceback (most recent call last):
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1642, in full_dispatch_request
    response = self.make_response(rv)
  File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1731, in make_response
    raise ValueError('View function did not return a response')
ValueError: View function did not return a response

centos7 python2.7.13,使用 curl http://localhost:5000/get/ 本地测试没问题,国外的云服务器上访问就报错

另外定时任务可能也有问题,python不是很熟能帮忙看下都是什么问题,怎么解决呢

ERROR:apscheduler.executors.default:Job "main (trigger: interval[0:05:00], next run at: 2017-07-07 17:29:56 CST)" raised an exception
Traceback (most recent call last):
  File "/usr/local/python/lib/python2.7/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "../Schedule/ProxyRefreshSchedule.py", line 73, in main
    p.refresh()
  File "../Manager/ProxyManager.py", line 42, in refresh
    for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
  File "../ProxyGetter/getFreeProxy.py", line 80, in freeProxySecond
    for proxy in re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}', html):
  File "/usr/local/python/Lib/re.py", line 181, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

关于IP地址的验证问题

看了下代码,只看到了从raw_proxy_queue中的对IP进行验证,把当前可用的IP放到useful_proxy_queue中,没有看到对useful_proxy_queue中的IP进行验证的代码。这些免费IP可能随时失效,也需要进行刷新验证。

centos 6下,Python2.7.1 运行报错

`

[root@VPS Run]# python main.py
Traceback (most recent call last):
File "main.py", line 22, in
from Schedule.ProxyRefreshSchedule import run as RefreshRun
File "../Schedule/ProxyRefreshSchedule.py", line 21, in
from apscheduler.schedulers.blocking import BlockingScheduler
File "/usr/local/python27/lib/python2.7/site-packages/apscheduler/init.py", line 2, in
release = import('pkg_resources').get_distribution('APScheduler').version.split('-')[0]
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 292, in get_distribution
if isinstance(dist,Requirement): dist = get_provider(dist)
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 176, in get_provider
return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 648, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 546, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: APScheduler`

无法启动采集

启动采集出现错误:提示 TypeError:init() got an unexpected keyword argument minute

its apear ..

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. ---># in browser

ERROR:apscheduler.executors.default:Job "main (trigger: interval[0:05:00], next run at: 2017-06-14 23:20:31 CEST)" raised an exception Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, **job.kwargs) File "../Schedule/ProxyRefreshSchedule.py", line 73, in main p.refresh() File "../Manager/ProxyManager.py", line 42, in refresh for proxy in getattr(GetFreeProxy, proxyGetter.strip())(): File "../ProxyGetter/getFreeProxy.py", line 79, in freeProxySecond html = getHTMLText(url, headers=HEADER) File "../Util/utilFunction.py", line 31, in getHTMLText return response.status_code UnboundLocalError: local variable 'response' referenced before assignment ---># in terminal

how solve?

运行环境设置都是什么?

用的centos 6 python3.5 结果各种出错

[root@localhost proxy_pool-master]# python -m Schedule.ProxyRefreshSchedule
Traceback (most recent call last):
File "/usr/local/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/proxy_pool-master/Schedule/ProxyRefreshSchedule.py", line 19, in
from apscheduler.schedulers.blocking import BlockingScheduler
ImportError: No module named 'apscheduler'

是哪里没有装好吗?

请求分类

不知道可不可以在get的时候,指定https或者http型代理,毕竟有些代理只支持http或者https?

谢谢!

[FIXED] SSDB可视化配置问题

请教, 在分别运行Api下的ProxyApi.py, Schedule下的ProxyRefreshSchedule.py和ProxyValidSchedule.py后, 想通过README里提供的SSDBAdmin可视化工具查看结果,请问SSDBAdmin的 setting.py 中的‘host’和‘port’如何设置?

#!/usr/bin/env python

servers = [
{
"host": "172.16.1.69",
"port": 8889
},
{
"host": "127.0.0.1",
"port": 8889
}
]

DEBUG = False

补充:

在运行ProxyApi.py,ProxyRefreshSchedule.py和ProxyValidSchedule.py时, 运行SSDBAdmin下的 runserver.py出现了socket占用错误:socket.error: [Errno 98] Address already in use. 由于在此之前没有redis和其他数据库经验,请多指教, 谢谢!

如何添加scheme属性?

  1. 第一次用NosqlDb,发现这特别确实比起传统的数据库更适合做这份工作(代理池维护)。
  2. 现在我需要给每个ip添加一个scheme属性,我注意到你的数据库工厂类DbClient里面写到value是None,我觉得这应该就是可以放scheme的地方。
  3. 我很熟悉怎么抓取数据
  4. 我想问一下,应该怎么修改代码来减少工作量,以下是我的想法。

在网页抓取代理IP的时候判断其scheme并添加进SSdb的value,这样的话,工厂类可能也要改,ProxyManager也要改。

请问一下这种情况下的Best practice是什么?

这两天一个代理也跑不出来了?

按照文档说明,安装了依赖包和ssdb,但是执行python后进程中仅有两个main.py,log文件中没有任何错误,是哪里出了问题么?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.