jhao104 / proxy_pool Goto Github PK
View Code? Open in Web Editor NEWPython ProxyPool for web spider
Home Page: https://jhao104.github.io/proxy_pool/
License: MIT License
Python ProxyPool for web spider
Home Page: https://jhao104.github.io/proxy_pool/
License: MIT License
比如10min之前(或者更久)塞了一批验证通过的代理放到useful队列中,过了10min某个代理被spider取出来使用,这个时候这个proxy不一定可用吧
使用get_all返回为空,使用get时返回服务器内部错误,这个何解?
Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
看了你的代码,受益匪浅。但是我复现不了freeProxyFirst(),按照你写的代码,返回的是521和js代码。我用抓包工具也能看到,在访问网站时,首先会给一个根本错误的包,想问一下 ,你是怎么解决的
这两个文件中的self.name
不好理解。想弄一个pymongo驱动,部署到bae上。原谅我没有使用过ssdb数据库。
工具很有用赞一个。
几个问题想确认下:
谢谢。
[root@iz8vbawf20vjywci9aweg8z Run]# python3 main.py
Process ValidRun:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "../Schedule/ProxyValidSchedule.py", line 61, in run
p.main()
File "../Schedule/ProxyValidSchedule.py", line 56, in main
self.__validProxy()
File "../Schedule/ProxyValidSchedule.py", line 36, in __validProxy
for each_proxy in self.db.getAll():
File "../DB/DbClient.py", line 94, in getAll
return self.client.getAll()
File "/home/software/proxy_pool/DB/SsdbClient.py", line 94, in getAll
return self.__conn.hgetall(self.name).keys()
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/client.py", line 1050, in hgetall
return self.execute_command('hgetall', name)
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/client.py", line 225, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/connection.py", line 404, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python3.6/site-packages/ssdb-0.0.3-py3.6.egg/ssdb/connection.py", line 383, in send_packed_command
self._sock.sendall(item)
TypeError: a bytes-like object is required, not 'str'
现在的代理网站不是很多,这样可用的代理IP就很少。我也尝试过扫描的方法,但是效率比较低
你好,你这些项目都不错 想认识一下,一直在组织大家做数据积累挖掘的事情,大家的力量是无限的,我的微信:toyaowu
在项目目录下执行python2.7 -m Schedule.ProxyRefreshSchedule
出现报错:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/sam/app/venv/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
main()
File "/home/sam/app/venv/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
p.refresh()
File "Manager/ProxyManager.py", line 46, in refresh
self.db.put(proxy)
File "DB/DbClient.py", line 73, in put
return self.client.put(value, **kwargs)
File "/home/sam/app/venv/proxy_pool/DB/SsdbClient.py", line 62, in put
return self.__conn.hset(self.name, value, None)
File "/usr/local/lib/python2.7/site-packages/ssdb/client.py", line 797, in hset
return self.execute_command('hset', name, key, value)
File "/usr/local/lib/python2.7/site-packages/ssdb/client.py", line 218, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 404, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 378, in send_packed_command
self.connect()
File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 281, in connect
sock = self._connect()
File "/usr/local/lib/python2.7/site-packages/ssdb/connection.py", line 308, in _connect
socket.SOCK_STREAM):
OverflowError: Python int too large to convert to C long
系统: CentOS6.4, 64位系统。
上网查了这个报错是因为底层用到了C的函数,导致此报错。
http://bugs.python.org/issue21816
奇怪的是我在issues里没看到其他人有同样的报错。
端口号最大为65534,所以最多有5位。
爬虫的正则表达式只会保存前4位
这个会不会并发操作db资源的?
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
main()
File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
p.refresh()
File "Manager/ProxyManager.py", line 40, in refresh
for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
File "ProxyGetter/getFreeProxy.py", line 102, in freeProxyFifth
d = tree.xpath('.//table[@class="table"]/tbody/tr[{}]/td'.format(i + 1))[0]
IndexError: list index out of range
怀疑是那个代理获取源xpath有问题,尝试注释掉后,出现另外一个错误
ProxyGetter/getFreeProxy.py
91 @staticmethod
92 @robustCrawl
93 def freeProxyFifth():
94 """
95 抓取guobanjia http://www.goubanjia.com/free/gngn/index.shtml
96 :return:
97 """
98 url = "http://www.goubanjia.com/free/gngn/index.shtml"
99 tree = getHtmlTree(url)
100 # 现在每天最多放15个(一页)
101 #for i in xrange(15):
102 #d = tree.xpath('.//table[@class="table"]/tbody/tr[{}]/td'.format(i + 1))[0]
103 #o = d.xpath('.//span/text() | .//div/text()')
104 #yield ''.join(o[:-1]) + ':' + o[-1]
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 71, in <module>
main()
File "/Library/Python/2.7/site-packages/proxy_pool/Schedule/ProxyRefreshSchedule.py", line 60, in main
p.refresh()
File "Manager/ProxyManager.py", line 40, in refresh
for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
TypeError: 'NoneType' object is not iterable
Collecting ssdb
/usr/lib/python2.6/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Using cached ssdb-0.0.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/tmp/pip-build-sfGK5r/ssdb/setup.py", line 5, in
from ssdb import version
File "ssdb/init.py", line 2, in
from ssdb.client import StrictSSDB, SSDB
File "ssdb/client.py", line 74
return {k:int(v) for k,v in list_to_dict(lst).items()}
^
SyntaxError: invalid syntax
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-sfGK5r/ssdb
在ProxyValidSchedule中,如果一个proxy存在很长时间,那么即使失效,它的计数也会很大,需要很久才能减为负数并被清理掉。如果设置一个计数的上限,比如10。当计数超过十就不再增加,是不是可以更有效地清理过期的proxy?
Manager.ProxyManager 18行有个拼写错误
from ProxyGetter.GetFreeProxy import GetFreeProxy
==》
from ProxyGetter.getFreeProxy import GetFreeProxy
应为tree.xpath(//*[@id="freelist"]/table/tbody/tr)
你好:
使用时:
$:/usr/bin/python2: No module named Run
请问怎么解决
http://proxydb.net/
http://multiproxy.org/cgi-bin/search-proxy.pl
写了几个,就这两个的搞不好。菜鸟求教
你好,自己在写一个类似的练习,但抓取西刺代理时就出现多次被Ban所得页面是block的情况,请问您是通过不断更换headers处理还是怎样呢?
新手求教,谢谢大神!
fork项目提交你的修改就可以
运行python -m Schedule.ProxyRefreshSchedule的时候出现了以下问题:
/Users/kanetsu/anaconda2/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:838: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/security.html
InsecureRequestWarning)
该如何解决呢?
举例:我要爬取万方数据知识服务平台的数据,代理IP采集+验证完毕但是,有如下的IP,有的IP(115.28.169.160:8118)在使用的过程中采集的数据总总是音悦台http://www.yinyuetai.com/的数据,还有猪八戒http://www.zbj.com/的首页,这样的问题你是咋解决的
你好,我在配置完依赖环境后跑了main.py然后就返回了ConnectionError: Error 10061 connecting localhost:8888 我还是新手没接触过SSDB方面的东西,不知道是不是我SSDB的配置有问题?单独运行getFreeProxy是可以返回IP列表的,应该就是数据库设置有问题吧,我把它放到云服务器上用SSDBAdmin改了服务器公网IP访问也还是返回10061的错误,请问我是还需要设置什么?谢谢。
你好jhao104,我看文档说是用SSDB来替代redis,但实际运行程序中,如果不先运行redis就运行main.py,就会报错,新手请指教,谢谢。(windows 8 64位系统)。而且有时运行成功,但没有任何代理显示出来,在浏览器中只是可怜的显示[ ]。
my mean
if we give same ip:port from several sources
example :
192.168.56.1:123 from X
and
192.168.56.1:123 from Y
jhao proxy_pool can resolve this?
python3环境下安装ssdb报错,3.4.3和3.6.2环境下均无法安装,建议采用pyssdb或者ssdb.py
ubuntu@hp:~/workspace/proxy_pool/Run$ python -V
Python 3.6.2
ubuntu@hp:~/workspace/proxy_pool/Run$ pip install ssdb
Collecting ssdb
Using cached ssdb-0.0.3.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-srsinocq/ssdb/setup.py", line 5, in <module>
from ssdb import __version__
File "/tmp/pip-build-srsinocq/ssdb/ssdb/__init__.py", line 2, in <module>
from ssdb.client import StrictSSDB, SSDB
File "/tmp/pip-build-srsinocq/ssdb/ssdb/client.py", line 3, in <module>
from itertools import chain, starmap, izip_longest
ImportError: cannot import name 'izip_longest'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-srsinocq/ssdb/
文档的封装成函数使用这里
def spider():
# ....
requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)})
# ....
get_proxy 应该是get_proxy() 根本没被正确调用
你好。我在使用你的代码的时候,直接用的存在redis里面的结果。但是代理在useful_proxy_queue中的代理要人为丢掉才会丢掉,但一次失败就丢会导致代理很快用完,所以我使用的时候,在从redis取东西的时候加了判断,连续20次失败了我才正式把它丢了,效果还不错。但是这边主要是加在我的代码逻辑里面,我想把它加在代理池接口部分,不知道这个feature接收吗。
在 proxy_pool/DB/RedisClient.py 中
pop 应该改为:
def pop(self):
return self.__conn.spop(self.name)
作者大大:
你好,下载了你的作品,在linux运行起来了,但只提供给我的程序不到几分钟的Ip地址,后面,proxyApi就开始报500错误,不再提供Ip地址了。这是为什么呢?下面是proxyApi报的Log。
...................................
ValueError: View function did not return a response
127.0.0.1 - - [22/Jan/2017 17:07:58] "GET /get/ HTTP/1.1" 500 -
[2017-01-22 17:07:59,348] ERROR in app: Exception on /get/ [GET]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1642, in full_dispatch_request
response = self.make_response(rv)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1731, in make_response
raise ValueError('View function did not return a response')
ValueError: View function did not return a response
127.0.0.1 - - [22/Jan/2017 17:07:59] "GET /get/ HTTP/1.1" 500 -
hi,
在https://github.com/jhao104/proxy_pool/blob/master/Schedule/ProxyRefreshSchedule.py#L51
这里exist_proxy没有看懂:
self.db.changeTable(self.raw_proxy_queue)
raw_proxy = self.db.pop()
self.log.info('%s start validProxy_a' % time.ctime())
exist_proxy = self.db.getAll()
这里exist_proxy是不是应该从useful_proxy_queue读取呢?
ERROR in app: Exception on /get/ [GET]
Traceback (most recent call last):
File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1642, in full_dispatch_request
response = self.make_response(rv)
File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1731, in make_response
raise ValueError('View function did not return a response')
ValueError: View function did not return a response
ERROR:Api.ProxyApi:Exception on /get/ [GET]
Traceback (most recent call last):
File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1642, in full_dispatch_request
response = self.make_response(rv)
File "/usr/local/python/lib/python2.7/site-packages/flask/app.py", line 1731, in make_response
raise ValueError('View function did not return a response')
ValueError: View function did not return a response
centos7 python2.7.13,使用 curl http://localhost:5000/get/
本地测试没问题,国外的云服务器上访问就报错
另外定时任务可能也有问题,python不是很熟能帮忙看下都是什么问题,怎么解决呢
ERROR:apscheduler.executors.default:Job "main (trigger: interval[0:05:00], next run at: 2017-07-07 17:29:56 CST)" raised an exception
Traceback (most recent call last):
File "/usr/local/python/lib/python2.7/site-packages/apscheduler/executors/base.py", line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File "../Schedule/ProxyRefreshSchedule.py", line 73, in main
p.refresh()
File "../Manager/ProxyManager.py", line 42, in refresh
for proxy in getattr(GetFreeProxy, proxyGetter.strip())():
File "../ProxyGetter/getFreeProxy.py", line 80, in freeProxySecond
for proxy in re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}', html):
File "/usr/local/python/Lib/re.py", line 181, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
看了下代码,只看到了从raw_proxy_queue中的对IP进行验证,把当前可用的IP放到useful_proxy_queue中,没有看到对useful_proxy_queue中的IP进行验证的代码。这些免费IP可能随时失效,也需要进行刷新验证。
流程图是拿什么画的
`
[root@VPS Run]# python main.py
Traceback (most recent call last):
File "main.py", line 22, in
from Schedule.ProxyRefreshSchedule import run as RefreshRun
File "../Schedule/ProxyRefreshSchedule.py", line 21, in
from apscheduler.schedulers.blocking import BlockingScheduler
File "/usr/local/python27/lib/python2.7/site-packages/apscheduler/init.py", line 2, in
release = import('pkg_resources').get_distribution('APScheduler').version.split('-')[0]
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 292, in get_distribution
if isinstance(dist,Requirement): dist = get_provider(dist)
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 176, in get_provider
return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 648, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/local/python27/lib/python2.7/site-packages/distribute-0.6.10-py2.7.egg/pkg_resources.py", line 546, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: APScheduler`
启动采集出现错误:提示 TypeError:init() got an unexpected keyword argument minute
数据库可不可以换成纯python实现的Nosql,希望还是直接支持python3的。
python3 现在用的基本上用python3了。
Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. ---># in browser
ERROR:apscheduler.executors.default:Job "main (trigger: interval[0:05:00], next run at: 2017-06-14 23:20:31 CEST)" raised an exception Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, **job.kwargs) File "../Schedule/ProxyRefreshSchedule.py", line 73, in main p.refresh() File "../Manager/ProxyManager.py", line 42, in refresh for proxy in getattr(GetFreeProxy, proxyGetter.strip())(): File "../ProxyGetter/getFreeProxy.py", line 79, in freeProxySecond html = getHTMLText(url, headers=HEADER) File "../Util/utilFunction.py", line 31, in getHTMLText return response.status_code UnboundLocalError: local variable 'response' referenced before assignment
---># in terminal
用的centos 6 python3.5 结果各种出错
[root@localhost proxy_pool-master]# python -m Schedule.ProxyRefreshSchedule
Traceback (most recent call last):
File "/usr/local/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/proxy_pool-master/Schedule/ProxyRefreshSchedule.py", line 19, in
from apscheduler.schedulers.blocking import BlockingScheduler
ImportError: No module named 'apscheduler'
是哪里没有装好吗?
不知道可不可以在get的时候,指定https或者http型代理,毕竟有些代理只支持http或者https?
谢谢!
请教, 在分别运行Api下的ProxyApi.py, Schedule下的ProxyRefreshSchedule.py和ProxyValidSchedule.py后, 想通过README里提供的SSDBAdmin可视化工具查看结果,请问SSDBAdmin的 setting.py 中的‘host’和‘port’如何设置?
#!/usr/bin/env python
servers = [
{
"host": "172.16.1.69",
"port": 8889
},
{
"host": "127.0.0.1",
"port": 8889
}
]
DEBUG = False
在运行ProxyApi.py,ProxyRefreshSchedule.py和ProxyValidSchedule.py时, 运行SSDBAdmin下的 runserver.py出现了socket占用错误:socket.error: [Errno 98] Address already in use. 由于在此之前没有redis和其他数据库经验,请多指教, 谢谢!
在网页抓取代理IP的时候判断其scheme并添加进SSdb的value,这样的话,工厂类可能也要改,ProxyManager也要改。
请问一下这种情况下的Best practice是什么?
不知道大家有没有这种情况,goubanjia的代理获取有问题,代理是“222222.9.944.144.9999:8616”这样的
1.能不能判断socks5的代理IP?
2. 是否可以运行于Python 3.x 版本?
按照文档说明,安装了依赖包和ssdb,但是执行python后进程中仅有两个main.py,log文件中没有任何错误,是哪里出了问题么?
用蜜罐 http://httpbin.org/ip 检测速度会更快一点
另外,快代理貌似不用JS渲染了
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.