urllib3 / urllib3 Goto Github PK
View Code? Open in Web Editor NEWurllib3 is a user-friendly HTTP client library for Python
Home Page: https://urllib3.readthedocs.io
License: MIT License
urllib3 is a user-friendly HTTP client library for Python
Home Page: https://urllib3.readthedocs.io
License: MIT License
This is an example of URL that urllib3 fails to parse correctly. It appears that it un-encodes the URL and then gets confused about the colon symbol in the mailto: statement that is part of the path.
Strangely enough, requests generates a different error for the same URL. For urllib3, it makes the pool think that the URL is from a different host. For requests, it thinks the other host is a new port number... and tries to int(string)...
Is this a known issue?
>>> http_pool = urllib3.connection_from_url('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')
>>> r = http_pool.get_url('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib3/request.py", line 136, in get_url
**urlopen_kw)
File "urllib3/request.py", line 78, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "urllib3/connectionpool.py", line 410, in urlopen
retries - 1, redirect, assert_same_host)
File "urllib3/connectionpool.py", line 341, in urlopen
raise HostChangedError(host, url, retries - 1)
urllib3.exceptions.HostChangedError: Connection pool with host 'http://stats.e-go.gr' tried to open a foreign host: mailto:[email protected]
>>>
>>>
>>> import requests
>>> r = requests.get('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 38, in request
return s.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 200, in request
r.send(prefetch=prefetch)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 514, in send
self._build_response(r)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 253, in _build_response
request.send()
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 430, in send
conn = self._poolmanager.connection_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 94, in connection_from_url
scheme, host, port = get_host(url)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 524, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: 'advertising%40pegasusinteractive.gr'
>>>
Just as a check, this is parsed correctly by urlparse:
>>> urlparse.urlparse('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')
ParseResult(scheme='http', netloc='stats.e-go.gr', path='/rx.asp', params='', query='nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr', fragment='')
>>>
>>> urlparse.urlparse('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr').port
>>>
But, doh! This is the same as issue #39, because it is a bogus redirect... nice work stats.e-go.gr....
>>> r = requests.get('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr', allow_redirects=False)
>>> r
<Response [302]>
>>> r.headers
{'content-length': '161', 'x-powered-by': 'ASP.NET', 'set-cookie': 'nUserID=4862144; expires=Mon, 21-Jan-2013 22:00:00 GMT; path=/, ASPSESSIONIDCSQABRTR=NOJOKALCFLCHFCALIBBMFJIJ; path=/', 'expires': 'Mon, 23 Jan 2012 13:40:53 GMT', 'server': 'Microsoft-IIS/6.0', 'location': 'mailto:[email protected]', 'cache-control': 'False', 'date': 'Mon, 23 Jan 2012 13:41:53 GMT', 'p3p': 'CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"', 'content-type': 'text/html'}
>>>
>>>
( Moved over from requests: https://github.com/kennethreitz/requests/issues/306 )
On attempting to import urllib3 in a python environment without SSL support, the following error is received:
File "urllib3/connectionpool.py", line 11, in <module>
from httplib import HTTPConnection, HTTPSConnection, HTTPException
ImportError: cannot import name HTTPSConnection
For comparison / reference, urllib2 does import on the same system. Attempting to retrieve a HTTPS url via urllib2.urlopen(...) results in this error:
urllib2.URLError: <urlopen error unknown url type: https>
Standard HTTP requests, however, using urllib2 work fine in this python environment.
Perhaps urllib3 should error out in a similar fashion?
It would be really convenient if one could configure proxies with environment variables (case in point, urllib).
After
pool = HTTPConnectionPool(host, port)
pool.urlopen('GET', '/', release_conn=False)
connection is being released.
Google App Engine restricts which modules can be used in their environment, and sadly, select
is not on their whitelist. Is there a way to make urllib3
work on App Engine without using this module? requests
depends on urllib3
, and the select
modules prevents users from using requests
on App Engine.
Currently httplib coerces headers into a dict, which breaks some things. Some monkeypatching might be required.
See also: Issue 15 @ GoogleCode for more discussion.
I'm unable to re-open issue #61 for some reason, so reporting this separately.
It doesn't appear to work with POST requests:
import webapp2
import urllib3
class MainPage(webapp2.RequestHandler):
def get(self):
http = urllib3.PoolManager()
r = http.request('GET', 'http://google.com/')
r2 = http.request('GET', 'http://yahoo.com/')
r3 = http.request('POST', 'http://www.example.com/')
r4 = http.request('POST', 'http://www.example.com/', fields={"foo":"bar"})
self.response.headers['Content-Type'] = 'text/plain'
self.response.out.write(r.status)
self.response.out.write(r2.status)
self.response.out.write(r3.status)
app = webapp2.WSGIApplication([('/', MainPage)],
debug=True)
Traceback:
The API package 'remote_socket' or call 'Resolve()' was not found. Traceback (most recent call last): File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__ rv = self.handle_exception(request, response, e) File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__ rv = self.router.dispatch(request, response) File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher return route.handler_adapter(request, response) File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__ return handler.dispatch() File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch return self.handle_exception(e, self.app.debug) File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch return method(*args, **kwargs) File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3_test.py", line 9, in get r3 = http.request('POST', 'http://www.example.com/') File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3/request.py", line 71, in request **urlopen_kw) File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3/request.py", line 119, in request_encode_body boundary=multipart_boundary) File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3/filepost.py", line 57, in encode_multipart_formdata boundary = choose_boundary() File "/base/python27_runtime/python27_dist/lib/python2.7/mimetools.py", line 140, in choose_boundary hostid = socket.gethostbyname(socket.gethostname()) File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 299, in gethostbyname return _Resolve(host, [AF_INET])[2][0] File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 251, in _Resolve canon, aliases, addresses = _ResolveName(name, families) File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 269, in _ResolveName apiproxy_stub_map.MakeSyncCall('remote_socket', 'Resolve', request, reply) File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 94, in MakeSyncCall return stubmap.MakeSyncCall(service, call, request, response) File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 308, in MakeSyncCall rpc.CheckSuccess() File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess raise self.exception CallNotFoundError: The API package 'remote_socket' or call 'Resolve()' was not found.
urllib3 can not access to web pages
import urllib3
http = urllib3.PoolManager()
url = 'http://waptt.com/'
r = http.request('GET', url, retries = 5)
print r.status
404
But I use curl to get to 200 status
curl -I http://waptt.com
HTTP/1.1 200 OK
I think there is a problem in the use of httplib.HTTPConnection
method request
when called at
line 213 of urllib3/connectionpool.py where you pass it the full URL, containing the scheme and host, instead of just the path (and query part), as show in httplib usage examples.
This ends up in a wrong HTTP request performed to the server. To see it, you can for instance run
python -m SimpleHTTPServer
in a shell and then, in another one, run
python -c 'from urllib3 import PoolManager; http = PoolManager(); http.request( "GET", "http://localhost:8000/this/is/an/example" )'
and compare what the access log in the first shell reports as compared to what happens if you do
curl "http://localhost:8000/this/is/an/example"
I can submit a patch, but I'm not an urllib3 expert so I will probably miss some other place where the same error occurs.
There are currently no tests is one test that check SSL cert verification in tests/with_dummyserver/test_https.py
. More tests would be great. :)
Specifically we want to reach good coverage for HTTPS-related code.
In our production environment, we've noticed the following errors in our logs, pool size set to 500:
FetchError: FetchError: msg = "No retries left, giving up", original exception = "EmptyPoolError("HTTPConnectionPool(host='SUPER_SECRET', port=80): Pool reached maximum size and no more connections are allowed.",)", url = "http://SUPER_SECRET_URL/", retries = 2
ERROR - base.py:fetch:166 - FetchError: HTTPConnectionPool(host='SUPER_SECRET', port=80): Pool reached maximum size and no more connections are allowed., retries left = 2
After a bit of investigation, it looks like the number of connections in the pool is slow decreasing over time as we encounter other errors, forced timeouts, etc. On urllib3/connectionpool.py:424, if httplib/socket raises an error, the connection will be dropped from the pool because it was acquired on 382 and never put back on line 431 since conn == None
.
This issue can be replicated with the following gist (uses gevent.Timeout, so quasi-related to #69): https://gist.github.com/2932793
It looks like the if conn
on line 431 was originally added to ensure that a SocketError raised from _get_conn()
didn't add None
to the pool since a connection would never be gotten, but in the current incarnation, it has the effect of causing the pool size to shrink over time if a connection is acquired and any error is raised from httplib/socket.
Can't figure this one out. Any ideas?
from urllib3.connectionpool import connection_from_url
url = 'https://twitter.com'
http_pool = connection_from_url(url, strict=False)
r = http_pool.urlopen('GET', url)
urllib3.exceptions.MaxRetryError: Max retries exceeded for url: https://twitter.com
We are seeing "error(35, 'Resource temporarily unavailable')" thrown from urllib when running on mac. It appears this is a known issue in python, when the caller of send must handle EAGAIN errors on BSD platforms.
This was observed with urllib3-1.1, OSX 10.6, Python 2.7. Not surprisingly, it's particularly common over slow network connections.
Turns out tornado is really eager to use IPv6. Unless you expressly hand the server the address, it doesn't even check for socket IPv6 support. I'll submit a pull request for the one-line fix in dummyserver/server.py momentarily.
Source: https://groups.google.com/group/python-tornado/browse_thread/thread/3ec04536e57a2833?pli=1
The gevent monkey patching on socket
for some reason (have yet to find out why) does not properly raise a SocketTimeout the way it should and as a consequence urllib3
will not raise a Timeout properly if you're using gevent.
How to reproduce:
Create unresponsive socket (simulate timeout)
nc -l 8080
from gevent.monkey import patch_all
patch_all()
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://localhost:8080/',timeout=1)
The Timeout is never raised.
One solution is to use gevent.socket
when making the connection, that should properly raise a SocketTimeout.
Kicking off a tracking thread for some of the work started at the PyCon 2012 Requests/urllib3 sprints.
I'm not sure as to the full history of the issue, but the general driving factor is more transparency and control over behaviors that are overly opaque in the standard library's httplib, which seems to have been written without taking into account:
The validity of these concerns was at least generally confirmed by Guido himself, in person, when @kennethreitz and I spoke to him and he agreed that certain batteries included with Python have started to lose their charge; libraries like urllib/urllib2/httplib are out of touch with the new momenta of web technologies.
The refactorings involved in fixing the above issues seem to point to a new httplib-like library that is more extensible and configurable. The emergence of libraries like @benoitc's http-parser also support this direction and urllib3 is certainly one of the best-positioned libraries for attempting this sort of enhancement.
Design discussions thus far have involved @shazow, @wolever, @atdt, @kennethreitz, @easel, @doublereedkurt, and @brandon-rhodes. More notes to follow.
@kennethreitz is starting some informal work on this somewhere.
Issues that are affected by this: https://github.com/shazow/urllib3/issues?milestone=1&state=open
Let's keep this thread updated as we make progress.
This URL parses correctly, however the redirect leads to a bogus URL --- (nice work Patch.com :-)
Can we cause this to generate a more useful exception? Something specifically about the redirect being bogus?
Here is the URL parsing correctly:
>>> import urlparse
>>> urlparse.urlparse('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')
ParseResult(scheme='http', netloc='stclairshores.patch.com', path='/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day', params='', query='', fragment='')
>>> urlparse.urlparse('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day').port
>>>
Here is the redirect that it generates... notice the http://http:// at the beginning of the new location!
>>> r = requests.get('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day', allow_redirects=False)
>>> r.headers
{'status': '302', 'content-length': '160', 'content-encoding': 'gzip', 'set-cookie': 'p13n=%5B%5D; path=/, _patch_session=BAh7BzoPc2Vzc2lvbl9pZCIlNmEzYzM4YzZjMWIxMTdiNDkxMmEwNmEwM2JmZDQzYTU6FnByb21wdF9mb3Jfc3VydmV5aQA%3D--547af227538bb1d039809a3d01eaadb320a9a42b; domain=patch.com; path=/', 'x-powered-by': 'Phusion Passenger (mod_rails/mod_rack) 3.0.11', 'vary': 'Accept-Encoding', 'server': 'Apache/2.2.15 (Unix) mod_ssl/2.2.15 OpenSSL/0.9.8l Phusion_Passenger/3.0.11', 'x-runtime': '15', 'location': 'http://http://www.dailytribune.com/articles/2011/11/09/news/doc4ebb336cad1c7378471368.txt?viewmode=fullstory', 'cache-control': 'no-cache', 'date': 'Mon, 23 Jan 2012 13:32:36 GMT', 'content-type': 'text/html; charset=utf-8', 'x-rack-cache': 'miss'}
>>>
So, of course urllib cannot open it:
>>> import urllib
>>> g = urllib.urlopen('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib.py", line 84, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 205, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 356, in open_http
return self.http_error(url, fp, errcode, errmsg, headers)
File "/usr/lib/python2.7/urllib.py", line 369, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "/usr/lib/python2.7/urllib.py", line 632, in http_error_302
data)
File "/usr/lib/python2.7/urllib.py", line 659, in redirect_internal
return self.open(newurl)
File "/usr/lib/python2.7/urllib.py", line 205, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 331, in open_http
h = httplib.HTTP(host)
File "/usr/lib/python2.7/httplib.py", line 1061, in __init__
self._setup(self._connection_class(host, port, strict))
File "/usr/lib/python2.7/httplib.py", line 693, in __init__
self._set_hostport(host, port)
File "/usr/lib/python2.7/httplib.py", line 718, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''
urllib3 has the same error:
>>> http_pool = urllib3.connection_from_url('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')>>> r = http_pool.get_url('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib3/request.py", line 136, in get_url
**urlopen_kw)
File "urllib3/request.py", line 78, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "urllib3/connectionpool.py", line 410, in urlopen
retries - 1, redirect, assert_same_host)
File "urllib3/connectionpool.py", line 336, in urlopen
if assert_same_host and not self.is_same_host(url):
File "urllib3/connectionpool.py", line 246, in is_same_host
scheme, host, port = get_host(url)
File "urllib3/connectionpool.py", line 538, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''
>>>
and it propagates through to requests too:
>>>
>>> r = requests.get('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 38, in request
return s.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 200, in request
r.send(prefetch=prefetch)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 514, in send
self._build_response(r)
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 253, in _build_response
request.send()
File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 430, in send
conn = self._poolmanager.connection_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 94, in connection_from_url
scheme, host, port = get_host(url)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 524, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''
I use the code:
headers = {
'User-Agent':'Baiduspider+(+http://www.baidu.com/search/spider.htm)',
'Accept-Encoding':'gzip,deflate'
}
r = http.request('GET', 'http://www.heroone.com', headers=headers)
Error log:
HTTPError: Received response with content-encoding: gzip, but failed to decode it.
Web server, add some "tail" behind the data of gzip compression, data.
Some data-extracting modules (such as Python's GzipFile module) in this case will handle exceptions. The browser will automatically discard the extra "tail" normal extracting and processing of page data.
The python GzipFile module undisclosed attributes: extrabuf responsible for the preservation has been successfully extracting data. Therefore, the following code for better compatibility:
files :urllib3/response.py
Old code:
def decode_gzip(data):
gzipper = gzip.GzipFile(fileobj=BytesIO(data))
Fix code:
try:
gf = GzipFile(fileobj=StringIO(html_data), mode="r")
html_data = gf.read()
except:
html_data = gf.extrabuf
>>> m = PoolManager()
>>> m.request('GET', 'https://google.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 67, in request
**urlopen_kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 80, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
return self.urlopen(method, e.url, **kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
return self.urlopen(method, e.url, **kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 104, in urlopen
return conn.urlopen(method, url, **kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/connectionpool.py", line 361, in urlopen
raise MaxRetryError(self, url)
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.google.de', port=443): Max retries exceeded with url: https://www.google.de/
It is not an HTTPS error:
>>> m.request('GET', 'http://google.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 67, in request
**urlopen_kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 80, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
return self.urlopen(method, e.url, **kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
return self.urlopen(method, e.url, **kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 104, in urlopen
return conn.urlopen(method, url, **kw)
File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/connectionpool.py", line 361, in urlopen
raise MaxRetryError(self, url)
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.google.de', port=80): Max retries exceeded with url: http://www.google.de/
The server is getting confused with the request "GET http://blag.xkcd.com/ HTTP/1.1":
>>> import urllib3
>>> url = "http://blag.xkcd.com/"
>>> conn = urllib3.connection_from_url(url)
>>> r = conn.request("GET", url, redirect=False)
>>> r.status
301
>>> r.get_redirect_location()
'http://blag.xkcd.comhttp/blag.xkcd.com/'
>>> r.headers
{'content-length': '0', 'x-powered-by': 'PHP/5.2.6-1+lenny13', 'expires': 'Wed, 11 Jan 1984 05:00:00 GMT', 'vary': 'Accept-Encoding', 'server': 'Apache', 'last-modified': 'Sun, 05 Feb 2012 08:15:10 GMT', 'connection': 'close', 'location': 'http://blag.xkcd.comhttp/blag.xkcd.com/', 'pragma': 'no-cache', 'cache-control': 'no-cache, must-revalidate, max-age=0', 'date': 'Sun, 05 Feb 2012 08:15:10 GMT', 'content-type': 'text/html; charset=UTF-8', 'x-pingback': 'http://blog.xkcd.com/xmlrpc.php'}
It works fine if the request is "GET / HTTP/1.1":
>>> r = conn.request("GET", "/", redirect=False)
>>> r.status
200
>>> r.headers
{'x-powered-by': 'PHP/5.2.6-1+lenny13', 'transfer-encoding': 'chunked', 'vary': 'Accept-Encoding', 'server': 'Apache', 'connection': 'close', 'date': 'Sun, 05 Feb 2012 08:16:55 GMT', 'content-type': 'text/html; charset=UTF-8', 'x-pingback': 'http://blog.xkcd.com/xmlrpc.php'}
This behavior might be considered user's fault, but the same behavior is seen with PoolManager, where the library is expected to figure out the correct connection from the full url.
Hi, I maintaining a fork of socksipy-branch, called socksipy-x, which is at https://github.com/brendoncrawford/socksipy-x. Socksipy and Socksipy-Branch have not been updated for a while, so I intend to fix some bugs and try to maintain the codebase when needed.
Before I embark on the task of adding SOCKS support to urllib3, I wanted to see if this is even something you would be interested in merging in? If you were, it could either be referenced as a dependency/submodule, or I could just copy the entire socks.py file directly into urllib3, so no external dependency would be required.
Any thoughts?
I know this is super lame to file an issue instead of emailing a mailing list or something for a feature question, but I can't find a urllib3 mailing list anywhere.
Anyway, I am trying to find any kind of support for HTTP pipelining in an existing library in Python before attempting something more drastic. Specifically, I am trying to pipeline a series of PUTs (yeah, they're idempotent), like so:
PUT
PUT
PUT
get response
get response
get response
Ideally, I'd love for this to be handled for me in some kind of threadsafe way, but I'm willing to do it myself. urllib3 seems really close! Threadsafe, connection pooling, the works! I even saw some other website that offhandedly speculated that urllib3 might even do pipelining. But I hardly think that's possible, as you have to call release_connection manually after reading the body of a response.
Anyway, do you know anything about this? It's sort of surprising to me how undersupported this HTTP feature seems to be. If urllib3 doesn't support it, any wise thoughts on what I might have to do instead?
-JT
import requests
requests.get('http://online.wsj.com?CALL_URL=http://online.wsj.com/article/SB10001424052702303640104577436251166644714.html%3fmod=googlenews_wsj')
Traceback (most recent call last):
File "", line 1, in
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/api.py", line 51, in get
return request('get', url, *_kwargs)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/api.py", line 39, in request
return s.request(method=method, url=url, *_kwargs)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/sessions.py", line 200, in request
r.send(prefetch=prefetch)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/models.py", line 463, in send
conn = self._poolmanager.connection_from_url(url)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/packages/urllib3/poolmanager.py", line 89, in connection_from_url
scheme, host, port = get_host(url)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 557, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''
https://github.com/kennethreitz/requests/blob/develop/requests/packages/urllib3/util.py#L75 should probably chop off the query string before checking for a :
Hi, I realized that custom 'Accept' headers are getting overwritten by the ProxyManager._set_proxy_headers function in poolmanager.py (line 124)
I did a bit of googling but couldn't come up with a reason why this should be done; if this modification of the 'Accept' header is not needed, would it be possible to get it removed somehow?
Why the 301 sites will be wrong?
http = urllib3.PoolManager()
conn = http.connection_from_url("http://www.opda.com.cn")
r = response = conn.request("GET", "/", retries = 5)
Error log:
HostChangedError: HTTPConnectionPool(host='www.opda.com.cn', port=80): Tried to open a foreign host with url: forum.php
I have been using urllib3 via Requests, and I noticed that I was only getting one cookie back. After some investigation, it looks like the problem is caused by the fact that all of the HTTP headers are converted from a list of 2-tuples (from httplib
) to a dictionary. This is normally okay, but when the server sends back multiple Set-Cookie
headers, all but the last are overwritten.
Minimal code to reproduce. I will leave that URL up until this issue is resolved or I am smacked upside the header for something stupid I am missing.
import urllib3
http = urllib3.PoolManager()
r = http.request("GET", "http://www.joelverhagen.com/sandbox/tests/cookies.php")
print(r.headers)
I am using Python 3.2.2 and urllib3 v1.2.2.
Edit: It may be a good idea to open an issue on the Requests GitHub repo if this is indeed a problem. I was going to do it, but I wanted to first check on this end.
See issue 269 fix subsequent redirect request method type in Requests.
Hello,
test-requirements.txt is not shipped with the source so installing urllib3. However I think that listing requirements inside setup.py is a better choice.
$ pip install urllib3
Downloading/unpacking urllib3
Downloading urllib3-1.2.1.tar.gz
Running setup.py egg_info for package urllib3
Traceback (most recent call last):
File "<string>", line 14, in <module>
File "/home/eriol/.virtualenvs/testurllib3/build/urllib3/setup.py", line 25, in <module>
tests_requirements = requirements + open('test-requirements.txt').readlines()
IOError: [Errno 2] No such file or directory: 'test-requirements.txt'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 14, in <module>
File "/home/eriol/.virtualenvs/testurllib3/build/urllib3/setup.py", line 25, in <module>
tests_requirements = requirements + open('test-requirements.txt').readlines()
IOError: [Errno 2] No such file or directory: 'test-requirements.txt'
Kind regards,
Daniele Tricoli
If someone today installs urllib3
with pip
or easy_install
into a novel environment — like the 64-bit version of an operating system that none of us have tested the library against — and receives an exception, then there is no easy way for them to quickly run the test suite to see whether the problem is that their code is faulty, or that urllib3
itself fails its own tests in the new environment.
But if we move the test
directory, that currently sits anonymously at the top of the repository, down into the urllib3
directory and start shipping it as a sub-package, then users who want to double-check that urllib3
is working would simply be able to type:
python -m unittest discover urllib3
Or, if they were using Python <2.7, then they could install unittest2
and use its unit2
command line that Michael Foord developed so that older Pythons can also enjoy the new, standard way for tests to be auto-discoverable.
It is not clear to me whether the dummyserver
top-level package that also sits in setup.py
stands in the way of subordinating the tests beneath the package itself. (Would the dummy server need to move down inside urllib3/tests
?)
Also check if it works for HTTPS proxies. It might not.
Whether a request()
or urlopen()
call is given retries=0
or retries=<positive-int>
, the exception returned on most connection and network failures is a MaxRetryError
. This prevents client code from seeing the “real error” that killed the connection and, therefore, clients based on urllib3
cannot, say, adjust their behavior based on whether the problem is any of the various errors at the HTTP level, or whether the problem is some specific socket error.
At least three solutions are possible, and I defer to the maintainers to decide which is the most urllib3
-ish.
First, it could be judged a design error to have introduced MaxRetryError
in the first place, and the final attempt's exception could be allowed to make it through to the client (some would want to throw in a retry_count
attribute to add a bit of information about what happened, but I am not sure that that is advisable). The except
clause near line 350 of connectionpool.py
would then look something like:
except (HTTPException, SocketError), e:
# Connection broken, discard. It will be replaced next _get_conn().
conn = None
# If this was the last try, let the caller see the real error.
if not retries:
raise
Other adjustments to the code might accomplish the same thing — a larger refactoring could eliminate even making the final attempt inside of an except:
, for example — but this is the simplest approach I could find for this first approach.
Second, the MaxRetryError
could be marked up with a list of failures. This would be a bit tricky, since there are finicky problems with keeping stack traces around, but at least the exception object from each failure could be kept around — shorn of its stack trace — and delivered back with the MaxRetryError
. This would even make it clear whether the problem was an actual network death of the HTTP protocol, or whether something like a redirect loop had kept the request going through too many attempts — and thus much more information would be available for the client to determine what had happened.
Finally, retries=None
could be a special signal that the bare exception should be returned. This would maintain (so far as I can see) 100% compatibility with the current code base. The ConnectionPool.urlopen()
function needs two or three lines adjusted so that the None
value avoids offending the various if
statements, but with minimal intrusion the change lets someone like me — whose application, alas, cares quite deeply what mode of failure an HTTP request encounters — take advantage of urllib3
but without breaking existing code that might always expect the MaxRetryError
exception. (Oh, and, beware — in Python 2, None < 0
is True
, I just discovered, so the retries < 0
will catch a None
value unless an extra clause is inserted!)
Of course, other approaches might also be possible that I have not thought of.
I very much like the quality of code that I am seeing in urllib3
and, if I can only get the actual exceptions back that it encounters, these already-written connection pools are going to save me a lot of work. Thanks!
The "Getting Started" section of the documentation suggests this simple test:
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'http://google.com/')
But actually running this code results in an exception:
Traceback (most recent call last):
File "test.py", line 7, in <module>
r = http.request('GET', 'http://www.google.com/')
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/request.py", line 65, in request
**urlopen_kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/request.py", line 78, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
return self.urlopen(method, e.new_url, **kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
return self.urlopen(method, e.new_url, **kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
return self.urlopen(method, e.new_url, **kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
return self.urlopen(method, e.new_url, **kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 109, in urlopen
return conn.urlopen(method, url, **kw)
File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/connectionpool.py", line 309, in urlopen
raise MaxRetryError(url)
urllib3.exceptions.MaxRetryError: Max retries exceeded for url: http://www.google.com/
The reason is that each attempt to make a connection is dying with HostChangedError
because the is_same_host()
method in connectionpool.py
is getting back the tuple ('http', 'www.google.com', None)
from get_host(url)
but the slightly different tuple ('http', 'www.google.com', 80)
when it combines self.scheme
with self.host
and self.port
.
Attempting the test with the URL http://google.com:80/
also fails because a first successful request is made that redirects to http://www.google.com/
which then dies as well with the None != 80
problem.
Only a test with the URL http://www.google.com:80/
succeeds, because only in that case is the explicit port number present to make is_same_host()
succeed, and no subsequent redirect occurs to break things.
It looks like either get_host()
needs to throw in the port 80
when building its tuple, or — if that would ruin the purity and usefulness of that function — the is_same_host()
function needs to detect the None
port number coming back from get_host()
and upgrade it to 80 or 443 as appropriate.
Or: is the problem that the self.port
that is_same_host()
is grabbing used to have the value None
as well, and it's the port having the value 80
so early in the process that is the problem here?
If more experienced project members could point me in the right direction here, I would be happy to contribute a patch and pull request. I could even add a test for the usage example in the documentation, so that it stays working. :) Thanks for your work on urllib3!
Requests to access a secure website (SSL/TLS) fail through a proxy.
Urllib3 does not properly implement the HTTP CONNECT method.
For example the following code should print 200.
Instead, with a burp proxy, it prints 502.
import urllib3
proxy = urllib3.proxy_from_url('http://localhost:8080/'
response = proxy.urlopen('GET', 'https://www.google.com/index.html')
print response.status
Issue 293.
I guess in the end we should use standard and tested urlparse.urlsplit
:)
I'm testing 1.0.1
As noted in the code, RecentlyUsedContainer is not threadsafe.
I'm getting: RuntimeError: deque mutated during iteration
Stacktrace:
File "/home/chrsjo/src/arkenutils_omkomp/arken/hcap.py", line 153, in _request
return self._pool.request(*args, **kwargs)
File ".../urllib3-1.0.1-py2.7.egg/urllib3/request.py", line 65, in request
**urlopen_kw)
File ".../urllib3-1.0.1-py2.7.egg/urllib3/request.py", line 78, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File ".../urllib3-1.0.1-py2.7.egg/urllib3/poolmanager.py", line 107, in urlopen
conn = self.connection_from_url(url)
File ".../urllib3-1.0.1-py2.7.egg/urllib3/poolmanager.py", line 98, in connection_from_url
return self.connection_from_host(host, port=port, scheme=scheme)
File ".../urllib3-1.0.1-py2.7.egg/urllib3/poolmanager.py", line 73, in connection_from_host
pool = self.pools.get(pool_key)
File "/home/chrsjo/.virtualenvs/arkenutils_omkomp/lib/python2.7/_abcoll.py", line 342, in get
return self[key]
File ".../urllib3-1.0.1-py2.7.egg/urllib3/_collections.py", line 96, in __getitem__
self._prune_invalidated_entries()
File ".../urllib3-1.0.1-py2.7.egg/urllib3/_collections.py", line 77, in _prune_invalidated_entries
self.access_log = deque(e for e in self.access_log if e.is_valid)
File ".../urllib3-1.0.1-py2.7.egg/urllib3/_collections.py", line 77, in <genexpr>
self.access_log = deque(e for e in self.access_log if e.is_valid)
RuntimeError: deque mutated during iteration
Python 2.7 has added support for buffering when reading from HTTP connections, which leads to a significant improvement in HTTP client performance.
See: http://bugs.python.org/issue4879
It looks like the only changes required would be in HTTPConnectionPool._make_request, passing buffering=True to the getresponse() call on Python 2.7 and later.
urllib2.py does the following:
try:
r = h.getresponse(buffering=True)
except TypeError: #buffering kw not supported
r = h.getresponse()
With the recent World IPv6 Launch ( http://www.worldipv6launch.org/ ) it would be really nice if urllib3
could support IPv6 (in such a way, that finally requests
can support it too).
Currently, it is not possible to post a 4 GB file using urllib3 since that requires reading the entire file content into a buffer before sending. (unless I'm missing something?)
It would be nice if we could pass file-like objects (objects with a read() method or something) as post variables, which will be read from on demand and perform well on huge files.
The only python script capable of this that I'm aware of is http://atlee.ca/software/poster/, which is a urllib2 opener class.
On SocketTimeout
, a TimeoutError
is raised with self.timeout
as part of the exception in urlopen()
, but when this is not set, you get a TypeError: float argument required, not NoneType
.
When the timeout
parameter is overriden locally in the function call, the exception should take this in to concideration as well.
To reproduce:
>>> from urllib3.connectionpool import connection_from_url
>>> con = connection_from_url('http://doesnotexists.com')
>>> con.urlopen('GET', 'http://doesnotexists.com', retries=0, assert_same_host=False, timeout=1.0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib3/connectionpool.py", line 353, in urlopen
self.timeout)
TypeError: float argument required, not NoneType
Code:
manager = urllib3.PoolManager()
r = manager.request('GET', 'http://ynet.co.il')
r.data - returns incorrect page
same with paypal.com and other domains... The code takes the location value from the HTTP response and issues a GET request to http://www.ynet.co.il to hostname http://ynet.co.il
Urllib3 doesn't allow multiple values for a single key if the request is with files.
I found this bug when I'm using Requests. https://github.com/kennethreitz/requests/issues/285
I sent following a Pull request to Requests. But It was rejected, because it needs to modify urllib3.
https://github.com/kennethreitz/requests/pull/422#issuecomment-4058039
There's basically 3 lines which aren't covered. Should be easy enough to cover them with a couple more tests.
IIRC something to do with timeouts and decoding.
Very contributor friendly if you're looking to dive into the codebase. :-)
from urllib3 import HTTPSConnectionPool
File "/usr/local/lib/python2.7/dist-packages/urllib3-1.3-py2.7.egg/urllib3/init.py", line 22, in
from . import exceptions
ImportError: cannot import name exceptions
with easy_install
Since version 2.7, httplib supports specifying a source address for HTTP(S)Connection: http://docs.python.org/library/httplib.html#httplib.HTTPConnection
Would be nice if urllib3 could let me use this when creating connections.
////
If we wrap httplib.HTTPConnection
into custom class we could hold some additional information about each connection. This way we could enumerate all connections and print their ids in debugging mode like curl does.
I really miss this feature, because I'm developing a simple upload application.
I think this would be an interesting feature.
Moved here from a Requests bug
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.