urllib3 / urllib3 Goto Github PK

View Code? Open in Web Editor NEW

3.7K 3.7K 1.1K 7.32 MB

urllib3 is a user-friendly HTTP client library for Python

Home Page: https://urllib3.readthedocs.io

License: MIT License

Python 98.80% Shell 0.02% JavaScript 0.35% HTML 0.83%

http http-client python urllib3

urllib3's Issues

URL parsing errors

This is an example of URL that urllib3 fails to parse correctly. It appears that it un-encodes the URL and then gets confused about the colon symbol in the mailto: statement that is part of the path.

Strangely enough, requests generates a different error for the same URL. For urllib3, it makes the pool think that the URL is from a different host. For requests, it thinks the other host is a new port number... and tries to int(string)...

Is this a known issue?

>>> http_pool = urllib3.connection_from_url('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')
>>> r = http_pool.get_url('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "urllib3/request.py", line 136, in get_url
    **urlopen_kw)
  File "urllib3/request.py", line 78, in request_encode_url
    return self.urlopen(method, url, **urlopen_kw)
  File "urllib3/connectionpool.py", line 410, in urlopen
    retries - 1, redirect, assert_same_host)
  File "urllib3/connectionpool.py", line 341, in urlopen
    raise HostChangedError(host, url, retries - 1)
urllib3.exceptions.HostChangedError: Connection pool with host 'http://stats.e-go.gr' tried to open a foreign host: mailto:[email protected]
>>> 
>>>
>>> import requests
>>> r = requests.get('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 38, in request
    return s.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 200, in request
    r.send(prefetch=prefetch)
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 514, in send
    self._build_response(r)
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 253, in _build_response
    request.send()
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 430, in send
    conn = self._poolmanager.connection_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 94, in connection_from_url
    scheme, host, port = get_host(url)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 524, in get_host
    port = int(port)
ValueError: invalid literal for int() with base 10: 'advertising%40pegasusinteractive.gr'
>>>

Just as a check, this is parsed correctly by urlparse:

>>> urlparse.urlparse('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr')
ParseResult(scheme='http', netloc='stats.e-go.gr', path='/rx.asp', params='', query='nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr', fragment='')
>>> 
>>> urlparse.urlparse('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr').port
>>>

But, doh! This is the same as issue #39, because it is a bogus redirect... nice work stats.e-go.gr....

>>> r = requests.get('http://stats.e-go.gr/rx.asp?nWebSrvID=100230&nCatID=23425&nLevelId=-20&target=mailto%3Aadvertising%40pegasusinteractive%2Egr', allow_redirects=False)
>>> r
<Response [302]>
>>> r.headers
{'content-length': '161', 'x-powered-by': 'ASP.NET', 'set-cookie': 'nUserID=4862144; expires=Mon, 21-Jan-2013 22:00:00 GMT; path=/, ASPSESSIONIDCSQABRTR=NOJOKALCFLCHFCALIBBMFJIJ; path=/', 'expires': 'Mon, 23 Jan 2012 13:40:53 GMT', 'server': 'Microsoft-IIS/6.0', 'location': 'mailto:[email protected]', 'cache-control': 'False', 'date': 'Mon, 23 Jan 2012 13:41:53 GMT', 'p3p': 'CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"', 'content-type': 'text/html'}
>>> 
>>>

urllib3 does not install if Python has no SSL support

( Moved over from requests: https://github.com/kennethreitz/requests/issues/306 )

On attempting to import urllib3 in a python environment without SSL support, the following error is received:

File "urllib3/connectionpool.py", line 11, in <module>
    from httplib import HTTPConnection, HTTPSConnection, HTTPException
ImportError: cannot import name HTTPSConnection

For comparison / reference, urllib2 does import on the same system. Attempting to retrieve a HTTPS url via urllib2.urlopen(...) results in this error:

urllib2.URLError: <urlopen error unknown url type: https>

Standard HTTP requests, however, using urllib2 work fine in this python environment.

Perhaps urllib3 should error out in a similar fashion?

Configuring proxies through environment variables

It would be really convenient if one could configure proxies with environment variables (case in point, urllib).

release_conn parameter doesn't have effect (without preload_content)

After

pool = HTTPConnectionPool(host, port)
pool.urlopen('GET', '/', release_conn=False)

connection is being released.

Support Google App Engine

Google App Engine restricts which modules can be used in their environment, and sadly, select is not on their whitelist. Is there a way to make urllib3 work on App Engine without using this module? requests depends on urllib3, and the select modules prevents users from using requests on App Engine.

Use MultiDict for headers

Currently httplib coerces headers into a dict, which breaks some things. Some monkeypatching might be required.

See also: Issue 15 @ GoogleCode for more discussion.

Support Google App Engine -- POST requests

I'm unable to re-open issue #61 for some reason, so reporting this separately.

It doesn't appear to work with POST requests:

import webapp2
import urllib3

class MainPage(webapp2.RequestHandler):
  def get(self):
      http = urllib3.PoolManager()
      r = http.request('GET', 'http://google.com/')
      r2 = http.request('GET', 'http://yahoo.com/')
      r3 = http.request('POST', 'http://www.example.com/')
      r4 = http.request('POST', 'http://www.example.com/', fields={"foo":"bar"})
      self.response.headers['Content-Type'] = 'text/plain'
      self.response.out.write(r.status)
      self.response.out.write(r2.status)
      self.response.out.write(r3.status)

app = webapp2.WSGIApplication([('/', MainPage)],
                              debug=True)

Traceback:

The API package 'remote_socket' or call 'Resolve()' was not found.
Traceback (most recent call last):
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
    return handler.dispatch()
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3_test.py", line 9, in get
    r3 = http.request('POST', 'http://www.example.com/')
  File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3/request.py", line 71, in request
    **urlopen_kw)
  File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3/request.py", line 119, in request_encode_body
    boundary=multipart_boundary)
  File "/base/data/home/apps/s~megachunt/1.358837729334117339/urllib3/filepost.py", line 57, in encode_multipart_formdata
    boundary = choose_boundary()
  File "/base/python27_runtime/python27_dist/lib/python2.7/mimetools.py", line 140, in choose_boundary
    hostid = socket.gethostbyname(socket.gethostname())
  File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 299, in gethostbyname
    return _Resolve(host, [AF_INET])[2][0]
  File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 251, in _Resolve
    canon, aliases, addresses = _ResolveName(name, families)
  File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 269, in _ResolveName
    apiproxy_stub_map.MakeSyncCall('remote_socket', 'Resolve', request, reply)
  File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 94, in MakeSyncCall
    return stubmap.MakeSyncCall(service, call, request, response)
  File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 308, in MakeSyncCall
    rpc.CheckSuccess()
  File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess
    raise self.exception
CallNotFoundError: The API package 'remote_socket' or call 'Resolve()' was not found.

Web page 404

urllib3 can not access to web pages

import urllib3
http = urllib3.PoolManager()
url = 'http://waptt.com/'
r = http.request('GET', url, retries = 5)
print r.status
404

But I use curl to get to 200 status
curl -I http://waptt.com
HTTP/1.1 200 OK

Scheme and host erroneously passed to HTTPConnection request method

I think there is a problem in the use of httplib.HTTPConnection method request when called at
line 213 of urllib3/connectionpool.py where you pass it the full URL, containing the scheme and host, instead of just the path (and query part), as show in httplib usage examples.

This ends up in a wrong HTTP request performed to the server. To see it, you can for instance run

python -m SimpleHTTPServer

in a shell and then, in another one, run

python -c 'from urllib3 import PoolManager; http = PoolManager(); http.request( "GET", "http://localhost:8000/this/is/an/example" )'

and compare what the access log in the first shell reports as compared to what happens if you do

curl  "http://localhost:8000/this/is/an/example"

I can submit a patch, but I'm not an urllib3 expert so I will probably miss some other place where the same error occurs.

VerifiedHTTPSConnection: Write more tests

There ~~are currently no tests~~ is one test that check SSL cert verification in tests/with_dummyserver/test_https.py. More tests would be great. :)

Specifically we want to reach good coverage for HTTPS-related code.

Pool Depletion / Leaking Connections

In our production environment, we've noticed the following errors in our logs, pool size set to 500:

FetchError: FetchError: msg = "No retries left, giving up", original exception = "EmptyPoolError("HTTPConnectionPool(host='SUPER_SECRET', port=80): Pool reached maximum size and no more connections are allowed.",)", url = "http://SUPER_SECRET_URL/", retries = 2
ERROR - base.py:fetch:166 - FetchError: HTTPConnectionPool(host='SUPER_SECRET', port=80): Pool reached maximum size and no more connections are allowed., retries left = 2

After a bit of investigation, it looks like the number of connections in the pool is slow decreasing over time as we encounter other errors, forced timeouts, etc. On urllib3/connectionpool.py:424, if httplib/socket raises an error, the connection will be dropped from the pool because it was acquired on 382 and never put back on line 431 since conn == None.

This issue can be replicated with the following gist (uses gevent.Timeout, so quasi-related to #69): https://gist.github.com/2932793

It looks like the if conn on line 431 was originally added to ensure that a SocketError raised from _get_conn() didn't add None to the pool since a connection would never be gotten, but in the current incarnation, it has the effect of causing the pool size to shrink over time if a connection is acquired and any error is raised from httplib/socket.

BadStatusLine for twitter.com

Can't figure this one out. Any ideas?

from urllib3.connectionpool import connection_from_url

url = 'https://twitter.com'

http_pool = connection_from_url(url, strict=False)
r = http_pool.urlopen('GET', url)

urllib3.exceptions.MaxRetryError: Max retries exceeded for url: https://twitter.com

[Errno 35] 'Resource temporarily unavailable' on Mac OS X

We are seeing "error(35, 'Resource temporarily unavailable')" thrown from urllib when running on mac. It appears this is a known issue in python, when the caller of send must handle EAGAIN errors on BSD platforms.

This was observed with urllib3-1.1, OSX 10.6, Python 2.7. Not surprisingly, it's particularly common over slow network connections.

nosetests crashes under IPv4 (error: getsockaddrarg: bad family)

Turns out tornado is really eager to use IPv6. Unless you expressly hand the server the address, it doesn't even check for socket IPv6 support. I'll submit a pull request for the one-line fix in dummyserver/server.py momentarily.

Source: https://groups.google.com/group/python-tornado/browse_thread/thread/3ec04536e57a2833?pli=1

gevent socket does not raise SocketTimeout and Timeouts fail

The gevent monkey patching on socket for some reason (have yet to find out why) does not properly raise a SocketTimeout the way it should and as a consequence urllib3 will not raise a Timeout properly if you're using gevent.

How to reproduce:
Create unresponsive socket (simulate timeout)

nc -l 8080

from gevent.monkey import patch_all
patch_all()
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://localhost:8080/',timeout=1)

The Timeout is never raised.

One solution is to use gevent.socket when making the connection, that should properly raise a SocketTimeout.

Moving away from httplib

Kicking off a tracking thread for some of the work started at the PyCon 2012 Requests/urllib3 sprints.

I'm not sure as to the full history of the issue, but the general driving factor is more transparency and control over behaviors that are overly opaque in the standard library's httplib, which seems to have been written without taking into account:

Clean proxying
HTTP parsing without establishing a connection
Incremental HTTP parsing
Raw socket control
(and a few more features I can't recall)

The validity of these concerns was at least generally confirmed by Guido himself, in person, when @kennethreitz and I spoke to him and he agreed that certain batteries included with Python have started to lose their charge; libraries like urllib/urllib2/httplib are out of touch with the new momenta of web technologies.

The refactorings involved in fixing the above issues seem to point to a new httplib-like library that is more extensible and configurable. The emergence of libraries like @benoitc's http-parser also support this direction and urllib3 is certainly one of the best-positioned libraries for attempting this sort of enhancement.

Design discussions thus far have involved @shazow, @wolever, @atdt, @kennethreitz, @easel, @doublereedkurt, and @brandon-rhodes. More notes to follow.

Python 3 compatibility

@kennethreitz is starting some informal work on this somewhere.

Issues that are affected by this: https://github.com/shazow/urllib3/issues?milestone=1&state=open

Let's keep this thread updated as we make progress.

redirect to bogus URL generates low-level exception; can do better

This URL parses correctly, however the redirect leads to a bogus URL --- (nice work Patch.com :-)

Can we cause this to generate a more useful exception? Something specifically about the redirect being bogus?

Here is the URL parsing correctly:

>>> import urlparse
>>> urlparse.urlparse('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')
ParseResult(scheme='http', netloc='stclairshores.patch.com', path='/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day', params='', query='', fragment='')
>>> urlparse.urlparse('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day').port
>>>

Here is the redirect that it generates... notice the http://http:// at the beginning of the new location!

>>> r = requests.get('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day', allow_redirects=False)
>>> r.headers
{'status': '302', 'content-length': '160', 'content-encoding': 'gzip', 'set-cookie': 'p13n=%5B%5D; path=/, _patch_session=BAh7BzoPc2Vzc2lvbl9pZCIlNmEzYzM4YzZjMWIxMTdiNDkxMmEwNmEwM2JmZDQzYTU6FnByb21wdF9mb3Jfc3VydmV5aQA%3D--547af227538bb1d039809a3d01eaadb320a9a42b; domain=patch.com; path=/', 'x-powered-by': 'Phusion Passenger (mod_rails/mod_rack) 3.0.11', 'vary': 'Accept-Encoding', 'server': 'Apache/2.2.15 (Unix) mod_ssl/2.2.15 OpenSSL/0.9.8l Phusion_Passenger/3.0.11', 'x-runtime': '15', 'location': 'http://http://www.dailytribune.com/articles/2011/11/09/news/doc4ebb336cad1c7378471368.txt?viewmode=fullstory', 'cache-control': 'no-cache', 'date': 'Mon, 23 Jan 2012 13:32:36 GMT', 'content-type': 'text/html; charset=utf-8', 'x-rack-cache': 'miss'}
>>>

So, of course urllib cannot open it:

>>> import urllib
>>> g = urllib.urlopen('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/urllib.py", line 84, in urlopen
    return opener.open(url)
  File "/usr/lib/python2.7/urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 356, in open_http
    return self.http_error(url, fp, errcode, errmsg, headers)
  File "/usr/lib/python2.7/urllib.py", line 369, in http_error
    result = method(url, fp, errcode, errmsg, headers)
  File "/usr/lib/python2.7/urllib.py", line 632, in http_error_302
    data)
  File "/usr/lib/python2.7/urllib.py", line 659, in redirect_internal
    return self.open(newurl)
  File "/usr/lib/python2.7/urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 331, in open_http
    h = httplib.HTTP(host)
  File "/usr/lib/python2.7/httplib.py", line 1061, in __init__
    self._setup(self._connection_class(host, port, strict))
  File "/usr/lib/python2.7/httplib.py", line 693, in __init__
    self._set_hostport(host, port)
  File "/usr/lib/python2.7/httplib.py", line 718, in _set_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

urllib3 has the same error:

>>> http_pool = urllib3.connection_from_url('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')>>> r = http_pool.get_url('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "urllib3/request.py", line 136, in get_url
    **urlopen_kw)
  File "urllib3/request.py", line 78, in request_encode_url
    return self.urlopen(method, url, **urlopen_kw)
  File "urllib3/connectionpool.py", line 410, in urlopen
    retries - 1, redirect, assert_same_host)
  File "urllib3/connectionpool.py", line 336, in urlopen
    if assert_same_host and not self.is_same_host(url):
  File "urllib3/connectionpool.py", line 246, in is_same_host
    scheme, host, port = get_host(url)
  File "urllib3/connectionpool.py", line 538, in get_host
    port = int(port)
ValueError: invalid literal for int() with base 10: ''
>>>

and it propagates through to requests too:

>>> 
>>> r = requests.get('http://stclairshores.patch.com/articles/shores-veteran-to-receive-complimentary-wedding-on-veterans-day')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 38, in request
    return s.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 200, in request
    r.send(prefetch=prefetch)
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 514, in send
    self._build_response(r)
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 253, in _build_response
    request.send()
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 430, in send
    conn = self._poolmanager.connection_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 94, in connection_from_url
    scheme, host, port = get_host(url)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 524, in get_host
    port = int(port)
ValueError: invalid literal for int() with base 10: ''

Gzip CRC check failed

Problem

I use the code:

headers = {
'User-Agent':'Baiduspider+(+http://www.baidu.com/search/spider.htm)',
'Accept-Encoding':'gzip,deflate'
}
r = http.request('GET', 'http://www.heroone.com', headers=headers)

Error log:
HTTPError: Received response with content-encoding: gzip, but failed to decode it.

Analysis

Web server, add some "tail" behind the data of gzip compression, data.
Some data-extracting modules (such as Python's GzipFile module) in this case will handle exceptions. The browser will automatically discard the extra "tail" normal extracting and processing of page data.

Solve

The python GzipFile module undisclosed attributes: extrabuf responsible for the preservation has been successfully extracting data. Therefore, the following code for better compatibility:

files :urllib3/response.py

Old code:
def decode_gzip(data):
gzipper = gzip.GzipFile(fileobj=BytesIO(data))

Fix code:

try:
gf = GzipFile(fileobj=StringIO(html_data), mode="r")
html_data = gf.read()
except:
html_data = gf.extrabuf

Redirects are broken

>>> m = PoolManager()
>>> m.request('GET', 'https://google.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 67, in request
    **urlopen_kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 80, in request_encode_url
    return self.urlopen(method, url, **urlopen_kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
    return self.urlopen(method, e.url, **kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
    return self.urlopen(method, e.url, **kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 104, in urlopen
    return conn.urlopen(method, url, **kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/connectionpool.py", line 361, in urlopen
    raise MaxRetryError(self, url)
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.google.de', port=443): Max retries exceeded with url: https://www.google.de/

It is not an HTTPS error:

 >>> m.request('GET', 'http://google.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 67, in request
    **urlopen_kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/request.py", line 80, in request_encode_url
    return self.urlopen(method, url, **urlopen_kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
    return self.urlopen(method, e.url, **kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 108, in urlopen
    return self.urlopen(method, e.url, **kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/poolmanager.py", line 104, in urlopen
    return conn.urlopen(method, url, **kw)
  File "/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/urllib3/connectionpool.py", line 361, in urlopen
    raise MaxRetryError(self, url)
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.google.de', port=80): Max retries exceeded with url: http://www.google.de/

placeholder...

request contains full url

The server is getting confused with the request "GET http://blag.xkcd.com/ HTTP/1.1":

>>> import urllib3
>>> url = "http://blag.xkcd.com/"
>>> conn = urllib3.connection_from_url(url)
>>> r = conn.request("GET", url, redirect=False)
>>> r.status
301
>>> r.get_redirect_location()
'http://blag.xkcd.comhttp/blag.xkcd.com/'
>>> r.headers
{'content-length': '0', 'x-powered-by': 'PHP/5.2.6-1+lenny13', 'expires': 'Wed, 11 Jan 1984 05:00:00 GMT', 'vary': 'Accept-Encoding', 'server': 'Apache', 'last-modified': 'Sun, 05 Feb 2012 08:15:10 GMT', 'connection': 'close', 'location': 'http://blag.xkcd.comhttp/blag.xkcd.com/', 'pragma': 'no-cache', 'cache-control': 'no-cache, must-revalidate, max-age=0', 'date': 'Sun, 05 Feb 2012 08:15:10 GMT', 'content-type': 'text/html; charset=UTF-8', 'x-pingback': 'http://blog.xkcd.com/xmlrpc.php'}

It works fine if the request is "GET / HTTP/1.1":

>>> r = conn.request("GET", "/", redirect=False)
>>> r.status
200
>>> r.headers
{'x-powered-by': 'PHP/5.2.6-1+lenny13', 'transfer-encoding': 'chunked', 'vary': 'Accept-Encoding', 'server': 'Apache', 'connection': 'close', 'date': 'Sun, 05 Feb 2012 08:16:55 GMT', 'content-type': 'text/html; charset=UTF-8', 'x-pingback': 'http://blog.xkcd.com/xmlrpc.php'}

This behavior might be considered user's fault, but the same behavior is seen with PoolManager, where the library is expected to figure out the correct connection from the full url.

I am Interested in Adding SOCKS Proxy Support

Hi, I maintaining a fork of socksipy-branch, called socksipy-x, which is at https://github.com/brendoncrawford/socksipy-x. Socksipy and Socksipy-Branch have not been updated for a while, so I intend to fix some bugs and try to maintain the codebase when needed.

Before I embark on the task of adding SOCKS support to urllib3, I wanted to see if this is even something you would be interested in merging in? If you were, it could either be referenced as a dependency/submodule, or I could just copy the entire socks.py file directly into urllib3, so no external dependency would be required.

Any thoughts?

Pipelining?

I know this is super lame to file an issue instead of emailing a mailing list or something for a feature question, but I can't find a urllib3 mailing list anywhere.

Anyway, I am trying to find any kind of support for HTTP pipelining in an existing library in Python before attempting something more drastic. Specifically, I am trying to pipeline a series of PUTs (yeah, they're idempotent), like so:

PUT
PUT
PUT
get response
get response
get response

Ideally, I'd love for this to be handled for me in some kind of threadsafe way, but I'm willing to do it myself. urllib3 seems really close! Threadsafe, connection pooling, the works! I even saw some other website that offhandedly speculated that urllib3 might even do pipelining. But I hardly think that's possible, as you have to call release_connection manually after reading the body of a response.

Anyway, do you know anything about this? It's sort of surprising to me how undersupported this HTTP feature seems to be. If urllib3 doesn't support it, any wise thoughts on what I might have to do instead?

-JT

Breaks on certain urls that use : in the query string

import requests
requests.get('http://online.wsj.com?CALL_URL=http://online.wsj.com/article/SB10001424052702303640104577436251166644714.html%3fmod=googlenews_wsj')
Traceback (most recent call last):
File "", line 1, in
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/api.py", line 51, in get
return request('get', url, *_kwargs)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/api.py", line 39, in request
return s.request(method=method, url=url, *_kwargs)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/sessions.py", line 200, in request
r.send(prefetch=prefetch)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/models.py", line 463, in send
conn = self._poolmanager.connection_from_url(url)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/packages/urllib3/poolmanager.py", line 89, in connection_from_url
scheme, host, port = get_host(url)
File "/Users/lsemel/www/virtualenvs/muckrack/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 557, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''

https://github.com/kennethreitz/requests/blob/develop/requests/packages/urllib3/util.py#L75 should probably chop off the query string before checking for a :

Custom 'Accept' header being overwritten

Hi, I realized that custom 'Accept' headers are getting overwritten by the ProxyManager._set_proxy_headers function in poolmanager.py (line 124)

I did a bit of googling but couldn't come up with a reason why this should be done; if this modification of the 'Accept' header is not needed, would it be possible to get it removed somehow?

301 problems

Why the 301 sites will be wrong?

http = urllib3.PoolManager()
conn = http.connection_from_url("http://www.opda.com.cn")
r = response = conn.request("GET", "/", retries = 5)

Error log:
HostChangedError: HTTPConnectionPool(host='www.opda.com.cn', port=80): Tried to open a foreign host with url: forum.php

All but one Set-Cookie headers are overwritten

I have been using urllib3 via Requests, and I noticed that I was only getting one cookie back. After some investigation, it looks like the problem is caused by the fact that all of the HTTP headers are converted from a list of 2-tuples (from httplib) to a dictionary. This is normally okay, but when the server sends back multiple Set-Cookie headers, all but the last are overwritten.

Minimal code to reproduce. I will leave that URL up until this issue is resolved or I am smacked upside the header for something stupid I am missing.

import urllib3

http = urllib3.PoolManager()

r = http.request("GET", "http://www.joelverhagen.com/sandbox/tests/cookies.php")

print(r.headers)

I am using Python 3.2.2 and urllib3 v1.2.2.

Edit: It may be a good idea to open an issue on the Requests GitHub repo if this is indeed a problem. I was going to do it, but I wanted to first check on this end.

fix subsequent redirect request method type

See issue 269 fix subsequent redirect request method type in Requests.

Please ship test-requirements.txt or avoid using it

Hello,
test-requirements.txt is not shipped with the source so installing urllib3. However I think that listing requirements inside setup.py is a better choice.

$ pip install urllib3
Downloading/unpacking urllib3
  Downloading urllib3-1.2.1.tar.gz
  Running setup.py egg_info for package urllib3
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/eriol/.virtualenvs/testurllib3/build/urllib3/setup.py", line 25, in <module>
        tests_requirements = requirements + open('test-requirements.txt').readlines()
    IOError: [Errno 2] No such file or directory: 'test-requirements.txt'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/eriol/.virtualenvs/testurllib3/build/urllib3/setup.py", line 25, in <module>

    tests_requirements = requirements + open('test-requirements.txt').readlines()

IOError: [Errno 2] No such file or directory: 'test-requirements.txt'

Kind regards,
Daniele Tricoli

Roll tests into the urllib3 package

If someone today installs urllib3 with pip or easy_install into a novel environment — like the 64-bit version of an operating system that none of us have tested the library against — and receives an exception, then there is no easy way for them to quickly run the test suite to see whether the problem is that their code is faulty, or that urllib3 itself fails its own tests in the new environment.

But if we move the test directory, that currently sits anonymously at the top of the repository, down into the urllib3 directory and start shipping it as a sub-package, then users who want to double-check that urllib3 is working would simply be able to type:

python -m unittest discover urllib3

Or, if they were using Python <2.7, then they could install unittest2 and use its unit2 command line that Michael Foord developed so that older Pythons can also enjoy the new, standard way for tests to be auto-discoverable.

It is not clear to me whether the dummyserver top-level package that also sits in setup.py stands in the way of subordinating the tests beneath the package itself. (Would the dummy server need to move down inside urllib3/tests?)

ProxyManager: Document and write tests

Also check if it works for HTTPS proxies. It might not.

Provide a way for callers to see HTTP exceptions

Whether a request() or urlopen() call is given retries=0 or retries=<positive-int>, the exception returned on most connection and network failures is a MaxRetryError. This prevents client code from seeing the “real error” that killed the connection and, therefore, clients based on urllib3 cannot, say, adjust their behavior based on whether the problem is any of the various errors at the HTTP level, or whether the problem is some specific socket error.

At least three solutions are possible, and I defer to the maintainers to decide which is the most urllib3-ish.

First, it could be judged a design error to have introduced MaxRetryError in the first place, and the final attempt's exception could be allowed to make it through to the client (some would want to throw in a retry_count attribute to add a bit of information about what happened, but I am not sure that that is advisable). The except clause near line 350 of connectionpool.py would then look something like:

        except (HTTPException, SocketError), e:
            # Connection broken, discard. It will be replaced next _get_conn().
            conn = None
            # If this was the last try, let the caller see the real error.
            if not retries:
                raise

Other adjustments to the code might accomplish the same thing — a larger refactoring could eliminate even making the final attempt inside of an except:, for example — but this is the simplest approach I could find for this first approach.

Second, the MaxRetryError could be marked up with a list of failures. This would be a bit tricky, since there are finicky problems with keeping stack traces around, but at least the exception object from each failure could be kept around — shorn of its stack trace — and delivered back with the MaxRetryError. This would even make it clear whether the problem was an actual network death of the HTTP protocol, or whether something like a redirect loop had kept the request going through too many attempts — and thus much more information would be available for the client to determine what had happened.

Finally, retries=None could be a special signal that the bare exception should be returned. This would maintain (so far as I can see) 100% compatibility with the current code base. The ConnectionPool.urlopen() function needs two or three lines adjusted so that the None value avoids offending the various if statements, but with minimal intrusion the change lets someone like me — whose application, alas, cares quite deeply what mode of failure an HTTP request encounters — take advantage of urllib3 but without breaking existing code that might always expect the MaxRetryError exception. (Oh, and, beware — in Python 2, None < 0 is True, I just discovered, so the retries < 0 will catch a None value unless an extra clause is inserted!)

Of course, other approaches might also be possible that I have not thought of.

I very much like the quality of code that I am seeing in urllib3 and, if I can only get the actual exceptions back that it encounters, these already-written connection pools are going to save me a lot of work. Thanks!

"Usage" example at top of docs raises MaxRetryError

The "Getting Started" section of the documentation suggests this simple test:

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'http://google.com/')

But actually running this code results in an exception:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    r = http.request('GET', 'http://www.google.com/')
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/request.py", line 65, in request
    **urlopen_kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/request.py", line 78, in request_encode_url
    return self.urlopen(method, url, **urlopen_kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
    return self.urlopen(method, e.new_url, **kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
    return self.urlopen(method, e.new_url, **kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
    return self.urlopen(method, e.new_url, **kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 113, in urlopen
    return self.urlopen(method, e.new_url, **kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/poolmanager.py", line 109, in urlopen
    return conn.urlopen(method, url, **kw)
  File "/home/brandon/v/local/lib/python2.7/site-packages/urllib3/connectionpool.py", line 309, in urlopen
    raise MaxRetryError(url)
urllib3.exceptions.MaxRetryError: Max retries exceeded for url: http://www.google.com/

The reason is that each attempt to make a connection is dying with HostChangedError because the is_same_host() method in connectionpool.py is getting back the tuple ('http', 'www.google.com', None) from get_host(url) but the slightly different tuple ('http', 'www.google.com', 80) when it combines self.scheme with self.host and self.port.

Attempting the test with the URL http://google.com:80/ also fails because a first successful request is made that redirects to http://www.google.com/ which then dies as well with the None != 80 problem.

Only a test with the URL http://www.google.com:80/ succeeds, because only in that case is the explicit port number present to make is_same_host() succeed, and no subsequent redirect occurs to break things.

It looks like either get_host() needs to throw in the port 80 when building its tuple, or — if that would ruin the purity and usefulness of that function — the is_same_host() function needs to detect the None port number coming back from get_host() and upgrade it to 80 or 443 as appropriate.

Or: is the problem that the self.port that is_same_host() is grabbing used to have the value None as well, and it's the port having the value 80 so early in the process that is the problem here?

If more experienced project members could point me in the right direction here, I would be happy to contribute a patch and pull request. I could even add a test for the usage example in the documentation, so that it stays working. :) Thanks for your work on urllib3!

HTTPS requests fail through a proxy

Requests to access a secure website (SSL/TLS) fail through a proxy.
Urllib3 does not properly implement the HTTP CONNECT method.

For example the following code should print 200.
Instead, with a burp proxy, it prints 502.

import urllib3

proxy = urllib3.proxy_from_url('http://localhost:8080/'
response = proxy.urlopen('GET', 'https://www.google.com/index.html')

print response.status

[requests] Proxy authentication broken with 0.76 -> 0.83 upgrade.

Issue 293.
I guess in the end we should use standard and tested urlparse.urlsplit :)

RecentlyUsedContainer is not threadsafe

I'm testing 1.0.1

As noted in the code, RecentlyUsedContainer is not threadsafe.

I'm getting: RuntimeError: deque mutated during iteration

Stacktrace:

  File "/home/chrsjo/src/arkenutils_omkomp/arken/hcap.py", line 153, in _request
    return self._pool.request(*args, **kwargs)
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/request.py", line 65, in request
    **urlopen_kw)
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/request.py", line 78, in request_encode_url
    return self.urlopen(method, url, **urlopen_kw)
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/poolmanager.py", line 107, in urlopen
    conn = self.connection_from_url(url)
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/poolmanager.py", line 98, in connection_from_url
    return self.connection_from_host(host, port=port, scheme=scheme)
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/poolmanager.py", line 73, in connection_from_host
    pool = self.pools.get(pool_key)
  File "/home/chrsjo/.virtualenvs/arkenutils_omkomp/lib/python2.7/_abcoll.py", line 342, in get 
    return self[key]
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/_collections.py", line 96, in __getitem__
    self._prune_invalidated_entries()
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/_collections.py", line 77, in _prune_invalidated_entries
    self.access_log = deque(e for e in self.access_log if e.is_valid)
  File ".../urllib3-1.0.1-py2.7.egg/urllib3/_collections.py", line 77, in <genexpr>
    self.access_log = deque(e for e in self.access_log if e.is_valid)
RuntimeError: deque mutated during iteration

Add support for HTTPConnection.getresponse(buffering=True)

Python 2.7 has added support for buffering when reading from HTTP connections, which leads to a significant improvement in HTTP client performance.

See: http://bugs.python.org/issue4879

It looks like the only changes required would be in HTTPConnectionPool._make_request, passing buffering=True to the getresponse() call on Python 2.7 and later.

urllib2.py does the following:

try:
    r = h.getresponse(buffering=True)
except TypeError: #buffering kw not supported
    r = h.getresponse()

IPv6 support

With the recent World IPv6 Launch ( http://www.worldipv6launch.org/ ) it would be really nice if urllib3 could support IPv6 (in such a way, that finally requests can support it too).

Support for posting big files/streaming objects

Currently, it is not possible to post a 4 GB file using urllib3 since that requires reading the entire file content into a buffer before sending. (unless I'm missing something?)
It would be nice if we could pass file-like objects (objects with a read() method or something) as post variables, which will be read from on demand and perform well on huge files.
The only python script capable of this that I'm aware of is http://atlee.ca/software/poster/, which is a urllib2 opener class.

HTTPConnectionPool.urlopen does not use local timeout on SocketTimeout

On SocketTimeout, a TimeoutError is raised with self.timeout as part of the exception in urlopen(), but when this is not set, you get a TypeError: float argument required, not NoneType.

When the timeout parameter is overriden locally in the function call, the exception should take this in to concideration as well.

To reproduce:

>>> from urllib3.connectionpool import connection_from_url
>>> con = connection_from_url('http://doesnotexists.com')
>>> con.urlopen('GET', 'http://doesnotexists.com', retries=0, assert_same_host=False, timeout=1.0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "urllib3/connectionpool.py", line 353, in urlopen
    self.timeout)
TypeError: float argument required, not NoneType

Incorrect handling of HTTP redirects

Code:

manager = urllib3.PoolManager()
r = manager.request('GET', 'http://ynet.co.il')
r.data - returns incorrect page

same with paypal.com and other domains... The code takes the location value from the HTTP response and issues a GET request to http://www.ynet.co.il to hostname http://ynet.co.il

multiple values for a single key

Urllib3 doesn't allow multiple values for a single key if the request is with files.

I found this bug when I'm using Requests. https://github.com/kennethreitz/requests/issues/285

I sent following a Pull request to Requests. But It was rejected, because it needs to modify urllib3.
https://github.com/kennethreitz/requests/pull/422#issuecomment-4058039

Increase unit test code coverage from 99% to 100%

There's basically 3 lines which aren't covered. Should be easy enough to cover them with a couple more tests.

IIRC something to do with timeouts and decoding.

Very contributor friendly if you're looking to dive into the codebase. :-)

Drop support for Python 2.5

Could we drop support for 2.5? 2.6 has already been available for 3 years.

Issues where dropping 2.5 would help:

Fix connection pooling behavior when maxsize > 1 (issue #13)
Ensure that the user-specified timeout is set before making a request (issue #17)

urllib3 / urllib3 Goto Github PK

urllib3's Issues

Problem

Analysis

Solve

Recommend Projects

Recommend Topics

Recommend Org