Comments (10)
What kind exception did you have in mind?
from urllib3.
What kind exception did you have in mind?
Anything that can be caught with something less broad than ValueError
would be an improvement.
Would an exception called "InvalidRedirectURL" be consistent with existing
style?
This issue also exists in python requests library, where it is more
general, because the redirect URL might parse properly and then only
expose itself as bogus when it attempts DNS. See this ticket:
https://github.com/kennethreitz/requests/issues/380
Maybe we can design one exception that can be used in both places?
In the specific example here, the URL proposed by the redirect would not
be accepted by a call to get_host, so maybe just a try..except around
that?
urllib3.get_host('bogus:')
Traceback (most recent call last):
File "", line 1, in
File "urllib3/connectionpool.py", line 538, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''
Here's the actual example:
urllib3.get_host(u)
Traceback (most recent call last):
File "", line 1, in
File "urllib3/connectionpool.py", line 538, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''
from urllib3.
Would an exception called "InvalidRedirectURL" be consistent with existing style?
Well, the same problem is when you call get
with such invalid url so it's more universal problem.
from urllib3.
I think the answer is something like a HostParseError
which inherits from both urllib3.exceptions.HTTPError
and ValueError
. Thoughts?
from urllib3.
Yes, an exception that inherits from both is a good idea.
I don't understand the name "HostParseError" -- that seems to imply that there is something wrong with only the hostname in the URL. Don't we mean that the URL is generally invalid? Maybe InvalidURL?
It would also be nice to have a way to know that the InvalidURL was generated by a redirect rather than a URL that I fed into urllib3. Is there another way of detecting that besides having another type of exception called something like InvalidURLFromRedirect?
Here's the URL normalization code that I have been using. As you can see, it calls sys.exit() if it fails to catch a new flavor of InvalidURL. It now catches a good variety of invalid URLs in the first few million URLs from this rather unclean source. I'll let you know if I find meaningfully different types of errors as I ramp up to the next couple hundred million URLs.
Parts of this were derived from requests --- however they really don't clean up the URLs enough there either. For example, currently requests library only requotes the path, and there are plenty of partially quoted parameter strings out there on the open Web. This code fully requotes all relevant parts. The various normalizing steps here are intended to make it easy to detect duplicate URLs that might have started off differently, e.g. with capital letters in the hostname or scheme, or some other oddity that makes the strings different even though the resource that they locate is the same.
import re
import hashlib
import traceback
import xml.sax.saxutils
from urlparse import urlparse, urlunparse
from urllib import quote, unquote
allowed_schemes = ['http', 'https']
class InvalidURL(BaseException):
pass
def encode_and_requote(part):
"""Re-quote this part of a URL
This function passes <part> through an unquote/quote cycle to
ensure that it is fully and consistently quoted.
"""
if isinstance(part, unicode):
try:
part = part.encode('utf-8')
except Exception, exc:
raise InvalidURL("Invalid URL: %r.encode('utf-8') failed (%s) url=%r" % (part, str(exc), url))
try:
part = quote(unquote(part), safe=b'')
except Exception, exc:
raise InvalidURL("Invalid URL: quote(unquote(%r, safe=b'') failed (%s) url=%r" % (part, str(exc), url))
return part
def encode_and_requote_path(path):
"""Re-quote each component of a URL <path>.
This function passes each part of <path> through an unquote/quote
cycle to ensure that it is fully and consistently quoted.
"""
parts = path.split(b'/')
parts = (encode_and_requote(part) for part in parts)
return b'/'.join(parts)
double_quoted_url_re = re.compile('''^\"*(?P<url>.*?)\"*$''')
valid_netloc = re.compile("^[a-z0-9-.]+(:[0-9]+)?$")
def normalize_url(url):
"""
Takes a URL string as input and returns a dict of parsed fields,
unless it cannot, in which case it raises InvalidURL
"""
original_url = url
try:
if url.startswith('http%3A%2F%2F'):
url = unquote(url)
# remove XML quoting of & > <
url = xml.sax.saxutils.unescape(url)
# Drop any number of double quote marks before or after it
url = double_quoted_url_re.search(url).group('url')
# Support for unicode domain names and paths.
scheme, netloc, path, params, query, fragment = urlparse(url)
if not scheme:
raise InvalidURL("Invalid URL: no scheme supplied in url=%r" % url)
scheme = scheme.lower().strip()
if scheme not in allowed_schemes:
raise InvalidURL("Invalid URL: not http or https url=%r" % url)
try:
netloc = netloc.encode('idna').lower().strip()
except Exception, exc:
raise InvalidURL("Invalid URL: encode('idna') failed on netloc=%r reason=%r url=%r" % (netloc, exc, url))
if not netloc == netloc.decode('utf-8'):
raise InvalidURL("Invalid URL: changed by netloc.decode('utf-8') netloc=%r url=%r" % (netloc, url))
if not valid_netloc.match(netloc):
raise InvalidURL("InvalidURL: invalid netloc=%r url=%r" % (netloc, url))
path = encode_and_requote_path(path.lstrip())
params = encode_and_requote(params)
query = encode_and_requote(query)
fragment = encode_and_requote(fragment)
# reconstruct URL without fragment
abs_url = (urlunparse([ scheme, netloc, path, params, query, '' ]))
except Exception, exc:
if type(exc) is InvalidURL:
raise exc
else:
#raise InvalidURL("Invalid URL: unhandled failure url=%r\n%s\nInvalidURL" % (url, traceback.format_exc(exc)))
import sys
sys.exit("Invalid URL: unhandled failure url=%r\n%s\nInvalidURL" % (url, traceback.format_exc(exc)))
# use scheme and netloc as unique identifier of target
schost = scheme + '://' + netloc
url_rec = {
'fragment': fragment,
'abs_url': abs_url,
'doc_id': hashlib.md5(abs_url).hexdigest(),
'schost': schost,
'host_id': hashlib.md5(schost).hexdigest()
}
if original_url != abs_url:
url_rec['original_url'] = original_url
return url_rec
from urllib3.
I'm closing this issue. We can discuss more effective host parsing in another issue as it comes up. :)
from urllib3.
I'm still hitting this issue, and finding this little bit of info was rather hard. I think that revisiting the error message is a valid expenditure of effor
from urllib3.
@mansweet Fab, I'm looking forward to seeing a PR from you then. 😄
(All joking aside, I think it's valuable too, but telling people what is a valid use of their time is not the most polite thing to do in the whole world. In this case, given that this issue has been closed for more than 5 years, you'll probably do best to open a new issue treating your case as a new bug report, providing all of the appropriate detail (e.g. stack trace, urllib3 version).)
from urllib3.
@shazow , I have the impression that you want only to create new issues to discuss issues instead of fixing problems in the original ticket, I can't understand why. You did the same here #38 ...
from urllib3.
@sofiand-png urllib3 is a community project, you're very welcome to help fix problems that are important to you. That's what I do, too.
If you look a little closer, I did indeed push a fix to the original problem before closing the issue. And since then, we have completely rewritten the parsing, so whatever problems it has are completely new.
If you have a bug to report, please open an issue to discuss and I encourage you to follow up with a pull request that fixes your bug. Thank you!
from urllib3.
Related Issues (20)
- Streaming responses using urllib3 HOT 5
- verbose logging output
- Excess leading path separators causes ConnectionPool.urlopen to parse URL as host & port HOT 1
- ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) HOT 1
- SSL: UNEXPECTED_EOF_WHILE_READING HOT 7
- imprecise types on `urllib3.Retry.new` / `urllib3.Retry.increment` HOT 1
- Investigate CI failures with Python 3.13.0a5 HOT 2
- "unable to get local issuer certificate", even though cURL works with the same website HOT 3
- Unclosed socket warning after HTTP 407 response from HTTP CONNECT proxy HOT 1
- All Retry backoff_factor to optionally start applying from first retry HOT 4
- Retry backoff_factor offset from second retry incorrectly computed HOT 2
- HTTPConnection.request chunked=False doesn't work properly HOT 8
- Need to exception for "SSLEOFError" on python 3.10, 3.11, 3.12 HOT 3
- After upgrading to 2.2.1: 'HTTPResponse' object has no attribute 'json' HOT 1
- Comment typo settimout settimeout HOT 1
- Dependency management issue HOT 2
- Fix test_redirecting_to_bad_url failure in Requests HOT 12
- NodeJS + pyodide support HOT 1
- Retry.respect_retry_after_header=False is not honored when retry is incremented HOT 1
- Unable to build the doc `Command sphinx-build -b html -W . _build/html failed with exit code 2` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from urllib3.