This URL parses correctly, however the redirect leads to a bogus URL --- (nice work P

What kind exception did you have in mind? <p dir="auto"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

redirect to bogus URL generates low-level exception; can do better about urllib3 HOT 10 CLOSED

urllib3 commented on July 3, 2024

redirect to bogus URL generates low-level exception; can do better

from urllib3.

Comments (10)

shazow commented on July 3, 2024

What kind exception did you have in mind?

from urllib3.

commented on July 3, 2024

What kind exception did you have in mind?

Anything that can be caught with something less broad than ValueError
would be an improvement.

Would an exception called "InvalidRedirectURL" be consistent with existing
style?

This issue also exists in python requests library, where it is more
general, because the redirect URL might parse properly and then only
expose itself as bogus when it attempts DNS. See this ticket:

https://github.com/kennethreitz/requests/issues/380

Maybe we can design one exception that can be used in both places?

In the specific example here, the URL proposed by the redirect would not
be accepted by a call to get_host, so maybe just a try..except around
that?

urllib3.get_host('bogus:')
Traceback (most recent call last):
File "", line 1, in
File "urllib3/connectionpool.py", line 538, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''

Here's the actual example:

u = 'http://http://www.dailytribune.com/articles/2011/11/09/news/doc4ebb336cad1c7378471368.txt?viewmode=fullstory'

urllib3.get_host(u)
Traceback (most recent call last):
File "", line 1, in
File "urllib3/connectionpool.py", line 538, in get_host
port = int(port)
ValueError: invalid literal for int() with base 10: ''

from urllib3.

piotr-dobrogost commented on July 3, 2024

Would an exception called "InvalidRedirectURL" be consistent with existing style?

Well, the same problem is when you call get with such invalid url so it's more universal problem.

from urllib3.

shazow commented on July 3, 2024

I think the answer is something like a HostParseError which inherits from both urllib3.exceptions.HTTPError and ValueError. Thoughts?

from urllib3.

commented on July 3, 2024

Yes, an exception that inherits from both is a good idea.

I don't understand the name "HostParseError" -- that seems to imply that there is something wrong with only the hostname in the URL. Don't we mean that the URL is generally invalid? Maybe InvalidURL?

It would also be nice to have a way to know that the InvalidURL was generated by a redirect rather than a URL that I fed into urllib3. Is there another way of detecting that besides having another type of exception called something like InvalidURLFromRedirect?

Here's the URL normalization code that I have been using. As you can see, it calls sys.exit() if it fails to catch a new flavor of InvalidURL. It now catches a good variety of invalid URLs in the first few million URLs from this rather unclean source. I'll let you know if I find meaningfully different types of errors as I ramp up to the next couple hundred million URLs.

Parts of this were derived from requests --- however they really don't clean up the URLs enough there either. For example, currently requests library only requotes the path, and there are plenty of partially quoted parameter strings out there on the open Web. This code fully requotes all relevant parts. The various normalizing steps here are intended to make it easy to detect duplicate URLs that might have started off differently, e.g. with capital letters in the hostname or scheme, or some other oddity that makes the strings different even though the resource that they locate is the same.

import re
import hashlib
import traceback
import xml.sax.saxutils
from urlparse import urlparse, urlunparse
from urllib import quote, unquote
allowed_schemes = ['http', 'https']

class InvalidURL(BaseException):
    pass

def encode_and_requote(part):
    """Re-quote this part of a URL

    This function passes <part> through an unquote/quote cycle to
    ensure that it is fully and consistently quoted.
    """
    if isinstance(part, unicode):
        try:
            part = part.encode('utf-8')
        except Exception, exc:
            raise InvalidURL("Invalid URL: %r.encode('utf-8') failed (%s) url=%r" % (part, str(exc), url))
    try:
        part = quote(unquote(part), safe=b'')
    except Exception, exc:
        raise InvalidURL("Invalid URL: quote(unquote(%r, safe=b'') failed (%s) url=%r" % (part, str(exc), url))
    return part

def encode_and_requote_path(path):
    """Re-quote each component of a URL <path>.

    This function passes each part of <path> through an unquote/quote
    cycle to ensure that it is fully and consistently quoted.
    """
    parts = path.split(b'/')
    parts = (encode_and_requote(part) for part in parts)
    return b'/'.join(parts)

double_quoted_url_re = re.compile('''^\"*(?P<url>.*?)\"*$''')
valid_netloc = re.compile("^[a-z0-9-.]+(:[0-9]+)?$")

def normalize_url(url):
    """
    Takes a URL string as input and returns a dict of parsed fields,
    unless it cannot, in which case it raises InvalidURL
    """
    original_url = url
    try:
        if url.startswith('http%3A%2F%2F'):
            url = unquote(url)
        # remove XML quoting of &amp; &gt; &lt;
        url =  xml.sax.saxutils.unescape(url)

        # Drop any number of double quote marks before or after it
        url = double_quoted_url_re.search(url).group('url')

        # Support for unicode domain names and paths.
        scheme, netloc, path, params, query, fragment = urlparse(url)

        if not scheme:
            raise InvalidURL("Invalid URL: no scheme supplied in url=%r" % url)

        scheme = scheme.lower().strip()
        if scheme not in allowed_schemes:
            raise InvalidURL("Invalid URL: not http or https url=%r" % url)

        try:
            netloc = netloc.encode('idna').lower().strip()
        except Exception, exc:
            raise InvalidURL("Invalid URL: encode('idna') failed on netloc=%r reason=%r url=%r" % (netloc, exc, url))

        if not netloc == netloc.decode('utf-8'):
            raise InvalidURL("Invalid URL: changed by netloc.decode('utf-8') netloc=%r url=%r" % (netloc, url))

        if not valid_netloc.match(netloc):
            raise InvalidURL("InvalidURL: invalid netloc=%r url=%r" % (netloc, url))

        path     = encode_and_requote_path(path.lstrip())
        params   = encode_and_requote(params)
        query    = encode_and_requote(query)

        fragment = encode_and_requote(fragment)

        # reconstruct URL without fragment
        abs_url = (urlunparse([ scheme, netloc, path, params, query, '' ]))

    except Exception, exc:
        if type(exc) is InvalidURL:
            raise exc
        else:
            #raise InvalidURL("Invalid URL: unhandled failure url=%r\n%s\nInvalidURL" % (url, traceback.format_exc(exc)))
            import sys
            sys.exit("Invalid URL: unhandled failure url=%r\n%s\nInvalidURL" % (url, traceback.format_exc(exc)))

    # use scheme and netloc as unique identifier of target
    schost = scheme + '://' + netloc
    url_rec = {
        'fragment': fragment,
        'abs_url': abs_url,
        'doc_id': hashlib.md5(abs_url).hexdigest(),
        'schost': schost,
        'host_id': hashlib.md5(schost).hexdigest()
        }
    if original_url != abs_url:
        url_rec['original_url'] = original_url
    return url_rec

from urllib3.

shazow commented on July 3, 2024

I'm closing this issue. We can discuss more effective host parsing in another issue as it comes up. :)

from urllib3.

mansweet commented on July 3, 2024

I'm still hitting this issue, and finding this little bit of info was rather hard. I think that revisiting the error message is a valid expenditure of effor

from urllib3.

Lukasa commented on July 3, 2024

@mansweet Fab, I'm looking forward to seeing a PR from you then. 😄

(All joking aside, I think it's valuable too, but telling people what is a valid use of their time is not the most polite thing to do in the whole world. In this case, given that this issue has been closed for more than 5 years, you'll probably do best to open a new issue treating your case as a new bug report, providing all of the appropriate detail (e.g. stack trace, urllib3 version).)

from urllib3.

sofiand-png commented on July 3, 2024

@shazow , I have the impression that you want only to create new issues to discuss issues instead of fixing problems in the original ticket, I can't understand why. You did the same here #38 ...

from urllib3.

shazow commented on July 3, 2024

@sofiand-png urllib3 is a community project, you're very welcome to help fix problems that are important to you. That's what I do, too.

If you look a little closer, I did indeed push a fix to the original problem before closing the issue. And since then, we have completely rewritten the parsing, so whatever problems it has are completely new.

If you have a bug to report, please open an issue to discuss and I encourage you to follow up with a pull request that fixes your bug. Thank you!

from urllib3.

redirect to bogus URL generates low-level exception; can do better about urllib3 HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent