Code Monkey home page Code Monkey logo

Comments (4)

dangra avatar dangra commented on May 21, 2024

just tried this, it logs a warning now and it is caused by restrict_xpaths argument, looks same bug than #199

$ scrapy shell 'http://www.last.fm/music/AC%252FDC/+images'
2013-01-29 12:51:24-0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapybot)
...
>>> from scrapy.contrib.linkextractors import sgml
>>> sgml.SgmlLinkExtractor(restrict_xpaths=('//a[@class="nextlink"]')).extract_links(response)
/home/daniel/src/scrapy/scrapy/link.py:16: UserWarning: Do not instantiate Link objects with unicode urls. Assuming utf-8 encoding (which could be wrong)
  warnings.warn("Do not instantiate Link objects with unicode urls. " \
[Link(url='http://www.last.fm/music/AC/DC/+images?page=2', text=u'Next', fragment='', nofollow=False)]

from scrapy.

dangra avatar dangra commented on May 21, 2024

this is actually a different bug, happens that canonicalize_url is unquoting the url path converting %2f into / and that is a wrong url for last.fm.

from scrapy.

dangra avatar dangra commented on May 21, 2024

this is the same problem described at https://github.com/kennethreitz/requests/pull/273

and wikipedia explain it more clearly what characters are reserved and must be quoted in a path http://en.wikipedia.org/wiki/Percent-encoding

from scrapy.

dangra avatar dangra commented on May 21, 2024

pass tests

scrapy.tests.test_contrib_linkextractors
  LinkExtractorTestCase
    test_base_url ...                                                      [OK]
    test_basic ...                                                         [OK]
    test_extraction_encoding ...                                           [OK]
    test_link_nofollow ...                                                 [OK]
    test_link_text_wrong_encoding ...                                      [OK]
    test_matches ...                                                       [OK]
  SgmlLinkExtractorTestCase
    test_base_url_with_restrict_xpaths ...                                 [OK]
    test_deny_extensions ...                                               [OK]
    test_encoded_url ...                                                   [OK]
    test_encoded_url_in_restricted_xpath ...                               [OK]
    test_extraction ...                                                    [OK]
    test_extraction_using_single_values ...                                [OK]
    test_matches ...                                                       [OK]
    test_process_value ...                                                 [OK]
    test_restrict_xpaths ...                                               [OK]
    test_restrict_xpaths_concat_in_handle_data ...                         [OK]
    test_restrict_xpaths_encoding ...                                      [OK]
    test_urls_type ...                                                     [OK]
scrapy.tests.test_utils_url
  UrlUtilsTest
    test_canonicalize_url ...                                              [OK]
    test_url_is_from_any_domain ...                                        [OK]
    test_url_is_from_spider ...                                            [OK]
    test_url_is_from_spider_class_attributes ...                           [OK]
    test_url_is_from_spider_with_allowed_domains ...                       [OK]
    test_url_is_from_spider_with_allowed_domains_class_attributes ...      [OK]
Doctest: scrapy.utils
  url
    escape_ajax ...                                                        [OK]

-------------------------------------------------------------------------------
Ran 25 tests in 0.031s

PASSED (successes=25)

from scrapy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.