Code Monkey home page Code Monkey logo

Comments (10)

matthewleon avatar matthewleon commented on July 29, 2024

FYI I am unable to duplicate the "list index out of range" errors on a bunch of these scrapers. ca_on_guelph, ca_on_markham, ca_on_richmond_hill, ca_qc_saint_jerome all complete without errors on my machine.

from scrapers-ca.

matthewleon avatar matthewleon commented on July 29, 2024

ca_qc_mercier page is 403'ing the scraper but loading fine in browser. Maybe some kind of user-agent sniffing going on? Will try to check later.

from scrapers-ca.

matthewleon avatar matthewleon commented on July 29, 2024

same with ca_qc_montreal_est

from scrapers-ca.

jpmckinney avatar jpmckinney commented on July 29, 2024

Yeah, there are four scrapers that only fail on Heroku, as in the issue description. When you call lxmlize, you can pass a user_agent string. ca_pe_stratford uses a string for IE10.

from scrapers-ca.

jpmckinney avatar jpmckinney commented on July 29, 2024

I fixed all the Heroku-only failures. They were mostly around the use of things like [2] in XPath. For ca_nb, for example, where it was picking the wrong image, on Heroku, //img[2] means "the second img within the same parent." Locally, it's interpreted as "the second img anywhere in the document."

from scrapers-ca.

matthewleon avatar matthewleon commented on July 29, 2024

Why is there this difference in interpretation? Is there a difference version of lxml running on heroku?

from scrapers-ca.

jpmckinney avatar jpmckinney commented on July 29, 2024

The Python package is the same version; maybe the C code is different? I assume only one of the two interpretations is correct, though.

from scrapers-ca.

matthewleon avatar matthewleon commented on July 29, 2024

Indeed. This is very strange.

from scrapers-ca.

Menerve avatar Menerve commented on July 29, 2024

The local interpretation is the good one according to http://www.w3.org/TR/1999/REC-xpath-19991116/

from scrapers-ca.

jpmckinney avatar jpmckinney commented on July 29, 2024

@matthewleon I added user agent strings for ca_qc_mercier and ca_qc_montreal_est. The scrapers now fail for a different reason (pupa.scrape.base.ScrapeError: no objects returned from people scrape) likely because the selectors don't work anymore.

from scrapers-ca.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.