Work through the broken scrapers at <a href="http://scrapers.herokuapp.com/" rel="nofo

The local interpretation is the good one according to <a href="http://www.w3.org/TR/19

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Fix broken scrapers-ca scrapers about scrapers-ca HOT 10 CLOSED

jpmckinney commented on July 29, 2024

Fix broken scrapers-ca scrapers

from scrapers-ca.

Comments (10)

matthewleon commented on July 29, 2024

FYI I am unable to duplicate the "list index out of range" errors on a bunch of these scrapers. ca_on_guelph, ca_on_markham, ca_on_richmond_hill, ca_qc_saint_jerome all complete without errors on my machine.

from scrapers-ca.

matthewleon commented on July 29, 2024

ca_qc_mercier page is 403'ing the scraper but loading fine in browser. Maybe some kind of user-agent sniffing going on? Will try to check later.

from scrapers-ca.

matthewleon commented on July 29, 2024

same with ca_qc_montreal_est

from scrapers-ca.

jpmckinney commented on July 29, 2024

Yeah, there are four scrapers that only fail on Heroku, as in the issue description. When you call lxmlize, you can pass a user_agent string. ca_pe_stratford uses a string for IE10.

from scrapers-ca.

jpmckinney commented on July 29, 2024

I fixed all the Heroku-only failures. They were mostly around the use of things like [2] in XPath. For ca_nb, for example, where it was picking the wrong image, on Heroku, //img[2] means "the second img within the same parent." Locally, it's interpreted as "the second img anywhere in the document."

from scrapers-ca.

matthewleon commented on July 29, 2024

Why is there this difference in interpretation? Is there a difference version of lxml running on heroku?

from scrapers-ca.

jpmckinney commented on July 29, 2024

The Python package is the same version; maybe the C code is different? I assume only one of the two interpretations is correct, though.

from scrapers-ca.

matthewleon commented on July 29, 2024

Indeed. This is very strange.

from scrapers-ca.

Menerve commented on July 29, 2024

The local interpretation is the good one according to http://www.w3.org/TR/1999/REC-xpath-19991116/

from scrapers-ca.

jpmckinney commented on July 29, 2024

@matthewleon I added user agent strings for ca_qc_mercier and ca_qc_montreal_est. The scrapers now fail for a different reason (pupa.scrape.base.ScrapeError: no objects returned from people scrape) likely because the selectors don't work anymore.

from scrapers-ca.

Recommend Projects

Fix broken scrapers-ca scrapers about scrapers-ca HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent