Comments (10)
FYI I am unable to duplicate the "list index out of range" errors on a bunch of these scrapers. ca_on_guelph, ca_on_markham, ca_on_richmond_hill, ca_qc_saint_jerome all complete without errors on my machine.
from scrapers-ca.
ca_qc_mercier page is 403'ing the scraper but loading fine in browser. Maybe some kind of user-agent sniffing going on? Will try to check later.
from scrapers-ca.
same with ca_qc_montreal_est
from scrapers-ca.
Yeah, there are four scrapers that only fail on Heroku, as in the issue description. When you call lxmlize, you can pass a user_agent
string. ca_pe_stratford uses a string for IE10.
from scrapers-ca.
I fixed all the Heroku-only failures. They were mostly around the use of things like [2]
in XPath. For ca_nb
, for example, where it was picking the wrong image, on Heroku, //img[2]
means "the second img within the same parent." Locally, it's interpreted as "the second img anywhere in the document."
from scrapers-ca.
Why is there this difference in interpretation? Is there a difference version of lxml running on heroku?
from scrapers-ca.
The Python package is the same version; maybe the C code is different? I assume only one of the two interpretations is correct, though.
from scrapers-ca.
Indeed. This is very strange.
from scrapers-ca.
The local interpretation is the good one according to http://www.w3.org/TR/1999/REC-xpath-19991116/
from scrapers-ca.
@matthewleon I added user agent strings for ca_qc_mercier and ca_qc_montreal_est. The scrapers now fail for a different reason (pupa.scrape.base.ScrapeError: no objects returned from people scrape
) likely because the selectors don't work anymore.
from scrapers-ca.
Related Issues (20)
- Toronto: Add parent org data to committees
- add_contact method in utils.py is broken HOT 4
- Toronto: Clean up events-incremental scraper
- Toronto: Find related agenda items HOT 3
- Toronto: Scrape personal website for persons
- Toronto: Scrape semantic deferral data HOT 5
- 4 Waterloo-region image hotlinks are pointing to wrong file HOT 1
- Toronto: scrape data on agenda itme urgency
- Request elected officials datasets HOT 1
- Make a Represent CSV validator and dashboard HOT 1
- Add logos to http://represent.opennorth.ca/government/
- Add cabinet positions to federal and provincial scrapers HOT 1
- ca_ns_cape_breton refuses python-requests user-agent HOT 2
- confused about ca_ab issue HOT 2
- Fix regex in patch.py
- Territories? HOT 1
- Upgrade validictory to run on Python 3.10+ HOT 1
- Last name does not match regex validation in Langley HOT 2
- Hardcoded mayor URLs
- Warnings from Represent import
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapers-ca.