Code Monkey home page Code Monkey logo

scrapers-ca's Introduction

Canadian Legislative Scrapers Build Status

Usage

Follow the instructions in the Python Quick Start Guide to install Homebrew, Git, PostGIS, Python 3.3+ and virtualenv.

mkvirtualenv scrapers-ca --python=`which python3`
git clone https://github.com/opencivicdata/scrapers-ca.git
cd scrapers-ca
pip install -r requirements.txt

Initialize the database:

createdb pupa
psql pupa -c "CREATE EXTENSION postgis;"
pupa dbinit ca

If you get an error like "no password supplied", then you need to configure the default DATABASE_URL in pupa_settings.py, e.g. postgis://USERNAME:PASSWORD@localhost/pupa.

Run a scraper

pupa update ca_ab_edmonton

To run only the scraping step and skip the import step add the --scrape switch:

pupa update --scrape ca_ab_edmonton

For documentation on the pupa command:

pupa -h

For documentation on the update subcommand:

pupa update -h

Create a scraper

See the first few steps of this wiki page to create a scraper.

Develop a scraper

Read the Pupa documentation or an existing scraper's code.

Avoid using the XPath string() function unless the expression is known to not have matches on some pages. Otherwise, scrapers may continue to run without error despite failing to find a match. A comment like # can be empty or # allow string() should accompany the use of string().

Use the get_email and get_phone helpers as much as possible.

In late 2014/early 2015, we disabled some single-jurisdiction scrapers to lower maintenance costs, some of which have been re-enabled, and disabled all multi-jurisdiction scrapers, because Pupa didn't support them. The disabled scrapers are in disabled/.

We heavily modify Pupa's validations in patch.py to be as strict as possible in order to keep data quality high. We subclass Pupa's Scraper, Jurisdiction and Person classes in utils.py to reduce code duplication and to correct common data quality issues.

Maintenance

List the available maintenance tasks:

invoke -l

Make the code style consistent:

flake8

Check module names, class names, classification, division_name, name and url in __init.py__ files:

invoke tidy

Check sources are credited and assertions are made:

invoke sources_and_assertions

Check jurisdiction URLs (look for Delete COUNCIL_PAGE or Missing COUNCIL_PAGE instructions):

invoke council_pages

Update the OCD-IDs:

curl -O https://raw.githubusercontent.com/opencivicdata/ocd-division-ids/master/identifiers/country-ca.csv

Check whether any non-authoritative CSVs are likely to be stale:

invoke csv_stale

Check whether any CSV errors can be reported to data publishers:

invoke csv_error

Scraper code rarely undergoes code review. The focus is on the quality of the data.

Bugs? Questions?

This repository is on GitHub: https://github.com/opencivicdata/scrapers-ca, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

Copyright (c) 2013 Open North Inc., released under the MIT license

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.