Light

opencivicdata / pupa Goto Github PK

View Code? Open in Web Editor NEW

82.0 13.0 42.0 1.34 MB

framework for scraping legislative/government data

License: BSD 3-Clause "New" or "Revised" License

Python 99.96% Shell 0.04%

pupa's Introduction

Pupa: A legislative data scraping framework

pupa's People

Contributors

Stargazers

Watchers

pupa's Issues

Reduce file I/O to improve performance

I noticed that my Pupa.rb scrapers (once HTTP responses were cached) were spending half their time on file I/O, reading from the cache and writing the JSON documents. I added an option to use in-memory stores (Memcached for the cache and Redis for the JSON documents) and it sped things up by around 4x (100 seconds down to 25). https://github.com/opennorth/pupa-ruby#reducing-file-io

How important are terms and sessions?

What relies on terms and sessions being set? What would happen if we simply ignored the existence of terms and sessions? We have 68 jurisdictions so far, and I don't see any advantage to tracking terms and sessions. Is there a way to make them optional?

https://github.com/opennorth/mycityhall-scrapers

speech reports

figure out where's best to validate term & session

Remove minimum value validations for end_year, start_year

Can't go back to Canada's Confederation if you stop at 1900.

Issue with pupa installation

The documentation here: http://docs.opencivicdata.org/en/latest/scrape/index.html
Says to run this command: pip install -e https://github.com/opencivicdata/pupa.git

But, running the command returns an error:
jeff$ pip install -e https://github.com/opencivicdata/pupa.git
https://github.com/opencivicdata/pupa.git should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+

This command is successful:
pip install -U git+https://github.com/opencivicdata/pupa.git#egg=pupa

I would recommend changing the documentation to include the revised command.

remove or clean up legistar.py

this was experimental I think, maybe it goes away, or maybe it sticks around but in a cleaner/documented form

scrape & import events

organization reports

Memberships lack division id

Should they be copied to roles as well?

change how parent_id linkage is done to allow multi-level committee heirarchies

right now issues will arise if there's more than one level, will have to import one level at a time or do a post-import step (less desirable)

bill reports

Use uuid4 not uuid1

Any reason to use uuid1? Some languages do not have uuid v1 in their core libraries, but only uuid v4 (e.g. Ruby). Note that in the Python docs:

Note that uuid1() may compromise privacy since it creates a UUID containing the computer’s network address.

run logging redux

Membership schema has links, but not the model

Personally, I don't have a use for links on the model, but I think the schema and model should agree. Either we remove links from the schema, or add it to the model. @jamesturk Which do you think?

By the way, I'm not sure if I should report issues to JIRA for this repository or not. The README didn't have a link.

add other commands

Update example and documentation to use new-style metadata

See comments in 54c3d0a

http://docs.opencivicdata.org/en/latest/scrape/new.html is also out-of-date.

import duplication of legislators

Something crazy came back - I remember something like this from pre-alpha versions of pupa that didn't have matching code

> db.people.distinct("name", {"sources.url": "https://chicago.legistar.com/People.aspx"}).length
52
> db.people.find({"sources.url": "https://chicago.legistar.com/People.aspx"}).count()
158

I'm filing a ticket, since this turned out to be a bigger issue than I thought it was.

Import on Python 3 appears to break

    report['import'] = self.do_import(juris, args)
  File "/home/tag/dev/sunlight/pupa/pupa/cli/commands/update.py", line 197, in do_import
    report.update(org_importer.import_from_json(args.datadir))
  File "/home/tag/dev/sunlight/pupa/pupa/importers/base.py", line 137, in import_from_json
    inverse[_hash(obj)].append(json_id)
  File "/home/tag/dev/sunlight/pupa/pupa/importers/base.py", line 19, in _hash
    return hash(obj)
TypeError: unhashable type: 'Organization'

It imports in Python 2. Something odd is going on. Filing this to look into it after I settle another issue.

scrape & import speeches

allow PRESERVED_FIELDS on metadata

add people & organization importer

import dynamics should be much simpler than billy

One scraper. Multiple jurisdictions?

Some provinces have directories of elected officials for all municipalities in the province. It would be very high maintenance to have one scraper per municipality (even with some code automation). How can I write one scraper that collects information for multiple jurisdictions? Any internals I can hack around?

Optimize duplicate detection

While performance testing Pupa.rb, I realized that the duplicate detection was running in O(n²) when it can be done in O(n). This made import go from many minutes (I didn't wait for it to finish) to one minute for a particular scraper importing 10,000 docs. Here's the Ruby code.

The Ruby code takes advantage of the fact that hashes (dicts) are hashable; they aren't in Python, but I guess you can repr the dict and use that as the key.

simple spreadsheet importer

Simple creation of people objects is going to be handy for people contributing spreadsheets of manually collected information.

For a proof of concept:

Convert CSV entries to Popolo formated people
Tweak pupa.importers.base to import the stream from memory. (import_directory reflowed to work off an iterator, and default to a filesystem json stream, and use that for loading the CSV stream or something)
Validate Jurisdictions (or otherwise become strict on loading new jurisdictions)
Add "moderated" flag to the data uploaded (or different DB)
Way to create council Orgs / Jurisdictions
Committee data import
Post data
division IDs? (long shot here, likely not without a UI to pick)

add Committee / Organization to pupa core

like the Legislator object, so that we can .add_membership on an org when we have to scrape Committees outside of the People scraper.

Docs for legislature_url?

In our scrapers, we just put the jurisdiction URL (usually a plain domain without a path), rather than the specific page on the jurisdiction's website that has information about its legislature. With 80+ jurisdictions who regularly change their URL scheme, it seems an unnecessary maintenance cost to put anything more specific than a domain there.

What is the actual use case for legislature_url? Is it implementation-specific?

log level / verbosity should be a command-line parameter

One should be able to specify a log level at the CLI, overriding the one in settings.py.

Issues with duplicates map

@paultag's recent pull request reminded me of another issue I discovered while implementing the Ruby version. If you compare Pupa.py's code to Pupa.rb's:

https://github.com/opencivicdata/pupa/blob/master/pupa/importers/base.py#L123
https://github.com/opennorth/pupa-ruby/blob/master/lib/pupa/processor.rb#L278

You'll notice two differences:

.rb doesn't compare objects to themselves when finding duplicates :)
.rb skips an object if it's already been labeled a duplicate.

The second difference is more important. In a simple example, A, B and C are all the same object with different IDs. The py code runs like:

Looking for duplicates of A
B and C marked as duplicates of A
Looking for duplicates of B
C marked as duplicate of B
Looking for duplicates of C (I just noticed that we both do an unnecessary iteration with the last item in the list)

Later, dedupe_json_id will return the wrong ID for C (it will return B's ID, which won't be imported, instead of A's).

Omnibus relations for related_bills

In Chicago, some bills are handled by omnibus bills like https://chicago.legistar.com/LegislationDetail.aspx?ID=1526672&GUID=4B542665-F6DD-4C4C-BA4B-6107DBDA8BD7

We need to model this type of relation.

scrape & import bills

work with @twneale on this, look to granny for a foundation

update example to actually scrape roles, etc.

Make path to template files configurable for init command

For our scrapers, we have different templates we'd like to use for new scrapers when running the pupa init command. If it were possible to configure the path to the examples directory (through pupa_settings.py maybe) that would be great!

vote reports

Pupa import balks at chamberless person

I thought I probably shouldn't go messing around in pupa too much without one of Paul/James here to check with, but I'm guessing we just need to initialize

self.chamber = None

on the pupa Person model.

$ pupa update boise
...
[Boise scraper creates a chamberless person]
...

Traceback (most recent call last):
File "/home/thom/.virtualenvs/pupa3/bin/pupa", line 9, in <module>
  load_entry_point('pupa==0.1.0', 'console_scripts', 'pupa')()
File "/home/thom/sunlight/pupa/pupa/cli/__main__.py", line 30, in main
  subcommands[args.subcommand].handle(args)
File "/home/thom/sunlight/pupa/pupa/cli/commands/update.py", line 251, in handle
  report['import'] = self.do_import(juris, args)
File "/home/thom/sunlight/pupa/pupa/cli/commands/update.py", line 177, in do_import
  report.update(person_importer.import_from_json(args.datadir))
File "/home/thom/sunlight/pupa/pupa/importers/base.py", line 145, in import_from_json
  self.json_to_db_id[json_id] = self.import_object(obj)
File "/home/thom/sunlight/pupa/pupa/importers/base.py", line 87, in import_object
  spec = self.get_db_spec(obj)
File "/home/thom/sunlight/pupa/pupa/importers/people.py", line 25, in get_db_spec
  if person.chamber:
AttributeError: chamber

write sphinx docs (should be ongoing)

should be ongoing after we start to stabilize the API

replace pupa import-billy with a generic Open States scraper

crazy idea?

check session list at runtime

code in billy should be adaptable to orgs

people reports

improve & test update (for all types)

Recommended approach to scrape multiple jurisdictions at once?

For example, if a provincial website has information for all its municipalities.

improve metadata importer

mostly working:

validate metadata
PRESERVED_FIELDS
datetime.combine dates? or leave to filters maybe

add a start-project script

the slightly more complex nature lends itself to taking a page from django and having a start project script that'll set up a new importable module that runs (& fails with reasonably helpful errors)

add Legislator object to pupa core

(right now it's sitting in the scraper)

update_related() or similar in BaseImporter

proposed syntax for setting Posts/Organizations

so within the root Jurisdiction we need to create an organization (or multiple in the case of chambers)

This is the current syntax (somewhat of a mess), Proposal 0.

class Alaska(Jurisdiction):
    division_id = 'ocd-division/country:us/state:ak'
    name = 'Alaska State Legislature'
    url = 'http://legis.state.ak.us'

    def organizations(self):
        yield Organization('Alaska State House', classification='legislature', chamber='lower')
        yield Organization('Alaska State Senate', classification='legislature', chamber='upper')

     def posts(self):
         for n in range(1, 41):
             yield Post(label=str(n), role='Representative', organization_id='~legislature:upper')
         for n in range(65, 85): 
              yield Post(label=chr(n) role='Senator', organization_id='~legislature:lower')

# note: ~legislature:lower is a special id syntax that dispatches to the id resolver to find the related chamber

I'd like to use this thread to figure out better syntax that works for all cases.

Proposal 1

Declarative approach, a lot of typing but no need to directly invoke subclasses (they'll get created from the dicts with reasonable defaults for any missing keys)

class Alaska(Jurisdiction):
   organizations = [
       {'name': 'Alaska State House', 'chamber': 'lower'},
       {'name': 'Alaska State Senate', 'chamber': 'upper'},
    ]

    posts = [ 
        {'chamber': 'upper', 'name': 'A', 'role': 'Senator'}, 
                   ...
        {'chamber': 'lower', 'name': '1', 'role': 'Representative'}, 
                   ...
                 ]

# and for non-bicameral chamber could be omitted

But I think we might be able to do better, especially with posts.

ccing @jpmckinney @paultag @twneale

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.