Code Monkey home page Code Monkey logo

pupa's Introduction

Pupa: A legislative data scraping framework

example workflow Coverage Status PyPI

pupa's People

Contributors

antidipyramid avatar azban avatar boblannon avatar crdunwel avatar csnardi avatar divergentdave avatar fgregg avatar hancush avatar jamesturk avatar jmcarp avatar mileswwatkins avatar patcon avatar paultag avatar rshorey avatar twneale avatar vanuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pupa's Issues

Issue with pupa installation

The documentation here: http://docs.opencivicdata.org/en/latest/scrape/index.html
Says to run this command: pip install -e https://github.com/opencivicdata/pupa.git

But, running the command returns an error:
jeff$ pip install -e https://github.com/opencivicdata/pupa.git
https://github.com/opencivicdata/pupa.git should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+

This command is successful:
pip install -U git+https://github.com/opencivicdata/pupa.git#egg=pupa

I would recommend changing the documentation to include the revised command.

Use uuid4 not uuid1

Any reason to use uuid1? Some languages do not have uuid v1 in their core libraries, but only uuid v4 (e.g. Ruby). Note that in the Python docs:

Note that uuid1() may compromise privacy since it creates a UUID containing the computer’s network address.

run logging redux

  • scrape record
  • handle exceptions
  • import record
  • report record?
  • save record

Membership schema has links, but not the model

Personally, I don't have a use for links on the model, but I think the schema and model should agree. Either we remove links from the schema, or add it to the model. @jamesturk Which do you think?

By the way, I'm not sure if I should report issues to JIRA for this repository or not. The README didn't have a link.

import duplication of legislators

Something crazy came back - I remember something like this from pre-alpha versions of pupa that didn't have matching code

> db.people.distinct("name", {"sources.url": "https://chicago.legistar.com/People.aspx"}).length
52
> db.people.find({"sources.url": "https://chicago.legistar.com/People.aspx"}).count()
158

I'm filing a ticket, since this turned out to be a bigger issue than I thought it was.

Import on Python 3 appears to break

    report['import'] = self.do_import(juris, args)
  File "/home/tag/dev/sunlight/pupa/pupa/cli/commands/update.py", line 197, in do_import
    report.update(org_importer.import_from_json(args.datadir))
  File "/home/tag/dev/sunlight/pupa/pupa/importers/base.py", line 137, in import_from_json
    inverse[_hash(obj)].append(json_id)
  File "/home/tag/dev/sunlight/pupa/pupa/importers/base.py", line 19, in _hash
    return hash(obj)
TypeError: unhashable type: 'Organization'

It imports in Python 2. Something odd is going on. Filing this to look into it after I settle another issue.

add people & organization importer

import dynamics should be much simpler than billy

  • load orgs
  • load people
  • link parents
  • load memberships
  • fix post ids changing on each import

One scraper. Multiple jurisdictions?

Some provinces have directories of elected officials for all municipalities in the province. It would be very high maintenance to have one scraper per municipality (even with some code automation). How can I write one scraper that collects information for multiple jurisdictions? Any internals I can hack around?

Optimize duplicate detection

While performance testing Pupa.rb, I realized that the duplicate detection was running in O(n²) when it can be done in O(n). This made import go from many minutes (I didn't wait for it to finish) to one minute for a particular scraper importing 10,000 docs. Here's the Ruby code.

The Ruby code takes advantage of the fact that hashes (dicts) are hashable; they aren't in Python, but I guess you can repr the dict and use that as the key.

simple spreadsheet importer

Simple creation of people objects is going to be handy for people contributing spreadsheets of manually collected information.

For a proof of concept:

  • Convert CSV entries to Popolo formated people
  • Tweak pupa.importers.base to import the stream from memory. (import_directory reflowed to work off an iterator, and default to a filesystem json stream, and use that for loading the CSV stream or something)
  • Validate Jurisdictions (or otherwise become strict on loading new jurisdictions)
  • Add "moderated" flag to the data uploaded (or different DB)
  • Way to create council Orgs / Jurisdictions
  • Committee data import
  • Post data
  • division IDs? (long shot here, likely not without a UI to pick)

Docs for legislature_url?

In our scrapers, we just put the jurisdiction URL (usually a plain domain without a path), rather than the specific page on the jurisdiction's website that has information about its legislature. With 80+ jurisdictions who regularly change their URL scheme, it seems an unnecessary maintenance cost to put anything more specific than a domain there.

What is the actual use case for legislature_url? Is it implementation-specific?

Issues with duplicates map

@paultag's recent pull request reminded me of another issue I discovered while implementing the Ruby version. If you compare Pupa.py's code to Pupa.rb's:

https://github.com/opencivicdata/pupa/blob/master/pupa/importers/base.py#L123
https://github.com/opennorth/pupa-ruby/blob/master/lib/pupa/processor.rb#L278

You'll notice two differences:

  1. .rb doesn't compare objects to themselves when finding duplicates :)
  2. .rb skips an object if it's already been labeled a duplicate.

The second difference is more important. In a simple example, A, B and C are all the same object with different IDs. The py code runs like:

  1. Looking for duplicates of A
  2. B and C marked as duplicates of A
  3. Looking for duplicates of B
  4. C marked as duplicate of B
  5. Looking for duplicates of C (I just noticed that we both do an unnecessary iteration with the last item in the list)

Later, dedupe_json_id will return the wrong ID for C (it will return B's ID, which won't be imported, instead of A's).

Make path to template files configurable for init command

For our scrapers, we have different templates we'd like to use for new scrapers when running the pupa init command. If it were possible to configure the path to the examples directory (through pupa_settings.py maybe) that would be great!

Pupa import balks at chamberless person

I thought I probably shouldn't go messing around in pupa too much without one of Paul/James here to check with, but I'm guessing we just need to initialize

self.chamber = None 

on the pupa Person model.

$ pupa update boise
...
[Boise scraper creates a chamberless person]
...

Traceback (most recent call last):
File "/home/thom/.virtualenvs/pupa3/bin/pupa", line 9, in <module>
  load_entry_point('pupa==0.1.0', 'console_scripts', 'pupa')()
File "/home/thom/sunlight/pupa/pupa/cli/__main__.py", line 30, in main
  subcommands[args.subcommand].handle(args)
File "/home/thom/sunlight/pupa/pupa/cli/commands/update.py", line 251, in handle
  report['import'] = self.do_import(juris, args)
File "/home/thom/sunlight/pupa/pupa/cli/commands/update.py", line 177, in do_import
  report.update(person_importer.import_from_json(args.datadir))
File "/home/thom/sunlight/pupa/pupa/importers/base.py", line 145, in import_from_json
  self.json_to_db_id[json_id] = self.import_object(obj)
File "/home/thom/sunlight/pupa/pupa/importers/base.py", line 87, in import_object
  spec = self.get_db_spec(obj)
File "/home/thom/sunlight/pupa/pupa/importers/people.py", line 25, in get_db_spec
  if person.chamber:
AttributeError: chamber

improve metadata importer

mostly working:

  • validate metadata
  • PRESERVED_FIELDS
  • datetime.combine dates? or leave to filters maybe

add a start-project script

the slightly more complex nature lends itself to taking a page from django and having a start project script that'll set up a new importable module that runs (& fails with reasonably helpful errors)

proposed syntax for setting Posts/Organizations

so within the root Jurisdiction we need to create an organization (or multiple in the case of chambers)

This is the current syntax (somewhat of a mess), Proposal 0.

class Alaska(Jurisdiction):
    division_id = 'ocd-division/country:us/state:ak'
    name = 'Alaska State Legislature'
    url = 'http://legis.state.ak.us'

    def organizations(self):
        yield Organization('Alaska State House', classification='legislature', chamber='lower')
        yield Organization('Alaska State Senate', classification='legislature', chamber='upper')

     def posts(self):
         for n in range(1, 41):
             yield Post(label=str(n), role='Representative', organization_id='~legislature:upper')
         for n in range(65, 85): 
              yield Post(label=chr(n) role='Senator', organization_id='~legislature:lower')

# note: ~legislature:lower is a special id syntax that dispatches to the id resolver to find the related chamber

I'd like to use this thread to figure out better syntax that works for all cases.

Proposal 1

Declarative approach, a lot of typing but no need to directly invoke subclasses (they'll get created from the dicts with reasonable defaults for any missing keys)

class Alaska(Jurisdiction):
   organizations = [
       {'name': 'Alaska State House', 'chamber': 'lower'},
       {'name': 'Alaska State Senate', 'chamber': 'upper'},
    ]

    posts = [ 
        {'chamber': 'upper', 'name': 'A', 'role': 'Senator'}, 
                   ...
        {'chamber': 'lower', 'name': '1', 'role': 'Representative'}, 
                   ...
                 ]

# and for non-bicameral chamber could be omitted

But I think we might be able to do better, especially with posts.

ccing @jpmckinney @paultag @twneale

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.