opencivicdata / pupa Goto Github PK
View Code? Open in Web Editor NEWframework for scraping legislative/government data
License: BSD 3-Clause "New" or "Revised" License
framework for scraping legislative/government data
License: BSD 3-Clause "New" or "Revised" License
I noticed that my Pupa.rb scrapers (once HTTP responses were cached) were spending half their time on file I/O, reading from the cache and writing the JSON documents. I added an option to use in-memory stores (Memcached for the cache and Redis for the JSON documents) and it sped things up by around 4x (100 seconds down to 25). https://github.com/opennorth/pupa-ruby#reducing-file-io
What relies on terms and sessions being set? What would happen if we simply ignored the existence of terms and sessions? We have 68 jurisdictions so far, and I don't see any advantage to tracking terms and sessions. Is there a way to make them optional?
Can't go back to Canada's Confederation if you stop at 1900.
The documentation here: http://docs.opencivicdata.org/en/latest/scrape/index.html
Says to run this command: pip install -e https://github.com/opencivicdata/pupa.git
But, running the command returns an error:
jeff$ pip install -e https://github.com/opencivicdata/pupa.git
https://github.com/opencivicdata/pupa.git should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+
This command is successful:
pip install -U git+https://github.com/opencivicdata/pupa.git#egg=pupa
I would recommend changing the documentation to include the revised command.
this was experimental I think, maybe it goes away, or maybe it sticks around but in a cleaner/documented form
Should they be copied to roles as well?
right now issues will arise if there's more than one level, will have to import one level at a time or do a post-import step (less desirable)
Any reason to use uuid1
? Some languages do not have uuid v1 in their core libraries, but only uuid v4 (e.g. Ruby). Note that in the Python docs:
Note that uuid1() may compromise privacy since it creates a UUID containing the computer’s network address.
Personally, I don't have a use for links on the model, but I think the schema and model should agree. Either we remove links from the schema, or add it to the model. @jamesturk Which do you think?
By the way, I'm not sure if I should report issues to JIRA for this repository or not. The README didn't have a link.
See comments in 54c3d0a
http://docs.opencivicdata.org/en/latest/scrape/new.html is also out-of-date.
Something crazy came back - I remember something like this from pre-alpha versions of pupa that didn't have matching code
> db.people.distinct("name", {"sources.url": "https://chicago.legistar.com/People.aspx"}).length
52
> db.people.find({"sources.url": "https://chicago.legistar.com/People.aspx"}).count()
158
I'm filing a ticket, since this turned out to be a bigger issue than I thought it was.
report['import'] = self.do_import(juris, args)
File "/home/tag/dev/sunlight/pupa/pupa/cli/commands/update.py", line 197, in do_import
report.update(org_importer.import_from_json(args.datadir))
File "/home/tag/dev/sunlight/pupa/pupa/importers/base.py", line 137, in import_from_json
inverse[_hash(obj)].append(json_id)
File "/home/tag/dev/sunlight/pupa/pupa/importers/base.py", line 19, in _hash
return hash(obj)
TypeError: unhashable type: 'Organization'
It imports in Python 2. Something odd is going on. Filing this to look into it after I settle another issue.
import dynamics should be much simpler than billy
Some provinces have directories of elected officials for all municipalities in the province. It would be very high maintenance to have one scraper per municipality (even with some code automation). How can I write one scraper that collects information for multiple jurisdictions? Any internals I can hack around?
While performance testing Pupa.rb, I realized that the duplicate detection was running in O(n²) when it can be done in O(n). This made import go from many minutes (I didn't wait for it to finish) to one minute for a particular scraper importing 10,000 docs. Here's the Ruby code.
The Ruby code takes advantage of the fact that hashes (dicts) are hashable; they aren't in Python, but I guess you can repr
the dict and use that as the key.
Simple creation of people objects is going to be handy for people contributing spreadsheets of manually collected information.
For a proof of concept:
pupa.importers.base
to import the stream from memory. (import_directory
reflowed to work off an iterator, and default to a filesystem json stream, and use that for loading the CSV stream or something)like the Legislator object, so that we can .add_membership
on an org when we have to scrape Committees outside of the People scraper.
In our scrapers, we just put the jurisdiction URL (usually a plain domain without a path), rather than the specific page on the jurisdiction's website that has information about its legislature. With 80+ jurisdictions who regularly change their URL scheme, it seems an unnecessary maintenance cost to put anything more specific than a domain there.
What is the actual use case for legislature_url
? Is it implementation-specific?
One should be able to specify a log level at the CLI, overriding the one in settings.py
.
@paultag's recent pull request reminded me of another issue I discovered while implementing the Ruby version. If you compare Pupa.py's code to Pupa.rb's:
https://github.com/opencivicdata/pupa/blob/master/pupa/importers/base.py#L123
https://github.com/opennorth/pupa-ruby/blob/master/lib/pupa/processor.rb#L278
You'll notice two differences:
The second difference is more important. In a simple example, A, B and C are all the same object with different IDs. The py code runs like:
Later, dedupe_json_id
will return the wrong ID for C (it will return B's ID, which won't be imported, instead of A's).
In Chicago, some bills are handled by omnibus bills like https://chicago.legistar.com/LegislationDetail.aspx?ID=1526672&GUID=4B542665-F6DD-4C4C-BA4B-6107DBDA8BD7
We need to model this type of relation.
work with @twneale on this, look to granny for a foundation
For our scrapers, we have different templates we'd like to use for new scrapers when running the pupa init
command. If it were possible to configure the path to the examples
directory (through pupa_settings.py
maybe) that would be great!
I thought I probably shouldn't go messing around in pupa too much without one of Paul/James here to check with, but I'm guessing we just need to initialize
self.chamber = None
on the pupa Person model.
$ pupa update boise
...
[Boise scraper creates a chamberless person]
...
Traceback (most recent call last):
File "/home/thom/.virtualenvs/pupa3/bin/pupa", line 9, in <module>
load_entry_point('pupa==0.1.0', 'console_scripts', 'pupa')()
File "/home/thom/sunlight/pupa/pupa/cli/__main__.py", line 30, in main
subcommands[args.subcommand].handle(args)
File "/home/thom/sunlight/pupa/pupa/cli/commands/update.py", line 251, in handle
report['import'] = self.do_import(juris, args)
File "/home/thom/sunlight/pupa/pupa/cli/commands/update.py", line 177, in do_import
report.update(person_importer.import_from_json(args.datadir))
File "/home/thom/sunlight/pupa/pupa/importers/base.py", line 145, in import_from_json
self.json_to_db_id[json_id] = self.import_object(obj)
File "/home/thom/sunlight/pupa/pupa/importers/base.py", line 87, in import_object
spec = self.get_db_spec(obj)
File "/home/thom/sunlight/pupa/pupa/importers/people.py", line 25, in get_db_spec
if person.chamber:
AttributeError: chamber
should be ongoing after we start to stabilize the API
crazy idea?
code in billy should be adaptable to orgs
For example, if a provincial website has information for all its municipalities.
mostly working:
the slightly more complex nature lends itself to taking a page from django and having a start project script that'll set up a new importable module that runs (& fails with reasonably helpful errors)
(right now it's sitting in the scraper)
so within the root Jurisdiction we need to create an organization (or multiple in the case of chambers)
This is the current syntax (somewhat of a mess), Proposal 0.
class Alaska(Jurisdiction):
division_id = 'ocd-division/country:us/state:ak'
name = 'Alaska State Legislature'
url = 'http://legis.state.ak.us'
def organizations(self):
yield Organization('Alaska State House', classification='legislature', chamber='lower')
yield Organization('Alaska State Senate', classification='legislature', chamber='upper')
def posts(self):
for n in range(1, 41):
yield Post(label=str(n), role='Representative', organization_id='~legislature:upper')
for n in range(65, 85):
yield Post(label=chr(n) role='Senator', organization_id='~legislature:lower')
# note: ~legislature:lower is a special id syntax that dispatches to the id resolver to find the related chamber
I'd like to use this thread to figure out better syntax that works for all cases.
Proposal 1
Declarative approach, a lot of typing but no need to directly invoke subclasses (they'll get created from the dicts with reasonable defaults for any missing keys)
class Alaska(Jurisdiction):
organizations = [
{'name': 'Alaska State House', 'chamber': 'lower'},
{'name': 'Alaska State Senate', 'chamber': 'upper'},
]
posts = [
{'chamber': 'upper', 'name': 'A', 'role': 'Senator'},
...
{'chamber': 'lower', 'name': '1', 'role': 'Representative'},
...
]
# and for non-bicameral chamber could be omitted
But I think we might be able to do better, especially with posts.
ccing @jpmckinney @paultag @twneale
all of the schema enums should be configurable
This functionality can be "opt-in". Curious to hear from others re: implementation.
tests are covering, but are they handling edge cases?
For example, the Boise, Temecula and Cleveland city council organizations do not have a division id attribute.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.