This repository contains documentation for developers including:
- Writing Scrapers using Pupa
- Open Civic Data's Data Type Specifications
- Open Civic Data Proposals
Read these docs at https://open-civic-data.readthedocs.io/en/latest/
Scrapers for US municipal governments.
License: MIT License
This repository contains documentation for developers including:
Read these docs at https://open-civic-data.readthedocs.io/en/latest/
It seems that our logging configuration is not capturing all of the messages that we want it to capture. Let's make sure it gets all of them.
We'll have to add these as post types.
Right now we are only setting hour.
This bill mentions a sponsor, but has no ID for the sponsor
http://api.opencivicdata.org/ocd-bill/44afd1d1-1141-47f8-aa60-acf33c4bda8a/
API only has about 1500 bills, and they are all from 2004
location name is sometimes a building, sometimes building + street address
In Chicago, sometimes 'Notice' and 'Agenda' are the same file. Sometimes not. Pupa doesn't allow the same file to appear more than once in document list (throws an ValueError). What's the way we want to handle this?
on legistar, events often have additional notes in italicized text under the event location: http://legistar.council.nyc.gov/Calendar.aspx/
the italicized text seems to fall under two categories - (1) 'jointly with' plus a list of committees (example), or (2) a note on status, indicating a continuation of a prev meeting (example) or that the meeting has been recessed (example)
to-do:
Can you provide detailed tickets for web scrapers that folks could write for you?
I would like to have them for challenges as part of a tutorial that I am teaching: https://us.pycon.org/2015/schedule/presentation/318/
(Also, please let me know if you have other repos that might have these need also.)
As they do their business, the city council deals with documents other than bills
These are not bills, but should be tracked. Will depend upon a DocumentType or similar being added to pupa.
add website & notes from legistar people pages
These are now available from the City Clerk: http://www.chicityclerk.com/city-council-news-central/council-members
Not finding any chicago events in the API
Attn @jesseilev
HTTPError: 404 while retrieving https://www.stlouis-mo.gov/government/departments/aldermen/city-laws/boardbill.cfm?bbDetail=true&BBId=9819
File "bin/pupa", line 9, in <module>
load_entry_point('pupa==0.5.0', 'console_scripts', 'pupa')()
File "pupa/cli/__main__.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "pupa/cli/commands/update.py", line 242, in handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "pupa/cli/commands/update.py", line 141, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "pupa/scrape/base.py", line 101, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "st_louis/bills.py", line 25, in scrape
yield self.scrape_bill(bill_url, bill_id, session_id)
File "st_louis/bills.py", line 29, in scrape_bill
page = self.lxmlize(bill_url)
File "st_louis/utils.py", line 12, in lxmlize
entry = self.get(url).text
File "requests/sessions.py", line 480, in get
return self.request('GET', url, **kwargs)
File "scrapelib/__init__.py", line 272, in request
raise HTTPError(resp)
Right now, all votes are classified as 'bill-passage' This depends on us better understanding what 'vote classification' means.
entry = self.urlopen(url)
AttributeError: 'ChicagoBillScraper' object has no attribute 'urlopen'
I think we need self.get
. Will patch in a sec
Under the hood requests (which uses urllib3 for SSL junk I'm pretty sure) is not allowing us to load pages from the Chicago Legistar site because it is forcing SSL while not really doing the best job at supporting a secure connection. There are a couple ways to approach this: pass the verify=False
flag to requests or ask the Legistar people to fix their SSL stuff. I'm thinking the first option is going to be a lot easier in the short term.
currently the link is the nyc calendar; source link should be the detail page for the event
currently stubbed out, erroring with ValueError: cannot resolve id: {UUID}
, loop back shortly
events only go up to 5/29/2015, everything after is missing
Locally, after running pupa update chicago
and looking at the API.
The same is showing up on sunlight's API.
http://api.opencivicdata.org/ocd-bill/0066e2f1-b4ed-428d-9188-38444ff0fbc5/
We have each of these organizations responsible as an action recorded as the City Council. The middle action should have "Committee on Transportation and Public Way" recorded.
We capture referral actions from the the city council to a committee, but I don't see anyway to record which committee the legislation was referred to. (we know it from Legistar)
We'll eventually know once the committee takes action on a piece of legsilation, but right now there doesn't seem to be a way of finding pending legislation in a committee.
@rshorey @paultag is that right? Any suggestions? Does this require a an OCDEP?
By clicking on the 'Search button' on https://chicago.legistar.com/People.aspx you can get a listing of Aldermen that show up for the previous couple sessions. We should scrape this so we can assign historic legislation to the right entity_id.
after opencivicdata/pupa#7 lands
The OCD-API has divisions for each of the wards:
http://api.opencivicdata.org/ocd-division/country%3Aus/state%3Ail/place%3Achicago/
Should I associate these to posts when the scraper creates posts? If so, how do I do that?
Right now we are not capturing the start time of cancelled events. This makes it impossible to update the status of events when they get cancelled. We need to pull this info from the internet calendar links.
This is for NYC.
Moxie is reporting an DuplicateItemError for an action on this bill: http://api.opencivicdata.org/ocd-bill/79d131fc-dedf-4f43-ab8f-3c406f6ccc39/,
import votes...
Traceback (most recent call last):
File "/usr/local/bin/pupa", line 9, in <module>
load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
File "/pupa/src/pupa/pupa/cli/__main__.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "/pupa/src/pupa/pupa/cli/commands/update.py", line 244, in handle
report['import'] = self.do_import(juris, args)
File "/pupa/src/pupa/pupa/cli/commands/update.py", line 181, in do_import
report.update(vote_importer.import_directory(datadir))
File "/pupa/src/pupa/pupa/importers/base.py", line 169, in import_directory
return self.import_data(json_stream())
File "/pupa/src/pupa/pupa/importers/base.py", line 206, in import_data
obj_id, what = self.import_item(data)
File "/pupa/src/pupa/pupa/importers/base.py", line 241, in import_item
raise DuplicateItemError(data, obj)
pupa.exceptions.DuplicateItemError: attempt to import data that would conflict with data already in the import: {'result': 'pass', 'extras': {}, 'legislative_session_id': UUID('1ca51a67-4d27-4649-bf34-c5849dfd30d6'), 'bill_id': 'ocd-bill/79d131fc-dedf-4f43-ab8f-3c406f6ccc39', 'identifier': '', 'motion_classification': ['passage'], 'motion_text': 'Approved, by Council', 'organization_id': 'ocd-organization/389257d3-aefe-42df-b3a2-a0d56d0ea731', 'start_date': '2014-05-14'} (already imported as Approved, by Council on M 63-2014 in New York City 2014 Regular Session Session)
the current source urls are legistar department detail pages; they should be legistar event detail pages
We are not resolving these ids right now.
We don't seem to be capturing the subjects of legislation.
Right now the actions of bills are being entered in reverse chronological order for NYC and Chicago. Probably makes more sense to do it chronological order. Thoughts @cathydeng?
Let me know what's possible to get started...
pupa update --scrape --fastmode chicago
no pupa_settings on path, using defaults
chicago (scrape)
bills: {}
Not checking sessions...
16:00:59 INFO pupa: save jurisdiction Chicago City Council as jurisdiction_ocd-jurisdiction-country:us-state:il-place:chicago-council.json
16:00:59 INFO pupa: save organization Chicago City Council as organization_81654f1c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Clerk as post_8165f606-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Mayor as post_816691b0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 1 as post_816747cc-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 2 as post_816801b2-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 3 as post_8168162a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 4 as post_81681c10-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 5 as post_8168219c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 6 as post_81682746-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 7 as post_81682c6e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 8 as post_816831a0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 9 as post_816836be-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 10 as post_81683bbe-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 11 as post_816840d2-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 12 as post_8168460e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 13 as post_81684b18-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 14 as post_8168502c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 15 as post_81685586-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 16 as post_81685ac2-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 17 as post_81685fd6-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 18 as post_81686558-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 19 as post_81686a62-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 20 as post_81686f6c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 21 as post_81687480-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 22 as post_81687ade-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 23 as post_8168807e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 24 as post_81688588-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 25 as post_81688b00-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 26 as post_81689014-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 27 as post_816896e0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 28 as post_81689c44-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 29 as post_8168a162-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 30 as post_8168a9aa-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 31 as post_8168ae0a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 32 as post_8168b242-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 33 as post_8168b68e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 34 as post_8168bd78-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 35 as post_8168c200-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 36 as post_8168c642-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 37 as post_8168ca70-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 38 as post_8168ce9e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 39 as post_8168d2b8-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 40 as post_8168d722-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 41 as post_8168db5a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 42 as post_8168df74-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 43 as post_8168e46a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 44 as post_8168e8ac-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 45 as post_8168ecda-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 46 as post_8168f112-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 47 as post_8168f536-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 48 as post_8168f964-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 49 as post_8168fd92-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 50 as post_816901c0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save organization Democrats as organization_816f148e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO scrapelib: GET - https://chicago.legistar.com/Legislation.aspx
16:00:59 INFO scrapelib: POST - https://chicago.legistar.com/Legislation.aspx
Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 372, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
body=body, headers=headers)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1139, in getresponse
raise ResponseNotReady(self.__state)
http.client.ResponseNotReady: Request-sent
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 597, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/packages/six.py", line 309, in reraise
raise value.with_traceback(tb)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
body=body, headers=headers)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1139, in getresponse
raise ResponseNotReady(self.__state)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', ResponseNotReady('Request-sent',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/fgregg/public/municipal-scrapers-us/.env/bin/pupa", line 9, in <module>
load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/__main__.py", line 71, in main
subcommands[args.subcommand].handle(args, other)
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 224, in handle
report['scrape'] = self.do_scrape(juris, args, scrapers)
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
report[scraper_name] = scraper.do_scrape(**scrape_args)
File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/scrape/base.py", line 102, in do_scrape
for obj in self.scrape(**kwargs) or []:
File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 89, in scrape
for i, page in enumerate(self.searchLegislation()) :
File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 24, in pages
page = self.lxmlize(url, payload)
File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 16, in lxmlize
entry = self.post(url, payload).text
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 508, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
**kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/cache.py", line 66, in request
resp = super(CachingSession, self).request(method, url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
return super(ThrottledSession, self).request(method, url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 177, in request
raise exception_raised
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
resp = super(RetrySession, self).request(method, url, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 594, in send
history = [resp for resp in gen] if allow_redirects else []
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 594, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 196, in resolve_redirects
**adapter_kwargs
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ResponseNotReady('Request-sent',))
Did the people scraper run?
This is not listing any committees: http://api.opencivicdata.org/organizations/?parent_id=ocd-organization/bf1f00e4-f13e-45a7-961a-0816974ede88
So the finance committee has their normal meeting: https://chicago.legistar.com/MeetingDetail.aspx?ID=191000&GUID=F46D8B14-13E1-4500-A47D-428604931403&Options=info%7C
During which they will have two public hearings:
https://chicago.legistar.com/MeetingDetail.aspx?ID=184426&GUID=0FB72168-5E0C-4F42-B782-E79DCCC1B9B6&Options=info%7C
https://chicago.legistar.com/MeetingDetail.aspx?ID=190991&GUID=14BD8DD5-4C57-4A11-AA09-D124C9551DB0&Options=info%7C
I'm not sure the right way to model this.
Right, we can't just insert all these events because pupa thinks they are duplicates
Thoughts @jpmckinney, @paultag
after opencivicdata/pupa#7 lands (as well)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.