This repository contains documentation for developers including:
- Writing Scrapers using Pupa
- Open Civic Data's Data Type Specifications
- Open Civic Data Proposals
Read these docs at https://open-civic-data.readthedocs.io/en/latest/
Scrapes municipal data from Legistar websites
License: BSD 3-Clause "New" or "Revised" License
This repository contains documentation for developers including:
Read these docs at https://open-civic-data.readthedocs.io/en/latest/
There's a toTime, but not a toDate:
https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/base.py#L189
The event web scraper should scrape "All" events, not just events for a single month: https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/events.py#L13
Per @sarindipity. Groan.
Intermittently, about 20% of the time, we see an AssertionError when sending a post request to metro.legistar.com/calendar.aspx and chicago.legistar.com/calendar.aspx.
We cannot change "This Month" to "All Years," and we see an error at this point in the scrape: https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/events.py#L30
Ann Arbor runs a Legistar
Recommendations welcomed where to start to add support to it in this code.
for metro, we've had a super low success rate for full bill scrapes (like, once a week) since we increased the volume of requests by scraping both public and private bills. the 104 errors from legistar seem to be self resolving, i.e., we may see a lower rate of scraper failure if we incorporate retries when 104 occurs.
Spend a decent chunk of time confused because I was on master in my local repo. Yes, my fault, but might be nice to merge it in for clarity, if it's all the same to you guys
could probably reduce this complexity & line count drastically (4k lines, more than pupa?!)
example of a target for this:
@make_item('meeting_location')
def get_meeting_location(self):
return self.get_field_text('meeting_location')
basically the word meeting_location 3 times, and make_item breaks control flow in weird ways
Seems that at one point, the NYC scraper's website column was called "Web site", but it now seems to be "Web Site". SF also uses "Web Site". Perhaps we could title-case or lower-case the headers for more consistency?
EDIT: https://github.com/opencivicdata/scrapers-us-municipal/blob/master/nyc/people.py#L86
The bill history code has code that removes deduplicates actions on bills if the previous action on the bill was the same type.
sometimes, this is okay, like adding co-sponsors to a bill, and we should have a whitelist.
see this matter for an example: http://webapi.legistar.com/v1/chicago/matters/162936/histories
Travis CI is throttling open-source build hours. Builds for this repo are therefore sluggish. Let's migrate to our pattern for deploying Python packages with GitHub Actions: https://github.com/datamade/how-to/blob/master/ci/github-actions.md#deploying-a-python-package
Recently, a huge number of public bills were marked as private because their gateway links could not be accessed: datamade/scrapers-us-municipal#55 (comment).
Per @fgregg:
it would be good to do a stricter check, if possible, that we are getting the kind of message we expect for private pages
One way to do that would be to update legislation_detail_url
to check the response for a 200
status code and the page text for This record is currently unavailable.
python-legistar-scraper/legistar/bills.py
Lines 450 to 463 in b9267cb
We need that to solve this
Related to #30
http://mwrd.legistar.com is water reclamation district
https://chicagoparkdistrict.legistar.com is a park district
https://cook-county.legistar.com is a county
It's currently impossible to scrape a single bill without also scraping lots of other bills. Ditto for events. Let's look into a way to update specific bills and events by passing in matter/event IDs, respectively.
We've started an object-oriented approach to organizing useful attributes of the API and web event dicts over in scrapers-us-municipal
. If these attributes are generally useful, it could be cool to have them here.
Hi, I'm trying to test out adding legistar as a dependency to another project, but I got an error pip installing legistar from a requirements file. This is what I did:
$ virtualenv -p python3 testenv1
Running virtualenv with interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in testenv1/bin/python3
Also creating executable in testenv1/bin/python
Installing setuptools, pip, wheel...done.
$ source testenv1/bin/activate
(testenv1)$ pip install git+https://github.com/opencivicdata/python-legistar-scraper
Collecting git+https://github.com/opencivicdata/python-legistar-scraper
Cloning https://github.com/opencivicdata/python-legistar-scraper to /tmp/pip-8lwn6og7-build
Collecting lxml (from legistar==0.0.1)
Collecting pytz (from legistar==0.0.1)
Using cached pytz-2017.2-py2.py3-none-any.whl
Collecting icalendar (from legistar==0.0.1)
Using cached icalendar-3.11.7-py2.py3-none-any.whl
Collecting python-dateutil (from icalendar->legistar==0.0.1)
Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil->icalendar->legistar==0.0.1)
Using cached six-1.10.0-py2.py3-none-any.whl
Installing collected packages: lxml, pytz, six, python-dateutil, icalendar, legistar
Running setup.py install for legistar
Successfully installed icalendar-3.11.7 legistar-0.0.1 lxml-3.8.0 python-dateutil-2.6.1 pytz-2017.2 six-1.10.0
(testenv1)$ pip freeze > requirements.txt
(testenv1)$ deactivate
$ virtualenv -p python3 testenv2
Running virtualenv with interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in testenv2/bin/python3
Also creating executable in testenv2/bin/python
Installing setuptools, pip, wheel...done.
$ source testenv2/bin/activate
(testenv2)$ pip install -r requirements.txt
Collecting icalendar==3.11.7 (from -r requirements.txt (line 1))
Using cached icalendar-3.11.7-py2.py3-none-any.whl
Collecting legistar==0.0.1 (from -r requirements.txt (line 2))
Could not find a version that satisfies the requirement legistar==0.0.1 (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for legistar==0.0.1 (from -r requirements.txt (line 2))
Hello,
My city, Louisville, Ky., has recently moved their public portal from Legistar to PrimeGov, and I was wondering if anyone had dealt with this transition before.
I was hoping that since both products are from Granicus, the underlying data would stay the same and even be available via the unpublished Legistar URLs, but I'm noticing differences.
As you can see, recent Metro Council meetings are not appearing on the PrimeGov portal.
Is there any way to use the Legistar scrapers to scrape PrimeGov data?
If not, is a PrimeGov scraper something Open Civic Data is thinking about building or facilitating?
Thanks for your help!
Currently, we manually build endpoint URLs, like so:
python-legistar-scraper/legistar/events.py
Line 194 in 87e11ce
It would be cool to be able to reference endpoints like self.EVENTS_URL
, etc.
If we we go this direction, we ought to have a conversation about whether to include all of the endpoints, or an oft-used subset, and how to organize that knowledge amongst the base scraper and its children.
We introduced logic in #112 to retrieve the most recent version of every matter relation. This relies on the assumption that the most recent version is the active version, but per Metro-Records/la-metro-councilmatic#669 (comment), this is not always the case. Omar thought Regina might have implemented some logic to determine which version is active, but I can't find it.
The revised implementation exposes a method to apply a different filter to matter relations, so maybe this is a question we allow users of this library to answer themselves:
python-legistar-scraper/legistar/bills.py
Lines 424 to 448 in 83cb634
The events scraper should hit all related endpoints, e.g., event agenda items - that is, if an event agenda item changes, but the event does not change, then the scraper should catch the change.
Some agendas are very big, we need to paginate through these to get all the items. https://chicago.legistar.com/MeetingDetail.aspx?ID=445253&GUID=DA07B999-1303-4B0F-9FD0-FDC65AA04009&Options=&Search=
The legistar webapi has some new features that may obviate the need for so much scraping of the web.
So far, I've noticed that event endpoints have a comment field and a link to the event web url.
Currently, we use a defaultDict to store our data collections. However, we want something more fragile: if we try to find a key that does not reside in the dictionary, then python will throw a straightforward KeyError. For example, we'd prefer this behavior, if the events dict does not have 'Audio': https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/events.py#L72
Let's cast the defaultDict as a Dict:
https://github.com/opencivicdata/python-legistar-scraper/blob/cleanup/legistar/base.py#L103
N.B. Related to this issue: https://sentry.io/datamade/scrapers-us-municipal/issues/336591713/
Otherwise it won't be possible to determine which config applies when there are multiple Jurisdictions (each with a different classification) for a division_id.
There are several known actions that change a bill or event, but do not update the last modified timestamp. These include making an agenda or media link public. This means that those entities will not be captured in so-called "windowed" scrapes (that is, scrapes that only check for entities updated since a certain timestamp). One option for capturing those changes is to run full scrapes more frequently, but that can be a time and resource intensive step. Are there other ways to determine when an entity has been updated?
We'd like to use the LegistarScraper and LegistarEventsScraper in this project, but would rather not have the pupa dependency if it's not needed.
I made a single change to legistar/base.py, lines 7 and 8:
from scrapelib import Scraper
#from pupa.scrape import Scraper
Then, the following worked for me (python 3 running in the terminal):
>>> from legistar.base import LegistarScraper
>>> ls = LegistarScraper()
>>> from legistar.events import LegistarEventsScraper
no pupa_settings on path, using defaults
>>> les = LegistarEventsScraper()
>>> les.EVENTSPAGE = 'https://cook-county.legistar.com/Calendar.aspx'
>>> g = les.events()
>>> next(g)
/.../python-legistar-scraper/python-legistar-scraper/lib/python3.4/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
/.../python-legistar-scraper/python-legistar-scraper/lib/python3.4/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
(defaultdict(<function LegistarScraper.parseDataTable.<locals>.<lambda> at 0x7f377429f400>, {'Meeting Details': 'Meeting\xa0details', 'Meeting Location': 'Cook County Building, Board Room, 118 North Clark Street, Chicago, Illinois', 'Name': {'url': 'https://cook-county.legistar.com/DepartmentDetail.aspx?ID=20924&GUID=B78A790A-5913-4FBF-8FBF-ECEE445B7796', 'label': 'Board of Commissioners'}, 'Meeting Date': '12/13/2017', 'Agenda': 'Not\xa0available', 'Meeting Time': '11:00 AM', 'Video': 'Not\xa0available', 'Minutes': 'Not\xa0available', 'iCalendar': {'url': 'https://cook-county.legistar.com/View.ashx?M=IC&ID=521586&GUID=59F74CF4-4FF8-4BEA-9723-4337DDCE22FB'}}), None)
As far as I'm concerned, this is working code and good for our project. :) I will dig into the docs to make sure there aren't any licensing issues and that the contributors to this project are given credit once our project is up and running using the LegistarEventsScraper.
I didn't see any tests to run, so I'm not sure if this small change has broken something else (and that's also why I've opened an issue not a pull request).
URLs that are pulled from onclick
attributes are returning invalid URLs if the path doesn't start with a /
. One example is "https://cook-county.legistar.comVideo.aspx?Mode=Granicus&ID1=1996&Mode2=Video" being returned instead of "https://cook-county.legistar.com/Video..."
Seen in both https://cook-county.legistar.com and https://chicagoparkdistrict.legistar.com. I can open a PR with a fix for this
Some legistar sites, such as Phoenix, AZ name their meeting information columns simply 'Details' (as opposed to 'Meeting Details') in the scraped tables.
We should update scraper code to account for this difference.
This is the code that triggered an error for the Phoneix, AZ legistar instance scrape:
python-legistar-scraper/legistar/events.py
Lines 114 to 117 in fe43fd4
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.