The python-legistar-scraper from opencivicdata

get tests passing again

Support for Ann Arbor, Michigan

Ann Arbor runs a Legistar

http://a2gov.legistar.com

Recommendations welcomed where to start to add support to it in this code.

Improvements to the legistar API

Query events by time window (last five days)
No reliable way to get order of events. For example http://webapi.legistar.com/v1/chicago/matters/15987/histories (Action Date doesn't always have a timestamp)

Perform more specific check for inaccessible gateway links

Recently, a huge number of public bills were marked as private because their gateway links could not be accessed: datamade/scrapers-us-municipal#55 (comment).

Per @fgregg:

it would be good to do a stricter check, if possible, that we are getting the kind of message we expect for private pages

One way to do that would be to update legislation_detail_url to check the response for a 200 status code and the page text for This record is currently unavailable.

python-legistar-scraper/legistar/bills.py

Lines 450 to 463 in b9267cb

    
           def legislation_detail_url(self, matter_id): 
        
               gateway_url = self.BASE_WEB_URL + '/gateway.aspx?m=l&id={0}' 
        
               # we want to supress any session level params for this head request 
        
               # since they could lead to an additonal level of redirect. 
        
               # 
        
               # Per 
        
               # http://docs.python-requests.org/en/master/user/advanced/, we 
        
               # have to do this by setting session level params to None 
        
               legislation_detail_route = self.head( 
        
                   gateway_url.format(matter_id), 
        
                   params={k: None for k in self.params}).headers['Location'] 
        
               return urljoin(self.BASE_WEB_URL, legislation_detail_route)

iCal icon breaks data table parsing

Legistar updated the data table format to show a calendar icon instead of an input in the header row, which is currently causing parseDataTable method to throw an IndexError. I can put in a PR for this

Sponsor name suffixes getting split incorrectly

Per @sarindipity. Groan.

retry requests that raise 104 errors

for metro, we've had a super low success rate for full bill scrapes (like, once a week) since we increased the volume of requests by scraping both public and private bills. the 104 errors from legistar seem to be self resolving, i.e., we may see a lower rate of scraper failure if we incorporate retries when 104 occurs.

add a README

what is this project for?
how to contribute?
installing new
contributing to an existing project
where is it running?
who runs each instance?
how often does it run?

Add a method for scraping specific bills and events

It's currently impossible to scrape a single bill without also scraping lots of other bills. Ditto for events. Let's look into a way to update specific bills and events by passing in matter/event IDs, respectively.

Use str.title() on parseDataTable keys

Seems that at one point, the NYC scraper's website column was called "Web site", but it now seems to be "Web Site". SF also uses "Web Site". Perhaps we could title-case or lower-case the headers for more consistency?

EDIT: https://github.com/opencivicdata/scrapers-us-municipal/blob/master/nyc/people.py#L86

whitelist some subsequent actions as not dupes

The bill history code has code that removes deduplicates actions on bills if the previous action on the bill was the same type.

sometimes, this is okay, like adding co-sponsors to a bill, and we should have a whitelist.

see this matter for an example: http://webapi.legistar.com/v1/chicago/matters/162936/histories

Each jurisdiction config requires both a division_id and a classification

Otherwise it won't be possible to determine which config applies when there are multiple Jurisdictions (each with a different classification) for a division_id.

How do we determine which version of a related matter is "active"?

We introduced logic in #112 to retrieve the most recent version of every matter relation. This relies on the assumption that the most recent version is the active version, but per Metro-Records/la-metro-councilmatic#669 (comment), this is not always the case. Omar thought Regina might have implemented some logic to determine which version is active, but I can't find it.

The revised implementation exposes a method to apply a different filter to matter relations, so maybe this is a question we allow users of this library to answer themselves:

python-legistar-scraper/legistar/bills.py

Lines 424 to 448 in 83cb634

    
               def _filter_relations(self, relations): 
        
                   ''' 
        
                   Sometimes, many versions of a bill are related. This method returns the 
        
                   most recent version of each relation. Override this method to apply a 
        
                   different filter or return the full array of relations. 
        
                   ''' 
        
                   # Sort relations such that the latest version of each matter 
        
                   # ID is returned first. 
        
                   sorted_relations = sorted( 
        
                       relations, 
        
                       key=lambda x: ( 
        
                           x['MatterRelationMatterId'], 
        
                           x['MatterRelationFlag'] 
        
                       ), 
        
                       reverse=True 
        
                   ) 
        
                   seen_relations = set() 
        
                   for relation in sorted_relations: 
        
                       relation_id = relation['MatterRelationMatterId'] 
        
                       if relation_id not in seen_relations: 
        
                           yield relation 
        
                           seen_relations.add(relation_id)

Document URLs returned from onclick attributes are invalid

URLs that are pulled from onclick attributes are returning invalid URLs if the path doesn't start with a /. One example is "https://cook-county.legistar.comVideo.aspx?Mode=Granicus&ID1=1996&Mode2=Video" being returned instead of "https://cook-county.legistar.com/Video..."

Seen in both https://cook-county.legistar.com and https://chicagoparkdistrict.legistar.com. I can open a PR with a fix for this

Bills: Look at all endpoints

Related to #30

Consider integrating enriched API and web event classes

We've started an object-oriented approach to organizing useful attributes of the API and web event dicts over in scrapers-us-municipal. If these attributes are generally useful, it could be cool to have them here.

remove separation between pupa scraper and other arbitrary objects

merge cleanup branch into master

Spend a decent chunk of time confused because I was on master in my local repo. Yes, my fault, but might be nice to merge it in for clarity, if it's all the same to you guys

Pip installing legistar from a requirements file gives an error

Hi, I'm trying to test out adding legistar as a dependency to another project, but I got an error pip installing legistar from a requirements file. This is what I did:

$ virtualenv -p python3 testenv1
Running virtualenv with interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in testenv1/bin/python3
Also creating executable in testenv1/bin/python
Installing setuptools, pip, wheel...done.
$ source testenv1/bin/activate
(testenv1)$ pip install git+https://github.com/opencivicdata/python-legistar-scraper
Collecting git+https://github.com/opencivicdata/python-legistar-scraper
  Cloning https://github.com/opencivicdata/python-legistar-scraper to /tmp/pip-8lwn6og7-build
Collecting lxml (from legistar==0.0.1)
Collecting pytz (from legistar==0.0.1)
  Using cached pytz-2017.2-py2.py3-none-any.whl
Collecting icalendar (from legistar==0.0.1)
  Using cached icalendar-3.11.7-py2.py3-none-any.whl
Collecting python-dateutil (from icalendar->legistar==0.0.1)
  Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil->icalendar->legistar==0.0.1)
  Using cached six-1.10.0-py2.py3-none-any.whl
Installing collected packages: lxml, pytz, six, python-dateutil, icalendar, legistar
  Running setup.py install for legistar
Successfully installed icalendar-3.11.7 legistar-0.0.1 lxml-3.8.0 python-dateutil-2.6.1 pytz-2017.2 six-1.10.0
(testenv1)$ pip freeze > requirements.txt
(testenv1)$ deactivate
$ virtualenv -p python3 testenv2
Running virtualenv with interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in testenv2/bin/python3
Also creating executable in testenv2/bin/python
Installing setuptools, pip, wheel...done.
$ source testenv2/bin/activate
(testenv2)$ pip install -r requirements.txt 
Collecting icalendar==3.11.7 (from -r requirements.txt (line 1))
  Using cached icalendar-3.11.7-py2.py3-none-any.whl
Collecting legistar==0.0.1 (from -r requirements.txt (line 2))
  Could not find a version that satisfies the requirement legistar==0.0.1 (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for legistar==0.0.1 (from -r requirements.txt (line 2))

The API improves, may be able to get away with not scraping web

The legistar webapi has some new features that may obviate the need for so much scraping of the web.

So far, I've noticed that event endpoints have a comment field and a link to the event web url.

Consider adding a toDate method in LegistarAPIScraper

There's a toTime, but not a toDate:

https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/base.py#L189

Remove pupa dependency from legistar.base.LegistarScraper

We'd like to use the LegistarScraper and LegistarEventsScraper in this project, but would rather not have the pupa dependency if it's not needed.

I made a single change to legistar/base.py, lines 7 and 8:

from scrapelib import Scraper
#from pupa.scrape import Scraper

Then, the following worked for me (python 3 running in the terminal):

>>> from legistar.base import LegistarScraper
>>> ls = LegistarScraper()
>>> from legistar.events import LegistarEventsScraper
no pupa_settings on path, using defaults
>>> les = LegistarEventsScraper()
>>> les.EVENTSPAGE = 'https://cook-county.legistar.com/Calendar.aspx'
>>> g = les.events()
>>> next(g)
/.../python-legistar-scraper/python-legistar-scraper/lib/python3.4/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/.../python-legistar-scraper/python-legistar-scraper/lib/python3.4/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
(defaultdict(<function LegistarScraper.parseDataTable.<locals>.<lambda> at 0x7f377429f400>, {'Meeting Details': 'Meeting\xa0details', 'Meeting Location': 'Cook County Building, Board Room, 118 North Clark Street, Chicago, Illinois', 'Name': {'url': 'https://cook-county.legistar.com/DepartmentDetail.aspx?ID=20924&GUID=B78A790A-5913-4FBF-8FBF-ECEE445B7796', 'label': 'Board of Commissioners'}, 'Meeting Date': '12/13/2017', 'Agenda': 'Not\xa0available', 'Meeting Time': '11:00 AM', 'Video': 'Not\xa0available', 'Minutes': 'Not\xa0available', 'iCalendar': {'url': 'https://cook-county.legistar.com/View.ashx?M=IC&ID=521586&GUID=59F74CF4-4FF8-4BEA-9723-4337DDCE22FB'}}), None)

As far as I'm concerned, this is working code and good for our project. :) I will dig into the docs to make sure there aren't any licensing issues and that the contributors to this project are given credit once our project is up and running using the LegistarEventsScraper.

I didn't see any tests to run, so I'm not sure if this small change has broken something else (and that's also why I've opened an issue not a pull request).

`event['Meeting Details']` not always accurate

Some legistar sites, such as Phoenix, AZ name their meeting information columns simply 'Details' (as opposed to 'Meeting Details') in the scraped tables.

We should update scraper code to account for this difference.

This is the code that triggered an error for the Phoneix, AZ legistar instance scrape:

python-legistar-scraper/legistar/events.py

Lines 114 to 117 in fe43fd4

    
           if follow_links and type(event["Meeting Details"]) == dict: 
        
               agenda = self.agenda(event["Meeting Details"]['url']) 
        
           else: 
        
               agenda = None

Consider alternative methods of determining whether an entity has been updated

There are several known actions that change a bill or event, but do not update the last modified timestamp. These include making an agenda or media link public. This means that those entities will not be captured in so-called "windowed" scrapes (that is, scrapes that only check for entities updated since a certain timestamp). One option for capturing those changes is to run full scrapes more frequently, but that can be a time and resource intensive step. Are there other ways to determine when an entity has been updated?

strip firefox dependency

Should we endow the API scraper/s with knowledge of the available endpoints?

Currently, we manually build endpoint URLs, like so:

python-legistar-scraper/legistar/events.py

Line 194 in 87e11ce

events_url = self.BASE_URL + '/events/'

It would be cool to be able to reference endpoints like self.EVENTS_URL, etc.

If we we go this direction, we ought to have a conversation about whether to include all of the endpoints, or an oft-used subset, and how to organize that knowledge amongst the base scraper and its children.

Base: Cast defaultdict as a dict

Currently, we use a defaultDict to store our data collections. However, we want something more fragile: if we try to find a key that does not reside in the dictionary, then python will throw a straightforward KeyError. For example, we'd prefer this behavior, if the events dict does not have 'Audio': https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/events.py#L72

Let's cast the defaultDict as a Dict:
https://github.com/opencivicdata/python-legistar-scraper/blob/cleanup/legistar/base.py#L103

AssertionError when sending a POST request to the Legistar calendar

Intermittently, about 20% of the time, we see an AssertionError when sending a post request to metro.legistar.com/calendar.aspx and chicago.legistar.com/calendar.aspx.

We cannot change "This Month" to "All Years," and we see an error at this point in the scrape: https://github.com/opencivicdata/python-legistar-scraper/blob/master/legistar/events.py#L30

Release to PyPI

We need that to solve this

Question about PrimeGov

Hello,

My city, Louisville, Ky., has recently moved their public portal from Legistar to PrimeGov, and I was wondering if anyone had dealt with this transition before.

I was hoping that since both products are from Granicus, the underlying data would stay the same and even be available via the unpublished Legistar URLs, but I'm noticing differences.

Screenshot of Legistar Metro Council meetings list

Screenshot of PrimeGov Metro Council meetings list

As you can see, recent Metro Council meetings are not appearing on the PrimeGov portal.

Is there any way to use the Legistar scrapers to scrape PrimeGov data?
If not, is a PrimeGov scraper something Open Civic Data is thinking about building or facilitating?

Thanks for your help!

Events: Look at all endpoints

The events scraper should hit all related endpoints, e.g., event agenda items - that is, if an event agenda item changes, but the event does not change, then the scraper should catch the change.

Migrate from Travis to GitHub Actions

Travis CI is throttling open-source build hours. Builds for this repo are therefore sluggish. Let's migrate to our pattern for deploying Python packages with GitHub Actions: https://github.com/datamade/how-to/blob/master/ci/github-actions.md#deploying-a-python-package

remove as much redundancy as possible

could probably reduce this complexity & line count drastically (4k lines, more than pupa?!)

example of a target for this:


    @make_item('meeting_location')
    def get_meeting_location(self):
        return self.get_field_text('meeting_location')

basically the word meeting_location 3 times, and make_item breaks control flow in weird ways

	def legislation_detail_url(self, matter_id):
	gateway_url = self.BASE_WEB_URL + '/gateway.aspx?m=l&id={0}'

	# we want to supress any session level params for this head request
	# since they could lead to an additonal level of redirect.
	#
	# Per
	# http://docs.python-requests.org/en/master/user/advanced/, we
	# have to do this by setting session level params to None
	legislation_detail_route = self.head(
	gateway_url.format(matter_id),
	params={k: None for k in self.params}).headers['Location']

	return urljoin(self.BASE_WEB_URL, legislation_detail_route)

	def _filter_relations(self, relations):
	'''
	Sometimes, many versions of a bill are related. This method returns the
	most recent version of each relation. Override this method to apply a
	different filter or return the full array of relations.
	'''
	# Sort relations such that the latest version of each matter
	# ID is returned first.
	sorted_relations = sorted(
	relations,
	key=lambda x: (
	x['MatterRelationMatterId'],
	x['MatterRelationFlag']
	),
	reverse=True
	)

	seen_relations = set()

	for relation in sorted_relations:
	relation_id = relation['MatterRelationMatterId']

	if relation_id not in seen_relations:
	yield relation
	seen_relations.add(relation_id)

	if follow_links and type(event["Meeting Details"]) == dict:
	agenda = self.agenda(event["Meeting Details"]['url'])
	else:
	agenda = None

opencivicdata / python-legistar-scraper Goto Github PK

python-legistar-scraper's Introduction

Open Civic Data Technical Documentation

python-legistar-scraper's People

Contributors

Stargazers

Watchers

Forkers

python-legistar-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org