Code Monkey home page Code Monkey logo

python-legistar-scraper's Introduction

Open Civic Data Technical Documentation

This repository contains documentation for developers including:

  • Writing Scrapers using Pupa
  • Open Civic Data's Data Type Specifications
  • Open Civic Data Proposals

Read these docs at https://open-civic-data.readthedocs.io/en/latest/

python-legistar-scraper's People

Contributors

antidipyramid avatar fgregg avatar hancush avatar jamesturk avatar jmithani avatar paultag avatar pjsier avatar reginafcompton avatar rshorey avatar twneale avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-legistar-scraper's Issues

retry requests that raise 104 errors

for metro, we've had a super low success rate for full bill scrapes (like, once a week) since we increased the volume of requests by scraping both public and private bills. the 104 errors from legistar seem to be self resolving, i.e., we may see a lower rate of scraper failure if we incorporate retries when 104 occurs.

merge cleanup branch into master

Spend a decent chunk of time confused because I was on master in my local repo. Yes, my fault, but might be nice to merge it in for clarity, if it's all the same to you guys

remove as much redundancy as possible

could probably reduce this complexity & line count drastically (4k lines, more than pupa?!)

example of a target for this:


    @make_item('meeting_location')
    def get_meeting_location(self):
        return self.get_field_text('meeting_location')

basically the word meeting_location 3 times, and make_item breaks control flow in weird ways

iCal icon breaks data table parsing

Legistar updated the data table format to show a calendar icon instead of an input in the header row, which is currently causing parseDataTable method to throw an IndexError. I can put in a PR for this

screen shot 2019-01-05 at 12 04 23 pm

add a README

  • what is this project for?
  • how to contribute?
  • installing new
  • contributing to an existing project
  • where is it running?
  • who runs each instance?
  • how often does it run?

Perform more specific check for inaccessible gateway links

Recently, a huge number of public bills were marked as private because their gateway links could not be accessed: datamade/scrapers-us-municipal#55 (comment).

Per @fgregg:

it would be good to do a stricter check, if possible, that we are getting the kind of message we expect for private pages

One way to do that would be to update legislation_detail_url to check the response for a 200 status code and the page text for This record is currently unavailable.

def legislation_detail_url(self, matter_id):
gateway_url = self.BASE_WEB_URL + '/gateway.aspx?m=l&id={0}'
# we want to supress any session level params for this head request
# since they could lead to an additonal level of redirect.
#
# Per
# http://docs.python-requests.org/en/master/user/advanced/, we
# have to do this by setting session level params to None
legislation_detail_route = self.head(
gateway_url.format(matter_id),
params={k: None for k in self.params}).headers['Location']
return urljoin(self.BASE_WEB_URL, legislation_detail_route)

Add a method for scraping specific bills and events

It's currently impossible to scrape a single bill without also scraping lots of other bills. Ditto for events. Let's look into a way to update specific bills and events by passing in matter/event IDs, respectively.

Pip installing legistar from a requirements file gives an error

Hi, I'm trying to test out adding legistar as a dependency to another project, but I got an error pip installing legistar from a requirements file. This is what I did:

$ virtualenv -p python3 testenv1
Running virtualenv with interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in testenv1/bin/python3
Also creating executable in testenv1/bin/python
Installing setuptools, pip, wheel...done.
$ source testenv1/bin/activate
(testenv1)$ pip install git+https://github.com/opencivicdata/python-legistar-scraper
Collecting git+https://github.com/opencivicdata/python-legistar-scraper
  Cloning https://github.com/opencivicdata/python-legistar-scraper to /tmp/pip-8lwn6og7-build
Collecting lxml (from legistar==0.0.1)
Collecting pytz (from legistar==0.0.1)
  Using cached pytz-2017.2-py2.py3-none-any.whl
Collecting icalendar (from legistar==0.0.1)
  Using cached icalendar-3.11.7-py2.py3-none-any.whl
Collecting python-dateutil (from icalendar->legistar==0.0.1)
  Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil->icalendar->legistar==0.0.1)
  Using cached six-1.10.0-py2.py3-none-any.whl
Installing collected packages: lxml, pytz, six, python-dateutil, icalendar, legistar
  Running setup.py install for legistar
Successfully installed icalendar-3.11.7 legistar-0.0.1 lxml-3.8.0 python-dateutil-2.6.1 pytz-2017.2 six-1.10.0
(testenv1)$ pip freeze > requirements.txt
(testenv1)$ deactivate
$ virtualenv -p python3 testenv2
Running virtualenv with interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in testenv2/bin/python3
Also creating executable in testenv2/bin/python
Installing setuptools, pip, wheel...done.
$ source testenv2/bin/activate
(testenv2)$ pip install -r requirements.txt 
Collecting icalendar==3.11.7 (from -r requirements.txt (line 1))
  Using cached icalendar-3.11.7-py2.py3-none-any.whl
Collecting legistar==0.0.1 (from -r requirements.txt (line 2))
  Could not find a version that satisfies the requirement legistar==0.0.1 (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for legistar==0.0.1 (from -r requirements.txt (line 2))

Question about PrimeGov

Hello,

My city, Louisville, Ky., has recently moved their public portal from Legistar to PrimeGov, and I was wondering if anyone had dealt with this transition before.

I was hoping that since both products are from Granicus, the underlying data would stay the same and even be available via the unpublished Legistar URLs, but I'm noticing differences.

Screenshot of Legistar Metro Council meetings list Screenshot of PrimeGov Metro Council meetings list

As you can see, recent Metro Council meetings are not appearing on the PrimeGov portal.

Is there any way to use the Legistar scrapers to scrape PrimeGov data?
If not, is a PrimeGov scraper something Open Civic Data is thinking about building or facilitating?

Thanks for your help!

Should we endow the API scraper/s with knowledge of the available endpoints?

Currently, we manually build endpoint URLs, like so:

events_url = self.BASE_URL + '/events/'

It would be cool to be able to reference endpoints like self.EVENTS_URL, etc.

If we we go this direction, we ought to have a conversation about whether to include all of the endpoints, or an oft-used subset, and how to organize that knowledge amongst the base scraper and its children.

How do we determine which version of a related matter is "active"?

We introduced logic in #112 to retrieve the most recent version of every matter relation. This relies on the assumption that the most recent version is the active version, but per Metro-Records/la-metro-councilmatic#669 (comment), this is not always the case. Omar thought Regina might have implemented some logic to determine which version is active, but I can't find it.

The revised implementation exposes a method to apply a different filter to matter relations, so maybe this is a question we allow users of this library to answer themselves:

def _filter_relations(self, relations):
'''
Sometimes, many versions of a bill are related. This method returns the
most recent version of each relation. Override this method to apply a
different filter or return the full array of relations.
'''
# Sort relations such that the latest version of each matter
# ID is returned first.
sorted_relations = sorted(
relations,
key=lambda x: (
x['MatterRelationMatterId'],
x['MatterRelationFlag']
),
reverse=True
)
seen_relations = set()
for relation in sorted_relations:
relation_id = relation['MatterRelationMatterId']
if relation_id not in seen_relations:
yield relation
seen_relations.add(relation_id)

Events: Look at all endpoints

The events scraper should hit all related endpoints, e.g., event agenda items - that is, if an event agenda item changes, but the event does not change, then the scraper should catch the change.

Base: Cast defaultdict as a dict

Currently, we use a defaultDict to store our data collections. However, we want something more fragile: if we try to find a key that does not reside in the dictionary, then python will throw a straightforward KeyError. For example, we'd prefer this behavior, if the events dict does not have 'Audio': https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/events.py#L72

Let's cast the defaultDict as a Dict:
https://github.com/opencivicdata/python-legistar-scraper/blob/cleanup/legistar/base.py#L103

N.B. Related to this issue: https://sentry.io/datamade/scrapers-us-municipal/issues/336591713/

Consider alternative methods of determining whether an entity has been updated

There are several known actions that change a bill or event, but do not update the last modified timestamp. These include making an agenda or media link public. This means that those entities will not be captured in so-called "windowed" scrapes (that is, scrapes that only check for entities updated since a certain timestamp). One option for capturing those changes is to run full scrapes more frequently, but that can be a time and resource intensive step. Are there other ways to determine when an entity has been updated?

Remove pupa dependency from legistar.base.LegistarScraper

We'd like to use the LegistarScraper and LegistarEventsScraper in this project, but would rather not have the pupa dependency if it's not needed.

I made a single change to legistar/base.py, lines 7 and 8:

from scrapelib import Scraper
#from pupa.scrape import Scraper

Then, the following worked for me (python 3 running in the terminal):

>>> from legistar.base import LegistarScraper
>>> ls = LegistarScraper()
>>> from legistar.events import LegistarEventsScraper
no pupa_settings on path, using defaults
>>> les = LegistarEventsScraper()
>>> les.EVENTSPAGE = 'https://cook-county.legistar.com/Calendar.aspx'
>>> g = les.events()
>>> next(g)
/.../python-legistar-scraper/python-legistar-scraper/lib/python3.4/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/.../python-legistar-scraper/python-legistar-scraper/lib/python3.4/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
(defaultdict(<function LegistarScraper.parseDataTable.<locals>.<lambda> at 0x7f377429f400>, {'Meeting Details': 'Meeting\xa0details', 'Meeting Location': 'Cook County Building, Board Room, 118 North Clark Street, Chicago, Illinois', 'Name': {'url': 'https://cook-county.legistar.com/DepartmentDetail.aspx?ID=20924&GUID=B78A790A-5913-4FBF-8FBF-ECEE445B7796', 'label': 'Board of Commissioners'}, 'Meeting Date': '12/13/2017', 'Agenda': 'Not\xa0available', 'Meeting Time': '11:00 AM', 'Video': 'Not\xa0available', 'Minutes': 'Not\xa0available', 'iCalendar': {'url': 'https://cook-county.legistar.com/View.ashx?M=IC&ID=521586&GUID=59F74CF4-4FF8-4BEA-9723-4337DDCE22FB'}}), None)

As far as I'm concerned, this is working code and good for our project. :) I will dig into the docs to make sure there aren't any licensing issues and that the contributors to this project are given credit once our project is up and running using the LegistarEventsScraper.

I didn't see any tests to run, so I'm not sure if this small change has broken something else (and that's also why I've opened an issue not a pull request).

`event['Meeting Details']` not always accurate

Some legistar sites, such as Phoenix, AZ name their meeting information columns simply 'Details' (as opposed to 'Meeting Details') in the scraped tables.

We should update scraper code to account for this difference.

This is the code that triggered an error for the Phoneix, AZ legistar instance scrape:

if follow_links and type(event["Meeting Details"]) == dict:
agenda = self.agenda(event["Meeting Details"]['url'])
else:
agenda = None

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.