Code Monkey home page Code Monkey logo

scrapers-us-municipal's Introduction

Open Civic Data Technical Documentation

This repository contains documentation for developers including:

  • Writing Scrapers using Pupa
  • Open Civic Data's Data Type Specifications
  • Open Civic Data Proposals

Read these docs at https://open-civic-data.readthedocs.io/en/latest/

scrapers-us-municipal's People

Contributors

antidipyramid avatar coreyar avatar derekeder avatar feydan avatar fgomez828 avatar fgregg avatar guelo avatar hancush avatar jamesturk avatar jesseilev avatar jmithani avatar jtotoole avatar mileswwatkins avatar mjumbewu avatar paultag avatar reginafcompton avatar rshorey avatar sbma44 avatar twneale avatar walter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapers-us-municipal's Issues

Check on logging

It seems that our logging configuration is not capturing all of the messages that we want it to capture. Let's make sure it gets all of them.

Duplicative documents

In Chicago, sometimes 'Notice' and 'Agenda' are the same file. Sometimes not. Pupa doesn't allow the same file to appear more than once in document list (throws an ValueError). What's the way we want to handle this?

NYC: parsing additional event info

on legistar, events often have additional notes in italicized text under the event location: http://legistar.council.nyc.gov/Calendar.aspx/

the italicized text seems to fall under two categories - (1) 'jointly with' plus a list of committees (example), or (2) a note on status, indicating a continuation of a prev meeting (example) or that the meeting has been recessed (example)

to-do:

  • parse this info into participants & status for ocd
  • remove this info from the ocd event name

Scrape document types for Chicago

As they do their business, the city council deals with documents other than bills

  • order
  • order
  • claim
  • communication
  • reports
  • oaths of offices

These are not bills, but should be tracked. Will depend upon a DocumentType or similar being added to pupa.

404 Error in St Louis Scraper

Attn @jesseilev

HTTPError: 404 while retrieving https://www.stlouis-mo.gov/government/departments/aldermen/city-laws/boardbill.cfm?bbDetail=true&BBId=9819
  File "bin/pupa", line 9, in <module>
    load_entry_point('pupa==0.5.0', 'console_scripts', 'pupa')()
  File "pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "pupa/cli/commands/update.py", line 242, in handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "pupa/cli/commands/update.py", line 141, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "pupa/scrape/base.py", line 101, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "st_louis/bills.py", line 25, in scrape
    yield self.scrape_bill(bill_url, bill_id, session_id)
  File "st_louis/bills.py", line 29, in scrape_bill
    page = self.lxmlize(bill_url)
  File "st_louis/utils.py", line 12, in lxmlize
    entry = self.get(url).text
  File "requests/sessions.py", line 480, in get
    return self.request('GET', url, **kwargs)
  File "scrapelib/__init__.py", line 272, in request
    raise HTTPError(resp)

chicago fails to run

     entry = self.urlopen(url)
AttributeError: 'ChicagoBillScraper' object has no attribute 'urlopen'

I think we need self.get. Will patch in a sec

Chicago Legistar SSL cert verification failing

Under the hood requests (which uses urllib3 for SSL junk I'm pretty sure) is not allowing us to load pages from the Chicago Legistar site because it is forcing SSL while not really doing the best job at supporting a secure connection. There are a couple ways to approach this: pass the verify=False flag to requests or ask the Legistar people to fix their SSL stuff. I'm thinking the first option is going to be a lot easier in the short term.

Chicago: Capturing current controlling body of legislation

We capture referral actions from the the city council to a committee, but I don't see anyway to record which committee the legislation was referred to. (we know it from Legistar)

We'll eventually know once the committee takes action on a piece of legsilation, but right now there doesn't seem to be a way of finding pending legislation in a committee.

@rshorey @paultag is that right? Any suggestions? Does this require a an OCDEP?

Capture time of meeting for cancelled events

Right now we are not capturing the start time of cancelled events. This makes it impossible to update the status of events when they get cancelled. We need to pull this info from the internet calendar links.

This is for NYC.

Conflict error for NYC action on bil

Moxie is reporting an DuplicateItemError for an action on this bill: http://api.opencivicdata.org/ocd-bill/79d131fc-dedf-4f43-ab8f-3c406f6ccc39/,

http://legistar.council.nyc.gov/LegislationDetail.aspx?ID=1796877&GUID=9F4422ED-4F21-409A-97A6-623A950F71BF

import votes...
Traceback (most recent call last):
  File "/usr/local/bin/pupa", line 9, in <module>
    load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
  File "/pupa/src/pupa/pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "/pupa/src/pupa/pupa/cli/commands/update.py", line 244, in handle
    report['import'] = self.do_import(juris, args)
  File "/pupa/src/pupa/pupa/cli/commands/update.py", line 181, in do_import
    report.update(vote_importer.import_directory(datadir))
  File "/pupa/src/pupa/pupa/importers/base.py", line 169, in import_directory
    return self.import_data(json_stream())
  File "/pupa/src/pupa/pupa/importers/base.py", line 206, in import_data
    obj_id, what = self.import_item(data)
  File "/pupa/src/pupa/pupa/importers/base.py", line 241, in import_item
    raise DuplicateItemError(data, obj)
pupa.exceptions.DuplicateItemError: attempt to import data that would conflict with data already in the import: {'result': 'pass', 'extras': {}, 'legislative_session_id': UUID('1ca51a67-4d27-4649-bf34-c5849dfd30d6'), 'bill_id': 'ocd-bill/79d131fc-dedf-4f43-ab8f-3c406f6ccc39', 'identifier': '', 'motion_classification': ['passage'], 'motion_text': 'Approved, by Council', 'organization_id': 'ocd-organization/389257d3-aefe-42df-b3a2-a0d56d0ea731', 'start_date': '2014-05-14'} (already imported as Approved, by Council on M 63-2014 in New York City 2014 Regular Session Session)

Getting connection error when scraping bills.

pupa update --scrape --fastmode chicago
no pupa_settings on path, using defaults
chicago (scrape)
  bills: {}
Not checking sessions...
16:00:59 INFO pupa: save jurisdiction Chicago City Council as jurisdiction_ocd-jurisdiction-country:us-state:il-place:chicago-council.json
16:00:59 INFO pupa: save organization Chicago City Council as organization_81654f1c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Clerk as post_8165f606-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Mayor as post_816691b0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 1 as post_816747cc-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 2 as post_816801b2-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 3 as post_8168162a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 4 as post_81681c10-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 5 as post_8168219c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 6 as post_81682746-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 7 as post_81682c6e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 8 as post_816831a0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 9 as post_816836be-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 10 as post_81683bbe-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 11 as post_816840d2-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 12 as post_8168460e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 13 as post_81684b18-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 14 as post_8168502c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 15 as post_81685586-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 16 as post_81685ac2-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 17 as post_81685fd6-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 18 as post_81686558-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 19 as post_81686a62-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 20 as post_81686f6c-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 21 as post_81687480-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 22 as post_81687ade-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 23 as post_8168807e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 24 as post_81688588-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 25 as post_81688b00-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 26 as post_81689014-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 27 as post_816896e0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 28 as post_81689c44-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 29 as post_8168a162-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 30 as post_8168a9aa-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 31 as post_8168ae0a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 32 as post_8168b242-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 33 as post_8168b68e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 34 as post_8168bd78-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 35 as post_8168c200-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 36 as post_8168c642-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 37 as post_8168ca70-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 38 as post_8168ce9e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 39 as post_8168d2b8-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 40 as post_8168d722-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 41 as post_8168db5a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 42 as post_8168df74-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 43 as post_8168e46a-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 44 as post_8168e8ac-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 45 as post_8168ecda-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 46 as post_8168f112-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 47 as post_8168f536-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 48 as post_8168f964-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 49 as post_8168fd92-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save post Ward 50 as post_816901c0-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO pupa: save organization Democrats as organization_816f148e-ed20-11e4-b79e-0024d703f52c.json
16:00:59 INFO scrapelib: GET - https://chicago.legistar.com/Legislation.aspx
16:00:59 INFO scrapelib: POST - https://chicago.legistar.com/Legislation.aspx
Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 372, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.4/http/client.py", line 1139, in getresponse
    raise ResponseNotReady(self.__state)
http.client.ResponseNotReady: Request-sent

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 597, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 245, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/packages/six.py", line 309, in reraise
    raise value.with_traceback(tb)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.4/http/client.py", line 1139, in getresponse
    raise ResponseNotReady(self.__state)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', ResponseNotReady('Request-sent',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fgregg/public/municipal-scrapers-us/.env/bin/pupa", line 9, in <module>
    load_entry_point('pupa==0.4.1', 'console_scripts', 'pupa')()
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/__main__.py", line 71, in main
    subcommands[args.subcommand].handle(args, other)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 224, in handle
    report['scrape'] = self.do_scrape(juris, args, scrapers)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/cli/commands/update.py", line 123, in do_scrape
    report[scraper_name] = scraper.do_scrape(**scrape_args)
  File "/home/fgregg/public/municipal-scrapers-us/.env/src/pupa/pupa/scrape/base.py", line 102, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/home/fgregg/public/municipal-scrapers-us/chicago/bills.py", line 89, in scrape
    for i, page in enumerate(self.searchLegislation()) :
  File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 24, in pages
    page = self.lxmlize(url, payload)
  File "/home/fgregg/public/municipal-scrapers-us/chicago/legistar.py", line 16, in lxmlize
    entry = self.post(url, payload).text
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 508, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
    **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/cache.py", line 66, in request
    resp = super(CachingSession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
    return super(ThrottledSession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 177, in request
    raise exception_raised
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
    resp = super(RetrySession, self).request(method, url, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 594, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 594, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 196, in resolve_redirects
    **adapter_kwargs
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/fgregg/public/municipal-scrapers-us/.env/lib/python3.4/site-packages/requests/adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ResponseNotReady('Request-sent',))

Public hearings at same time as committee meeting

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.