propublica / sunlight-congress Goto Github PK

The Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.

Home Page: https://www.propublica.org/nerds/item/congress-api-bill-subjects-personal-explanations-and-sunsetting-sunlight

License: Other

Ruby 88.24% Shell 1.20% Python 10.56%

sunlight-congress's Introduction

Sunlight Congress API

This is the code that powers the Sunlight Foundation's Congress API.

Overview

The Congress API has two parts:

A light front end, written in Ruby using Sinatra.
A back end of data scraping and loading tasks. Most are written in Ruby, but Python tasks are also supported.

The front end is essentially read-only. Its job is to translate an API call (the query string) into a single database query (usually to MongoDB), wrap the resulting JSON in a bit of pagination metadata, and return it to the user.

Endpoints and behavior are determined by introspecting on the classes defined in models/. These classes are also expected to define database indexes where applicable.

The front end tries to maintain as little model-specific logic as possible. There are a couple of exceptions made (like allowing disabling of pagination for /legislators) — but generally, adding a new endpoint is as simple as adding a model class.

The back end is a set of tasks (scripts) whose job is to write data to the collections those models refer to. Most data is stored in MongoDB, but some tasks will store additional data in Elasticsearch, and some tasks may extract citations via a citation server.

We currently manage these tasks via cron. A small task runner wraps each script in order to ensure any "reports" created along the way get emailed to admins, to catch errors, and to parse command line options.

While the front end and back end are mostly decoupled, many of them do use the definitions in models/ to save data (via Mongoid) and to manage duplicating "basic" fields about objects onto other objects.

The API never performs joins -- if data from one collection is expected to appear as a sub-field on another collection, it should be copied there during data loading.

Setup - Dependencies

If you don't have Bundler, install it:

gem install bundler

Then use Bundler to install the Ruby dependencies:

bundle install --local

If you're going to use any of the Python-based tasks, install virtualenv and virtualenvwrapper, make a new virtual environment, and install the Python dependencies:

mkvirtualenv congress-api
pip install -r tasks/requirements.txt

Some tasks use PDF text extraction, which is performed through the docsplit gem. If you use a task that does this, you will need to install a system dependency, pdftotext.

On Linux:

sudo apt-get install poppler-data

Or on OS X:

brew install poppler

Setup - Configuration

Copy the example config files:

cp config/config.yml.example config/config.yml
cp config/mongoid.yml.example config/mongoid.yml
cp config.ru.example config.ru`

You don't need to edit these to get started in development, the defaults should work fine.

In production, you may wish to turn on the API key requirement, and add SMTP server details so that mail can be sent to admins and task owners.

If you work for the Sunlight Foundation, and want it to sync analytics and API keys with HQ, you'll need to update the services section with a shared_secret.

Read the documentation in config.yml.example for a description of each element.

Setup - Services

You can get started by just installing MongoDB.

The Congress API depends on MongoDB, a JSONic document store, for just about everything. MongoDB can be installed via apt, homebrew, or manually.

Optional. Some tasks that index full text will require Elasticsearch, a JSONic full-text search engine based on Lucene. Elasticsearch can be installed via apt, or manually.

Optional. If you want citation parsing, you'll need to install citation, a Node-based citation extractor. After installing Node, you can install it with [sudo] npm -g install citation, then run it via cite-server on port 3000.

Optional. To perform location lookups, you'll need to point the API at an instance of pentagon, a boundary service. Sunlight uses an instance loaded with congressional districts and ZCTAs, so that we can look up legislators and districts by either latitude/longitude or zip.

Starting the API

After installing dependencies and MongoDB, and copying the config files, boot the app with:

bundle exec unicorn

The API should return some enthusiastic JSON at http://localhost:8080.

Specify --port to use a port other than 8080.

Running tasks

The API uses rake to run data loading tasks, and various other API maintenance tasks.

Every directory in tasks/ generates an automatic rake task, like:

rake task:hearings_house

This will look in tasks/hearings_house/ for either a hearings_house.rb or hearings_house.py.

Ruby tasks should define a class named after the file, e.g. HearingsHouse, with a class-level run method that accepts a hash of options.

Python tasks should just define a run method that accepts a dict of options.

Options will be read from the command line using env syntax, for example:

rake task:hearings_house month=2014-01

The options hash will also include an additional config key that contains the parsed contents of config/config.yml, so that tasks have access to API configuration details.

So rake task:hearings_house month=2014-01 will execute:

HearingsHouse.run({
  month: "2014-01",
  config: {
    # ...parsed config.yml details...
  }
})

Task files should define the options they accept at the top of the file, in comments, like so.

Task Reporting

Tasks can file "reports" as they operate. Reports will be stored in the database, and reports with certain status will be emailed to the admin and any task-specific owners (as configured in config.yml).

Since this is MongoDB, any other useful data can simply be dumped onto the report document.

For example, a task might log warnings during its operation, and send a single warning email at the end:

if failures.any?
  Report.failure self, "Failed to process #{failures.size} reports", {failures: failures}
end

(In this case, self is the class of the task, e.g. GaoReports.)

Emails will be sent when filing failure or warning reports. You can also store note reports, and all tasks should file a success report at the end if they were successful.

The system will automatically file a complete report, with a record of how long a task took - tasks do not need to do this themselves.

Similarly, if an exception is raised during a task, the system will catch it and file (and email) a failure report.

Any task that encounters an error or something worth warning about should file a warning or failure report during operation. After a task completes, the system will examine the reports collection for any "unread" warning or failure reports, send emails for each one, and mark them as "read".

Undocumented features

This API has some endpoints and features that are not included in the public documentation, but are used in Sunlight tools.

Endpoints

/regulations - Material published in the Federal Register since 2009. Currently used in Scout. /documents - Reports from the Government Accountability Office, and various inspectors general since 2009. Currently used in Scout. /videos - Information on videos from the House floor and Senate floor, synced through the Granicus API. Currently used in Sunlight's Roku apps.

Citation detection

As bills, regulations, and documents are indexed into the system, they are first run through a citation extractor over HTTP.

Extracted citation data is stored locally, in Mongo, in a citations collection, using the Citation model. Excerpts of surrounding context are also stored then, at index-time.

The API accepts a citing parameter, of one or more (pipe-delimited) citation IDs, in the format produced by unitedstates/citation. Passing citing adds a filter (to either Mongo or Elasticsearch-based endpoints) of citation_ids__all, which limits results to only documents for which all given citation IDs were detected at index-time.

If a citing.details parameter is passed with a value of true, then every returned result will trigger a quick database lookup for those associated citations for that document, and citation details (including the surrounding match context) will be added to that document as a citation field.

For example, a search for:

/bills?citing=usc/5/552&citing.details=true&per_page=1&fields=bill_id

Might return something like:

{
  "results": [
    {
      "bill_id": "s2141-113",
      "citations": [
        {
          "type": "usc",
          "match": "section 552(b) of title 5",
          "index": 8624,
          "excerpt": "disclosure pursuant to section 1905 of title 18, United States Code, section 552(b) of title 5, United States Code, or section 301(j) of this Act.",
          "usc": {
            "title": "5",
            "section": "552",
            "subsections": [],
            "id": "usc/5/552",
            "section_id": "usc/5/552"
          }
        }
      ]
    }
  ]
}

License

This project is licensed under the GPL v3.

sunlight-congress's People

Contributors

Stargazers

Watchers

Forkers

chowdhury jackieiscool datamaskin nickom rmw blackworm mikecdt maddoxnelson kraptr clbailey shabalin eres805 mojowen lindsayyoung zawsx brina crdunwel kumarl philosoralphter rrgayhart chrisantaki lfwebdev annetheagile alangunning fantasypolitics thomaswdonnelly oxbx08 ben-zen rshorey tcurranopsrc tyrcho yuvarajmuthu benisch2 g3n7 cadabu jwebbstevens allenfuller thowlandntu stevenhaddox leichen0426 rahall4405 ceubanks jaronmoore haynes1 jkthorne sathvikn chrisbrandow jodizzle briankardon pll33 clintonb michelewang joejordan arjunbansil pjsamson upliftagency petershan1119 enterstudio konklone kamado-tanjiro-usa iq-scm shajisn rubygeems

sunlight-congress's Issues

Add XML output option

For clients which need it.

Add exception catching to whitehouse_live

Add failure/exception reporting to whitehouse_live.

Senate Floor Updates

We only have the House in there right now, which is coming from HouseLive.gov. Use the one on republicans.senate.gov:

http://republican.senate.gov/public/index.cfm?FuseAction=FloorUpdates.Home

Comma separated values not working for sections parameter

It seems to only bring back the first section listed, for videos anyway:

http://api.realtimecongress.org/api/v1/videos.xml?per_page=7&sections=duration,clip-id,video-id,legislative-day,clip-urls&apikey=&order=legislative_day&sort=desc

Forbidden fields on models

Document and try to enforce them somewhere. Anything used in params, basically: captures, callback, sections, apikey, per_page, page, order, sort. I think that's it.

Fetch some roll call vote data in real time

The idea here is to have a separate task that can run every X minutes. It can make new roll call votes, that are missing fields (for example, "required" will be missing, and a related bill might not even exist yet). These fields will be filled in later by the twice-daily roll call vote task (that goes over all THOMAS-provided roll call votes and passage voice votes).

Example of House XML (view source, it's actually XML, and they use Bioguide IDs):
http://clerk.house.gov/evs/2010/roll518.xml

Example of Senate XML (uses some internal ID, will have to parse names out):
http://www.senate.gov/legislative/LIS/roll_call_votes/vote1112/vote_111_2_00229.xml

Sadly, in both cases we'll have to monitor an HTML table to see whether there's new stuff:
http://clerk.house.gov/evs/2010/index.asp
http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_111_2.htm

And the URLs for both tables depend on the year, Congress number, and session number. Not trivial, but: possible.

Look into introspecting filter_keys on defined Mongoid fields with types

Maybe I don't need an extra method at all, it's quite cumbersome besides.

Link committees and amendments

As the "sponsor" field, the basic fields for a committee. Also add the committee ID as the "sponsor_id" field.

Allow options to pass from rake to individual task

Especially for the house_live script.

Support "not" for fields

For example, to find any vote that was not a roll call:

votes.json?vote_type!=roll

Keys are absolutely unlikely to use excalamation points, though this makes parsing out the conditions a little trickier, of course.

Filter keys with dots don't work

Example:
/votes.json?apikey=sunlight9&per_page=1&vote_breakdown.ayes%3E=200

It breaks upon storing a hit in the analytics db.

Don't include "filename" field in roll call vote

It's in Drumbone, don't repeat it in RTC.

Email-time for failure and warning reports should occur post-task

Instead of occurring as a report is filed, in the middle of a task, have tasks file reports marked as unread (as the default value). After the task is done running, go through all unread reports, mark them as read and send emails for any warnings or failures. Surround this in exception handling as well, and file a local report with a special flag set if it fails.

This is good not just so that reports can be filed from other languages and still reported on, but also so that tasks do not potentially hang in the middle of their job while trying to send an email. It's just sensible.

Support greater/less than or equal to

For example, show me all the bills with at least 5 cosponsors:

bills.json?cosponsors_count>=5

<= for less than or equal to.

It's not possible that keys will have > or < in them. Not allowing "less than" or "greater than" without "or equal to" will only be problematic in the case of floating point numbers, of which we don't have any now. If we end up having them in the future, we can invent some special syntax for them (>>= and <<=, perhaps).

Switch hpricot code to Nokogiri

Should be a drop-in replacement, and amendments_archive is a fine example of how to do it.

Handle arbitrary error messages returned to the user in the right format

Example of "expected" error:
/votes.json?apikey=sunlight9&per_page=10&sections=number,question&question~~=(

Link votes and amendments

votes_archive should look for references to amendments (do they exist?) and add an amendment_id field and amendment subobject if it's there.
Same for rolls_live_house and rolls_live_senate.

Report analytics nightly

Report nightly, as Drumbone does.

File local reports, and make sure broad exception handling is covered.

Make an index-creating capistrano task

All it does is, for models + sunlight_services + hit + api_key, run the Model.create_indexes method on each one.

Set up Python on backend and api boxes

Get all scrapers working.

Support operators on datetime fields

Greater than and less than. Interpret any datetime-ish query value ("2010-09-29") against the actual datetime fields.

Add a Vote model and populate it

Port over the roll call fetching code from Drumbone, into a model named Vote. Add a vote_type field that's either "roll" or whatever it is.

As part of the get_votes task, following roll call loading, iterate through each bill and go through each one's votes array. For any voice votes, create them (and include a "bill" object on them). For any roll call votes, update them with anything worth doing (perhaps nothing, refer to notes).

If the Vote table is empty, fill it from scratch. Otherwise, you can just worry about the roll call votes in the Senate and House whose numbers are higher than the last recorded, since old roll call votes never change. Then, go over bills and add voice votes and link roll call votes as normal.

Once we're pulling in partial roll call vote data in real time, this logic can be updated to be: if the Vote table is empty, fill it from scratch. Otherwise, just fill in the un-filled in roll call votes, then go over bills and add voice votes and link roll call votes as normal.

Sync with central API key service

Receive keys from central, as Drumbone does.

Publish data in bulk

Have it nightly dump the tables to compressed JSON, at a publicly available address.

Rename pubDate to pubdate

On house videos.

Expand videos to include White House videos

Real Time "Congress" be damned:

update house_live script to add a "chamber" field with a value of "house"
update house_live script to rename "timestamp_id" to "video_id" and prepend "house", e.g. "house-123456789"
make two whitehouse_live scripts that pull archival and live videos. Use a "chamber" value of "whitehouse", and a "video_id" value of "whitehouse-" followed by the date and slug, e.g. "whitehouse-2010-11-23-new-start-treaty".

Committee Hearings

What we currently have in RTC:
http://realtimecongress.org/hearings_upcoming.json

For the Senate, they have nice XML:
http://www.senate.gov/general/committee_schedules/hearings.xml

For the House, GovTrack's feed:
http://www.govtrack.us/users/events-rss2.xpd?monitors=misc:allcommittee

Remove "clip_id", "full_length", and "offset" from top-level video object

All are remnants from when videos were structured differently.

Bring in committees for internal use

Write a committees task that imports them from the Congress API similarly to legislators.

House Whip Notices

Democratic and Republican whip notices for the House, using the code or algorithms in the old RTC API.

Support "in" and "nin" operators

For example, to support queries such as "give me all bills that are actually bills and not resolutions":

bills.json?bill_type__in=hr|hjres|s|sjres

Pipes seem unlikely to occur in filterable fields, and if we found some source data that uses pipes, we could always swap those pipes out for something else before syndicating it.

Use an "in" query though, not an actual "or" query:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24in

Finally, when "not" is supported, support the idea of "not in" searches, like this one for "anything but simple resolutions":

bills.json?bill_type!=hres|sres

This would map to the "nin" operator in Mongo.

Record hits in the database for analytics

Be wary, as this caused issues in Drumbone when it got too high, but - keep something.

Perhaps a task that runs monthly that offloads the month's hits into a dump and removes them from the database.

Support $all operator

So if someone wants a bill cosponsored by a pair of people:
/bills.json?cosponsor_ids__all=1|2

that'd match any bills where the cosponsor_ids array contains both "1" and "2".

Add Amendments

Add an "amendments" endpoint, using GovTrack's amendment XML:

Example:
http://www.govtrack.us/data/us/111/bills.amdt/h234.xml

Have a task, amendments_archive, that loads in all amendments to the table, and then goes over each bill (perhaps by Amendment.distinct(:bill_id) or the like) and adds an array of amendments to the bill. Each amendment on a bill should have only the basic fields (everything but the actions).

Return 204 for favicon.ico at the nginx level

It's the "right" way to do it, and I'm sick of 404 errors in the logs.

Add /crossdomain.xml support

Use the template in Drumbone, and in its nginx configuration, to set this up at the root.

Require an API key

Restore functionality from Drumbone.

Include count and page keys for plural endpoints

As peer to the array key (i.e. "bills"), include: "count", "page", and "per_page". "page" and "per_page" are the (possibly adjusted) pagination params, and "count" is the total number of items for that search.

Link committees and bills

Any bill which has a committee associated, see if we can add on the committee and its relationship to the bill. committee_ids and committee subobjects.

If we need to keep the relationship, then it should work like voter_ids and voters do on the vote object - {commitee_id: [id], relationship: "..."} and {committee: {obj}, relationship: "..."}.

Investigate Hudson for monitoring

The Open State project uses this and it would be worth checking out how applicable to our own tasks, especially since the most awkward part of supporting multi-language data loaders is the reporting.

Emails not sending from production box

Connection refused

Document the developer perspective in a README

As stuff stabilizes, document what it takes to add a new model, and a new task.

Pull in House videos and floor events

Work with Kaitlin to pull in floor events.

I'm not sure yet how to reconcile the floor events from this feed with the floor_events that Josh already picked up in the old RTC.

Directory structure for tasks

Give each task a folder that supports running unit tests (i.e. link to environment.rb correctly), or any other files the task needs.

Have the loader that governs making the rake tasks use the folder names. Have each task load in the [task_name].rb file in the root of the task's folder, and assume that a camelized class name is in there.

Dates should be dates, not timestamps

For most bill dates, it's a full on timestamp, at midnight UTC, which is incorrect. It should be limited to the date only, with no timestamp.

Since America is west of UTC, these dates represented in any American timezone would be the day before they actually are, which is a serious inaccuracy.

Pluck out legislator_names and bioguide_ids from clip description

Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.

Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.

I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.

You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".

But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.

(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)

Put party breakdown inside vote_breakdown

vote_breakdown: {
    total: {ayes: ..., nays: ..., ...},
    party: {R: {ayes: ..., nays: ..., ...}, D: {...}, ...},
}

Leaves room for us to easily expand on any other ways it could be broken down.

Add lots more filter fields

Now that we support regexes and comparators.

Whip Date/Notices

Links to the latest whip packs and notices for Democrats and Republicans, as modeled here:

http://realtimecongress.org/whip_dates.json

I'd like to change the modeling, but I'm not sure how.

Set up cronjobs on staging and backend

For get_legislators, once a day.
For get_bills, twice a day.
For get_rolls, twice a day.
For house_live, every 5 minutes (consult Kaitlin).

Imminent:
For get_amendments, twice a day.
For rolls_live, every 10 minutes.

Down the line:
For floor_updates, every minute.
For various docserver scrapers, consult Josh.