Code Monkey home page Code Monkey logo

sunlight-congress's Introduction

Sunlight Congress API

This is the code that powers the Sunlight Foundation's Congress API.

Overview

The Congress API has two parts:

  • A light front end, written in Ruby using Sinatra.
  • A back end of data scraping and loading tasks. Most are written in Ruby, but Python tasks are also supported.

The front end is essentially read-only. Its job is to translate an API call (the query string) into a single database query (usually to MongoDB), wrap the resulting JSON in a bit of pagination metadata, and return it to the user.

Endpoints and behavior are determined by introspecting on the classes defined in models/. These classes are also expected to define database indexes where applicable.

The front end tries to maintain as little model-specific logic as possible. There are a couple of exceptions made (like allowing disabling of pagination for /legislators) โ€” but generally, adding a new endpoint is as simple as adding a model class.

The back end is a set of tasks (scripts) whose job is to write data to the collections those models refer to. Most data is stored in MongoDB, but some tasks will store additional data in Elasticsearch, and some tasks may extract citations via a citation server.

We currently manage these tasks via cron. A small task runner wraps each script in order to ensure any "reports" created along the way get emailed to admins, to catch errors, and to parse command line options.

While the front end and back end are mostly decoupled, many of them do use the definitions in models/ to save data (via Mongoid) and to manage duplicating "basic" fields about objects onto other objects.

The API never performs joins -- if data from one collection is expected to appear as a sub-field on another collection, it should be copied there during data loading.

Setup - Dependencies

If you don't have Bundler, install it:

gem install bundler

Then use Bundler to install the Ruby dependencies:

bundle install --local

If you're going to use any of the Python-based tasks, install virtualenv and virtualenvwrapper, make a new virtual environment, and install the Python dependencies:

mkvirtualenv congress-api
pip install -r tasks/requirements.txt

Some tasks use PDF text extraction, which is performed through the docsplit gem. If you use a task that does this, you will need to install a system dependency, pdftotext.

On Linux:

sudo apt-get install poppler-data

Or on OS X:

brew install poppler

Setup - Configuration

Copy the example config files:

cp config/config.yml.example config/config.yml
cp config/mongoid.yml.example config/mongoid.yml
cp config.ru.example config.ru`

You don't need to edit these to get started in development, the defaults should work fine.

In production, you may wish to turn on the API key requirement, and add SMTP server details so that mail can be sent to admins and task owners.

If you work for the Sunlight Foundation, and want it to sync analytics and API keys with HQ, you'll need to update the services section with a shared_secret.

Read the documentation in config.yml.example for a description of each element.

Setup - Services

You can get started by just installing MongoDB.

The Congress API depends on MongoDB, a JSONic document store, for just about everything. MongoDB can be installed via apt, homebrew, or manually.

Optional. Some tasks that index full text will require Elasticsearch, a JSONic full-text search engine based on Lucene. Elasticsearch can be installed via apt, or manually.

Optional. If you want citation parsing, you'll need to install citation, a Node-based citation extractor. After installing Node, you can install it with [sudo] npm -g install citation, then run it via cite-server on port 3000.

Optional. To perform location lookups, you'll need to point the API at an instance of pentagon, a boundary service. Sunlight uses an instance loaded with congressional districts and ZCTAs, so that we can look up legislators and districts by either latitude/longitude or zip.

Starting the API

After installing dependencies and MongoDB, and copying the config files, boot the app with:

bundle exec unicorn

The API should return some enthusiastic JSON at http://localhost:8080.

Specify --port to use a port other than 8080.

Running tasks

The API uses rake to run data loading tasks, and various other API maintenance tasks.

Every directory in tasks/ generates an automatic rake task, like:

rake task:hearings_house

This will look in tasks/hearings_house/ for either a hearings_house.rb or hearings_house.py.

Ruby tasks should define a class named after the file, e.g. HearingsHouse, with a class-level run method that accepts a hash of options.

Python tasks should just define a run method that accepts a dict of options.

Options will be read from the command line using env syntax, for example:

rake task:hearings_house month=2014-01

The options hash will also include an additional config key that contains the parsed contents of config/config.yml, so that tasks have access to API configuration details.

So rake task:hearings_house month=2014-01 will execute:

HearingsHouse.run({
  month: "2014-01",
  config: {
    # ...parsed config.yml details...
  }
})

Task files should define the options they accept at the top of the file, in comments, like so.

Task Reporting

Tasks can file "reports" as they operate. Reports will be stored in the database, and reports with certain status will be emailed to the admin and any task-specific owners (as configured in config.yml).

Since this is MongoDB, any other useful data can simply be dumped onto the report document.

For example, a task might log warnings during its operation, and send a single warning email at the end:

if failures.any?
  Report.failure self, "Failed to process #{failures.size} reports", {failures: failures}
end

(In this case, self is the class of the task, e.g. GaoReports.)

Emails will be sent when filing failure or warning reports. You can also store note reports, and all tasks should file a success report at the end if they were successful.

The system will automatically file a complete report, with a record of how long a task took - tasks do not need to do this themselves.

Similarly, if an exception is raised during a task, the system will catch it and file (and email) a failure report.

Any task that encounters an error or something worth warning about should file a warning or failure report during operation. After a task completes, the system will examine the reports collection for any "unread" warning or failure reports, send emails for each one, and mark them as "read".

Undocumented features

This API has some endpoints and features that are not included in the public documentation, but are used in Sunlight tools.

Endpoints

/regulations - Material published in the Federal Register since 2009. Currently used in Scout. /documents - Reports from the Government Accountability Office, and various inspectors general since 2009. Currently used in Scout. /videos - Information on videos from the House floor and Senate floor, synced through the Granicus API. Currently used in Sunlight's Roku apps.

Citation detection

As bills, regulations, and documents are indexed into the system, they are first run through a citation extractor over HTTP.

Extracted citation data is stored locally, in Mongo, in a citations collection, using the Citation model. Excerpts of surrounding context are also stored then, at index-time.

The API accepts a citing parameter, of one or more (pipe-delimited) citation IDs, in the format produced by unitedstates/citation. Passing citing adds a filter (to either Mongo or Elasticsearch-based endpoints) of citation_ids__all, which limits results to only documents for which all given citation IDs were detected at index-time.

If a citing.details parameter is passed with a value of true, then every returned result will trigger a quick database lookup for those associated citations for that document, and citation details (including the surrounding match context) will be added to that document as a citation field.

For example, a search for:

/bills?citing=usc/5/552&citing.details=true&per_page=1&fields=bill_id

Might return something like:

{
  "results": [
    {
      "bill_id": "s2141-113",
      "citations": [
        {
          "type": "usc",
          "match": "section 552(b) of title 5",
          "index": 8624,
          "excerpt": "disclosure pursuant to section 1905 of title 18, United States Code, section 552(b) of title 5, United States Code, or section 301(j) of this Act.",
          "usc": {
            "title": "5",
            "section": "552",
            "subsections": [],
            "id": "usc/5/552",
            "section_id": "usc/5/552"
          }
        }
      ]
    }
  ]
}

License

This project is licensed under the GPL v3.

sunlight-congress's People

Contributors

annetheagile avatar ben-zen avatar crdunwel avatar dwillis avatar jcarbaugh avatar kaitlin avatar konklone avatar lindsayyoung avatar luigi avatar mtigas avatar philosoralphter avatar plantfansam avatar rshorey avatar sbai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sunlight-congress's Issues

Forbidden fields on models

Document and try to enforce them somewhere. Anything used in params, basically: captures, callback, sections, apikey, per_page, page, order, sort. I think that's it.

Fetch some roll call vote data in real time

The idea here is to have a separate task that can run every X minutes. It can make new roll call votes, that are missing fields (for example, "required" will be missing, and a related bill might not even exist yet). These fields will be filled in later by the twice-daily roll call vote task (that goes over all THOMAS-provided roll call votes and passage voice votes).

Example of House XML (view source, it's actually XML, and they use Bioguide IDs):
http://clerk.house.gov/evs/2010/roll518.xml

Example of Senate XML (uses some internal ID, will have to parse names out):
http://www.senate.gov/legislative/LIS/roll_call_votes/vote1112/vote_111_2_00229.xml

Sadly, in both cases we'll have to monitor an HTML table to see whether there's new stuff:
http://clerk.house.gov/evs/2010/index.asp
http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_111_2.htm

And the URLs for both tables depend on the year, Congress number, and session number. Not trivial, but: possible.

Support "not" for fields

For example, to find any vote that was not a roll call:

votes.json?vote_type!=roll

Keys are absolutely unlikely to use excalamation points, though this makes parsing out the conditions a little trickier, of course.

Filter keys with dots don't work

Example:
/votes.json?apikey=sunlight9&per_page=1&vote_breakdown.ayes%3E=200

It breaks upon storing a hit in the analytics db.

Email-time for failure and warning reports should occur post-task

Instead of occurring as a report is filed, in the middle of a task, have tasks file reports marked as unread (as the default value). After the task is done running, go through all unread reports, mark them as read and send emails for any warnings or failures. Surround this in exception handling as well, and file a local report with a special flag set if it fails.

This is good not just so that reports can be filed from other languages and still reported on, but also so that tasks do not potentially hang in the middle of their job while trying to send an email. It's just sensible.

Support greater/less than or equal to

For example, show me all the bills with at least 5 cosponsors:

bills.json?cosponsors_count>=5

<= for less than or equal to.

It's not possible that keys will have > or < in them. Not allowing "less than" or "greater than" without "or equal to" will only be problematic in the case of floating point numbers, of which we don't have any now. If we end up having them in the future, we can invent some special syntax for them (>>= and <<=, perhaps).

Link votes and amendments

  1. votes_archive should look for references to amendments (do they exist?) and add an amendment_id field and amendment subobject if it's there.
  2. Same for rolls_live_house and rolls_live_senate.

Report analytics nightly

Report nightly, as Drumbone does.

File local reports, and make sure broad exception handling is covered.

Add a Vote model and populate it

Port over the roll call fetching code from Drumbone, into a model named Vote. Add a vote_type field that's either "roll" or whatever it is.

As part of the get_votes task, following roll call loading, iterate through each bill and go through each one's votes array. For any voice votes, create them (and include a "bill" object on them). For any roll call votes, update them with anything worth doing (perhaps nothing, refer to notes).

If the Vote table is empty, fill it from scratch. Otherwise, you can just worry about the roll call votes in the Senate and House whose numbers are higher than the last recorded, since old roll call votes never change. Then, go over bills and add voice votes and link roll call votes as normal.

Once we're pulling in partial roll call vote data in real time, this logic can be updated to be: if the Vote table is empty, fill it from scratch. Otherwise, just fill in the un-filled in roll call votes, then go over bills and add voice votes and link roll call votes as normal.

Publish data in bulk

Have it nightly dump the tables to compressed JSON, at a publicly available address.

Expand videos to include White House videos

Real Time "Congress" be damned:

  • update house_live script to add a "chamber" field with a value of "house"
  • update house_live script to rename "timestamp_id" to "video_id" and prepend "house", e.g. "house-123456789"
  • make two whitehouse_live scripts that pull archival and live videos. Use a "chamber" value of "whitehouse", and a "video_id" value of "whitehouse-" followed by the date and slug, e.g. "whitehouse-2010-11-23-new-start-treaty".

House Whip Notices

Democratic and Republican whip notices for the House, using the code or algorithms in the old RTC API.

Support "in" and "nin" operators

For example, to support queries such as "give me all bills that are actually bills and not resolutions":

bills.json?bill_type__in=hr|hjres|s|sjres

Pipes seem unlikely to occur in filterable fields, and if we found some source data that uses pipes, we could always swap those pipes out for something else before syndicating it.

Use an "in" query though, not an actual "or" query:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24in

Finally, when "not" is supported, support the idea of "not in" searches, like this one for "anything but simple resolutions":

bills.json?bill_type!=hres|sres

This would map to the "nin" operator in Mongo.

Record hits in the database for analytics

Be wary, as this caused issues in Drumbone when it got too high, but - keep something.

Perhaps a task that runs monthly that offloads the month's hits into a dump and removes them from the database.

Support $all operator

So if someone wants a bill cosponsored by a pair of people:
/bills.json?cosponsor_ids__all=1|2

that'd match any bills where the cosponsor_ids array contains both "1" and "2".

Add Amendments

Add an "amendments" endpoint, using GovTrack's amendment XML:

Example:
http://www.govtrack.us/data/us/111/bills.amdt/h234.xml

Have a task, amendments_archive, that loads in all amendments to the table, and then goes over each bill (perhaps by Amendment.distinct(:bill_id) or the like) and adds an array of amendments to the bill. Each amendment on a bill should have only the basic fields (everything but the actions).

Include count and page keys for plural endpoints

As peer to the array key (i.e. "bills"), include: "count", "page", and "per_page". "page" and "per_page" are the (possibly adjusted) pagination params, and "count" is the total number of items for that search.

Link committees and bills

Any bill which has a committee associated, see if we can add on the committee and its relationship to the bill. committee_ids and committee subobjects.

If we need to keep the relationship, then it should work like voter_ids and voters do on the vote object - {commitee_id: [id], relationship: "..."} and {committee: {obj}, relationship: "..."}.

Investigate Hudson for monitoring

The Open State project uses this and it would be worth checking out how applicable to our own tasks, especially since the most awkward part of supporting multi-language data loaders is the reporting.

Pull in House videos and floor events

Work with Kaitlin to pull in floor events.

I'm not sure yet how to reconcile the floor events from this feed with the floor_events that Josh already picked up in the old RTC.

Directory structure for tasks

Give each task a folder that supports running unit tests (i.e. link to environment.rb correctly), or any other files the task needs.

Have the loader that governs making the rake tasks use the folder names. Have each task load in the [task_name].rb file in the root of the task's folder, and assume that a camelized class name is in there.

Dates should be dates, not timestamps

For most bill dates, it's a full on timestamp, at midnight UTC, which is incorrect. It should be limited to the date only, with no timestamp.

Since America is west of UTC, these dates represented in any American timezone would be the day before they actually are, which is a serious inaccuracy.

Pluck out legislator_names and bioguide_ids from clip description

Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.

Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.

I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.

You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".

But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.

(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)

Put party breakdown inside vote_breakdown

vote_breakdown: {
    total: {ayes: ..., nays: ..., ...},
    party: {R: {ayes: ..., nays: ..., ...}, D: {...}, ...},
}

Leaves room for us to easily expand on any other ways it could be broken down.

Set up cronjobs on staging and backend

For get_legislators, once a day.
For get_bills, twice a day.
For get_rolls, twice a day.
For house_live, every 5 minutes (consult Kaitlin).

Imminent:
For get_amendments, twice a day.
For rolls_live, every 10 minutes.

Down the line:
For floor_updates, every minute.
For various docserver scrapers, consult Josh.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.