Code Monkey home page Code Monkey logo

saberva's Introduction

Saberva

A free, open source tool to parse Virginia State Board of Elections campaign filings.

The name, "Saberva," combines "State Board of Elections" and "Virginia" in a single word, and is rooted in the Spanish "saber," meaning "to know."

Usage

Run at the command line: php saberva.php. It will retrieve a list of every campaign committee that has filed a report since January 1, 2012, and then retrieve a list of every report filed by that committee. The result is a large JSON file (several megabytes). It then iterates through that list of committees, creates a JSON file for each committee, retrieves every cited report, converts that report to JSON, and stores each report as a JSON file. (This produces thousands of JSON files.) Finally, it creates CSV versions of each JSON file, as well as contributions.csv and expenses.csv files that contain all contributions to and expenditures by all committees. Optionally, it will atomize expense and contribution reports down to individual records, storing each contribution and expense as its own JSON file. (This produces hundreds of thousands of JSON files.)

Because the master committee list is demanding to assemble, committees.json, will not be refreshed unless a) 18 hours have elapsed since it was last built b) --reload is passed as a command-line argument (e.g., php saberva.php --reload) or c) committees.json does not exist.

Options

Customizations can be made in config.inc.php.

--reload / -r: Force committees.json—the master committees list—to be rebuilt from the SBE's website, even if it is less than 18 hours old.

--from-cache / -c: Use the cached version of committees.json, no matter how old it is.

--atomize / -a: Create individual JSON files for every contribution and expense.

--verbose / -v: Display additional progress information.

--progress-meter / -p: Display a progress meter as committees.json is built.

--help / -h: Displays a list of parameters and usage examples.

Each switch must be provided individually (e.g., php saberva.php -c -p), rather than grouped (e.g., php saberva -cp).

Resulting files

  • committees.json
  • committees.csv
  • committees/*.json
  • committees/*.csv
  • report/*.json
  • expenses.csv
  • expenses/*.json
  • expenses/*.csv
  • contributions/*.csv
  • contributions/*.json
  • contributions.csv

Data source update schedule

All of the information is pulled from the Virginia State Board of Elections’ Campaign Finance Reports site. Their site's data is updated once daily, at 5 PM EST. Although amendments can be filed on any day of the month, major changes occur as per the filing schedule (e.g., the 2013 candidate committees). There is no benefit to running this more than once per day, and for most purposes, it will only need to be run a few times a year (e.g., July 15 and January 15, for elected officials not on the ballot that November).

LICENSE

Released under the MIT License.

saberva's People

Contributors

waldoj avatar

Stargazers

Chris Zubak-Skees avatar Lucas avatar Jim Van Fleet avatar Bill Eger avatar Max Fenton avatar albert avatar Judson Mitchell avatar Yamil Gonzales avatar jasonsee avatar Jeff Cornejo avatar John Athayde avatar Derek Willis avatar

Watchers

 avatar James Cloos avatar

Forkers

dwillis takumi8x

saberva's Issues

Normalize addresses

Address data does not appear to be normalized. For instance, Virginia Engineers PAC's 09/07/2012 contribution says that their primary place of business is "Richmon, VA." (To be fair, that's not address data.) It appears that the software used by many committees is providing normalization, but they normalize differently. For instance, some software normalizes on long street suffixes ("Court," "Boulevard," "Road," etc.), while some software normalizes on short street suffixes ("Ct.," "Blvd.," "Rd.," etc.) So the good news is that reports often have internal consistency that should make it easy to join all of the reports in collective consistency.

Implement the an address normalization system (presumably the USPS's API) to deal with this problem.

The only question is at what point this should be done. Is it appropriate to do this prior to saving the data and generating the JSON? Or is it wrong to alter the SBE's data? Wouldn't this mean making tens of thousands of API calls every time that the parser is run?

This might be an argument for standardizing addresses via a cruder, local function at the time of input, and save the USPS API calls to be used beyond the Saberva pipeline.

Create a cron job

In the employment of this on opeva.com, create a cron job to run this nightly, after the 5 PM SBE file publication time.

Verify terminology

Hardcoding terms like "expenses" and "contributions" demands that the terminology's accuracy be compared to SBE practice. Do so, and correct anything that's out of line.

Recess single-element LiA and LiD elements

We've almost got this problem fixed. The remaining problem is that this code:

$tmp = $report->ScheduleA->LiA;
$report->ScheduleA->LiA = array();
$report->ScheduleA->LiA[] = $tmp;

generates this error:

It is not yet possible to assign complex types to properties

I'm pretty confident that this is a SimpleXML error, since $report is a SimpleXML object. The solution is to stop trying to just append to that object, and instead use SimpleXML's own functionality (presumably addChild) to make this change to the data.

Add an XML parser

Saberva does not yet do anything with the campaign finance XML files. It should:

[x] Download all of these XML files
[x] Turn them into JSON
[ ] Load them into a database

Stop the blank Schedule A entries

e.g.:

ScheduleA: {
LiA: [
{
Contributor: {
@attributes: {
IsIndividual: "true"
},
FirstName: "John",
LastName: "Whitbeck",
Address: {
Line1: "116E EDWARDS FERRY RD",
City: "Leesburg",
State: "VA",
ZipCode: "20176"
},
NameOfEmployer: "Whitbeck Cisneros",
OccupationOrTypeOfBusiness: "Attorney",
PrimaryCityAndStateOfEmploymentOrBusiness: "LEESBURG VA"
},
TransactionDate: "2014-03-06",
Amount: "1000.00",
TotalToDate: "1000.00"
},
{
Contributor: {
Address: {
Line1: { }
}
}
}
]
},

Figure out why this is happening and make it stop—it's creating a series of messes.

Handle timeouts more gracefully

This just happened:

Honest Reform PAC reports retrieved.
House Democratic Caucus reports retrieved.
House Republican Campaign Committee reports retrieved.
Operation timed out after 5001 milliseconds with 0 bytes received
830 committees
10th District Republican Congressional Committee: PP-12-00458 saved to committees/PP-12-00458.json
11th Congressional District Democratic Committee: PP-12-00366 saved to committees/PP-12-00366.json

It looks like the timeout was cause to abandon the entire process of fetching the list of reports, causing the parser to move onto fetching individual reports. The result was it stopped at "House Republican Campaign Committee," when there were hundreds more reports remaining (they're retrieved alphabetically).

I suspect that we need to have cURL retry after a failure or, if not that, just move onto the next committee.

Deal with empty strings being saved as objects

Sometimes we're having empty strings saved as objects, which wreaks havoc downstream. For instance, this:

{
    "Contributor": {
        "@attributes": {
            "IsIndividual": "false"
        },
        "LastName": "McGuire Woods PAC",
        "Address": {
            "Line1": "901 E CARY ST",
            "Line2": {},
            "City": "Richmond",
            "State": "VA",
            "ZipCode": "23219"
        },
        "OccupationOrTypeOfBusiness": "Political Action Committee",
        "PrimaryCityAndStateOfEmploymentOrBusiness": "RICHMOND VA"
    },
    "TransactionDate": "2013-11-03",
    "Amount": "250.00",
    "TotalToDate": "448.80"
}

That "Line2": {},? It's big trouble. Line2 should be empty, but instead it's an object with nothing in it. Downstream, we expect that to be a string. Casting it as a string yields a fatal error:

PHP Catchable fatal error:  Object of class stdClass could not be converted to string

We need to prevent these bad data from being encoded in the first place. That means modifying class.Saberva.inc.php to check the output and make sure that we don't have any objects where we shouldn't. If we do, then we should replace them with empty strings.

Embed Open VA API links in JSON

Right now the JSON is a straight translation of the SBE XML. But it would be useful to provide the URL for the detailed reports etc. within that JSON. Figure out where to best add that data, and start including it.

Produce a contributions JSON file

Iterate through every contribution and build up a huge JSON file that lists every contributor, the amount contributed, and the campaign to which it was contributed.

(This is the first step towards producing individual donor records.)

Add another progress graph for further iteration

The only progress graph available right now is for parsing through the master committees XML file. From there on out, a verbose progress log is displayed. Duplicate the graphing functionality for additional loops.

Add a help parameter

Appending -h or --help when running the program should display a list of options.

Overhaul and document the reload option

It's non-obvious that the reload option is required. Do three things:

  • infer reload if the cache is more than X hours old
  • make the max cache age a configuration option
  • change parameter format from reload to --reload
  • add a --from-cache option that will ignore the cache age
  • document this feature

Header rows appearing repeatedly in contributions CSV

We've got header rows appearing over and over in contributions.csv, like such:

address_1,address_2,address_city,address_state,address_zip
address_1,address_2,address_city,address_state,address_zip
address_1,address_2,address_city,address_state,address_zip
address_1,address_2,address_city,address_state,address_zip

Figure out why this is happening and stop it.

Create a record for every donor

On the one hand, I want to do this as a series of JSON files. On the other hand, I'm not sure that's the right approach.

I suspect that this is the line at which it starts to make sense to use a database. We don't have unique identifiers for each donor, but we'll need them. That means that each time that the parser is run, we'll be generating new random identifiers. (Or we make them something weird, like MD5s of names & addresses, which is bad for a bunch of reasons.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.