data4democracy / house_expenditures Goto Github PK

R 0.27% HTML 98.85% Jupyter Notebook 0.81% Python 0.06%

house_expenditures's Introduction

title	author	date	output
Read Me	Eric and Ryan	1/17/2017	html_document

Slack: #propublica

Project Description: This ProPublica repository is part of Data for Democracy. Our purpose is to collaboratively work through analytic processes that support the journalism at ProPublica. Currently, contributors have been focused on cleaning the house expenditures dataset. We are always open to ideas for how to work with this dataset to make it more useful to ProPublica. Please contact @ryanes or @eric_bickel on Slack with any suggestions or questions.

Analysis Workflow

Reading, cleaning, and analyzing data should be done in a reproducible notebook format when possible. When submitting pull requests, please submit them from a fork of the repository and on a separate branch. Data for Democracy has an awesome set of instructions for how to do this if you need it.

Organizing Work

If contributors are working on projects other than updating the files in the main directory, they are encouraged to keep their work in a folder that is named in a way that describes the folder's contents. Some examples might be ml_model_R or alternate_cleaning_python. This should make it easier for new contributors to follow what is happening and make judgements about how to organize their contributions.

Loading and Cleaning Datasets

For each analysis, data needs to be loaded and cleaned to a format that is useable for the current analysis and for future analyses.

After data has been cleaned, the resulting dataset should be written as a csv. The csv should be made available on data.world.

Exploratory Analysis

Team members working in exploratory analysis work up general statistics, distributions of important variables, and hypotheses based on initial exploration of covariation. If this analysis is in a notebook that is different from the cleaning script, there should be documentation of which scripts need to be run in order to reproduce the analysis results.

When an analysis job is complete, a pull request to the GitHub repo should be made to be edited by collaborators of the project or a committee of assigned editors.

Modeling

Team members use modeling techniques to test the hypotheses generated in the exploratory analysis phase and to quantify relationships between variables in the data. Team members may also be working to test specific hypotheses generated by ProPublica.

Algorithms used in the modeling should be vetted through open discussions with the team and through pull requests, and final model specification should be a collaborative effort using any individual findings from the discussion. The project readme should outline these specifications, and the final modeling code should be pushed to the GitHub repo.

Reporting

Team members detail the findings in a reproducible report that can be immediately used by ProPublica. All sources and data used should be linked in the report, and the project readme containing all background in methodology and links to data and code.

house_expenditures's People

Contributors

Stargazers

Watchers

Forkers

ehbick01 restrellado supermdat keesdeschepper kblevins mjain-1 frankiezeager vickitran ckmarcelus1

house_expenditures's Issues

Create an "about the data" write up

Review the house expenditures data and write a summary describing the variables and where to download it. The readme needs to be updated to link to the new about_the_data file.

You can read more about the data in this ProPublica post and at the site that hosts the data.

Standardize Payee Names (particularly for individuals)

A key goal for us is to identify as best we can congressional staffers over time. Since they often move from office to office, we need a standardized version of their names to help with that. The problem is that there is no real unique ID for them, so we've only got a little bit of context to help us. But: it's rare for staffers who work in lawmakers' offices to work for a member of a different party, so in most cases we can assume that the combination of name-office-party would be unique(ish). The date context (each report covers a quarter) also could help with that.

My ideal result from this is a set of canonical (or close to it) names along with offices and dates. You can add party information for those records that have a bioguide_id value via the ProPublica Congress API or the United States organization on GitHub

Clean the `purpose` variable

The values in the data appear to be manually entered, and are therefore not standardized. This means that unique entities for the purpose variable are spelled differently and should be collapsed into one value. For example, "CONG AIDE/OUTREACH SERVICES", and "CONGRESS AIDE/OUTREACH SER" are presumably the same. Similarly, "EXECUTIVE ASSISTANT/LEGISLATIV", and "EXECUTIVE ASSISTANT/LEGISLATIV (OTHER COMPENSATION)" may not be exactly the same, but could/should be aggregated.

Issue #29 took steps to clean this variable using the Jaro-Winker distance from stringdist::stringdist, but some duplicates remain, and additional cleaning would be useful.

Any method (topic modeling, word2vec, etc.) would be acceptable so long as it is accurate and scalable. Because of the large number of unique entries for this variable, condensing the entries into similar categories (e.g., with topic modeling) may be particular beneficial.

Inspection and cleaning of "dates" variables

This specific issue is to clean up the "date" variables that are used to describe when an expenditure occurred. This will be used later to ease additional analyses (e.g., even just basic summary statistics).

Standardize office, payee, and recipient fields

The office, payee, and recipient fields aren't standardized so unique entities that are spelled differently need to be collapsed into one value. You can read more about that problem under the "How We Collect the Data" section of ProPublica's post.

Clean the `office` variable

The values in the data appear to be manually entered, and are therefore not standardized. This means that unique entities for the office variable are spelled differently and should be collapsed into one value. For example, "NET EXPENSES OF EQUIP", and "NET EXPENSES OF EQUIPMENT" are presumably the same. Similarly, "HOUSE CHILD CARE GENERAL FUND", "HOUSE CHILD CARE CENTER", and "CHILD CARE CTR" may not be exactly the same, but could/should be aggregated.