openelections / openelections-core Goto Github PK

View Code? Open in Web Editor NEW

176.0 31.0 95.0 10.64 MB

Core repo for election results data acquisition, transformation and output.

License: MIT License

Python 100.00%

openelections elections

openelections-core's People

Contributors

Stargazers

Watchers

Forkers

datamade ghing imclab myersjustinc suymilk jeremyjbowers patsweet redgemstate jtbates poochmonster rabidaudio jonnyd55 bycoffe dmil samgassel evz jacobmiller cathydeng mcdisc bsmithgall joestemarie chrislkeller aratkiewicz paleomedia nbdavies acouch gvsurenderreddy spulec lexieheinle pohalloran cjwinchester divergentdave redoakmark lisalarson wanderingstar jslap nealmcb smarthi mattyhead mih4ajlo jamesclarence colcarroll jmcarp stephenfeagin ghostaseem jsasplun sb003a jatkins23 ohioopenvoterdata wismer daverosoff izenmania mgerring wtadler jnelson16 jamesdunham kjb290 pixelactions kstohr dtschirmer jdiglesias zacman85 clarkecampbellevans milandv vmnet04 americavotesdata beastneedsmoretorque lennybronner sanmakaren maggiexwang kvanallen mbailey256 tgroth58 francoishuet duncan-spencer nealzonwheelz psa mohit-iitb capesepias jchernof17 lmorinishi omarcoming warwickmm ersawin mwaiton cyberflamego kyle-gehring micheas-dy epsian dwashing primetarget kblack2990 ktds78

openelections-core's Issues

When should PDF conversion/extraction occur?

For results files that are electronic PDFs (or even for those which are not), when do we want to convert/extract the results? Before the loader is run, or during that process? Seems to me that we might want the data portion done before the loader is run, meaning that generated_filenames would not be PDFs, even if the source files were. Thoughts?

Results file scraper for Ohio

Initially grab all the links and filter them.

Add data loader invoke task

Create invoke task to drive state LoadResults classes.

How to present party field in baked output?

In the baker (see #39) output:

where should the party field described in the Result spec come from? I'm assuming this is Candidate.parties rather than Contest.party.
Candidate.parties is a list. What's the motivation for making this a list rather than a single value? The list serializes fine in CSV, at least for MD, but should we compress this to a string for output. How should we handle multiple values in this string.

@zstumgoren, @dwillis do you have an answer to this?

Set mongo connection in settings

Add MONGO settings to settings template
Instantiate connection in settings.py and import in models.py

Create datasource for MD

Encapsulate process of building source data URLs and standardizing results filenames (now in fetcher.py) in a new datasource.py module. This Datasource class should provide a simple public interface for dynamic querying by downstream fetcher and loader.

Add bakery to invoke framework

Write tests using FactoryBoy
Create base/bake.py module with Baker class
- Executes Mongo query constructed from bake invoke task's CLI params
- Preloads Election and Candidate records, by key, into memory
- Writes batched data (default batch of 5000) as stream to a target file
Add to_csv and to_json methods to Election, Candidate, Result models. Do not serialize references by default (include_refs=False).
Create bake.state_file invoke task
- Accepts filters for customizing result output
  - state (required) - postal code
  - format (default: CSV) - format for output
  - outputdir (default: openelections/us/bakery)
  - date [YYYY|YYYY-MM-DD] (require to avoid huge file sizes?)
  - type [general|primary|special|etc] - use choices list from models.py
  - office - most useful after data standardized
  - district - most useful after data standardized
  - party - most useful after data standardized
  - reporting level [state|county|precinct| etc.] - use choices list from models.py
- Sets sensible defaults. For example, output all state/contest-wide results for all races when no filters are applied.
- Writes two files to openelections/us/bakery folder:
  - ._ - the baked results
  - manifest.txt - query parameters used to produce the result

Add validations for MD

Write validation tests for MD results to ensure various result file types were loaded correctly.

Examples:

count of result records by candidate and file type (e.g. all target candidates in 2012 general state leg file should have 5504 results - 64 leg districts x 86 candidate types, including other write-ins)
Tally results for candidates at sub-racewiide reporting levels and compare to known totals
compare number of candidates and contests to expected numbers

Document file name standardization conventions

See previous discussions here:

#29
#4

Add cache.diff task

Create cache.diff invoke task that shows diff between locally cached files and expected files, based on comparison of files in state cache dir with expected files generated from datasource.mappings.

Add cache.diff alert to load.run invoke task

Update load.run invoke task to:

exit with alert if there's a difference between cached files and those expected (based on cache.diff; see #17).
force loading despite cache.diff using -f/--force flag

Two elections on same day or one election?

We looking at an election board that seems to think of a Republican Primary as a different election than a Democratic Primary even though they are held on the same day.

What's the thinking in this project?

Accessing undeclared variables or not using variables in us.md.load

load_county_2002()

mapping variable is referenced but not declared
candidate variable is referenced but not declared
write_in variable is referenced but not declared

load_2002_file()

A result object is instantiated at the end, but it looks like it's never appended to the results list that gets passed to Result.objects.insert

load_2000_primary_file()

winner, cand_kwargs variables never declared

OCD Division mappings

Should the dashboard have an API response containing OCD mappings for a state? Should we be able to add our own via the admin?

Should Candidate have a state field?

I know they'll have uuids, but wondering if they should also include the state to differentiate them that way.

Add developer bootstrap instructions to README

Is _races_by_type being redundant?

This line seems to be adding a slug attribute for an election by constructing it. Why do this when the API response has the slug already built as the id attribute?

Generated file names for multiple offices at the same level

Ohio has some instances in which multiple files cover the same "office" label we use - for example, they have separate HTML pages for state senators and state reps, as well as for state officer posts (see primary results here: http://www.sos.state.oh.us/sos/elections/Research/electResultsMain/2008ElectionResults.aspx, for example). We haven't really decided on the naming conventions for those. Do we want to have something like:

20080304__oh__democratic__primary__state_leg_1.html

20080304__oh__democratic__primary__state_senate.html
20080304__oh__democratic__primary__state_house.html

Migrate name parsing to transform module

Migrate name parsing bits in base.load.BaseLoader to MD-specific md/transform/names.py and/or base/transforms/names.py

Add tasks to generate and archive manifest file

git rm filenames.json from md/mappings/filenames.json
Add invoke datasource.create_manifest to generate manifest.csv in state/mappings dir
Add invoke cache.save_manifest to save manifest.csv (formerly filenames.json) to S3

Any way to account for Ballot initiatives?

I'm noticing that your result model assumes a candidate. Is there any vision to include ballot initiatives in this schema? If so, where/how does that fit?

Implement name conversion strategy for raw results files

Create a module to standardize names of raw result files. Raw results will be stored on S3 using the standardized name.

Standardized names should:

be resolvable back to raw file names
encapsulate enough information about the contained results to link up to metadata via API

Naming Convention

See #4 for details on naming convention

Standardization should generate a composite file name that reflects metadata captured in our data admin.

File name components should include:

election date - YYYYMMDD
state - postal code
race type - general, primary-dem, primary runoff-dem, etc.
jurisdiction - OCD id of the jurisdiction, or geographic area, for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD name
race_code that denotes types of races covered in the data file. Optional element that should only be used when state provides data for single race in distinct file. For example, Louisiana provides precinct-level results, by parish, for each race. This field could also be expanded, on a state-by-state basis, to handle arbitrary groupings of results (e.g. separate files for state leg., federal, local).
reporting level - precinct, city, county, state, etc.
file type extension - db, csv, html, json, xml, etc.

Format

File name components separated by double underscores; component sub-parts separated by single underscores.

<YYYYMMDD>__<state>__<race_type>__<jurisdiction>__[<race_code>__]<level>.<ext>

Examples

Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106__la__general__jefferson_davish_parish__cd_1__precinct.html

Allegeny County precincnt results for general election (contains multiple race types)
20121106__md__general__allegany_county__precinct.csv

Implementation

Standardized name should be generated during file download process (in state-specific fetch.py modules).

Each state directory should have a 2-column mappings.txt file that contains standardized name and link to raw result file. The raw link should point to result file located at source agency or to copy of raw file archived on S3. The latter would be used in cases where result files are not scrapable (e.g. if agency provided a database dump).

## mappings.txt ##
standard_name, raw_source_name
20121106__md__general__anne_arundel_county__precinct.csv, http://www.elections.state.md.us/elections/2012/election_data/Anne_Arundel_By_Precinct_2012_General.csv

How will scrapers pull in the dashboard election slug?

Normalized Candidates

@fgregg and I have been exploring implementing the openelections data structure for our local elections in Chicago and we ran across an issue today which I'm wondering if you might consider implementing in a slightly different way.

Since a Candidate is stored as an EmbeddedDocument within each Result, (which is itself an EmbeddedDocument within a Contest) the process of updating an individual Candidate can be somewhat of a bear, especially for a candidate who has been running in elections for as long as we have data for (and since our data is at the precinct level)

The main reason this comes up is because we're storing information about local aldermen in a pupa instance which is giving us ocd_person ids for them. We'd like to be able to cross reference that info with the info about the elections that they've run in that we're storing in this app and the only way we have to do that is to manually add the ocd_person id into this app manually. The manual part of this we were expecting and can handle but I'm wondering if you might consider storing the candidates as a separate Document the way that you're storing the Office for a given result. This would certainly make the process of getting at the information about candidates a whole heck of a lot easier.

Create RawResult documents in load step, create Result, Contest and Candidate documents in transform step

Tasks

Update models to reflect changes (see code and questions in models.py on rawresults branch)
Update MD loader to use RawResult model
Migrate logic for creating unique Contest and Candidate entries from MD loader to transforms
Update Contributor docs to reflect the new workflow

Background

This comes from a discussion in #46 where @zstumgoren said:

But I'm starting to wonder if the creation of unique Candidate and Contest instances
should be treated as a transform step. Our initial goal with the data load step should
simply be getting the data loaded into Mongo in its raw form. @dwillis and I agreed to
this approach a while back, and have gradually migrated transforms and various
cleanups from the load step to the transformation step.

Enforcing uniqueness of contests and candidates in the load step adds a great deal
of complexity to this phase of the pipeline, and it feels like we're blending concerns a
bit. Unless @dwillis has strong feelings against, I'd be favorable to shifting our
approach. I don't think it would take a great deal of reworking of the models or
loader/bakery. In fact, it would greatly simplify the loader and possibly v1 of bakery.

Here's one possible strategy:

Create a RawResult model that lets us load a flat model of all raw data (this would be
our current Result model, plus contest and candidate fields currently normalized to
their own models)
Generate unique Contest and Candidate instances and "clean" Result documents as
subsequent transform steps
In this new model, Result documents would store cleaned or processed Result data
migrated from RawResult, or generated subsequently from lower-level results (e.g.
race-wide results rolled up from precinct-level results). In general, these collections
would store transformed, normalized versions of our raw data.

Fix MD 2000-04 file names

Update loaders to reflect party field on Result

Update loaders to reflect party field added to the Result model and removed from the Candidate model for #46.

Add include/exclude filter to validate task

Add include/exclude filters to validate task. API should match transform task.

Coding style guide

WIP draft of this can be found in Code Style Guide in our shared GDocs directory.

Add tests for core framework

tests for openelex.base modules/classes
tests for invoke tasks

Devise naming convention for raw results files

We need a consistent file naming convention for raw result files. This file name would be applied during the initial download of the file (in the fetch class), and would be the name of the file archived on S3. It should provide enough information about the source file to link up to our metadata API.

Strategies

Metadata ID

Resolve a canonical ID using metadata API.

Pros

Minimizes how gnarly the file name gets.
Provides an early tie-in to our metadata that could be used in downstream parsing and loading processes.

Cons

May not be as intuitive and reverseable as a plain-language ID.
Tightly couples our scraping process to an external API.
Does not account for local data and referendums, since these source types are not reflected in metadata. Could devise a secondary convention for non-target file types.

Composite name

Generate composite file names that reflect metadata captured in our data admin.

File name components could include:

election date
state
race type (general, primary-dem, primary runoff-dem, etc.)
OCD id for jurisdictional boundary of the data. This could be the OCD id of the geographic area for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD name
race_code that denotes types of races covered in the data file. Could be an optional element only used when state slices up data for a single election date into multiple files. For example, Louisiana provides precinct-level results, by parish, for each race.
reporting level - precinct, city, county, state, etc.
file type extension - db, csv, html, json, xml, etc.

Examples:

FORMAT

<YYYYMMDD>_<state>_<race_type>_<jurisdiction>_[<race_code>_]<level>.<ext>

EXAMPLES

Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106_la_general_jefferson-davish-parish_cd-1_precinct.html

Allegeny County precincnt results for general election (contains multiple race types)
20121106_md_general_allegany-county_precinct.csv

Pros

file names are plain language and reversable
not coupled to an external API but could be parsed and used to query the metadata API
naming convention should work across target and non-target data (e.g. local races and referendums)

Cons

Some rather gnarly file names
the race_code handling is a bit fuzzy and ad hoc. Would need to be careful about enforcing a convention here.

Add us states choices to models

Add choices attribute to state field on Contest, Candidate and Result documents. Use https://github.com/unitedstates/python-us

Finish BaseLoader and port MD load.py

Finish port of openelex.base.load.BaseLoader
Port md.load.py to:
- use updated BaseLoader
- use md.datasource.py methods (mappings, elections, etc.).
- move name parsing from BaseLoader to md loader

Update models and optimize MD results loading

Change documents in models.py to use separate collections for Contests, Candidates and Results. Make each a DynamicDocument that allows arbitrary attributes, in addition to core/expected fields. Use loading for all collections.

Should Candidate have a raw_suffix field?

Add logging to core framework and state subclasses

Lots of print statements sprinkled around that need to be replaced by proper logging.

String representations for core models

Fix MD 2000 loader

Primary election results are in a single csv with multiple "tables" on top of each other. Maybe scrape?

Auto-generate S3 index of raw result files

Add archive task that builds index.html file with list of files on S3, so they're browseable by public.

Add archive invoke task to generate manifest on S3

Add archive task that generates a manifest for a given state based on list of files saved to S3. Should link the up by standard filename to original/raw urls, plus any other metadata that's appropriate from datasource.mappings.

Add optional flag to archive previous version of a file on S3. This is important in cases where we hand-keyed results data or used some combination of automation (e.g. Tabula) and manual processes. The flag would allow us to version the data over time. (see #55)

Add archive invoke task

Create openelex.base.archive.py
Delete md.archiver.py
openelex.base.archive.py methods to implement
- ~~save_file(datefilter) - saves cached file to S3~~
- ~~delete_file(datefiler) - delete files from S3 not found in datasource.mappings~~

Add indexes to model collections

At the very least, add an index on election_id.

Include comment with sample data rows

This is just a style suggestion that I thought of while looking through us.md.load.

To reduce contributor friction and to make it easier to return to code, putting an example of a row in the comment of the code block that parses or transforms it would be a big help. This is particularly true when there's a lot of variance from year to year.

Candidates in more than one Contest?

Hi there, me again. This sorta relates to #30 but I thought I'd open another issue just to keep things neat.

@zstumgoren pointed out that you guys are working on new models for the elections data in your tasks branch which move the Candidate and Result objects out from under the Contest objects. Which is great and I've gone ahead and implemented that approach in our fork of this project.

However, I'm wondering what the thinking is behind making a Candidate only able to be related to one Contest. It seems to me that there it is more often the case that a candidate will run in more than one election for the same office but, the way I interpret this relation is that you'd end up with new Candidate objects for every Contest that a given person runs in. Even though they are the same person. Would it make sense to make that relation a ListField full of ReferenceFields pointing to the various contests the person has run in? That way you only end up with a single record for that person.

Or am I totally interpreting this the wrong way?

Office standardization lookup CSV file

Office standardization - list all offices and office holder names; identify upper and lower chamber for state legislatures; give generic titles to offices that don’t have common names.

Models missing fields in spec

The data models are missing some of the fields described in the spec.

Missing from Result model

pct
precincts

Missing from Contest model

absentee_provisional
source_url
notes

@zstumgoren suggested that some of these fields might be dynamic properties on the model class, but it's important to remember that we bypass the model layer when baking in the interest of performance.

Originally opened as part of the discussion for #39.

Fix generated file names for FL runoffs

Create ID datasource

Use multiple loader classes instead of methods for state data formats that vary over time

For example in md.

Instead of:

class LoadResults(BaseLoader):

    def run(self, mapping):
        ...
        # Load results based on file type
        if '2002' in self.election_id:
            self.load_2002_file(mapping)
        ...

have multiple classes:

class Load2002Results(BaseLoader):
    def run(self, mapping):
        ...

    # Any other year related supporting methods happen here

class LoadResults(BaseLoader):
    def run(self, mapping):
        self.election_id = mapping['election']

        if '2002' in self.election_id:
            Load2002Results().run(mapping)

The big advantage to this is to make it easier to see which helper methods are related to a particular vintage without having to think about or stick to a naming convention for the methods.

I could also imagine being able to put common functionality in a state loader class and then reuse it in year-based subclasses.

2002 MD Results are Pipe-delimited

Need a special-case loader