openelections / openelections-core Goto Github PK
View Code? Open in Web Editor NEWCore repo for election results data acquisition, transformation and output.
License: MIT License
Core repo for election results data acquisition, transformation and output.
License: MIT License
For results files that are electronic PDFs (or even for those which are not), when do we want to convert/extract the results? Before the loader is run, or during that process? Seems to me that we might want the data portion done before the loader is run, meaning that generated_filenames would not be PDFs, even if the source files were. Thoughts?
Initially grab all the links and filter them.
Create invoke task to drive state LoadResults classes.
In the baker (see #39) output:
party
field described in the Result spec come from? I'm assuming this is Candidate.parties rather than Contest.party.@zstumgoren, @dwillis do you have an answer to this?
Encapsulate process of building source data URLs and standardizing results filenames (now in fetcher.py) in a new datasource.py module. This Datasource class should provide a simple public interface for dynamic querying by downstream fetcher and loader.
Write validation tests for MD results to ensure various result file types were loaded correctly.
Examples:
Create cache.diff invoke task that shows diff between locally cached files and expected files, based on comparison of files in state cache dir with expected files generated from datasource.mappings.
Update load.run invoke task to:
We looking at an election board that seems to think of a Republican Primary as a different election than a Democratic Primary even though they are held on the same day.
What's the thinking in this project?
load_county_2002()
load_2002_file()
load_2000_primary_file()
Should the dashboard have an API response containing OCD mappings for a state? Should we be able to add our own via the admin?
I know they'll have uuids, but wondering if they should also include the state to differentiate them that way.
This line seems to be adding a slug attribute for an election by constructing it. Why do this when the API response has the slug already built as the id
attribute?
Ohio has some instances in which multiple files cover the same "office" label we use - for example, they have separate HTML pages for state senators and state reps, as well as for state officer posts (see primary results here: http://www.sos.state.oh.us/sos/elections/Research/electResultsMain/2008ElectionResults.aspx, for example). We haven't really decided on the naming conventions for those. Do we want to have something like:
20080304__oh__democratic__primary__state_leg_1.html
or
20080304__oh__democratic__primary__state_senate.html
20080304__oh__democratic__primary__state_house.html
?
Migrate name parsing bits in base.load.BaseLoader to MD-specific md/transform/names.py and/or base/transforms/names.py
I'm noticing that your result model assumes a candidate. Is there any vision to include ballot initiatives in this schema? If so, where/how does that fit?
Create a module to standardize names of raw result files. Raw results will be stored on S3 using the standardized name.
Standardized names should:
See #4 for details on naming convention
Standardization should generate a composite file name that reflects metadata captured in our data admin.
File name components should include:
election date
- YYYYMMDDstate
- postal coderace type
- general, primary-dem, primary runoff-dem, etc.jurisdiction
- OCD id of the jurisdiction, or geographic area, for which results are provided. For example, a file for MD that contains precinct-level results for Anne Arundel County could use a slugified version of the plain OCD namerace_code
that denotes types of races covered in the data file. Optional element that should only be used when state provides data for single race in distinct file. For example, Louisiana provides precinct-level results, by parish, for each race. This field could also be expanded, on a state-by-state basis, to handle arbitrary groupings of results (e.g. separate files for state leg., federal, local).reporting level
- precinct, city, county, state, etc.file type extension
- db, csv, html, json, xml, etc.File name components separated by double underscores; component sub-parts separated by single underscores.
<YYYYMMDD>__<state>__<race_type>__<jurisdiction>__[<race_code>__]<level>.<ext>
Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106__la__general__jefferson_davish_parish__cd_1__precinct.html
Allegeny County precincnt results for general election (contains multiple race types)
20121106__md__general__allegany_county__precinct.csv
Standardized name should be generated during file download process (in state-specific fetch.py modules).
Each state directory should have a 2-column mappings.txt file that contains standardized name and link to raw result file. The raw link should point to result file located at source agency or to copy of raw file archived on S3. The latter would be used in cases where result files are not scrapable (e.g. if agency provided a database dump).
## mappings.txt ##
standard_name, raw_source_name
20121106__md__general__anne_arundel_county__precinct.csv, http://www.elections.state.md.us/elections/2012/election_data/Anne_Arundel_By_Precinct_2012_General.csv
@fgregg and I have been exploring implementing the openelections data structure for our local elections in Chicago and we ran across an issue today which I'm wondering if you might consider implementing in a slightly different way.
Since a Candidate
is stored as an EmbeddedDocument
within each Result
, (which is itself an EmbeddedDocument
within a Contest
) the process of updating an individual Candidate
can be somewhat of a bear, especially for a candidate who has been running in elections for as long as we have data for (and since our data is at the precinct level)
The main reason this comes up is because we're storing information about local aldermen in a pupa instance which is giving us ocd_person
ids for them. We'd like to be able to cross reference that info with the info about the elections that they've run in that we're storing in this app and the only way we have to do that is to manually add the ocd_person
id into this app manually. The manual part of this we were expecting and can handle but I'm wondering if you might consider storing the candidates as a separate Document
the way that you're storing the Office
for a given result. This would certainly make the process of getting at the information about candidates a whole heck of a lot easier.
Contest
and Candidate
entries from MD loader to transformsThis comes from a discussion in #46 where @zstumgoren said:
But I'm starting to wonder if the creation of unique Candidate and Contest instances
should be treated as a transform step. Our initial goal with the data load step should
simply be getting the data loaded into Mongo in its raw form. @dwillis and I agreed to
this approach a while back, and have gradually migrated transforms and various
cleanups from the load step to the transformation step.Enforcing uniqueness of contests and candidates in the load step adds a great deal
of complexity to this phase of the pipeline, and it feels like we're blending concerns a
bit. Unless @dwillis has strong feelings against, I'd be favorable to shifting our
approach. I don't think it would take a great deal of reworking of the models or
loader/bakery. In fact, it would greatly simplify the loader and possibly v1 of bakery.Here's one possible strategy:
Create a RawResult model that lets us load a flat model of all raw data (this would be
our current Result model, plus contest and candidate fields currently normalized to
their own models)
Generate unique Contest and Candidate instances and "clean" Result documents as
subsequent transform steps
In this new model, Result documents would store cleaned or processed Result data
migrated from RawResult, or generated subsequently from lower-level results (e.g.
race-wide results rolled up from precinct-level results). In general, these collections
would store transformed, normalized versions of our raw data.
Update loaders to reflect party field added to the Result model and removed from the Candidate model for #46.
Add include/exclude filters to validate task. API should match transform task.
WIP draft of this can be found in Code Style Guide in our shared GDocs directory.
We need a consistent file naming convention for raw result files. This file name would be applied during the initial download of the file (in the fetch class), and would be the name of the file archived on S3. It should provide enough information about the source file to link up to our metadata API.
Resolve a canonical ID using metadata API.
Pros
Cons
Generate composite file names that reflect metadata captured in our data admin.
File name components could include:
Examples:
FORMAT
<YYYYMMDD>_<state>_<race_type>_<jurisdiction>_[<race_code>_]<level>.<ext>
EXAMPLES
Louisiana Congressional District 1 precinct level results, Jefferson Davis Parish
20121106_la_general_jefferson-davish-parish_cd-1_precinct.html
Allegeny County precincnt results for general election (contains multiple race types)
20121106_md_general_allegany-county_precinct.csv
Pros
Cons
Add choices attribute to state field on Contest, Candidate and Result documents. Use https://github.com/unitedstates/python-us
Change documents in models.py to use separate collections for Contests, Candidates and Results. Make each a DynamicDocument that allows arbitrary attributes, in addition to core/expected fields. Use loading for all collections.
Lots of print statements sprinkled around that need to be replaced by proper logging.
Primary election results are in a single csv with multiple "tables" on top of each other. Maybe scrape?
Add archive task that generates a manifest for a given state based on list of files saved to S3. Should link the up by standard filename to original/raw urls, plus any other metadata that's appropriate from datasource.mappings.
Add optional flag to archive previous version of a file on S3. This is important in cases where we hand-keyed results data or used some combination of automation (e.g. Tabula) and manual processes. The flag would allow us to version the data over time. (see #55)
At the very least, add an index on election_id
.
This is just a style suggestion that I thought of while looking through us.md.load
.
To reduce contributor friction and to make it easier to return to code, putting an example of a row in the comment of the code block that parses or transforms it would be a big help. This is particularly true when there's a lot of variance from year to year.
Hi there, me again. This sorta relates to #30 but I thought I'd open another issue just to keep things neat.
@zstumgoren pointed out that you guys are working on new models for the elections data in your tasks branch which move the Candidate
and Result
objects out from under the Contest
objects. Which is great and I've gone ahead and implemented that approach in our fork of this project.
However, I'm wondering what the thinking is behind making a Candidate
only able to be related to one Contest
. It seems to me that there it is more often the case that a candidate will run in more than one election for the same office but, the way I interpret this relation is that you'd end up with new Candidate
objects for every Contest
that a given person runs in. Even though they are the same person. Would it make sense to make that relation a ListField
full of ReferenceField
s pointing to the various contests the person has run in? That way you only end up with a single record for that person.
Or am I totally interpreting this the wrong way?
Office standardization - list all offices and office holder names; identify upper and lower chamber for state legislatures; give generic titles to offices that don’t have common names.
The data models are missing some of the fields described in the spec.
Missing from Result model
Missing from Contest model
@zstumgoren suggested that some of these fields might be dynamic properties on the model class, but it's important to remember that we bypass the model layer when baking in the interest of performance.
Originally opened as part of the discussion for #39.
For example in md.
Instead of:
class LoadResults(BaseLoader):
def run(self, mapping):
...
# Load results based on file type
if '2002' in self.election_id:
self.load_2002_file(mapping)
...
have multiple classes:
class Load2002Results(BaseLoader):
def run(self, mapping):
...
# Any other year related supporting methods happen here
class LoadResults(BaseLoader):
def run(self, mapping):
self.election_id = mapping['election']
if '2002' in self.election_id:
Load2002Results().run(mapping)
The big advantage to this is to make it easier to see which helper methods are related to a particular vintage without having to think about or stick to a naming convention for the methods.
I could also imagine being able to put common functionality in a state loader class and then reuse it in year-based subclasses.
Need a special-case loader
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.