Code Monkey home page Code Monkey logo

camd-eia-crosswalk's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

camd-eia-crosswalk's Issues

the code breaks at a line where camd_eia_gen_crosswalk is executed

generator_step_string <- "3_1_Generator (generators) match on plant and gen IDs Step 1"

camd_eia_gen_crosswalk <-
get_manual_matches(
unit_manual_matches,
unit_manual_excluded,
camd_unit,
eia_generator,
eia_by = c("EIA_PLANT_ID", "EIA_GENERATOR_ID")
)

camd_eia_gen_crosswalk <- camd_eia_gen_crosswalk %>%
bind_rows(
match_camd_eia_units(
get_camd_unmatched(camd_unit, camd_eia_gen_crosswalk),
get_unmatched(eia_generator, camd_eia_gen_crosswalk, by = c("EIA_PLANT_ID", "EIA_GENERATOR_ID")),
by = plant_generator_match,
str_glue("{generator_step_string}a: Exact match")
)
)

Add Python version of script

I wanted to suggest that a Python version of the R script be developed (or that this code base be transitioned to Python) to help facilitate contributions by the parts of the user community that use the crosswalk but primarily use Python. Switching to Python might enable more robust contribution from the user community, most of whom I am aware of seem to exclusively publish code in Python. Examples:

At the very least, it might be helpful to specify an anaconda environment and python script/notebook that could be used to run the R script using a python package such as r2py.

Update crosswalk using year 2020 data

It appears that the R script is using year 2018 EIA-860 data to construct the crosswalk. I suggest in the next release updating this to use year 2020 data (or the most recent year of data available) to run this. This may potentially fix some issues with missing associations.

Missing CAMD_PLANT_ID values

I've been using the CAMD-EIA crosswalk to connect data from the CAMD CEMS dataset and the EIA Form 860.

I noticed that there are some ORISPL_CODE values in the CEMS dataset that are missing from the crosswalk under CAMD_PLANT_ID, the field I believe is the crosswalk equivalent.

Here are the 140 ORISPL_CODE values that are in the CEMS data but not in the crosswalk:

[5, 247, 312, 334, 375, 569, 596, 604, 646, 647,
 658, 668, 699, 700, 734, 964, 1294, 1360, 1372, 1392,
 1458, 1470, 1496, 1555, 1557, 1585, 1589, 1918, 2397, 2473,
 2497, 2502, 2529, 2531, 2629, 2640, 2642, 2858, 2867, 2877,
 2947, 3099, 3109, 3110, 3112, 3114, 3120, 3134, 3139, 3142,
 3143, 3144, 3145, 3146, 3147, 3154, 3155, 3182, 3334, 3419, 
 3436, 3438, 3440, 3442, 3451, 3454, 3455, 3461, 3471, 3480,
 3493, 3503, 3523, 3524, 3526, 3527, 3549, 3610, 4036, 4233,
 4938, 6025, 6598, 7185, 7945, 7996, 8058, 10114, 10252, 10321,
 10430, 10522, 10616, 10618, 10628, 10883, 13213, 14013, 50459, 50468,
 50855, 50954, 54088, 54089, 54138, 54656, 54807, 55082, 55209, 55303,
 55373, 55486, 55683, 55858, 56186, 57185, 59882, 60589, 60698, 60925,
 60926, 60927, 61028, 61035, 61241, 61242, 880009, 880013, 880020, 880021,
 880022, 880026, 880066, 880068, 880070, 880077, 880081, 880091, 880094, 880109]

A good chunk of these seem to correlate directly with EIA_PLANT_ID values from 860.

What is the best way to integrate these into the crosswalk? Should I use the manual mapping form?

Add flag for combined cycles

Add a column to flag combined cycle units in the output.
Possibly include details about the relationship (e.g. many-to-many, one-to-many, etc)

Condense MATCH_TYPE_ columns

Determine better way to indicate match types from the independent matches to EIA data (generator/boiler matches)

Explain duplicates in README

When either EPA or EIA has more identifiers for the same units, this creates "duplicate" outputs in the crosswalk.

For example, the following two plants, included in the manual match file, with IDs 52151 and 7903 have duplicates for CAMD units and EIA units respectively.

CAMD_PLANT_ID CAMD_UNIT_ID CAMD_GENERATOR_ID EIA_PLANT_ID EIA_BOILER_ID EIA_GENERATOR_ID
52151 001 GEN1 52151 PB1 GEN1
52151 001 GEN1 52151 RF1 GEN1
52151 001 GEN2 52151 PB2 GEN2
52151 001 GEN2 52151 RF2 GEN2
7903 MGS1A MGS1A 7903   MGS1
7903 MGS1B MGS1B 7903   MGS1
7903 MGS2A MGS2A 7903   MSG2
7903 MGS2B MGS2B 7903   MSG2

Missing boiler ID from EIA data

The EIA data includes 6_1_EnviroAssoc_Y2018.xlsx which connects EIA generator ID and EIA boiler ID. The boiler ID should be included in the spreadsheet and tested for possible matches.

QA fuzzy matches

Investigate fuzzy matches and Include a manual match or exclusion where necessary

Missing boiler ID associations

Based on reading the crosswalk documentation, the crosswalk uses the EIA-860 boiler generator association (BGA) table as one of its inputs, but the crosswalk does not seem to include the complete set of associations included in the BGA table. I'm not sure whether this is intentional (ie your goal is only to crosswalk CAMD unit ids with EIA generator IDs) or unintentional.

For example, for plant 1391, the crosswalk associates camd unit id 1A only with EIA generator ID 1A and boiler ID 1A. However, the EIA-860 BGA table shows that generator 1A is associated not only with boiler 1A, but also 2A, 3A, and 9.

It might be helpful in the next release to include the full set of boiler-generator associations for each CAMD unit id. Otherwise I would suggest including documentation that users need to merge the crosswalk with the BGA table to get the full set of associations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.