usepa / camd-eia-crosswalk Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 11.0 5.28 MB

A data crosswalk to integrate U.S. power sector emission and operation data from EPA to EIA

License: MIT License

oar

camd-eia-crosswalk's People

Stargazers

Watchers

Forkers

gracemitchw jhuetteman breakthrough-energy risalewis ecolumix ibarra-michelle grgmiller rmi-electricity pelloaspuru andwax catalyst-cooperative

camd-eia-crosswalk's Issues

the code breaks at a line where camd_eia_gen_crosswalk is executed

generator_step_string <- "3_1_Generator (generators) match on plant and gen IDs Step 1"

camd_eia_gen_crosswalk <-
get_manual_matches(
unit_manual_matches,
unit_manual_excluded,
camd_unit,
eia_generator,
eia_by = c("EIA_PLANT_ID", "EIA_GENERATOR_ID")
)

camd_eia_gen_crosswalk <- camd_eia_gen_crosswalk %>%
bind_rows(
match_camd_eia_units(
get_camd_unmatched(camd_unit, camd_eia_gen_crosswalk),
get_unmatched(eia_generator, camd_eia_gen_crosswalk, by = c("EIA_PLANT_ID", "EIA_GENERATOR_ID")),
by = plant_generator_match,
str_glue("{generator_step_string}a: Exact match")
)
)

Utilize unit code field from EIA-860 for combined cycles

Add Python version of script

I wanted to suggest that a Python version of the R script be developed (or that this code base be transitioned to Python) to help facilitate contributions by the parts of the user community that use the crosswalk but primarily use Python. Switching to Python might enable more robust contribution from the user community, most of whom I am aware of seem to exclusively publish code in Python. Examples:

At the very least, it might be helpful to specify an anaconda environment and python script/notebook that could be used to run the R script using a python package such as r2py.

Create confidence column for matches

Update crosswalk using year 2020 data

It appears that the R script is using year 2018 EIA-860 data to construct the crosswalk. I suggest in the next release updating this to use year 2020 data (or the most recent year of data available) to run this. This may potentially fix some issues with missing associations.

Create independent association tables

EPA plant to EIA plant
EPA unit to EIA gen (combination of plant/gen)

Missing CAMD_PLANT_ID values

I've been using the CAMD-EIA crosswalk to connect data from the CAMD CEMS dataset and the EIA Form 860.

I noticed that there are some ORISPL_CODE values in the CEMS dataset that are missing from the crosswalk under CAMD_PLANT_ID, the field I believe is the crosswalk equivalent.

Here are the 140 ORISPL_CODE values that are in the CEMS data but not in the crosswalk:

[5, 247, 312, 334, 375, 569, 596, 604, 646, 647,
 658, 668, 699, 700, 734, 964, 1294, 1360, 1372, 1392,
 1458, 1470, 1496, 1555, 1557, 1585, 1589, 1918, 2397, 2473,
 2497, 2502, 2529, 2531, 2629, 2640, 2642, 2858, 2867, 2877,
 2947, 3099, 3109, 3110, 3112, 3114, 3120, 3134, 3139, 3142,
 3143, 3144, 3145, 3146, 3147, 3154, 3155, 3182, 3334, 3419, 
 3436, 3438, 3440, 3442, 3451, 3454, 3455, 3461, 3471, 3480,
 3493, 3503, 3523, 3524, 3526, 3527, 3549, 3610, 4036, 4233,
 4938, 6025, 6598, 7185, 7945, 7996, 8058, 10114, 10252, 10321,
 10430, 10522, 10616, 10618, 10628, 10883, 13213, 14013, 50459, 50468,
 50855, 50954, 54088, 54089, 54138, 54656, 54807, 55082, 55209, 55303,
 55373, 55486, 55683, 55858, 56186, 57185, 59882, 60589, 60698, 60925,
 60926, 60927, 61028, 61035, 61241, 61242, 880009, 880013, 880020, 880021,
 880022, 880026, 880066, 880068, 880070, 880077, 880081, 880091, 880094, 880109]

A good chunk of these seem to correlate directly with EIA_PLANT_ID values from 860.

What is the best way to integrate these into the crosswalk? Should I use the manual mapping form?

Get data from Monitoring Plan to enhance matches

Fuel type data
Unit type data

Add flag for combined cycles

Add a column to flag combined cycle units in the output.
Possibly include details about the relationship (e.g. many-to-many, one-to-many, etc)

Use CAMPD API instead of FACT

FACT API will be decommissioned

Use CAM APIs instead of FACT API

Condense MATCH_TYPE_ columns

Determine better way to indicate match types from the independent matches to EIA data (generator/boiler matches)

Explain duplicates in README

When either EPA or EIA has more identifiers for the same units, this creates "duplicate" outputs in the crosswalk.

For example, the following two plants, included in the manual match file, with IDs 52151 and 7903 have duplicates for CAMD units and EIA units respectively.

CAMD_PLANT_ID	CAMD_UNIT_ID	CAMD_GENERATOR_ID	EIA_PLANT_ID	EIA_BOILER_ID	EIA_GENERATOR_ID
52151	001	GEN1	52151	PB1	GEN1
52151	001	GEN1	52151	RF1	GEN1
52151	001	GEN2	52151	PB2	GEN2
52151	001	GEN2	52151	RF2	GEN2
7903	MGS1A	MGS1A	7903		MGS1
7903	MGS1B	MGS1B	7903		MGS1
7903	MGS2A	MGS2A	7903		MSG2
7903	MGS2B	MGS2B	7903		MSG2

Incorporate "proposed" units from EIA into matches when they show up in CAMD before EIA?

Remove the filter on EIA-860 for Prime Mover "CE"

Plant Code 7063 Generator 1 is of type Natural Gas with Compressed Air Storage but is still in the Acid Rain Program. It is also the only unit in the U.S. with this prime mover

Missing boiler ID from EIA data

The EIA data includes 6_1_EnviroAssoc_Y2018.xlsx which connects EIA generator ID and EIA boiler ID. The boiler ID should be included in the spreadsheet and tested for possible matches.

Investigate other data sets to include

e-GGRT
Needs
NEI
USGS

Organize repo with input/output directories

Create more complex example with combined cycle units

Expand on the coal unit example and include combined cycle units.

Include stack associations

Create stripped-down output with only key columns

Create an output with only columns that can be used to perform the crosswalk

QA fuzzy matches

Investigate fuzzy matches and Include a manual match or exclusion where necessary

Add EPA web link to README

Once the EPA web page is published, add the link to the README

Convert crosswalk script from notebook to executable script with parameters

Restructure the crosswalk script to be a plain R script
Create general functions that can be packaged
Wrap in command-line interface to accept parameters such as data year

Investigate fuel type as a way to filter out incorrect fuzzy matches

Depends on monitoring plan data

Updates for recent data years

Missing boiler ID associations

Based on reading the crosswalk documentation, the crosswalk uses the EIA-860 boiler generator association (BGA) table as one of its inputs, but the crosswalk does not seem to include the complete set of associations included in the BGA table. I'm not sure whether this is intentional (ie your goal is only to crosswalk CAMD unit ids with EIA generator IDs) or unintentional.

For example, for plant 1391, the crosswalk associates camd unit id 1A only with EIA generator ID 1A and boiler ID 1A. However, the EIA-860 BGA table shows that generator 1A is associated not only with boiler 1A, but also 2A, 3A, and 9.

It might be helpful in the next release to include the full set of boiler-generator associations for each CAMD unit id. Otherwise I would suggest including documentation that users need to merge the crosswalk with the BGA table to get the full set of associations.

Unmatched units are missing SEQUENCE_NUMBER

Should unmatched units have a sequence number and if so, should they be at the bottom or mixed with the rest sorted by PLANT_ID, UNIT_ID, BOILER_ID?

usepa / camd-eia-crosswalk Goto Github PK

camd-eia-crosswalk's People

Stargazers

Watchers

Forkers

camd-eia-crosswalk's Issues

Recommend Projects

Recommend Topics

Recommend Org