stewicombo removing records where FRS_ID is null

Standardized Emission and Waste Inventories (StEWI)

StEWI is a collection of Python modules that provide processed USEPA facility-based emission and waste generation inventory data in standard tabular formats. The standard outputs may be further aggregated or filtered based on given criteria, and can be combined based on common facility and flows across the inventories.

StEWI consists of a core module, stewi, that digests and provides the USEPA inventory data in standard formats. Two matcher modules, the facilitymatcher and chemicalmatcher, provide commons IDs for facilities and flows across inventories, which is used by the stewicombo module to combine the data, and optionally remove overlaps and remove double counting of groups of chemicals based on user preferences.

StEWI v1 was peer-reviewed internally at USEPA and externally through Applied Sciences. An article describing StEWI was published in a special issue of Applied Sciences: Advanced Data Engineering for Life Cycle Applications.

USEPA Inventories Covered By Data Reporting Year (current version)

Source	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
Discharge Monitoring Reports*	x	x	x	x	x	x	x	x	x	x	x
Greenhouse Gas Reporting Program	x	x	x	x	x	x	x	x	x	x	x
Emissions & Generation Resource Integrated Database				x		x		x	x	x	x
National Emissions Inventory**	x	i	i	x	i	i	x	i	i	x
RCRA Biennial Report*	x		x		x		x		x
Toxic Release Inventory*	x	x	x	x	x	x	x	x	x	x	x

*Earlier data exist and are accessible but have not been validated

**Only point sources included at this time from NEI; i interim years between triennial releases, accessed through the Emissions Inventory System, are not validated

Standard output formats

The core stewi module produces the following output formats:

Flow-By-Facility: Each row represents the total amount of release or waste of a single type in a given year from the given facility.

Flow-By-Process: Each row represents the total amount of release or waste of a single type in a given year from a specific process within the given facility. Applicable only to NEI and GHGRP.

Facility: Each row represents a unique facility in a given inventory and given year

Flow: Each row represents a unique flow (substance or waste) in a given inventory and given year

The chemicalmatcher module produces:

Chemical Matches: Each row provides a common identifier for an inventory flow chemical

The facilitymatcher module produces:

Facility Matches: Each row provides a common identifier for an inventory facility

The stewicombo module produces:

Flow-By-Facility Combined: Analagous to the flowbyfacility, with chemical and facilitymatches added

Data Processing

The following describes details related to dataset access, processing, and validation

DMR

Processing of the DMR uses the custom search option of the Water Pollutant Loading Tool with the following parameters:

Parameter grouping: On - applies a parameter grouping function to avoid double-counting loads for pollutant parameters that represent the same pollutant
Detection limit: Half - set all non-detects to ½ the detection limit
Estimation: On - estimates loads when monitoring data are not reported for one or more monitoring periods in a reporting year
Nutrient Aggregation: On - Nitrogen and Phosphorous flows are converted to N and P equivalents

For validation, the sum of facility releases (excluding N & P) are compared against reported state totals. Some validation issues are expected due to differences in default parameters used by the water pollutant loading tool for calculating state totals.

eGRID

eGRID data are sourced from EPA's eGRID site. For validation, the sum of facility releases are compared against reported U.S. totals by flow.

GHGRP

GHGRP data are sourced from EPA's Envirofacts API For validation, the sum of facility releases by subpart are compared against reported U.S. totals by subpart and flow. The validation of some flows (HFC, HFE, and PFCs) are reported in carbon dioxide equivalents. Mixed reporting of these flows in the source data in units of mass or carbon dioxide equivalents results in validation issues.

NEI

NEI data are downloaded from the EPA Emissions Inventory System (EIS) Gateway and hosted on EPA Data Commons for access by StEWI. For validation, the sum of facility releases are compared against reported totals by flow. Validation is only available for triennial datasets.

RCRAInfo

RCRAInfo data are sourced from the Public Data Files For validation, the sum of facility waste generation are compared against reported state totals as calculated for the National Biennial Report.

TRI

TRI data are sourced from the Basic Plus Data files For validation, the sum of facility releases are compared to national totals by flow from the TRI Explorer.

Combined Inventories

stewicombo module combines inventory data from within and across selected inventories by matching facilities in the Facility Registry Service and chemical flows using the Substance Registry Service. If the remove_overlap parameter is set to True (default), stewicombo combines records using the following default logic:

Records that share a common compartment, SRS ID and FRS ID within an inventory are summed.
Records that share a common compartment, SRS ID and FRS ID across an inventory are assessed by compartment preference (see INVENTORY_PREFERENCE_BY_COMPARTMENT).
Additional steps are taken to avoid overlap of:
- nutrient flow releases to water between the TRI and DMR
- particulate matter releases to air reflecting PM < 10 and PM < 2.5 in the NEI
- Volatile Organic Compound (VOC) releases to air for individually reported VOCs and grouped VOCs

Installation Instructions

Install a release directly from github using pip. From a command line interface, run:

pip install git+https://github.com/USEPA/[email protected]#egg=StEWI

where you can replace 'v1.1.0' with the version you wish to use under Releases.

Alternatively, to install from the most current point on the repository:

git clone https://github.com/USEPA/standardizedinventories.git
cd standardizedinventories
pip install . # or pip install -e . for devs

Secondary Context Installation Steps

In order to enable calculation and assignment of urban/rural secondary contexts, please refer to esupy's README.md for installation instructions, which may require a copy of the env_sec_ctxt.yaml file included here.

Data Products

Output of StEWI can be accessed for selected releases without having to run StEWI. See the Data Product Links page for direct links to StEWI output files in Apache parquet format.

Wiki

See the Wiki for instructions on installation and use and for citation and contact information.

Disclaimer

The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity , confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

Line	Hits	Time	Per Hit	% Time	Line Contents
37	58570	130121.0	2.2	0.2	preferences = INVENTORY_PREFERENCE_BY_COMPARTMENT[group.name]
39	119678	98851.0	0.8	0.2	for pref in preferences:
40	180786	50576395.0	279.8	91.0	for index, row in group.iterrows():
41	119678	4543449.0	38.0	8.2	if pref == row[SOURCE_COL]:
42	58570	209420.0	3.6	0.4	return row

Line	Hits	Time	Per Hit	% Time	Line Contents
114	29285	10662992.0	364.1	1.3	grouped =df_new.groupby(COMPARTMENT_COL)
115	29285	261084505.0	8915.3	30.8	df_new =grouped.apply(get_by_preference)

FRS_ID	FacilityID	FlowAmount	FlowName	Source
110001536197	10123	1006345.47519044	Carbon dioxide	eGRID
110017313423	10123	1006345.47519044	Carbon dioxide	eGRID

	# for all columns in the temporary dataframe, remove subpart-specific prefixes
	for col in table_df:
	table_df.rename(columns={col: col[len(table) + 1:]}, inplace=True)

usepa / standardizedinventories Goto Github PK

standardizedinventories's Introduction

Standardized Emission and Waste Inventories (StEWI)

USEPA Inventories Covered By Data Reporting Year (current version)

Standard output formats

Data Processing

DMR

eGRID

GHGRP

NEI

RCRAInfo

TRI

Combined Inventories

Installation Instructions

Secondary Context Installation Steps

Data Products

Wiki

Disclaimer

standardizedinventories's People

Contributors

Stargazers

Watchers

Forkers

standardizedinventories's Issues

Discussed in #116

Recommend Projects

Recommend Topics

Recommend Org