usepa / tada Goto Github PK

This R package can be used to compile and evaluate Water Quality Portal (WQP) data for samples collected from surface water monitoring sites on streams and lakes. It can be used to create applications that support water quality programs and help states, tribes, and other stakeholders efficiently analyze the data.

Home Page: https://usepa.github.io/TADA/

License: Creative Commons Zero v1.0 Universal

R 99.97% Rez 0.03%

tada's Introduction

Welcome to TADA: Tools for Automated Data Analysis!

Tools for Automated Data Analysis, or TADA, is being developed to help States, Tribes (i.e., Tribal Nations, Pueblos, Bands, Rancherias, Communities, Colonies, Towns, Indians, Villages), federal partners, and any other Water Quality Portal (WQP) users (e.g. researchers) efficiently compile and evaluate WQP data collected from water quality monitoring sites. TADA is both a stand-alone R package, and a building block to support development of the TADA R Shiny application. We encourage you to read this package's LICENSE and README files (you are here).

How to use TADA:
- Function Reference
- Example Workflow 1: Water Quality Portal Data Discovery and Cleaning (Beginner)
- Example Workflow 2: 2023 Shepherdstown Training (Advanced)
How to Contribute
- We encourage stakeholders to test the functionality and provide feedback. Moreover, open source software provides an avenue for water quality data originators and users to develop and share code, and we welcome your contributions! We hope to build a collaborative community dedicated to this effort where TADA users and contributors can discover, share and build the functionality over time.
More information on how TADA leverages the WQX QAQC Service
More about the TADA Project

Installation

You must first have R and R Studio installed to use the TADA R Package (see instructions below if needed). Our team is actively developing TADA, therefore we highly recommend that you update the TADA R Package and all of its dependency libraries each time you use the package. You can install and/or update the TADA R Package and all dependencies by running:

if(!"remotes"%in%installed.packages()){
install.packages("remotes")
}

remotes::install_github("USEPA/TADA", ref = "develop", dependencies = TRUE, force = TRUE)

The TADA R Shiny application can be run on the web (R and R Studio install not required), or within R Studio. Run the following code within R Studio to install or update and run the most recent version of the TADA R Shiny application:

if(!"remotes"%in%installed.packages()){
install.packages("remotes")
}

remotes::install_github("USEPA/TADAShiny", ref = "develop", dependencies = TRUE, force = TRUE)

TADAShiny::run_app()

Water Quality Portal

In 2012, the WQP was deployed by the U.S. Geological Survey (USGS), the U.S. Environmental Protection Agency (USEPA), and the National Water Quality Monitoring Council to combine and serve water-quality data from numerous sources in a standardized format. The WQP holds over 420 million water quality sample results from over 1000 federal, state, tribal and other partners, and is the nation's largest source for single point of access for water-quality data. Participating organizations submit their data to the WQP using the EPA's Water Quality Exchange (WQX), a framework designed to map their data holdings to a common data structure.

Install R and R Studio

To download R: Go to https://cran.r-project.org/ and click the link that describes your computer operating system in the first box in the menu entitled "Download and Install R".
Clicking your operating system will take you to a new page, which looks slightly different for PC (first image) and Macs (second image):

Download the program by clicking the appropriate link for your system, and click through the installer windows on your computer, accepting all defaults.
Next, go to the following link to download RStudio: https://posit.co/download/rstudio-desktop/, scroll down a little, and click download RStudio.

Again, download the installer, click through the prompts, and accept the defaults.

Note: If you are an EPA employee, please follow the directions here instead of the instructions above: https://work.epa.gov/software/r-software.

Open-Source Code Policy

Effective August 8, 2016, the OMB Mandate: M-16-21; Federal Source Code Policy: Achieving Efficiency, Transparency, and Innovation through Reusable and Open Source Software applies to new custom-developed code created or procured by EPA consistent with the scope and applicability requirements of Office of Management and Budget's (OMB's) Federal Source Code Policy. In general, it states that all new custom-developed code by Federal Agencies should be made available and reusable as open-source code.

The EPA specific implementation of OMB Mandate M-16-21 is addressed in the System Life Cycle Management Procedure. EPA has chosen to use GitHub as its version control system as well as its inventory of open-source code projects. EPA uses GitHub to inventory its custom-developed, open-source code and generate the necessary metadata file that is then posted to code.gov for broad reuse in compliance with OMB Mandate M-16-21.

If you have any questions or want to read more, check out the EPA Open Source Project Repo and EPA's Interim Open Source Code Guidance.

License

All contributions to this project will be released under the CCO-1.0 license file dedication. By submitting a pull request or issue, you are agreeing to comply with this waiver of copyright interest.

Disclaimer

This United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

Contact

If you have any questions, please reach out to Cristina Mullin ([email protected]).

tada's People

Contributors

Stargazers

Watchers

Forkers

zsmith27 ldecicco-usgs nx10 jbousquin amla22 anbethe caitrsmith martinrd02 lkoenig-usgs jakevandijk twointheboat rossyndicate

tada's Issues

Research iterating by row efficiently

The current code (rowwise and case_when functions) used to check data row by row is used in a number of functions (command+f in .R files to see which functions use it). But it takes a long time to run, especially for larger datasets. Research other methods of checking data row by row in R and compare the efficiency of those methods to the one being used.

Note that this may be the best way to do it; it could be that the process of looking at data row by row is cumbersome regardless of how it's code.

Version control

Set up dependency management system

Filtering page functions

Add functions for filtering page to the utilities.R file or create a new filtering.R file to place these and the other specific filtering functions (continuous data, media not water, etc.).

Use functions from Jake's filtering vignette: https://usepa.sharepoint.com/:f:/r/sites/WQPDataAssessmentTeam/Shared%20Documents/General/TADA%20Dev/FirstTool_DataDiscoveryandCleaning/Logic-vignettes/Filtering?csf=1&web=1&e=bGWdEh

Fields for full dataset filtering
ActivityTypeCode
ActivityMediaName
ActivityMediaSubdivisionName
ActivityCommentText
MonitoringLocationTypeName
StateName
TribalLandName
OrganizationFormalName
CharacteristicName
HydrologicCondition
HydrologicEvent
BiologicalIntentName
MeasureQualifierCode
ActivityGroup
AssemblageSampledName
ProjectName
CharacteristicNameUserSupplied
DetectionQuantitationLimitTypeName
SampleTissueAnatomyName
LaboratoryName

Fields for characteristic level filtering
ActivityCommentText
ActivityTypeCode
ActivityMediaName
ActivityMediaSubdivisionName
MeasureQualifierCode
MonitoringLocationTypeName
HydrologicCondition
HydrologicEvent
ResultStatusIdentifier
MethodQualifierTypeName
ResultCommentText
ResultLaboratoryCommentText
ResultMeasure/MeasureUnitCode
ResultSampleFractionText
ResultTemperatureBasisText
ResultValueTypeName
ResultWeightBasisText
SampleCollectionEquipmentName
LaboratoryName
MethodDescriptionText
ResultParticleSizeBasisText
SampleCollectionMethod/MethodIdentifier
SampleCollectionMethod/MethodIdentifierContext
SampleCollectionMethod/MethodName
DataQuality/BiasValue
MethodSpeciationName
ResultAnalyticalMethod/MethodName
ResultAnalyticalMethod/MethodIdentifier
ResultAnalyticalMethod/MethodIdentifierContext
AssemblageSampledName
CharacteristicNameUserSupplied
DetectionQuantitationLimitTypeName

PotentialDuplicate_RowID

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
No
Include warning if flags are not applied?
No
Required columns:
ActivityIdentifier
ActivityConductingOrganizationText
OrganizationFormalName
OrganizationIdentifier
ProjectIdentifier
ResultCommentText
ActivityCommentText

Development Notes

ResultFlagsIndependent.R
Required columns are those that are not included when checking for duplicates
Use code from "auto clean" function
Flag is unique identifier for each unique row (identical IDs = duplicate row)

Testing vignette

Run all functions and test them, including the functions with open issues. Provide feedback.

Research how run functions upon loading package to R

We would like TADA to use the most up-to-date reference tables each time its used. Running the generate ref table functions automatically upon loading the package into R would allow us to do this. Do some research to learn how we can do that.

If it's not clear or there is no standard way to that, consider other approaches to achieving the goal of automatically refreshing tables

License & Contribution files

Discuss CCO vs. MIT license. Which is a better fit?

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Add toupper function to web service retrieval

toupper function is in the test_data.R file (base R function)

add true duplicate rows function as well

uppercase function and TADA package

How can we make TADA download dataRetrieval from the remote, not CRAN automatically?

remotes::install_github("USGS-R/dataRetrieval")

QAPPDocAvailable

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
No
Include warning if flags are not applied?
No
Required columns:
ProjectFileUrl

Development Notes

Consider using this logic:
ProjectAttachedBinaryObject is populated, QAPPavailable = Y (when clean = TRUE, these columns are retained)
ProjectAttachedBinaryObject is not populated, QAPPavailable = N (when clean = TRUE, these columns are not retained)

UpdateMeasureUnitRef

Page Requirements/Standards

All of the generate reference file functions should be consistent in the following ways:

No argument is included. Similar to the check() or dcoument() functions in devtools
Where possible, read data in via URL (not from a static, downloaded file) to maintain up-to-date records
Finish each function with UpdateInternalData(x), which is a function unique to TADA (at the top of GenerateRefTables.R) that updates sysdata.rda without overwriting other data.

Development Notes

Raw data only has "inches" as the target unit for units of type (Description column) "Length Distance." This function adds data for "m" and "ft" target units

DepthProfileFig

InvalidSpeciation

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, sourced from WQX QAQC Characteristic Validation table
Include warning if flags are not applied?
Yes, Metadata transformations may be affected
Required columns:
CharacteristicName
MethodSpeciationName

Development Notes

None

Check List

Review how to use reference table
Create function to add/update reference table
Create InvalidSpeciation function

Publish package with pre-release status

Publish package with "pre-release" status.

See:

Include WQX at the start of functions dependent on WQX reference tables

WQXInvalidResultUnit
WQXInvalidFraction
WQXInvalidSpeciation

Flag Page Summary Table Generation

ALWAYS: Run all site and result flag functions and generate summary table (drafted in mock ups).

Flag=TRUE: append all flag columns to dataset
Flag=FALSE: do not append all flag columns to dataset

Clean=TRUE: remove all data that has been flagged AND remove flag columns if there
Clean=FALSE: do not remove all data that has been flagged

MeasureValueSpecialCharacters

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
No
Include warning if flags are not applied?
Yes, "Data summaries and calculations may be affected by choosing to retain special characters in the ResultValue field. In order to ensure transformation functions will run properly, set clean = TRUE."
Required columns:
ResultMeasureValue

Development Notes

ResultFlagsDependent.R

ComparableDataIdentifier

Add a column with the unique identifiers for each comparable data combination that has a unique identifiers included in the harmonization template. For combinations that do not already have a specific identifier in the reference file, generate a unique identifier.

CensoredData

Function title: CensoredDataSubstitutions or TransformCensorData?

Generate DetectionQuantitationLimitTypeName (and associated DetectionQuantitationLimitMeasure/MeasureValue and DetectionQuantitationLimitMeasure/MeasureUnitCode) and ResultDetectionConditionText from Result Value where needed

Depends on ResultsSpecialChars Function:

Convert Result Values that start with "<" into an appropriate Detection Condition and Detection Limit Value. That is: a Result Value of "<0.25" would be converted into a ResultDetectionConditionText of "Present Below Quantification Limit", a Nondetect Result Value of "0.25", and a DetectionQuantitationLimitTypeName of "Lower Quantitation Limit"

Two options (best for when <70% of data are censored):

Robust ROS [Regression on Order Statistics (ROS)] (for lower limits, use random number between detection limit and 0)
x times the detection limit

Add vignettes folder

devtools has a function to add vignette folder

AutoFilter

Page Requirements/Standards
All of the flag page functions should be consistent in the following ways

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
No
Include warning if flags are not applied?
No

Required columns:
ActivityMediaName

Development Notes:
None

InvalidCoordinates

When clean = FALSE, append column titled "InvalidCoordinates" with the following:

If the LAT is outside of the 0 to 90 range and longitude is outside of the -180 to 0 range, flag row as "NotInNorthAmerica".
If the LAT or LONG includes the specific strings, 000 or 999, or if the LAT is outside of the -90 to 90 range and longitude is outside of the -180 to 180 range, flag row as "Invalid".
Precision can be measured by the number of decimal places in the latitude and longitude provided. If the LAT or LONG does not have any numbers to the right of the decimal point, flag row as "Imprecise". Precision can be measured by the number of decimal places in the latitude and longitude provided.

When clean = TRUE, append column titled "InvalidCoordinates with the following:"

If NotInNorthAmerica: LAT has a - sign, autoclean and change it to +; if LONG is +, autoclean and change it to -; include "ChangedLatLongSign" in the "InvalidCoordinates" column
If the LAT or LONG includes the specific strings, 00 or 999, or if the LAT is outside of the -90 to 90 range and longitude is outside of the -180 to 180 range, flag row as "Invalid".
If the LAT or LONG does not have any numbers to the right of the decimal point, still only flag row as "Imprecise". Do not remove from dataset.

Include additional Boolean argument for imprecise lat/longs:
When imprecise=TRUE, if the LAT or LONG does not have any numbers to the right of the decimal point, remove from dataset

When imprecise=FALSE, if the LAT or LONG does not have any numbers to the right of the decimal point, do not remove from dataset

TransformSynonyms

Suggest to focus on nutrients to start

InvalidFraction

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, sourced from WQX QAQC Characteristic Validation table
Include warning if flags are not applied?
Yes, Metadata transformations may be affected
Required columns:
CharacteristicName
ResultSampleFractionText

Development Notes

None

Check List

Create function to add/update reference table
Create InvalidFraction function

Review HarmonizeData Functionality

Function Logic
Always: Generate "TADAHarmonizationTable"
Flag=TRUE: Append all yellow columns pulled from the "TADAHarmonizationTemplate" to the master TADA data profile as well.
Flag=FALSE: DO NOT append all yellow columns pulled from the "TADAHarmonizationTemplate" to the master TADA data profile
Clean=FALSE: Do not transform or convert yet.
Clean=TRUE: Perform all transformations and conversions.

Dependent on "TADAHarmonizationReferenceFile"
"TADAHarmonizationTable" is generated for the specific dataset using the "TADAHarmonizationReferenceFile". The "TADAHarmonizationReferenceFile" includes logic for harmonizing synonyms and units

Dependent on other functions
The WQXInvalidResultUnit function (clean=TRUE) is required to run this function
The WQXInvalidFraction function (clean=TRUE) is required to run this function
The WQXInvalidSpeciation function (clean=TRUE) is required to run this function
The WQXTargetUnits function is required to run this function

Notes
Suggest to focus on nutrients to start

Retrieval auto filter functions

Include the following functions in the Utilities.R file:

AutoFilter (media not water, biological data, etc.)
TrueDuplicate
RemoveColumnsWithNAs (do not do this because we require columns for some of the functions, remove columns at very of process end instead = final cleaning & output function; alternatively we could say remove all NAs except for the TADA critical columns)

Page Requirements/Standards
All of the flag page functions should be consistent in the following ways

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE
Function Requirements

Reference table required?
No
Include warning if flags are not applied?
No
Required columns:
ActivityMediaName

Development Notes:

Consider auto filter for assemblage and media subdivision fields as well - to assist with simplifying the dataset (autofiltering) and ensuring results are comparable
May be part of a larger function instead of being a standalone function........could autoclean remove duplicate rows, remove blank colums, remove medianotwater data, and remove any other data our tool cannot support (TBD )

Make test_data.R into a function

And move to DataDiscoveryRetrieval.R

GenerateMap

Generate map using WQP station metadata. May can be static but include colors, size, and shapes to provide useful information about the WQP monitoring sites. Interactivity is a plus, but not a requirement for the MVP.

Retrieval enhancement: TADADataProfile

Option 1: Generate blank TADADataProfile
Option2: Read in filled out TADADataProfile and check the fields (column headers), not field values (done)

AboveNationalWQXUpperThreshold

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, sourced from WQX QAQC Characteristic Validation table
Include warning if flags are not applied?
No
Required columns:
CharacteristicName
ResultMeasureValue
ResultMeasure.MeasureUnitCode

Development Notes

ResultFlagsIndependent.R
Filter Type == "CharacteristicUnit"
Required columns to join the reference table and the TADA profile:

Characteristic, CharacteristicName
Source, AcitivtyMediaName
Value, ResultMeasure.MeasureUnitCode
This one is a little tricky because Maximum values pertain to the "Value Unit" (target unit) column, not the "Value" (original unit) column. Therefore, units must be converted before checking if a value is outside the range. Here's some logic to get started:
Join Maximum and Conversion Factor columns to the input dataset (by Characteristic, Source, and Value)
Create a new ResultMeasureValue column (e.g. ConvertedValue), which is ResultMeasureValue / Conversion Factor
Create flag column, add flag when Maximum >= ConvertedValue

DepthProfileData

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, Unit conversion reference table
Include warning if flags are not applied?
Yes, Data summaries and calculations may be affected (show only if convert = FALSE)
Required columns:
Activity(Top/Bottom)DepthHeight fields
ResultDepthHeightMeasure fields

Development Notes

HOLD on development- finalize components of this function with discussion
This function will do 2 things: 1) Flag rows with depth profile data, 2) optionally convert depth profile data to a uniform unit

HistogramAndBoxplot

Develop two functions to create a histogram and box plot:

histogram
boxplot

Function inputs/required columns

Group rows for plots by: TADA.ComparableDataIdentifier from Harmonization Template (https://github.com/USEPA/TADA/blob/develop/inst/extdata/HarmonizationTemplate.csv)
Time Period
Unit
Result Values

Leverage NADA2 R package or other packages if possible:
https://cran.r-project.org/web/packages/NADA2/index.html

WQPWebServiceImport

Import data using web service for the full physical chemical profile directly (not dataRetrieval)

In the vignette, generate two files that are used throughout process and available to a user to view at any time (.csv?). Generate "TADAProfileClean" file and "TADAProfileOriginal".

For function, simply import the data (do not write to global environment) --like what dataRetrieval does

CensoredDataSummary

Function title: CensoredDataSummary
Some result values are either below or above the detection limit of the equipment used to collect data. Users may want substitute values for this data to make it more useful for assessment. This function will provide a summary of the censored data in this dataset.

Always: generate censored data stats table and provide it to function users as a .csv (or in environment?)

Depends on ResultsSpecialChars Function

See summary table here:
https://usepa.sharepoint.com/sites/WQPDataAssessmentTeam/Shared%20Documents/Forms/AllItems.aspx?id=%2Fsites%2FWQPDataAssessmentTeam%2FShared%20Documents%2FGeneral%2FTADA%20Dev%2FFirstTool%5FDataDiscoveryandCleaning%2FTemplates%2FDraft%20Templates&viewid=deb12f0b%2D3d2c%2D4694%2Dbe22%2D7d262f785cce

UncommonAnalyticalMethodID

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, sourced from WQX QAQC Characteristic Validation table
Include warning if flags are not applied?
No
Required columns:
CharacteristicName
ResultAnalyticalMethod.MethodIdentifier
ResultAnalyticalMethod.MethodIdentifierContext

Development Notes

None

Check List

Create function to add/update reference table
Create UncommonAnalyticalMethodID function

BelowNationalWQXLowerThreshold

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, sourced from WQX QAQC Characteristic Validation table
Include warning if flags are not applied?
No
Required columns:
CharacteristicName
ResultMeasureValue
ResultMeasure.MeasureUnitCode

Development Notes

ResultFlagsIndependent.R
Filter Type == "CharacteristicUnit"
Required columns to join the reference table and the TADA profile:

Characteristic, CharacteristicName
Source, AcitivtyMediaName
Value, ResultMeasure.MeasureUnitCode
This one is a little tricky because Minimum values pertain to the "Value Unit" (target unit) column, not the "Value" (original unit) column. Therefore, units must be converted before checking if a value is outside the range. Here's some logic to get started:
Join Minimum and Conversion Factor columns to the input dataset (by Characteristic, Source, and Value)
Create a new ResultMeasureValue column (e.g. ConvertedValue), which is ResultMeasureValue / Conversion Factor
Create flag column, add flag when Minimum <= ConvertedValue

Add "TADA" to appended columns?

Consider including TADA at the beginning of all columns that TADA appends to the dataset.

For example, from the Harmonization Template:

TADA Suggested CharacteristicName
TADA CharacteristicName assumptions
TADA Suggested sample fraction
TADA Fraction assumptions
TADA Suggested speciation
TADA Speciation Conversion Factor
TADA Speciation Assumptions
TADA Suggested result unit
TADA UnitConversionFactor
TADA UnitConversionCoefficient

TADAOutliers

Consider adding outlier information to TADA stats function.

Append one or two additional columns to the dataset flagging outliers at the individual station/char level and/or at the all stations/char level.

Add new function input for stats to flag outliers across single station (input ID) or all stations:
Scale = AllStations
Scale = IndividualStations

Retrieval enhancement: dataRetrievalTemplate

Option 1: Generate Blank TADARetrieval Template
Option 2: Upload filled in TADARetrieval template and download WQP data

Use data retrieval package for this

Upon data retrieval, generate two files that are used throughout process and available to a user to view at any time (.csv?). Generate "TADAProfileClean" file and "TADAProfileOriginal".

AggregatedContinuousData

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
No
Include warning if flags are not applied?
No
Required columns:
ResultDetectionConditionText

Development Notes

ResultFlagsIndependent.R
Use code from "auto clean" function

Finalize documentation for existing functions

Double check and complete any missing documentation

InvalidResultUnit

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
Yes, sourced from WQX QAQC Characteristic Validation table
Include warning if flags are not applied?
Yes, Unit conversions, data summaries, and data calculations may be affected
Required columns:
CharacteristicName
ResultMeasure.MeasureUnitCode
ActivityMediaName

Development Notes

None

Check List

Create function to add/update reference table
Create InvalidResultUnit function

TimeSeriesFig

CalculateTotalNitrogen

Calculate = Yes
Calculate = No

GenerateHarmonizationTable = Yes
GenreateTable = No

LogicColumn = Yes
LogicColumn = No

See Total Nitrogen Summations page for more detailed requirements

The following logic can be hard coded:
Multiple forms of nitrogen can be summed to calculate TN. Here is the logic TADA uses for this calculation:
If available, use the total N result for a given day, even if there are other constitutes available
If total N is not available, use the sum of multiple constituents available for the same day as TN. If only 1 constituent is available for a given day, use that as total N or P.

Use HarmonizationTemplate
https://usepa.sharepoint.com/:x:/r/sites/WQPDataAssessmentTeam/_layouts/15/Doc.aspx?sourcedoc=%7B756FBEA4-399E-4B40-BB15-4E52041151D0%7D&file=HarmonizationTemplate.xlsx&action=default&mobileredirect=true

SummaryStatistics

Generate summary statistics table using HarmonizationIDs

Utilities.R

Create function to remove true duplicates
Create filtering functions

Fields for full dataset filtering
ActivityTypeCode
ActivityMediaName
ActivityMediaSubdivisionName
ActivityCommentText
MonitoringLocationTypeName
StateName
TribalLandName
OrganizationFormalName
CharacteristicName
HydrologicCondition
HydrologicEvent
BiologicalIntentName
MeasureQualifierCode
ActivityGroup
AssemblageSampledName
ProjectName
CharacteristicNameUserSupplied
DetectionQuantitationLimitTypeName
SampleTissueAnatomyName
LaboratoryName

Fields for characteristic level filtering
ActivityCommentText
ActivityTypeCode
ActivityMediaName
ActivityMediaSubdivisionName
MeasureQualifierCode
MonitoringLocationTypeName
HydrologicCondition
HydrologicEvent
ResultStatusIdentifier
MethodQualifierTypeName
ResultCommentText
ResultLaboratoryCommentText
ResultMeasure/MeasureUnitCode
ResultSampleFractionText
ResultTemperatureBasisText
ResultValueTypeName
ResultWeightBasisText
SampleCollectionEquipmentName
LaboratoryName
MethodDescriptionText
ResultParticleSizeBasisText
SampleCollectionMethod/MethodIdentifier
SampleCollectionMethod/MethodIdentifierContext
SampleCollectionMethod/MethodName
DataQuality/BiasValue
MethodSpeciationName
ResultAnalyticalMethod/MethodName
ResultAnalyticalMethod/MethodIdentifier
ResultAnalyticalMethod/MethodIdentifierContext
AssemblageSampledName
CharacteristicNameUserSupplied
DetectionQuantitationLimitTypeName

WQXTargetUnits

Convert all units using MeasureUnit (CSV): https://cdx2.epa.gov/wqx/download/DomainValues/MeasureUnit.CSV

If clean=FALSE, append target unit and conversion columns only

If unit is not recognizable or able to be converted, flag as "manual conversion required" or for some specific ones, include flag "UnitIncludesMetadata".

If clean=TRUE, append target unit and conversion columns AND convert units in dataset

If unit is not recognizable or able to be converted, flag as "manual conversion required" or for some specific ones, include flag "UnitIncludesMetadata".

RecordSummary

Generate record summary. May fit well in Utilities.R folder because it can be used throughout the assessment process.

This function relies on initial data retrieval generation of a "clean" file and a "raw file"

Summary
Total Records in Raw File: 544
Total Records Removed: 48
Total Records in Clean File: 496

Total Sites in Raw File: 20
Total Sites Removed: 4
Total Sites in Clean File: 16

IndividualCensoredDataSubstitutions

Apply more advanced methods and/or different methods depending on characteristics being assessed and proportion of censored data for each

Additional options
KaplanMeier
Other methods (TBD)

Must download and upload template

QAPPApproved

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required?
No
Include warning if flags are not applied?
No
Required columns:
QAPPApprovedIndicator

Development Notes

Consider using this logic:
QAPPApprovedIndicator is populated, QAPPApproved = Y (when clean = TRUE, these columns are retained)
QAPPApprovedIndicator is not populated, QAPPApproved = N (when clean = TRUE, these columns are not retained)

usepa / tada Goto Github PK

tada's Introduction

Welcome to TADA: Tools for Automated Data Analysis!

Installation

Water Quality Portal

Install R and R Studio

Open-Source Code Policy

License

Disclaimer

Contact

tada's People

Contributors

Stargazers

Watchers

Forkers

tada's Issues

Page Requirements/Standards

Function Requirements

Development Notes

Discuss CCO vs. MIT license. Which is a better fit?

Page Requirements/Standards

Function Requirements

Development Notes

Page Requirements/Standards

Development Notes

Page Requirements/Standards

Function Requirements

Development Notes

Check List

Page Requirements/Standards

Function Requirements

Development Notes

Page Requirements/Standards

Function Requirements

Development Notes

Check List

Page Requirements/Standards

Function Requirements

Development Notes

Page Requirements/Standards

Function Requirements

Development Notes

Page Requirements/Standards

Function Requirements

Development Notes

Check List

Page Requirements/Standards

Function Requirements

Development Notes

Page Requirements/Standards

Function Requirements

Development Notes

Page Requirements/Standards

Function Requirements

Development Notes

Check List

Page Requirements/Standards

Function Requirements

Development Notes

Recommend Projects

Recommend Topics

Recommend Org