Code Monkey home page Code Monkey logo

envirotools's Introduction

Welcome to my personal GitHub profile.

Hello! I'm Kyle P Messier, PhD, Stadtman Tenure Track Investigator at the National Institute of Environmental Health Sciences.

{SET}group

The Spatiotemporal Exposures and Toxicology group, {SET} group, at NIEHS has a broad interest in geospatial exposomics and risk mapping.

Please check out our gh-pages website for details on the people, papers, and software.

Software

Our code and software is hosted at the NIEHS GitHub Enterprise.

Here is a current list of our software in development:

No. Package Name Description Status
1. amadeus A Machine for Data, Environments, and User Setup for common environmental and climate health datasets is an R package developed to improve and expedite users’ access to large, publicly available geospatial datasets. Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.
2. beethoven Building an Extensible, Reproducible, Test-driven, Harmonized, Open-source, Versioned, Ensemble model for air quality is an R package developed to facilitate the development of ensemble models for air quality. Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.
3. chopin Computation for Climate and Health research On Parallelized Infrastructure automates parallelization in spatial operations with chopin functions as well as sf/terra functions. Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.
4. GeoTox GeoTox, or source-to-outcome, modeling framework with an S3 object-oriented approach. Facilitates the calculation and visualization of single and multiple chemical risk at individual and group levels. Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.
5. RGCA Implements Reflected Generalized Concentration Addition: A geometric, piecewise inverse function for 3+ parameter sigmoidal models used in chemical mixture concentration-response modeling. Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.
6. PrestoGP Scalable penalized regression on spatio-temporal outcomes using Gaussian processes. Designed for big data, large-scale geospatial exposure assessment, and geophysical modeling. Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

Software Development Practices

We are focused on developing and promoting software and computational best-practices such as test-driven-development (TDD) and open-source code for the environmental health sciences. To this end, we have protocols in place to ensure that our code is well-documented, tested, and reproducible. Below are some of the key practices we follow:

Unit and Integration Testing

We will utilize various testing approaches to ensure functionality and quality of code.

Git + GitHub

Version control of software is essential for reproducibility and collaboration. We use Git and the NIEHS Enterprise GitHub for version control and collaboration.

CI/CD Workflows

Within GitHub, we will utilize continuous integration and continuous deployment workflows to ensure that our code is always functional and up-to-date. Multiple ** branch protection rules** within GitHub aresetup and enforced for our GitHub repositories:

  1. Require pull request and 1 review before merging to main
  2. Test pass: Linting: Code shall adhere to the style/linting rules defined in the repository.
  3. Test pass: Tets Coverage: A given push should not decrease overall test coverage of the repository.
  4. Test pass: Build Checks: The code should build without errors or warnings.

The CI/CD workflows in GitHub are setup to run on every push to the main branch and on every pull request. The workflows are setup using yaml files in the .github/workflows directory of the repository.

Processes to test or check

  1. data type
  2. data name
  3. data size
  4. relative paths
  5. output of one module is the expectation of the input of the next module

Test Drive Development

Starting from the end product, we work backwards while articulating the tests needed at each stage.

Key Points of Unit and Integration Testing

File Type
  1. NetCDF
  2. Numeric, double precision
  3. NA
  4. Variable Names Exist
  5. Naming Convention
Stats
  1. Non-negative variance ($\sigma^2$)
  2. Mean is reasonable ($\mu$)
  3. SI Units
Domain
  1. In the geographic domain (eg. US + buffer)
  2. In Time range (e.g. 2018-2022)
Geographic
  1. Projections
  2. Coordinate names (e.g. lat/lon)
  3. Time in acceptable format

Test Driven Development (TDD)- Key Steps

  1. Write a Test: Before you start writing any code, you write a test case for the functionality you want to implement. This test should fail initially because you haven't written the code to make it pass yet. The test defines the expected behavior of your code.

  2. Run the Test: Run the test to ensure it fails. This step confirms that your test is correctly assessing the functionality you want to implement.

  3. Write the Minimum Code: Write the minimum amount of code required to make the test pass. Don't worry about writing perfect or complete code at this stage; the goal is just to make the test pass.

  4. Run the Test Again: After writing the code, run the test again. If it passes, it means your code now meets the specified requirements.

  5. Refactor (if necessary): If your code is working and the test passes, you can refactor your code to improve its quality, readability, or performance. The key here is that you should have test coverage to ensure you don't introduce new bugs while refactoring.

  6. Repeat: Continue this cycle of writing a test, making it fail, writing the code to make it pass, and refactoring as needed. Each cycle should be very short and focused on a small piece of functionality.

  7. Complete the Feature: Keep repeating the process until your code meets all the requirements for the feature you're working on.

TDD helps ensure that your code is reliable and that it remains functional as you make changes and updates. It also encourages a clear understanding of the requirements and promotes better code design.

_targets and/or snakemake pipelines

We will utilize the targets and/or snakemake packages in R and Python respectively to create reproducible workflows for our data analysis. These packages allow us to define the dependencies between the steps in our analysis and ensure that our analysis is reproducible. Additionally, they keep track of pipeline objects and skip steps that have already been run, saving time and resources.

Some Benefits of _targets and/or snakemake pipelines

  1. Reproducibility: By defining the dependencies between the steps in our analysis, we ensure that our analysis is reproducible. This is essential for scientific research and data analysis.

  2. High-Level Abstract: _targets and snakemake allow us to define our analysis at a high level of abstraction, making it easier to understand and maintain.

  3. Testing: Creating pipelines and unit/integration testing go hand-in-hand together. As we write the pipeline, the tests to write become obvious.

envirotools's People

Contributors

kyle-messier avatar mitchellmanware avatar

Watchers

 avatar

envirotools's Issues

Vignette integration with "Geocomputation in R"

From our CHORDS TEP meeting today, it was brought up that we could link out to the bookdown "Geocomputation with R", so as to not replicate too much of the basics. It would allow us to focus on environmental and climate data analysis.

2.1 Polygon Data (sf): Outline Functions

Section 2 of the 'Introduction to Spatial Analysis with Environmental Data' will focus on polygon data type using the sf package.

Data use case: HMS Fire and Smoke Product

1.2.1 Polygon Data with sf package

  • Setup

    • Download data with utils::download.file()
    • Unzip data with utils::unzip()
    • Import data with sf::st_read()
    • Inspect data (class, summary, coordinate reference system)
    • "Factorize" smoke density column to ensure analyses run properly
  • Plot polygons

    • simple visualization with base R plot()
  • Plot two sets of polygons

    • Download, unzip, import and inspect United States boundary polygons
    • simple visualization with base R plot()
    • subset to contiguous United States boundary
    • Plot two polygon data sets together with ggplot2
  • Exploratory analyses

    • Merge individual polygons to one multi-part polygon for each density classification
      • Utilizes dplyr syntax and functions
      • Plot multi-part polygons and conus boundary with ggplot2
    • Crop smoke polygons to the bounding box surrounding the contiguous United States
      • Plot with ggplot2
  • Ability to perform zonal statistics is limited in the sf package. Sets up a natural transition to the terra package where zonal statistics are easier to perform

visualize S-T summaries

(Transfer task from CHORDS Data analysis project board)

  • data standards for input
  • marginalize time (e.g. maps)
  • marginalize space (e.g. geom_smooth)
  • a few combinations (e.g. micromap)
  • Incorporation of census data statistics
  • trelliscropeJS / interactive visualizations

@mitchellmanware @sigmafelix the SI material in https://www.pnas.org/doi/epdf/10.1073/pnas.1818859116 has a section exlaining population-weighted averages. I think that will address Mitchell's question on how to deal with that. Also @sigmafelix may be working on this task for Scalable_GIS.

0. Introduction

Create 0. Introduction section to introduce the motivations, data types, data sources and packages

dplyr script for summarizing data in census units

Transfer from CHORDS data analysis:

Example using US state boundaries and NCEP-NARR 2m Air Temperature data

library(terra)
library(tidyterra)
library(dplyr)
library(ggplot2)

# import state boundaries from tigris
us <- tigris::states(cb = FALSE, year = 2022)
# convert to SpatVector
us <- terra::vect(us)
# import NCEP-NARR 2m air temperature
air <- terra::rast(
  "/Volumes/manwareme/EnviroTools_shiny_app/input/air.2m.2023.nc"
)
# select only 2 layers for example
air_subset <- c(air$air_1, air$air_2)
# uniform crs
us <- terra::project(us,terra::crs(air))

#### zonal statistics using dplyr syntax
#### example flow for function building
us_air <-
  air_subset %>%
  zonal(us, fun = "mean") %>%
  mutate(NAME = paste(us$NAME)) %>%
  terra::merge(x = us, by.x = "NAME", by.y = "NAME")

# plot state average 2m temperature for January 1, 2023
ggplot()+
  geom_spatvector(data = us_air,
                  aes(fill = air_1)) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

Image

Script flow:

  1. calculates mean within each vector zone (state in this example)
  2. creates new column with vector zone identifier (state name in this example)
  3. merges calculated values back onto the original vector dataset
    • Remerging is required because the terra::zonal() function returns a geometry-less data frame

Can easily be applied to other census boundaries using the tigris package, which is a package of data-download functions for census-defined boundaries

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.