Code Monkey home page Code Monkey logo

airsensor's Introduction

CRAN_Status_Badge Downloads

A dedicated Slack channel has been created for announcements, support and to help build a community of practice around this open source package. You may request an invitation to join from [email protected].

AirSensor R Package

Process and display PM2.5 data from PurpleAir

Background

The AirSensor R package is being developed to help air quality analysts, scientists and interested members of the public more easily work with air quality data from consumer-grade air quality sensors. Initial focus is on PM2.5 measurements from sensors produced by PurpleAir.

The package makes it easier to obtain data, perform analyses and create visualizations. It includes functionality to:

  • download and easily work with PM2.5 data from PurpleAir
  • visualize raw "engineering-level" data from a PurpleAir sensor
  • visualize data quality using built-in analytics and plots
  • aggregate raw data onto an hourly axis
  • create interactive maps and time series plots
  • convert aggregated PurpleAir data into ws_monitor objects appropriate for use with the PWFSLSmoke package

Institutional Support

The initial development of this package was funded by the South Coast Air Quality Management District with funds from an EPA STAR grant. The following disclaimer applies:

This package was prepared as part of a project funded through a Science to Achieve Results (STAR) grant award (RD83618401) from the U.S. Environmental Protection Agency to the South Coast Air Quality Management District (South Coast AQMD). The opinions, findings, conclusions, and recommendations are those of the author and do not necessarily represent the views of the U.S. EPA or the South Coast AQMD, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The U.S. EPA, the South Coast AQMD, their officers, employees, contractors, and subcontractors make no warranty, expressed or implied, and assume no legal liability for the information in this package. The U.S. EPA and South Coast AQMD have not approved or disapproved this package, and neither have passed upon the accuracy or adequacy of the information contained herein.

Additional funding was provided by the US Forest Service in support of the Interagency Wildland Fire Air Quality Response Program.

Mazama Science develops and maintains the package as part of its ongoing relationships with federal, state and local air quality agencies.

Installation

This package is designed to be used with R (>= 3.5) and RStudio so make sure you have those installed first.

The package is available on CRAN or you get the latest development version from GitHub. To install the latest development version, users will want to install the devtools package and then type the following at the RStudio console:

# Note that vignettes require knitr and rmarkdown
install.packages('knitr')
install.packages('rmarkdown')
devtools::install_github("MazamaScience/AirSensor")

Any work with spatial data, e.g. assigning countries, states and timezones, will require installation of required spatial datasets. To get these datasets you should type the following at the RStudio console:

library(MazamaSpatialUtils)
dir.create('~/Data/Spatial', recursive=TRUE)
setSpatialDataDir('~/Data/Spatial')
installSpatialData("NaturalEarthAdm1")
installSpatialData("USCensusStates"")
installSpatialData("CA_AirBasins_01")

airsensor's People

Contributors

astridsanna avatar dependabot[bot] avatar hmrtn avatar jonathancallahan avatar rubyfore avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

airsensor's Issues

Error from pas_leaflet

I typed the following code at the command line, from the MazamaPurpleAir directory:
devtools::load_all()
data(pas_Jan25)
initializeMazamaSpatialUtils()
pas_leaflet(pas)

The error I receive is:
Error in validObject(.Object) :
invalid class “SpatialPointsDataFrame” object: 1: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "pa_synoptic", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 2: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "tbl_df", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 3: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "tbl", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 4: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "data.frame", should be or extend class "data.frame"
Called from: validObject(.Object)

sessionInfo gives the following information:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] MazamaPurpleAir_0.1.1 MazamaCoreUtils_0.1.3 futile.logger_1.4.3
[4] dplyr_0.7.8

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 RColorBrewer_1.1-2 pillar_1.3.1
[4] compiler_3.5.0 formatR_1.5 later_0.7.5
[7] bindr_0.1.1 futile.options_1.0.1 tools_3.5.0
[10] testthat_2.0.1 digest_0.6.17 PWFSLSmoke_1.1.18
[13] lattice_0.20-35 lubridate_1.7.4 jsonlite_1.6
[16] memoise_1.1.0 tibble_2.0.1 pkgconfig_2.0.2
[19] rlang_0.3.1 shiny_1.1.0 rstudioapi_0.7
[22] mapproj_1.2.6 commonmark_1.5 crosstalk_1.0.0
[25] rgdal_1.3-6 yaml_2.2.0 bindrcpp_0.2.2
[28] withr_2.1.2 stringr_1.3.1 httr_1.4.0
[31] roxygen2_6.1.0 xml2_1.2.0 maps_3.3.0
[34] htmlwidgets_1.2 devtools_1.13.6 grid_3.5.0
[37] leaflet_2.0.2 tidyselect_0.2.5 glue_1.3.0
[40] R6_2.3.0 sp_1.3-1 purrr_0.2.5
[43] lambda.r_1.2.3 magrittr_1.5 scales_1.0.0
[46] MazamaSpatialUtils_0.5.4 promises_1.0.1 htmltools_0.3.6
[49] assertthat_0.2.0 colorspace_1.3-2 xtable_1.8-3
[52] mime_0.6 httpuv_1.4.5 stringi_1.2.4
[55] munsell_0.5.0 crayon_1.3.4

naming: pwfsl_load vs pwfsl_loadLatest?

In the documentation, the function that downloads PWFSL data is called pwfsl_loadLatest, however the file name is pwfsl_load. Shouldn't file names reflect function names exactly? There is also inconsistency w/in the Examples and Usage in the documentation for that file regarding pwfsl_load/pwfsl_loadLatest.

deal with encoding issues in downloadParseSynopticData()

The "Label" column in the returned data frame has mixed encoding: "unknown" and "UTF-8".

We need to carefully deal with encoding so as to preserve hispanic place names.

And we made need to stringi::stri_escape_unicode() if we save this data as package internal data.

pat_sample() logic

The code used in pat_sample(forGraphic = TRUE) is not quite right in version 0.2.5. It doesn't detect and save the same outliers that one sees when running pat_outliers().

One thing to note is that sampling for graphics doesn't need to preserve the every-minute time axis. So we can start off by simply removing any records where both pm25_A and pm25B are missing.

Some of the logic from pat_outliers() needs to be copied over when forGraphic = TRUE so that the overall creation of the returned data frame has the following steps:

  • filter out is.na() separately on the A and B channel records before finding outliers
  • reduce sampleSize by the number of outliers detected so that we return the precise sampleSize requested and so that we avoid errors when sampleSize is greater than the number of rows in the non-outliers
  • create the combined outliers part of the returned data frame by using dplyr::full_join() as in pat_outliers()
  • merge that with the sampled non-outliers portion using dplyr::bind_rows()
  • be sure to run dplyr::distinct() before dplyr::arrange()
  • create local_examples/05_graphics_sampling.R to use example_pat for the week of 2018-08-01 to 2018-08-07. Create a plot using pat_outliers(thresholdMin = 4) and then several runs of pat_multiplot(sampleSize = N) to show that the graphics retain all of the outliers originally detected regardless of N.

pat_filter() function

Because we have stuffed the actual tibbles associated with a pat object inside of pat$meta and pat$data we cannot use dplyr filtering on our pat object.

So we need a pat_filter() function which applies the incoming filtering expression to pat$data and returns the modified pat object.

Using pat_filter() should feel just like working with dplyr::filter().

See PWFSLSmoke::monitor_subsetBy() for example code for how to do this.

We already have a pat_filterDate() function that simplifies things for the most common usage and we should provide pat_filterData() that is more general just to be complete.

For extra-completeness, we should have a pat_filter() function that just wraps pat_filterData().

That seems like a good API for end users:

  • ~_filter() -- wrapper for ~_filterData() for those expecting dplyr::filter()
  • ~_filterDate() -- convenience function; just specify user friendly startdate and enddate
  • ~_filterData() -- explicit; does just what it says

improve/rename pat_internalData()

This function does simple subsetting of the columns of data in pat$data and then does one of two things:

  • returns the dataframe
  • plots the dataframe

It's probably a less confusing to have separate functions for extraction and plotting.

Also, the plotting function needs to spit out a warning if attempting to plot more than # records X # different parameters.

Check loaded packages and versions on "startup"

Mazama PA core functionality throws exceptions if packages are not properly loaded. Explore possibility of automatically checking and loading necessary package dependencies. This could possibly be resolved utilizing sessionInfo().

create MazamaSpatialUtils.R for package internal stuff

Two .R files with roxygen2 comments exist and describe the example data in the data/ directory:

  • pas.R
  • pas_raw.R

These should be removed.

You should read up on data in packages: http://r-pkgs.had.co.nz/data.html

You should probably recreate these datasets as example_pas and example_pas_raw and save them as similarly named files in the data/ directory. This will prevent any confusion when example code use pas as a variable name.

You should then create a new MazamaPurpleAir.R file and put all the package internal documentation describing internal data in there. You can use the following source code as a model:

https://github.com/MazamaScience/MazamaSpatialUtils/blob/master/R/MazamaSpatialUtils.R

pat_dygraph()

This function will work just like PWFSLSmoke::monitor_dygraph() except for pat objects.

We probably need to get pat_sample() working first as dygraph may become unresponsive with too many data points.

Use-case vignettes for new functions

Show how these new functions can be used in conjunction to create a powerful exploratory analysis environment.
This should include:

  • filtering
  • subsampling
  • plotting
  • dygraphs
  • etc.

pat_findOutliers returns NULL data

This should return a new pa_timeseries object with either replaced or NA outliers. Insetad, it returns a NULL data list.
Must be fixed in order to work with filtered data.

test parameter validation using testthat

All the input validation below should have an associated test in the test/testthat directory

  • downloadParseSynopticData()- check that the URL is valid, or that URL begins with 'https://www.purpleair.com', as appropriate.
  • enhanceSynopticData() - check for valid country code, appropriate class of pas_raw, valid boolean
  • pas_leaflet() - check pas is the appropriate class and nonempty, param is one of the choices, radius a number, opacity a # less than 1, maptype is one of the choices, and outsideOnly is boolean
  • pas_load() - check for valid URL, country code, includePWFSL is boolean and lookback days >= 1.
  • the other functions have no parameters as input

learn the pkgdown package

We use the pkgdown package to create websites for each package. You should familiarize yourself with this package and use the build_site() function to create a complete website. (It will be saved in a docs/ directory that you will then upload. I have to do some work on Github to make it visible.)

Check the main public Mazama package repositories for usage and extra files like _pgkdown.yml or svg, etc.

add custom color scales to pas_staticMap()

The color scales for temperature, humidity and pwfsl_closestDistance in the pas_leaflet() are carefully constructed color palettes with breaks specific to each variable.

These color scales should be ported over to pas_staticMap() as named palettes:

  • temperature
  • humidity
  • distance

Initialize PAS missing databases automatically

library(MazamaPurpleAir) should be the only dependency the end-user should worry about.
Currently, pas_load() throws:
Error: Missing database. Please loadSpatialData("SimpleCountriesEEZ"). This should be done automatically without raising unnecessary exceptions.

extra metadata columns

The following columns need to be added in enhanceSynopticData():

  • Community Region -- Names will be derived from the SC**_## label
  • Air District -- For air basins we will have to use CARB spatial data to identify which basin each sensor is in.
  • Sensor Manufacturer -- "Purple Air"
  • Target Pollutant -- PM (2.5?)
  • Technology Type -- "consumer-grade"
  • Model -- The "sensorType" value for all the SCAQMD sensors is "PMS5003+PMS5003+BME280".

downloadParseTimeseriesData() should return empty tibble rather than error

The downloadParseTimeseriesData() function currently errors out when it gets an error response from ThingSpeak. This is undesirable because that stop() bubbles up to pat_load() where it is not handled.

This forces end users to know ahead of time the time range of available data.

The proper solution is to have downloadParseTimeseriesData() handle the error message from ThingsSpeak by returning a tibble with all appropriate columns but no rows.

This can then be used by bind_rows() in pat_load() without any special handling.

pat_sample() function

Longer pat time series take forever to plot!

We can provide visually identical plots using many fewer points if we use dplyr::sample_n() to restrict the number of points to some reasonable number like 10,000 or so.

This will also be important when we implement pat_dygraph() for interactive time series.

The problem with naively sampling is that you will miss out on many of the potentially important outliers. So our pat_sample() function will need to be smart by doing the following:

  1. Find a bunch of outliers with seismicRoll::findOutliers(n = 11, thresholdMin = 4) or some such. Save this "outliers" tibble.
  2. Use dplyr::sample_n() to naively sample and create a "subsampled" tibble.
  3. row bind "outliers" and "subsampled"
  4. remove duplicate records
  5. reorder based on datetime

It might have taken less typing if I had just written out the lines of dplyr code. ;-)

get pat_multiplot(prototype = "pm25_over") working

One of the most telling plots for Purple Air sensors is to have data from the A and B channels plotted on top of each other with different colors and partial opacity. I like red and blue because, when they lie on top of each other you get purple.

As a budding ggplot expert, I would hope this is a straightforward task.

OLD pat_timeAverage() description

This ticket previously talked about doing everything related to time averaging in a single function. This turns out to be too messy so this issue is being deprecated in favor of smaller lego bricks of functionality.

-- >8 --

Previous description

Create a pat_timeAverage() function based on local_examples/TESTING_patTimeAverage().

This function should generate a ggplot version of the three-level plot in the example.

The configurable parameters in that file should all be turned into function parameters.

There should also be a showPlot = TRUE parameter.

As I'm not sure yet what type of plot will be most compelling, there should be a prototype parameter allowing one to choose among:

  • separate plots for max, mean, median, min and sd, stacked vertically
  • an upper plot with max, mean, min lines and a lower plot with sd
  • box plots in the upper plot instead and sd in a lower plot (or on the same plot with a separate y-axis?) (perhaps with a little barplot of counts along the bottom?)

There should also be a sdThreshold = 1e6 parameter which will be used to replace some of the mean values in the returned data frame with NA

The function should invisibly return a data frame of aggregated data with the following columns of data generated by openair::timeAverage():

  • datetime
  • mean
  • max
  • min
  • median
  • sd
  • count of measurements (named "frequency" in openair::timeAverage())

The combination of the pat_outliers() and pat_timeAverage() functions should provide good tools for both visual inspection of the data and for programmatic removal of suspect data before we convert it into our as_ data model which will have an averaged datetime axis and only a single pm2.5 measurement per sensor. (Just like a ws_monitor object.)

The conversion process for a single sensor will go something like this:

  1. raw_pat <- pat_load()
  2. clean_pat <- pat_outliers(raw_pat)
  3. avg_data <- openair::timeAverage(clean_pat, sdThreshold = 10)
  4. as <- list(meta = NULL, data = NULL
  5. as$meta <- clean_pat$meta
  6. as$data <- avg_data %>% select(datetime, mean) %>% rename(pm25 = mean)

I am optimistic this will generate a pretty reliable dataset for us to build systems on top of.

create issues for time series data

The MazamaPurpleAir_Private package has the following functions that need to be copied over to MazamaPurpleAir:

  • pat_load.R
  • downloadParseTimeseriesData.R
  • createPATimeseriesObject.R
  • pat_internalData.R

These functions allow you to take a monitor ID or name, perhaps obtained with pas_leaflet(), and download a week's worth of time series data for that particular sensor.

There is a lot to learn about the structure of the data and how these functions work so this task is to copy over the code and get it running so that you can explore time series data a little before creating your own issues to work on. A suggested starter list of issues might include:

  • rewrite documentation for the functions above
  • refactor to use better variable names and more comments in the code
  • add parameter validation and associated unit tests
  • add an example pat object to the internal data so that we can use it in examples
  • other?

pat_internalFit() improvements

This function currently uses base plot graphics. It should be rebuilt using ggplot graphics and showPlot = TRUE should generate a plot with two panes:

  1. ggplot version of A vs B as in the current pat_internalFit() plot
  2. pat_multiplot(prototype = "pm25_over")

As in the current version, it should return a linear model.

pas_filter() function

This is just a simple wrapper for dplyr::filter().

A person familiar with dplyr could just run filter(pas, ...) but we want to provide an easy way for people who are less familiar with R to begin using the package. So we should create a pas_filter() function which behaves for a pas in a way similar to how pat_filter() behaves for a pat object. And we will probably end up with an aqs object (for Air Quality Sensor) which is basically a high resolution version of the PWFSLSmoke ws_monitor object. These will eventually have their own aqs_filter~() functions.

play with version 0.1.0 of the package

Ruby --

You need to get comfortable with both RStudio's tools for building packages and with the initial set of functionality of the package. Before closing this task you should:

  • learn about building packages at http://r-pkgs.had.co.nz
  • build and explore all package functions to learn what they do
  • print out the RStudio package cheat sheet
  • practice and understand all the items in RStudio main menu under "Build"
  • learn about roxygen2 style comments
  • improve some documentation and then "Document/Test/Check"
  • "Install and Restart" and then review package documentation in the Help pane

pat_join() function to combine pat objects

My thoughts for generating an archive of pat data for SCAQMD is to create per-sensor monthly data files and then join them together so that we also have up-to-the-current-month annual files.

My first foray into this is local_examples/PROTOTYPE_pat_communityArchive.R

The monthly data files are ~0.5 MB so annual files will be a very reasonable 5 MB or so -- easy to download and load into memory.

To make all of this easy, we will want to have a pat_join() function that accepts two pat objects and does the following:

  • parameter validation
  • ensure that both pat objects share the same metadata
  • dplyr::bind_rows() on the $data
  • return the "joined" pat object

Even if it's only a few lines of code, I think it's worth having this as a separate utility function because we may end up adding more checks for edge cases.

add parameter validation to all functions

Every parameter in every function should be validated, have an appropriate error message if it is invalid.

  • downloadParseSynopticData()- check that the URL is valid, or that URL begins with 'https://www.purpleair.com', as appropriate.
  • enhanceSynopticData() - check for valid country code, appropriate class of pas_raw, valid boolean
  • pas_leaflet() - check pas is the appropriate class and nonempty, param is one of the choices, radius a number, opacity a # less than 1, maptype is one of the choices, and outsideOnly is boolean
  • pas_load() - check for valid URL, country code, outsideOnly is boolean and lookback days >= 1.
  • the other functions have no parameters as input

Purple Air Timeseries vignette

Create a vignettes/purple-air-timeseries.Rmd vignette.

It should educate someone new to the package about what a pat object is, how to get or save one and how to work with it. There might be sections on:

  • TBD

new pat_timeseriesDiagnosticPlot() function

(I'm not wedded to the name. Perhaps it should have "aggregation" as part of it? We'll probably need to use it for a while to figure out what it should be called.)

We will want a new function to go along with pat_timeseriesDiagnostics().

This function should accept a pat object (or possibly a diagnostic data frame?) and generate a set of stacked? plots where one can visually assess the values and statistics that come out of pat_timeseriesDiagnostics().

The use case I imagine is that our eventual automated QC process will flag a bunch of hours as bad and someone will ask "Why?". This function should provide a quick answer.

MazamaSpatialUtils not loaded via namespace or in attached packages

In using the MazamaPurpleAir package in the synoptic vignette, I have come across two issues having to do with using the MazamaSpatialUtils package.

  1. PWFLSmoke imports MazamaSpatialUtils >= 0.5.4, however, enhanceSynopticData() throws an error unless MazamaSpatialUtils >=0.6.1. I did not have the new version downloaded, and got an error, and solved the problem by updating. Q: In the description for MazamaPurpleAir under 'Imports' should MazamaSpatialUtils be >= 0.6.1?

  2. After loading MazamaPurpleAir, MazamaSpatialUtils does not show up in the session information returned by sessionInfo(). This is easily solved by including library(MazamaSpatialUtils), but I assume you do not want this to be a necessary command to using MazamaPurpleAir.
    I have tried: restarting R, restarting RStudio. Here is the results of sessionInfo() from a newly started Rstudio session, after loading MazamaPurpleAir 0.1.1.

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MazamaPurpleAir_0.1.1 MazamaCoreUtils_0.1.3 futile.logger_1.4.3   dplyr_0.7.8          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           crayon_1.3.4         assertthat_0.2.0     R6_2.3.0             futile.options_1.0.1
 [6] formatR_1.5          magrittr_1.5         pillar_1.3.1         rlang_0.3.1          rstudioapi_0.7      
[11] bindrcpp_0.2.2       lambda.r_1.2.3       tools_3.5.0          glue_1.3.0           purrr_0.2.5         
[16] yaml_2.2.0           compiler_3.5.0       pkgconfig_2.0.2      bindr_0.1.1          tidyselect_0.2.5    
[21] tibble_2.0.1 

pas_staticMap()

ESRI cut off free access to basemap images yesterday!

So our ability to generate static maps for pas objects has gone away.

We need a new pas_staticMap() function that accesses free tiles from one of the many available tile servers and assembles them into the desired basemap.

This function should probably accept:

  • zoom
  • centerLon
  • centerLat
  • height (pixels)
  • width (pixels)

Because we will be generating standard sized images with these base maps, being able to define the height and width is important.

pat_multiplot() enhancements

We need to design all of our output plots for one of two cases:

  1. use in a public facing web service where clean and simple is the priority
  2. publication-ready, fully annotated plots

The pat_multplot() function is in need of some basic improvements:

  • axis labels
  • title
  • symbol should be semi-transparent squares

Ideally, we would have a set of defaults that a savvy user could override. If it's easy to implement, using a ggplot theme would be ideal.

And what can be done with the ... parameter?

Purple Air Synoptic vignette

Create a vignettes/purple-air-synoptic.Rmd vignette.

It should educate someone new to the package about what a pas object is, how to get or save one and how to work with it. There might be sections on:

  • "spatially enhanced metadata"
  • using dplyr to filter based on different criteria
  • creating interactive maps with pas_leaflet()
  • creating static maps with TBD

allow differentiation of AirNow FRMs vs. E-BAMs and Nephelometers

According to Lee Tarnay: The BAM FRM is the official "ground truth" while the E-BAMs and nephelometers produce biased results.

In allowing for comparison of PurpleAir data with "FRM"s, we should allow for the exclusion of non E-BAM monitors.

We probably need some argument called FRM_only or some such.

One way to identify these monitors is to take only the AirNow, non-mobile monitors.

For AIRSIS and WRCC monitors, there may be metadata fields that can help with this identification.

pas_esriMap() function

We need to produce static maps as well as interactive ones.

We should harness the functionality in the PWFSLSmoke package and create a function that mimics some of the functionality in monitor_esriMap()

This new function will need to determine center location and zoom from the incoming pas object.

new pat_timeAverageDiagnostics() function

I cloned the source code for openair and tool a look at the implementation of the timeAverage() function.

That function is waaaaay to tied to the openair data model. We can create our own function that does what we want, runs faster and has much more readable code. I've gotten a start in local_examples/PROTOTYPE_pat_timeAverage.R.

The feature set for this function is:

  • accept a pat pat and return a tibble with new data columns on a new time axis
  • accept a unit parameter specifying the new time axis period
  • the returned tibble with have columns with mean, sd and count statistics for each of pm25_A, pm25_B, temperature and humidity. Columns will be named <parameter>_<statistic>
  • the returned tibble will have additional t-test parameters: pm25_t, pm25_df and pm25_p
  • be sure to convert any NaN, Inf or NULL values generated by mean or sd into NA
  • if it takes less than 4 hours, add support for a data.thresh parameter

improve pat_scatterplot()

The pat_scatterplot() is a good working example but can be improved.

If it's not too hard, we should convert it to ggplot graphics.

Desirable features:

  • parameters argument -- We should allow people to pass in a vector of the parameters they want to generate scatterplots for. Any combination of pat internal variables should be OK. But we also want to default to datetime, A, B, temp, humidity.
  • graphical parameters optimized for speed -- shape = 18? or '.'
  • sampleFraction argument -- For large data frames we will optimize things by randomly sampling the data. The sampleFraction argument should default to NULL which means we subsample to get a data frame with a reasonable number of rows, say 1e5. But users will be allowed to set this parameter to 1 if they want to force it to plot everything or a low number if they want to speed things up
  • ... argument -- We should allow people to supply extra graphical parameters to customize the plot like color, size, shape, etc.

add sensor "label" to every pat_~ plot

The various pat_~ plots are well annotated except for the name of the sensor being plotted.

The label field in the pat$meta should be used to annotate each plot. We should also think about a plan for displaying the sensorType.

The label may need to be truncated to N characters to fit nicely. My first thought is to just include the label at the beginning of each plot title. That's probably good enough for now and we can worry about moving it or adding sensorType only if requested.

learn to use package building packages

R has a number of packages to help you build packages. You should become passingly familiar with the following:

  • goodpractice -- We don't slavishly follow line length or test-for-every-function suggestions but it's good to run the gp() function and see what it says.
  • usethis -- You should know what is available here. I already used use_pipe() to include piping functionality in the package.
  • testthat -- You will be using this to create tests.
  • Any others?

createASTimeseriesObject()

I am now convinced that we need another data model, "Air Sensor Timeseres" that has the following characteristics:

metadataframe

  • location metadata from pat object

data dataframe

  • uniform datetime time axis on one of 5-min, 10-min, 15-min or hourly
  • columns for pm25, humidity, temperature (means)
  • columns for pm25_sd, humidity_sd, temperature_sd
  • columns for pm25_count, humidity_count, temperature_count
  • columns for pm25_qc, humidity_qc, temperature_qc

This function will start with a pat object and requested time period and do the following:

  1. guarantee that we are working with a pat object
  2. perform to-be-determined QC on the A/B channel data (leave this as a stub for now)
  3. use openair::timeAverage() on the merged A/B channel data to produce pm25 metrics
  4. perform to-be-determined QC on the humidity data (leave this as a stub for now)
  5. use openair::timeAverage() to produce humidity metrics
  6. perform to-be-determined QC on the temperature data (leave this as a stub for now)
  7. use openair::timeAverage() on produce temperature metrics

The returned object will look a lot like a pat object with the following differences:

  • it may have a few to-be-determined additional metadata fields
  • it will have a time axis defined by period
  • it will always have the same 13 columns in the data data frame

The reason for this additional data model is that we will almost certainly need to generate per-sensor plots of pm25, temperature and humidity on QC'ed data on a uniform time axis. The pas and pat data models and associated functions are intended for use by serious analysts wishing to work with raw data.

This new ast object is for community members who want to look at relatively high resolution data and may have some questions about sensor performance but don't want to get lost in the details. Various ast_~ plot functions will provide graphics aimed more at interested members of general public.

Finally, the ast object is generic enough that it will accommodate data from many types of sensors. We would need to have separate data merging/QC pathways for each type of sensor but the ast object will allow for a harmonized approach.

pat_timeAverage()

This function should limit its functionality to wrapping potentially multiple calls to the openair::timeAverage() function and returning a data frame. We will keep the A and B channels separate for now.

We still need to do some investigating of the "pat" data so we can come up with some sort of reasonable QC before we start mixing A and B together to create our "official" number.

Input

  • pat = NULL -- pa_timeseries object
  • parameter = "pm25_A" -- "pat" data column to process
  • period = "10 min" -- used as avg.time ("period" is the terminology preferred by lubridate)
  • dataThreshold = 0 -- used as data.thresh
  • stats = "all" -- vector of statistics to return as dataframe columns
  • fill = FALSE" -- used as fill

Omit all of the following from opener::timeAverage(): "type, "percentile", "start.date", "end.date", "interval" and "vector.ws".

Output
The returned dataframe should have a first column named datetime followed by the names of the requested statistics with "count" replacing openair's "frequency".

When it all works, a user should be able to do the following:

# Generate a plot of 10 minute averages for pm25_A
pat %>%
  pat_timeAverage("pm25_A", "10 min", stats = "mean") %>%
  plot()
# Generate a QC plot of all statistics for channel B
pat %>%
  pat_timeAverage("pm25_B", "1 hour") %>%
  timeAveragePlot(plottype = "QC")
# Test my own ideas for a QC metric
my_qc <- 
  pat_timeAverage("pm25_A", "1 hour", stats = c("median", "sd", "count")) %>%
  mutate(qc_1 = median - sd, qc_2 = median / sd, qc_3 = sd * count)

somePlotFunction(my_qc)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.