mazamascience / airsensor Goto Github PK

Utilities for working with data from PurpleAir sensors

Home Page: https://mazamascience.github.io/AirSensor/

License: GNU General Public License v3.0

R 99.73% Dockerfile 0.19% Makefile 0.08%

airsensor's Introduction

A dedicated Slack channel has been created for announcements, support and to help build a community of practice around this open source package. You may request an invitation to join from [email protected].

AirSensor R Package

Process and display PM2.5 data from PurpleAir

Background

The AirSensor R package is being developed to help air quality analysts, scientists and interested members of the public more easily work with air quality data from consumer-grade air quality sensors. Initial focus is on PM2.5 measurements from sensors produced by PurpleAir.

The package makes it easier to obtain data, perform analyses and create visualizations. It includes functionality to:

download and easily work with PM2.5 data from PurpleAir
visualize raw "engineering-level" data from a PurpleAir sensor
visualize data quality using built-in analytics and plots
aggregate raw data onto an hourly axis
create interactive maps and time series plots
convert aggregated PurpleAir data into ws_monitor objects appropriate for use with the PWFSLSmoke package

Institutional Support

The initial development of this package was funded by the South Coast Air Quality Management District with funds from an EPA STAR grant. The following disclaimer applies:

This package was prepared as part of a project funded through a Science to Achieve Results (STAR) grant award (RD83618401) from the U.S. Environmental Protection Agency to the South Coast Air Quality Management District (South Coast AQMD). The opinions, findings, conclusions, and recommendations are those of the author and do not necessarily represent the views of the U.S. EPA or the South Coast AQMD, nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The U.S. EPA, the South Coast AQMD, their officers, employees, contractors, and subcontractors make no warranty, expressed or implied, and assume no legal liability for the information in this package. The U.S. EPA and South Coast AQMD have not approved or disapproved this package, and neither have passed upon the accuracy or adequacy of the information contained herein.

Additional funding was provided by the US Forest Service in support of the Interagency Wildland Fire Air Quality Response Program.

Mazama Science develops and maintains the package as part of its ongoing relationships with federal, state and local air quality agencies.

Installation

This package is designed to be used with R (>= 3.5) and RStudio so make sure you have those installed first.

The package is available on CRAN or you get the latest development version from GitHub. To install the latest development version, users will want to install the devtools package and then type the following at the RStudio console:

# Note that vignettes require knitr and rmarkdown
install.packages('knitr')
install.packages('rmarkdown')
devtools::install_github("MazamaScience/AirSensor")

Any work with spatial data, e.g. assigning countries, states and timezones, will require installation of required spatial datasets. To get these datasets you should type the following at the RStudio console:

library(MazamaSpatialUtils)
dir.create('~/Data/Spatial', recursive=TRUE)
setSpatialDataDir('~/Data/Spatial')
installSpatialData("NaturalEarthAdm1")
installSpatialData("USCensusStates"")
installSpatialData("CA_AirBasins_01")

airsensor's People

Contributors

Stargazers

Watchers

Forkers

iozeroff jbootman drroad praful-dodda acollier-oxandale brandonaq yojimbodurant gene1wood yingw1014 jair-89

airsensor's Issues

naming: pwfsl_load vs pwfsl_loadLatest?

In the documentation, the function that downloads PWFSL data is called pwfsl_loadLatest, however the file name is pwfsl_load. Shouldn't file names reflect function names exactly? There is also inconsistency w/in the Examples and Usage in the documentation for that file regarding pwfsl_load/pwfsl_loadLatest.

learn to use package building packages

R has a number of packages to help you build packages. You should become passingly familiar with the following:

goodpractice -- We don't slavishly follow line length or test-for-every-function suggestions but it's good to run the gp() function and see what it says.
usethis -- You should know what is available here. I already used use_pipe() to include piping functionality in the package.
testthat -- You will be using this to create tests.
Any others?

extra metadata columns

The following columns need to be added in enhanceSynopticData():

Community Region -- Names will be derived from the SC**_## label
Air District -- For air basins we will have to use CARB spatial data to identify which basin each sensor is in.
Sensor Manufacturer -- "Purple Air"
Target Pollutant -- PM (2.5?)
Technology Type -- "consumer-grade"
Model -- The "sensorType" value for all the SCAQMD sensors is "PMS5003+PMS5003+BME280".

time series parameter validation unit test

add unit testing to the time series functionality

play with version 0.1.0 of the package

Ruby --

You need to get comfortable with both RStudio's tools for building packages and with the initial set of functionality of the package. Before closing this task you should:

learn about building packages at http://r-pkgs.had.co.nz
build and explore all package functions to learn what they do
print out the RStudio package cheat sheet
practice and understand all the items in RStudio main menu under "Build"
learn about roxygen2 style comments
improve some documentation and then "Document/Test/Check"
"Install and Restart" and then review package documentation in the Help pane

Update README

Background
Examples Section
Additional Information?

pat_filter() function

Because we have stuffed the actual tibbles associated with a pat object inside of pat$meta and pat$data we cannot use dplyr filtering on our pat object.

So we need a pat_filter() function which applies the incoming filtering expression to pat$data and returns the modified pat object.

Using pat_filter() should feel just like working with dplyr::filter().

See PWFSLSmoke::monitor_subsetBy() for example code for how to do this.

We already have a pat_filterDate() function that simplifies things for the most common usage and we should provide pat_filterData() that is more general just to be complete.

For extra-completeness, we should have a pat_filter() function that just wraps pat_filterData().

That seems like a good API for end users:

~_filter() -- wrapper for ~_filterData() for those expecting dplyr::filter()
~_filterDate() -- convenience function; just specify user friendly startdate and enddate
~_filterData() -- explicit; does just what it says

Purple Air Timeseries vignette

Create a vignettes/purple-air-timeseries.Rmd vignette.

It should educate someone new to the package about what a pat object is, how to get or save one and how to work with it. There might be sections on:

TBD

improve/rename pat_internalData()

This function does simple subsetting of the columns of data in pat$data and then does one of two things:

returns the dataframe
plots the dataframe

It's probably a less confusing to have separate functions for extraction and plotting.

Also, the plotting function needs to spit out a warning if attempting to plot more than # records X # different parameters.

add tests for existing functions

Use the testthat package to create tests. For example code, you can look at the packages where we have written the most tests:

add custom color scales to pas_staticMap()

The color scales for temperature, humidity and pwfsl_closestDistance in the pas_leaflet() are carefully constructed color palettes with breaks specific to each variable.

These color scales should be ported over to pas_staticMap() as named palettes:

temperature
humidity
distance

MazamaSpatialUtils not loaded via namespace or in attached packages

In using the MazamaPurpleAir package in the synoptic vignette, I have come across two issues having to do with using the MazamaSpatialUtils package.

PWFLSmoke imports MazamaSpatialUtils >= 0.5.4, however, enhanceSynopticData() throws an error unless MazamaSpatialUtils >=0.6.1. I did not have the new version downloaded, and got an error, and solved the problem by updating. Q: In the description for MazamaPurpleAir under 'Imports' should MazamaSpatialUtils be >= 0.6.1?
After loading MazamaPurpleAir, MazamaSpatialUtils does not show up in the session information returned by sessionInfo(). This is easily solved by including library(MazamaSpatialUtils), but I assume you do not want this to be a necessary command to using MazamaPurpleAir.
I have tried: restarting R, restarting RStudio. Here is the results of sessionInfo() from a newly started Rstudio session, after loading MazamaPurpleAir 0.1.1.

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MazamaPurpleAir_0.1.1 MazamaCoreUtils_0.1.3 futile.logger_1.4.3   dplyr_0.7.8          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           crayon_1.3.4         assertthat_0.2.0     R6_2.3.0             futile.options_1.0.1
 [6] formatR_1.5          magrittr_1.5         pillar_1.3.1         rlang_0.3.1          rstudioapi_0.7      
[11] bindrcpp_0.2.2       lambda.r_1.2.3       tools_3.5.0          glue_1.3.0           purrr_0.2.5         
[16] yaml_2.2.0           compiler_3.5.0       pkgconfig_2.0.2      bindr_0.1.1          tidyselect_0.2.5    
[21] tibble_2.0.1

new pat_timeAverageDiagnostics() function

I cloned the source code for openair and tool a look at the implementation of the timeAverage() function.

That function is waaaaay to tied to the openair data model. We can create our own function that does what we want, runs faster and has much more readable code. I've gotten a start in local_examples/PROTOTYPE_pat_timeAverage.R.

The feature set for this function is:

accept a pat pat and return a tibble with new data columns on a new time axis
accept a unit parameter specifying the new time axis period
the returned tibble with have columns with mean, sd and count statistics for each of pm25_A, pm25_B, temperature and humidity. Columns will be named <parameter>_<statistic>
the returned tibble will have additional t-test parameters: pm25_t, pm25_df and pm25_p
be sure to convert any NaN, Inf or NULL values generated by mean or sd into NA
if it takes less than 4 hours, add support for a data.thresh parameter

improve pat_scatterplot()

The pat_scatterplot() is a good working example but can be improved.

If it's not too hard, we should convert it to ggplot graphics.

Desirable features:

parameters argument -- We should allow people to pass in a vector of the parameters they want to generate scatterplots for. Any combination of pat internal variables should be OK. But we also want to default to datetime, A, B, temp, humidity.
graphical parameters optimized for speed -- shape = 18? or '.'
sampleFraction argument -- For large data frames we will optimize things by randomly sampling the data. The sampleFraction argument should default to NULL which means we subsample to get a data frame with a reasonable number of rows, say 1e5. But users will be allowed to set this parameter to 1 if they want to force it to plot everything or a low number if they want to speed things up
... argument -- We should allow people to supply extra graphical parameters to customize the plot like color, size, shape, etc.

OLD pat_timeAverage() description

This ticket previously talked about doing everything related to time averaging in a single function. This turns out to be too messy so this issue is being deprecated in favor of smaller lego bricks of functionality.

-- >8 --

Previous description

Create a pat_timeAverage() function based on local_examples/TESTING_patTimeAverage().

This function should generate a ggplot version of the three-level plot in the example.

The configurable parameters in that file should all be turned into function parameters.

There should also be a showPlot = TRUE parameter.

As I'm not sure yet what type of plot will be most compelling, there should be a prototype parameter allowing one to choose among:

separate plots for max, mean, median, min and sd, stacked vertically
an upper plot with max, mean, min lines and a lower plot with sd
box plots in the upper plot instead and sd in a lower plot (or on the same plot with a separate y-axis?) (perhaps with a little barplot of counts along the bottom?)

There should also be a sdThreshold = 1e6 parameter which will be used to replace some of the mean values in the returned data frame with NA

The function should invisibly return a data frame of aggregated data with the following columns of data generated by openair::timeAverage():

datetime
mean
max
min
median
sd
count of measurements (named "frequency" in openair::timeAverage())

The combination of the pat_outliers() and pat_timeAverage() functions should provide good tools for both visual inspection of the data and for programmatic removal of suspect data before we convert it into our as_ data model which will have an averaged datetime axis and only a single pm2.5 measurement per sensor. (Just like a ws_monitor object.)

The conversion process for a single sensor will go something like this:

raw_pat <- pat_load()
clean_pat <- pat_outliers(raw_pat)
avg_data <- openair::timeAverage(clean_pat, sdThreshold = 10)
as <- list(meta = NULL, data = NULL
as$meta <- clean_pat$meta
as$data <- avg_data %>% select(datetime, mean) %>% rename(pm25 = mean)

I am optimistic this will generate a pretty reliable dataset for us to build systems on top of.

pas_staticMap()

ESRI cut off free access to basemap images yesterday!

So our ability to generate static maps for pas objects has gone away.

We need a new pas_staticMap() function that accesses free tiles from one of the many available tile servers and assembles them into the desired basemap.

This function should probably accept:

zoom
centerLon
centerLat
height (pixels)
width (pixels)

Because we will be generating standard sized images with these base maps, being able to define the height and width is important.

Show full legend on static map

Needs to show the complete legend to provide context to the user.
Reproduce using the "temperature" parameter.

add parameter validation to all functions

Every parameter in every function should be validated, have an appropriate error message if it is invalid.

downloadParseSynopticData()- check that the URL is valid, or that URL begins with 'https://www.purpleair.com', as appropriate.
enhanceSynopticData() - check for valid country code, appropriate class of pas_raw, valid boolean
pas_leaflet() - check pas is the appropriate class and nonempty, param is one of the choices, radius a number, ~~opacity a # less than 1~~, ~~maptype is one of the choices~~, and outsideOnly is boolean
pas_load() - ~~check for valid URL~~, country code, ~~outsideOnly~~ is boolean and lookback days >= 1.
the other functions have no parameters as input

Purple Air Synoptic vignette

Create a vignettes/purple-air-synoptic.Rmd vignette.

It should educate someone new to the package about what a pas object is, how to get or save one and how to work with it. There might be sections on:

"spatially enhanced metadata"
using dplyr to filter based on different criteria
creating interactive maps with pas_leaflet()
creating static maps with TBD

new pat_timeseriesDiagnosticPlot() function

(I'm not wedded to the name. Perhaps it should have "aggregation" as part of it? We'll probably need to use it for a while to figure out what it should be called.)

We will want a new function to go along with pat_timeseriesDiagnostics().

This function should accept a pat object (or possibly a diagnostic data frame?) and generate a set of stacked? plots where one can visually assess the values and statistics that come out of pat_timeseriesDiagnostics().

The use case I imagine is that our eventual automated QC process will flag a bunch of hours as bad and someone will ask "Why?". This function should provide a quick answer.

test parameter validation using testthat

All the input validation below should have an associated test in the test/testthat directory

downloadParseSynopticData()- check that the URL is valid, or that URL begins with 'https://www.purpleair.com', as appropriate.
enhanceSynopticData() - check for valid country code, appropriate class of pas_raw, valid boolean
pas_leaflet() - check pas is the appropriate class and nonempty, param is one of the choices, radius a number, ~~opacity a # less than 1, maptype is one of the choices,~~ and outsideOnly is boolean
pas_load() - ~~check for valid URL~~, country code, includePWFSL is boolean and lookback days >= 1.
the other functions have no parameters as input

pat_sample() function

Longer pat time series take forever to plot!

We can provide visually identical plots using many fewer points if we use dplyr::sample_n() to restrict the number of points to some reasonable number like 10,000 or so.

This will also be important when we implement pat_dygraph() for interactive time series.

The problem with naively sampling is that you will miss out on many of the potentially important outliers. So our pat_sample() function will need to be smart by doing the following:

Find a bunch of outliers with seismicRoll::findOutliers(n = 11, thresholdMin = 4) or some such. Save this "outliers" tibble.
Use dplyr::sample_n() to naively sample and create a "subsampled" tibble.
row bind "outliers" and "subsampled"
remove duplicate records
reorder based on datetime

It might have taken less typing if I had just written out the lines of dplyr code. ;-)

Continuous color scale fix

Replace (or parameterize) AQI colors with continuous color scale to pas_esriMap()

pat_dygraph()

This function will work just like PWFSLSmoke::monitor_dygraph() except for pat objects.

We probably need to get pat_sample() working first as dygraph may become unresponsive with too many data points.

Update timeseries documentation

too many typos and unclear documentation

Check loaded packages and versions on "startup"

Mazama PA core functionality throws exceptions if packages are not properly loaded. Explore possibility of automatically checking and loading necessary package dependencies. This could possibly be resolved utilizing sessionInfo().

Initialize PAS missing databases automatically

library(MazamaPurpleAir) should be the only dependency the end-user should worry about.
Currently, pas_load() throws:
Error: Missing database. Please loadSpatialData("SimpleCountriesEEZ"). This should be done automatically without raising unnecessary exceptions.

pas_filter() function

This is just a simple wrapper for dplyr::filter().

A person familiar with dplyr could just run filter(pas, ...) but we want to provide an easy way for people who are less familiar with R to begin using the package. So we should create a pas_filter() function which behaves for a pas in a way similar to how pat_filter() behaves for a pat object. And we will probably end up with an aqs object (for Air Quality Sensor) which is basically a high resolution version of the PWFSLSmoke ws_monitor object. These will eventually have their own aqs_filter~() functions.

allow differentiation of AirNow FRMs vs. E-BAMs and Nephelometers

According to Lee Tarnay: The BAM FRM is the official "ground truth" while the E-BAMs and nephelometers produce biased results.

In allowing for comparison of PurpleAir data with "FRM"s, we should allow for the exclusion of non E-BAM monitors.

We probably need some argument called FRM_only or some such.

One way to identify these monitors is to take only the AirNow, non-mobile monitors.

For AIRSIS and WRCC monitors, there may be metadata fields that can help with this identification.

pat_findOutliers returns NULL data

This should return a new pa_timeseries object with either replaced or NA outliers. Insetad, it returns a NULL data list.
Must be fixed in order to work with filtered data.

pat_join() function to combine pat objects

My thoughts for generating an archive of pat data for SCAQMD is to create per-sensor monthly data files and then join them together so that we also have up-to-the-current-month annual files.

My first foray into this is local_examples/PROTOTYPE_pat_communityArchive.R

The monthly data files are ~0.5 MB so annual files will be a very reasonable 5 MB or so -- easy to download and load into memory.

To make all of this easy, we will want to have a pat_join() function that accepts two pat objects and does the following:

parameter validation
ensure that both pat objects share the same metadata
dplyr::bind_rows() on the $data
return the "joined" pat object

Even if it's only a few lines of code, I think it's worth having this as a separate utility function because we may end up adding more checks for edge cases.

use geosphere::distm() for distance calculations

This function appears to be wicked fast. We should test it and then use it whenever we need to calculate multiple distances.

Gradient scale colors for leaflet

Replace (or parameterize) AQI colors with continuous color scale to pas_leaflet()

createASTimeseriesObject()

I am now convinced that we need another data model, "Air Sensor Timeseres" that has the following characteristics:

metadataframe

location metadata from pat object

data dataframe

uniform datetime time axis on one of 5-min, 10-min, 15-min or hourly
columns for pm25, humidity, temperature (means)
columns for pm25_sd, humidity_sd, temperature_sd
columns for pm25_count, humidity_count, temperature_count
columns for pm25_qc, humidity_qc, temperature_qc

This function will start with a pat object and requested time period and do the following:

guarantee that we are working with a pat object
perform to-be-determined QC on the A/B channel data (leave this as a stub for now)
use openair::timeAverage() on the merged A/B channel data to produce pm25 metrics
perform to-be-determined QC on the humidity data (leave this as a stub for now)
use openair::timeAverage() to produce humidity metrics
perform to-be-determined QC on the temperature data (leave this as a stub for now)
use openair::timeAverage() on produce temperature metrics

The returned object will look a lot like a pat object with the following differences:

it may have a few to-be-determined additional metadata fields
it will have a time axis defined by period
it will always have the same 13 columns in the data data frame

The reason for this additional data model is that we will almost certainly need to generate per-sensor plots of pm25, temperature and humidity on QC'ed data on a uniform time axis. The pas and pat data models and associated functions are intended for use by serious analysts wishing to work with raw data.

This new ast object is for community members who want to look at relatively high resolution data and may have some questions about sensor performance but don't want to get lost in the details. Various ast_~ plot functions will provide graphics aimed more at interested members of general public.

Finally, the ast object is generic enough that it will accommodate data from many types of sensors. We would need to have separate data merging/QC pathways for each type of sensor but the ast object will allow for a harmonized approach.

learn the pkgdown package

We use the pkgdown package to create websites for each package. You should familiarize yourself with this package and use the build_site() function to create a complete website. (It will be saved in a docs/ directory that you will then upload. I have to do some work on Github to make it visible.)

Check the main public Mazama package repositories for usage and extra files like _pgkdown.yml or svg, etc.

add sensor "label" to every pat_~ plot

The various pat_~ plots are well annotated except for the name of the sensor being plotted.

The label field in the pat$meta should be used to annotate each plot. We should also think about a plan for displaying the sensorType.

The label may need to be truncated to N characters to fit nicely. My first thought is to just include the label at the beginning of each plot title. That's probably good enough for now and we can worry about moving it or adding sensorType only if requested.

create MazamaSpatialUtils.R for package internal stuff

Two .R files with roxygen2 comments exist and describe the example data in the data/ directory:

pas.R
pas_raw.R

These should be removed.

You should read up on data in packages: http://r-pkgs.had.co.nz/data.html

You should probably recreate these datasets as example_pas and example_pas_raw and save them as similarly named files in the data/ directory. This will prevent any confusion when example code use pas as a variable name.

You should then create a new MazamaPurpleAir.R file and put all the package internal documentation describing internal data in there. You can use the following source code as a model:

https://github.com/MazamaScience/MazamaSpatialUtils/blob/master/R/MazamaSpatialUtils.R

Use-case vignettes for new functions

Show how these new functions can be used in conjunction to create a powerful exploratory analysis environment.
This should include:

filtering
subsampling
plotting
dygraphs
etc.

deal with encoding issues in downloadParseSynopticData()

The "Label" column in the returned data frame has mixed encoding: "unknown" and "UTF-8".

We need to carefully deal with encoding so as to preserve hispanic place names.

And we made need to stringi::stri_escape_unicode() if we save this data as package internal data.

Error from pas_leaflet

I typed the following code at the command line, from the MazamaPurpleAir directory:
devtools::load_all()
data(pas_Jan25)
initializeMazamaSpatialUtils()
pas_leaflet(pas)

The error I receive is:
Error in validObject(.Object) :
invalid class “SpatialPointsDataFrame” object: 1: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "pa_synoptic", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 2: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "tbl_df", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 3: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "tbl", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 4: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "data.frame", should be or extend class "data.frame"
Called from: validObject(.Object)

sessionInfo gives the following information:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] MazamaPurpleAir_0.1.1 MazamaCoreUtils_0.1.3 futile.logger_1.4.3
[4] dplyr_0.7.8

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 RColorBrewer_1.1-2 pillar_1.3.1
[4] compiler_3.5.0 formatR_1.5 later_0.7.5
[7] bindr_0.1.1 futile.options_1.0.1 tools_3.5.0
[10] testthat_2.0.1 digest_0.6.17 PWFSLSmoke_1.1.18
[13] lattice_0.20-35 lubridate_1.7.4 jsonlite_1.6
[16] memoise_1.1.0 tibble_2.0.1 pkgconfig_2.0.2
[19] rlang_0.3.1 shiny_1.1.0 rstudioapi_0.7
[22] mapproj_1.2.6 commonmark_1.5 crosstalk_1.0.0
[25] rgdal_1.3-6 yaml_2.2.0 bindrcpp_0.2.2
[28] withr_2.1.2 stringr_1.3.1 httr_1.4.0
[31] roxygen2_6.1.0 xml2_1.2.0 maps_3.3.0
[34] htmlwidgets_1.2 devtools_1.13.6 grid_3.5.0
[37] leaflet_2.0.2 tidyselect_0.2.5 glue_1.3.0
[40] R6_2.3.0 sp_1.3-1 purrr_0.2.5
[43] lambda.r_1.2.3 magrittr_1.5 scales_1.0.0
[46] MazamaSpatialUtils_0.5.4 promises_1.0.1 htmltools_0.3.6
[49] assertthat_0.2.0 colorspace_1.3-2 xtable_1.8-3
[52] mime_0.6 httpuv_1.4.5 stringi_1.2.4
[55] munsell_0.5.0 crayon_1.3.4

get pat_multiplot(prototype = "pm25_over") working

One of the most telling plots for Purple Air sensors is to have data from the A and B channels plotted on top of each other with different colors and partial opacity. I like red and blue because, when they lie on top of each other you get purple.

As a budding ggplot expert, I would hope this is a straightforward task.

pat_timeAverage()

This function should limit its functionality to wrapping potentially multiple calls to the openair::timeAverage() function and returning a data frame. We will keep the A and B channels separate for now.

We still need to do some investigating of the "pat" data so we can come up with some sort of reasonable QC before we start mixing A and B together to create our "official" number.

Input

pat = NULL -- pa_timeseries object
parameter = "pm25_A" -- "pat" data column to process
period = "10 min" -- used as avg.time ("period" is the terminology preferred by lubridate)
dataThreshold = 0 -- used as data.thresh
stats = "all" -- vector of statistics to return as dataframe columns
fill = FALSE" -- used as fill

Omit all of the following from opener::timeAverage(): "type, "percentile", "start.date", "end.date", "interval" and "vector.ws".

Output
The returned dataframe should have a first column named datetime followed by the names of the requested statistics with "count" replacing openair's "frequency".

When it all works, a user should be able to do the following:

# Generate a plot of 10 minute averages for pm25_A
pat %>%
  pat_timeAverage("pm25_A", "10 min", stats = "mean") %>%
  plot()

# Generate a QC plot of all statistics for channel B
pat %>%
  pat_timeAverage("pm25_B", "1 hour") %>%
  timeAveragePlot(plottype = "QC")

# Test my own ideas for a QC metric
my_qc <- 
  pat_timeAverage("pm25_A", "1 hour", stats = c("median", "sd", "count")) %>%
  mutate(qc_1 = median - sd, qc_2 = median / sd, qc_3 = sd * count)

somePlotFunction(my_qc)

add example timeseries data

for internal and example use.

pat_sample() logic

The code used in pat_sample(forGraphic = TRUE) is not quite right in version 0.2.5. It doesn't detect and save the same outliers that one sees when running pat_outliers().

One thing to note is that sampling for graphics doesn't need to preserve the every-minute time axis. So we can start off by simply removing any records where both pm25_A and pm25B are missing.

Some of the logic from pat_outliers() needs to be copied over when forGraphic = TRUE so that the overall creation of the returned data frame has the following steps:

filter out is.na() separately on the A and B channel records before finding outliers
reduce sampleSize by the number of outliers detected so that we return the precise sampleSize requested and so that we avoid errors when sampleSize is greater than the number of rows in the non-outliers
create the combined outliers part of the returned data frame by using dplyr::full_join() as in pat_outliers()
merge that with the sampled non-outliers portion using dplyr::bind_rows()
be sure to run dplyr::distinct() before dplyr::arrange()
create local_examples/05_graphics_sampling.R to use example_pat for the week of 2018-08-01 to 2018-08-07. Create a plot using pat_outliers(thresholdMin = 4) and then several runs of pat_multiplot(sampleSize = N) to show that the graphics retain all of the outliers originally detected regardless of N.

Hampel filter vignette

Something short that provides enough background for Brandon to understand.

pat_internalFit() improvements

This function currently uses base plot graphics. It should be rebuilt using ggplot graphics and showPlot = TRUE should generate a plot with two panes:

ggplot version of A vs B as in the current pat_internalFit() plot
pat_multiplot(prototype = "pm25_over")

As in the current version, it should return a linear model.

pat_multiplot() enhancements

We need to design all of our output plots for one of two cases:

use in a public facing web service where clean and simple is the priority
publication-ready, fully annotated plots

The pat_multplot() function is in need of some basic improvements:

axis labels
title
symbol should be semi-transparent squares

Ideally, we would have a set of defaults that a savvy user could override. If it's easy to implement, using a ggplot theme would be ideal.

And what can be done with the ... parameter?

downloadParseTimeseriesData() should return empty tibble rather than error

The downloadParseTimeseriesData() function currently errors out when it gets an error response from ThingSpeak. This is undesirable because that stop() bubbles up to pat_load() where it is not handled.

This forces end users to know ahead of time the time range of available data.

The proper solution is to have downloadParseTimeseriesData() handle the error message from ThingsSpeak by returning a tibble with all appropriate columns but no rows.

This can then be used by bind_rows() in pat_load() without any special handling.

pas_esriMap() function

We need to produce static maps as well as interactive ones.

We should harness the functionality in the PWFSLSmoke package and create a function that mimics some of the functionality in monitor_esriMap()

This new function will need to determine center location and zoom from the incoming pas object.

create issues for time series data

The MazamaPurpleAir_Private package has the following functions that need to be copied over to MazamaPurpleAir:

pat_load.R
downloadParseTimeseriesData.R
createPATimeseriesObject.R
pat_internalData.R

These functions allow you to take a monitor ID or name, perhaps obtained with pas_leaflet(), and download a week's worth of time series data for that particular sensor.

There is a lot to learn about the structure of the data and how these functions work so this task is to copy over the code and get it running so that you can explore time series data a little before creating your own issues to work on. A suggested starter list of issues might include:

rewrite documentation for the functions above
refactor to use better variable names and more comments in the code
add parameter validation and associated unit tests
add an example pat object to the internal data so that we can use it in examples
other?