mazamascience / airsensor Goto Github PK
View Code? Open in Web Editor NEWUtilities for working with data from PurpleAir sensors
Home Page: https://mazamascience.github.io/AirSensor/
License: GNU General Public License v3.0
Utilities for working with data from PurpleAir sensors
Home Page: https://mazamascience.github.io/AirSensor/
License: GNU General Public License v3.0
We need to design all of our output plots for one of two cases:
The pat_multplot()
function is in need of some basic improvements:
Ideally, we would have a set of defaults that a savvy user could override. If it's easy to implement, using a ggplot theme would be ideal.
And what can be done with the ...
parameter?
In the documentation, the function that downloads PWFSL data is called pwfsl_loadLatest, however the file name is pwfsl_load. Shouldn't file names reflect function names exactly? There is also inconsistency w/in the Examples and Usage in the documentation for that file regarding pwfsl_load/pwfsl_loadLatest.
Replace (or parameterize) AQI colors with continuous color scale to pas_leaflet()
too many typos and unclear documentation
Something short that provides enough background for Brandon to understand.
(I'm not wedded to the name. Perhaps it should have "aggregation" as part of it? We'll probably need to use it for a while to figure out what it should be called.)
We will want a new function to go along with pat_timeseriesDiagnostics()
.
This function should accept a pat object (or possibly a diagnostic data frame?) and generate a set of stacked? plots where one can visually assess the values and statistics that come out of pat_timeseriesDiagnostics()
.
The use case I imagine is that our eventual automated QC process will flag a bunch of hours as bad and someone will ask "Why?". This function should provide a quick answer.
Use the testthat
package to create tests. For example code, you can look at the packages where we have written the most tests:
Two .R files with roxygen2 comments exist and describe the example data in the data/
directory:
pas.R
pas_raw.R
These should be removed.
You should read up on data in packages: http://r-pkgs.had.co.nz/data.html
You should probably recreate these datasets as example_pas
and example_pas_raw
and save them as similarly named files in the data/
directory. This will prevent any confusion when example code use pas
as a variable name.
You should then create a new MazamaPurpleAir.R
file and put all the package internal documentation describing internal data in there. You can use the following source code as a model:
https://github.com/MazamaScience/MazamaSpatialUtils/blob/master/R/MazamaSpatialUtils.R
All the input validation below should have an associated test in the test/testthat directory
downloadParseSynopticData()
- check that the URL is valid, or that URL begins with 'https://www.purpleair.com', as appropriate.enhanceSynopticData()
- check for valid country code, appropriate class of pas_raw, valid booleanpas_leaflet()
- check pas is the appropriate class and nonempty, param is one of the choices, radius a number, pas_load()
- This function appears to be wicked fast. We should test it and then use it whenever we need to calculate multiple distances.
This function currently uses base plot graphics. It should be rebuilt using ggplot graphics and showPlot = TRUE
should generate a plot with two panes:
pat_internalFit()
plotpat_multiplot(prototype = "pm25_over")
As in the current version, it should return a linear model.
R has a number of packages to help you build packages. You should become passingly familiar with the following:
goodpractice
-- We don't slavishly follow line length or test-for-every-function suggestions but it's good to run the gp()
function and see what it says.usethis
-- You should know what is available here. I already used use_pipe()
to include piping functionality in the package.testthat
-- You will be using this to create tests.This is just a simple wrapper for dplyr::filter()
.
A person familiar with dplyr could just run filter(pas, ...)
but we want to provide an easy way for people who are less familiar with R to begin using the package. So we should create a pas_filter()
function which behaves for a pas
in a way similar to how pat_filter()
behaves for a pat
object. And we will probably end up with an aqs
object (for Air Quality Sensor) which is basically a high resolution version of the PWFSLSmoke ws_monitor
object. These will eventually have their own aqs_filter~()
functions.
The various pat_~
plots are well annotated except for the name of the sensor being plotted.
The label
field in the pat$meta
should be used to annotate each plot. We should also think about a plan for displaying the sensorType
.
The label
may need to be truncated to N characters to fit nicely. My first thought is to just include the label
at the beginning of each plot title. That's probably good enough for now and we can worry about moving it or adding sensorType
only if requested.
I cloned the source code for openair and tool a look at the implementation of the timeAverage()
function.
That function is waaaaay to tied to the openair data model. We can create our own function that does what we want, runs faster and has much more readable code. I've gotten a start in local_examples/PROTOTYPE_pat_timeAverage.R
.
The feature set for this function is:
pat
pat and return a tibble with new data columns on a new time axisunit
parameter specifying the new time axis periodmean
, sd
and count
statistics for each of pm25_A
, pm25_B
, temperature
and humidity
. Columns will be named <parameter>_<statistic>
pm25_t
, pm25_df
and pm25_p
NaN
, Inf
or NULL
values generated by mean
or sd
into NA
data.thresh
parameterWe need to produce static maps as well as interactive ones.
We should harness the functionality in the PWFSLSmoke package and create a function that mimics some of the functionality in monitor_esriMap()
This new function will need to determine center location and zoom from the incoming pas
object.
Every parameter in every function should be validated, have an appropriate error message if it is invalid.
downloadParseSynopticData()
- check that the URL is valid, or that URL begins with 'https://www.purpleair.com', as appropriate.enhanceSynopticData()
- check for valid country code, appropriate class of pas_raw, valid booleanpas_leaflet()
- check pas is the appropriate class and nonempty, param is one of the choices, radius a number, pas_load()
- Show how these new functions can be used in conjunction to create a powerful exploratory analysis environment.
This should include:
Longer pat
time series take forever to plot!
We can provide visually identical plots using many fewer points if we use dplyr::sample_n()
to restrict the number of points to some reasonable number like 10,000 or so.
This will also be important when we implement pat_dygraph()
for interactive time series.
The problem with naively sampling is that you will miss out on many of the potentially important outliers. So our pat_sample()
function will need to be smart by doing the following:
seismicRoll::findOutliers(n = 11, thresholdMin = 4)
or some such. Save this "outliers" tibble.dplyr::sample_n()
to naively sample and create a "subsampled" tibble.datetime
It might have taken less typing if I had just written out the lines of dplyr code. ;-)
Replace (or parameterize) AQI colors with continuous color scale to pas_esriMap()
This should return a new pa_timeseries object with either replaced or NA outliers. Insetad, it returns a NULL data list.
Must be fixed in order to work with filtered data.
I am now convinced that we need another data model, "Air Sensor Timeseres" that has the following characteristics:
meta
dataframe
data
dataframe
datetime
time axis on one of 5-min, 10-min, 15-min or hourlypm25
, humidity
, temperature
(means)pm25_sd
, humidity_sd
, temperature_sd
pm25_count
, humidity_count
, temperature_count
pm25_qc
, humidity_qc
, temperature_qc
This function will start with a pat object and requested time period
and do the following:
openair::timeAverage()
on the merged A/B channel data to produce pm25
metricsopenair::timeAverage()
to produce humidity
metricstemperature
data (leave this as a stub for now)openair::timeAverage()
on produce temperature
metricsThe returned object will look a lot like a pat object with the following differences:
period
data
data frameThe reason for this additional data model is that we will almost certainly need to generate per-sensor plots of pm25, temperature and humidity on QC'ed data on a uniform time axis. The pas and pat data models and associated functions are intended for use by serious analysts wishing to work with raw data.
This new ast object is for community members who want to look at relatively high resolution data and may have some questions about sensor performance but don't want to get lost in the details. Various ast_~
plot functions will provide graphics aimed more at interested members of general public.
Finally, the ast object is generic enough that it will accommodate data from many types of sensors. We would need to have separate data merging/QC pathways for each type of sensor but the ast object will allow for a harmonized approach.
I typed the following code at the command line, from the MazamaPurpleAir directory:
devtools::load_all()
data(pas_Jan25)
initializeMazamaSpatialUtils()
pas_leaflet(pas)
The error I receive is:
Error in validObject(.Object) :
invalid class “SpatialPointsDataFrame” object: 1: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "pa_synoptic", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 2: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "tbl_df", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 3: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "tbl", should be or extend class "data.frame"
invalid class “SpatialPointsDataFrame” object: 4: invalid object for slot "data" in class "SpatialPointsDataFrame": got class "data.frame", should be or extend class "data.frame"
Called from: validObject(.Object)
sessionInfo
gives the following information:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MazamaPurpleAir_0.1.1 MazamaCoreUtils_0.1.3 futile.logger_1.4.3
[4] dplyr_0.7.8
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 RColorBrewer_1.1-2 pillar_1.3.1
[4] compiler_3.5.0 formatR_1.5 later_0.7.5
[7] bindr_0.1.1 futile.options_1.0.1 tools_3.5.0
[10] testthat_2.0.1 digest_0.6.17 PWFSLSmoke_1.1.18
[13] lattice_0.20-35 lubridate_1.7.4 jsonlite_1.6
[16] memoise_1.1.0 tibble_2.0.1 pkgconfig_2.0.2
[19] rlang_0.3.1 shiny_1.1.0 rstudioapi_0.7
[22] mapproj_1.2.6 commonmark_1.5 crosstalk_1.0.0
[25] rgdal_1.3-6 yaml_2.2.0 bindrcpp_0.2.2
[28] withr_2.1.2 stringr_1.3.1 httr_1.4.0
[31] roxygen2_6.1.0 xml2_1.2.0 maps_3.3.0
[34] htmlwidgets_1.2 devtools_1.13.6 grid_3.5.0
[37] leaflet_2.0.2 tidyselect_0.2.5 glue_1.3.0
[40] R6_2.3.0 sp_1.3-1 purrr_0.2.5
[43] lambda.r_1.2.3 magrittr_1.5 scales_1.0.0
[46] MazamaSpatialUtils_0.5.4 promises_1.0.1 htmltools_0.3.6
[49] assertthat_0.2.0 colorspace_1.3-2 xtable_1.8-3
[52] mime_0.6 httpuv_1.4.5 stringi_1.2.4
[55] munsell_0.5.0 crayon_1.3.4
Because we have stuffed the actual tibbles associated with a pat
object inside of pat$meta
and pat$data
we cannot use dplyr filtering on our pat
object.
So we need a pat_filter()
function which applies the incoming filtering expression to pat$data
and returns the modified pat
object.
Using pat_filter()
should feel just like working with dplyr::filter()
.
See PWFSLSmoke::monitor_subsetBy()
for example code for how to do this.
We already have a pat_filterDate()
function that simplifies things for the most common usage and we should provide pat_filterData()
that is more general just to be complete.
For extra-completeness, we should have a pat_filter()
function that just wraps pat_filterData()
.
That seems like a good API for end users:
~_filter()
-- wrapper for ~_filterData()
for those expecting dplyr::filter()
~_filterDate()
-- convenience function; just specify user friendly startdate
and enddate
~_filterData()
-- explicit; does just what it saysThe "Label" column in the returned data frame has mixed encoding: "unknown" and "UTF-8".
We need to carefully deal with encoding so as to preserve hispanic place names.
And we made need to stringi::stri_escape_unicode()
if we save this data as package internal data.
This function does simple subsetting of the columns of data in pat$data
and then does one of two things:
It's probably a less confusing to have separate functions for extraction and plotting.
Also, the plotting function needs to spit out a warning if attempting to plot more than # records X # different parameters.
The code used in pat_sample(forGraphic = TRUE)
is not quite right in version 0.2.5. It doesn't detect and save the same outliers that one sees when running pat_outliers()
.
One thing to note is that sampling for graphics doesn't need to preserve the every-minute time axis. So we can start off by simply removing any records where both pm25_A
and pm25B
are missing.
Some of the logic from pat_outliers()
needs to be copied over when forGraphic = TRUE
so that the overall creation of the returned data frame has the following steps:
is.na()
separately on the A and B channel records before finding outlierssampleSize
by the number of outliers detected so that we return the precise sampleSize
requested and so that we avoid errors when sampleSize is greater than the number of rows in the non-outliersdplyr::full_join()
as in pat_outliers()
dplyr::bind_rows()
dplyr::distinct()
before dplyr::arrange()
local_examples/05_graphics_sampling.R
to use example_pat
for the week of 2018-08-01 to 2018-08-07. Create a plot using pat_outliers(thresholdMin = 4)
and then several runs of pat_multiplot(sampleSize = N)
to show that the graphics retain all of the outliers originally detected regardless of N
.My thoughts for generating an archive of pat
data for SCAQMD is to create per-sensor monthly data files and then join them together so that we also have up-to-the-current-month annual files.
My first foray into this is local_examples/PROTOTYPE_pat_communityArchive.R
The monthly data files are ~0.5 MB so annual files will be a very reasonable 5 MB or so -- easy to download and load into memory.
To make all of this easy, we will want to have a pat_join()
function that accepts two pat
objects and does the following:
dplyr::bind_rows()
on the $data
pat
objectEven if it's only a few lines of code, I think it's worth having this as a separate utility function because we may end up adding more checks for edge cases.
Ruby --
You need to get comfortable with both RStudio's tools for building packages and with the initial set of functionality of the package. Before closing this task you should:
for internal and example use.
ESRI cut off free access to basemap images yesterday!
So our ability to generate static maps for pas objects has gone away.
We need a new pas_staticMap()
function that accesses free tiles from one of the many available tile servers and assembles them into the desired basemap.
This function should probably accept:
Because we will be generating standard sized images with these base maps, being able to define the height and width is important.
add unit testing to the time series functionality
The color scales for temperature
, humidity
and pwfsl_closestDistance
in the pas_leaflet()
are carefully constructed color palettes with breaks specific to each variable.
These color scales should be ported over to pas_staticMap()
as named palettes:
temperature
humidity
distance
The pat_scatterplot()
is a good working example but can be improved.
If it's not too hard, we should convert it to ggplot graphics.
Desirable features:
parameters
argument -- We should allow people to pass in a vector of the parameters they want to generate scatterplots for. Any combination of pat
internal variables should be OK. But we also want to default to datetime, A, B, temp, humidity.sampleFraction
argument -- For large data frames we will optimize things by randomly sampling the data. The sampleFraction
argument should default to NULL which means we subsample to get a data frame with a reasonable number of rows, say 1e5. But users will be allowed to set this parameter to 1 if they want to force it to plot everything or a low number if they want to speed things up...
argument -- We should allow people to supply extra graphical parameters to customize the plot like color, size, shape, etc.Create a vignettes/purple-air-timeseries.Rmd vignette.
It should educate someone new to the package about what a pat object is, how to get or save one and how to work with it. There might be sections on:
Needs to show the complete legend to provide context to the user.
Reproduce using the "temperature" parameter.
One of the most telling plots for Purple Air sensors is to have data from the A and B channels plotted on top of each other with different colors and partial opacity. I like red and blue because, when they lie on top of each other you get purple.
As a budding ggplot expert, I would hope this is a straightforward task.
The MazamaPurpleAir_Private
package has the following functions that need to be copied over to MazamaPurpleAir
:
pat_load.R
downloadParseTimeseriesData.R
createPATimeseriesObject.R
pat_internalData.R
These functions allow you to take a monitor ID or name, perhaps obtained with pas_leaflet()
, and download a week's worth of time series data for that particular sensor.
There is a lot to learn about the structure of the data and how these functions work so this task is to copy over the code and get it running so that you can explore time series data a little before creating your own issues to work on. A suggested starter list of issues might include:
pat
object to the internal data so that we can use it in examplesThis function will work just like PWFSLSmoke::monitor_dygraph()
except for pat
objects.
We probably need to get pat_sample()
working first as dygraph may become unresponsive with too many data points.
Create a vignettes/purple-air-synoptic.Rmd
vignette.
It should educate someone new to the package about what a pas
object is, how to get or save one and how to work with it. There might be sections on:
pas_leaflet()
According to Lee Tarnay: The BAM FRM is the official "ground truth" while the E-BAMs and nephelometers produce biased results.
In allowing for comparison of PurpleAir data with "FRM"s, we should allow for the exclusion of non E-BAM monitors.
We probably need some argument called FRM_only
or some such.
One way to identify these monitors is to take only the AirNow, non-mobile monitors.
For AIRSIS and WRCC monitors, there may be metadata fields that can help with this identification.
library(MazamaPurpleAir)
should be the only dependency the end-user should worry about.
Currently, pas_load()
throws:
Error: Missing database. Please loadSpatialData("SimpleCountriesEEZ")
. This should be done automatically without raising unnecessary exceptions.
This function should limit its functionality to wrapping potentially multiple calls to the openair::timeAverage()
function and returning a data frame. We will keep the A and B channels separate for now.
We still need to do some investigating of the "pat" data so we can come up with some sort of reasonable QC before we start mixing A and B together to create our "official" number.
Input
pat = NULL
-- pa_timeseries objectparameter = "pm25_A"
-- "pat" data column to processperiod = "10 min"
-- used as avg.time
("period" is the terminology preferred by lubridate)dataThreshold = 0
-- used as data.thresh
stats = "all"
-- vector of statistics to return as dataframe columnsfill = FALSE"
-- used as fill
Omit all of the following from opener::timeAverage()
: "type, "percentile", "start.date", "end.date", "interval" and "vector.ws".
Output
The returned dataframe should have a first column named datetime
followed by the names of the requested statistics with "count" replacing openair's "frequency".
When it all works, a user should be able to do the following:
# Generate a plot of 10 minute averages for pm25_A
pat %>%
pat_timeAverage("pm25_A", "10 min", stats = "mean") %>%
plot()
# Generate a QC plot of all statistics for channel B
pat %>%
pat_timeAverage("pm25_B", "1 hour") %>%
timeAveragePlot(plottype = "QC")
# Test my own ideas for a QC metric
my_qc <-
pat_timeAverage("pm25_A", "1 hour", stats = c("median", "sd", "count")) %>%
mutate(qc_1 = median - sd, qc_2 = median / sd, qc_3 = sd * count)
somePlotFunction(my_qc)
Mazama PA core functionality throws exceptions if packages are not properly loaded. Explore possibility of automatically checking and loading necessary package dependencies. This could possibly be resolved utilizing sessionInfo().
The following columns need to be added in enhanceSynopticData()
:
This ticket previously talked about doing everything related to time averaging in a single function. This turns out to be too messy so this issue is being deprecated in favor of smaller lego bricks of functionality.
-- >8 --
Create a pat_timeAverage()
function based on local_examples/TESTING_patTimeAverage()
.
This function should generate a ggplot version of the three-level plot in the example.
The configurable parameters
in that file should all be turned into function parameters.
There should also be a showPlot = TRUE
parameter.
As I'm not sure yet what type of plot will be most compelling, there should be a prototype
parameter allowing one to choose among:
There should also be a sdThreshold = 1e6
parameter which will be used to replace some of the mean
values in the returned data frame with NA
The function should invisibly return a data frame of aggregated data with the following columns of data generated by openair::timeAverage()
:
openair::timeAverage()
)The combination of the pat_outliers()
and pat_timeAverage()
functions should provide good tools for both visual inspection of the data and for programmatic removal of suspect data before we convert it into our as_
data model which will have an averaged datetime axis and only a single pm2.5 measurement per sensor. (Just like a ws_monitor object.)
The conversion process for a single sensor will go something like this:
raw_pat <- pat_load()
clean_pat <- pat_outliers(raw_pat)
avg_data <- openair::timeAverage(clean_pat, sdThreshold = 10)
as <- list(meta = NULL, data = NULL
as$meta <- clean_pat$meta
as$data <- avg_data %>% select(datetime, mean) %>% rename(pm25 = mean)
I am optimistic this will generate a pretty reliable dataset for us to build systems on top of.
We use the pkgdown
package to create websites for each package. You should familiarize yourself with this package and use the build_site()
function to create a complete website. (It will be saved in a docs/
directory that you will then upload. I have to do some work on Github to make it visible.)
Check the main public Mazama package repositories for usage and extra files like _pgkdown.yml
or svg
, etc.
The downloadParseTimeseriesData()
function currently errors out when it gets an error response from ThingSpeak. This is undesirable because that stop()
bubbles up to pat_load()
where it is not handled.
This forces end users to know ahead of time the time range of available data.
The proper solution is to have downloadParseTimeseriesData()
handle the error message from ThingsSpeak by returning a tibble with all appropriate columns but no rows.
This can then be used by bind_rows()
in pat_load()
without any special handling.
In using the MazamaPurpleAir package in the synoptic vignette, I have come across two issues having to do with using the MazamaSpatialUtils package.
PWFLSmoke imports MazamaSpatialUtils >= 0.5.4, however, enhanceSynopticData()
throws an error unless MazamaSpatialUtils >=0.6.1. I did not have the new version downloaded, and got an error, and solved the problem by updating. Q: In the description for MazamaPurpleAir under 'Imports' should MazamaSpatialUtils be >= 0.6.1?
After loading MazamaPurpleAir, MazamaSpatialUtils does not show up in the session information returned by sessionInfo()
. This is easily solved by including library(MazamaSpatialUtils)
, but I assume you do not want this to be a necessary command to using MazamaPurpleAir.
I have tried: restarting R, restarting RStudio. Here is the results of sessionInfo() from a newly started Rstudio session, after loading MazamaPurpleAir 0.1.1.
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MazamaPurpleAir_0.1.1 MazamaCoreUtils_0.1.3 futile.logger_1.4.3 dplyr_0.7.8
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 crayon_1.3.4 assertthat_0.2.0 R6_2.3.0 futile.options_1.0.1
[6] formatR_1.5 magrittr_1.5 pillar_1.3.1 rlang_0.3.1 rstudioapi_0.7
[11] bindrcpp_0.2.2 lambda.r_1.2.3 tools_3.5.0 glue_1.3.0 purrr_0.2.5
[16] yaml_2.2.0 compiler_3.5.0 pkgconfig_2.0.2 bindr_0.1.1 tidyselect_0.2.5
[21] tibble_2.0.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.