mfasiolo / mgcviz Goto Github PK
View Code? Open in Web Editor NEWAn R package for interactive visualization of GAM models
Home Page: https://mfasiolo.github.io/mgcViz/
An R package for interactive visualization of GAM models
Home Page: https://mfasiolo.github.io/mgcViz/
We could follow the approach used here:
https://statnmap.com/2017-11-01-spatial-interpolation-on-earth-as-a-3d-sphere/
and we could add contours as well (not sure how hard it is).
Arguably, dFun is on a divergent scale with a natural midpoint at 0, so it should use a divergent color scale with fixed midpoint = 0
so that a divergence of 0 is visually identical across models, the border beween under- and over-represented residual values is more vivid and the colors corresponding to "no relevant divergence between observed and expected density" cannot change depending on the extremes of the observed distances.
Thx for the great package. Just playing around with settings and options and adding my impressions :)
When check1D
is called with a list as argument (e.g. check1D(o, list("x1", "x2", "x3"))
) it is then internally called recursively (only one level of recursion) and a plotGam
object is returned containing a list 3 ggplots.
However, the residuals are recomputed 3 times (in this case), hence it would be be better to compute them once, and then call the rest of the code. Same for check2D, where this could be particularly important if we are checking all pairs of variables.
If we let the user define a function
gridFun = function(.x, .v1){
....
}
then they could, for instance, regress the residuals .x on the variable .v1, to get useful diagnostics.
I removed the dTrans
argument as it was not called.
Do we want to reintroduce it as an argument for plot.mgcv.smooth1D or should we keep only the default (cubic root) ?
Hi
library(mgcViz)
n <- 1e3
x1 <- rnorm(n)
x2 <- rnorm(n)
dat <- data.frame("x1" = x1, "x2" = x2,
"y" = sin(x1) + 0.5 * x2^2 + pmax(x2, 0.2) * rnorm(n))
b <- gam(y ~ s(x1, x2), data = dat, method = "REML")
b <- getViz(b)
plot(sm(b, 1)) + l_fitRaster() + l_fitContour() + l_points()
The plot returns an error :
Error in UseMethod("rescale") :
pas de méthode pour 'rescale' applicable pour un objet de classe "AsIs"
Here is my session Info (I suspect a package version problem):
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mgcViz_0.0.1 rgl_0.98.1 ggplot2_2.2.1 mgcv_1.8-22 nlme_3.1-131
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 compiler_3.4.1 RColorBrewer_1.1-2 plyr_1.8.4 bindr_0.1
[6] viridis_0.4.0 tools_3.4.1 mvnfast_0.2.2 digest_0.6.12 viridisLite_0.2.0
[11] jsonlite_1.5 tibble_1.3.4 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
[16] rlang_0.1.2 Matrix_1.2-10 rstudioapi_0.7 shiny_1.0.5 GGally_1.3.2
[21] yaml_2.1.14 bindrcpp_0.2 gridExtra_2.3 httr_1.3.1 dplyr_0.7.4
[26] knitr_1.16 htmlwidgets_0.9 grid_3.4.1 glue_1.1.1 reshape_0.8.7
[31] data.table_1.10.4 R6_2.2.2 plotly_4.7.1 tidyr_0.7.1 purrr_0.2.2.2
[36] magrittr_1.5 matrixStats_0.52.2 scales_0.5.0 htmltools_0.3.6 assertthat_0.2.0
[41] mime_0.5 xtable_1.8-2 colorspace_1.3-2 httpuv_1.3.5 KernSmooth_2.23-15
[46] miniUI_0.1.1 lazyeval_0.2.0 munsell_0.4.3
``
The idea seems to be that the check.something methods are just wrappers for more specific plots. See also this issue. Hence here it would make sense to extract a gridCheck2D function from check.mgcv.smooth.2D.
Currently the problem with check.mgcv.smooth.1D and 2D is that they really check what's happening along a 1 or 2 covariates, the do not really care about the specific smooth considered.
Outstanding issues:
When method=="simul2", do we really need rank()? Can we just use 1:n? Especially given that
rank() will miss unobserved levels of discrete variables.
When method=="simul1" we use .quBySort() to get the quantiles by sorting. Check how the
quantiles are being estimated from the order statistics. Same when we calculate the quantiles
using rowOrderStats().
Provide gamlss and extended families with CDF function, otherwise we won't be able to use
method=="tnorm" or "tunif" with them! Do this by extending the fix.family.cdf() function.
See .initializeXXX, in particular: stop("Not supported at the moment")
Not sure how often this comes up if you don't set the scale manually but it seems dangerous to me:
library(mgcv)
library(mgcViz)
n <- 1e3
x1 <- rnorm(n)
x2 <- rnorm(n)
dat <- data.frame("x1" = x1,
"x2" = x2,
"y" = rnorm(n, sin(x1) + x2^2, sd = 1))
# fit with much too small scale:
b_scaled <- gam(y ~ s(x1) + s(x2), data=dat, scale = .1)
qq.gam(b_scaled)
check1D(getViz(b_scaled), x = "x1") + l_densCheck() + scale_fill_gradient2()
#> Scale for 'fill' is already present. Adding another scale for 'fill',
#> which will replace the existing scale.
# looks ok (unstructured wiggles), but we *know* the model variance is much too small:
# Comparing with Gaussian density with model's scale, *not* empirical sd of residuals:
fasiolo_dist <- function(.ed, .gr, .y) {
td <- dnorm(.gr, 0, sd = .1)
sign(.ed - td) * (abs(sqrt(.ed) - sqrt(td)))^1/3
}
check1D(getViz(b_scaled), x = "x1") + l_densCheck(dFun = fasiolo_dist) +
scale_fill_gradient2()
#> Scale for 'fill' is already present. Adding another scale for 'fill',
#> which will replace the existing scale.
# looks as it should: shows underdispersion of residuals
Created on 2018-09-11 by the reprex package (v0.2.0).
Look at the left of the plot. Strangely this does not seem to happen for different seeds.
set.seed(414)
library(mgcViz)
n <- 1e6
x1 <- rnorm(n)
x2 <- rnorm(n)
dat <- data.frame("x1" = x1, "x2" = x2,
"y" = sin(x1) + 0.5 * x2^2 + pmax(x2, 0.2) * rnorm(n))
b <- bam(y ~ s(x1)+s(x2), data = dat, method = "fREML", discrete = TRUE)
v <- getViz(b)
plot(v(2), resDen = "cond", alpha.rug = 0.05, maxpo = 1e4)
The double loop would definitely go much faster in Rcpp, and this should not be too hard to do.
This scheme
option is just a pain at the moment, I would propose to get rid of it entirely.
In particular, I think we need to remove it from
and maybe more.
In plot.mgcv.smooth.2D the persp
or scheme == 2 option could become a separate function,
while the scheme = 0 with se = TRUE can be achieved by not calling geom_raster and by adding
the contour lines for the confidence intervals. We can add options to do that. scheme==3 is like
what we already have (scheme=2) but in black and white. We can modify the colour scale afterwards,
so do not need for this scheme.
Currently the two scatter plots produced by check.gam do not handle large data sets. It would be useful to implement them via hexagonal binning, at least when the number of observations is, say, > 5e4.
Add a postCheck function, that simulated responses from the posterior and summarizes them using a scalar valued function. The full data can be summarized as well, and possibly bootstrapped. Then we can add some layers such as histograms or density plots.
Section 4 of the paper "Graphics for uncertainty" from Bowman explain who to visualise the uncertainty of the fitted effects by simulating curves from the posterior, interpolating and then plotting them using animations.
This could clearly be implemented in mgcViz, probably using gganimate.
See here for a possible solution:
https://stackoverflow.com/questions/22408820/telling-ggplot-not-to-scale-alpha
Instead of plotting a panel of slices, we could create an animation. Probably using gganimate.
At the moment we need to use the trans
function to plot on response scale. This can be quite laborious. Maybe it is possible to do something automatically, possibly using the derivatives of the inverse link to transform the standard errors as well. Actually it might be possible to do this for general transformations, with transformation to response scale being a special case.
Example:
#######
# Simulate some data
library(mgcViz)
n <- 2e3
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
dat <- data.frame("x1" = x1, "x2" = x2, "x3" = x3,
"y" = abs( 0.1 + sin(x1) + 0.5 * x2^2 + 0.2*x3 + 0.2*rnorm(n) ))
b <- gamV(y ~ s(x1, x2) + s(x3), data = dat, family = Gamma(link=log))
# Plot s(x1, x2) on link scale
plot(sm(b, 1))
# Plot s(x1, x2) on response scale
plot(sm(b, 1),
trans = function(x){
b$family$linkinv(coef(b)[1] + x)
})
# We added the intercept and inverted the link function (if you use the identity link this is not necessary).
# Also, above s(x3) is left to zero, so we are conditioning on x3 being approximately 0 as can be seen from plot(sm(b, 2))
# To condition on a different value of x3 we must use predict.
# Here s3 is the value of s(x3) at x3 = 2 (the values of x1 and x2 do not matter here)
s3 <- predict(b, newdata = data.frame(x1 = 0, x2 = 0, x3 = 2), type = "terms")[2]
# Plotting on response scale
plot(sm(b, 1),
trans = function(x){
b$family$linkinv(coef(b)[1] + s3 + x)
})
Same can be done for 1D smooths (in some way).
To get a wormplot, it should be sufficient to subtract Dq (i.e. the x axis) from D, conf and df in .plot.qq.gam().
Thx for this great package!
I think it would be even more awesome if I didn't have to "quote" all my "variables".
check1D(v, x = x3)
👍 , check1D(v, x = "x3")
👎
cdCheck allows to compare the residuals distribution only with uniform or Gaussian densities. We can cover model residuals types (e.g. Pearson) by simulating residuals from the model and use them to estimate the model-based residual density.
Notice that this might result in fairly noisy density estimates in the tails of the x distributions. It would be nice to be able to pass weights to simulate.gam, so that we could simulate more responses where we have little data. This could be achieved by re-sampling the elements of mu, wt and scale (the arguments of family$rd).
We need to work out how to do this without generating too many warnings.
It would be quite easy to create an app where by drawing a rectangle on the plot (basic shiny feature) we get the corresponding xlim and ylim, which then get passed to zoo.qqGam(), which does the binning and the plotting.
Residuals seems to be in a corner of the covariate space, maybe the same problem occurs if using l_points with standard 2d smooth plots.
It seems that using the eval() trick is unnecessary, in that using a proper function would not cause any
performance handicap (no large object will be copied).
Hi, is there a way to add a horizontal line to each graph in print(plot(b, allTerms = T), pages = 1) ?
I would be interested to display a horizontal line at 0 in each of the graphs.
Thanks!
We want methods for plotting multiple effects (one per quantile) on the same plot.
Easy to do for 1D smooths, much harder to see how to do it for multivariate effects.
Needed in plot.mgcv.smooth.2D, plot.sos.smooth, check.mgcv.smooth.2D and maybe others.
1D and 2D by factor smooths are currently plotted on an individual basis each with their own scale. This is particularly strange for 2D smooth. Maybe there is room for modifying plot.gamViz so that it plots them on a grid, or at least using the same y-scale (for 1D plots) or colour-scale (for 2D rasters). A short-term solution for the 2D case is manual intervention:
# Simulate some data
library(mgcViz)
library(viridis)
set.seed(3)
dat <- gamSim(1,n=2000,dist="normal",scale=20)
dat$fac <- as.factor( sample(c("A1", "A2", "A3"), nrow(dat), replace = TRUE) )
dat$logi <- as.logical( sample(c(TRUE, FALSE), nrow(dat), replace = TRUE) )
bs <- "cr"; k <- 12
b <- gamV(y~fac + s(x0, x1, by = fac), data=dat)
# Different scales
print(plot(b), pages = 1)
# Same scale: need to add raster and contour manually to avoid having them added
# automatically by print.plotSmooth()
print(plot(b) + l_fitRaster() + l_fitContour() +
scale_fill_gradientn(colours = viridis(50, begin = 0.2),
limits = c(-6,7), name = "s(x)"), pages = 1)
These would be the equivalent of cqCheck and cqCheck2d in qgam.
At the moment layers such as l_gridCheck1D
do no work with factor variables. These functions need to be modified to cover factor variables.
This might be a quite useful layer, see
http://www.fromthebottomoftheheap.net/2017/10/10/difference-splines-i/
To calculate the standard error for the differences in smooth we need:
cov(X1b1, X2b2) = X1 %*% Sig11 %*% t(X1) + X2 %*% Sig22 %*% t(X2) - X1 %*% Sig12 %*% t(X2) - X2 %*% Sig12 %*% t(X1)
where Sig11, Sig22 and Sig12 = Sig21 are blocks of cov(b1, b2).
When plotting a factor effect on multiple models, the xlabel should indicate the model ID but it currently reports the name of the factor variable.
We need to create something similar to check.gam. That is a function that produces an set of plots, probably based on gridCheck, cdCheck, ACF or qq.gam and probably including also a plot from plot.mgcv.smooth.1D.
Hence check.mgcv.smooth.1D would essentially be just a wrapper. Given that gridCheck and qq.gam simulate data from the model, there is potential here for simulating the data only once, and passing it to both methods.
The xlim argument is still needed for 1D smooths, even if we can zoom in and out using ggplot2.
For 2D smooths we'll also need ylim.
Functions such as check.gam() output objects of class "check.gam". I might be better renaming these
to something like "checkGam" to avoid confusion/conflicts.
Candidates to go are shiny, plotly, data.table, dplyr, mvnfast, rstudioapi, miniUI, shiny
In order to speed up qq.gam, we should keep on the lookout for faster sorting functions.
A promising one is data.table::fsort() which unfortunately works only for positive vectors
at the moment.
At the moment this method uses only residuals.
Says "abs(sqrt(em)-sqrt(th))^(1/3)", should be "(sqrt(em)-sqrt(th))^(1/3)".
This shouldn't be too have to do. The layer function should be similar to l_fitLine(), but with some
clustering algorithm running within it.
We could add an option to l_ciLines, l_ciContour etc that allows to get simultaneous credible intervals using the approach described in section 6.10.2 of Wood's book (2nd ed).
This is the only wrapper missing, I think.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.