ehrlinger / ggrandomforests Goto Github PK
View Code? Open in Web Editor NEWGraphical analysis of random forests with the randomForestSRC, randomForest and ggplot2 packages.
Graphical analysis of random forests with the randomForestSRC, randomForest and ggplot2 packages.
On Win8 64 bit RStudio, throws for R 3.2.0, R 3.1.3, & below:
install.packages("ggRandomForest")
Installing package into ‘C:/Users/Jim/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘ggRandomForest’ is not available (for R version 3.1.2)using both RStudio global CRAN and IA CRAN
Tnx, Jim
I ran devtools::install_github(repo = "ggRandomForests", username = "ehrlinger") in my 3.0.1 version of R. I got * installing source package 'ggRandomForests' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error in namespaceExport(ns, exports) :
undefined exports: ggCompetingRisk, ggCompetingRisk.ggRandomForests, ggCoplot.ggRandomForests, ggInteraction, ggInteraction.ggRandomForests, ggMinDepth, ggMinDepth.ggRandomForests, ggSurvival, ggSurvival.ggRandomForests, plot.ggRandomForests, show.ggRandomForests
Error: loading failed
Execution halted
ERROR: loading failed
Hello.
In the ggRandomForests: Exploring Random Forest Survival paper, there is a partial dependence coplot of 1 year survival against bilirubin,conditional on albumin interval group membership (figure 24)
I am trying to create a similar cotplot, but instead of conditioning on albumin intervals, I would like to condition on a variable that was originally categorized (for instance edema or ascites). I tried doing so to but could not get the script right.
Would deeply appreciate your help in this matter.
Thanks
Roni
How do we get a hazard estimate from the rfsrc objects?
Having the classes "gg_rfsrc" "data.frame" "class"
does mess up internal code used by e.g. tidyr
and dplyr
and hence this is one of the reasons this package fails against dplyr
1.0.0:
library(ggRandomForests)
#> Loading required package: randomForestSRC
#>
#> randomForestSRC 2.9.3
#>
#> Type rfsrc.news() to see new features, changes, and bug fixes.
#>
data(rfsrc_iris, package="ggRandomForests")
gg_dta <- gg_rfsrc(rfsrc_iris)
str(gg_dta)
#> Classes 'gg_rfsrc', 'class' and 'data.frame': 150 obs. of 4 variables:
#> $ setosa : num 1 1 1 1 1 ...
#> $ versicolor: num 0 0 0 0 0 ...
#> $ virginica : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ y : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
class(gg_dta)
#> [1] "gg_rfsrc" "data.frame" "class"
gg_plt <- ggRandomForests:::plot.gg_rfsrc(gg_dta)
#> Error: Input must be a vector, not a `gg_rfsrc/data.frame/class` object.
Created on 2020-04-03 by the reprex package (v0.3.0)
We want to verify that n_var keeps the highest n_var variables. It looks like it's cutting observations from a melted table.
Thank you for the great package. I'm currently running it on my data and the plots are rather cumbersome since there are more than 400 variables. Would you please kindly advice me on how to make them more readable, probably there is a way to scale down the Y label font?
Sorry if I'm asking in the wrong place, I'm quite new to Data Analysis, being MD in my background.
I've got a bit crazy on test and examples, which require cached rfsrc objects. The cache requirement is due to computational expense and rfsrc version issues (I can use pre-release rfsrc).
So, we'll remove the airq, mtcars and veteran cached objects. The objects can still be built from rfsrc_cache_dataset
function...
rfsrc_cache_dataset
to create only set=c("iris", "Boston", "pbc")
by default.Hi,
Just a quick question.
What pct stand for in calc_roc.rfsrc?
Right now it only does a single class, but overlaying multiple classes on each ROC curve is possible
An alternative of multiple panels.
rmarkdown is cool, but...
Want to update the arXiv submission, rmarkdown latex is pretty ugly still.
Hi John,
When trying to produce variable interaction plots with plot.gg_interaction() I cannot get the order of the faceting to match the minimal depth rank order, with the set I am interested in being produced in alphabetical order instead.
I looked at several other plot methods for the package and found that 'variable' is often set to a factor after a gather step. I tested if this could fix the issue by adding
gg_dta$variable <- factor(gg_dta$variable, levels=unique(gg_dta$variable))
after line 143, which worked.
This doesn't exclude the (high) possibility that I am simply doing something wrong, but I thought I would send through the code. Thanks for the great package!
Hi @ehrlinger,
Thanks for make easier working with radomForestSRC
, you saved me a lot time! After calculating a rfsrc
object with the following code...
> rfs <- rfsrc(Surv(time, event) ~ ., data = mydata, nsplit = NULL, ntree = 100, importance = T)
> rfs
Sample size: 99
Number of deaths: 35
Number of trees: 100
Forest terminal node size: 15
Average no. of terminal nodes: 9.44
No. of variables tried at each split: 4
Total no. of variables: 12
Resampling used to grow trees: swor
Resample size used to grow trees: 63
Analysis: RSF
Family: surv
Splitting rule: logrank
Error rate: 28.85%
... I'm trying to obtain the OOB plot for each tree but I'm having this (uninformative) error and a empty plot appears:
> plot(gg_error(rfs))
Warning message:
Removed 9 rows containing missing values (geom_path).
Any idea about what I'm doing wrong, please? Thanks in advance.
The se is calculated for classification and regression with "normal" response.
I don't have a good minimal working example or anything, but I'm going to try my best to describe what's going on.
After calling gg_survival with either type "kaplan" or type "nelson" and supplying a factor for 'by', survfit with strata on 'by' is called. The default is na.group = FALSE, so it stratifies only on the other levels of the factor.
A little further down in the code, we have this bit:
if(!is.null(by)){
tm_splits <- which(c(FALSE,sapply(2:nrow(tbl), function(ind){tbl$time[ind] < tbl$time[ind - 1]})))
lbls <- unique(data[,by])
tbl$groups <- lbls[1]
for(ind in 2:(length(tm_splits) + 1)){
tbl$groups[tm_splits[ind - 1]:nrow(tbl)] <- lbls[ind]
}
}
Unique also returns 'NA' as an option, but NA was not included as a stratum level, so if you have a situation where NA occurs before at least one of your levels, it will take its place and you'll drop a factor level you potentially cared about.
I solved it myself by editing in na.group = TRUE to the call to strata in the kaplan and nelson functions because I wanted that information anyway, but I guess this might be something encountered by others as well!
We often want the alternative to prob of survival, mortality =1-survival.
rfsrc returns something else when surv.type="mort", not what we expect. Should be a simple conversion.
Hi,
The Survival vignette is really looking good with lots of great plots, but I'm having problems reproducing many of the examples.
For example, in the begging you mention that you prefer "years" to "days" in the pbc data set. Yet there is no code how you convert it.
Doing a naive
pbc$years <- pbc$days/365
I fail in the next part using the gg_survival function example.
Next, there is no code for the very nice EDA plots in the vignette.
I also could not get the 3D example in Appendix 1 to work.
This line
partial_time <- do.call(rbind,lapply(partial_pbc_time, gg_partial))
always produces errors.
I always get errors when there is a theme() part in plot.
There were some updates to randomForestSRC and ggplot recently that may
cause a lot of these problems.
Then we can use combine.gg_partial
to mash an arbitrary number of gg_partial
objects together.
combine.gg_partial
does not really extend combine.default
gg_survival function uses finite difference to calculate the hazard from the cumulative hazard. Results are not correct yet.
tests only check if we've called plot.variable with multiple variables.
When using gg_survival with the by= parameter and then plotting using plot.gg_survival the colors and the legends are not correct assigned to the groups.
To be precise, the error seams to be in kaplan.R, line 72ff.
Convert S3method functions to use UseNext
instead of UseMethod
method dispatch.
This should fix issues with arguments, and prepare the way for extending to random forest packages beyond randomForestSRC
#3.
Should also tighten up the whole OO design so that functions that look like S3methods really are S3methods.
I am attempting to work through the "Random Forests for Regression" vignette, however I have run into an issue near the end when generating partial coplot data for a contour plot. Rather than returning 50 unique coplots for each specific value of rm, the coplots are all identical. The result is that the contour plot only has contours for predicted y values from one x variable, making it just a 3D representation of a single variable co-plot.
In a previous version of the "Random Forests for Regression" vignette this did not appear to be an issue (see output image below)
However, in the updated version of the vignette, the issue has appeared:
Perhaps an update in the plot.variable() function have altered how these codes perform?
This is a side effect of the previous gg_vimp #8 bug fix.
The examples need to be refactored to show how cached data objects were created.
The package
and the vignette
codes are working perfectly except st.labs
involved codes.
plot(gg_md, lbls=st.labs)
Error in plot.gg_minimal_depth(gg_md, lbls = st.labs) :
object 'st.labs' not found
gg_brier function for classification and survival forests?
Would require S3 gg_ methods for formatting the data output. Would mean no support for survival, or minimal depth.
This is an rfsrc issue. For survival and classification forests, variable dependence is returned on [0,1], partial dependence for survival is returned as [0,100]. Should be able to detect the difference and convert in gg_variable.
plot.gg_minimal_vimp
should be an argument based alternative to plot.gg_minimal_depth
, thereby avoiding S3method namings when not applicable.
By gg_vimp returns a VIMP panel for each factor from a classification forest. If we provide which.outcome, we want a VIMP figure for that factor only. Sorted this VIMP for the factor of interest.
Need to do a lot here
This will probably warrant a v2.0 release increment?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.