ehrlinger / ggrandomforests Goto Github PK

View Code? Open in Web Editor NEW

146.0 146.0 29.0 320.15 MB

Graphical analysis of random forests with the randomForestSRC, randomForest and ggplot2 packages.

R 100.00%

ggrandomforests's People

Contributors

Stargazers

Watchers

ggrandomforests's Issues

pkg install throws "not available (for..."

On Win8 64 bit RStudio, throws for R 3.2.0, R 3.1.3, & below:

install.packages("ggRandomForest")
Installing package into ‘C:/Users/Jim/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘ggRandomForest’ is not available (for R version 3.1.2)

using both RStudio global CRAN and IA CRAN
Tnx, Jim

Is this ready to be in package form yet?

I ran devtools::install_github(repo = "ggRandomForests", username = "ehrlinger") in my 3.0.1 version of R. I got * installing source package 'ggRandomForests' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error in namespaceExport(ns, exports) :
undefined exports: ggCompetingRisk, ggCompetingRisk.ggRandomForests, ggCoplot.ggRandomForests, ggInteraction, ggInteraction.ggRandomForests, ggMinDepth, ggMinDepth.ggRandomForests, ggSurvival, ggSurvival.ggRandomForests, plot.ggRandomForests, show.ggRandomForests
Error: loading failed
Execution halted
ERROR: loading failed

removing 'C:/Users/Jonathan/Documents/R/win-library/3.1/ggRandomForests'
Error: Command failed (1)

conditioning a coplot on a factor

Hello.
In the ggRandomForests: Exploring Random Forest Survival paper, there is a partial dependence coplot of 1 year survival against bilirubin,conditional on albumin interval group membership (figure 24)
I am trying to create a similar cotplot, but instead of conditioning on albumin intervals, I would like to condition on a variable that was originally categorized (for instance edema or ascites). I tried doing so to but could not get the script right.
Would deeply appreciate your help in this matter.
Thanks
Roni

hazard estimates (gg_rfsrc)

How do we get a hazard estimate from the rfsrc objects?

data.frame should be last of the classes

Having the classes "gg_rfsrc" "data.frame" "class" does mess up internal code used by e.g. tidyr and dplyr and hence this is one of the reasons this package fails against dplyr 1.0.0:

library(ggRandomForests)
#> Loading required package: randomForestSRC
#> 
#>  randomForestSRC 2.9.3 
#>  
#>  Type rfsrc.news() to see new features, changes, and bug fixes. 
#> 

data(rfsrc_iris, package="ggRandomForests")
gg_dta <- gg_rfsrc(rfsrc_iris)
str(gg_dta)
#> Classes 'gg_rfsrc', 'class' and 'data.frame':    150 obs. of  4 variables:
#>  $ setosa    : num  1 1 1 1 1 ...
#>  $ versicolor: num  0 0 0 0 0 ...
#>  $ virginica : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ y         : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
class(gg_dta)
#> [1] "gg_rfsrc"   "data.frame" "class"

gg_plt <- ggRandomForests:::plot.gg_rfsrc(gg_dta)
#> Error: Input must be a vector, not a `gg_rfsrc/data.frame/class` object.

^{Created on 2020-04-03 by the reprex package (v0.3.0)}

plot.gg_vimp n_var cuts out weird lists.

We want to verify that n_var keeps the highest n_var variables. It looks like it's cutting observations from a melted table.

Scale Y label

Thank you for the great package. I'm currently running it on my data and the plots are rather cumbersome since there are more than 400 variables. Would you please kindly advice me on how to make them more readable, probably there is a way to scale down the Y label font?
Sorry if I'm asking in the wrong place, I'm quite new to Data Analysis, being MD in my background.

Reduce size of pkg for CRAN compliance

I've got a bit crazy on test and examples, which require cached rfsrc objects. The cache requirement is due to computational expense and rfsrc version issues (I can use pre-release rfsrc).

So, we'll remove the airq, mtcars and veteran cached objects. The objects can still be built from rfsrc_cache_dataset function...

refactor tests to use iris, Boston and pbc for classification, regression and survival.
Bound all examples NOT in {iris, Boston, pbc} with \dontrun
remove cached datasets NOT in {iris, Boston, pbc}
refector rfsrc_cache_dataset to create only set=c("iris", "Boston", "pbc") by default.

pct?

Hi,
Just a quick question.
What pct stand for in calc_roc.rfsrc?

gg_roc with multiple outcomes.

Right now it only does a single class, but overlaying multiple classes on each ROC curve is possible
An alternative of multiple panels.

randomForestSRC-regression vignette

rmarkdown is cool, but...

Want to update the arXiv submission, rmarkdown latex is pretty ugly still.

So port the vignette back to knitr latex format.
incorporate many changes from randomForestSRC-survival vignette.

Possible bug: plot.gg_interaction facet order

Hi John,

When trying to produce variable interaction plots with plot.gg_interaction() I cannot get the order of the faceting to match the minimal depth rank order, with the set I am interested in being produced in alphabetical order instead.

I looked at several other plot methods for the package and found that 'variable' is often set to a factor after a gather step. I tested if this could fix the issue by adding
gg_dta$variable <- factor(gg_dta$variable, levels=unique(gg_dta$variable))
after line 143, which worked.

This doesn't exclude the (high) possibility that I am simply doing something wrong, but I thought I would send through the code. Thanks for the great package!

gg_error is not showing anything

Hi @ehrlinger,

Thanks for make easier working with radomForestSRC, you saved me a lot time! After calculating a rfsrc object with the following code...

> rfs <- rfsrc(Surv(time, event) ~ ., data = mydata, nsplit = NULL, ntree = 100, importance = T)
> rfs
                         Sample size: 99
                    Number of deaths: 35
                     Number of trees: 100
           Forest terminal node size: 15
       Average no. of terminal nodes: 9.44
No. of variables tried at each split: 4
              Total no. of variables: 12
       Resampling used to grow trees: swor
    Resample size used to grow trees: 63
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank
                          Error rate: 28.85%

... I'm trying to obtain the OOB plot for each tree but I'm having this (uninformative) error and a empty plot appears:

> plot(gg_error(rfs))
Warning message:
Removed 9 rows containing missing values (geom_path).

Any idea about what I'm doing wrong, please? Thanks in advance.

Partial plot confidence intervals.

The se is calculated for classification and regression with "normal" response.

gg_survival with 'by' handles factors with NA incorrectly when occurring before other levels

I don't have a good minimal working example or anything, but I'm going to try my best to describe what's going on.

After calling gg_survival with either type "kaplan" or type "nelson" and supplying a factor for 'by', survfit with strata on 'by' is called. The default is na.group = FALSE, so it stratifies only on the other levels of the factor.

A little further down in the code, we have this bit:

  if(!is.null(by)){
    tm_splits <- which(c(FALSE,sapply(2:nrow(tbl), function(ind){tbl$time[ind] < tbl$time[ind - 1]})))

    lbls <- unique(data[,by])
    tbl$groups <- lbls[1]

    for(ind in 2:(length(tm_splits) + 1)){
      tbl$groups[tm_splits[ind - 1]:nrow(tbl)] <- lbls[ind]
    }
  }

Unique also returns 'NA' as an option, but NA was not included as a stratum level, so if you have a situation where NA occurs before at least one of your levels, it will take its place and you'll drop a factor level you potentially cared about.

I solved it myself by editing in na.group = TRUE to the call to strata in the kaplan and nelson functions because I wanted that information anyway, but I guess this might be something encountered by others as well!

Naming consistency (away from S3 type names)

v1.1.4 release indicates I need to refactor function names to remove dot separators because of clashes with S3method names.

Example functions:
combine.gg_partial does not really extend combine.default #20
plot.gg_minimal_vimp should be an argument based alternative to plot.gg_minimal_depth #21

surv.type = mortality

We often want the alternative to prob of survival, mortality =1-survival.

rfsrc returns something else when surv.type="mort", not what we expect. Should be a simple conversion.

make vignette more reproducible

Hi,
The Survival vignette is really looking good with lots of great plots, but I'm having problems reproducing many of the examples.
For example, in the begging you mention that you prefer "years" to "days" in the pbc data set. Yet there is no code how you convert it.
Doing a naive
pbc$years <- pbc$days/365
I fail in the next part using the gg_survival function example.
Next, there is no code for the very nice EDA plots in the vignette.
I also could not get the 3D example in Appendix 1 to work.
This line
partial_time <- do.call(rbind,lapply(partial_pbc_time, gg_partial))
always produces errors.
I always get errors when there is a theme() part in plot.
There were some updates to randomForestSRC and ggplot recently that may
cause a lot of these problems.

Convert `combine.gg_partial` to use `Match.call`

Then we can use combine.gg_partial to mash an arbitrary number of gg_partial objects together.

Rename `combine.gg_partial` to avoid S3method issues

combine.gg_partial does not really extend combine.default

calculate hazard estimates (gg_survival)

gg_survival function uses finite difference to calculate the hazard from the cumulative hazard. Results are not correct yet.

combine.gg_partial fails on plot.variable with only one xvar

tests only check if we've called plot.variable with multiple variables.

plot.gg_survival has wrong legend

When using gg_survival with the by= parameter and then plotting using plot.gg_survival the colors and the legends are not correct assigned to the groups.
To be precise, the error seams to be in kaplan.R, line 72ff.

Convert S3method functions to use `UseNext` instead of `UseMethod`

Convert S3method functions to use UseNext instead of UseMethod method dispatch.

This should fix issues with arguments, and prepare the way for extending to random forest packages beyond randomForestSRC #3.

Should also tighten up the whole OO design so that functions that look like S3methods really are S3methods.

ranomForestSRC::plot.variable not behaving as expected

I am attempting to work through the "Random Forests for Regression" vignette, however I have run into an issue near the end when generating partial coplot data for a contour plot. Rather than returning 50 unique coplots for each specific value of rm, the coplots are all identical. The result is that the contour plot only has contours for predicted y values from one x variable, making it just a 3D representation of a single variable co-plot.

In a previous version of the "Random Forests for Regression" vignette this did not appear to be an issue (see output image below)

However, in the updated version of the vignette, the issue has appeared:

Perhaps an update in the plot.variable() function have altered how these codes perform?

remove strip from gg_vimp plot when there is only one case

This is a side effect of the previous gg_vimp #8 bug fix.

Rewrite examples to include rfsrc code comments

The examples need to be refactored to show how cached data objects were created.

remove cache loads for airq, veteran, mtcars
add examples to gg_variable for Boston and pbc.

st.labs

The package and the vignette codes are working perfectly except st.labs involved codes.

plot(gg_md, lbls=st.labs)
Error in plot.gg_minimal_depth(gg_md, lbls = st.labs) :
object 'st.labs' not found

interface with gg_partial
tests
examples

This will probably warrant a v2.0 release increment?

ehrlinger / ggrandomforests Goto Github PK

ggrandomforests's People

Contributors

Stargazers

Watchers

Forkers

ggrandomforests's Issues

Recommend Projects

Recommend Topics

Recommend Org