ropensci / aorsf Goto Github PK

View Code? Open in Web Editor NEW

54.0 3.0 10.0 114.33 MB

Accelerated Oblique Random Survival Forests

Home Page: https://docs.ropensci.org/aorsf

License: Other

R 65.59% C++ 33.88% TeX 0.53%

rstats data-science oblique random-forest survival

aorsf's Issues

dont fit oblique if mtry is 1

If mtry is 1, the regression coefficient for the given covariate should just be set to 1 rather than whatever happens to come out of running the specified model to get coefficients for oblique splits.

throw error in vs when oob value not available

just need to check if the oob stat values are empty and stop if they are

return non reference coded names from orsf_vs

It's annoying when you want to select the variables that orsf_vs returns but then you need to clean orsf_vs output

utility functions for impurity of splits (regression trees)

Reduction in variance is a standard technique for assessing regression tree split purity. It would be great to implement a function in utility.cpp with the following inputs:

y_node (type: arma::vec) the outcome values in the current tree node
w_node (type: arma::vec) a vector of non-zero weights (integer valued) the same length as y_node
g_node (type: arma::uvec) a vector of 0s and 1s the same length as y_node, with 0 indicating going to the left child node and 1 indicating the right.

The excerpt below from Ishwaran et al 2014 summarizes the reduction in variance computation very well. We will need to code this, incorporating weights through w_node. Should be able to check that the function gives the exact right answer using matrixStats::weightedVar. @ciaran-evans, would you like to look into this? You could actually write the function as a stand-alone function in orsf_oop.cpp with the usual //[Rcpp::export] tag rather than put it into utility.cpp, and I could move it over when it's ready. Basically we would just want the function to be named compute_var_reduction and we would want to create the file tests/testthat/test-compute_var_reduction.R that tests to make sure our variance reduction function gives the same answer as a function written in R.

allow survival predictions for VI

Something I noticed while updating aorsf-bench in response to bcjaeger/aorsf-bench#7 (👋 @darentsai):

in the paper associated with aorsf-bench, all benchmarks of variable importance were based on predicted survival probability. Since then, aorsf was updated to automatically use mortality predictions for VI computation because it was much simpler and didn't require specification of a prediction horizon. However, I think it would be ideal to retain the methods that were used in the paper, so I think it would be helpful to add an option to make variable importance for survival forests use predicted survival probability.

smarter prediction if `oobag_pred_type = 'none'`

should revert to null for calling predict later

TreeClassification and TreeRegression classes

Follow the template of TreeSurvival.

Classification should allow for binary or categorical outcomes, but initially just build it for binary outcomes

leaf summary can be either a vote or a probability

Regression should allow for continuous outcomes

leaf summary is a mean or some other summary

Enable single `Surv` object as the reponse in a formula for `orsf()`

I'd like to use a "pre-made" Surv object as the response in the formula interface. Would it be possible to allow for that? For some context, this is to make it easier to use the aorsf engine in tidymodels' workflow objects.

library(aorsf)
library(survival)

lung_orsf <- na.omit(lung)
lung_orsf$surv <- Surv(lung_orsf$time, lung_orsf$status)
lung_orsf <- lung_orsf[, -c(2,3)]

aorsf::orsf(
  data = lung_orsf,
  formula = surv ~ age + ph.ecog
)
#> Error: formula must have two variables (time & status) as the response

^{Created on 2022-11-01 with reprex v2.0.2}

na_action = 'pass' for cart

need to add post-predict cleaning step

oobag_denom should be saved after grow

It is computed dynamically but it does not change and it can be set right after the trees are grown.

Update repo documentation link

@bcjaeger Congatulations on getting the first "gold standard" package through rOpenSci's peer review process! Could you please update the webiste link in the repo settings here to https://docs.ropensci.org/aorsf? Thanks! Mark

Classification Summary Level Selection

orsf_summarize_uni gives summary for the first class by default. Would it be helpful to include an input allowing users to select the class to summarize? In the reproducible example below, the printed summary is limited to Adelie penguin. The summaries for all classes are included in the orsf_summarize_uni object but having a way to print them directly would be useful.

library(aorsf)

fit =  orsf(species ~ ., data = penguins_orsf)

smry = orsf_summarize_uni(fit)
smry
#> 
#> -- bill_length_mm (VI Rank: 1) -------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean     Median      25th %    75th %
#>       36.6 0.7044331 0.85955664 0.344965208 0.9625072
#>       39.5 0.6535697 0.80727711 0.266058440 0.9468242
#>       44.5 0.3697256 0.34953660 0.035810945 0.6409438
#>       48.6 0.2234635 0.14498102 0.008626836 0.4262582
#>       50.8 0.1896335 0.09561501 0.009713793 0.3866315
#> 
#> -- island (VI Rank: 2) ---------------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>     Biscoe 0.5007671 0.4513207 0.01346324 0.9491816
#>      Dream 0.4360160 0.1990975 0.07250154 0.8740631
#>  Torgersen 0.6220022 0.6459860 0.22850515 0.9810957
#> 
#> -- flipper_length_mm (VI Rank: 3) ----------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>        185 0.5910941 0.3937772 0.30187167 0.9552354
#>        190 0.5645121 0.3637454 0.26707447 0.9433666
#>        197 0.4952256 0.2889080 0.18181057 0.8831356
#>        213 0.3292072 0.1116319 0.02213975 0.6602175
#>        221 0.3025870 0.1087965 0.01011283 0.6184659
#> 
#> -- bill_depth_mm (VI Rank: 4) --------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median      25th %    75th %
#>       14.3 0.3560189 0.1540067 0.008714118 0.7293553
#>       15.6 0.3943511 0.1673597 0.045879915 0.7901435
#>       17.3 0.4651208 0.2253786 0.106483255 0.9155495
#>       18.7 0.5094000 0.2998837 0.133393326 0.9473630
#>       19.5 0.5276217 0.3456728 0.162150077 0.9524491
#> 
#> -- sex (VI Rank: 5) ------------------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>     female 0.4217927 0.1468877 0.02271961 0.8977797
#>       male 0.4742529 0.3384249 0.05131323 0.9453019
#> 
#> -- body_mass_g (VI Rank: 6) ----------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median      25th %    75th %
#>       3300 0.4825696 0.2233228 0.121143276 0.9339286
#>       3550 0.4750676 0.2110828 0.101099039 0.9372102
#>       4050 0.4638683 0.2262442 0.076359495 0.9343941
#>       4775 0.4356364 0.2836723 0.032402907 0.8618333
#>       5440 0.4154148 0.2934098 0.009086565 0.8180788
#> 
#> -- year (VI Rank: 7) -----------------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>       2007 0.4324757 0.1512571 0.01277177 0.9280795
#>       2008 0.4436049 0.1834657 0.01505514 0.9434973
#>       2009 0.4511092 0.1993108 0.01914280 0.9450677
#> 
#>  Predicted probability for top 7 predictors

^{Created on 2024-03-22 with reprex v2.1.0}

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: cf10d04a

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 95.8%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

pkgcheck results - master

Checks for aorsf (v0.0.1)

git hash: f3262096

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 95.8%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

`orsf_control` will be awkward with option for classification and regression

right now we have things like orsf_control_fast and orsf_control_cph that are entirely for survival analyses. It would be more appropriate with this update to introduce

orsf_control_survival( method, scale_x, ties, elastic_mix, elastic_df, max_iter, epsilon )
orsf_control_classification( method, scale_x, elastic_mix, elastic_df, max_iter, epsilon )
orsf_control_regression( method, scale_x, elastic_mix, elastic_df, max_iter, epsilon )

this would

eliminate confusion between which control method is compatible with which outcome type
add an additional layer of protection to make sure a user's outcome type matches their control.
deprecate existing functions

Are classification and regression supported?

obliqueRF is no longer on CRAN since May. Can aorsf be used for classification and regression or is it only for survival trees? Are there any examples?

pass new fit arguments to `orsf_train`

It would be nice to do something like:

fit_spec <- orsf(time+status~., na_action = 'na_impute_meanmode', n_tree = 10, no_fit = TRUE)

orsf_train(fit_spec, data = pbc)
orsf_train(fit_spec, data = flchain)

vint and 3+ categories

pd_values hits an error here, likely because of the multiple categories

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: b3212ad8

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✖️ These functions do not have examples: [print.aorsf].
✔️ Package has continuous integration checks.
✔️ Package coverage is 97.1%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

Feature request: Predict survival time

This issue showed someone trying to predict survival time with aorsf via tidymodels. We currently only have predictions of the survival probability implemented in censored. Looking around aorsf I didn't see any prediction type that we could wrap for "survival time". Is that correct? Would you consider implementing that? 🙌

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: 1fed941c

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 97.1%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

na_omit for regression forest

An error is thrown due to type coercion during prep_y()

unexpected error

I found this example caused R to crash. Likely has something to do with sample_fraction = 1 causing oobag_denom to backfire in orsf_cpp

library(aorsf)

oblique_1 <- orsf(species ~ flipper_length_mm + bill_length_mm,
                  data = penguins_orsf,
                  sample_with_replacement = FALSE,
                  sample_fraction = 1,
                  split_min_obs = nrow(penguins_orsf)-1,
                  tree_seeds = 649725,
                  oobag_pred_type = 'none',
                  n_tree = 1)

grid <- tidyr::expand_grid(
 flipper_length_mm = seq(170, 235, len = 100),
 bill_length_mm = seq(30, 70, len = 100)
)

predict(oblique_1, newdata = grid, pred_type = 'prob')

detect outcome type

need a formula that can inspect formula and data input, and reliably determine outcome type

oobag prediction on modified training data

this would be a useful feature

faster matrix multiplication for leaf prediction

Taking sub-matrix views will likely be less fast than iterating over the relevant columns/rows.

Consider implementing this function for data class:


arma::vec submat_mult_lincomb(arma::mat& x,
                              arma::uvec& x_rows,
                              arma::uvec& x_cols,
                              arma::vec& beta){

 arma::vec out (x_rows.size());
 arma::uword i = 0;
 arma::uword j = 0;

 for(auto row : x_rows){
  j=0;
  for(auto col : x_cols){
   out[i] += x.at(row, col) * beta[j];
   j++;
  }
  i++;
 }

 return(out);

}

References in BibTex

Instead of using R functions to paste references, we should make BibTeX entries similar to https://github.com/mlr-org/mlr3extralearners/blob/main/R/bibentries.R

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: 4b86e904

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 97.1%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

`orsf_vs` should throw error if object has no importance type

title

Return predictions when `pred_horizon = 0`

Hey @bcjaeger,

I am just playing around with this package. It's great. But I am wondering whether it makes sense for predict.orsf_fit() to actually return predictions instead of throwing an error when pred_horizon = 0. I think it makes sense to return 0, 1, and 0 when pred_type is 'risk', 'surv', and 'chf', respectively. Do you agree?

It may seem senseless to request predictions at time zero, but ,as an example, I was trying to make risk or survival curves by making predictions from time zero to time t at equally-spaced intervals, and noted the error. I think the values suggested above are both statistically valid and would make the predict function a little friendlier.

For comparison, I contributed the predict.flexsurvreg() function to the {flexsurv} package and predictions at time = 0 are valid and return the values suggested above.

Questions + an issue

Hi @bcjaeger, I was checking your package via the mlr3extralearners (really good work!!!) and had some questions/notes for you when you have some time:

I see no parameter like num.threads in ranger, so I guess parallelization is not supported? Is it because its difficult to implement or somehow it is not possible in case of ORSFs?
I was thinking of tuning the control_type (fast vs cph vs net). Do you think that it would make sense to group some of these as a different random forest learner as it might be the case that e.g. fast and cph produce similar forests/results/predictions (using the same dataset and using control_cph_iter_max = to_tune(1,20) for example) compared to net (where tuning the alpha might result in completely different solutions). In your recent arXiv paper you had each as a separate learner (e.g. aorsf-net, aorsf-fast and aorsf-cph). Or you would advise in using fast in all cases (i.e. not tune at all) based on the results from your recent arXiv paper (though alpha wasn't tuned there if I understood correctly)?
In orsf_control_net, do you leave the tuning of the lambda to glmnet? e.g. is cv.glmnet being used "under the hood" :) ? (that would explain the reason it takes so much more time)
For tuning purposes, is the split_min_obs/split_min_events the equivalent of min.node.size in ranger? i.e. a node should have at least split_min_obs observations (out of which split_min_events are events), to consider splitting it?
Have you tested the package in more high-dim settings (in the paper only one dataset was with n < p)? I have a dataset of 145 observations x 10000 features and there seems to be some overhead before the training phase (before growing tree No. starts appearing) and during prediction (maybe has to do with not having parallelized trees?). Especially, permutation importance (which I want to use in my implementation of RFE + RSFs) takes so much more time to compute (only used 100 trees, stopped it after some minutes...)! With ranger, even with no parallelization, I can train and predict with the same dataset in less than 3 secs. Let me know if you want me to share the dataset for more investigation, it would be great if somehow things could be sped up a bit!
In the arXiv paper, you mention the '1 point region of practical equivalence' for comparing the different learners using a Bayesian linear mixed model (awesome!). Is that 1 point difference equivalent to a 0.01 difference in the C-index for example since everything is scaled by 100? Is that is the case, is even a 5 point average difference a practical difference at all?
A small issue I found (maybe for mlr3extralearners):

library(mlr3proba)
#> Loading required package: mlr3
library(mlr3extralearners)

task_mRNA = readRDS(file = gzcon(url('https://github.com/bblodfon/paad-survival-bench/blob/main/data/task_mRNA_flt.rds?raw=True'))) # 1000 features

dsplit = mlr3::partition(task_mRNA, ratio = 0.8)
train_indxs = dsplit$train
test_indxs  = dsplit$test

orsf = lrn('surv.aorsf',
  importance = 'none',
  oobag_pred_type = 'surv',
  attach_data = FALSE,
  verbose_progress = TRUE, 
  n_tree = 10
)

orsf$train(task = task_mRNA, row_ids = train_indxs)
#>  growing tree no. 9 of 10
p = orsf$predict(task = task_mRNA, row_ids = test_indxs)
#> Error: Assertion on 'length(times)' failed: FALSE.

^{Created on 2023-02-27 with reprex v2.0.2}

I don't want to carry the training data to calculate importance or other things so I used attach_data = FALSE, but it seems in the mlr3extralearners wrapper code you use the $model$data slot for prediction either way which results in the above error.

So I guess since the time and status of the training data are always needed for prediction and we would like to keep the attach_data option but allow prediction to work when attach_data = FALSE, then maybe consider attaching these two into the $model slot and accessing them on the wrapper code (rather than the model$data)?

Question about na_action = "impute_meanmode"

In the case where we have missing data on the testing set, is the imputation done based on the mean&mode of the testing set or the training set?

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: c73bb98c

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 97.1%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

utility functions for logistic regression and ols

default methods to find linear combos of predictors in classification and regression trees

Error in matrix(data = c(collapse::fnth(numeric_data, 0.1, w = self$weights), : 'data' must be of a vector type, was 'NULL'

For the oblique random survival forest model, if there is only one independent variable in the data except for the time and event. The error will happen.

vs error with n_predictor_min = 1

library(aorsf)
fit <- orsf(mpg ~ ., data = mtcars)
orsf_vs(fit, n_predictor_min = 1)
#> Error in eval(expr, envir, enclos) : Not a matrix.
#> Error: Error in eval(expr, envir, enclos) : Not a matrix.

^{Created on 2024-03-06 with reprex v2.1.0}

introduce `ltry`

currently oblique forests just try to make one linear combination of predictors, and they will try again (with different predictors) if that split isn't good enough.

There should be more flexibility with this. What if I want to try 5 different linear combos and pick the best one?

This would change some of the core C++ routines in splitting data.

leaf-adjacent models for explainability

this is an idea to make oblique RFs more explainable for a single prediction:

pull out the regression coefficients from the node directly above the given obervation's predicted leaf node of each tree
aggregate them
use the aggregate regression coefficients as a starting value in a regression model fitted to the training set
(optional) weight the training set by nearest neighbors of the observation

We would assess the validity of this by measuring correlation between the forest's predictions and the model's predictions. It would necessitate fitting 1 model per prediction, so not computationally great.

try mean instead of median survival time prediction

The mean survival time is estimated as the area under the survival curve in the interval 0 to max time observed (Klein & Moeschberger, 2003).

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: b9f49833

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 97%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

survival learner issues in via mlr3

Hi Byron,

Just tried to use aorsf with survival data via mlr3extralearners and got this error:

library(mlr3extralearners)
library(mlr3pipelines)
library(mlr3proba)
#> Loading required package: mlr3

task = tsk('lung')
pre = po('encode', method = 'treatment') %>>%
  po('imputelearner', lrn('regr.rpart'))
task = pre$train(task)[[1]]
task
#> <TaskSurv:lung> (228 x 10): Lung Cancer
#> * Target: time, status
#> * Properties: -
#> * Features (8):
#>   - int (7): age, inst, meal.cal, pat.karno, ph.ecog, ph.karno, wt.loss
#>   - dbl (1): sex

aorsf = lrn('surv.aorsf', control_type = 'fast',
  oobag_pred_type = 'surv', importance = 'anova',
  attach_data = TRUE)

aorsf$train(task)
#> Error: some variables have unsupported type:
#>  <status> has type <logical>
#> supported types are numeric, integer, units, factor, and Surv
aorsf$errors
#> character(0)

^{Created on 2024-04-11 with reprex v2.0.2}

Maybe you haven't updated it with recent changes? I see for example that the oobag_pred_type default is risk now, not surv.

ropensci / aorsf Goto Github PK

aorsf's Issues

Checks for aorsf (v0.0.0.9000)

Checks for aorsf (v0.0.1)

Checks for aorsf (v0.0.0.9000)

Checks for aorsf (v0.0.0.9000)

Checks for aorsf (v0.0.0.9000)

Checks for aorsf (v0.0.0.9000)

Checks for aorsf (v0.0.0.9000)

Recommend Projects

Recommend Topics

Recommend Org