ropensci / aorsf Goto Github PK
View Code? Open in Web Editor NEWAccelerated Oblique Random Survival Forests
Home Page: https://docs.ropensci.org/aorsf
License: Other
Accelerated Oblique Random Survival Forests
Home Page: https://docs.ropensci.org/aorsf
License: Other
If mtry is 1, the regression coefficient for the given covariate should just be set to 1 rather than whatever happens to come out of running the specified model to get coefficients for oblique splits.
just need to check if the oob stat values are empty and stop if they are
It's annoying when you want to select the variables that orsf_vs
returns but then you need to clean orsf_vs
output
Reduction in variance is a standard technique for assessing regression tree split purity. It would be great to implement a function in utility.cpp with the following inputs:
y_node
(type: arma::vec
) the outcome values in the current tree nodew_node
(type: arma::vec
) a vector of non-zero weights (integer valued) the same length as y_node
g_node
(type: arma::uvec
) a vector of 0s and 1s the same length as y_node
, with 0 indicating going to the left child node and 1 indicating the right.The excerpt below from Ishwaran et al 2014 summarizes the reduction in variance computation very well. We will need to code this, incorporating weights through w_node
. Should be able to check that the function gives the exact right answer using matrixStats::weightedVar
. @ciaran-evans, would you like to look into this? You could actually write the function as a stand-alone function in orsf_oop.cpp
with the usual //[Rcpp::export]
tag rather than put it into utility.cpp
, and I could move it over when it's ready. Basically we would just want the function to be named compute_var_reduction
and we would want to create the file tests/testthat/test-compute_var_reduction.R
that tests to make sure our variance reduction function gives the same answer as a function written in R.
Something I noticed while updating aorsf-bench
in response to bcjaeger/aorsf-bench#7 (๐ @darentsai):
in the paper associated with aorsf-bench
, all benchmarks of variable importance were based on predicted survival probability. Since then, aorsf
was updated to automatically use mortality predictions for VI computation because it was much simpler and didn't require specification of a prediction horizon. However, I think it would be ideal to retain the methods that were used in the paper, so I think it would be helpful to add an option to make variable importance for survival forests use predicted survival probability.
should revert to null for calling predict
later
Follow the template of TreeSurvival.
Classification should allow for binary or categorical outcomes, but initially just build it for binary outcomes
Regression should allow for continuous outcomes
I'd like to use a "pre-made" Surv
object as the response in the formula interface. Would it be possible to allow for that? For some context, this is to make it easier to use the aorsf
engine in tidymodels' workflow objects.
library(aorsf)
library(survival)
lung_orsf <- na.omit(lung)
lung_orsf$surv <- Surv(lung_orsf$time, lung_orsf$status)
lung_orsf <- lung_orsf[, -c(2,3)]
aorsf::orsf(
data = lung_orsf,
formula = surv ~ age + ph.ecog
)
#> Error: formula must have two variables (time & status) as the response
Created on 2022-11-01 with reprex v2.0.2
need to add post-predict cleaning step
It is computed dynamically but it does not change and it can be set right after the trees are grown.
@bcjaeger Congatulations on getting the first "gold standard" package through rOpenSci's peer review process! Could you please update the webiste link in the repo settings here to https://docs.ropensci.org/aorsf? Thanks! Mark
orsf_summarize_uni gives summary for the first class by default. Would it be helpful to include an input allowing users to select the class to summarize? In the reproducible example below, the printed summary is limited to Adelie penguin. The summaries for all classes are included in the orsf_summarize_uni object but having a way to print them directly would be useful.
library(aorsf)
fit = orsf(species ~ ., data = penguins_orsf)
smry = orsf_summarize_uni(fit)
smry
#>
#> -- bill_length_mm (VI Rank: 1) -------------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> 36.6 0.7044331 0.85955664 0.344965208 0.9625072
#> 39.5 0.6535697 0.80727711 0.266058440 0.9468242
#> 44.5 0.3697256 0.34953660 0.035810945 0.6409438
#> 48.6 0.2234635 0.14498102 0.008626836 0.4262582
#> 50.8 0.1896335 0.09561501 0.009713793 0.3866315
#>
#> -- island (VI Rank: 2) ---------------------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> Biscoe 0.5007671 0.4513207 0.01346324 0.9491816
#> Dream 0.4360160 0.1990975 0.07250154 0.8740631
#> Torgersen 0.6220022 0.6459860 0.22850515 0.9810957
#>
#> -- flipper_length_mm (VI Rank: 3) ----------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> 185 0.5910941 0.3937772 0.30187167 0.9552354
#> 190 0.5645121 0.3637454 0.26707447 0.9433666
#> 197 0.4952256 0.2889080 0.18181057 0.8831356
#> 213 0.3292072 0.1116319 0.02213975 0.6602175
#> 221 0.3025870 0.1087965 0.01011283 0.6184659
#>
#> -- bill_depth_mm (VI Rank: 4) --------------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> 14.3 0.3560189 0.1540067 0.008714118 0.7293553
#> 15.6 0.3943511 0.1673597 0.045879915 0.7901435
#> 17.3 0.4651208 0.2253786 0.106483255 0.9155495
#> 18.7 0.5094000 0.2998837 0.133393326 0.9473630
#> 19.5 0.5276217 0.3456728 0.162150077 0.9524491
#>
#> -- sex (VI Rank: 5) ------------------------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> female 0.4217927 0.1468877 0.02271961 0.8977797
#> male 0.4742529 0.3384249 0.05131323 0.9453019
#>
#> -- body_mass_g (VI Rank: 6) ----------------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> 3300 0.4825696 0.2233228 0.121143276 0.9339286
#> 3550 0.4750676 0.2110828 0.101099039 0.9372102
#> 4050 0.4638683 0.2262442 0.076359495 0.9343941
#> 4775 0.4356364 0.2836723 0.032402907 0.8618333
#> 5440 0.4154148 0.2934098 0.009086565 0.8180788
#>
#> -- year (VI Rank: 7) -----------------------------------
#>
#> |--------------- Probability ---------------|
#> Value Mean Median 25th % 75th %
#> 2007 0.4324757 0.1512571 0.01277177 0.9280795
#> 2008 0.4436049 0.1834657 0.01505514 0.9434973
#> 2009 0.4511092 0.1993108 0.01914280 0.9450677
#>
#> Predicted probability for top 7 predictors
Created on 2024-03-22 with reprex v2.1.0
git hash: cf10d04a
Package License: MIT + file LICENSE
git hash: f3262096
Package License: MIT + file LICENSE
right now we have things like orsf_control_fast
and orsf_control_cph
that are entirely for survival analyses. It would be more appropriate with this update to introduce
orsf_control_survival( method, scale_x, ties, elastic_mix, elastic_df, max_iter, epsilon )
orsf_control_classification( method, scale_x, elastic_mix, elastic_df, max_iter, epsilon )
orsf_control_regression( method, scale_x, elastic_mix, elastic_df, max_iter, epsilon )
this would
obliqueRF is no longer on CRAN since May. Can aorsf be used for classification and regression or is it only for survival trees? Are there any examples?
It would be nice to do something like:
fit_spec <- orsf(time+status~., na_action = 'na_impute_meanmode', n_tree = 10, no_fit = TRUE)
orsf_train(fit_spec, data = pbc)
orsf_train(fit_spec, data = flchain)
pd_values
hits an error here, likely because of the multiple categories
git hash: b3212ad8
Important: All failing checks above must be addressed prior to proceeding
Package License: MIT + file LICENSE
This issue showed someone trying to predict survival time with aorsf via tidymodels. We currently only have predictions of the survival probability implemented in censored. Looking around aorsf I didn't see any prediction type that we could wrap for "survival time". Is that correct? Would you consider implementing that? ๐
git hash: 1fed941c
Package License: MIT + file LICENSE
An error is thrown due to type coercion during prep_y()
I found this example caused R to crash. Likely has something to do with sample_fraction = 1
causing oobag_denom
to backfire in orsf_cpp
library(aorsf)
oblique_1 <- orsf(species ~ flipper_length_mm + bill_length_mm,
data = penguins_orsf,
sample_with_replacement = FALSE,
sample_fraction = 1,
split_min_obs = nrow(penguins_orsf)-1,
tree_seeds = 649725,
oobag_pred_type = 'none',
n_tree = 1)
grid <- tidyr::expand_grid(
flipper_length_mm = seq(170, 235, len = 100),
bill_length_mm = seq(30, 70, len = 100)
)
predict(oblique_1, newdata = grid, pred_type = 'prob')
need a formula that can inspect formula
and data
input, and reliably determine outcome type
this would be a useful feature
Taking sub-matrix views will likely be less fast than iterating over the relevant columns/rows.
Consider implementing this function for data class:
arma::vec submat_mult_lincomb(arma::mat& x,
arma::uvec& x_rows,
arma::uvec& x_cols,
arma::vec& beta){
arma::vec out (x_rows.size());
arma::uword i = 0;
arma::uword j = 0;
for(auto row : x_rows){
j=0;
for(auto col : x_cols){
out[i] += x.at(row, col) * beta[j];
j++;
}
i++;
}
return(out);
}
Instead of using R functions to paste references, we should make BibTeX entries similar to https://github.com/mlr-org/mlr3extralearners/blob/main/R/bibentries.R
git hash: 4b86e904
Package License: MIT + file LICENSE
title
Hey @bcjaeger,
I am just playing around with this package. It's great. But I am wondering whether it makes sense for predict.orsf_fit()
to actually return predictions instead of throwing an error when pred_horizon = 0
. I think it makes sense to return 0, 1, and 0 when pred_type
is 'risk'
, 'surv'
, and 'chf'
, respectively. Do you agree?
It may seem senseless to request predictions at time zero, but ,as an example, I was trying to make risk or survival curves by making predictions from time zero to time t at equally-spaced intervals, and noted the error. I think the values suggested above are both statistically valid and would make the predict function a little friendlier.
For comparison, I contributed the predict.flexsurvreg()
function to the {flexsurv}
package and predictions at time = 0
are valid and return the values suggested above.
Hi @bcjaeger, I was checking your package via the mlr3extralearners
(really good work!!!) and had some questions/notes for you when you have some time:
num.threads
in ranger
, so I guess parallelization is not supported? Is it because its difficult to implement or somehow it is not possible in case of ORSFs?control_type
(fast
vs cph
vs net
). Do you think that it would make sense to group some of these as a different random forest learner as it might be the case that e.g. fast
and cph
produce similar forests/results/predictions (using the same dataset and using control_cph_iter_max = to_tune(1,20)
for example) compared to net
(where tuning the alpha
might result in completely different solutions). In your recent arXiv paper you had each as a separate learner (e.g. aorsf-net
, aorsf-fast
and aorsf-cph
). Or you would advise in using fast
in all cases (i.e. not tune at all) based on the results from your recent arXiv paper (though alpha
wasn't tuned there if I understood correctly)?orsf_control_net
, do you leave the tuning of the lambda
to glmnet? e.g. is cv.glmnet
being used "under the hood" :) ? (that would explain the reason it takes so much more time)split_min_obs
/split_min_events
the equivalent of min.node.size
in ranger
? i.e. a node should have at least split_min_obs
observations (out of which split_min_events
are events), to consider splitting it?n < p
)? I have a dataset of 145 observations x 10000 features and there seems to be some overhead before the training phase (before growing tree No.
starts appearing) and during prediction (maybe has to do with not having parallelized trees?). Especially, permutation
importance (which I want to use in my implementation of RFE + RSFs) takes so much more time to compute (only used 100
trees, stopped it after some minutes...)! With ranger
, even with no parallelization, I can train and predict with the same dataset in less than 3 secs. Let me know if you want me to share the dataset for more investigation, it would be great if somehow things could be sped up a bit!mlr3extralearners
):library(mlr3proba)
#> Loading required package: mlr3
library(mlr3extralearners)
task_mRNA = readRDS(file = gzcon(url('https://github.com/bblodfon/paad-survival-bench/blob/main/data/task_mRNA_flt.rds?raw=True'))) # 1000 features
dsplit = mlr3::partition(task_mRNA, ratio = 0.8)
train_indxs = dsplit$train
test_indxs = dsplit$test
orsf = lrn('surv.aorsf',
importance = 'none',
oobag_pred_type = 'surv',
attach_data = FALSE,
verbose_progress = TRUE,
n_tree = 10
)
orsf$train(task = task_mRNA, row_ids = train_indxs)
#> growing tree no. 9 of 10
p = orsf$predict(task = task_mRNA, row_ids = test_indxs)
#> Error: Assertion on 'length(times)' failed: FALSE.
Created on 2023-02-27 with reprex v2.0.2
I don't want to carry the training data
to calculate importance or other things so I used attach_data = FALSE
, but it seems in the mlr3extralearners wrapper code you use the $model$data
slot for prediction either way which results in the above error.
So I guess since the time
and status
of the training data are always needed for prediction and we would like to keep the attach_data
option but allow prediction to work when attach_data = FALSE
, then maybe consider attaching these two into the $model
slot and accessing them on the wrapper code (rather than the model$data
)?
In the case where we have missing data on the testing set, is the imputation done based on the mean&mode of the testing set or the training set?
git hash: c73bb98c
Package License: MIT + file LICENSE
default methods to find linear combos of predictors in classification and regression trees
library(aorsf)
fit <- orsf(mpg ~ ., data = mtcars)
orsf_vs(fit, n_predictor_min = 1)
#> Error in eval(expr, envir, enclos) : Not a matrix.
#> Error: Error in eval(expr, envir, enclos) : Not a matrix.
Created on 2024-03-06 with reprex v2.1.0
currently oblique forests just try to make one linear combination of predictors, and they will try again (with different predictors) if that split isn't good enough.
There should be more flexibility with this. What if I want to try 5 different linear combos and pick the best one?
This would change some of the core C++ routines in splitting data.
this is an idea to make oblique RFs more explainable for a single prediction:
We would assess the validity of this by measuring correlation between the forest's predictions and the model's predictions. It would necessitate fitting 1 model per prediction, so not computationally great.
The mean survival time is estimated as the area under the survival curve in the interval 0 to max time observed (Klein & Moeschberger, 2003).
git hash: b9f49833
Package License: MIT + file LICENSE
Hi Byron,
Just tried to use aorsf
with survival data via mlr3extralearners
and got this error:
library(mlr3extralearners)
library(mlr3pipelines)
library(mlr3proba)
#> Loading required package: mlr3
task = tsk('lung')
pre = po('encode', method = 'treatment') %>>%
po('imputelearner', lrn('regr.rpart'))
task = pre$train(task)[[1]]
task
#> <TaskSurv:lung> (228 x 10): Lung Cancer
#> * Target: time, status
#> * Properties: -
#> * Features (8):
#> - int (7): age, inst, meal.cal, pat.karno, ph.ecog, ph.karno, wt.loss
#> - dbl (1): sex
aorsf = lrn('surv.aorsf', control_type = 'fast',
oobag_pred_type = 'surv', importance = 'anova',
attach_data = TRUE)
aorsf$train(task)
#> Error: some variables have unsupported type:
#> <status> has type <logical>
#> supported types are numeric, integer, units, factor, and Surv
aorsf$errors
#> character(0)
Created on 2024-04-11 with reprex v2.0.2
Maybe you haven't updated it with recent changes? I see for example that the oobag_pred_type
default is risk
now, not surv
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.