Code Monkey home page Code Monkey logo

aorsf's Introduction

rOpenSci

Project Status: Abandoned

This repository has been archived. The former README is now in README-NOT.md.

aorsf's People

Contributors

bcjaeger avatar ciaran-evans avatar eltociear avatar jeroen avatar sawyerweld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

aorsf's Issues

faster matrix multiplication for leaf prediction

Taking sub-matrix views will likely be less fast than iterating over the relevant columns/rows.

Consider implementing this function for data class:


arma::vec submat_mult_lincomb(arma::mat& x,
                              arma::uvec& x_rows,
                              arma::uvec& x_cols,
                              arma::vec& beta){

 arma::vec out (x_rows.size());
 arma::uword i = 0;
 arma::uword j = 0;

 for(auto row : x_rows){
  j=0;
  for(auto col : x_cols){
   out[i] += x.at(row, col) * beta[j];
   j++;
  }
  i++;
 }

 return(out);

}

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: cf10d04a

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 95.8%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Feature request: Predict survival time

This issue showed someone trying to predict survival time with aorsf via tidymodels. We currently only have predictions of the survival probability implemented in censored. Looking around aorsf I didn't see any prediction type that we could wrap for "survival time". Is that correct? Would you consider implementing that? 🙌

dont fit oblique if mtry is 1

If mtry is 1, the regression coefficient for the given covariate should just be set to 1 rather than whatever happens to come out of running the specified model to get coefficients for oblique splits.

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: b3212ad8

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✖️ These functions do not have examples: [print.aorsf].
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 97.1%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

leaf-adjacent models for explainability

this is an idea to make oblique RFs more explainable for a single prediction:

  1. pull out the regression coefficients from the node directly above the given obervation's predicted leaf node of each tree
  2. aggregate them
  3. use the aggregate regression coefficients as a starting value in a regression model fitted to the training set
  4. (optional) weight the training set by nearest neighbors of the observation

We would assess the validity of this by measuring correlation between the forest's predictions and the model's predictions. It would necessitate fitting 1 model per prediction, so not computationally great.

survival learner issues in via mlr3

Hi Byron,

Just tried to use aorsf with survival data via mlr3extralearners and got this error:

library(mlr3extralearners)
library(mlr3pipelines)
library(mlr3proba)
#> Loading required package: mlr3

task = tsk('lung')
pre = po('encode', method = 'treatment') %>>%
  po('imputelearner', lrn('regr.rpart'))
task = pre$train(task)[[1]]
task
#> <TaskSurv:lung> (228 x 10): Lung Cancer
#> * Target: time, status
#> * Properties: -
#> * Features (8):
#>   - int (7): age, inst, meal.cal, pat.karno, ph.ecog, ph.karno, wt.loss
#>   - dbl (1): sex

aorsf = lrn('surv.aorsf', control_type = 'fast',
  oobag_pred_type = 'surv', importance = 'anova',
  attach_data = TRUE)

aorsf$train(task)
#> Error: some variables have unsupported type:
#>  <status> has type <logical>
#> supported types are numeric, integer, units, factor, and Surv
aorsf$errors
#> character(0)

Created on 2024-04-11 with reprex v2.0.2

Maybe you haven't updated it with recent changes? I see for example that the oobag_pred_type default is risk now, not surv.

Enable single `Surv` object as the reponse in a formula for `orsf()`

I'd like to use a "pre-made" Surv object as the response in the formula interface. Would it be possible to allow for that? For some context, this is to make it easier to use the aorsf engine in tidymodels' workflow objects.

library(aorsf)
library(survival)

lung_orsf <- na.omit(lung)
lung_orsf$surv <- Surv(lung_orsf$time, lung_orsf$status)
lung_orsf <- lung_orsf[, -c(2,3)]

aorsf::orsf(
  data = lung_orsf,
  formula = surv ~ age + ph.ecog
)
#> Error: formula must have two variables (time & status) as the response

Created on 2022-11-01 with reprex v2.0.2

unexpected error

I found this example caused R to crash. Likely has something to do with sample_fraction = 1 causing oobag_denom to backfire in orsf_cpp

library(aorsf)

oblique_1 <- orsf(species ~ flipper_length_mm + bill_length_mm,
                  data = penguins_orsf,
                  sample_with_replacement = FALSE,
                  sample_fraction = 1,
                  split_min_obs = nrow(penguins_orsf)-1,
                  tree_seeds = 649725,
                  oobag_pred_type = 'none',
                  n_tree = 1)

grid <- tidyr::expand_grid(
 flipper_length_mm = seq(170, 235, len = 100),
 bill_length_mm = seq(30, 70, len = 100)
)

predict(oblique_1, newdata = grid, pred_type = 'prob')

vs error with n_predictor_min = 1

library(aorsf)
fit <- orsf(mpg ~ ., data = mtcars)
orsf_vs(fit, n_predictor_min = 1)
#> Error in eval(expr, envir, enclos) : Not a matrix.
#> Error: Error in eval(expr, envir, enclos) : Not a matrix.

Created on 2024-03-06 with reprex v2.1.0

allow survival predictions for VI

Something I noticed while updating aorsf-bench in response to bcjaeger/aorsf-bench#7 (👋 @darentsai):

in the paper associated with aorsf-bench, all benchmarks of variable importance were based on predicted survival probability. Since then, aorsf was updated to automatically use mortality predictions for VI computation because it was much simpler and didn't require specification of a prediction horizon. However, I think it would be ideal to retain the methods that were used in the paper, so I think it would be helpful to add an option to make variable importance for survival forests use predicted survival probability.

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: c73bb98c

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 97.1%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

TreeClassification and TreeRegression classes

Follow the template of TreeSurvival.

Classification should allow for binary or categorical outcomes, but initially just build it for binary outcomes

  • leaf summary can be either a vote or a probability

Regression should allow for continuous outcomes

  • leaf summary is a mean or some other summary

Questions + an issue

Hi @bcjaeger, I was checking your package via the mlr3extralearners (really good work!!!) and had some questions/notes for you when you have some time:

  1. I see no parameter like num.threads in ranger, so I guess parallelization is not supported? Is it because its difficult to implement or somehow it is not possible in case of ORSFs?
  2. I was thinking of tuning the control_type (fast vs cph vs net). Do you think that it would make sense to group some of these as a different random forest learner as it might be the case that e.g. fast and cph produce similar forests/results/predictions (using the same dataset and using control_cph_iter_max = to_tune(1,20) for example) compared to net (where tuning the alpha might result in completely different solutions). In your recent arXiv paper you had each as a separate learner (e.g. aorsf-net, aorsf-fast and aorsf-cph). Or you would advise in using fast in all cases (i.e. not tune at all) based on the results from your recent arXiv paper (though alpha wasn't tuned there if I understood correctly)?
  3. In orsf_control_net, do you leave the tuning of the lambda to glmnet? e.g. is cv.glmnet being used "under the hood" :) ? (that would explain the reason it takes so much more time)
  4. For tuning purposes, is the split_min_obs/split_min_events the equivalent of min.node.size in ranger? i.e. a node should have at least split_min_obs observations (out of which split_min_events are events), to consider splitting it?
  5. Have you tested the package in more high-dim settings (in the paper only one dataset was with n < p)? I have a dataset of 145 observations x 10000 features and there seems to be some overhead before the training phase (before growing tree No. starts appearing) and during prediction (maybe has to do with not having parallelized trees?). Especially, permutation importance (which I want to use in my implementation of RFE + RSFs) takes so much more time to compute (only used 100 trees, stopped it after some minutes...)! With ranger, even with no parallelization, I can train and predict with the same dataset in less than 3 secs. Let me know if you want me to share the dataset for more investigation, it would be great if somehow things could be sped up a bit!
  6. In the arXiv paper, you mention the '1 point region of practical equivalence' for comparing the different learners using a Bayesian linear mixed model (awesome!). Is that 1 point difference equivalent to a 0.01 difference in the C-index for example since everything is scaled by 100? Is that is the case, is even a 5 point average difference a practical difference at all?
  7. A small issue I found (maybe for mlr3extralearners):
library(mlr3proba)
#> Loading required package: mlr3
library(mlr3extralearners)

task_mRNA = readRDS(file = gzcon(url('https://github.com/bblodfon/paad-survival-bench/blob/main/data/task_mRNA_flt.rds?raw=True'))) # 1000 features

dsplit = mlr3::partition(task_mRNA, ratio = 0.8)
train_indxs = dsplit$train
test_indxs  = dsplit$test

orsf = lrn('surv.aorsf',
  importance = 'none',
  oobag_pred_type = 'surv',
  attach_data = FALSE,
  verbose_progress = TRUE, 
  n_tree = 10
)

orsf$train(task = task_mRNA, row_ids = train_indxs)
#>  growing tree no. 9 of 10
p = orsf$predict(task = task_mRNA, row_ids = test_indxs)
#> Error: Assertion on 'length(times)' failed: FALSE.

Created on 2023-02-27 with reprex v2.0.2

I don't want to carry the training data to calculate importance or other things so I used attach_data = FALSE, but it seems in the mlr3extralearners wrapper code you use the $model$data slot for prediction either way which results in the above error.

So I guess since the time and status of the training data are always needed for prediction and we would like to keep the attach_data option but allow prediction to work when attach_data = FALSE, then maybe consider attaching these two into the $model slot and accessing them on the wrapper code (rather than the model$data)?

introduce `ltry`

currently oblique forests just try to make one linear combination of predictors, and they will try again (with different predictors) if that split isn't good enough.

There should be more flexibility with this. What if I want to try 5 different linear combos and pick the best one?

This would change some of the core C++ routines in splitting data.

`orsf_control` will be awkward with option for classification and regression

right now we have things like orsf_control_fast and orsf_control_cph that are entirely for survival analyses. It would be more appropriate with this update to introduce

  • orsf_control_survival( method, scale_x, ties, elastic_mix, elastic_df, max_iter, epsilon )
  • orsf_control_classification( method, scale_x, elastic_mix, elastic_df, max_iter, epsilon )
  • orsf_control_regression( method, scale_x, elastic_mix, elastic_df, max_iter, epsilon )

this would

  • eliminate confusion between which control method is compatible with which outcome type
  • add an additional layer of protection to make sure a user's outcome type matches their control.
  • deprecate existing functions

pass new fit arguments to `orsf_train`

It would be nice to do something like:

fit_spec <- orsf(time+status~., na_action = 'na_impute_meanmode', n_tree = 10, no_fit = TRUE)

orsf_train(fit_spec, data = pbc)
orsf_train(fit_spec, data = flchain)

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: 1fed941c

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 97.1%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

utility functions for impurity of splits (regression trees)

Reduction in variance is a standard technique for assessing regression tree split purity. It would be great to implement a function in utility.cpp with the following inputs:

  1. y_node (type: arma::vec) the outcome values in the current tree node
  2. w_node (type: arma::vec) a vector of non-zero weights (integer valued) the same length as y_node
  3. g_node (type: arma::uvec) a vector of 0s and 1s the same length as y_node, with 0 indicating going to the left child node and 1 indicating the right.

The excerpt below from Ishwaran et al 2014 summarizes the reduction in variance computation very well. We will need to code this, incorporating weights through w_node. Should be able to check that the function gives the exact right answer using matrixStats::weightedVar. @ciaran-evans, would you like to look into this? You could actually write the function as a stand-alone function in orsf_oop.cpp with the usual //[Rcpp::export] tag rather than put it into utility.cpp, and I could move it over when it's ready. Basically we would just want the function to be named compute_var_reduction and we would want to create the file tests/testthat/test-compute_var_reduction.R that tests to make sure our variance reduction function gives the same answer as a function written in R.

image

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: b9f49833

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 97%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

pkgcheck results - master

Checks for aorsf (v0.0.1)

git hash: f3262096

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 95.8%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Classification Summary Level Selection

orsf_summarize_uni gives summary for the first class by default. Would it be helpful to include an input allowing users to select the class to summarize? In the reproducible example below, the printed summary is limited to Adelie penguin. The summaries for all classes are included in the orsf_summarize_uni object but having a way to print them directly would be useful.

library(aorsf)

fit =  orsf(species ~ ., data = penguins_orsf)

smry = orsf_summarize_uni(fit)
smry
#> 
#> -- bill_length_mm (VI Rank: 1) -------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean     Median      25th %    75th %
#>       36.6 0.7044331 0.85955664 0.344965208 0.9625072
#>       39.5 0.6535697 0.80727711 0.266058440 0.9468242
#>       44.5 0.3697256 0.34953660 0.035810945 0.6409438
#>       48.6 0.2234635 0.14498102 0.008626836 0.4262582
#>       50.8 0.1896335 0.09561501 0.009713793 0.3866315
#> 
#> -- island (VI Rank: 2) ---------------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>     Biscoe 0.5007671 0.4513207 0.01346324 0.9491816
#>      Dream 0.4360160 0.1990975 0.07250154 0.8740631
#>  Torgersen 0.6220022 0.6459860 0.22850515 0.9810957
#> 
#> -- flipper_length_mm (VI Rank: 3) ----------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>        185 0.5910941 0.3937772 0.30187167 0.9552354
#>        190 0.5645121 0.3637454 0.26707447 0.9433666
#>        197 0.4952256 0.2889080 0.18181057 0.8831356
#>        213 0.3292072 0.1116319 0.02213975 0.6602175
#>        221 0.3025870 0.1087965 0.01011283 0.6184659
#> 
#> -- bill_depth_mm (VI Rank: 4) --------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median      25th %    75th %
#>       14.3 0.3560189 0.1540067 0.008714118 0.7293553
#>       15.6 0.3943511 0.1673597 0.045879915 0.7901435
#>       17.3 0.4651208 0.2253786 0.106483255 0.9155495
#>       18.7 0.5094000 0.2998837 0.133393326 0.9473630
#>       19.5 0.5276217 0.3456728 0.162150077 0.9524491
#> 
#> -- sex (VI Rank: 5) ------------------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>     female 0.4217927 0.1468877 0.02271961 0.8977797
#>       male 0.4742529 0.3384249 0.05131323 0.9453019
#> 
#> -- body_mass_g (VI Rank: 6) ----------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median      25th %    75th %
#>       3300 0.4825696 0.2233228 0.121143276 0.9339286
#>       3550 0.4750676 0.2110828 0.101099039 0.9372102
#>       4050 0.4638683 0.2262442 0.076359495 0.9343941
#>       4775 0.4356364 0.2836723 0.032402907 0.8618333
#>       5440 0.4154148 0.2934098 0.009086565 0.8180788
#> 
#> -- year (VI Rank: 7) -----------------------------------
#> 
#>            |--------------- Probability ---------------|
#>      Value      Mean    Median     25th %    75th %
#>       2007 0.4324757 0.1512571 0.01277177 0.9280795
#>       2008 0.4436049 0.1834657 0.01505514 0.9434973
#>       2009 0.4511092 0.1993108 0.01914280 0.9450677
#> 
#>  Predicted probability for top 7 predictors

Created on 2024-03-22 with reprex v2.1.0

detect outcome type

need a formula that can inspect formula and data input, and reliably determine outcome type

Return predictions when `pred_horizon = 0`

Hey @bcjaeger,

I am just playing around with this package. It's great. But I am wondering whether it makes sense for predict.orsf_fit() to actually return predictions instead of throwing an error when pred_horizon = 0. I think it makes sense to return 0, 1, and 0 when pred_type is 'risk', 'surv', and 'chf', respectively. Do you agree?

It may seem senseless to request predictions at time zero, but ,as an example, I was trying to make risk or survival curves by making predictions from time zero to time t at equally-spaced intervals, and noted the error. I think the values suggested above are both statistically valid and would make the predict function a little friendlier.

For comparison, I contributed the predict.flexsurvreg() function to the {flexsurv} package and predictions at time = 0 are valid and return the values suggested above.

pkgcheck results - master

Checks for aorsf (v0.0.0.9000)

git hash: 4b86e904

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 97.1%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.