The R package case-based-reasoning provides an R interface case-based reasoning using machine learning methods.
install.packages("CaseBasedReasoning")
install.packages("devtools")
devtools::install_github("sipemu/case-based-reasoning")
This R package provides two methods case-based reasoning by using an endpoint:
-
Linear, logistic, and CPH Regression
-
Proximity and Depth Measure extracted from a fitted random forest (ranger package)
Besides the functionality of searching for similar cases, we added some additional features:
-
automatic validation of the critical variables between the query and similar cases dataset
-
checking proportional hazard assumption for the Cox Model
-
C++-functions for distance calculation
In the first example, we use theCPH model and the ovarian
data set from the
survival
package. In the first step, we initialize the R6 data object.
library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)
# initialize R6 object
coxBeta <- CoxBetaModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)
After the initialization, we may want to get for each case in the query data the most similar case from the learning data.
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# fit model
ovarian[trainID, ] %>%
coxBeta$fit()
# get similar cases
ovarian[testID, ] %>%
coxBeta$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
You may extract then the similar cases and the verum data and put them together:
Note 1: In the initialization step, we dropped all cases with missing values in the variables of data
and endPoint
. So, you need to make sure that you do missing value analysis before.
Note 2: The data.table
returned from coxBeta$get_similar_cases
has four additional columns:
caseId
: By this column, you may map the similar cases to cases in data, e.g., if you had chosenk = 3
, then the first three elements in the columncaseId
will be1
(following three2
and so on). These three cases are the three most similar cases to case0
in verum data.scDist
: The calculated distancescCaseId
: Grouping number of the query with matched datagroup
: Grouping matched or query data
Alternatively, you may be interested in the distance matrix:
ovarian %>%
coxBeta$calc_distance_matrix() -> distMatrix
coxBeta$calc_distance_matrix()
calculates the distance matrix between train and test data, when test data is omitted, the distances between observations in the test data is calculated. Rows are observations in train and columns observations of test.
The distance matrix is saved internally in the CoxBetaModel
object: coxBeta$distMat
.
In the second example, we apply a RandomForest model for approximating the distance measure on the ovarian
data. Two possibilities for distance/similarity calculation are offered (details can be found in the documentation):
-
Proximity: When comparing two observations, the mean of having the same end node over all trees is calculated
-
Depth: When comparing two observations, the mean length of edges between the two end nodes over all trees is calculated
Let's initialize the model object:
library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)
# initialize R6 object
rfSC <- RFModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)
All observations with missing values in training and endpoint variables are dropped (na.omit
) and the reduced data without missing values is stored internally. You get a text output on how many cases were dropped. Furthermore, character
variables will be transformed to factor
.
Optionally, you can adjust RandomForest parameters when initializing the model. The documentation of setable parameters can be found in the ranger R-package.
As described, we offer two distance measures are offered:
Depth
(Default)Proximity
rfSC$set_dist(distMethod = "Proximity")
All following steps are the same as above:
Similar Cases:
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# train model
ovarian[trainID, ] %>%
rfSC$fit()
# get similar cases
ovarian[trainID, ] %>%
rfSC$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
Distance Matrix Calculation:
ovarian %>%
rfSC$calc_distance_matrix() -> distMatrix
-
PD Dr. Jürgen Dippon, Institut für Stochastik und Anwendungen, Universität Stuttgart
-
Dr. Simon Müller, TTI GmbH - MUON-STAT
-
Dr. Peter Fritz
-
Professor Dr. Friedel
The Robert Bosch Foundation funded this work. Special thanks go to Professor Dr. Friedel (Thoraxchirugie - Klinik Schillerhöhe).
-
Dippon et al. A statistical approach to case based reasoning, with application to breast cancer data (2002),
-
Friedel et al. Postoperative Survival of Lung Cancer Patients: Are There Predictors beyond TNM? (2012).
-
Englund and Verikas A novel approach to estimate proximity in a random forest: An exploratory study
-
Stuart, E. et al. Matching methods for causal inference: Designing observational studies
-
Defossez et al. Temporal representation of care trajectories of cancer patients using data from a regional information system: an application in breast cancer