Code Monkey home page Code Monkey logo

case-based-reasoning's Introduction

cran version rstudio mirror downloads Build Status

Case Based Reasoning

The R package case-based-reasoning provides an R interface case-based reasoning using machine learning methods.

Installation

CRAN

install.packages("CaseBasedReasoning")

GITHUB

install.packages("devtools")
devtools::install_github("sipemu/case-based-reasoning")

Features

This R package provides two methods case-based reasoning by using an endpoint:

  • Linear, logistic, and CPH Regression

  • Proximity and Depth Measure extracted from a fitted random forest (ranger package)

Besides the functionality of searching for similar cases, we added some additional features:

  • automatic validation of the critical variables between the query and similar cases dataset

  • checking proportional hazard assumption for the Cox Model

  • C++-functions for distance calculation

Example: Cox Beta Model

Initialization

In the first example, we use theCPH model and the ovarian data set from the survival package. In the first step, we initialize the R6 data object.

library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)

# initialize R6 object
coxBeta <- CoxBetaModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)

Similar Cases

After the initialization, we may want to get for each case in the query data the most similar case from the learning data.

n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]

# fit model 
ovarian[trainID, ] %>% 
  coxBeta$fit()
# get similar cases
ovarian[testID, ] %>%
  coxBeta$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData

You may extract then the similar cases and the verum data and put them together:

Note 1: In the initialization step, we dropped all cases with missing values in the variables of data and endPoint. So, you need to make sure that you do missing value analysis before.

Note 2: The data.table returned from coxBeta$get_similar_cases has four additional columns:

  1. caseId: By this column, you may map the similar cases to cases in data, e.g., if you had chosen k = 3, then the first three elements in the column caseId will be 1 (following three 2 and so on). These three cases are the three most similar cases to case 0 in verum data.
  2. scDist: The calculated distance
  3. scCaseId: Grouping number of the query with matched data
  4. group: Grouping matched or query data

Distance Matrix

Alternatively, you may be interested in the distance matrix:

ovarian %>%
  coxBeta$calc_distance_matrix() -> distMatrix

coxBeta$calc_distance_matrix() calculates the distance matrix between train and test data, when test data is omitted, the distances between observations in the test data is calculated. Rows are observations in train and columns observations of test. The distance matrix is saved internally in the CoxBetaModel object: coxBeta$distMat.

Example: RandomForest Model

Initialization

In the second example, we apply a RandomForest model for approximating the distance measure on the ovarian data. Two possibilities for distance/similarity calculation are offered (details can be found in the documentation):

  • Proximity: When comparing two observations, the mean of having the same end node over all trees is calculated

  • Depth: When comparing two observations, the mean length of edges between the two end nodes over all trees is calculated

Let's initialize the model object:

library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)

# initialize R6 object
rfSC <- RFModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)

All observations with missing values in training and endpoint variables are dropped (na.omit) and the reduced data without missing values is stored internally. You get a text output on how many cases were dropped. Furthermore, character variables will be transformed to factor.

Optionally, you can adjust RandomForest parameters when initializing the model. The documentation of setable parameters can be found in the ranger R-package.

As described, we offer two distance measures are offered:

  • Depth (Default)
  • Proximity
rfSC$set_dist(distMethod = "Proximity")

All following steps are the same as above:

Similar Cases:

n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]

# train model 
ovarian[trainID, ] %>% 
  rfSC$fit()
# get similar cases
ovarian[trainID, ] %>%
  rfSC$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData

Distance Matrix Calculation:

ovarian %>%
  rfSC$calc_distance_matrix() -> distMatrix

Contribution

Responsible for Mathematical Model Development and Programming

Medical Advisor

  • Dr. Peter Fritz

  • Professor Dr. Friedel

Funding

Robert-Bosch-Stifung

The Robert Bosch Foundation funded this work. Special thanks go to Professor Dr. Friedel (Thoraxchirugie - Klinik Schillerhöhe).

References

Main

Other

case-based-reasoning's People

Contributors

sipemu avatar sm-datazoo avatar friesewoudloper avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.