Cross-Validation for Model Selection
Authors: Ludvig R. Olsen (
[email protected] ), Hugh Benjamin Zachariae
License:
MIT
Started: October
2016
R package: Cross-validate one or multiple regression or classification
models and get relevant evaluation metrics in a tidy format. Validate
the best model on a test set and compare it to a baseline evaluation.
Alternatively, evaluate predictions from an external model. Currently
supports regression ('gaussian'
), binary classification
('binomial'
), and (some functions only) multiclass classification
('multinomial'
).
Main functions:
cross_validate()
cross_validate_fn()
validate()
evaluate()
baseline()
combine_predictors()
cv_plot()
select_metrics()
reconstruct_formulas()
- cvms
- Examples
Originally, cvms
only provided the option to cross-validate Gaussian
and binomial regression models, fitting the models internally with the
lm()
, lmer()
, glm()
, and glmer()
functions. The
cross_validate()
function has thus been designed specifically to work
with those functions.
To allow cross-validation of custom model functions like support-vector
machines, neural networks, etc., the cross_validate_fn()
function has
been added. You provide a model function and (if defaults fail) a
predict function, and it does the rest (see examples below).
-
Fixes bug in
evaluate()
, when used on a grouped data frame. The row order in the output was not guaranteed to fit with the grouping keys. If you have usedevaluate()
on a grouped data frame, please rerun to make sure your results are correct! (30th of November 2019) -
cross_validate_fn()
is added. Cross-validate custom model functions. -
In
evaluate()
, whentype
ismultinomial
, the output is now a single tibble. TheClass Level Results
are included as a nested tibble. -
Adds
'multinomial'
family tobaseline()
andevaluate()
. -
evaluate()
is added. Evaluate your model’s predictions with the same metrics as used incross_validate()
. -
AUC calculation has changed. Now explicitly sets the direction in
pROC::roc
. (27th of May 2019) -
Argument
positive
now defaults to2
. If a dependent variable has the values 0 and 1, 1 is now the default positive class, as that’s the second smallest value. If the dependent variable is of typecharacter
, it’s in alphabetical order.
CRAN:
install.packages(“cvms”)
Development version:
install.packages(“devtools”)
devtools::install_github(“LudvigOlsen/groupdata2”)
devtools::install_github(“LudvigOlsen/cvms”)
library(cvms)
library(groupdata2) # fold() partition()
library(knitr) # kable()
library(dplyr) # %>% arrange()
library(ggplot2)
The dataset participant.scores
comes with cvms.
data <- participant.scores
Create a grouping factor for subsetting of folds using
groupdata2::fold()
. Order the dataset by the folds.
# Set seed for reproducibility
set.seed(7)
# Fold data
data <- fold(data, k = 4,
cat_col = 'diagnosis',
id_col = 'participant') %>%
arrange(.folds)
# Show first 15 rows of data
data %>% head(15) %>% kable()
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
9 | 34 | 0 | 33 | 1 | 1 |
9 | 34 | 0 | 53 | 2 | 1 |
9 | 34 | 0 | 66 | 3 | 1 |
8 | 21 | 1 | 16 | 1 | 1 |
8 | 21 | 1 | 32 | 2 | 1 |
8 | 21 | 1 | 44 | 3 | 1 |
2 | 23 | 0 | 24 | 1 | 2 |
2 | 23 | 0 | 40 | 2 | 2 |
2 | 23 | 0 | 67 | 3 | 2 |
1 | 20 | 1 | 10 | 1 | 2 |
1 | 20 | 1 | 24 | 2 | 2 |
1 | 20 | 1 | 45 | 3 | 2 |
6 | 31 | 1 | 14 | 1 | 2 |
6 | 31 | 1 | 25 | 2 | 2 |
6 | 31 | 1 | 30 | 3 | 2 |
CV1 <- cross_validate(data, "score~diagnosis",
fold_cols = '.folds',
family = 'gaussian',
REML = FALSE)
# Show results
CV1
#> # A tibble: 1 x 20
#> RMSE MAE r2m r2c AIC AICc BIC Predictions Results Coefficients
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list>
#> 1 16.4 13.8 0.271 0.271 195. 196. 198. <tibble [3… <tibbl… <tibble [8 …
#> # … with 10 more variables: Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Singular Fit Messages` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Family <chr>, Link <chr>, Dependent <chr>,
#> # Fixed <chr>
# Let's take a closer look at the different parts of the output
# Results metrics
CV1 %>% select_metrics() %>% kable()
RMSE | MAE | r2m | r2c | AIC | AICc | BIC | Dependent | Fixed |
---|---|---|---|---|---|---|---|---|
16.35261 | 13.75772 | 0.270991 | 0.270991 | 194.6218 | 195.9276 | 197.9556 | score | diagnosis |
# Nested predictions
# Note that [[1]] picks predictions for the first row
CV1$Predictions[[1]] %>% head() %>% kable()
Fold Column | Fold | Target | Prediction |
---|---|---|---|
.folds | 1 | 33 | 51.00000 |
.folds | 1 | 53 | 51.00000 |
.folds | 1 | 66 | 51.00000 |
.folds | 1 | 16 | 30.66667 |
.folds | 1 | 32 | 30.66667 |
.folds | 1 | 44 | 30.66667 |
# Nested results from the different folds
CV1$Results[[1]] %>% kable()
Fold Column | Fold | RMSE | MAE | r2m | r2c | AIC | AICc | BIC |
---|---|---|---|---|---|---|---|---|
.folds | 1 | 12.56760 | 10.72222 | 0.2439198 | 0.2439198 | 209.9622 | 211.1622 | 213.4963 |
.folds | 2 | 16.60767 | 14.77778 | 0.2525524 | 0.2525524 | 182.8739 | 184.2857 | 186.0075 |
.folds | 3 | 15.97355 | 12.87037 | 0.2306104 | 0.2306104 | 207.9074 | 209.1074 | 211.4416 |
.folds | 4 | 20.26162 | 16.66049 | 0.3568816 | 0.3568816 | 177.7436 | 179.1554 | 180.8772 |
# Nested model coefficients
# Note that you have the full p-values,
# but kable() only shows a certain number of digits
CV1$Coefficients[[1]] %>% kable()
term | estimate | std.error | statistic | p.value | Fold | Fold Column |
---|---|---|---|---|---|---|
(Intercept) | 51.00000 | 5.901264 | 8.642216 | 0.0000000 | 1 | .folds |
diagnosis | -20.33333 | 7.464574 | -2.723978 | 0.0123925 | 1 | .folds |
(Intercept) | 53.33333 | 5.718886 | 9.325826 | 0.0000000 | 2 | .folds |
diagnosis | -19.66667 | 7.565375 | -2.599563 | 0.0176016 | 2 | .folds |
(Intercept) | 49.77778 | 5.653977 | 8.804030 | 0.0000000 | 3 | .folds |
diagnosis | -18.77778 | 7.151778 | -2.625610 | 0.0154426 | 3 | .folds |
(Intercept) | 49.55556 | 5.061304 | 9.791065 | 0.0000000 | 4 | .folds |
diagnosis | -22.30556 | 6.695476 | -3.331437 | 0.0035077 | 4 | .folds |
# Additional information about the model
# and the training process
CV1 %>% select(11:17) %>% kable()
Folds | Fold Columns | Convergence Warnings | Singular Fit Messages | Other Warnings | Warnings and Messages | Family |
---|---|---|---|---|---|---|
4 | 1 | 0 | 0 | 0 | list(Fold Column = character(0), Fold = integer(0), Type = character(0), Message = character(0)) |
gaussian |
CV2 <- cross_validate(data, "diagnosis~score",
fold_cols = '.folds',
family = 'binomial')
#> Registered S3 methods overwritten by 'survival':
#> method from
#> nobs.coxph MuMIn
#> nobs.survreg MuMIn
# Show results
CV2
#> # A tibble: 1 x 28
#> `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.736 0.821 0.889 0.583 0.762
#> # … with 23 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> # CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <list>, `Confusion Matrix` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Singular Fit Messages` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Family <chr>, Link <chr>, Dependent <chr>,
#> # Fixed <chr>
# Let's take a closer look at the different parts of the output
# We won't repeat the parts too similar to those in Gaussian
# Results metrics
CV2 %>% select(1:9) %>% kable()
Balanced Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC | Lower CI | Upper CI |
---|---|---|---|---|---|---|---|---|
0.7361111 | 0.8205128 | 0.8888889 | 0.5833333 | 0.7619048 | 0.7777778 | 0.7685185 | 0.5962701 | 0.9407669 |
CV2 %>% select(10:14) %>% kable()
Kappa | MCC | Detection Rate | Detection Prevalence | Prevalence |
---|---|---|---|---|
0.4927536 | 0.5048268 | 0.5333333 | 0.7 | 0.6 |
# ROC curve info
CV2$ROC[[1]] %>% head() %>% kable()
Sensitivities | Specificities |
---|---|
1.0000000 | 0.0000000 |
1.0000000 | 0.0833333 |
0.9444444 | 0.0833333 |
0.9444444 | 0.1666667 |
0.9444444 | 0.2500000 |
0.8888889 | 0.2500000 |
# Confusion matrix
CV2$`Confusion Matrix`[[1]] %>% kable()
Fold Column | Prediction | Target | Pos_0 | Pos_1 | N |
---|---|---|---|---|---|
.folds | 0 | 0 | TP | TN | 7 |
.folds | 1 | 0 | FN | FP | 5 |
.folds | 0 | 1 | FP | FN | 2 |
.folds | 1 | 1 | TN | TP | 16 |
models <- c("score~diagnosis", "score~age")
mixed_models <- c("score~diagnosis+(1|session)", "score~age+(1|session)")
CV3 <- cross_validate(data, models,
fold_cols = '.folds',
family = 'gaussian',
REML = FALSE)
# Show results
CV3
#> # A tibble: 2 x 20
#> RMSE MAE r2m r2c AIC AICc BIC Predictions Results Coefficients
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list>
#> 1 16.4 13.8 0.271 0.271 195. 196. 198. <tibble [3… <tibbl… <tibble [8 …
#> 2 22.4 18.9 0.0338 0.0338 201. 202. 204. <tibble [3… <tibbl… <tibble [8 …
#> # … with 10 more variables: Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Singular Fit Messages` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Family <chr>, Link <chr>, Dependent <chr>,
#> # Fixed <chr>
CV4 <- cross_validate(data, mixed_models,
fold_cols = '.folds',
family = 'gaussian',
REML = FALSE)
# Show results
CV4
#> # A tibble: 2 x 21
#> RMSE MAE r2m r2c AIC AICc BIC Predictions Results Coefficients
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list>
#> 1 7.95 6.41 0.290 0.811 176. 178. 180. <tibble [3… <tibbl… <tibble [8 …
#> 2 17.5 16.2 0.0366 0.526 194. 196. 198. <tibble [3… <tibbl… <tibble [8 …
#> # … with 11 more variables: Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Singular Fit Messages` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Family <chr>, Link <chr>, Dependent <chr>,
#> # Fixed <chr>, Random <chr>
Let’s first add some extra fold columns. We will use the num_fold_cols
argument to add 3 unique fold columns. We tell fold()
to keep the
existing fold column and simply add three extra columns. We could also
choose to remove the existing fold column, if for instance we were
changing the number of folds (k). Note, that the original fold column
will be renamed to “.folds_1”.
# Set seed for reproducibility
set.seed(2)
# Fold data
data <- fold(data, k = 4,
cat_col = 'diagnosis',
id_col = 'participant',
num_fold_cols = 3,
handle_existing_fold_cols = "keep")
# Show first 15 rows of data
data %>% head(10) %>% kable()
participant | age | diagnosis | score | session | .folds_1 | .folds_2 | .folds_3 | .folds_4 |
---|---|---|---|---|---|---|---|---|
10 | 32 | 0 | 29 | 1 | 4 | 4 | 3 | 1 |
10 | 32 | 0 | 55 | 2 | 4 | 4 | 3 | 1 |
10 | 32 | 0 | 81 | 3 | 4 | 4 | 3 | 1 |
2 | 23 | 0 | 24 | 1 | 2 | 3 | 1 | 2 |
2 | 23 | 0 | 40 | 2 | 2 | 3 | 1 | 2 |
2 | 23 | 0 | 67 | 3 | 2 | 3 | 1 | 2 |
4 | 21 | 0 | 35 | 1 | 3 | 2 | 4 | 4 |
4 | 21 | 0 | 50 | 2 | 3 | 2 | 4 | 4 |
4 | 21 | 0 | 78 | 3 | 3 | 2 | 4 | 4 |
9 | 34 | 0 | 33 | 1 | 1 | 1 | 2 | 3 |
CV5 <- cross_validate(data, "diagnosis ~ score",
fold_cols = paste0(".folds_", 1:4),
family = 'binomial',
REML = FALSE)
# Show results
CV5
#> # A tibble: 1 x 29
#> `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.729 0.813 0.875 0.583 0.759
#> # … with 24 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> # CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <list>, Results <list>, `Confusion Matrix` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Singular Fit Messages` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Family <chr>, Link <chr>, Dependent <chr>,
#> # Fixed <chr>
# The binomial output now has a nested 'Results' tibble
# Let's see a subset of the columns
CV5$Results[[1]] %>% select(1:8) %>% kable()
Fold Column | Balanced Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC |
---|---|---|---|---|---|---|---|
.folds_1 | 0.7361111 | 0.8205128 | 0.8888889 | 0.5833333 | 0.7619048 | 0.7777778 | 0.7685185 |
.folds_2 | 0.7361111 | 0.8205128 | 0.8888889 | 0.5833333 | 0.7619048 | 0.7777778 | 0.7777778 |
.folds_3 | 0.7083333 | 0.7894737 | 0.8333333 | 0.5833333 | 0.7500000 | 0.7000000 | 0.7476852 |
.folds_4 | 0.7361111 | 0.8205128 | 0.8888889 | 0.5833333 | 0.7619048 | 0.7777778 | 0.7662037 |
cross_validate_fn()
works with regression (gaussian
), binary
classification (binomial
), and multiclass classification
(multinomial
).
Let’s cross-validate a support-vector machine using the svm()
function
from the e1071
package. First, we will create a model function. You
can do anything you want in it, as long as it takes the arguments
train_data
and formula
and returns the fitted model object.
# Create model function
#
# train_data : tibble with the training data
# formula : a formula object
svm_model_fn <- function(train_data, formula){
# Note that `formula` must be specified first
# when calling svm(), otherwise it fails
e1071::svm(formula = formula,
data = train_data,
kernel = "linear",
type = "C-classification")
}
For the svm()
function, the default predict function and settings
within cross_validate_fn()
works, so we don’t have to specify a
predict function. In many cases, it’s probably safer to supply a predict
function anyway, so you’re sure everything is correct. We will see how
in the naive Bayes example below, but first, let’s cross-validate the
model function. Note, that some of the arguments have changed names
(models -> formulas
, family -> type
).
# Cross-validate svm_model_fn
CV6 <- cross_validate_fn(data = data,
model_fn = svm_model_fn,
formulas = c("diagnosis~score", "diagnosis~age"),
fold_cols = '.folds_1',
type = 'binomial')
CV6
#> # A tibble: 2 x 26
#> `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.694 0.80 0.889 0.5 0.727
#> 2 0.417 0.667 0.833 0 0.556
#> # … with 21 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> # CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <list>, `Confusion Matrix` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Other Warnings` <int>, `Warnings and Messages` <list>,
#> # Family <chr>, Dependent <chr>, Fixed <chr>
The naive Bayes classifier requires us to supply a predict function, so we will go through that next. First, let’s create the model function.
# Create model function
#
# train_data : tibble with the training data
# formula : a formula object
nb_model_fn <- function(train_data, formula){
e1071::naiveBayes(formula = formula,
data = train_data)
}
Now, we will create a predict function. This will usually wrap
stats::predict()
and just make sure, the predictions have the correct
format. When type
is binomial
, the predictions should be a vector,
or a one-column matrix / data frame, with the probabilities of the
second class (alphabetically). That is, if we have the classes 0
and
1
, it should be the probabilities of the observations being in class
1
. The help file, ?cross_validate_fn
, describes the formats for the
other types (gaussian
and multinomial
).
The predict function should take the arguments test_data
, model
, and
formula
. You do not need to use the formula
within your function.
# Create predict function
#
# test_data : tibble with the test data
# model : fitted model object
# formula : a formula object
nb_predict_fn <- function(test_data, model, formula){
stats::predict(object = model, newdata = test_data,
type = "raw", allow.new.levels = TRUE)[,2]
}
With both functions specified, we are ready to cross-validate our naive Bayes classifier.
CV7 <- cross_validate_fn(data,
model_fn = nb_model_fn,
formulas = c("diagnosis~score", "diagnosis~age"),
type = 'binomial',
predict_fn = nb_predict_fn,
fold_cols = '.folds_1')
CV7
#> # A tibble: 2 x 26
#> `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.736 0.821 0.889 0.583 0.762
#> 2 0.25 0.462 0.5 0 0.429
#> # … with 21 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> # CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <list>, `Confusion Matrix` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>, `Convergence
#> # Warnings` <int>, `Other Warnings` <int>, `Warnings and Messages` <list>,
#> # Family <chr>, Dependent <chr>, Fixed <chr>
Evaluate predictions from a model trained outside cvms. Works with
regression (gaussian
), binary classification (binomial
), and
multiclass classification (multinomial
). The following is an example
of multinomial evaluation.
Create a dataset with 3 predictors and a target column. Partition it
with groupdata2::partition()
to create a training set and a validation
set. multiclass_probability_tibble()
is a simple helper function for
generating random tibbles.
# Set seed
set.seed(1)
# Create class names
class_names <- paste0("class_", 1:4)
# Create random dataset with 100 observations
# Partition into training set (75%) and test set (25%)
multiclass_partitions <- multiclass_probability_tibble(
num_classes = 3, # Here, number of predictors
num_observations = 100,
apply_softmax = FALSE,
FUN = rnorm,
class_name = "predictor_") %>%
dplyr::mutate(class = sample(
class_names,
size = 100,
replace = TRUE)) %>%
partition(p = 0.75,
cat_col = "class")
# Extract partitions
multiclass_train_set <- multiclass_partitions[[1]]
multiclass_test_set <- multiclass_partitions[[2]]
multiclass_test_set
#> # A tibble: 26 x 4
#> predictor_1 predictor_2 predictor_3 class
#> <dbl> <dbl> <dbl> <chr>
#> 1 1.60 0.158 -0.331 class_1
#> 2 -1.99 -0.180 -0.341 class_1
#> 3 0.418 -0.324 0.263 class_1
#> 4 0.398 0.450 0.136 class_1
#> 5 0.0743 1.03 -1.32 class_1
#> 6 0.738 0.910 0.541 class_2
#> 7 0.576 0.384 -0.0134 class_2
#> 8 -0.305 1.68 0.510 class_2
#> 9 -0.0449 -0.393 1.52 class_2
#> 10 0.557 -0.464 -0.879 class_2
#> # … with 16 more rows
Train multinomial model using the nnet
package and get the predicted
probabilities.
# Train multinomial model
multiclass_model <- nnet::multinom(
"class ~ predictor_1 + predictor_2 + predictor_3",
data = multiclass_train_set)
#> # weights: 20 (12 variable)
#> initial value 102.585783
#> iter 10 value 98.124010
#> final value 98.114250
#> converged
# Predict the targets in the test set
predictions <- predict(multiclass_model,
multiclass_test_set,
type = "probs") %>%
dplyr::as_tibble()
# Add the targets
predictions[["target"]] <- multiclass_test_set[["class"]]
head(predictions, 10)
#> # A tibble: 10 x 5
#> class_1 class_2 class_3 class_4 target
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 0.243 0.214 0.304 0.239 class_1
#> 2 0.136 0.371 0.234 0.259 class_1
#> 3 0.230 0.276 0.264 0.230 class_1
#> 4 0.194 0.218 0.262 0.326 class_1
#> 5 0.144 0.215 0.302 0.339 class_1
#> 6 0.186 0.166 0.241 0.407 class_2
#> 7 0.201 0.222 0.272 0.305 class_2
#> 8 0.117 0.131 0.195 0.557 class_2
#> 9 0.237 0.264 0.215 0.284 class_2
#> 10 0.216 0.310 0.303 0.171 class_2
Perform the evaluation. This will create one-vs-all binomial evaluations and summarize the results.
# Evaluate predictions
ev <- evaluate(data = predictions,
target_col = "target",
prediction_cols = class_names,
type = "multinomial")
ev
#> # A tibble: 1 x 18
#> `Overall Accura… `Balanced Accur… F1 Sensitivity Specificity
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.154 0.427 NaN 0.143 0.712
#> # … with 13 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # `Class Level Results` <list>, Predictions <list>, `Confusion Matrix` <list>
The class level results (i.e., the one-vs-all evaluations) are also included, and would usually be reported alongside the above results.
ev$`Class Level Results`
#> [[1]]
#> # A tibble: 4 x 18
#> Class `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 clas… 0.476 NaN 0 0.952 0
#> 2 clas… 0.380 0.211 0.286 0.474 0.167
#> 3 clas… 0.474 NaN 0 0.947 0
#> 4 clas… 0.380 0.211 0.286 0.474 0.167
#> # … with 12 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> # CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> # Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>, Support <int>,
#> # ROC <list>, `Confusion Matrix` <list>
Create baseline evaluations of a test set.
Approach: The baseline model (y ~ 1), where 1 is simply the intercept (i.e. mean of y), is fitted on n random subsets of the training set and evaluated on the test set. We also perform an evaluation of the model fitted on the entire training set.
Start by partitioning the dataset.
# Set seed for reproducibility
set.seed(1)
# Partition the dataset
partitions <- groupdata2::partition(participant.scores,
p = 0.7,
cat_col = 'diagnosis',
id_col = 'participant',
list_out = TRUE)
train_set <- partitions[[1]]
test_set <- partitions[[2]]
Create the baseline evaluations:
baseline(test_data = test_set, train_data = train_set,
n = 100, dependent_col = "score", family = "gaussian")
#> $summarized_metrics
#> # A tibble: 9 x 9
#> Measure RMSE MAE r2m r2c AIC AICc BIC `Training Rows`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mean 19.7 15.8 0 0 87.0 89.5 87.4 9.63
#> 2 Median 19.2 15.5 0 0 83.3 85.3 83.7 9
#> 3 SD 1.05 0.759 0 0 28.9 27.6 29.6 3.22
#> 4 IQR 1.16 0.264 0 0 45.9 44.3 47.0 5
#> 5 Max 24.1 19.4 0 0 137. 138. 138. 15
#> 6 Min 18.9 15.5 0 0 42.0 48.0 41.2 5
#> 7 NAs 0 0 0 0 0 0 0 0
#> 8 INFs 0 0 0 0 0 0 0 0
#> 9 All_rows 19.1 15.5 0 0 161. 162. 163. 18
#>
#> $random_evaluations
#> # A tibble: 100 x 13
#> RMSE MAE r2m r2c AIC AICc BIC Predictions Coefficients
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 20.0 16.3 0 0 72.5 74.9 72.7 <tibble [1… <tibble [1 …
#> 2 19.0 15.5 0 0 137. 138. 138. <tibble [1… <tibble [1 …
#> 3 20.2 15.7 0 0 61.3 64.3 61.2 <tibble [1… <tibble [1 …
#> 4 20.0 15.7 0 0 97.7 99.2 98.5 <tibble [1… <tibble [1 …
#> 5 19.3 15.6 0 0 73.3 75.7 73.5 <tibble [1… <tibble [1 …
#> 6 20.4 15.9 0 0 44.4 50.4 43.6 <tibble [1… <tibble [1 …
#> 7 19.0 15.5 0 0 118. 120. 119. <tibble [1… <tibble [1 …
#> 8 19.4 15.5 0 0 93.3 95.1 94.0 <tibble [1… <tibble [1 …
#> 9 20.7 16.2 0 0 71.2 73.6 71.3 <tibble [1… <tibble [1 …
#> 10 20.8 17.1 0 0 43.7 49.7 42.9 <tibble [1… <tibble [1 …
#> # … with 90 more rows, and 4 more variables: `Training Rows` <int>,
#> # Family <chr>, Dependent <chr>, Fixed <chr>
Approach: n random sets of predictions are evaluated against the dependent variable in the test set. We also evaluate a set of all 0s and a set of all 1s.
Create the baseline evaluations:
baseline(test_data = test_set, n = 100,
dependent_col = "diagnosis", family = "binomial")
#> $summarized_metrics
#> # A tibble: 10 x 15
#> Measure `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mean 0.502 0.495 0.478 0.525 0.498
#> 2 Median 0.5 0.5 0.5 0.5 0.500
#> 3 SD 0.147 0.159 0.215 0.210 0.194
#> 4 IQR 0.167 0.252 0.333 0.333 0.200
#> 5 Max 0.833 0.833 0.833 1 1
#> 6 Min 0.167 0.182 0 0 0
#> 7 NAs 0 4 0 0 0
#> 8 INFs 0 0 0 0 0
#> 9 All_0 0.5 NA 0 1 NaN
#> 10 All_1 0.5 0.667 1 0 0.5
#> # … with 9 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>,
#> # `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>
#>
#> $random_evaluations
#> # A tibble: 100 x 19
#> `Balanced Accur… F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.417 0.364 0.333 0.5 0.4
#> 2 0.5 0.5 0.5 0.5 0.5
#> 3 0.417 0.364 0.333 0.5 0.4
#> 4 0.667 0.6 0.5 0.833 0.75
#> 5 0.583 0.667 0.833 0.333 0.556
#> 6 0.667 0.6 0.5 0.833 0.75
#> 7 0.25 0.308 0.333 0.167 0.286
#> 8 0.5 0.4 0.333 0.667 0.500
#> 9 0.25 0.182 0.167 0.333 0.20
#> 10 0.417 0.222 0.167 0.667 0.333
#> # … with 90 more rows, and 14 more variables: `Neg Pred Value` <dbl>,
#> # AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <list>, `Confusion Matrix` <list>, Family <chr>,
#> # Dependent <chr>
Approach: Creates one-vs-all (binomial) baseline evaluations for n sets of random predictions against the dependent variable, along with sets of “all class x,y,z,…” predictions.
Create the baseline evaluations:
multiclass_baseline <- baseline(
test_data = multiclass_test_set, n = 100,
dependent_col = "class", family = "multinomial")
# Summarized metrics
multiclass_baseline$summarized_metrics
#> # A tibble: 12 x 16
#> Measure `Overall Accura… `Balanced Accur… F1 Sensitivity Specificity
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mean 0.250 0.501 0.283 0.252 0.750
#> 2 Median 0.231 0.494 0.280 0.243 0.746
#> 3 SD 0.0841 0.0567 0.0737 0.0853 0.0284
#> 4 IQR 0.115 0.0795 0.0920 0.121 0.0385
#> 5 Max 0.538 0.786 0.667 1 1
#> 6 Min 0.0769 0.262 0.111 0 0.474
#> 7 NAs NA 0 61 0 0
#> 8 INFs NA 0 0 0 0
#> 9 All_cl… 0.192 0.5 NA 0.25 0.75
#> 10 All_cl… 0.269 0.5 NA 0.25 0.75
#> 11 All_cl… 0.269 0.5 NA 0.25 0.75
#> 12 All_cl… 0.269 0.5 NA 0.25 0.75
#> # … with 10 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>
# Summarized class level results for class 1
multiclass_baseline$summarized_class_level_results %>%
dplyr::filter(Class == "class_1") %>%
tidyr::unnest(Results)
#> # A tibble: 10 x 16
#> Class Measure `Balanced Accur… F1 Sensitivity Specificity
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 clas… Mean 0.514 0.284 0.28 0.748
#> 2 clas… Median 0.529 0.286 0.2 0.762
#> 3 clas… SD 0.102 0.106 0.191 0.0979
#> 4 clas… IQR 0.124 0.182 0.2 0.0952
#> 5 clas… Max 0.786 0.526 1 0.952
#> 6 clas… Min 0.262 0.125 0 0.524
#> 7 clas… NAs 0 18 0 0
#> 8 clas… INFs 0 0 0 0
#> 9 clas… All_0 0.5 NA 0 1
#> 10 clas… All_1 0.5 0.323 1 0
#> # … with 10 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>
# Random evaluations
# Note, that the class level results for each repetition
# is available as well
multiclass_baseline$random_evaluations
#> # A tibble: 100 x 21
#> Repetition `Overall Accura… `Balanced Accur… F1 Sensitivity Specificity
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.154 0.445 NaN 0.171 0.719
#> 2 2 0.269 0.518 NaN 0.279 0.758
#> 3 3 0.192 0.460 0.195 0.193 0.727
#> 4 4 0.385 0.591 0.380 0.386 0.797
#> 5 5 0.154 0.430 NaN 0.143 0.717
#> 6 6 0.154 0.438 NaN 0.157 0.718
#> 7 7 0.154 0.445 NaN 0.171 0.719
#> 8 8 0.346 0.574 0.341 0.364 0.784
#> 9 9 0.308 0.541 0.315 0.314 0.767
#> 10 10 0.308 0.536 0.322 0.3 0.772
#> # … with 90 more rows, and 15 more variables: `Pos Pred Value` <dbl>, `Neg Pred
#> # Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>,
#> # MCC <dbl>, `Detection Rate` <dbl>, `Detection Prevalence` <dbl>,
#> # Prevalence <dbl>, Predictions <list>, `Confusion Matrix` <list>, `Class
#> # Level Results` <list>, Family <chr>, Dependent <chr>
There are currently a small set of plots for quick visualization of the results. It is supposed to be easy to extract the needed information to create your own plots. If you lack access to any information or have other requests or ideas, feel free to open an issue.
cv_plot(CV1, type = "RMSE") +
theme_bw()
cv_plot(CV1, type = "r2") +
theme_bw()
cv_plot(CV1, type = "IC") +
theme_bw()
cv_plot(CV1, type = "coefficients") +
theme_bw()
cv_plot(CV2, type = "ROC") +
theme_bw()
Instead of manually typing all possible model formulas for a set of
fixed effects (including the possible interactions),
combine_predictors()
can do it for you (with some constraints).
When including interactions, >200k formulas have been precomputed for up to 8 fixed effects, with a maximum interaction size of 3, and a maximum of 5 fixed effects per formula. It’s possible to further limit the generated formulas.
We can also append a random effects structure to the generated formulas.
combine_predictors(dependent = "y",
fixed_effects = c("a","b","c"),
random_effects = "(1|d)")
#> [1] "y ~ a + (1|d)" "y ~ b + (1|d)"
#> [3] "y ~ c + (1|d)" "y ~ a * b + (1|d)"
#> [5] "y ~ a * c + (1|d)" "y ~ a + b + (1|d)"
#> [7] "y ~ a + c + (1|d)" "y ~ b * c + (1|d)"
#> [9] "y ~ b + c + (1|d)" "y ~ a * b * c + (1|d)"
#> [11] "y ~ a * b + c + (1|d)" "y ~ a * c + b + (1|d)"
#> [13] "y ~ a + b * c + (1|d)" "y ~ a + b + c + (1|d)"
#> [15] "y ~ a * b + a * c + (1|d)" "y ~ a * b + b * c + (1|d)"
#> [17] "y ~ a * c + b * c + (1|d)" "y ~ a * b + a * c + b * c + (1|d)"
If two or more fixed effects should not be in the same formula, like an effect and its log-transformed version, we can provide them as sublists.
combine_predictors(dependent = "y",
fixed_effects = list("a", list("b","log_b")),
random_effects = "(1|d)")
#> [1] "y ~ a + (1|d)" "y ~ b + (1|d)" "y ~ log_b + (1|d)"
#> [4] "y ~ a * b + (1|d)" "y ~ a * log_b + (1|d)" "y ~ a + b + (1|d)"
#> [7] "y ~ a + log_b + (1|d)"