modeloriented / modelstudio Goto Github PK
View Code? Open in Web Editor NEW๐ Interactive Studio for Explanatory Model Analysis
Home Page: https://doi.org/10.1007/s10618-023-00924-w
License: GNU General Public License v3.0
๐ Interactive Studio for Explanatory Model Analysis
Home Page: https://doi.org/10.1007/s10618-023-00924-w
License: GNU General Public License v3.0
For example in the modelStudio
the description is outdated
The main goal of this function is to connect two local model explainers: Ceteris Paribus and Break Down. It also shows global explainers for your model such as Partial Dependency and Feature Importance.
In the DESCRIPTION the 'Description' needs to be updated
In the header of the modelStudio it will be good to show true y
for selected observation
It would be great to have a message about the currently calculated explanation.
Thank you in advance @hbaniecki
DALEXverse 0.19.8 release summer 2019
assigned: @pbiecek
assigned: @maksymiuks
assigned: @WojciechKretowicz
I was trying to execute an example for modelStudio
library("dime")
library("DALEX")
titanic <- na.omit(titanic)
set.seed(1313)
titanic_small <- titanic[sample(1:nrow(titanic), 500), c(1,2,3,6,7,9)]
model_titanic_glm <- glm(survived == "yes" ~ gender + age + fare + class + sibsp,
data = titanic_small, family = "binomial")
explain_titanic_glm <- explain(model_titanic_glm,
data = titanic_small[,-6],
y = titanic_small$survived == "yes",
label = "glm")
new_observation <- titanic_small[1:10,-6]
modelStudio(explain_titanic_glm, new_observation[1,])
but this ends with
> modelStudio(explain_titanic_glm, new_observation[1,])
| | 0%Error in ceteris_paribus.default(x, data, predict_function = predict_function, :
promise already under evaluation: recursive default argument reference or earlier problems?
Enter a frame number, or 0 to exit
1: modelStudio(explain_titanic_glm, new_observation[1, ])
2: modelStudio.explainer(explain_titanic_glm, new_observation[1, ])
3: modelStudio.default(x = x$model, new_observation = new_observation, facet_dim
4: ingredients::accumulated_dependency(x, data, predict_function, only_numerical
5: accumulated_dependency.R#51: accumulated_dependency.default(x, data, predict_
6: accumulated_dependency.R#91: ceteris_paribus.default(x, data, predict_functio
It will be great to have a new plot in the dashboard:
scatterplot for EDA
in the FIFA example I would like to see what is the relation between Player Value and Age
This will nicely supplement the PDP for the model
breakpoint_description <- ifelse(multiple_breakpoints, paste0("Breakpoints are identified at (",
variables, " = ", cut_name, " and ", variables, " = ",
round(df[cutpoint_additional, variables], 3), ")."),
paste0("Breakpoint is identified at (", variables, " = ",
cut_name, ")."))
Browse[2]> prefix <- paste0("The highest prediction occurs for (",
variables, " = ", max_name, "),", " while the lowest for (",
variables, " = ", min_name, ").\n", breakpoint_description)
Browse[2]> cutpoint <- ifelse(multiple_breakpoints, cutpoint_additional,
cutpoint)
Browse[2]> sufix <- describe_numeric_variable(original_x = attr(x,
"observations"), df = df, cutpoint = cutpoint, variables = variables)
Browse[2]> description <- paste(introduction, prefix, sufix, sep = "\n\n")
Browse[2]> description
PNG with arrows pointing at interactive elements would be handy
Hi, I am one of the reviewers for your JOSS submission. I thought I'd put the things I miss in the documentation and the corresponding review checklist items here:
glmnet
for which it requires gcc-fortran
(which I had to install using my package manager). First I am wondering why it knew that it had to install glmnet - it is not mentioned in this libraries DESCRIPTION (I assume it is a dependency of one of the other packages?) And I am also not sure if it is required that your README mentions that one might need to install gcc-fortran
(because it is not directly used by your package). Just wanted to let you know that this might be an issue :)ms_update_options()
and ms_update_observations()
to the perks vignetterhub::check_for_cran()
rhub::check_with_rdevel()
usethis::use_cran_comments()
devtools::submit_cran()
It would calculate local explanations and add them to existing modelStudio (without recalculating global explanations)
parallelMap
I can't get your demonstration example to run. I also tried installing the newest version of modelStudio
and ingredients
using devtools
, but I still get this error:
This is the output of sessionInfo()
:
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux
Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_AT.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_AT.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] modelStudio_0.1.8
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 pillar_1.4.2 compiler_3.6.1 remotes_2.1.0 prettyunits_1.0.2
[6] ingredients_0.3.10 iterators_1.0.12 tools_3.6.1 testthat_2.2.1 digest_0.6.22
[11] pkgbuild_1.0.6 pkgload_1.0.2 memoise_1.1.0 tibble_2.1.3 gtable_0.3.0
[16] lattice_0.20-38 pkgconfig_2.0.3 rlang_0.4.1 Matrix_1.2-17 foreach_1.4.7
[21] cli_1.1.0 rstudioapi_0.10 curl_4.2 withr_2.1.2 fs_1.3.1
[26] desc_1.2.0 devtools_2.2.1 rprojroot_1.3-2 glmnet_2.0-18 grid_3.6.1
[31] glue_1.3.1 R6_2.4.0 processx_3.4.1 DALEX_0.4.7 sessioninfo_1.1.1
[36] ggplot2_3.2.1 callr_3.3.2 magrittr_1.5 usethis_1.5.1 backports_1.1.5
[41] scales_1.0.0 codetools_0.2-16 ps_1.3.0 ellipsis_0.3.0 assertthat_0.2.1
[46] colorspace_1.4-1 lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4
Am I doing something wrong?
Hi,
There's a glitch with modelStudio
when using mlr3
pipelines with data with missing values.
It looks like modelStudio()
doesn't know how to impute missing data before crunching the numbers, even when the user has incorporated a pipe operator for missing values in the mlr3
pipeline. In fact, modelStudio()
does not even recognize mlr3
learners if their class is other than [1] "LearnerClassifRanger" "LearnerClassif" "Learner" "R6"
(e.g. try class(learner)
for a Random Forest learner). If you have a pipeline, whose class is [1] "GraphLearner" "Learner" "R6"
, modelStudio()
doesn't know how to handle it.
Package DALExtra
's explainer_mlr3()
suffers from the same issue, although this can be dealt with by providing custom functions for arguments predict_function
and residual_function
.
Below is an example of a pipeline that imputes missing data and then balances classes. Note that it works fine when there are no missing data, but returns an error otherwise.
Example 1: no missing data
library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3) # NOTE: install mlr3 packages from GitHub, not CRAN, as they differ in a few things, e.g. with GitHub you tune the pipeline with $optimize() but with CRAN with $tune()
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)
# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))
# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio
# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = upsample_ratio)
# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major",
reference = "major", shuffle = FALSE, ratio = downsample_ratio)
# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]
# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals =
list(affect_columns = selector_name(names(features_with_nas))))
# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>%
po("encode") %>>%
po("select",
selector = selector_invert(selector_type("factor")),
id = 'dummy_encoding')
impute_data <- po("copy", 2) %>>%
gunion(list(hist_imp, miss_ind)) %>>%
po("featureunion")
impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)
# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")
rf_up <- GraphLearner$new(
po_over %>>%
po('learner', rf, id = 'rf'),
predict_type = 'prob'
)
rf_down <- GraphLearner$new(
po_under %>>%
po('learner', rf, id = 'rf'),
predict_type = 'prob')
# All learners (Random Forest with up- and down-balancing)
learners <- list(
rf_up,
rf_down
)
names(learners) <- sapply(learners, function(x) x$id)
# Our pipeline
graph <-
impute_data %>>%
po("branch", names(learners)) %>>%
gunion(unname(learners)) %>>%
po("unbranch")
graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline
pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.
param_set <- ParamSetCollection$new(list(
ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))
# Set up tuning instance
instance <- TuningInstance$new(
task = task,
learner = pipe,
resampling = rsmp('cv', folds = 2),
measures = msr('classif.bbrier'),
param_set,
terminator = term("evals", n_evals = 3),
store_models = TRUE)
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find best-performing branch
tuner$optimize(instance)
# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model
# Train pipeline
pipe$train(task)
################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################
# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
pr <- model$
predict_newdata(data)$
data$
prob[, 1]
return(pr)
}
residual_function_custom <- function(model, data, y) {
pr <- model$
predict_newdata(data)
y_hat <- pr$
data$
prob[, 1]
return(as.integer(y == 0) - y_hat)
}
# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
data = task$data()[, -1],
y = as.integer(task$data()[, 1] == 'M'),
predict_function = predict_function_custom,
residual_function = residual_function_custom,
label = "mlr3")
# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info
# Let's run modelStudio. You'll need to wait for a while
modelStudio(
explainer,
new_observation = task$data()[6, -1]
)
# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`,
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.
Working just fine.
Example 2: missing data
library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)
# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))
# Create some missing data
data <- task$data()
data$V1[1:5] <- NA
task <- TaskClassif$new(data, id = 'sonar', target = 'Class')
# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio
# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = upsample_ratio)
# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major",
reference = "major", shuffle = FALSE, ratio = downsample_ratio)
# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]
# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals =
list(affect_columns = selector_name(names(features_with_nas))))
# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>%
po("encode") %>>%
po("select",
selector = selector_invert(selector_type("factor")),
id = 'dummy_encoding')
impute_data <- po("copy", 2) %>>%
gunion(list(hist_imp, miss_ind)) %>>%
po("featureunion")
impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)
# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")
rf_up <- GraphLearner$new(
po_over %>>%
po('learner', rf, id = 'rf'),
predict_type = 'prob'
)
rf_down <- GraphLearner$new(
po_under %>>%
po('learner', rf, id = 'rf'),
predict_type = 'prob')
# All learners (Random Forest with up- and down-balancing)
learners <- list(
rf_up,
rf_down
)
names(learners) <- sapply(learners, function(x) x$id)
# Our pipeline
graph <-
impute_data %>>%
po("branch", names(learners)) %>>%
gunion(unname(learners)) %>>%
po("unbranch")
graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline
pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.
param_set <- ParamSetCollection$new(list(
ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))
# Set up tuning instance
instance <- TuningInstance$new(
task = task,
learner = pipe,
resampling = rsmp('cv', folds = 2),
measures = msr('classif.bbrier'),
param_set,
terminator = term("evals", n_evals = 3),
store_models = TRUE)
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find best-performing branch
tuner$optimize(instance)
# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model
# Train pipeline
pipe$train(task)
################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################
# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
pr <- model$
predict_newdata(data)$
data$
prob[, 1]
return(pr)
}
residual_function_custom <- function(model, data, y) {
pr <- model$
predict_newdata(data)
y_hat <- pr$
data$
prob[, 1]
return(as.integer(y == 0) - y_hat)
}
# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
data = task$data()[, -1],
y = as.integer(task$data()[, 1] == 'M'),
predict_function = predict_function_custom,
residual_function = residual_function_custom,
label = "mlr3")
# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info
# Let's run modelStudio. You'll need to wait for a while
modelStudio(
explainer,
new_observation = task$data()[6, -1]
)
# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`,
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.
We get errors and no plot:
Calculating ...
Calculating ingredients::feature_importance
Calculating ingredients::partial_dependence (numerical)
Calculating ingredients::accumulated_dependence (numerical)
Elapsed time: 00:01:01 ETA...Error in seq.default(min(x[, name]), max(x[, name]), length.out = nbins) :
'from' must be a finite number
In addition: Warning messages:
1: In value[[3L]](cond) :
Error occurred in ingredients::partial_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE
2: In value[[3L]](cond) :
Error occurred in ingredients::accumulated_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE
Is there a way to pass imputed data from explainer_mlr3()
to modelStudio()
just like you can pass predictions and residuals with arguments predict_function
and residual_function
respectively? Any chances of implementing this please?
Thanks
DALEX::apartments
dalex fifa
Explainer.dump()
in python examplesranger
instead of randomForest
(everywhere)pip install dalex
console chunkB = 10
, N = 300
to support "fast feedback loop" processN/n_samples
to feature_importance
calculationupdateModelStudio() function is supposed to allow user change parameters of how modelStudio is diplayed without re-computation.
Hi,
I would like to know how to display the features of interest on a modelStudio
plot. It looks like modelStudio
chooses the first feature in the data frame by default and that information on the rest of features is only made available by hovering over the plots.
Example from modelStudio
website:
library("DALEX")
library("modelStudio")
# fit a model
model <- glm(survived ~., data = titanic_imputed, family = "binomial")
# create an explainer for the model
explainer <- explain(model,
data = titanic_imputed,
y = titanic_imputed$survived,
label = "Titanic GLM")
# make a studio for the model
modelStudio(explainer)
The only feature displayed on the plot is gender
, which is the first column in titanic_imputed
.
Unless I'm missing something, it appears that there is no mention in the manual about how to change this. There's also no option for changing this in the actual plot.
Thanks.
As in auditor or DALEX packages
rhub::check_for_cran()
rhub::check_with_rdevel()
usethis::use_cran_comments()
devtools::submit_cran()
to track changes in consecutive versions of the package
(see an example in DALEX or archivist)
Most of the information is covered in the documentation: https://modelstudio.drwhy.ai/
โจ Please, submit a new issue when dealing with potential bugs. Thanks! โจ
modelStudio()
computationfoo
plot doesn't show up on the dashboardDALEX::explain()
. There could be a warning message pointing to the solution of this problem.modelStudio()
. There could be an error message (printed as a warning) pointing to the origin and solution of this problem.DALEX
, ingredients
, iBreakDown
.modelStudio()
output shows up as a white window in the RStudio ViewerSolve this by updating the RStudio. Please, check if the output shows up properly in the browser (e.g.ย use viewer = "browser"
argument in modelStudio()
).
Use modelStudio(..., options = ms_options(margin_left = 200))
.
Unable to load the pickle file with Explainer object
See reticulate
vignettes: Python Version Configuration โ Installing Python Packages
NA in data
See #71
Shiny support
See #77
Change the number of panels with plots from 2x2 to 1x2 or 3x3 (grid size of the dashboard)
See #54 (comment)
title
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.