Update documentation and DESCRIPTION

For example in the modelStudio the description is outdated
The main goal of this function is to connect two local model explainers: Ceteris Paribus and Break Down. It also shows global explainers for your model such as Partial Dependency and Feature Importance.
In the DESCRIPTION the 'Description' needs to be updated

In the header of the modelStudio it will be good to show true y for selected observation

Change only_numerical like in ingredients

Progress bar verbose

It would be great to have a message about the currently calculated explanation.
Thank you in advance @hbaniecki

DALEXverse 0.19.8 release summer 2019

Integration

assigned: @pbiecek

Code review

consistency: names of functions
consistency: names of files
consistency: names of variables in functions (local and global)
length: functions
readability: code (comments, constructions)

assigned: @maksymiuks

Feature review

readability: documentation (title, description, details)
readability: examples (relevant, complete, with comments)
reproducibility: tests (code coverage)
links to functions: \code

assigned: @WojciechKretowicz

dependence

ModelOriented/ingredients#103

error in the example

I was trying to execute an example for modelStudio

library("dime")
library("DALEX")

titanic <- na.omit(titanic)
set.seed(1313)
titanic_small <- titanic[sample(1:nrow(titanic), 500), c(1,2,3,6,7,9)]

model_titanic_glm <- glm(survived == "yes" ~ gender + age + fare + class + sibsp,
                         data = titanic_small, family = "binomial")

explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_small[,-6],
                               y = titanic_small$survived == "yes",
                               label = "glm")

new_observation <- titanic_small[1:10,-6]

modelStudio(explain_titanic_glm, new_observation[1,])

but this ends with

> modelStudio(explain_titanic_glm, new_observation[1,])
  |                                                                        |   0%Error in ceteris_paribus.default(x, data, predict_function = predict_function,  : 
  promise already under evaluation: recursive default argument reference or earlier problems?

Enter a frame number, or 0 to exit   

1: modelStudio(explain_titanic_glm, new_observation[1, ])
2: modelStudio.explainer(explain_titanic_glm, new_observation[1, ])
3: modelStudio.default(x = x$model, new_observation = new_observation, facet_dim
4: ingredients::accumulated_dependency(x, data, predict_function, only_numerical
5: accumulated_dependency.R#51: accumulated_dependency.default(x, data, predict_
6: accumulated_dependency.R#91: ceteris_paribus.default(x, data, predict_functio

Change plot titles to these in new DALEX v1.0.0

new plot: scatterplot [EDA]

It will be great to have a new plot in the dashboard:
scatterplot for EDA
in the FIFA example I would like to see what is the relation between Player Value and Age
This will nicely supplement the PDP for the model

problem in describe

breakpoint_description <- ifelse(multiple_breakpoints, paste0("Breakpoints are identified at (",

variables, " = ", cut_name, " and ", variables, " = ",

round(df[cutpoint_additional, variables], 3), ")."),

paste0("Breakpoint is identified at (", variables, " = ",

```
  cut_name, ")."))
```

Browse[2]> prefix <- paste0("The highest prediction occurs for (",

variables, " = ", max_name, "),", " while the lowest for (",

variables, " = ", min_name, ").\n", breakpoint_description)

Browse[2]> cutpoint <- ifelse(multiple_breakpoints, cutpoint_additional,

```
cutpoint)
```

Browse[2]> sufix <- describe_numeric_variable(original_x = attr(x,

"observations"), df = df, cutpoint = cutpoint, variables = variables)

Browse[2]> description <- paste(introduction, prefix, sufix, sep = "\n\n")
Browse[2]> description

Add instruction to README

PNG with arrows pointing at interactive elements would be handy

Missing parts in documentation

Hi, I am one of the reviewers for your JOSS submission. I thought I'd put the things I miss in the documentation and the corresponding review checklist items here:

A statement of need: It is described what the software should solve, but I somehow miss what the target audience is. Is it researcher, machine learning practitioners, anyone interested in interpretable machine learning...?
Installation instructions: (this might be because I haven't used R much in the past year - as stated before I started the review). When I installed your package (on a Manjaro machine), I had issues because it also installed glmnet for which it requires gcc-fortran (which I had to install using my package manager). First I am wondering why it knew that it had to install glmnet - it is not mentioned in this libraries DESCRIPTION (I assume it is a dependency of one of the other packages?) And I am also not sure if it is required that your README mentions that one might need to install gcc-fortran (because it is not directly used by your package). Just wanted to let you know that this might be an issue :)
Automated tests: The reviewing check list asks "Are there automated tests or manual steps described so that the functionality of the software can be verified?" I can't find such a thing, maybe you can point me to it.
Community guidelines: The reviewing check list asks "Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support" I can't find such a thing, maybe you can point me to it.

new plot: Target Average vs Feature [EDA]

v1.1.0 release checklist

update observations and calculate only local explanations

It would calculate local explanations and add them to existing modelStudio (without recalculating global explanations)

Add description to ingredients plots

Add parallel backend

parallelMap

Error in eval(predvars, data, env) : object 'parch' not found

I can't get your demonstration example to run. I also tried installing the newest version of modelStudio and ingredients using devtools, but I still get this error:

This is the output of sessionInfo():

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_AT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] modelStudio_0.1.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.1     remotes_2.1.0      prettyunits_1.0.2 
 [6] ingredients_0.3.10 iterators_1.0.12   tools_3.6.1        testthat_2.2.1     digest_0.6.22     
[11] pkgbuild_1.0.6     pkgload_1.0.2      memoise_1.1.0      tibble_2.1.3       gtable_0.3.0      
[16] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.1        Matrix_1.2-17      foreach_1.4.7     
[21] cli_1.1.0          rstudioapi_0.10    curl_4.2           withr_2.1.2        fs_1.3.1          
[26] desc_1.2.0         devtools_2.2.1     rprojroot_1.3-2    glmnet_2.0-18      grid_3.6.1        
[31] glue_1.3.1         R6_2.4.0           processx_3.4.1     DALEX_0.4.7        sessioninfo_1.1.1 
[36] ggplot2_3.2.1      callr_3.3.2        magrittr_1.5       usethis_1.5.1      backports_1.1.5   
[41] scales_1.0.0       codetools_0.2-16   ps_1.3.0           ellipsis_0.3.0     assertthat_0.2.1  
[46] colorspace_1.4-1   lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

Am I doing something wrong?

Change ms title option

modelStudio(), explainer_mlr3() and NAs

Hi,

There's a glitch with modelStudio when using mlr3 pipelines with data with missing values.

It looks like modelStudio() doesn't know how to impute missing data before crunching the numbers, even when the user has incorporated a pipe operator for missing values in the mlr3 pipeline. In fact, modelStudio() does not even recognize mlr3 learners if their class is other than [1] "LearnerClassifRanger" "LearnerClassif" "Learner" "R6" (e.g. try class(learner) for a Random Forest learner). If you have a pipeline, whose class is [1] "GraphLearner" "Learner" "R6", modelStudio() doesn't know how to handle it.

Package DALExtra's explainer_mlr3() suffers from the same issue, although this can be dealt with by providing custom functions for arguments predict_function and residual_function.

Below is an example of a pipeline that imputes missing data and then balances classes. Note that it works fine when there are no missing data, but returns an error otherwise.

Example 1: no missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3) # NOTE: install mlr3 packages from GitHub, not CRAN, as they differ in a few things, e.g. with GitHub you tune the pipeline with $optimize() but with CRAN with $tune()
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

Working just fine.

Example 2: missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Create some missing data
data <- task$data()
data$V1[1:5] <- NA
task <- TaskClassif$new(data, id = 'sonar', target = 'Class')

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

We get errors and no plot:

Calculating ... 
  Calculating ingredients::feature_importance 
  Calculating ingredients::partial_dependence (numerical) 
  Calculating ingredients::accumulated_dependence (numerical) 
    Elapsed time: 00:01:01 ETA...Error in seq.default(min(x[, name]), max(x[, name]), length.out = nbins) : 
  'from' must be a finite number
In addition: Warning messages:
1: In value[[3L]](cond) : 
Error occurred in ingredients::partial_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE
2: In value[[3L]](cond) : 
Error occurred in ingredients::accumulated_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE

Is there a way to pass imputed data from explainer_mlr3() to modelStudio() just like you can pass predictions and residuals with arguments predict_function and residual_function respectively? Any chances of implementing this please?

Thanks

Add dime to DrWhy README

Update instructions and cheatsheet

Check if ingredients fix will fix modelStudio

ModelOriented/ingredients#107

error when passing data with one value column

check before CRAN

[AT] last bin not needed?
join [TV] and [FD] data

Start modelStudio with first plot clicked

TODO

update modelStudio looks after the computation

updateModelStudio() function is supposed to allow user change parameters of how modelStudio is diplayed without re-computation.

Display feature of interest in plots

Hi,

I would like to know how to display the features of interest on a modelStudio plot. It looks like modelStudio chooses the first feature in the data frame by default and that information on the rest of features is only made available by hovering over the plots.

Example from modelStudio website:

library("DALEX")
library("modelStudio")

# fit a model
model <- glm(survived ~., data = titanic_imputed, family = "binomial")

# create an explainer for the model    
explainer <- explain(model,
                     data = titanic_imputed,
                     y = titanic_imputed$survived,
                     label = "Titanic GLM")

# make a studio for the model
modelStudio(explainer)

The only feature displayed on the plot is gender, which is the first column in titanic_imputed.

Unless I'm missing something, it appears that there is no mention in the manual about how to change this. There's also no option for changing this in the actual plot.

Thanks.

modelStudio FAQ & Troubleshooting

Most of the information is covered in the documentation: https://modelstudio.drwhy.ai/

✨ Please, submit a new issue when dealing with potential bugs. Thanks! ✨

Error occurred during the modelStudio() computation
foo plot doesn't show up on the dashboard

Read the console output of DALEX::explain(). There could be a warning message pointing to the solution of this problem.
Read the console output of modelStudio(). There could be an error message (printed as a warning) pointing to the origin and solution of this problem.
Make sure to update these R packages to their latest versions: DALEX, ingredients, iBreakDown.

modelStudio() output shows up as a white window in the RStudio Viewer

Solve this by updating the RStudio. Please, check if the output shows up properly in the browser (e.g. use viewer = "browser" argument in modelStudio()).

y-axis labels go outside of the plot

Use modelStudio(..., options = ms_options(margin_left = 200)).

Unable to load the pickle file with Explainer object
See reticulate vignettes: Python Version Configuration Installing Python Packages
NA in data
See #71
Shiny support
See #77
Change the number of panels with plots from 2x2 to 1x2 or 3x3 (grid size of the dashboard)
See #54 (comment)

modeloriented / modelstudio Goto Github PK

modelstudio's People

Contributors

Stargazers

Watchers

Forkers

modelstudio's Issues

Integration

Code review

Feature review

modelStudio FAQ & Troubleshooting

Recommend Projects

Recommend Topics

Recommend Org