Code Monkey home page Code Monkey logo

modelstudio's People

Contributors

hbaniecki avatar kyleniemeyer avatar pbiecek avatar piotrpiatyszek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

modelstudio's Issues

Update documentation and DESCRIPTION

For example in the modelStudio the description is outdated
The main goal of this function is to connect two local model explainers: Ceteris Paribus and Break Down. It also shows global explainers for your model such as Partial Dependency and Feature Importance.
In the DESCRIPTION the 'Description' needs to be updated

DALEXverse 0.19.8 release summer 2019

DALEXverse 0.19.8 release summer 2019

Integration

  • readability: vignettes
  • readability: NEWS
  • readability: DESCRIPTION
  • consistency: pkgdown website
  • consistency: entry at DrWhy.AI webpage

assigned: @pbiecek

Code review

  • consistency: names of functions
  • consistency: names of files
  • consistency: names of variables in functions (local and global)
  • length: functions
  • readability: code (comments, constructions)

assigned: @maksymiuks

Feature review

  • readability: documentation (title, description, details)
  • readability: examples (relevant, complete, with comments)
  • reproducibility: tests (code coverage)
  • links to functions: \code

assigned: @WojciechKretowicz

error in the example

I was trying to execute an example for modelStudio

library("dime")
library("DALEX")

titanic <- na.omit(titanic)
set.seed(1313)
titanic_small <- titanic[sample(1:nrow(titanic), 500), c(1,2,3,6,7,9)]

model_titanic_glm <- glm(survived == "yes" ~ gender + age + fare + class + sibsp,
                         data = titanic_small, family = "binomial")

explain_titanic_glm <- explain(model_titanic_glm,
                               data = titanic_small[,-6],
                               y = titanic_small$survived == "yes",
                               label = "glm")

new_observation <- titanic_small[1:10,-6]

modelStudio(explain_titanic_glm, new_observation[1,])

but this ends with

> modelStudio(explain_titanic_glm, new_observation[1,])
  |                                                                        |   0%Error in ceteris_paribus.default(x, data, predict_function = predict_function,  : 
  promise already under evaluation: recursive default argument reference or earlier problems?

Enter a frame number, or 0 to exit   

1: modelStudio(explain_titanic_glm, new_observation[1, ])
2: modelStudio.explainer(explain_titanic_glm, new_observation[1, ])
3: modelStudio.default(x = x$model, new_observation = new_observation, facet_dim
4: ingredients::accumulated_dependency(x, data, predict_function, only_numerical
5: accumulated_dependency.R#51: accumulated_dependency.default(x, data, predict_
6: accumulated_dependency.R#91: ceteris_paribus.default(x, data, predict_functio

new plot: scatterplot [EDA]

It will be great to have a new plot in the dashboard:
scatterplot for EDA
in the FIFA example I would like to see what is the relation between Player Value and Age
This will nicely supplement the PDP for the model

problem in describe

breakpoint_description <- ifelse(multiple_breakpoints, paste0("Breakpoints are identified at (",

  • variables, " = ", cut_name, " and ", variables, " = ", 
    
  • round(df[cutpoint_additional, variables], 3), ")."), 
    
  • paste0("Breakpoint is identified at (", variables, " = ", 
    
  •   cut_name, ")."))
    

Browse[2]> prefix <- paste0("The highest prediction occurs for (",

  • variables, " = ", max_name, "),", " while the lowest for (", 
    
  • variables, " = ", min_name, ").\n", breakpoint_description)
    

Browse[2]> cutpoint <- ifelse(multiple_breakpoints, cutpoint_additional,

  • cutpoint)
    

Browse[2]> sufix <- describe_numeric_variable(original_x = attr(x,

  • "observations"), df = df, cutpoint = cutpoint, variables = variables)
    

Browse[2]> description <- paste(introduction, prefix, sufix, sep = "\n\n")
Browse[2]> description

Missing parts in documentation

Hi, I am one of the reviewers for your JOSS submission. I thought I'd put the things I miss in the documentation and the corresponding review checklist items here:

  • A statement of need: It is described what the software should solve, but I somehow miss what the target audience is. Is it researcher, machine learning practitioners, anyone interested in interpretable machine learning...?
  • Installation instructions: (this might be because I haven't used R much in the past year - as stated before I started the review). When I installed your package (on a Manjaro machine), I had issues because it also installed glmnet for which it requires gcc-fortran (which I had to install using my package manager). First I am wondering why it knew that it had to install glmnet - it is not mentioned in this libraries DESCRIPTION (I assume it is a dependency of one of the other packages?) And I am also not sure if it is required that your README mentions that one might need to install gcc-fortran (because it is not directly used by your package). Just wanted to let you know that this might be an issue :)
  • Automated tests: The reviewing check list asks "Are there automated tests or manual steps described so that the functionality of the software can be verified?" I can't find such a thing, maybe you can point me to it.
  • Community guidelines: The reviewing check list asks "Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support" I can't find such a thing, maybe you can point me to it.

v1.1.0 release checklist

  • add ms_update_options() and ms_update_observations() to the perks vignette
  • test vignettes
  • update dashboards
  • rhub::check_for_cran()
  • rhub::check_with_rdevel()
  • usethis::use_cran_comments()
  • devtools::submit_cran()
  • accept the mail
  • tag release on GitHub

Error in eval(predvars, data, env) : object 'parch' not found

I can't get your demonstration example to run. I also tried installing the newest version of modelStudio and ingredients using devtools, but I still get this error:

image

This is the output of sessionInfo():

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_AT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] modelStudio_0.1.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         pillar_1.4.2       compiler_3.6.1     remotes_2.1.0      prettyunits_1.0.2 
 [6] ingredients_0.3.10 iterators_1.0.12   tools_3.6.1        testthat_2.2.1     digest_0.6.22     
[11] pkgbuild_1.0.6     pkgload_1.0.2      memoise_1.1.0      tibble_2.1.3       gtable_0.3.0      
[16] lattice_0.20-38    pkgconfig_2.0.3    rlang_0.4.1        Matrix_1.2-17      foreach_1.4.7     
[21] cli_1.1.0          rstudioapi_0.10    curl_4.2           withr_2.1.2        fs_1.3.1          
[26] desc_1.2.0         devtools_2.2.1     rprojroot_1.3-2    glmnet_2.0-18      grid_3.6.1        
[31] glue_1.3.1         R6_2.4.0           processx_3.4.1     DALEX_0.4.7        sessioninfo_1.1.1 
[36] ggplot2_3.2.1      callr_3.3.2        magrittr_1.5       usethis_1.5.1      backports_1.1.5   
[41] scales_1.0.0       codetools_0.2-16   ps_1.3.0           ellipsis_0.3.0     assertthat_0.2.1  
[46] colorspace_1.4-1   lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4   

Am I doing something wrong?

modelStudio(), explainer_mlr3() and NAs

Hi,

There's a glitch with modelStudio when using mlr3 pipelines with data with missing values.

It looks like modelStudio() doesn't know how to impute missing data before crunching the numbers, even when the user has incorporated a pipe operator for missing values in the mlr3 pipeline. In fact, modelStudio() does not even recognize mlr3 learners if their class is other than [1] "LearnerClassifRanger" "LearnerClassif" "Learner" "R6" (e.g. try class(learner) for a Random Forest learner). If you have a pipeline, whose class is [1] "GraphLearner" "Learner" "R6", modelStudio() doesn't know how to handle it.

Package DALExtra's explainer_mlr3() suffers from the same issue, although this can be dealt with by providing custom functions for arguments predict_function and residual_function.

Below is an example of a pipeline that imputes missing data and then balances classes. Note that it works fine when there are no missing data, but returns an error otherwise.

Example 1: no missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3) # NOTE: install mlr3 packages from GitHub, not CRAN, as they differ in a few things, e.g. with GitHub you tune the pipeline with $optimize() but with CRAN with $tune()
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

Working just fine.

Example 2: missing data

library(tidyverse)
library(data.table)
library(tidymodels)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(DALEXtra)
library(modelStudio)

# Load task and make smaller so code runs faster
task <- tsk('sonar')
task$select(paste0('V', 1:10))

# Create some missing data
data <- task$data()
data$V1[1:5] <- NA
task <- TaskClassif$new(data, id = 'sonar', target = 'Class')

# Ratio values for class-balancing pipe operators
class_counts <- table(task$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] / 
  class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio

# Pipe operators for class-balancing
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor", 
  reference = "minor", shuffle = FALSE, ratio = upsample_ratio)

# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major", 
  reference = "major", shuffle = FALSE, ratio = downsample_ratio)

# Handle missing values
features_with_nas <- sort(task$missings() / task$nrow, decreasing = TRUE)
features_with_nas <- features_with_nas[features_with_nas != 0]

# Imputes values based on histogram
hist_imp <- po("imputehist", param_vals = 
  list(affect_columns = selector_name(names(features_with_nas))))

# Add an indicator column for each feature with missing values
# One-hot encode these new categorical columns, and then remove the categorical versions of them
miss_ind <- po("missind") %>>% 
  po("encode") %>>%
  po("select", 
     selector = selector_invert(selector_type("factor")), 
     id = 'dummy_encoding')

impute_data <- po("copy", 2) %>>%
  gunion(list(hist_imp, miss_ind)) %>>%
  po("featureunion")

impute_data$plot() # This is the Graph we'll add to the pipeline
impute_data$plot(html = TRUE)

# Random Forest learner with up- and down-balancing
rf <- lrn("classif.ranger", predict_type = "prob")

rf_up <- GraphLearner$new(
  po_over %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob'
)

rf_down <- GraphLearner$new(
  po_under %>>%
    po('learner', rf, id = 'rf'),
  predict_type = 'prob')

# All learners (Random Forest with up- and down-balancing)
learners <- list(
  rf_up,
  rf_down
)
names(learners) <- sapply(learners, function(x) x$id)

# Our pipeline
graph <- 
  impute_data %>>%
  po("branch", names(learners)) %>>% 
  gunion(unname(learners)) %>>%
  po("unbranch")

graph$plot() # Plot pipeline
graph$plot(html = TRUE) # Plot pipeline

pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # We want to predict probabilities and not classes.

param_set <- ParamSetCollection$new(list(
  ParamSet$new(list(pipe$param_set$params$branch.selection$clone()))
))

# Set up tuning instance
instance <- TuningInstance$new(
  task = task,
  learner = pipe,
  resampling = rsmp('cv', folds = 2),
  measures = msr('classif.bbrier'),
  param_set,
  terminator = term("evals", n_evals = 3), 
  store_models = TRUE)
tuner <- TunerRandomSearch$new()

# Tune pipe learner to find best-performing branch
tuner$optimize(instance)

# Take a look at the results
instance$result
print(instance$result$tune_x$branch.selection) # Best model

# Train pipeline
pipe$train(task)

################################################################################################
# DALEXextra and modelStudio stuff
################################################################################################

# First create custom functions for predictions and residuals
# We need custom functions because explain_mlr3() doesn't recognize the Graph Learner class of mlr3
predict_function_custom <- function(model, data) {
  pr <- model$
    predict_newdata(data)$
    data$
    prob[, 1]
  
  return(pr)
}

residual_function_custom <- function(model, data, y) {
  pr <- model$
    predict_newdata(data)
  
  y_hat <- pr$
    data$
    prob[, 1]
  
  return(as.integer(y == 0) - y_hat)
}

# Run explainer- works fine with cthe above functions
explainer <- explain_mlr3(model = pipe,
  data = task$data()[, -1],
  y = as.integer(task$data()[, 1] == 'M'),
  predict_function = predict_function_custom,
  residual_function = residual_function_custom,
  label = "mlr3")

# HOWEVER: we have a classification task, but explainer thinks it's regression!
explainer$model_info

# Let's run modelStudio. You'll need to wait for a while
modelStudio(
  explainer, 
  new_observation = task$data()[6, -1]
)

# Ignore warning about data format. Argument `new_observation` is a `data.table`, so its class is `[1] "data.table" "data.frame"`, 
# which is essentially a data frame. so the class has two elements, but the condition only looks at the first one.

We get errors and no plot:

Calculating ... 
  Calculating ingredients::feature_importance 
  Calculating ingredients::partial_dependence (numerical) 
  Calculating ingredients::accumulated_dependence (numerical) 
    Elapsed time: 00:01:01 ETA...Error in seq.default(min(x[, name]), max(x[, name]), length.out = nbins) : 
  'from' must be a finite number
In addition: Warning messages:
1: In value[[3L]](cond) : 
Error occurred in ingredients::partial_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE
2: In value[[3L]](cond) : 
Error occurred in ingredients::accumulated_dependence (numerical) function: missing values and NaN's not allowed if 'na.rm' is FALSE

Is there a way to pass imputed data from explainer_mlr3() to modelStudio() just like you can pass predictions and residuals with arguments predict_function and residual_function respectively? Any chances of implementing this please?

Thanks

TODO

  • change mlr? example to the regression model on DALEX::apartments
  • change sklearn? example to the regression model on dalex fifa
  • use Explainer.dump() in python examples
  • use ranger instead of randomForest (everywhere)
  • pip install dalex console chunk
  • update gifs
  • add parsnip example
  • use macos devel in gh-actions
  • citation
  • change default B = 10, N = 300 to support "fast feedback loop" process
  • add N/n_samples to feature_importance calculation
  • remove d3 from DESC and README
  • remove txtProgressBar import
  • remove covr from suggests
  • fix wrong vignette indexEntry p&r
  • write blog about IEMA

Display feature of interest in plots

Hi,

I would like to know how to display the features of interest on a modelStudio plot. It looks like modelStudio chooses the first feature in the data frame by default and that information on the rest of features is only made available by hovering over the plots.

Example from modelStudio website:

library("DALEX")
library("modelStudio")

# fit a model
model <- glm(survived ~., data = titanic_imputed, family = "binomial")

# create an explainer for the model    
explainer <- explain(model,
                     data = titanic_imputed,
                     y = titanic_imputed$survived,
                     label = "Titanic GLM")

# make a studio for the model
modelStudio(explainer)

The only feature displayed on the plot is gender, which is the first column in titanic_imputed.

Unless I'm missing something, it appears that there is no mention in the manual about how to change this. There's also no option for changing this in the actual plot.

Thanks.

v1.0.2 release checklist

  • update fifa20
  • unify python pipelines with dalex notebook
  • update gifs
  • update dashboards
  • test examples
  • test vignettes
  • rhub::check_for_cran()
  • rhub::check_with_rdevel()
  • usethis::use_cran_comments()
  • devtools::submit_cran()
  • accept the mail
  • tag release on GitHub

Add NEWS file

to track changes in consecutive versions of the package
(see an example in DALEX or archivist)

โ“โ” FAQ & Troubleshooting โ”โ“

modelStudio FAQ & Troubleshooting

Most of the information is covered in the documentation: https://modelstudio.drwhy.ai/


โœจ Please, submit a new issue when dealing with potential bugs. Thanks! โœจ


  • Error occurred during the modelStudio() computation
  • foo plot doesn't show up on the dashboard
  1. Read the console output of DALEX::explain(). There could be a warning message pointing to the solution of this problem.
  2. Read the console output of modelStudio(). There could be an error message (printed as a warning) pointing to the origin and solution of this problem.
  3. Make sure to update these R packages to their latest versions: DALEX, ingredients, iBreakDown.
  • modelStudio() output shows up as a white window in the RStudio Viewer

Solve this by updating the RStudio. Please, check if the output shows up properly in the browser (e.g.ย use viewer = "browser" argument in modelStudio()).

  • y-axis labels go outside of the plot

Use modelStudio(..., options = ms_options(margin_left = 200)).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.