Code Monkey home page Code Monkey logo

shapr's People

Contributors

andersloland avatar aredelmeier avatar camiling avatar jenswahl avatar lhbo avatar martinju avatar nikolase90 avatar rawanmahdi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

shapr's Issues

[review-expectopatronum] References

Hi @expectopatronum
Just opening an issue to handle the reference comments from openjournals/joss-reviews#2027 (comment)

  • 1. Lundberg, S. (2019): I am not sure if this is correct/necessary. On their Github page they state which references should be used, this one doesn't seem to be there. (in case this actually should be there, I think SHAP should be capitalized)
  • 2. Maybe this also needs to be cited (very recent): https://raw.githubusercontent.com/slundberg/shap/master/docs/references/tree_explainer.bib
  • 3. Pedersen & Benesty: Title should be "lime: Local Interpretable Model-Agnostic Explanations"
  • 4. Furthermore, capitalization of the titles and conference names is not consistent (some are capitalized, others are all lower case).

Fix redundant arguments

A list of potentially redundant function arguments that we could remove to simplify code.

  • We may delete the reduce_dim argument of feature_combinations (always set it to TRUE)
  • If we delete the reduce_dim argument, we probably don't need the use_shapley_weights_in_W argument in weight_matrix, as the no-column is always 1 (check!!!), and then we can simplify this code by always using the shapley_weight column only as the weights.
  • Remove type in inv_gaussian_transform

Package development

Just leaving a few possible tasks for further improvments of the package

  • Should perhaps parallelize the distance computations in prepare_kernelShap, but computing 10000x100x1002 takes a minute or two and is 7.5gb large, so much more than that cannot be done anyway, without spilling to disk (could do that though?)
  • If we are using the Gaussian approach, then the computations of the distances is not necessary and the computations could be speeded up significantly.
  • We should add three arguments for parallellization: One which concerns parallellization of the predictions method (which is passed to pred_vector), a second another which concerns parallelization over test samples in compute_kernelShap [add a check that at least one of thesem are set to 1) and a third for parallelization of distance computation in prepare_kernelShap.

Change header at the pkgdown site

I suggest to change the Article header on the pkgdown site to "Vignettes", OR (if we plan to put the JOSS paper here as well when it is accepted), change the title on the button you click to open the vignette to something like "Vignette: Understanding shapr" describing that this is a vignette.

Add contribution paragraph in README

Should include the following;

  • How we use testthat and covr
  • Style of code and linting (i.e. using styler & lintr)
  • Documentation of functions using roxygen2
  • When to update NEWS.md

Output of compute_kshap

Currently we output a matrix with the explanations in Kshap. Maybe a data.table with original column names is better?

Also, adding the actual predictions for the test data in the output list would be good.

Add tests using testthat

If you're working on one of these files, please add the url to your branch or the pull request. Mark the box if the changes are merged with master.

R-files

  • clustering.R
  • explanation.R#128
  • features.R #70
  • observations.R #95
  • plot.R #86
  • predictions.R #109
  • sampling.R#92
  • shapley.R #128
  • transformation.R #77
  • utils.R Currently there are zero functions in R/utils.R.
  • models.R #129

src-files

  • AICc.cpp
  • distance.cpp
  • impute_data.cpp #78
  • weighted_matrix.cpp #73

The following files should not be tested

  • R/shapr-package.R
  • R/zzz.R

Require input data to be of class data.frame

When expanding the package to handle categorical variables, it is beneficial to only allow input data (training and testing) to be of class data.frames (or data.table). This allows us to stop the procedure if one tries to use existing numeric methods when categorical variables are provided (or potentially one-hot-encode them in that case).

Agree @nikolase90 ?

Simplify user modification of the empirical approach

Currently the user must supply the kernelshap function with a list specifying all details of the empirical condtional approach. There are default values for the full list, but if the user just needs to change one of these values, he needs to specify all the other default values manually, which is very inconvenient. Thus, if an element of the list is not supplied, default values should be used.

I think the easiest way out of this is to start with the default list in the top of the function and modify that with the elements in the list supplied in the function call.

Require specific format in sample_** functions

Currently, the sample_gaussian and sample_copula functions can take input on a number of different formats. We should restrict this to a single format to reduce the possibility of errors. Both px1 and 1xp is currently OK per the documentation, but px1 will actually fail if m=0 or p.

I suggest using data.table all the way, or 1xp matrix.

Ref #120 and #122

Add linting of package

The following files should be adjusted:

  • clustering.R
  • explanation.R
  • features.R
  • models.R
  • observations.R
  • plot.R
  • predictions.R
  • sampling.R
  • shapley.R
  • shapr-package.R
  • transformation.R
  • utils.R

Parallelization

We should add 4 arguments for parallelization:

  1. One concerning parallelization of the predictions method (which is passed to prediction_vector)
  2. One for parallelization of the sampling method when either the Gaussian or copula method is used.
  3. One which concerns parallelization over test samples in compute_kshap
  4. One for parallelization of distance computation in prepare_kshap.

We should also add a test checking that either 3 or both of 1 and 2 are set to 1 core to avoid parallelization within parallelizations.

Tests for functions in R/sampling.R

Steps;

  • Rename file tests/testthat/test-sample_combinations.R to tests/testthat/test-sampling.R
  • Adds tests for the following two functions:
    • sample_gaussian() (located in R/sampling.R)
    • sample_copula() (located in R/sampling.R)

Note that it is not necessary to create tests for sample_combinations since they are already written.

Enchantments

  • Update shapr with feature labels
  • Add check that uses exact = FALSE if m is greater than a given number
  • Find a way to re-use D matrix if you run code again
  • Add n_samples as an argument in explain
  • Fix documentation for type in explain.combined
  • Don't copy code in explain.combined
  • Rename argument noSamp as n_combinations in feature_combinations
  • Rename x_test in prepare_data.copula
  • Add check for single prediction in shapr and explain (want to check that correct features are passed into the functions)

Variable in scope

There seems to be a bug regarding global ourr use of data.table and global variables.

This chunk of code throws an error about the m variable not being defined when used in w_shapley, although it seems to be passed as it should be.

l.1 <- prepare_kernelShap(
    m = 3,
    Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
    Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
    exact = FALSE,
    nrows = 10,
    scale = F)

Defining m as a global variable fixes the issue:

m=3 # 
l.2 <- prepare_kernelShap(
    m = p,
    Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
    Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
    exact = FALSE,
    nrows = 10,
    scale = F)

Warning: Setting m equal to something else gives an incorrect answer even if it is not passed to the function, so this is actually quite dangerous.

The w_shapley function is called within a data.table call, and apparently it does not look for m as defined within the scope of the function where it is present, but rather in the global scope... I can't see what we are doing wrong here. Am I missing something obvious?

Please take a look when you are back from vacation, @nikolase90.

Improve documentation

The following needs to be done;

  • All arguments in functions should be properly documented.
  • All functions should have good examples (OK, not all, but as many as possible).
  • There should be zero warnings when running devtools::check_man()

Vignette fails with uninformative error

When package gbm is missing, the vignette fails to knit, and the error message doesn't make it very clear that gbm is the culprit.

I suggest adding library(gbm) somewhere before line 399, so that a more informative error message is printed.

Update vignette

  • Include section on "Advanced usage" or so including how to explain a custom model. Use the example script under inst.

  • For the "Advanced usage" section include an example with the combined approach. See the example in the documentation in #134 .

  • May also add an example with the independence version here (so people easily can see that the results differ.

  • Fix broken link to comparison with Lundbergs python implementation.

First release and JOSS

This is a list of the things we must do before releasing a version to CRAN and submitting a paper to JOSS

Switch to solely using data.table internally

After fixing #120, I suggest we change to the use of data.table throughout the package, only to convert to matrix when calling cpp-functions. The combination of data.frame, data.table and matrix of various dimensions complicates maintenance. See also #123

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.