The shapr from norskregnesentral

[review-expectopatronum] References

Hi @expectopatronum
Just opening an issue to handle the reference comments from openjournals/joss-reviews#2027 (comment)

1. Lundberg, S. (2019): I am not sure if this is correct/necessary. On their Github page they state which references should be used, this one doesn't seem to be there. (in case this actually should be there, I think SHAP should be capitalized)
2. Maybe this also needs to be cited (very recent): https://raw.githubusercontent.com/slundberg/shap/master/docs/references/tree_explainer.bib
3. Pedersen & Benesty: Title should be "lime: Local Interpretable Model-Agnostic Explanations"
4. Furthermore, capitalization of the titles and conference names is not consistent (some are capitalized, others are all lower case).

Update shapley_weight_inf_replacement in R/features.R

Should be aligned with argument in shapley_weights().

Fix redundant arguments

A list of potentially redundant function arguments that we could remove to simplify code.

We may delete the reduce_dim argument of feature_combinations (always set it to TRUE)
If we delete the reduce_dim argument, we probably don't need the use_shapley_weights_in_W argument in weight_matrix, as the no-column is always 1 (check!!!), and then we can simplify this code by always using the shapley_weight column only as the weights.
Remove type in inv_gaussian_transform

Tag the first release with a version number

Does not work properly

shapr/R/features.R

Line 89 in 2e20a1b

X <- rbindlist(list(X, X_zero_all))

Package development

Just leaving a few possible tasks for further improvments of the package

Should perhaps parallelize the distance computations in prepare_kernelShap, but computing 10000x100x1002 takes a minute or two and is 7.5gb large, so much more than that cannot be done anyway, without spilling to disk (could do that though?)
If we are using the Gaussian approach, then the computations of the distances is not necessary and the computations could be speeded up significantly.
We should add three arguments for parallellization: One which concerns parallellization of the predictions method (which is passed to pred_vector), a second another which concerns parallelization over test samples in compute_kernelShap [add a check that at least one of thesem are set to 1) and a third for parallelization of distance computation in prepare_kernelShap.

Ensure consistency between p, m, nfeatures and n_features

Change header at the pkgdown site

I suggest to change the Article header on the pkgdown site to "Vignettes", OR (if we plan to put the JOSS paper here as well when it is accepted), change the title on the button you click to open the vignette to something like "Vignette: Understanding shapr" describing that this is a vignette.

Add contribution paragraph in README

Should include the following;

How we use testthat and covr
Style of code and linting (i.e. using styler & lintr)
Documentation of functions using roxygen2
When to update NEWS.md

Output of compute_kshap

Currently we output a matrix with the explanations in Kshap. Maybe a data.table with original column names is better?

Also, adding the actual predictions for the test data in the output list would be good.

Add tests using testthat

If you're working on one of these files, please add the url to your branch or the pull request. Mark the box if the changes are merged with master.

R-files

src-files

AICc.cpp
distance.cpp
impute_data.cpp #78
weighted_matrix.cpp #73

The following files should not be tested

R/shapr-package.R
R/zzz.R

Comparison with orignal independence implementation of KernelSHAP

Create test that shows that (a special case of) our implementation of independence kernelSHAP gives the same results as Lundberg's implementations.

Do this in a Rmarkdown document? And put in in a folder under inst?
Link to the comparison from the Readme.

Update DESCRIPTION

Add vignette of methodology/package

I made a template with https://github.com/NorskRegnesentral/shapr/tree/nikolai/add_vignette. You could continue pushing to this branch once you have the time work on this.

Some examples of vignettes:

Fix feedback on R CMD CHECK. There should be zero errors, warnings and notes.

Write functions to visualize results

[review-frycast] LICENSE

@frycast The reason why the description file states the following

License: MIT + file LICENSE

is described in http://r-pkgs.had.co.nz/description.html#license. See the section with MIT license.

You could also see a similar discussion here: openjournals/joss-reviews#1863.

Improved plotting function

#15 is fixed in PR #45, but should also include the prediction value in the header of the facets of that functionality.

Make sure that names of objects follow the paper

Reorganize folders, i.e. scripts should be placed in a folder called inst

Change LICENSE file (should be MIT)

Support for custom models

Allow the user to supply the prediction function on an appropriate form such as in LIME.

Use consistent style of naming (should not use camelCase, but always use snake_case)

Require input data to be of class data.frame

When expanding the package to handle categorical variables, it is beneficial to only allow input data (training and testing) to be of class data.frames (or data.table). This allows us to stop the procedure if one tries to use existing numeric methods when categorical variables are provided (or potentially one-hot-encode them in that case).

Agree @nikolase90 ?

Remove man from .gitignore, and add all man/*.Rd files to repository

People are going to install the package from Github, and it that case it is nice if the files are already generated.

Add site using pkgdown

Simplify user modification of the empirical approach

Currently the user must supply the kernelshap function with a list specifying all details of the empirical condtional approach. There are default values for the full list, but if the user just needs to change one of these values, he needs to specify all the other default values manually, which is very inconvenient. Thus, if an element of the list is not supplied, default values should be used.

I think the easiest way out of this is to start with the default list in the top of the function and modify that with the elements in the list supplied in the function call.

Add reference to R-package lime (i.e. API is inspired by lime )

Require specific format in sample_** functions

Currently, the sample_gaussian and sample_copula functions can take input on a number of different formats. We should restrict this to a single format to reduce the possibility of errors. Both px1 and 1xp is currently OK per the documentation, but px1 will actually fail if m=0 or p.

I suggest using data.table all the way, or 1xp matrix.

Ref #120 and #122

Add linting of package

The following files should be adjusted:

Update README

[review-frycast] Typo

shapr/R/explanation.R

Line 20 in 991dfdb

#' \code{"gassuian"} and \code{"copula"}.

This should be gaussian and not gassuian.

Parallelization

We should add 4 arguments for parallelization:

One concerning parallelization of the predictions method (which is passed to prediction_vector)
One for parallelization of the sampling method when either the Gaussian or copula method is used.
One which concerns parallelization over test samples in compute_kshap
One for parallelization of distance computation in prepare_kshap.

We should also add a test checking that either 3 or both of 1 and 2 are set to 1 core to avoid parallelization within parallelizations.

Update examples to new package style

Currently, some of the examples under inst/scripts still follow the style from the old package.

Tests for functions in R/sampling.R

Steps;

Rename file tests/testthat/test-sample_combinations.R to tests/testthat/test-sampling.R
Adds tests for the following two functions:
- sample_gaussian() (located in R/sampling.R)
- sample_copula() (located in R/sampling.R)

Note that it is not necessary to create tests for sample_combinations since they are already written.

Move from CircleCI to Travis CI

Move sample_combinations() into R/sampling.R

sample_combinations() is located in R/extra.R. After moving the function into R/sampling.R you can delete R/extra.R .

Use r-package styler when styling the package

Add covr of package

Enchantments

Update shapr with feature labels
Add check that uses exact = FALSE if m is greater than a given number
Find a way to re-use D matrix if you run code again
~~Add n_samples as an argument in explain~~
Fix documentation for type in explain.combined
Don't copy code in explain.combined
Rename argument noSamp as n_combinations in feature_combinations
Rename x_test in prepare_data.copula
Add check for single prediction in shapr and explain (want to check that correct features are passed into the functions)

Variable in scope

There seems to be a bug regarding global ourr use of data.table and global variables.

This chunk of code throws an error about the m variable not being defined when used in w_shapley, although it seems to be passed as it should be.

l.1 <- prepare_kernelShap(
    m = 3,
    Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
    Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
    exact = FALSE,
    nrows = 10,
    scale = F)

Defining m as a global variable fixes the issue:

m=3 # 
l.2 <- prepare_kernelShap(
    m = p,
    Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
    Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
    exact = FALSE,
    nrows = 10,
    scale = F)

Warning: Setting m equal to something else gives an incorrect answer even if it is not passed to the function, so this is actually quite dangerous.

The w_shapley function is called within a data.table call, and apparently it does not look for m as defined within the scope of the function where it is present, but rather in the global scope... I can't see what we are doing wrong here. Am I missing something obvious?

Please take a look when you are back from vacation, @nikolase90.

Improve documentation

The following needs to be done;

All arguments in functions should be properly documented.
All functions should have good examples (OK, not all, but as many as possible).
There should be zero warnings when running devtools::check_man()

Setup repository to use CircleCI (when we make the package open source it could be moved over to Travis CI)

Vignette fails with uninformative error

When package gbm is missing, the vignette fails to knit, and the error message doesn't make it very clear that gbm is the culprit.

I suggest adding library(gbm) somewhere before line 399, so that a more informative error message is printed.

replace does not exists

shapr/R/features.R

Line 20 in 2e20a1b

if (!exact && noSamp > (2^m - 2) && !replace) {

Update vignette

Include section on "Advanced usage" or so including how to explain a custom model. Use the example script under inst.
For the "Advanced usage" section include an example with the combined approach. See the example in the documentation in #134 .
May also add an example with the independence version here (so people easily can see that the results differ.
Fix broken link to comparison with Lundbergs python implementation.

First release and JOSS

This is a list of the things we must do before releasing a version to CRAN and submitting a paper to JOSS

Switch to solely using data.table internally

After fixing #120, I suggest we change to the use of data.table throughout the package, only to convert to matrix when calling cpp-functions. The combination of data.frame, data.table and matrix of various dimensions complicates maintenance. See also #123

Use r-package spelling as spell checker

Get all spelling errors using spelling::spell_check_package(".").

norskregnesentral / shapr Goto Github PK

shapr's People

Contributors

Stargazers

Watchers

Forkers

shapr's Issues

R-files

src-files

Recommend Projects

Recommend Topics

Recommend Org