norskregnesentral / shapr Goto Github PK
View Code? Open in Web Editor NEWExplaining the output of machine learning models with more accurately estimated Shapley values
Home Page: https://norskregnesentral.github.io/shapr/
License: Other
Explaining the output of machine learning models with more accurately estimated Shapley values
Home Page: https://norskregnesentral.github.io/shapr/
License: Other
Hi @expectopatronum
Just opening an issue to handle the reference comments from openjournals/joss-reviews#2027 (comment)
Should be aligned with argument in shapley_weights()
.
A list of potentially redundant function arguments that we could remove to simplify code.
reduce_dim
argument of feature_combinations
(always set it to TRUE)reduce_dim
argument, we probably don't need the use_shapley_weights_in_W
argument in weight_matrix
, as the no-column is always 1 (check!!!), and then we can simplify this code by always using the shapley_weight
column only as the weights.type
in inv_gaussian_transform
Line 89 in 2e20a1b
Just leaving a few possible tasks for further improvments of the package
I suggest to change the Article header on the pkgdown site to "Vignettes", OR (if we plan to put the JOSS paper here as well when it is accepted), change the title on the button you click to open the vignette to something like "Vignette: Understanding shapr" describing that this is a vignette.
Should include the following;
testthat
and covr
styler
& lintr
)roxygen2
NEWS.md
Currently we output a matrix with the explanations in Kshap. Maybe a data.table with original column names is better?
Also, adding the actual predictions for the test data in the output list would be good.
If you're working on one of these files, please add the url to your branch or the pull request. Mark the box if the changes are merged with master
.
clustering.R
explanation.R
#128features.R
#70observations.R
#95plot.R
#86predictions.R
#109sampling.R
#92shapley.R
#128transformation.R
#77utils.R
Currently there are zero functions in R/utils.R
.models.R
#129The following files should not be tested
R/shapr-package.R
R/zzz.R
Create test that shows that (a special case of) our implementation of independence kernelSHAP gives the same results as Lundberg's implementations.
Do this in a Rmarkdown document? And put in in a folder under inst?
Link to the comparison from the Readme.
I made a template with https://github.com/NorskRegnesentral/shapr/tree/nikolai/add_vignette. You could continue pushing to this branch once you have the time work on this.
Some examples of vignettes:
@frycast The reason why the description file states the following
License: MIT + file LICENSE
is described in http://r-pkgs.had.co.nz/description.html#license. See the section with MIT license.
You could also see a similar discussion here: openjournals/joss-reviews#1863.
Allow the user to supply the prediction function on an appropriate form such as in LIME.
When expanding the package to handle categorical variables, it is beneficial to only allow input data (training and testing) to be of class data.frames (or data.table). This allows us to stop the procedure if one tries to use existing numeric methods when categorical variables are provided (or potentially one-hot-encode them in that case).
Agree @nikolase90 ?
People are going to install the package from Github, and it that case it is nice if the files are already generated.
Currently the user must supply the kernelshap function with a list specifying all details of the empirical condtional approach. There are default values for the full list, but if the user just needs to change one of these values, he needs to specify all the other default values manually, which is very inconvenient. Thus, if an element of the list is not supplied, default values should be used.
I think the easiest way out of this is to start with the default list in the top of the function and modify that with the elements in the list supplied in the function call.
Currently, the sample_gaussian and sample_copula functions can take input on a number of different formats. We should restrict this to a single format to reduce the possibility of errors. Both px1 and 1xp is currently OK per the documentation, but px1 will actually fail if m=0 or p.
I suggest using data.table all the way, or 1xp matrix.
The following files should be adjusted:
Line 20 in 991dfdb
This should be gaussian
and not gassuian
.
We should add 4 arguments for parallelization:
We should also add a test checking that either 3 or both of 1 and 2 are set to 1 core to avoid parallelization within parallelizations.
Currently, some of the examples under inst/scripts still follow the style from the old package.
Steps;
tests/testthat/test-sample_combinations.R
to tests/testthat/test-sampling.R
sample_gaussian()
(located in R/sampling.R
)sample_copula()
(located in R/sampling.R
)Note that it is not necessary to create tests for sample_combinations
since they are already written.
sample_combinations()
is located in R/extra.R
. After moving the function into R/sampling.R
you can delete R/extra.R
.
exact = FALSE
if m
is greater than a given numberD
matrix if you run code againn_samples
as an argument in explain
type
in explain.combined
explain.combined
n_combinations
in feature_combinations
x_test
in prepare_data.copula
shapr
and explain
(want to check that correct features are passed into the functions)There seems to be a bug regarding global ourr use of data.table and global variables.
This chunk of code throws an error about the m variable not being defined when used in w_shapley, although it seems to be passed as it should be.
l.1 <- prepare_kernelShap(
m = 3,
Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
exact = FALSE,
nrows = 10,
scale = F)
Defining m as a global variable fixes the issue:
m=3 #
l.2 <- prepare_kernelShap(
m = p,
Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
exact = FALSE,
nrows = 10,
scale = F)
Warning: Setting m equal to something else gives an incorrect answer even if it is not passed to the function, so this is actually quite dangerous.
The w_shapley function is called within a data.table call, and apparently it does not look for m as defined within the scope of the function where it is present, but rather in the global scope... I can't see what we are doing wrong here. Am I missing something obvious?
Please take a look when you are back from vacation, @nikolase90.
The following needs to be done;
devtools::check_man()
When package gbm is missing, the vignette fails to knit, and the error message doesn't make it very clear that gbm is the culprit.
I suggest adding library(gbm)
somewhere before line 399, so that a more informative error message is printed.
Line 20 in 2e20a1b
Include section on "Advanced usage" or so including how to explain a custom model. Use the example script under inst.
For the "Advanced usage" section include an example with the combined approach. See the example in the documentation in #134 .
May also add an example with the independence version here (so people easily can see that the results differ.
Fix broken link to comparison with Lundbergs python implementation.
This is a list of the things we must do before releasing a version to CRAN and submitting a paper to JOSS
Get all spelling errors using spelling::spell_check_package(".")
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.