bgreenwell / pdp Goto Github PK

A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R.

Home Page: http://bgreenwell.github.io/pdp

R 94.06% C++ 3.19% C 2.75%

r machine-learning partial-dependence-function partial-dependence-plot visualization black-box-model

pdp's Introduction

pdp

Overview

pdp is an R package for constructing partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. PDPs and ICE curves are part of a larger framework referred to as interpretable machine learning (IML), which also includes (but not limited to) variable importance plots (VIPs). While VIPs (available in the R package vip) help visualize feature impact (either locally or globally), PDPs and ICE curves help visualize feature effects. An in-progress, but comprehensive, overview of IML can be found at the following URL: https://github.com/christophM/interpretable-ml-book.

A detailed introduction to pdp has been published in The R Journal: “pdp: An R Package for Constructing Partial Dependence Plots”, https://journal.r-project.org/archive/2017/RJ-2017-016/index.html. You can track development at https://github.com/bgreenwell/pdp. To report bugs or issues, contact the main author directly or submit them to https://github.com/bgreenwell/pdp/issues. For additional documentation and examples, visit the package website.

As of right now, pdp exports the following functions:

partial() - compute partial dependence functions and individual conditional expectations (i.e., objects of class "partial" and "ice", respectively) from various fitted model objects;
plotPartial()" - construct lattice-based PDPs and ICE curves;
autoplot() - construct ggplot2-based PDPs and ICE curves;
~~topPredictors() extract most “important” predictors from various types of fitted models.~~ see vip instead for a more robust and flexible replacement;
exemplar() - construct an exemplar record from a data frame (experimental feature that may be useful for constructing fast, approximate feature effect plots.)

Installation

# The easiest way to get pdp is to install it from CRAN:
install.packages("pdp")

# Alternatively, you can install the development version from GitHub:
if (!("remotes" %in% installed.packages()[, "Package"])) {
  install.packages("remotes")
}
remotes::install_github("bgreenwell/pdp")

pdp's People

Stargazers

Watchers

Forkers

weekend-warrior guangyuan iamjoshbinder thiyangt chmichel conormm alex33261 atusy vnijs imarcello michaelchirico

pdp's Issues

partial fails with gbm objects

partial fails with "gbm" objects due to missing n.trees argument.

pdp: new gbm release

Hi!

We're coming up to a new release of gbm which considerably tidies the code and introduces a much stricter NAMESPACE. We've also removed the multinomial distribution, which was essentially unsupported and rather buggy code, and it looks like it's breaking your package.

Would it be possible for you to take a look?

00check.log.txt

Function documentation could use some improvement

support for gamma and poisson regression

Hi @bgreenwell thank you so much for your awesome package... Really fantastic..

Nevertheless, we have noticed that it only supports gaussian regression and classification. Is it possible to further implement gamma and poisson regressions? I am using xgboost 0.6.x
Thank in advance for your attention
Giorgio

xgboost using xbg.DMatrix

I'm trying to use pdp with an xgboost model. The training data is mostly text that I am encoding using bag of words. So I start off with a data frame that contains a column of text. That text is tokenized and a sparse term document matrix created. This matrix is what is used to train the xgboost model.

It appears that pdp can't handle this situation yet where the training data for the xgboost model isn't in a data frame, but in a matrix. I tried this:

partial(xgb_fit, train = train_matrix, pred.var = "price")
Error in partial.default(xgb_fit, train = train_matrix, pred.var = "price") : 
  price not found in the training data.

but as you can see it doesn't work. Here train_matrix is a xgb.DMatrix object that was created from the bag of words term-document matrix of type dgCMatrix.

Is it possible to use pdp to generate partial dependence plots in this case?

Deprecate super.type in favor of method

Borrowing from rpart:

if (missing(method)) {
  method <- if (is.factor(Y) || is.character(Y)) "class"
   else if (inherits(Y, "Surv")) "exp"
  else if (is.matrix(Y)) "poisson"
  else "anova"
 }

Allow column position specification for the pred.grid argument

Allow column position specification for the pred.grid argument:

partial(fit, pred.var = "x")
partial(fit, pred.var = 1)

Add support for gbm3 package

plotPartial displays weird line graphs when pred.grid is not sorted

Fix issue with LDA/QDA with more than two classes

Currently the predicted probabilities are not wrapped correctly for XGBoost models. This can be fixed by using reshape = TRUE in the prediction call.

Allow different response scales (e.g., ordinary two-class logit, probabilities, etc.)

If we let type = c("link", "response"), then do we even need the superType generic?

Incorrect handling of multinomial outcomes in some cases (gbm, for example)

Implement Friedman & Popescu's H-statistic?

This is a statistic for assessing the strength of interaction between predictors and is based on the partial dependence functions. Start with gbm's implementation; see ?gbm::interact.gbm.

Use GBMs plot-helper.R approach

Add rug plot and convex hull options?

Call gbm::plot.gbm for speed with "gbm" objects?

Currently, partial uses the same algorithm for all objects. However, gbm's plot function is much faster on "gbm" objects. Perhaps it would be more resourceful to simply make partial.gbm a wrapper around gbm::plot.gbm.

partial will fail whenever pred.var contains "yhat"

Best way to handle XGBoost?

Need methods for upcoming changes in gbm

Will need the C++ code and methods for the new class "GBMFit" (no more multinomial here).

Add smooth option?

Finish partial_2d code

How should offsets be handled?

Add support for naive Bayes classification

Support added for e1071::naiveBayes models (classic naive Bayes).

There is no need to specify type when pred.fun is supplied

This should suffice:

type <- match.arg(type)
if (type == "auto") {
  if (!is.null(pred.fun)) {
    "regression"  # doesn't matter
  } else {
    superType(object)
  }
}

Better handling of categorical variables

xyplot(y ~ x1 | x2, ...)
xyplot(y ~ x1, groups = x2, ...)

Add option for centered PDPs

Not sure this possible right now since it will require knowledge of the response vector. Perhaps add two new options: center and response?

partial(mod, pred.var = "X1", center = TRUE, response = "Y", plot = TRUE, train = trn)

type argument prevent use of type in ...

The following will not work for "randomForest" (and similar) objects:

partial(fit, pred.var = "x", type = "prob")

Perhaps revert to super.type = c("regression", "classification")? Again, this can be accomplished via the pred.fun argument, but this will be quicker in many cases.

Add support fot Microsoft/LightGBM

Very similar in spirit to XGBoost, but faster.

Use a better default grid resolution (50-100 should suffice)

if (missing(grid.resolution)) {
  grid.resolution <- min(length(unique(x)), 51)
}

mssg should be a function of object

Add a ggPlotPartial function or an autoplot method for "partial" objects

Use reshape = TRUE in applicable xgb.Booster methods

This would automatically reshape predicted probabilities into a matrix alleviating the need to do it manually in both getPDClassLogit and getPDClassProb.

Make a (non exported) generic with methods

pdRegression <- function(object, pred.grid, training.data, ...) {
  UseMethod("pdRegression")
}
pdRegression.default <- function(object, pred.grid, training.data, ...) {
  adply(pred.grid, .margins = 1, .fun = function(x) {
    temp <- training.data
    temp[pred.var] <- x
    mean(predict(object, newdata = temp), na.rm = TRUE)
  }, ...)
}


pdClassification <- function(object, pred.var, pred.grid, which.class, 
                             training.data, ...) {
  UseMethod("pdClassification")
}
pdClassification.default <- function(object, pred.var, pred.grid, which.class, 
                                     training.data, ...) {
  adply(pred.grid, .margins = 1, .fun = function(x) {
    temp <- training.data
    temp[pred.var] <- x
    pr <- predict(object, newdata = temp, type = "prob")
    avgLogit(pr, which.class = which.class)
  }, ...)
}
pdClassification.svm <- function(object, pred.var, pred.grid, which.class, 
                                 training.data, ...) {
  if (is.null(object$call$probability)) {
    stop(paste("Cannot obtain predicted probabilities from", 
               deparse(substitute(object))))
  }
  adply(pred.grid, .margins = 1, .fun = function(x) {
    temp <- training.data
    temp[pred.var] <- x
    pr <- attr(predict(object, newdata = temp, probability = TRUE), "probabilities")
    avgLogit(pr, which.class = which.class)
  }, ...)
}

Add support for mlr package

Use correct response name instead of just y

Alternatively, a better default name like "Partial dependence" or something could be used.

Add tests!

Add a partial.gbm method that calls gbm::gbm_plot

All that is needed is the creation of an appropriate X matrix (i.e., pred.grid) and a call to the gbm_plot C++ function.

Add options for ICE and c-ICE curves

# ICE curves
partial(object, pred.var = "X1", ice = TRUE, plot = TRUE)

# c-ICE curves
partial(object, pred.var = "X1", ice = TRUE, center = TRUE, plot = TRUE)

Add quantiles and trim.outliers options to partial?

Example usage:

partial(obj, pred.var = "x", quantiles = TRUE, probs = 1:9/10, plot = TRUE))
partial(obj, pred.var = "x", trim.outliers = TRUE, plot = TRUE))

autoplot uses incorrect y-axis label in multidimensional displays

Fixed!

Add comment in scr files about origin of C++ code and ping current maintainers

superType for nnet objects

Looks like classification models contain a lev component.

Add approx option for quick and dirty plots

Following recent work in scikit-learn, it might be useful to have three functions:

pdBrute - used when approx = FALSE and recursive = FALSE and averages over training data (slow, but accurate);
pdApprox - used when approx = TRUE and will fix other predictors at their median/mode (fast, but less accurate);
pdGBM - used when recursive = TRUE and uses Friedman's weighted tree traversal method (only for "gbm" objects).

RStudio crash

RStudio keeps crashing when ever using partial with "gbm" objects and a user-supplied data frame for the pred.grid argument.

Add support for knnreg

knnreg is a caret function for k nearest neighbors regression. Formula interface seems broken and does not return anything useful.

Add boston data set?

partial fails if one of the predictors is labeled "y"

For now, partial will throw the following error message:

 Error in partial.default(fit, pred.var = "y") : 
  "y" cannot be a predictor name.

Add a response.type argument?

prediction.type = c("link", "response", "centered.logit")

where,

response.type = "link" returns predictions on the link scale (original scale for ordinary regression models, log count for Poisson models, two-class logit for binomial models, etc.)
response.type = "response" returns predictions on the original response scale (original scale for ordinary regression models, count for Poisson models, class probabilities for binomial models, etc.)
response.type = "centered.logit" is only used for classification and returns predictions on a scale similar to a logit, but with the average log probability as the reference class.

Hence, it might be worthwhile to create a transformPrediction function with arguments yhat, dist, and type. While this can all be accomplished using the pred.fun argument, this option will be quicker in many cases.

Automatically plot "most important" predictors?

Probably most useful if object is of class "train":

topPredictors <- function(object, n = 1L, ...) {
  UseMethod("topPredictors")
}
topPredictors.train <- function(object, n = 1L, ...) {
  imp <- caret::varImp(object)$importance
  imp <- imp[order(imp$Overall, decreasing = TRUE), , drop = FALSE]
  rownames(imp)[seq_len(n)]
}

library(pdp)
important.preds <- topPredictors(mod, 2)
partial(mod, pred.var = iimportant.preds, plot = TRUE,
        chull = TRUE, individual = TRUE, progress = "text")

Add recursive option for GBM models

When recursive = TRUE partial will call pdGBM which relies on GBM's C++ code for Friedman's weighted tree traversal approach (which seems to approximate the brute force approach?).