Code Monkey home page Code Monkey logo

pdp's Introduction

pdp

CRAN status R-CMD-check Codecov test coverage Total Downloads

Overview

pdp is an R package for constructing partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. PDPs and ICE curves are part of a larger framework referred to as interpretable machine learning (IML), which also includes (but not limited to) variable importance plots (VIPs). While VIPs (available in the R package vip) help visualize feature impact (either locally or globally), PDPs and ICE curves help visualize feature effects. An in-progress, but comprehensive, overview of IML can be found at the following URL: https://github.com/christophM/interpretable-ml-book.

A detailed introduction to pdp has been published in The R Journal: “pdp: An R Package for Constructing Partial Dependence Plots”, https://journal.r-project.org/archive/2017/RJ-2017-016/index.html. You can track development at https://github.com/bgreenwell/pdp. To report bugs or issues, contact the main author directly or submit them to https://github.com/bgreenwell/pdp/issues. For additional documentation and examples, visit the package website.

As of right now, pdp exports the following functions:

  • partial() - compute partial dependence functions and individual conditional expectations (i.e., objects of class "partial" and "ice", respectively) from various fitted model objects;

  • plotPartial()" - construct lattice-based PDPs and ICE curves;

  • autoplot() - construct ggplot2-based PDPs and ICE curves;

  • topPredictors() extract most “important” predictors from various types of fitted models. see vip instead for a more robust and flexible replacement;

  • exemplar() - construct an exemplar record from a data frame (experimental feature that may be useful for constructing fast, approximate feature effect plots.)

Installation

# The easiest way to get pdp is to install it from CRAN:
install.packages("pdp")

# Alternatively, you can install the development version from GitHub:
if (!("remotes" %in% installed.packages()[, "Package"])) {
  install.packages("remotes")
}
remotes::install_github("bgreenwell/pdp")

pdp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pdp's Issues

pdp: new gbm release

Hi!

We're coming up to a new release of gbm which considerably tidies the code and introduces a much stricter NAMESPACE. We've also removed the multinomial distribution, which was essentially unsupported and rather buggy code, and it looks like it's breaking your package.

Would it be possible for you to take a look?

00check.log.txt

support for gamma and poisson regression

Hi @bgreenwell thank you so much for your awesome package... Really fantastic..

Nevertheless, we have noticed that it only supports gaussian regression and classification. Is it possible to further implement gamma and poisson regressions? I am using xgboost 0.6.x
Thank in advance for your attention
Giorgio

xgboost using xbg.DMatrix

I'm trying to use pdp with an xgboost model. The training data is mostly text that I am encoding using bag of words. So I start off with a data frame that contains a column of text. That text is tokenized and a sparse term document matrix created. This matrix is what is used to train the xgboost model.

It appears that pdp can't handle this situation yet where the training data for the xgboost model isn't in a data frame, but in a matrix. I tried this:

partial(xgb_fit, train = train_matrix, pred.var = "price")
Error in partial.default(xgb_fit, train = train_matrix, pred.var = "price") : 
  price not found in the training data.

but as you can see it doesn't work. Here train_matrix is a xgb.DMatrix object that was created from the bag of words term-document matrix of type dgCMatrix.

Is it possible to use pdp to generate partial dependence plots in this case?

Deprecate super.type in favor of method

Borrowing from rpart:

if (missing(method)) {
  method <- if (is.factor(Y) || is.character(Y)) "class"
   else if (inherits(Y, "Surv")) "exp"
  else if (is.matrix(Y)) "poisson"
  else "anova"
 }

Implement Friedman & Popescu's H-statistic?

This is a statistic for assessing the strength of interaction between predictors and is based on the partial dependence functions. Start with gbm's implementation; see ?gbm::interact.gbm.

Call gbm::plot.gbm for speed with "gbm" objects?

Currently, partial uses the same algorithm for all objects. However, gbm's plot function is much faster on "gbm" objects. Perhaps it would be more resourceful to simply make partial.gbm a wrapper around gbm::plot.gbm.

Add option for centered PDPs

Not sure this possible right now since it will require knowledge of the response vector. Perhaps add two new options: center and response?

partial(mod, pred.var = "X1", center = TRUE, response = "Y", plot = TRUE, train = trn)

type argument prevent use of type in ...

The following will not work for "randomForest" (and similar) objects:

partial(fit, pred.var = "x", type = "prob")

Perhaps revert to super.type = c("regression", "classification")? Again, this can be accomplished via the pred.fun argument, but this will be quicker in many cases.

Make a (non exported) generic with methods

pdRegression <- function(object, pred.grid, training.data, ...) {
  UseMethod("pdRegression")
}
pdRegression.default <- function(object, pred.grid, training.data, ...) {
  adply(pred.grid, .margins = 1, .fun = function(x) {
    temp <- training.data
    temp[pred.var] <- x
    mean(predict(object, newdata = temp), na.rm = TRUE)
  }, ...)
}


pdClassification <- function(object, pred.var, pred.grid, which.class, 
                             training.data, ...) {
  UseMethod("pdClassification")
}
pdClassification.default <- function(object, pred.var, pred.grid, which.class, 
                                     training.data, ...) {
  adply(pred.grid, .margins = 1, .fun = function(x) {
    temp <- training.data
    temp[pred.var] <- x
    pr <- predict(object, newdata = temp, type = "prob")
    avgLogit(pr, which.class = which.class)
  }, ...)
}
pdClassification.svm <- function(object, pred.var, pred.grid, which.class, 
                                 training.data, ...) {
  if (is.null(object$call$probability)) {
    stop(paste("Cannot obtain predicted probabilities from", 
               deparse(substitute(object))))
  }
  adply(pred.grid, .margins = 1, .fun = function(x) {
    temp <- training.data
    temp[pred.var] <- x
    pr <- attr(predict(object, newdata = temp, probability = TRUE), "probabilities")
    avgLogit(pr, which.class = which.class)
  }, ...)
}

Add options for ICE and c-ICE curves

# ICE curves
partial(object, pred.var = "X1", ice = TRUE, plot = TRUE)

# c-ICE curves
partial(object, pred.var = "X1", ice = TRUE, center = TRUE, plot = TRUE)

Add approx option for quick and dirty plots

Following recent work in scikit-learn, it might be useful to have three functions:

  • pdBrute - used when approx = FALSE and recursive = FALSE and averages over training data (slow, but accurate);
  • pdApprox - used when approx = TRUE and will fix other predictors at their median/mode (fast, but less accurate);
  • pdGBM - used when recursive = TRUE and uses Friedman's weighted tree traversal method (only for "gbm" objects).

RStudio crash

RStudio keeps crashing when ever using partial with "gbm" objects and a user-supplied data frame for the pred.grid argument.

Add support for knnreg

knnreg is a caret function for k nearest neighbors regression. Formula interface seems broken and does not return anything useful.

Add a response.type argument?

prediction.type = c("link", "response", "centered.logit")

where,

  • response.type = "link" returns predictions on the link scale (original scale for ordinary regression models, log count for Poisson models, two-class logit for binomial models, etc.)
  • response.type = "response" returns predictions on the original response scale (original scale for ordinary regression models, count for Poisson models, class probabilities for binomial models, etc.)
  • response.type = "centered.logit" is only used for classification and returns predictions on a scale similar to a logit, but with the average log probability as the reference class.

Hence, it might be worthwhile to create a transformPrediction function with arguments yhat, dist, and type. While this can all be accomplished using the pred.fun argument, this option will be quicker in many cases.

Automatically plot "most important" predictors?

Probably most useful if object is of class "train":

topPredictors <- function(object, n = 1L, ...) {
  UseMethod("topPredictors")
}
topPredictors.train <- function(object, n = 1L, ...) {
  imp <- caret::varImp(object)$importance
  imp <- imp[order(imp$Overall, decreasing = TRUE), , drop = FALSE]
  rownames(imp)[seq_len(n)]
}

library(pdp)
important.preds <- topPredictors(mod, 2)
partial(mod, pred.var = iimportant.preds, plot = TRUE,
        chull = TRUE, individual = TRUE, progress = "text")

Add recursive option for GBM models

When recursive = TRUE partial will call pdGBM which relies on GBM's C++ code for Friedman's weighted tree traversal approach (which seems to approximate the brute force approach?).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.