Code Monkey home page Code Monkey logo

aai-institute / pydvl Goto Github PK

View Code? Open in Web Editor NEW
81.0 4.0 9.0 315.69 MB

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

Home Page: https://pydvl.org

License: GNU Lesser General Public License v3.0

Shell 0.84% Python 99.09% HTML 0.07%
data-valuation shapley-value machine-learning transferlab least-core influence-functions game-theory data-quality robust-machine-learning data-centric-ai

pydvl's People

Contributors

anesbenmerzoug avatar bastienzim avatar dependabot[bot] avatar github-actions[bot] avatar jakobkruse1 avatar kosmitive avatar mdbenito avatar opcode81 avatar schroedk avatar xuzzo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pydvl's Issues

Fast Hessian-Vectors with stochastic optimization for CG

Implement the stochastic sampling described in Koh and Liang, ‘Understanding Black-Box Predictions via Influence Functions’. (from Agarwal, Bullins, and Hazan, ‘Second-Order Stochastic Optimization for Machine Learning in Linear Time’.)

Truncated MonteCarlo Shapley

Serial:

  • Test serial_truncated_montecarlo_shapley

Parallel:

  • Implement without InterruptibleWorker (e.g. by wrapping ShapleyWorker as a MapReduceJob)
  • Test convergence for convex models using Hoeffding
  • Maybeeeee try with interruption (or create new issue)

Fix CLI

Should be an interface to run all algorithms on datasets and store results. Currently broken

Make dataset support multiple output dimensions.

Currently the dateset only supports one single dimension for y. In future versions we want y to be multiple dimensional and hence:

  • Add multiple dimension to dataset, e.g. throw no exception
  • Extend influence functions to the more general case, see TODO in code.

@Xuzzo How about Shapley version?

Fix logging

logging.py implements a quick and dirty multinode logging solution (probably broken). This should be abstracted to allow for multiple backends, e.g. ray if using it, etc.

Parallelize IF impl. (by samples)

We want:

  • To split the computation of the influence of training points across processors and machines
  • To reuse computations (e.g. in #24 we might want to reuse the samples across processes, if this makes sense)

IF for Gradient Boosted Trees

See the paper and reference repo (abandoned, not typed, not packaged => fix in the repo and import as library if we can have ownership / admin rights. However, there's a dep. on catboost and this might be a problem. Apparently the impl. requires some manual exporting of model weights to json? This might not (or no longer) be necessary)

Test claims of paper, consider inclusion.

Fix notebooks

They should run without problems (adapt to new changes in the code). See #1

Create performance tests / benchmarking

In order to keep track of the performance of the library, we should implement performance tests.

It would also allow us to properly assess whether a change actually improves the performance or not.

A first approach we could choose would be to write tests, with proper fixtures to make the test runs independant, using the pytest-benchmark extension. This would make running the tests a slower but would allow us to easily benchmarks certain parts of the code without writing too much extra code. Of course we should use pytest markers to prevent running such tests by default.

A second approach would be to use airspeed-velocity, a tool for benchmarking Python packages over their lifetime. Runtime, memory consumption and even custom-computed values may be tracked. This would require us to write a bit of code and to ideally keep it in a separate repository, but it would give allow us to track the changes overtime and to track multiple things more easily.

For example, Dask is using it and storing the code and results in this repository.

Fix conjugate gradient for influence function computation with poorly conditioned matrix

Conjugate gradient is tested on well conditioned problems, but it can happen that an ML model outputs a poorly conditioned matrix. In this case, influence calculation for the case of $N \ll M $ diverges, where $\mathbf{A} \in \mathbb{R}^{N\times N} $. Fully constructing the Hessian matrix yields the desired outputs, so the case for high-dimensional problems via conjugate gradient is postponed to this issue. Further inspection needs to be done to determine how this can be mitigaged, e.g. using an appropriate pre-conditioner, by scaling the matrix or by adding a regularization during fitting of the model. However it seems that different models require different solutions.

Related: #45

Notebook for data mislabeling in classification using influence functions

Write a notebook analysing a dataset using IFs, essentially reproducing the application illustrated in 5.4 (bad labels) of Koh and Liang, ‘Understanding Black-Box Predictions via Influence Functions’.

This should be an end-to-end example, with a "story line" and using a public (and non-standard) dataset or a cooperation with some partner. Avoid image data.

Additionally one can think of reweighting as described here

Test montecarlo shapley with Random forests and GBM models

The repository is built to support tabular data and hence tree based methods. However, some parts of it were built with the implicit assumption of deterministic training (e.g. linear fit), which is not the case for random forests and gradient boosting machines.
Before moving on (e.g. working on truncated montecarlo or adjusting our cache to the non-deterministic case) we should analyse the drawbacks of evaluating a non-deterministic model on our (implicitly deterministic) code.

Fix map_reduce calls with utility and add tests

In many examples and tests map_reduce is passed a utility, but this works only when num_jobs <= num_runs. When num_jobs > num_runs, data being a Utility is not supported. Considering we probably want to keep data a simple collection, find a way around calling map_reduce with a utility and test it.

See, e.g. this discussion

Once fixed, unify interface in montecarlo methods (there is a TODO in permutation_montecarlo_shapley)

Use Ray for InterruptibleWorker

Use ray for InterruptibleWorker and add an abstraction layer around it so as not to introduce a hard dependency on ray.

Maybe we can salvage the current implementation.

Shapley values for coalitions

Instead of only computing SVs for single samples, also allow for groups of them. Modify the interfaces to take sets of indices?

One would want to group by properties of the samples, or by group size. The code should allow for generic definitions of groups.

Redo cache to recompute until the function's output stabilises

Caching utility based on indices makes perfect sense if the computation is deterministic, but it often isn't. We might want to average over different retrainings of the model on the same subset. A cache hit would then retrain with the same input and compute a running average until convergence, at which point no further retraining is done.

Open questions are how many such hits we would have and whether the output does stabilise or not for any given model.

Create Documentation Page

It would be great to have a documentation page, Github Pages, for this project to make it easier to read the documentation from a user's point-of-view.

I think this would also give us more motivation to write more and better documentation.

Playbook for data shapley

In order to showcase the data library, we need an end to end example. A few points such example should touch are:

  • Knn shapley because it is fast
  • Fit a model, show that outliers have low scores, possibly re-train the model without outliers
  • Include Knn for grouped data

Implement attribute in model for the surrogate objective to handle non-convex and non-differentiable loss

If the loss function is non-convex, the Hessian around a chosen parameter value might have non-negative eigenvalues. To solve this, a small multiple of the identity is added to the Hessian at that point in a 2nd order Taylor approximation to the loss. This is then used to compute the influence instead of the actual loss.

See section 4 of http://proceedings.mlr.press/v70/koh17a.html.

For non-convex losses one can simple neglect the indefinite part of the matrix involving terms. This would correspond to using the Gauss-Newton approximation from https://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf.

Although the interface is not fully specified and we might modify in a way similar to the following. The same pattern could achieved in a functional style. Here is an example for non-differentiable models:

class SupervisedModel(Protocol):
    """Pedantic: only here for the type hints."""

    def fit(self, x: np.ndarray, y: np.ndarray):
        pass

    def predict(self, x: np.ndarray) -> np.ndarray:
        pass

    def score(self, x: np.ndarray, y: np.ndarray) -> float:
        pass

    def params(self) -> np.ndarray:
        pass
class PyTorchSurrogateModel:
    """Wrap non-differentiable model with a surrogate objective L(\theta)."""

    def __init__(self, base_model: SupervisedModel, surrogate_objective: Callable[[torch.tensor], torch.tensor]):
        self.__base_model = base_model
        self.__surrogate = surrogate_objective
    
    # ================================================
    # implement grad and mvp using torch and L(\theta)
    # ================================================

    def params(self):
        return self.__base_model.params()

    def fit(self, x: np.ndarray, y: np.ndarray):
        return self.__base_model.fit(x, y)

    def predict(self, x: np.ndarray) -> np.ndarray:
        return self.__base_model.predict(x)

    def score(self, x: np.ndarray, y: np.ndarray) -> float:
        return self.__base_model.score(x, y)

Fix Hoeffding bound

The standard (ε,δ)-bound for the MonteCarlo SV approximation assumes a deterministic utility, but one never has this. Instead: make an additional (ε,δ)-assumption on the utility and nest the bounds.

However, sample bounds on the utility are valid for >n samples and during SV computation, most evaluations of the utility are on small subsets of the training set. This is generally a problem when SVs. Solutions?

Allow multidimensional labels

The current dataset class only supports 1-dimensional outputs. For regression it would be good to have multi-dimensional outputs. Due to the encapsulation through the loss function, it should not involve many changes.

Test baseline implementations

The current implementations of data Shapley have been tested on some very simple algorithms and dataset. We need to test them in more realistic scenario and with more advanced models.

Switch from pyhash to hashllib

I have found many problems with using pyhash, mostly related to the fact that it only supports pypy.
We could switch from pyhash to the more common hashlib.

Improve project setup

The project's structure and setup could be improved for easier collaboration.

I have identified a few points that we could work on to make it better:

Monte Carlo iterative iHvP

See section 3 stochastic estimation from http://proceedings.mlr.press/v70/koh17a.html. The algorithm is different to conjugate gradient, in the sense that it uses a Taylor expansion and a recursive fix point iteration.

r(x) = (A(x) - b)
x_(t+1) =`x_t - r(x_t)

We have to note out, that the approximations get more inaccurate and thus it gets harder to test them correctly, considering bad
rounding errors.

Fix simultaneous caching of multiple utilities

At the moment, if I define different utilities with the same client config, they will use the same cache. This has given me a lot of problems with the subsequent tests. Shall we flush the client when we create the utility? Any possible drawbacks from doing this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.