aai-institute / pydvl Goto Github PK

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

License: GNU Lesser General Public License v3.0

Shell 0.84% Python 99.09% HTML 0.07%

data-valuation shapley-value machine-learning transferlab least-core influence-functions game-theory data-quality robust-machine-learning data-centric-ai

pydvl's People

Contributors

Stargazers

Watchers

Forkers

ygcoconut mdbenito tzq2doc hengzi fran-am patrickmesana anesbenmerzoug bastienzim opcode81

pydvl's Issues

Add pre-commit hooks

pre-commit is in use for black + isort.

More?

Fix failing pipelines

E.g. https://github.com/appliedAI-Initiative/valuation/runs/6560196247?check_suite_focus=true

Fast Hessian-Vectors with stochastic optimization for CG

Implement the stochastic sampling described in Koh and Liang, ‘Understanding Black-Box Predictions via Influence Functions’. (from Agarwal, Bullins, and Hazan, ‘Second-Order Stochastic Optimization for Machine Learning in Linear Time’.)

KNN Shapley

Read [this paper](https://openaccess.thecvf.com/content/CVPR2021/html/Jia_Scalability_vs._Utility_Do_We_Have_To_Sacrifice_One_for_CVPR_2021_paper.html) and [the KNN one](https://doi.org/10.14778/3342263.3342637)
Is exact_knn_shapley an implementation of the paper?
Write tests with analytic solution
Extend knn_shapley notebook

Related: #2

Truncated MonteCarlo Shapley

Serial:

Test serial_truncated_montecarlo_shapley

Parallel:

Implement without InterruptibleWorker (e.g. by wrapping ShapleyWorker as a MapReduceJob)
Test convergence for convex models using Hoeffding
Maybeeeee try with interruption (or create new issue)

Fix CLI

Should be an interface to run all algorithms on datasets and store results. Currently broken

Convergence of TMCS for non-convex models

Check convergence and compare to exact for non-convex models.

Make dataset support multiple output dimensions.

Currently the dateset only supports one single dimension for y. In future versions we want y to be multiple dimensional and hence:

Add multiple dimension to dataset, e.g. throw no exception
Extend influence functions to the more general case, see TODO in code.

@Xuzzo How about Shapley version?

Fix logging

logging.py implements a quick and dirty multinode logging solution (probably broken). This should be abstracted to allow for multiple backends, e.g. ray if using it, etc.

Parallelize IF impl. (by samples)

We want:

To split the computation of the influence of training points across processors and machines
To reuse computations (e.g. in #24 we might want to reuse the samples across processes, if this makes sense)

IF for Gradient Boosted Trees

See the paper and reference repo (abandoned, not typed, not packaged => fix in the repo and import as library if we can have ownership / admin rights. However, there's a dep. on catboost and this might be a problem. Apparently the impl. requires some manual exporting of model weights to json? This might not (or no longer) be necessary)

Test claims of paper, consider inclusion.

Look for a good implementation of IFs

Choose implementation from https://github.com/nimarb/pytorch_influence_functions

Fix notebooks

They should run without problems (adapt to new changes in the code). See #1

Implement a 2-layer neural network for influences.

Implement a 2-layer neural network for which the influences shall be calculated.
Add an appropriate test

Document required / optional setup

memcached
ray?
xyz

The readme should provide enough information for a new user to set up their environment before using the library.

Ensure packaging works on CI

We need to make sure that the package can be built, ideally a wheel, without any issues on CI.

Bring back code formatting

Black if it must be. A more flexible one if possible.

Create performance tests / benchmarking

In order to keep track of the performance of the library, we should implement performance tests.

It would also allow us to properly assess whether a change actually improves the performance or not.

A first approach we could choose would be to write tests, with proper fixtures to make the test runs independant, using the pytest-benchmark extension. This would make running the tests a slower but would allow us to easily benchmarks certain parts of the code without writing too much extra code. Of course we should use pytest markers to prevent running such tests by default.

A second approach would be to use airspeed-velocity, a tool for benchmarking Python packages over their lifetime. Runtime, memory consumption and even custom-computed values may be tracked. This would require us to write a bit of code and to ideally keep it in a separate repository, but it would give allow us to track the changes overtime and to track multiple things more easily.

For example, Dask is using it and storing the code and results in this repository.

Implement Kronecker factorization for inverting the Hessian of a neural network.

Fix conjugate gradient for influence function computation with poorly conditioned matrix

Conjugate gradient is tested on well conditioned problems, but it can happen that an ML model outputs a poorly conditioned matrix. In this case, influence calculation for the case of $N \ll M $ diverges, where $\mathbf{A} \in \mathbb{R}^{N\times N} $. Fully constructing the Hessian matrix yields the desired outputs, so the case for high-dimensional problems via conjugate gradient is postponed to this issue. Further inspection needs to be done to determine how this can be mitigaged, e.g. using an appropriate pre-conditioner, by scaling the matrix or by adding a regularization during fitting of the model. However it seems that different models require different solutions.

Related: #45

Support python from 3.8 to 3.10

Which versions do we want to support?

In Ubuntu 22.04 no more python3.8 is shipped by house.

Notebook for data mislabeling in classification using influence functions

Write a notebook analysing a dataset using IFs, essentially reproducing the application illustrated in 5.4 (bad labels) of Koh and Liang, ‘Understanding Black-Box Predictions via Influence Functions’.

This should be an end-to-end example, with a "story line" and using a public (and non-standard) dataset or a cooperation with some partner. Avoid image data.

Additionally one can think of reweighting as described here

Test montecarlo shapley with Random forests and GBM models

The repository is built to support tabular data and hence tree based methods. However, some parts of it were built with the implicit assumption of deterministic training (e.g. linear fit), which is not the case for random forests and gradient boosting machines.
Before moving on (e.g. working on truncated montecarlo or adjusting our cache to the non-deterministic case) we should analyse the drawbacks of evaluating a non-deterministic model on our (implicitly deterministic) code.

Fix map_reduce calls with utility and add tests

In many examples and tests map_reduce is passed a utility, but this works only when num_jobs <= num_runs. When num_jobs > num_runs, data being a Utility is not supported. Considering we probably want to keep data a simple collection, find a way around calling map_reduce with a utility and test it.

See, e.g. this discussion

Once fixed, unify interface in montecarlo methods (there is a TODO in permutation_montecarlo_shapley)

General cleanup (We no longer need .gitlab-ci.yml for example)

Implement a pre-conditioner on the system Ax=b

Test general diagonal pre-conditioner based on A.
Extend conjugate gradient function to support a pre-conditioner M:

Group tool configuration files into pyproject.toml

... when possible (tox really likes its tox.ini file)

(Approximate) Maximum Influence Perturbation

As described in Broderick, Giordano, and Meager, ‘An Automatic Finite-Sample Robustness Metric’.

Use Ray for InterruptibleWorker

Use ray for InterruptibleWorker and add an abstraction layer around it so as not to introduce a hard dependency on ray.

Maybe we can salvage the current implementation.

Implement Influence Functions

Implement basic influence function computation

Shapley values for coalitions

Instead of only computing SVs for single samples, also allow for groups of them. Modify the interfaces to take sets of indices?

One would want to group by properties of the samples, or by group size. The code should allow for generic definitions of groups.

Redo cache to recompute until the function's output stabilises

Caching utility based on indices makes perfect sense if the computation is deterministic, but it often isn't. We might want to average over different retrainings of the model on the same subset. A cache hit would then retrain with the same input and compute a running average until convergence, at which point no further retraining is done.

Open questions are how many such hits we would have and whether the output does stabilise or not for any given model.

Baseline MonteCarlo Shapley

Test current implementation of permutation_montecarlo_shapley
Verify that combinatorial_montecarlo_shapley does the right thing
See FIXME in https://github.com/appliedAI-Initiative/valuation/blob/29fbdfcabaae8d226a9d8da47a695c2daf71fec4/src/valuation/shapley/montecarlo.py#L301
combinatorial is a different approach that weighs subsets by their frequency in the powerset. Test that both methods obtain the same value and possibly remove one of them (the slowest).

Create Documentation Page

It would be great to have a documentation page, Github Pages, for this project to make it easier to read the documentation from a user's point-of-view.

I think this would also give us more motivation to write more and better documentation.

Enable and run type checking with mypy

Playbook for data shapley

In order to showcase the data library, we need an end to end example. A few points such example should touch are:

Knn shapley because it is fast
Fit a model, show that outliers have low scores, possibly re-train the model without outliers
Include Knn for grouped data

Change author to appliedAI

Here

Add flake8

Implement attribute in model for the surrogate objective to handle non-convex and non-differentiable loss

If the loss function is non-convex, the Hessian around a chosen parameter value might have non-negative eigenvalues. To solve this, a small multiple of the identity is added to the Hessian at that point in a 2nd order Taylor approximation to the loss. This is then used to compute the influence instead of the actual loss.

See section 4 of http://proceedings.mlr.press/v70/koh17a.html.

For non-convex losses one can simple neglect the indefinite part of the matrix involving terms. This would correspond to using the Gauss-Newton approximation from https://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf.

Although the interface is not fully specified and we might modify in a way similar to the following. The same pattern could achieved in a functional style. Here is an example for non-differentiable models:

class SupervisedModel(Protocol):
    """Pedantic: only here for the type hints."""

    def fit(self, x: np.ndarray, y: np.ndarray):
        pass

    def predict(self, x: np.ndarray) -> np.ndarray:
        pass

    def score(self, x: np.ndarray, y: np.ndarray) -> float:
        pass

    def params(self) -> np.ndarray:
        pass

class PyTorchSurrogateModel:
    """Wrap non-differentiable model with a surrogate objective L(\theta)."""

    def __init__(self, base_model: SupervisedModel, surrogate_objective: Callable[[torch.tensor], torch.tensor]):
        self.__base_model = base_model
        self.__surrogate = surrogate_objective
    
    # ================================================
    # implement grad and mvp using torch and L(\theta)
    # ================================================

    def params(self):
        return self.__base_model.params()

    def fit(self, x: np.ndarray, y: np.ndarray):
        return self.__base_model.fit(x, y)

    def predict(self, x: np.ndarray) -> np.ndarray:
        return self.__base_model.predict(x)

    def score(self, x: np.ndarray, y: np.ndarray) -> float:
        return self.__base_model.score(x, y)

Fix pipeline for notebook tests

This refers to fixing the pipeline itself (there were import problems), so that it can execute the notebooks in CI.

Fix Hoeffding bound

The standard (ε,δ)-bound for the MonteCarlo SV approximation assumes a deterministic utility, but one never has this. Instead: make an additional (ε,δ)-assumption on the utility and nest the bounds.

However, sample bounds on the utility are valid for >n samples and during SV computation, most evaluations of the utility are on small subsets of the training set. This is generally a problem when SVs. Solutions?

Allow multidimensional labels

The current dataset class only supports 1-dimensional outputs. For regression it would be good to have multi-dimensional outputs. Due to the encapsulation through the loss function, it should not involve many changes.

Test baseline implementations

The current implementations of data Shapley have been tested on some very simple algorithms and dataset. We need to test them in more realistic scenario and with more advanced models.

Replace pure python impl. of conjugate gradient

Optimize performance of conjugate gradient with ~~JAX~~ pytorch

Switch from pyhash to hashllib

I have found many problems with using pyhash, mostly related to the fact that it only supports pypy.
We could switch from pyhash to the more common hashlib.

Improve project setup

The project's structure and setup could be improved for easier collaboration.

I have identified a few points that we could work on to make it better:

Incorporate latest changes to pymetrius

Upon first instantiating the template some things were removed. Ensure that everything runs (e.g. doc build, packaging, etc.).
See https://github.com/appliedAI-Initiative/pymetrius
Related: #4, #1

Tests for Leave-One-Out

Test naive_loo with analytic solution

Monte Carlo iterative iHvP

See section 3 stochastic estimation from http://proceedings.mlr.press/v70/koh17a.html. The algorithm is different to conjugate gradient, in the sense that it uses a Taylor expansion and a recursive fix point iteration.

r(x) = (A(x) - b)
x_(t+1) =`x_t - r(x_t)

We have to note out, that the approximations get more inaccurate and thus it gets harder to test them correctly, considering bad
rounding errors.

Fix simultaneous caching of multiple utilities

At the moment, if I define different utilities with the same client config, they will use the same cache. This has given me a lot of problems with the subsequent tests. Shall we flush the client when we create the utility? Any possible drawbacks from doing this?