koaning / doubtlab Goto Github PK

View Code? Open in Web Editor NEW

485.0 7.0 19.0 1.94 MB

Doubt your data, find bad labels.

Home Page: https://koaning.github.io/doubtlab/

License: MIT License

Makefile 1.75% Python 98.25%

doubtlab's Introduction

doubtlab

A lab for bad labels. Learn more here.

This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.

Install

You can install the tool via pip or conda.

Install with pip

python -m pip install doubtlab

Install with conda

conda install -c conda-forge doubtlab

Quickstart

Doubtlab allows you to define "reasons" for a row of data to deserve another look. These reasons can form a pipeline which can be used to retreive a sorted list of examples worth checking again.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason

# Let's say we have some dataset/model already
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1_000)
model.fit(X, y)

# Next we can add reasons for doubt. In this case we're saying
# that examples deserve another look if the associated proba values
# are low or if the model output doesn't match the associated label.
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model)
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtEnsemble(**reasons)

# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)
# Get dataframe with "reason"-ing behind the sorting
predicates = doubt.get_predicates(X, y)

Features

The library implemented many "reasons" for doubt.

General Reasons

RandomReason: assign doubt randomly, just for sure
OutlierReason: assign doubt when the model declares a row an outlier

Classification Reasons

ProbaReason: assign doubt when a models' confidence-values are low for any label
WrongPredictionReason: assign doubt when a model cannot predict the listed label
ShortConfidenceReason: assign doubt when the correct label gains too little confidence
LongConfidenceReason: assign doubt when a wrong label gains too much confidence
DisagreeReason: assign doubt when two models disagree on a prediction
CleanlabReason: assign doubt according to cleanlab
MarginConfidenceReason: assign doubt when there's a small difference between the top two class confidences

Regression Reasons

AbsoluteDifferenceReason: assign doubt when the absolute difference is too high
RelativeDifferenceReason: assign doubt when the relative difference is too high
StandardizedErrorReason: assign doubt when the absolute standardized residual is too high

Feedback

It is early days for the project. The project should be plenty useful as-is, but we prefer to be honest. Feedback and anecdotes are very welcome!

Related Projects

The cleanlab project was an inspiration for this one. They have a great heuristic for bad label detection but I wanted to have a library that implements many. Be sure to check out their work on the labelerrors.com project.
My former employer, Rasa, has always had a focus on data quality. Some of that attitude is bound to have seeped in here. Be sure to check the Conversation Driven Development approach and Rasa X if you're working on virtual assistants.
My current employer, Explosion, has a neat labelling tool called Prodigy. I'm currently investigating how tools like doubtlab might lead to better labels when combined with this (very like-able) annotation tool.

doubtlab's People

Contributors

Stargazers

Watchers

Forkers

hylke-rozema adbmd sugatoray avvorstenbosch geopars muhtasham dineshdyne qaboahene stjordanis afiqmuzaffar dumpmemory databill86 adrianholovaty xander982 markusdegen duarteocarmo nusaibawarsan thesekyi iq-scm

doubtlab's Issues

Doubt Reason based on Margin

In Active Learning literature, there is a heuristic called margin where we check the difference in probabilities between the first and second highest predicted class. If the margin is very low, it could indicate doubt.

Example:

cat: 0.9, dog: 0.05, elephant: 0.05 -> margin = 0.9 - 0.05 = 0.85 (high margin)

cat: 0.4, dog: 0.4, elephant: 0.2 -> margin = 0.4 - 0.4 = 0 (low margin)

Make the `proba` feature more search-able

https://twitter.com/KickItLikeShika/status/1507328167983394819?s=20&t=GIKUv7NsHW8e_lzJXW2vow

It makes sense to make it easy for folks googling' on "huggingface"/"torch"/"spaCy" to be able to find this.

Consider a fairlearn demo.

When two models disagree something interesting might be happening. But that'll only happen if you have two models that are actually different.

What if you have one model that's better at accuracy and another one that's better at fairness.

Maybe these labels deserve more attention too.

Doubt Reason Based on Entropy

If a machine learning model is very "confident" then the proba scores will have low entropy. The most uncertain outcome is a uniform distribution which would contain high entropy. Therefore, it could be sensible to add entropy as a reason for doubt.

Add example to docs that shows lambda X, y: y.isna()

Hey! First of all: this is a very cool project ;) I have been thinking about potential new "reasons" to doubt and I personally often look into predictions generated by a model whenever the data instance had missing values (and part of the model-pipeline imputes them)... So I wonder if it would be useful to have a FillNaNReason (or something similar) based, for example in the MissingIndicator transformer.

Label properties (or meta data) that make it less likely that a label is correct

Real data is usually unbalanced but the long tail labels are also more likely to be wrong (whether predicted or human annotated). I don't know how much weight should be assigned to that though. But I figure it could make a Reason class on its own.

Some labels have a hierarchical or aggregative aspect that is predictive about label accuracy, e.g. in ImageNet there are 120 classes that are dog breeds which may be better or worse data compared to other groups of labels. This is more of a metadata thing.

And then there's straight up metadata, e.g. certain annotators may be very unreliable or if the date of annotation is known there may be a correlation that says most of the bad labeling happened at certain stage in the labeling process. But now we are moving into modeling the likelihood of a prediction being true or not.

Tbh it would be nice to have a meld of prodigy and doubtlab where we do active learning on whether the label was right or not (probably with tree-based method sticking to numbers and categories). And then fixing the labels while doing this.

And it would be nice if Prodigy had a more realistic demo for classification where there's hundreds or thousands of possible labels. Very easy to do something with these toy datasets - and when that's in the demo it usually means you have to fight the framework to do anything with a non-toy dataset.

Improve Docs for `from_proba`-esque methods for Keras users

Hey,

I'm trying to use doubtlab with my Keras model that does image classification. I have a model that is already trained and a test dataset. I want to find data from my test dataset that have a wrong classification or a good classification with low confidence.
Basically I'm running a very simple

model = tf.keras.models.load_model('data/results/SDH16K_GPU_WITHAUG/model.h5')
doubt = DoubtEnsemble(reason = WrongPredictionReason(model=model))
indices = doubt.get_indices(test_images, test_labels)

And get the following error:

105/105 [==============================] - 10s 50ms/step
File ~/code-project/MyoQuant-SDH-Train/.venv/lib/python3.8/site-packages/doubtlab/reason.py:232, in WrongPredictionReason.from_predict(pred, y, method)
    228         raise ValueError(
    229             f"Cannot use method={method} when y_true values aren't binary."
    230         )
    231 if method == "all":
--> 232     return (pred != y).astype(np.float16)
    233 if method == "fp":
    234     return ((y == 0) & (pred == 1)).astype(np.float16)
AttributeError: 'bool' object has no attribute 'astype'

I've tried wrapping my Keras Model as as Sci-kit classifier (using: https://www.adriangb.com/scikeras/stable/generated/scikeras.wrappers.KerasClassifier.html)
I get a "not fitted" error

sciKeras = KerasClassifier(model)
doubt = DoubtEnsemble(reason = WrongPredictionReason(model=sciKeras))

File ~/code-project/MyoQuant-SDH-Train/.venv/lib/python3.8/site-packages/scikeras/wrappers.py:993, in BaseWrapper._predict_raw(self, X, **kwargs)
    991 # check if fitted
    992 if not self.initialized_:
--> 993     raise NotFittedError(
    994         "Estimator needs to be fit before `predict` " "can be called"
    995     )
    996 # basic input checks
    997 X, _ = self._validate_data(X=X, y=None)

NotFittedError: Estimator needs to be fit before `predict` can be called

I guess DoubtLab is only for Scikit models for now, but I wondered if somebody tried something similar.

Add a conda installation option using conda-forge channel

I have already started this one. Will push a PR once the conda installation option is available.

See: Adding doubtlab from PyPI to conda-forge channel.

@koaning As the primary maintainer of this repo, would you like to be listed as one of the maintainers of doubtlab on conda-forge channel? Please let me know, I will add your name as another maintainer of conda-forge/doubtlab-feedstock, once it is accepted.

Doubt Reason based on Margin of Confidence

If we assume classification with n > 2 classes, then the difference between the top two most confident predictions could also be used as a proxy for doubt. This could be the absolute difference or relative difference.

Issue with cleanlab upgrading to v2

Issue

Environment details

Temporary fix

pip install "doubtlab==1.0.0"

More permanent fix

Pin doubtlab dependency to "doubtlab<2.0.0"

More more permanent fix

They've made some changes to their API

Let me know if you'd like me to make a PR

Thanks for a great package @koaning 😄

ApricotReason

Although the apricot library isn't meant to find "bad" labels, it could help find the rows of data that "matter more". When starting out with only a few labels, this may be of great use.

Before moving on, it would be good to investigate the merits of this approach a bit more though.

"Fair" Sorting

Suppose there are 5 reasons for doubt, 4 of which overlap a lot. Then we may end up in a situation where we ignore a reason. That could be bad ... maybe it's worth exploring voting systems a bit to figure out alternative sorting methods.

Benchmark Snorkel Sorting

There are a few ideas here. Right now we reward overlap and allow the user to use the predicates table to customize the sorting. But mayhaps we can go a step further. Maybe something like snorkel could play a role here.

Not 100% sure if it's worth the effort though, so it'd be good to benchmark first.

Assign Doubt for Dissimilarity from Labelled Set

Suppose that y can contain NaN values if they aren't labeled. In that case, we may want to favor a subset of these NaN values. In particular: if they differ substantially from the already labeled datapoints.

The idea here is that we may be able to sample more diverse datapoints.

Consider false positive/false negatives

Maybe as a setting for WrongPredictionReason?

Add `flip_labels` function.

Per discussion in #28 (comment), I figured it'd be good to have some utility methods that may help in benchmarking all of these tools.

`QuantileDifferenceReason` and `StandardDeviationReason`

Hey! I was thinking if it would make sense to add two more reasons for regressions tasks, namely something like HighLeveragePointReason and HighStudentizedResidualReason.

Citing Wikipedia:

Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables (link)
A studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. [...] This is an important technique in the detection of outliers. (link)

DisagreeReason, does it have to be two models?

Maybe you should be able to add three as well?

Doubt about MarginConfidenceReason :-)

Hi Vincent,

Nice library! As mentioned a while ago on Twitter I'm doing a review to understand and compare different approaches to find label errors.

I'm playing with the AG News dataset, which we know it contains a lot of errors from our own previous experiments with Rubrix (using the training loss and using cleanlab).

While playing with the different reasons, I'm having difficulties to understand the reasoning behind the MarginConfidenceReason. As far as I can tell, if the model is doubting the margin between the top two predicted labels should be small, and that could point to an ambiguous example and/or a label error. If I read the code and description correctly, MarginConfidenceReason is doing the opposite, so I'd love to know the reasoning behind this to make sure I'm not missing something.

For context, using the MarginConfidenceReason with the AG News training set yields almost the entire dataset (117788 examples for the default threshold of 0.2, and 112995 for threshold=0.5). I guess this could become useful when there's overlap with other reasons, but I want to make sure about the reasoning :-).

Minor bug in LongConfidenceReason

Hi, as stated in the title,
in reason.py line 260,
proba_dict = {classes[j]: v for j, v in enumerate(proba) if j != y[i]}
index j is compared to the class y[i]
suggested change:
proba_dict = {classes[j]: v for j, v in enumerate(proba) if classes[j] != y[i]}

Add staticmethods to reasons to prevent re-compute.

I really like the current design with reasons just being function calls.

However, when working with large datasets or in use cases where you already have the predictions of a model, I wonder if you have thought about letting users to pass either a sklearn model or the pre-computed probas (for those Reasons where it make sense). For threshold-based reasons and large datasets this could save some time and compute, allow for faster iteration, and it would open up the possibility of using other models beyond sklearn.

I understand that the design wouldn't be as clean as it is right now, might cause miss-alignments if users don't send the correct shapes/positions, but I wonder if you have considered this (or any other way to pass pre-computed predictions).

Just to illustrate what I mean (sorry about the dirty-pseudo code):

class ProbaReason:

    def __init__(self, model=None, probas=None, max_proba=0.55):
        if not model or probas:
             print("You should at least pass a model or probas")
        self.model = model
        self.probas = probas
        self.max_proba = max_proba

    def __call__(self, X, y=None):
        probas = probas if self.probas else self.model.predict_proba(X)
        result = probas.max(axis=1) <= self.max_proba
        return result.astype(np.float16)

Does it make sense to add an ensemble for spaCy?

This seems to be a like-able method of dealing with text outside the realm of scikit-learn. But I prefer to delay this feature until I really understand the use-case. For anything related to entities we cannot use sklearn, but tags/classes should work fine as-is.