Code Monkey home page Code Monkey logo

verified_calibration's People

Contributors

ananyakumar avatar mpitropov avatar yifanmai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

verified_calibration's Issues

Question: Reliability diagrams

Within the Verified Uncertainty Calibration paper I noticed there are no reliability diagrams. Would it be possible to return the bin accuracies in order for users to create the reliability diagram? If the bin sizes are different, I think that would also have to be returned.

For example in my current codebase to create a reliability diagram for ECE I have the following:

I have these variables

acc = [0, 0, 0.00167434, 0.00271739, 0.007, 0.00495663, 0.00269906, 0.01893491, 0.04973357, 0.65488513]
ece = 0.41769305566809833

Send it to my plotting function

# Plot Reliability Diagram
def plot_reliability(acc, ece, save_path):
    interval = 1 / len(acc)
    x = np.arange(interval/2, 1+interval/2, 1/len(acc))

    plt.figure(figsize=(3,3))
    plt.bar(x, acc, width=0.08, edgecolor='k')
    plt.xlabel('Confidence')
    plt.ylabel('Accuracy')
    plt.xlim([0,1])
    plt.ylim([0,1])
    plt.text(0,1.01,'ECE={}'.format(str(ece)[:5]))

    plt.plot([0,1], [0,1], 'k--')
    plt.tight_layout()
    plt.savefig(save_path)
    plt.show()

To create a diagram like this:
image

Pickle trained Calibrators

Trying to pickle a fitted HistogramMarginalCalibrator gives the error:

AttributeError: Can't pickle local object 'get_histogram_calibrator.<locals>.calibrator'

Error if a bin has only a single class in it

If a bin only contains a single class label the logistic regression fails to fit (solver requires more than one class label).

I'm not sure if this is the 'right' fix but it worked as a quick-n-dirty workaround:

def get_platt_scaler(model_probs, labels):
    clf = LogisticRegression(C=1e10, solver='lbfgs')
    eps = 1e-12
    model_probs = model_probs.astype(dtype=np.float64)
    model_probs = np.expand_dims(model_probs, axis=-1)
    model_probs = np.clip(model_probs, eps, 1 - eps)
    model_probs = np.log(model_probs / (1 - model_probs))
    unique_labels = np.unique(labels) # +
    if unique_labels.shape[0] != 1: # +
        clf.fit(model_probs, labels)
    def calibrator(probs):
        x = np.array(probs, dtype=np.float64)
        x = np.clip(x, eps, 1 - eps)
        x = np.log(x / (1 - x))
        if unique_labels.shape[0] != 1: # +
            x = x * clf.coef_[0] + clf.intercept_
        output = 1 / (1 + np.exp(-x))
        return output
    return calibrator

Question about num_calibration

Hi, thanks for making this library - it's super convenient. One question about the PlattBinnerMarginalCalibrator (and I suppose the other calibrators as well). What is the purpose of the num_calibration argument? it seems like it's only being used to assert that a minimum number of samples have been provided, but what is the point of that? Does it have any usages apart from the single assert?

Calculate calibration error with softmax distribution and missing class

For example, this works:

>>> l1 = [0.8,0.1,0.1]
>>> l0 = [0.8,0.1,0.1]
>>> l1 = [0.7,0.2,0.1]
>>> l2 = [0.3,0.3,0.3]
>>> cal.get_calibration_error([l0,l1,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2], [0,2,1,1,1,1,1,1,1,1,1,1,1,1,1])
0.4353797831268186

But removing class 0 will give an error:

>>> cal.get_calibration_error([l0,l1,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2], [1,2,1,1,1,1,1,1,1,1,1,1,1,1,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 125, in get_calibration_error
    return get_binning_ce(probs, labels, p, debias, mode=mode)
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 193, in get_binning_ce
    return _get_ce(probs, labels, p, debias, None, binning_scheme=get_discrete_bins, mode=mode)
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 236, in _get_ce
    labels_one_hot = get_labels_one_hot(labels, k=probs.shape[1])
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 509, in get_labels_one_hot
    assert np.min(labels) == 0
AssertionError

I have a simple fix I can open in a PR, but I'm not sure if it is valid.

Reproducing Figure 3 of the paper

Hi,

I'm trying to run the experiment in section 4.3 of the paper using
python3 experiments/scaling_binning_calibrator/compare_calibrators.py
but it throws the following error

File "experiments/scaling_binning_calibrator/compare_calibrators.py", line 11
    def eval_top_calibration(probs, probs, labels):
    ^
SyntaxError: duplicate argument 'probs' in function definition

Also the functions eval_top_calibration, upper_bound_marginal_calibration_unbiased and upper_bound_marginal_calibration_biased have the same problem.

I think in eval_top_calibration we should pass probs = utils.get_top_probs(probs) to cal.get_discrete_bins in line 13.

But after changing those lines I'm still unable to reproduce the plots in Figure 3 of the paper. Can you please tell me what should I modify to make it work?

Bootstrap uncertainty details

Hi!

First of all, thanks for the excellent package, and in particular also for still actively maintaining it! :-)

I have some questions regarding the bootstrapping-based uncertainty quantification. When I call get_calibration_error_uncertainties, it calls bootstrap_uncertainty with the functional get_calibration_error(probs, labels, p, debias=False, mode=mode).

bootstrap_uncertainty will then roughly do this:

    plugin = functional(data)
    bootstrap_estimates = []
    for _ in range(num_samples):
        bootstrap_estimates.append(functional(resample(data)))
    return (2*plugin - np.percentile(bootstrap_estimates, 100 - alpha / 2.0),
            2*plugin - np.percentile(bootstrap_estimates, 50),
            2*plugin - np.percentile(bootstrap_estimates, alpha / 2.0))

Questions:

  1. Why is debias=False in the call to get_calibration_error? I would like UQ for the unbiased (L2) error estimate?
  2. How/why is "2*plugin - median(bootstrap_estimates)" a good estimate of the median? And similarly for the lower/upper quantiles?
  3. In get_calibration_error_uncertainties, it says "When p is not 2 (e.g. for the ECE where p = 1), [the median]
    can be used as a debiased estimate as well." - why would that be true / what exactly do you mean by it...?

I guess what I am really asking is: what's the reasoning behind the approach you chose, and is it described somewhere? :-)

Unable to obtain calibration error when missing a class

When there are missing classes represented in the ground truth the calibration error cannot be computed.

To reproduce:

>>> cal.get_ece([[0.9,0.1], [0.8,0.2]], [0,0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 196, in get_ece
    binning_scheme=get_equal_prob_bins, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 162, in lower_bound_scaling_ce
    return _get_ce(probs, labels, p, debias, num_bins, binning_scheme, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 232, in _get_ce
    raise ValueError('labels should be between 0 and num_classes - 1.')
ValueError: labels should be between 0 and num_classes - 1.

Calibrated probabilites from "top" calibrators

Hi,

First I really appreciate the repository. Awesome work!

I noticed that the "top" calibrators, such as HistogramTop, PlattBinnerTop, etc., produce only the calibrated probabilities of the top label. I'm not sure how I can adjust the probabilities of the other classes in a multi-class task. Say I have originally a probabilistic prediction [0.1, 0.8, 0.05, 0.05] and the top-calibrator only adjusts 0.8 to 0.6. Should I distribute the 0.2 uniformly onto the other 3 classes? In some cases this might change the decision no? (I would need the complete distribution to calculate, e.g., ECE score etc.)

Another question: I also saw that the calibrators require a num_calibration argument which doesn't seem to play any role. What's the reason for that?

Thanks and best regards,
T

The number orders are not consistent before/after calibration

Hi,

First I would like to appreciate the work and the repository.

When using the library, I've noticed that the relative ordering of numerical values can change post-calibration. For instance, if a > b before calibration, it is not guaranteed that a>b after calibration. However, based on my understanding, the calibration function should be monotonic.

Below is the example I used:

raw_probs = [0.61051559, 0.00047493709, 0.99639291, 0.00021221573, 0.99599433, 0.0014127002, 0.0028262993]
labels = [1,0,1,0,1,0,0]
raw_probs = np.array(raw_probs)
raw_probs = np.vstack((raw_probs, 1-raw_probs)).T
# train calibrator
num_bins = 4
num_points = len(raw_probs)
calibrator = cal.PlattBinnerMarginalCalibrator(num_points, num_bins=num_bins)
calibrator.train_calibration(raw_probs, labels)
# test
np.random.seed(0)
test_probs_1 = np.random.rand(7)
test_probs_1 = np.array(test_probs_1)
test_probs_1 = np.vstack((test_probs_1, 1-test_probs_1)).T
calibrated_probs_1 = calibrator_1.calibrate(test_probs_1)
print(np.argsort(test_probs_1[:,0]) == np.argsort(calibrated_probs_1[:,0])) # check whether the orders are the same

I also tested with the example file in the repo In this file, a calibrator is trained and tested with 1000 synthetic data. I randomly sampled 100 pairs of numbers from probabilities before/after calibration. I also found that the relative orders for these samples are not always consistent before/after.

I would appreciate if you could provide any clarification regarding it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.