p-lambda / verified_calibration Goto Github PK

Calibration library and code for the paper: Verified Uncertainty Calibration. Ananya Kumar, Percy Liang, Tengyu Ma. NeurIPS 2019 (Spotlight).

License: MIT License

Python 96.18% Shell 3.82%

calibration calibration-library ece metrics toolbox uncertainty-calibration

verified_calibration's People

Contributors

Stargazers

Watchers

Forkers

mishooax sugarcorn hanguo97 russwong minghao2016 valeman ykwon0407 wkiri jordy-vl mpitropov sdelcore apcc-geoslegend kevinmtian ag027592 yifanmai gadithyaraju21 liel-leman forrestbao

verified_calibration's Issues

Question: Reliability diagrams

Within the Verified Uncertainty Calibration paper I noticed there are no reliability diagrams. Would it be possible to return the bin accuracies in order for users to create the reliability diagram? If the bin sizes are different, I think that would also have to be returned.

For example in my current codebase to create a reliability diagram for ECE I have the following:

I have these variables

acc = [0, 0, 0.00167434, 0.00271739, 0.007, 0.00495663, 0.00269906, 0.01893491, 0.04973357, 0.65488513]
ece = 0.41769305566809833

Send it to my plotting function

# Plot Reliability Diagram
def plot_reliability(acc, ece, save_path):
    interval = 1 / len(acc)
    x = np.arange(interval/2, 1+interval/2, 1/len(acc))

    plt.figure(figsize=(3,3))
    plt.bar(x, acc, width=0.08, edgecolor='k')
    plt.xlabel('Confidence')
    plt.ylabel('Accuracy')
    plt.xlim([0,1])
    plt.ylim([0,1])
    plt.text(0,1.01,'ECE={}'.format(str(ece)[:5]))

    plt.plot([0,1], [0,1], 'k--')
    plt.tight_layout()
    plt.savefig(save_path)
    plt.show()

To create a diagram like this:

Pickle trained Calibrators

Trying to pickle a fitted HistogramMarginalCalibrator gives the error:

AttributeError: Can't pickle local object 'get_histogram_calibrator.<locals>.calibrator'

Error if a bin has only a single class in it

If a bin only contains a single class label the logistic regression fails to fit (solver requires more than one class label).

I'm not sure if this is the 'right' fix but it worked as a quick-n-dirty workaround:

def get_platt_scaler(model_probs, labels):
    clf = LogisticRegression(C=1e10, solver='lbfgs')
    eps = 1e-12
    model_probs = model_probs.astype(dtype=np.float64)
    model_probs = np.expand_dims(model_probs, axis=-1)
    model_probs = np.clip(model_probs, eps, 1 - eps)
    model_probs = np.log(model_probs / (1 - model_probs))
    unique_labels = np.unique(labels) # +
    if unique_labels.shape[0] != 1: # +
        clf.fit(model_probs, labels)
    def calibrator(probs):
        x = np.array(probs, dtype=np.float64)
        x = np.clip(x, eps, 1 - eps)
        x = np.log(x / (1 - x))
        if unique_labels.shape[0] != 1: # +
            x = x * clf.coef_[0] + clf.intercept_
        output = 1 / (1 + np.exp(-x))
        return output
    return calibrator

Incorrect scikit-learn package requirement

The PyPI package name is scikit-learn, not sklearn.

Function parameter not used

verified_calibration/experiments/platt_not_calibrated/lower_bounds.py

Line 59 in e2f0f74

verification_data, estimator, num_samples=1000)

Function parameter num_samples of lower_bound_experiment is not used.

Should be num_samples= num_samples instead of num_samples=1000.

Unused variable

verified_calibration/experiments/synthetic/synthetic.py

Line 232 in e2f0f74

parent = Path(save_file).parent

I noticed that the variable parent is not used. I though I would mention it in case it is meant to be used to specify where the figure should be saved.

Question about num_calibration

Hi, thanks for making this library - it's super convenient. One question about the PlattBinnerMarginalCalibrator (and I suppose the other calibrators as well). What is the purpose of the num_calibration argument? it seems like it's only being used to assert that a minimum number of samples have been provided, but what is the point of that? Does it have any usages apart from the single assert?

Calculate calibration error with softmax distribution and missing class

For example, this works:

>>> l1 = [0.8,0.1,0.1]
>>> l0 = [0.8,0.1,0.1]
>>> l1 = [0.7,0.2,0.1]
>>> l2 = [0.3,0.3,0.3]
>>> cal.get_calibration_error([l0,l1,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2], [0,2,1,1,1,1,1,1,1,1,1,1,1,1,1])
0.4353797831268186

But removing class 0 will give an error:

>>> cal.get_calibration_error([l0,l1,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2], [1,2,1,1,1,1,1,1,1,1,1,1,1,1,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 125, in get_calibration_error
    return get_binning_ce(probs, labels, p, debias, mode=mode)
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 193, in get_binning_ce
    return _get_ce(probs, labels, p, debias, None, binning_scheme=get_discrete_bins, mode=mode)
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 236, in _get_ce
    labels_one_hot = get_labels_one_hot(labels, k=probs.shape[1])
  File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 509, in get_labels_one_hot
    assert np.min(labels) == 0
AssertionError

I have a simple fix I can open in a PR, but I'm not sure if it is valid.

Reproducing Figure 3 of the paper

Hi,

I'm trying to run the experiment in section 4.3 of the paper using
python3 experiments/scaling_binning_calibrator/compare_calibrators.py
but it throws the following error

File "experiments/scaling_binning_calibrator/compare_calibrators.py", line 11
    def eval_top_calibration(probs, probs, labels):
    ^
SyntaxError: duplicate argument 'probs' in function definition

Also the functions eval_top_calibration, upper_bound_marginal_calibration_unbiased and upper_bound_marginal_calibration_biased have the same problem.

I think in eval_top_calibration we should pass probs = utils.get_top_probs(probs) to cal.get_discrete_bins in line 13.

But after changing those lines I'm still unable to reproduce the plots in Figure 3 of the paper. Can you please tell me what should I modify to make it work?

Bootstrap uncertainty details

Hi!

First of all, thanks for the excellent package, and in particular also for still actively maintaining it! :-)

I have some questions regarding the bootstrapping-based uncertainty quantification. When I call get_calibration_error_uncertainties, it calls bootstrap_uncertainty with the functional get_calibration_error(probs, labels, p, debias=False, mode=mode).

bootstrap_uncertainty will then roughly do this:

    plugin = functional(data)
    bootstrap_estimates = []
    for _ in range(num_samples):
        bootstrap_estimates.append(functional(resample(data)))
    return (2*plugin - np.percentile(bootstrap_estimates, 100 - alpha / 2.0),
            2*plugin - np.percentile(bootstrap_estimates, 50),
            2*plugin - np.percentile(bootstrap_estimates, alpha / 2.0))

Questions:

Why is debias=False in the call to get_calibration_error? I would like UQ for the unbiased (L2) error estimate?
How/why is "2*plugin - median(bootstrap_estimates)" a good estimate of the median? And similarly for the lower/upper quantiles?
In get_calibration_error_uncertainties, it says "When p is not 2 (e.g. for the ECE where p = 1), [the median]
can be used as a debiased estimate as well." - why would that be true / what exactly do you mean by it...?

I guess what I am really asking is: what's the reasoning behind the approach you chose, and is it described somewhere? :-)

Unable to obtain calibration error when missing a class

When there are missing classes represented in the ground truth the calibration error cannot be computed.

To reproduce:

>>> cal.get_ece([[0.9,0.1], [0.8,0.2]], [0,0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 196, in get_ece
    binning_scheme=get_equal_prob_bins, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 162, in lower_bound_scaling_ce
    return _get_ce(probs, labels, p, debias, num_bins, binning_scheme, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 232, in _get_ce
    raise ValueError('labels should be between 0 and num_classes - 1.')
ValueError: labels should be between 0 and num_classes - 1.

Incorrect type hints

verified_calibration/calibration/utils.py

Lines 415 to 416 in bdd60a7

    
           def bootstrap_uncertainty(data: List[T], functional, estimator=None, alpha=10.0,  
        
                                     num_samples=1000) -> Tuple[float, float]:

verified_calibration/calibration/utils.py

Lines 430 to 431 in bdd60a7

    
           def precentile_bootstrap_uncertainty(data: List[T], functional, estimator=None, alpha=10.0, 
        
                                                num_samples=1000) -> Tuple[float, float]:

It should be

-> Tuple[float, float, float]:

instead of

-> Tuple[float, float]:

Calibrated probabilites from "top" calibrators

Hi,

First I really appreciate the repository. Awesome work!

I noticed that the "top" calibrators, such as HistogramTop, PlattBinnerTop, etc., produce only the calibrated probabilities of the top label. I'm not sure how I can adjust the probabilities of the other classes in a multi-class task. Say I have originally a probabilistic prediction [0.1, 0.8, 0.05, 0.05] and the top-calibrator only adjusts 0.8 to 0.6. Should I distribute the 0.2 uniformly onto the other 3 classes? In some cases this might change the decision no? (I would need the complete distribution to calculate, e.g., ECE score etc.)

Another question: I also saw that the calibrators require a num_calibration argument which doesn't seem to play any role. What's the reason for that?

Thanks and best regards,
T

Undefined variable

verified_calibration/calibration/utils.py

Lines 525 to 526 in bdd60a7

    
           def get_accuracy(probs, labels): 
        
               return sum(labels == predictions) * 1.0 / len(labels)

The variable predictions is not defined and probs is not used.

The number orders are not consistent before/after calibration

Hi,

First I would like to appreciate the work and the repository.

When using the library, I've noticed that the relative ordering of numerical values can change post-calibration. For instance, if a > b before calibration, it is not guaranteed that a>b after calibration. However, based on my understanding, the calibration function should be monotonic.

Below is the example I used:

raw_probs = [0.61051559, 0.00047493709, 0.99639291, 0.00021221573, 0.99599433, 0.0014127002, 0.0028262993]
labels = [1,0,1,0,1,0,0]
raw_probs = np.array(raw_probs)
raw_probs = np.vstack((raw_probs, 1-raw_probs)).T
# train calibrator
num_bins = 4
num_points = len(raw_probs)
calibrator = cal.PlattBinnerMarginalCalibrator(num_points, num_bins=num_bins)
calibrator.train_calibration(raw_probs, labels)
# test
np.random.seed(0)
test_probs_1 = np.random.rand(7)
test_probs_1 = np.array(test_probs_1)
test_probs_1 = np.vstack((test_probs_1, 1-test_probs_1)).T
calibrated_probs_1 = calibrator_1.calibrate(test_probs_1)
print(np.argsort(test_probs_1[:,0]) == np.argsort(calibrated_probs_1[:,0])) # check whether the orders are the same

I also tested with the example file in the repo In this file, a calibrator is trained and tested with 1000 synthetic data. I randomly sampled 100 pairs of numbers from probabilities before/after calibration. I also found that the relative orders for these samples are not always consistent before/after.

I would appreciate if you could provide any clarification regarding it.

	def bootstrap_uncertainty(data: List[T], functional, estimator=None, alpha=10.0,
	num_samples=1000) -> Tuple[float, float]:

	def precentile_bootstrap_uncertainty(data: List[T], functional, estimator=None, alpha=10.0,
	num_samples=1000) -> Tuple[float, float]:

	def get_accuracy(probs, labels):
	return sum(labels == predictions) * 1.0 / len(labels)

p-lambda / verified_calibration Goto Github PK

verified_calibration's People

Contributors

Stargazers

Watchers

Forkers

verified_calibration's Issues

Recommend Projects

Recommend Topics

Recommend Org