p-lambda / verified_calibration Goto Github PK
View Code? Open in Web Editor NEWCalibration library and code for the paper: Verified Uncertainty Calibration. Ananya Kumar, Percy Liang, Tengyu Ma. NeurIPS 2019 (Spotlight).
License: MIT License
Calibration library and code for the paper: Verified Uncertainty Calibration. Ananya Kumar, Percy Liang, Tengyu Ma. NeurIPS 2019 (Spotlight).
License: MIT License
Within the Verified Uncertainty Calibration paper I noticed there are no reliability diagrams. Would it be possible to return the bin accuracies in order for users to create the reliability diagram? If the bin sizes are different, I think that would also have to be returned.
For example in my current codebase to create a reliability diagram for ECE I have the following:
I have these variables
acc = [0, 0, 0.00167434, 0.00271739, 0.007, 0.00495663, 0.00269906, 0.01893491, 0.04973357, 0.65488513]
ece = 0.41769305566809833
Send it to my plotting function
# Plot Reliability Diagram
def plot_reliability(acc, ece, save_path):
interval = 1 / len(acc)
x = np.arange(interval/2, 1+interval/2, 1/len(acc))
plt.figure(figsize=(3,3))
plt.bar(x, acc, width=0.08, edgecolor='k')
plt.xlabel('Confidence')
plt.ylabel('Accuracy')
plt.xlim([0,1])
plt.ylim([0,1])
plt.text(0,1.01,'ECE={}'.format(str(ece)[:5]))
plt.plot([0,1], [0,1], 'k--')
plt.tight_layout()
plt.savefig(save_path)
plt.show()
Trying to pickle a fitted HistogramMarginalCalibrator gives the error:
AttributeError: Can't pickle local object 'get_histogram_calibrator.<locals>.calibrator'
If a bin only contains a single class label the logistic regression fails to fit (solver requires more than one class label).
I'm not sure if this is the 'right' fix but it worked as a quick-n-dirty workaround:
def get_platt_scaler(model_probs, labels):
clf = LogisticRegression(C=1e10, solver='lbfgs')
eps = 1e-12
model_probs = model_probs.astype(dtype=np.float64)
model_probs = np.expand_dims(model_probs, axis=-1)
model_probs = np.clip(model_probs, eps, 1 - eps)
model_probs = np.log(model_probs / (1 - model_probs))
unique_labels = np.unique(labels) # +
if unique_labels.shape[0] != 1: # +
clf.fit(model_probs, labels)
def calibrator(probs):
x = np.array(probs, dtype=np.float64)
x = np.clip(x, eps, 1 - eps)
x = np.log(x / (1 - x))
if unique_labels.shape[0] != 1: # +
x = x * clf.coef_[0] + clf.intercept_
output = 1 / (1 + np.exp(-x))
return output
return calibrator
The PyPI package name is scikit-learn
, not sklearn
.
Function parameter num_samples
of lower_bound_experiment
is not used.
Should be num_samples= num_samples
instead of num_samples=1000
.
I noticed that the variable parent
is not used. I though I would mention it in case it is meant to be used to specify where the figure should be saved.
Hi, thanks for making this library - it's super convenient. One question about the PlattBinnerMarginalCalibrator (and I suppose the other calibrators as well). What is the purpose of the num_calibration argument? it seems like it's only being used to assert that a minimum number of samples have been provided, but what is the point of that? Does it have any usages apart from the single assert?
For example, this works:
>>> l1 = [0.8,0.1,0.1]
>>> l0 = [0.8,0.1,0.1]
>>> l1 = [0.7,0.2,0.1]
>>> l2 = [0.3,0.3,0.3]
>>> cal.get_calibration_error([l0,l1,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2], [0,2,1,1,1,1,1,1,1,1,1,1,1,1,1])
0.4353797831268186
But removing class 0 will give an error:
>>> cal.get_calibration_error([l0,l1,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2,l2], [1,2,1,1,1,1,1,1,1,1,1,1,1,1,1])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 125, in get_calibration_error
return get_binning_ce(probs, labels, p, debias, mode=mode)
File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 193, in get_binning_ce
return _get_ce(probs, labels, p, debias, None, binning_scheme=get_discrete_bins, mode=mode)
File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 236, in _get_ce
labels_one_hot = get_labels_one_hot(labels, k=probs.shape[1])
File "/home/matthew/anaconda3/lib/python3.7/site-packages/calibration/utils.py", line 509, in get_labels_one_hot
assert np.min(labels) == 0
AssertionError
I have a simple fix I can open in a PR, but I'm not sure if it is valid.
Hi,
I'm trying to run the experiment in section 4.3 of the paper using
python3 experiments/scaling_binning_calibrator/compare_calibrators.py
but it throws the following error
File "experiments/scaling_binning_calibrator/compare_calibrators.py", line 11
def eval_top_calibration(probs, probs, labels):
^
SyntaxError: duplicate argument 'probs' in function definition
Also the functions eval_top_calibration
, upper_bound_marginal_calibration_unbiased
and upper_bound_marginal_calibration_biased
have the same problem.
I think in eval_top_calibration
we should pass probs = utils.get_top_probs(probs)
to cal.get_discrete_bins
in line 13.
But after changing those lines I'm still unable to reproduce the plots in Figure 3 of the paper. Can you please tell me what should I modify to make it work?
Hi!
First of all, thanks for the excellent package, and in particular also for still actively maintaining it! :-)
I have some questions regarding the bootstrapping-based uncertainty quantification. When I call get_calibration_error_uncertainties, it calls bootstrap_uncertainty with the functional get_calibration_error(probs, labels, p, debias=False, mode=mode).
bootstrap_uncertainty
will then roughly do this:
plugin = functional(data)
bootstrap_estimates = []
for _ in range(num_samples):
bootstrap_estimates.append(functional(resample(data)))
return (2*plugin - np.percentile(bootstrap_estimates, 100 - alpha / 2.0),
2*plugin - np.percentile(bootstrap_estimates, 50),
2*plugin - np.percentile(bootstrap_estimates, alpha / 2.0))
Questions:
debias=False
in the call to get_calibration_error
? I would like UQ for the unbiased (L2) error estimate?get_calibration_error_uncertainties
, it says "When p is not 2 (e.g. for the ECE where p = 1), [the median]I guess what I am really asking is: what's the reasoning behind the approach you chose, and is it described somewhere? :-)
When there are missing classes represented in the ground truth the calibration error cannot be computed.
To reproduce:
>>> cal.get_ece([[0.9,0.1], [0.8,0.2]], [0,0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 196, in get_ece
binning_scheme=get_equal_prob_bins, mode=mode)
File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 162, in lower_bound_scaling_ce
return _get_ce(probs, labels, p, debias, num_bins, binning_scheme, mode=mode)
File "/usr/local/lib/python3.6/dist-packages/calibration/utils.py", line 232, in _get_ce
raise ValueError('labels should be between 0 and num_classes - 1.')
ValueError: labels should be between 0 and num_classes - 1.
verified_calibration/calibration/utils.py
Lines 415 to 416 in bdd60a7
verified_calibration/calibration/utils.py
Lines 430 to 431 in bdd60a7
It should be
-> Tuple[float, float, float]:
instead of
-> Tuple[float, float]:
Hi,
First I really appreciate the repository. Awesome work!
I noticed that the "top" calibrators, such as HistogramTop, PlattBinnerTop, etc., produce only the calibrated probabilities of the top label. I'm not sure how I can adjust the probabilities of the other classes in a multi-class task. Say I have originally a probabilistic prediction [0.1, 0.8, 0.05, 0.05] and the top-calibrator only adjusts 0.8 to 0.6. Should I distribute the 0.2 uniformly onto the other 3 classes? In some cases this might change the decision no? (I would need the complete distribution to calculate, e.g., ECE score etc.)
Another question: I also saw that the calibrators require a num_calibration
argument which doesn't seem to play any role. What's the reason for that?
Thanks and best regards,
T
verified_calibration/calibration/utils.py
Lines 525 to 526 in bdd60a7
The variable predictions
is not defined and probs
is not used.
Hi,
First I would like to appreciate the work and the repository.
When using the library, I've noticed that the relative ordering of numerical values can change post-calibration. For instance, if a > b before calibration, it is not guaranteed that a>b after calibration. However, based on my understanding, the calibration function should be monotonic.
Below is the example I used:
raw_probs = [0.61051559, 0.00047493709, 0.99639291, 0.00021221573, 0.99599433, 0.0014127002, 0.0028262993]
labels = [1,0,1,0,1,0,0]
raw_probs = np.array(raw_probs)
raw_probs = np.vstack((raw_probs, 1-raw_probs)).T
# train calibrator
num_bins = 4
num_points = len(raw_probs)
calibrator = cal.PlattBinnerMarginalCalibrator(num_points, num_bins=num_bins)
calibrator.train_calibration(raw_probs, labels)
# test
np.random.seed(0)
test_probs_1 = np.random.rand(7)
test_probs_1 = np.array(test_probs_1)
test_probs_1 = np.vstack((test_probs_1, 1-test_probs_1)).T
calibrated_probs_1 = calibrator_1.calibrate(test_probs_1)
print(np.argsort(test_probs_1[:,0]) == np.argsort(calibrated_probs_1[:,0])) # check whether the orders are the same
I also tested with the example file in the repo In this file, a calibrator is trained and tested with 1000 synthetic data. I randomly sampled 100 pairs of numbers from probabilities before/after calibration. I also found that the relative orders for these samples are not always consistent before/after.
I would appreciate if you could provide any clarification regarding it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.