tmgrgg / localvsglobaluncertainty Goto Github PK

Empirical analysis of recent stochastic gradient methods for approximate inference in Bayesian deep learning, including SWA-Gaussian, MultiSWAG, and deep ensembles. See report_localglobal.pdf.

Python 51.43% Shell 48.57%

dee-learning deep-neural-networks bayesian-deep-learning approximate-inference

localvsglobaluncertainty's Introduction

This is my master's research project, supervised by Hippolyt Ritter and David Barber at the UCL Centre for AI. If interested I recommend reading the introductory chapter of my report (report_localglobal.pdf) to get a high-level idea.

Codebase references:

https://github.com/wjmaddox/swa_gaussian

https://github.com/timgaripov/swa/

https://github.com/izmailovpavel/understandingbdl/

localvsglobaluncertainty's People

Contributors

Stargazers

Watchers

localvsglobaluncertainty's Issues

Experiment 2a: Heat Map Analysis

I trained a MultiSWAG solution consisting of (up to) 15 models on DenseNet10 x FashionMNIST, increasing the rank of each individual SWAG solution incrementally to produce this heatmap demonstrating a broad picture of the complementary benefits of modelling local and global uncertainty:

Observations:

It seems clear that increasing the rank of a unimodal SWAG approximation has a much weaker effect on solution quality than increasing the number of ensembled solutions.
It seems that once a certain threshold of ensembled solutions has been reached, i.e. between 5-10, resources are better dedicated to improving local approximations.
Note that for a fixed MultiSWAG model with n_ensembled = n, rank = k, moving to the above model in the graph incurs an additional storage cost of k (n * k-rank modes -> (n + 1) k-rank modes), whereas moving rightwards incurs an additional storage cost of n (n * k-rank modes -> n*(k + 1)-rank modes).
i.e. the cost of each solution can be read as kn|theta|
I'm currently playing around with a couple of ideas for when to choose one over the other...

Experiment 1b (Simple Analysis of Ensembling)

Analysis of performance of Ensemble versus number of models (1b)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the Ensemble improved with number of models included. Similar to 1a, the actual running of the code is done in Notebooks using a Colab GPU (which will hopefully change).

1. I trained (100) models explicitly using different seeds for random generators.

The notebook is: https://colab.research.google.com/drive/180azaR--x66_kqQNjrTX1Ek__Fwi_zfE?usp=sharing
I used the same optimization parameters and learning rate scheduling for all of them, sticking with what we used in the previous SWAG experiment, namely:

batch_size=128
LR_INIT = 0.1
MOMENTUM = 0.85
L2 = 1e-4
TRAINING_EPOCHS = 100
FINAL_LR = 0.005

The models' final performances on the validation set look roughly randomly distributed about a mean performance:
![InitialModelPerformancesForEnsembling](https://user-images.githubusercontent.com/39443562/88405288-f2444300-cdc6-11ea-8ad8-3d3b67104fc8.png)

### 2. I then measured ensemble performance against number of members of the ensemble:
Notebook here: https://colab.research.google.com/drive/1IqkEmhTMZVB3G8Wamf8NNDmehFQngoea?usp=sharing

Note for each n I randomly select n members to belong to the ensemble rather than progressively adding one each time. I mix the ensembles with an equal weighting to generate predictions. I obtain the following:

![NumberEnsembledVersusPerformance](https://user-images.githubusercontent.com/39443562/88405957-e3aa5b80-cdc7-11ea-964d-1070e5a07075.png)

So the benefits of ensembling directly on performance are initially steep, but tail off quickly with more and more models.

Experiment 3a

Diversity

I defined the diversity index:

$1/m sum_i^m KL(P_i | P_ens)/NLL(P_i) $

Where P_ens = \sum_i^m P_i

The basic idea being that the numerator of each term will grow if P_i deviates from P_ens significantly, and the denominator will grow if P_i is a poor solution to the probabilistic learning problem. Hence, if the mean ratios flattens out as m increases, we are no longer gaining diversity relative to loss on average.

Pure local versus pure global:

The mean ratios are remarkably (read: suspiciously) well correlated with the loss of the ensemble:

Local:

Global:

If there's nothing strange about this index, then it is a good ``explanation'' for the limits of the effectiveness of pure local and pure global ensembles: namely, local ensembles run out of accessable diversity faster than global ensembles. The correlations are -0.95 and -0.99 respectively.

(Note that the graphs have incorrectly labelled y-axis.)

In terms of the interaction between local and global, I plotted a subset of the heatmap from last week (10x10) alongside the diversity index evaluated with increasing number of global solutions of increasing rank:

Heatmap:

Diversity Index:

Corr: -0.85

weirdly:

the correlation as rank goes up is strong... but in the opposite direction

Experiment 1a (Simple Analysis of SWAG)

Analysis of performance of SWAG versus approximation rank (1a)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.

1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.

I first ran a wide grid search over parameters with different magnitudes to get an idea for which optimization parameters would reach a good solution.
The notebook is: https://colab.research.google.com/drive/1eCTHFa5vooirGJi9YdsXo080PoMCdb-d
and the results of the grid search can be found here: https://drive.google.com/file/d/1-ZWK0jN8cLK8Cw9D7rK7OXz0XxXyWCqC/view?usp=sharing
I then trained a model with the best params:
Notebook: https://colab.research.google.com/drive/110ae3O6WoMqlqpvwH8LnmyTTpoKzFjDr#scrollTo=Z5suM6SZQmEp

LR_INIT = 0.1
MOMENTUM = 0.85
L2 = 1e-4

Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:

Final pretrained model performance is:

::: Train :::
 {'loss': 0.034561568461060524, 'accuracy': 99.41}
::: Valid :::
 {'loss': 0.23624498672485353, 'accuracy': 92.41}
::: Test :::
 {'loss': 0.2568485828399658, 'accuracy': 92.28}

Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:

def schedule(lr_init, epoch, max_epochs):
    t = epoch / max_epochs
    lr_ratio = FINAL_LR / lr_init 
    if t <= 0.5:
        factor = 1.0
    elif t <= 0.9:
        factor = 1.0 - (1.0 - lr_ratio) * (t - 0.5) / 0.4
    else:
        factor = lr_ratio
    return lr_init * factor

2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).

The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing

The model trained in the notebook above was trained with the following parameters:

SWA_LR = 0.005
SWA_MOMENTUM = 0.85
L2 = 1e-4
RANK = 30
SAMPLES_PER_EPOCH = 1
SAMPLE_FREQ = int((1/SAMPLES_PER_EPOCH)*len(train_set)/batch_size)
SAMPLING_CONDITION = lambda: True

I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution.

I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model:

Looks weird. May want to investigate the code.

3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).

The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing
Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing

What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):

SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging).

)

I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank).