Code Monkey home page Code Monkey logo

localvsglobaluncertainty's Introduction

localvsglobaluncertainty's People

Contributors

tmgrgg avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

localvsglobaluncertainty's Issues

Experiment 2a: Heat Map Analysis

I trained a MultiSWAG solution consisting of (up to) 15 models on DenseNet10 x FashionMNIST, increasing the rank of each individual SWAG solution incrementally to produce this heatmap demonstrating a broad picture of the complementary benefits of modelling local and global uncertainty:

heatmap

Observations:

  • It seems clear that increasing the rank of a unimodal SWAG approximation has a much weaker effect on solution quality than increasing the number of ensembled solutions.

  • It seems that once a certain threshold of ensembled solutions has been reached, i.e. between 5-10, resources are better dedicated to improving local approximations.

  • Note that for a fixed MultiSWAG model with n_ensembled = n, rank = k, moving to the above model in the graph incurs an additional storage cost of k (n * k-rank modes -> (n + 1) k-rank modes), whereas moving rightwards incurs an additional storage cost of n (n * k-rank modes -> n*(k + 1)-rank modes).

  • i.e. the cost of each solution can be read as kn|theta|
    I'm currently playing around with a couple of ideas for when to choose one over the other...

Experiment 1b (Simple Analysis of Ensembling)

Analysis of performance of Ensemble versus number of models (1b)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the Ensemble improved with number of models included. Similar to 1a, the actual running of the code is done in Notebooks using a Colab GPU (which will hopefully change).

1. I trained (100) models explicitly using different seeds for random generators.

The notebook is: https://colab.research.google.com/drive/180azaR--x66_kqQNjrTX1Ek__Fwi_zfE?usp=sharing
I used the same optimization parameters and learning rate scheduling for all of them, sticking with what we used in the previous SWAG experiment, namely:

batch_size=128
LR_INIT = 0.1
MOMENTUM = 0.85
L2 = 1e-4
TRAINING_EPOCHS = 100
FINAL_LR = 0.005

The models' final performances on the validation set look roughly randomly distributed about a mean performance:
![InitialModelPerformancesForEnsembling](https://user-images.githubusercontent.com/39443562/88405288-f2444300-cdc6-11ea-8ad8-3d3b67104fc8.png)

### 2. I then measured ensemble performance against number of members of the ensemble:
Notebook here: https://colab.research.google.com/drive/1IqkEmhTMZVB3G8Wamf8NNDmehFQngoea?usp=sharing

Note for each n I randomly select n members to belong to the ensemble rather than progressively adding one each time. I mix the ensembles with an equal weighting to generate predictions. I obtain the following:

![NumberEnsembledVersusPerformance](https://user-images.githubusercontent.com/39443562/88405957-e3aa5b80-cdc7-11ea-964d-1070e5a07075.png)

So the benefits of ensembling directly on performance are initially steep, but tail off quickly with more and more models. 






Experiment 3a

Diversity

I defined the diversity index:

$1/m sum_i^m KL(P_i | P_ens)/NLL(P_i) $

Where P_ens = \sum_i^m P_i

The basic idea being that the numerator of each term will grow if P_i deviates from P_ens significantly, and the denominator will grow if P_i is a poor solution to the probabilistic learning problem. Hence, if the mean ratios flattens out as m increases, we are no longer gaining diversity relative to loss on average.

Pure local versus pure global:

The mean ratios are remarkably (read: suspiciously) well correlated with the loss of the ensemble:

Local:

download

Global:

download (4)

If there's nothing strange about this index, then it is a good ``explanation'' for the limits of the effectiveness of pure local and pure global ensembles: namely, local ensembles run out of accessable diversity faster than global ensembles. The correlations are -0.95 and -0.99 respectively.

(Note that the graphs have incorrectly labelled y-axis.)

In terms of the interaction between local and global, I plotted a subset of the heatmap from last week (10x10) alongside the diversity index evaluated with increasing number of global solutions of increasing rank:

Heatmap:
Screenshot 2020-08-21 at 17 04 48

Diversity Index:
Screenshot 2020-08-21 at 17 07 14

Corr: -0.85

weirdly:

the correlation as rank goes up is strong... but in the opposite direction

Experiment 1a (Simple Analysis of SWAG)

Analysis of performance of SWAG versus approximation rank (1a)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.

1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.

LR_INIT = 0.1
MOMENTUM = 0.85
L2 = 1e-4

Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:

pretrained_training_grapah

Final pretrained model performance is:

::: Train :::
 {'loss': 0.034561568461060524, 'accuracy': 99.41}
::: Valid :::
 {'loss': 0.23624498672485353, 'accuracy': 92.41}
::: Test :::
 {'loss': 0.2568485828399658, 'accuracy': 92.28}

Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:

def schedule(lr_init, epoch, max_epochs):
    t = epoch / max_epochs
    lr_ratio = FINAL_LR / lr_init 
    if t <= 0.5:
        factor = 1.0
    elif t <= 0.9:
        factor = 1.0 - (1.0 - lr_ratio) * (t - 0.5) / 0.4
    else:
        factor = lr_ratio
    return lr_init * factor

2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).

The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing

The model trained in the notebook above was trained with the following parameters:

SWA_LR = 0.005
SWA_MOMENTUM = 0.85
L2 = 1e-4
RANK = 30
SAMPLES_PER_EPOCH = 1
SAMPLE_FREQ = int((1/SAMPLES_PER_EPOCH)*len(train_set)/batch_size)
SAMPLING_CONDITION = lambda: True

I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution.
SWA_training

I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model:
BMASamplesVsPerformance

Looks weird. May want to investigate the code.

3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).

The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing
Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing

What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):

SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging).
swag_rank_v_performance_v1
)

I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank).
SWA_swag_rank_v_performance_v1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.