Code Monkey home page Code Monkey logo

hypergradient_variants's Introduction

Hypergradient based Optimization Methods

This work tries improvements to the existing 'Hypergradient' based optimizers proposed in the paper Online Learning Rate Adaptation with Hypergradient Descent. The report summarises the work and can be found here.

Introduction

The method proposed in the paper "Online Learning Rate Adaptation with Hypergradient Descent" automatically adjusts the learning rate to minimize some estimate of the expectation of the loss, by introducing the “hypergradient” - the gradient of any loss function w.r.t hyperparameter “eta” (the optimizer learning rate). It learns the step-size via an update from gradient descent of the hypergradient at each training iteration, and uses it alongside the model optimizers SGD, SGD with Nesterov (SGDN) and Adam resulting in their hypergradient counterparts SGD-HD, SGDN-HD and Adam-HD, which demonstrate faster convergence of the loss and better generalization than solely using the original (plain) optimizers.

But we expect that the hypergradient based learning rate update could be more accurate and aim to exploit the gains much better by boosting the learning rate updates with momentum and adaptive gradients, experimenting with

  1. Hypergradient descent with momentum, and
  2. Adam with Hypergradient,

alongside the model optimizers SGD, SGD with Nesterov(SGDN) and Adam.

The naming convention used is: {model optimizer}op-{learning rate optimizer}lop, following which we have {model optimizer}op-SGDNlop (when the l.r. optimizer is hypergradient descent with momentum) and {model optimizer}op-Adamlop (when the l.r. optimizer is adam with hypergradient).

The new optimizers and the respective hypergradient-descent baselines from which their performance are compared are given as

  • SGDop-SGDNlop, with baseline SGD-HD (i.e. SGDop-SGDlop)
  • SGDNop-SGDNlop, with baseline SGDN-HD (i.e SGDNop-SGDlop)
  • Adamop-Adamlop, with baseline Adam-HD (i.e Adamop-SGDlop)

The optimizers provide the following advantages when evaluated against their hypergradient-descent baselines: Better generalization, Faster convergence, Better training stability (less sensitive to the initial chosen learning rate).

Motivation

The alpha_0 (initial learning rate) and beta (hypergradient l.r) configurations for the new optimizers are kept the same as the respective baselines from the paper (see run.sh for details). The results show that the new optimizers perform better for all the three models (VGGNet, LogReg, MLP). More description about the optimizers can be found in the project report here.

Behavior of the optimizers compared with their hypergradient-descent baselines.

Columns: left: logistic regression on MNIST; middle: multi-layer neural network on MNIST; right: VGG Net on CIFAR-10.

Project Structure

The project is organised as follows:

.
├── hypergrad/
│   ├── __init__.py 
│   ├── sgd_Hd.py  # model op. sgd, l.r. optimizer Hypergadient-descent (original)
│   └── adam_Hd.py #model op. adam, l.r. optimizer Hypergadient-descent (original)
├── op_sgd_lop_sgdn.py # model op. sgd, l.r. optimizer Hypergadient-descent with momentum
├── op_sgd_lop_adam.py # model op. sgd, l.r. optimizer Adam with Hypergadient 
├── op_adam_lop_sgdn.py # model op. adam, l.r. optimizer Hypergadient-descent with momentum
├── op_adam_lop_adam.py # model op. adam, l.r. optimizer Adam with Hypergadient
├── vgg.py
├── train.py
├── test/ # results of the experiments
├── plot_src/
├── plots/ # Experiment plots
├── run_.sh # to run the experiments
.
folders and files below will be generated after running the experiments
.
├── {model}_{optimizer}_{beta}_epochs{X}.pth           # Model checkpoint
└── test/{model}/{alpha}_{beta}/{optimizer}.csv        # Experiment results

Experiments

The experiment configurations (hyperparameters alpha_0 and beta) are defined in run.sh for the optimizers and three model classes. The experiments for the new optimizers are run following the same settings as their Hypergradient-descent versions: Logreg (20 epochs on MNIST), MLP (100 epochs on MNIST) and VGGNet (200 epochs on CIFAR-10).

References

  1. Hypergradient Descent (Github repository)

Contributors

hypergradient_variants's People

Contributors

ankit-dhankhar avatar dependabot[bot] avatar harshalmittal4 avatar yashkant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.