awslabs / adatune Goto Github PK

Gradient based Hyperparameter Tuning library in PyTorch

License: Apache License 2.0

Python 100.00%

deep-learning pytorch automl hyperparameter-tuning learning-rate-scheduling machine-learning neural-networks

adatune's Introduction

AdaTune

AdaTune is a library to perform gradient based hyperparameter tuning for training deep neural networks. AdaTune currently supports tuning of the learning_rate parameter but some of the methods implemented here can be extended to other hyperparameters like momentum or weight_decay etc. AdaTune provides the following gradient based hyperparameter tuning algorithms - HD, RTHO and our newly proposed algorithm, MARTHE. The repository also contains other commonly used non-adaptive learning_rate adaptation strategies like staircase-decay, exponential-decay and cosine-annealing-with-restarts. The library is implemented in PyTorch.

Mathematical Formulation of the Problem

The goal of the methods in this package is to automatically compute in an online fashion a learning rate schedule for stochastic optimization methods (such as SGD) only on the basis of the given learning task, aiming at producing models with associated small validation error.

Theoretically, we have to solve the problem of finding a learning rate (LR) schedule under the framework of gradient-based hyperparameter optimization. In this sense, we consider as an optimal schedule $\eta^*=(\eta^*_0,\dots,\eta^*_{T-1})\in\mathbb{R}_{\geq{0}}^T$ a solution to the following constrained optimization problem:

$\min \{f_T(\eta)=E(w_T(\eta)):\eta\in\mathbb{R}_{\geq{0}}^T\}\quad s.t.\quad w_0=\bar{w},\quad w_{t 1}(\eta)=\Phi_t(w_{t}(\eta),\eta_t)$

for $t=\{0, \dots, T-1\}$ , where $E:\mathbb{R}^d\to\mathbb{R}_{\geq{0}}$ is an objective function, $\Phi_t:\mathbb{R}^d\times \mathbb{R}_{\geq0}\to\mathbb{R}^d$ is a (possibly stochastic) weight update dynamics, $\bar{w}\in\mathbb{R}^d$ represents the initial model weights (parameters) and finally $w_t$ are the weights after t iterations.

We can think of E as either the training or the validation loss of the model, while the dynamics $\Phi$ describe the update rule (such as SGD, SGD-Momentum, Adam etc.). For example in the case of SGD, while the dynamics $\Phi_t(w_t,\eta_t)=w_t-\eta_t\nabla L_t(w_t)$ with $L_t(w_t)$ the (possibly regularized) training loss on the t-th minibatch. The horizon T should be large enough so that the training error can be effectively minimized, in order to avoid underfitting. Note that a too large value of T does not necessarily harm since $\eta_k=0$ for $k \geq \bar{T}$ is still a feasible solution, implementing early stopping in this setting.

Installation

The library can be installed (from source) like this:

git checkout https://github.com/awslabs/adatune.git
cd adatune
python setup.py install

Usage

You can easily replace a non-adaptive learning_rate based training procedure with an adaptive one (RTHO/MARTHE) like this:

Non Adaptive

loss.backward()
optimizer.step()

Adaptive

first_grad = ag.grad(loss, net.parameters(), create_graph=True, retain_graph=True)
hyper_optim.compute_hg(net, first_grad)
for params, gradients in zip(net.parameters(), first_grad):
     params.grad = gradients
optimizer.step()
hyper_optim.hyper_step(vg.val_grad(net))

There are two standalone Python scripts provided in the bin directory which show in details how to use the library.

baselines.py - This file contains all the baselines we compare against while developing MARTHE (apart from RTHO). The parameters defined in the cli_def function are self-explanatory. You can change the learning_rate adaptation strategy with lr-scheduler parameter defined there.

For example, if you want to run cosine-annealing-with-restarts for VGG-11 on CIFAR-10 with SGD-momentum as the optimizer, you can run it like this after the package is installed:

python bin/baselines.py --network vgg --dataset cifar_10 --optimizer sgd --momentum 0.9 --lr-scheduler cyclic

rtho.py - This file contains the implementation RTHO and MARTHE. MARTHE is a generalization of RTHO and HD. It is implemented together with RTHO because both the algorithms share the common component of computing the Hessian-Vector-Product.

If you want to run MARTHE, HD, or RTHO, you can run it like this:

python bin/rtho.py --network vgg --dataset cifar_10 --optimizer sgd --momentum 0.9 --hyper-lr 1e-8

if you pass mu as 1.0, the algorithm behaves as RTHO. If you pass mu as 0, the algorithm is similar to HD (though the outer gradient will be computed on the validation set instead of training set).

In order to automatically set and adapt mu, set it to any value less than 0.0. You can also pass a value of mu in the range of [0.99, 0.999] if you don't want an adaptive behavior for mu only.

If you pass alpha equals to 0.0, the hyper-lr value will stay the same for the whole training procedure.

Generally, the value of hyper-lr should be set to minimum 3-4 scales lower for Adam when compared to SGD (w/o momentum) for all the gradient based methods.

In order to automatically set and adapt hyper-lr, it is possible to set the value of alpha positive and small (e.g. 1e-6).

You can use a linear search algorithm to gradually reduce the value of alpha starting from a higher value and seeing when the algorithm is not diverging. Generally, if the value of alpha is high for a given task, the algorithm would diverge within the first few epochs.

In future, we plan to implement a find_hyper_lr method to automatically handle the linear search over alpha as well (removing completely any human intervention in the whole precedure).

For both, there is a parameter called model-loc which determines where the trained model would be saved. Please create this directory before running the code if you are using a different directory than the current working directory.

Networks

network.py implements LeNet-5, VGG, ResNet, MLP, Wide-ResNet and DenseNet-BC. So far, experiments are mostly done with VGG-11 and ResNet-18.

Datasets & DataLoders

List of available Datasets/DataLoaders can be found in data_loader.py. Currently datasets supported are MNIST, CIFAR-10 and CIFAR-100. The DataLoaders classes will download these datasets when used for the first time.

Results comparing MARTHE and other methods

For further details, please refer to the original paper.

How to cite

The idea of this code is from the following paper:

Donini et al. "MARTHE: Scheduling the Learning Rate Via Online Hypergradients" IJCAI-PRCAI 2020.

Bibtex citation:

@inproceedings{donini2020MARTHE,
  title={MARTHE: Scheduling the Learning Rate Via Online Hypergradients},
  author={Donini, Michele and Franceschi, Luca and Majumder, Orchid and Pontil, Massimiliano and Frasconi, Paolo},
  booktitle={Proceedings of the 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence},
  year={2020},
  organization={AAAI Press}
}

adatune's People

Contributors

Stargazers

Watchers

adatune's Issues

View size not compatible error in rtho.py

Team,

While running the sample script ./bin/rtho.py, using README provided command the following error occurs.
python bin/baselines.py --network vgg --dataset cifar_10 --optimizer sgd --momentum 0.9 --lr-scheduler cyclic

Traceback (most recent call last): File "bin/rtho.py", line 125, in <module> train_rtho(args.network, args.dataset, args.num_epoch, args.batch_size, args.optimizer, args.lr, args.momentum, File "bin/rtho.py", line 92, in train_rtho hyper_optim.compute_hg(net, first_grad) File "/home/tyu/sandbox/original_adatune/adatune/adatune/mu_sgd.py", line 68, in compute_hg hvp_flatten = torch.cat([h.view(-1) for h in hvp]) File "/home/tyu/sandbox/original_adatune/adatune/adatune/mu_sgd.py", line 68, in <listcomp> hvp_flatten = torch.cat([h.view(-1) for h in hvp]) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Same error reported when using adam, in mu_adam.py

Changing line 68 in mu_sgd.py into the following by adding contiguous, fixed the problem and can see the script running.
hvp_flatten = torch.cat([h.contiguous().view(-1) for h in hvp])

It will be good if you can advise whether this will be a correct fix?

Falied reproducing MARTHE in CIFAR100...

Hi, I was trying to reproduce your experiment in CIFAR100 but got into trouble, would you please give a little help?

When running the code like
python bin/rtho.py --network resnet --dataset cifar_100 --optimizer adam --lr 1e-5 --mu -0.1
I cannot reproduce the test accuracy presented in your paper, which is over 75% after 200 epoch, and I got only 40%. The learning rate was descending from 1e-5 to around 1e-7 and some of the learning rates are less than 0. Setting mu=0 couldn't help, and mu=0.99 will result in the training loss being nan. How can I fix this?
Is computing hypergradient for learning rate supposed to be slow? I'm using a RTX 2080, and it takes more than 15 minutes to compute hypergradient( the compute_hg function) in one epoch. Is it because there are too many matrix multiplications or I didn't do it right?

I would really appreciate it if you can look into these problems!

Implementing Marthe for other models - location of last code line in training loop?

Hi - quite excited about your work here on hypergradient optimization.

I ran the baselines with VGG to test things out and now trying to setup MARTHE for my own training loop with an effdet object detector. I'm not able to get any actual learning to take place so I want to clarify the placement esp of the last line of your implemention block.

The readme states to implement:

first_grad = ag.grad(loss, net.parameters(), create_graph=True, retain_graph=True)
hyper_optim.compute_hg(net, first_grad)
for params, gradients in zip(net.parameters(), first_grad):
params.grad = gradients
optimizer.step()

hyper_optim.hyper_step(vg.val_grad(net)) #<-- last line

my question is where does the last line go - literally right after optimizer.step() as shown above? or is it moved below to after the validation step is done?

   if epoch % test_interval == 0:
       model.eval()

If I move it to where I think it should go (after eval) then it complains about vg not defined...which is another question, where is vg object coming from?)

Possibly related to my issue - I get this error on every mini-batch as well:
One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

I'll re-read the paper and see what I might have missed but confirming the code placement above would be appreciated (as well any other guidelines so I can get this up and running).
Thanks!

Reproduce problem

I tried to run your experiments with your provided hyperparameters, while I found the results are worse than you reported. Can you know the issues?

How can I reproduce the experiments in the paper?

Hi, I'd like to appreciate your great work!

I am trying to reproduce the experiments in Section 6 of your paper. However, it is a bit hard to figure out the exact configuration for each of the experiment (at least for me).

Mostly I was focusing on CIFAR-100 + ResNet-18 case.

For instance, I tried to get the baseline result of staircase decay by running the following script.

python bin/baselines.py --network resnet --dataset cifar_100 \
--optimizer adam --lr 0.0003 --wd 5e-4 \
--lr-scheduler staircase --step-size 60 --lr-decay 0.1

As a result, I got 74.xx% accuracy over several runs, which deviates from 76.40% reported in the paper. May I ask for some suggestions for where to look at?

Moreover, I could not find the exact hyperparameter configuration for RTHO and MARTHE. I assumed that the configuration would look like this:

# RTHO
python bin/rtho.py --network resnet --dataset cifar_100 \ 
--optimizer adam --lr 0.0003 --wd 5e-4 \
--hyper-lr 1e-8 --alpha 0.0 --mu 1.0

# MARTHE
python bin/rtho.py --network resnet --dataset cifar_100 \ 
--optimizer adam --lr 0.0003 --wd 5e-4 \
--hyper-lr 1e-8 --alpha 1e-6 --mu -0.99999

Especially, it is a bit hard to understand the role of mu when it is negative. The following is from the readme file and the second sentence most confuses me:

In order to automatically set and adapt mu, set it to any value less than 0.0. You can also pass a value of mu in the range of [0.99, 0.999] if you don't want an adaptive behavior for mu only.

Can you explain further the behavior of the code when mu is negative? And, may I ask for an advice (or script) to reproduce experiments as reported in the paper?

Thank you so much for reading!

support for large-scaled hyperparameters or not?

Thanks for making this package public.

Does this package support for large-scaled hyperparameters?

For examples, as we know, the hyperparameter optimization problems can be mathematically formulated as bilevel programming problems. If the variables in inner optimization(parameters) and in outer optimization(hyperparameters) are equal and both in the scale of neural networks weights, does this package still work?

Reproducing MARTHE's learning rate schedules

Hi, I am trying to reproduce the interesting results presented in figure 3 of your paper, especially the learning rate schedules your algorithm found.
However, a few trials with different alpha and seeds on CIFAR10 give me LR schedules that look like exponential decay (i.e. no increase in LR in the early stages of training).
Here is the graph of LR schedules that I found:

I assume that the example schedule presented in figure 3 is a special case (for some specific seed) or I am missing something. Would you like to share your ideas on this?

(Legend string MARTHE_(1)(2)(3) represents (1): 1st or 2nd order, (2): alpha, (3): seed where all the other hyperparameters are set to the values as described in the paper.)

Thanks in advance!

learning_rate becomes nan in rtho.py

Team,

I am trying to run your sample script ./bin/rtho.py, using the sample command in README.
python bin/rtho.py --network vgg --dataset cifar_10 --optimizer sgd --momentum 0.9 --hyper-lr 1e-8

After seeing val_cacurace increased in the first few epochs, learning_rate became nan. and training accuracy stuck at 10.0.
Sample output is attached at the end.

Running baseline.py using the provided command runs fine. Best val_accuracy will reach good value.

Can you help to advise what is causing the learning rate nan issue in rtho.py ?
Is is behaviour expected for the methods?