I have seen a performance boost switching from Adam to AdaBound. I have tuned my model and found that a range of 2e-4 to 2e-2 works best. I am interested in fine tuning my model on a new dataset, but have found that switching to tf.train.gradientdescentoptimizer with a 1e-3 learning rate causes a slow divergence.
In the pytorch implementation, the authors were able to get their AdaBound to work with learning rate decay as follows:
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=150, gamma=0.1, last_epoch=start_epoch)
https://github.com/Luolc/AdaBound/blob/master/demos/cifar10/main.py
I'm unclear how to reduce the learning rate of AdaBound in tensorflow without switching to a new optimizer (SGD) with a set learning rate, which seems to mess things up.