Comments (14)
@cryu854 Could you start a pull request, and input your contact info in the code? Perhaps also change the version to 0.2.1, so I can upload it to pip. Thanks a lot.
from adabelief-optimizer.
I have the same issue, can anyone provide some insight?
Thanks
from adabelief-optimizer.
The pytorch version does not have such problem. I think itโs due to the implementation, since Iโm not so familiar with tensorflow. The pip package 0.1.0 is an old version compared to the source code under pypi_package/adabelief_tf0.1.0/Adabelief_tf.py , this is merged from a pull request by @cryu854, should be optimized, but I have not updated it in pip (so install by pip install adabelief-tf installs an old version). Please try the source code. Let me know if there are any updates.
from adabelief-optimizer.
@ManoharSai2000 @sumanthsadhu could you provide the code to reproduce the result?
from adabelief-optimizer.
from future import absolute_import
from future import division
from future import print_function
import tensorflow as tf
from tabulate import tabulate
from colorama import Fore, Back, Style
class AdaBeliefOptimizer(tf.keras.optimizers.Optimizer):
"""
It implements the AdaBeliefOptimizer proposed by
Juntang Zhuang et al. in AdaBelief Optimizer: Adapting stepsizes by the belief
in observed gradients.
Example of usage:
python from adabelief_tf impoty AdaBeliefOptimizer opt = AdaBeliefOptimizer(lr=1e-3)
Note: amsgrad
is not described in the original paper. Use it with
caution.
AdaBeliefOptimizer is not a placement of the heuristic warmup, the settings should be
kept if warmup has already been employed and tuned in the baseline method.
You can enable warmup by setting total_steps
and warmup_proportion
:
python opt = AdaBeliefOptimizer( lr=1e-3, total_steps=10000, warmup_proportion=0.1, min_lr=1e-5, )
In the above example, the learning rate will increase linearly
from 0 to lr
in 1000 steps, then decrease linearly from lr
to min_lr
in 9000 steps.
Lookahead, proposed by Michael R. Zhang et.al in the paper
[Lookahead Optimizer: k steps forward, 1 step back]
(https://arxiv.org/abs/1907.08610v1), can be integrated with AdaBeliefOptimizer,
which is announced by Less Wright and the new combined optimizer can also
be called "Ranger". The mechanism can be enabled by using the lookahead
wrapper. For example:
python adabelief = AdaBeliefOptimizer() ranger = tfa.optimizers.Lookahead(adabelief, sync_period=6, slow_step_size=0.5)
Example of serialization:
python optimizer = AdaBeliefOptimizer(learning_rate=lr_scheduler, weight_decay=wd_scheduler) config = tf.keras.optimizers.serialize(optimizer) new_optimizer = tf.keras.optimizers.deserialize(config, custom_objects={"AdaBeliefOptimizer": AdaBeliefOptimizer})
"""
def __init__(
self,
learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-14,
weight_decay=0.0,
rectify=True,
amsgrad=False,
sma_threshold=5.0,
total_steps=0,
warmup_proportion=0.1,
min_lr=0.0,
name="AdaBeliefOptimizer",
**kwargs):
r"""Construct a new AdaBelief optimizer.
Args:
learning_rate: A `Tensor` or a floating point value, or a schedule
that is a `tf.keras.optimizers.schedules.LearningRateSchedule`.
The learning rate.
beta_1: A float value or a constant float tensor.
The exponential decay rate for the 1st moment estimates.
beta_2: A float value or a constant float tensor.
The exponential decay rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability.
weight_decay: A `Tensor` or a floating point value, or a schedule
that is a `tf.keras.optimizers.schedules.LearningRateSchedule`.
Weight decay for each parameter.
rectify: boolean. Whether to enable rectification as in RectifiedAdam
amsgrad: boolean. Whether to apply AMSGrad variant of this
algorithm from the paper "On the Convergence of Adam and
beyond".
sma_threshold. A float value.
The threshold for simple mean average.
total_steps: An integer. Total number of training steps.
Enable warmup by setting a positive value.
warmup_proportion: A floating point value.
The proportion of increasing steps.
min_lr: A floating point value. Minimum learning rate after warmup.
name: Optional name for the operations created when applying
gradients. Defaults to "AdaBeliefOptimizer".
**kwargs: keyword arguments. Allowed to be {`clipnorm`,
`clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients
by norm; `clipvalue` is clip gradients by value, `decay` is
included for backward compatibility to allow time inverse
decay of learning rate. `lr` is included for backward
compatibility, recommended to use `learning_rate` instead.
"""
super().__init__(name, **kwargs)
# ------------------------------------------------------------------------------
# Print modifications to default arguments
#print(Fore.RED + 'Please check your arguments if you have upgraded adabelief-tf from version 0.0.1.')
#print(Fore.RED + 'Modifications to default arguments:')
#default_table = tabulate([
# ['adabelief-tf=0.0.1','1e-8','Not supported','Not supported'],
# ['Current version (0.1.0)','1e-14','supported','default: True']],
# headers=['eps','weight_decouple','rectify'])
#print(Fore.RED + default_table)
#print(Fore.RED +'For a complete table of recommended hyperparameters, see')
#print(Fore.RED + 'https://github.com/juntang-zhuang/Adabelief-Optimizer')
print(Style.RESET_ALL)
# ------------------------------------------------------------------------------
self._set_hyper("learning_rate", kwargs.get("lr", learning_rate))
self._set_hyper("beta_1", beta_1)
self._set_hyper("beta_2", beta_2)
self._set_hyper("decay", self._initial_decay)
self._set_hyper("weight_decay", weight_decay)
self._set_hyper("sma_threshold", sma_threshold)
self._set_hyper("total_steps", int(total_steps))
self._set_hyper("warmup_proportion", warmup_proportion)
self._set_hyper("min_lr", min_lr)
self.epsilon = epsilon or tf.keras.backend.epsilon()
self.amsgrad = amsgrad
self.rectify = rectify
self._has_weight_decay = weight_decay != 0.0
self._initial_total_steps = total_steps
def _create_slots(self, var_list):
for var in var_list:
self.add_slot(var, "m")
for var in var_list:
self.add_slot(var, "v")
for var in var_list:
self.add_slot(var, "grad_dif")
if self.amsgrad:
for var in var_list:
self.add_slot(var, "vhat")
def set_weights(self, weights):
params = self.weights
num_vars = int((len(params) - 1) / 2)
if len(weights) == 4 * num_vars + 1:
weights = weights[: len(params)]
super().set_weights(weights)
def _decayed_wd(self, var_dtype):
wd_t = self._get_hyper("weight_decay", var_dtype)
if isinstance(wd_t, tf.keras.optimizers.schedules.LearningRateSchedule):
wd_t = tf.cast(wd_t(self.iterations), var_dtype)
return wd_t
def _resource_apply_dense(self, grad, var):
var_dtype = var.dtype.base_dtype
lr_t = self._decayed_lr(var_dtype)
wd_t = self._decayed_wd(var_dtype)
m = self.get_slot(var, "m")
v = self.get_slot(var, "v")
beta_1_t = self._get_hyper("beta_1", var_dtype)
beta_2_t = self._get_hyper("beta_2", var_dtype)
epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
local_step = tf.cast(self.iterations + 1, var_dtype)
beta_1_power = tf.math.pow(beta_1_t, local_step)
beta_2_power = tf.math.pow(beta_2_t, local_step)
if self._initial_total_steps > 0:
total_steps = self._get_hyper("total_steps", var_dtype)
warmup_steps = total_steps * self._get_hyper("warmup_proportion", var_dtype)
min_lr = self._get_hyper("min_lr", var_dtype)
decay_steps = tf.maximum(total_steps - warmup_steps, 1)
decay_rate = (min_lr - lr_t) / decay_steps
lr_t = tf.where(
local_step <= warmup_steps,
lr_t * (local_step / warmup_steps),
lr_t + decay_rate * tf.minimum(local_step - warmup_steps, decay_steps),
)
sma_inf = 2.0 / (1.0 - beta_2_t) - 1.0
sma_t = sma_inf - 2.0 * local_step * beta_2_power / (1.0 - beta_2_power)
m_t = m.assign(
beta_1_t * m + (1.0 - beta_1_t) * grad, use_locking=self._use_locking
)
m_corr_t = m_t / (1.0 - beta_1_power)
grad_dif = self.get_slot(var,'grad_dif')
grad_dif.assign( grad - m_t )
v_t = v.assign(
beta_2_t * v + (1.0 - beta_2_t) * tf.math.square(grad - m_t) + epsilon_t,
use_locking=self._use_locking,
)
if self.amsgrad:
vhat = self.get_slot(var, "vhat")
vhat_t = vhat.assign(tf.maximum(vhat, v_t), use_locking=self._use_locking)
v_corr_t = tf.math.sqrt(vhat_t / (1.0 - beta_2_power))
else:
vhat_t = None
v_corr_t = tf.math.sqrt(v_t / (1.0 - beta_2_power))
r_t = tf.math.sqrt(
(sma_t - 4.0)
/ (sma_inf - 4.0)
* (sma_t - 2.0)
/ (sma_inf - 2.0)
* sma_inf
/ sma_t
)
if self.rectify:
sma_threshold = self._get_hyper("sma_threshold", var_dtype)
var_t = tf.where(
sma_t >= sma_threshold,
r_t * m_corr_t / (v_corr_t + epsilon_t),
m_corr_t,
)
else:
var_t = m_corr_t / (v_corr_t + epsilon_t)
if self._has_weight_decay:
var_t += wd_t * var
var_update = var.assign_sub(lr_t * var_t, use_locking=self._use_locking)
updates = [var_update, m_t, v_t]
if self.amsgrad:
updates.append(vhat_t)
return tf.group(*updates)
def _resource_apply_sparse(self, grad, var, indices):
var_dtype = var.dtype.base_dtype
lr_t = self._decayed_lr(var_dtype)
wd_t = self._decayed_wd(var_dtype)
beta_1_t = self._get_hyper("beta_1", var_dtype)
beta_2_t = self._get_hyper("beta_2", var_dtype)
epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
local_step = tf.cast(self.iterations + 1, var_dtype)
beta_1_power = tf.math.pow(beta_1_t, local_step)
beta_2_power = tf.math.pow(beta_2_t, local_step)
if self._initial_total_steps > 0:
total_steps = self._get_hyper("total_steps", var_dtype)
warmup_steps = total_steps * self._get_hyper("warmup_proportion", var_dtype)
min_lr = self._get_hyper("min_lr", var_dtype)
decay_steps = tf.maximum(total_steps - warmup_steps, 1)
decay_rate = (min_lr - lr_t) / decay_steps
lr_t = tf.where(
local_step <= warmup_steps,
lr_t * (local_step / warmup_steps),
lr_t + decay_rate * tf.minimum(local_step - warmup_steps, decay_steps),
)
sma_inf = 2.0 / (1.0 - beta_2_t) - 1.0
sma_t = sma_inf - 2.0 * local_step * beta_2_power / (1.0 - beta_2_power)
m = self.get_slot(var, "m")
m_scaled_g_values = grad * (1 - beta_1_t)
m_t = m.assign(m * beta_1_t, use_locking=self._use_locking)
m_t = self._resource_scatter_add(m, indices, m_scaled_g_values)
m_corr_t = m_t / (1.0 - beta_1_power)
grad_dif = self.get_slot(var,'grad_dif')
grad_dif.assign(m_t)
grad_dif = self._resource_scatter_add(grad_dif, indices, -1.0 * grad)
v = self.get_slot(var, "v")
m_t_indices = tf.gather(m_t, indices)
v_scaled_g_values = tf.math.square(grad - m_t_indices) * (1 - beta_2_t)
v_t = v.assign(v * beta_2_t + epsilon_t, use_locking=self._use_locking)
v_t = self._resource_scatter_add(v, indices, v_scaled_g_values)
if self.amsgrad:
vhat = self.get_slot(var, "vhat")
vhat_t = vhat.assign(tf.maximum(vhat, v_t), use_locking=self._use_locking)
v_corr_t = tf.math.sqrt(vhat_t / (1.0 - beta_2_power))
else:
vhat_t = None
v_corr_t = tf.math.sqrt(v_t / (1.0 - beta_2_power))
r_t = tf.math.sqrt(
(sma_t - 4.0)
/ (sma_inf - 4.0)
* (sma_t - 2.0)
/ (sma_inf - 2.0)
* sma_inf
/ sma_t
)
if self.rectify:
sma_threshold = self._get_hyper("sma_threshold", var_dtype)
var_t = tf.where(
sma_t >= sma_threshold,
r_t * m_corr_t / (v_corr_t + epsilon_t),
m_corr_t,
)
else:
var_t = m_corr_t / (v_corr_t + epsilon_t)
if self._has_weight_decay:
var_t += wd_t * var
var_update = self._resource_scatter_add(
var, indices, tf.gather(-lr_t * var_t, indices)
)
updates = [var_update, m_t, v_t]
if self.amsgrad:
updates.append(vhat_t)
return tf.group(*updates)
def get_config(self):
config = super().get_config()
config.update(
{
"learning_rate": self._serialize_hyperparameter("learning_rate"),
"beta_1": self._serialize_hyperparameter("beta_1"),
"beta_2": self._serialize_hyperparameter("beta_2"),
"decay": self._serialize_hyperparameter("decay"),
"weight_decay": self._serialize_hyperparameter("weight_decay"),
"sma_threshold": self._serialize_hyperparameter("sma_threshold"),
"epsilon": self.epsilon,
"amsgrad": self.amsgrad,
"rectify": self.rectify,
"total_steps": self._serialize_hyperparameter("total_steps"),
"warmup_proportion": self._serialize_hyperparameter(
"warmup_proportion"
),
"min_lr": self._serialize_hyperparameter("min_lr"),
}
)
return config
generator_g_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)
generator_f_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)
discriminator_x_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)
discriminator_y_optimizer = AdaBeliefOptimizer(2e-4, beta_1=0.5)
from adabelief-optimizer.
The first part is the Ada-Belief Source Code and the second part is its usage
from adabelief-optimizer.
Hi @ManoharSai2000, @sumanthsadhu. I tried the example of tensorflow cycleGAN here, and used Adam and Adabelief to train, took me 490 sec and 560 sec for one epoch respectively on Tesla T4 in google colab.
In my opinion, since Adam is further optimized by fused kernels and Adabelief is implemented only using tensorflow ops and then wrapped as keras, so it is inevitable that there will be an efficiency gap.
from adabelief-optimizer.
@cryu854, Ok, Thank You. Is this issue same in pytorch? I hope, the optimizer will be added soon to tensorflow as you mentioned.
from adabelief-optimizer.
@ManoharSai2000. To my knowledge, pytorch seems to need an additional compiler like JIT to fuse kernel automatically. Otherwise, pytorch will launch a separate kernel for each operation. Please correct me if I'm wrong.
from adabelief-optimizer.
@cryu854 Thanks for the update. I just found that the code uses a slot called "grad_dif", but seems it's not used,
from adabelief-optimizer.
@juntang-zhuang The new code looks good to me, and it passes all the test cases in Adabelief_test.py.
Btw, should we move Adabelief_test.py out of the folder? I am not sure whether the pypi package will include the test code.
from adabelief-optimizer.
@cryu854 Thanks a lot. I just deleted the test code, and uploaded it to pip 0.2.0. BTW, do you want to add your name and email at the beginning of the file as a contributor? If so, I'll update it in version 0.2.1. Thanks again for your efforts and help.
from adabelief-optimizer.
@ManoharSai2000 @sumanthsadhu Just removed some redundancy computation in the code, and released adabelief-tf==0.2.0
, please try it from pip pip install adabelief-tf==0.2.0
, should be a little bit faster now. Though we did not perform fused kernel operation. Source code is in pypi_packages/adabelief_tf0.2.0
from adabelief-optimizer.
@juntang-zhuang Yes, if it won't bother you, it would be an honor for me to be as a part of the contributions. Thank you in advance.
from adabelief-optimizer.
Related Issues (20)
- fine-tune with bert models HOT 2
- Please add a license HOT 1
- Upgrade with Adas optimizer HOT 3
- MSVAG HOT 1
- Why does g_t substract m_t, instead of m_{t-1} ? HOT 1
- On imagenet accuracy result 70.08 HOT 1
- Documentation (at least for TF) and weight_decouple is not an option HOT 2
- FileNotFoundError for ImageNet HOT 1
- Changing init learning rate HOT 2
- Question about SGD optimizer in LSTM experiments HOT 1
- Compatibility with warmup HOT 2
- Inconsistent computation of weight_decay and grad_residual among pytorch versions HOT 5
- Your method is just equivalent to SGD with a changable global learning rate. HOT 3
- Some questions related to import adabelief HOT 2
- Tensorflow restoration issue HOT 1
- weight_decouple in adabelief tf HOT 1
- Inconsistent use of epsilon HOT 4
- Suppressing weight decoupling and rectification messages HOT 1
- The problem of reproducing the result of ImageNet HOT 4
- AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper' HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabelief-optimizer.