juntang-zhuang / adabelief-optimizer Goto Github PK

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

License: BSD 2-Clause "Simplified" License

Python 11.42% Shell 0.04% Jupyter Notebook 88.53%

adabelief-optimizer's Introduction

adabelief-optimizer's People

Contributors

Stargazers

Watchers

Forkers

kiminh yyht albertvillanova sbhadade ankitshah009 sun-peach hfxunlp ywz1993 templeblock chisyliu zeta1999 scape1989 stjordanis caorenzhi cryu854 global-localhost global19 mayinjin eyaler qianrenjian suhuynh gavinzjchao taomo shercklo cheriylan yueyericardo mldl theamazingelys pdimitrov-thoughtriver qiuweibin2005 peterouzh darwin-systems trendingtechnology aimetrics aysegulbumin longjohncoder mertcanatan junhyunb sts-sadr valeman nisheethjaiswal ls6468 sugerdonut ronghanghu hzhang57 frozennep gtadiparthi pipipiu maybeee18 hongyunnchen daixiangzi xrosliang arrufat theodorosk ye-hanyu smallwhite97 vpj tubbz-alt lzh990711 blkerby liuledian goncaloperes linghm fagan2888 alighofrani95 drahnreb garrettkatz tranhp98 xyh97 neilliang90 nick-f0813 sandy4321 purbayankar dlrudco saqibmamoon koide-lab itsuki8914 densechen mushroomhunting sinhmd wes-chen qingshanqihai111 shikishima-tasakilab minhnhatvt prixt chuanli11 valv asjir wwoods lelantos42 wangclin emp325 kaltsiti nikitara klae01 denfrost yifan16 world4jason myth-coder uunravel

adabelief-optimizer's Issues

Tensorflow implementation doesn't work

TF 2.3

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
import numpy as np
from adabelief_tf import AdaBeliefOptimizer

x = np.random.random_sample((5,))
y = np.random.random_sample((5,))

model = Sequential()
model.add(Dense(1))
model.compile(loss='mse',
              optimizer=AdaBeliefOptimizer())

model.fit(x, y)

Traceback (most recent call last):
  File "C:/Users/Ben/PycharmProjects/tradingbot/trainer/test.py", line 14, in <module>
    model.fit(x, y)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 823, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 696, in _initialize
    self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\function.py", line 2855, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\function.py", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\function.py", line 3065, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\func_graph.py", line 986, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 600, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\func_graph.py", line 973, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py:806 train_function  *
        return step_function(self, iterator)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py:796 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1211 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2585 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2945 _call_for_each_replica
        return fn(*args, **kwargs)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py:789 run_step  **
        outputs = model.train_step(data)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py:756 train_step
        _minimize(self.distribute_strategy, tape, self.optimizer, loss,
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\engine\training.py:2747 _minimize
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\keras\optimizers.py:775 apply_gradients
        self.optimizer.apply_gradients(grads, global_step=self.iterations)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\training\optimizer.py:616 apply_gradients
        update_ops.append(processor.update_op(self, grad))
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\training\optimizer.py:171 update_op
        update_op = optimizer._resource_apply_dense(g, self._v)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\adabelief_tf\AdaBelief_tf.py:187 _resource_apply_dense
        beta1_power = math_ops.cast(self._get_non_slot_variable("beta1_power", graph=graph), grad.dtype.base_dtype)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\ops\math_ops.py:920 cast
        x = ops.convert_to_tensor(x, name="x")
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\ops.py:1499 convert_to_tensor
        ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\constant_op.py:338 _constant_tensor_conversion_function
        return constant(v, dtype=dtype, name=name)
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\constant_op.py:263 constant
        return _constant_impl(value, dtype, shape, name, verify_shape=False,
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\constant_op.py:280 _constant_impl
        tensor_util.make_tensor_proto(
    C:\Users\Ben\PycharmProjects\tradingbot\venv\lib\site-packages\tensorflow\python\framework\tensor_util.py:444 make_tensor_proto
        raise ValueError("None values not supported.")

    ValueError: None values not supported.

0.1.0 changes for ranger_adabelief

Hi @juntang-zhuang , super excited to try the new improvements.
I saw that you did not updated the ranger version. Do you plan to add the improvements there too?

Some questions related to import adabelief

Sorry to disturb you, I wonder how to utilize the pytorch version of adabelief.

Here is my error. Thanks a lot.

AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper'

I'm facing the problem with AdaBeliefOptimizer

AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper'

optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False)

Should this work with Mixed precision training (AMP)

Hi just a question is this optimizer compatible with Mixed precision training or AMP. I tried to use in in combination with lucidrains' lightweight-gan implementation which uses the PyTorch version of this optimizer. But after a few 100 iterations my losses go to NaN and eventually causes a Division by Zero error. Don't see the same problem with using the standard adam optimizer

Inconsistent use of epsilon

Hello, I noticed an inconsistency in the paper with the epsilon parameter. In the main-text,

whereas in the supplementary materials:

The two are not equivalent since in the first case, eps is accumulated in each iteration in s_t, which bias the estimate of the variance by (number of iterations) * epsilon, if I'm not mistaken.

Is this a typo? If so, what is the "correct" version and which one is implemented in the repo?
Thanks!

Tensorflow restoration issue

Hi,

I've installed "adabelief-tf==0.2.0" through conda. And the tensorflow version I'm using is 2.8.0 (tensorflow==2.8.0). I'm facing below restoration issue while trying to try the model on test dataset - the checkpoint was created during training -

ValueError: Unknown optimizer: AdaBeliefOptimizer. Please ensure this object is passed to the custom_objects argument.

I'm using it below way:

lr_decayed_fn = tf.keras.optimizers.schedules.CosineDecayRestarts(initial_learning_rate=1e-3, first_decay_steps=200)

model.compile(loss='binary_crossentropy', 
              optimizer=AdaBeliefOptimizer(learning_rate=lr_decayed_fn, epsilon=1e-14, rectify=False),
              metrics=['accuracy', tf.keras.metrics.AUC(name="AUC")],
              )

I want to explore CosineDecayRestarts and AdaBeliefOptimizer together. Is there something wrong with my usage?

UserWarning: This overload of add_ is deprecated

With torch==1.5.0 the following user warning is displayed when using AdaBelief -

/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:
        add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
        add_(Tensor other, *, Number alpha)

Inconsistent computation of weight_decay and grad_residual among pytorch versions

Hi
I was looking at the various versions you have in the pypi_packages folder and noticed that the order of computation of weight decay (which for some options modifies grad) and of grad_residual (which uses grad) differs for the different versions. In adabelief_pytorch0.0.5, adabelief_pytorch0.2.0, and adabelief_pytorch0.2.1 weight decay is done before computing grad_residual but in adabelief_pytorch0.1.0 it is done afterwards. It seems that adabelief_pytorch0.1.0 is more closely following what your paper described as the second-order momentum computation. Shouldn't the others be changes to align with adabelief_pytorch0.1.0?

keyerror exp_avg_var

pytorch 1.6
keyerror exp_avg_var

weight_decouple in adabelief tf

Hi, I am a bit confused, it says that the weight-decouple is supported but not an option. Does it mean it is using it by default? If not how can I turn it on?

RangerAdaBelief setstate

Encountered some problem when I tried to resume previous training with RangerAdaBelief optimizer.
Should this line be RangerAdaBelief instead of Ranger?

Question: How similar or dissimilar is this compared to Hypergradient Descent?

Congratulations on the new optimizer, excited to try it out! In our fastai discussion group it was brought up that your optimizer seems similar to HGD. Would you be able to summarize the key differences between the two for me/us? :) Thank you in advance!

https://github.com/gbaydin/hypergradient-descent

scripts for the toy examples?

Would you please share the scripts to produce the results of the toy examples and to generate the gif files?
Thanks!

Please add a license

The license should probably be BSD, same as PyTorch, since this uses PyTorch code.

raw results

Thank you for your great work, can you provided the raw results of the experiments, thank you very much.

MSVAG

Adabelief-Optimizer/PyTorch_Experiments/classification_cifar10/optimizers/MSVAG.py

Line 103 in b0e4551

v = exp_avg_sq / bias_correction1

Should this be bias_correction2 in all MSVAG?

degenerated_to_sgd hyperparameter -- background and recommendations?

Hello and great work! I was wondering about the "degenerated_to_sgd" hyperparameter. Can you explain the background behind it and maybe provide a paper about it if there is one? Also, would you say the recommendations on when to use it are similar to rectify? If not, when do you think it should be used (beneficial all the time or only some of the time)?

FileNotFoundError for ImageNet

Hi,
I am trying to reproduce the results for ImageNet. But I am getting the following error message when executing the run.sh file:
FileNotFoundError: [Errno 2] No such file or directory: '/media/juntang/Samsung_T5/ImageNet/train'
How can I resolve this issue?

Different usage of eps between "A quick look at the algorithm" and the code

I have a question.

In "A quick look at the algorithm" in README.md, eps is added to shat_t.
But in the code, eps is added to s_t(exp_avg_var) instead of shat_t.

Also only the code for pytorch, if amsgrad is True, eps is added to max_exp_avg_var instead of exp_arg_var(s_t) or shat_t.

Which behavior is correct?

pytorch code
tensorflow code

On imagenet accuracy result 70.08

Hi, congrats on the nice work. But I have a problem in achieving your claimed accuracy result 70.08 of the ImageNet experiment in the paper. My run on my machine using your code with your parameter setting is 69.32.

Could you please provide (link to) your model checkpoint file, or is there any other tricks in training? Thanks .

Compatibility with warmup

I use a LR scheduler to configure a warmup (Lr linearly increasing from very small value to it's real (=from args) value for 500 iter).
Will this confuse adabelief or it's okay ?

Epsilon is important to Adaptive Optimizer

Hi~
#18 (comment)
Since I asked you question last time, I've done a series of experiments. I think both methods of determining the step size of the descent are plausible, whether based on the variance of the gradient or the square of the gradient. I found that if epsilon's position is changed, the result similar to adabelief can be achieved. I did some experiments and analysis and put it in https://github.com/yuanwei2019/EAdam-optimizer

Upgrade with Adas optimizer

What do you think about merge Adabelief with Adas (https://github.com/YanaiEliyahu/AdasOptimizer)?
Or do they conflict?

Is extra epsilon more important than belief?

Hello,

congratulations on being accepted to NeurIPS and thank you for sharing the code.
I'm enjoying playing with this code.

I found that the arXiv paper has been updated from v1 to v2.
In v2, the extra epsilon in the bias correlation has been added.

I removed the extra epsilon from this code to investigate the effects of "belief" only.

https://gist.github.com/yasutoshi/39f1b74af9bc0cf504fa678917383ef8#file-adabelief_noepsilon-py-L161

As a result, Adam and AdaBelief were only about the same accuracy in the experiment on Cifar10 with ResNet.

Does this mean that the performance improvement of AdaBelief is due to the extra epsilon and not belief?

I would be grateful if you could tell me if I was wrong.

train acc:
test acc:

Matlab implementation

How about a Matlab test case?

I tried to implement a Matlab version of AdaBelief and compare it with SGD with momentum at
https://github.com/pcwhy/AdaBelief-Matlab
I found that sometimes AdaBelief is not guaranteed to converge to an optimal solution as SGD with momentum can reach.

Suppressing weight decoupling and rectification messages

Is there a way to suppress these messages by setting some parameters explicitly when they are enabled?

Weight decoupling enabled in AdaBelief
Rectification enabled in AdaBelief

I skimmed through the code and did not notice there is any parameter that we do so. I apologize if I have overlooked any part of the code/documentation. Thank you in advance for your reply.

Environment

adabelief_pytorch 0.2.1
Python 3.8.10

Debug prints in ranger-adabelief

Hey @juntang-zhuang

I'm just trying out the optimisers and noticed that ranger-adabelief has a couple of debug prints in.

Why does g_t substract m_t, instead of m_{t-1} ?

Dear authors,
Thanks for providing such a good implementation, and I benefit a lot from the repo in my experiments.
I have a question for the update of s_t in the algorithm as titled.

In my task, (g_t - m_t)^2 gives a contractive result against (g_t - m_{t-1})^2 on the choice of different betas.
Specifically, the original update (g_t - m_t)^2 suggests a greater beta2 is better (0.999 other than 0.98),
while the revised version (g_t - m_{t-1})^2 shows 0.98 is a better beta2.

Other parameters are kept the same as the default. The code version I use is pytorch-0.2.0.
To name some of them, lr=1e-3, eps=1e-16, weight_decay=0.1, weight_decoupled=True, amsgrad=False, fixed_decay=False, rectify=True.

To compare with Adam and RAdam, rectify set as False is also tested.
The contraction still occurs for the original and revised update of s_t (however, at this time the better beta2 is reversed).

I know the parameter tuning lacks much sufficient evidence to make a convincing conclusion, so I just wonder why (g_t - m_t)^2 is used?
Since (g_t - m_{t-1})^2 will compare the gradient of the current step with previous moving average, I guess it's more intuitive.

Thanks for reading my question. Wish you a good day :)

denom = (exp_avg_var.add_(group['eps']).sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])

作者你好，我发现Adabelief-Optimizer/PyTorch_Experiments/AdaBelief.py里的第157行：
‘ denom = (exp_avg_var.add_(group['eps']).sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])’
exp_avg_var.add_(eps)这样是不是每次修正偏差都会导致exp_avg_var加上一个eps，和文中的St更新公式不一样。是不是应该改成exp_avg_var.add(group['eps'])或者是使用add_实验效果好？

Documentation (at least for TF) and weight_decouple is not an option

Hiya,

In the ReadME you say that Rectify is implemented as an option but the default is True. I would update the ReadME to reflect that.

You also make it sound like weight_decouple is an available option in the TF version. But it isn't:

| AdaBeliefOptimizer(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-14, weight_decay=0.0, rectify=True, amsgrad=False, sma_threshold=5.0, total_steps=0, warmup_proportion=0.1, min_lr=0.0, name='AdaBeliefOptimizer', print_change_log=True, **kwargs)

I just get an error message when I try to set weight_decouple=True.

Great work otherwise!

torch version requirement

can AdaBelief Optimizer be used with torch==1.4.0 or earlier verison?

Results on ImageNet with tuning weight decay

I quickly run some experiments on ImageNet with different weight decay rates.

Using AdamW with wd=1e-2 and setting other hyper parameters the same as reported in AdaBelief paper, the average accuracy over 3 runs is 69.73%, much better than that compared in the paper. I will keep updating results for other optimizers and weight decay rates.

The problem of reproducing the result of ImageNet

Recently I try reproducing the result in the paper. I successfully did this on cifar10 and GAN, but the test accuracy on ImageNet is nearly 69.5%, which on the paper is 70.08 %. I wonder whether I used the wrong edition of AdaBelief or the parameters in run.sh has been changed. Could I ask you for the edition of AdaBelief and the parameters if I want to reproduce the result on ImageNet (pytorch)?
The edition of AdaBelief I used is 0.2.0.

i use adabelief optimizer on fine-tune efficientb4 that acc is worse than Adam?

what is details about the experiments for cifar-100

Hi, Juntang,

The work is outstanding! If convenient, would you please tell me the details about the experiments for cifar-100? Compared with cifar-10, is there only one difference that you change the output dimension of the last linear layer from 10 to 100? Or are there some other differences?

Looking forward to your reply! Thanks for your nice work again!

Model load shows error message. ValueError: Unknown optimizer: AdaBeliefOptimizer

Dear all.
I am excited to use Adabelief. And today I installed the package and tested it in my ML training successfully.
When I load the model.h5 file to in different machine, the application keep showing error message as follows even if I installed the package whose command is 'pip3 install adabelief-tf==0.2.0' in both of machine whose OS is ubuntu 18.04 and Mac OSX.

It would be appreciated if you let me know if I am missing in your installation guide.

Best regards.

---- error message in the model loading side (Mac OS)----

2021-01-11 14:53:45.527256: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-11 14:53:45.539315: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc89ebcc5b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-11 14:53:45.539342: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "drive.py", line 125, in
model = load_model(args.model)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/tensorflow/python/keras/saving/save.py", line 182, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 193, in load_model_from_hdf5
model.compile(**saving_utils.compile_args_from_training_config(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/tensorflow/python/keras/saving/saving_utils.py", line 211, in compile_args_from_training_config
optimizer = optimizers.deserialize(optimizer_config)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/tensorflow/python/keras/optimizers.py", line 865, in deserialize
return deserialize_keras_object(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 346, in deserialize_keras_object
(cls, cls_config) = class_and_config_for_serialized_keras_object(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 296, in class_and_config_for_serialized_keras_object
raise ValueError('Unknown ' + printable_module_name + ': ' + class_name)
ValueError: Unknown optimizer: AdaBeliefOptimizer

Unfair comparison on ImageNet?

As this post mentioned, Adabelief on ImageNet is trained with weight decay rate 1e-2, while the results reported in previous work are usually with 1e-4. Since the weight decay rate has significant effect on the test accuracy, have you conducted experiments on the same setting for Adam and its variants?

fine-tune with bert models

Have you ever tested adabelief for fine-tuning bert models? And what's the recommended hyper-parameters?

Your method is just equivalent to SGD with a changable global learning rate.

Please print any statistics of the variable 'exp_avg_var' (e.g., print(exp_avg_var.min())) for each parameter group.
You will find the adaptive learning rates for different parameters are almost the same.
Please note this line of the code
"denom = (exp_avg_var.add_(group['eps']).sqrt()/ math.sqrt(bias_correction2)).add_(group['eps'])".
The code 'add_()' operator in 'exp_avg_var.add_(group['eps'])' will change the value of the variable 'exp_avg_var'.
As a result, the value of exp_avg_var will accumulate with eps in each iteration. The values of exp_avg_var for
different parameters will constantly increase and be almost same. So actually, your method is just equivalent to SGD with a changable global learning rate.
I guess you refer to EAdam to introduce eps, which also has this problem.
It is a very severe problem. Because the 'belief' mentioned in your paper does not give any contribution to the final performance in your code.

Changing init learning rate

Does modifying the initial learning rate hurt the algorithm in any way? Wanting to use exponential decay but don't know if it would improve the performance.

Tensorflow Implementation

When I tried the optimizer with tensorflow cycle GAN, it takes lot of time to complete one step. Is it a problem regarding the use of gpu or framework, or with the optimizer itself?

Thanks in Advance

Performance vs AdamW

Hey @juntang-zhuang

My experiments with AdaBelief are going great! The downside is a 20% increase in training time per epoch, is this expected?

Similarity to AdaHessian

Hi, first of all, thank you very much for sharing the code for AdaBelief, it looks like a very promising optimizer! :) Have you considered comparing it to AdaHessian? I feel like AdaHessian is using the same trick as you (but they do it less efficiently).

Unstability in training in RNN

Hello,

Congratulations about this awesome paper and for providing the code to test it.
I´m training a small RNN network ( 2 layers of SRU (https://github.com/asappresearch/sru), 256 hidden size, CRF at end) for the NER task.

As following the Readme, I disabled the gradient clipping, and used an epsilon of 1e-12. This task converges great with Ranger, SGD and Adam. But using Adabelief I get some loss explosion randomly.

Am I doing something wrong?

accuracy: 0.8366, accuracy3: 0.8366, precision-overall: 0.0040, recall-overall: 0.0163, f1-measure-overall: 0.0065, batch_loss: 7236.0938, loss: 57461.7845 ||: : 30it [09:29, 18.99s/it]                        
accuracy: 0.9254, accuracy3: 0.9255, precision-overall: 0.1612, recall-overall: 0.2104, f1-measure-overall: 0.1825, batch_loss: 51126.7266, loss: 18637.9896 ||: : 30it [08:47, 17.60s/it]                       
accuracy: 0.9645, accuracy3: 0.9645, precision-overall: 0.3207, recall-overall: 0.4666, f1-measure-overall: 0.3801, batch_loss: 11046.6484, loss: 13583.7611 ||: : 30it [08:59, 17.99s/it]                      
accuracy: 0.9828, accuracy3: 0.9829, precision-overall: 0.6505, recall-overall: 0.7602, f1-measure-overall: 0.7011, batch_loss: 8434.5000, loss: 3932.2246 ||: : 29it [08:37, 17.86s/it]                       
accuracy: 0.9856, accuracy3: 0.9856, precision-overall: 0.7832, recall-overall: 0.8383, f1-measure-overall: 0.8098, batch_loss: 122.3125, loss: 3008.3288 ||: : 29it [09:13, 19.09s/it]                        
accuracy: 0.9930, accuracy3: 0.9930, precision-overall: 0.8261, recall-overall: 0.8861, f1-measure-overall: 0.8551, batch_loss: 2115.6699, loss: 1362.0373 ||: : 30it [08:55, 17.84s/it]                       
accuracy: 0.9948, accuracy3: 0.9948, precision-overall: 0.8893, recall-overall: 0.9243, f1-measure-overall: 0.9065, batch_loss: 1569.0469, loss: 1011.7590 ||: : 30it [08:33, 17.10s/it]                       
accuracy: 0.9972, accuracy3: 0.9972, precision-overall: 0.9367, recall-overall: 0.9571, f1-measure-overall: 0.9468, batch_loss: 591.5840, loss: 426.5681 ||: : 29it [08:58, 18.56s/it]                       
accuracy: 0.9977, accuracy3: 0.9977, precision-overall: 0.9514, recall-overall: 0.9660, f1-measure-overall: 0.9587, batch_loss: 23.7188, loss: 279.9471 ||: : 29it [08:32, 17.69s/it]                        
accuracy: 0.9977, accuracy3: 0.9977, precision-overall: 0.9501, recall-overall: 0.9627, f1-measure-overall: 0.9564, batch_loss: 93.2188, loss: 243.8314 ||: : 30it [09:16, 18.54s/it]                        
accuracy: 0.9984, accuracy3: 0.9984, precision-overall: 0.9641, recall-overall: 0.9732, f1-measure-overall: 0.9686, batch_loss: 53.5000, loss: 199.5779 ||: : 29it [08:44, 18.10s/it]                        
accuracy: 0.9984, accuracy3: 0.9984, precision-overall: 0.9702, recall-overall: 0.9789, f1-measure-overall: 0.9745, batch_loss: 52.5781, loss: 156.1823 ||: : 30it [09:14, 18.47s/it]                       
accuracy: 0.9994, accuracy3: 0.9994, precision-overall: 0.9816, recall-overall: 0.9871, f1-measure-overall: 0.9843, batch_loss: 61.4688, loss: 69.1954 ||: : 29it [09:01, 18.66s/it]                        
accuracy: 0.9990, accuracy3: 0.9990, precision-overall: 0.9813, recall-overall: 0.9858, f1-measure-overall: 0.9836, batch_loss: 29.5312, loss: 90.0869 ||: : 29it [08:51, 18.33s/it]                        
accuracy: 0.9996, accuracy3: 0.9996, precision-overall: 0.9846, recall-overall: 0.9896, f1-measure-overall: 0.9871, batch_loss: 74.0625, loss: 53.9213 ||: : 29it [08:40, 17.94s/it]                       
accuracy: 0.9995, accuracy3: 0.9995, precision-overall: 0.9822, recall-overall: 0.9868, f1-measure-overall: 0.9845, batch_loss: 33.9844, loss: 49.5508 ||: : 30it [08:35, 17.19s/it]                       
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9854, recall-overall: 0.9869, f1-measure-overall: 0.9862, batch_loss: 19.3906, loss: 34.1199 ||: : 30it [09:03, 18.11s/it]                       
accuracy: 0.9995, accuracy3: 0.9995, precision-overall: 0.9938, recall-overall: 0.9950, f1-measure-overall: 0.9944, batch_loss: 709.4336, loss: 48.0945 ||: : 29it [08:38, 17.88s/it]                      
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9914, recall-overall: 0.9937, f1-measure-overall: 0.9925, batch_loss: 14.9688, loss: 38.2326 ||: : 29it [08:36, 17.79s/it]                       
accuracy: 0.9996, accuracy3: 0.9996, precision-overall: 0.9852, recall-overall: 0.9894, f1-measure-overall: 0.9873, batch_loss: 79.4688, loss: 51.3397 ||: : 29it [08:55, 18.46s/it]                       
accuracy: 0.9998, accuracy3: 0.9998, precision-overall: 0.9926, recall-overall: 0.9936, f1-measure-overall: 0.9931, batch_loss: 39.0625, loss: 22.0619 ||: : 30it [09:00, 18.03s/it]                      
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9915, recall-overall: 0.9937, f1-measure-overall: 0.9926, batch_loss: 16.9062, loss: 33.6324 ||: : 30it [09:32, 19.07s/it]                       
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9939, recall-overall: 0.9947, f1-measure-overall: 0.9943, batch_loss: 0.7812, loss: 27.4840 ||: : 30it [09:13, 18.44s/it]                        
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9951, recall-overall: 0.9959, f1-measure-overall: 0.9955, batch_loss: 27.0786, loss: 15.0342 ||: : 29it [09:08, 18.92s/it]                      
accuracy: 0.9996, accuracy3: 0.9996, precision-overall: 0.9938, recall-overall: 0.9963, f1-measure-overall: 0.9951, batch_loss: 7.7500, loss: 25.8246 ||: : 29it [09:00, 18.63s/it]                       
accuracy: 0.9998, accuracy3: 0.9998, precision-overall: 0.9957, recall-overall: 0.9966, f1-measure-overall: 0.9961, batch_loss: 27.6875, loss: 17.3096 ||: : 30it [08:47, 17.58s/it]                      
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9949, recall-overall: 0.9968, f1-measure-overall: 0.9958, batch_loss: 35.4727, loss: 26.2837 ||: : 29it [08:24, 17.40s/it]                      
accuracy: 0.9977, accuracy3: 0.9977, precision-overall: 0.9501, recall-overall: 0.9627, f1-measure-overall: 0.9564, batch_loss: 93.2188, loss: 243.8314 ||: : 30it [09:16, 18.54s/it]
accuracy: 0.9984, accuracy3: 0.9984, precision-overall: 0.9641, recall-overall: 0.9732, f1-measure-overall: 0.9686, batch_loss: 53.5000, loss: 199.5779 ||: : 29it [08:44, 18.10s/it]
accuracy: 0.9984, accuracy3: 0.9984, precision-overall: 0.9702, recall-overall: 0.9789, f1-measure-overall: 0.9745, batch_loss: 52.5781, loss: 156.1823 ||: : 30it [09:14, 18.47s/it]
accuracy: 0.9994, accuracy3: 0.9994, precision-overall: 0.9816, recall-overall: 0.9871, f1-measure-overall: 0.9843, batch_loss: 61.4688, loss: 69.1954 ||: : 29it [09:01, 18.66s/it]
accuracy: 0.9990, accuracy3: 0.9990, precision-overall: 0.9813, recall-overall: 0.9858, f1-measure-overall: 0.9836, batch_loss: 29.5312, loss: 90.0869 ||: : 29it [08:51, 18.33s/it]
accuracy: 0.9996, accuracy3: 0.9996, precision-overall: 0.9846, recall-overall: 0.9896, f1-measure-overall: 0.9871, batch_loss: 74.0625, loss: 53.9213 ||: : 29it [08:40, 17.94s/it]
accuracy: 0.9995, accuracy3: 0.9995, precision-overall: 0.9822, recall-overall: 0.9868, f1-measure-overall: 0.9845, batch_loss: 33.9844, loss: 49.5508 ||: : 30it [08:35, 17.19s/it]
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9854, recall-overall: 0.9869, f1-measure-overall: 0.9862, batch_loss: 19.3906, loss: 34.1199 ||: : 30it [09:03, 18.11s/it]
accuracy: 0.9995, accuracy3: 0.9995, precision-overall: 0.9938, recall-overall: 0.9950, f1-measure-overall: 0.9944, batch_loss: 709.4336, loss: 48.0945 ||: : 29it [08:38, 17.88s/it]
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9914, recall-overall: 0.9937, f1-measure-overall: 0.9925, batch_loss: 14.9688, loss: 38.2326 ||: : 29it [08:36, 17.79s/it]
accuracy: 0.9996, accuracy3: 0.9996, precision-overall: 0.9852, recall-overall: 0.9894, f1-measure-overall: 0.9873, batch_loss: 79.4688, loss: 51.3397 ||: : 29it [08:55, 18.46s/it]
accuracy: 0.9998, accuracy3: 0.9998, precision-overall: 0.9926, recall-overall: 0.9936, f1-measure-overall: 0.9931, batch_loss: 39.0625, loss: 22.0619 ||: : 30it [09:00, 18.03s/it]
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9915, recall-overall: 0.9937, f1-measure-overall: 0.9926, batch_loss: 16.9062, loss: 33.6324 ||: : 30it [09:32, 19.07s/it]
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9939, recall-overall: 0.9947, f1-measure-overall: 0.9943, batch_loss: 0.7812, loss: 27.4840 ||: : 30it [09:13, 18.44s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9951, recall-overall: 0.9959, f1-measure-overall: 0.9955, batch_loss: 27.0786, loss: 15.0342 ||: : 29it [09:08, 18.92s/it]
accuracy: 0.9996, accuracy3: 0.9996, precision-overall: 0.9938, recall-overall: 0.9963, f1-measure-overall: 0.9951, batch_loss: 7.7500, loss: 25.8246 ||: : 29it [09:00, 18.63s/it]
accuracy: 0.9998, accuracy3: 0.9998, precision-overall: 0.9957, recall-overall: 0.9966, f1-measure-overall: 0.9961, batch_loss: 27.6875, loss: 17.3096 ||: : 30it [08:47, 17.58s/it]
accuracy: 0.9997, accuracy3: 0.9997, precision-overall: 0.9949, recall-overall: 0.9968, f1-measure-overall: 0.9958, batch_loss: 35.4727, loss: 26.2837 ||: : 29it [08:24, 17.40s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9968, recall-overall: 0.9975, f1-measure-overall: 0.9972, batch_loss: 40.9062, loss: 13.3182 ||: : 30it [09:12, 18.42s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9965, recall-overall: 0.9979, f1-measure-overall: 0.9972, batch_loss: 0.5000, loss: 8.9580 ||: : 29it [08:27, 17.51s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9973, recall-overall: 0.9978, f1-measure-overall: 0.9976, batch_loss: 0.6250, loss: 10.6955 ||: : 29it [08:08, 16.84s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9983, recall-overall: 0.9990, f1-measure-overall: 0.9986, batch_loss: 5.4375, loss: 9.3031 ||: : 30it [08:18, 16.63s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9978, recall-overall: 0.9982, f1-measure-overall: 0.9980, batch_loss: 6.3047, loss: 6.1776 ||: : 29it [08:19, 17.22s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9977, recall-overall: 0.9980, f1-measure-overall: 0.9979, batch_loss: 0.8438, loss: 5.7469 ||: : 29it [08:14, 17.04s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9975, recall-overall: 0.9976, f1-measure-overall: 0.9976, batch_loss: 9.0176, loss: 7.7605 ||: : 30it [08:18, 16.60s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9964, recall-overall: 0.9966, f1-measure-overall: 0.9965, batch_loss: 1.8438, loss: 11.5324 ||: : 30it [08:11, 16.37s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9962, recall-overall: 0.9969, f1-measure-overall: 0.9966, batch_loss: 9.9844, loss: 12.8704 ||: : 29it [08:27, 17.51s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9980, recall-overall: 0.9988, f1-measure-overall: 0.9984, batch_loss: 3.5742, loss: 4.8728 ||: : 30it [08:36, 17.23s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9993, recall-overall: 0.9993, f1-measure-overall: 0.9993, batch_loss: 0.7031, loss: 2.8980 ||: : 30it [08:26, 16.88s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9986, recall-overall: 0.9987, f1-measure-overall: 0.9986, batch_loss: 7.0625, loss: 4.2808 ||: : 30it [08:50, 17.69s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9988, recall-overall: 0.9990, f1-measure-overall: 0.9989, batch_loss: 2.1562, loss: 4.5667 ||: : 30it [08:08, 16.28s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9987, recall-overall: 0.9990, f1-measure-overall: 0.9988, batch_loss: 15.0625, loss: 3.0480 ||: : 30it [08:36, 17.22s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9986, recall-overall: 0.9989, f1-measure-overall: 0.9987, batch_loss: 21.6094, loss: 2.7449 ||: : 30it [08:18, 16.60s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9995, recall-overall: 0.9997, f1-measure-overall: 0.9996, batch_loss: 0.7812, loss: 2.5399 ||: : 29it [08:06, 16.78s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9995, recall-overall: 0.9995, f1-measure-overall: 0.9995, batch_loss: -0.0625, loss: 2.2463 ||: : 29it [08:13, 17.03s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9992, recall-overall: 0.9993, f1-measure-overall: 0.9992, batch_loss: 2.7969, loss: 3.0429 ||: : 30it [08:21, 16.71s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9997, recall-overall: 0.9998, f1-measure-overall: 0.9997, batch_loss: 2.4316, loss: 2.3025 ||: : 30it [08:30, 17.02s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9996, recall-overall: 0.9998, f1-measure-overall: 0.9997, batch_loss: 1.3281, loss: 4.6582 ||: : 29it [08:09, 16.89s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9973, recall-overall: 0.9980, f1-measure-overall: 0.9977, batch_loss: -0.0000, loss: 4.8893 ||: : 30it [08:36, 17.23s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9956, recall-overall: 0.9976, f1-measure-overall: 0.9966, batch_loss: 0.6875, loss: 4.2254 ||: : 30it [08:21, 16.71s/it]
accuracy: 0.9999, accuracy3: 0.9999, precision-overall: 0.9980, recall-overall: 0.9981, f1-measure-overall: 0.9981, batch_loss: 0.0312, loss: 5.8634 ||: : 30it [08:10, 16.34s/it]
accuracy: 0.9984, accuracy3: 0.9984, precision-overall: 0.9787, recall-overall: 0.9515, f1-measure-overall: 0.9649, batch_loss: 22304.5000, loss: 749.8296 ||: : 30it [08:32, 17.08s/it]
accuracy: 0.9570, accuracy3: 0.9570, precision-overall: 0.2782, recall-overall: 0.4189, f1-measure-overall: 0.3343, batch_loss: 731722.4375, loss: 65948.9812 ||: : 30it [08:25, 16.85s/it]
accuracy: 0.9383, accuracy3: 0.9383, precision-overall: 0.1668, recall-overall: 0.2775, f1-measure-overall: 0.2083, batch_loss: 778091.5625, loss: 337316.9677 ||: : 29it [08:08, 16.83s/it]
Epoch    53: reducing learning rate of group 0 to 3.0000e-03.
accuracy: 0.9668, accuracy3: 0.9669, precision-overall: 0.3510, recall-overall: 0.5322, f1-measure-overall: 0.4230, batch_loss: 77123.0000, loss: 253831.3728 ||: : 29it [08:23, 17.36s/it]
accuracy: 0.9767, accuracy3: 0.9767, precision-overall: 0.4897, recall-overall: 0.6151, f1-measure-overall: 0.5453, batch_loss: -1.0000, loss: 137048.0448 ||: : 30it [08:35, 17.19s/it]
accuracy: 0.9839, accuracy3: 0.9839, precision-overall: 0.6340, recall-overall: 0.7326, f1-measure-overall: 0.6798, batch_loss: 43615.0000, loss: 103847.1062 ||:  19%|#8        | 5/27 [01:36<07:03, 19.27s/it]

Question about SGD optimizer in LSTM experiments

Hi Juntang,

Nice work indeed! The codes are quite well-written! May I ask two questions regarding SGD optimizer in LSTM experiments please?

(1) In the experiments, is there any specific reason to switch SGD optimizer to ASGD optimizer? I did not catch any related information in your paper about that.

(2) Should you use the validation dataset instead of test dataset when deciding if to switch to ASGD?

Thanks for your precious time.

Best,

support for tensorflow 1.10+

will adabelief support for tensorflow 1.10+

recommended experiments

Hi,

There is an obvious question, i think it would be nice to address it in the final presentation, paper, etc.
In the "A quick look at the algorithm",
The "belief" part of Adabelief comes from the g_t^2 -> (g_t - m_t)^2 modification.
However, m_t could contain quite a large part of g_t, depending on the momentum weight (beta_1).
Wouldn't be more effective to use the m_t-1 value?
In most cases, with large momentum, the difference is probably marginal, but there are
three obvious outcomes, and it would improve the paper to identify which one applies:
The effect of using m_t-1 is
1, marginal
2, makes it more effective
3, makes it less effective

I think it is trivial to run an experiment if you already have the pipeline for the paper.

As another improvement, it would be nice to compare a few different beta_2 values.
The momentum for the s/v term (0.999) is quite a high default. Since Adabelief
scales in a more smart way than Adam, maybe using a smaller beta_2 makes it
reacting/adapting faster, than Adam. E.g. plotting some demos with 0.999, 0.99, 0.95 would be nice.
My theory is that Adabelief would be even more effective with smaller beta_2 (a.k.a. the optimal beta_2
is not the same for Adam and Adabelief).

Imagenette baseline for AdaBelief

As we have discussed earlier, for 5-epoch Imagenette training, I am not achieving better results with AdaBelief/RangerAdaBelief compared to Ranger, Adam, or SGD. I have attached two notebooks with my baselines. Even after playing around with LR schedule, eps, wd, etc., I still wasn't able to reach similar performance to Ranger (80% vs. 83%) for 5-epoch run on Imagenette. Any tips to improve AdaBelief performance?

imagenette_baseline.ipynb
imagenette_adabelief.ipynb

issues on AdaBlief-tensorflow

HI!
I had some trouble using Adambelief in a simple lstm training.
What could be the reason for this？
CODE：
from adabelief_tf import AdaBeliefOptimizer
tf.keras.backend.clear_session()
multivariate_lstmA = tf.keras.models.Sequential([
LSTM(100, input_shape=input_shape,
return_sequences=True),
Flatten(),
Dense(200, activation='relu'),
Dropout(0.1),
Dense(1)
])
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
'multivariate_lstmA.h5', monitor=('val_loss'), save_best_only=True)
optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False)
multivariate_lstmA.compile(loss=loss,
optimizer=optimizer,
metrics=metric)

RESULT：
Please check your arguments if you have upgraded adabelief-tf from version 0.0.1.
Modifications to default arguments:
eps weight_decouple rectify

adabelief-tf=0.0.1 1e-08 Not supported Not supported
Current version (0.1.0) 1e-14 supported default: True
For a complete table of recommended hyperparameters, see
https://github.com/juntang-zhuang/Adabelief-Optimizer