Code Monkey home page Code Monkey logo

Comments (10)

juntang-zhuang avatar juntang-zhuang commented on June 23, 2024

Hi, thanks a lot for comment, please refer to the PyTorch code whenever there's a difference, because it's what I used for experiments.

In the Algorithm Part, when adding eps to s_t, it will finally pass to \hat{s_t} in the update, which is used in the denominator for update. I would take the code to be correct for now. (Also note that in the algo, eps is directly added to s_t before bias correction, which match the code. It might be more precise to think about whether adding eps before or after bias correction.)

Similarly for the max_exp_avg_var, the eps is added to max_exp_avg_var because it's the denominator, though adding to exp_arg_var(s_t) will finally pass to the denominator (I have not tested it yet). Also note that in our experiments we found ams_grad never helps for AdaBelief, so I never carefully tested ams_grad in following experiments.

PS: the implementation in Tensorflow is different from PyTorch, even for Adam. That's why they use different default eps.

from adabelief-optimizer.

tatsuhiko-inoue avatar tatsuhiko-inoue commented on June 23, 2024

Thunk you for your reply.

I note whether eps is added to s_{t-1} of next update, not bias correction.

In "A quick look at the algorithm", eps is added to shat_t instead of s_t, so s_{t-1} of the next update have not been added eps.
In contrast, in the code for pytorch, eps is added exp_avg_var with Tensor#add_ method(in-place version).
Therefore, eps is added to exp_avg_var of the next update.

Here is the script that specifies inf for eps and checks if eps is added to exp_avg_ *:

import sys 
import torch 
import math 
from torch.optim import Adam 
from adabelief_pytorch import AdaBelief 
 
def test(opt_class, state_name, amsgrad=False): 
    param = torch.nn.Parameter(torch.ones([])) 
    optim = opt_class([param], lr=1, betas=(0.0, 0.999), eps=math.inf, amsgrad=amsgrad) 
 
    print(f"with {opt_class.__name__}, amsgrad = {amsgrad}", file=sys.stderr) 
    for i in range(3): 
        param.grad = torch.ones([]) 
        optim.step() 
 
        state = optim.state[param] 
        msg = f"grad = {param.grad}, {state_name} = {state[state_name]}" 
        if amsgrad: 
            msg += f", max_{state_name} = {state['max_'+state_name]}" 
        print(msg, file=sys.stderr) 
 
test(Adam,      'exp_avg_sq',  amsgrad=False) 
test(AdaBelief, 'exp_avg_var', amsgrad=False) 
 
test(Adam,      'exp_avg_sq',  amsgrad=True) 
test(AdaBelief, 'exp_avg_var', amsgrad=True) 

The stderr shown below.

with Adam, amsgrad = False
grad = 1.0, exp_avg_sq = 0.0010000000474974513
grad = 1.0, exp_avg_sq = 0.0019990000873804092
grad = 1.0, exp_avg_sq = 0.0029970011673867702
with AdaBelief, amsgrad = False
grad = 1.0, exp_avg_var = inf
grad = 1.0, exp_avg_var = inf
grad = 1.0, exp_avg_var = inf
with Adam, amsgrad = True
grad = 1.0, exp_avg_sq = 0.0010000000474974513, max_exp_avg_sq = 0.0010000000474974513
grad = 1.0, exp_avg_sq = 0.0019990000873804092, max_exp_avg_sq = 0.0019990000873804092
grad = 1.0, exp_avg_sq = 0.0029970011673867702, max_exp_avg_sq = 0.0029970011673867702
with AdaBelief, amsgrad = True
grad = 1.0, exp_avg_var = 0.0, max_exp_avg_var = inf
grad = 1.0, exp_avg_var = 0.0, max_exp_avg_var = inf
grad = 1.0, exp_avg_var = 0.0, max_exp_avg_var = inf

The following states have eps added.

  • exp_avg_var in AdaBelief(amsgrad=False)
  • max_exp_avg_var in AdaBelief(amsgrad=True)

No eps has been added to the following states:

  • exp_avg_sq, max_avg_sq in Adam
  • exp_avg_var in AdaBelief(amsgrad=True)

I used the following version.

% pip freeze | grep torch==
adabelief-pytorch==0.1.0
torch==1.7.0

PS: I didn't know the Adam implementation in Tensorflow is different from PyTorch. Thank you.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 23, 2024

Thanks for clarification, please take the code with inplace operation as the correct one, since I have not tested the non-inplace operation version. This does not affect the theoretical analysis though, since in theory we only assume the denominator is lower bounded, but I'm not sure what's the difference in practice.

from adabelief-optimizer.

tatsuhiko-inoue avatar tatsuhiko-inoue commented on June 23, 2024

Thunk you for your reply.

I got it. I will use inplace operation version.

However I think that inplace operation version amsgrad is broken.
eps is added to max_exp_avg_var every step where max_exp_avg_var is not updated by the current gradient.
So max_exp_avg_var keeps increasing.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 23, 2024

Thanks for the suggestion, this could be the reason why amsgrad never helps in my experiments, I'll try to fix it and see if the new version of amsgrad helps.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 23, 2024

@tatsuhiko-inoue We have updated the amsgrad version in adabelief-pytorch==0.2.0 and adabelief-tf==0.2.0. The eps is added to exp_avg_var with in-place operation, then max_exp_avg_var is the element-wise maximum with exp_avg_var. It behaves similarly to the AdaBelief version without amsgrad. Hopefully this could somehow help resolve the issue.

from adabelief-optimizer.

tatsuhiko-inoue avatar tatsuhiko-inoue commented on June 23, 2024

I confirmed this fix. Thank you!

Will be you change "A quick look at the algorithm" in README.md to match the pytorch code?

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 23, 2024

Yeah, we will fix the readme and the paper.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 23, 2024

Fixed.

from adabelief-optimizer.

tatsuhiko-inoue avatar tatsuhiko-inoue commented on June 23, 2024

I confirmed README.md. Thank you!

from adabelief-optimizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.