Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

This run was comparing - AdamW(model

Sorry it wasn't clear, I use the PyTorch default weight decay for <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Performance vs AdamW about adabelief-optimizer HOT 10 CLOSED

juntang-zhuang commented on September 25, 2024

Performance vs AdamW

from adabelief-optimizer.

Comments (10)

juntang-zhuang commented on September 25, 2024

The training time should rise, because extra N computation to calculate gt-mt, where N is number of parameters. But I don't expect the running time to increase so much, is it because your network is small, so the computation on data is comparable to the computation on parameter?

from adabelief-optimizer.

iiSeymour commented on September 25, 2024

Unfortunately there was a cross over at epoch 14 on a 20 epoch run.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Could you provide more info? What hyperparameter are you using, and did you use decoupled weight decay for AdaBelief? Also what type of learning rate schedule? If possible please share some code to reproduce, so I can better look into it.

from adabelief-optimizer.

iiSeymour commented on September 25, 2024

This run was comparing -

AdamW(model.parameters(), amsgrad=False, lr=1e-3)
AdaBelief(model.parameters(), lr=args.lr, eps=1e-12, betas=(0.9,0.999))

With the following learning rate schedule -

from torch.optim.lr_scheduler import LambdaLR

def cosine_decay_schedule(y0, y1):
    """
    Cosine Decay Scheduler
    """
    return lambda t: y1 + 0.5 * (y0 - y1) * (np.cos(t * np.pi) + 1.0)

def func_scheduler(optimizer, func, total_steps, warmup_steps=None, warmup_ratio=0.1, start_step=0):
    """
    Learning Rate Scheduler
    """
    if warmup_steps:
        y0 = func(0.0)
        func = piecewise_schedule(
            [warmup_steps / total_steps],
            [linear_schedule(warmup_ratio * y0, y0), func]
        )
    return LambdaLR(optimizer, (lambda step: func((step + start_step) / total_steps)))

lr_scheduler = func_scheduler(
    optimizer, cosine_decay_schedule(1.0, 0.1), args.epochs * len(train_loader),
    warmup_steps=500, start_step=last_epoch*len(train_loader)
)

I have just set off another training run with weight_decouple=True, weight_decay=0.01 -

AdaBelief(model.parameters(), lr=1e-3, eps=1e-12, betas=(0.9,0.999), weight_decouple=True, weight_decay=0.01)

It takes about 2 days for 20 epochs.

You can find the project here if you are interested https://github.com/nanoporetech/bonito

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Just to confirm, no weight decay is applied to AdamW, so AdamW reduces to Adam?

from adabelief-optimizer.

iiSeymour commented on September 25, 2024

Sorry it wasn't clear, I use the PyTorch default weight decay for AdamW which is 1e-2.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

I see, thanks for the information. I'll try your experiment later.

I guess two possible reasons:
one is the decoupled weight decay, this affects the generalization especially during fine tuning;
second is the eps, note that we actually use mt/(\sqrt(st+eps) + eps), as in detailed Algo in appendix A, the eps within sqrt dominates the eps outside sqrt, so setting eps=1e-16 is roughly setting eps=1e-8 for AdamW.
Not sure if the second affects much, but if eps is too large, then st is dominated by eps, and it's more like SGD, perhaps 1e-12 is still not small enough. 1e-16 is perhaps suitable for scenario where the update needs to be very "adaptive".

I'll try your code once I finish the updates on GAN for camera ready. Also will test on Transformer and maybe RL later. Thanks for the feedback.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

@iiSeymour Just to check, any update on the result comparison?

from adabelief-optimizer.

iiSeymour commented on September 25, 2024

I now get very similar performance to AdamW with decoupled weight decay and eps=1e-8 or eps=1e-16 when using AdaBelief.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Thank you for the feedback! Perhaps for some specific cases, decoupled weight decay is the key factor and dominates other differences in implementation.

from adabelief-optimizer.

Performance vs AdamW about adabelief-optimizer HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent