Comments (10)
The training time should rise, because extra N computation to calculate gt-mt, where N is number of parameters. But I don't expect the running time to increase so much, is it because your network is small, so the computation on data is comparable to the computation on parameter?
from adabelief-optimizer.
Unfortunately there was a cross over at epoch 14 on a 20 epoch run.
from adabelief-optimizer.
Could you provide more info? What hyperparameter are you using, and did you use decoupled weight decay for AdaBelief? Also what type of learning rate schedule? If possible please share some code to reproduce, so I can better look into it.
from adabelief-optimizer.
This run was comparing -
AdamW(model.parameters(), amsgrad=False, lr=1e-3)
AdaBelief(model.parameters(), lr=args.lr, eps=1e-12, betas=(0.9,0.999))
With the following learning rate schedule -
from torch.optim.lr_scheduler import LambdaLR
def cosine_decay_schedule(y0, y1):
"""
Cosine Decay Scheduler
"""
return lambda t: y1 + 0.5 * (y0 - y1) * (np.cos(t * np.pi) + 1.0)
def func_scheduler(optimizer, func, total_steps, warmup_steps=None, warmup_ratio=0.1, start_step=0):
"""
Learning Rate Scheduler
"""
if warmup_steps:
y0 = func(0.0)
func = piecewise_schedule(
[warmup_steps / total_steps],
[linear_schedule(warmup_ratio * y0, y0), func]
)
return LambdaLR(optimizer, (lambda step: func((step + start_step) / total_steps)))
lr_scheduler = func_scheduler(
optimizer, cosine_decay_schedule(1.0, 0.1), args.epochs * len(train_loader),
warmup_steps=500, start_step=last_epoch*len(train_loader)
)
I have just set off another training run with weight_decouple=True, weight_decay=0.01
-
AdaBelief(model.parameters(), lr=1e-3, eps=1e-12, betas=(0.9,0.999), weight_decouple=True, weight_decay=0.01)
It takes about 2 days for 20 epochs.
You can find the project here if you are interested https://github.com/nanoporetech/bonito
from adabelief-optimizer.
Just to confirm, no weight decay is applied to AdamW, so AdamW reduces to Adam?
from adabelief-optimizer.
Sorry it wasn't clear, I use the PyTorch default weight decay for AdamW
which is 1e-2
.
from adabelief-optimizer.
I see, thanks for the information. I'll try your experiment later.
I guess two possible reasons:
one is the decoupled weight decay, this affects the generalization especially during fine tuning;
second is the eps, note that we actually use mt/(\sqrt(st+eps) + eps), as in detailed Algo in appendix A, the eps within sqrt dominates the eps outside sqrt, so setting eps=1e-16 is roughly setting eps=1e-8 for AdamW.
Not sure if the second affects much, but if eps is too large, then st is dominated by eps, and it's more like SGD, perhaps 1e-12 is still not small enough. 1e-16 is perhaps suitable for scenario where the update needs to be very "adaptive".
I'll try your code once I finish the updates on GAN for camera ready. Also will test on Transformer and maybe RL later. Thanks for the feedback.
from adabelief-optimizer.
@iiSeymour Just to check, any update on the result comparison?
from adabelief-optimizer.
I now get very similar performance to AdamW
with decoupled weight decay and eps=1e-8 or eps=1e-16 when using AdaBelief
.
from adabelief-optimizer.
Thank you for the feedback! Perhaps for some specific cases, decoupled weight decay is the key factor and dominates other differences in implementation.
from adabelief-optimizer.
Related Issues (20)
- Please add a license HOT 1
- Upgrade with Adas optimizer HOT 3
- MSVAG HOT 1
- Why does g_t substract m_t, instead of m_{t-1} ? HOT 1
- On imagenet accuracy result 70.08 HOT 1
- Documentation (at least for TF) and weight_decouple is not an option HOT 2
- FileNotFoundError for ImageNet HOT 1
- Changing init learning rate HOT 2
- Question about SGD optimizer in LSTM experiments HOT 1
- Compatibility with warmup HOT 2
- Inconsistent computation of weight_decay and grad_residual among pytorch versions HOT 5
- Your method is just equivalent to SGD with a changable global learning rate. HOT 3
- Some questions related to import adabelief HOT 2
- Tensorflow restoration issue HOT 1
- weight_decouple in adabelief tf HOT 1
- Inconsistent use of epsilon HOT 4
- Suppressing weight decoupling and rectification messages HOT 1
- The problem of reproducing the result of ImageNet HOT 4
- AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper' HOT 4
- loss become nan when beta1=0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabelief-optimizer.