I quickly run some experiments on ImageNet with different weight decay rates. <p d

Results on ImageNet with tuning weight decay about adabelief-optimizer HOT 11 CLOSED

juntang-zhuang commented on September 25, 2024

Results on ImageNet with tuning weight decay

from adabelief-optimizer.

Comments (11)

XuezheMax commented on September 25, 2024 5

I run experiments with Adam and RAdam on ResNet-18. I decoupled the weight decay for both of them, so they are actually AdamW and RAdamW. The lr schedule is the same as AdaBelief: decaying at 70 and 80 epochs by 0.1, with total 90 epochs for training. The implementation is from this repo.
Here are the updates (3 runs for each experiment):

method	wd=1e-2	wd=1e-4
AdamW	69.73	67.57
RAdamW	69.80	67.68

I think these results suggest that the baselines for ImageNet need to be updated by using the same weight decay 1e-2.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Seems effect of weight decay dominates the effect of optimizers in this case. What learning rate schedule did you use? Does that influence results?

from adabelief-optimizer.

XuezheMax commented on September 25, 2024

I used the same lr scheduler: decay at epoch 70 and 80 by 0.1.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.

from adabelief-optimizer.

XuezheMax commented on September 25, 2024

I run with 8 v100 (from AWS) and it took around 10 hours to complete the training with 90 epochs.
One comment that might be useful for you is that the CPU memory sometimes is the bottleneck for running ImageNet experiments since the dataset is very large.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Thanks for the suggestions and experiments, it might be the reason, feel quite stuck when experimenting with my 1080 GPU.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

It surprises me that RAdam does not outperform Adam, since RAdam uses decoupled weight decay, do you have any results about AdamW with larger weight decay? Based on your results, I somehow doubt if decoupled weight decay is actually helpful. BTW, is the result reported in Apollo paper achieved by Apollo or ApolloW?

from adabelief-optimizer.

XuezheMax commented on September 25, 2024

Oh, sorry for the confusion. Here in my results, Adam is actually AdamW. Without decoupling the weight decay, Adam works significantly worse than AdamW.

For the results in Apollo, I did not decouple the weight decay for Apollo. I tried ApolloW, but the performance is similar to Apollo.

from adabelief-optimizer.

juntang-zhuang commented on September 25, 2024

Thanks a lot. I think your results suggest that weight decay is not properly set for AdamW family, and the baseline needs to be improved.

By looking at the literature, I found something weird, [1] also uses AdamW, and set weight decay as 5e-2, which is also a large number, and they achieve 67.93. Though the authors claim they performed a grid search, not sure if their grid includes 1e-2 as you used here. I'll take a more careful look later to see if some training details are different from yours.

BTW, regarding Apollo, is it because it's scale-variant, hence the weight decay is similar to a decoupled weight decay, like in SGD? Any idea why Apollo is not influenced by decoupled weight decay so much?

[1] Closing the Generalization Gap of Adaptive Gradient Methods in TrainingDeep Neural Networks

from adabelief-optimizer.

XuezheMax commented on September 25, 2024

I also tried wd=5e-2 for AdamW, and the results are even slightly better than wd=1e-2. So I guess the models were not properly trained in [1].

For Apollo and SGD, I think one possible reason that decoupled weight decay is not that influential is that they were not using second-order momentum. In ICLR 2021, there is a new submission about stable weight decay in Adam, maybe we can get some ideas from it :-)

from adabelief-optimizer.

soloice commented on September 25, 2024

Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.

This could be reasonable. According to this benchmark, V100s are 5x faster than 1080Tis.

from adabelief-optimizer.

Results on ImageNet with tuning weight decay about adabelief-optimizer HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent