Comments (11)
I run experiments with Adam and RAdam on ResNet-18. I decoupled the weight decay for both of them, so they are actually AdamW and RAdamW. The lr schedule is the same as AdaBelief: decaying at 70 and 80 epochs by 0.1, with total 90 epochs for training. The implementation is from this repo.
Here are the updates (3 runs for each experiment):
method | wd=1e-2 | wd=1e-4 |
---|---|---|
AdamW | 69.73 | 67.57 |
RAdamW | 69.80 | 67.68 |
I think these results suggest that the baselines for ImageNet need to be updated by using the same weight decay 1e-2.
from adabelief-optimizer.
Seems effect of weight decay dominates the effect of optimizers in this case. What learning rate schedule did you use? Does that influence results?
from adabelief-optimizer.
I used the same lr scheduler: decay at epoch 70 and 80 by 0.1.
from adabelief-optimizer.
Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.
from adabelief-optimizer.
I run with 8 v100 (from AWS) and it took around 10 hours to complete the training with 90 epochs.
One comment that might be useful for you is that the CPU memory sometimes is the bottleneck for running ImageNet experiments since the dataset is very large.
from adabelief-optimizer.
Thanks for the suggestions and experiments, it might be the reason, feel quite stuck when experimenting with my 1080 GPU.
from adabelief-optimizer.
It surprises me that RAdam does not outperform Adam, since RAdam uses decoupled weight decay, do you have any results about AdamW with larger weight decay? Based on your results, I somehow doubt if decoupled weight decay is actually helpful. BTW, is the result reported in Apollo paper achieved by Apollo or ApolloW?
from adabelief-optimizer.
Oh, sorry for the confusion. Here in my results, Adam is actually AdamW. Without decoupling the weight decay, Adam works significantly worse than AdamW.
For the results in Apollo, I did not decouple the weight decay for Apollo. I tried ApolloW, but the performance is similar to Apollo.
from adabelief-optimizer.
Thanks a lot. I think your results suggest that weight decay is not properly set for AdamW family, and the baseline needs to be improved.
By looking at the literature, I found something weird, [1] also uses AdamW, and set weight decay as 5e-2, which is also a large number, and they achieve 67.93. Though the authors claim they performed a grid search, not sure if their grid includes 1e-2 as you used here. I'll take a more careful look later to see if some training details are different from yours.
BTW, regarding Apollo, is it because it's scale-variant, hence the weight decay is similar to a decoupled weight decay, like in SGD? Any idea why Apollo is not influenced by decoupled weight decay so much?
[1] Closing the Generalization Gap of Adaptive Gradient Methods in TrainingDeep Neural Networks
from adabelief-optimizer.
I also tried wd=5e-2 for AdamW, and the results are even slightly better than wd=1e-2. So I guess the models were not properly trained in [1].
For Apollo and SGD, I think one possible reason that decoupled weight decay is not that influential is that they were not using second-order momentum. In ICLR 2021, there is a new submission about stable weight decay in Adam, maybe we can get some ideas from it :-)
from adabelief-optimizer.
Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.
This could be reasonable. According to this benchmark, V100s are 5x faster than 1080Tis.
from adabelief-optimizer.
Related Issues (20)
- Please add a license HOT 1
- Upgrade with Adas optimizer HOT 3
- MSVAG HOT 1
- Why does g_t substract m_t, instead of m_{t-1} ? HOT 1
- On imagenet accuracy result 70.08 HOT 1
- Documentation (at least for TF) and weight_decouple is not an option HOT 2
- FileNotFoundError for ImageNet HOT 1
- Changing init learning rate HOT 2
- Question about SGD optimizer in LSTM experiments HOT 1
- Compatibility with warmup HOT 2
- Inconsistent computation of weight_decay and grad_residual among pytorch versions HOT 5
- Your method is just equivalent to SGD with a changable global learning rate. HOT 3
- Some questions related to import adabelief HOT 2
- Tensorflow restoration issue HOT 1
- weight_decouple in adabelief tf HOT 1
- Inconsistent use of epsilon HOT 4
- Suppressing weight decoupling and rectification messages HOT 1
- The problem of reproducing the result of ImageNet HOT 4
- AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper' HOT 4
- loss become nan when beta1=0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabelief-optimizer.