Comments (6)
Hi @bratao, thanks for asking. Currently don't have a plan for a big change with the ranger version, will fix some small errors. I don't have much experience with ranger version, also the decoupled weight decay and rectification is turned on by default in ranger (these two defaults are modified in adabelief-pytorch), except the eps is set as 1e-5 in ranger. Do you have a feeling what is a good default value for eps in ranger-adabelief? Or any other ideas on potential improvements?
from adabelief-optimizer.
Hi,
I would recommend setting the default value of 'weight decoupling' to true.
Adam originally had coupled weight decay, but after this paper,
https://arxiv.org/abs/1711.05101 e.g. Pytorch introduced AdamW.
But there is no supporting evidence for any case where coupled weight
decay outperforms decoupled one, so even the default pytorch
changed to decoupled. Actually, not just changed the default,
they completely removed the option for the old one, making it backward incompatible.
(So yes, now Adam and AdamW is almost the same in Pytorch,
but in Adam the default decay is 0.)
If you know any counterexamples where coupling helps, please, show a link,
but otherwise i strongly recommend the default to be set to True.
(And obviously, for comparison the same weight decay scheme should be used.)
from adabelief-optimizer.
@dvolgyes Thanks for feedback, actually for all latest implementations the default for weight_decouple is True. BTW, since you mentioned decoupled decay is enabled in Adam in PyTorch, I checked the source code it seems Adam does not have much change, am I missing something? Could you specify where in Adam of PyTorch is decoupled weight decay enabled? Thanks a lot
from adabelief-optimizer.
Hi,
It depend how you see it. Check out the documentation of Adam between version 1.5.1 and 1.6:
https://pytorch.org/docs/1.5.1/optim.html?highlight=adam#torch.optim.Adam
https://pytorch.org/docs/1.6.0/optim.html?highlight=adam#torch.optim.Adam
The old one was:
"Implements Adam algorithm.
It has been proposed in Adam: A Method for Stochastic Optimization."
The new one:
"Implements Adam algorithm.
It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization."
I did not go into the details checking how they implemented it, but I should have.
Turned out, it is quite likely that the documentation is wrong, it suggests they updated it,
but they didn't, and they still use the old version without weight decoupling.
pytorch/pytorch#42843
I still hold that I haven't seen any paper where the weight decoupling shows worse performance than the non-decoupled
version, but I take back my Pytorch suggestion, I was mislead by the documentation.
So my current view:
- Adam uses coupled weight decay
- the Adam documentation is wrong
- weight decoupling seems to be beneficial for all, I haven't seen counter-evidence
from adabelief-optimizer.
@dvolgyes Thanks a lot. It seems weird that the "source code" page for Adam in PyTorch 1,6 is
if group['weight_decay'] != 0:
grad = grad.add(p, alpha=group['weight_decay'])
Then it's not decoupled weight decay. Quite weird, I guess the document page is wrong. But thanks for suggestion, the default weight_decouple is turned on.
from adabelief-optimizer.
"Code never lies, comments sometimes do." -- Ron Jeffries
:)
Even worse, the bug was discovered before 1.7, but it might be that even 1.7.1 will be released without it.
Anyway, you code was always correct, but I am glad that I could convince you to change the default.
from adabelief-optimizer.
Related Issues (20)
- fine-tune with bert models HOT 2
- Please add a license HOT 1
- Upgrade with Adas optimizer HOT 3
- MSVAG HOT 1
- Why does g_t substract m_t, instead of m_{t-1} ? HOT 1
- On imagenet accuracy result 70.08 HOT 1
- Documentation (at least for TF) and weight_decouple is not an option HOT 2
- FileNotFoundError for ImageNet HOT 1
- Changing init learning rate HOT 2
- Question about SGD optimizer in LSTM experiments HOT 1
- Compatibility with warmup HOT 2
- Inconsistent computation of weight_decay and grad_residual among pytorch versions HOT 5
- Your method is just equivalent to SGD with a changable global learning rate. HOT 3
- Some questions related to import adabelief HOT 2
- Tensorflow restoration issue HOT 1
- weight_decouple in adabelief tf HOT 1
- Inconsistent use of epsilon HOT 4
- Suppressing weight decoupling and rectification messages HOT 1
- The problem of reproducing the result of ImageNet HOT 4
- AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper' HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabelief-optimizer.