Comments (7)
Just an update @juntang-zhuang . Your modified Ranger got the best accuracy with half epochs compared to regular Ranger. It is already my favorite optimizer. Thank you!
from adabelief-optimizer.
Thanks for the feedback. I think you are right, and this might be caused by the fact that the update is roughly m_t / sqrt(. (gt-mt)^2 ), the denominator is sometimes too small. Even if the denominator is small for one element, that element will explode. This is a issue I'm trying to fix in the next release, for example hard thres 1/sqrt( (gt-mt)^2 ) to a rather large value.
Please keep this issue open as a reminder of problems to fix for improvement.
from adabelief-optimizer.
I noticed that the learning rate is quite large, after reduction it's still 3e-3. Perhaps a large lr also causes the instability.
from adabelief-optimizer.
Wow, excited to hear that, thanks so much for trying it out.
from adabelief-optimizer.
Could be entirely unrelated to this issue, but at first step in AdaBelief we havem_t = grad
, causing v_t = 0
and step_size = step_size / epsilon_t
which seems like unintended behaviour.
Edit2:
Deleted my earlier further comments as they were not exactly correct.
from adabelief-optimizer.
Thanks for comment @henriyl . The detailed implementation is definitely not perfect now and might suffer from numerical issues, and you refer to a very good point. For this paper, it's more like a "proof-of-idea" considering the key modification of Adam(W) is only 2 lines of code, therefore many details are not well solved (these details might not be a big problem for CV task but might be more serious in RNN with exploding or vanishing gradient). We are working on the improvement, both in implementation and on theory (personally I guess the convergence bound is too loose in the paper). Thanks again for pointing this out.
from adabelief-optimizer.
@bratao Just an update, I might confuse "gradient clip" with "gradient threshold" before, please see the discussion in readme.md. Perhaps "gradient clip" still helps, which shrinks vector amplitude but keeps the direction, but it might require different clip ranges from Adam; while "gradient threshold" is element-wise operation, and for each element outputs a value in a fixed range, and each dimension of the parameter is independently thresholded, this might cause 0 denominator.
from adabelief-optimizer.
Related Issues (20)
- fine-tune with bert models HOT 2
- Please add a license HOT 1
- Upgrade with Adas optimizer HOT 3
- MSVAG HOT 1
- Why does g_t substract m_t, instead of m_{t-1} ? HOT 1
- On imagenet accuracy result 70.08 HOT 1
- Documentation (at least for TF) and weight_decouple is not an option HOT 2
- FileNotFoundError for ImageNet HOT 1
- Changing init learning rate HOT 2
- Question about SGD optimizer in LSTM experiments HOT 1
- Compatibility with warmup HOT 2
- Inconsistent computation of weight_decay and grad_residual among pytorch versions HOT 5
- Your method is just equivalent to SGD with a changable global learning rate. HOT 3
- Some questions related to import adabelief HOT 2
- Tensorflow restoration issue HOT 1
- weight_decouple in adabelief tf HOT 1
- Inconsistent use of epsilon HOT 4
- Suppressing weight decoupling and rectification messages HOT 1
- The problem of reproducing the result of ImageNet HOT 4
- AttributeError: 'AdaBeliefOptimizer' object has no attribute '_set_hyper' HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabelief-optimizer.