Comments (9)
Little update, definitely getting some interesting results with the orthogonal init. At first glance, it seems less finicky than kaiming init, more stable. I'm going to turn the experiment above into an optuna optimization problem to find the optimum gain because it takes so long manually and isn't exact.
With the orthogonal gain hyper, too small gain equals vanishing gradient, too large equals exploding, so with optuna I could narrow down the best gain that should allow for pretty deep propagation through a lot of layers. To really see the difference between orthogonal and kaiming, though, I'll have to do an actual training experiment. We'll see! I'll at least update my finding for the best setting for orthogonal gain for mish here shortly so others could experiment with this value. I'll also run kaiming through optuna as well to narrow that hyper down more exactly than 0.0003.
from mish.
@Xynonners Not sure, I ended up using silu for quality but also speed reasons
from mish.
Thanks for the appreciation of my work. I'm glad Mish has been working well in your projects.
This is an interesting observation, I haven't extensively investigated into the optimal initialization schemes for Mish. I will check this more and verify this. Off context, have you tried Orthogonal initialization earlier?
from mish.
No I have not! But now I want to after you've reminded me -- orthogonal used to work very well for me in RL experiments with complex networks. Maybe I'll try it out with the standard deviation experiment above and see what happens. With kaiming uniform I was getting some signal up to about 75 layers.
I've only messed around with the kaiming uniform and fan_in, but i'm sure the gain/slope setting would be the same for kaiming normal and fan_out, as the gain has only to do with the shape of the activation function with those. Since 0.0003 is so close to 0, I'd think using any defaults for relu would also work well for mish. I know Less Wright used nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') in his work with mish that beat some kaggle competitions... he uses nonlinearity=relu aka a=0. This makes me think orthogonal would at least work somewhat well with mish, since it works well with relu.
from mish.
Right, the reason I'm interested in orthogonal initialization because of this. Let me know if you have had any progress with orthogonal initialization.
from mish.
@evanatyourservice Thanks for the update. Interesting, in orthogonal, you need to keep the init on the EOC otherwise you'll either have vanishing or exploding gradients.
Keep me posted with your progress.
from mish.
Little update, definitely getting some interesting results with the orthogonal init. At first glance, it seems less finicky than kaiming init, more stable. I'm going to turn the experiment above into an optuna optimization problem to find the optimum gain because it takes so long manually and isn't exact.
With the orthogonal gain hyper, too small gain equals vanishing gradient, too large equals exploding, so with optuna I could narrow down the best gain that should allow for pretty deep propagation through a lot of layers. To really see the difference between orthogonal and kaiming, though, I'll have to do an actual training experiment. We'll see! I'll at least update my finding for the best setting for orthogonal gain for mish here shortly so others could experiment with this value. I'll also run kaiming through optuna as well to narrow that hyper down more exactly than 0.0003.
hey there,
did you ever finish those experiments? pretty interested to know the results (am lazy to do them myself).
from mish.
@Xynonners yeah I’ve messed with orthogonal and mish a bit over the years now, orthogonal has given problems with transformers so I use the 0.02 truncated normal init usually, and use silu activation
from mish.
@Xynonners yeah I’ve messed with orthogonal and mish a bit over the years now, orthogonal has given problems with transformers so I use the 0.02 truncated normal init usually, and use silu activation
oh, so you didn't find any improvements with mish? (and kaiming init)
from mish.
Related Issues (20)
- PyTorch Mish - 1.5x slower training, 2.9X more memory usage vs LeakyReLU(0.1) HOT 6
- The result is not good (Fixed, improved mAP) HOT 5
- Should be mish used before or after batchnorm? HOT 2
- how to use mish in tensorflow slim.conv2d???? HOT 5
- Equivalent, faster (?) formulation HOT 34
- More comparison with existing methods? HOT 10
- Computational cost of Mish vs GELU vs Swish HOT 6
- Extended Coverage of other CNNs HOT 1
- Discussion regarding GAN benchmarks HOT 1
- Why does mish return significantly worse result than elu? HOT 9
- Demo Jupyter Notebooks - Link "404" error HOT 1
- Source Code Implentation - "404" Error HOT 1
- Comparison with TanhExp HOT 3
- Application of Mish on ReXNet HOT 2
- Mish and alternatives, including my own HOT 11
- Visualization code hyperlink page is not found. HOT 1
- Need help with Mish-Metal HOT 7
- Issue about Table.1 HOT 3
- Spelling error in repo description
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mish.