Hello! Great work on this activation function! I've been using it in some of my projec

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Correct gain value during kaiming weight initialization about mish HOT 9 CLOSED

digantamisra98 commented on September 27, 2024

Correct gain value during kaiming weight initialization

from mish.

Comments (9)

evanatyourservice commented on September 27, 2024 1

Little update, definitely getting some interesting results with the orthogonal init. At first glance, it seems less finicky than kaiming init, more stable. I'm going to turn the experiment above into an optuna optimization problem to find the optimum gain because it takes so long manually and isn't exact.

With the orthogonal gain hyper, too small gain equals vanishing gradient, too large equals exploding, so with optuna I could narrow down the best gain that should allow for pretty deep propagation through a lot of layers. To really see the difference between orthogonal and kaiming, though, I'll have to do an actual training experiment. We'll see! I'll at least update my finding for the best setting for orthogonal gain for mish here shortly so others could experiment with this value. I'll also run kaiming through optuna as well to narrow that hyper down more exactly than 0.0003.

from mish.

evanatyourservice commented on September 27, 2024 1

@Xynonners Not sure, I ended up using silu for quality but also speed reasons

from mish.

digantamisra98 commented on September 27, 2024

Thanks for the appreciation of my work. I'm glad Mish has been working well in your projects.

This is an interesting observation, I haven't extensively investigated into the optimal initialization schemes for Mish. I will check this more and verify this. Off context, have you tried Orthogonal initialization earlier?

from mish.

evanatyourservice commented on September 27, 2024

No I have not! But now I want to after you've reminded me -- orthogonal used to work very well for me in RL experiments with complex networks. Maybe I'll try it out with the standard deviation experiment above and see what happens. With kaiming uniform I was getting some signal up to about 75 layers.

I've only messed around with the kaiming uniform and fan_in, but i'm sure the gain/slope setting would be the same for kaiming normal and fan_out, as the gain has only to do with the shape of the activation function with those. Since 0.0003 is so close to 0, I'd think using any defaults for relu would also work well for mish. I know Less Wright used nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') in his work with mish that beat some kaggle competitions... he uses nonlinearity=relu aka a=0. This makes me think orthogonal would at least work somewhat well with mish, since it works well with relu.

from mish.

digantamisra98 commented on September 27, 2024

Right, the reason I'm interested in orthogonal initialization because of this. Let me know if you have had any progress with orthogonal initialization.

from mish.

digantamisra98 commented on September 27, 2024

@evanatyourservice Thanks for the update. Interesting, in orthogonal, you need to keep the init on the EOC otherwise you'll either have vanishing or exploding gradients.
Keep me posted with your progress.

from mish.

Xynonners commented on September 27, 2024

Little update, definitely getting some interesting results with the orthogonal init. At first glance, it seems less finicky than kaiming init, more stable. I'm going to turn the experiment above into an optuna optimization problem to find the optimum gain because it takes so long manually and isn't exact.

With the orthogonal gain hyper, too small gain equals vanishing gradient, too large equals exploding, so with optuna I could narrow down the best gain that should allow for pretty deep propagation through a lot of layers. To really see the difference between orthogonal and kaiming, though, I'll have to do an actual training experiment. We'll see! I'll at least update my finding for the best setting for orthogonal gain for mish here shortly so others could experiment with this value. I'll also run kaiming through optuna as well to narrow that hyper down more exactly than 0.0003.

hey there,
did you ever finish those experiments? pretty interested to know the results (am lazy to do them myself).

from mish.

evanatyourservice commented on September 27, 2024

@Xynonners yeah I’ve messed with orthogonal and mish a bit over the years now, orthogonal has given problems with transformers so I use the 0.02 truncated normal init usually, and use silu activation

from mish.

Xynonners commented on September 27, 2024

@Xynonners yeah I’ve messed with orthogonal and mish a bit over the years now, orthogonal has given problems with transformers so I use the 0.02 truncated normal init usually, and use silu activation

oh, so you didn't find any improvements with mish? (and kaiming init)

from mish.

Correct gain value during kaiming weight initialization about mish HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent