Code Monkey home page Code Monkey logo

Comments (9)

evanatyourservice avatar evanatyourservice commented on September 27, 2024 1

Little update, definitely getting some interesting results with the orthogonal init. At first glance, it seems less finicky than kaiming init, more stable. I'm going to turn the experiment above into an optuna optimization problem to find the optimum gain because it takes so long manually and isn't exact.

With the orthogonal gain hyper, too small gain equals vanishing gradient, too large equals exploding, so with optuna I could narrow down the best gain that should allow for pretty deep propagation through a lot of layers. To really see the difference between orthogonal and kaiming, though, I'll have to do an actual training experiment. We'll see! I'll at least update my finding for the best setting for orthogonal gain for mish here shortly so others could experiment with this value. I'll also run kaiming through optuna as well to narrow that hyper down more exactly than 0.0003.

from mish.

evanatyourservice avatar evanatyourservice commented on September 27, 2024 1

@Xynonners Not sure, I ended up using silu for quality but also speed reasons

from mish.

digantamisra98 avatar digantamisra98 commented on September 27, 2024

Thanks for the appreciation of my work. I'm glad Mish has been working well in your projects.

This is an interesting observation, I haven't extensively investigated into the optimal initialization schemes for Mish. I will check this more and verify this. Off context, have you tried Orthogonal initialization earlier?

from mish.

evanatyourservice avatar evanatyourservice commented on September 27, 2024

No I have not! But now I want to after you've reminded me -- orthogonal used to work very well for me in RL experiments with complex networks. Maybe I'll try it out with the standard deviation experiment above and see what happens. With kaiming uniform I was getting some signal up to about 75 layers.

I've only messed around with the kaiming uniform and fan_in, but i'm sure the gain/slope setting would be the same for kaiming normal and fan_out, as the gain has only to do with the shape of the activation function with those. Since 0.0003 is so close to 0, I'd think using any defaults for relu would also work well for mish. I know Less Wright used nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') in his work with mish that beat some kaggle competitions... he uses nonlinearity=relu aka a=0. This makes me think orthogonal would at least work somewhat well with mish, since it works well with relu.

from mish.

digantamisra98 avatar digantamisra98 commented on September 27, 2024

Right, the reason I'm interested in orthogonal initialization because of this. Let me know if you have had any progress with orthogonal initialization.

from mish.

digantamisra98 avatar digantamisra98 commented on September 27, 2024

@evanatyourservice Thanks for the update. Interesting, in orthogonal, you need to keep the init on the EOC otherwise you'll either have vanishing or exploding gradients.
Keep me posted with your progress.

from mish.

Xynonners avatar Xynonners commented on September 27, 2024

Little update, definitely getting some interesting results with the orthogonal init. At first glance, it seems less finicky than kaiming init, more stable. I'm going to turn the experiment above into an optuna optimization problem to find the optimum gain because it takes so long manually and isn't exact.

With the orthogonal gain hyper, too small gain equals vanishing gradient, too large equals exploding, so with optuna I could narrow down the best gain that should allow for pretty deep propagation through a lot of layers. To really see the difference between orthogonal and kaiming, though, I'll have to do an actual training experiment. We'll see! I'll at least update my finding for the best setting for orthogonal gain for mish here shortly so others could experiment with this value. I'll also run kaiming through optuna as well to narrow that hyper down more exactly than 0.0003.

hey there,
did you ever finish those experiments? pretty interested to know the results (am lazy to do them myself).

from mish.

evanatyourservice avatar evanatyourservice commented on September 27, 2024

@Xynonners yeah I’ve messed with orthogonal and mish a bit over the years now, orthogonal has given problems with transformers so I use the 0.02 truncated normal init usually, and use silu activation

from mish.

Xynonners avatar Xynonners commented on September 27, 2024

@Xynonners yeah I’ve messed with orthogonal and mish a bit over the years now, orthogonal has given problems with transformers so I use the 0.02 truncated normal init usually, and use silu activation

oh, so you didn't find any improvements with mish? (and kaiming init)

from mish.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.