Code Monkey home page Code Monkey logo

Comments (6)

shjwudp avatar shjwudp commented on May 19, 2024 1

Hi Edward, thank you very much, your advice saved me. A larger learning rate exposed the problem, the plot showed jitter, I debugged and fixed the problem, and now the plot is smooth and shows good increase in value. I'm going to try the big learning rate μTransfer on this.

Although it looks good now, the working principle is too hard for me, and the muP is really amazing.

from mup.

edwardjhu avatar edwardjhu commented on May 19, 2024 1

Thanks for your patience, Jianbin.

There are many considerations when training a very large model. In some sense, mup is a necessary but insufficient condition for successful training of large models. Other factors include the use of weight decay and floating point precision. Hope this can help with your investigation!

Another question is that the transformer example and mutransformers use different initialization methods, (init_std / d_model) ** 0.5 vs init_std * width_mult ** -0.5, are these two formulas equivalent in some sense? Will there be pros and cons?

They are equivalent up to a constant.

from mup.

edwardjhu avatar edwardjhu commented on May 19, 2024

Hi shjwudp,

Thanks for your interest in our work!

Your coordinate check plots seem identical across time steps, which is a sign that the learning rate is too small for the function to change. Can you try rerunning with a larger learning rate? It's possible that with a moderately larger learning rate, the muP run might blow up after a couple steps, in which case we can look into it further.

from mup.

shjwudp avatar shjwudp commented on May 19, 2024

Hi, @edwardjhu I've recently done some experiments, an extension of the previous discussion. I found that transferring the same hyperparameters from a 350M model to 1.3B scale works fine, but transferring to a larger model 2.7B blowup, does that mean my parameters are too aggressive? how should i avoid this?

My coord:
image
The same hyperparameters, 1.3B model and 2.7B model comparison: https://tensorboard.dev/experiment/RirdggEZS8O2rRU9clEy0g/#scalars

from mup.

shjwudp avatar shjwudp commented on May 19, 2024

Another question is that the transformer example and mutransformers use different initialization methods, (init_std / d_model) ** 0.5 vs init_std * width_mult ** -0.5, are these two formulas equivalent in some sense? Will there be pros and cons?

from mup.

leenachennuru avatar leenachennuru commented on May 19, 2024

Hi Edward, thank you very much, your advice saved me. A larger learning rate exposed the problem, the plot showed jitter, I debugged and fixed the problem, and now the plot is smooth and shows good increase in value. I'm going to try the big learning rate μTransfer on this.

Although it looks good now, the working principle is too hard for me, and the muP is really amazing.

Hi Jiabin, Could you share info on what caused the jitter in your coord check plots? Its possible that I have a similar issue (#58).

Thanks!

from mup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.