Code Monkey home page Code Monkey logo

Comments (15)

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024 1

@Taytay We're currently investigating it with others in #25. You are welcome to join the discussion if you are interested.

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

I think that we're talking about two different things here.

The thing I mean by saying RMS-LR scaling is this which is in all public implementations of Adafactor.

What you mean is this.

When I was trying to make Adam work with T5 pre-training I tried to identify all differences and I was checking them one-by-one independently, and the one that happened to work was the LR scaling mentioned above.

Hope it clarifies, also looking forward to hearing your thoughts about it, I also can't explain it : )

from nanot5.

SmerkyG avatar SmerkyG commented on September 14, 2024

Ah, thank you for the clarification. I was indeed missing that part from the Adafactor paper (Section 8), and the learning rate adjustment was buried in a separate function in the adafactor code I looked at, so I missed that as well.

One other note you might find interesting is that RWKV uses a different technique to achieve what the Adafactor paper claims as the reason for this Section 8 RMS-LR scaling, namely to improve the embedding finding good values. He initializes his embedding vectors specially and then adds a separate LayerNorm after them before passing through to the rest of the model, which causes them to move quickly while maintaining a unit-size norm, and thereby converge to useful values early. See https://github.com/BlinkDL/SmallInitEmb

I've been interested to see if this Adafactor style RMS-LM scaling trick can help train other models more efficiently, such as GPT2 style etc. If you like, I can let you know when I have some results to share.

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

Super interesting, I would love to know more about it as I also find it super interesting. I was planning to dig deeper on this, but I won't have time for a few weeks now. Please keep me in the loop whenever you have some results, I can definitely help with some experiments if you'd like : )

from nanot5.

SmerkyG avatar SmerkyG commented on September 14, 2024

I've been running some experiments this morning, and so far the loss curve is much steeper (better) in early training for my tweaked GPT2 model using AdamW rather than AdamW plus the RMS-LR scaling, once I figured out how to adjust for the very different base learning rates needed. (8e-2 vs 4e-6 for me) For me, AdamWScaled's progress flattens out relatively quickly, but maybe that's made up for later in the training cycle...

Do you recall if you were seeing improvements even early in training on AdamWScaled vs AdamW?

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

I think that the pre-training task matters a lot. For regular decoder-only GPT style models I've also been using AdamW and it works just fine - the loss curve goes down smoothly from the beginning. T5 pre-training is a lot different. What I do not include in the README is the beginning of the loss curve. For every successful training of T5 there is a double-descent situation (around loss = 5):

image

You can also observe something similar in the blog. Only a small subset of runs converge (Adafactor) to desired values (<3), and other stay around very high loss (~5).

The only way I found to have this double-descent behaviour for T5 pre-training is including this weird RMS-LR scaling, and I tested it for variety of optimizers (Adam, Sophia, Lion). None of them converged to loss lower than 4, and after I added RMS-LR scaling they worked. I think that regular Adam is optimal for huge variety of tasks, but T5 pre-training is not one of them. The question is why :)

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

Hmm, but what if the thing they mention in the Section 8 of the Adafactor paper is true for this instance of T5 model. Have a look here. In HF they in fact initialise weight matrices in a complex way (e.g. they do not scale the attention scores by the sqrt(h_dim) but initialise the attention weights appropriately so that it's not needed). Maybe relative step size is necessary in this scenario

from nanot5.

SmerkyG avatar SmerkyG commented on September 14, 2024

Thanks, that's great to see the whole graph - such images almost always have the beginning cut off :) And I didn't know that about the two-phase loss curve for T5 models! Very interesting.

Sounds like maybe the relative step size is helping you get past that part where the loss curve flattens temporarily right around 5? like in the graph from that HF blog you mentioned https://yhavinga-pre-training-dutch-t5-models.hf.space/media/2c8d9281e22bee471f438c8658cd4faca4e1bb905d4fe41ea2c899a4.png

One idea is maybe the two-phase aspect is somehow a result of it finally 'escaping' the 1e-3 small param regime? Which is where it then can improve the learning rate

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

Haha, I changed the initialisation of the model so that all weights are drawn from the same distribution and regular Adam worked! Thanks for brining this up, mystery solved. However, this trick from Adafactor is actually super cool, because this makes the optimiser more initialisation agnostic!

from nanot5.

SmerkyG avatar SmerkyG commented on September 14, 2024

Wow, that's great! So glad I accidentally helped!!!

from nanot5.

SmerkyG avatar SmerkyG commented on September 14, 2024

Would you mind posting a graph of the new descent? I'd like to see how that hump changed!

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

The hump is exactly the same, I'll try to post the updated version of the repo soon : )

from nanot5.

SmerkyG avatar SmerkyG commented on September 14, 2024

Neat. That's interesting to find out that the hump is a fundamental part of the model and not simply an artifact of the RMS-LR scaling trick.

from nanot5.

Taytay avatar Taytay commented on September 14, 2024

@PiotrNawrot : did you have time to incorporate these findings into the report? It sounds really encouraging!

from nanot5.

PiotrNawrot avatar PiotrNawrot commented on September 14, 2024

@SmerkyG also - we're currently investigating it with others in #25. You are welcome to join the discussion if you are interested.

from nanot5.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.