Could you shed some light on how the RMS scaling code used in AdamWScale is supposed t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RMS scaling issues about nanot5 HOT 15 CLOSED

piotrnawrot commented on September 14, 2024

RMS scaling issues

from nanot5.

Comments (15)

PiotrNawrot commented on September 14, 2024 1

@Taytay We're currently investigating it with others in #25. You are welcome to join the discussion if you are interested.

from nanot5.

PiotrNawrot commented on September 14, 2024

I think that we're talking about two different things here.

The thing I mean by saying RMS-LR scaling is this which is in all public implementations of Adafactor.

What you mean is this.

When I was trying to make Adam work with T5 pre-training I tried to identify all differences and I was checking them one-by-one independently, and the one that happened to work was the LR scaling mentioned above.

Hope it clarifies, also looking forward to hearing your thoughts about it, I also can't explain it : )

from nanot5.

SmerkyG commented on September 14, 2024

Ah, thank you for the clarification. I was indeed missing that part from the Adafactor paper (Section 8), and the learning rate adjustment was buried in a separate function in the adafactor code I looked at, so I missed that as well.

One other note you might find interesting is that RWKV uses a different technique to achieve what the Adafactor paper claims as the reason for this Section 8 RMS-LR scaling, namely to improve the embedding finding good values. He initializes his embedding vectors specially and then adds a separate LayerNorm after them before passing through to the rest of the model, which causes them to move quickly while maintaining a unit-size norm, and thereby converge to useful values early. See https://github.com/BlinkDL/SmallInitEmb

I've been interested to see if this Adafactor style RMS-LM scaling trick can help train other models more efficiently, such as GPT2 style etc. If you like, I can let you know when I have some results to share.

from nanot5.

PiotrNawrot commented on September 14, 2024

Super interesting, I would love to know more about it as I also find it super interesting. I was planning to dig deeper on this, but I won't have time for a few weeks now. Please keep me in the loop whenever you have some results, I can definitely help with some experiments if you'd like : )

from nanot5.

SmerkyG commented on September 14, 2024

I've been running some experiments this morning, and so far the loss curve is much steeper (better) in early training for my tweaked GPT2 model using AdamW rather than AdamW plus the RMS-LR scaling, once I figured out how to adjust for the very different base learning rates needed. (8e-2 vs 4e-6 for me) For me, AdamWScaled's progress flattens out relatively quickly, but maybe that's made up for later in the training cycle...

Do you recall if you were seeing improvements even early in training on AdamWScaled vs AdamW?

from nanot5.

PiotrNawrot commented on September 14, 2024

I think that the pre-training task matters a lot. For regular decoder-only GPT style models I've also been using AdamW and it works just fine - the loss curve goes down smoothly from the beginning. T5 pre-training is a lot different. What I do not include in the README is the beginning of the loss curve. For every successful training of T5 there is a double-descent situation (around loss = 5):

You can also observe something similar in the blog. Only a small subset of runs converge (Adafactor) to desired values (<3), and other stay around very high loss (~5).

The only way I found to have this double-descent behaviour for T5 pre-training is including this weird RMS-LR scaling, and I tested it for variety of optimizers (Adam, Sophia, Lion). None of them converged to loss lower than 4, and after I added RMS-LR scaling they worked. I think that regular Adam is optimal for huge variety of tasks, but T5 pre-training is not one of them. The question is why :)

from nanot5.

PiotrNawrot commented on September 14, 2024

Hmm, but what if the thing they mention in the Section 8 of the Adafactor paper is true for this instance of T5 model. Have a look here. In HF they in fact initialise weight matrices in a complex way (e.g. they do not scale the attention scores by the sqrt(h_dim) but initialise the attention weights appropriately so that it's not needed). Maybe relative step size is necessary in this scenario

from nanot5.

SmerkyG commented on September 14, 2024

Thanks, that's great to see the whole graph - such images almost always have the beginning cut off :) And I didn't know that about the two-phase loss curve for T5 models! Very interesting.

Sounds like maybe the relative step size is helping you get past that part where the loss curve flattens temporarily right around 5? like in the graph from that HF blog you mentioned https://yhavinga-pre-training-dutch-t5-models.hf.space/media/2c8d9281e22bee471f438c8658cd4faca4e1bb905d4fe41ea2c899a4.png

One idea is maybe the two-phase aspect is somehow a result of it finally 'escaping' the 1e-3 small param regime? Which is where it then can improve the learning rate

from nanot5.

PiotrNawrot commented on September 14, 2024

Haha, I changed the initialisation of the model so that all weights are drawn from the same distribution and regular Adam worked! Thanks for brining this up, mystery solved. However, this trick from Adafactor is actually super cool, because this makes the optimiser more initialisation agnostic!

from nanot5.

SmerkyG commented on September 14, 2024

Wow, that's great! So glad I accidentally helped!!!

from nanot5.

SmerkyG commented on September 14, 2024

Would you mind posting a graph of the new descent? I'd like to see how that hump changed!

from nanot5.

PiotrNawrot commented on September 14, 2024

The hump is exactly the same, I'll try to post the updated version of the repo soon : )

from nanot5.

SmerkyG commented on September 14, 2024

Neat. That's interesting to find out that the hump is a fundamental part of the model and not simply an artifact of the RMS-LR scaling trick.

from nanot5.

Taytay commented on September 14, 2024

@PiotrNawrot : did you have time to incorporate these findings into the report? It sounds really encouraging!

from nanot5.

PiotrNawrot commented on September 14, 2024

@SmerkyG also - we're currently investigating it with others in #25. You are welcome to join the discussion if you are interested.

from nanot5.

RMS scaling issues about nanot5 HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent