Comments (15)
@Taytay We're currently investigating it with others in #25. You are welcome to join the discussion if you are interested.
from nanot5.
I think that we're talking about two different things here.
The thing I mean by saying RMS-LR scaling is this which is in all public implementations of Adafactor.
What you mean is this.
When I was trying to make Adam work with T5 pre-training I tried to identify all differences and I was checking them one-by-one independently, and the one that happened to work was the LR scaling mentioned above.
Hope it clarifies, also looking forward to hearing your thoughts about it, I also can't explain it : )
from nanot5.
Ah, thank you for the clarification. I was indeed missing that part from the Adafactor paper (Section 8), and the learning rate adjustment was buried in a separate function in the adafactor code I looked at, so I missed that as well.
One other note you might find interesting is that RWKV uses a different technique to achieve what the Adafactor paper claims as the reason for this Section 8 RMS-LR scaling, namely to improve the embedding finding good values. He initializes his embedding vectors specially and then adds a separate LayerNorm after them before passing through to the rest of the model, which causes them to move quickly while maintaining a unit-size norm, and thereby converge to useful values early. See https://github.com/BlinkDL/SmallInitEmb
I've been interested to see if this Adafactor style RMS-LM scaling trick can help train other models more efficiently, such as GPT2 style etc. If you like, I can let you know when I have some results to share.
from nanot5.
Super interesting, I would love to know more about it as I also find it super interesting. I was planning to dig deeper on this, but I won't have time for a few weeks now. Please keep me in the loop whenever you have some results, I can definitely help with some experiments if you'd like : )
from nanot5.
I've been running some experiments this morning, and so far the loss curve is much steeper (better) in early training for my tweaked GPT2 model using AdamW rather than AdamW plus the RMS-LR scaling, once I figured out how to adjust for the very different base learning rates needed. (8e-2 vs 4e-6 for me) For me, AdamWScaled's progress flattens out relatively quickly, but maybe that's made up for later in the training cycle...
Do you recall if you were seeing improvements even early in training on AdamWScaled vs AdamW?
from nanot5.
I think that the pre-training task matters a lot. For regular decoder-only GPT style models I've also been using AdamW and it works just fine - the loss curve goes down smoothly from the beginning. T5 pre-training is a lot different. What I do not include in the README is the beginning of the loss curve. For every successful training of T5 there is a double-descent situation (around loss = 5):
You can also observe something similar in the blog. Only a small subset of runs converge (Adafactor) to desired values (<3), and other stay around very high loss (~5).
The only way I found to have this double-descent behaviour for T5 pre-training is including this weird RMS-LR scaling, and I tested it for variety of optimizers (Adam, Sophia, Lion). None of them converged to loss lower than 4, and after I added RMS-LR scaling they worked. I think that regular Adam is optimal for huge variety of tasks, but T5 pre-training is not one of them. The question is why :)
from nanot5.
Hmm, but what if the thing they mention in the Section 8 of the Adafactor paper is true for this instance of T5 model. Have a look here. In HF they in fact initialise weight matrices in a complex way (e.g. they do not scale the attention scores by the sqrt(h_dim)
but initialise the attention weights appropriately so that it's not needed). Maybe relative step size is necessary in this scenario
from nanot5.
Thanks, that's great to see the whole graph - such images almost always have the beginning cut off :) And I didn't know that about the two-phase loss curve for T5 models! Very interesting.
Sounds like maybe the relative step size is helping you get past that part where the loss curve flattens temporarily right around 5? like in the graph from that HF blog you mentioned https://yhavinga-pre-training-dutch-t5-models.hf.space/media/2c8d9281e22bee471f438c8658cd4faca4e1bb905d4fe41ea2c899a4.png
One idea is maybe the two-phase aspect is somehow a result of it finally 'escaping' the 1e-3 small param regime? Which is where it then can improve the learning rate
from nanot5.
Haha, I changed the initialisation of the model so that all weights are drawn from the same distribution and regular Adam worked! Thanks for brining this up, mystery solved. However, this trick from Adafactor is actually super cool, because this makes the optimiser more initialisation agnostic!
from nanot5.
Wow, that's great! So glad I accidentally helped!!!
from nanot5.
Would you mind posting a graph of the new descent? I'd like to see how that hump changed!
from nanot5.
The hump is exactly the same, I'll try to post the updated version of the repo soon : )
from nanot5.
Neat. That's interesting to find out that the hump is a fundamental part of the model and not simply an artifact of the RMS-LR scaling trick.
from nanot5.
@PiotrNawrot : did you have time to incorporate these findings into the report? It sounds really encouraging!
from nanot5.
@SmerkyG also - we're currently investigating it with others in #25. You are welcome to join the discussion if you are interested.
from nanot5.
Related Issues (20)
- Difficulty applying NanoT5 to different model and database HOT 2
- AttributeError: Can't pickle local object 'IterableDataset.map.<locals>.<lambda>' HOT 1
- About pre-training on another dataset HOT 7
- Pre-training fails at step 30155 out of 32768 steps every time HOT 7
- self-defined loss function failed to work (torch._dynamo.exc.InternalTorchDynamoError: ln_encoder) HOT 4
- nanoT5 initializes lm_head weights with 768x too much variance, probably HOT 19
- Transformation to HF model
- Pre-train on different Dataset than C4 HOT 1
- Flash attention HOT 2
- Larger models and training on the Pile HOT 5
- How to create pytorch_model.bin file? HOT 1
- Silly question: Why do you need to re-implement T5 model? HOT 3
- Question about implementing whole word masking in nanoT5 HOT 1
- Beginner Question : Would it be wise to use this as a backbone for custom seq2seq modeling fMRI data and custom encoder? HOT 2
- Learning rate for multi-GPUs training HOT 3
- Just a quick question to pretrain Flan-T5 HOT 5
- Continued pretraining from official models. HOT 1
- pre-training on local C4 dataset? HOT 1
- About Pre-training objectives HOT 1
- checkpoint-pt-151 does not appear to have a file named config.json HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nanot5.