Hi! I didn't fully understand how the transfer of parameters such as batch_size/se

Batch size, Seq len, Step Transfering about mup HOT 2 CLOSED

timothyxp commented on May 19, 2024

Batch size, Seq len, Step Transfering

from mup.

Comments (2)

thegregyang commented on May 19, 2024 2

Adding on to Edward, the usual lr/batch_size dependency rule is when you fix the number of epochs, whereas here we are fixing the number of steps (because we are shrinking the training problem to tune, this makes more sense).

from mup.

edwardjhu commented on May 19, 2024 1

Thanks for the question, Timofey.

We note in the second paragraph of the intro that

In addition to width, we empirically verify that, with a few caveats, HPs can also be transferred across depth (in Section 6.1) as well as batch size, language model sequence length, and training time (in Appendix G.2.1). This reduces the tuning problem of an (arbitrarily) large model to that of a (fixed-sized) small model. Our overall procedure, which we call µTransfer, is summarized in Algorithm 1 and Fig. 2, and the HPs we cover are summarized in Tables 1 and 2.

This is also mentioned in the caption of Table 1.

You are right that mup doesn't give us any theoretical guarantees for these dimensions, but we need to consider them to make muTransfer useful in practice, which is why we verified them empirically.

from mup.

Batch size, Seq len, Step Transfering about mup HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent