Hi: Thank you for your good work! I want to know i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Sorry for the late reply! <a class="user-mention notranslate" data-h

Sorry for the late reply! <a class="user-mention notran

Resume the pre-training process about nanot5 HOT 5 CLOSED

piotrnawrot commented on September 14, 2024

Resume the pre-training process

from nanot5.

Comments (5)

PiotrNawrot commented on September 14, 2024 1

Sorry for the possible confusion. I see two reasons to resume the pre-training process.

You have a time limit for your experiment and you can't fit the entire pre-training into the time limit and need to split it into > 1 jobs. Then resuming the pre-training is just as easy as doing accelerator.load_state which loads LR scheduler and all the states correctly and resumes the pre-training until it finished after 2^16 steps.
You have finished pre-training for the desired number of steps (2^16), but then you want to try to actually take the checkpoint after 2^16 steps and continue pre-training it for the next 2^16 steps. To do so you need to update the LR scheduler with the new desired number of steps (2^17) so that you properly decay your LR during the second half of your training.

Good luck and ask if you have any further questions

from nanot5.

iSevenDays commented on September 14, 2024

@QizhiPei Hi. See my notebook for more details. Yes, you can resume training https://github.com/iSevenDays/nanoT5/blob/main/nanoT5/train.ipynb

from nanot5.

QizhiPei commented on September 14, 2024

@QizhiPei Hi. See my notebook for more details. Yes, you can resume training https://github.com/iSevenDays/nanoT5/blob/main/nanoT5/train.ipynb

Thanks for your kindly help!

from nanot5.

PiotrNawrot commented on September 14, 2024

Sorry for the late reply!

@QizhiPei I'm using HF Accelerator to save the state Code Pointer. It should be quite easy to load the state to resume the pre-training process, you can check out HF tutorial here. It should be basically a one-liner like: accelerator.load_state(path_to_checkpoint) in the main.py before the train call.

Continuing pre-training (besides 2**16 steps set by default) is slightly more complex because you'd need to adjust the LR scheduler appropriately.

Let me know if it works!

from nanot5.

QizhiPei commented on September 14, 2024

Sorry for the late reply!

@QizhiPei I'm using HF Accelerator to save the state Code Pointer. It should be quite easy to load the state to resume the pre-training process, you can check out HF tutorial here. It should be basically a one-liner like: accelerator.load_state(path_to_checkpoint) in the main.py before the train call.

Continuing pre-training (besides 2**16 steps set by default) is slightly more complex because you'd need to adjust the LR scheduler appropriately.

Let me know if it works!

Thanks for you suggestions!

I successfully load the saved checkpoints. However, it seems that the accelerator.load_state will also load the scheduler state. Could you kindly explain the detailed meaning of "adjust the LR scheduler appropriately" ?

Thanks again!

from nanot5.

Resume the pre-training process about nanot5 HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent