Currently each time I ran the command with the same experiment name, a new version is

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Streamlined UX in saving, loading, continue training. about lightning HOT 6 CLOSED

lkhphuc commented on May 3, 2024

Streamlined UX in saving, loading, continue training.

from lightning.

Comments (6)

williamFalcon commented on May 3, 2024 1

No worries. I think we're saying the same thing but in different ways haha.

I don't think we need to add any more arguments. The only change to the public API that needs to be made is to load the latest weights whenever the exp version is specified.

Case 1: No exp version given. Version is bumped (current behavior)

Case 2: Exp version is given. Then exp continues where it left off (this is already the current behavior). The only addition here is that it should load the latest weights available for that version.

Case 2 covers the 3 flags you mentioned, so there's no need to add any more arguments, just change the internal loading (ie: look for the latest checkpoint at the beginning of training. If the version is new there won't be any, if it's an old version, the user had to have set that on purpose. and thus would be expected to load the latest weights).

from lightning.

williamFalcon commented on May 3, 2024

Great suggestion. PR for this would be great, but let's do the feature in a slightly different way.

Let's pick 1 way of doing it with these considerations.

Test-tube logs and weights can't be forced to be in the same path. A user might have a restriction to save weights and logs to different folder. This could be for disk space or for ease of SCP files (especially true on corporate or .edu clusters)
With that said, continue training can still work. It's actually already supported but meant for cluster training only. However, I agree it's a good idea to change the signatures and docs a bit to ALSO allow this for non-cluster training.

To do that here are the things that need to be changed:

Update the save function to behave like this hpc_save function (which saves training state as well).
Delete the hpc_save function.
Register the new save fx here (for cluster tng).
Change the signature of hpc_load.
Register the new hpc_save in the same fx as (3).
This is where it gets tricky. It might be easier to give the trainer either a new flag or a function like restore_training. Not sure which one might be a better UX.

Let's pick one of these two:

trainer = Trainer(...)
trainer.restore_training(weights_path)

B)
Or if the version is the same in the exp, it'll just pick right back up:

exp = Experiment(version=same_as_before)
trainer = Trainer(experiment=exp)

Change the relevant tests.
Update docs.

At the end of the day, we want hpc_save/load and non-hpc_save/load to now behave the same.

Thoughts?

@cinjon any suggestions?

from lightning.

lkhphuc commented on May 3, 2024

Great. I will start reading the hpc code as I didn't use DDP before and try making some changes.

from lightning.

williamFalcon commented on May 3, 2024

ok, so, let’s do it where if the test tube version is the same then it continues training?

so, nothing of the external api needs to change, just internal + docs

from lightning.

williamFalcon commented on May 3, 2024

@lkhphuc how's it going? Did you take a look at the hpc method signatures?

from lightning.

lkhphuc commented on May 3, 2024

Sorry, I've been busy the last few days. I will spend some time on it in the next week.
Around the API for this change, I think:

Everytime you run an experiment, the version of that experiment is bumped (current behavior).
If additional argument --load-model=best/latest, the best/latest checkpoint from the latest version is loaded back.
If additional arguments --load-model=best/latest --from-version=X, the best/latest checkpoint from version X is loaded back.
If additional argument --from-version=X: the latest checkpoint from version X is loaded.

Together, I will introduce two new arguments in Trainer, the default will behave like current, combine they can load checkpoint best/latest from version latest/X.

How do you think?
Should --load-model=best/latest/X with X is arbitrary epoch number?

from lightning.

Streamlined UX in saving, loading, continue training. about lightning HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent