Code Monkey home page Code Monkey logo

Comments (6)

williamFalcon avatar williamFalcon commented on May 3, 2024 1

No worries. I think we're saying the same thing but in different ways haha.

I don't think we need to add any more arguments. The only change to the public API that needs to be made is to load the latest weights whenever the exp version is specified.

Case 1: No exp version given. Version is bumped (current behavior)

Case 2: Exp version is given. Then exp continues where it left off (this is already the current behavior). The only addition here is that it should load the latest weights available for that version.

Case 2 covers the 3 flags you mentioned, so there's no need to add any more arguments, just change the internal loading (ie: look for the latest checkpoint at the beginning of training. If the version is new there won't be any, if it's an old version, the user had to have set that on purpose. and thus would be expected to load the latest weights).

from lightning.

williamFalcon avatar williamFalcon commented on May 3, 2024

Great suggestion. PR for this would be great, but let's do the feature in a slightly different way.

Let's pick 1 way of doing it with these considerations.

  1. Test-tube logs and weights can't be forced to be in the same path. A user might have a restriction to save weights and logs to different folder. This could be for disk space or for ease of SCP files (especially true on corporate or .edu clusters)

  2. With that said, continue training can still work. It's actually already supported but meant for cluster training only. However, I agree it's a good idea to change the signatures and docs a bit to ALSO allow this for non-cluster training.

To do that here are the things that need to be changed:

  1. Update the save function to behave like this hpc_save function (which saves training state as well).
  2. Delete the hpc_save function.
  3. Register the new save fx here (for cluster tng).
  4. Change the signature of hpc_load.
  5. Register the new hpc_save in the same fx as (3).
  6. This is where it gets tricky. It might be easier to give the trainer either a new flag or a function like restore_training. Not sure which one might be a better UX.

Let's pick one of these two:

A)

trainer = Trainer(...)
trainer.restore_training(weights_path)

B)
Or if the version is the same in the exp, it'll just pick right back up:

exp = Experiment(version=same_as_before)
trainer = Trainer(experiment=exp)
  1. Change the relevant tests.
  2. Update docs.

At the end of the day, we want hpc_save/load and non-hpc_save/load to now behave the same.

Thoughts?

@cinjon any suggestions?

from lightning.

lkhphuc avatar lkhphuc commented on May 3, 2024

Great. I will start reading the hpc code as I didn't use DDP before and try making some changes.

from lightning.

williamFalcon avatar williamFalcon commented on May 3, 2024

ok, so, let’s do it where if the test tube version is the same then it continues training?

so, nothing of the external api needs to change, just internal + docs

from lightning.

williamFalcon avatar williamFalcon commented on May 3, 2024

@lkhphuc how's it going? Did you take a look at the hpc method signatures?

from lightning.

lkhphuc avatar lkhphuc commented on May 3, 2024

Sorry, I've been busy the last few days. I will spend some time on it in the next week.
Around the API for this change, I think:

  • Everytime you run an experiment, the version of that experiment is bumped (current behavior).
  • If additional argument --load-model=best/latest, the best/latest checkpoint from the latest version is loaded back.
  • If additional arguments --load-model=best/latest --from-version=X, the best/latest checkpoint from version X is loaded back.
  • If additional argument --from-version=X: the latest checkpoint from version X is loaded.

Together, I will introduce two new arguments in Trainer, the default will behave like current, combine they can load checkpoint best/latest from version latest/X.

How do you think?
Should --load-model=best/latest/X with X is arbitrary epoch number?

from lightning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.