Comments (6)
No worries. I think we're saying the same thing but in different ways haha.
I don't think we need to add any more arguments. The only change to the public API that needs to be made is to load the latest weights whenever the exp version is specified.
Case 1: No exp version given. Version is bumped (current behavior)
Case 2: Exp version is given. Then exp continues where it left off (this is already the current behavior). The only addition here is that it should load the latest weights available for that version.
Case 2 covers the 3 flags you mentioned, so there's no need to add any more arguments, just change the internal loading (ie: look for the latest checkpoint at the beginning of training. If the version is new there won't be any, if it's an old version, the user had to have set that on purpose. and thus would be expected to load the latest weights).
from lightning.
Great suggestion. PR for this would be great, but let's do the feature in a slightly different way.
Let's pick 1 way of doing it with these considerations.
-
Test-tube logs and weights can't be forced to be in the same path. A user might have a restriction to save weights and logs to different folder. This could be for disk space or for ease of SCP files (especially true on corporate or .edu clusters)
-
With that said, continue training can still work. It's actually already supported but meant for cluster training only. However, I agree it's a good idea to change the signatures and docs a bit to ALSO allow this for non-cluster training.
To do that here are the things that need to be changed:
- Update the save function to behave like this hpc_save function (which saves training state as well).
- Delete the hpc_save function.
- Register the new save fx here (for cluster tng).
- Change the signature of hpc_load.
- Register the new hpc_save in the same fx as (3).
- This is where it gets tricky. It might be easier to give the trainer either a new flag or a function like restore_training. Not sure which one might be a better UX.
Let's pick one of these two:
A)
trainer = Trainer(...)
trainer.restore_training(weights_path)
B)
Or if the version is the same in the exp, it'll just pick right back up:
exp = Experiment(version=same_as_before)
trainer = Trainer(experiment=exp)
- Change the relevant tests.
- Update docs.
At the end of the day, we want hpc_save/load and non-hpc_save/load to now behave the same.
Thoughts?
@cinjon any suggestions?
from lightning.
Great. I will start reading the hpc code as I didn't use DDP before and try making some changes.
from lightning.
ok, so, let’s do it where if the test tube version is the same then it continues training?
so, nothing of the external api needs to change, just internal + docs
from lightning.
@lkhphuc how's it going? Did you take a look at the hpc method signatures?
from lightning.
Sorry, I've been busy the last few days. I will spend some time on it in the next week.
Around the API for this change, I think:
- Everytime you run an experiment, the version of that experiment is bumped (current behavior).
- If additional argument
--load-model=best/latest
, the best/latest checkpoint from the latest version is loaded back. - If additional arguments
--load-model=best/latest --from-version=X
, the best/latest checkpoint from version X is loaded back. - If additional argument
--from-version=X
: the latest checkpoint from version X is loaded.
Together, I will introduce two new arguments in Trainer, the default will behave like current, combine they can load checkpoint best/latest from version latest/X.
How do you think?
Should --load-model=best/latest/X
with X is arbitrary epoch number?
from lightning.
Related Issues (20)
- Multi-gpu training is much lower than single gpu (due to additional processes?)
- Missing documentation for the `log_weight_decay` argument in `lightning.pytorch.callbacks.LearningRateMonitor`
- parsing issue with `save_last` parameter of `ModelCheckpoint`
- Construct objects from yaml by classmethod
- FSDP Strategy checkpoint loading
- Current FSDPPrecision does not support custom scaler for 16-mixed precision
- Differentiate testing multiple sets/models when logging
- Issue in Manual optimisation, during self.manual_backward call HOT 1
- Existing metric keys not moved to device after LearningRateFinder
- Checkpoint every_n_steps reruns epoch on restore HOT 3
- Metrics logged by self.log and metric.compute() are different HOT 1
- Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster HOT 3
- Full validation after first microbatch when training after LearningRateFinder
- Add a warning when some of the modules are in eval mode before the training stage
- why pytorch-lightning doc say "Model-parallel training (FSDP and DeepSpeed)". I think there is something wrong. HOT 1
- AWS Trainium fails number of device validation when using more than 1 accelerator on the instances
- OnExceptionCheckpoint: training resumes if ckpt found, even if no ckpt_path provided
- TensorBoardLogger has the wrong epoch numbers much more than the fact
- How to incorporate vLLM in Lightning for LLM inference?
- WandbLogger `save_dir` and `dir` parameters do not work as expected.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightning.