It would be great to get some documentation on how to restart failed runs. I start

Docs for continuing aborted runs about syne-tune HOT 12 CLOSED

amueller commented on June 16, 2024

Docs for continuing aborted runs

from syne-tune.

Comments (12)

geoalgo commented on June 16, 2024 1

Hi Andreas,

I agree this is a great suggestion, we will add an example in the doc.

To answer your question, the tuner is regularly checkpointed (if you let the default options of the Tuner) so you can resume a tuning by loading its checkpoint:

from syne_tune.experiments import load_experiment
from syne_tune import StoppingCriterion

# Loads a previous experiment, sets `load_tuner` to True to deserialize the Tuner
tuning_experiment = load_experiment("plot-results-demo-2023-10-10-07-27-48-235", load_tuner=True)

# Update stop criterion to run the tuning a couple more
tuner = tuning_experiment.tuner
tuner.stop_criterion = StoppingCriterion(max_num_trials_started=100)
tuner.run()

See screenshot for resuming a previous example:

Note that the tuner is serialized with dill, so this should only be done if you trust the file.

from syne-tune.

amueller commented on June 16, 2024 1

Awesome, that's easy and makes sense, but also would be good to call out in the docs :)

from syne-tune.

amueller commented on June 16, 2024 1

one more note for just restarting the tuner: I would add to the docs to recommend backing up the tuner before restarting. I just did ctrl-c after reloading and the file got corrupted / was emptied.

from syne-tune.

geoalgo commented on June 16, 2024 1

This is a good point. Right now, the example modify some internals to update the configuration space but ideally we would have clear setters methods with tests and list the properties that can be changed.

For now, I believe it makes sense to allow user to perform those modifications in case they want to experiment something (for instance your case of resuming tuning while changing deleting checkpoints option) and add those features as soon as we have multiple users asking for them.

Regarding your change, I think it does the right thing but @mseeger would be the best person to confirm.

from syne-tune.

amueller commented on June 16, 2024 1

Btw, figuring out which aspects of a model can be changed at which state of training is something scikit-learn hasn't even figured out yet, it's definitely not easy to solve in general.
I think this specific one is quite relevant since it's on the natural progression path of a new user trying to make things work, and the fact that the harddrive is full probably means they invested quite a bite of compute already that they'd like to reuse.

from syne-tune.

amueller commented on June 16, 2024

Oh I got one more follow-up: Let's say I want to re-use experiments with a different tuner. Is that also possible? Say I want to expand a parameter range, or maybe vary one more parameter. It seems silly to start from scratch then.

from syne-tune.

geoalgo commented on June 16, 2024

Awesome, that's easy and makes sense, but also would be good to call out in the docs :)

Completely agree, I am planning to add a FAQ example (you are not the first person to ask :-)).

from syne-tune.

geoalgo commented on June 16, 2024

Oh I got one more follow-up: Let's say I want to re-use experiments with a different tuner. Is that also possible? Say I want to expand a parameter range, or maybe vary one more parameter. It seems silly to start from scratch then.

Here the problem would be more dependent on the scheduler you are using. Conceptually, it should work for everything that looks like random-search (and ASHA), but it is not tested and I am not sure which scheduler would work in this mode.

There was this paper https://arxiv.org/abs/2010.13117 that proposed some strategies for this problem but this problem is not well studied I would say.

Edit: added the not :-)

from syne-tune.

amueller commented on June 16, 2024

Are you saying it's not a well-studied problem?

The general case could get arbitrarily complicated, I think, but I'm mostly interested in the case where the true function stays the same, but maybe parameters of the search or the domain of the function change.
In these settings, it should be possible to re-use old data by potentially filling in missing values in the search space, let's say we fixed super_duper_option=True before and now we vary it, we would need to inform the scheduler that the previous points correspond to super_duper_option=True since that wasn't included in the search space.

Re-using previous runs would lead to a different sampling bias but hopefully the acquisition function can compensate for that?

If the true function changes, say you change the dataset or model in some way, that seems much more trickier and I wasn't really asking about that.

from syne-tune.

amueller commented on June 16, 2024

One related thing that would be useful is knowing what can be changed and how if a tuner is loaded.
Let's say my tuner crashed cause my hard drive was full, so now I want to restart it and enable the automatic checkpoint removal. It's not entirely clear to me how to do that. I tried

tuner = load_experiment(sys.argv[1], load_tuner=True).tuner
tuner.trial_backend.delete_checkpoints = True
tuner.scheduler.early_checkpoint_removal_kwargs = {"max_num_checkpoints": 80}
tuner.scheduler._initialize_early_checkpoint_removal({"max_num_checkpoints": 80})
tuner.run()

but I don't think that had the desired effect?

from syne-tune.

mseeger commented on June 16, 2024

You are asking for a lot here. You can try to add tuner._initialize_early_checkpoint_removal() before tuner.run() above, see whether this works. Most likely, it will not. The feature is based on a callback, which has a state, depending on what happened during the experiment. It would be very difficult to recreate this state based on checkpoints written during an experiment when the feature was not enabled.

What may work, is to enable the removal feature up front, but with a large max_num_checkpoints. Maybe in this case, the callback can be amended (i.e., max_num_checkpoints can be lowered) when restarting.

from syne-tune.

amueller commented on June 16, 2024

Yeah makes sense that that's complex to support. It might not be worth the hassle. Another option would be to do what you suggest as second option by default. My issue, and maybe that of other new users, is that I didn't realize how quickly that would become an issue. Though your tutorial does point it out as an issue, so you could also just say that was user error.

from syne-tune.

Docs for continuing aborted runs about syne-tune HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent