Comments (10)
Thanks and thank you for all your replies! I think having the simple option of storing every epoch and the slightly more advanced, but probably also very simple to implement option of catching a sigterm for storing on demand would be great options to choose from for a user.
Basically, I got caught up in the docs that explain the helper functions and was a bit confused, also making clear that the recommended solution currently is to store every epoch also seems totally fine.
from syne-tune.
Btw, I forgot to mention we have this example that shows how to activate checkpointing for an xgboost script:
- example: https://github.com/awslabs/syne-tune/blob/main/examples/launch_checkpoint_example.py
- associated training script: https://github.com/awslabs/syne-tune/blob/main/examples/training_scripts/xgboost/xgboost_checkpoint.py
It may be useful as perhaps you are considering scikit-learn pipelines.
I am planning to have a look whether using sigterm works with checkpointing, I will let you know.
from syne-tune.
Thanks, I made checkpointing work, I'm actually using custom pytorch transformers 🤯
Crashed when the hard drive was full of checkpoints, now looking at your docs for fixing that :) The journey continues!
from syne-tune.
Hello Andreas,
we want to keep things as simple as possible on the side of the training script. In fact, in all our examples, checkpointing is done at the end of each epoch. Your suggestion would save time by checkpointing less often, but it would be more difficult to implement.
from syne-tune.
Ah ok, the documentation suggested using the helper functions, which seems more complicated than doing the exception as it requires passing the config objects around. But I saw that actually in the end it's always stored. So yes, it's more complicated than always storing but seems easier than the method that the code suggests was planned.
from syne-tune.
Hi Andreas,
This makes sense, you also made a good point about using sigterm instead of sigkill, I will take a look and let you know if we can support it.
from syne-tune.
https://syne-tune.readthedocs.io/en/latest/faq.html#checkpoints-are-filling-up-my-disk-what-can-i-do
The number of checkpoints stored only scales with the number of trials, not with how often a trial stores a checkpoint. Storing after every epoch simply just overwrites what was there before.
There is a speculative checkpoint removal feature, which is documented in the link above. Maybe this helps.
from syne-tune.
Hi @amueller, was that link helpful or do you need further support?
from syne-tune.
I'm good, thank you!
from syne-tune.
OK perfect, closing.
from syne-tune.
Related Issues (20)
- RemoteLauncher corrupts requirements.txt when not ending with newline HOT 5
- Conditional/Inactive hyperparameters HOT 6
- Troubles with maximising using MORandomScalarizationBayesOpt HOT 4
- Run BOHB/SyncBOHB using lcbench HOT 2
- Open `MultiObjectiveMultiSurrogateSearcher` to additional arguments HOT 2
- Simple example for learning curve plotting HOT 7
- Surprising results of trial values over time HOT 3
- Conditional sampling in configuration space HOT 4
- Convenience transformation for config spaces HOT 8
- Docs for continuing aborted runs HOT 12
- Hard to find default configurations for schedulers HOT 3
- Difficulties setting rungs / stopping HOT 20
- GP not robust to NaN metric HOT 2
- Direct support for time as a resource? HOT 7
- Acquisition functions in Bayesian optimization HOT 1
- Update Ray dependencies, as dependabot flags them as security vulnerabilities
- Set custom GPU Ids for LocalBackend HOT 2
- [Question] Multiple runs for same parameter values HOT 5
- ModuleNotFoundError: No module named 'sagemaker.interactive_apps' HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from syne-tune.