Comments (4)
@yixiaoer, are you running a spot or on-demand TPU? INTERRUPTED_BY_NO_CAPACITY means the instance became unavailable and this usually happens with spot instances when they get interrupted by the cloud provider.
If the task can handle interruptions, you can specify a retry policy to resubmit the run when it gets interrupted.
If the task cannot handle interruptions, consider using on-demand instances: pass --on-demand
to dstack run
.
from dstack.
I was running on a spot TPU. After specifying the retry option with interruption, it retried, but later still lost connection.
And also tried using the --on-demand
option with dstack run
, but the problem persists.
Can this be related to the TPU memory capacity? The dataset downloaded is quite large (approximately 22GB, specified to download in /dev/shm
). However, no errors were reported for the code running; I also specified in .dstack.yml
:
resources:
memory: 100GB
shm_size: 50GB
Is this the correct way to specify resources for TPUs? Given the situation, is there anything else I can do to resolve this issue?
from dstack.
@yixiaoer, it's quite strange that the problem persists with --on-demand
. Could you please double-check it? Also, show dstack ps
once you try it to see whether it used spot or not.
Also, to ensure on-demand is used, you can set in the YAML spot_policy
to on-demand
then to ensure it doesn't use spot instances.
Please let me know if you can check it.
Yes, the resources
looks OK to me!
Also, in case it doesn't work again, could you please share the repo with train.dstack.yml
and scripts so we can try to reproduce it?
from dstack.
thanks! I double-checked and tried running with --on-demand
again. This time, it ran successfully.
from dstack.
Related Issues (20)
- [Feature]: Auto scale based on Queue Delay
- [Feature]: spot_policy should be by default set to on-demand HOT 5
- [Bug]: Task freezes and fails without any notification HOT 2
- Show run error on the run page in the UI HOT 1
- [Bug]: No graceful shutdown when building Packer images
- Keep failed-to-provision instances instead of deleting them
- [Bug]: Allow to configure gateway instance type HOT 5
- Store gateway logs in `~/.dstack/gateway.log` HOT 1
- [UI] Revert the list of projects and users in Administration
- [Feature]: Support `docker` and `docker compose` commands
- [Bug]: Internal Server Error when requesting a nonexistent page under Python 3.8
- [UX] There is no `--detach` anymore with `dstack apply` HOT 4
- [Feature] Support private subnets on Azure
- `dstack server` must collect logs from gateways and instances
- Provide a mechanism to migrate dstack data from SQLite to Postgres
- [Bug]: Some locally committed files are empty in runs
- [Bug]: `dstack init --git-identity` doesn't accept backslashes in path on Windows
- [Bug]: Cluster provisioning fails on dstack Sky with dstack creds
- Add UI for volumes
- [Experimental] Use `-tmpfs /dev/shm:rw,nosuid,nodev,exec,size=X` instead of `--shm-size=X`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dstack.