Comments (3)
Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot.
This is surfaced in debug logs (export SKYPILOT_DEBUG=1
):
D 05-02 13:40:51 kubernetes.py:344] Instance type 2CPU--8GB--1T4 does not fit in the Kubernetes cluster. Reason: GPU nodes with T4 do not have enough CPU and/or memory. Maximum resources found on a single node: 2.0 CPUs, 7.3G Memory
Explicitly specifying a lower CPU/mem request (e.g., sky launch --cloud kubernetes --gpus T4 --cpus 1 --memory 2
) should work.
TODO for us is to make the log messages better - perhaps resources: Kubernetes({'T4': 1})
should have shown the CPUs and memory requested. Leaving the issue open for us to fix logging. Thanks for the report!
from skypilot.
Thanks for the report @asaiacai - I'm unable to reproduce this on d27e0ff. Can you share a reproduction script, a bit more about how you created the cluster and the full output of sky launch --cloud kubernetes --gpus T4
?
Here's how I created my cluster:
$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=gkeusc4
$ gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.27.12-gke.1115000" --release-channel "regular" --machine-type "n1-standard-16" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
cloud.google.com/gke-accelerator=nvidia-tesla-t4
$ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
T4 1
$ sky launch --cloud kubernetes --gpus T4
# My cluster ran as expected.
from skypilot.
I have an existing GKE cluster cluster-1
that I created a new nodepool adding one T4 instance. I shouldn't need to purge ~/.sky
right?
$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1
$ gcloud beta container node-pools create t4-nodepool --cluster=${CLUSTER_NAME} --zone=us-central1-c --node-locations=us-central1-c --num-nodes=1 --total-min-nodes=1 --total-max-nodes=1 --reservation-affinity=none --no-enable-autorepair --location-policy=ANY --machine-type=n1-standard-2 --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
t4-nodepool n1-standard-2 100 1.28.7-gke.1026000
$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
cloud.google.com/gke-accelerator=nvidia-tesla-t4
cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...
$ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
T4 1
Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
resources: Kubernetes({'T4': 1}).
To fix: relax or change the resource requirements.
$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e
from skypilot.
Related Issues (20)
- [Storage] Fail to use R2 as default storage when no other storage clouds are enabled
- [Storage] Fail early for COPY mode storage field without source specified
- Support for .skyignore or SYNC_FILE_MOUNTS in sky.exec
- [Core][Spot] Change Spot Dashboard from port forwarding to open ports
- [Core][AWS] Allow the config to set IAM roles for resources being launched by AWS HOT 4
- [Docs] Add docs for installing SkyPilot with pipx
- KeyError on Syncing File Mounts When Launching Spot HOT 1
- [Clouds] Launch Error with Paperspace Backend HOT 6
- [Core] Support cross-cloud failover when labels are provided
- [Tests] Azure disk tier test fail HOT 4
- I get permitrootlogin error on fludstack. HOT 1
- [k8s] Identifying worker id from pods HOT 1
- [Core] Task fails when run section is to long
- [Core] Disk tier `best` is not the best possible disk
- [docs] Add references to Kubernetes in SkyServe docs
- [GCP] Compute Engine Metadata unavailable when using service account in a local Docker? HOT 3
- [Core] Gracefully handle OOM for a job HOT 1
- [GCP] Invalid value for field 'resource.instanceProperties.labels' HOT 3
- [AI Gallery] Add kubernetes tabs in AI gallery examples
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skypilot.