I'm running a single T4 node on GKE. Nodes are properly labeled as shown below and <c

Thanks for the report <a class="user-mention notranslate" data-hovercard-type="user" d

[k8s] [GKE] Fail to request T4 instance about skypilot HOT 3 OPEN

asaiacai commented on June 13, 2024

[k8s] [GKE] Fail to request T4 instance

from skypilot.

Comments (3)

romilbhardwaj commented on June 13, 2024 1

Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot.

This is surfaced in debug logs (export SKYPILOT_DEBUG=1):

D 05-02 13:40:51 kubernetes.py:344] Instance type 2CPU--8GB--1T4 does not fit in the Kubernetes cluster. Reason: GPU nodes with T4 do not have enough CPU and/or memory. Maximum resources found on a single node: 2.0 CPUs, 7.3G Memory

Explicitly specifying a lower CPU/mem request (e.g., sky launch --cloud kubernetes --gpus T4 --cpus 1 --memory 2) should work.

TODO for us is to make the log messages better - perhaps resources: Kubernetes({'T4': 1}) should have shown the CPUs and memory requested. Leaving the issue open for us to fix logging. Thanks for the report!

from skypilot.

romilbhardwaj commented on June 13, 2024

Thanks for the report @asaiacai - I'm unable to reproduce this on d27e0ff. Can you share a reproduction script, a bit more about how you created the cluster and the full output of sky launch --cloud kubernetes --gpus T4?

Here's how I created my cluster:

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=gkeusc4

$ gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.27.12-gke.1115000" --release-channel "regular" --machine-type "n1-standard-16" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES
T4          1

$ sky launch --cloud kubernetes --gpus T4
# My cluster ran as expected.

from skypilot.

asaiacai commented on June 13, 2024

I have an existing GKE cluster cluster-1 that I created a new nodepool adding one T4 instance. I shouldn't need to purge ~/.sky right?

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1

$ gcloud beta container node-pools create t4-nodepool  --cluster=${CLUSTER_NAME}  --zone=us-central1-c  --node-locations=us-central1-c     --num-nodes=1     --total-min-nodes=1     --total-max-nodes=1     --reservation-affinity=none     --no-enable-autorepair     --location-policy=ANY   --machine-type=n1-standard-2     --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.                                                                                             
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME         MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
t4-nodepool  n1-standard-2  100           1.28.7-gke.1026000

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                      cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e

from skypilot.

[k8s] [GKE] Fail to request T4 instance about skypilot HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent