Code Monkey home page Code Monkey logo

Comments (3)

romilbhardwaj avatar romilbhardwaj commented on June 13, 2024 1

Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot.

This is surfaced in debug logs (export SKYPILOT_DEBUG=1):

D 05-02 13:40:51 kubernetes.py:344] Instance type 2CPU--8GB--1T4 does not fit in the Kubernetes cluster. Reason: GPU nodes with T4 do not have enough CPU and/or memory. Maximum resources found on a single node: 2.0 CPUs, 7.3G Memory

Explicitly specifying a lower CPU/mem request (e.g., sky launch --cloud kubernetes --gpus T4 --cpus 1 --memory 2) should work.

TODO for us is to make the log messages better - perhaps resources: Kubernetes({'T4': 1}) should have shown the CPUs and memory requested. Leaving the issue open for us to fix logging. Thanks for the report!

from skypilot.

romilbhardwaj avatar romilbhardwaj commented on June 13, 2024

Thanks for the report @asaiacai - I'm unable to reproduce this on d27e0ff. Can you share a reproduction script, a bit more about how you created the cluster and the full output of sky launch --cloud kubernetes --gpus T4?

Here's how I created my cluster:

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=gkeusc4

$ gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.27.12-gke.1115000" --release-channel "regular" --machine-type "n1-standard-16" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES
T4          1

$ sky launch --cloud kubernetes --gpus T4
# My cluster ran as expected.

from skypilot.

asaiacai avatar asaiacai commented on June 13, 2024

I have an existing GKE cluster cluster-1 that I created a new nodepool adding one T4 instance. I shouldn't need to purge ~/.sky right?

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1

$ gcloud beta container node-pools create t4-nodepool  --cluster=${CLUSTER_NAME}  --zone=us-central1-c  --node-locations=us-central1-c     --num-nodes=1     --total-min-nodes=1     --total-max-nodes=1     --reservation-affinity=none     --no-enable-autorepair     --location-policy=ANY   --machine-type=n1-standard-2     --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.                                                                                             
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME         MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
t4-nodepool  n1-standard-2  100           1.28.7-gke.1026000

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                      cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e

from skypilot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.