Code Monkey home page Code Monkey logo

yc-mk8s-gpu-time-slicing's Introduction

Time-Slicing GPUs in Kubernetes

Intro

The NVIDIA GPU Operator allows oversubscription of GPUs through a set of extended options for the NVIDIA Kubernetes Device Plugin. Internally, GPU time-slicing is used to allow workloads that land on oversubscribed GPUs to interleave with one another. This page covers ways to enable this in Managed service for Kubernetes using the GPU Operator.

This mechanism for enabling “time-sharing” of GPUs in Kubernetes allows a system administrator to define a set of “replicas” for a GPU, each of which can be handed out independently to a pod to run workloads on. Unlike MIG(Multi-Instance GPU), there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU.

Official documentation

Quick start

Add node group with NVIDIA T4 GPU

Provide time-slicing configurations for the NVIDIA Kubernetes Device Plugin as a ConfigMap:

kubectl create -f time-slicing-config.yaml

Install GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update \
   && helm install gpu-operator nvidia/gpu-operator \
     -n gpu-operator \
     --set devicePlugin.config.name=time-slicing-config

The time-slicing configuration can be applied either at cluster level or per node. By default, the GPU Operator will not apply the time-slicing configuration to any GPU node in the cluster. You can use default with the devicePlugin.config.default= parameter per ClusterPolicy:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
   -n gpu-operator --type merge \
   -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "tesla-t4"}}}}'

OR

per nodes with node label:

yc managed-kubernetes node-group add-labels <NODE-GROUP-NAME>|<NODE-GROUP-ID> --labels nvidia.com/device-plugin.config=tesla-t4

Testing GPU Time-Slicing with the NVIDIA GPU Operator

Create a deployment with multiple replicas:

kubectl apply -f nvidia-plugin-test.yml

Verify that all five replicas are running:

kubectl get pods

Check nvidia-smi

kubectl exec <nvidia-container-toolkit-name> -n gpu-operator -- nvidia-smi

Your output should look something like this:


Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Thu Jan 26 09:42:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   72C    P0    70W /  70W |   1579MiB / 15360MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43108      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     43211      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     44583      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     44589      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A     44595      C   /usr/bin/dcgmproftester11         315MiB |
+-----------------------------------------------------------------------------+

yc-mk8s-gpu-time-slicing's People

Contributors

nettworker avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.