Code Monkey home page Code Monkey logo

aks-windows-gpu-acceleration's Introduction

Overview

This reposity holds instructions for configuring GPU acclerated Windows nodes in AKS clusters

Node Setup

  1. Create an AKS cluster following https://docs.microsoft.com/en-us/azure/aks/learn/quick-windows-container-deploy-cli#create-an-aks-cluster

  2. Add a Windows Server node pool following https://learn.microsoft.com/en-us/azure/aks/learn/quick-windows-container-deploy-cli#add-a-windows-node-pool

    Notes:

    1. kubernetes version must be 1.23 or later

    2. node-vm-size should be a NV* series which support DirectX acceleration. [https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-gpu]

  3. Add the 'NVDIA GPU Driver Extension' to the VirtualMachineScaleSet created for the new node pool added above

    https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-windows

    https://docs.microsoft.com/en-us/cli/azure/vmss/extension?view=azure-cli-latest

  4. Ensure VirtualMachineScaleSet instances are using the most up-to-date model

    https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set#how-to-bring-vms-up-to-date-with-the-latest-scale-set-model

  5. Wait for VM extensions to run

  6. Deploy `k8s-directx-device-plugin.yaml to your cluster

    kubectl apply -f https://raw.githubusercontent.com/marosset/aks-windows-gpu-acceleration/main/k8s-directx-device-plugin/k8s-directx-device-plugin.yaml

Node Setup Verification

  1. run kubectl describe node {gpu-node} and look for

    Capacity:
        cpu:                    6
        ephemeral-storage:      133703676Ki
        memory:                 58719796Ki
        microsoft.com/directx:  1
        pods:                   30
    Allocatable:
        cpu:                    5840m
        ephemeral-storage:      133703676Ki
        memory:                 51379764Ki
        microsoft.com/directx:  1
        pods:                   30

Requesting GPU acceleration in container instances

  1. In your workload deployment files add requests for 'microsoft.com/directx` resoruces

    ...
    spec:
      containers:
    ...
        resources:
          requests:
            microsoft.com/directx: "1"

Troubleshooting

  1. Ensure daemonSet pods are running

    Check status with

    kubectl get pods -A
  2. Check device plugin pod logs

    kubectl logs -n kube-system {pod-name}

    You should see something like

    kubectl logs -n kube-system directx-device-plugin-5t7dc
    ERROR: logging before flag.Parse: I0505 19:10:39.046785    5076 main.go:98] GPU NVIDIA Tesla M60 id: PCI\VEN_10DE&DEV_13F2&SUBSYS_115E10DE&REV_A1\6&2B78CA89&0&0
    ERROR: logging before flag.Parse: W0505 19:10:39.096589    5076 main.go:90] 'Microsoft Hyper-V Video' doesn't match  'nvidia', ignoring this gpu
    ERROR: logging before flag.Parse: I0505 19:10:39.096589    5076 main.go:101] pluginSocksDir: /var/lib/kubelet/device-plugins/
    ERROR: logging before flag.Parse: I0505 19:10:39.096589    5076 main.go:103] socketPath: /var/lib/kubelet/device-plugins/directx.sock
    2022/05/05 19:10:40 Starting to serve on /var/lib/kubelet/device-plugins/directx.sock
    2022/05/05 19:10:40 Deprecation file not found. Invoke registration
    2022/05/05 19:10:40 ListAndWatch
    

aks-windows-gpu-acceleration's People

Contributors

eliises avatar marosset avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.