Code Monkey home page Code Monkey logo

nvidia_gpu_prometheus_exporter's Introduction

NVIDIA GPU Prometheus Exporter

This is a Prometheus Exporter for exporting NVIDIA GPU metrics. It uses the Go bindings for NVIDIA Management Library (NVML) which is a C-based API that can be used for monitoring NVIDIA GPU devices. Unlike some other similar exporters, it does not call the nvidia-smi binary.

Design Doc

Design Doc

Building

The repository includes nvml.h, so there are no special requirements from the build environment. go get should be able to build the exporter binary.

go get github.com/mindprince/nvidia_gpu_prometheus_exporter

Running on Kubernetes

kubectl create -f https://raw.githubusercontent.com/swiftdiaries/nvidia_gpu_prometheus_exporter/master/nvidia-exporter.yaml

Using ksonnet

kubectl create ns monitoring
ks init ks-app --env default --namespace monitoring --skip-default-registries
cd ks-app
ks registry add gpu-prometheus https://github.com/swiftdiaries/nvidia_gpu_prometheus_exporter/tree/master/gpu-prometheus
ks pkg install gpu-prometheus/nvidia-prometheus-exporter
ks generate nvidia-prometheus-exporter nvidia-prometheus-exporter
ks apply default

Complete setup on a k8s cluster

Note: Ensure nvidia-docker is installed.

Verify nvidia-docker

$ sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

Reference: GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs

Nvidia driver install - daemonset

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml Note: It takes a couple of minutes for the drivers to install.

Prometheus, Grafana install (uses complicated YAML ~ 2k lines, // TODO reconfigure)

$ kubectl apply --filename https://raw.githubusercontent.com/giantswarm/kubernetes-prometheus/master/manifests-all.yaml

Grafana dashboard

wget https://raw.githubusercontent.com/swiftdiaries/nvidia_gpu_prometheus_exporter/master/Prometheus-GPU-stats-1533769198014.json

Import this JSON to Grafana.

Preview

Grafana Preview Note: Excuse the flat duty cycle.

TODO

  1. Reduce size of image used for exporter.
  2. Simpler / manageable YAML for Prometheus.
  3. ksonnet app for easy deployments / integration with Kubeflow.

Note: priority is not necessarily in that order.

Run locally using Docker

$ make build

$ docker run -p 9445:9445 --rm --runtime=nvidia swiftdiaries/gpu_prom_metrics

Make changes, build, iterate.

Verify:

$ localhost:9445/metrics | grep -i "gpu"

Sample output:

# HELP nvidia_gpu_duty_cycle Percent of time over the past sample period during which one or more kernels were executing on the GPU device
# TYPE nvidia_gpu_duty_cycle gauge
nvidia_gpu_duty_cycle{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 0
# HELP nvidia_gpu_fanspeed_percent Fanspeed of the GPU device as a percent of its maximum
# TYPE nvidia_gpu_fanspeed_percent gauge
nvidia_gpu_fanspeed_percent{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 0
# HELP nvidia_gpu_memory_total_bytes Total memory of the GPU device in bytes
# TYPE nvidia_gpu_memory_total_bytes gauge
nvidia_gpu_memory_total_bytes{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 2.092171264e+09
# HELP nvidia_gpu_memory_used_bytes Memory used by the GPU device in bytes
# TYPE nvidia_gpu_memory_used_bytes gauge
nvidia_gpu_memory_used_bytes{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 1.048576e+06
# HELP nvidia_gpu_num_devices Number of GPU devices
# TYPE nvidia_gpu_num_devices gauge
nvidia_gpu_num_devices 1
# HELP nvidia_gpu_power_usage_milliwatts Power usage of the GPU device in milliwatts
# TYPE nvidia_gpu_power_usage_milliwatts gauge
nvidia_gpu_power_usage_milliwatts{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 13240
# HELP nvidia_gpu_temperature_celsius Temperature of the GPU device in celsius
# TYPE nvidia_gpu_temperature_celsius gauge
nvidia_gpu_temperature_celsius{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 34

Running locally pre-requisites

The exporter requires the following:

  • access to NVML library (libnvidia-ml.so.1).
  • access to the GPU devices.

To make sure that the exporter can access the NVML libraries, either add them to the search path for shared libraries. Or set LD_LIBRARY_PATH to point to their location.

By default the metrics are exposed on port 9445. This can be updated using the -web.listen-address flag.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.