mindprince / nvidia_gpu_prometheus_exporter Goto Github PK

View Code? Open in Web Editor NEW

212.0 212.0 77.0 292 KB

NVIDIA GPU Prometheus Exporter

License: Apache License 2.0

Makefile 6.41% Go 89.49% Dockerfile 4.10%

gpu monitoring nvidia-smi nvml prometheus prometheus-exporter

nvidia_gpu_prometheus_exporter's People

Contributors

Stargazers

Watchers

Forkers

kairen discordianfish gpucloud yan234280533 swiftdiaries mysolo kanglanglang cacack shhui solarisyan astleychen cormoran michaellzc 0mp okatkov gshamov devthane rovanion thomasadavis tingzhang-ming keyingliu riverzhang barseghyanartur schoisles dewey363 johnsondeng xuchencn xofym likueimo hbwheat adampl treydock davidjveer ealgra gt758215 dosssman thehamsta mbigras maromero3d nomagicai lancelind bijoumd78 bryanasdev000 jephilc laashub-soa zwb-github archlevel caoshitong369 adaouda mona-cyclope infuseai flixr gordan-code chenhuan-bxt fuguohong1024 osbakken iamtroy412 molotovtv rickstaa kitter metalsm7 loda13 unaltra jaelynlitz mr-leiy felixdeman isaic-ca xyhcc

nvidia_gpu_prometheus_exporter's Issues

How to run?

So I run:

go get github.com/mindprince/nvidia_gpu_prometheus_exporter

and nothing happens, how do I run?

couldn't initialize gonvml

system info: ubuntu 16.04
gpu: rtx 2080Ti
cuda: 10
nvidia-driver:410.48

when I create the nvidia-gpu-prometheus-exporter, the pod error, the message is "couldn't initialize gonvml: could not load NVML library Make sure NVML is in the shared library search path"

deploy gpu-exporter on a non-gpu node will get error and crash

Hi @mindprince ,
the gpu exporter works pretty well on a gpu node, while it will get error when deployed on a non-gpu node.
Of curse it is reasonable because there is no hardware and NVML on that node, but should us still enable the gpu-exporter, just does NOT display related metrics any more? So that it can respect the behavior as cadvisor

Actually I do need this behavior because I combine gpu-exporter can common node-export in one pod (as daemonset), which will run on each node (even for the node without GPU), and only in this way can I join the common-node-metrics with gpu-node-metrics together

Should we fix it? or any suggestion?

Best Regards

Is there a grafana dashboard example for this exporter?

Any help will be appriciated.

Heads-up: nvidia_gpu_prometheus_exporter is now packaged in FreeBSD

Just for your information, nvidia_gpu_prometheus_exporter has been added to the FreeBSD Ports Collection recently. It is now possible to download it as a package on FreeBSD with its package manager pkg.

Cheers!

https://www.freshports.org/net-mgmt/nvidia_gpu_prometheus_exporter/

Running in Docker

Hi,

I'm trying to run this in Docker. I've created a Dockerfile here: https://github.com/discordianfish/nvidia_gpu_prometheus_exporter/blob/master/Dockerfile

I've bind-mount /opt/nvidia/lib64 to the container and setup ld.so.conf to find it, yet the exporter still fails:

2018/05/29 17:39:20 Couldn't initialize gonvml: could not load NVML library. Make sure NVML is in the shared library search path.

# cat /etc/ld.so.conf.d/nvidia.conf 
/usr/local/nvidia/lib64
root@nvidia-exporter-ng2g9:~# ls /usr/local/nvidia/lib64
libEGL.so		 libGLESv1_CM_nvidia.so.1	libGLdispatch.so.0  libnvcuvid.so.390.46		 libnvidia-fbc.so	     libnvidia-ml.so.390.46
libEGL.so.1		 libGLESv1_CM_nvidia.so.390.46	libOpenCL.so	    libnvidia-cfg.so			 libnvidia-fbc.so.1	     libnvidia-opencl.so.1
libEGL.so.1.1.0		 libGLESv2.so			libOpenCL.so.1	    libnvidia-cfg.so.1			 libnvidia-fbc.so.390.46     libnvidia-opencl.so.390.46
libEGL_nvidia.so.0	 libGLESv2.so.2			libOpenCL.so.1.0    libnvidia-cfg.so.390.46		 libnvidia-glcore.so.390.46  libnvidia-ptxjitcompiler.so
libEGL_nvidia.so.390.46  libGLESv2.so.2.1.0		libOpenCL.so.1.0.0  libnvidia-compiler.so.390.46	 libnvidia-glsi.so.390.46    libnvidia-ptxjitcompiler.so.1
libGL.la		 libGLESv2_nvidia.so.2		libOpenGL.so	    libnvidia-egl-wayland.so.1		 libnvidia-gtk2.so.390.46    libnvidia-ptxjitcompiler.so.390.46
libGL.so		 libGLESv2_nvidia.so.390.46	libOpenGL.so.0	    libnvidia-egl-wayland.so.1.0.2	 libnvidia-gtk3.so.390.46    libnvidia-tls.so.390.46
libGL.so.1		 libGLX.so			libcuda.so	    libnvidia-eglcore.so.390.46		 libnvidia-ifr.so	     libvdpau_nvidia.so
libGL.so.1.7.0		 libGLX.so.0			libcuda.so.1	    libnvidia-encode.so			 libnvidia-ifr.so.1	     tls
libGLESv1_CM.so		 libGLX_indirect.so.0		libcuda.so.390.46   libnvidia-encode.so.1		 libnvidia-ifr.so.390.46     vdpau
libGLESv1_CM.so.1	 libGLX_nvidia.so.0		libnvcuvid.so	    libnvidia-encode.so.390.46		 libnvidia-ml.so	     xorg
libGLESv1_CM.so.1.2.0	 libGLX_nvidia.so.390.46	libnvcuvid.so.1     libnvidia-fatbinaryloader.so.390.46  libnvidia-ml.so.1
root@nvidia-exporter-ng2g9:~# ldconfig  -v|grep nvidia-ml     
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.24.so is the dynamic linker, ignoring

	libnvidia-ml.so.1 -> libnvidia-ml.so.390.46
root@nvidia-exporter-ng2g9:~# ldconfig  -v|grep nvidia   
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.24.so is the dynamic linker, ignoring

/usr/local/nvidia/lib64:
	libnvidia-glcore.so.390.46 -> libnvidia-glcore.so.390.46
	libnvidia-tls.so.390.46 -> libnvidia-tls.so.390.46
	libEGL_nvidia.so.0 -> libEGL_nvidia.so.390.46
	libnvidia-gtk3.so.390.46 -> libnvidia-gtk3.so.390.46
	libnvidia-gtk2.so.390.46 -> libnvidia-gtk2.so.390.46
	libnvidia-fatbinaryloader.so.390.46 -> libnvidia-fatbinaryloader.so.390.46
	libnvidia-opencl.so.1 -> libnvidia-opencl.so.390.46
	libnvidia-compiler.so.390.46 -> libnvidia-compiler.so.390.46
	libnvidia-ml.so.1 -> libnvidia-ml.so.390.46
	libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.390.46
	libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.390.46
	libnvidia-cfg.so.1 -> libnvidia-cfg.so.390.46
	libnvidia-ifr.so.1 -> libnvidia-ifr.so.390.46
	libnvidia-egl-wayland.so.1 -> libnvidia-egl-wayland.so.1.0.2
	libGLX_nvidia.so.0 -> libGLX_nvidia.so.390.46
	libnvidia-fbc.so.1 -> libnvidia-fbc.so.390.46
	libnvidia-eglcore.so.390.46 -> libnvidia-eglcore.so.390.46
	libnvidia-glsi.so.390.46 -> libnvidia-glsi.so.390.46
	libnvidia-encode.so.1 -> libnvidia-encode.so.390.46
	libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.390.46
/usr/local/nvidia/lib64/tls: (hwcap: 0x8000000000000000)
	libnvidia-tls.so.390.46 -> libnvidia-tls.so.390.46

Update readme to point to dcgm-exporter

Hi. Is it possible that you to update the README and repo header to link to https://github.com/NVIDIA/dcgm-exporter plus deprecate this repo?

Upgraded Driver, FanSpeed error

Hello!

I have been using this for a while to give stats on GPU's.

Now, after some Nvidia upgrades, it seems the container complains, with the error: FanSpeed() error: nvml: Not Supported

I see in Grafana I get the other Data, but fan speed is no longer there.

Is that something fixable?

the dutyCycle is the usage rate of a gpu card?

Looking forward to your reply

FanSpeed() error: nvml: Not Supported

Local compilation failed

Run this on Raspian (non-docker)?

Is there a version with a conf or yaml file to use with Raspian?

Official docker image?

Similar to @discordianfish in #1, I am interested in running this exporter via docker. Would it be possible to provide an "official" image, e.g. on the main Docker hub?

dutyCycle loses data

dutyCycle is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:

Percent of time over the past sample period during which one or more kernels was executing on the GPU.

Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.

https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost:
Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.

Instead it would be better to have a ..._seconds_total counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).

Unable to access Prometheus type data source in Grafana

Hello!

I'm facing an error accessing the metrics from your exporter.

Steps performed:

Running the exporter with nvidia-docker
nvidia-docker run -p 9445:9445 -ti mindprince/nvidia_gpu_prometheus_exporter:0.1
Testing by accesing http://localhost:9445

seems to work well

Setup Grafana in Docker
docker run -d --name grafana -p 3000:3000 grafana/grafana

works fine

Adding data source in Grafana as Proetheus type

in the URL I provide the exporter's docker IP address (this worked with other prometheus docker based on this tutorial )
if I provide the host machine's IP then it provides Bad Gateway error

Nvidia GPU dasboard

when getting the nvida gpu dasboard from https://grafana.com/grafana/dashboards/10703 metrics are not loaded with error message "Cannot read property 'result' of undefined"

I tried to explore the metrics but they never load

Couldn't initialize gonvml: could not load NVML library. Make sure NVML is in the shared library search path.

May I use alpine to report this error, but I use ubuntu and there is no error