mindprince / nvidia_gpu_prometheus_exporter Goto Github PK
View Code? Open in Web Editor NEWNVIDIA GPU Prometheus Exporter
License: Apache License 2.0
NVIDIA GPU Prometheus Exporter
License: Apache License 2.0
So I run:
go get github.com/mindprince/nvidia_gpu_prometheus_exporter
and nothing happens, how do I run?
system info: ubuntu 16.04
gpu: rtx 2080Ti
cuda: 10
nvidia-driver:410.48
when I create the nvidia-gpu-prometheus-exporter, the pod error, the message is "couldn't initialize gonvml: could not load NVML library Make sure NVML is in the shared library search path"
Hi @mindprince ,
the gpu exporter works pretty well on a gpu node, while it will get error when deployed on a non-gpu node.
Of curse it is reasonable because there is no hardware and NVML on that node, but should us still enable the gpu-exporter, just does NOT display related metrics any more? So that it can respect the behavior as cadvisor
Actually I do need this behavior because I combine gpu-exporter can common node-export in one pod (as daemonset), which will run on each node (even for the node without GPU), and only in this way can I join the common-node-metrics with gpu-node-metrics together
Should we fix it? or any suggestion?
Best Regards
Any help will be appriciated.
Just for your information, nvidia_gpu_prometheus_exporter has been added to the FreeBSD Ports Collection recently. It is now possible to download it as a package on FreeBSD with its package manager pkg
.
Cheers!
https://www.freshports.org/net-mgmt/nvidia_gpu_prometheus_exporter/
Hi,
I'm trying to run this in Docker. I've created a Dockerfile here: https://github.com/discordianfish/nvidia_gpu_prometheus_exporter/blob/master/Dockerfile
I've bind-mount /opt/nvidia/lib64 to the container and setup ld.so.conf to find it, yet the exporter still fails:
2018/05/29 17:39:20 Couldn't initialize gonvml: could not load NVML library. Make sure NVML is in the shared library search path.
# cat /etc/ld.so.conf.d/nvidia.conf
/usr/local/nvidia/lib64
root@nvidia-exporter-ng2g9:~# ls /usr/local/nvidia/lib64
libEGL.so libGLESv1_CM_nvidia.so.1 libGLdispatch.so.0 libnvcuvid.so.390.46 libnvidia-fbc.so libnvidia-ml.so.390.46
libEGL.so.1 libGLESv1_CM_nvidia.so.390.46 libOpenCL.so libnvidia-cfg.so libnvidia-fbc.so.1 libnvidia-opencl.so.1
libEGL.so.1.1.0 libGLESv2.so libOpenCL.so.1 libnvidia-cfg.so.1 libnvidia-fbc.so.390.46 libnvidia-opencl.so.390.46
libEGL_nvidia.so.0 libGLESv2.so.2 libOpenCL.so.1.0 libnvidia-cfg.so.390.46 libnvidia-glcore.so.390.46 libnvidia-ptxjitcompiler.so
libEGL_nvidia.so.390.46 libGLESv2.so.2.1.0 libOpenCL.so.1.0.0 libnvidia-compiler.so.390.46 libnvidia-glsi.so.390.46 libnvidia-ptxjitcompiler.so.1
libGL.la libGLESv2_nvidia.so.2 libOpenGL.so libnvidia-egl-wayland.so.1 libnvidia-gtk2.so.390.46 libnvidia-ptxjitcompiler.so.390.46
libGL.so libGLESv2_nvidia.so.390.46 libOpenGL.so.0 libnvidia-egl-wayland.so.1.0.2 libnvidia-gtk3.so.390.46 libnvidia-tls.so.390.46
libGL.so.1 libGLX.so libcuda.so libnvidia-eglcore.so.390.46 libnvidia-ifr.so libvdpau_nvidia.so
libGL.so.1.7.0 libGLX.so.0 libcuda.so.1 libnvidia-encode.so libnvidia-ifr.so.1 tls
libGLESv1_CM.so libGLX_indirect.so.0 libcuda.so.390.46 libnvidia-encode.so.1 libnvidia-ifr.so.390.46 vdpau
libGLESv1_CM.so.1 libGLX_nvidia.so.0 libnvcuvid.so libnvidia-encode.so.390.46 libnvidia-ml.so xorg
libGLESv1_CM.so.1.2.0 libGLX_nvidia.so.390.46 libnvcuvid.so.1 libnvidia-fatbinaryloader.so.390.46 libnvidia-ml.so.1
root@nvidia-exporter-ng2g9:~# ldconfig -v|grep nvidia-ml
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.24.so is the dynamic linker, ignoring
libnvidia-ml.so.1 -> libnvidia-ml.so.390.46
root@nvidia-exporter-ng2g9:~# ldconfig -v|grep nvidia
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.24.so is the dynamic linker, ignoring
/usr/local/nvidia/lib64:
libnvidia-glcore.so.390.46 -> libnvidia-glcore.so.390.46
libnvidia-tls.so.390.46 -> libnvidia-tls.so.390.46
libEGL_nvidia.so.0 -> libEGL_nvidia.so.390.46
libnvidia-gtk3.so.390.46 -> libnvidia-gtk3.so.390.46
libnvidia-gtk2.so.390.46 -> libnvidia-gtk2.so.390.46
libnvidia-fatbinaryloader.so.390.46 -> libnvidia-fatbinaryloader.so.390.46
libnvidia-opencl.so.1 -> libnvidia-opencl.so.390.46
libnvidia-compiler.so.390.46 -> libnvidia-compiler.so.390.46
libnvidia-ml.so.1 -> libnvidia-ml.so.390.46
libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.390.46
libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.390.46
libnvidia-cfg.so.1 -> libnvidia-cfg.so.390.46
libnvidia-ifr.so.1 -> libnvidia-ifr.so.390.46
libnvidia-egl-wayland.so.1 -> libnvidia-egl-wayland.so.1.0.2
libGLX_nvidia.so.0 -> libGLX_nvidia.so.390.46
libnvidia-fbc.so.1 -> libnvidia-fbc.so.390.46
libnvidia-eglcore.so.390.46 -> libnvidia-eglcore.so.390.46
libnvidia-glsi.so.390.46 -> libnvidia-glsi.so.390.46
libnvidia-encode.so.1 -> libnvidia-encode.so.390.46
libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.390.46
/usr/local/nvidia/lib64/tls: (hwcap: 0x8000000000000000)
libnvidia-tls.so.390.46 -> libnvidia-tls.so.390.46
Hi. Is it possible that you to update the README and repo header to link to https://github.com/NVIDIA/dcgm-exporter plus deprecate this repo?
Hello!
I have been using this for a while to give stats on GPU's.
Now, after some Nvidia upgrades, it seems the container complains, with the error: FanSpeed() error: nvml: Not Supported
I see in Grafana I get the other Data, but fan speed is no longer there.
Is that something fixable?
Looking forward to your reply
Is there a version with a conf or yaml file to use with Raspian?
Similar to @discordianfish in #1, I am interested in running this exporter via docker. Would it be possible to provide an "official" image, e.g. on the main Docker hub?
dutyCycle
is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.
https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t
So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost:
Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.
Instead it would be better to have a ..._seconds_total
counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).
Hello!
I'm facing an error accessing the metrics from your exporter.
Steps performed:
Running the exporter with nvidia-docker
nvidia-docker run -p 9445:9445 -ti mindprince/nvidia_gpu_prometheus_exporter:0.1
Testing by accesing http://localhost:9445
docker run -d --name grafana -p 3000:3000 grafana/grafana
May I use alpine to report this error, but I use ubuntu and there is no error
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.