Code Monkey home page Code Monkey logo

nvidia_gpu_prometheus_exporter's People

Contributors

adampl avatar rohitagarwal003 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvidia_gpu_prometheus_exporter's Issues

How to run?

So I run:

go get github.com/mindprince/nvidia_gpu_prometheus_exporter

and nothing happens, how do I run?

couldn't initialize gonvml

system info: ubuntu 16.04
gpu: rtx 2080Ti
cuda: 10
nvidia-driver:410.48

when I create the nvidia-gpu-prometheus-exporter, the pod error, the message is "couldn't initialize gonvml: could not load NVML library Make sure NVML is in the shared library search path"

deploy gpu-exporter on a non-gpu node will get error and crash

Hi @mindprince ,
the gpu exporter works pretty well on a gpu node, while it will get error when deployed on a non-gpu node.
Of curse it is reasonable because there is no hardware and NVML on that node, but should us still enable the gpu-exporter, just does NOT display related metrics any more? So that it can respect the behavior as cadvisor

Actually I do need this behavior because I combine gpu-exporter can common node-export in one pod (as daemonset), which will run on each node (even for the node without GPU), and only in this way can I join the common-node-metrics with gpu-node-metrics together

Should we fix it? or any suggestion?

Best Regards

Running in Docker

Hi,

I'm trying to run this in Docker. I've created a Dockerfile here: https://github.com/discordianfish/nvidia_gpu_prometheus_exporter/blob/master/Dockerfile

I've bind-mount /opt/nvidia/lib64 to the container and setup ld.so.conf to find it, yet the exporter still fails:

2018/05/29 17:39:20 Couldn't initialize gonvml: could not load NVML library. Make sure NVML is in the shared library search path.
# cat /etc/ld.so.conf.d/nvidia.conf 
/usr/local/nvidia/lib64
root@nvidia-exporter-ng2g9:~# ls /usr/local/nvidia/lib64
libEGL.so		 libGLESv1_CM_nvidia.so.1	libGLdispatch.so.0  libnvcuvid.so.390.46		 libnvidia-fbc.so	     libnvidia-ml.so.390.46
libEGL.so.1		 libGLESv1_CM_nvidia.so.390.46	libOpenCL.so	    libnvidia-cfg.so			 libnvidia-fbc.so.1	     libnvidia-opencl.so.1
libEGL.so.1.1.0		 libGLESv2.so			libOpenCL.so.1	    libnvidia-cfg.so.1			 libnvidia-fbc.so.390.46     libnvidia-opencl.so.390.46
libEGL_nvidia.so.0	 libGLESv2.so.2			libOpenCL.so.1.0    libnvidia-cfg.so.390.46		 libnvidia-glcore.so.390.46  libnvidia-ptxjitcompiler.so
libEGL_nvidia.so.390.46  libGLESv2.so.2.1.0		libOpenCL.so.1.0.0  libnvidia-compiler.so.390.46	 libnvidia-glsi.so.390.46    libnvidia-ptxjitcompiler.so.1
libGL.la		 libGLESv2_nvidia.so.2		libOpenGL.so	    libnvidia-egl-wayland.so.1		 libnvidia-gtk2.so.390.46    libnvidia-ptxjitcompiler.so.390.46
libGL.so		 libGLESv2_nvidia.so.390.46	libOpenGL.so.0	    libnvidia-egl-wayland.so.1.0.2	 libnvidia-gtk3.so.390.46    libnvidia-tls.so.390.46
libGL.so.1		 libGLX.so			libcuda.so	    libnvidia-eglcore.so.390.46		 libnvidia-ifr.so	     libvdpau_nvidia.so
libGL.so.1.7.0		 libGLX.so.0			libcuda.so.1	    libnvidia-encode.so			 libnvidia-ifr.so.1	     tls
libGLESv1_CM.so		 libGLX_indirect.so.0		libcuda.so.390.46   libnvidia-encode.so.1		 libnvidia-ifr.so.390.46     vdpau
libGLESv1_CM.so.1	 libGLX_nvidia.so.0		libnvcuvid.so	    libnvidia-encode.so.390.46		 libnvidia-ml.so	     xorg
libGLESv1_CM.so.1.2.0	 libGLX_nvidia.so.390.46	libnvcuvid.so.1     libnvidia-fatbinaryloader.so.390.46  libnvidia-ml.so.1
root@nvidia-exporter-ng2g9:~# ldconfig  -v|grep nvidia-ml     
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.24.so is the dynamic linker, ignoring

	libnvidia-ml.so.1 -> libnvidia-ml.so.390.46
root@nvidia-exporter-ng2g9:~# ldconfig  -v|grep nvidia   
ldconfig: Path `/lib/x86_64-linux-gnu' given more than once
ldconfig: Path `/usr/lib/x86_64-linux-gnu' given more than once
ldconfig: /lib/x86_64-linux-gnu/ld-2.24.so is the dynamic linker, ignoring

/usr/local/nvidia/lib64:
	libnvidia-glcore.so.390.46 -> libnvidia-glcore.so.390.46
	libnvidia-tls.so.390.46 -> libnvidia-tls.so.390.46
	libEGL_nvidia.so.0 -> libEGL_nvidia.so.390.46
	libnvidia-gtk3.so.390.46 -> libnvidia-gtk3.so.390.46
	libnvidia-gtk2.so.390.46 -> libnvidia-gtk2.so.390.46
	libnvidia-fatbinaryloader.so.390.46 -> libnvidia-fatbinaryloader.so.390.46
	libnvidia-opencl.so.1 -> libnvidia-opencl.so.390.46
	libnvidia-compiler.so.390.46 -> libnvidia-compiler.so.390.46
	libnvidia-ml.so.1 -> libnvidia-ml.so.390.46
	libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.390.46
	libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.390.46
	libnvidia-cfg.so.1 -> libnvidia-cfg.so.390.46
	libnvidia-ifr.so.1 -> libnvidia-ifr.so.390.46
	libnvidia-egl-wayland.so.1 -> libnvidia-egl-wayland.so.1.0.2
	libGLX_nvidia.so.0 -> libGLX_nvidia.so.390.46
	libnvidia-fbc.so.1 -> libnvidia-fbc.so.390.46
	libnvidia-eglcore.so.390.46 -> libnvidia-eglcore.so.390.46
	libnvidia-glsi.so.390.46 -> libnvidia-glsi.so.390.46
	libnvidia-encode.so.1 -> libnvidia-encode.so.390.46
	libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.390.46
/usr/local/nvidia/lib64/tls: (hwcap: 0x8000000000000000)
	libnvidia-tls.so.390.46 -> libnvidia-tls.so.390.46

Upgraded Driver, FanSpeed error

Hello!

I have been using this for a while to give stats on GPU's.

Now, after some Nvidia upgrades, it seems the container complains, with the error: FanSpeed() error: nvml: Not Supported

I see in Grafana I get the other Data, but fan speed is no longer there.

Is that something fixable?

dutyCycle loses data

dutyCycle is the GPU utilization during the last "sample period" of the driver, according to NVIDIA docs:

Percent of time over the past sample period during which one or more kernels was executing on the GPU.

Utilization information for a device. Each sample period may be between 1 second and 1/6 second, depending on the product being queried.

https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

So this can be a very short period. Prometheus scrape intervals are usually 10 seconds or longer, so data is almost certainly lost:
Let's say a workload uses 100% of the GPU for one second, then sleeps one second - the GPU is 50% busy on average. We don't know exactly when Prometheus will scrape, but there's a good chance it would only see 100% or 0% every time it does, so the recorded utilization will probably be incorrect.

Instead it would be better to have a ..._seconds_total counter, like it's done for CPU utilization: https://www.robustperception.io/understanding-machine-cpu-usage
This way we wouldn't lose data due to long Prometheus sample periods, but it would probably require some more work in the exporter (poll data at a higher frequency).

Unable to access Prometheus type data source in Grafana

Hello!

I'm facing an error accessing the metrics from your exporter.

Steps performed:

  1. Running the exporter with nvidia-docker
    nvidia-docker run -p 9445:9445 -ti mindprince/nvidia_gpu_prometheus_exporter:0.1

  2. Testing by accesing http://localhost:9445
    extractor

  • seems to work well
  1. Setup Grafana in Docker
    docker run -d --name grafana -p 3000:3000 grafana/grafana
  • works fine
  1. Adding data source in Grafana as Proetheus type
    grafana_docker
  • in the URL I provide the exporter's docker IP address (this worked with other prometheus docker based on this tutorial )
  • if I provide the host machine's IP then it provides Bad Gateway error
  1. Nvidia GPU dasboard
  1. I tried to explore the metrics but they never load
    grafana_explore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.