nvidia / libnvidia-container Goto Github PK

NVIDIA container runtime library

License: Apache License 2.0

Makefile 3.44% C 92.50% Shell 0.32% RPC 0.71% Go 3.04%

libnvidia-container's Introduction

libnvidia-container

This repository provides a library and a simple CLI utility to automatically configure GNU/Linux containers leveraging NVIDIA hardware.
The implementation relies on kernel primitives and is designed to be agnostic of the container runtime.

Installing the library

From packages

Configure the package repository for your Linux distribution.

Install the packages:

libnvidia-container1
libnvidia-container-tools

From sources

With Docker:

# Generate docker images for a supported <os><version>
make {ubuntu18.04, ubuntu16.04, debian10, debian9, centos7, amazonlinux2, opensuse-leap15.1}

# Or generate docker images for all supported distributions in the dist/ directory
make docker

The resulting images have the name nvidia/libnvidia-container/<os>:<version>

Without Docker:

make install

# Alternatively in order to customize the installation paths
DESTDIR=/path/to/root make install prefix=/usr

Using the library

Container runtime example

Refer to the nvidia-container-runtime project.

Command line example

# Setup a new set of namespaces
cd $(mktemp -d) && mkdir rootfs
sudo unshare --mount --pid --fork

# Setup a rootfs based on Ubuntu 16.04 inside the new namespaces
curl http://cdimage.ubuntu.com/ubuntu-base/releases/16.04/release/ubuntu-base-16.04.6-base-amd64.tar.gz | tar -C rootfs -xz
useradd -R $(realpath rootfs) -U -u 1000 -s /bin/bash nvidia
mount --bind rootfs rootfs
mount --make-private rootfs
cd rootfs

# Mount standard filesystems
mount -t proc none proc
mount -t sysfs none sys
mount -t tmpfs none tmp
mount -t tmpfs none run

# Isolate the first GPU device along with basic utilities
nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --no-cgroups --utility --device 0 $(pwd)

# Change into the new rootfs
pivot_root . mnt
umount -l mnt
exec chroot --userspec 1000:1000 . env -i bash

# Run nvidia-smi from within the container
nvidia-smi -L

Copyright and License

This project is released under the BSD 3-clause license.

Additionally, this project can be dynamically linked with libelf from the elfutils package (https://sourceware.org/elfutils), in which case additional terms apply.
Refer to NOTICE for more information.

Issues and Contributing

Checkout the Contributing document!

Please let us know by filing a new issue
You can contribute by opening a pull request

libnvidia-container's People

Contributors

Stargazers

Watchers

Forkers

flx42 dave-re deepmachines eric-918 dllehr81 msi-ge62-2qd-apache-pro zvonkok d-demirci mahak tcwalther clnperez wangzi19870227 julianocristian capgadsx godloved qinzhao168 yan234280533 kenjikanedanvidia dev-zero eunjuyang anight ricordel qarnot cognitron stjordanis klueska schoenemeyer connectionmaster alexrashed rajatchopra euri10 guptanswati louislee831 chelarua renaudwastaken pecka27 huanwei hustcat amruta-bandhu-chaudhury k82cn acidburn0zzz vishwas1234567 rmathur33 rafaelpalomar jinlmsft mitar katiewasnothere darkspadez jordimassaguerpla archlitchi krenshaw2018 morphis dcermak sycomix dearsource paroque28 o330oei vowel001 ajesse11x longervision mdlglobal-atlassian-net nvjmayo dannysauer hixio-mh bmwiedemann paralin qingshanyinyin ich777 chuerxiao zsdaka shivamerla cxz jeanp1209 qiangzai00001 hero-kwon sioy2000 global-localhost global19 global19-atlassian-net wlin70 edmontdants planet-winter psaab lexxxel knackworks bigboss-fox dickmao elezar faenrir alexfok rpi-hpc garyhalo michelangelo-y wangli6666 apx103 luchixiang pidb doctaweeks chenhengqi zenyanagata

libnvidia-container's Issues

/proc/<pid> prefixed to rootfs path

I'm running this in supervised mode with the following args and get the error where the rootfs is not found because the cli tool is prefixing the rootfs paths with /proc/.

I cannot figureout what is wrong.

# args 
[--load-kmods configure --device=0 --utility --pid=2087 /run/containerds/gpus/rootfs]
#error
ERRO[0002] exit status 1: nvidia-container-cli: container error: open failed: /proc/2087/root/run/containers/gpus/rootfs: no such file or directory

Add support for aarm64?

We use Tegra and Xavier GPUs in Self driving cars. Adding aarm64 support would help us a lot.
Could this be prioritized?

I saw issue #7 by @Gotrek77 closed in the past. Again requesting.

configure.c in src/cli

I read some part of the code.
In my understanding, runc created a container, then through prestart hook call this cli tool to configure some parameters for container (mainly defined in configure.c), finally runc run the container.

Am i right ?

Opensuse Leap 15 and Opensuse Tumbleweed support

Hi,

will there be any Opensuse Leap 15 and Opensuse Tumbleweed support?

Arm64 support?

Hi,
When will be included arm64 support?

I tried to build It for arm64 but i had problems!

Thanks
Giuseppe

nvidia-container-cli fails to enable Vulkan in containers

I'm using lxc with the mount hook that calls nvidia-container-cli, and it works for running nvidia-smi and opengl applications in the container. However, vulkan applications, including vulkaninfo, vkcube, and games, all fail. For example:

$ vulkaninfo
Cannot create Vulkan instance.
/build/vulkan-tools-_xrZWD/vulkan-tools-1.1.101.0+dfsg1/vulkaninfo/vulkaninfo.c:884: failed with VK_ERROR_INCOMPATIBLE_DRIVER

Digging a little deeper with strace, I found that /usr/share/vulkan/icd.d/nvidia_icd.json is missing in the container. When I manually exported it to the container's filesystem, I got a new error:

$ vulkaninfo
The NVIDIA driver was unable to open 'libnvidia-glvkspirv.so.430.14'. This library is required at run time.

WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0. Skipping ICD.
Cannot create Vulkan instance.
/build/vulkan-tools-_xrZWD/vulkan-tools-1.1.101.0+dfsg1/vulkaninfo/vulkaninfo.c:884: failed with VK_ERROR_INCOMPATIBLE_DRIVER

Following the trail of errors, I tried manually exporting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.430.14 into the container. At that point, vulkan applications started working.

So, it looks like nvidia-container-cli and/or its libraries are failing to expose at least these files to the container:

/usr/share/vulkan/icd.d/nvidia_icd.json
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.*
/usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.*

Can you folks please fix this?

Document your build requirements

This thing is impossible to build and you have to scour the internet to track down the necessary packages. Please document which packages are required and where to get them.

#include <rpc/rpc.h> - add Include dir for binaries

Hi,
After downloading the sources and enabling the flag WITH_TIRPC the sources compilation complained about:

/driver.h:10:10: fatal error: rpc/rpc.h: No such file or directory

Probably would be better to add some preprocessor directives to include

-I /usr/include/tirpc/

if the WITH_TIRPC flag is enabled.

Or am I missing something?

-l

Missing dist directory exits build process when executing make rpm

After a successful build with make, I was trying to do a make rpm that fails
because of the missing $(DESTDIR). After creating $(DESTDIR) it works.


index 19a00a6..6aab914 100644
--- a/Makefile
+++ b/Makefile
@@ -284,6 +284,7 @@ deb: install
 
 rpm: DESTDIR:=$(DIST_DIR)/$(LIB_NAME)_$(VERSION)_$(ARCH)
 rpm: all
+       $(MKDIR) -p $(DIST_DIR)
        $(CP) -T $(PKG_DIR)/rpm $(DESTDIR)
        $(LN) -nsf $(CURDIR) $(DESTDIR)/BUILD
        $(MKDIR) -p $(DESTDIR)/RPMS && $(LN) -nsf $(DIST_DIR) $(DESTDIR)/RPMS/$(ARCH)

I can't update libnvidia-container

W: Failed to fetch https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64/InRelease Could not resolve host: nvidia.github.io

Honor DOCKER_RAMDISK environment variable

Docker supports an environment variable called DOCKER_RAMDISK that, if set, will tell runc not to use pivot_root but an alternative way of chrooting. This is needed to run Docker from a ramfs, which is exactly my use case.

The toplevel use of that variable can be seen here: https://github.com/moby/moby/blob/c3a02077149ea8ee1d53b2b60a3d36c29d1505f8/libcontainerd/client_daemon.go#L305
and the NoPivotRoot option is then propagated through dockerd and containerd down to runc, that selects a way of changing the rootfs according to it (and other params, see https://github.com/opencontainers/runc/blob/63e6708c74c1cc46091ec92ea9df5aca4d82e803/libcontainer/rootfs_linux.go#L102)

Unfortunately, libnvidia-container does not honor this parameter and uses SYS_pivot_root unconditionally to change the rootfs (https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc_ldcache.c#L117). As a result, it's impossible to use nvidia-docker from an initramfs, typically on a diskless machine.

Is it possible to do something equivalent to what runc does? I will give it a try myself, but all this is quite an unknown territory for me, so any help is appreciated, either to do it for me (that would be lovely :-) ), explain me how to do it, or tell me that it's not possible.

nvidia-container-cli doesn't list symlinks to .so files

I would like nvidia-container-cli to give me a complete list of shared libraries, including symlinks, because it's the symlinks that apps actually link to.

This is for Charliecloud, which does not have a phase where containers have been created but not started, so subcommand configure won't work for us. (We don't want to link to an external library because right now we link to none.)

We can work around this by discovering the symlinks manually, but that means we can't just use nvidia-container-cli as a comprehensive source of truth.

Current behavior: (v1.0.0-beta.1)

$ ./nvidia-container-cli list --libraries
/usr/lib64/libnvidia-ml.so.390.30
/usr/lib64/libnvidia-cfg.so.390.30
[...]

Expected behavior: something like:

$ ./nvidia-container-cli list --libraries --symlinks=yes
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.390.30
/usr/lib64/libnvidia-cfg.so
/usr/lib64/libnvidia-cfg.so.1
/usr/lib64/libnvidia-cfg.so.390.30
[...]

installation instructions seem to be broken

I followed the ubuntu instructions at https://nvidia.github.io/libnvidia-container/ and it resulted in a 404 and an apt broken because it was filled with html from the 404.

I had to manually find https://github.com/NVIDIA/libnvidia-container/blob/gh-pages/ubuntu16.04/libnvidia-container.list and add the file manually.

nvidia-container-cli failed to isolate the GPU

I am using nvdia-docker2 to run with --env NVIDIA_VISIBLE_DEVICES=GPU-86bdd91b-8349-37a4-7924-c06dcf9f9993,GPU-8da2b880-db8f-2491-9a71-5c56493a5744,GPU-e148979d-232f-faac-76c6-1cd34b814662,GPU-a968a191-b2ee-e8be-a5e9-4313464cc079

check the NVIDIA_VISIBLE_DEVICES in the container, looks like it supports 4

# env | grep NVIDI
NVIDIA_REQUIRE_CUDA=cuda>=9.0
NVIDIA_VISIBLE_DEVICES=GPU-86bdd91b-8349-37a4-7924-c06dcf9f9993,GPU-8da2b880-db8f-2491-9a71-5c56493a5744,GPU-e148979d-232f-faac-76c6-1cd34b814662,GPU-a968a191-b2ee-e8be-a5e9-4313464cc079
NVIDIA_DRIVER_CAPABILITIES=compute,utility
# env | grep CUDA
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_PKG_VERSION=9-0-9.0.176-1
CUDA_VERSION=9.0.176
# mount | grep dev| grep nvidia|grep devtmpfs|grep -v uvm|grep -v ctl
devtmpfs on /dev/nvidia3 type devtmpfs (ro,nosuid,noexec,relatime,size=115462336k,nr_inodes=28865584,mode=755)
devtmpfs on /dev/nvidia4 type devtmpfs (ro,nosuid,noexec,relatime,size=115462336k,nr_inodes=28865584,mode=755)
devtmpfs on /dev/nvidia5 type devtmpfs (ro,nosuid,noexec,relatime,size=115462336k,nr_inodes=28865584,mode=755)
devtmpfs on /dev/nvidia6 type devtmpfs (ro,nosuid,noexec,relatime,size=115462336k,nr_inodes=28865584,mode=755)

But run nvidia-smi -L in the container

# nvidia-smi -L
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-9bfd7bdf-2199-d783-cf06-b6c0f540b9e6)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-3d2d5851-6f8b-b6ab-14a8-67379e1fe17f)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-2742704b-f033-9690-f50b-ab23cd7af80b)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-a968a191-b2ee-e8be-a5e9-4313464cc079)
GPU 4: Tesla P100-PCIE-16GB (UUID: GPU-86bdd91b-8349-37a4-7924-c06dcf9f9993)
GPU 5: Tesla P100-PCIE-16GB (UUID: GPU-8da2b880-db8f-2491-9a71-5c56493a5744)
GPU 6: Tesla P100-PCIE-16GB (UUID: GPU-e148979d-232f-faac-76c6-1cd34b814662)
GPU 7: Tesla P100-PCIE-16GB (UUID: GPU-e729c751-494c-d78b-8d47-4c1f3368bacc)

Looks like the isolation doesn't work.

nvidia-container-cli: initialization error: load library failed: libcuda.so.1

Hi,

I'm getting a failure on trying to load libcuda.so.1, but my understanding is that CUDA doesn't have to be installed on the host machine, right?

I'm on Fedora 26, with bumblebee installed but optirun shouldn't be needed, right?

[sztamas@nomad ~]$ cat /proc/acpi/bbswitch 
0000:01:00.0 ON

[sztamas@nomad ~]$ nvidia-smi 
Sat Nov 18 22:59:48 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   37C    P0    N/A /  N/A |      0MiB /  4041MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But

[sztamas@nomad ~]$ docker run --runtime=nvidia --rm nvidia/cuda find / -name nvidia-smicontainer_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=19311 /var/lib/docker/overlay2/248624b66696970549d54634da9cba9a6c6041b9d5d587b2d2cfa6c698a70a7e/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\""
docker: Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=19311 /var/lib/docker/overlay2/248624b66696970549d54634da9cba9a6c6041b9d5d587b2d2cfa6c698a70a7e/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\"".

It looks like it comes down to that nvidia-container-cli: initialization error:

[sztamas@nomad ~]$ nvidia-container-cli --debug=/dev/stdout list --compute

-- WARNING, the following logs are for debugging purposes only --

I1118 21:05:30.464748 19428 nvc.c:250] initializing library context (version=1.0.0, build=ec15c7233bd2de821ad5127cb0de6b52d9d2083c)
I1118 21:05:30.464846 19428 nvc.c:225] using ldcache /etc/ld.so.cache
I1118 21:05:30.464857 19428 nvc.c:226] using unprivileged user 1000:1000
nvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory

[sztamas@nomad ~]$ ldconfig -p | grep cuda
	libicudata.so.57 (libc6,x86-64) => /lib64/libicudata.so.57
	libcuda.so.1 (libc6) => /lib/libcuda.so.1
	libcuda.so (libc6) => /lib/libcuda.so
[sztamas@nomad ~]$ uname -a
Linux nomad 4.13.12-200.fc26.x86_64 #1 SMP Wed Nov 8 16:47:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[sztamas@nomad ~]$ cat /etc/fedora-release 
Fedora release 26 (Twenty Six)
[sztamas@nomad ~]$ docker -v 
Docker version 17.09.0-ce, build afdb6d4
[sztamas@nomad ~]$ dnf list installed | grep nvidia-docker2
nvidia-docker2.noarch                      2.0.1-1.docker17.09.0.ce    @nvidia-docker

Any ideas what could be wrong?

Many Thanks.

Cannot create container under Gentoo

I created new rootfs, but instead of Ubuntu-core I used Gentoo stage tarball, but when I try to copy utility I get error:

venus /tmp/tmp.Z1SsikCMGh/rootfs # nvidia-container-cli --load-kmods configure --no-cgroups --utility --device 0 .
nvidia-container-cli: input error: invalid rootfs directory

Where can I customize expected paths as I think this si related to Red/Deb system path which Gentoo might not follow in all cases...\

Thank you.

../ls

../

Problems with "no cuda-capable device is detected" after Ubuntu upgrade

I have upgrade from Ubuntu 16.04 LTS to 18.04 LTS and cannot get nvidia-docker2 to work anymore. I tried removing all nvidia packages and reinstalling from scratch. The command I use for testing now is docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi, which gives:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=4489 /var/lib/docker/aufs/mnt/5094d003411b0b81bdd4d54af77ed4099e2c6a65ba91923b5821c75f6dbb9c87]\\\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\\\n\\\"\"": unknown.
zsh: exit 125   docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

nvidia-container-cli info gives me:

NVRM version:   390.87
CUDA version:   9.1

Device Index:   0
Device Minor:   0
Model:          GeForce GTX 1080 Ti
Brand:          GeForce
GPU UUID:       GPU-c585f5ec-e9bf-682d-7d19-12e0a2f0bba4
Bus Location:   00000000:01:00.0
Architecture:   6.1

nvidia-smi also works fine:

Tue Oct  9 16:55:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 23%   42C    P8    11W / 250W |     30MiB / 11176MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1518      G   /usr/lib/xorg/Xorg                            27MiB |
+-----------------------------------------------------------------------------+

Of course I tried searching for the error message, but could not find any relevant findings. Do you have any suggestions what could go wrong here, or how to find out / get out of this situation?

Bind mount of the top binary directory cause hardlinking to fail

NVIDIA/nvidia-docker#562

Easiest repro:

$ docker run --runtime=runc -ti nvidia/cuda:9.0-base ln /usr/bin/find /usr/myfind
$ docker run --runtime=nvidia -ti nvidia/cuda:9.0-base ln /usr/bin/find /usr/myfind
ln: failed to create hard link '/usr/myfind' => '/usr/bin/find': Invalid cross-device link

Comes from this code:

libnvidia-container/src/nvc_mount.c

Lines 50 to 52 in 3c1e925

    
           /* Bind mount the top directory and every files under it with read-only permissions. */ 
        
           if (xmount(err, path, path, NULL, MS_BIND, NULL) < 0) 
        
                   goto fail;

I feel this call should be better documented, I know there was a good reason for doing that, but I don't remember it.

Repository configuration for SLES and OpenSUSE

Hi, is this supported on SLES and OpenSUSE? If not, do you have a plan?

gpu on windows container

hi
since you do not have an implementation for windows.
where i can read or briefly what i need to do, to have nvidia gpu on a windows container. i want to make unit tests with directx 12 in a docker environment with hardware acceleration

Update to a more recent version of runC?

Hi All,

First of all thanks so much for the work you've put into this project.

I noticed that the most recent version of runC supported is from back in March -- I think the commit is this one (opencontainers/runc@69663f0). There have been a lot of significant changes in the project since then, especially in the area of rootless container support. Would it be difficult to rebase this project on a recent release?

Thanks

RPM repository unavailable

This is a placeholder for the same issue that is occurring at NVIDIA/nvidia-container-runtime#4 replacing the URLS as:

https://nvidia.github.io/libnvidia-container/centos7/x86_64
http://nvidia.github.io/libnvidia-container/centos7/x86_64

can't find libnvidia-container.so.1

I'm finally getting around to looking at this. (Things have been quite busy.) I can't see to get it to run.

I'm running on a GCP GPU instance with Ubuntu Sever 16.04. Here are the command I used to build, install, and run.

git clone https://github.com/NVIDIA/libnvidia-container.git
cd libnvidia-container/
sudo apt-get install bmake libcap-dev libseccomp-dev
make
sudo make install
nvidia-container-cli --help

It seems to build and install, but the last command yields:

nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory

CAP_SYS_MODULE required even if not loading module

Ran into this using nvidia-docker inside unprivileged LXC. I have dropped CAP_SYS_MODULE as the kernel module is already loaded on the host.

Any nvidia-container-cli command fails due to attempt to set CAP_SYS_MODULE

root@lxc:/# nvidia-container-cli list
nvidia-container-cli: permission error: capability change failed: operation not permitted

To test I removed CAP_SYS_MODULE here

libnvidia-container/src/nvc_internal.h

Line 95 in 8245f6c

CAP_SYS_MODULE,

and then it works fine.

How to use the runtime hook for rootless RunC containers?

I would like to run RunC containers based on nvidia-docker rootless. But using the runtime hook I get

container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --no-cgroups --device=all --compute --utility --require=cuda>=10.1 brand=tesla,driver>=418,driver<419 --pid=18266 /home/testCuda/build/runc/rootfs]\\\\nnvidia-container-cli: permission error: capability change failed: operation not permitted\\\\n\\\"\""

I tried the solution from moby/moby#38729 of setting no-cgroups = true (as you can see from the command line), but still no progress. I do not understand whether I need additional capabilities in my runc config, or something else.

My RunC configuration looks like this:


{
	"ociVersion": "1.0.0-rc5-dev",
	"root": {
		"path": "rootfs",
		"readonly": false
	},
	"process": {
		"args": [
			"bash", "./startup.sh", "matrix1_testCuda_gtest"
		],
		"cwd": "/app",
		"env": [
"PATH=/opt/cmake/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"CUDA_VERSION=10.1.105",
"CUDA_PKG_VERSION=10-1=10.1.105-1",
"LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"NVIDIA_VISIBLE_DEVICES=all",
"NVIDIA_DRIVER_CAPABILITIES=compute,utility",
"NVIDIA_REQUIRE_CUDA=cuda>=10.1 brand=tesla,driver>=418,driver<419",
"NCCL_VERSION=2.4.2",
"LIBRARY_PATH=/usr/local/cuda/lib64/stubs",
"CCACHE_SLOPPINESS=include_file_ctime,include_file_mtime",
"CONAN_SYSREQUIRES_SUDO=0",

			"TERM=xterm"
			],
		"oomScoreAdj": 0,
		"terminal": false,
		"user": {
			"gid": 0,
			"uid": 0
			},
		"noNewPrivileges": true,
		"capabilities": {
			"bounding": [
				"CAP_MKNOD",
				"CAP_NET_RAW",
				"CAP_KILL",
				"CAP_AUDIT_WRITE"
			],
			"effective": [
				"CAP_MKNOD",
				"CAP_NET_RAW",
				"CAP_KILL",
				"CAP_AUDIT_WRITE"

			],
			"inheritable": [
				"CAP_MKNOD",
				"CAP_NET_RAW",
				"CAP_KILL",
				"CAP_AUDIT_WRITE"
			],
			"permitted": [
				"CAP_MKNOD",
				"CAP_NET_RAW",
				"CAP_KILL",
				"CAP_AUDIT_WRITE"
			]
		},
		"rlimits": [
		]
	},

	"linux": {
		"uidMappings": [
			{
				"hostID": 500101175,
				"containerID": 0,
				"size": 1
			}
		],
		"gidMappings": [
			{
				"hostID": 513,
				"containerID": 0,
				"size": 1
			}
		],
		"maskedPaths": [
			"/proc/asound",
			"/proc/acpi",
			"/proc/kcore",
			"/proc/keys",
			"/proc/latency_stats",
			"/proc/timer_list",
			"/proc/timer_stats",
			"/proc/sched_debug",
			"/proc/scsi",
			"/sys/firmware"
		],
		"namespaces": [
			{
				"type": "mount"
			},
			{
				"type": "uts"
			},
			{
				"type": "pid"
			},
			{
				"type": "ipc"
			},
			{
				"type": "user"
			}
		],
	"readonlyPaths": [
		"/proc/bus",
		"/proc/fs",
		"/proc/irq",
		"/proc/sys",
		"/proc/sysrq-trigger"
		]
	},
	"mounts": [
		{
			"destination": "/proc",
			"options": [
			"nosuid",
			"noexec",
			"nodev"
			],
			"source": "proc",
			"type": "proc"
		},
		{
			"destination": "/dev",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			],
			"source": "tmpfs",
			"type": "tmpfs"
		},
		{
			"destination": "/dev/pts",
			"options": [
			"nosuid",
			"noexec",
			"newinstance",
			"ptmxmode=0666",
			"mode=0620"
			],
			"source": "devpts",
			"type": "devpts"
		},
		{
			"destination": "/sys",
			"source": "/sys",
			"options": [
				"rbind",
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			],
			"type": "none"
		},
		{
			"destination": "/sys/fs/cgroup",
			"options": [
				"ro",
				"nosuid",
				"noexec",
				"nodev"
			],
			"source": "cgroup",
			"type": "cgroup"
		},
		{
			"destination": "/dev/mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			],
			"source": "mqueue",
			"type": "mqueue"
		}
	]
	,"hooks": {
	    "prestart": [
	    	{
		        "path": "/usr/bin/nvidia-container-runtime-hook",
		        "args": ["nvidia-container-runtime-hook",  "-config", "/home/testCuda/build/runc/nvidia_hook.conf", "prestart"],
		        "env": [
		            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
		        ]
	    	}
	    ]
	}
}

Update Readme with usage example

it would be nice to have an example (maybe docker) run command to demonstrate how to use this.

Unable to sync packages to foreman

I'm trying to sync the RHEL based distro to foreman, and it fails. Foreman does not support a sha512 checksum type.

Here is the error that I'm getting
"PLP1005: The checksum type 'sha512' is unknown."

Can you create a repo with a sha256 checksum? It will allow my team to use these packages as part of our automation.

Slackware Support / Packages / Install instructions

I am unfamiliar with the build system for slack and no packages are available.

Anyone lend a hand / can we get some packages?

Is it necessary for cuda to be on the host?

$ nvidia-container-cli list
nvidia-container-cli: initialization error: cuda error: unknown error

I don't have cuda installed on this host. But the expectation is that cuda will be installed in the container right? Is it necessary for cuda to exist on the host for this to work?

[LowPriority] References to old GPU driver is kept (with zero bytes) after upgrade or downgrade the drivers on host.

It's not a error but is something that I get to see some times.
I am using LXD/LXC containers and the last time I first launched the container, my host had the driver version nvidia-418.56.
After some time I had to downgrade my host nvidia drivers to an earlier version nvidia-410.104, and after build some other software inside the container the execution of ldconfig dumped the following message:

/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libcuda.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvcuvid.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.56 is empty, not checked.
/sbin/ldconfig.real: File /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.56 is empty, not checked.

So I checked to see if there was some files left behind and I found the following
(References from older drivers with zero bytes.).

lrwxrwxrwx  1 root   root           24 Apr 14 21:41 libEGL_nvidia.so.0 -> libEGL_nvidia.so.410.104
-rw-r--r--  1 nobody nogroup   1031584 Feb  6 04:55 libEGL_nvidia.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libEGL_nvidia.so.418.56
lrwxrwxrwx  1 root   root           30 Apr 14 21:41 libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.410.104
-rw-r--r--  1 nobody nogroup     60200 Feb  6 04:54 libGLESv1_CM_nvidia.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libGLESv1_CM_nvidia.so.418.56
lrwxrwxrwx  1 root   root           27 Apr 14 21:41 libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.410.104
-rw-r--r--  1 nobody nogroup    111400 Feb  6 04:54 libGLESv2_nvidia.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libGLESv2_nvidia.so.418.56
lrwxrwxrwx  1 root   root           23 Apr  1 09:17 libGLX_indirect.so.0 -> libGLX_nvidia.so.418.56
lrwxrwxrwx  1 root   root           24 Apr 14 21:41 libGLX_nvidia.so.0 -> libGLX_nvidia.so.410.104
-rw-r--r--  1 nobody nogroup   1274704 Feb  6 04:56 libGLX_nvidia.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libGLX_nvidia.so.418.56
lrwxrwxrwx  1 root   root           24 Apr 14 21:41 libnvidia-cfg.so.1 -> libnvidia-cfg.so.410.104
-rw-r--r--  1 nobody nogroup    179592 Feb  6 04:54 libnvidia-cfg.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-cfg.so.418.56
-rw-r--r--  1 nobody nogroup  47842480 Feb  6 05:14 libnvidia-compiler.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-compiler.so.418.56
-rw-r--r--  1 nobody nogroup  25283584 Feb  6 05:12 libnvidia-eglcore.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-eglcore.so.418.56
lrwxrwxrwx  1 root   root           27 Apr 14 21:41 libnvidia-encode.so.1 -> libnvidia-encode.so.410.104
-rw-r--r--  1 nobody nogroup    168184 Feb  6 04:54 libnvidia-encode.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-encode.so.418.56
-rw-r--r--  1 nobody nogroup    292840 Feb  6 04:55 libnvidia-fatbinaryloader.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-fatbinaryloader.so.418.56
lrwxrwxrwx  1 root   root           24 Apr 14 21:41 libnvidia-fbc.so.1 -> libnvidia-fbc.so.410.104
-rw-r--r--  1 nobody nogroup    123112 Feb  6 04:54 libnvidia-fbc.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-fbc.so.418.56
-rw-r--r--  1 nobody nogroup  27088008 Feb  6 05:12 libnvidia-glcore.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-glcore.so.418.56
-rw-r--r--  1 nobody nogroup    578872 Feb  6 04:55 libnvidia-glsi.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-glsi.so.418.56
lrwxrwxrwx  1 root   root           24 Apr 14 21:41 libnvidia-ifr.so.1 -> libnvidia-ifr.so.410.104
-rw-r--r--  1 nobody nogroup    206888 Feb  6 04:54 libnvidia-ifr.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-ifr.so.418.56
lrwxrwxrwx  1 root   root           23 Apr 14 21:41 libnvidia-ml.so.1 -> libnvidia-ml.so.410.104
-rw-r--r--  1 nobody nogroup   1528376 Feb  6 04:58 libnvidia-ml.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-ml.so.418.56
lrwxrwxrwx  1 root   root           27 Apr 14 21:41 libnvidia-opencl.so.1 -> libnvidia-opencl.so.410.104
-rw-r--r--  1 nobody nogroup  28467576 Feb  6 05:12 libnvidia-opencl.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-opencl.so.418.56
lrwxrwxrwx  1 root   root           31 Apr  1 09:17 libnvidia-opticalflow.so.1 -> libnvidia-opticalflow.so.418.56
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-opticalflow.so.418.56
lrwxrwxrwx  1 root   root           35 Apr 14 21:41 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.410.104
-rw-r--r--  1 nobody nogroup  12129448 Feb  6 05:01 libnvidia-ptxjitcompiler.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-ptxjitcompiler.so.418.56
-rw-r--r--  1 nobody nogroup     14480 Feb  6 04:54 libnvidia-tls.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libnvidia-tls.so.418.56
lrwxrwxrwx  1 root   root           26 Apr 14 21:41 libvdpau_nvidia.so.1 -> libvdpau_nvidia.so.410.104
-rw-r--r--  1 nobody nogroup    991552 Feb  6 04:55 libvdpau_nvidia.so.410.104
-rw-r--r--  1 root   root            0 Apr  1 09:17 libvdpau_nvidia.so.418.56

Could be a way to make those old references go away automatically without need to exclude each one by hand ?

some questions about nvidia-docker v2?

Hi, all, not sure if this is the right place to ask :)

Have been followed nvidia-docker for a long time( since v1), we were working on a accelerator framework in docker then, trying to find a common way to use special devices like nvidia gpu in container easily. our work focus on docker damon, so we got a simple scheme according to v1 nvidia-docker.

now that nvidia-docker comes to v2, and it seems to be a totally different thing comparing with v1. prestart hook nvidia-container-cli did all the works. just wondering if this is the final project, preparing gpu in prestart hook is a good idea, but is it necessary to do this in a new runtime? is it better to do this with a new docker command and standard runc(docker add a new prestart to runtime spec)

ping @flx42

Thanks

CUDA Error encountered during standard ongoing operation

Installed Host os software versions:

Ubuntu 18.04.1
Nvidia Docker 2.0.3
Nvidia Driver 390.48
libnvidia-container1 1.0.0 rc.2-1
nvidia-container-runtime 2.0.0
Docker CE 18.06
Secure boot is off, nvidia_uvm kernel module is loaded upon boot, of course.

We are running FFmpeg compiled with nvenc inside a container deriving from cuda:16.04-cuda-9.1 image for prolonged duration, and sometimes the FFmpeg command just starts hangs. Trying to run any CUDA related container or container operation beyond that point, such as:

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

nvidia-container-cli -k -d /dev/tty list.

results in the following error:

nvidia-container-cli: initialization error: cuda error: unknown error

nvidia-smi works on the host os without an issue.

Rebooting the machine resolves the issue, but that's less than desirable, and yes - we know this has already been discussed in #3 a year ago.

@flx42 @3XX0 any advice or help is welcomed

Better error message if driver installed in the container

See NVIDIA/nvidia-docker#588

Build Issues on Amazon Linux AMIs

I'm trying to build on Amazon Linux AMIs because if I install it via RPM it has issues running.

 /usr/lib64/libnvidia-container.so.1: undefined symbol: cap_get_bound

I tried to compile to overcome this issue. I ran into another problem.

In file included from /usr/include/sys/prctl.h:22:0,
                 from /home/ec2-user/libnvidia-container/src/nvc_ldcache.c:10:
/usr/include/linux/prctl.h:134:2: error: unknown type name ‘__u64’
  __u64 start_code;  /* code section bounds */
  ^
/usr/include/linux/prctl.h:135:2: error: unknown type

I've installed the kernel headers which I thought would provide __u64. Any help or support for this compiling on Amazon Linux?

Gentoo ebuild - help needed

I have created basic package for Gentoo (form of ebuild file). I got stuck on error:

 fatal error: rpc/rpc.h: No such file or directory
 #include <rpc/rpc.h>

I can find such file in multiple packages:

app-crypt/mit-krb5-1.16-r2 (/usr/include/gssrpc/rpc.h)
dev-libs/libevent-2.1.8 (/usr/include/event2/rpc.h)
net-libs/libtirpc-1.0.2-r1 (/usr/include/tirpc/rpc/rpc.h)
sys-kernel/gentoo-sources-4.14.65 (/usr/src/linux-4.14.65-gentoo/drivers/staging/lustre/lnet/selftest/rpc.h)
sys-kernel/gentoo-sources-4.14.78 (/usr/src/linux-4.14.78-gentoo/drivers/staging/lustre/lnet/selftest/rpc.h)
sys-libs/glibc-2.27-r6 (/usr/src/debug/sys-libs/glibc-2.27-r6/glibc-2.27/include/rpc/rpc.h)
sys-libs/glibc-2.27-r6 (/usr/src/debug/sys-libs/glibc-2.27-r6/glibc-2.27/sunrpc/rpc/rpc.h)

What is correct package to use from (I suspect its glibc). I will try to include in Makefile.

Thank you.

Error: "driver service terminated with signal 15"

System:

Archlinux
Linux: 4.19.26-1-lts
Nvidia: 418.43
Cuda: 10.0.130
dmesg.txt

Built steps:

make WITH_TIRPC=yes prefix=/usr/local
make install

NOTE: Build failed unless the attached patched was applied to fix the Makefile
fix_flags.patch.txt

Failing command command:

$ nvidia-container-cli --debug=log.txt info
nvidia-container-cli: initialization error: cuda error: unknown error

Content of log.txt:


-- WARNING, the following logs are for debugging purposes only --

I0306 20:39:40.826306 23023 nvc.c:281] initializing library context (version=1.0.1, build=fe20a8e4a17a63df8116f39795173a461325fb3d)
I0306 20:39:40.826369 23023 nvc.c:255] using root /
I0306 20:39:40.826378 23023 nvc.c:256] using ldcache /etc/ld.so.cache
I0306 20:39:40.826385 23023 nvc.c:257] using unprivileged user 1000:1000
I0306 20:39:40.826624 23024 driver.c:133] starting driver service
I0306 20:39:40.874145 23023 driver.c:233] driver service terminated with signal 15

does nvidia-container support gpu virtualization?

According to the document, we can specify which GPUs to be used on each container. My question is can different containers share the same set GPUs?

Building from source - issue with libelf

Hey there!

I'm trying to build it from source because I'm using Ubuntu 18.04, but I'm facing the following compiling issue with libelf:

/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(elf_data.o): In function `elf_getdata':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/elf_data.c:120: undefined reference to `_libelf_msize'
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/elf_data.c:154: undefined reference to `_libelf_get_translator'
/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(elf_scn.o): In function `_libelf_load_section_headers':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/elf_scn.c:72: undefined reference to `_libelf_fsize'
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/elf_scn.c:87: undefined reference to `_libelf_get_translator'
/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(gelf_dyn.o): In function `gelf_getdyn':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/gelf_dyn.c:70: undefined reference to `_libelf_msize'
/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(gelf_fsize.o): In function `elf32_fsize':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/gelf_fsize.c:37: undefined reference to `_libelf_fsize'
/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(gelf_fsize.o): In function `elf64_fsize':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/gelf_fsize.c:43: undefined reference to `_libelf_fsize'
/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(libelf_ehdr.o): In function `_libelf_ehdr':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/libelf_ehdr.c:138: undefined reference to `_libelf_fsize'
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/libelf_ehdr.c:146: undefined reference to `_libelf_msize'
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/libelf_ehdr.c:169: undefined reference to `_libelf_get_translator'
/home/tiago/make_repos/libnvidia-container/deps/usr/lib/@DEB_HOST_MULTIARCH@/libelf.a(libelf_ehdr.o): In function `_libelf_load_extended':
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/libelf_ehdr.c:52: undefined reference to `_libelf_fsize'
/home/tiago/make_repos/libnvidia-container/deps/src/elftoolchain-0.7.1/libelf/libelf_ehdr.c:63: undefined reference to `_libelf_get_translator'

I'm executing make deb.

Thanks!

xfuncs: x* functions have a non-uniform behavior

Hi,

x* functions like xopen, xcalloc and such are some wrappers around basic functions. Wrapping these functions can be useful, but here, we have a lot of different unexplained behaviors we might want to fix.

should we rewrite the whole thing ?

In a lot of cases, the error is not very meaningful.
In some cases, we lose some info (ex: xopen, xclose will return 1 error type only)
In some cases, we might even fail while failing, ex: xstrdup, that can fail due to a bad alloc. If we fail, we will go to error_set, which will call asprintf, which will alloc memory. Yes!
in other cases, the error check we have to do after is the same (xdlopen ?)

My opinion is that these functions SHOULD exit when failing. This lib has no fallback when something goes wrong. Errors are set, but not handled, so we can either crash, or have UBs.

My suggestion is rewrite the whole file, and each function, if failing should:

write a meaningful message to stderr
abort

Opinions ?

break containers compatibility

this software nvidia-container-cli has a incomplete list dependences nvidia libs

when I remove, the containers in singularity and docker no need more Bootstrap from

libglvnd project
for full hardware acceleration and GL dependences

Setup local mirror for CentOS/RHEL

Hi!

In our company only a few machines have direct access to the internet.

This is the reason why I need to setup a local mirror of CentOS/RHEL packages.

Can you please give me a starting URL for wget --mirror?

Thanks a lot

Dirk

Error: ubuntu18.04/arm64 Release Not Found

Running Ubuntu 18.04 LTS

Get the following error when I run "sudo apt-get update"

Err:13 https://nvidia.github.io/libnvidia-container/ubuntu18.04/arm64 Release
404 Not Found [IP: 185.199.111.153 443]

when will this be released for 18.04 ?

Thank you.

Errors listed attached
Ubuntu 18.04 LTS Error.pdf

Alpine Linux resp. non-glibc dependend compilation

i know, that you usually refuse to support non-glibc based distributions in prrinciple, but it would be really useful, if we could utilize nvidia-docker also on LinuxKit as base for our diskless CUDA computing clusters. The core parts of LinuxKit are all build around Alpine Linux, because of it's extraordinary small size. there are various good reasons, why one may like this choice or wholeheartedly hate it, but we simply have to live with it.

porting the nvidia kernel drivers to Alpine linux resp. LinuxKit, was a trivial job, but i failed miserably, when i had to fight the RPC dependencies of libnvidia-container. :(

i already prepared a Dockerfile.alpine an the most obvious necessary makefile changes here:

https://gitlab.com/mash-graz/libnvidia-container

but till now, i wasn't able to solve all the RPC related troubles during the compilation. :(
it looks like libnvidia-container should be able to utilize tirpc, which is already available as apk packge in alpine, but i couldn't figure out, if it could really work as a substitute?

would it be possible for you, to take a look on this particular issue?

i think, i could handle the rest myself.
and it's indeed an improvement, which could be of significant benefit to others as well.

see:
linuxkit/linuxkit#613
linuxkit/linuxkit#2944
moby/moby#23917

thanks!

LXC hook failed to run out of the box

Background: On a newly installed Ubuntu box, I am trying to start a container in LXC but it fails to start once I set nvidia.runtime="true". I frantically traced the error to the last line of the /usr/share/lxc/hooks/nvidia LXC hook, which is a line calling nvidia-container-cli. The exec call should be something like (extracted with a very long strace log):

LXC_CONFIG_FILE=/var/log/lxd/abc/lxc.conf LXC_LOG_LEVEL=ERROR LXC_ROOTFS_MOUNT=/usr/lib/x86_64-linux-gnu/lxc LXC_NAME=abc http_proxy= NVIDIA_VISIBLE_DEVICES=none LD_LIBRARY_PATH=/usr/lib/lxd/ LXC_CONSOLE_LOGPATH=/var/log/lxd/abc/console.log LXC_ROOTFS_PATH=/var/lib/lxd/containers/abc/rootfs NVIDIA_DRIVER_CAPABILITIES=compute,utility PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/sbin:/usr/bin:/sbin:/bin LISTEN_FDNAMES=lxd.socket PWD=/ LANG=en_HK.UTF-8 LXC_HOOK_VERSION=0 SHLVL=0 LANGUAGE=en_HK:en LXC_CGNS_AWARE=1 LVM_SUPPRESS_FD_WARNINGS=1 nvidia-container-cli --user configure --no-cgroups --ldconfig=@/sbin/ldconfig.real --compute --utility /usr/lib/x86_64-linux-gnu/lxc

Problem: Inside the above nvidia-container-cli execution something goes wrong (and thus my LXC hook failed and LXC complains it could not start the container). When running that on command line directly, I got message like:

nvidia-container-cli: container error: stat failed: /usr/lib/x86_64-linux-gnu/lxc/proc/13970: no such file or directory (of course the PID would change every time but I don't think it is relevant from the moment).

My current understanding of the situation is that nvc_container_new() somehow failed to continue since it failed to find the named folder with stat() syscall (might be fstat() as I used strace to make my guess), but I have no clue on the exact line of code which is searching for that particular folder.

Could anyone pls help resolving the problem as it is rather sad when things don't work even in a clean box. I am willing to dig deeper into the root cause but failed to get significant progress with strace() so would like some help from people who are more knowledge in how nvc_container_new() works.

Software Versions (I specifically re-installed the machine so that it is "clean" enough for easier reproduction of the issue):
OS: Ubuntu 16.04.5 (kernel: 4.4.0-131-generic)
Packages installed after a "apt-get update && apt-get upgrade" of a blank Ubuntu
CUDA: 10.0.130-1, Driver 410.72 installed via developer.download.nvidia.com/compute/cuda/repos
LXC: 3.0.2 from ppa:ubuntu-lxc/stable (lxc lxcfs)
LXD: 3.0.2 from xenial-backports (xenial-backports lxd lxd-client)
ZFS (apt-get zfsutils-linux)
nvidia-container-runtime installed via https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/nvidia-container-runtime.list (thus libnvidia-container installed as its dependency)

Add go bindings or go implementation

As many container runtimes and platforms consume nvidia support, and many of these are written in go, it would be good to have go bindings or a go implementation of this lib so that we initialization checks and validation can be performed at the runtime/platform level for nvidia gpus.

Doing this at the hook level where this lib us currently used is still late in the container lifecycle and we can provide quicker feedback for users by consuming this lib sooner in the stack.

cli segfaults on power with rhel & centos

This is kind of a question-issue. I set up the runtime on a RHEL 7.5 system, but then when I tried using it via docker commands, I got a segfault in the cli. So, running just the cli, at one time (I thought) I at least got some usage instructions. Now just a segfault. Could it be that a prereq is checked for first? It doesn't look like that's the case.

$ /usr/bin/nvidia-container-cli --help
Segmentation fault

I built a debug version of the cli and attached gdb to it. But putting my first breakpoint at main hit this too, so it's pretty immediate, which is weird.

If this is not expected, I can provide more detail of course.

Docker Swarm on nvidia GPUs

Hello,
I'm trying to create a Docker Swarm cluster on Nvidia Gpus. My cluster will be used for Tensor-flow training on a single GPU for deep learning purposes. I chose Swarm instead of Kubernetis because of the limited number of PC in my disposal.
I have three PCs, each with two Nvidia GeForce GTX 1080 Ti. My task is to make possible that for each new training job launched form the users, it will be assigned to a single free GPU for the computation.

From my research I know that Swarm doesn't support Nvidia-docker2, so I don't know how to expose the GPU resources of my PCs in such way Swarm can see them like workers and assign the training dockers to them .

Centos distrobution not found

This issus does not show the reopo correctly for CENTOS based

It should expect other distribution

	/* Bind mount the top directory and every files under it with read-only permissions. */
	if (xmount(err, path, path, NULL, MS_BIND, NULL) < 0)
	goto fail;

nvidia / libnvidia-container Goto Github PK

libnvidia-container's Introduction

libnvidia-container

Installing the library

From packages

From sources

Using the library

Container runtime example

Command line example

Copyright and License

Issues and Contributing

libnvidia-container's People

Contributors

Stargazers

Watchers

Forkers

libnvidia-container's Issues

Recommend Projects

Recommend Topics

Recommend Org