Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all about clearml-agent HOT 12 OPEN

allegroai commented on June 3, 2024

Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all

from clearml-agent.

Comments (12)

bmartinn commented on June 3, 2024

Thanks @AzaelCicero !

I would expect it to run command with NVIDIA_VISIBLE_DEVICES set to correct value,

Yes one would expect that but life is strange ;)

The way the nvidia docker runtime works, is that inside the docker (if executed with --gpus flag) only the selected GPUs are available.
For example let's assume we have a dual gpu machine, GPU_0 and GPU_1.
We are running a docker on GPU_1 with docker run --gpus 1 ..., that means that we assigned only GPU_1 to the container, and inside the container we only see a single GPU, but the GPU index inside the container will be "0" (because the gpu index id always start at zero). By setting NVIDIA_VISIBLE_DEVICES=all inside the docker, we are basically telling the process it can use all the GPUs the nvidia docker runtime environment allocated for the container instance.

Make sense ?

from clearml-agent.

AzaelCicero commented on June 3, 2024

It makes sense and its even inline with docker documentation.
However it is in contradiction to my observations on latest Ubuntu 20.04 software stack and images (nvidia/cuda:11.2.0-devel)

Let me get through those commands and their outputs
╭─2021-02-25 08:26:01 kpawelczyk@AS-PC007 ~
╰─$ docker pull nvidia/cuda:11.2.0-devel
11.2.0-devel: Pulling from nvidia/cuda
Digest: sha256:68f7fbf7c6fb29340f4351c94b2309c43c98a5ffe46db1d6fa4f7c262fc223cb
Status: Image is up to date for nvidia/cuda:11.2.0-devel
docker.io/nvidia/cuda:11.2.0-devel
╭─2021-02-25 08:26:11 kpawelczyk@AS-PC007 ~
╰─$ nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
╭─2021-02-25 08:26:15 kpawelczyk@AS-PC007 ~
╰─$ ll /proc/driver/nvidia/gpus
total 0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:06:00.0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:07:00.0
╭─2021-02-25 08:26:17 kpawelczyk@AS-PC007 ~
╰─$ docker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash
root@5d4b12d55688:/# nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
root@5d4b12d55688:/# ll /proc/driver/nvidia/gpus
total 0
drwxr-xr-x 3 root root 60 Feb 25 07:27 ./
dr-xr-xr-x 3 root root 120 Feb 25 07:27 ../
dr-xr-xr-x 2 root root 0 Feb 24 07:05 0000:07:00.0/
root@5d4b12d55688:/# cat /proc/driver/nvidia/gpus/0000:07:00.0/information
Model: GeForce RTX 2060 SUPER
IRQ: 79
GPU UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3
Video BIOS: 90.06.44.80.56
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:07:00.0
Device Minor: 1
Blacklisted: No
root@5d4b12d55688:/# echo $NVIDIA_VISIBLE_DEVICES
1
root@5d4b12d55688:/# exit
╭─2021-02-25 08:32:36 kpawelczyk@AS-PC007 ~
╰─$ docker --version
Docker version 19.03.13, build 4484c46d9d

from clearml-agent.

AzaelCicero commented on June 3, 2024

I didn't change anything in configuration of Docker. My assumption is that NVIDIA changed behaviour of their docker runtime.

Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?

from clearml-agent.

bmartinn commented on June 3, 2024

@AzaelCicero I see the issue --privileged will cause the docker to see all the GPU's, see issue here
You can quickly verify:
docker run --gpus device=1 -it nvidia/cuda:11.2.0-devel bash then inside the docker run 'nvidia-smi -L(you should see a single GPU) on the contrary if you rundocker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash then inside the docker run 'nvidia-smi -L , you will get both GPUs

Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?

Good question, I think there was, dockers with leftover environment variables (I think), if NVIDIA_VISIBLE_DEVICES is not set or is set to all it is basically the same. It also tells the agent running inside the docker that it has GPU support (it need that information to look for the correct pytorch package based on the cuda version, for example)

Is there a reasonyou have the --privileged on ?
Maybe we should detect it and set the NVIDIA_VISIBLE_DEVICES accordingly ?

from clearml-agent.

AzaelCicero commented on June 3, 2024

Thanks @bmartinn for pointing out the impact of privileged I was not aware of this behaviour.

My environment requires -privileged. I am leveraging Docker In Docker as a clearml-agent runner image, as I am running software which is spreaded into multiple Docker containers. Currently I have an workaround which as first step in run script is detecting which devices are available based on the content of /proc/driver/nvidia/gpus/ and corrects value of NVIDIA_VISIBLE_DEVICES.

I don't think that this is normal use case for clearml, but I like the idea of detecting privileged mode. I think that it should be easy as it can only be introduced by entry in agent.default_docker.arguments setting. I will try to propose the solution if I will find time to spare.

from clearml-agent.

bmartinn commented on June 3, 2024

@AzaelCicero I think you are right, even though not "traditional" setup, I think clearml-agent should properly handle it.
In order for the agent to pass the correct "NVIDIA_VISIBLE_DEVICES" (i.e. understanding the --gpus is ignored) it should know it is also passing --privileged flag
Maybe we should add a configuration / flag saying all Task's dockers should always run with --privileged (then we know we need to change the NVIDIA_VISIBLE_DEVICES behavior), WDYT?

from clearml-agent.

dcarrion87 commented on June 3, 2024

@bmartinn @AzaelCicero I think I'm running into a similar issue here except within a k8s cluster: NVIDIA/nvidia-docker#1686

Have you done anything further with any wrapper scripts to make this work smoother / as intended with a combination gpu limit and --privilege mode?

I'm trying to get rootless dind in the k8s cluster to get around this as well to no avail so far.

from clearml-agent.

AzaelCicero commented on June 3, 2024

@dcarrion87 I have ended up with the "temporary" workaround. At the start of the script ran by the ClearML agent I discover what GPU is available to the runner.

import os
import re

def discover_gpus():
    """
    Scan /proc/driver/nvidia/gpus directory and identify active GPUs ids.
    """
    available = os.popen(
        'cat /proc/driver/nvidia/gpus/*/information | grep "Device Minor"'
    ).read()
    matches = re.findall(r"^Device Minor:.*([0-9])+", available, re.MULTILINE)
    return ",".join(matches)

os.environ["NVIDIA_VISIBLE_DEVICES"] = available_gpus

It is good enough and I have not found time to propose permanent solution.

from clearml-agent.

dcarrion87 commented on June 3, 2024

Awesome thanks for sharing. Not sure if this will help but we ended up bundling up something that avoids privileged mode and bundled a rootless dind thanks to hints from other projects.

https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind

from clearml-agent.

Xeroxxxfiles commented on June 3, 2024

pá 7. 10. 2022 v 7:56 odesílatel Daniel Carrion ***@***.***> napsal:

Awesome thanks for sharing. Not sure if this will help but we ended up bundling up something that avoids privileged mode and bundled a rootless dind thanks to hints from other projects. https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind — Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATXMPNDOYCV4KES24HWDWALWB63SJANCNFSM4YEQC2DQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

I think Im way too unskliled but more crazieve, pnormaly tell me what how lets go

from clearml-agent.

bmartinn commented on June 3, 2024

@dcarrion87 just to be sure I understand, only when running the clearml-agent With --privileged the issue exists (because without privileged, inside the docker the visible devices are only the ones specified with --gpus hence "all" is the correct value).
Is that accurate ?
If you define CLEARML_DOCKER_SKIP_GPUS_FLAG=1 when runnign the clearml-agent it will essentially skip setting the --gpus flag when launching the docker and set the correct NVIDIA_VISIBLE_DEVICES based on the selected GPUs.
Basically try:

CLEARML_DOCKER_SKIP_GPUS_FLAG=1 clearml-agent daemon ...

from clearml-agent.

dcarrion87 commented on June 3, 2024

Hi @bmartinn we're not actually using clearml I just came across this from another similar issue. Sorry for hijacking this discussion!

from clearml-agent.

Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all about clearml-agent HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent