Code Monkey home page Code Monkey logo

Comments (12)

bmartinn avatar bmartinn commented on June 3, 2024

Thanks @AzaelCicero !

I would expect it to run command with NVIDIA_VISIBLE_DEVICES set to correct value,

Yes one would expect that but life is strange ;)

The way the nvidia docker runtime works, is that inside the docker (if executed with --gpus flag) only the selected GPUs are available.
For example let's assume we have a dual gpu machine, GPU_0 and GPU_1.
We are running a docker on GPU_1 with docker run --gpus 1 ..., that means that we assigned only GPU_1 to the container, and inside the container we only see a single GPU, but the GPU index inside the container will be "0" (because the gpu index id always start at zero). By setting NVIDIA_VISIBLE_DEVICES=all inside the docker, we are basically telling the process it can use all the GPUs the nvidia docker runtime environment allocated for the container instance.

Make sense ?

from clearml-agent.

AzaelCicero avatar AzaelCicero commented on June 3, 2024

It makes sense and its even inline with docker documentation.
However it is in contradiction to my observations on latest Ubuntu 20.04 software stack and images (nvidia/cuda:11.2.0-devel)

Let me get through those commands and their outputs
╭─2021-02-25 08:26:01 kpawelczyk@AS-PC007 ~
╰─$ docker pull nvidia/cuda:11.2.0-devel
11.2.0-devel: Pulling from nvidia/cuda
Digest: sha256:68f7fbf7c6fb29340f4351c94b2309c43c98a5ffe46db1d6fa4f7c262fc223cb
Status: Image is up to date for nvidia/cuda:11.2.0-devel
docker.io/nvidia/cuda:11.2.0-devel
╭─2021-02-25 08:26:11 kpawelczyk@AS-PC007 ~
╰─$ nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
╭─2021-02-25 08:26:15 kpawelczyk@AS-PC007 ~
╰─$ ll /proc/driver/nvidia/gpus
total 0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:06:00.0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:07:00.0
╭─2021-02-25 08:26:17 kpawelczyk@AS-PC007 ~
╰─$ docker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash
root@5d4b12d55688:/# nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
root@5d4b12d55688:/# ll /proc/driver/nvidia/gpus
total 0
drwxr-xr-x 3 root root 60 Feb 25 07:27 ./
dr-xr-xr-x 3 root root 120 Feb 25 07:27 ../
dr-xr-xr-x 2 root root 0 Feb 24 07:05 0000:07:00.0/
root@5d4b12d55688:/# cat /proc/driver/nvidia/gpus/0000:07:00.0/information
Model: GeForce RTX 2060 SUPER
IRQ: 79
GPU UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3
Video BIOS: 90.06.44.80.56
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:07:00.0
Device Minor: 1
Blacklisted: No
root@5d4b12d55688:/# echo $NVIDIA_VISIBLE_DEVICES
1
root@5d4b12d55688:/# exit
╭─2021-02-25 08:32:36 kpawelczyk@AS-PC007 ~
╰─$ docker --version
Docker version 19.03.13, build 4484c46d9d

from clearml-agent.

AzaelCicero avatar AzaelCicero commented on June 3, 2024

I didn't change anything in configuration of Docker. My assumption is that NVIDIA changed behaviour of their docker runtime.

Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

@AzaelCicero I see the issue --privileged will cause the docker to see all the GPU's, see issue here
You can quickly verify:
docker run --gpus device=1 -it nvidia/cuda:11.2.0-devel bash then inside the docker run 'nvidia-smi -L(you should see a single GPU) on the contrary if you rundocker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash then inside the docker run 'nvidia-smi -L , you will get both GPUs

Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?

Good question, I think there was, dockers with leftover environment variables (I think), if NVIDIA_VISIBLE_DEVICES is not set or is set to all it is basically the same. It also tells the agent running inside the docker that it has GPU support (it need that information to look for the correct pytorch package based on the cuda version, for example)

Is there a reasonyou have the --privileged on ?
Maybe we should detect it and set the NVIDIA_VISIBLE_DEVICES accordingly ?

from clearml-agent.

AzaelCicero avatar AzaelCicero commented on June 3, 2024

Thanks @bmartinn for pointing out the impact of privileged I was not aware of this behaviour.

My environment requires -privileged. I am leveraging Docker In Docker as a clearml-agent runner image, as I am running software which is spreaded into multiple Docker containers. Currently I have an workaround which as first step in run script is detecting which devices are available based on the content of /proc/driver/nvidia/gpus/ and corrects value of NVIDIA_VISIBLE_DEVICES.

I don't think that this is normal use case for clearml, but I like the idea of detecting privileged mode. I think that it should be easy as it can only be introduced by entry in agent.default_docker.arguments setting. I will try to propose the solution if I will find time to spare.

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

@AzaelCicero I think you are right, even though not "traditional" setup, I think clearml-agent should properly handle it.
In order for the agent to pass the correct "NVIDIA_VISIBLE_DEVICES" (i.e. understanding the --gpus is ignored) it should know it is also passing --privileged flag
Maybe we should add a configuration / flag saying all Task's dockers should always run with --privileged (then we know we need to change the NVIDIA_VISIBLE_DEVICES behavior), WDYT?

from clearml-agent.

dcarrion87 avatar dcarrion87 commented on June 3, 2024

@bmartinn @AzaelCicero I think I'm running into a similar issue here except within a k8s cluster: NVIDIA/nvidia-docker#1686

Have you done anything further with any wrapper scripts to make this work smoother / as intended with a combination gpu limit and --privilege mode?

I'm trying to get rootless dind in the k8s cluster to get around this as well to no avail so far.

from clearml-agent.

AzaelCicero avatar AzaelCicero commented on June 3, 2024

@dcarrion87 I have ended up with the "temporary" workaround. At the start of the script ran by the ClearML agent I discover what GPU is available to the runner.

import os
import re

def discover_gpus():
    """
    Scan /proc/driver/nvidia/gpus directory and identify active GPUs ids.
    """
    available = os.popen(
        'cat /proc/driver/nvidia/gpus/*/information | grep "Device Minor"'
    ).read()
    matches = re.findall(r"^Device Minor:.*([0-9])+", available, re.MULTILINE)
    return ",".join(matches)

os.environ["NVIDIA_VISIBLE_DEVICES"] = available_gpus

It is good enough and I have not found time to propose permanent solution.

from clearml-agent.

dcarrion87 avatar dcarrion87 commented on June 3, 2024

Awesome thanks for sharing. Not sure if this will help but we ended up bundling up something that avoids privileged mode and bundled a rootless dind thanks to hints from other projects.

https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind

from clearml-agent.

Xeroxxxfiles avatar Xeroxxxfiles commented on June 3, 2024

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

@dcarrion87 just to be sure I understand, only when running the clearml-agent With --privileged the issue exists (because without privileged, inside the docker the visible devices are only the ones specified with --gpus hence "all" is the correct value).
Is that accurate ?
If you define CLEARML_DOCKER_SKIP_GPUS_FLAG=1 when runnign the clearml-agent it will essentially skip setting the --gpus flag when launching the docker and set the correct NVIDIA_VISIBLE_DEVICES based on the selected GPUs.
Basically try:

CLEARML_DOCKER_SKIP_GPUS_FLAG=1 clearml-agent daemon ...

from clearml-agent.

dcarrion87 avatar dcarrion87 commented on June 3, 2024

Hi @bmartinn we're not actually using clearml I just came across this from another similar issue. Sorry for hijacking this discussion!

from clearml-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.