Comments (12)
Thanks @AzaelCicero !
I would expect it to run command with NVIDIA_VISIBLE_DEVICES set to correct value,
Yes one would expect that but life is strange ;)
The way the nvidia docker runtime works, is that inside the docker (if executed with --gpus
flag) only the selected GPUs are available.
For example let's assume we have a dual gpu machine, GPU_0 and GPU_1.
We are running a docker on GPU_1 with docker run --gpus 1 ...
, that means that we assigned only GPU_1 to the container, and inside the container we only see a single GPU, but the GPU index inside the container will be "0" (because the gpu index id always start at zero). By setting NVIDIA_VISIBLE_DEVICES=all
inside the docker, we are basically telling the process it can use all the GPUs the nvidia docker runtime environment allocated for the container instance.
Make sense ?
from clearml-agent.
It makes sense and its even inline with docker documentation.
However it is in contradiction to my observations on latest Ubuntu 20.04 software stack and images (nvidia/cuda:11.2.0-devel)
Let me get through those commands and their outputs
╭─2021-02-25 08:26:01 kpawelczyk@AS-PC007 ~
╰─$ docker pull nvidia/cuda:11.2.0-devel
11.2.0-devel: Pulling from nvidia/cuda
Digest: sha256:68f7fbf7c6fb29340f4351c94b2309c43c98a5ffe46db1d6fa4f7c262fc223cb
Status: Image is up to date for nvidia/cuda:11.2.0-devel
docker.io/nvidia/cuda:11.2.0-devel
╭─2021-02-25 08:26:11 kpawelczyk@AS-PC007 ~
╰─$ nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
╭─2021-02-25 08:26:15 kpawelczyk@AS-PC007 ~
╰─$ ll /proc/driver/nvidia/gpus
total 0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:06:00.0
dr-xr-xr-x 2 root root 0 lut 24 08:05 0000:07:00.0
╭─2021-02-25 08:26:17 kpawelczyk@AS-PC007 ~
╰─$ docker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash
root@5d4b12d55688:/# nvidia-smi -L
GPU 0: GeForce RTX 2070 (UUID: GPU-862380e5-b56b-9042-9a3a-012482d99792)
GPU 1: GeForce RTX 2060 SUPER (UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3)
root@5d4b12d55688:/# ll /proc/driver/nvidia/gpus
total 0
drwxr-xr-x 3 root root 60 Feb 25 07:27 ./
dr-xr-xr-x 3 root root 120 Feb 25 07:27 ../
dr-xr-xr-x 2 root root 0 Feb 24 07:05 0000:07:00.0/
root@5d4b12d55688:/# cat /proc/driver/nvidia/gpus/0000:07:00.0/information
Model: GeForce RTX 2060 SUPER
IRQ: 79
GPU UUID: GPU-6310f70f-d10e-2753-d17c-51a0fd440cf3
Video BIOS: 90.06.44.80.56
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:07:00.0
Device Minor: 1
Blacklisted: No
root@5d4b12d55688:/# echo $NVIDIA_VISIBLE_DEVICES
1
root@5d4b12d55688:/# exit
╭─2021-02-25 08:32:36 kpawelczyk@AS-PC007 ~
╰─$ docker --version
Docker version 19.03.13, build 4484c46d9d
from clearml-agent.
I didn't change anything in configuration of Docker. My assumption is that NVIDIA changed behaviour of their docker runtime.
Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?
from clearml-agent.
@AzaelCicero I see the issue --privileged
will cause the docker to see all the GPU's, see issue here
You can quickly verify:
docker run --gpus device=1 -it nvidia/cuda:11.2.0-devel bash
then inside the docker run 'nvidia-smi -L(you should see a single GPU) on the contrary if you run
docker run --privileged --gpus device=1 -it nvidia/cuda:11.2.0-devel bash then inside the docker run 'nvidia-smi -L
, you will get both GPUs
Why clearml-agent sets NVIDIA_VISIBLE_DEVICES=all in the command? Is there any edge case which requires such explicit operation?
Good question, I think there was, dockers with leftover environment variables (I think), if NVIDIA_VISIBLE_DEVICES
is not set or is set to all
it is basically the same. It also tells the agent running inside the docker that it has GPU support (it need that information to look for the correct pytorch
package based on the cuda
version, for example)
Is there a reasonyou have the --privileged
on ?
Maybe we should detect it and set the NVIDIA_VISIBLE_DEVICES accordingly ?
from clearml-agent.
Thanks @bmartinn for pointing out the impact of privileged
I was not aware of this behaviour.
My environment requires -privileged
. I am leveraging Docker In Docker as a clearml-agent runner image, as I am running software which is spreaded into multiple Docker containers. Currently I have an workaround which as first step in run script is detecting which devices are available based on the content of /proc/driver/nvidia/gpus/ and corrects value of NVIDIA_VISIBLE_DEVICES.
I don't think that this is normal use case for clearml, but I like the idea of detecting privileged
mode. I think that it should be easy as it can only be introduced by entry in agent.default_docker.arguments
setting. I will try to propose the solution if I will find time to spare.
from clearml-agent.
@AzaelCicero I think you are right, even though not "traditional" setup, I think clearml-agent
should properly handle it.
In order for the agent to pass the correct "NVIDIA_VISIBLE_DEVICES" (i.e. understanding the --gpus
is ignored) it should know it is also passing --privileged
flag
Maybe we should add a configuration / flag saying all Task's dockers should always run with --privileged
(then we know we need to change the NVIDIA_VISIBLE_DEVICES behavior), WDYT?
from clearml-agent.
@bmartinn @AzaelCicero I think I'm running into a similar issue here except within a k8s cluster: NVIDIA/nvidia-docker#1686
Have you done anything further with any wrapper scripts to make this work smoother / as intended with a combination gpu limit and --privilege mode?
I'm trying to get rootless dind in the k8s cluster to get around this as well to no avail so far.
from clearml-agent.
@dcarrion87 I have ended up with the "temporary" workaround. At the start of the script ran by the ClearML agent I discover what GPU is available to the runner.
import os
import re
def discover_gpus():
"""
Scan /proc/driver/nvidia/gpus directory and identify active GPUs ids.
"""
available = os.popen(
'cat /proc/driver/nvidia/gpus/*/information | grep "Device Minor"'
).read()
matches = re.findall(r"^Device Minor:.*([0-9])+", available, re.MULTILINE)
return ",".join(matches)
os.environ["NVIDIA_VISIBLE_DEVICES"] = available_gpus
It is good enough and I have not found time to propose permanent solution.
from clearml-agent.
Awesome thanks for sharing. Not sure if this will help but we ended up bundling up something that avoids privileged mode and bundled a rootless dind thanks to hints from other projects.
https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind
from clearml-agent.
from clearml-agent.
@dcarrion87 just to be sure I understand, only when running the clearml-agent
With --privileged
the issue exists (because without privileged, inside the docker the visible devices are only the ones specified with --gpus hence "all" is the correct value).
Is that accurate ?
If you define CLEARML_DOCKER_SKIP_GPUS_FLAG=1
when runnign the clearml-agent
it will essentially skip setting the --gpus
flag when launching the docker and set the correct NVIDIA_VISIBLE_DEVICES
based on the selected GPUs.
Basically try:
CLEARML_DOCKER_SKIP_GPUS_FLAG=1 clearml-agent daemon ...
from clearml-agent.
Hi @bmartinn we're not actually using clearml I just came across this from another similar issue. Sorry for hijacking this discussion!
from clearml-agent.
Related Issues (20)
- Modify clearml-agent to accept urlib>=2 as a dependency.
- poetry_install_extra_args passes arguments to poetry config HOT 1
- environment variables in default_docker arguments of clearml.conf not passed to container on first run HOT 2
- Yolo Execution with GPU HOT 4
- RAM / CPU cores partitioning for multiple agents on the same machine HOT 1
- Issue of checkout PR commit by sha HOT 1
- Image on Docker Hub is out of date HOT 12
- no module named "virtualenv" with execute_remotely HOT 5
- clearml-agent build not building a docker image HOT 10
- shh to http conversion fails with dev.azure HOT 2
- Run in a docker mode not passing envs (DIND) HOT 2
- gnutls_handshake() failed: An unexpected TLS HOT 4
- The cmd clearml-agent daemon stop marked ongoing Task as completed
- Docker container of the cloned task crashes/stucks. HOT 12
- Feature request: support for PDM package manager HOT 6
- error: could not write config file /root/.gitconfig: Device or resource busy - running clearml-agent in docker mode HOT 3
- install error PEP 503 HOT 1
- Feature: automatically install repo as pip package HOT 2
- ClearML does not find all packages HOT 4
- Use agent with dind HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml-agent.