Copied packages do not reflect the environment the copied experiment has been run. I a

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Conda environment installation skipping packages about clearml-agent HOT 4 CLOSED

Mert-Ergin commented on June 14, 2024

Conda environment installation skipping packages

from clearml-agent.

Comments (4)

Mert-Ergin commented on June 14, 2024 1

Hi Martin,
Figured the problem. Had to restart the agent daemon wtih single gpu. Now it is detecting tensorflow-gpu for both pip and conda environments. Thank you for your response. I thought workers were updated for each experiment in the queue.
Have a follow up issue about horovod but unrelated to this one.

from clearml-agent.

bmartinn commented on June 14, 2024

Hi @Mert-Ergin ,
Trains will catch and enlist packages you specifically import from your code.
The reason Trains does not capture the entire conda/pip environment, is to allow the trains-agent more "freedom" when it re-creates the environment. That said once the trains-agent creates the new environment it will update back all the installed packages, including derivative packages.

Specifically tensorflow will always show as tensorflow with a specific version, then trains-agent depending on its setup will wither install tensorflow (if it is running with cpu-only settings, or tensorflow-gpu if running in gpu mode)

Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json

It seems trains-agent thinks it is in cpu-only mode, which is the reason it is trying to install tensroflow and not tensorflow-gpu (actually with conda this is set by defining the conda environment, see the cpuonly argument at the end of the command line)

Once the installation is done, the trains-agent will update back the experiment with all the installed packages (because this is by definition a clean environment that can be fully reproduced)

What's the trains-agent command like you are using ? Did you add --gpu all

BTW:
If you clone the experiment (right-click select "Clone") you can edit the "Installed Packages" list and add any package that trains missed (hopefully that should not happen :)

from clearml-agent.

Mert-Ergin commented on June 14, 2024

Trains will catch and enlist packages you specifically import from your code.
The reason Trains does not capture the entire conda/pip environment, is to allow the trains-agent more "freedom" when it re-creates the environment. That said once the trains-agent creates the new environment it will update back all the installed packages, including derivative packages.
Specifically tensorflow will always show as tensorflow with a specific version, then trains-agent depending on its setup will wither install tensorflow (if it is running with cpu-only settings, or tensorflow-gpu if running in gpu mode)

I understand this decison and respect it. However, it is not capturing the right packages in conda when both conda and pip packages are installed. I tried this without horovod installed and it is the same. By the way I don't think agents is the problem. Because this also happens for the original training. Only a couple of the packages are detected. And I am sure the original training is using tensorflow-gpu as I checked from nvidia-smi and all were occupied.

Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json
It seems trains-agent thinks it is in cpu-only mode, which is the reason it is trying to install tensroflow and not tensorflow-gpu (actually with conda this is set by defining the conda environment, see the cpuonly argument at the end of the command line)

Once the installation is done, the trains-agent will update back the experiment with all the installed packages (because this is by definition a clean environment that can be fully reproduced)

What's the trains-agent command like you are using ? Did you add --gpu all

full command is:
trains-agent daemon --detached --gpus all --queue all

BTW:
If you clone the experiment (right-click select "Clone") you can edit the "Installed Packages" list and add any package that trains missed (hopefully that should not happen :)

I tried this part too. However, conda package part is not even listed there. So I can not add other packages like cudnn cuda-toolkit etc. Additionally if I change the installed packages part how could the software keep track of whether to use pip or conda?

Am I missing something here? Do I need to restart server after updating config file? Or is conda supported for only a specific version (I am using conda 4.6.11).

from clearml-agent.

bmartinn commented on June 14, 2024

Hi @Mert-Ergin ,
There is a distinction here that is important, there are two types of packages

Code defendant packages: such as Keras numpy Tensorflow etc. These Trains will automatically list under "Installed Packages", again without their requirements.
Environment dependent packages: Such as tensorflow-gpu (vs tensorflow), PyTroch for specific CUDA, and conda specific such as cudnn, cuda-toolkit etc. These packages are installed per machine by the trains-agent automatically and depending on the machine setup (i.e. CPU/GPU CUDA version etc.). There is no need to list them, the idea is that if we want to run the same code on a different machine with different CUDA version or on a CPU, it is transparent, the trains-agent will take care of it.

Now on how it works:
If trains-agent is using conda as the package manager, it will first try to install all the packages using conda. That means that you can add any conda-specific package into the "Installed Packages" and it will get installed (obviously it will not work on machines running an agent with different package manager). That said all the default GPU/CUDA packages should be installed automatically with matching CUDA versions to the system.
BTW: This is the reason tensorflow is listed in the "installed packages" and not "tensorflow-gpu" , the agent will install the "tensorflow-gpu" instead, if it is running with --gpus .

Could you send the trains-agent output when you execute it, it should output all it's configuration, including the CUDA/CUDNN version it will use to setup conda environment.
Specifically the interesting part is:

agent.package_manager.type = conda
agent.cuda_version = 101
agent.cudnn_version = 76

If you it failed automatically detecting the CUDA/CUDNN versions you can specify them in the standard NVIDIA OS environment variables.
For example:

$ CUDA_VERSION=10.1 CUDNN_VERSION=7.6 trains-agent daemon --detached --gpus all --queue all

If I had to guess, I think the missing CUDA version is the root cause of it all, and I will make sure that if running with conda and CUDA is not detected it will print an error and leave, in order to avoid these situations that are very challenging to debug.
What do you think?

from clearml-agent.

Conda environment installation skipping packages about clearml-agent HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent