Code Monkey home page Code Monkey logo

Comments (4)

Mert-Ergin avatar Mert-Ergin commented on June 14, 2024 1

Hi Martin,
Figured the problem. Had to restart the agent daemon wtih single gpu. Now it is detecting tensorflow-gpu for both pip and conda environments. Thank you for your response. I thought workers were updated for each experiment in the queue.
Have a follow up issue about horovod but unrelated to this one.

from clearml-agent.

bmartinn avatar bmartinn commented on June 14, 2024

Hi @Mert-Ergin ,
Trains will catch and enlist packages you specifically import from your code.
The reason Trains does not capture the entire conda/pip environment, is to allow the trains-agent more "freedom" when it re-creates the environment. That said once the trains-agent creates the new environment it will update back all the installed packages, including derivative packages.

Specifically tensorflow will always show as tensorflow with a specific version, then trains-agent depending on its setup will wither install tensorflow (if it is running with cpu-only settings, or tensorflow-gpu if running in gpu mode)

Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json

It seems trains-agent thinks it is in cpu-only mode, which is the reason it is trying to install tensroflow and not tensorflow-gpu (actually with conda this is set by defining the conda environment, see the cpuonly argument at the end of the command line)

Once the installation is done, the trains-agent will update back the experiment with all the installed packages (because this is by definition a clean environment that can be fully reproduced)

What's the trains-agent command like you are using ? Did you add --gpu all

BTW:
If you clone the experiment (right-click select "Clone") you can edit the "Installed Packages" list and add any package that trains missed (hopefully that should not happen :)

from clearml-agent.

Mert-Ergin avatar Mert-Ergin commented on June 14, 2024

Trains will catch and enlist packages you specifically import from your code.
The reason Trains does not capture the entire conda/pip environment, is to allow the trains-agent more "freedom" when it re-creates the environment. That said once the trains-agent creates the new environment it will update back all the installed packages, including derivative packages.
Specifically tensorflow will always show as tensorflow with a specific version, then trains-agent depending on its setup will wither install tensorflow (if it is running with cpu-only settings, or tensorflow-gpu if running in gpu mode)

I understand this decison and respect it. However, it is not capturing the right packages in conda when both conda and pip packages are installed. I tried this without horovod installed and it is the same. By the way I don't think agents is the problem. Because this also happens for the original training. Only a couple of the packages are detected. And I am sure the original training is using tensorflow-gpu as I checked from nvidia-smi and all were occupied.

Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json
It seems trains-agent thinks it is in cpu-only mode, which is the reason it is trying to install tensroflow and not tensorflow-gpu (actually with conda this is set by defining the conda environment, see the cpuonly argument at the end of the command line)

Once the installation is done, the trains-agent will update back the experiment with all the installed packages (because this is by definition a clean environment that can be fully reproduced)

What's the trains-agent command like you are using ? Did you add --gpu all

full command is:
trains-agent daemon --detached --gpus all --queue all

BTW:
If you clone the experiment (right-click select "Clone") you can edit the "Installed Packages" list and add any package that trains missed (hopefully that should not happen :)

I tried this part too. However, conda package part is not even listed there. So I can not add other packages like cudnn cuda-toolkit etc. Additionally if I change the installed packages part how could the software keep track of whether to use pip or conda?

Am I missing something here? Do I need to restart server after updating config file? Or is conda supported for only a specific version (I am using conda 4.6.11).

from clearml-agent.

bmartinn avatar bmartinn commented on June 14, 2024

Hi @Mert-Ergin ,
There is a distinction here that is important, there are two types of packages

  • Code defendant packages: such as Keras numpy Tensorflow etc. These Trains will automatically list under "Installed Packages", again without their requirements.
  • Environment dependent packages: Such as tensorflow-gpu (vs tensorflow), PyTroch for specific CUDA, and conda specific such as cudnn, cuda-toolkit etc. These packages are installed per machine by the trains-agent automatically and depending on the machine setup (i.e. CPU/GPU CUDA version etc.). There is no need to list them, the idea is that if we want to run the same code on a different machine with different CUDA version or on a CPU, it is transparent, the trains-agent will take care of it.

Now on how it works:
If trains-agent is using conda as the package manager, it will first try to install all the packages using conda. That means that you can add any conda-specific package into the "Installed Packages" and it will get installed (obviously it will not work on machines running an agent with different package manager). That said all the default GPU/CUDA packages should be installed automatically with matching CUDA versions to the system.
BTW: This is the reason tensorflow is listed in the "installed packages" and not "tensorflow-gpu" , the agent will install the "tensorflow-gpu" instead, if it is running with --gpus .

Could you send the trains-agent output when you execute it, it should output all it's configuration, including the CUDA/CUDNN version it will use to setup conda environment.
Specifically the interesting part is:

agent.package_manager.type = conda
agent.cuda_version = 101
agent.cudnn_version = 76

If you it failed automatically detecting the CUDA/CUDNN versions you can specify them in the standard NVIDIA OS environment variables.
For example:

$ CUDA_VERSION=10.1 CUDNN_VERSION=7.6 trains-agent daemon --detached --gpus all --queue all

If I had to guess, I think the missing CUDA version is the root cause of it all, and I will make sure that if running with conda and CUDA is not detected it will print an error and leave, in order to avoid these situations that are very challenging to debug.
What do you think?

from clearml-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.