Comments (4)
Hi Martin,
Figured the problem. Had to restart the agent daemon wtih single gpu. Now it is detecting tensorflow-gpu for both pip and conda environments. Thank you for your response. I thought workers were updated for each experiment in the queue.
Have a follow up issue about horovod but unrelated to this one.
from clearml-agent.
Hi @Mert-Ergin ,
Trains will catch and enlist packages you specifically import from your code.
The reason Trains does not capture the entire conda/pip environment, is to allow the trains-agent more "freedom" when it re-creates the environment. That said once the trains-agent creates the new environment it will update back all the installed packages, including derivative packages.
Specifically tensorflow
will always show as tensorflow
with a specific version, then trains-agent depending on its setup will wither install tensorflow
(if it is running with cpu-only settings, or tensorflow-gpu
if running in gpu mode)
Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json
It seems trains-agent thinks it is in cpu-only
mode, which is the reason it is trying to install tensroflow and not tensorflow-gpu (actually with conda this is set by defining the conda environment, see the cpuonly
argument at the end of the command line)
Once the installation is done, the trains-agent will update back the experiment with all the installed packages (because this is by definition a clean environment that can be fully reproduced)
What's the trains-agent command like you are using ? Did you add --gpu all
BTW:
If you clone the experiment (right-click select "Clone") you can edit the "Installed Packages" list and add any package that trains missed (hopefully that should not happen :)
from clearml-agent.
Trains will catch and enlist packages you specifically import from your code.
The reason Trains does not capture the entire conda/pip environment, is to allow the trains-agent more "freedom" when it re-creates the environment. That said once the trains-agent creates the new environment it will update back all the installed packages, including derivative packages.
Specificallytensorflow
will always show astensorflow
with a specific version, then trains-agent depending on its setup will wither installtensorflow
(if it is running with cpu-only settings, ortensorflow-gpu
if running in gpu mode)
I understand this decison and respect it. However, it is not capturing the right packages in conda when both conda and pip packages are installed. I tried this without horovod installed and it is the same. By the way I don't think agents is the problem. Because this also happens for the original training. Only a couple of the packages are detected. And I am sure the original training is using tensorflow-gpu as I checked from nvidia-smi and all were occupied.
Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json
It seems trains-agent thinks it is incpu-only
mode, which is the reason it is trying to install tensroflow and not tensorflow-gpu (actually with conda this is set by defining the conda environment, see thecpuonly
argument at the end of the command line)
Once the installation is done, the trains-agent will update back the experiment with all the installed packages (because this is by definition a clean environment that can be fully reproduced)
What's the trains-agent command like you are using ? Did you add
--gpu all
full command is:
trains-agent daemon --detached --gpus all --queue all
BTW:
If you clone the experiment (right-click select "Clone") you can edit the "Installed Packages" list and add any package that trains missed (hopefully that should not happen :)
I tried this part too. However, conda package part is not even listed there. So I can not add other packages like cudnn cuda-toolkit etc. Additionally if I change the installed packages part how could the software keep track of whether to use pip or conda?
Am I missing something here? Do I need to restart server after updating config file? Or is conda supported for only a specific version (I am using conda 4.6.11).
from clearml-agent.
Hi @Mert-Ergin ,
There is a distinction here that is important, there are two types of packages
- Code defendant packages: such as Keras numpy Tensorflow etc. These Trains will automatically list under "Installed Packages", again without their requirements.
- Environment dependent packages: Such as tensorflow-gpu (vs tensorflow), PyTroch for specific CUDA, and conda specific such as cudnn, cuda-toolkit etc. These packages are installed per machine by the trains-agent automatically and depending on the machine setup (i.e. CPU/GPU CUDA version etc.). There is no need to list them, the idea is that if we want to run the same code on a different machine with different CUDA version or on a CPU, it is transparent, the trains-agent will take care of it.
Now on how it works:
If trains-agent is using conda as the package manager, it will first try to install all the packages using conda. That means that you can add any conda-specific package into the "Installed Packages" and it will get installed (obviously it will not work on machines running an agent with different package manager). That said all the default GPU/CUDA packages should be installed automatically with matching CUDA versions to the system.
BTW: This is the reason tensorflow is listed in the "installed packages" and not "tensorflow-gpu" , the agent will install the "tensorflow-gpu" instead, if it is running with --gpus
.
Could you send the trains-agent output when you execute it, it should output all it's configuration, including the CUDA/CUDNN version it will use to setup conda environment.
Specifically the interesting part is:
agent.package_manager.type = conda
agent.cuda_version = 101
agent.cudnn_version = 76
If you it failed automatically detecting the CUDA/CUDNN versions you can specify them in the standard NVIDIA OS environment variables.
For example:
$ CUDA_VERSION=10.1 CUDNN_VERSION=7.6 trains-agent daemon --detached --gpus all --queue all
If I had to guess, I think the missing CUDA version is the root cause of it all, and I will make sure that if running with conda and CUDA is not detected it will print an error and leave, in order to avoid these situations that are very challenging to debug.
What do you think?
from clearml-agent.
Related Issues (20)
- no module named "virtualenv" with execute_remotely HOT 5
- clearml-agent build not building a docker image HOT 10
- shh to http conversion fails with dev.azure HOT 2
- Run in a docker mode not passing envs (DIND) HOT 2
- gnutls_handshake() failed: An unexpected TLS HOT 4
- The cmd clearml-agent daemon stop marked ongoing Task as completed
- Docker container of the cloned task crashes/stucks. HOT 12
- Feature request: support for PDM package manager HOT 6
- error: could not write config file /root/.gitconfig: Device or resource busy - running clearml-agent in docker mode HOT 3
- install error PEP 503 HOT 1
- Feature: automatically install repo as pip package HOT 2
- ClearML does not find all packages HOT 4
- Use agent with dind HOT 2
- Agent on Mac doesn't pull tasks from queue and automatically unregisters from Server after a while HOT 2
- Does clearml-agent caches experiments docker-enviroment? HOT 2
- How to set pod-template dynamically in k8s-glue? HOT 2
- How to run a clearml-task without --requirements or --packages when using Docker? HOT 2
- How to run a bash script instead of a Python script in clearml-agent? HOT 1
- Unable to Create Multiple Agents on Specified GPU HOT 2
- How to specify a custom path to pyproject.toml and poetry.lock when running clearml-agent build?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml-agent.