Comments (7)
Hi @H4dr1en I can definitely feel you on this one :)
So we used to use venv_update , in theory you can still try to use it (but I have to honest I'm not sure on its status)
Actually we are working on accelerating pip install
, in this issue you can see the full potential , and the initial PR.
I'm hoping that after 21.1 is released we will be able to merge all our improvements.
Feel free to join the discussion there :)
The idea is that the safest way to restore an environment is to recreate it, (just imagine something goes wrong it reuses the venv and from time to time something is a bit different, or you think you are getting the same environment, but your are not...)
And since everything is cached, and pip has no real dependencies to solve (think the seconds time, where all the packages are fixed, after a pip freeze
to the initial venv) , there is no reason why the unzip should not be a few seconds, after all the these GPU machines are usually fast enough to handle a few file unzipping...
from clearml-agent.
Actually we are working on accelerating pip install, in this issue you can see the full potential , and the initial PR.
I'm hoping that after 21.1 is released we will be able to merge all our improvements.
Feel free to join the discussion there :)
Kudos for the great work 🥇 Looks very promising!
The idea is that the safest way to restore an environment is to recreate it, (just imagine something goes wrong it reuses the venv and from time to time something is a bit different, or you think you are getting the same environment, but your are not...)
This is true in general, but in the specific case where a user wants to rerun an experiment on the same machine, no one would do that: the user simply starts the experiment again in the same environment. This would be very valuable in trains, because even when caching all the wheels and reinstalling without solving any dependencies, the installation is still very slow when you deal with big libraries like pytorch, opencv, scipy, etc.
We are talking about 5 to 10 mins, even on a competitive machine, to rebuild an environment that already was already built on the machine. IMO this is an actual need to should be addressed, because reusing a previous environment shouldn't be difficult to achieve.
Why? Because most of the time, researchers have a lot of experiments, but only a small number of environments and it would be very convenient to attach the same environment to multiple experiments, therefore reducing the deployment time to 0. This would be a killer feature.
How I would see it:
Proposal 1
Trains agents take care of everything:
- Do not delete envs after experiments are finished
- Create an internal store of (hash of env, env location)
- if an agent pulls a task having requirements that matches an hash in its store, it uses that env.
- Provide this feature as a
agent.cache_envs
parameter to the user. Users knowing that they won't change the environment during they experiments (99% of users) can use this parameter.
Proposal 2
- Decouple environments from experiments: Users can create environments from the web UI/Python API and manage them (create/clone/delete/update list of requirements/package versions, ...)
- Users can link environments to experiments: When creating/editing a task, users can specify which environment they want to use for one experiment (via the unique ID of the environment).
- Keep the flexibility of the current implementation: Environments can be created on-the-fly when creating a task.
- Have a programmatic access to these environments: One could do:
my_task = Task(...) # Create task with new env, run vcs detection. Update new env.
my_task = Task(..., environment_id=...) # Create task with already existing environment
from clearml-agent.
Hi @H4dr1en
I think that "Proposal 2" is something you can already achieve.
This is basically building a docker , and using it as the base docker image.
trains-agent build --docker nvidia/cuda --id aa11bb22 --target my_new_env_docker
This command will take experiment id "aa11bb22" and will build a docker including everything installed in it based on the environment defined in the experiment.
Now you can use the newly created base docker ("my_new_env_docker") as the base docker for all your experiment. Basically what happens is the environment is installed as the "system" environment inside the docker, and every venv created inherits the packages. This means everything is preinstalled, but still gives you the possibility to change package versions, if needed.
What do you think ?
Regrading "proposal 1" it makes sense only if we hash the environment requirements, and the question is how many venv's we cache. This is doable but might require some work, it also might be a bit more complicated to share the venvs if you are running multiple agents on the same machine. My fear is actually stability, it will be quite bad if from time to time you will be getting the wrong venv, or venv with leftovers...
from clearml-agent.
Hi,
I am on the same bandwagon and tried proposal 2 by setting up my own docker environment. I need this solution specifically because I have to use nvidia-dali for fast pre-processing. However nvidia dali requires the following command to be installed:
pip install nvidia-dali==0.21.0 --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0
However, as mentioned in issues section in trains-agent 'pip freeze' does not capture --extra-index-url.
I also need to install horovod, which also requires some previous steps. I managed to build this docker and run it using the following command:
trains-agent build --docker name_of_docker--id 41672b8... --target trains_docker
It builds and shows as a worker in workes&queues section with following errors:
trains_agent: ERROR: Could not parse task execution info: 'Tasks' object has no attribute 'script'
trains_agent: ERROR: 'NoneType' object has no attribute 'id'
bash: /root/trains.conf: Permission denied
bash: /root/trains.conf: Permission denied
And when I try to run enqueu a task I get following error, naturally
trains_agent: ERROR: Could not find task id=05d03ebb905840279336ab57f6b69ac8 (for host: )
Exception: 'Tasks' object has no attribute 'id'
I have attached the following log file from results section.
task_a5df428d97454314b0e56d66f3135fca.log
I am also adding following log file for agent building section.
Lastly, adding Dockerfile in case someone wants to use that. I learned how to use Docker in a week so there might also be something going wrong there.
from clearml-agent.
Hi @Mert-Ergin
A few remarks, before answering your question :)
- Did you add the
extra_index_url
to the~/trains.conf
? As you can see here we support having multiple indexes for the exact reason you mentioned. - Horovod is one of the special cases trains-agent takes care of. Basically it will always get installed last after all the rest of the requirements are installed, this is due to the fact that Horovod installs different flavours based on the pytorch/tensorflow installed in the system.
Regrading the error:
- What's trains-agent version you are using (both for building the docker and for running it)
- This error is basically saying there is no Task with the requested ID, which is probably because it is missing permissions to your server (and by default will try the demo-server)
- How did you get the error , are you running the docker or using it as "base docker image" for a specific experiment?
- Just making sure, are you running trains-agent in docker mode ?
Lastly, adding Dockerfile in case someone wants to use that. I learned how to use Docker in a week so there might also be something going wrong there.
👍 nice :)
from clearml-agent.
Hi,
I'm updating here that the latest version of clearml-agent
now includes venv caching capabilities 🎉 🎊
Add this section to your ~/clearml.conf
file on the agent's machine
agent {
# cached virtual environment folder
venvs_cache: {
# maximum number of cached venvs
max_entries: 10
# minimum required free space to allow for cache entry, disable by passing 0 or negative value
free_space_threshold_gb: 2.0
# unmark to enable virtual environment caching
path: ~/.clearml/venvs-cache
},
}
Reference here:
clearml-agent/docs/clearml.conf
Line 93 in 22d5892
And upgrade and restart the clearml-agent
pip install clearml-agent==0.17.2rc2
from clearml-agent.
This is awesome, thanks a lot @bmartinn and the team!!
I am testing that right away 🤩
from clearml-agent.
Related Issues (20)
- Modify clearml-agent to accept urlib>=2 as a dependency.
- poetry_install_extra_args passes arguments to poetry config HOT 1
- environment variables in default_docker arguments of clearml.conf not passed to container on first run HOT 2
- Yolo Execution with GPU HOT 4
- RAM / CPU cores partitioning for multiple agents on the same machine HOT 1
- Issue of checkout PR commit by sha HOT 1
- Image on Docker Hub is out of date HOT 12
- no module named "virtualenv" with execute_remotely HOT 5
- clearml-agent build not building a docker image HOT 10
- shh to http conversion fails with dev.azure HOT 2
- Run in a docker mode not passing envs (DIND) HOT 2
- gnutls_handshake() failed: An unexpected TLS HOT 4
- The cmd clearml-agent daemon stop marked ongoing Task as completed
- Docker container of the cloned task crashes/stucks. HOT 12
- Feature request: support for PDM package manager HOT 6
- error: could not write config file /root/.gitconfig: Device or resource busy - running clearml-agent in docker mode HOT 3
- install error PEP 503 HOT 1
- Feature: automatically install repo as pip package HOT 2
- ClearML does not find all packages HOT 4
- Use agent with dind HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml-agent.