Code Monkey home page Code Monkey logo

Comments (7)

bmartinn avatar bmartinn commented on June 3, 2024

Hi @H4dr1en I can definitely feel you on this one :)
So we used to use venv_update , in theory you can still try to use it (but I have to honest I'm not sure on its status)

Actually we are working on accelerating pip install, in this issue you can see the full potential , and the initial PR.
I'm hoping that after 21.1 is released we will be able to merge all our improvements.
Feel free to join the discussion there :)

The idea is that the safest way to restore an environment is to recreate it, (just imagine something goes wrong it reuses the venv and from time to time something is a bit different, or you think you are getting the same environment, but your are not...)
And since everything is cached, and pip has no real dependencies to solve (think the seconds time, where all the packages are fixed, after a pip freeze to the initial venv) , there is no reason why the unzip should not be a few seconds, after all the these GPU machines are usually fast enough to handle a few file unzipping...

from clearml-agent.

H4dr1en avatar H4dr1en commented on June 3, 2024

Actually we are working on accelerating pip install, in this issue you can see the full potential , and the initial PR.
I'm hoping that after 21.1 is released we will be able to merge all our improvements.
Feel free to join the discussion there :)

Kudos for the great work 🥇 Looks very promising!

The idea is that the safest way to restore an environment is to recreate it, (just imagine something goes wrong it reuses the venv and from time to time something is a bit different, or you think you are getting the same environment, but your are not...)

This is true in general, but in the specific case where a user wants to rerun an experiment on the same machine, no one would do that: the user simply starts the experiment again in the same environment. This would be very valuable in trains, because even when caching all the wheels and reinstalling without solving any dependencies, the installation is still very slow when you deal with big libraries like pytorch, opencv, scipy, etc.

We are talking about 5 to 10 mins, even on a competitive machine, to rebuild an environment that already was already built on the machine. IMO this is an actual need to should be addressed, because reusing a previous environment shouldn't be difficult to achieve.

Why? Because most of the time, researchers have a lot of experiments, but only a small number of environments and it would be very convenient to attach the same environment to multiple experiments, therefore reducing the deployment time to 0. This would be a killer feature.

How I would see it:

Proposal 1

Trains agents take care of everything:

  • Do not delete envs after experiments are finished
  • Create an internal store of (hash of env, env location)
  • if an agent pulls a task having requirements that matches an hash in its store, it uses that env.
  • Provide this feature as a agent.cache_envs parameter to the user. Users knowing that they won't change the environment during they experiments (99% of users) can use this parameter.

Proposal 2

  • Decouple environments from experiments: Users can create environments from the web UI/Python API and manage them (create/clone/delete/update list of requirements/package versions, ...)
  • Users can link environments to experiments: When creating/editing a task, users can specify which environment they want to use for one experiment (via the unique ID of the environment).
  • Keep the flexibility of the current implementation: Environments can be created on-the-fly when creating a task.
  • Have a programmatic access to these environments: One could do:
my_task = Task(...)  # Create task with new env, run vcs detection. Update new env.
my_task = Task(..., environment_id=...)  # Create task with already existing environment

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

Hi @H4dr1en
I think that "Proposal 2" is something you can already achieve.
This is basically building a docker , and using it as the base docker image.

trains-agent build --docker nvidia/cuda --id aa11bb22 --target my_new_env_docker

This command will take experiment id "aa11bb22" and will build a docker including everything installed in it based on the environment defined in the experiment.

Now you can use the newly created base docker ("my_new_env_docker") as the base docker for all your experiment. Basically what happens is the environment is installed as the "system" environment inside the docker, and every venv created inherits the packages. This means everything is preinstalled, but still gives you the possibility to change package versions, if needed.
What do you think ?

Regrading "proposal 1" it makes sense only if we hash the environment requirements, and the question is how many venv's we cache. This is doable but might require some work, it also might be a bit more complicated to share the venvs if you are running multiple agents on the same machine. My fear is actually stability, it will be quite bad if from time to time you will be getting the wrong venv, or venv with leftovers...

from clearml-agent.

Mert-Ergin avatar Mert-Ergin commented on June 3, 2024

Hi,
I am on the same bandwagon and tried proposal 2 by setting up my own docker environment. I need this solution specifically because I have to use nvidia-dali for fast pre-processing. However nvidia dali requires the following command to be installed:

pip install nvidia-dali==0.21.0 --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0
However, as mentioned in issues section in trains-agent 'pip freeze' does not capture --extra-index-url.

I also need to install horovod, which also requires some previous steps. I managed to build this docker and run it using the following command:

trains-agent build --docker name_of_docker--id 41672b8... --target trains_docker

It builds and shows as a worker in workes&queues section with following errors:

trains_agent: ERROR: Could not parse task execution info: 'Tasks' object has no attribute 'script'
trains_agent: ERROR: 'NoneType' object has no attribute 'id'

bash: /root/trains.conf: Permission denied
bash: /root/trains.conf: Permission denied

And when I try to run enqueu a task I get following error, naturally

trains_agent: ERROR: Could not find task id=05d03ebb905840279336ab57f6b69ac8 (for host: )
Exception: 'Tasks' object has no attribute 'id'

I have attached the following log file from results section.
task_a5df428d97454314b0e56d66f3135fca.log
I am also adding following log file for agent building section.

agent-build.log

Lastly, adding Dockerfile in case someone wants to use that. I learned how to use Docker in a week so there might also be something going wrong there.

Dockerfile.txt

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

Hi @Mert-Ergin

A few remarks, before answering your question :)

  1. Did you add the extra_index_url to the ~/trains.conf ? As you can see here we support having multiple indexes for the exact reason you mentioned.
  2. Horovod is one of the special cases trains-agent takes care of. Basically it will always get installed last after all the rest of the requirements are installed, this is due to the fact that Horovod installs different flavours based on the pytorch/tensorflow installed in the system.

Regrading the error:

  • What's trains-agent version you are using (both for building the docker and for running it)
  • This error is basically saying there is no Task with the requested ID, which is probably because it is missing permissions to your server (and by default will try the demo-server)
  • How did you get the error , are you running the docker or using it as "base docker image" for a specific experiment?
  • Just making sure, are you running trains-agent in docker mode ?

Lastly, adding Dockerfile in case someone wants to use that. I learned how to use Docker in a week so there might also be something going wrong there.

👍 nice :)

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

Hi,
I'm updating here that the latest version of clearml-agent now includes venv caching capabilities 🎉 🎊

Add this section to your ~/clearml.conf file on the agent's machine

agent {
    # cached virtual environment folder
    venvs_cache: {
        # maximum number of cached venvs
        max_entries: 10
        # minimum required free space to allow for cache entry, disable by passing 0 or negative value
        free_space_threshold_gb: 2.0
        # unmark to enable virtual environment caching
        path: ~/.clearml/venvs-cache
    }, 
}

Reference here:

venvs_cache: {

And upgrade and restart the clearml-agent

pip install clearml-agent==0.17.2rc2

from clearml-agent.

H4dr1en avatar H4dr1en commented on June 3, 2024

This is awesome, thanks a lot @bmartinn and the team!!
I am testing that right away 🤩

from clearml-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.