Code Monkey home page Code Monkey logo

Comments (13)

Shaked avatar Shaked commented on June 3, 2024 1

Hey @bmartinn

Thank you for your reply.

Apparently the issue was related to the experiment, because I tried another one and it just worked... I guess I did something wrong before and didn't give enough attention how I messed it up .

Thanks again
Shaked

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024
  1. The Trains packages creates a new experiment in trains-server.
    This experiment stores links to the git repository, commit id, and the git diff at the time of execution. Note that the git repo and commit id are just links, this is not the actual code, contrary to the git diff which actually means the trains-server stores the git diff results (as plain text) embedded on the experiment.
  2. Cloning the experiment in the UI, will clone the environment/arguments of the original experiment. This means copy of the git repo reference/commit id/and the git diff (the actual text diff, this is not a reference)
  3. The trains-agent uses the git credentials to clone the requested repo/commit based on the refrence stored on the experiment. It is using the git ssh key ring (stored in ~/.ssh) or using user/pass configured in ~/trains.conf L18
  4. If you are re-running your code on your machine, trains will either re-use the original experiment (overwriting it) or if you had artifacts/models stored during the previous execution, it will create a new experiment. Either way it will have no effect on the cloned experiment (the one trains-agent will be running)

Does that make sense?

One more question, what do you mean by:

  1. Do I need to mount different weights volume for each agent?

from clearml-agent.

oak-tree avatar oak-tree commented on June 3, 2024

Hey @bmartinn,
Thanks for the answers,
So basically, if we need the trains-agent to be expose to some data/files we need to start it with volumes during its initialization, right ?

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

@oak-tree, can I assume you are referring to docker volumes mounts you would like the trains-agent to set for the experiment execution? (obviously when running the trains-agent in docker mode)

You can control the default docker arguments either when launching the trains-agent or per experiment in the UI.

  1. $ trains-agent --docker nvidia/cuda -v '/local/folder:/root/inside_folder'
    Will launch all exepriments inside the nvidia/cuda container with /root/inside_folder mapped to /local/folder
  2. In the UI under "Execution" tab, "Base Docker Image" controls the same docker/arguments on a per experiment basis. Meaning, if you clone an experiment and edit this section to "nvidia/cuda -v /local/folder:/root/inside_folder" you will get the same behavior.

Does that answer the question?

from clearml-agent.

Shaked avatar Shaked commented on June 3, 2024

@bmartinn

I can't speak for @oak-tree, but I think that the main issue is what happens while working with more than one agent? As we read and write data/weights, it's not possible to have the same volume shared between more than one agent, as it might create a dead lock as files might be opened and being written on another agent. How should we face this? Do we need to have a /data folder per agent?

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

Hi @Shaked ,

I would not write back data to a shared mounted, for the exact reason you mentioned. I would only use it as read-only device.

Regrading storing weights, I would opt for auto-magic model copying, by configuring the "output_uri".
Just make sure the default_output_uri (or per experiment, in the Web UI, see "Execution" tab, "Output Destination" section) is configured to a second storage device.

For example if default_output_uri=s3://bucket/, all models will be automatically uploaded to s3://bucket/<project_name>/<experiment name>.<experiment id>/ regardless of where they are stored locally! Yes if we call torch.save('/tmp/model.pth') a copy of the file will be automatically uploaded to the output_uri

If you need shared folder instead of object-storage, set "default_output_uri=/root/inside_folder" you will end up with the same folder structure on the shared folder.

This ensures experiments do not overwrite one another model/artifcats files.

Notice that if you have other output artifacts, you could use Task.upload_artifact (example) , and they will be stored in the same output_uri folder structure

Sounds good?

from clearml-agent.

Shaked avatar Shaked commented on June 3, 2024

@bmartinn

This sounds like a great solution to me. Is it possible to use Azure Files instead of S3? I might have missed this part in the documentation.

@oak-tree, what do you think about this solution?

from clearml-agent.

oak-tree avatar oak-tree commented on June 3, 2024

Hey @bmartinn

Agent dockerized settings

thanks for answering, yes we are speaking on dockerized agent.
it makes sense to start the agent with the desired volumes. But if I understand you correctly it is possible to control those parameters from the u.i?

Resources stuff

Our inhouse tool, which we think that trains add very nice feature on it, saves weights and other artifacts to the disk. It loads them on demand by some pattern. So for example in new run everything is done automatically. We have in the backlog migrate to db but we are not there it.
Therefore my question:
If we use some bucket on the cloud, does trains support some kind of name base searching on this bucket? and like @Shaked asked, does trains support Azure (blob/files/etc) instead of aws?

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

Hi @oak-tree ,
Regarding the docker volume mounts:
There are now two (and with the release of trains-agent v0.13.1, due to be released soon, three) options for controlling docker image arguments:

  1. Setting the default docker image/argument when launching trains-agent either with command line or with the configuration file
  2. Per experiment, in the web UI, under "Execution" you can set "Base Docker Image".
    For example nvidia/cuda -v /mnt/outside:/mnt/inside
    If this field is empty the trains-agent will use its default settings.
  3. Setting "extra_docker_arguments" (will be available in 0.13.1) will make sure the trains-agent always add these arguments to the docker launch command.

Storage Artifacts and Models etc

  • Trains supports S3/GS/Azure/http/Local Folders, for any upload/download of artifacts.
    This means you can safely set "default_output_uri" to s3://my_bucket or azure://company.blob.core.windows.net/my_bucket/.
  • The auto-magic will make sure all saved models (locally, regardless of the path) will be copied to that specific bucket (folder structure based on project name and experiment name),
  • It will also make sure all artifacts will be uploaded to the same bucket, example here

from clearml-agent.

Shaked avatar Shaked commented on June 3, 2024

@bmartinn,

Regarding

Setting the default docker image/argument when launching trains-agent either with command line or with the configuration file
Per experiment, in the web UI, under "Execution" you can set "Base Docker Image".
For example nvidia/cuda -v /mnt/outside:/mnt/inside
If this field is empty the trains-agent will use its default settings.

I have tried to set this by using <customImage> -v /path/to:/volume -v /path/to2:volume2 but the agent keeps failing with missing python packages which are available on the <customImage>.

Looking at the logs, I can't see the agent even starting a container that uses the image, except of it specifiying that:

...using default docker image: ABC.azurecr.io/vsearch/ABC-base-gpu...

I have tried to set the default docker from both cli and UI but non of the succeeded.

trains

Any idea what I'm doing wrong?

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

Hi @Shaked ,

Are you running trains-agent in docker mode?

Could you send the beginning of the log for a failed experiment (from the UI, Results -> Log)?

The first two lines should look something like that:

task AAAABBBCCC pulled from 111ZZZXXX by worker amachine:0
Running Task AAAABBBCCC inside docker: customImage -v /path/to:/volume -v /path/to2:volume2

from clearml-agent.

Shaked avatar Shaked commented on June 3, 2024

Hey @bmartinn

Are you running trains-agent in docker mode?

Yes

Could you send the beginning of the log for a failed experiment (from the UI, Results -> Log)?

Of course.

Storing stdout and stderr log to '/tmp/.trains_agent_out.ev5qlwmd.txt', '/tmp/.trains_agent_out.ev5qlwmd.txt'
Running Task 0e2db8f3e6ff4c1aabd7b62015c11812 inside docker: example.azurecr.io/vsearch/example-base-gpu:v-dev.gpu-v0.0.12-branch-docker-dataset --runtime=nvidia -v /home/Shaked/remote_code/green:/opt/green -v /home/Shaked/remote_code/experiments:/opt/experiments -v /data:/data -e CUDA_VISIBLE_DEVICES=1 -e PYTHON_PATH="/root/.trains/venvs-builds/3.5/lib/python3.5/site-packages:/opt"

Ends up with the following error:

Traceback (most recent call last):
File "/root/.trains/venvs-builds/3.5/code/train_resnet50_classification_full_power_128x128_balance_samples.py", line 7, in
from trains import Task
ImportError: No module named 'trains'

Note: I have also tried to remove -e PYTHON_PATH="/root/.trains/venvs-builds/3.5/lib/python3.5/site-packages:/opt" completely, but it doesn't really matter, I see the same error.

Just to make sure, we are trying to mount local python modules from /home/Shaked/remote_code/green:/opt/green and /opt/experiments using our own custom base docker image.

Any ideas?

Thank you!
Shaked

from clearml-agent.

bmartinn avatar bmartinn commented on June 3, 2024

@Shaked , it seems it is running the selected docker, but from your error I assume it did not install the required python packages inside the docker.

Let's take the trains package as an example, do you have it under "Experiment" -> "Execution" -> "Installed Packages" ?

If you do not, that seems to be the problem. Python packages are installed inside the docker according to this section on the experiment. It should be automatically populated when you execute your code in development-mode, e.g. manually, without the trains-agent.

If the trains package is listed under "Installed Packages", then I suggest you send the installation log, because obviously it failed to install it...

BTW, what do you mean by:

Just to make sure, we are trying to mount local python modules from /home/Shaked/remote_code/green:/opt/green and /opt/experiments using our own custom base docker image.

from clearml-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.