Code Monkey home page Code Monkey logo

clearml-agent's Introduction

ClearML Agent - MLOps/LLMOps made easy
MLOps/LLMOps scheduler & orchestration solution supporting Linux, macOS and Windows

GitHub license PyPI pyversions PyPI version shields.io PyPI Downloads Artifact Hub

🌟 ClearML is open-source - Leave a star to support the project! 🌟


ClearML-Agent

Formerly known as Trains Agent

It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.

Full Automation in 5 steps

  1. ClearML Server self-hosted or free tier hosting
  2. pip install clearml-agent (install the ClearML Agent on any GPU machine: on-premises / cloud / ...)
  3. Create a job or add ClearML to your code with just 2 lines of code
  4. Change the parameters in the UI & schedule for execution (or automate with an AutoML pipeline)
  5. πŸ“‰ πŸ“ˆ πŸ‘€ 🍺

"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"

Try ClearML now Self Hosted or Free tier Hosting

Simple, Flexible Experiment Orchestration

The ClearML Agent was built to address the DL/ML R&D DevOps needs:

  • Easily add & remove machines from the cluster
  • Reuse machines without the need for any dedicated containers or images
  • Combine GPU resources across any cloud and on-prem
  • No need for yaml / json / template configuration of any kind
  • User friendly UI
  • Manageable resource allocation that can be used by researchers and engineers
  • Flexible and controllable scheduler with priority support
  • Automatic instance spinning in the cloud

Using the ClearML Agent, you can now set up a dynamic cluster with *epsilon DevOps

*epsilon - Because we are πŸ“ and nothing is really zero work

Kubernetes Integration (Optional)

We think Kubernetes is awesome, but it is not a must to get started with remote execution agents and cluster management. We designed clearml-agent so you can run both bare-metal and on top of Kubernetes, in any combination that fits your environment.

You can find the Dockerfiles in the docker folder and the helm Chart in https://github.com/allegroai/clearml-helm-charts

Benefits of integrating existing Kubernetes cluster with ClearML

  • ClearML-Agent adds the missing scheduling capabilities to your Kubernetes cluster
  • Users do not need to have direct Kubernetes access!
  • Easy learning curve with UI and CLI requiring no DevOps knowledge from end users
  • Unlike other solutions, ClearML-Agents work in tandem with other customers of your Kubernetes cluster
  • Allows for more flexible automation from code, building pipelines and visibility
  • A programmatic interface for easy CI/CD workflows, enabling GitOps to trigger jobs inside your cluster
  • Seamless integration with the ClearML ML/DL/GenAI experiment manager
  • Web UI for customization, scheduling & prioritization of jobs
  • Enterprise Features: RBAC, vault, multi-tenancy, scheduler, quota management, fractional GPU support

Run the agent in Kubernetes Glue mode an map ClearML jobs directly to K8s jobs:

  • Use the ClearML Agent Helm Chart to spin an agent pod acting as a controller
  • The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a Kubernetes job (based on provided yaml template)
  • Inside each pod the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process, fully visible in the clearml UI
  • Benefits: Kubernetes full view of all running jobs in the system
  • Enterprise Features
    • Full scheduler features added on Top of Kubernetes, with quota/over-quota management, priorities and order.
    • Fractional GPU support, allowing multiple isolated containers sharing the same GPU with memory/compute limit per container

SLURM (Optional)

Yes! Slurm integration is available, check the documentation for further details

Using the ClearML Agent

Full scale HPC with a click of a button

The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.

Any 'Draft' experiment can be scheduled for execution by a ClearML agent.

A previously run experiment can be put into 'Draft' state by either of two methods:

  • Using the 'Reset' action from the experiment right-click context menu in the ClearML UI - This will clear any results and artifacts the previous run had created.
  • Using the 'Clone' action from the experiment right-click context menu in the ClearML UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.

An experiment is scheduled for execution using the 'Enqueue' action from the experiment right-click context menu in the ClearML UI and selecting the execution queue.

See creating an experiment and enqueuing it for execution.

Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.

The ClearML UI Workers & Queues page provides ongoing execution information:

  • Workers Tab: Monitor you cluster
    • Review available resources
    • Monitor machines statistics (CPU / GPU / Disk / Network)
  • Queues Tab:
    • Control the scheduling order of jobs
    • Cancel or abort job execution
    • Move jobs between execution queues

What The ClearML Agent Actually Does

The ClearML Agent executes experiments using the following process:

  • Create a new virtual environment (or launch the selected docker image)
  • Clone the code into the virtual-environment (or inside the docker)
  • Install python packages based on the package requirements listed for the experiment
    • Special note for PyTorch: The ClearML Agent will automatically select the torch packages based on the CUDA_VERSION environment variable of the machine
  • Execute the code, while monitoring the process
  • Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
  • Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

System Design & Flow

clearml-architecture

Installing the ClearML Agent

pip install clearml-agent

ClearML Agent Usage Examples

Full Interface and capabilities are available with

clearml-agent --help
clearml-agent daemon --help

Configuring the ClearML Agent

clearml-agent init

Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default ClearML Agent cache folder is ~/.clearml.

See full details in your configuration file at ~/clearml.conf.

Note: The ClearML Agent extends the ClearML configuration file ~/clearml.conf. They are designed to share the same configuration file, see example here

Running the ClearML Agent

For debug and experimentation, start the ClearML agent in foreground mode, where all the output is printed to screen:

clearml-agent daemon --queue default --foreground

For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe). Notice: with --detached flag, the clearml-agent will be running in the background

clearml-agent daemon --detached --queue default

GPU allocation is controlled via the standard OS environment NVIDIA_VISIBLE_DEVICES or --gpus flag (or disabled with --cpu-only).

If no flag is set, and NVIDIA_VISIBLE_DEVICES variable doesn't exist, all GPUs will be allocated for the clearml-agent.
If --cpu-only flag is set, or NVIDIA_VISIBLE_DEVICES="none", no gpu will be allocated for the clearml-agent.

Example: spin two agents, one per GPU on the same machine:

Notice: with --detached flag, the clearml-agent will run in the background

clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default

Example: spin two agents, pulling from dedicated dual_gpu queue, two GPUs per agent

clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
Starting the ClearML Agent in docker mode

For debug and experimentation, start the ClearML agent in foreground mode, where all the output is printed to screen

clearml-agent daemon --queue default --docker --foreground

For actual service mode, all the stdout will be stored automatically into a file (no need to pipe). Notice: with --detached flag, the clearml-agent will run in the background

clearml-agent daemon --detached --queue default --docker

Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 docker:

clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04

Example: spin two agents, pulling from dedicated dual_gpu queue, two GPUs per agent, with default nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 docker:

clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
Starting the ClearML Agent - Priority Queues

Priority Queues are also supported, example use case:

High priority queue: important_jobs, low priority queue: default

clearml-agent daemon --queue important_jobs default

The ClearML Agent will first try to pull jobs from the important_jobs queue, and only if it is empty, the agent will try to pull from the default queue.

Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see example on our free server

Stopping the ClearML Agent

To stop a ClearML Agent running in the background, run the same command line used to start the agent with --stop appended. For example, to stop the first of the above shown same machine, single gpu agents:

clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 --stop

How do I create an experiment on the ClearML Server?

  • Integrate ClearML with your code

  • Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)

  • As your code is running, ClearML creates an experiment logging all the necessary execution information:

    • Git repository link and commit ID (or an entire jupyter notebook)
    • Git diff (we’re not saying you never commit and push, but still...)
    • Python packages used by your code (including specific versions used)
    • Hyperparameters
    • Input artifacts

    You now have a 'template' of your experiment with everything required for automated execution

  • In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.

  • You now have a new draft experiment cloned from your original experiment, feel free to edit it

    • Change the hyperparameters
    • Switch to the latest code base of the repository
    • Update package versions
    • Select a specific docker image to run in (see docker execution mode section)
    • Or simply change nothing to run the same experiment again...
  • Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'

ClearML-Agent Services Mode

ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases:

  • Auto-scaler service (spinning instances when the need arises and the budget allows)
  • Controllers (Implementing pipelines and more sophisticated DevOps logic)
  • Optimizer (such as Hyperparameter Optimization or sweeping)
  • Application (such as interactive Bokeh apps for increased data transparency)

ClearML-Agent Services mode will spin any task enqueued into the specified queue. Every task launched by ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities. Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched alongside GPU agents.

clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only

Note: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.

AutoML and Orchestration Pipelines

The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.

Sample AutoML & Orchestration examples can be found in the ClearML example/automation folder.

AutoML examples:

Experiment Pipeline examples:

  • First step experiment
    • This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
  • Second step experiment
    • In order to create an experiment-template in the system, this code must be executed once manually

License

Apache License, Version 2.0 (see the LICENSE for more information)

clearml-agent's People

Contributors

achaiah avatar ae-ae avatar alex-burlacu-clear-ml avatar allegroai-git avatar eliorc avatar feu-aklos avatar h4dr1en avatar honzys avatar idantene avatar ilouzl avatar incognito124 avatar jday1 avatar jkhenning avatar lucacerone avatar mmiller-max avatar nfzd avatar nielstenboom avatar pollfly avatar pshowbs avatar sgasse avatar xadcoh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clearml-agent's Issues

Missing an installed package (scipy)

trains-agent experiment run is missing one of the installed packages (scipy). Hence it aborts.

Please let me know in case you need more information

python3.6/site-packages/keras_preprocessing/image/affine_transformations.py", line 281, in apply_affine_transform
raise ImportError('Image transformations require SciPy. '
ImportError: Image transformations require SciPy. Install SciPy.

Feature request: Allow to modify SSH forwarding for docker daemon

I usually use SSH forwarding from host to container as follows:

-v $SSH_AUTH_SOCK:$SSH_AUTH_SOCK -e SSH_AUTH_SOCK=$SSH_AUTH_SOCK

However, this seems to conflict with the current way of mounting the .ssh directory of the host. Could you maybe add an option in the config file to overwrite this part of the docker run arguments like the one I would like to use?

Support for on-exit handler.

For more context see the slack thread here.

In speaking to the team, I understand there's no way right now to specify an on-exit task to be performed on a thread run from trains-agent. In my script, I have a signal handler that catches SIGINT and SIGTERM and saves the model. This is useful when terminating a long running experiment where I still want the model to be saved.

Unfortunately aborting the same thread from trains-agent does not produce the same outcome. The trains task not only overwrites the signal handles, it SIGKILL's (or equivalent) the script if it doesn't exit in 2 seconds, which may not be enough time to run a final validation run and save the model.

What would be great is an explicit handler that is called on exit, and an extended (or maybe configurable) grace period on exit. In many cases, the handler may signal a separate training thread to save and exit, which may take some time. Premature termination of the script will impede the proper cleanup.

api_server is misconfigured Is this the TRAINS API server http://localhost:8008

Hi everyone.
I don't know if i had done something wrong. I set up the TRAINS-agent and TRANS-server in the same machine, and I adjust the api section in trains.conf like shown below:
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server: http://10.53.9.37:8008
web_server: http://10.53.9.37:8080
files_server: http://10.53.9.37:8081
# Credentials are generated using the webapp, http://localhost:8080/profile
# Override with os environment: TRAINS_API_ACCESS_KEY / TRAINS_API_SECRET_KEY
credentials {"access_key": "***", "secret_key": ""}
}

when I enqueue the experiment no matter in docker mode or not I got error:
trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://localhost:8008 ?

that is the log of my TRAINS-agent
Current configuration (trains_agent v0.16.0, location: /home/bladesaber/trains.conf):

api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://10.53.9.37:8008
api.web_server = http://10.53.9.37:8080
api.files_server = http://10.53.9.37:8081
api.credentials.access_key = B6PVLPCB74BMVFYIUUSU
api.host = http://10.53.9.37:8008
agent.worker_id =
agent.worker_name = bladesaber-MS-7C02
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.package_manager.torch_nightly = false
agent.venvs_dir = /home/bladesaber/.trains/venvs-builds.1
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/bladesaber/.trains/vcs-cache.1
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/bladesaber/.trains/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/bladesaber/.trains/pip-cache
agent.docker_apt_cache = /home/bladesaber/.trains/apt-cache.1
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.1-runtime-ubuntu18.04
agent.git_user =
agent.default_python = 3.8
agent.cuda_version = 100
agent.cudnn_version = 76
sdk.storage.cache.default_base_dir = ~/.trains/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Worker "bladesaber-MS-7C02:gpu0" - Listening to queues:
+----------------------------------+---------+-------+
| id | name | tags |
+----------------------------------+---------+-------+
| fdf18427b96b4b3f8e7740bc4fa5b00d | default | |
+----------------------------------+---------+-------+

Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3

how can I solve the problem

best regards & thanks

trans-agent failes to replicate python environment with pytorch and torchvision

Hi,
I've tried several combinations of pytorch (1.2, 1.4) and cuda (10,10.1,10.2), and running train-agent using docker and without docker. In all of the cases I get a similar error of the type:
Exception when trying to resolve python wheel: Was not able to find pytorch wheel URL: Could not find wheel for "torchvision==0.4.0", Available versions: ['1.0.0', '1.0.1', '1.1.0', '1.2.0']
be it for torch or torch vision.
I also get errors stating cuda 10.1 and cuda 10.2 are not supported.
for reference, please see two attached logs.
task_a241d53e9593473a858287113ac69cb0.log
task_53b1efb08826447985df4e902c35c6a0.log

thank you!

Adding nvidia apex to the list of unique packages that needs to be installed at the end [similar to horovod]

Hello allegro.ai team,

Thank you for your amazing contribution to the community.

I and my friends really like to machine learn, and using your open-source tools makes it even more enjoyable.

We encounter some issues when we wanted to machine learn in parallel on multi-GPU using NVIDIA apex library. the problem is that although apex requires pytorch, the pip install insist on installing apex before pytorch, and the installation fails.

This makes us very sad, can you fix it, please?

Thanks!

AutoML/cloning issue

When cloning a base experiment (e.g. automl_base_template_keras_simple) manually or automatically (e.g. with with automl_random_search_example), the trains agent catches it correctly and sets the relevant environment but then it fails to execute it. From LOG:

Environment setup completed successfully
Starting Task Execution:
Using TensorFlow backend.

and it stops here. no error but also no artifact.

Use ssh to clone repos does not work

Context

If I don't specify git user and git pass in the config, it should automatically use SSH.

Problem

  • It is not clear which ssh key is used
  • When running the following commands during startup in trains-agent machine:
eval "$(ssh-agent -s)"
ssh-add /home/h4dr1en/.ssh/id_rsa

I can see in the logs that it was well added (at startup user is root)

May 11 10:01:00 GCEMetadataScripts[711]: 2020/05/11 10:01:00 GCEMetadataScripts: startup-script: Agent pid 1373
May 11 10:01:00 GCEMetadataScripts[711]: 2020/05/11 10:01:00 GCEMetadataScripts: startup-script: Identity added: /home/h4dr1en/.ssh/id_rsa (/home/h4dr1en/.ssh/id_rsa)

If I connect to the trains-agent, I can also test the connection to github:

$ ssh -T [email protected]
Warning: Permanently added the RSA host key for IP address '140.82.118.3' to the list of known hosts.
Hi H4dr1en! You've successfully authenticated, but GitHub does not provide shell access.

But for some reason trains-agent does not succeed to clone the repo of the experiment (from the logs):

cloning: https://github.com/h4dr1en/training-repo
fatal: could not read Username for 'https://github.com': terminal prompts disabled
Repository cloning failed: Command '['clone', 'https://github.com/h4dr1en/training-repo', '/root/.trains/vcs-cache/training-repo.pytorch.65c96545aef218c67580e8307b6d0267/training-repo', '--quiet', '--recursive']' returned non-zero exit status 128
trains_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(branch='master', tag='', repository='https://github.com/h4dr1en/training-repo', commit_id='', entry_point='src/cli.py', working_dir='.')
2) Check if remote-worker has valid credentials [see worker configuration file]

It looks like trains-agent still tries to clone using HTTP and fails because I did not specify creds in trains.conf file

Note: I could solve the problem by adding (from here):

git config --global --add url."[email protected]:".insteadOf "https://github.com/"

at startup, but I think this should be handled by trains-agent, right?

EDIT:

git config --system --add url."[email protected]:".insteadOf "https://github.com/"

After fresh install of trains-agent ERROR: cannot import name 'config_obj' from 'trains_agent.config'

Hi all,

I recently got a machine to use for the train-agent, but I have trouble getting it started.
I made a new conda env with train-sagent installed in it (pip) after which I ran the "trains-agent init" which told me that I already have a conf file which I do, and that is fine.

The problem is when I run:
trains-agent daemon --default

Which results in

trains_agent: ERROR: cannot import name 'config_obj' from 'trains_agent.config' (c:\users\name\.conda\envs\trains\lib\site-packages\trains_agent\config.py)

I'm out of ideas what to try now, reinstalling didn't help, the versions are

trains                    0.16.1                   pypi_0    pypi
trains-agent              0.16.0                   pypi_0    pypi

clearml_agent: ERROR: It is required that you pass in a value for the "algorithms" argument when calling decode().

When executing command "clearml-agent daemon --queue default" to start an agent running tasks from the default cue, I get a weird error: "clearml_agent: ERROR: It is required that you pass in a value for the "algorithms" argument when calling decode()". This makes no sense as it working fine yesterday and this seems be an Azure error msg. I am using the free-tier-hosted clearML dashboard and running on MBP M1. Can anyone point me in direction of a fix? Thanks

Git clone not working in docker

I tried running a job with a pre-built base docker image and wanted to checkout some newer code in a feature branch from a git repo. This fails when running in docker since no git SSH key is visible there. Is there any mechanism to mount the git SSH key into the container or to checkout the source code outside the container and just mount it as a volume?

fatal: Could not read from remote repository.
Please make sure you have the correct access rights and the repository exists.
error: Could not fetch origin

As a workaround I tried to apply the diff in "UNCOMMITTED CHANGES" a remove the repo from configuration but the job crashes as well:

clearml_agent: ERROR: Can not run task without repository or literal script in `script.diff`

UPDATE: I found that a copy of the whole ~/.ssh directory and also a cache of the git repo from the host machine is mounted into the docker container:

Executing: ['docker', 'run', ... '-v', '/tmp/clearml_agent.ssh.shht7is1:/root/.ssh', ... '-v', '/home/<user>/.clearml/vcs-cache:/root/.clearml/vcs-cache', ...

How to use argparser?

I'm trying the following for semi-autoML, without any luck. Am I doing it wrong? argparse is one of the supported types for task.connect(), right?

parser = argparse.ArgumentParser()
...
opt = parser.parse_args()  
opt=task.connect(opt)

Error:
Exception: Unsupported mutable type Namespace: no connect function found

Thanks!

--env passed via task.set_base_docker is not accepted in k8s glue.

task.set_base_docker("nvcr.io/nvidia/tensorflow:19.11-tf2-py3 --env TRAINS_AGENT_GIT_USER=git_username_here --env TRAINS_AGENT_GIT_PASS=git_password_here")

The above will give error

skipping docker argument  
TRAINS_AGENT_GIT_USER=git_username_here (only -e --env supported)
TRAINS_AGENT_GIT_PASS=git_username_here (only -e --env supported)

Node metrics not showing

Hi,
I got one trains-agent running and I liked it to my Trains installation.
Everything looks good and the Agent is capable of consuming task from default queue as expected.

A minor issue is the fact I don't get any Usage statistic in the Workers & Queues tab.

image

I checked docs but I didn't find any help.

trains_agent: ERROR: Failed cloning repository.

Problem

Summary: Trains agent, running in docker mode, fails to clone a private repository.

I have a trains-agent (version 0.16.1) running on Machine X, it is connected to a GPU and was launched using trains-agent daemon --detached --gpus 0 --queue single_gpu --docker.
In order to make sure Machine X can clone the repo I ran git clone <[email protected]:repo_url> and it successfully cloned the repo.
From Machine A, I'm launching a task using Task.execute_remotely(queue_name='single_gpu') and I get the following error:

cloning: <[email protected]:repo_url>
2020-10-17 12:08:01
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Repository cloning failed: Command '['clone', '<[email protected]:repo_url>, '/root/.trains/vcs-cache/<repo_name>.git.95435a3ab9551ab8e978fcc8f950307a/<repo_name>.git', '--quiet', '--recursive']' returned non-zero exit status 128.
trains_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='<[email protected]:repo_url>', branch='<branch_name>', commit_id='c0c21d178aa810e3f78f6df7bdc41bc9df732af4', tag='', entry_point='<script_name>', working_dir='.')
2) Check if remote-worker has valid credentials [see worker configuration file]

I did verify that <[email protected]:repo_url> is correct, it is correctly pointing at an SSH cloning URL

What have I tried?

  1. Tried using username + password (agent.git_(user|password))
    instead of SSH
  2. Empty username and password (agent.git_(user|password)) with agent.force_git_ssh_protocol: true
  3. Pushed the script to the branch (empty git diff)

Between each try, I stopped the agent trains-agent daemon --stop and relaunched the agent just to make sure the changes in trains.conf are active

git creds are not used when install packages

Context

  • An experiment lives in a private repo
  • agent.git_user and agent.git_pass are configured in trains.conf
  • This experiment is a python package installed at runtime
  • This python package has dependencies
  • Among the dependencies, one is another private repository

Problem

trains-agentwill successfully clone the private repo where the experiments lives using the git credentials configured. But it will fail installing the dependencies, because one of them is a private repo and trains-agent won't use the configured git credentials for the installation of the dependencies.

Solution

trains-agent should use the git credentials for any git operation, regardless of when it happens.

torch version inference logic broken when torchvision is specified

If I start an experiment with the following requirements defined in the UI:

torch==1.3.1

The installation works well, But if I use the following requirements:

torch==1.3.1
torchvision==0.2.1

Then it fails trying to install torch==0.2.1 after installing torch==1.3.1. Probably the parsing of the version of torchvision has an error?

Here is the full log of the error:

Requirement already up-to-date: pip==20.1 in /home/H4dr1en/.trains/venvs-builds/3.7/lib/python3.7/site-packages (20.1)
Collecting Cython
  Using cached Cython-0.29.17-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.17
Collecting torch==1.3.1+cpu
  File was already downloaded /home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl
Successfully downloaded torch
Collecting torch==0.2.1
  ERROR: HTTP error 403 while getting http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
  ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
ERROR: Could not install requirement torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl because of HTTP error 403 Client Error: Forbidden for url: http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl for URL http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl
trains_agent: ERROR: Could not download wheel name of "http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl"
ERROR: Double requirement given: torch==0.2.1 from http://download.pytorch.org/whl/cu0/torch-0.2.1-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsipcp8nfs.txt (line 2)) (already in torch==1.3.1+cpu from file:///home/H4dr1en/.trains/pip-download-cache/cu0/torch-1.3.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (from -r /tmp/cached-reqsipcp8nfs.txt (line 1)), name='torch')
trains_agent: ERROR: Could not install task requirements!
Command '['/home/H4dr1en/.trains/venvs-builds/3.7/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsipcp8nfs.txt']' returned non-zero exit status 1.
DONE: Running task 'c63fc150ff5049c4939cd6a37f3d30a8', exit status 1

System: Linux Debian 9
Cuda: not installed (no gpu)

getting all monitor metrics for a task from the api

Hello ,
we would like to get the timeseries of all scalars plots, not the hardware ones, just the ones for loss val_roc and such per iteration, the use case is an automatic process that follows experiments to see that they converage towards the same point .
thanks

Init script shows git creds

When running trains-agent init, the console prompts git creds:

Enter git username for repository cloning (leave blank for SSH key authentication): [] {username}
Enter password for user 'h4dr1en': {password}.
Git repository cloning will be using user={username} password={password}

Password should be hidden (is visible currently)

Cloning local (conda) venv

Hi all I have been trying out trains-agent for a while now and I was wondering if there is an option for it to clone the venv from the preexisting conda venv?
So for example, I have a few projects I have been working on, each having their own virtual environment made in conda. Could these project's environments be somehow cloned to a trains-agent so that it doesn't redownload the whole requirements and git folders etc.?
Maybe this option already exists but I might have missed it.

Thanks!

Provide option for conda installation parameters

First, thanks for the great work πŸ‘

Currently, agents with package_manager.type = conda will install silently using --quiet:

Executing Conda: /home/h4dr1en/miniconda3/bin/conda install -p /home/h4dr1en/.trains/venvs-builds/3.6 -c defaults -c conda-forge -c pytorch pip==20.1 --quiet --json

For debugging purposes, it would be great if instead one could specify other parameters acceptable by conda install, notably -v

Does trains-agent caches experiments envs?

Context

Hi,

Most of the time (99%), we send tasks to trains-agent with changes in code, but no changes in requirements (the environment does not change). We would expect that the environment (venv) is cached and reused between different experiments, to spare us the installation time (5-10 mins), so that we can iterate faster.

Problem

  • I tried to run the artifact_toy example locally.
  • I then cloned the experiment in the UI, reset it and sent it to queue again
  • Wait for the experiment to finish, and repeat previous step.

Therefore the task is executed two times in the same trains-agent.

Actual behavior

The logs below show the execution trace of the second run in the agent. As you can see:

  • Git repo was successfully cached and reused
  • Packages were successfully cached and reused
  • Environment, although being the same, was not cached and reused. It was reinstalled.

Expected behavior

Since the task runs a second time with the same environment, I would expect trains-agent to reuse it, saving me the time to install it. I would expect that trains-agent creates a hash of the environment from the list of requirements (on task creation, not after tasks finished, so that it can match another draft task requirements) and reuse the same env if a new task has the same hash.

Logs

2020-05-19T07:41:06.320Z instance-2:0 INFO task 6a5045e5c3b74afb892f85986b655218 pulled from 672de23dcf4b456590e150a2d3e3d002 by worker instance-2:0

2020-05-19T07:41:11.394Z instance-2:0 DEBUG Current configuration (trains_agent v0.14.1, location: /tmp/.trains_agent.uzshvg_e.cfg):
----------------------
agent.worker_id = instance-2:0
agent.worker_name = instance-2
agent.python_binary = 
agent.package_manager.type = pip
agent.package_manager.pip_version = <21
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.venvs_dir = /home/user/.trains/venvs-builds
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/user/.trains/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/user/.trains/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.default_docker.image = nvidia/cuda
agent.git_user = 
agent.default_python = 3.7
agent.cuda_version = 0
agent.cudnn_version = 0
sdk.storage.cache.default_base_dir = ~/.trains/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff_on_train = true
sdk.development.support_stopping = true
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
Executing task id [6a5045e5c3b74afb892f85986b655218]:
****
entry_point = artifact_toy.py
working_dir = .
Warning: could not locate requested Python version 3.6, reverting to version 3.7
Using base prefix '/usr'
New python executable in /home/user/.trains/venvs-builds/3.7/bin/python3.7
Also creating executable in /home/user/.trains/venvs-builds/3.7/bin/python
Installing setuptools, pip, wheel...
done.
Using cached repository in "/home/user/.trains/vcs-cache/my-repo.git.32940bc4e1fe7ef7cdafd7e48f8cf5db/my-repo.git"

2020-05-19T07:41:16.437Z instance-2:0 DEBUG Note: checking out '7e3592afc8e0d3cd7b7c02a3672b558aed8c675c'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
  git checkout -b <new-branch-name>
HEAD is now at 7e3592a add toy
type: git
url: https://github.com/H4dr1en/my-repo.git
branch: HEAD
commit: 7e3592afc8e0d3cd7b7c02a3672b558aed8c675c
root: /home/user/.trains/venvs-builds/3.7/task_repository/my-repo.git
Requirement already up-to-date: pip<21 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (20.1)
Collecting Cython
  Using cached Cython-0.29.18-cp37-cp37m-manylinux1_x86_64.whl (2.0 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.18
Requirement already satisfied: Cython==0.29.18 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (0.29.18)

2020-05-19T07:41:21.481Z instance-2:0 DEBUG Collecting numpy==1.16.2
  Using cached numpy-1.16.2-cp37-cp37m-manylinux1_x86_64.whl (17.3 MB)
Installing collected packages: numpy
Successfully installed numpy-1.16.2
Collecting attrs==19.3.0
  Using cached attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting boto3==1.12.39
  Using cached boto3-1.12.39-py2.py3-none-any.whl (128 kB)
Collecting botocore==1.15.49
  Using cached botocore-1.15.49-py2.py3-none-any.whl (6.2 MB)

2020-05-19T07:41:26.525Z instance-2:0 DEBUG Collecting certifi==2020.4.5.1
  Using cached certifi-2020.4.5.1-py2.py3-none-any.whl (157 kB)
Collecting chardet==3.0.4
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Requirement already satisfied: Cython==0.29.18 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (from -r /tmp/cached-reqsszq7heni.txt (line 6)) (0.29.18)
Collecting docutils==0.15.2
  Using cached docutils-0.15.2-py3-none-any.whl (547 kB)
Collecting funcsigs==1.0.2
  Using cached funcsigs-1.0.2-py2.py3-none-any.whl (17 kB)
Collecting furl==2.1.0
  Using cached furl-2.1.0-py2.py3-none-any.whl (20 kB)
Processing /home/user/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e/future-0.18.2-cp37-none-any.whl
Collecting humanfriendly==8.2
  Using cached humanfriendly-8.2-py2.py3-none-any.whl (86 kB)
Collecting idna==2.9
  Using cached idna-2.9-py2.py3-none-any.whl (58 kB)
Collecting importlib-metadata==1.6.0
  Using cached importlib_metadata-1.6.0-py2.py3-none-any.whl (30 kB)
Collecting jmespath==0.10.0
  Using cached jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting jsonmodels==2.4
  Using cached jsonmodels-2.4-py2.py3-none-any.whl (20 kB)
Collecting jsonschema==3.2.0
  Using cached jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
Requirement already satisfied: numpy==1.16.2 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (from -r /tmp/cached-reqsszq7heni.txt (line 17)) (1.16.2)
Collecting orderedmultidict==1.0.1
  Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB)
Collecting pandas==1.0.3
  Using cached pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
Collecting pathlib2==2.3.5
  Using cached pathlib2-2.3.5-py2.py3-none-any.whl (18 kB)
Collecting Pillow==6.2.1
  Using cached Pillow-6.2.1-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Collecting plotly==4.7.1
  Using cached plotly-4.7.1-py2.py3-none-any.whl (11.5 MB)
Processing /home/user/.cache/pip/wheels/b6/e7/50/aee9cc966163d74430f13f208171dee22f11efa4a4a826661c/psutil-5.7.0-cp37-cp37m-linux_x86_64.whl
Collecting PyJWT==1.7.1
  Using cached PyJWT-1.7.1-py2.py3-none-any.whl (18 kB)
Collecting pyparsing==2.4.7
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Processing /home/user/.cache/pip/wheels/22/52/11/f0920f95c23ed7d2d0b05f2b7b2f4509e87a20cfe8ea43d987/pyrsistent-0.16.0-cp37-cp37m-linux_x86_64.whl
Collecting python-dateutil==2.8.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytz==2020.1
  Using cached pytz-2020.1-py2.py3-none-any.whl (510 kB)
Processing /home/user/.cache/pip/wheels/5e/03/1e/e1e954795d6f35dfc7b637fe2277bff021303bd9570ecea653/PyYAML-5.3.1-cp37-cp37m-linux_x86_64.whl
Collecting requests==2.23.0
  Using cached requests-2.23.0-py2.py3-none-any.whl (58 kB)
Collecting requests-file==1.5.1
  Using cached requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB)
Processing /home/user/.cache/pip/wheels/d7/a9/33/acc7b709e2a35caa7d4cae442f6fe6fbf2c43f80823d46460c/retrying-1.3.3-cp37-none-any.whl
Collecting s3transfer==0.3.3
  Using cached s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
Collecting six==1.14.0
  Using cached six-1.14.0-py2.py3-none-any.whl (10 kB)
Collecting tqdm==4.46.0
  Using cached tqdm-4.46.0-py2.py3-none-any.whl (63 kB)
Collecting trains==0.14.3
  Using cached trains-0.14.3-py2.py3-none-any.whl (550 kB)
Collecting typing==3.7.4.1
  Using cached typing-3.7.4.1-py3-none-any.whl (25 kB)
Collecting urllib3==1.25.9
  Using cached urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
Collecting zipp==3.1.0
  Using cached zipp-3.1.0-py3-none-any.whl (4.9 kB)
Requirement already satisfied: setuptools in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (from jsonschema==3.2.0->-r /tmp/cached-reqsszq7heni.txt (line 16)) (46.4.0)

2020-05-19T07:41:31.569Z instance-2:0 DEBUG Installing collected packages: attrs, jmespath, six, python-dateutil, docutils, urllib3, botocore, s3transfer, boto3, certifi, chardet, funcsigs, orderedmultidict, furl, future, humanfriendly, idna, zipp, importlib-metadata, jsonmodels, pyrsistent, jsonschema, pytz, pandas, pathlib2, Pillow, retrying, plotly, psutil, PyJWT, pyparsing, PyYAML, requests, requests-file, tqdm, typing, trains

2020-05-19T07:41:46.654Z instance-2:0 DEBUG Successfully installed Pillow-6.2.1 PyJWT-1.7.1 PyYAML-5.3.1 attrs-19.3.0 boto3-1.12.39 botocore-1.15.49 certifi-2020.4.5.1 chardet-3.0.4 docutils-0.15.2 funcsigs-1.0.2 furl-2.1.0 future-0.18.2 humanfriendly-8.2 idna-2.9 importlib-metadata-1.6.0 jmespath-0.10.0 jsonmodels-2.4 jsonschema-3.2.0 orderedmultidict-1.0.1 pandas-1.0.3 pathlib2-2.3.5 plotly-4.7.1 psutil-5.7.0 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pytz-2020.1 requests-2.23.0 requests-file-1.5.1 retrying-1.3.3 s3transfer-0.3.3 six-1.14.0 tqdm-4.46.0 trains-0.14.3 typing-3.7.4.1 urllib3-1.25.9 zipp-3.1.0
Running task id [6a5045e5c3b74afb892f85986b655218]:
[.]$ /home/user/.trains/venvs-builds/3.7/bin/python -u artifact_toy.py
Summary - installed python packages:
pip:
- attrs==19.3.0
- boto3==1.12.39
- botocore==1.15.49
- certifi==2020.4.5.1
- chardet==3.0.4
- Cython==0.29.18
- docutils==0.15.2
- funcsigs==1.0.2
- furl==2.1.0
- future==0.18.2
- humanfriendly==8.2
- idna==2.9
- importlib-metadata==1.6.0
- jmespath==0.10.0
- jsonmodels==2.4
- jsonschema==3.2.0
- numpy==1.16.2
- orderedmultidict==1.0.1
- pandas==1.0.3
- pathlib2==2.3.5
- Pillow==6.2.1
- plotly==4.7.1
- psutil==5.7.0
- PyJWT==1.7.1
- pyparsing==2.4.7
- pyrsistent==0.16.0
- python-dateutil==2.8.1
- pytz==2020.1
- PyYAML==5.3.1
- requests==2.23.0
- requests-file==1.5.1
- retrying==1.3.3
- s3transfer==0.3.3
- six==1.14.0
- tqdm==4.46.0
- trains==0.14.3
- typing==3.7.4.1
- urllib3==1.25.9
- zipp==3.1.0
Environment setup completed successfully
Starting Task Execution:
TRAINS results page: http:/a.b.c.d.e:8080/projects/fc7cc6cc167f4763ae35eb27e1bfff2b/experiments/6a5045e5c3b74afb892f85986b655218/output/log

2020-05-19T07:41:51.694Z instance-2:0 DEBUG         num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8
Done
[train]: shape=(4, 3), 4 unique rows, 100.0% uniqueness
TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring

Parameters in clearml.conf not used during task run

Hi,

I started my agent using.
clearml-agent daemon --gpus 0 --queue gpu --docker --foreground, with the following parameters in clearml.conf.


    default_docker: {
        # default docker image to use when running in docker mode
        image: "dockerrepo/mydocker:custom"

        # optional arguments to pass to docker image
        # arguments: ["--ipc=host", ]
        arguments: ["--env GIT_SSL_NO_VERIFY=true",]
    }

Then this is shown while waiting for tasks.

Worker "master-node:gpu0" - Listening to queues:
+----------------------------------+------+-------+
| id                               | name | tags  |
+----------------------------------+------+-------+
| 943fce37803044ef89f6d9af0cd5279c | gpu  |       |
+----------------------------------+------+-------+

Running in Docker  mode (v19.03 and above) - using default docker image: dockerrepo/mydocker:custom running python3

So far so good except that when a task is pulled, i get this as output. If you noticed, first the docker image is reverted to nvidia/cuda:10.1-runtime-ubuntu18.04, and there's no indication that the arg --env is passed on.

task 228caa5d25d94ac5aa10fa7e1d02f03c pulled from 943fce37803044ef89f6d9af0cd5279c by worker master-node:gpu0
Running task '228caa5d25d94ac5aa10fa7e1d02f03c'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.xmqr15w5.txt', '/tmp/.clearml_agent_out.xmqr15w5.txt'
Running Task 228caa5d25d94ac5aa10fa7e1d02f03c inside docker: nvidia/cuda:10.1-runtime-ubuntu18.04
Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-e', 'CLEARML_WORKER_ID=master-node:gpu0', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04', '-v', '/home/jax/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.txivbuei.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.04t66_qn:/root/.ssh', '-v', '/home/jax/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/home/jax/.clearml/pip-cache:/root/.cache/pip', '-v', '/home/jax/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/home/jax/.clearml/cache:/clearml_agent_cache', '-v', '/home/jax/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', 'nvidia/cuda:10.1-runtime-ubuntu18.04', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring  --id 228caa5d25d94ac5aa10fa7e1d02f03c']

Agents disappear after machine restart

Hello,

Thanks for the awesome product and especially the trains-agent.

I have a question/issue regarding the persistency of the agents:

Background: I have an Ubuntu machine, running several agents that I created using the following command:

TRAINS_WORKER_ID=servername:gpu1c_only trains-agent daemon --detached --gpus 1 --create-queue --queue gpu1_only --docker nvcr.io/nvidia/pytorch:20.08-py3

If it make any difference, on a 2-GPU machine, I have several agents running on a single GPU with their own queue and another set of agents running on 2 GPUs, also with their own queue.

Issue: When I restart the machine, the agents disappear and I need to recreate them. The only one that survives the reset is the services agent. If it matters, the UI still shows the deleted queues in the enqueue menu. [ A bonus question: how can I clean the list? ]

Question: How to make these agents persistent?

Thanks in advance.

Model.get_local_copy() returns empty directory

Context

Assuming a first task logs a model, a second task (cloned from the first one) tries to access to it:

parent_task = Task.get_task(Task.current_task().parent)
local_model = parent_task.models['output'][-1].get_local_copy()
logger.report_text(str(local_model))

The two tasks ran on the same agent, one after the other (the first one cloned itself and enqueued the second one). The first task finishes successfully:

020-05-18 14:20:34,261 - trains.Task - INFO - Completed model upload to file:///root/.trains/venvs-builds/3.6/task_repository/my-repo.git/checkpoint/test_conversion_pipeline/test_2.83871052c05c4f0a853a3870dbac302d/models/best_checkpoint_accuracy=0.0000.pt
2020-05-18 14:20:34,560 - trains.Task - INFO - Waiting to finish uploads
2020-05-18 14:20:48,567 - trains.Task - INFO - Finished uploading

Note: The saving of the models is handled by pytorch-ignite using TrainsSaver, every epoch. The files exist locally.

Problem

The code above leads to the following warning when executed in the second task:

2020-05-18 15:04:32,203 - trains - WARNING - Exception [Errno 21] Is a directory: '/root/.trains/venvs-builds/3.6/task_repository/my-repo.git/checkpoint/test_conversion_pipeline/test_2.83871052c05c4f0a853a3870dbac302d/models/best_checkpoint_accuracy=0.0000.pt'
Failed extracting zip file /root/.trains/venvs-builds/3.6/task_repository/my-repo/checkpoint/test_conversion_pipeline/test_2.83871052c05c4f0a853a3870dbac302d/models/best_checkpoint_accuracy=0.0000.pt

Indeed the local_model is a path to an empty directory.

Config

agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = >\=20.1

trains-init asks web service URI but promp shows API URI

With trains-agent==0.15.0rc0

When executing trains-agent init:

TRAINS-AGENT setup process

Please create new trains credentials through the profile page in your trains web app (e.g. https://demoapp.trains.allegro.ai/profile)
In the profile page, press "Create new credentials", then press "Copy to clipboard".

Paste copied configuration here: 

Could not parse credentials, please try entering them manually.
Enter user access key: *****
Enter user secret: *****

Editing configuration file: /Users/h4dr1en/trains.conf
Enter the url of the trains-server's Web service, for example: http://localhost:8080

API Host configured to: [] 

The user is asked to enter the the url of the trains-server's Web service, but the prompt message shows API Host configured to, which is not the same port (typically 8008 instead of 8080). This can be confusing for the user. It works well when entering the API URI (I did not test entering the web server URI).

Ability to prioritise tasks on the Kubernetes Glue

Currently, all tasks pushed into the kubernetes glue are executed in uniform manner, meaning there is no way to prioritise different tasks at different rates.

python k8s_glue_example.py --queue singleQueue

Contrasting this with running clearml-agent,
High and Low priority Queues

clearml-agent daemon --queue highQueue lowQueue

Round Robin Priority Queues

cleamrl-agent daemon --queue firstQueue secondQueue --order-fairness

Support for venv

Hello,

First thanks for sharing trains with the world, it looks very promising.
I'm wondering why trains-agent uses virtualenv and not venv ? I usually use the latter, and when I tried trains-agent I got the following error

/usr/bin/python3.6: No module named virtualenv

trains_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/florian/.trains/venvs-builds/3.6']' returned non-zero exit status 1.

I didn't know it could be a python module (I even thought it was kind of deprecated), so I then installed it (apt-get install python3-virtualenv) but now even with Python base version being 3.6(agent.default_python = 3.6), it creates a python 2.7 virtualenv:

New python executable in /home/florian/.trains/venvs-builds/3.6/bin/python2
Also creating executable in /home/florian/.trains/venvs-builds/3.6/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.
Running virtualenv with interpreter /usr/bin/python2

May be I'm doing something wrong ?

OS: Ubuntu18.04 LTS

Multiple config files

Hi,
I'm have multiple projects, some with pip and some with conda as package manager.
Is there a way to have multiple config files and define the relevant config for each queue?

Feature request: change ssh user on agent

I have a git repository whose's SSH credentials do not begin with the usual ssh://git@.... Instead, it is ssh://root@git.....
It seems like the agent's behavior to clone repositories using SSH credentials is hardcoded to use git@.....

More specifically, doing git clone ssh://[email protected]/repo/repo.git results in permision denied, whereas doing git clone ssh://[email protected]/repo/repo.git works just fine.

Can you please add a way to modify the SSH user for the agent?

Thanks.

Support --install-globally for daemon mode

Hi,

Would it be possible to use --install-globally parameter for the trains-agent daemon mode?

Currently, this parameter is only available for the trains-agent build mode.

Splitting GPUs across multiple trains agents on the same machine

Hi,

I am trying to run two trains agents in daemon docker mode on a 4 GPU machine and split GPU allocation 2 for each.
I get the following error:
docker: Error response from daemon: cannot set both Count and DeviceIDs on device request.
The command I'm running is:
trains-agent daemon --queue <NAME> --docker "<ECR> -v /root:/root --ipc=host --privileged" --gpus 0,1

(I am using an internal docker image which is based on one of Nvidia's images)

I also tried using:

  • the CUDA_VISIBLE_DEVICES flag
  • -e NVIDIA_VISIBLE_DEVICES=0,1 in the docker cmd
  • not using the privileged flag

But either getting the same error or having all 4 GPUs allocated to the agent.

"it seems *api_server* is misconfigured" error

Hello, thanks for the millionth time for this great project. It literally saves me everyday.

However, I having a problem with trains agent.

Setup:

  1. Windows 10 local machine, connecting via ssh with the right trains ports 8008,8080,8081 to
  2. A remote ubuntu machine, running trains server and trains agent (docker mode).
  3. On the remote machine, a docker container is running (with --net==host, to freely communicate with the server) with my code.

Scenarios:

  1. If I run manually my code via terminal on the remote, it runs nicely, trains server logs it and I see it on my local machine in the webapp (on localhost:8080).
  2. I run trains-agent (docker mode with the same base image) on the same remote. I see the worker on the webapp, the worker pulls the cloned task (the same as 1, cloned from the webapp), and it starts running the docker base image I provided, installs the dependencies and I see all this in the same webapp. Then it fails and I receive the following error:
    trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server http://localhost:8008 ?

Any idea what is going on?

Thanks,
Majd

Error when running experiments using NVIDIA pytorch docker image

Hi,
Recently we switched to running trains-agent tasks using a docker image as a basis, this allowed us to overcome issue #1 regarding apex [that cannot be installed from pypi].

We use NVIDIA's PyTorch container and apparently they compiled their own version of PyTorch using a version tag that is not included in PyPI [version torch 1.4.0a0+649135b where the latest released version of PyTorch is 1.3.2]

To avoid conflicts and because we're using the docker image as a base now, we removed "torch" from our requirement.txt file. but - due to trains-agent magic package-requirement logging, the package is added to the required packages of the task. then when we're trying to run a cloned task, the trains-agent tries to install the 'unique' versions of PyTorch and the task crashes/fails.

This is also happening in NVIDIA's version of torchvision [torchvision 0.5.0a0]

How do you suggest to tackle this issue?

Problems with Public key and ssh

Hi all,

I think I have a related problem as it is related to ssh and trains.
I have set up on the local PC trains-agent which is able to connect to the remote PC running trains.
When I try to run the code that is running well on the local machine (git clone and running the code works), but when I "throw" the job via trains-agent daemon to the trains server, I get the following error
(...)

[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

(...)

I can only use ssh for cloning the repo from the company git repository. However, that requires a combination of adding public key to the ssh agent. THe public key is stored in the ~/.ssh/id_rsa.pub and that same file is used when I use the local git commands.

I've read in the instructions and in some of the closed issues that I should set the follwing in the trains.conf:

agent {
    # Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
    # leave blank for GIT SSH credentials (set force_git_ssh_protocol=true to force SSH protocol)
    # git_user=""
    # git_pass=""

    # Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
    force_git_ssh_protocol: True

I'm a bit confused on why there is an error. I thought that the "id_rsa.pub" file is fetched from the local PC where the trains-agent daemon is running, but that maybe it's taking the one from the remote PC. So, I did try having the same id_rsa.pub files in ~/.ssh, on both machines but I get the same error, so I'm doubting if that is the problem.
Maybe you know more?

I'm running the trains agent like:
trains-agent daemon --gpus 0,1 --queue default --git-user USERNAME --git-pass PASSWORD

Parameters sometimes are converted to strings, sometimes not

task.set_parameters(abc=123) // wrote an int
print(type(task.get_parameter("General/abc")))  // str
print(type(task.get_parameter("General/badparam", 456))) // int

This is completely counterintuitive. Whatever I write, I should always read the same, no casting to str.

Conda environment installation skipping packages

Copied packages do not reflect the environment the copied experiment has been run. I am working with tensorflow-gpu installed with conda but trains does not detect tensorflow-gpu as a package. Below is the conda list output

#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_tflow_select             2.1.0                       gpu  
absl-py                   0.9.0                    py36_0  
astor                     0.8.0                    py36_0  
attrs                     19.3.0                   pypi_0    pypi
blas                      1.0                         mkl  
c-ares                    1.15.0            h7b6447c_1001  
ca-certificates           2020.1.1                      0  
certifi                   2020.4.5.2               py36_0  
cloudpickle               1.4.1                    pypi_0    pypi
cudatoolkit               10.0.130                      0  
cudnn                     7.6.5                cuda10.0_0  
cupti                     10.0.130                      0  
cycler                    0.10.0                   pypi_0    pypi
decorator                 4.4.2                    pypi_0    pypi
efficientnet              1.0.0                    pypi_0    pypi
funcsigs                  1.0.2                    pypi_0    pypi
furl                      2.1.0                    pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
gast                      0.3.3                      py_0  
grpcio                    1.27.2           py36hf8bcb03_0  
h5py                      2.10.0           py36h7918eee_0  
hdf5                      1.10.4               hb1b8bf9_0  
horovod                   0.19.4                   pypi_0    pypi
humanfriendly             8.2                      pypi_0    pypi
image-classifiers         1.0.0                    pypi_0    pypi
imageio                   2.8.0                    pypi_0    pypi
imgviz                    1.1.0                    pypi_0    pypi
intel-openmp              2020.1                      217  
joblib                    0.15.1                   pypi_0    pypi
jsonmodels                2.4                      pypi_0    pypi
jsonschema                3.2.0                    pypi_0    pypi
keras                     2.2.4                    pypi_0    pypi
keras-applications        1.0.8                      py_0  
keras-preprocessing       1.1.0                      py_1  
kito                      1.0.4                    pypi_0    pypi
kiwisolver                1.2.0                    pypi_0    pypi
labelme                   3.16.3                   pypi_0    pypi
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.3                  he6710b0_1  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
libprotobuf               3.12.3               hd408876_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
markdown                  3.1.1                    py36_0  
matplotlib                3.2.1                    pypi_0    pypi
mkl                       2020.1                      217  
mkl-service               2.3.0            py36he904b0f_0  
mkl_fft                   1.0.15           py36ha843d7b_0  
mkl_random                1.1.1            py36h0573a6f_0  
mock                      4.0.2                      py_0  
ncurses                   6.2                  he6710b0_1  
networkx                  2.4                      pypi_0    pypi
numpy                     1.18.1           py36h4f9e942_0  
numpy-base                1.18.1           py36hde5b4d6_1  
nvidia-dali               0.21.0                   pypi_0    pypi
opencv-python             4.2.0.32                 pypi_0    pypi
openssl                   1.1.1g               h7b6447c_0  
orderedmultidict          1.0.1                    pypi_0    pypi
pathlib2                  2.3.5                    pypi_0    pypi
pillow                    7.1.2                    pypi_0    pypi
pip                       20.0.2                   py36_3  
plotly                    4.8.1                    pypi_0    pypi
protobuf                  3.12.3           py36he6710b0_0  
pyjwt                     1.7.1                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyqt5                     5.15.0                   pypi_0    pypi
pyqt5-sip                 12.8.0                   pypi_0    pypi
pyrsistent                0.16.0                   pypi_0    pypi
python                    3.6.10               h7579374_2  
pywavelets                1.1.1                    pypi_0    pypi
qtpy                      1.9.0                    pypi_0    pypi
readline                  8.0                  h7b6447c_0  
requests-file             1.5.1                    pypi_0    pypi
retrying                  1.3.3                    pypi_0    pypi
scikit-image              0.17.2                   pypi_0    pypi
scikit-learn              0.21.0                   pypi_0    pypi
scipy                     1.4.1            py36h0b6359f_0  
segmentation-models       1.0.1                    pypi_0    pypi
setuptools                47.1.1                   py36_0  
six                       1.15.0                     py_0  
sqlite                    3.31.1               h62c20be_1  
tensorboard               1.13.1           py36hf484d3e_0  
tensorflow                1.13.1          gpu_py36h3991807_0  
tensorflow-base           1.13.1          gpu_py36h8d69cac_0  
tensorflow-estimator      1.13.0                     py_0  
tensorflow-gpu            1.13.1               h0d30ee6_0  
termcolor                 1.1.0                    py36_1  
tifffile                  2020.6.3                 pypi_0    pypi
tk                        8.6.8                hbc83047_0  
tqdm                      4.46.1                   pypi_0    pypi
trains                    0.15.0                   pypi_0    pypi
typing                    3.7.4.1                  pypi_0    pypi
werkzeug                  1.0.1                      py_0  
wheel                     0.34.2                   py36_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3  

However in dashboard tensorflow-gpu does not show in installed packages:


# Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)  [GCC 7.3.0]

Keras == 2.2.4
Keras_Applications == 1.0.8
Pillow == 7.1.2
efficientnet == 1.0.0
h5py == 2.10.0
horovod == 0.19.4
image_classifiers == 1.0.0
imageio == 2.8.0
imgviz == 1.1.0
kito == 1.0.4
labelme == 3.16.3
matplotlib == 3.2.1
numpy == 1.18.4
nvidia_dali == 0.21.0
opencv_python == 4.2.0.32
scikit_image == 0.17.2
scikit_learn == 0.21.0
tensorflow == 1.13.1
tqdm == 4.46.1
trains == 0.15.0


It only captures tensorflow which eventually trains on CPU. This is also true for other conda packages. Package manager.type is set to conda for trains-agent config.

agent.package_manager.type = conda

Here is the part of the results where agent tries to install conda requirements. As you can see many of the packages are missing.


Conda: Trying to install requirements:
['h5py~=2.10.0', 'Keras-Applications~=1.0.8', 'numpy~=1.18.4', 'tensorflow~=1.13.1', 'graphviz', 'python-graphviz', 'kiwisolver', 'cpuonly']
Executing Conda: /home/mert/anaconda3/condabin/conda env update -p /home/mert/.trains/venvs-builds-alsi/3.6 --file /tmp/conda_envbjjg4czh.yml --quiet --json

Thank you for your help in advance. Attached a bash script to replicate the environment. Discarded horovod part as it requires openmpi, nccl etc.

condainstallscript.sh.txt

Two question regarding deployment/setup

Hello,

  1. After setting up trains-agent, how should I make sure the agent has access to that latest code updates?

For example:

  1. I'm working on my dev machine, starting a new experiment

  2. I'm duplicating the experiment on the trains-server

  3. The agent grabs the experiment and starts doing its thing

  4. I'm making a code change on my dev machine, starting a new experiment

  5. I'm duplicating the experiment on the trains-server

  6. ---> what should happen next? <--- the agent is yet to be updated with the latest code change, should I update all of the agents manually before?

  7. Do I need to mount different weights volume for each agent?

Thank you!
Shaked

Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all

Agent with specific GPU is spawning docker container with NVIDIA_VISIBLE_DEVICES=all .

Agent created with command:
clearml-agent daemon --gpus 1 --queue default --docker

When job is runned by it, it executes following command:
Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '--privileged', '-e', 'CLEARML_WORKER_ID=694d7fc1852a:gpu1', ..., 'bash', '-c', 'echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id a3857511e46b4063aba159f00fde9d4a']

I would expect it to run command with NVIDIA_VISIBLE_DEVICES set to correct value, or even better let it be as it was set by docker runtime.

k8s glue wants to install torch 1.8.0 and fails

Hi, i am running a k8s glue on a disconnected on-prem env. I have a private pypa repo which is pointed to via /etc/pip.conf in the container images. When i submit a task with tensorflow code, on a nvcr.io/cuda10.1-tensorflow image with a requirements.txt without any reference to torch or torchvision, i encountered the following error.

clearml_agent: Warning: could not resolve python wheel replacement for torch==1.8.0
clearml_agent: ERROR: Exception when trying to resolve python wheel: Could not find pytorch wheel URL for: torch==1.8.0 with cuda 101 support.

First question is why is it even installing torch in the first place?
Second question is what is it trying to resolve to? Is it to pytorch website? I'm on a disconnected env.

Bug: clearm_agent: ERROR: Expecting value: line 2 column 1 (char 1)

When I set agent.package_manager.type to conda I get the following error (seems to be a json parse error?) when submitting a task to a queue and running it on an agent:

Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.7 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json

clearml_agent: ERROR: Expecting value: line 2 column 1 (char 1)

DONE: Running task 'c8e18f605cea4610a89b51781481260e', exit status 1

When I set agent.package_manager.type to pip it runs fine!

Installed packages cloning issues

Hello,

I have the following situation where the cloning options installs the wrong version of a package (which eventually causes the experiment to fail, regardless of trains/trains-agent):

  • code is running from conda base venv

  • A requirements.txt file including torchvision as one of the packages (note, no version number). torchvision is just an example of a package.

  • A machine with already installed torchvision (0.4.2) and Pillow (5.4.1). Note that Pillow is not listed in the requirements.txt but a dependency of torchvision .

  • When I run this as a new task, everything runs smoothly. Trains logs under the installed packages the torchvision (0.4.2) but not Pillow.

  • However, when I clone it, trains-agent installs torchvision==0.4.2+cu100 from scratch, which depends on Pillow. However, as this is a new installation, it installs the latest Pillow 7.0 instead and ignores the 5.4.1 (which, again, appears in the pip list but not in trains installed packages).

Am I missing something? Isn't that the entire point of trains-agent? And of course, how to overcome this?

Thank you in advance!
Majd

k8s glue container name merge issue and extra documentation for k8s glue integration

I have integrated clearml agent with our k8s cluster using the k8s glue.

As part of my work, I have create the following pod template:

apiVersion: v1
kind: Pod
metadata:
  name: template-name
spec:
  containers:
    - name: test
      env:
      - name: "GIT_SSH_COMMAND"
        value: "ssh -i /root/.ssh/id_rsa"
      - name: TEST_SHAKED
        value: "1"
      volumeMounts:
      - name: full-ro-rsa-sec
        mountPath: /root/.ssh/id_rsa
        subPath: id_rsa
        readOnly: true
  volumes:
    - name: full-ro-rsa-sec
      secret:
        secretName: full-ro-rsa
        defaultMode: 256
        items:
          - key: id_rsa
            path: id_rsa

After some digging, I figured that I have to install the agent from the master branch as stated in #51

Once I tried to run the agent, I have encountered the following error:

Running kubectl encountered an error: error: error validating "/tmp/clearml_k8stmpl_8288bk09.yml": error validating data: ValidationError(Pod.spec.containers[0].name): invalid type for io.k8s.api.core.v1.Container.name: got "map", expected "string"; if you choose to ignore these errors, turn validation off with --validate=false

My assumption is that somewhere around https://github.com/allegroai/clearml-agent/blob/master/clearml_agent/glue/k8s.py#L444 something is not being merged correctly and instead of overriding the - name: test, it creates something like - name: { 0: "test", 1: "clearml-....." } or - name: ["test", "clearml-...."]


Regardless, following my conversation with @bmartinn which has helped me a lot via Slack, I wanted to state some important things while working with the agent and k8s, especially because it wasn't clear to me how to inject my SSH key and run a git clone on a private git repository.

clearml.conf

Make sure to set force_git_ssh_protocol: true

pod-template.yaml

You have to consider two things:

  1. You want to inject your SSH key in a secure way. I am working with Azure KeyVault which injects a k8s secret using https://github.com/Azure/secrets-store-csi-driver-provider-azure and then I just mount the secret to the pod with the right permissions i.e defaultMode: 256
  2. You have to ensure that the host of your repo won't stop you from clonning e.g:

Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

In order to make this work you have to make sure git knows about your SSH key and that it doesn't require a strict host key checking. This can be done by overriding the GIT_SSH_COMMAND environment variable:

apiVersion: v1
kind: Pod
metadata:
  name: template-name
spec:
  containers:
    - env:
      - name: TRAINS_CONFIG_FILE
        value: "/secrets/clearml.conf"
      - name: GIT_SSH_COMMAND
        value: "ssh -i /root/.ssh/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
      volumeMounts:
      - name: full-ro-rsa-sec
        mountPath: /root/.ssh/id_rsa
        subPath: id_rsa
        readOnly: true
      - name: clearml-conf-sec
        mountPath: /secrets/clearml.conf
        subPath: clearml.conf
        readOnly: true
  volumes:
    - name: full-ro-rsa-sec
      secret:
        secretName: full-ro-rsa
        defaultMode: 256
        items:
          - key: id_rsa
            path: id_rsa
    - name: clearml-conf-sec
      secret:
        secretName: clearml-conf
        items:
          - key: clearml.conf
            path: clearml.conf

Dockerfile

You can use the following Dockerfile in order to create a small and simple agent:

FROM python:3.9-slim

RUN apt update && apt install -y \
    git \
    apt-transport-https gnupg2 curl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
RUN install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
ENV CLEARML_CONFIG_FILE=/secrets/clearml.conf
ENV GIT_SSH_COMMAND="ssh -i /root/.ssh/id_rsa"
RUN python3 -m pip install git+https://github.com/allegroai/clearml-agent.git
ADD k8s_glue_example.py k8s_glue_example.py

ENTRYPOINT [ "python3", "k8s_glue_example.py" ]

As you can see, I'm also injecting clearml.conf through Azure KeyVault just as I prefer to manage my configuration in that way. You can do it however you feel like though.

k8s deployment

The only thing left is your deployment.yaml (or helm package). Once you have built the above docker image, you can run it and just pass the relevant arguments, for example:

          args: 
            - --queue
            - "shaked-test"
            - --template-yaml
            - /secrets/pod-template.yaml

Same as I stated above, the pod-template.yaml is also a configuration in my perspective and I just inject it from the outside world so that I won't have to rebuild the image everytime from scratch.

All the best,
Shaked

UPDATE:

I have added clearml.conf to the pod template volumes otherwise without it you might end up with #55.

different indexes on pytorch dataloader with trains

Hi,

I'm launching train with trains, and when I execute different trains I get different indexes from the data loader of pytorch (when I launch it locally, it produce the same indexes) do you have a clue for why this is happening?

attached logs of packages and stuff:

2020-05-28T09:10:58.454Z trains-agent-lv-beast-keras-dl:gpu0,1 INFO task cdb4b8edc4344f078a66c915e0bd6e0f pulled from ca9acd7a32df49e5b4ea41760e252be7 by worker trains-agent-lv-beast-keras-dl:gpu0,1

2020-05-28T09:11:03.497Z trains-agent-lv-beast-keras-dl:gpu0,1 DEBUG Current configuration (trains_agent v0.14.0rc0, location: /tmp/.trains_agent.bpcikde2.cfg):

sdk.storage.cache.default_base_dir = ~/.trains/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff_on_train = true
sdk.development.support_stopping = true
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
agent.worker_id = trains-agent-lv-beast-keras-dl:gpu0,1
agent.worker_name = trains-agent-lv-beast-keras-dl
agent.python_binary = /home/lv-beast/miniconda3/envs/keras-dl/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_version = <20
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.venvs_dir = /home/lv-beast/.trains/venvs-builds.2
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/lv-beast/.trains/vcs-cache.2
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/lv-beast/.trains/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/lv-beast/.trains/pip-cache
agent.docker_apt_cache = /home/lv-beast/.trains/apt-cache.2
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda
agent.git_user = tomer.amit
agent.default_python = 3.7
agent.cuda_version = 102
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://localhost:8008
api.web_server = http://localhost:8080
api.files_server = http://localhost:8081
api.credentials.access_key = LYH4CB09H4VY49TCRT90
api.host = http://localhost:8008
Executing task id [cdb4b8edc4344f078a66c915e0bd6e0f]:
repository = https://bodyvisionmedical.visualstudio.com/LungVision/_git/MachineLearning
branch = develop
version_num = 6ae5d42da77f9fe01c7a09264137bdd147794ba4
tag =
entry_point = training_script.py
working_dir = LvObjects/Lv3D/TomoGan
Using base prefix '/home/lv-beast/miniconda3/envs/keras-dl'
New python executable in /home/lv-beast/.trains/venvs-builds.2/3.5/bin/python
Installing setuptools, pip, wheel...
done.
Using cached repository in "/home/lv-beast/.trains/vcs-cache.2/MachineLearning.2f27b16d384a01d04c7eedd7180ff087/MachineLearning"
Fetching submodule LungVision

2020-05-28T09:11:08.533Z trains-agent-lv-beast-keras-dl:gpu0,1 DEBUG Note: switching to '6ae5d42da77f9fe01c7a09264137bdd147794ba4'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 6ae5d42 Merged PR 3169: add print to dataset index

2020-05-28T09:11:23.578Z trains-agent-lv-beast-keras-dl:gpu0,1 DEBUG url: https://bodyvisionmedical.visualstudio.com/LungVision/_git/MachineLearning
branch: HEAD
commit: 6ae5d42da77f9fe01c7a09264137bdd147794ba4
root: /home/lv-beast/.trains/venvs-builds.2/3.5/task_repository/MachineLearning
Applying uncommitted changes
Collecting pip<20
Using cached pip-19.3.1-py2.py3-none-any.whl (1.4 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.1.1
Uninstalling pip-20.1.1:
Successfully uninstalled pip-20.1.1

2020-05-28T09:11:28.601Z trains-agent-lv-beast-keras-dl:gpu0,1 DEBUG Collecting Cython
Using cached https://files.pythonhosted.org/packages/57/ac/b095febb27a241a3e09374cc85fb58cecb73439cab59949e0c55b67aac8d/Cython-0.29.19-cp35-cp35m-manylinux1_x86_64.whl
Installing collected packages: Cython
Successfully installed Cython-0.29.19
Running task id [cdb4b8edc4344f078a66c915e0bd6e0f]:
[LvObjects/Lv3D/TomoGan]$ /home/lv-beast/.trains/venvs-builds.2/3.5/bin/python -u training_script.py
Summary - installed python packages:
pip:

  • absl-py==0.9.0
  • albumentations==0.4.3
  • astor==0.8.1
  • astra-toolbox==1.9.0.dev0
  • attrs==19.3.0
  • bvframework==0.1.3
  • cachetools==3.1.1
  • certifi==2018.8.24
  • chardet==3.0.4
  • cloudpickle==1.2.2
  • colorama==0.4.3
  • colorlog==4.1.0
  • cycler==0.10.0
  • Cython==0.29.19
  • dask==2.6.0
  • decorator==4.4.1
  • dill==0.2.8.2
  • elasticdeform==0.4.6
  • enum34==1.1.6
  • funcsigs==1.0.2
  • furl==2.1.0
  • future==0.18.2
  • gast==0.2.2
  • google-api-python-client==1.7.11
  • google-auth==1.10.2
  • google-auth-httplib2==0.0.3
  • google-auth-oauthlib==0.4.1
  • google-pasta==0.1.8
  • grpcio==1.26.0
  • h5py==2.10.0
  • httplib2==0.11.3
  • humanfriendly==4.18
  • idna==2.8
  • imageio==2.3.0
  • imgaug==0.2.9
  • importlib-metadata==1.5.0
  • joblib==0.14.1
  • jsonmodels==2.4
  • jsonpickle==1.2
  • jsonschema==3.2.0
  • Keras-Applications==1.0.8
  • Keras-Preprocessing==1.1.0
  • kiwisolver==1.0.1
  • llvmlite==0.25.0
  • log-utils==0.3.4
  • Markdown==3.1.1
  • matplotlib==2.2.2
  • multiprocess==0.70.5
  • networkx==2.4
  • nose==1.3.7
  • numba==0.40.0
  • numpy==1.18.1
  • oauth2client==4.1.3
  • oauthlib==3.1.0
  • olefile==0.46
  • opt-einsum==3.1.0
  • orderedmultidict==1.0.1
  • pandas==0.23.4
  • pathlib2==2.3.5
  • pathos==0.2.1
  • pigar==0.9.2
  • Pillow==5.2.0
  • pkginfo==1.5.0.1
  • plotly==4.5.0
  • pox==0.2.7
  • ppft==1.6.4.7.1
  • protobuf==3.11.2
  • psutil==5.6.7
  • pyasn1==0.4.8
  • pyasn1-modules==0.2.7
  • pyclipper==1.0.6
  • PyJWT==1.7.1
  • pyparsing==2.4.6
  • pyrsistent==0.15.7
  • pyserialization==0.1.1
  • python-dateutil==2.8.1
  • pytz==2019.3
  • PyWavelets==1.0.1
  • PyYAML==3.12
  • requests==2.22.0
  • requests-file==1.4.3
  • requests-oauthlib==1.3.0
  • retrying==1.3.3
  • rsa==4.0
  • scikit-image==0.15.0
  • scikit-learn==0.20.0
  • scipy==1.1.0
  • semantic-version==2.8.2
  • Shapely==1.5.16
  • SimpleITK==0.10.0
  • simplejson==3.16.1
  • six==1.14.0
  • tabulate==0.8.6
  • tensorboard==2.1.0
  • tensorflow==2.1.0
  • tensorflow-estimator==2.1.0
  • termcolor==1.1.0
  • toolz==0.10.0
  • torch==1.4.0
  • torchvision==0.5.0
  • tornado==5.1.1
  • tqdm==4.42.0
  • trains==0.13.2
  • typing==3.7.4.1
  • uritemplate==3.0.1
  • urllib3==1.25.8
  • virtualenv==16.7.9
  • vtk==8.1.1
  • Werkzeug==0.16.1
  • wrapt==1.11.2
  • zipp==1.1.0

Horovod installation default settings causes environment problems

I am running experiments with horovod with OpenMPI. Horovod in my environment is built with the following command:

PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod

However, when trains-agent tries to create a copy of this environment I 'assume' it only runs
pip install horovod

I face the following issue for pip only environment:

import horovod.keras as hvd
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/keras/__init__.py", line 19, in
from horovod.tensorflow import init
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/__init__.py", line 28, in
from horovod.tensorflow.mpi_ops import allgather, broadcast, _allreduce
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 49, in
MPI_LIB = _load_library('mpi_lib' + get_ext_suffix())
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 45, in _load_library
library = load_library.load_op_library(filename)
File "/home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/mert/.trains/venvs-builds-alsi/3.6/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNK10tensorflow8OpKernel4nameB5cxx11Ev

You can find this issue numerous times in horovod issues:
horovod/horovod#236
horovod/horovod#431
horovod/horovod#656

After I get this error, I go to the environment, and check the horovod build using
horovodrun --check-build
and it is not built properly. If I run following two lines I get a successful build:

pip uninstall horovod -y
PATH=$PATH:$HOME/openmpi/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod

However this is automatically deleted when trains-agent tries to run a new experiment.

Is there any solution for this like you did for --extra-index-url?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.