azure / batch-shipyard Goto Github PK

Simplify HPC and Batch workloads on Azure

License: MIT License

Python 76.80% Shell 13.22% Batchfile 0.24% Jupyter Notebook 5.76% PowerShell 0.28% Dockerfile 3.71%

azure-batch docker hpc mpi gpu infiniband rdma azure batch-processing nfs glusterfs azure-functions containers singularity windows-containers serverless slurm

batch-shipyard's Introduction

Batch Shipyard

PROJECT STATUS

This toolkit is no longer actively maintained. The develop branch has proposed fixes for outstanding issues, but they will not be merged back to master. Please see the main Azure Batch GitHub repository for more information about Azure Batch.

Batch Shipyard is a tool to help provision, execute, and monitor container-based batch processing and HPC workloads on Azure Batch. Batch Shipyard supports both Docker and Singularity containers. No experience with the Azure Batch SDK is needed; run your containers with easy-to-understand configuration files. All Azure regions are supported, including non-public Azure regions.

Additionally, Batch Shipyard provides the ability to provision and manage entire standalone remote file systems (storage clusters) in Azure, independent of any integrated Azure Batch functionality.

Major Features

Container Runtime and Image Management

Support for multiple container runtimes including Docker, Singularity, and Kata Containers tuned for Azure Batch compute nodes
Automated deployment of container images required for tasks to compute nodes
Support for container registries including Azure Container Registry for both Docker and Singularity images (ORAS), other Internet-accessible public and private registries, and support for the Sylabs Singularity Library and Singularity Hub
Transparent support for GPU-accelerated container applications on both Docker and Singularity on Azure N-Series VM instances
Transparent assist for running Docker and Singularity containers utilizing Infiniband/RDMA on HPC Azure VM instances including A-Series, H-Series, Hb/Hc-Series, and N-Series
Integrated support for Singularity Encrypted Containers

Data Management and Shared File Systems

Comprehensive data movement support: move data easily between locally accessible storage systems, remote filesystems, Azure Blob or File Storage, and compute nodes
Standalone Remote Filesystem Provisioning with integration to auto-link these filesystems to compute nodes with support for NFS and GlusterFS distributed network file system
Automatic shared data volume support for linking to Remote Filesystems, Azure File via SMB, Azure Blob via blobfuse, GlusterFS provisioned directly on compute nodes, and custom Linux mount support (fstab)
Support for automated on-demand, per-job distributed scratch space provisioning via BeeGFS BeeOND

Monitoring

Automated, integrated resource monitoring with Prometheus and Grafana for Batch pools and RemoteFS storage clusters
Support for Batch Insights

Open Source Scheduler Integration

Support for elastic cloud bursting on Slurm to Batch pools with automated RemoteFS shared file system linking

Azure Ecosystem Integration

Support for serverless execution binding with Azure Functions
Support for credential management through Azure KeyVault

Azure Batch Integration and Enhancements

Federation support: enables unified, constraint-based scheduling to collections of heterogeneous pools, including across multiple Batch accounts and Azure regions
Support for simple, scenario-based pool autoscale and autopool to dynamically scale and control computing resources on-demand
Support for Task Factories with the ability to generate tasks based on parametric (parameter) sweeps, randomized input, file enumeration, replication, and custom Python code-based generators
Support for multi-instance tasks to accommodate MPI and multi-node cluster applications packaged as Docker or Singularity containers on compute pools with automatic job completion and task termination
Seamless, direct high-level configuration support for popular MPI runtimes including OpenMPI, MPICH, MVAPICH, and Intel MPI with automatic configuration for Infiniband, including SR-IOV RDMA VM sizes
Seamless integration with Azure Batch job, task and file concepts along with full pass-through of the Azure Batch API to containers executed on compute nodes
Support for Azure Batch task dependencies allowing complex processing pipelines and DAGs
Support for merge or final task specification that automatically depends on all other tasks within the job
Support for job schedules and recurrences for automatic execution of tasks at set intervals
Support for live job and job schedule migration between pools
Support for Low Priority Compute Nodes
Support for deploying Batch compute nodes into a specified Virtual Network and pre-defined public IP addresses
Automatic setup of SSH or RDP users to all nodes in the compute pool and optional creation of SSH tunneling scripts to Docker Hosts on compute nodes
Support for custom host images including Shared Image Gallery
Support for Windows Containers on compliant Windows compute node pools with the ability to activate Azure Hybrid Use Benefit if applicable

Installation

Local Installation

Please see the installation guide for more information regarding the various local installation options and requirements.

Azure Cloud Shell

Batch Shipyard is integrated directly into Azure Cloud Shell and you can execute any Batch Shipyard workload using your web browser or the Microsoft Azure Android and iOS app.

Simply request a Cloud Shell session and type shipyard to invoke the CLI; no installation is required. Try Batch Shipyard now in your browser.

Documentation and Recipes

Please refer to the Batch Shipyard Documentation on Read the Docs.

Visit the Batch Shipyard Recipes section for various sample container workloads using Azure Batch and Batch Shipyard.

Batch Shipyard Compute Node Host OS Support

Batch Shipyard is currently compatible with popular Azure Batch supported Marketplace Linux VMs, compliant Linux custom images, and native Azure Batch Windows Server with Containers VMs. Please see the platform image support documentation for more information specific to Batch Shipyard support of compute node host operating systems.

Change Log

Please see the Change Log for project history.

Please see this project's Code of Conduct and Contributing guidelines.

batch-shipyard's People

Contributors

Stargazers

Watchers

Forkers

mmacy flabunsky itowlson msoni3 mhamilton723 ebetica andreadotti nillsf hongyunnchen hechuanray skaarthik pareshverma91 smith1511 leliaonvidia azureexpert tanewill msalvaris montecarlo1 ivjia lmiroslaw setuc zpbappi sebastianbk lfraile hkcaesar jsturtevant a140222 dwa davidflorez axissol etraiger b6pzeusbc54tvhw5jgpyw8pwz2x6gs sanfern cauldnz sportsbitenews fertinaz knadimpalli jongiddy alexanderyukhanov pepperlk amirhalatzi hebinhuang jluk yzhvictor edwardsp ornithos saschagottfried daxaholic egonzalf beifeizhou dyeri shelb26 timeanor nepomuceno jeancloud365 just4jc alejandrolmeida sumaria-c schoenemeyer adampaternostro ericschles xcorail arangogutierrez ivana61 azurecloudmonk kiyoaki-suga danyrouh jonnythepython rekha-balan bezveza tomassa jackpimbert tawan0109 shriah jaysnanavati sd37 mikenholt rongxunwang richardlock vincentlabonte zhengpingwan xinlaoda gridl noxx5489 srknk8990 robotvisionhang databill86 georgeaccnt-gh hmeiland hyorigo jianghongping darrentu bureado isabella232 allie-fowler gyssels emattiza leahgorospe sleongmgi iwillsky

batch-shipyard's Issues

Windows Server 2016 Support

Support Windows Server 2016, specifically the 2016-Datacenter-with-Containers sku.

Future:

Enable Windows Server File Share support
Allow Samba mounts from storage clusters
pool listimages support (via task)
Credential Encryption support
Port some recipes for Windows containers

Deprecation Notice: [2.0.0] batch-shipyard:cascade-latest and batch-shipyard:tfm-latest docker images

Hello batch-shipyard users,

This is an announcement that the backend batch-shipyard:cascade-latest and batch-shipyard:tfm-latest docker images are being deprecated. Releases after 2.0.0 will no longer use these images on the backend and will eventually be deleted from the public docker hub repository. Moving forward, Batch Shipyard will use versioned docker images on the backend to ensure that future changes do not break users on earlier versions. This change is transparent but you should upgrade to the latest release. Please follow the upgrade instructions as found in this doc.

The batch-shipyard:cli-latest image will continue to be generated with each commit to master, in addition to tagged release versions.

NOTE: batch-shipyard:cascade-latest and batch-shipyard:tfm-latest (2.0.0) will be removed from the repository on or after January 31, 2017.

What is the best way to convert the job parameters for a deep learning run into the input parameters for a batch job?

We currently run mutiple of the jobs via pre-configured Virtual Machines. The jobs are read off an Azure Storage Queue by a python script and executed as per the instructions in the queue. If I were to extend this to Azure Batch running the same job in a pre-configured docker job pool, what is the best way to pass such instructions? Is there a way to pass the job parameters directly via the queue?

Add list images command

Add pool subcommand for list images. Optionally detect mismatched images on all nodes.

Documentation page links not working

Links to guide and recipes not working on homepage (https://azure.github.io/batch-shipyard/) in Documentation section:

Batch Shipyard Guide links to https://azure.github.io/batch-shipyard/docs (404 error)
Batch Shipyard Recipes links to https://azure.github.io/batch-shipyard/recipes (404 error)

Feature suggestion: user-configurable job "terminate when tasks are done" setting

There's a way to set this in the Azure portal, but it would be nice to be able to configure this officially in Batch Shipyard using jobs.json. I hacked this functionality into the tool earlier, but didn't put the hack back when new versions came out.

Internal refactor to allow easier script integration

You can already import shipyard with some small code hacks to use the existing package in scripts directly. However, we should allow for something along the lines of import batch_shipyard to enable easier direct integration into scripts with formal objects to be held by caller. The existing CLI experience should not change.

Deprecation Notice: [1.1.0] batch-shipyard:latest docker image

Hello Batch Shipyard users,

This is an announcement that the batch-shipyard:latest docker image is being deprecated. Releases after 1.1.0 will no longer use this image on the backend and will eventually be deleted from the public docker hub repository. Moving forward, the batch-shipyard repository will contain three different images that provide functionality for different parts of the system. These tags are:

cli-latest: Docker image containing the complete CLI functionality of shipyard.py.
cascade-{version}: Docker image containing some of the backend functionality of Batch Shipyard.
tfm-{version}: Docker image containing backend task file movement capability of Batch Shipyard.

Apologies for the inconvenience due to the changes. We strongly recommended upgrading your Batch Shipyard installation to the latest release with git pull or downloading the latest release in order to take advantage of the newest features and bugfixes before the image is removed.

NOTE: batch-shipyard:latest (1.1.0) will be removed from the repository on or after December 31, 2016.

Thanks!

Consistent Key Error for Storage Account

I have encountered this consistently after rebuilding 4 times, key is not being recognized.

Command:
./batch-shipyard/shipyard.py pool add --credentials ./credentials.json --config ./config.json --pool ./pool.json

Error:
KeyError: '[storageaccountname]'

Notice: Azure Storage Data Movement Breaking Change

blobxfer, which powers the data movement engine between local nodes to Azure Storage and between compute nodes and Azure Storage, is undergoing a breaking change with its CLI interface with the upcoming 1.0.0 release.

Any Batch Shipyard version prior to 2.5.3 will not be able to handle blobxfer 1.0.0 (when it is released) for data movement and Azure Storage. Please migrate your pools (by recreating them) to the newest version of Batch Shipyard to prevent a disruption in your jobs with Azure Storage data movement scenarios. If you require no downtime, you can create a new pool with the new version of Batch Shipyard and submit your work against the new pool while your old pool drains.

Thanks for your understanding in the matter. The improvements to blobxfer will percolate into Batch Shipyard after it is released.

Action items:

Update config templates
- Allow multiple includes
- Allow excludes
- Egress allow remote_path and local_path
- Ingress allow remote_path and local_path
data logic updates
task_factory:file config and logic updates
Update shell script
Update docs regarding old blobxfer limitations and example extra options

Deprecation Notice: Removal of self-hosted private registry functionality

Hello Batch Shipyard users,

The feature enabling self-hosted private registry with Docker images backed to Azure Storage blobs will be removed from Batch Shipyard in the next major release (3.0.0). Please migrate your images stored in this manner to an alternate private registry, such as the Azure Container Registry.

Action items:

Do not create/add to the prefixregistry table
Modify cascade to not read from prefixregistry table
Delete private registry setup py from cascade
Force delete prefixregistry table
Remove deprecation warning
Remove from config templates
Remove from docs
Remove mention in README

"The value provided for one of the properties in the request body is invalid." error when trying to create pool

I have a recurring job that runs every day, scheduled using an external scheduler. It worked just fine yesterday, but stopped working today. Tried kicking it off again a few times, with the same problem.

This error occurs when the job tries to create the pool (some names changed)

2017-03-11T00:18:06.4385430Z Unable to find image 'alfpark/batch-shipyard:cli-latest' locally
2017-03-11T00:18:07.5574940Z cli-latest: Pulling from alfpark/batch-shipyard
2017-03-11T00:18:07.5607490Z 6daefd62341a: Pulling fs layer
2017-03-11T00:18:07.5625380Z 2aa297eab108: Pulling fs layer
2017-03-11T00:18:08.1042870Z 6daefd62341a: Download complete
2017-03-11T00:18:08.3003220Z 6daefd62341a: Pull complete
2017-03-11T00:18:11.6265530Z 2aa297eab108: Verifying Checksum
2017-03-11T00:18:11.6282660Z 2aa297eab108: Download complete
2017-03-11T00:18:14.5793500Z 2aa297eab108: Pull complete
2017-03-11T00:18:14.6183050Z Digest: sha256:7286eeaf0d3cb776acef202d96d16a55987cf31154a03d12ee44bdc2df8c24e7
2017-03-11T00:18:14.6440140Z Status: Downloaded newer image for alfpark/batch-shipyard:cli-latest
2017-03-11T00:18:15.6050450Z 2017-03-11 00:18:15,603Z DEBUG convoy.keyvault:parse_secret_ids:248 fetching batch account key from keyvault
2017-03-11T00:18:16.0461670Z 2017-03-11 00:18:16,042Z DEBUG convoy.keyvault:parse_secret_ids:263 fetching storage account key for link data from keyvault
2017-03-11T00:18:16.3074910Z 2017-03-11 00:18:16,306Z DEBUG convoy.keyvault:parse_secret_ids:263 fetching storage account key for link batch from keyvault
2017-03-11T00:18:16.6545270Z 2017-03-11 00:18:16,653Z DEBUG convoy.keyvault:parse_secret_ids:278 fetching docker registry password for registry myregistry.azurecr.io from keyvault
2017-03-11T00:18:17.3565740Z 2017-03-11 00:18:17,355Z INFO convoy.storage:create_storage_containers:469 creating container: shipyardtor-mybatchaccount-mypool
2017-03-11T00:18:17.5785020Z 2017-03-11 00:18:17,575Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardtorrentinfo
2017-03-11T00:18:17.7681260Z 2017-03-11 00:18:17,766Z INFO convoy.storage:create_storage_containers:477 creating queue: shipyardgr-mybatchaccount-mypool
2017-03-11T00:18:17.9617670Z 2017-03-11 00:18:17,958Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardgr
2017-03-11T00:18:18.0079060Z 2017-03-11 00:18:18,006Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardimages
2017-03-11T00:18:18.0591210Z 2017-03-11 00:18:18,057Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardregistry
2017-03-11T00:18:18.1066530Z 2017-03-11 00:18:18,105Z INFO convoy.storage:create_storage_containers:474 creating table: shipyarddht
2017-03-11T00:18:18.1601350Z 2017-03-11 00:18:18,159Z INFO convoy.storage:create_storage_containers:469 creating container: shipyardrf-mybatchaccount-mypool
2017-03-11T00:18:18.2105060Z 2017-03-11 00:18:18,209Z INFO convoy.storage:_clear_blobs:384 deleting blobs: shipyardtor-mybatchaccount-mypool
2017-03-11T00:18:18.2860930Z 2017-03-11 00:18:18,285Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardtorrentinfo
2017-03-11T00:18:18.3684180Z 2017-03-11 00:18:18,367Z INFO convoy.storage:clear_storage_containers:452 clearing queue: shipyardgr-mybatchaccount-mypool
2017-03-11T00:18:18.4156300Z 2017-03-11 00:18:18,414Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardgr
2017-03-11T00:18:18.5136990Z 2017-03-11 00:18:18,512Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardimages
2017-03-11T00:18:18.5619460Z 2017-03-11 00:18:18,560Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardperf
2017-03-11T00:18:18.6100800Z 2017-03-11 00:18:18,608Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardregistry
2017-03-11T00:18:18.7062410Z 2017-03-11 00:18:18,705Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyarddht
2017-03-11T00:18:18.8991810Z 2017-03-11 00:18:18,897Z INFO convoy.storage:_clear_blobs:384 deleting blobs: shipyardrf-mybatchaccount-mypool
2017-03-11T00:18:19.2330150Z 2017-03-11 00:18:19,231Z WARNING convoy.fleet:_adjust_settings_for_pool_creation:1006 forcing shipyard docker image to be used due to VM config, publisher=openlogic offer=centos sku=7.2
2017-03-11T00:18:19.2809030Z 2017-03-11 00:18:19,279Z INFO convoy.storage:_add_global_resource:255 adding global resource: docker:myimage
2017-03-11T00:18:19.7223520Z 2017-03-11 00:18:19,721Z INFO convoy.storage:upload_resource_files:338 uploading file /opt/batch-shipyard/scripts/shipyard_nodeprep.sh as 'shipyard_nodeprep.sh'
2017-03-11T00:18:19.9671350Z 2017-03-11 00:18:19,965Z INFO convoy.storage:upload_resource_files:338 uploading file /opt/batch-shipyard/scripts/docker_jp_block.sh as 'docker_jp_block.sh'
2017-03-11T00:18:20.2068330Z 2017-03-11 00:18:20,205Z INFO convoy.storage:upload_resource_files:338 uploading file /opt/batch-shipyard/scripts/shipyard_blobxfer.sh as 'shipyard_blobxfer.sh'
2017-03-11T00:18:20.4706070Z 2017-03-11 00:18:20,469Z INFO convoy.batch:create_pool:361 Attempting to create pool: mypool
2017-03-11T00:18:20.5340780Z Traceback (most recent call last):
2017-03-11T00:18:20.5356860Z   File "/opt/batch-shipyard/shipyard.py", line 941, in <module>
2017-03-11T00:18:20.5371900Z     cli()
2017-03-11T00:18:20.5385490Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 716, in __call__
2017-03-11T00:18:20.5398770Z     return self.main(*args, **kwargs)
2017-03-11T00:18:20.5412370Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 696, in main
2017-03-11T00:18:20.5428400Z     rv = self.invoke(ctx)
2017-03-11T00:18:20.5441240Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
2017-03-11T00:18:20.5454590Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-11T00:18:20.5467890Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
2017-03-11T00:18:20.5481400Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-11T00:18:20.5495210Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 889, in invoke
2017-03-11T00:18:20.5508400Z     return ctx.invoke(self.callback, **ctx.params)
2017-03-11T00:18:20.5521630Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 534, in invoke
2017-03-11T00:18:20.5535560Z     return callback(*args, **kwargs)
2017-03-11T00:18:20.5548800Z   File "/usr/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
2017-03-11T00:18:20.5561900Z     return ctx.invoke(f, obj, *args[1:], **kwargs)
2017-03-11T00:18:20.5575330Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 534, in invoke
2017-03-11T00:18:20.5588100Z     return callback(*args, **kwargs)
2017-03-11T00:18:20.5601120Z   File "/opt/batch-shipyard/shipyard.py", line 607, in pool_add
2017-03-11T00:18:20.5614220Z     ctx.table_client, ctx.config)
2017-03-11T00:18:20.5627110Z   File "/opt/batch-shipyard/convoy/fleet.py", line 1200, in action_pool_add
2017-03-11T00:18:20.5639870Z     _add_pool(batch_client, blob_client, config)
2017-03-11T00:18:20.5653520Z   File "/opt/batch-shipyard/convoy/fleet.py", line 640, in _add_pool
2017-03-11T00:18:20.5666650Z     nodes = batch.create_pool(batch_client, config, pool)
2017-03-11T00:18:20.5679700Z   File "/opt/batch-shipyard/convoy/batch.py", line 365, in create_pool
2017-03-11T00:18:20.5692760Z     batch_client.pool.add(pool)
2017-03-11T00:18:20.5705710Z   File "/usr/lib/python3.5/site-packages/azure/batch/operations/pool_operations.py", line 291, in add
2017-03-11T00:18:20.5718720Z     raise models.BatchErrorException(self._deserialize, response)
2017-03-11T00:18:20.5732820Z azure.batch.models.batch_error.BatchErrorException: {'value': 'The value provided for one of the properties in the request body is invalid.\nRequestId:79b2f4d8-8b47-49fd-a85e-206574727170\nTime:2017-03-11T00:18:20.6184337Z', 'lang': 'en-US'}

Any idea why this might be happening? I haven't touched my pool.json since yesterday.

My pool.json:

{
    "pool_specification": {
        "id": "mypool",
        "vm_size": "STANDARD_A2_V2",
        "vm_count": 10,
        "max_tasks_per_node": 2,

        "publisher": "OpenLogic",
        "offer": "CentOS",
        "sku": "7.2",

        "reboot_on_start_task_failed": true,
        "block_until_all_global_resources_loaded": true
    }
}

UserSubscription Batch Account Support

Allow UserSubscription batch accounts.

AAD auth for Batch account
- Doc TFM with AAD is not supported yet
- Add code in data movement to check/block?
Allow VNet Id in pool
- Create VNet/subnet if not found option?
Remove 40 VM limit for inter node comm enabled pools and UserSubscription batch accounts
- Update current limitations doc
Add to limitations doc that custom images are not supported (yet)
Link to docs on how to create a user subscription batch account

Support for recurring jobs and accessing credentials from containers

Is there a recommended way to use batch-shipyard with the Azure Batch job scheduler? I'd like to be able to schedule Docker jobs to run at a recurring interval, and I couldn't find documentation on this topic.

Also, I was wondering if there was a way to expose credentials to the running Docker container without saving them as variables in jobs.json? I'd like to provide some credentials for external services to the container, but I don't want to check these into source control. Using KeyVault from inside the container would also work, but I don't think those env vars are passed to the pool.

(p.s. is this the right place to ask questions about batch-shipyard usage, or is there a more appropriate forum for this?)

Support recurring jobs

Support JobSchedules and recurrences. See issue #15 for more details.

Occasional 1-2 node startup tasks failing with no error message

Hey, I've started using the latest develop branch (4eea944), and every time I create a pool with 10 nodes, I inevitably get one or two nodes that fail to start properly now. I haven't had any fail to start before. The only messages I get from the startup task are:

in startup/stdout.txt:

/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/sdb1 temp disk is mounted as fuseblk/ntfs

startup/stderr.txt is empty.

My pool.json looks like:

{
    "pool_specification": {
        "id": "mypool",
        "vm_size": "STANDARD_D11_V2",
        "vm_count": 10,
        "max_tasks_per_node": 2,

        "publisher": "Canonical",
        "offer": "UbuntuServer",
        "sku": "16.04.0-LTS",

        "ssh": {
            "username": "hi"
        },

        "reboot_on_start_task_failed": true,
        "block_until_all_global_resources_loaded": true
    }
}

With 5 nodes, I haven't run into any issues. Any idea what might be causing this?

Migrate to Alpine 3.6 for Docker images

Should be transparent, but potential issues from openssl to libressl switch (from 3.4). Python will move from 3.5.2 to 3.6.1.

CLI
cascade
tfm

Support interactive SSH sessions via pool subcommand

Support direct interactive SSH sessions via pool ssh subcommand to allow ease of connection to remote hosts without having to issue grls and manually ssh.

Support Autopools

Allow jobs to be executed without an active pool, but instead with an autopool link. See #19 for the initial discussion.

Allow job-level override to run missing pre-loaded docker images

Job level json property: allow_run_on_missing_image
Modify JP to not run jp block script if above property is true
Prepend private registry to image name in tasks under job
Update docs
- Add note that passthrough on missing config only applies to config.json images

Check pool size prior to submitting udi task

Only issue pool udi task if pool has nodes in it.

Documentation

I am staring at the Tensor Flow recipes looking desperately for how to kick off my first load. I have 150 gb sitting in data lake. Do I provision Batch, do I set up VMs, what in the world do I need to do step by step to use random images in data lake with some random label file or set of label files.

How to add tasks?

Hi,

I'm new to both Docker and Azure Batch, but I feel it is what I need.

My use case:
I need to process batches of thousands of images. My input for each task is an xml file and 5 images. Then a Linux executable wrapped in a Python script processes this input and produces 3 images and an xml file as output. I wrapped the code in a Docker image that processes one set of images. Both the input and output files are in Azure blob storage. Also this process is part of a bigger automated pipeline and I need to monitor when the batch is done.

I used batch-shipyard to create a pool. Now my question is, how do create the job and its tasks (from code)? Am I supposed to generate a jobs.xml with thousands of tasks? Or is there another way? Can I use the Azure Bath API as well?

Thanks in advance for clarifying this.

Possible scenarios for using pool auto-resize?

I noticed in the "current issues" documentation that you recommend using batch-shipyard to resize pools. I looked at the resizing code in fleet.py and batch.py, and I was wondering if there were scenarios where I could possibly use pool auto-resizing? e.g. if I didn't care about SSHing into nodes, and I didn't use GlusterFS.

Perform automatic substitution on data ingress/egress to storage from GlusterFS volumes

It's currently undocumented where the host gluster path is mounted to and is confusing if ingressing/egressing data to/from those mount points and Azure Storage. Perform automatic substitution to the host paths.

Also document where GlusterFS is mounted on the host.

Unused reference to the azure-mgmt package in convoy/keyvault.py?

I just tried running the Docker container version of the CLI, and I run into this error:

> docker run --rm -it alfpark/batch-shipyard:cli-latest
Traceback (most recent call last):
  File "/opt/batch-shipyard/shipyard.py", line 42, in <module>
    import convoy.fleet
  File "/opt/batch-shipyard/convoy/fleet.py", line 52, in <module>
    from . import keyvault
  File "/opt/batch-shipyard/convoy/keyvault.py", line 40, in <module>
    import azure.mgmt.resource.resources
ImportError: No module named 'azure.mgmt'

Seems like this import is unused? convoy/keyvault.py I'll try removing that line and building the container on my end, and see if it breaks or not. edit: seems like it works

Broken pipes in Blobxfer / requests when outputting from many nodes concurrently

When I'm running a large parallel job with hundreds of simultaneous tasks, I'm running into an issue with blobxfer and requests failing to output data to blob storage.

I originally started the parallel job with 10 nodes, 7 tasks per node, for a total of 70 simultaneous tasks, and my tasks were completing correctly and uploading their results to blob storage as expected. Since each job is independent and I wanted to speed this job up, I then used pool resize to scale up the pool from 10 nodes to 50 nodes.

After the resize was complete, all of the jobs started failing in the output step when blobxfer is attempting to upload each task's result to blob storage. The error reports multiple broken pipes (assuming each one is a retry attempt) in requests. Unfortunately, I forgot to grab a error log before shutting down the cluster, but if it occurs again I'll post it here.

I'm assuming this is because I'm trying to shove too many simultaneous uploads into Blob storage at once? Is there an inherent limit to the number of simultaneous blob storage uploads? I wasn't able to find a good metric for this on Azure documentation.

If task name is too long, fails to add task

I have a job with >1000 tasks, if task id is left to null in jobs.json, when adding task number 1001 (automatically called "dockertask-1000") shipyard fails. The relevant message being: "The specified task already exists."
I suspect the task name becomes too long, the last digit is being silently dropped and thus the task has the exact same name as dockertask-100.

Indeed assigning (short) custom names solves the problem.
If my interpretation is correct I suggest adding a run-time check on validity of task names.

2017-01-24 23:01:51,152Z INFO convoy.batch:add_jobs:1894 Adding task: dockertask-1000
Traceback (most recent call last):
  File "/home/adotti/Work/Azure/batch-shipyard/shipyard.py", line 921, in <module>
    cli()
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/adotti/Work/Azure/batch-shipyard/shipyard.py", line 741, in jobs_add
    recreate, tail)
  File "/home/adotti/Work/Azure/batch-shipyard/convoy/fleet.py", line 1453, in action_jobs_add
    _BLOBXFER_FILE, recreate, tail)
  File "/home/adotti/Work/Azure/batch-shipyard/convoy/batch.py", line 1895, in add_jobs
    batch_client.task.add(job_id=job.id, task=batchtask)
  File "/home/adotti/.local/lib/python2.7/site-packages/azure/batch/operations/task_operations.py", line 107, in add
    raise models.BatchErrorException(self._deserialize, response)
azure.batch.models.batch_error.BatchErrorException: {'lang': u'en-US', 'value': u'The specified task already exists.\nRequestId:789cbe9f-3006-4325-9ec2-6c094cb64808\nTime:2017-01-25T07:01:50.9582196Z'}

Support pool autoscale

Allow for pool autoscale formulas. See issue #25.

Enabling changes:

Need to redesign how docker images are pulled since queue message limits are 7 days
RF SASes should have sufficiently large se parameter

Autoscale changes:

Other changes:

Block pool creation for the following
- Autoscale and GlusterFS on compute
- Autoscale and peer-to-peer
Emit warning if pool ssh user is detected with autoscale
Update docs
Add autoscale guide

Add option for virtualenv installation in install.sh

Provide auto install to virtual env of your choice

add -e parameter to install.sh script
modify generated shipyard script to activate/deactivate
update install docs regarding this option

Azure KeyVault support for credentials.json

Support AAD to KeyVault access for credential secrets.

KeyError: account_key on cli-latest

I'm running into this blocking issue running the cli-latest docker container.

Traceback:

2017-03-18T00:05:26.9865110Z Digest: sha256:31fc61d165291b4c5a186ceb74827226397e32c04d8df349bb197ef67e5ccd4e
2017-03-18T00:05:27.0034980Z Status: Downloaded newer image for alfpark/batch-shipyard:cli-latest
2017-03-18T00:05:27.9223690Z Traceback (most recent call last):
2017-03-18T00:05:27.9238610Z   File "/opt/batch-shipyard/shipyard.py", line 1452, in <module>
2017-03-18T00:05:27.9254630Z     cli()
2017-03-18T00:05:27.9268580Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 722, in __call__
2017-03-18T00:05:27.9280400Z     return self.main(*args, **kwargs)
2017-03-18T00:05:27.9292230Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 697, in main
2017-03-18T00:05:27.9303840Z     rv = self.invoke(ctx)
2017-03-18T00:05:27.9316070Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
2017-03-18T00:05:27.9328400Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-18T00:05:27.9340310Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
2017-03-18T00:05:27.9352410Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-18T00:05:27.9364450Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 895, in invoke
2017-03-18T00:05:27.9376980Z     return ctx.invoke(self.callback, **ctx.params)
2017-03-18T00:05:27.9388590Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 535, in invoke
2017-03-18T00:05:27.9400430Z     return callback(*args, **kwargs)
2017-03-18T00:05:27.9412260Z   File "/usr/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
2017-03-18T00:05:27.9424690Z     return ctx.invoke(f, obj, *args[1:], **kwargs)
2017-03-18T00:05:27.9436310Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 535, in invoke
2017-03-18T00:05:27.9448360Z     return callback(*args, **kwargs)
2017-03-18T00:05:27.9461250Z   File "/opt/batch-shipyard/shipyard.py", line 1038, in pool_add
2017-03-18T00:05:27.9473730Z     ctx.initialize_for_batch()
2017-03-18T00:05:27.9486240Z   File "/opt/batch-shipyard/shipyard.py", line 124, in initialize_for_batch
2017-03-18T00:05:27.9498770Z     skip_global_config=False, skip_pool_config=False, fs_storage=False)
2017-03-18T00:05:27.9511520Z   File "/opt/batch-shipyard/shipyard.py", line 321, in _init_config
2017-03-18T00:05:27.9524240Z     convoy.fleet.populate_global_settings(self.config, fs_storage)
2017-03-18T00:05:27.9537360Z   File "/opt/batch-shipyard/convoy/fleet.py", line 209, in populate_global_settings
2017-03-18T00:05:27.9550490Z     sc = settings.credentials_storage(config, bs.storage_account_settings)
2017-03-18T00:05:27.9563550Z   File "/opt/batch-shipyard/convoy/settings.py", line 837, in credentials_storage
2017-03-18T00:05:27.9576550Z     account_key=conf['account_key'],
2017-03-18T00:05:27.9589290Z KeyError: 'account_key'

This started happening today with my scheduled task just a few hours ago, and doesn't seem to be transient (I reran the scheduled task and it failed the second time).

Yesterday's run worked fine, and cli 2.5.4 seems to work fine as well. I haven't tried 2.6.0b1 yet. If you'd like, I can try that as well.

My credentials.json:

{
    "credentials": {
        "batch": {
            "account": "mybatchaccount",
            "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/batch",
            "account_service_url": "https://mybatchaccount.westus.batch.azure.com"
        },
        "storage": {
            "batch": {
                "account": "mybatchstorage",
                "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/storage-batch"
            },
            "data": {
                "account": "storage",
                "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/storage"
            }
        },
        "docker_registry": {
            "myregistry-on.azurecr.io": {
                "username": "myuser",
                "password_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/acr-myregistry-reader"
            }
        }
    }
}

Exception in task_file_mover when ingressing files from other batch tasks

I've set up a job which contains several fetch tasks, and a single processing task that depends on the fetch tasks. For convenience, I tried using the Azure Batch input_data type in the processing task to get all the data from the preceding fetch tasks, but I'm running into this exception with task_file_mover.

Traceback (most recent call last):
  File "task_file_mover.py", line 148, in <module>
    main()
  File "task_file_mover.py", line 123, in main
    batch_client = _create_credentials()
  File "task_file_mover.py", line 60, in _create_credentials
    ba, url, bakey = os.environ['SHIPYARD_BATCH_ENV'].split(';')
ValueError: not enough values to unpack (expected 3, got 2)

I'm using KeyVault for supplying the batch credentials, like:

{
    "credentials": {
        "batch": {
            "account": "myaccount",
            "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/batchkey",
            "account_service_url": "https://myaccount.westus.batch.azure.com"
        }
    }
}

Make the default configdir './'

Passing '--configdir' is often redundant, and makes the commands bulky. This would allow for very clean looking shell scripts:

cd config
shipyard pool add
shipyard pool asu
shipyard pool ssh

As an example. I can submit a PR if that looks like a good idea.

Will shared data volumes defined in config.json be shared across pools?

From the documentation at https://github.com/Azure/batch-shipyard/blob/master/docs/10-batch-shipyard-configuration.md#global-config, it is not clear if the "shared_data_volumes", say e.g. GlusterFS setup, will be available across different pools created with the same config.json.

I haven't tried it out yet, but it seems that the volumes would only be available within a pool, in which case shouldn't the configuration exist in pool.json?

Safe to use pool autoscaling with Batch Shipyard?

I've been setting pool autoscaling on pools created using Batch Shipyard pool add, and I've been testing it out for a while now with no issues. I just wanted to confirm that Batch Shipyard is safe to use with autoscaling, and whether there are any considerations I should keep in mind? I'm not running multi-instance jobs.

max_task_retry_count support?

Do you have plans to add support to setting the max_task_retry_count property for tasks in jobs? Some of my tasks may experience transient failures, and being able to retry right inside Azure Batch would be super convenient.

The Azure Batch SDK for python seems to support it at the job and task level:

https://github.com/Azure/azure-sdk-for-python/blob/f07caf6d435bd49cbfc654c77d20a2fc3f8357c5/azure-batch/azure/batch/models/task_constraints.py

If you'd like, I can also try my hand at making a PR for this feature, where I'd add a max_task_retry_count property to both job and task definitions in jobs.json. I just wanted to check if you're already working on this so I don't step on any toes.

Thanks!

Default `docker run --shm-size=64MB` inadequate for some Intel MPI jobs

When using the combined shm:dapl Intel MPI fabrics the /dev/shm device is exposed through to the Docker container from the host. It is then used for MPI communications intra-node. Unfortunately, the default size of /dev/shm is restricted to 64MB and this is inadequate. The result is MPI applications that crash at random points.

Fix:

In jobs.json set the additional_docker_run_options to be --shm-size=256m (or as appropriate).

https://github.com/Azure/batch-shipyard/blob/master/config_templates/jobs.json

This is more of a suggestion/warning than a bug report.

Joint work with @chrisrichardson.

Question: should additional_node_prep_commands happen after input data commands?

Currently, convoy/fleet.py adds additional start tasks right before setting up any file shares, which makes it impossible to use fileshares during setup in the start task.

Wouldn't it make more sense to run additional_node_prep_commands after the shares are set up?

Batch account_key needs to be defined even if account_key_keyvault_secret_id is set

I'm trying to use credentials.json with KeyVault to just store secret IDs, and provide the KeyVault parameters through environment vars.

When using credentials.json this way, it appears that the credentials.batch.account_key needs to exist even if credentials.batch.account_key_keyvault_secret_id is set. If it doesn't exist, then shipyard assumes that the credentials.json is bad and tries to retrieve it from the secret storage, which will fail if the credential secret id is not set. Since I'm using credentials.json in my repository, rather than a KeyVault credentials.json, this throws many exceptions.

convoy.keyvault:fetch_credentials_json:140 fetching credentials json from keyvault
Traceback (most recent call last):
  File "/mnt/batch-shipyard/shipyard.py", line 183, in _init_config
    convoy.settings.credentials_batch(self.config)
  File "/mnt/batch-shipyard/convoy/settings.py", line 592, in credentials_batch
    account_key=conf['account_key'],
KeyError: 'account_key'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 33, in _validate_string_argument
    prop = prop.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch-shipyard/shipyard.py", line 919, in <module>
    cli()
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/mnt/batch-shipyard/shipyard.py", line 535, in keyvault_list
    ctx.initialize(creds_only=True, no_config=True)
  File "/mnt/batch-shipyard/shipyard.py", line 93, in initialize
    self._init_config(creds_only)
  File "/mnt/batch-shipyard/shipyard.py", line 192, in _init_config
    self.keyvault_credentials_secret_id)
  File "/mnt/batch-shipyard/convoy/fleet.py", line 265, in fetch_credentials_json_from_keyvault
    keyvault_client, keyvault_uri, keyvault_credentials_secret_id)
  File "/mnt/batch-shipyard/convoy/keyvault.py", line 141, in fetch_credentials_json
    cred = client.get_secret(keyvault_credentials_secret_id)
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_client.py", line 135, in get_secret
    sid = parse_secret_id(secret_identifer)
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 142, in parse_secret_id
    return parse_object_id('secrets', id)
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 79, in parse_object_id
    id = _validate_string_argument(id, 'id')
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 36, in _validate_string_argument
    raise TypeError("argument '{}' must by of type string".format(name))
TypeError: argument 'id' must by of type string

If I add a blank account_key property, then everything seems to work fine:

{
    "credentials": {
        "batch": {
            "account": "mybatchaccount",
            "account_key": "",
            "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/batchkey",
            "account_service_url": "https://mybatchaccount.region.batch.azure.com"
        }
    }
}

Is this intended behavior? The documentation seems to suggest that, if I define a particular *_keyvault_secret_id property, then I can omit the * property itself. The other storage and docker registry *_keyvault_secret_id properties seem to read fine without needing to define their corresponding *s.

Thanks!

Run container as specific uid/gid

This can already be accomplished with additional docker run options.

However, native support in job/task config should be present to remap to the azbatch user if wanted, or any other uid/gid (that is present) which will help with storage cluster integration.

Create job/task config option for remap
Auto remap host passwd/group/sudoers to container as ro
Doc updates

Provide a warning if submitting a job against a pool that doesn't exist

Provide a warning during jobs add for a pool that doesn't exist. Don't throw an exception in anticipation of autopool support.

Nodes and jobs getting stuck in weird states - disk full?

I've been encountering issues with batch tasks and nodes getting stuck in weird states several hours after starting. Running short jobs works fine, but if run full job workloads (each job downloads and processes ~60GB of data each), nodes start getting stuck in "Waiting for start task" or "Idle", and tasks start getting stuck in "Running" or "Preparing" with no way for me to see the files for the specific task. I'm not sure if this is a batch-shipyard issue or an issue with Azure Batch itself. Could it have anything to do with running out of space on the node?

If it is a space issue, since I have remove_container_after_exit set to true, will this remove the data in $AZ_BATCH_TASK_WORKING_DIR? If not, is there a recommended way of removing this data? Since I'm running data fetch tasks, I have to egress most of this data to blob storage at the end of the task, so I can't remove the data before blobxfer runs.

Create docker group and add remote users to group automatically

Create a docker group on node startup
Remote user auto-add to docker group
- Create configuration option, default disabled
Update docs

Missing argument in NADM config file

I wanted to test your NAMD container with DC/OS and run a simple job. It turned out that /sw/NAMD_2.11_Linux-x86_64-TCP/apoa1/apoa1.namd.template is missing outputname parameter which prevents namd2 from running. When fixed namd2 apoa1.namd.template works.

I don't know if this is relevant but maybe others will be interested how to run a simple job within the container. Could you add one section to the documentation how to run 'a hello world' example with apoa1 molecule?

Remote FS Cluster Support

Add support for creating a standalone filesystem with attached data disks on a VNet. Linking remote fs clusters created by Batch Shipyard to compute pools will only be supported with UserSubscription accounts.

question - Infiniband job isolation

Hi All,

Is it possible to use shipyard to run multiple jobs in the same pool and have each job isolated from each other in terms of networking and storage.

For example, can I map in a specific data area for each job that is not accessible to other jobs running in the same pool and potentially on the same node.

Also, I note that enabling infiniband forces docker to use the hosts networking stack. Does this means that all containers can communicate with each other even if they are part of another job but in the same pool?

Thanks,
h

Tasks get stuck (in the transfer of output via blobxfer)

Hello, I have some jobs with many tasks attached (up to 10k). I expect the jobs to finish on my test pool with 20 cores in about seven days.
Everything runs smooth for few days, but then some jobs simply get stuck in what seems to be the transfer of the generated output to my output storage account .
At the end of stdout.txt file reads for the stuck jobs. I checked and the rest of the output is correct. In particular the file to be copied is present:

=====================================
 azure blobxfer parameters [v0.12.1]
=====================================
             platform: Linux-4.4.0-47-generic-x86_64-with
   python interpreter: CPython 3.5.2
     package versions: az.common=1.1.4 az.sml=0.20.5 az.stor=0.33.0 crypt=1.6 req=2.12.3
      subscription id: None
      management cert: None
   transfer direction: local->Azure
       local resource: .
      include pattern: *.tgz
      remote resource: None
   max num of workers: 12
              timeout: None
      storage account: geant4data
              use SAS: True
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: False
 container/share name: XXXXXXXXXXXXXXXXXX
  container/share URI: XXXXXXXXXXXXXXXXXX 
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: False
  keep mismatched MD5: False
     recursive if dir: True
component strip on up: 1
        remote delete: False
           collate to: disabled
      local overwrite: True
      encryption mode: disabled
         RSA key file: disabled
         RSA key type: disabled
=======================================

script start time: 2017-02-28 13:46:22

The output file is truncated at this point and not updated since several hours. Azure portal reports the task in "preparing" state.
Any idea?

Exception when providing job environment variables without task environment variables

If my jobs.json file contains an environment_variables property in a job specification object, but no task-level property, I get the following exception:

Traceback (most recent call last):
  File "/mnt/batch-shipyard/shipyard.py", line 919, in <module>
    cli()
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/mnt/batch-shipyard/shipyard.py", line 739, in jobs_add
    ctx.batch_client, ctx.blob_client, ctx.config, recreate, tail)
  File "/mnt/batch-shipyard/convoy/fleet.py", line 1451, in action_jobs_add
    recreate, tail)
  File "/mnt/batch-shipyard/convoy/batch.py", line 1742, in add_jobs
    job_env_vars, task.environment_variables)
  File "/mnt/batch-shipyard/convoy/util.py", line 199, in merge_dict
    raise ValueError('dict1 or dict2 is not a dictionary')
ValueError: dict1 or dict2 is not a dictionary

Seems like https://github.com/Azure/batch-shipyard/blob/master/convoy/batch.py#L1738 checks for the task-but-no-job case, but not the job-but-no-task case.

udi not running on 10-node CentOS D13_v2 pool

Attempting to update the image with udi causes the job to appear as active, but pool/node/start time/end time stay frozen as n/a in overview.

azure / batch-shipyard Goto Github PK

batch-shipyard's Introduction

Batch Shipyard

PROJECT STATUS

Major Features

Container Runtime and Image Management

Data Management and Shared File Systems

Monitoring

Open Source Scheduler Integration

Azure Ecosystem Integration

Azure Batch Integration and Enhancements

Installation

Local Installation

Azure Cloud Shell

Documentation and Recipes

Batch Shipyard Compute Node Host OS Support

Change Log

batch-shipyard's People

Contributors

Stargazers

Watchers

Forkers

batch-shipyard's Issues

Recommend Projects

Recommend Topics

Recommend Org