Code Monkey home page Code Monkey logo

batch-shipyard's Introduction

Build Status Build Status Build status

Batch Shipyard

PROJECT STATUS

This toolkit is no longer actively maintained. The develop branch has proposed fixes for outstanding issues, but they will not be merged back to master. Please see the main Azure Batch GitHub repository for more information about Azure Batch.

dashboard

Batch Shipyard is a tool to help provision, execute, and monitor container-based batch processing and HPC workloads on Azure Batch. Batch Shipyard supports both Docker and Singularity containers. No experience with the Azure Batch SDK is needed; run your containers with easy-to-understand configuration files. All Azure regions are supported, including non-public Azure regions.

Additionally, Batch Shipyard provides the ability to provision and manage entire standalone remote file systems (storage clusters) in Azure, independent of any integrated Azure Batch functionality.

Major Features

Container Runtime and Image Management

Data Management and Shared File Systems

Monitoring

Open Source Scheduler Integration

Azure Ecosystem Integration

Azure Batch Integration and Enhancements

  • Federation support: enables unified, constraint-based scheduling to collections of heterogeneous pools, including across multiple Batch accounts and Azure regions
  • Support for simple, scenario-based pool autoscale and autopool to dynamically scale and control computing resources on-demand
  • Support for Task Factories with the ability to generate tasks based on parametric (parameter) sweeps, randomized input, file enumeration, replication, and custom Python code-based generators
  • Support for multi-instance tasks to accommodate MPI and multi-node cluster applications packaged as Docker or Singularity containers on compute pools with automatic job completion and task termination
  • Seamless, direct high-level configuration support for popular MPI runtimes including OpenMPI, MPICH, MVAPICH, and Intel MPI with automatic configuration for Infiniband, including SR-IOV RDMA VM sizes
  • Seamless integration with Azure Batch job, task and file concepts along with full pass-through of the Azure Batch API to containers executed on compute nodes
  • Support for Azure Batch task dependencies allowing complex processing pipelines and DAGs
  • Support for merge or final task specification that automatically depends on all other tasks within the job
  • Support for job schedules and recurrences for automatic execution of tasks at set intervals
  • Support for live job and job schedule migration between pools
  • Support for Low Priority Compute Nodes
  • Support for deploying Batch compute nodes into a specified Virtual Network and pre-defined public IP addresses
  • Automatic setup of SSH or RDP users to all nodes in the compute pool and optional creation of SSH tunneling scripts to Docker Hosts on compute nodes
  • Support for custom host images including Shared Image Gallery
  • Support for Windows Containers on compliant Windows compute node pools with the ability to activate Azure Hybrid Use Benefit if applicable

Installation

Local Installation

Please see the installation guide for more information regarding the various local installation options and requirements.

Azure Cloud Shell

Batch Shipyard is integrated directly into Azure Cloud Shell and you can execute any Batch Shipyard workload using your web browser or the Microsoft Azure Android and iOS app.

Simply request a Cloud Shell session and type shipyard to invoke the CLI; no installation is required. Try Batch Shipyard now in your browser.

Documentation and Recipes

Please refer to the Batch Shipyard Documentation on Read the Docs.

Visit the Batch Shipyard Recipes section for various sample container workloads using Azure Batch and Batch Shipyard.

Batch Shipyard Compute Node Host OS Support

Batch Shipyard is currently compatible with popular Azure Batch supported Marketplace Linux VMs, compliant Linux custom images, and native Azure Batch Windows Server with Containers VMs. Please see the platform image support documentation for more information specific to Batch Shipyard support of compute node host operating systems.

Change Log

Please see the Change Log for project history.


Please see this project's Code of Conduct and Contributing guidelines.

batch-shipyard's People

Contributors

alfpark avatar andreadotti avatar andreipet avatar arangogutierrez avatar artsobolev avatar danyrouh avatar daxaholic avatar dwa avatar gonzaloruiz avatar hieuhc avatar jackpimbert avatar jasper-schneider avatar jluk avatar lediur avatar maetthu avatar microsoft-github-policy-service[bot] avatar msalvaris avatar pareshverma91 avatar sebastianbk avatar smith1511 avatar srknk8990 avatar vincentlabonte avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

batch-shipyard's Issues

Windows Server 2016 Support

Support Windows Server 2016, specifically the 2016-Datacenter-with-Containers sku.

  • Publisher/Offer/Sku allowable in convoy
  • Data Ingress changes
    • Disallow direct ingress
    • Disallow data ingress command to windows nodes
  • Disallow glusterfs
  • Disallow storage cluster mounts
  • wrap commands exit on first error
  • Node prep powershell script
    • Follow nativedocker, skip cascade
    • Download blobxfer binary
    • Pull windows cargo Docker image
    • Azure File Share Volume mount under pool admin user
  • Add RDP admin users to pool
    • Password support
      • Autogen support
    • pool aru
    • pool dru
    • Do not invoke SSH path
  • Blobxfer support
    • Blobxfer binary for Windows
    • Blobxfer transfer cmd script
  • Cargo Docker image for Windows
    • Shell scripts to cmd scripts
    • Add build to AppVeyor
  • data getfile
  • data getnodefile
  • Jobs config
    • Disallow auto-IB/GPU in jobs
    • Disallow user identities
  • Doc updates
    • Data movement guide Windows notes
    • Update usage guide with new commands
    • Update current limitations
    • FAQ
    • README (node support)
  • Add Windows Docker build for CLI in AppVeyor
  • Banner keywords

Future:

  • Enable Windows Server File Share support
  • Allow Samba mounts from storage clusters
  • pool listimages support (via task)
  • Credential Encryption support
  • Port some recipes for Windows containers

Deprecation Notice: [2.0.0] batch-shipyard:cascade-latest and batch-shipyard:tfm-latest docker images

Hello batch-shipyard users,

This is an announcement that the backend batch-shipyard:cascade-latest and batch-shipyard:tfm-latest docker images are being deprecated. Releases after 2.0.0 will no longer use these images on the backend and will eventually be deleted from the public docker hub repository. Moving forward, Batch Shipyard will use versioned docker images on the backend to ensure that future changes do not break users on earlier versions. This change is transparent but you should upgrade to the latest release. Please follow the upgrade instructions as found in this doc.

The batch-shipyard:cli-latest image will continue to be generated with each commit to master, in addition to tagged release versions.

NOTE: batch-shipyard:cascade-latest and batch-shipyard:tfm-latest (2.0.0) will be removed from the repository on or after January 31, 2017.

What is the best way to convert the job parameters for a deep learning run into the input parameters for a batch job?

We currently run mutiple of the jobs via pre-configured Virtual Machines. The jobs are read off an Azure Storage Queue by a python script and executed as per the instructions in the queue. If I were to extend this to Azure Batch running the same job in a pre-configured docker job pool, what is the best way to pass such instructions? Is there a way to pass the job parameters directly via the queue?

Internal refactor to allow easier script integration

You can already import shipyard with some small code hacks to use the existing package in scripts directly. However, we should allow for something along the lines of import batch_shipyard to enable easier direct integration into scripts with formal objects to be held by caller. The existing CLI experience should not change.

  • tox and pytest setup
  • Object exposure
    • API
    • Models
    • Operations
  • Configuration to object conversion

Deprecation Notice: [1.1.0] batch-shipyard:latest docker image

Hello Batch Shipyard users,

This is an announcement that the batch-shipyard:latest docker image is being deprecated. Releases after 1.1.0 will no longer use this image on the backend and will eventually be deleted from the public docker hub repository. Moving forward, the batch-shipyard repository will contain three different images that provide functionality for different parts of the system. These tags are:

  • cli-latest: Docker image containing the complete CLI functionality of shipyard.py.
  • cascade-{version}: Docker image containing some of the backend functionality of Batch Shipyard.
  • tfm-{version}: Docker image containing backend task file movement capability of Batch Shipyard.

Apologies for the inconvenience due to the changes. We strongly recommended upgrading your Batch Shipyard installation to the latest release with git pull or downloading the latest release in order to take advantage of the newest features and bugfixes before the image is removed.

NOTE: batch-shipyard:latest (1.1.0) will be removed from the repository on or after December 31, 2016.

Thanks!

Consistent Key Error for Storage Account

I have encountered this consistently after rebuilding 4 times, key is not being recognized.

Command:
./batch-shipyard/shipyard.py pool add --credentials ./credentials.json --config ./config.json --pool ./pool.json

Error:
KeyError: '[storageaccountname]'

Notice: Azure Storage Data Movement Breaking Change

blobxfer, which powers the data movement engine between local nodes to Azure Storage and between compute nodes and Azure Storage, is undergoing a breaking change with its CLI interface with the upcoming 1.0.0 release.

Any Batch Shipyard version prior to 2.5.3 will not be able to handle blobxfer 1.0.0 (when it is released) for data movement and Azure Storage. Please migrate your pools (by recreating them) to the newest version of Batch Shipyard to prevent a disruption in your jobs with Azure Storage data movement scenarios. If you require no downtime, you can create a new pool with the new version of Batch Shipyard and submit your work against the new pool while your old pool drains.

Thanks for your understanding in the matter. The improvements to blobxfer will percolate into Batch Shipyard after it is released.

Action items:

  • Update config templates
    • Allow multiple includes
    • Allow excludes
    • Egress allow remote_path and local_path
    • Ingress allow remote_path and local_path
  • data logic updates
  • task_factory:file config and logic updates
  • Update shell script
  • Update docs regarding old blobxfer limitations and example extra options

Deprecation Notice: Removal of self-hosted private registry functionality

Hello Batch Shipyard users,

The feature enabling self-hosted private registry with Docker images backed to Azure Storage blobs will be removed from Batch Shipyard in the next major release (3.0.0). Please migrate your images stored in this manner to an alternate private registry, such as the Azure Container Registry.

Action items:

  • Do not create/add to the prefixregistry table
  • Modify cascade to not read from prefixregistry table
  • Delete private registry setup py from cascade
  • Force delete prefixregistry table
  • Remove deprecation warning
  • Remove from config templates
  • Remove from docs
  • Remove mention in README

"The value provided for one of the properties in the request body is invalid." error when trying to create pool

I have a recurring job that runs every day, scheduled using an external scheduler. It worked just fine yesterday, but stopped working today. Tried kicking it off again a few times, with the same problem.

This error occurs when the job tries to create the pool (some names changed)

2017-03-11T00:18:06.4385430Z Unable to find image 'alfpark/batch-shipyard:cli-latest' locally
2017-03-11T00:18:07.5574940Z cli-latest: Pulling from alfpark/batch-shipyard
2017-03-11T00:18:07.5607490Z 6daefd62341a: Pulling fs layer
2017-03-11T00:18:07.5625380Z 2aa297eab108: Pulling fs layer
2017-03-11T00:18:08.1042870Z 6daefd62341a: Download complete
2017-03-11T00:18:08.3003220Z 6daefd62341a: Pull complete
2017-03-11T00:18:11.6265530Z 2aa297eab108: Verifying Checksum
2017-03-11T00:18:11.6282660Z 2aa297eab108: Download complete
2017-03-11T00:18:14.5793500Z 2aa297eab108: Pull complete
2017-03-11T00:18:14.6183050Z Digest: sha256:7286eeaf0d3cb776acef202d96d16a55987cf31154a03d12ee44bdc2df8c24e7
2017-03-11T00:18:14.6440140Z Status: Downloaded newer image for alfpark/batch-shipyard:cli-latest
2017-03-11T00:18:15.6050450Z 2017-03-11 00:18:15,603Z DEBUG convoy.keyvault:parse_secret_ids:248 fetching batch account key from keyvault
2017-03-11T00:18:16.0461670Z 2017-03-11 00:18:16,042Z DEBUG convoy.keyvault:parse_secret_ids:263 fetching storage account key for link data from keyvault
2017-03-11T00:18:16.3074910Z 2017-03-11 00:18:16,306Z DEBUG convoy.keyvault:parse_secret_ids:263 fetching storage account key for link batch from keyvault
2017-03-11T00:18:16.6545270Z 2017-03-11 00:18:16,653Z DEBUG convoy.keyvault:parse_secret_ids:278 fetching docker registry password for registry myregistry.azurecr.io from keyvault
2017-03-11T00:18:17.3565740Z 2017-03-11 00:18:17,355Z INFO convoy.storage:create_storage_containers:469 creating container: shipyardtor-mybatchaccount-mypool
2017-03-11T00:18:17.5785020Z 2017-03-11 00:18:17,575Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardtorrentinfo
2017-03-11T00:18:17.7681260Z 2017-03-11 00:18:17,766Z INFO convoy.storage:create_storage_containers:477 creating queue: shipyardgr-mybatchaccount-mypool
2017-03-11T00:18:17.9617670Z 2017-03-11 00:18:17,958Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardgr
2017-03-11T00:18:18.0079060Z 2017-03-11 00:18:18,006Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardimages
2017-03-11T00:18:18.0591210Z 2017-03-11 00:18:18,057Z INFO convoy.storage:create_storage_containers:474 creating table: shipyardregistry
2017-03-11T00:18:18.1066530Z 2017-03-11 00:18:18,105Z INFO convoy.storage:create_storage_containers:474 creating table: shipyarddht
2017-03-11T00:18:18.1601350Z 2017-03-11 00:18:18,159Z INFO convoy.storage:create_storage_containers:469 creating container: shipyardrf-mybatchaccount-mypool
2017-03-11T00:18:18.2105060Z 2017-03-11 00:18:18,209Z INFO convoy.storage:_clear_blobs:384 deleting blobs: shipyardtor-mybatchaccount-mypool
2017-03-11T00:18:18.2860930Z 2017-03-11 00:18:18,285Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardtorrentinfo
2017-03-11T00:18:18.3684180Z 2017-03-11 00:18:18,367Z INFO convoy.storage:clear_storage_containers:452 clearing queue: shipyardgr-mybatchaccount-mypool
2017-03-11T00:18:18.4156300Z 2017-03-11 00:18:18,414Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardgr
2017-03-11T00:18:18.5136990Z 2017-03-11 00:18:18,512Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardimages
2017-03-11T00:18:18.5619460Z 2017-03-11 00:18:18,560Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardperf
2017-03-11T00:18:18.6100800Z 2017-03-11 00:18:18,608Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyardregistry
2017-03-11T00:18:18.7062410Z 2017-03-11 00:18:18,705Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=mybatchaccount$mypool): shipyarddht
2017-03-11T00:18:18.8991810Z 2017-03-11 00:18:18,897Z INFO convoy.storage:_clear_blobs:384 deleting blobs: shipyardrf-mybatchaccount-mypool
2017-03-11T00:18:19.2330150Z 2017-03-11 00:18:19,231Z WARNING convoy.fleet:_adjust_settings_for_pool_creation:1006 forcing shipyard docker image to be used due to VM config, publisher=openlogic offer=centos sku=7.2
2017-03-11T00:18:19.2809030Z 2017-03-11 00:18:19,279Z INFO convoy.storage:_add_global_resource:255 adding global resource: docker:myimage
2017-03-11T00:18:19.7223520Z 2017-03-11 00:18:19,721Z INFO convoy.storage:upload_resource_files:338 uploading file /opt/batch-shipyard/scripts/shipyard_nodeprep.sh as 'shipyard_nodeprep.sh'
2017-03-11T00:18:19.9671350Z 2017-03-11 00:18:19,965Z INFO convoy.storage:upload_resource_files:338 uploading file /opt/batch-shipyard/scripts/docker_jp_block.sh as 'docker_jp_block.sh'
2017-03-11T00:18:20.2068330Z 2017-03-11 00:18:20,205Z INFO convoy.storage:upload_resource_files:338 uploading file /opt/batch-shipyard/scripts/shipyard_blobxfer.sh as 'shipyard_blobxfer.sh'
2017-03-11T00:18:20.4706070Z 2017-03-11 00:18:20,469Z INFO convoy.batch:create_pool:361 Attempting to create pool: mypool
2017-03-11T00:18:20.5340780Z Traceback (most recent call last):
2017-03-11T00:18:20.5356860Z   File "/opt/batch-shipyard/shipyard.py", line 941, in <module>
2017-03-11T00:18:20.5371900Z     cli()
2017-03-11T00:18:20.5385490Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 716, in __call__
2017-03-11T00:18:20.5398770Z     return self.main(*args, **kwargs)
2017-03-11T00:18:20.5412370Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 696, in main
2017-03-11T00:18:20.5428400Z     rv = self.invoke(ctx)
2017-03-11T00:18:20.5441240Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
2017-03-11T00:18:20.5454590Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-11T00:18:20.5467890Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
2017-03-11T00:18:20.5481400Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-11T00:18:20.5495210Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 889, in invoke
2017-03-11T00:18:20.5508400Z     return ctx.invoke(self.callback, **ctx.params)
2017-03-11T00:18:20.5521630Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 534, in invoke
2017-03-11T00:18:20.5535560Z     return callback(*args, **kwargs)
2017-03-11T00:18:20.5548800Z   File "/usr/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
2017-03-11T00:18:20.5561900Z     return ctx.invoke(f, obj, *args[1:], **kwargs)
2017-03-11T00:18:20.5575330Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 534, in invoke
2017-03-11T00:18:20.5588100Z     return callback(*args, **kwargs)
2017-03-11T00:18:20.5601120Z   File "/opt/batch-shipyard/shipyard.py", line 607, in pool_add
2017-03-11T00:18:20.5614220Z     ctx.table_client, ctx.config)
2017-03-11T00:18:20.5627110Z   File "/opt/batch-shipyard/convoy/fleet.py", line 1200, in action_pool_add
2017-03-11T00:18:20.5639870Z     _add_pool(batch_client, blob_client, config)
2017-03-11T00:18:20.5653520Z   File "/opt/batch-shipyard/convoy/fleet.py", line 640, in _add_pool
2017-03-11T00:18:20.5666650Z     nodes = batch.create_pool(batch_client, config, pool)
2017-03-11T00:18:20.5679700Z   File "/opt/batch-shipyard/convoy/batch.py", line 365, in create_pool
2017-03-11T00:18:20.5692760Z     batch_client.pool.add(pool)
2017-03-11T00:18:20.5705710Z   File "/usr/lib/python3.5/site-packages/azure/batch/operations/pool_operations.py", line 291, in add
2017-03-11T00:18:20.5718720Z     raise models.BatchErrorException(self._deserialize, response)
2017-03-11T00:18:20.5732820Z azure.batch.models.batch_error.BatchErrorException: {'value': 'The value provided for one of the properties in the request body is invalid.\nRequestId:79b2f4d8-8b47-49fd-a85e-206574727170\nTime:2017-03-11T00:18:20.6184337Z', 'lang': 'en-US'}

Any idea why this might be happening? I haven't touched my pool.json since yesterday.

My pool.json:

{
    "pool_specification": {
        "id": "mypool",
        "vm_size": "STANDARD_A2_V2",
        "vm_count": 10,
        "max_tasks_per_node": 2,

        "publisher": "OpenLogic",
        "offer": "CentOS",
        "sku": "7.2",

        "reboot_on_start_task_failed": true,
        "block_until_all_global_resources_loaded": true
    }
}

UserSubscription Batch Account Support

Allow UserSubscription batch accounts.

  • AAD auth for Batch account
    • Doc TFM with AAD is not supported yet
    • Add code in data movement to check/block?
  • Allow VNet Id in pool
    • Create VNet/subnet if not found option?
  • Remove 40 VM limit for inter node comm enabled pools and UserSubscription batch accounts
    • Update current limitations doc
  • Add to limitations doc that custom images are not supported (yet)
  • Link to docs on how to create a user subscription batch account

Support for recurring jobs and accessing credentials from containers

Is there a recommended way to use batch-shipyard with the Azure Batch job scheduler? I'd like to be able to schedule Docker jobs to run at a recurring interval, and I couldn't find documentation on this topic.

Also, I was wondering if there was a way to expose credentials to the running Docker container without saving them as variables in jobs.json? I'd like to provide some credentials for external services to the container, but I don't want to check these into source control. Using KeyVault from inside the container would also work, but I don't think those env vars are passed to the pool.

(p.s. is this the right place to ask questions about batch-shipyard usage, or is there a more appropriate forum for this?)

Support recurring jobs

Support JobSchedules and recurrences. See issue #15 for more details.

  • Json schema for job schedule/recurrence
  • Generic job manager
    • Docker image
    • Spec transfer
    • Spec reader/submitter
    • Optional task monitor
  • jobs list should include both regular jobs and schedules
  • jobs del/jobs term/jobs disable/jobs enable/jobs migrate should detect schedules
    • --jobscheduleid flag to explicitly act on schedule rather than job
    • disambiguate if --all flag exists
  • Update configuration doc
  • Update usage doc

Occasional 1-2 node startup tasks failing with no error message

Hey, I've started using the latest develop branch (4eea944), and every time I create a pool with 10 nodes, I inevitably get one or two nodes that fail to start properly now. I haven't had any fail to start before. The only messages I get from the startup task are:

in startup/stdout.txt:

/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/sdb1 temp disk is mounted as fuseblk/ntfs

startup/stderr.txt is empty.

My pool.json looks like:

{
    "pool_specification": {
        "id": "mypool",
        "vm_size": "STANDARD_D11_V2",
        "vm_count": 10,
        "max_tasks_per_node": 2,

        "publisher": "Canonical",
        "offer": "UbuntuServer",
        "sku": "16.04.0-LTS",

        "ssh": {
            "username": "hi"
        },

        "reboot_on_start_task_failed": true,
        "block_until_all_global_resources_loaded": true
    }
}

With 5 nodes, I haven't run into any issues. Any idea what might be causing this?

Support Autopools

Allow jobs to be executed without an active pool, but instead with an autopool link. See #19 for the initial discussion.

  • Cascade will require an auto pool env var to strip off the trailing GUID
  • Jobs specification auto_pool property
    • pool_lifetime
    • keep_alive
  • Further refactor pool add so the creation of the pool add param and RF upload is independent of the actual create pool call
  • Logic change in jobs add
    • Autopool property will invoke part of pool add to get pool add param and RF upload
  • Warn with pool options with autopool
    • GlusterFS on compute
    • Auto ssh user
    • Local data movement to pool-level
  • Provide storage cleanup options
    • Add --poolid option to both storage clear and storage del
  • Config guide update
    • Add note about orphaned storage data with autopools

Allow job-level override to run missing pre-loaded docker images

  • Job level json property: allow_run_on_missing_image
  • Modify JP to not run jp block script if above property is true
  • Prepend private registry to image name in tasks under job
  • Update docs
    • Add note that passthrough on missing config only applies to config.json images

Documentation

I am staring at the Tensor Flow recipes looking desperately for how to kick off my first load. I have 150 gb sitting in data lake. Do I provision Batch, do I set up VMs, what in the world do I need to do step by step to use random images in data lake with some random label file or set of label files.

How to add tasks?

Hi,

I'm new to both Docker and Azure Batch, but I feel it is what I need.

My use case:
I need to process batches of thousands of images. My input for each task is an xml file and 5 images. Then a Linux executable wrapped in a Python script processes this input and produces 3 images and an xml file as output. I wrapped the code in a Docker image that processes one set of images. Both the input and output files are in Azure blob storage. Also this process is part of a bigger automated pipeline and I need to monitor when the batch is done.

I used batch-shipyard to create a pool. Now my question is, how do create the job and its tasks (from code)? Am I supposed to generate a jobs.xml with thousands of tasks? Or is there another way? Can I use the Azure Bath API as well?

Thanks in advance for clarifying this.

Possible scenarios for using pool auto-resize?

I noticed in the "current issues" documentation that you recommend using batch-shipyard to resize pools.  I looked at the resizing code in fleet.py and batch.py, and I was wondering if there were scenarios where I could possibly use pool auto-resizing? e.g. if I didn't care about SSHing into nodes, and I didn't use GlusterFS.

Unused reference to the azure-mgmt package in convoy/keyvault.py?

I just tried running the Docker container version of the CLI, and I run into this error:

> docker run --rm -it alfpark/batch-shipyard:cli-latest
Traceback (most recent call last):
  File "/opt/batch-shipyard/shipyard.py", line 42, in <module>
    import convoy.fleet
  File "/opt/batch-shipyard/convoy/fleet.py", line 52, in <module>
    from . import keyvault
  File "/opt/batch-shipyard/convoy/keyvault.py", line 40, in <module>
    import azure.mgmt.resource.resources
ImportError: No module named 'azure.mgmt'

Seems like this import is unused? convoy/keyvault.py I'll try removing that line and building the container on my end, and see if it breaks or not. edit: seems like it works

Broken pipes in Blobxfer / requests when outputting from many nodes concurrently

When I'm running a large parallel job with hundreds of simultaneous tasks, I'm running into an issue with blobxfer and requests failing to output data to blob storage.

I originally started the parallel job with 10 nodes, 7 tasks per node, for a total of 70 simultaneous tasks, and my tasks were completing correctly and uploading their results to blob storage as expected. Since each job is independent and I wanted to speed this job up, I then used pool resize to scale up the pool from 10 nodes to 50 nodes.

After the resize was complete, all of the jobs started failing in the output step when blobxfer is attempting to upload each task's result to blob storage. The error reports multiple broken pipes (assuming each one is a retry attempt) in requests. Unfortunately, I forgot to grab a error log before shutting down the cluster, but if it occurs again I'll post it here.

I'm assuming this is because I'm trying to shove too many simultaneous uploads into Blob storage at once? Is there an inherent limit to the number of simultaneous blob storage uploads? I wasn't able to find a good metric for this on Azure documentation.

If task name is too long, fails to add task

I have a job with >1000 tasks, if task id is left to null in jobs.json, when adding task number 1001 (automatically called "dockertask-1000") shipyard fails. The relevant message being: "The specified task already exists."
I suspect the task name becomes too long, the last digit is being silently dropped and thus the task has the exact same name as dockertask-100.

Indeed assigning (short) custom names solves the problem.
If my interpretation is correct I suggest adding a run-time check on validity of task names.

2017-01-24 23:01:51,152Z INFO convoy.batch:add_jobs:1894 Adding task: dockertask-1000
Traceback (most recent call last):
  File "/home/adotti/Work/Azure/batch-shipyard/shipyard.py", line 921, in <module>
    cli()
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/home/adotti/.local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/adotti/Work/Azure/batch-shipyard/shipyard.py", line 741, in jobs_add
    recreate, tail)
  File "/home/adotti/Work/Azure/batch-shipyard/convoy/fleet.py", line 1453, in action_jobs_add
    _BLOBXFER_FILE, recreate, tail)
  File "/home/adotti/Work/Azure/batch-shipyard/convoy/batch.py", line 1895, in add_jobs
    batch_client.task.add(job_id=job.id, task=batchtask)
  File "/home/adotti/.local/lib/python2.7/site-packages/azure/batch/operations/task_operations.py", line 107, in add
    raise models.BatchErrorException(self._deserialize, response)
azure.batch.models.batch_error.BatchErrorException: {'lang': u'en-US', 'value': u'The specified task already exists.\nRequestId:789cbe9f-3006-4325-9ec2-6c094cb64808\nTime:2017-01-25T07:01:50.9582196Z'}

Support pool autoscale

Allow for pool autoscale formulas. See issue #25.

Enabling changes:

  • Need to redesign how docker images are pulled since queue message limits are 7 days
  • RF SASes should have sufficiently large se parameter

Autoscale changes:

  • Add json properties for pool
    • autoscale
      • evaluation_interval
      • formula
      • scenario
        • name
        • maximum_vm_count
        • node_deallocation_option
        • sample_lookback_interval
        • required_sample_percentage
        • bias_last_sample
        • bias_node_type
        • rebalance_preemption_percentage
  • Add autoscale scenarios
    • active_tasks
    • pending_tasks
    • workday
    • workday_with_offpeak_max_low_priority
    • weekday
    • weekend
  • Modify Pool add parameter
  • Add autoscale subcommand
    • pool autoscale evaluate
    • pool autoscale enable
    • pool autoscale disable
    • pool autoscale lastexec
    • Prevent enable command from running on any pool with metadata version < 2.9.0

Other changes:

  • Block pool creation for the following
    • Autoscale and GlusterFS on compute
    • Autoscale and peer-to-peer
  • Emit warning if pool ssh user is detected with autoscale
  • Update docs
  • Add autoscale guide

Azure KeyVault support for credentials.json

Support AAD to KeyVault access for credential secrets.

  • Entire credentials.json (zlib+base64) as secret
    • Store CLI option
    • Delete CLI option
  • Individual secrets support per key with KeyVault Secret Id
  • CLI options with envvar support
    • AAD Service Principal (Tenant/Directory ID, Client/Application ID, Secret)
    • AAD Service Principal Certificate auth
    • AAD User and Password
    • List all secret ids in uri as convenience command
  • AAD/Keyvault creds in credentials.json support
  • Guides/Docs
    • New Batch Shipyard and Azure KeyVault guide
    • Usage doc (keyvault command)
    • Configuration doc (*_keyvault_secret_id properties)

KeyError: account_key on cli-latest

I'm running into this blocking issue running the cli-latest docker container.

Traceback:

2017-03-18T00:05:26.9865110Z Digest: sha256:31fc61d165291b4c5a186ceb74827226397e32c04d8df349bb197ef67e5ccd4e
2017-03-18T00:05:27.0034980Z Status: Downloaded newer image for alfpark/batch-shipyard:cli-latest
2017-03-18T00:05:27.9223690Z Traceback (most recent call last):
2017-03-18T00:05:27.9238610Z   File "/opt/batch-shipyard/shipyard.py", line 1452, in <module>
2017-03-18T00:05:27.9254630Z     cli()
2017-03-18T00:05:27.9268580Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 722, in __call__
2017-03-18T00:05:27.9280400Z     return self.main(*args, **kwargs)
2017-03-18T00:05:27.9292230Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 697, in main
2017-03-18T00:05:27.9303840Z     rv = self.invoke(ctx)
2017-03-18T00:05:27.9316070Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
2017-03-18T00:05:27.9328400Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-18T00:05:27.9340310Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
2017-03-18T00:05:27.9352410Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2017-03-18T00:05:27.9364450Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 895, in invoke
2017-03-18T00:05:27.9376980Z     return ctx.invoke(self.callback, **ctx.params)
2017-03-18T00:05:27.9388590Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 535, in invoke
2017-03-18T00:05:27.9400430Z     return callback(*args, **kwargs)
2017-03-18T00:05:27.9412260Z   File "/usr/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
2017-03-18T00:05:27.9424690Z     return ctx.invoke(f, obj, *args[1:], **kwargs)
2017-03-18T00:05:27.9436310Z   File "/usr/lib/python3.5/site-packages/click/core.py", line 535, in invoke
2017-03-18T00:05:27.9448360Z     return callback(*args, **kwargs)
2017-03-18T00:05:27.9461250Z   File "/opt/batch-shipyard/shipyard.py", line 1038, in pool_add
2017-03-18T00:05:27.9473730Z     ctx.initialize_for_batch()
2017-03-18T00:05:27.9486240Z   File "/opt/batch-shipyard/shipyard.py", line 124, in initialize_for_batch
2017-03-18T00:05:27.9498770Z     skip_global_config=False, skip_pool_config=False, fs_storage=False)
2017-03-18T00:05:27.9511520Z   File "/opt/batch-shipyard/shipyard.py", line 321, in _init_config
2017-03-18T00:05:27.9524240Z     convoy.fleet.populate_global_settings(self.config, fs_storage)
2017-03-18T00:05:27.9537360Z   File "/opt/batch-shipyard/convoy/fleet.py", line 209, in populate_global_settings
2017-03-18T00:05:27.9550490Z     sc = settings.credentials_storage(config, bs.storage_account_settings)
2017-03-18T00:05:27.9563550Z   File "/opt/batch-shipyard/convoy/settings.py", line 837, in credentials_storage
2017-03-18T00:05:27.9576550Z     account_key=conf['account_key'],
2017-03-18T00:05:27.9589290Z KeyError: 'account_key'

This started happening today with my scheduled task just a few hours ago, and doesn't seem to be transient (I reran the scheduled task and it failed the second time).

Yesterday's run worked fine, and cli 2.5.4 seems to work fine as well. I haven't tried 2.6.0b1 yet. If you'd like, I can try that as well.

My credentials.json:

{
    "credentials": {
        "batch": {
            "account": "mybatchaccount",
            "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/batch",
            "account_service_url": "https://mybatchaccount.westus.batch.azure.com"
        },
        "storage": {
            "batch": {
                "account": "mybatchstorage",
                "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/storage-batch"
            },
            "data": {
                "account": "storage",
                "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/storage"
            }
        },
        "docker_registry": {
            "myregistry-on.azurecr.io": {
                "username": "myuser",
                "password_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/acr-myregistry-reader"
            }
        }
    }
}

Exception in task_file_mover when ingressing files from other batch tasks

I've set up a job which contains several fetch tasks, and a single processing task that depends on the fetch tasks. For convenience, I tried using the Azure Batch input_data type in the processing task to get all the data from the preceding fetch tasks, but I'm running into this exception with task_file_mover.

Traceback (most recent call last):
  File "task_file_mover.py", line 148, in <module>
    main()
  File "task_file_mover.py", line 123, in main
    batch_client = _create_credentials()
  File "task_file_mover.py", line 60, in _create_credentials
    ba, url, bakey = os.environ['SHIPYARD_BATCH_ENV'].split(';')
ValueError: not enough values to unpack (expected 3, got 2)

I'm using KeyVault for supplying the batch credentials, like:

{
    "credentials": {
        "batch": {
            "account": "myaccount",
            "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/batchkey",
            "account_service_url": "https://myaccount.westus.batch.azure.com"
        }
    }
}

Make the default configdir './'

Passing '--configdir' is often redundant, and makes the commands bulky. This would allow for very clean looking shell scripts:

cd config
shipyard pool add
shipyard pool asu
shipyard pool ssh

As an example. I can submit a PR if that looks like a good idea.

Will shared data volumes defined in config.json be shared across pools?

From the documentation at https://github.com/Azure/batch-shipyard/blob/master/docs/10-batch-shipyard-configuration.md#global-config, it is not clear if the "shared_data_volumes", say e.g. GlusterFS setup, will be available across different pools created with the same config.json.

I haven't tried it out yet, but it seems that the volumes would only be available within a pool, in which case shouldn't the configuration exist in pool.json?

Safe to use pool autoscaling with Batch Shipyard?

I've been setting pool autoscaling on pools created using Batch Shipyard pool add, and I've been testing it out for a while now with no issues. I just wanted to confirm that Batch Shipyard is safe to use with autoscaling, and whether there are any considerations I should keep in mind? I'm not running multi-instance jobs.

max_task_retry_count support?

Do you have plans to add support to setting the max_task_retry_count property for tasks in jobs? Some of my tasks may experience transient failures, and being able to retry right inside Azure Batch would be super convenient.

The Azure Batch SDK for python seems to support it at the job and task level:

https://github.com/Azure/azure-sdk-for-python/blob/f07caf6d435bd49cbfc654c77d20a2fc3f8357c5/azure-batch/azure/batch/models/task_constraints.py

If you'd like, I can also try my hand at making a PR for this feature, where I'd add a max_task_retry_count property to both job and task definitions in jobs.json. I just wanted to check if you're already working on this so I don't step on any toes.

Thanks!

Default `docker run --shm-size=64MB` inadequate for some Intel MPI jobs

When using the combined shm:dapl Intel MPI fabrics the /dev/shm device is exposed through to the Docker container from the host. It is then used for MPI communications intra-node. Unfortunately, the default size of /dev/shm is restricted to 64MB and this is inadequate. The result is MPI applications that crash at random points.

Fix:

In jobs.json set the additional_docker_run_options to be --shm-size=256m (or as appropriate).

https://github.com/Azure/batch-shipyard/blob/master/config_templates/jobs.json

This is more of a suggestion/warning than a bug report.

Joint work with @chrisrichardson.

Batch account_key needs to be defined even if account_key_keyvault_secret_id is set

I'm trying to use credentials.json with KeyVault to just store secret IDs, and provide the KeyVault parameters through environment vars.

When using credentials.json this way, it appears that the credentials.batch.account_key needs to exist even if credentials.batch.account_key_keyvault_secret_id is set. If it doesn't exist, then shipyard assumes that the credentials.json is bad and tries to retrieve it from the secret storage, which will fail if the credential secret id is not set. Since I'm using credentials.json in my repository, rather than a KeyVault credentials.json, this throws many exceptions.

convoy.keyvault:fetch_credentials_json:140 fetching credentials json from keyvault
Traceback (most recent call last):
  File "/mnt/batch-shipyard/shipyard.py", line 183, in _init_config
    convoy.settings.credentials_batch(self.config)
  File "/mnt/batch-shipyard/convoy/settings.py", line 592, in credentials_batch
    account_key=conf['account_key'],
KeyError: 'account_key'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 33, in _validate_string_argument
    prop = prop.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch-shipyard/shipyard.py", line 919, in <module>
    cli()
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/mnt/batch-shipyard/shipyard.py", line 535, in keyvault_list
    ctx.initialize(creds_only=True, no_config=True)
  File "/mnt/batch-shipyard/shipyard.py", line 93, in initialize
    self._init_config(creds_only)
  File "/mnt/batch-shipyard/shipyard.py", line 192, in _init_config
    self.keyvault_credentials_secret_id)
  File "/mnt/batch-shipyard/convoy/fleet.py", line 265, in fetch_credentials_json_from_keyvault
    keyvault_client, keyvault_uri, keyvault_credentials_secret_id)
  File "/mnt/batch-shipyard/convoy/keyvault.py", line 141, in fetch_credentials_json
    cred = client.get_secret(keyvault_credentials_secret_id)
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_client.py", line 135, in get_secret
    sid = parse_secret_id(secret_identifer)
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 142, in parse_secret_id
    return parse_object_id('secrets', id)
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 79, in parse_object_id
    id = _validate_string_argument(id, 'id')
  File "/home/user/.local/lib/python3.5/site-packages/azure/keyvault/key_vault_id.py", line 36, in _validate_string_argument
    raise TypeError("argument '{}' must by of type string".format(name))
TypeError: argument 'id' must by of type string

If I add a blank account_key property, then everything seems to work fine:

{
    "credentials": {
        "batch": {
            "account": "mybatchaccount",
            "account_key": "",
            "account_key_keyvault_secret_id": "https://myvault.vault.azure.net/secrets/batchkey",
            "account_service_url": "https://mybatchaccount.region.batch.azure.com"
        }
    }
}

Is this intended behavior? The documentation seems to suggest that, if I define a particular *_keyvault_secret_id property, then I can omit the * property itself. The other storage and docker registry *_keyvault_secret_id properties seem to read fine without needing to define their corresponding *s.

Thanks!

Run container as specific uid/gid

This can already be accomplished with additional docker run options.

However, native support in job/task config should be present to remap to the azbatch user if wanted, or any other uid/gid (that is present) which will help with storage cluster integration.

  • Create job/task config option for remap
  • Auto remap host passwd/group/sudoers to container as ro
  • Doc updates

Nodes and jobs getting stuck in weird states - disk full?

I've been encountering issues with batch tasks and nodes getting stuck in weird states several hours after starting. Running short jobs works fine, but if run full job workloads (each job downloads and processes ~60GB of data each), nodes start getting stuck in "Waiting for start task" or "Idle", and tasks start getting stuck in "Running" or "Preparing" with no way for me to see the files for the specific task. I'm not sure if this is a batch-shipyard issue or an issue with Azure Batch itself. Could it have anything to do with running out of space on the node?

If it is a space issue, since I have remove_container_after_exit set to true, will this remove the data in $AZ_BATCH_TASK_WORKING_DIR? If not, is there a recommended way of removing this data? Since I'm running data fetch tasks, I have to egress most of this data to blob storage at the end of the task, so I can't remove the data before blobxfer runs. 

Missing argument in NADM config file

I wanted to test your NAMD container with DC/OS and run a simple job. It turned out that /sw/NAMD_2.11_Linux-x86_64-TCP/apoa1/apoa1.namd.template is missing outputname parameter which prevents namd2 from running. When fixed namd2 apoa1.namd.template works.

I don't know if this is relevant but maybe others will be interested how to run a simple job within the container. Could you add one section to the documentation how to run 'a hello world' example with apoa1 molecule?

Remote FS Cluster Support

Add support for creating a standalone filesystem with attached data disks on a VNet. Linking remote fs clusters created by Batch Shipyard to compute pools will only be supported with UserSubscription accounts.

  • Add fs.json file
    • Resource group
    • Location
    • Managed disks (k:v)
      • Resource group
      • Location/region
      • disks (array)
        • Name
        • Disk size
        • Storage account type (allow premium)
    • Storage cluster
      • Name (for internal linkage only)
      • VNet: ID, subnet info
      • File Server
        • Type: nfs and glusterfs
        • Mount point
      • VM Count
      • VM Size
      • Static IP enabled?
      • Network security rules
      • Node to disk map
        • RAID 0 flag for multiple disks?
        • Format as
  • Add fs command:
    • disks
      • add
      • del
      • list
    • cluster
      • add
      • del
      • expand
      • resize
      • ssh
      • status
      • start
      • suspend
  • Allow remotefs to be linked to config.json shared data volumes
    • mount option exposure?
  • Pool configuration changes
    • VNet Id
  • RemoteFS guide

question - Infiniband job isolation

Hi All,

Is it possible to use shipyard to run multiple jobs in the same pool and have each job isolated from each other in terms of networking and storage.

For example, can I map in a specific data area for each job that is not accessible to other jobs running in the same pool and potentially on the same node.

Also, I note that enabling infiniband forces docker to use the hosts networking stack. Does this means that all containers can communicate with each other even if they are part of another job but in the same pool?

Thanks,
h

Tasks get stuck (in the transfer of output via blobxfer)

Hello, I have some jobs with many tasks attached (up to 10k). I expect the jobs to finish on my test pool with 20 cores in about seven days.
Everything runs smooth for few days, but then some jobs simply get stuck in what seems to be the transfer of the generated output to my output storage account .
At the end of stdout.txt file reads for the stuck jobs. I checked and the rest of the output is correct. In particular the file to be copied is present:

=====================================
 azure blobxfer parameters [v0.12.1]
=====================================
             platform: Linux-4.4.0-47-generic-x86_64-with
   python interpreter: CPython 3.5.2
     package versions: az.common=1.1.4 az.sml=0.20.5 az.stor=0.33.0 crypt=1.6 req=2.12.3
      subscription id: None
      management cert: None
   transfer direction: local->Azure
       local resource: .
      include pattern: *.tgz
      remote resource: None
   max num of workers: 12
              timeout: None
      storage account: geant4data
              use SAS: True
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: False
 container/share name: XXXXXXXXXXXXXXXXXX
  container/share URI: XXXXXXXXXXXXXXXXXX 
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: False
  keep mismatched MD5: False
     recursive if dir: True
component strip on up: 1
        remote delete: False
           collate to: disabled
      local overwrite: True
      encryption mode: disabled
         RSA key file: disabled
         RSA key type: disabled
=======================================

script start time: 2017-02-28 13:46:22

The output file is truncated at this point and not updated since several hours. Azure portal reports the task in "preparing" state.
Any idea?

Exception when providing job environment variables without task environment variables

If my jobs.json file contains an environment_variables property in a job specification object, but no task-level property, I get the following exception:

Traceback (most recent call last):
  File "/mnt/batch-shipyard/shipyard.py", line 919, in <module>
    cli()
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/home/user/.local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/mnt/batch-shipyard/shipyard.py", line 739, in jobs_add
    ctx.batch_client, ctx.blob_client, ctx.config, recreate, tail)
  File "/mnt/batch-shipyard/convoy/fleet.py", line 1451, in action_jobs_add
    recreate, tail)
  File "/mnt/batch-shipyard/convoy/batch.py", line 1742, in add_jobs
    job_env_vars, task.environment_variables)
  File "/mnt/batch-shipyard/convoy/util.py", line 199, in merge_dict
    raise ValueError('dict1 or dict2 is not a dictionary')
ValueError: dict1 or dict2 is not a dictionary

Seems like https://github.com/Azure/batch-shipyard/blob/master/convoy/batch.py#L1738 checks for the task-but-no-job case, but not the job-but-no-task case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.