Code Monkey home page Code Monkey logo

torchelastic_lab's Introduction

Elastic PyTorch training on Kubernetes Self Serve Lab

This lab introduces elastic fault tolerant distributed Pytorch training on Kubernetes. With larger and larger models being trained, the need to scale out to a distributed cluster is increasingly important. Distributed training on multiple high performance computing instances can significantly reduce the time to train modern deep neural networks with large dataset. Running training on low cost Azure Spot VMs will further cut the costs.

Kubernetes enables machine learning teams to run training jobs distributed across fleets of powerful GPU instances. However, typical distributed training jobs are not fault tolerant, and a job cannot continue if a node fails or is reclaimed.

Elastic Training takes it further and enables distributed training jobs to be executed in a fault tolerant and elastic manner on Kubernetes nodes that can dynamically change, without disrupting the model training process. Pytorch Elastic module allows users to enable Pytorch training script with elasticity and fault tolerance and it's fully native to Pytorch, user would not need to install additional frameworks like Horovod, Ray or Dask.

What you'll build

In this Lab, you will build an AKS cluster with GPU enabled Spot VM nodepool and will run elastic distributed Pytorch training job and test how the script survives interruptions when AKS node is evicted. Upon completion, your infrastructure will contain:

  • An Azure Kubernetes Services (AKS) cluster with
    • GPU enabled Spot VM nodepool for running elastic training
    • CPU VM nodepool for running Rendezevous server - training control plane
  • Azure Storage Account for hosting training data and model training checkpoints
  • Notebooks that create infrastructure and run training jobs architecure

What you'll learn

In this lab you will build Cloud Native infrastructure required for running distributed Pytorch jobs, deploy Kubernetes components such as Rendezvous ETCD server and Torch Elastic Kubernetes operator and run the training. Learning include

  • How to train large scale PyTorch model on a cluster of low cost Spot VMs while retaining job reliability.
  • How to run elastic distributed Pytorch training on Kubernetes with TorchElastic operator
  • Leverage Kubernetes auto-scaling for elastic training jobs to shorten time for training
  • Simulate Azure Spot VM eviction and verify elastic training job is not failing

Key Takeaways:

  • Audience will be able to apply gained knowledge to save money for the customers who are running training jobs on Azure or on premises.

What you'll need

  • A basic understanding of Kubernetes will be helpful but not necessary
  • Azure Subscription with quota for GPU cores
  • Linux environment - Windows Sybsystem for Linux or DSVM or Azure ML Compute instance
  • Azure CLI

Lab Steps

  • Step 0: Lab Environment Setup
  • Step 1: Infrastructure Setup (AKS + Spot VM Nodepool) and Torch Elastic
  • Step 2: Adjust script for Elastic training
  • Step 3: Run Torch Elastic ImageNet training on Spot VM Pool
  • Step 4: Simulate node eviction and verify training is unaffected

References

Torch Elastic Docs

Azure Spot VMs

torchelastic_lab's People

Contributors

lenisha avatar raviskolli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

danielschulz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.