Code Monkey home page Code Monkey logo

transformer_central's Introduction

transformer_central

Various tutorials and transformers for FSDP research and learning.

Lessons available:

1 - Using the FSDP Transformer Wrapper (video + notebook)

FSDP now has an express auto-wrapper for Transformer models. This allows FSDP to create a 'model aware' sharding plan for how it breaks up the model across the GPU's and can result in some significant performance improvements for your training speed.

Video and notebook are in the sub-folder here

2 - Using FSDP's checkpoint activations (video + notebook)

FSDP now has the ability to auto-insert checkpoints using a similar process as the Transformer wrapper where you designate your layer class. Recommend watching the video 1 above first, before watching this video.

Video and notebook are in the folder here

3 - Loading and Saving Model and Optimizer checkpoints, with FULL_STATE_DICT (video + notebook)

Saving and loading your training checkpoints is an essential task, and this tutorial covers how to do that with FSDP. There are multiple state dict types within FSDP - this tutorial covers FULL_STATE_DICT, which is the typical use case within the constraint that the entire checkpoint (model or optimizer) is able to fit within your available cpu memory. For models that go beyond cpu memory (eg 20-30B+), we'll use distributed checkpoints via LOCAL_STATE_DICT which will be covered in a seperate tutorial.

Video and notebook are in the folder here

4 - Backwards Prefetching - optimize your training speed by increasing communication and computation overlap (video + notebook)

FSDP has multiple options for optimizing communication and computation overlap during the backward pass of training. In this tutorial, we show the three current options, how to use them, and explain at a parameter level what the differences are.

Video and notebook are in the folder here

5 - Maximizing your training speed with FSDP and gpu memory:

Conventional wisdom is that to maximize your training throughput, you should run your batch size up until you OOM, and then just slightly back off from there, and viola, optimal throughput.

This is not correct though as you need to optimize by ensuring you are not hitting cudaMalloc retries to get maximum speed!

This tutorial covers an example with tuning a 2B model in FSDP, and the improvements by avoiding retries (25% greater throughput vs conventional practice), as well as offers a utility class (Memory_Maximizer) you can add to your project to automatically monitor gpu info and retry counts for optimizing.

Video, notebook and utility file are in the folder here

6 - Sharding Strategies for FSDP (video + notebook):

FSDP has 3 different sharding strategies which allow you to customize the tradeoff between memory vs communication, and thus with a single line of code, go from DDP -> Zero2 -> Full Shard. FSDP thus is becoming a universal training framework for models ranging from 100M - 1 Trillion+.

In this tutorial, you will learn how to modify the FSDP sharding strategy, understand the relative tradeoffs, and see the comparative growth in size of model trainable on a fixed server simply by adjusting the sharding strategy.

Video and notebook are in the folder here

7 - Mixed Precision with FSDP (video + notebook + importable module):

FSDP allows you to easily switch between various datatypes (Bfloat16, FP16, FP32) for your training via custom policies. You can thus control the datatype for your parameters, your gradient communication and your buffers. This tutorial shows you how to do that as well as offers some best practices and a Bfloat16 checker module that will confirm both native GPU and network support for BFloat16.

Video and notebook are here

8 - Saving and Loading models with FSDP Local State Dict (distributed checkpoints):

FSDP has two methods for saving and loading models. Full State Dict saves and loads with the single file (.pt) concept. By contrast, Local State Dict saves to an exclusive directory, with potentially thousands of smaller files and a single .metadata file. The key benefit is local state dict allows model saving and loading for gigantic models where assembling it for single file saving and loading would exceed CPU memory.

This tutorial will show you how to work with local state / distributed checkpoints. The notebook has saving and loading functions you can directly leverage.

Video and notebook are here

9 - Fine Tuning Models with FSDP (video + notebook):

FSDP currently does not support layer level freezing for fine tuning (due to the sharding). However, in this tutorial will discuss how to use Child Fine Tuning, which has been shown to outperform vanilla fine tuning on a variety of language tasks.

Video and notebook are here

10 - End to End overview of FSDP in a working codebase (video):

This video tutorial does a 14 minute walkthrough of a codebase that is training a variety of models using FSDP. The goal of this video is to show the overall features of FSDP within a codebase. From there, you can dive into the detailed sub-tutorials on each specific topic of interest.

Video is here

11 - Using the new FSDP Rate Limiter to free up reserved memory and increase training speed:

With PyTorch Nightly 914 and higher, a new 'limit_all_gathers' param has been added to FSDP init, which controls the 'rate limiter' feature. This activates an internal rate limiter that can avoid over buffering of GPU memory for some cases, and by reinvesting this newly freed memory you can potentially accelerate your training times. This 4 minute video walks you through the how and why of using the rate limiter for your training!

Video and notebook are here

transformer_central's People

Contributors

lessw2020 avatar

Stargazers

Cong Wang avatar Christian avatar Rishub Tamirisa avatar fred monroe avatar lzy^0x0 avatar 爱可可-爱生活 avatar David Marx avatar Xingyu avatar Neeraj Singh Sarwan avatar Jeremy Howard avatar Valentin Liévin avatar Omkar Pangarkar avatar Thanmay Jayakumar avatar Roy avatar Aggoune Rayane  avatar Felipe Honorato avatar  avatar Aaqib avatar  avatar  avatar Raj Dabre avatar Anupam Bhatnagar avatar  avatar Darragh Hanley avatar locchuong avatar Mark Saroufim avatar Sourab Mangrulkar avatar Jiazhi Yang avatar  avatar  avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.