mosaicml / composer Goto Github PK

View Code? Open in Web Editor NEW

5.1K 50.0 407.0 14.42 MB

Supercharge Your Model Training

Home Page: http://docs.mosaicml.com

License: Apache License 2.0

Python 99.60% Dockerfile 0.37% Makefile 0.03%

deep-learning pytorch neural-networks ml-systems ml-efficiency ml-training machine-learning neural-network

composer's Introduction

Supercharge your Model Training

Deep Learning Framework for Training at Scale

[Website] - [Getting Started] - [Docs] - [We're Hiring!]

👋 Welcome

Composer is an open-source deep learning training library by MosaicML. Built on top of PyTorch, the Composer library makes it easier to implement distributed training workflows on large-scale clusters.

We built Composer to be optimized for scalability and usability, integrating best practices for efficient, multi-node training. By abstracting away low-level complexities like parallelism techniques, distributed data loading, and memory optimization, you can focus on training modern ML models and running experiments without slowing down.

We recommend using Composer to speedup your experimentation workflow if you’re training neural networks of any size, including:

Large Language Models (LLMs)
Diffusion models
Embedding models (e.g. BERT)
Transformer-based models
Convolutional Neural Networks (CNNs)

Composer is heavily used by the MosaicML research team to train state-of-the-art models like MPT, and we open-sourced this library to enable the ML community to do the same. This framework is used by organizations in both the tech industry and the academic sphere and is continually updated with new features, bug fixes, and stability improvements for production workloads.

🔑 Key Features

We designed Composer from the ground up for modern deep learning workloads. Gone are the days of AlexNet and ResNet, when state-of-the-art models could be trained on a couple of desktop GPUs. Today, developing the latest and greatest deep learning models often requires cluster-scale hardware — but with Composer’s help, you’ll hardly notice the difference.

The heart of Composer is our Trainer abstraction: a highly optimized PyTorch training loop designed to allow both you and your model to iterate faster. Our trainer has simple ways for you to configure your parallelization scheme, data loaders, metrics, loggers, and more.

Scalability

Whether you’re training on 1 GPU or 512 GPUs, 50MB or 10TB of data - Composer is built to keep your workflow simple.

FSDP: For large models that are too large to fit on GPUs, Composer has integrated PyTorch FullyShardedDataParallelism into our trainer and made it simple to efficiently parallelize custom models. We’ve found FSDP is competitive performance-wise with much more complex parallelism strategies. Alternatively, Composer also supports standard PyTorch distributed data parallelism (DDP) and Deepspeed execution.
Elastic sharded checkpointing: Save on eight GPUs, resume on sixteen. Composer supports elastic sharded checkpointing, so you never have to worry if your sharded saved state is compatible with your new hardware setup.
Data streaming: Working with large datasets? Download datasets from cloud blob storage on the fly by integrating with MosaicML StreamingDataset during model training.

Customizability

Other high-level deep learning trainers provide simplicity at the cost of rigidity. When you want to add your own features, their abstractions get in your way. Composer, on the other hand, provides simple ways for you to customize our Trainer to your needs.

Fig. 1: Composer’s training loop has a series of events that occur at each stage in the training process. Callbacks are functions that users write to run at specific events. For example, our Learning Rate Monitor Callback logs the learning rate at every BATCH_END event.

Callbacks: Composer’s callback system allows you to insert custom logic at any point in the training loop. We’ve written callbacks to monitor memory usage, log and visualize images, and estimate your model’s remaining training time, to name a few. This feature is popular among researchers who want to implement and experiment with custom training techniques.
Speedup algorithms: We draw from the latest research to create a collection of algorithmic speedups. Stack these speedups into MosaicML recipes to boost your training speeds. Our team has open-sourced the optimal combinations of speedups for different types of models.
- 8x speedup: Stable Diffusion
  - $200k original SD2 cost —> $50k (Blog)
- 7x speedup: ResNet-50 on ImageNet
  - 3h33m —> 25m on 8xA100 (Blog)
- 8.8x speedup: BERT-Base Pretraining
  - 10h —> 1.13h on 8xA100 (Blog)
- 5.4x speedup: DeepLab v3 on ADE20K
  - 3h30m —> 39m on 8xA100 (Blog)

Better workflows

Composer is built to automate away low-level pain points and headaches so you can focus on the important (and fun) parts of deep learning and iterate faster.

Auto-resumption: Failed training run? Have no fear — just re-run your code, and Composer will automatically resume from your latest saved checkpoint.
CUDA OOM Prevention: Say goodbye to out-of-memory errors. Set your microbatch size to “auto”, and Composer will automatically select the biggest one that fits on your GPUs.
Time Abstractions: Ever messed up your conversion between update steps, epochs, samples, and tokens? Specify your training duration with custom units (epochs, batches, samples, and tokens) in your training loop with our Time class.

Integrations

Integrate with the tools you know and love for experiment tracking and data streaming.

Cloud integrations: Our Checkpointing and logging features have first-class support for remote storage and loading from Cloud bucket (OCI, GCP, AWS S3).
Experiment tracking: Weights and Biases, MLFlow, CometML, and neptune.ai — the choice is yours, easily log your data to your favorite platform.

🚀 Getting Started

📍Prerequisites

Composer is designed for users who are comfortable with Python and have basic familiarity with deep learning fundamentals and PyTorch.

Software requirements: A recent version of PyTorch.

Hardware requirements: System with CUDA-compatible GPUs (AMD + RoCM coming soon!). Composer can run on CPUs, but for full benefits, we recommend using it on hardware accelerators.

💾 Installation

Composer can be installed with pip:

pip install mosaicml

To simplify the environment setup for Composer, we also provide a set of pre-built Docker images. We highly recommend you use our Docker images.

🏁 Quick Start

Here is a code snippet demonstrating our Trainer on the MNIST dataset.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

from composer import Trainer
from composer.models import ComposerClassifier
from composer.algorithms import LabelSmoothing, CutMix, ChannelsLast

class Model(nn.Module):
    """Toy convolutional neural network architecture in pytorch for MNIST."""

    def __init__(self, num_classes: int = 10):
        super().__init__()

        self.num_classes = num_classes

        self.conv1 = nn.Conv2d(1, 16, (3, 3), padding=0)
        self.conv2 = nn.Conv2d(16, 32, (3, 3), padding=0)
        self.bn = nn.BatchNorm2d(32)
        self.fc1 = nn.Linear(32 * 16, 32)
        self.fc2 = nn.Linear(32, num_classes)

    def forward(self, x):
        out = self.conv1(x)
        out = F.relu(out)
        out = self.conv2(out)
        out = self.bn(out)
        out = F.relu(out)
        out = F.adaptive_avg_pool2d(out, (4, 4))
        out = torch.flatten(out, 1, -1)
        out = self.fc1(out)
        out = F.relu(out)
        return self.fc2(out)

transform = transforms.Compose([transforms.ToTensor()])
dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
train_dataloader = DataLoader(dataset, batch_size=128)

trainer = Trainer(
    model=ComposerClassifier(module=Model(), num_classes=10),
    train_dataloader=train_dataloader,
    max_duration="2ep",
    algorithms=[
        LabelSmoothing(smoothing=0.1),
        CutMix(alpha=1.0),
        ChannelsLast(),
    ],
)
trainer.fit()

Next, check out our Getting Started Colab for a walk-through of Composer’s main features. In this tutorial, we will cover the basics of the Composer Trainer:

Dataloader
Trainer
Optimizer and Scheduler
Logging
Training a baseline model
Speeding up training

📚 Learn more

Once you’ve completed the Quick Start, you can go through the below tutorials or our documentation to further familiarize yourself with Composer.

If you have any questions, please feel free to reach out to us on our Community Slack!

Here are some resources actively maintained by the Composer community to help you get started:

Resource	Details
Training BERTs with Composer and 🤗	A Colab Notebook showing how to train BERT models with Composer and 🤗!
Pretraining and Finetuning an LLM Tutorial	A tutorial from MosaicML’s LLM Foundry, using MosaicML Composer, StreamingDataset, and MCLI on training and evaluating LLMs.
Migrating from PyTorch Lightning	A tutorial is to illustrating a path from working in PyTorch Lightning to working in Composer.
Finetuning and Pretraining HuggingFace Models	Want to use Hugging Face models with Composer? No problem. Here, we’ll walk through using Composer to fine-tune a pretrained Hugging Face BERT model.
Building Speedup Methods	A Colab Notebook showing how to build new training modifications on top of Composer

🛠️ For Best Results, Use within the Databricks & MosaicML Ecosystem

Composer can be used on its own, but for the smoothest experience we recommend using it in combination with other components of the MosaicML ecosystem:

Mosaic AI training (MCLI)- Our proprietary Command Line Interface (CLI) and Python SDK for orchestrating, scaling, and monitoring the GPU nodes and container images executing training and deployment. Used by our customers for training their own Generative AI models.
- To get started, reach out here and check out our Training product pages
MosaicML LLM Foundry - This open source repository contains code for training, finetuning, evaluating, and preparing LLMs for inference with Composer. Designed to be easy to use, efficient and flexible, this codebase is designed to enable rapid experimentation with the latest techniques.
MosaicML StreamingDataset - Open-source library for fast, accurate streaming from cloud storage.
MosaicML Diffusion - Open-source code to train your own Stable Diffusion model on your own data. Learn more via our blogs: (Results , Speedup Details)

🏆 Project Showcase

Here are some projects and experiments that used Composer. Got something to add? Share in our Community Slack!

MPT Foundation Series: Commercially usable open source LLMs, optimized for fast training and inference and trained with Composer.
Mosaic Diffusion Models: see how we trained a stable diffusion model from scratch for <$50k
replit-code-v1-3b: A 2.7B Causal Language Model focused on Code Completion, trained by Replit on Mosaic AI training in 10 days.
BabyLLM: the first LLM to support both Arabic and English. This 7B model was trained by MetaDialog on the world’s largest Arabic/English dataset to improve customer support workflows (Blog)
BioMedLM: a domain-specific LLM for Bio Medicine built by MosaicML and Stanford CRFM

💫 Contributors

Composer is part of the broader Machine Learning community, and we welcome any contributions, pull requests, or issues!

To start contributing, see our Contributing page.

P.S.: We're hiring!

❓FAQ

What is the best tech stack you recommend when training large models?
- We recommend that users combine components of the MosaicML ecosystem for the smoothest experience:
  - Composer
  - StreamingDataset
  - MCLI (Databricks Mosaic AI Training)
How can I get community support for using Composer?
- You can join our Community Slack!
How does Composer compare to other trainers like NeMo Megatron and PyTorch Lightning?
- We built Composer to be optimized for both simplicity and efficiency. Community users have shared that they enjoy Composer for its capabilities and ease of use compared to alternative libraries.
How do I use Composer to train graph neural networks (GNNs), or Generative Adversarial Networks (GANs), or models for reinforcement learning (RL)?
- We recommend you use alternative libraries for if you want to train these types of models - a lot of assumptions we made when designing Composer are suboptimal for GNNs, RL, and GANs

✍️ Citation

@misc{mosaicml2022composer,
    author = {The Mosaic ML Team},
    title = {composer},
    year = {2021},
    howpublished = {\url{https://github.com/mosaicml/composer/}},
}

composer's People

Contributors

Stargazers

Watchers

Forkers

hanlint brando90 jbloxham landanjs ajaysaini725 dblalock corymosaicml jacobfulano anisehsani zeeroocooll averylamp siriuslee abhi-mosaic zhanglianng ravi-mosaicml lupesko rishikumarray stanford-crfm cronblocks-ai techthiyanes af-74413592 swagshaw neildg celestialized sidney1994 murthyn anhnguyendepocen christindbose stjordanis ai-hub-deep-learning-fundamental joskid zonezone12 ianworley dlmgary moinnadeem murilo elvishelvis vwxyzjn mrcodechef geoffreyporto mcneela urnotpopcorn huaxingxu mathpopo chatchanan-v liangrongzhi136 xuliji pavithranrao joeupwu strategist922 pikus16 ma-zx geothinking zenithez imflash217 fellowship ofirpress icodein tuozhanjun fdoperezi mbrukman azgo14 psyrtsov arslan-z ronenj changdong-zhou a-jacobson razaulazam alextrott16 knighton dskhudia mvpatel2000 zhouleidcc milocress ejyuen ishanashastri growlix codestar12 vladd-i renesugar codeaudit kaihsiangl arita37 navyjeff caorenzhi iainwh 9cat zinyurita bandish-shah florescl nqn dylankipkemoi neleon linden-li hurricanejin hongbo-miao shinyfoil behradtoghi hanrui-wang bdutta19

composer's Issues

Implement ESAM

Efficient SAM (anonymous, 2022) is a proposed duo of SAM optimizations to reduce the throughput hit of SAM. The composer repo already supports an interval hyperparameter which has empirically been found to maintain much of the quality improvement of SAM while sacrificing little throughput, but it would be interesting to see if ESAM could enable setting lower values of interval.

WandB logger should have `every_n_batches`

In order to space out calls to the wandb client, we should support the same frequency settings as the FileLogger.

Training run processes do not stop at the end of training

Environment

mosaicml/research:latest docker container on 3080s.

To reproduce

Command:
python examples/run_mosaic_trainer.py -f composer/yamls/models/resnet50.yaml --loggers wandb --loggers.wandb.entity mosaic-ml --loggers.wandb.project landan-random --callbacks speed_monitor lr_monitor --callbacks.speed_monitor.window_size 100

I believe Cory saw hanging at the end of the CIFAR-10 benchmark as well, so that may be sufficient to reproduce the bug.

Expected behavior

All (sub)processes to be killed at the end of training.

Additional context

Training runs hang at the end of training. This means the processes will continue to run although training is complete.

Remove deferred logging

With #65, the global rank is now known when the python process starts. Thus, for rank zero loggers, it is not necessary to wait until training start to initialize the logger. Instead, loggers should initialize on the INIT event, and process all logging calls immediately.
By convention, there will not be any calls to the loggers before the init event.

Auto-TCP Port Selection by default

Steps to reproduce the behavior:

Run .fit() twice on the same Trainer in a script or notebook (not cleaned up port usage for torch.distributed)

Expected behavior

Ideally the Trainer by default won't use a static port in TCPStore and instead select an open port to use for torch.distributed coordination.

Synthetic Data Generation for LM Datasets

LM datasets do not support synthetic data. It would be great to add support for it.

DDP Spawn `can only test a child process` error

** To reproduce

Steps to reproduce the behavior:
Produces a traceback in DDP spawn (cpu only), where workers crash (still trains fine)

from composer.trainer import TrainerHparams, Trainer

hparams = TrainerHparams.create('composer/yamls/models/classify_mnist_cpu.yaml')
hparams.set_datadir("~/datasets")
trainer = Trainer.create_from_hparams(hparams)

trainer.fit()

/home/avery/mosaic/composer/composer/utils/ddp.py:20: UserWarning: DDPDefaultValueWarning: RANK env var not set and process group not initialized; returning 0 for global rank.
  warnings.warn(f"DDPDefaultValueWarning: {env_var} env var not set"
Epoch 1:   3%|▎         | 1/29 [00:00<00:21,  1.32it/s]                                                                                      /home/avery/mosaic/composer/composer/utils/ddp.py:20: UserWarning: DDPDefaultValueWarning: WORLD_SIZE env var not set and process group not initialized; returning 1 for world size.
  warnings.warn(f"DDPDefaultValueWarning: {env_var} env var not set"
Epoch 1:   3%|▎         | 1/29 [00:00<00:21,  1.32it/s, loss/train=2.3191]                                                                   Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7ff33771e5e0>
Traceback (most recent call last):
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7ff33771e5e0>
Traceback (most recent call last):
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

Fix flaky convergence unit test

The trainer convergence test is flaky right now. This is likely due to the fact that we are using a CNN for the test which does significant dimensionality reduction and is thus hard to reason about in terms of linear separability of gaussian data. A fix would be to convert the test into training logistic regression.

** To reproduce

Run the test many times on the same code (seems to fail once every ~50-100 times)

Expected behavior

The test behavior should be consistent (i.e. if it passes once on some code then it should always pass on that code).

Add Memory Monitor Callback

🚀 Feature Request

Add a callback to monitor memory statistics during training such as memory reserved by the caching allocator, number of malloc calls, number of free calls, etc...

Motivation

Having memory allocator statistics during training available is very helpful for debugging issues such as OOM and memory leaks.

Implementation

See: https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html#torch.cuda.memory_stats for the API that gives this information.

Symbolic links in models directory are broken

For example: https://github.com/mosaicml/composer/tree/main/composer/models/resnet50

The links seem to point to themselves.

Proper seeding for DDP

If the seed is not set in hparams, it is randomly selected in __init__. Each DDP process, when it starts up, gets a different random seed.

The seed from the rank 0 process is saved in checkpoints

When resuming from a checkpoint, the seed from the rank 0 process is restored across all DDP processes.
This leads to inconsistent behavior, since the non-rank-0 process now resume with a different seed than they first trained with.

To fix: add the seed to the RNG state, and sync across all DDP processes

Fix the `load_model` test for unet and GPT

The unet and gpt models currently fail on tests/test_load.py due to something about the mock model.
They likely need a mock model of the appropriate type.
Need to debug and fix these tests.

WandB error due to multiple artifacts with the same ID

When running a baseline resnet50 model on imagenet, I encountered this error:

wandb: ERROR Error while calling W&B API: Error 1062: Duplicate entry '6394579-1' for key 'unique_artifact_collection_membership_version' (<Response [409]>)
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 50, in _thread_body
    self._handle_event(event)
  File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 79, in _handle_event
    self._maybe_commit_artifact(job.artifact_id)
  File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 161, in _maybe_commit_artifact
    self._api.commit_artifact(artifact_id)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 2235, in commit_artifact
    response = self.gql(mutation, variable_values={"artifactID": artifact_id})
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/lib/retry.py", line 102, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 147, in execute
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 141, in execute
    return self.client.execute(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
    request.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://api.wandb.ai/graphql

I've asked the WandB folks and they think it's from an attempted upload of an artifact with the same ID as another. The recent addition of artifact uploading from run_directory seems to be causing this, so PR #89 will disable it by default, but we need to verify that artifact uploads are working as expected.

Algorithm Composability API

When running tests, we validate that algorithms run on each model type. Some algorithms are not compatible with some models (e.g. NLP algs on image classification models), so we manually hard-code this in the tests. It would be helpful to have a first-class API to get which models support which algorithms, and which algorithms support which models.

The engine could also use this information to perform a static analysis to detect runtime issues before they arise.

One possible design could be to have a ModelType that would work like this:

class ModelType(StringEnum):
    CLASSIFICATION = "Classification"
    NLP = "Nlp"
    ...

class BaseMosaicModel:
    model_type: ModelType   # would be set on each model
    ...

class Algorithm:
    @classmethod
    def get_supported_model_types(cls):
        return list(ModelType)    # can be overridden on each algorithm

Implement `predict()` in `Trainer`

The Trainer class should have a predict() function as a convenience for a user who wants to run inference on a trained model.

Lazy loading of non-core dependencies

All non-core dependencies should be lazily loaded, so one can use the library without having to install composer[all]

This likely means that functions that depend on a non-core dependency should import that dependency inside the function, not at module-level.

Blob Store Uploading for the Run Directory

🚀 Feature Request

Add callbacks to upload the run directory to blob stores (s3, gcs)

Motivation

Currently, the run directory is only saved locally (or, uploaded to WANDB, but we're running into issues with that). When a K8S pod dies, we lose the run directory. We store logs, checkpoints, traces, etc... in the run directory, so this should be persisted.

[Optional] Implementation

This can be implemented via a callback, quite trivially. It would be best to delegate the directory monitoring / uploading to a subprocess (not sub thread), as not to use GIL time in the main training loop. While network I/O happens outside the GIL, other work related to uploading (e.g. computing file hashes) does occur within the GIL, so it would be best to offload this. However, an initial implementation can use a background thread.

For cross-cloud compatibility, going to use apache libcloud.

Remove `soft_cross_entropy`

Motivation

As of this update in pytorch: pytorch/pytorch#61044, we no longer need our implementation of soft_cross_entropy in composer/models/loss.py and should remove it in favor of the one in pytorch.

DeepSpeed Integration

🚀 Feature Request

Integration with DeepSpeed. The V0 use case is targeted only on data parallelism strategies like ZeRO.

Motivation

Necessary to train GPT models above 1.3B parameters.

Launch DDP processes before initializing trainer

🚀 Feature Request

Our current trainer relaunches itself N times to create N processes for DDP. The problem with this is that DDP does so by rerunning the very script that launched the trainer in the first place. This is problematic for any user invoking DDP via a custom script, and also for testing.

The canonical solution to this problem is to provide a launch executable that wraps a user provided script to initialize a trainer. The launch executable runs the script N times to create N processes. This appears to be the direction that many ML frameworks, including DeepSpeed, are moving towards.

Motivation

This will simplify testing and allow us to accurately calculate coverage metrics. This is also essentially a prerequisite to integrating the trainer with DeepSpeed, which also uses an executable.

Add Colab Example

Add Example Jupyter notebook to the examples folder
Add "Open in Colab" to the README.md

Errors are not printed to stdout when using multi-gpu training

Anytime an error occurs while I am using multi-gpu training, the job crashes, but the error is not printed. I need to run the experiment with a single GPU to find what the error was.

Is there a way to fix this? It makes determining issue very difficult.

I can try to create an example with the current release if needed.

Python 3.7 support (for use with colab)

🚀 Feature Request

add support for python 3.7 build

Motivation

I wanted to play with composer but was not able to install via pip because google Colab runs in a python 3.7 environment

Test support for Pytorch v1.10

Run regression tests on pytorch v1.10 (https://pytorch.org/blog/pytorch-1.10-released/)

Add GLUE Benchmark for NLP

Remove curriculum learning algorithm

This was renamed to seq_length_warmup but for some reason curriculum_learning.py still exists.

Remove seed from state

The seed is stored in the State object in the Trainer but instead it should be stored in the checkpoint_rng object.

Note that right now, if the user does not set a seed on trainer init, then a different seed is created on each process but only the rank 0 seed its saved in the checkpoint. We want to enforce each device using the same seed which will be addressed by #12

More efficient microbatch DDP sync behavior when `find_unused_parameters` is True

Ordinarily, when training with gradient accumulation, we only need to do a DDP sync on the final microbatch, because synced gradients aren't needed until the optimizer runs at the end of the batch. However, the find_unused_parameters flag indicates that some algorithms (such as stochastic depth) may cause not all gradients to be generated. Critically, the set of unused parameters may vary between microbatches. Syncing on only the last microbatch may cause some parameters used in earlier microbatches but unused in the final microbatch to not be properly synced - resulting in severe quality degradations.

Our current solution to this issue is just to sync all microbatches when the find_unused_paramaters flag is set, but this incurs a throughput penalty of about 5%, depending on gradient accumulation setting. We would like to investigate whether it is possible to sync all parameters used in any microbatch, to avoid this throughput penalty.

Support `loss.backward(create_graph=True)` in the Trainer

Methods such as AdaHessian or similar need support for this feature.

Override `max_epochs` on resume from checkpoint with SSR

When resuming from a checkpoint, max_epochs currently defaults to the original max_epochs which prevents users from being able to train for more than the original max_epochs when resuming from a checkpoint.

It would be good to be able to resume from checkpoint and train for more epochs than the original max_epochs. However, we need to come up with a scheme to make this work with scale_schedule_ratio because scale schedule ratios are computed assuming that max_epochs does not change.

How should we go about handling this?

Support for subset sampler

🚀 Feature Request

Add support for training on only a subset of the dataset on each epoch.

Motivation

During testing and profiling, it can be important to skip over the first epoch (e.g. to ignore io bandwidth), but it is usually not needed to train over the entire dataset. Only a small subset is needed.

Implementation

Add support for https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#SubsetRandomSampler.
It will be a bit more complicated to make a DDP version of this.

Checkpointing with DeepSpeed

Synthetic Data Generation for Brats Dataset

Brats does not support synthetic data. It would be great to add support for it.

GPT models differ between code and docs

The gpt2 models in the code (38m, 85m, 114m) are different from what's in the docs (52m, 83m, 125m). Also p sure d_attn for GPT2-52m is incorrect in the table in the docs.

Enable small "smoke test" runs

🚀 Feature Request

Add a --smoke-test flag or something similar.

Motivation

I would like to be able to start a run that simply checks one step of training and one step of validation to ensure as well as possible that the training pipeline is working. This will make it easier when running many runs in parallel, where a small bug in the validation loop can waste a lot of time and compute resources.

Synthetic Data Generation

When testing, benchmarking, smoke testing, and profiling, it is helpful to be able to easily get synthetic data that can then be passed into the model.forward() function for any type of model. However, it is impossible to automatically read the input (tensor) shape off of the model graph, so we are currently manually specifying the input shape wherever we perform synthetic passes (e.g. in tests, when constructing the synthetic dataset, etc...)

Because different models have different input formats, it would be difficult to describe this via a static parameter such as input shape -- e.g. nlp models use an input dictionary. As such, generating a synthetic batch would be preferred.

Option A: Add `get_synethic_batch(batch_size)` on each BaseMosaicModel:

Proposed Example

class BaseMosaicModel:
    @abc.abstractclass
    def get_synethic_batch(self, batch_size: int, synthetic_data_distribution: SynethicDataDistributionEnum) -> Batch:
        # for ease of subclass implementation, a set of helper methods would be available
        pass

Then, anything that needs to perform a forward pass could do:

def my_profiling_script(model: BaseMosaicModel):
    batch = model.get_synethic_batch(batch_size=10)  # returns a batch size of 10 samples that the model can train on
    output = model(batch)

We could also generalize the synthetic dataset to do something like:

class SyntheticDataset:
   def __init__(self, model):
        self.model = model
        
   def __getitem__(self, i):
         return self.model.get_synthetic_batch(1)

Option B: Add a SyntheticDatasetGenerator

Instead of storing how to generate synthetic batch information on each model, this could instead be stored in a common registry-like design. For example:

class SyntheticDatasetGenerator:
    def get_synethic_dataset(self, model, *args, **kwargs):
        if isinstance(model, MNIST):
             return SyntheticDataset(input_shape=(1, 28, 28), *args, **kwargs)
        if isinstance(model, Resent):
             return SyntheticDataset(input_shape=(3, 224, 224), *args, **kwargs)

This option would require generalization of the SyntheticDataset to support NLP data.

Cityscapes + Deeplabv3 benchmark

🚀 Feature Request

Add a semantic segmentation benchmark based on the Cityscapes dataset and the Deeplabv3 architecture.

Motivation

Prior work

Our current segmentation benchmark is based on the Multimodal Brain Tumor Segmentation Challenge (BraTS) and the Unet architecture. There are a couple of reasons why we may want to add another segmentation benchmark:

Dataset size: BraTS has lower resolution (192 x 160) and a smaller number of training images (500) than we expect from other segmentation datasets. As of now, we can train on BraTS in ~3 minutes
Recognition: BraTS does not seem as recognizable in the ML community, so it may be difficult for people to interpret our results. Also, a frequently used dataset would be beneficial to the community since proposed methods can be easily compared to prior work using the dataset.
Domain: this may just be a me thing, but I think it would be helpful to have a dataset in a similar domain to the ImageNet benchmark. This could help in determining whether the success/failure of a method depends on the task or the domain.

Cityscapes

Cityscapes appears to be the second most common semantic segmentation benchmark (behind Pascal VOC), so evaluating methods on Cityscapes should be relevant to the community. Cityscapes image resolution is 1024 x 2048 and the training set contains 2,975 densely and 20,000 coarsely annotated images (not as many as we would like, but a start). Alternatively, we could use ADE20k or Pascal VOC segmentation if others feel strongly towards either dataset.

Deeplabv3

It would be easier to benchmark with Deeplabv3 since the hyperparameters and target performance on Cityscapes are known. As of now, we have no numbers on training time, so this will be unknown. For Unet, we would need to tune hyperparameters and would not be sure if we are achieving an expected performance.

Implementation

Simple implementation outline, but should be made more detailed:

Cityscapes DataloaderSpec: will try to use torchvision.datasets.cityscapes if it fits our use case
Deeplabv3 BaseMosaicModel: will try to use torchvision.models.segmentation.deeplabv3_resnet101 if it fits our use case
Implement intersection-over-union (IoU) metric for evaluation: should be easyish?
Dataset and model throughput profiling
Dataset and model card

Use GPUs in tests

🚀 Feature Request

We have a number of unit tests (ex. tests/trainer/test_trainer.py and tests/trainer/test_checkpoint.py which use GPUs as a part of the test. However, these tests are not run as a part of the GitHub actions tests, which results in the potential for GPU-related bugs. We should have a system in place which runs GPU tests before code can be merged into dev.

Motivation

There have been GPU-specific bugs in the past that were not caught because GPU tests do not run in our unit testing suite.

Implementation

Can use CircleCI for this.

ZeRO Configuration with DeepSpeed

Supporting stage 3 is expected to be non-trivial, since we can no longer store a complete copy of the model on each node.

Support multiple `eval` datasets

🚀 Feature Request

For fine-tuning tasks (e.g. GLUE) and also many vision experiments, need to support multiple eval datasets. The metrics needed could be different across different datasets.

[Optional] Implementation

support for eval_dataloaders as a List
during the eval loop, run through multiple dataloaders and log the metrics for each dataset
to support different metrics, may need to either (1) store the metric with the datasets, or (2) have the model's metric function return different metrics depending on the dataset.

Add supported docker image matrix to docker/README.md

Add in the officially supported docker images to docker/README.md

Originally posted by @ravi-mosaicml in #66 (comment)

Precision Handling Support with DeepSpeed

DeepSpeed currently crashes if you try using it to train RN50 with FP16 (FP32 works fine). The problem is that the model needs the input tensor to also be in FP16, but the dataloader does nothing to change the dtype of the batches it returns according to the current precision. This isn't a problem for NLP models because the dtypes of NLP batches are generally all integer types anyways, so those models already handle casting the batch types (or something like that, I'm a bit unclear on exactly what's happening).

My proposed fix is fairly hacky. I'd like to avoid having to add code to dataloaders, datasets, and models to handle FP16 precision settings. Instead, I'd like to have the trainer itself handle casting batches to FP16 as appropriate. The hacky part of this is that the trainer needs to be able to determine when this cast should be done, as for ImageNet, but not for NLP. There's no perfect way to do this. I'm going to try having it just cast any FP32 tensor it sees in loaded batches to FP16.

Fix model surgery so `Event.INIT` can be removed from Trainer `init`

Right now model surgery does not work after the model parameters have been passed to an optimizer. As a result, we call the Event.INIT (which is used by model-modifying methods such as Blurpool and SqueezeExcite) call back in the Trainer __init__ before the optimizer is constructed rather than in the training loop.

This yields API complications because the user cannot pass a pre-constructed optimizer into the Trainer __init__.

We need to get surgery working properly and test it on Blurpool and SqueezeExcite to make sure there are no regressions.

`eval_only` flag

🚀 Feature Request

For post-hoc measurements on different datasets, we want to be able to load a checkpoint and run --eval_only.

[Optional] Implementation

Add a --eval_only flag that loads from checkpoint and only runs eval. User would need to specify a new dataset/dataloader that differs from the checkpointed hparams.

Add `callback.run_event()`

Add a helper method in callback for run_event. This helper method would then call the correct method on callback. It would look something like:

class Callback:
  def run_event(self, state; State, logger; Logger, event: Event):
    if event == Event.TRAINING_START:
        self.training_start(state, logger)
    if event == Event.BEFORE_FORWARD:
        self.before_forward(state, logger)
    ...

Then, the engine would do callback.run_event(state, logger, event).

This would help clean up code in the following places:

RankZeroCallback: Instead of monkeypatching each callback function, it would simply override run_event.
RankZeroLogger: No need for a private _training_start method that is different from all of the other callbacks
Checkpointing tests: The EventCounterCallback basically does this, via monkeypatching

Configure Jenkins

Enable github actions for:

pytest CPU runner
formatting and type checking (yapf, pyright)
docker builds
docs builds

Fix engine compile to ensure `selective_backprop` is first algorithm in `AFTER_DATALOADER` event

selective_backprop needs to be the first algorithm used in the AFTER_DATALOADER event because it prunes data samples and we only want to run other data-modification algorithms on the pruned set of data samples.

Graceful Trainer Cleanup upon `KeyboardInterrupt`

Trainers do not cleanup properly when KeyboardInterrupted. Should cleanup the model/possibly keep the model in a state where it can be partially trained but evaluated .fit() is exited early. Probably should gracefully exit and cleanup for interactive composer users

** To reproduce

from composer import algorithms, trainer, Trainer
from composer.core.types import Precision

hparams = trainer.load("classify_mnist_cpu")  # loads from composer/yamls/models/classify_mnist_cpu.yaml
hparams.algorithms = algorithms.load_multiple("blurpool", "label_smoothing")

# edit other properties in the hparams object
hparams.precision = Precision.FP32
hparams.grad_accum = 2
hparams.set_datadir("~/datasets")

trainer = Trainer.create_from_hparams(hparams)
trainer.fit()

then CTRL-C, Keyboard Escape
then

trainer.fit()

Produces

>>> trainer.fit()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
  File "/home/avery/mosaic/composer/composer/trainer/trainer.py", line 356, in fit
BrokenPipeError: [Errno 32] Broken pipe
    self._train_loop()
  File "/home/avery/mosaic/composer/composer/trainer/trainer.py", line 488, in _train_loop
    assert isinstance(original_model, BaseMosaicModel)
AssertionError

Auto Grad Accum

🚀 Feature Request

The trainer can automatically determine the appropriate grad_accum to use based off of hardware properties.

Motivation

It is cumbersome to manually specify the grad accum for every hardware and model.

Implementation

while True:
  try:
      train_model()
  except CudaOOOM:
      state.grad_accum += 1

run_mosaic_trainer.py to print help text when invoked without arguments

🚀 Feature Request

Instead of getting a ValueError (and stack trace) when running run_mosaic_trainer.py without arguments, it might be a bit more friendly to print out the help text from -h.

Motivation

Getting a good first impression (and not feeling accused) as a CLI user is good practice, and we can print out the help text easily.

mosaicml / composer Goto Github PK

composer's Introduction

Supercharge your Model Training

Deep Learning Framework for Training at Scale

[Website] - [Getting Started] - [Docs] - [We're Hiring!]

👋 Welcome

🔑 Key Features

Scalability

Customizability

Better workflows

Integrations

🚀 Getting Started

📍Prerequisites

💾 Installation

🏁 Quick Start

📚 Learn more

🛠️ For Best Results, Use within the Databricks & MosaicML Ecosystem

🏆 Project Showcase

💫 Contributors

❓FAQ

✍️ Citation

composer's People

Contributors

Stargazers

Watchers

Forkers

composer's Issues

Environment

To reproduce

Expected behavior

Additional context

Expected behavior

Expected behavior

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

Motivation

[Optional] Implementation

Motivation

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

Motivation

Synthetic Data Generation

Option A: Add get_synethic_batch(batch_size) on each BaseMosaicModel:

Proposed Example

Option B: Add a SyntheticDatasetGenerator

🚀 Feature Request

Motivation

Prior work

Cityscapes

Deeplabv3

Implementation

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

[Optional] Implementation

🚀 Feature Request

[Optional] Implementation

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

Motivation

Recommend Projects

Recommend Topics

Recommend Org

Option A: Add `get_synethic_batch(batch_size)` on each BaseMosaicModel: