add CONTRIBUTING guidelines

WARNING: THIS PROJECT IS CURRENTLY IN MAINTENANCE MODE, DUE TO COMPANY REORGANIZATION.

Bagua is a deep learning training acceleration framework for PyTorch developed by AI platform@Kuaishou Technology and DS3 Lab@ETH Zürich. Bagua currently supports:

Advanced Distributed Training Algorithms: Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code (optionally in elastic mode). One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. So far, Bagua has integrated communication primitives including
- Centralized Synchronous Communication (e.g. Gradient AllReduce)
- Decentralized Synchronous Communication (e.g. Decentralized SGD)
- Low Precision Communication (e.g. ByteGrad)
- Asynchronous Communication (e.g. Async Model Average)
Cached Dataset: When data loading is slow or data preprocessing is tedious, they could become a major bottleneck of the whole training process. Bagua provides cached dataset to speedup this process by caching data samples in memory, so that reading these samples after the first time becomes much faster.
TCP Communication Acceleration (Bagua-Net): Bagua-Net is a low level communication acceleration feature provided by Bagua. It can greatly improve the throughput of AllReduce on TCP network. You can enable Bagua-Net optimization on any distributed training job that uses NCCL to do GPU communication (this includes PyTorch-DDP, Horovod, DeepSpeed, and more).
Performance Autotuning: Bagua can automatically tune system parameters to achieve the highest throughput.
Generic Fused Optimizer: Bagua provides generic fused optimizer which improve the performance of optimizers by fusing the optimizer .step() operation on multiple layers. It can be applied to arbitrary PyTorch optimizer, in contrast to NVIDIA Apex's approach, where only some specific optimizers are implemented.
Load Balanced Data Loader: When the computation complexity of samples in training data are different, for example in NLP and speech tasks, where each sample have different lengths, distributed training throughput can be greatly improved by using Bagua's load balanced data loader, which distributes samples in a way that each worker's workload are similar.
Integration with PyTorch Lightning: Are you using PyTorch Lightning for your distributed training job? Now you can use Bagua in PyTorch Lightning by simply set strategy=BaguaStrategy in your Trainer. This enables you to take advantage of a range of advanced training algorithms, including decentralized methods, asynchronous methods, communication compression, and their combinations!

Its effectiveness has been evaluated in various scenarios, including VGG and ResNet on ImageNet, BERT Large and many industrial applications at Kuaishou.

Links

Performance

The performance of different systems and algorithms on VGG16 with 128 GPUs under different network bandwidth.

Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.

Installation

Wheels (precompiled binary packages) are available for Linux (x86_64). Package names are different depending on your CUDA Toolkit version (CUDA Toolkit version is shown in nvcc --version).

CUDA Toolkit version	Installation command
>= v10.2	`pip install bagua-cuda102`
>= v11.1	`pip install bagua-cuda111`
>= v11.3	`pip install bagua-cuda113`
>= v11.5	`pip install bagua-cuda115`
>= v11.6	`pip install bagua-cuda116`

Add --pre to pip install commands to install pre-release (development) versions. See Bagua tutorials for quick start guide and more installation options.

Quick Start on AWS

Thanks to the Amazon Machine Images (AMI), we can provide users an easy way to deploy and run Bagua on AWS EC2 clusters with flexible size of machines and a wide range of GPU types. Users can find our pre-installed Bagua image on EC2 by the unique AMI-ID that we publish here. Note that AMI is a regional resource, so please make sure you are using the machines in same reginon as our AMI.

Bagua version	AMI ID	Region
0.6.3	ami-0e719d0e3e42b397e	us-east-1
0.9.0	ami-0f01fd14e9a742624	us-east-1

To manage the EC2 cluster more efficiently, we use Starcluster as a toolkit to manipulate the cluster. In the config file of Starcluster, there are a few configurations that need to be set up by users, including AWS credentials, cluster settings, etc. More information regarding the Starcluster configuration can be found in this tutorial.

For example, we create a EC2 cluster with 4 machines, each of which has 8 V100 GPUs (p3.16xlarge). The cluster is based on the Bagua AMI we pre-installed in us-east-1 region. Then the config file of Starcluster would be:

# region of EC2 instances, here we choose us_east_1
AWS_REGION_NAME = us-east-1
AWS_REGION_HOST = ec2.us-east-1.amazonaws.com
# AMI ID of Bagua
NODE_IMAGE_ID = ami-0e719d0e3e42b397e
# number of instances
CLUSTER_SIZE = 4
# instance type
NODE_INSTANCE_TYPE = p3.16xlarge

With above setup, we created two identical clusters to benchmark a synthesized image classification task over Bagua and Horovod, respectively. Here is the screen recording video of this experiment.

Cite Bagua

% System Overview
@misc{gan2021bagua,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  year={2021},
  eprint={2107.01499},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

% Theory on System Relaxation Techniques
@book{liu2020distributed,
  title={Distributed Learning Systems with First-Order Methods: An Introduction},
  author={Liu, J. and Zhang, C.},
  isbn={9781680837018},
  series={Foundations and trends in databases},
  url={https://books.google.com/books?id=vzQmzgEACAAJ},
  year={2020},
  publisher={now publishers}
}

	# weight_decay = group["weight_decay"] # TODO: @ganshaoduo check if this should be used
	beta1, beta2 = group["betas"]
	eps = group["eps"]

	for param_id, param in enumerate(group["params"]):
	state = self.state[param]

	): # TODO: consider the case where this is a scalar
	broadcast(inner_state, root=0)

	def _bagua_autotune_step(self):
	CYCLE_STEP = 100
	start_time = time.time()

	# TODO: @shjwudp add support for reporting tensor completion order so that the autotune service does not
	# rely on tensor registration order
	from bagua.torch_api.communication import get_bagua_hyperparameters

	self._bagua_autotune_client.report_metrics(
	rank=get_rank(),

	] # TODO: @shjwudp rename this to is autotune completed
	recommended_buckets = map(
	lambda x: list(map(lambda y: self._bagua_tensor_map[y["name"]], x)),
	response["recommended_hyperparameters"]["buckets"],
	)
	return list(recommended_buckets)

	# TODO: do we need to check whether optimizers and model parameters are the same?
	self.bagua_optimizers = (
	optimizers #: the optimizers passed in by ``with_bagua(...)``
	)
	self.bagua_algorithm = (
	algorithm #: the algorithm passed in by ``with_bagua(...)``

	get_bagua_hyperparameters().update( # TODO: @shjwudp do we need global hyperparameters?
	response["recommended_hyperparameters"]
	)

	self._bagua_autotune_completed = not response[
	"is_autotune_processing"

	) # TODO: check duplicated names
	bagua_buckets.append(bagua_bucket)
	return bagua_buckets

	def init_forward_pre_hook(self, bagua_module: BaguaModule):
	"""Given a `BaguaModule`, return a hook function that will be executed before the

	self._bagua_autotune_client.register_models( # TODO: @shjwudp rename to register tensors
	autotune_tensor_list, bagua_tensor_group_info
	).json() # TODO: @shjwudp error check

	def _bagua_autotune_get_buckets(self):
	response = self._bagua_autotune_client.ask_hyperparameters(

	] # TODO: remove parameter group logic
	)

	self._bagua_autotune_client.register_models( # TODO: @shjwudp rename to register tensors
	autotune_tensor_list, bagua_tensor_group_info
	).json() # TODO: @shjwudp error check

	# TODO: @shjwudp add support for reporting tensor completion order
	# so that the autotune service does not rely on tensor registration
	# order
	rsp = self._bagua_autotune_client.report_metrics(
	rank=get_rank(),
	unix_timestamp=time.time(),

	req: dict = request.get_json(force=True) # TODO: @shjwudp
	tensor_ready_order: list = req["tensor_ready_order"]
	communication_time_ms: float = req["communication_time_ms"]
	hyperparameters: dict = req["hyperparameters"]

	* `delay_reduce`(_bool_) - Overlap communication with computation. Default: `True`.
	* `delay_reduce`(_bool_): Delay all communication to the end of the backward pass. This disables

	# TODO: @shjwudp merge with service module
	import copy
	import collections
	import logging

	# score = np.mean(score_list) # TODO: @shjwudp check whether these are still needed
	# std = np.std(score_list)

	return np.mean(score_list), np.std(score_list), score_list.tolist()

baguasys / bagua Goto Github PK

bagua's Introduction

Links

Performance

Installation

Quick Start on AWS

Cite Bagua

Contributors

bagua's People

Contributors

Stargazers

Watchers

Forkers

bagua's Issues

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 6f62a84. It's been assigned to @NOBLES5E because they committed the code.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 1bb1ad1 when #86 was merged. cc @shjwudp.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in e4ee5ee. It's been assigned to @shjwudp because they committed the code.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a TODO comment in 96cb6fe when #24 was merged. cc @BaguaSys.

Recommend Projects

Recommend Topics

Recommend Org

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `6f62a84`. It's been assigned to @NOBLES5E because they committed the code.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `1bb1ad1` when #86 was merged. cc @shjwudp.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `e4ee5ee`. It's been assigned to @shjwudp because they committed the code.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.

This issue was generated by todo based on a `TODO` comment in `96cb6fe` when #24 was merged. cc @BaguaSys.