We propose a new gradient compression algorithm called ACP-SGD, that alternates the low-rank compression and aggregation between PowerSGD's P and Q, so as to enable system optimization techniques such as ring all-reduce, pipelining and tensor fusion. This repository contains ACP-SGD's source code (see acpsgd.py), as well as a set of benchmarking scripts for evaluating the training performance among S-SGD, Power-SGD, and ACP-SGD.
Currently, it covers:
- S-SGD atop PyTorch-DDP
- Power-SGD atop communication hook
- ACP-SGD, which supports tensor fusion with hyper-parameters rank and threshold (default: 25MB)
We refer readers to gradient_reducers for evaluating more gradient compression methods, such as Top-k SGD and Sign-SGD.
- Convolutional neural networks (CNNs) on a fake ImageNet data set (i.e., randomly generate the input image of 224*224*3)
- Transformers: BERT-Base and BERT-Large pretraining models.
- Python 3.6+
- CUDA-10.+
- NCCL-2.4.+
- PyTorch-1.12.+
Before running the scripts, please carefully configure the configuration file envs.conf, e.g.
- PY: python environment
- xxx_INTERFACE: network
- hosts: cluster configuration
- The batch mode
bash batch.sh
- The individual mode, e.g.,
opt=acpsgd rank=4 dnn=resnet50 bs=64 nworkers=32 bash perf.sh
For different experimental settings, users can modify the algorithm, DNN model, batch size, the number of GPUs, and network configurations.
The ACP-SGD distributed optimizer can be easily used like horovod.DistributedOptimizer()
.
import acpsgd
acpsgd.init()
...
optimizer = optim.SGD(model.parameters(), ...)
optimizer = acpsgd.DistributedOptimizer(optimizer, ...)
...
for i, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
...
Example script for training on MNIST was provided.
$ bash mnist.sh
If you are using this repository for your paper, please cite our work
@article{lin23acpsgd,
author = {Zhang, Lin and Zhang, Longteng and Shi, Shaohuai and Chu, Xiaowen and Li, Bo},
title = {Evaluation and Optimization of Gradient Compression for Distributed Deep Learning},
journal = {IEEE International Conference on Distributed Computing Systems (ICDCS)},
year = {2023}
}