Code Monkey home page Code Monkey logo

acpsgd's Introduction

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Introduction

We propose a new gradient compression algorithm called ACP-SGD, that alternates the low-rank compression and aggregation between PowerSGD's P and Q, so as to enable system optimization techniques such as ring all-reduce, pipelining and tensor fusion. This repository contains ACP-SGD's source code (see acpsgd.py), as well as a set of benchmarking scripts for evaluating the training performance among S-SGD, Power-SGD, and ACP-SGD.

Currently, it covers:

Data Parallelism Algorithms

  • S-SGD atop PyTorch-DDP
  • Power-SGD atop communication hook
  • ACP-SGD, which supports tensor fusion with hyper-parameters rank and threshold (default: 25MB)

We refer readers to gradient_reducers for evaluating more gradient compression methods, such as Top-k SGD and Sign-SGD.

Deep Neural Networks

Installation

Prerequisites

Configure the cluster settings

Before running the scripts, please carefully configure the configuration file envs.conf, e.g.

  • PY: python environment
  • xxx_INTERFACE: network
  • hosts: cluster configuration

Run benchmarks

  • The batch mode
bash batch.sh
  • The individual mode, e.g.,
opt=acpsgd rank=4 dnn=resnet50 bs=64 nworkers=32 bash perf.sh

For different experimental settings, users can modify the algorithm, DNN model, batch size, the number of GPUs, and network configurations.

ACP-SGD Usage

The ACP-SGD distributed optimizer can be easily used like horovod.DistributedOptimizer().

import acpsgd
acpsgd.init()
... 
optimizer = optim.SGD(model.parameters(), ...)
optimizer = acpsgd.DistributedOptimizer(optimizer, ...)
... 
for i, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
...

ACP-SGD Example

Example script for training on MNIST was provided.

$ bash mnist.sh

Paper

If you are using this repository for your paper, please cite our work

@article{lin23acpsgd,
    author = {Zhang, Lin and Zhang, Longteng and Shi, Shaohuai and Chu, Xiaowen and Li, Bo},
    title = {Evaluation and Optimization of Gradient Compression for Distributed Deep Learning},
    journal = {IEEE International Conference on Distributed Computing Systems (ICDCS)},
    year = {2023}
}

acpsgd's People

Contributors

lzhangbv avatar

Stargazers

Amit Hasan avatar  avatar QQSong avatar Zihan Li avatar Wanru Zhao avatar wangzc-HPC avatar  avatar EnanaShinonome avatar Aaron Chung avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.