Code Monkey home page Code Monkey logo

pytorch-disttrain's Introduction

Introduction

This is a demo of pytorch distributed training. In this repo, you can find three simple demos for training model with several GPUs either on one single machine or several machines. The main code borrowed from pytorch-multigpu and pytorch-tutorial. I only do some code finishing work, thanks to the two guy. What's more, a sbatch sample will be given for running distributed training on a HPC (High performance computer).

Requirements

  • Pytorch >= 1.0 is prefered.
  • Python > 3.0 is preferd.
  • NFS: all compute nodes are prefered to load data from the Network File System.
  • linux: the pytorch distributed package can run on linux only now.

Run the demos

Demo 1

This demo is based on the torch.nn.DataParallel(model), the simplest one to use multi GPU on a single compute node. A batch is automatically divided into N mini-batches and processed by N GPUs. The models of different GPUs maintain synchronized during the whole training process.

python multigpu_demo_v1.py

Demo 2

This demo is based on the PyTorch distributed package. There exists N individual training processes and each process monopolizes a GPU. Also, the models on different GPUs maintain synchronized during the whole training process. We use torch.distributed.launch to create N processes.

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 multigpu_demo_v2.py

Demo 3

In this demo, I will run three processes on three different compute nodes, and each process can use either one GPU or several GPUs on that compute node (the same way as demo 1 sided) specified by --gpu_devices 0 1. Of course, every computes node must have the same PyTorch runtime environment. I use NFS file as the init_method. Note that the NFS file(not exists, maybe for UDS) must be accessed by all the processes on different compute nodes because all the processes need this file to communicate with each other during the initial cluster built process. The NFS file will be automatically removed after training. There are also other ways to init a process group, please refer to here.

Manually launch a process on each computes node.

# node 1
python multigpu_demo_v3.py \
    --init_method file://<absolute path to nfs file> \
    --rank 0 \
    --world_size 3\
    --gpu_devices 0 1
# node 2
python multigpu_demo_v3.py \
    --init_method file://<absolute path to nfs file> \
    --rank 1 \
    --world_size 3\
    --gpu_devices 0 1
# node 3
python multigpu_demo_v3.py \
    --init_method file://<absolute path to nfs file> \
    --rank 2 \
    --world_size 3 \
    --gpu_devices 0 1

GPU cluster on HPC

  1. Create job.sh.
#!/bin/bash

#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=1
#SBATCH --mem=12000
#SBATCH --time=20:00
#SBATCH --output=log
#SBATCH --ntasks=1
#SBATCH --array=0,1

srun python multigpu_demo_v3.py \
    --init_method file://<absolute path to nfs file> \
    --rank $SLURM_ARRAY_TASK_ID \
    --world_size ${SLURM_ARRAY_TASK_COUNT}\
    --batch_size 256\
    --gpu_devices 0 1
  1. Simply run sbatch job.sh on interactive node to submit a job.

Performance comparisons

The code for the first three tests comes from pytorch-multigpu.

settings

  • gpu: p100;
  • gpu cluster: 2 gpus/node;
  • dataset: CIFAR10; batch size: 256; epoch: 1; iters: 196.

results

method epoch time batch time
single gpu 2:34 1.04s
v1 2:17 0.70s
v2 2:09 0.60s
v3 2:01 0.58s

I didn't expect that the version(v2) of the two processes on a single machine would be slightly slower than the distributed version(v3) of the two nodes. Perhaps, it is due to the HPC. In short, no matter which way you use multiple GPUs, the speed will not increase by a multiple, because of the communication cost of the model synchronization. The only benefit you could get with multi-gpus is a bigger batch size.

Verify the models

This script will verify whether the models from different processes are synchronized.

python compare_models.py final_model_rank_0.pth final_model_rank_1.pth

# output
# layer3.15.bn3.running_var
# layer3.16.bn1.running_mean
# layer3.16.bn1.running_var
# layer3.16.bn2.running_mean
# layer3.16.bn2.running_var
# layer3.16.bn3.running_mean
# layer3.16.bn3.running_var
# bn_out.running_mean
# bn_out.running_var

From the ouput above, the only difference between models is BN layer, because that different minibatches' data does not synchronize. So this three way to facilitate multi gpus cannot improve the performance of the BN. If you want to improve your BN performance, the sync bn may satisfy your demand.

Blogs

The chinese blogs:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.