Code Monkey home page Code Monkey logo

torchgpipe's Introduction

torchgpipe

PyPI Build Status Coverage Status Documentation Status Korean README

A GPipe implementation in PyTorch. It is optimized for CUDA rather than TPU.

from torchgpipe import GPipe
model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)

What is GPipe?

GPipe is a scalable pipeline parallelism library published by Google Brain, which allows efficient training of large, memory-consuming models. According to the paper, GPipe can train a 25x larger model by using 8x devices (TPU), and train a model 3.5x faster by using 4x devices.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Google trained AmoebaNet-B with 557M parameters over GPipe. This model has achieved 84.3% top-1 and 97.0% top-5 accuracy on ImageNet classification benchmark (the state-of-the-art performance as of May 2019).

GPipe uses (a) pipeline parallelism and (b) automatic recomputation of the forward propagation during the backpropagation, hence leverages training a large model. We refer to (b) as checkpointing, following the well-known terminology in PyTorch community.

Pipeline Parallelism
GPipe splits a model into multiple partitions and places each partition on a different device to occupy more memory capacity. And it splits a mini-batch into multiple micro-batches to make the partitions work as parallel as possible.
Checkpointing
Checkpointing is applied to each partition to minimize the overall memory consumption by a model. During forward propagation, only the tensors at the boundaries between partitions are remembered. All other intermediate tensors are volatilized, and recomputed during backpropagation when necessary.

Usage

Currently, torchgpipe requires the following environments:

  • Python 3.6+
  • PyTorch 1.1+

To use torchgpipe, install it via PyPI:

$ pip install torchgpipe

To train a module with GPipe, simply wrap it with torchgpipe.GPipe. Your module must be nn.Sequential as GPipe will automatically split the module into partitions with consecutive layers. balance argument determines the number of layers in each partition. chunks argument specifies the number of micro-batches. Input, output, and intermediate tensors must be Tensor or Tuple[Tensor, ...].

The below example code shows how to split a module with four layers into four partitions each having a single layer. This code also splits a mini-batch into 8 micro-batches:

from torchgpipe import GPipe

model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)

for input in data_loader:
    output = model(input)

Documentation

Visit torchgpipe.readthedocs.io for more information including the API references.

Benchmarking

The full details and more benchmarks are available in torchgpipe.readthedocs.io.

ResNet-101 Accuracy Benchmark

Batch size torchgpipe nn.DataParallel Goyal et al.
256 21.99±0.13 22.02±0.11 22.08±0.06
1K 22.24±0.19 22.04±0.24 N/A
4K 22.13±0.09 N/A N/A

GPipe should be transparent not to introduce additional hyperparameter tuning. To verify the transparency, we reproduced top-1 error rate of ResNet-101 on ImageNet, as reported in Table 2(c) of Accurate, Large Minibatch SGD by Goyal et al.

U-Net (B, C) Memory Benchmark

Experiment U-Net (B, C) Parameters Memory usage
baseline (6, 72) 362.2M 20.3 GiB
pipeline-1 (11, 128) 2.21B 20.5 GiB
pipeline-2 (24, 128) 4.99B 43.4 GiB
pipeline-4 (24, 160) 7.80B 79.1 GiB
pipeline-8 (48, 160) 15.82B 154.1 GiB

The table shows how GPipe facilitates scaling U-Net models. baseline denotes the baseline without pipeline parallelism nor checkpointing, and pipeline-1, -2, -4, -8 denotes that the model is trained with GPipe with the corresponding number of partitions.

Here we used a simplified U-Net architecture. The size of a model is determined by hyperparameters B and C which are proportional to the number of layers and filters, respectively.

U-Net (5, 64) Speed Benchmark

Experiment Throughput Speed up
baseline 28.500/s
pipeline-1 24.456/s 0.858×
pipeline-2 35.502/s 1.246×
pipeline-4 67.042/s 2.352×
pipeline-8 88.497/s 3.105×

To verify efficiency with skip connections, we measured the throughput of U-Net with various number of devices. We chose to use U-Net since it has several long skip connections.

AmoebaNet-D (18, 256) Speed Benchmark

Experiment Throughput torchgpipe Huang et al.
n=2, m=1 26.733/s
n=2, m=4 41.133/s 1.546× 1.07×
n=2, m=32 47.386/s 1.780× 1.21×
n=4, m=1 26.827/s 1.006× 1.13×
n=4, m=4 44.543/s 1.680× 1.26×
n=4, m=32 72.412/s 2.711× 1.84×
n=8, m=1 24.918/s 0.932× 1.38×
n=8, m=4 70.065/s 2.625× 1.72×
n=8, m=32 132.413/s 4.966× 3.48×

(n: number of partitions, m: number of micro-batches)

The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of GPipe by Huang et al. Note that we replaced K in the paper with n.

Notes

This project is functional, but the interface is not confirmed yet. All public APIs are subject to change without warning until v0.1.0.

Authors and Licensing

torchgpipe project is developed by Heungsub Lee, Myungryong Jeong, and Chiheon Kim at Kakao Brain, with Sungbin Lim, Ildoo Kim, Woonhyuk Baek, and Boogeon Yoon's help. It is distributed under the 3-clause BSD license.

Citation

If you apply this library to any project and research, please cite our code:

@article{kim2020torchgpipe,
    title={torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models},
    author={Chiheon Kim and Heungsub Lee and Myungryong Jeong and Woonhyuk Baek and Boogeon Yoon and Ildoo Kim and Sungbin Lim and Sungwoong Kim},
    year={2020},
    eprint={2004.09910},
    archivePrefix={arXiv}
}

torchgpipe's People

Contributors

bgyoon avatar chiheonk avatar mrjeong avatar paul-june avatar sublee avatar wbaek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torchgpipe's Issues

Not sure if the backpropagation here also follows pipeline

Cool project!
In the code, it uses queuing mechanisms to ensure the pipeline in forward propagation. But what I am not sure how it ensures the pipeline during backward propagation. If there is some mechanism, maybe you want to include it in the document.

Checkpoint Issues

I tried the 'never' option for checkpointing. The idea was to see how the pipeline was performing without checkpointing overhead.

What I observed was the performance is consistent for pipeline parallelism 2, 4 and 8. And also another important observation was the performance is much lower than the performance with checkpointing.

Is this expected or are there any other tunning parameters to get better performance?

I checked the backward and forward to backward time ratio?

Assuming backward time increase with checkpointing, is it a valid logic with your implementation?
Meaning when I turn off checkpointing the pipeline performance must improve?

Could you clarify the implementation details on this.

About speedup

Hi,

I see that two factors contribute to the reported speedup which is based on how much samples are processed in the unit time:

  1. The usage of larger mini-batch size when increasing the number of micro-batches, because it reduces the memory consumption using checkpointing (or re-computation).
  2. The pipeline mechanism.

Does the original GPipe paper also report speedup in this way? Reporting speedup in such a way seems unfair in some way. We do not know how much 2) contributes to the reported speedup. Maybe all the benefits come from the leverage of a larger mini-batch size. Besides, sometimes we do not want a very large mini-batch size in order to train better. Although for the classification task on ImageNet, this is not an issue anymore.

To enable a fair comparison, I think we should compare the speedup with checkpointing (Chen, et al .) under the same mini-batch size. Reducing memory as in 1) is first proposed by Chen, et al . Only 2) is new in GPipe. What do you think?

Thanks

Dual-license as BSD3 for PyTorch integration

The PyTorch team is currently evaluating pulling in fairscale's pipelining framework (https://github.com/facebookresearch/fairscale/tree/master/fairscale/nn/pipe) into PyTorch. Fairscale's pipeline framework is a fork of torchgpipe and to cleanly integrate this into PyTorch, the easiest option for us would be if torchgpipe is dual-licensed as BSD3 in addition to the Apache license.

Please let us know if this is feasible for the torchgpipe project since it would be a great addition to PyTorch core. We plan to add appropriate citations and acknowledgements to the original torchgpipe paper and project as part of this addition.

For more context, I have a PR on the PyTorch repo pulling in some of this work: pytorch/pytorch#44090.

Reproducing GPipe accuracy results

Hi, thank you for creating this code. Have you been able to reproduce any of the GPipe accuracy claims with torchgpipe? If so, could you make that training code available?

Balance Module

🐞 Bug

Balance by time didn't ignore the time for CUDA initialization. I think adding a warm-up run can eliminate the effect of that.
Balance by size actually accumulate the size so that the latter layer always consumes more memory than the previous layers.

Code that reproduces

Change the test to the following can reproduce for balance by size.

def test_balance_by_size():
    class Expand(nn.Module):
        def __init__(self, times):
            super().__init__()
            self.times = times

        def forward(self, x):
            for i in range(self.times):
                x = x + torch.rand_like(x, requires_grad=True)
            return x

    model = nn.Sequential(*[Expand(i) for i in [6, 5, 4, 3, 2, 1]])
    sample = torch.rand(10, 100, 100)
    balance = balance_by_size(model, sample, partitions=2, device='cuda')
    assert balance == [2, 4]

[Question] What is the purpose of "always" checkpointing mode?

The user guide mentions the following:

Usually, checkpointing at the last micro-batch may not be useful because the saved memory will be reconstructed immediately. That’s why we choose 'except_last' as the default option.

It seems like using the mode "always" might not have any benefit, are there some cases where it makes sense to use "always" instead of "exception_last"?

torchpipe + megatron

Can these two work together easily using this codebase? What are the expected hurdles/complications?

Could torchgpipe run on CPU-only machines?

Dear torchgpipe team,

Thanks for sharing the code! I wonder could torchgpipe is suitable for the CPU-only machine/machines? Or any other possible method to deploy it on CPU-only environments? Thanks!

Issues on torchgpipe project and paper

Thanks for your sharing this project and paper.

I have one doubt after reading your paper - how do you achieve concurrent copy and computation only using streams wrapped by torch?

  1. As the description of https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf from NVIDIA, it's impossible to run kernels on default stream and other streams simultaneously. If developers want to use two or more streams to overlap communication and computation, they have to explicitly create non-blocking streams to do this, but not including default-stream.
  2. Besides, due to the Python GIL limitation, it's also hard/impossible to launch kernels simultaneously into several non-blocking streams using Python.

Or do you introduce other technologies to address this issue?

Doubts as follows :

  1. The figure5 in your paper said that kernels run on default stream and non-blocking streams, sure?
  2. The figure7 has shown that the improvement you achieved by profiling kernel timeline with Nvidia Nsight tool, but if we enlarge the timeline to see it, the kernels perform overlapped?, or still execute sequentially in overall although using more streams.
  3. Is there any possible that the improvement you got is from the reduced idle gap between each kernels, not overlapped communication and computation.

Thanks for your time and answer.

CUDA issues

Hi, I have some questions regarding CUDA usage when I call the GPipe class.
When GPipe tries to split_module partition.to(device), the error torch._C._cuda_init() RuntimeError: No CUDA GPUs are available happens.
However, if I don't use GPipe under the same environment, the GPUs can be found by torch.
I wonder if there is anyone ever faced with the same problem?
Thank you!

Question about worker thread in GPipe

Hi, thanks for the fantastic work!

I have a question about the micro-batch lockstep.

out_queue.join()

In the comment it says "During this partition is executing a micro-batch, to copy a micro-batch by the next partition would be blocked". Why would dumping the message into the queue be blocked by the next partition thread?

Why does `Copy` compute gradients in reversed order

Niiiice work! I am confused that why Copy compute is in reversed order.

grad_input: Deque[Tensor] = deque(maxlen=len(grad_output))
input_stream = current_stream(get_device(prev_stream))
with use_stream(prev_stream), use_stream(next_stream):
for x in reversed(grad_output):
y = x.to(get_device(prev_stream))
grad_input.appendleft(y)
# 'next_stream' is not where 'x' has been allocated.
record_stream(x, next_stream)
# 'y' has been allocated on 'prev_stream'.
# It might be used on the current stream captured as 'input_stream'.
record_stream(y, input_stream)

If I write the part, maybe it is implemented directly.

 grad_input:List[Tensor] = []
 input_stream = current_stream(get_device(prev_stream)) 
  
 with use_stream(prev_stream), use_stream(next_stream): 
     for x in grad_output: 
         y = x.to(get_device(prev_stream)) 
         grad_input.append(y) 
  
         # 'next_stream' is not where 'x' has been allocated. 
         record_stream(x, next_stream) 
         # 'y' has been allocated on 'prev_stream'. 
         # It might be used on the current stream captured as 'input_stream'. 
         record_stream(y, input_stream) 

Is there something I do not take into account? Could you explain it? Thank you very much.

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

Hi there. First off, thanks for this great implementation of GPipe for PyTorch, it is very much appreciated. I am looking forward to getting it running for a research project I am working on, however, I have encountered an issue. I'm not sure if it is a bug, but I have essentially used the code exactly as provided in the docs.

In my project, I'm training ResNet on ImageNet. I am using the nn.Sequential ResNet adaptation provided under benchmarks, this function specifically:

def resnet101(**kwargs: Any) -> nn.Sequential:

I have avoided doing any manual CUDA device assignments or manipulation, but I am running on a single node with 2x P40 GPUs installed. Here is a simplified version of my code:

...
model = resnet101() # From torchgpipe
...
partitions = torch.cuda.device_count()
sample = torch.rand(128, 3, 224, 224)
balance = balance_by_time(partitions, model, sample)
model = GPipe(model, balance, chunks=8)
...
for i, (images, target) in enumerate(train_loader):
  ...
  # Exception is thrown here!
  # images has shape: torch.Size([256, 3, 224, 224])
  output = model(images)
  ...

And here is the actual error thrown:

Traceback (most recent call last):
  File "models/ImageNet/train.py", line 513, in <module>
    main()
  File "models/ImageNet/train.py", line 168, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "models/ImageNet/train.py", line 300, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "models/ImageNet/train.py", line 347, in train
    output = model(images)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/gpipe.py", line 376, in forward
    pipeline.run()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 115, in run
    self.compute(schedule, skip_trackers, in_queues, out_queues)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 249, in compute
    raise exc_info[0].with_traceback(exc_info[1], exc_info[2])
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 82, in worker
    batch = task.compute()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 57, in compute
    return self._compute()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 95, in checkpoint
    self.function, input_atomic, *input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 254, in forward
    output = function(input[0] if input_atomic else input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 202, in function
    return partition(input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

Any idea why this error is being thrown? If not, would you be able to recommend any steps for debugging? I believe my code is valid, but please let me know if you see any issues. I'm not sure if this issue occurs for other models that already inherit from nn.Sequential, but perhaps the issue could be with the ResNet implementation? I have successfully trained a torchvision.models.ResNet with this script (using DistributedDataParallel instead of torchgpipe), so I am less suspicious of the code I have omitted.

Thanks so much for your help!

How did you handle batch norm?

Hi,

Can you guys explain how you managed to do the batchnorm operation when you are dealing with microbatches?

Best
Niranda

A more memory-efficient implementation

Hi,

I came across a paper called DAPPLE (url) about a more memory-efficient implementation of GPipe like this:

Capture

There seems to be some typo in the legend of (c). However, I see that you have implemented GPipe by building dependencies in the forward pass of micro-batches, and then the PyTorch autograd will automatically schedule the backward pass. So is it impossible to use PyTorch to implement DAPPLE, where the forward and backward pass are mixed?

Thank you for your help!

Benchmark Performance for Baseline vs Pipeline-1

With the speed benchmarks, the pipeline-1 benchmark time is higher than that of baseline benchmarks. Is there a clear reason why there is a significant overhead with pipeline-1 with respect to baseline experiments?

What I understood from the script was the baseline runs in one GPU core. Is this right?
And pipeline also runs in one GPU core? Is this right?

Gpipe Benchmark

I want to compare gpipe benchmark to torchgpipe benchmark. I ran some micro benchmarks. I want to test same overheads in Gpipe. Is that script opensource?

Forward in eval() mode.

Hi,

This is a very nice project! I got a quick question here: does the inference in evaluation mode follow the same procedure as the forward process in the training mode?

Thanks!

KeyError in Stash

Thank you so much for the awesome repository!

I use stash and find there is an error in PoPCat().isolate(ns)

I have checked the stash is fine and have run successfully.

Thanks again for your help!

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torchgpipe/skip/skippable.py", line 173, in forward
    poppable_tensors[name] = skip_tracker.load(batch, ns, name)
  File "/opt/conda/lib/python3.6/site-packages/torchgpipe/skip/tracker.py", line 38, in load
    return self.tensors.pop((ns, name))
KeyError: (<Namespace 'bfb2fb79-66d2-4437-abf9-ef62cfb17836'>, 'k')
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torchgpipe/skip/skippable.py", line 175, in forward
    raise RuntimeError(f"'{name}' has not been stashed")
RuntimeError: 'k' has not been stashed

Using GPipe for Hessian Computation

Firstly thank you for this amazing library!!

I was trying to utilize this library for one of my projects where I would like to do a Hessian analysis of the model in a distributed setting. I understand that you were able to resolve the issue with the backpropagation in gpipe by establishing a relationship between the micro-batches. Since calculating any second-order derivative is same as doing a second backpropagation through the computational graph, do you think it will be safe to do this operation using GPipe?

Thank you.

convergence problem

        checkpoint = (j < checkpoint_stop)
        if checkpoint:
            chk = Checkpointing(partition, batch)
            task = Task(streams[i], compute=chk.checkpoint, finalize=chk.recompute)
            del chk

        else:
            def compute(batch: Batch = batch, partition: nn.Sequential = partition) -> Batch:
                return batch.call(partition)
            task = Task(streams[i], compute=compute, finalize=None)
            del compute

When I used the second method of compute not checkpoint, I found my the effect of my network become worse and it is proportional to the number of divisions.

The same batch size, different micro batches, the algorithm effects are inconsistent.

🐞 Bug

The same batch size, different micro batches, the algorithm effects are inconsistent.

I have fixed the random seed.

I set chunks equal to 2 or 4

Code that reproduces

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torch.nn import functional as F
from torchgpipe import GPipe
import random, os
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class SimpleDNN(nn.Module):
    def __init__(self):
        super(SimpleDNN, self).__init__()
        self.fc1 = nn.Linear(28*28, 512)  # assuming input images are 28x28
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)  # flatten the image
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# set random seed
seed = 0
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)

model = SimpleDNN().to(device)
model = nn.Sequential(
    model,
    nn.ReLU(),
    nn.Linear(10, 10)
)

chunks = 2  # Assume you want to divide the model into chunks

model = GPipe(model, balance=[1, 1, 1], chunks=chunks, devices=[device] * 3)

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

total = 0
correct = 0
for epoch in range(1):
    for batch_idx, (data, target) in enumerate(train_loader):
        if data.size(0) % chunks != 0:
            continue  # Skip batches that do not have the correct size
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()
        _, predicted = torch.max(output, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

        print(f'batch {batch_idx}, Accuracy: {100 * correct / total}%')

[Question] Inference time speed up or not?

Thank for sharing this project and paper
I'm using GPipe Pytorch to testing inference time the same test dataset and compare with running on single GPU as baseline.
1/ The inference time running with GPiPe is seem slower than single GPU. Therefore , GPipe is suitable for training large model ? and not effective for speed up inference time ? Please correct me if I'm wrong.
2/ I'm curious that Does GPipe library support computes the communication latency among GPUs when intermediate data is transmitted between 2 GPUs in a row?

Thank you

Failed when trying to use auto partition on running resnet101-speed benchmark

🐞 Bug

A clear and concise description of what the bug is.

Code that reproduces

torchgpipe/benchmarks/resnet101-speed/main.py only 2 lines added here tried to use the auto partition but failed

# Paste here the code that reproduces the bug.
# Keep the code compact to focus on the actual problem.
@staticmethod
    def pipeline2(model: nn.Module, devices: List[int]) -> Stuffs:
        batch_size = 25000
        chunks = 1667
        balance = [135, 235]
        sample = torch.empty(10, 3, 224, 224)
        balance = balance_by_time(2, model, sample)
        model = cast(nn.Sequential, model)
        model = GPipe(model, balance, devices=devices, chunks=chunks)
        return model, batch_size, list(model.devices)

Paste here output from the above code:

root@90988a5b53ba:/torchgpipe/benchmarks/resnet101-speed# python3 main.py pipeline-2 --devices 5,6
Traceback (most recent call last):
  File "/torchgpipe/benchmarks/resnet101-speed/torchgpipe/skip/skippable.py", line 173, in forward
    poppable_tensors[name] = skip_tracker.load(batch, ns, name)
  File "/torchgpipe/benchmarks/resnet101-speed/torchgpipe/skip/tracker.py", line 38, in load
    return self.tensors.pop((ns, name))
KeyError: (<Namespace 'd46efc96-1063-49d7-982b-35c8ddf3bb72'>, 'identity')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 259, in <module>
    cli()
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "main.py", line 153, in cli
    model, batch_size, _devices = f(model, devices)
  File "main.py", line 45, in pipeline2
    balance = balance_by_time(2, model, sample)
  File "/torchgpipe/benchmarks/resnet101-speed/torchgpipe/balance/__init__.py", line 76, in balance_by_time
    times = profile_times(module, sample, timeout, torch.device(device))
  File "/torchgpipe/benchmarks/resnet101-speed/torchgpipe/balance/profile.py", line 67, in profile_times
    batch = batch.call(layer)
  File "/torchgpipe/benchmarks/resnet101-speed/torchgpipe/microbatch.py", line 69, in call
    return Batch(function(self.value))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/torchgpipe/benchmarks/resnet101-speed/torchgpipe/skip/skippable.py", line 175, in forward
    raise RuntimeError(f"'{name}' has not been stashed")
RuntimeError: 'identity' has not been stashed

Environment

Run the below Python code and paste the output:

import os, platform, torch, torchgpipe

if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_properties(0))
    print('Number of GPUs:', torch.cuda.device_count())
    print('CUDA:', torch.version.cuda)
    print('cuDNN:', torch.backends.cudnn.version())
else:
    print('No GPUs')

print('Python:', platform.python_version())
print('PyTorch:', torch.__version__)
print('torchgpipe:', torchgpipe.__version__)

try:
    with open(os.path.join(torchgpipe.__path__[0], '../.git/HEAD')) as f:
        print('torchgpipe.git:', f.read())
except FileNotFoundError:
    pass

Paste here:

GPU: _CudaDeviceProperties(name='Tesla V100-PCIE-32GB', major=7, minor=0, total_memory=32510MB, multi_processor_count=80)
Number of GPUs: 8
CUDA: 10.2
cuDNN: 7605
Python: 3.6.9
PyTorch: 1.8.1+cu102
torchgpipe: 0.0.7
torchgpipe.git: ref: refs/heads/master

Additional context

Add any other context about the problem here.
success with amoebanetd model failed on ResNet

Typo in documentation

In the documentation (guide.rst),
I found a very trivial typo.

mode = GPipe(model, balance=[1, 1, 1, 1], devices=[4, 5, 6, 7], # Specify GPUs. chunks=8)

should be changed to

model = GPipe(model, balance=[1, 1, 1, 1], devices=[4, 5, 6, 7], # Specify GPUs. chunks=8)

i.e. the variable name 'mode' should be changed to 'model'.

Work with DistributedDataParallel

🐞 Bug

likely a new feature:
On a 6 GPU node, group the first 3 as one pipeline, and the rest 3 as the other.
Two model replica are communicated using nn.parallel.DistributedDataParallel.

Code that reproduces

        from nn.parallel imort DistributedDataParallel as DDP
        model = GPipe(model, balance=[1, 1, 2], devices=devices, chunks=CSZ)
        model = DDP(model)

the full version can be found here:
https://github.com/YHRen/gpipe_demo/blob/master/main.py

Traceback (most recent call last):
Traceback (most recent call last):
  File "main_dist.py", line 155, in <module>
  File "main_dist.py", line 155, in <module>
    loss.backward()
    loss.backward()
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    allow_unreachable=True)  # allow_unreachable flag
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
    return self._forward_cls.backward(self, *args)
  File "/ccsopen/home/yren/.local/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 269, in backward
  File "/ccsopen/home/yren/.local/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 269, in backward
    torch.autograd.backward(tensors, grad_output)
    torch.autograd.backward(tensors, grad_output)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: has_marked_unused_parameters_ INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1594299597148/work/torch/csrc/distributed/c10d/reducer.cpp:290, please report a bug to PyTorch. 
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: has_marked_unused_parameters_ INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1594299597148/work/torch/csrc/distributed/c10d/reducer.cpp:290, please report a bug to PyTorch. 
Traceback (most recent call last):
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/distributed/launch.py", line 253, in <module>
    main()
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/distributed/launch.py", line 249, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/bin/python', '-u', 'main_dist.py', '--local_rank=1', '-m', 'cnn', '-b', '32', '-c', '4', '-d', '2048', '-w', '128', '-l', '5', '-e', '2', '--dist', '--gpus_per_group', '3', '--group_per_node', '2']' returned non-zero exit status 1.

If we wrap another way around:

        model = model.cuda() # default cuda id has been handled per rank. 
        model = DDP(model)
        model = GPipe(model, balance=[1, 1, 2], devices=devices, chunks=CSZ)

The error will complain all tensors must be on the same device.

Traceback (most recent call last):
  File "main_dist.py", line 135, in <module>
    model = DDP(model)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 325, in __init__
    self._ddp_init_helper()
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 343, in _ddp_init_helper
    self._module_copies = replicate(self.module, self.device_ids, detach=True)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 96, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 75, in _broadcast_coalesced_reshape
    return comm.broadcast_coalesced(tensors, devices)
  File "/sw/ascent/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.7.0-0/lib/python3.6/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: all tensors must be on devices[0]

Environment

GPU: _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16128MB, multi_processor_count=80)
Number of GPUs: 6
CUDA: 10.2.89
cuDNN: 7605
Python: 3.6.10
PyTorch: 1.3.1
torchgpipe: 0.0.6

Additional context

I'm wondering if it is possible to run "DDP" for each split in the model pipeline.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.