msr-fiddle / pipedream Goto Github PK

View Code? Open in Web Editor NEW

371.0 18.0 113.0 2.94 MB

License: MIT License

Dockerfile 0.16% Python 98.85% C++ 0.24% Cuda 0.62% Shell 0.12%

pipedream's Introduction

PipeDream: Pipeline Parallelism for DNN Training

This repository contains the source code implementation of the following papers:

"PipeDream: Generalized Pipeline Parallelism for DNN Training", which appeared at SOSP 2019 (pipedream branch).
"Memory-Efficient Pipeline-Parallel DNN Training", which appeared at ICML 2021 (pipedream_2bw branch).

This work was one as part of Microsoft Research's Project Fiddle. This source code is available under the MIT License.

Directory Structure

`graph`

This contains a Python implementation of a graph, used by the PipeDream profiler and optimizer. Profiling scripts in profiler generate graph profiles, that can then be ingested by the optimizer located in optimizer to generate a partitioned model, that can then be fed to the PipeDream runtime.

`profiler`

Instrumented PyTorch applications which return profiles that can be ingested by the optimizer.

`optimizer`

A Python implementation of PipeDream's optimizer.

`runtime`

PipeDream's runtime, which implements model parallelism, as well as input pipelining in PyTorch. This can be fused with data parallelism to give hybrid model and data parallelism, and input pipelining.

Setup

Software Dependencies

To run PipeDream, you will need a NVIDIA GPU with CUDA 10.0, GPU driver version 418.56, nvidia-docker2, and Python 3. On a Linux server with NVIDIA GPU(s) and Ubuntu 16.04, these dependencies can be installed using,

bash setup.sh

All dependencies are in the nvcr.io/nvidia/pytorch:19.05-py3 container, which can be downloaded using,

nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3

To run the PipeDream profiler, you will need to build a new Docker image, which can be done using the Dockerfile in this directory. Note that the Dockerfile has a dependency on the pre_hook.patch and requirements.txt files in this directory. This container can be built using,

docker build --tag <CONTAINER_NAME> .

The PyTorch Docker Container can then be run using,

nvidia-docker run -it -v /mnt:/mnt --ipc=host --net=host <CONTAINER_NAME> /bin/bash

Data

Image Classification

All image classification experiments are run using the ImageNet ILSVC 2012 dataset. This can be downloaded using the following command (within the docker container above),

cd scripts; python download_imagenet.py --data_dir <DATASET_DIR>

Note that the ImageNet dataset is about 145GB, so this download script can take some time.

Translation

All translation experiments are run using the WMT En-De dataset, also used for the MLPerf translation (RNN) task. This can be downloaded using the instructions in the MLPerf repository.

End-to-end Workflow

To run a demo, run the following commands (the optimizer and runtime have been verified to work unchanged in nvcr.io/nvidia/pytorch:19.05-py3). More detailed instructions for each of the individual components are in the corresponding directory READMEs, and more detailed instructions on how to run the main experiments in the SOSP paper are in EXPERIMENTS.md.

[from pipedream/profiler/image_classification; you will need to have the changes to PyTorch listed above] Note that the profiling step must be run with only a single GPU (hence the CUDA_VISIBLE_DEVICES=0 before the command).

CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir <path to ImageNet directory>

[from pipedream/optimizer]

python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 4 --activation_compression_ratio 1 -o vgg16_partitioned

[from pipedream/optimizer]

python convert_graph_to_model.py -f vgg16_partitioned/gpus=4.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=4 --stage_to_num_ranks 0:3,1:1

[from pipedream/runtime/image_classification; run on 4 GPUs (including a single server with 4 GPUs)]

python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir <path to ImageNet> --rank 0 --local_rank 0 --master_addr <master IP address> --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo
python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir <path to ImageNet> --rank 1 --local_rank 1 --master_addr <master IP address> --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo
python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir <path to ImageNet> --rank 2 --local_rank 2 --master_addr <master IP address> --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo
python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir <path to ImageNet> --rank 3 --local_rank 3 --master_addr <master IP address> --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo

master IP address here is the IP address of the rank 0 process. On a server with 4 GPUs, localhost can be specified.

When running DP setups, please use the nccl backend for optimal performance. When running hybrid setups, please use the gloo backend.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

License

Licensed under the MIT license.

pipedream's People

Contributors

Stargazers

Watchers

Forkers

wangshuaizs fanshiqing uoft-ecosystem jeageun cnzhanj huaizhengzhang pengwwang zhaojp-frank jthelin mimic66 kkkwqy adam-ct leoyichen kanonjz stepyhan catfootprint mahajandivya vibhatha anijain2305 stjordanis jglicat tanmoy058 simonzsx jasonyanwenl nirandaperera aditbro joeyyoung valaydave jiashenc yanzhaowu laekov d3v3l0 neolinsu ziyueluocs ooffordable gudiandian pengmli mitchellx aoranwu enderdead gaurav274 codes1gn shigangli liuq4360 502110983 khabya lipaul tygu1004 ouyangyu erdnbt vihowe yxd886 sirius93123 siddharth9820 org-mars xxlest saisriker ymsdu2004 zanzong zmxdream taihulight codecaution zzqq2199 xrosliang coderxdy yxyoo charlesjiangxm parasj haolin-nju tianyi-ge jinalong sheele41 huyang719 vuanvin shyhuai gofire2000 mlkimmins gekeshi haishin songzhen-neu daiyaanarfeen habvt emeralddddd hongsunjang huhuzizi mahidhar96 im-moo machinelearningsystem allen-czyysx jxzhangjhu daitr616 wangxionghome kongweiming tweikiang dshruti20 otavioon trellixvulnteam 5001945 qinpr rutheniumnetwork

pipedream's Issues

insufficient shared memory (shm)

Here is the error message, why I could complete one epoch of training, the second epoch began to report errors: This might be caused by insufficient shared memory (shm).
I can't understand why this mistake happened?

Epoch 0: 6843.771 seconds
Epoch start time: 1577064742.170, epoch end time: 1577071585.941
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.6/queue.py", line 173, in get
self.not_empty.wait(remaining)
File "/opt/conda/lib/python3.6/threading.py", line 299, in wait
gotit = waiter.acquire(True, timeout)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 325) is killed by signal: Bus error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main_with_runtime.py", line 579, in
main()
File "main_with_runtime.py", line 311, in main
prec1 = validate(val_loader, r, epoch)
File "main_with_runtime.py", line 453, in validate
r.run_forward()
File "../runtime.py", line 498, in run_forward
self.receive_tensors_forward()
File "../runtime.py", line 387, in receive_tensors_forward
input = next(self.loader_iter)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in next
idx, batch = self._get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 512, in _get_batch
success, data = self._try_get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 488, in _try_get_batch
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 325) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Gpu underutilization

Environment: 2 gpus in dgx-1
Bandwidth 20GB/s (considered NVLINK bandwidth)
Run the demo:
step1: CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 32 --data_dir ./data

step2: python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 2 --activation_compression_ratio 1 -o vgg16_partitioned --network_bandwidths [21474836480]

step3: python convert_graph_to_model.py -f vgg16_partitioned/gpus=2.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=2 --stage_to_num_ranks 0:1,1:1

step4: docker0 的ip为 10.1.1.4 ，docker1 的ip为 10.1.1.5

in docker0：
python main_with_runtime.py --module models.vgg16.gpus=2 -b 32 --data_dir ./data --rank 0 --local_rank 0 --master_addr 10.1.1.4 --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend gloo --eval-batch-size 32

in docker1：
python main_with_runtime.py --module models.vgg16.gpus=2 -b 32 --data_dir ./data --rank 1 --local_rank 1 --master_addr 10.1.1.4 --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend gloo --eval-batch-size 32

whats the role of optimizer/inference_optimizer_graph.py

How to use this script？
What does the -t parameter mean? Can you give me an example?
https://github.com/msr-fiddle/pipedream/blob/f50827f2e28cbdbd82a4ea686c0498272b1460d6/optimizer/inference_optimizer_graph.py

All ranks are not trained. They are blocked all the time

My environment:

server1：4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank：

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

Does using multiple GPUs on a single machine require the --local_rank option when running?

In the top level README, the example workflow has a line that says:

[from pipedream/runtime/image_classification; run on 4 GPUs (including a single server with 4 GPUs)]

Do those commands actually require setting the --local_rank option in order to map the different processes to different GPUs on a single machine setup? When I run nvidia-smi, I am not actually seeing multiple GPUs get used unless I set --local_rank to a unique value (between 0 and 4 in my 4 GPU setup).

I would like to know what the role of the following driver.py is？

https://github.com/msr-fiddle/pipedream/blob/f50827f2e28cbdbd82a4ea686c0498272b1460d6/runtime/driver.py

When I ran python driver.py --config . Only 4 files were generated, as follows:
command_history.log
gpus=16
machinefile
test_gpus16.yml

I understand that when the above command is run, the training begins, but only four files are generated, so I want to know what driver.py does

How to expose the "register_pre_hook()" interface?

Hi, I stack in the first step. How to profile the runtime of forward and backward? I've never tried to modified the pytorch source code before, and the 'python_cpp_function.h' or 'python_cpp_function.cpp' isn't clear for me to write the pre_hook interface. Could you provide the change you made to pytorch source code? Thx a lot.

optimizer speed for resnet101

Hi, I'm trying to compare resnet101 with model parallelism and your pipeline parallelism using a nvprof.
For this one, I'm trying to make an optimization code to launch.

I launched the python code below, and it takes a long time, so I can't get results for this code.

python convert_graph_to_model.py -f resnet101_partitioned/gpus=4.txt -n RESNET101Partitioned -a resnet101 -o ../runtime/image_classification/models/resnet101/gpus=4 --stage_to_num_ranks 0:1,1:1,2:1,3:1
How long did you take to create a resnet101 optimization result? Is there a way to get a result faster? If you have already computed the result for this, could you guys upload it for me?

Thanks

Infinite loop problem in convert_graph_to_model.py

I had a problem using the optimizer for the resnet101 model.
Following the README, the following commands were executed sequentially.

python optimizer_graph_hierarchical.py
-f ../profiler/image_classification/profiles/resnet101/graph.txt
--activation_compression_ratio 1
-o resnet101_partitioned
--all_num_machines 4 4
--network_bandwidths 15750000000 7000000000
--memory_size 16000000000

python convert_graph_to_model.py
-f resnet101_partitioned/gpus=16.txt
-n RESNET101Partitioned
-a resnet101
-o ../runtime/image_classification/models/gpus=16

In the resnet101 (or resnet152) model, an infinite loop occurs in convert_graph_to_model.py.
(resnet50 or vgg16 had no problem.)

When I checked it, an infinite loop occurs in the populate_depths function in pipedream/graph/graph.py.

Is my execution wrong?
Can you confirm what's the problem?

does branch mechanism has not been implemented yet？

Traceback (most recent call last): File "main_with_runtime.py", line 579, in <module> main() File "main_with_runtime.py", line 129, in main module = importlib.import_module(args.module) File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_import File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 665, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/admin/pipedream/runtime/image_classification/models/inceptionv3/gpus=2/__init__.py", line 1, in <module> from .inceptionv3 import Inceptionv3Partitioned File "/home/admin/pipedream/runtime/image_classification/models/inceptionv3/gpus=2/inceptionv3.py", line 2, in <module> from .stage0 import Stage0 File "/home/admin/pipedream/runtime/image_classification/models/inceptionv3/gpus=2/stage0.py", line 24 self.layer19 = torch.nn.Branch 3

here is generated stages's code， it seems not legal python code：

class Stage0(torch.nn.Module): def __init__(self): super(Stage0, self).__init__() self.layer2 = torch.nn.Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), bias=False) self.layer3 = torch.nn.BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer4 = torch.nn.ReLU(inplace=True) self.layer5 = torch.nn.Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), bias=False) self.layer6 = torch.nn.BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer7 = torch.nn.ReLU(inplace=True) self.layer8 = torch.nn.Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) self.layer9 = torch.nn.BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer10 = torch.nn.ReLU(inplace=True) self.layer11 = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) self.layer12 = torch.nn.Conv2d(64, 80, kernel_size=(1, 1), stride=(1, 1), bias=False) self.layer13 = torch.nn.BatchNorm2d(80, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer14 = torch.nn.ReLU(inplace=True) self.layer15 = torch.nn.Conv2d(80, 192, kernel_size=(3, 3), stride=(1, 1), bias=False) self.layer16 = torch.nn.BatchNorm2d(192, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer17 = torch.nn.ReLU(inplace=True) self.layer18 = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) self.layer19 = torch.nn.Branch 3 self.layer20 = torch.nn.Branch 2 self.layer21 = torch.nn.Branch 1 self.layer22 = torch.nn.Branch 0 self.layer24 = torch.nn.Branch 7 self.layer25 = torch.nn.Branch 6 self.layer26 = torch.nn.Branch 5 self.layer27 = torch.nn.Branch 4 self.layer29 = torch.nn.Branch 9 self.layer30 = torch.nn.Branch 8 self.layer31 = torch.nn.Branch 11 self.layer32 = torch.nn.Branch 10 self.layer34 = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) self.layer35 = torch.nn.Branch 13 self.layer36 = torch.nn.Branch 12

Batch size and optimizer

Hi,
I see that the profiler calculates memory and execution times based on a particular batch size.
But the optimizer code does not take in any batch size parameter. So, does that mean, in the optimizer logic, the execution times and activation memories are normalized?

Best

RuntimeError: ProcessGroupNCCL does not support send

i run the command as bellow:
`
worker-0:
python main_with_runtime.py --module models.vgg16.gpus=2 -b 64 --data_dir /home/admin/pipedream/data/sample --rank 0 --local_rank 0 --master_addr localhost --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend nccl

worker-1:
python main_with_runtime.py --module models.vgg16.gpus=2 -b 64 --data_dir /home/admin/pipedream/data/sample --rank 1 --local_rank 1 --master_addr localhost --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend nccl`

then i got the error:

`Exception in thread Thread-3:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../communication.py", line 632, in send_helper_thread
sub_process_group=sub_process_group)
File "../communication.py", line 709, in _send
dist.send(tensor=tensor_shape, dst=dst_rank, tag=tag)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 608, in send
_default_pg.send([tensor], dst, tag).wait()
RuntimeError: ProcessGroupNCCL does not support send

Epoch: 0 Step 0 Learning rate: 0.100000
Epoch: [0][0/13] Memory: 6.323 (9.244)`

tenroflow support

@deepakn94 it seems that this project just work for pytorch, is there any cross-framework（such as tensorflow、mxnet） plan？thanks

What is the meaning of `antichain` in optimizer_graph_hierarchical.py ?

Could you please describe the meaning of antichain graph used in partitioning algorithm? Is it related to backward path computations?

can't find pipedream AMI in AWS console

Hi @deepakn94 , I can't find the AMI in the AWS console by searching the AMI ID or name in EXPERIMENTS.md. Is it a public image?

Actual results did not match the optimizer expectation

Hi,

I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.

I ran the following commands for the profiler and the optimizer.

CUDA_VISIBLE_DEVICES=4 python main.py -a "resnet101"  -b 64 --data_dir "$HOME/data/imagenet-mini/" --profile_directory "profiles1/64"

python optimizer_graph_hierarchical.py -f "../profiler/image_classification/profiles1/64/resnet101/graph.txt" -n 4 -s 11000000000 --straight_pipeline -o "./optim/64/resnet101/gpus=4_straight" -b 2500000000 --use_memory_constraint
python convert_graph_to_model.py -f "./optim/64/resnet101/gpus=4_straight/gpus=4.txt" -n resnet101 -a resnet101 -o "./optim/64/resnet101/gpus=4_straight/"

On the optimizer step, I get the following output.

Time taken by single-stage pipeline: 2.0447990000000003
Time per stage in pipeline: 0.5839989999999989
Throughput increase (compared to single machine): 3.5013741461886134
[Note that single-machine and (4)-machine DP might not fit given memory constraints]
Throughput increase of (4)-machine DP compared to single machine: 3.62130052703585
Throughput increase (compared to (4)-machine DP): 0.966883063155932

So, my expectation was the straight pipeline would be roughly similar to the DP timings.

But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.

        model  batch     conf         mean  speed_up
21  resnet101     64   1_conf  1098.136000  1.000000
22  resnet101     64  mp_conf   770.499250  1.425227
23  resnet101     64  dp_conf   304.383375  3.607740

I'd be very grateful if you could help me, figuring out this discrepancy?

I have drawn the gannt charts for,
pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).

It looks to me that each stage is stagnated on the comms for a considerable period of time.

RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai

When I launch docker specifying --net=host, and the training script specifying --master_addr: localhost, an error is thrown

Traceback (most recent call last):
File "main_with_runtime.py", line 579, in
main()
File "main_with_runtime.py", line 192, in main
enable_recompute=args.recompute)
File "../runtime.py", line 64, in init
master_addr, rank, local_rank, num_ranks_in_server)
File "../runtime.py", line 196, in initialize
backend=self.distributed_backend)
File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 370, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai

profile problem

@deepakn94 it looks like that the container doesn't support NVIDIA GPU K40c, So, I decided modify the pytorch source code according pre_hook.patch, But I did not success:

After modify the code, I should set USE_NCCL = OFF? I did this and compile successfuly.
Is there any Points to care about when I use pytorch source code instead container? Because When I run image_classification in profiler, the output error is

=> creating model 'resnet50'
Collecting profile...

Total accounted time: 2364.255 ms
Traceback (most recent call last):
File "main.py", line 569, in
main()
File "main.py", line 283, in main
os.path.join(args.profile_directory, args.arch))
File "main.py", line 115, in create_graph
output = model(input)
File "/seu_share/home/zhanjun/.conda/envs/pipetorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 571, in call
result = self.forward(*input, **kwargs)
File "/seu_share/home/zhanjun/.conda/envs/pipetorch/lib/python3.6/site-packages/torchvision/models/resnet.py", line 207, in forward
x = torch.flatten(x, 1)
TypeError: flatten(): argument 'input' (position 1) must be Tensor, not TensorWrapper

Thank you!

how does the backward msg get the corresponding old weights in 1F1B-RR

hi, me again.. First of all, your work is really good, so I read it once more time.

For "1F1B-RR" mechanism, let's assume that there is 1 machine(A) in previous stage and 2 machines(B, C) in the current stage. It's a simple 1-2 configuration, and pipeline the data with in-flight=2.
A process forward of batch3, backward of batch2, the forward of batch4. Then A send batch3 to B and batch4 to C. I want to know what'll happen if C finish the stage then send the backward of batch4 to A earlier than B's batch3. I find that load_old_params() seems to just pop the buff queue. So the batch4 would get the wrong weights which doesn't include the gradients of batch2.
I'm not sure that the functionality of setup_messaging_schedule. Is setup_messaging_schedule function designed for this corner case?

Translation profiler missing init.py files for module location?

The translation profiler section appears to be have some issues. When attempting to profile, I ran into the following:

# python train.py --dataset-dir /mnt/ptb --target-bleu 21.8 --epochs 20 --math fp16 --print-freq 10 --arch gnmt --batch-size 64 --test-batch-size 128 --model-$onfig "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': False}" --optimization-config "{'optimizer': 'FusedAdam', 'lr': 1.75e-3}" --scheduler-config "{'lr_method':'mlperf', 'warmup_iters$:1000, 'remain_steps':1450, 'decay_steps':40}"
Traceback (most recent call last):
  File "train.py", line 20, in <module>
    from seq2seq.models.gnmt import GNMT
ModuleNotFoundError: No module named 'seq2seq.models'

I saw that there were local directories for that module, so I added empty __init__.py files into seq2seq and seq2seq/models to get around that. If the intention is to use these local versions, then you should probably just check in some empty __init__.py files into the repo.

No seq2seq.pack_utils module when running translation profiler

I worked around the issue in #8 , but I am now seeing another submodule missing:

ModuleNotFoundError: No module named 'seq2seq.pack_utils'

I found that seq2seq/csrc contains the C++ code that I assume gets called under the hood, but I am not sure what I need to do to make the python interpreter find that as a module. Is there some code generation step that needs to happen or anything like that?

How to determine replication factors

I run the following code
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b 4294967296 2147483648

i get the output.Can you tell me in this example: stage and replication factor?
How do they use the following output calculations?

Total number of states: 40
Solving optimization problem with 8 machines with inter-machine bandwidth of 4.29 GB/s
[[0.04692 0.056919000000000004 0.21645 ... 0.6715939999999999
0.6715939999999999 0.6725349999999999]
[None 0.009999000000000001 0.16953000000000001 ... 0.624674 0.624674
0.6256149999999999]
[None None 0.159531 ... 0.6146749999999999 0.6146749999999999
0.6156159999999998]
...
[None None None ... None 0.0 0.0009409999999999696]
[None None None ... None None 0.0009409999999999696]
[None None None ... None None None]]
Solving optimization problem with 2 machines with inter-machine bandwidth of 2.15 GB/s
[[0.005865730156898499 0.007115605156898499 0.027072026604413987 ...
0.09235717243739539 0.09235717243739539 0.09235717243739539]
[None 0.0012498750000000001 0.02120629644751549 ... 0.0856534978594099
0.0856534978594099 0.0856534978594099]
[None None 0.01995642144751549 ... 0.08422506928798132
0.08422506928798132 0.08422506928798132]
...
[None None None ... None 0.0 0.0009765625]
[None None None ... None None 0.001786962507337328]
[None None None ... None None None]]
[[0.002933282310962677 0.0035582198109626773 0.013545028504729271 ...
0.07743855509553638 0.07743855509553638 0.07839246224258628]
[None 0.0006249375000000001 0.010611746193766595 ... 0.0740863005740302
0.0740863005740302 0.07504020772108011]
[None None 0.009986808693766594 ... 0.07337208628831592
0.07337208628831592 0.07432599343536582]
...
[None None None ... None 0.0 0.0014421883970499039]
[None None None ... None None 0.0018473884007185679]
[None None None ... None None None]]

Level 2
Number of machines used: 2...
Compute time = 0.335797, Data-parallel communication time = 0.250080...

Number of machines in budget not used: 0...

(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 40) 0.6725349999999999 2
Total number of stages: 1
Level 1
Number of machines used: 1...
Split between layers 23 and 24...
Split before antichain ['node26']...
Compute time = 0.049474, Data-parallel communication time = 0.000000, Pipeline-parallel communication time = 0.023926...
Number of machines used: 7...
Compute time = 0.088874, Data-parallel communication time = 0.003483...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 24) 0.6221200000000001 7
(24, 40) 0.050414999999999766 1

Total number of stages: 2
Time taken by single-stage pipeline: 0.6725349999999999
Time per stage in pipeline: 0.07839246224258628
Throughput increase (compared to single machine): 8.579077385257188
[Note that single-machine and (8,2)-machine DP might not fit given memory constraints]
Throughput increase of (8,2)-machine DP compared to single machine: 6.5655154772476045
Throughput increase (compared to (8,2)-machine DP): 1.3066875578905357

Can the profiler handle dynamic graphs?

Can the profiler which generates the graph handle conditionals and loops?

Does it work for a detection network with branches?

hi, i notice the implemetation for image classification in pipedream, but not sure it also works for object detection network?

How to use two GPUs in two servers separately to run Pipedream

I used two containers on two servers to run pipedream, data parallelism worked in nccl backend but model parallelism didn't work in gloo backend, and it seemed two servers just waited for connection and no further output.
Then I use runtime/tests/communication/point_to_point.py to test the connection, it was the same situation as above.

# server 1:
python point_to_point.py --backend gloo --master_addr xxx.xxxx.xx.xx --rank 0 --master_port 8888
# server 2
python point_to_point.py --backend gloo --master_addr xxx.xxxx.xx.xx --rank 1 --master_port 8888

@deepakn94 Could you help me?

Is this the version used in SOSP paper?

Hi,

I have been testing this repository to replicate results given in the SOSP paper.
https://dl.acm.org/doi/abs/10.1145/3341301.3359646

But I was unable to reproduce the results, and I'm seeing some data loading problems for Alexnet. I have started a discussion in PyTorch forum.
https://discuss.pytorch.org/t/strange-behavior-in-data-loader-with-workers/83769

So, is this the exact version that was used for the experiments in the SOSP paper?

Translation demo: Division by zero

I'm experiencing the below error which looks critical. I'm using revision f50827f with docker base nvcr.io/nvidia/pytorch:19.05-py3

Traceback (most recent call last):
  File "train.py", line 474, in <module>
    main()
  File "train.py", line 458, in main
    train_loss, train_perf = trainer.optimize(train_loader)
  File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 373, in optimize
    output = self.feed_data(data_loader, training=True)
  File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 330, in feed_data
    os.path.join("profiles", self.arch+'_2'))
  File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 42, in create_graph
    graph_creator.persist_graph(directory)
  File "../torchmodules/torchgraph/graph_creator.py", line 281, in persist_graph
    self.graph.render_bar_graphs_and_cdfs(directory)
  File "../../graph/graph.py", line 607, in render_bar_graphs_and_cdfs
    pdfs.append(((node.forward_compute_time + node.backward_compute_time) / (cdfs[-1][0] / 100.0),
ZeroDivisionError: float division by zero

Hanging with [4,3,1] GPU assignment

@deepakn94 Do you have any cues on how to fix the hang issue using two DPs with 4GPUs and 3GPUs.

I checked the code, it seems that all data from 4GPUs is only sent to one of the 3 GPUs (I guess it is due to the self.tensor_tags which can only store one tag for one input/output node). e.g, here

I also noticed a sentence called "TODO: don't current support uneven configurations." here

Directory mismatch in example workflow in documentation

Hi there, thanks for thoroughly documenting your work and providing examples! I think I found a documentation issue under https://github.com/msr-fiddle/pipedream#end-to-end-workflow

The second step of optimization, which takes place in pipedream/optimizer includes the following:

-o ../runtime/models/vgg16/gpus=4

This places files under pipedream/runtime/models

The next step, running, suggests you navigate over to pipedream/runtime/image_classification and then run some commands that include the following:

--config_path models/vgg16/gpus=4/hybrid_conf.json

This reads files from pipedream/runtime/image_classification/models.

Presumably this mismatch is unintentional.

Error occurs in profiler after I updated pytorch

Traceback (most recent call last):
  File "main.py", line 574, in <module>
    main()
  File "main.py", line 266, in main
    per_layer_times, data_time = profile_train(train_loader, model, criterion, optimizer)
  File "main.py", line 345, in profile_train
    with torchprofiler.Profiling(model, module_whitelist=[]) as p:
  File "../torchmodules/torchprofiler/profiling.py", line 25, in __enter__
    self.start()
  File "../torchmodules/torchprofiler/profiling.py", line 93, in start
    self.hook_modules(self.model)
  File "../torchmodules/torchprofiler/profiling.py", line 120, in hook_modules
    self.hook_modules(sub_module)
  File "../torchmodules/torchprofiler/profiling.py", line 120, in hook_modules
    self.hook_modules(sub_module)
  File "../torchmodules/torchprofiler/profiling.py", line 122, in hook_modules
    sub_module.reset_hooks()
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 585, in __getattr__
    type(self).__name__, name))
AttributeError: 'Conv2d' object has no attribute 'reset_hooks'

Imagenet download url in imagenet.py doesn't work

Imagenet doesn't support direct download, is there any solution?

"no kernel image is available for execution on the device" when doing translation profiling

I am running the translation profiler as follows:

# python train.py --dataset-dir /mnt/wmt16/ --target-bleu 21.8 --epochs 20 --math fp32 --print-freq 10 --arch gnmt --batch-size 64 --test-batch-size 128
 --model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': False}" --optimization-config "{'optimizer': 'FusedAdam', 'lr': 1.75e-3}" --scheduler-config "{'lr_method':'mlperf
', 'warmup_iters':1000, 'remain_steps':1450, 'decay_steps':40}"

When running, I encounter the following error when the first training epoch starts:

0: Starting epoch 0                                                                                                                                                                                 [7/1801]
:::MLPv0.5.0 gnmt 1569879709.258431435 (train.py:452) train_epoch: 0
THCudaCheck FAIL file=/opt/pytorch/pytorch/aten/src/THC/generic/THCTensorMath.cu line=35 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
  File "train.py", line 474, in <module>
    main()
  File "train.py", line 458, in main
    train_loss, train_perf = trainer.optimize(train_loader)
  File "/workspace/pipedream/profiler/translation/seq2seq/train/trainer.py", line 372, in optimize
    self.preallocate(data_loader, training=True)
  File "/workspace/pipedream/profiler/translation/seq2seq/train/trainer.py", line 360, in preallocate
    self.iterate(src, tgt, update=False, training=training)
  File "/workspace/pipedream/profiler/translation/seq2seq/train/trainer.py", line 160, in iterate
    output = self.model(src, src_length, tgt[:-1])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/pipedream/profiler/translation/seq2seq/models/gnmt.py", line 62, in forward
    context = self.encode(input_encoder, input_enc_len)
  File "/workspace/pipedream/profiler/translation/seq2seq/models/seq2seq_base.py", line 34, in encode
    return self.encoder(inputs, lengths)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/pipedream/profiler/translation/seq2seq/models/encoder.py", line 127, in forward
    x = self.rnn_layers[0](x, lengths)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/pipedream/profiler/translation/seq2seq/models/encoder.py", line 64, in forward
    return self.emu_bidir_lstm(self.layer1, self.layer2, input, lengths)
  File "/workspace/pipedream/profiler/translation/seq2seq/models/encoder.py", line 53, in emu_bidir_lstm
    out1 = model1(inputl1)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 556, in forward
    return self.forward_tensor(input, hx)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 536, in forward_tensor
    output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 509, in forward_impl
    dtype=input.dtype, device=input.device)
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /opt/pytorch/pytorch/aten/src/THC/generic/THCTensorMath.cu:35

Do you have any ideas what might be causing this? It isn't clear to me whether or not this is a software/environment issue, or an issue with my specific hardware.

bandwidth parameter

First of all, I want to say that your work is amazing.
I used you in vgg16. gpus = 16. hybrid_conf.json. (https://github.com/msr-fiddle/pipedream/blob/f50827f2e28cbdbd82a4ea686c0498272b1460d6/runtime/image_classification/models/vgg16/gpus%3D16/hybrid_conf.json)

It only takes 600 seconds to train an epoch in the Imagenet dataset. Can you tell me how you generated this configuration file? Can you tell me the bandwidth parameter?

the confused workflow about 1F1B-RR

hi, there
when I read pipedream, I notice that in fig 8, worker 1 will process batch5 after the backward of batch1. Then I read your code, the data parallelism of "1F1B-RR" mechanism is implemented using DistributedDataParallel, which, I think, is a sync operator. --> The docs.
So, in my opinion, in fig 8, the forward of batch5 should start after the backward of batch2.

Do I miss something?

Bandwidth within the machine

optimizer_graph_hierarchical.py The script's parameter (--network_bandwidth) is bandwidth within the machine. What is bandwidth considered between machines?

failed to build profiler image

i'am trying to build the docker image for running profiler.
i followed as bellow:
1.
sudo nvidia-docker run --name=pipedream -it -v /home/vincent.ym/pipedream:/home/admin/pipedream --ipc=host --net=host nvcr.io/nvidia/pytorch:19.05-py3 /bin/bash
2.
cd /home/admin/pipedream
3.
i copy the Dockefile content to a bash script: a.sh, an then run: sh a.sh

apt-get update && apt-get install -y --no-install-recommends \
        texlive-latex-extra \
      && \
    rm -rf /var/lib/apt/lists/

#COPY requirements.txt requirements.txt
pip install -r requirements.txt

# Bring in changes from outside container to /tmp
# (assumes pre_hook.patch is in same directory as Dockerfile)
cp pre_hook.patch /tmp

# Change working directory to PyTorch source path
cd /opt/pytorch

# Apply modifications and re-build PyTorch
cd pytorch && patch -p1 < /tmp/pre_hook.patch && \
    TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5+PTX" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    NCCL_INCLUDE_DIR="/usr/include/" \
    NCCL_LIB_DIR="/usr/lib/" \
    python setup.py install && python setup.py clean

# Reset default working directory
cd /workspace

finally, i got error as bellow while compiling:

[1162/3068] Linking CXX shared library lib/libthnvrtc.so
FAILED: lib/libthnvrtc.so
: && /usr/bin/c++  -fPIC -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-unused-but-set-variable -Wno-maybe-uninitialized -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -rdynamic -shared -Wl,-soname,libthnvrtc.so -o lib/libthnvrtc.so caffe2/torch/CMakeFiles/thnvrtc.dir/csrc/jit/fuser/cuda/thnvrtc.cpp.o  /usr/local/nvidia/lib/libcuda.so /usr/local/cuda/lib64/libnvrtc.so -Wl,-rpath,/usr/local/nvidia/lib:/usr/local/cuda/lib64:::::::: && :
/usr/local/nvidia/lib/libcuda.so: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status
[1172/3068] Generating ../aten/src/ATen/CPUBoolType.cpp, ../aten/src/ATen/CPUBoolType.h, ../aten/src/ATen/CPUByteType.../aten/src/ATen/SparseCUDALongType.h, ../aten/src/ATen/SparseCUDAShortType.cpp, ../aten/src/ATen/SparseCUDAShortType.h
/opt/pytorch/pytorch/aten/src/ATen/cwrap_parser.py:18: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  declaration = yaml.load('\n'.join(declaration_lines))
[1178/3068] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc: In constructor ‘fbgemm::PackedDepthWiseConvMatrix<KERNEL_PROD>::PackedDepthWiseConvMatrix(int, const int8_t*) [with int KERNEL_PROD = 9; int8_t = signed char]’:
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc:50:3: warning: ignoring return value of ‘int posix_memalign(void**, size_t, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
   posix_memalign(
   ^
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc: In constructor ‘fbgemm::PackedDepthWiseConvMatrix<KERNEL_PROD>::PackedDepthWiseConvMatrix(int, const int8_t*) [with int KERNEL_PROD = 27; int8_t = signed char]’:
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc:50:3: warning: ignoring return value of ‘int posix_memalign(void**, size_t, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
[1181/3068] Building CXX object third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/cpu/cpu_engine.cpp.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "setup.py", line 722, in <module>
    build_deps()
  File "setup.py", line 285, in build_deps
    build_dir='build')
  File "/opt/pytorch/pytorch/tools/build_pytorch_libs.py", line 268, in build_caffe2
    check_call(ninja_cmd, cwd=build_dir, env=my_env)
  File "/opt/conda/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ninja', 'install']' returned non-zero exit status 1.
root@i39f13437:/home/admin/pipedream#

my machine information:

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

$gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)

Unexpected Error

I use four GPUs to run Resnet50 in the hybrid mode, stage_to_rank_map is {"0": [0, 1], "1": [2], "2": [3]}, below error occured:

Traceback (most recent call last):
  File "main_with_runtime.py", line 594, in <module>
    main()
  File "main_with_runtime.py", line 310, in main
    train(train_loader, r, optimizer, epoch, epoch_time_list)
  File "main_with_runtime.py", line 419, in train
    r.run_backward()
  File "../runtime.py", line 613, in run_backward
    for output_name in outputs]))
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

What's the latest version of PyTorch supported?

Hi, what's the latest version of stable PyTorch release supported? Which version is pre_hook_pytorch_latest.patch for? Thanks for your reply in advance.

Translation demo: Installation instrutions issue. Missing CUDA kernels.

Hi. Looks like the installation instructions are incomplete for the Translation task. One need to mention that gnmt package has to be installed explicitly with cd ./runtime/translation; python setup.py install. Also it is worth mentioning that one may need to change GPU architecture to match their hardware in https://github.com/msr-fiddle/pipedream/blob/master/runtime/translation/setup.py.

Thank you.

I can't get the same result as your model after segmentation

I'm sorry to bother you, but I'm really confused
I use the graph.txt file generated by your single GPU. The bandwidth（B1,B2） is from 1GB to 30GB, and the interval is 500MB. I can't get the same result as your model after segmentation.

I think that after trying so many bandwidths, I can get a model that is the same as your segmentation, but I can only get 3 stages of models. The key problem is that almost none of these models work. So I want to ask you to get the specific parameters of the model divided into 2 stages in vgg16.gpus = 16

python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b B1 B2

Error occurred in profiling

Similar to issue16, error occurred when I tried to profile resnet:

Traceback (most recent call last):
  File "main.py", line 597, in <module>
    main()
  File "main.py", line 311, in main
    os.path.join(args.profile_directory, args.arch))
  File "main.py", line 122, in create_graph
    output = model(input)
  File "/root/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 574, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gudiandian/pipedream/profiler/image_classification/models/resnetNew.py", line 93, in forward
    x = torch.flatten(x, 1)
TypeError: flatten(): argument 'input' (position 1) must be Tensor, not TensorWrapper

I have changed my torchvision version to 0.2.1 as suggested in issue16, However, this doesn't remove the error. My pytorch version is 1.6.0. I'm not sure if this torch version works for pipedream or not? Or is there any other problem? Thank you.

Segmentation fault when generating graph.txt

Hi,

When I ran the cmd CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir <path to ImageNet directory> to generate graph.txt, everyting went well until it ran to create_graph(model, train_loader, summary, os.path.join(args.profile_directory, args.arch))

The following is error log. It seems TensorWrapper is used? Is that a bug?

Traceback (most recent call last):
  File "main.py", line 579, in <module>
    main()
  File "main.py", line 289, in main
    os.path.join(args.profile_directory, args.arch))
  File "main.py", line 116, in create_graph
    output = model(input)
  File "/home/user/.conda/envs/pipedream/lib/python3.6/site-packages/torch/nn/modules/module.py", line 509, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/.conda/envs/pipedream/lib/python3.6/site-packages/torchvision/models/vgg.py", line 45, in forward
    x = torch.flatten(x, 1)
TypeError: flatten(): argument 'input' (position 1) must be Tensor, not TensorWrapper
Segmentation fault (core dumped)

Some error about communication

When I use 8GPUs to train imagenet, using VGG16, some error occured on pipedream/runtime/connunication.py, line 235 "assert forward_num_iterations % self.num_ranks_in_next_stage == 0".
The "forward_num_iterations" is 10009 and "num_ranks_in_next_stage" is 2, where forward_num_iterations="total number of training set"/"batch_size"/"number of stage".
When i annotation this line and line 242"assert backward_num_iterations % self.num_ranks_in_previous_stage == 0", the program can run successfully in one epoch, then some error occured:
Traceback (most recent call last): File "main_with_runtime.py", line 617, in <module> main() File "main_with_runtime.py", line 321, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 442, in train r.run_backward() File "../runtime.py", line 650, in run_backward for output_name in outputs])) File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 479, in distributed_data_parallel_hook self._sync_reduction_works() File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 501, in _sync_reduction_works self.buckets_coalesced[bucket_idx]) RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete
How should I do to sovle this problem, any answer wil be useful. Thank you very much!

Multi-machine distribution problem

server1: 8gpus
server2: 8gpus
The built-in distribution of torch can succeed, but there is an error using pipedream
The error information is as follows：

Finished initializing process group; backend: gloo, rank: 14, world_size: 16
Replicating stage: ranks=2, module_size=3151872.000
Send ranks: {'out4': [15], 'target': [15]}
Receive ranks: {'out3': [12], 'target': [12]}
Setting up process groups for broadcasts...
Letting in 1 warm-up minibatches
Running training for 10008 minibatches
Traceback (most recent call last):
File "main_with_runtime.py", line 578, in
main()
File "main_with_runtime.py", line 307, in main
train(train_loader, r, optimizer, epoch)
File "main_with_runtime.py", line 355, in train
r.run_forward()
File "../runtime.py", line 498, in run_forward
self.receive_tensors_forward()
File "../runtime.py", line 426, in receive_tensors_forward
backward=False)
File "../communication.py", line 592, in recv
index = self.get_messaging_index(sending=False)
File "../communication.py", line 496, in get_messaging_index
self.fwd_messaging_scheduling_row][
IndexError: list index out of range

image_processing main_with_runtime.py examples probably need to include distributed_backend parameter

Me again... this one is not urgent, and it may not even be an issue, but I want to capture it as I go just in case.

The top level README and the runtime README both have examples of running the main_with_runtime.py without setting the --distributed_backend parameter. When I try to run a single-machine-multi-gpu hybrid parallel scenario, if I do not specify that parameter then I see the following error get raised by torch:

ValueError: Backend name must be a string, but got: None

I am running each command as follows:

python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json -v 1 --rank <ID> --local_rank <ID>

Where ID is 0, 1, 2, and 3 for the four different processes I am trying to run. Does the documentation need updating, or am I doing things incorrectly?

Invalid shape when running example workflow

Hello,

I am walking through the example workflow, and I am hitting an issue when running:

# python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --rank 0 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json
Traceback (most recent call last):
  File "main_with_runtime.py", line 578, in <module>
    main()
  File "main_with_runtime.py", line 149, in main
    output_tensors = stage(*tuple(input_tensors))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/pipedream/runtime/image_classification/models/vgg16/gpus=4/stage1.py", line 60, in forward
    out16 = out14.view(out15)
RuntimeError: shape '[64]' is invalid for input of size 1605632

My environment is a single docker container running on a single 4 GPU machine configured following the instructions in https://github.com/msr-fiddle/pipedream#setup. The commands that I ran were:

cd profiler/image_classification
CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir /mnt
cd ../../optimizer/
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 4 --activation_compression_ratio 1 -o vgg16_partitioned
python convert_graph_to_model.py -f vgg16_partitioned/gpus=4.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=4 --stage_to_num_ranks 0:3,1:1
cd ../runtime/image_classification/
python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --rank 0 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json

I ran through a similar issue trying to set up a model of alexnet with 4 GPUs using a batch size of 256, except that the runtime error complained about a shape of '[256]' instead. I can provide details on what I ran there if it is useful.

Is there anything obvious I am doing wrong here for my specific environment?

Multi node training

I hope to add examples of multi node training

Setting -m variable in optimizer_graph_hierarchical.py causes an error

https://github.com/msr-fiddle/pipedream/blob/master/optimizer/optimizer_graph_hierarchical.py#L201 includes a variable num_machines which has not been defined. The parameter passed into the main function is all_num_machines as opposed to num_machines. By default the code is not executed because if -m is not set when running the script, then the check is bypassed due to conditional short circuiting. But if you set the -m parameter when calling optimizer_graph_hierarchical.py, then an error is thrown.

"stage_to_depth_map" not found

Hi! I think I found a problem.

With this script (convert_graph_to_model.py) , we can get the configuration file (vgg16.gpus=16.hybrid_conf.json). But I didn't find the field "stage_to_depth_map" in the script(convert_graph_to_model.py). This field appears in vgg16.gpus=16.hybrid_conf.json.
{ "module_to_stage_map": [0, 1, 1], "stage_to_rank_map": { "0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "1": [15] }, "stage_to_depth_map": { "0": 1, "1": 0 } }
I don't know how this field appears.

docker pull error

Hi, @deepakn94
when I finished bash setup.sh, and using command:
$ nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3

it has error with:
unauthorized: authentication required

what's the problem?