msr-fiddle / pipedream Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hi there, thanks for thoroughly documenting your work and providing examples! I think I found a documentation issue under https://github.com/msr-fiddle/pipedream#end-to-end-workflow
The second step of optimization, which takes place in pipedream/optimizer
includes the following:
-o ../runtime/models/vgg16/gpus=4
This places files under pipedream/runtime/models
The next step, running, suggests you navigate over to pipedream/runtime/image_classification
and then run some commands that include the following:
--config_path models/vgg16/gpus=4/hybrid_conf.json
This reads files from pipedream/runtime/image_classification/models
.
Presumably this mismatch is unintentional.
When I use 8GPUs to train imagenet, using VGG16, some error occured on pipedream/runtime/connunication.py, line 235 "assert forward_num_iterations % self.num_ranks_in_next_stage == 0".
The "forward_num_iterations" is 10009 and "num_ranks_in_next_stage" is 2, where forward_num_iterations="total number of training set"/"batch_size"/"number of stage".
When i annotation this line and line 242"assert backward_num_iterations % self.num_ranks_in_previous_stage == 0", the program can run successfully in one epoch, then some error occured:
Traceback (most recent call last): File "main_with_runtime.py", line 617, in <module> main() File "main_with_runtime.py", line 321, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 442, in train r.run_backward() File "../runtime.py", line 650, in run_backward for output_name in outputs])) File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 479, in distributed_data_parallel_hook self._sync_reduction_works() File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 501, in _sync_reduction_works self.buckets_coalesced[bucket_idx]) RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete
How should I do to sovle this problem, any answer wil be useful. Thank you very much!
How to use this script?
What does the -t parameter mean? Can you give me an example?
https://github.com/msr-fiddle/pipedream/blob/f50827f2e28cbdbd82a4ea686c0498272b1460d6/optimizer/inference_optimizer_graph.py
With this script (convert_graph_to_model.py) , we can get the configuration file (vgg16.gpus=16.hybrid_conf.json). But I didn't find the field "stage_to_depth_map" in the script(convert_graph_to_model.py). This field appears in vgg16.gpus=16.hybrid_conf.json.
{ "module_to_stage_map": [0, 1, 1], "stage_to_rank_map": { "0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "1": [15] }, "stage_to_depth_map": { "0": 1, "1": 0 } }
I don't know how this field appears.
Traceback (most recent call last):
File "main_with_runtime.py", line 579, in
main()
File "main_with_runtime.py", line 192, in main
enable_recompute=args.recompute)
File "../runtime.py", line 64, in init
master_addr, rank, local_rank, num_ranks_in_server)
File "../runtime.py", line 196, in initialize
backend=self.distributed_backend)
File "../communication.py", line 42, in init
dist.init_process_group(backend, rank=rank, world_size=world_size)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 370, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: dgx-1.ai
Hi, what's the latest version of stable PyTorch release supported? Which version is pre_hook_pytorch_latest.patch for? Thanks for your reply in advance.
I had a problem using the optimizer for the resnet101 model.
Following the README, the following commands were executed sequentially.
python optimizer_graph_hierarchical.py
-f ../profiler/image_classification/profiles/resnet101/graph.txt
--activation_compression_ratio 1
-o resnet101_partitioned
--all_num_machines 4 4
--network_bandwidths 15750000000 7000000000
--memory_size 16000000000python convert_graph_to_model.py
-f resnet101_partitioned/gpus=16.txt
-n RESNET101Partitioned
-a resnet101
-o ../runtime/image_classification/models/gpus=16
In the resnet101 (or resnet152) model, an infinite loop occurs in convert_graph_to_model.py.
(resnet50 or vgg16 had no problem.)
When I checked it, an infinite loop occurs in the populate_depths function in pipedream/graph/graph.py.
Is my execution wrong?
Can you confirm what's the problem?
The translation profiler section appears to be have some issues. When attempting to profile, I ran into the following:
# python train.py --dataset-dir /mnt/ptb --target-bleu 21.8 --epochs 20 --math fp16 --print-freq 10 --arch gnmt --batch-size 64 --test-batch-size 128 --model-$onfig "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': False}" --optimization-config "{'optimizer': 'FusedAdam', 'lr': 1.75e-3}" --scheduler-config "{'lr_method':'mlperf', 'warmup_iters$:1000, 'remain_steps':1450, 'decay_steps':40}"
Traceback (most recent call last):
File "train.py", line 20, in <module>
from seq2seq.models.gnmt import GNMT
ModuleNotFoundError: No module named 'seq2seq.models'
I saw that there were local directories for that module, so I added empty __init__.py
files into seq2seq
and seq2seq/models
to get around that. If the intention is to use these local versions, then you should probably just check in some empty __init__.py
files into the repo.
Hi,
I see that the profiler calculates memory and execution times based on a particular batch size.
But the optimizer code does not take in any batch size parameter. So, does that mean, in the optimizer logic, the execution times and activation memories are normalized?
Best
@deepakn94 it looks like that the container doesn't support NVIDIA GPU K40c, So, I decided modify the pytorch source code according pre_hook.patch, But I did not success:
=> creating model 'resnet50'
Collecting profile...
Total accounted time: 2364.255 ms
Traceback (most recent call last):
File "main.py", line 569, in
main()
File "main.py", line 283, in main
os.path.join(args.profile_directory, args.arch))
File "main.py", line 115, in create_graph
output = model(input)
File "/seu_share/home/zhanjun/.conda/envs/pipetorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 571, in call
result = self.forward(*input, **kwargs)
File "/seu_share/home/zhanjun/.conda/envs/pipetorch/lib/python3.6/site-packages/torchvision/models/resnet.py", line 207, in forward
x = torch.flatten(x, 1)
TypeError: flatten(): argument 'input' (position 1) must be Tensor, not TensorWrapper
Thank you!
Imagenet doesn't support direct download, is there any solution?
hi, me again.. First of all, your work is really good, so I read it once more time.
setup_messaging_schedule
function designed for this corner case?Hi, I'm trying to compare resnet101 with model parallelism and your pipeline parallelism using a nvprof.
For this one, I'm trying to make an optimization code to launch.
I launched the python code below, and it takes a long time, so I can't get results for this code.
python convert_graph_to_model.py -f resnet101_partitioned/gpus=4.txt -n RESNET101Partitioned -a resnet101 -o ../runtime/image_classification/models/resnet101/gpus=4 --stage_to_num_ranks 0:1,1:1,2:1,3:1
How long did you take to create a resnet101 optimization result? Is there a way to get a result faster? If you have already computed the result for this, could you guys upload it for me?
Thanks
Hi,
I have been testing this repository to replicate results given in the SOSP paper.
https://dl.acm.org/doi/abs/10.1145/3341301.3359646
But I was unable to reproduce the results, and I'm seeing some data loading problems for Alexnet. I have started a discussion in PyTorch forum.
https://discuss.pytorch.org/t/strange-behavior-in-data-loader-with-workers/83769
So, is this the exact version that was used for the experiments in the SOSP paper?
Hi, I stack in the first step. How to profile the runtime of forward and backward? I've never tried to modified the pytorch source code before, and the 'python_cpp_function.h' or 'python_cpp_function.cpp' isn't clear for me to write the pre_hook interface. Could you provide the change you made to pytorch source code? Thx a lot.
Me again... this one is not urgent, and it may not even be an issue, but I want to capture it as I go just in case.
The top level README and the runtime README both have examples of running the main_with_runtime.py without setting the --distributed_backend
parameter. When I try to run a single-machine-multi-gpu hybrid parallel scenario, if I do not specify that parameter then I see the following error get raised by torch:
ValueError: Backend name must be a string, but got: None
I am running each command as follows:
python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json -v 1 --rank <ID> --local_rank <ID>
Where ID is 0, 1, 2, and 3 for the four different processes I am trying to run. Does the documentation need updating, or am I doing things incorrectly?
@deepakn94 Do you have any cues on how to fix the hang issue using two DPs with 4GPUs and 3GPUs.
I checked the code, it seems that all data from 4GPUs is only sent to one of the 3 GPUs (I guess it is due to the self.tensor_tags which can only store one tag for one input/output node). e.g, here
I also noticed a sentence called "TODO: don't current support uneven configurations." here
Hi,
I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.
I ran the following commands for the profiler and the optimizer.
CUDA_VISIBLE_DEVICES=4 python main.py -a "resnet101" -b 64 --data_dir "$HOME/data/imagenet-mini/" --profile_directory "profiles1/64"
python optimizer_graph_hierarchical.py -f "../profiler/image_classification/profiles1/64/resnet101/graph.txt" -n 4 -s 11000000000 --straight_pipeline -o "./optim/64/resnet101/gpus=4_straight" -b 2500000000 --use_memory_constraint
python convert_graph_to_model.py -f "./optim/64/resnet101/gpus=4_straight/gpus=4.txt" -n resnet101 -a resnet101 -o "./optim/64/resnet101/gpus=4_straight/"
On the optimizer step, I get the following output.
Time taken by single-stage pipeline: 2.0447990000000003
Time per stage in pipeline: 0.5839989999999989
Throughput increase (compared to single machine): 3.5013741461886134
[Note that single-machine and (4)-machine DP might not fit given memory constraints]
Throughput increase of (4)-machine DP compared to single machine: 3.62130052703585
Throughput increase (compared to (4)-machine DP): 0.966883063155932
So, my expectation was the straight pipeline would be roughly similar to the DP timings.
But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.
model batch conf mean speed_up
21 resnet101 64 1_conf 1098.136000 1.000000
22 resnet101 64 mp_conf 770.499250 1.425227
23 resnet101 64 dp_conf 304.383375 3.607740
I'd be very grateful if you could help me, figuring out this discrepancy?
I have drawn the gannt charts for,
pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing
Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).
It looks to me that each stage is stagnated on the comms for a considerable period of time.
Hi @deepakn94 , I can't find the AMI in the AWS console by searching the AMI ID or name in EXPERIMENTS.md
. Is it a public image?
Can the profiler which generates the graph handle conditionals and loops?
Epoch 0: 6843.771 seconds
Epoch start time: 1577064742.170, epoch end time: 1577071585.941
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.6/queue.py", line 173, in get
self.not_empty.wait(remaining)
File "/opt/conda/lib/python3.6/threading.py", line 299, in wait
gotit = waiter.acquire(True, timeout)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 325) is killed by signal: Bus error.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main_with_runtime.py", line 579, in
main()
File "main_with_runtime.py", line 311, in main
prec1 = validate(val_loader, r, epoch)
File "main_with_runtime.py", line 453, in validate
r.run_forward()
File "../runtime.py", line 498, in run_forward
self.receive_tensors_forward()
File "../runtime.py", line 387, in receive_tensors_forward
input = next(self.loader_iter)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in next
idx, batch = self._get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 512, in _get_batch
success, data = self._try_get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 488, in _try_get_batch
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 325) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
i'am trying to build the docker image for running profiler.
i followed as bellow:
1.
sudo nvidia-docker run --name=pipedream -it -v /home/vincent.ym/pipedream:/home/admin/pipedream --ipc=host --net=host nvcr.io/nvidia/pytorch:19.05-py3 /bin/bash
2.
cd /home/admin/pipedream
3.
i copy the Dockefile content to a bash script: a.sh, an then run: sh a.sh
apt-get update && apt-get install -y --no-install-recommends \
texlive-latex-extra \
&& \
rm -rf /var/lib/apt/lists/
#COPY requirements.txt requirements.txt
pip install -r requirements.txt
# Bring in changes from outside container to /tmp
# (assumes pre_hook.patch is in same directory as Dockerfile)
cp pre_hook.patch /tmp
# Change working directory to PyTorch source path
cd /opt/pytorch
# Apply modifications and re-build PyTorch
cd pytorch && patch -p1 < /tmp/pre_hook.patch && \
TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5+PTX" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
NCCL_INCLUDE_DIR="/usr/include/" \
NCCL_LIB_DIR="/usr/lib/" \
python setup.py install && python setup.py clean
# Reset default working directory
cd /workspace
finally, i got error as bellow while compiling:
[1162/3068] Linking CXX shared library lib/libthnvrtc.so
FAILED: lib/libthnvrtc.so
: && /usr/bin/c++ -fPIC -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-unused-but-set-variable -Wno-maybe-uninitialized -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -rdynamic -shared -Wl,-soname,libthnvrtc.so -o lib/libthnvrtc.so caffe2/torch/CMakeFiles/thnvrtc.dir/csrc/jit/fuser/cuda/thnvrtc.cpp.o /usr/local/nvidia/lib/libcuda.so /usr/local/cuda/lib64/libnvrtc.so -Wl,-rpath,/usr/local/nvidia/lib:/usr/local/cuda/lib64:::::::: && :
/usr/local/nvidia/lib/libcuda.so: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status
[1172/3068] Generating ../aten/src/ATen/CPUBoolType.cpp, ../aten/src/ATen/CPUBoolType.h, ../aten/src/ATen/CPUByteType.../aten/src/ATen/SparseCUDALongType.h, ../aten/src/ATen/SparseCUDAShortType.cpp, ../aten/src/ATen/SparseCUDAShortType.h
/opt/pytorch/pytorch/aten/src/ATen/cwrap_parser.py:18: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
declaration = yaml.load('\n'.join(declaration_lines))
[1178/3068] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc: In constructor ‘fbgemm::PackedDepthWiseConvMatrix<KERNEL_PROD>::PackedDepthWiseConvMatrix(int, const int8_t*) [with int KERNEL_PROD = 9; int8_t = signed char]’:
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc:50:3: warning: ignoring return value of ‘int posix_memalign(void**, size_t, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
posix_memalign(
^
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc: In constructor ‘fbgemm::PackedDepthWiseConvMatrix<KERNEL_PROD>::PackedDepthWiseConvMatrix(int, const int8_t*) [with int KERNEL_PROD = 27; int8_t = signed char]’:
../third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc:50:3: warning: ignoring return value of ‘int posix_memalign(void**, size_t, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
[1181/3068] Building CXX object third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/cpu/cpu_engine.cpp.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "setup.py", line 722, in <module>
build_deps()
File "setup.py", line 285, in build_deps
build_dir='build')
File "/opt/pytorch/pytorch/tools/build_pytorch_libs.py", line 268, in build_caffe2
check_call(ninja_cmd, cwd=build_dir, env=my_env)
File "/opt/conda/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ninja', 'install']' returned non-zero exit status 1.
root@i39f13437:/home/admin/pipedream#
my machine information:
$nvidia-smi
Tue Oct 22 14:06:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:04:00.0 Off | 0 |
| N/A 29C P0 25W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... On | 00000000:84:00.0 Off | 0 |
| N/A 27C P0 25W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)
I'm experiencing the below error which looks critical. I'm using revision f50827f with docker base nvcr.io/nvidia/pytorch:19.05-py3
Traceback (most recent call last):
File "train.py", line 474, in <module>
main()
File "train.py", line 458, in main
train_loss, train_perf = trainer.optimize(train_loader)
File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 373, in optimize
output = self.feed_data(data_loader, training=True)
File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 330, in feed_data
os.path.join("profiles", self.arch+'_2'))
File "/workspace/src/pipedream/profiler/translation/seq2seq/train/trainer.py", line 42, in create_graph
graph_creator.persist_graph(directory)
File "../torchmodules/torchgraph/graph_creator.py", line 281, in persist_graph
self.graph.render_bar_graphs_and_cdfs(directory)
File "../../graph/graph.py", line 607, in render_bar_graphs_and_cdfs
pdfs.append(((node.forward_compute_time + node.backward_compute_time) / (cdfs[-1][0] / 100.0),
ZeroDivisionError: float division by zero
hi, there
when I read pipedream, I notice that in fig 8, worker 1 will process batch5 after the backward of batch1. Then I read your code, the data parallelism of "1F1B-RR" mechanism is implemented using DistributedDataParallel, which, I think, is a sync operator. --> The docs.
So, in my opinion, in fig 8, the forward of batch5 should start after the backward of batch2.
Do I miss something?
Traceback (most recent call last): File "main_with_runtime.py", line 579, in <module> main() File "main_with_runtime.py", line 129, in main module = importlib.import_module(args.module) File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_import File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 665, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/home/admin/pipedream/runtime/image_classification/models/inceptionv3/gpus=2/__init__.py", line 1, in <module> from .inceptionv3 import Inceptionv3Partitioned File "/home/admin/pipedream/runtime/image_classification/models/inceptionv3/gpus=2/inceptionv3.py", line 2, in <module> from .stage0 import Stage0 File "/home/admin/pipedream/runtime/image_classification/models/inceptionv3/gpus=2/stage0.py", line 24 self.layer19 = torch.nn.Branch 3
here is generated stages's code, it seems not legal python code:
class Stage0(torch.nn.Module): def __init__(self): super(Stage0, self).__init__() self.layer2 = torch.nn.Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), bias=False) self.layer3 = torch.nn.BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer4 = torch.nn.ReLU(inplace=True) self.layer5 = torch.nn.Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), bias=False) self.layer6 = torch.nn.BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer7 = torch.nn.ReLU(inplace=True) self.layer8 = torch.nn.Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) self.layer9 = torch.nn.BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer10 = torch.nn.ReLU(inplace=True) self.layer11 = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) self.layer12 = torch.nn.Conv2d(64, 80, kernel_size=(1, 1), stride=(1, 1), bias=False) self.layer13 = torch.nn.BatchNorm2d(80, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer14 = torch.nn.ReLU(inplace=True) self.layer15 = torch.nn.Conv2d(80, 192, kernel_size=(3, 3), stride=(1, 1), bias=False) self.layer16 = torch.nn.BatchNorm2d(192, eps=0.001, momentum=0.1, affine=True, track_running_stats=True) self.layer17 = torch.nn.ReLU(inplace=True) self.layer18 = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) self.layer19 = torch.nn.Branch 3 self.layer20 = torch.nn.Branch 2 self.layer21 = torch.nn.Branch 1 self.layer22 = torch.nn.Branch 0 self.layer24 = torch.nn.Branch 7 self.layer25 = torch.nn.Branch 6 self.layer26 = torch.nn.Branch 5 self.layer27 = torch.nn.Branch 4 self.layer29 = torch.nn.Branch 9 self.layer30 = torch.nn.Branch 8 self.layer31 = torch.nn.Branch 11 self.layer32 = torch.nn.Branch 10 self.layer34 = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) self.layer35 = torch.nn.Branch 13 self.layer36 = torch.nn.Branch 12
hi, i notice the implemetation for image classification in pipedream, but not sure it also works for object detection network?
Hi, @deepakn94
when I finished bash setup.sh, and using command:
$ nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3
it has error with:
unauthorized: authentication required
what's the problem?
I run the following code
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b 4294967296 2147483648
Total number of states: 40
Solving optimization problem with 8 machines with inter-machine bandwidth of 4.29 GB/s
[[0.04692 0.056919000000000004 0.21645 ... 0.6715939999999999
0.6715939999999999 0.6725349999999999]
[None 0.009999000000000001 0.16953000000000001 ... 0.624674 0.624674
0.6256149999999999]
[None None 0.159531 ... 0.6146749999999999 0.6146749999999999
0.6156159999999998]
...
[None None None ... None 0.0 0.0009409999999999696]
[None None None ... None None 0.0009409999999999696]
[None None None ... None None None]]
Solving optimization problem with 2 machines with inter-machine bandwidth of 2.15 GB/s
[[0.005865730156898499 0.007115605156898499 0.027072026604413987 ...
0.09235717243739539 0.09235717243739539 0.09235717243739539]
[None 0.0012498750000000001 0.02120629644751549 ... 0.0856534978594099
0.0856534978594099 0.0856534978594099]
[None None 0.01995642144751549 ... 0.08422506928798132
0.08422506928798132 0.08422506928798132]
...
[None None None ... None 0.0 0.0009765625]
[None None None ... None None 0.001786962507337328]
[None None None ... None None None]]
[[0.002933282310962677 0.0035582198109626773 0.013545028504729271 ...
0.07743855509553638 0.07743855509553638 0.07839246224258628]
[None 0.0006249375000000001 0.010611746193766595 ... 0.0740863005740302
0.0740863005740302 0.07504020772108011]
[None None 0.009986808693766594 ... 0.07337208628831592
0.07337208628831592 0.07432599343536582]
...
[None None None ... None 0.0 0.0014421883970499039]
[None None None ... None None 0.0018473884007185679]
[None None None ... None None None]]
Level 2
Number of machines used: 2...
Compute time = 0.335797, Data-parallel communication time = 0.250080...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 40) 0.6725349999999999 2
Total number of stages: 1
Level 1
Number of machines used: 1...
Split between layers 23 and 24...
Split before antichain ['node26']...
Compute time = 0.049474, Data-parallel communication time = 0.000000, Pipeline-parallel communication time = 0.023926...
Number of machines used: 7...
Compute time = 0.088874, Data-parallel communication time = 0.003483...
Number of machines in budget not used: 0...
(Split start, split end) / compute time taken per stage / replication factor per stage:
(0, 24) 0.6221200000000001 7
(24, 40) 0.050414999999999766 1
Total number of stages: 2
Time taken by single-stage pipeline: 0.6725349999999999
Time per stage in pipeline: 0.07839246224258628
Throughput increase (compared to single machine): 8.579077385257188
[Note that single-machine and (8,2)-machine DP might not fit given memory constraints]
Throughput increase of (8,2)-machine DP compared to single machine: 6.5655154772476045
Throughput increase (compared to (8,2)-machine DP): 1.3066875578905357
I'm sorry to bother you, but I'm really confused
I use the graph.txt file generated by your single GPU. The bandwidth(B1,B2) is from 1GB to 30GB, and the interval is 500MB. I can't get the same result as your model after segmentation.
I think that after trying so many bandwidths, I can get a model that is the same as your segmentation, but I can only get 3 stages of models. The key problem is that almost none of these models work. So I want to ask you to get the specific parameters of the model divided into 2 stages in vgg16.gpus = 16
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 8 2 --activation_compression_ratio 1 -o vgg16_partitioned -b B1 B2
I hope to add examples of multi node training
I use four GPUs to run Resnet50 in the hybrid mode, stage_to_rank_map is {"0": [0, 1], "1": [2], "2": [3]}
, below error occured:
Traceback (most recent call last):
File "main_with_runtime.py", line 594, in <module>
main()
File "main_with_runtime.py", line 310, in main
train(train_loader, r, optimizer, epoch, epoch_time_list)
File "main_with_runtime.py", line 419, in train
r.run_backward()
File "../runtime.py", line 613, in run_backward
for output_name in outputs]))
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Finished initializing process group; backend: gloo, rank: 14, world_size: 16
Replicating stage: ranks=2, module_size=3151872.000
Send ranks: {'out4': [15], 'target': [15]}
Receive ranks: {'out3': [12], 'target': [12]}
Setting up process groups for broadcasts...
Letting in 1 warm-up minibatches
Running training for 10008 minibatches
Traceback (most recent call last):
File "main_with_runtime.py", line 578, in
main()
File "main_with_runtime.py", line 307, in main
train(train_loader, r, optimizer, epoch)
File "main_with_runtime.py", line 355, in train
r.run_forward()
File "../runtime.py", line 498, in run_forward
self.receive_tensors_forward()
File "../runtime.py", line 426, in receive_tensors_forward
backward=False)
File "../communication.py", line 592, in recv
index = self.get_messaging_index(sending=False)
File "../communication.py", line 496, in get_messaging_index
self.fwd_messaging_scheduling_row][
IndexError: list index out of range
Hi,
When I ran the cmd CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir <path to ImageNet directory>
to generate graph.txt, everyting went well until it ran to create_graph(model, train_loader, summary, os.path.join(args.profile_directory, args.arch))
The following is error log. It seems TensorWrapper is used? Is that a bug?
Traceback (most recent call last):
File "main.py", line 579, in <module>
main()
File "main.py", line 289, in main
os.path.join(args.profile_directory, args.arch))
File "main.py", line 116, in create_graph
output = model(input)
File "/home/user/.conda/envs/pipedream/lib/python3.6/site-packages/torch/nn/modules/module.py", line 509, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/.conda/envs/pipedream/lib/python3.6/site-packages/torchvision/models/vgg.py", line 45, in forward
x = torch.flatten(x, 1)
TypeError: flatten(): argument 'input' (position 1) must be Tensor, not TensorWrapper
Segmentation fault (core dumped)
My environment:
server1:4GPUS
server2 : 4GPUS
Initialization has been completed. All ranks are not trained. They are blocked all the time
Here is the output of each rank:
in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)
in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)
in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)
in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)
in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002
in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002
in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002
in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005
https://github.com/msr-fiddle/pipedream/blob/master/optimizer/optimizer_graph_hierarchical.py#L201 includes a variable num_machines
which has not been defined. The parameter passed into the main
function is all_num_machines
as opposed to num_machines
. By default the code is not executed because if -m
is not set when running the script, then the check is bypassed due to conditional short circuiting. But if you set the -m
parameter when calling optimizer_graph_hierarchical.py, then an error is thrown.
First of all, I want to say that your work is amazing.
I used you in vgg16. gpus = 16. hybrid_conf.json. (https://github.com/msr-fiddle/pipedream/blob/f50827f2e28cbdbd82a4ea686c0498272b1460d6/runtime/image_classification/models/vgg16/gpus%3D16/hybrid_conf.json)
It only takes 600 seconds to train an epoch in the Imagenet dataset. Can you tell me how you generated this configuration file? Can you tell me the bandwidth parameter?
I used two containers on two servers to run pipedream, data parallelism worked in nccl backend but model parallelism didn't work in gloo backend, and it seemed two servers just waited for connection and no further output.
Then I use runtime/tests/communication/point_to_point.py
to test the connection, it was the same situation as above.
# server 1:
python point_to_point.py --backend gloo --master_addr xxx.xxxx.xx.xx --rank 0 --master_port 8888
# server 2
python point_to_point.py --backend gloo --master_addr xxx.xxxx.xx.xx --rank 1 --master_port 8888
@deepakn94 Could you help me?
Traceback (most recent call last):
File "main.py", line 574, in <module>
main()
File "main.py", line 266, in main
per_layer_times, data_time = profile_train(train_loader, model, criterion, optimizer)
File "main.py", line 345, in profile_train
with torchprofiler.Profiling(model, module_whitelist=[]) as p:
File "../torchmodules/torchprofiler/profiling.py", line 25, in __enter__
self.start()
File "../torchmodules/torchprofiler/profiling.py", line 93, in start
self.hook_modules(self.model)
File "../torchmodules/torchprofiler/profiling.py", line 120, in hook_modules
self.hook_modules(sub_module)
File "../torchmodules/torchprofiler/profiling.py", line 120, in hook_modules
self.hook_modules(sub_module)
File "../torchmodules/torchprofiler/profiling.py", line 122, in hook_modules
sub_module.reset_hooks()
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 585, in __getattr__
type(self).__name__, name))
AttributeError: 'Conv2d' object has no attribute 'reset_hooks'
optimizer_graph_hierarchical.py The script's parameter (--network_bandwidth) is bandwidth within the machine. What is bandwidth considered between machines?
I worked around the issue in #8 , but I am now seeing another submodule missing:
ModuleNotFoundError: No module named 'seq2seq.pack_utils'
I found that seq2seq/csrc
contains the C++ code that I assume gets called under the hood, but I am not sure what I need to do to make the python interpreter find that as a module. Is there some code generation step that needs to happen or anything like that?
Hello,
I am walking through the example workflow, and I am hitting an issue when running:
# python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --rank 0 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json
Traceback (most recent call last):
File "main_with_runtime.py", line 578, in <module>
main()
File "main_with_runtime.py", line 149, in main
output_tensors = stage(*tuple(input_tensors))
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/pipedream/runtime/image_classification/models/vgg16/gpus=4/stage1.py", line 60, in forward
out16 = out14.view(out15)
RuntimeError: shape '[64]' is invalid for input of size 1605632
My environment is a single docker container running on a single 4 GPU machine configured following the instructions in https://github.com/msr-fiddle/pipedream#setup. The commands that I ran were:
cd profiler/image_classification
CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir /mnt
cd ../../optimizer/
python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 4 --activation_compression_ratio 1 -o vgg16_partitioned
python convert_graph_to_model.py -f vgg16_partitioned/gpus=4.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=4 --stage_to_num_ranks 0:3,1:1
cd ../runtime/image_classification/
python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --rank 0 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json
I ran through a similar issue trying to set up a model of alexnet with 4 GPUs using a batch size of 256, except that the runtime error complained about a shape of '[256]' instead. I can provide details on what I ran there if it is useful.
Is there anything obvious I am doing wrong here for my specific environment?
I am running the translation profiler as follows:
# python train.py --dataset-dir /mnt/wmt16/ --target-bleu 21.8 --epochs 20 --math fp32 --print-freq 10 --arch gnmt --batch-size 64 --test-batch-size 128
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': False}" --optimization-config "{'optimizer': 'FusedAdam', 'lr': 1.75e-3}" --scheduler-config "{'lr_method':'mlperf
', 'warmup_iters':1000, 'remain_steps':1450, 'decay_steps':40}"
When running, I encounter the following error when the first training epoch starts:
0: Starting epoch 0 [7/1801]
:::MLPv0.5.0 gnmt 1569879709.258431435 (train.py:452) train_epoch: 0
THCudaCheck FAIL file=/opt/pytorch/pytorch/aten/src/THC/generic/THCTensorMath.cu line=35 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "train.py", line 474, in <module>
main()
File "train.py", line 458, in main
train_loss, train_perf = trainer.optimize(train_loader)
File "/workspace/pipedream/profiler/translation/seq2seq/train/trainer.py", line 372, in optimize
self.preallocate(data_loader, training=True)
File "/workspace/pipedream/profiler/translation/seq2seq/train/trainer.py", line 360, in preallocate
self.iterate(src, tgt, update=False, training=training)
File "/workspace/pipedream/profiler/translation/seq2seq/train/trainer.py", line 160, in iterate
output = self.model(src, src_length, tgt[:-1])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/pipedream/profiler/translation/seq2seq/models/gnmt.py", line 62, in forward
context = self.encode(input_encoder, input_enc_len)
File "/workspace/pipedream/profiler/translation/seq2seq/models/seq2seq_base.py", line 34, in encode
return self.encoder(inputs, lengths)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/pipedream/profiler/translation/seq2seq/models/encoder.py", line 127, in forward
x = self.rnn_layers[0](x, lengths)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/pipedream/profiler/translation/seq2seq/models/encoder.py", line 64, in forward
return self.emu_bidir_lstm(self.layer1, self.layer2, input, lengths)
File "/workspace/pipedream/profiler/translation/seq2seq/models/encoder.py", line 53, in emu_bidir_lstm
out1 = model1(inputl1)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 556, in forward
return self.forward_tensor(input, hx)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 536, in forward_tensor
output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 509, in forward_impl
dtype=input.dtype, device=input.device)
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /opt/pytorch/pytorch/aten/src/THC/generic/THCTensorMath.cu:35
Do you have any ideas what might be causing this? It isn't clear to me whether or not this is a software/environment issue, or an issue with my specific hardware.
Hi. Looks like the installation instructions are incomplete for the Translation task. One need to mention that gnmt
package has to be installed explicitly with cd ./runtime/translation; python setup.py install
. Also it is worth mentioning that one may need to change GPU architecture to match their hardware in https://github.com/msr-fiddle/pipedream/blob/master/runtime/translation/setup.py.
Thank you.
Similar to issue16, error occurred when I tried to profile resnet:
Traceback (most recent call last):
File "main.py", line 597, in <module>
main()
File "main.py", line 311, in main
os.path.join(args.profile_directory, args.arch))
File "main.py", line 122, in create_graph
output = model(input)
File "/root/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 574, in __call__
result = self.forward(*input, **kwargs)
File "/home/gudiandian/pipedream/profiler/image_classification/models/resnetNew.py", line 93, in forward
x = torch.flatten(x, 1)
TypeError: flatten(): argument 'input' (position 1) must be Tensor, not TensorWrapper
I have changed my torchvision version to 0.2.1 as suggested in issue16, However, this doesn't remove the error. My pytorch version is 1.6.0. I'm not sure if this torch version works for pipedream or not? Or is there any other problem? Thank you.
hello,This is my gpu utilization, I don't know which step caused the problem that led to the low utilization of gpu1
| 0 Tesla P100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 49C P0 180W / 300W | 9481MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 657MiB / 16280MiB | 1% Default |
Environment: 2 gpus in dgx-1
Bandwidth 20GB/s (considered NVLINK bandwidth)
Run the demo:
step1: CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 32 --data_dir ./data
step2: python optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 2 --activation_compression_ratio 1 -o vgg16_partitioned --network_bandwidths [21474836480]
step3: python convert_graph_to_model.py -f vgg16_partitioned/gpus=2.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=2 --stage_to_num_ranks 0:1,1:1
step4: docker0 的ip为 10.1.1.4 ,docker1 的ip为 10.1.1.5
in docker0:
python main_with_runtime.py --module models.vgg16.gpus=2 -b 32 --data_dir ./data --rank 0 --local_rank 0 --master_addr 10.1.1.4 --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend gloo --eval-batch-size 32
in docker1:
python main_with_runtime.py --module models.vgg16.gpus=2 -b 32 --data_dir ./data --rank 1 --local_rank 1 --master_addr 10.1.1.4 --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend gloo --eval-batch-size 32
i run the command as bellow:
`
worker-0:
python main_with_runtime.py --module models.vgg16.gpus=2 -b 64 --data_dir /home/admin/pipedream/data/sample --rank 0 --local_rank 0 --master_addr localhost --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend nccl
worker-1:
python main_with_runtime.py --module models.vgg16.gpus=2 -b 64 --data_dir /home/admin/pipedream/data/sample --rank 1 --local_rank 1 --master_addr localhost --config_path models/vgg16/gpus=2/hybrid_conf.json --distributed_backend nccl`
then i got the error:
`Exception in thread Thread-3:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../communication.py", line 632, in send_helper_thread
sub_process_group=sub_process_group)
File "../communication.py", line 709, in _send
dist.send(tensor=tensor_shape, dst=dst_rank, tag=tag)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 608, in send
_default_pg.send([tensor], dst, tag).wait()
RuntimeError: ProcessGroupNCCL does not support send
Epoch: 0 Step 0 Learning rate: 0.100000
Epoch: [0][0/13] Memory: 6.323 (9.244)`
In the top level README, the example workflow has a line that says:
[from pipedream/runtime/image_classification; run on 4 GPUs (including a single server with 4 GPUs)]
Do those commands actually require setting the --local_rank
option in order to map the different processes to different GPUs on a single machine setup? When I run nvidia-smi
, I am not actually seeing multiple GPUs get used unless I set --local_rank
to a unique value (between 0 and 4 in my 4 GPU setup).
@deepakn94 it seems that this project just work for pytorch, is there any cross-framework(such as tensorflow、mxnet) plan?thanks
Could you please describe the meaning of antichain
graph used in partitioning algorithm? Is it related to backward path computations?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.