Comments (4)
When I use 8GPUs to train imagenet, using VGG16, some error occured on pipedream/runtime/connunication.py, line 235 "assert forward_num_iterations % self.num_ranks_in_next_stage == 0".
The "forward_num_iterations" is 10009 and "num_ranks_in_next_stage" is 2, where forward_num_iterations="total number of training set"/"batch_size"/"number of stage".
When i annotation this line and line 242"assert backward_num_iterations % self.num_ranks_in_previous_stage == 0", the program can run successfully in one epoch, then some error occured:
Traceback (most recent call last): File "main_with_runtime.py", line 617, in <module> main() File "main_with_runtime.py", line 321, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 442, in train r.run_backward() File "../runtime.py", line 650, in run_backward for output_name in outputs])) File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 479, in distributed_data_parallel_hook self._sync_reduction_works() File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 501, in _sync_reduction_works self.buckets_coalesced[bucket_idx]) RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete
How should I do to sovle this problem, any answer wil be useful. Thank you very much!
I have the same issue, have you solved this problem?
Thank you.
from pipedream.
I have the same issue.
Have you solved this problem?
from pipedream.
I have the same issue, too.
Have you solved this problem?
from pipedream.
I have the same issue, have you solved this problem?
Sorry, I don't remember it. ๐ถ
from pipedream.
Related Issues (20)
- Handling uneven number of batches per replicated instance of a layer
- GPU Peer2Peer communication via --num_ranks_in_server argument HOT 1
- Resource temporarily unavailable
- To run PipeDream_2BW branch without --recompute_step
- The BLEU score of translation model seems abnormal. The model doesn't seem to train effectively.
- GPT2 355m model convergence with 2BW training
- Is there AllReduce in data parallelism? HOT 6
- How is the Double-Buffered Weight Mechanism implemented?
- Supporting T5
- The arguments of self.start_helper_thread() should be more flexible instead of fixed as int64.
- Question about time complexity of PipeDream-2BW's planner algorithm
- Question about PipeDream's optimizer
- AttributeError: module 'models.resnet50.resnet50' has no attribute 'model' HOT 1
- Is there any 2bw code that will run on the native GPU HOT 1
- AttributeError: module 'torch.distributed' has no attribute 'P2POp' HOT 1
- Running in docker will give you an error that you can't find a physical address HOT 1
- what is the role of pre_hook_pytorch_latest.patch๏ผ HOT 1
- When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): HOT 2
- optimizer got an empty parameter list when rank=1 HOT 1
- same train_loader but got different loader size HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pipedream.