All ranks are not trained. They are blocked all the time about pipedream HOT 3 OPEN

msr-fiddle commented on July 29, 2024

All ranks are not trained. They are blocked all the time

from pipedream.

Comments (3)

ADAM-CT commented on July 29, 2024

Finally it throw the runtime error :

Exception in thread Thread-9:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../communication.py", line 619, in recv_helper_thread
sub_process_group=sub_process_group)
File "../communication.py", line 654, in _recv
group=sub_process_group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 755, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete

from pipedream.

letianzhao commented on July 29, 2024

My environment:

server1：4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank：

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

I have the same issue, have you solved this problem?
Thank you.

from pipedream.

Q1Shane commented on July 29, 2024

I Have same issue!
Have U found the solution?

from pipedream.

All ranks are not trained. They are blocked all the time about pipedream HOT 3 OPEN

Comments (3)

Finally it throw the runtime error :

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent