Code Monkey home page Code Monkey logo

Comments (3)

ADAM-CT avatar ADAM-CT commented on July 29, 2024

Finally it throw the runtime error :

Exception in thread Thread-9:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../communication.py", line 619, in recv_helper_thread
sub_process_group=sub_process_group)
File "../communication.py", line 654, in _recv
group=sub_process_group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 755, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete

from pipedream.

letianzhao avatar letianzhao commented on July 29, 2024

My environment:

server1:4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank:

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

I have the same issue, have you solved this problem?
Thank you.

from pipedream.

Q1Shane avatar Q1Shane commented on July 29, 2024

I Have same issue!
Have U found the solution?

from pipedream.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.