Comments (8)
Can you check if this line is successfully being executed or not?
Also, what does ifconfig
give you?
from pipedream.
I'm sure it was stuck in that line.
First server:
docker0 Link encap:Ethernet HWaddr 02:42:7a:64:01:0b
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
inet6 addr: fe80::42:7aff:fe64:10b/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:695688 errors:0 dropped:0 overruns:0 frame:0
TX packets:964730 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:36630114 (36.6 MB) TX bytes:2590485457 (2.5 GB)
enp1s0f0 Link encap:Ethernet HWaddr 50:6b:4b:48:f9:02
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1925264 errors:0 dropped:0 overruns:0 frame:0
TX packets:731800 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:329735619 (329.7 MB) TX bytes:303080112 (303.0 MB)
enp1s0f1 Link encap:Ethernet HWaddr 50:6b:4b:48:f9:03
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1913046 errors:0 dropped:0 overruns:0 frame:0
TX packets:744934 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:328516079 (328.5 MB) TX bytes:304361864 (304.3 MB)
enp6s0 Link encap:Ethernet HWaddr 18:31:bf:ce:75:69
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Memory:fa600000-fa67ffff
enp7s0 Link encap:Ethernet HWaddr 18:31:bf:ce:75:6a
inet addr:10.37.0.154 Bcast:10.37.255.255 Mask:255.255.0.0
inet6 addr: fe80::1a31:bfff:fece:756a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:153985906 errors:0 dropped:75268 overruns:0 frame:0
TX packets:65997037 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:106082283261 (106.0 GB) TX bytes:65581165114 (65.5 GB)
Memory:fa500000-fa57ffff
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:16770839 errors:0 dropped:0 overruns:0 frame:0
TX packets:16770839 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:756604351890 (756.6 GB) TX bytes:756604351890 (756.6 GB)
Second server:
docker0 Link encap:Ethernet HWaddr 02:42:50:e2:d7:42
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
inet6 addr: fe80::42:50ff:fee2:d742/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1479723 errors:0 dropped:0 overruns:0 frame:0
TX packets:2101909 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:92751285 (92.7 MB) TX bytes:5170455762 (5.1 GB)
enp5s0 Link encap:Ethernet HWaddr 18:31:bf:ce:75:8d
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Memory:fa600000-fa67ffff
enp6s0 Link encap:Ethernet HWaddr 18:31:bf:ce:75:8e
inet addr:10.37.0.152 Bcast:10.37.255.255 Mask:255.255.0.0
inet6 addr: fe80::1a31:bfff:fece:758e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:179463516 errors:0 dropped:89540 overruns:0 frame:0
TX packets:68389093 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:107214923926 (107.2 GB) TX bytes:64154306470 (64.1 GB)
Memory:fa500000-fa57ffff
ens2f0 Link encap:Ethernet HWaddr ec:0d:9a:8c:02:9c
inet6 addr: fe80::69f0:163e:1b2:4f83/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2410894 errors:0 dropped:0 overruns:0 frame:0
TX packets:710506 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:586478625 (586.4 MB) TX bytes:126837006 (126.8 MB)
ens2f1 Link encap:Ethernet HWaddr ec:0d:9a:8c:02:9d
inet6 addr: fe80::7495:ecd9:9bb8:d8ca/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2401449 errors:0 dropped:0 overruns:0 frame:0
TX packets:720867 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:585506402 (585.5 MB) TX bytes:127871441 (127.8 MB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:69929743 errors:0 dropped:0 overruns:0 frame:0
TX packets:69929743 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:3841555161589 (3.8 TB) TX bytes:3841555161589 (3.8 TB)
veth8ab1497 Link encap:Ethernet HWaddr 1a:19:a4:98:c7:ba
inet6 addr: fe80::1819:a4ff:fe98:c7ba/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:194 errors:0 dropped:0 overruns:0 frame:0
TX packets:2922 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2223765 (2.2 MB) TX bytes:725592 (725.5 KB)
from pipedream.
Can you try setting GLOO_SOCKET_IFNAME
to enp6s0
on both containers?
from pipedream.
Thanks @deepakn94 !! I solved it by setting GLOO_SOCKET_IFNAME
to enp6s0
and enp7s0
separately.
from pipedream.
There is another question. The speed of enp6s0
and enp7s0
is only 100M. If I want to use ens2f1
and enp1s0f1
that are faster, error occurs:
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:88] ifa != nullptr. Unable to find address for: enp1s0f1
from pipedream.
Hmm, not sure about that one :/
from pipedream.
Thanks anyway and Merry Christmas!!
from pipedream.
Did you figure this out?
from pipedream.
Related Issues (20)
- Handling uneven number of batches per replicated instance of a layer
- GPU Peer2Peer communication via --num_ranks_in_server argument HOT 1
- Resource temporarily unavailable
- To run PipeDream_2BW branch without --recompute_step
- The BLEU score of translation model seems abnormal. The model doesn't seem to train effectively.
- GPT2 355m model convergence with 2BW training
- Is there AllReduce in data parallelism? HOT 6
- How is the Double-Buffered Weight Mechanism implemented?
- Supporting T5
- The arguments of self.start_helper_thread() should be more flexible instead of fixed as int64.
- Question about time complexity of PipeDream-2BW's planner algorithm
- Question about PipeDream's optimizer
- AttributeError: module 'models.resnet50.resnet50' has no attribute 'model' HOT 1
- Is there any 2bw code that will run on the native GPU HOT 1
- AttributeError: module 'torch.distributed' has no attribute 'P2POp' HOT 1
- Running in docker will give you an error that you can't find a physical address HOT 1
- what is the role of pre_hook_pytorch_latest.patch? HOT 1
- When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): HOT 2
- optimizer got an empty parameter list when rank=1 HOT 1
- same train_loader but got different loader size HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pipedream.