Comments (10)
BINGO! Disabling IOMMU did the trick!
$ CUDA_VISIBLE_DEVICES=0,1 NCCL_SOCKET_IFNAME=lo NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
DeepWhite:3031:3031 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:3031:3031 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:3031:3031 [0] NCCL INFO NET/IB : No device found.
DeepWhite:3031:3031 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:3031:3031 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:3032:3032 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:3032:3032 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:3032:3032 [1] NCCL INFO NET/IB : No device found.
DeepWhite:3032:3032 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:3032:3032 [1] NCCL INFO Using network Socket
DeepWhite:3031:3059 [0] NCCL INFO Channel 00/02 : 0 1
DeepWhite:3032:3060 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DeepWhite:3031:3059 [0] NCCL INFO Channel 01/02 : 0 1
DeepWhite:3031:3059 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:3031:3059 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:3031:3059 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:3031:3059 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Connected all rings
DeepWhite:3032:3060 [1] NCCL INFO Connected all trees
DeepWhite:3032:3060 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:3032:3060 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:3031:3059 [0] NCCL INFO Connected all rings
DeepWhite:3031:3059 [0] NCCL INFO Connected all trees
DeepWhite:3031:3059 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:3031:3059 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:3032:3060 [1] NCCL INFO comm 0x7fdb7c002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:3031:3059 [0] NCCL INFO comm 0x7fcbac002fb0 rank 0 nranks 2 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:3031:3031 [0] NCCL INFO Launch mode Parallel
[DeepWhite-1] is OK (global rank: 1/2)
[DeepWhite-0] is OK (global rank: 0/2)
pt=1.11.0+cu113, cuda=11.3, nccl=(2, 10, 3)
device compute capabilities=(8, 6)
pytorch compute capabilities=['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
from ml-engineering.
So the key to unravelling this problem was noticing a page fault in syslog:
$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]
Yes, that is correct. π Thanks again for all your help!
from ml-engineering.
Oh, duh. You can also disable IOMMU in the BIOS. That's preferable to fiddling with GRUB, me thinks.
from ml-engineering.
Is your 192.168.50.21
firewalled? or is it somehow a misconfigured network device?
Does it work if you use a loopback device 127.0.0.1
?
NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py
if not, see what other local network devices you have via ifconfig
- try that instead of lo
if any.
It's currently using enp67s0
in your case.
If not, does it work if you use the first 2 or the last 2 gpus
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
then the 2nd pair:
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
If not, attach to each process with sudo py-spy dump -n -p PID
after pip install py-spy
and share the tracebacks - one is enough if they are the same. PID
is the process id of the hanging python processes.
from ml-engineering.
Pure gold! Thank you so much for the insight. I don't think it's a firewall/networking issue since this machine is on my desk, and I'm logged into it directly. I see some page faults in /var/log/syslog
, but I don't know if that's bad or not. py-spy
shows something about sleeping and libc-2.31.so
at the top of the stack. Could this be the problem?
I get the same result every time: Each GPU being tested hangs, using one third of its total power and ~2GB of VRAM indefinitely.
$ CUDA_VISIBLE_DEVICES=0,1 NCCL_SOCKET_IFNAME=lo NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
DeepWhite:6484:6484 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:6484:6484 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:6484:6484 [0] NCCL INFO NET/IB : No device found.
DeepWhite:6484:6484 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:6484:6484 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:6485:6485 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:6485:6485 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:6485:6485 [1] NCCL INFO NET/IB : No device found.
DeepWhite:6485:6485 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:6485:6485 [1] NCCL INFO Using network Socket
DeepWhite:6484:6516 [0] NCCL INFO Channel 00/02 : 0 1
DeepWhite:6485:6517 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DeepWhite:6484:6516 [0] NCCL INFO Channel 01/02 : 0 1
DeepWhite:6484:6516 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:6484:6516 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:6484:6516 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:6484:6516 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Connected all rings
DeepWhite:6485:6517 [1] NCCL INFO Connected all trees
DeepWhite:6485:6517 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:6485:6517 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:6484:6516 [0] NCCL INFO Connected all rings
DeepWhite:6484:6516 [0] NCCL INFO Connected all trees
DeepWhite:6484:6516 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:6484:6516 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:6485:6517 [1] NCCL INFO comm 0x7fcec4002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:6484:6516 [0] NCCL INFO comm 0x7f5a98002fb0 rank 0 nranks 2 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:6484:6484 [0] NCCL INFO Launch mode Parallel
$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]
Mar 23 10:57:11 deepwhite kernel: [ 1332.090997] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740000000 flags=0x0030]
Mar 23 10:57:11 deepwhite kernel: [ 1332.091001] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0xd2139068 flags=0x0020]
Mar 23 10:57:12 deepwhite kernel: [ 1332.223860] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0xd2139070 flags=0x0020]
$ py-spy dump -n -p 14088
Process 14088: python3 -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
Python v3.9.7 (/home/matt/miniconda3/envs/nlp3.9/bin/python3.9)
Thread 14088 (idle): "MainThread"
select (libc-2.31.so)
time_sleep (python3.9)
_invoke_run (torch/distributed/elastic/agent/server/api.py:850)
run (torch/distributed/elastic/agent/server/api.py:709)
wrapper (torch/distributed/elastic/metrics/api.py:125)
launch_agent (torch/distributed/launcher/api.py:236)
__call__ (torch/distributed/launcher/api.py:131)
run (torch/distributed/run.py:715)
main (torch/distributed/run.py:724)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (torch/distributed/run.py:728)
_run_code (runpy.py:87)
_run_module_as_main (runpy.py:197)
from ml-engineering.
I also tried this CUDA bandwidthTest from Nvidia, and it passed. BTW, I have the fourth GPU unplugged for nowβjust because this Threadripper box needs a dedicated 20A power outlet to run on all cylinders.
/usr/local/cuda/samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ ./bandwidthTest --device=all --mode=shmoo
[CUDA Bandwidth Test] - Starting...
!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!
Running on...
Device 0: NVIDIA RTX A6000
Device 1: NVIDIA RTX A6000
Device 2: NVIDIA RTX A6000
Shmoo Mode
...................................................................................................................................................................................................................................................
Host to Device Bandwidth, 3 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
1000 1.4
2000 2.7
3000 4.1
4000 5.4
5000 6.7
6000 8.0
7000 9.2
8000 10.6
9000 11.7
10000 12.5
11000 13.3
12000 14.3
13000 15.2
14000 16.1
15000 15.3
16000 17.2
17000 17.9
18000 18.9
19000 19.3
20000 20.2
22000 21.2
24000 22.2
26000 23.8
28000 24.3
30000 25.3
32000 26.3
34000 26.9
36000 27.7
38000 27.9
40000 28.3
42000 29.5
44000 30.6
46000 30.7
48000 31.3
50000 31.5
60000 33.9
70000 36.0
80000 36.9
90000 38.1
100000 39.1
200000 44.4
300000 46.4
400000 52.0
500000 54.4
600000 55.2
700000 56.2
800000 59.1
900000 59.5
1000000 59.9
2000000 62.8
3000000 63.6
4000000 64.3
5000000 65.1
6000000 65.5
7000000 65.7
8000000 65.6
9000000 65.8
10000000 66.1
11000000 66.0
12000000 66.3
13000000 66.2
14000000 66.3
15000000 66.2
16000000 66.4
18000000 66.3
20000000 66.4
22000000 66.4
24000000 66.5
26000000 66.6
28000000 66.6
30000000 66.6
32000000 66.6
36000000 66.6
40000000 66.6
44000000 66.6
48000000 66.7
52000000 66.7
56000000 66.8
60000000 66.7
64000000 66.8
68000000 66.7
...................................................................................................................................................................................................................................................
Device to Host Bandwidth, 3 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
1000 1.4
2000 2.7
3000 4.4
4000 5.8
5000 7.0
6000 8.7
7000 10.1
8000 11.6
9000 12.7
10000 14.5
11000 16.0
12000 17.2
13000 18.7
14000 20.1
15000 21.9
16000 23.1
17000 24.1
18000 24.2
19000 26.5
20000 27.9
22000 33.7
24000 34.9
26000 35.7
28000 36.4
30000 38.9
32000 40.0
34000 40.4
36000 42.0
38000 42.4
40000 42.8
42000 42.8
44000 44.3
46000 45.8
48000 46.1
50000 46.6
60000 49.4
70000 51.1
80000 52.7
90000 53.5
100000 54.5
200000 60.7
300000 62.9
400000 64.1
500000 64.3
600000 64.5
700000 65.5
800000 65.8
900000 65.1
1000000 65.4
2000000 66.5
3000000 66.9
4000000 67.0
5000000 67.1
6000000 66.9
7000000 67.0
8000000 67.0
9000000 67.1
10000000 67.0
11000000 67.0
12000000 66.9
13000000 66.7
14000000 66.7
15000000 66.7
16000000 66.8
18000000 66.7
20000000 66.6
22000000 66.6
24000000 66.6
26000000 66.5
28000000 66.7
30000000 66.5
32000000 66.6
36000000 62.6
40000000 60.7
44000000 60.5
48000000 60.7
52000000 60.5
56000000 60.4
60000000 60.7
64000000 60.9
68000000 60.6
...................................................................................................................................................................................................................................................
Device to Device Bandwidth, 3 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
1000 2.8
2000 4.0
3000 6.3
4000 8.4
5000 10.5
6000 12.6
7000 14.5
8000 16.9
9000 18.7
10000 20.9
11000 23.0
12000 25.4
13000 27.2
14000 29.7
15000 31.4
16000 33.7
17000 35.7
18000 37.9
19000 39.7
20000 42.1
22000 46.4
24000 50.8
26000 55.1
28000 59.4
30000 63.5
32000 67.7
34000 72.2
36000 76.6
38000 81.3
40000 85.2
42000 86.7
44000 94.5
46000 98.4
48000 102.3
50000 107.7
60000 128.0
70000 151.1
80000 172.9
90000 194.8
100000 216.8
200000 441.6
300000 678.2
400000 933.0
500000 1200.5
600000 1477.6
700000 1736.5
800000 1946.4
900000 2108.5
1000000 2287.1
2000000 2577.7
3000000 2586.3
4000000 1814.6
5000000 1575.0
6000000 1606.6
7000000 1595.9
8000000 1637.1
9000000 1675.1
10000000 1700.9
11000000 1754.5
12000000 1767.3
13000000 1784.8
14000000 1798.7
15000000 1805.3
16000000 1825.7
18000000 1858.4
20000000 1862.0
22000000 1878.9
24000000 1891.2
26000000 1907.0
28000000 1916.1
30000000 1916.8
32000000 1927.9
36000000 1941.4
40000000 1953.7
44000000 1959.2
48000000 1968.8
52000000 1975.5
56000000 1974.6
60000000 1984.2
64000000 2007.7
68000000 1990.8
Result = PASS
from ml-engineering.
/usr/local/cuda/samples/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A6000, pciBusID: 49, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 673.20 13.15 13.15
1 12.99 673.20 22.20
2 13.00 21.98 673.78
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2
0 673.49 2.60 1.58
1 2.12 672.33 1.70
2 2.12 1.60 673.78
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 677.73 18.01 18.02
1 19.25 678.32 27.11
2 19.25 28.02 678.43
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 678.91 5.21 5.21
1 3.30 678.46 5.56
2 3.49 3.21 677.73
P2P=Disabled Latency Matrix (us)
GPU 0 1 2
0 1.57 11.57 12.11
1 11.44 1.62 11.53
2 16.89 11.80 1.57
CPU 0 1 2
0 2.68 8.79 8.31
1 8.44 2.64 8.24
2 8.99 8.29 2.64
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2
0 1.62 49299.65 49299.60
1 49299.91 1.57 49299.87
2 49299.74 49299.72 1.64
CPU 0 1 2
0 2.73 2.18 3.01
1 2.28 2.81 2.21
2 3.51 2.43 2.77
from ml-engineering.
Oh, look at this! Same page fault messages. Sounds like this might help. God, I hope I don't break GRUB. It was a nightmare getting this RAID-0 array set up. Stay tuned.
from ml-engineering.
Oh, wow! That's some awesome diagnostics you have performed - absolutely awesome, @mhillebrand! Glad to hear you got it working!
So the key to unravelling this problem was noticing a page fault in syslog:
$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]
We probably should start compiling all the difference causes somewhere so others will have it easier.
Glad you resolved it!
from ml-engineering.
@jeffra, tagging you on this one as FYI, since some users are likely to run into this with Deepspeed.
And this is not the first problem with AMD and multi-gpu I have seen.
from ml-engineering.
Related Issues (8)
- Daisy chain batch jobs HOT 1
- Improve folder structure HOT 3
- Convert to bfloat16 failing HOT 2
- Missing `hparams` section HOT 2
- convert markdown to pdf HOT 10
- Minor Typo in emulate multi node HOT 4
- GPU requirements and cost estimation. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml-engineering.