Hi, I saw your toolbox link in a Huggingface issue and gave it a try. My four new GPUs

BINGO! <a href="https://community.amd.com/t5/knowledge-base/iommu-advisory-for-multi-g

Oh, look at <a href="https://github.com/pytorch/pytorch/issues/52142" data-hovercard-t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Parallel training hangs,about stas00/ml-engineering

Comments (10)

mhillebrand commented on May 18, 2024 1

$ CUDA_VISIBLE_DEVICES=0,1 NCCL_SOCKET_IFNAME=lo NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DeepWhite:3031:3031 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:3031:3031 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:3031:3031 [0] NCCL INFO NET/IB : No device found.
DeepWhite:3031:3031 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:3031:3031 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:3032:3032 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:3032:3032 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:3032:3032 [1] NCCL INFO NET/IB : No device found.
DeepWhite:3032:3032 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:3032:3032 [1] NCCL INFO Using network Socket
DeepWhite:3031:3059 [0] NCCL INFO Channel 00/02 :    0   1
DeepWhite:3032:3060 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DeepWhite:3031:3059 [0] NCCL INFO Channel 01/02 :    0   1
DeepWhite:3031:3059 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:3031:3059 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:3031:3059 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:3031:3059 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Connected all rings
DeepWhite:3032:3060 [1] NCCL INFO Connected all trees
DeepWhite:3032:3060 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:3032:3060 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:3031:3059 [0] NCCL INFO Connected all rings
DeepWhite:3031:3059 [0] NCCL INFO Connected all trees
DeepWhite:3031:3059 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:3031:3059 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:3032:3060 [1] NCCL INFO comm 0x7fdb7c002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:3031:3059 [0] NCCL INFO comm 0x7fcbac002fb0 rank 0 nranks 2 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:3031:3031 [0] NCCL INFO Launch mode Parallel
[DeepWhite-1] is OK (global rank: 1/2)
[DeepWhite-0] is OK (global rank: 0/2)
pt=1.11.0+cu113, cuda=11.3, nccl=(2, 10, 3)
device compute capabilities=(8, 6)
pytorch compute capabilities=['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']

from ml-engineering.

mhillebrand commented on May 18, 2024 1

So the key to unravelling this problem was noticing a page fault in syslog:

$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]

Yes, that is correct. 😃 Thanks again for all your help!

from ml-engineering.

mhillebrand commented on May 18, 2024 1

Oh, duh. You can also disable IOMMU in the BIOS. That's preferable to fiddling with GRUB, me thinks.

from ml-engineering.

stas00 commented on May 18, 2024

Is your 192.168.50.21 firewalled? or is it somehow a misconfigured network device?

Does it work if you use a loopback device 127.0.0.1?

NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py

if not, see what other local network devices you have via ifconfig - try that instead of lo if any.

It's currently using enp67s0 in your case.

If not, does it work if you use the first 2 or the last 2 gpus

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

then the 2nd pair:

CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

If not, attach to each process with sudo py-spy dump -n -p PID after pip install py-spy and share the tracebacks - one is enough if they are the same. PID is the process id of the hanging python processes.

from ml-engineering.

mhillebrand commented on May 18, 2024

Pure gold! Thank you so much for the insight. I don't think it's a firewall/networking issue since this machine is on my desk, and I'm logged into it directly. I see some page faults in /var/log/syslog, but I don't know if that's bad or not. py-spy shows something about sleeping and libc-2.31.so at the top of the stack. Could this be the problem?

I get the same result every time: Each GPU being tested hangs, using one third of its total power and ~2GB of VRAM indefinitely.

$ CUDA_VISIBLE_DEVICES=0,1 NCCL_SOCKET_IFNAME=lo NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DeepWhite:6484:6484 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:6484:6484 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:6484:6484 [0] NCCL INFO NET/IB : No device found.
DeepWhite:6484:6484 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:6484:6484 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:6485:6485 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:6485:6485 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:6485:6485 [1] NCCL INFO NET/IB : No device found.
DeepWhite:6485:6485 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:6485:6485 [1] NCCL INFO Using network Socket
DeepWhite:6484:6516 [0] NCCL INFO Channel 00/02 :    0   1
DeepWhite:6485:6517 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DeepWhite:6484:6516 [0] NCCL INFO Channel 01/02 :    0   1
DeepWhite:6484:6516 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:6484:6516 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:6484:6516 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:6484:6516 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Connected all rings
DeepWhite:6485:6517 [1] NCCL INFO Connected all trees
DeepWhite:6485:6517 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:6485:6517 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:6484:6516 [0] NCCL INFO Connected all rings
DeepWhite:6484:6516 [0] NCCL INFO Connected all trees
DeepWhite:6484:6516 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:6484:6516 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:6485:6517 [1] NCCL INFO comm 0x7fcec4002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:6484:6516 [0] NCCL INFO comm 0x7f5a98002fb0 rank 0 nranks 2 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:6484:6484 [0] NCCL INFO Launch mode Parallel

$ tail /var/log/syslog

Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]
Mar 23 10:57:11 deepwhite kernel: [ 1332.090997] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740000000 flags=0x0030]
Mar 23 10:57:11 deepwhite kernel: [ 1332.091001] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0xd2139068 flags=0x0020]
Mar 23 10:57:12 deepwhite kernel: [ 1332.223860] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0xd2139070 flags=0x0020]

$ py-spy dump -n -p 14088

Process 14088: python3 -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
Python v3.9.7 (/home/matt/miniconda3/envs/nlp3.9/bin/python3.9)

Thread 14088 (idle): "MainThread"
    select (libc-2.31.so)
    time_sleep (python3.9)
    _invoke_run (torch/distributed/elastic/agent/server/api.py:850)
    run (torch/distributed/elastic/agent/server/api.py:709)
    wrapper (torch/distributed/elastic/metrics/api.py:125)
    launch_agent (torch/distributed/launcher/api.py:236)
    __call__ (torch/distributed/launcher/api.py:131)
    run (torch/distributed/run.py:715)
    main (torch/distributed/run.py:724)
    wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
    <module> (torch/distributed/run.py:728)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:197)

from ml-engineering.

mhillebrand commented on May 18, 2024

I also tried this CUDA bandwidthTest from Nvidia, and it passed. BTW, I have the fourth GPU unplugged for now—just because this Threadripper box needs a dedicated 20A power outlet to run on all cylinders.

/usr/local/cuda/samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ ./bandwidthTest --device=all --mode=shmoo

[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

 Device 0: NVIDIA RTX A6000
 Device 1: NVIDIA RTX A6000
 Device 2: NVIDIA RTX A6000
 Shmoo Mode

...................................................................................................................................................................................................................................................
 Host to Device Bandwidth, 3 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				1.4
   2000				2.7
   3000				4.1
   4000				5.4
   5000				6.7
   6000				8.0
   7000				9.2
   8000				10.6
   9000				11.7
   10000			12.5
   11000			13.3
   12000			14.3
   13000			15.2
   14000			16.1
   15000			15.3
   16000			17.2
   17000			17.9
   18000			18.9
   19000			19.3
   20000			20.2
   22000			21.2
   24000			22.2
   26000			23.8
   28000			24.3
   30000			25.3
   32000			26.3
   34000			26.9
   36000			27.7
   38000			27.9
   40000			28.3
   42000			29.5
   44000			30.6
   46000			30.7
   48000			31.3
   50000			31.5
   60000			33.9
   70000			36.0
   80000			36.9
   90000			38.1
   100000			39.1
   200000			44.4
   300000			46.4
   400000			52.0
   500000			54.4
   600000			55.2
   700000			56.2
   800000			59.1
   900000			59.5
   1000000			59.9
   2000000			62.8
   3000000			63.6
   4000000			64.3
   5000000			65.1
   6000000			65.5
   7000000			65.7
   8000000			65.6
   9000000			65.8
   10000000			66.1
   11000000			66.0
   12000000			66.3
   13000000			66.2
   14000000			66.3
   15000000			66.2
   16000000			66.4
   18000000			66.3
   20000000			66.4
   22000000			66.4
   24000000			66.5
   26000000			66.6
   28000000			66.6
   30000000			66.6
   32000000			66.6
   36000000			66.6
   40000000			66.6
   44000000			66.6
   48000000			66.7
   52000000			66.7
   56000000			66.8
   60000000			66.7
   64000000			66.8
   68000000			66.7

...................................................................................................................................................................................................................................................
 Device to Host Bandwidth, 3 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				1.4
   2000				2.7
   3000				4.4
   4000				5.8
   5000				7.0
   6000				8.7
   7000				10.1
   8000				11.6
   9000				12.7
   10000			14.5
   11000			16.0
   12000			17.2
   13000			18.7
   14000			20.1
   15000			21.9
   16000			23.1
   17000			24.1
   18000			24.2
   19000			26.5
   20000			27.9
   22000			33.7
   24000			34.9
   26000			35.7
   28000			36.4
   30000			38.9
   32000			40.0
   34000			40.4
   36000			42.0
   38000			42.4
   40000			42.8
   42000			42.8
   44000			44.3
   46000			45.8
   48000			46.1
   50000			46.6
   60000			49.4
   70000			51.1
   80000			52.7
   90000			53.5
   100000			54.5
   200000			60.7
   300000			62.9
   400000			64.1
   500000			64.3
   600000			64.5
   700000			65.5
   800000			65.8
   900000			65.1
   1000000			65.4
   2000000			66.5
   3000000			66.9
   4000000			67.0
   5000000			67.1
   6000000			66.9
   7000000			67.0
   8000000			67.0
   9000000			67.1
   10000000			67.0
   11000000			67.0
   12000000			66.9
   13000000			66.7
   14000000			66.7
   15000000			66.7
   16000000			66.8
   18000000			66.7
   20000000			66.6
   22000000			66.6
   24000000			66.6
   26000000			66.5
   28000000			66.7
   30000000			66.5
   32000000			66.6
   36000000			62.6
   40000000			60.7
   44000000			60.5
   48000000			60.7
   52000000			60.5
   56000000			60.4
   60000000			60.7
   64000000			60.9
   68000000			60.6

...................................................................................................................................................................................................................................................
 Device to Device Bandwidth, 3 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				2.8
   2000				4.0
   3000				6.3
   4000				8.4
   5000				10.5
   6000				12.6
   7000				14.5
   8000				16.9
   9000				18.7
   10000			20.9
   11000			23.0
   12000			25.4
   13000			27.2
   14000			29.7
   15000			31.4
   16000			33.7
   17000			35.7
   18000			37.9
   19000			39.7
   20000			42.1
   22000			46.4
   24000			50.8
   26000			55.1
   28000			59.4
   30000			63.5
   32000			67.7
   34000			72.2
   36000			76.6
   38000			81.3
   40000			85.2
   42000			86.7
   44000			94.5
   46000			98.4
   48000			102.3
   50000			107.7
   60000			128.0
   70000			151.1
   80000			172.9
   90000			194.8
   100000			216.8
   200000			441.6
   300000			678.2
   400000			933.0
   500000			1200.5
   600000			1477.6
   700000			1736.5
   800000			1946.4
   900000			2108.5
   1000000			2287.1
   2000000			2577.7
   3000000			2586.3
   4000000			1814.6
   5000000			1575.0
   6000000			1606.6
   7000000			1595.9
   8000000			1637.1
   9000000			1675.1
   10000000			1700.9
   11000000			1754.5
   12000000			1767.3
   13000000			1784.8
   14000000			1798.7
   15000000			1805.3
   16000000			1825.7
   18000000			1858.4
   20000000			1862.0
   22000000			1878.9
   24000000			1891.2
   26000000			1907.0
   28000000			1916.1
   30000000			1916.8
   32000000			1927.9
   36000000			1941.4
   40000000			1953.7
   44000000			1959.2
   48000000			1968.8
   52000000			1975.5
   56000000			1974.6
   60000000			1984.2
   64000000			2007.7
   68000000			1990.8

Result = PASS

from ml-engineering.

mhillebrand commented on May 18, 2024

/usr/local/cuda/samples/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A6000, pciBusID: 49, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0	     1     1     1
     1	     1     1     1
     2	     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 673.20  13.15  13.15 
     1  12.99 673.20  22.20 
     2  13.00  21.98 673.78 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2 
     0 673.49   2.60   1.58 
     1   2.12 672.33   1.70 
     2   2.12   1.60 673.78 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 677.73  18.01  18.02 
     1  19.25 678.32  27.11 
     2  19.25  28.02 678.43 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 678.91   5.21   5.21 
     1   3.30 678.46   5.56 
     2   3.49   3.21 677.73 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2 
     0   1.57  11.57  12.11 
     1  11.44   1.62  11.53 
     2  16.89  11.80   1.57 

   CPU     0      1      2 
     0   2.68   8.79   8.31 
     1   8.44   2.64   8.24 
     2   8.99   8.29   2.64 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2 
     0   1.62 49299.65 49299.60 
     1 49299.91   1.57 49299.87 
     2 49299.74 49299.72   1.64 

   CPU     0      1      2 
     0   2.73   2.18   3.01 
     1   2.28   2.81   2.21 
     2   3.51   2.43   2.77

from ml-engineering.

mhillebrand commented on May 18, 2024

Oh, look at this! Same page fault messages. Sounds like this might help. God, I hope I don't break GRUB. It was a nightmare getting this RAID-0 array set up. Stay tuned.

from ml-engineering.

stas00 commented on May 18, 2024

Oh, wow! That's some awesome diagnostics you have performed - absolutely awesome, @mhillebrand! Glad to hear you got it working!

So the key to unravelling this problem was noticing a page fault in syslog:

$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]

We probably should start compiling all the difference causes somewhere so others will have it easier.

Glad you resolved it!

from ml-engineering.

stas00 commented on May 18, 2024

@jeffra, tagging you on this one as FYI, since some users are likely to run into this with Deepspeed.

And this is not the first problem with AMD and multi-gpu I have seen.

from ml-engineering.

Parallel training hangs about ml-engineering HOT 10 CLOSED

Comments (10)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent