I'm able to successfully start the training on a single node single GPU setup, but fail when I increase the number of GPUs.
For example, on an A100 with 2 GPUs, if I run the following with deepspeed enabled:
CUDA_VISIBLE_DEVICES='0,1' python -m torch.distributed.launch --nproc_per_node=2 --master_port=5566 main_pretrain_yaml.py --config _args/args_pretrain.json
I can see that both GPUs (ranks 0 and 1) are seemingly able to initialize the distributed training, but while GPU rank 0 continues to run as expected, GPU rank 1 becomes unresponsive. Furthermore, It appears that only one process on the CPU starts and is pinned on one of the gpus.
Here's a snippet from the logs:
INFO - __main__ - Init distributed training on local rank 0
INFO - __main__ - Init distributed training on local rank 1
INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1
INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0
INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
...
[INFO] [comm.py:594:init_distributed] cdb=None
INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:2 to store for rank: 0
INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=2, worker_count=1, timeout=0:30:00)
...
This issue arises when attempting to distribute the computational workload across multiple data files (cc3m/webvid2.5m_train_0.caption.tsv to cc3m/webvid2.5m_train_9.caption.tsv) verses when using a single file (cc3m/webvid2.5m_train_0.caption.tsv) so it seems like the problem may be in the cpu's data loading/handling of the files. I have tried increasing the number of workers without success.
Note that this occurs in the code when making a call to
self.model, self.optzr, _, _ = deepspeed.initialize(config_params=config, model=self.model, optimizer=self.optzr, lr_scheduler=self.lr_scheduler)
And similarly, in the case that deepspeed is not enabled, at
self.model = T.nn.parallel.DistributedDataParallel(self.model, device_ids=[get_local_rank()], output_device=get_local_rank(), find_unused_parameters=True)
Please help, thanks!