In GitLab by <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Gladly! This is the original image. <a target="_blank" rel="noopener

Yea, can you attach the run you used? I'll see if I can reproduce. Thanks!

Find horovod bug,about mala-project/mala

Comments (34)

vlad-oles commented on August 25, 2024 1

Just created the pull request.

from mala.

RandomDefaultUser commented on August 25, 2024

In GitLab by @RandomDefaultUser on Feb 16, 2021, 12:15

Also: We shouldn't just ramp up the batch size too high just to gain performance, horovod should yield speed up for smaller batch sizes as well.
https://stats.stackexchange.com/questions/164876/what-is-the-trade-off-between-batch-size-and-number-of-iterations-to-train-a-neu

By Fiedler, Lenz (FWU) - 146409 on 2021-02-16T12:15:22 (imported from GitLab)

from mala.

RandomDefaultUser commented on August 25, 2024

In GitLab by @RandomDefaultUser on Feb 15, 2021, 21:31

changed due date to February 19, 2021

By Fiedler, Lenz (FWU) - 146409 on 2021-02-15T21:31:35 (imported from GitLab)

from mala.

RandomDefaultUser commented on August 25, 2024

In GitLab by @RandomDefaultUser on Feb 15, 2021, 21:32

changed the description

By Fiedler, Lenz (FWU) - 146409 on 2021-02-15T21:32:22 (imported from GitLab)

from mala.

ellisja commented on August 25, 2024

Hi Lenz, Can you reattach your figure in the first comment? It's missing I believe. I'd be happy to look into it and try it out on our machine here to see if it is a Horovod installation issue like you mentioned or something else. Thanks!

from mala.

RandomDefaultUser commented on August 25, 2024

Gladly! This is the original image.

I would be very interested to see if you experience a similar behavior. I can also send you a run script to test this.

from mala.

ellisja commented on August 25, 2024

Yea, can you attach the run script you used? I'll see if I can reproduce. Thanks!

from mala.

ellisja commented on August 25, 2024

Still running and digging, but I believe this is causing the issue:
Network sent to the same GPU

If you run on a node with >1 GPU, I believe this (self.to('cuda')) defaults to device cuda:0(?), but what you want is for rank 0 to send to cuda:0 and rank 1 to send to cuda:1. This would make sense as to why you are seeing a 2x slowdown as well.

I believe something like this would solve the issue:

if params.use_cuda:
    torch.cuda.set_device(hvd.local_rank())
...
# Build Model
...
if params.use_cuda:
    model.cuda()

Again, this is just my hunch for now.

from mala.

ellisja commented on August 25, 2024

self.to('cuda:%d' % hvd.local_rank()) should also work, for a single line fix.

from mala.

RandomDefaultUser commented on August 25, 2024

Oh, that makes a lot of sense! That would definitely explain the slowdown when using more then one GPU per node. I think I have also seen an erronous behavior when using one GPU per node but multiple nodes, but me and @OmarHexa are currently in the process of figuring that out. Thanks for the insight!

from mala.

RandomDefaultUser commented on August 25, 2024

I just looked and we do have this line in the instantiation of the trainer:

mala/mala/network/runner.py

Line 52 in 042867e

torch.cuda.set_device(hvd.local_rank())

Shouldn't this do the trick?

from mala.

ellisja commented on August 25, 2024

If I had to guess, I would say that that would not affect the self.to('cuda') call, but would affect a call like mymodel.cuda() (or self.cuda()). If you can access the nodes during the run and run watch nvidia-smi, it should tell you exactly which GPUs are running and which are not.

from mala.

ellisja commented on August 25, 2024

Our machine is down for the day, but I was able to find this:
https://discuss.pytorch.org/t/difference-between-cuda-0-vs-cuda-with-1-gpu/93080/4
Before, I had two cases running on summit that had:

1 gpu, ~360s epoch time
2 gpus, ~360s epoch time

which is still not correct, but a difference from what you were seeing.

Also, I saw that there are a few other calls to .to('cuda') in the training loops when moving sample to the device. I would have expected that to crash with my change to the model device placement, so somewhat strange.

I'll update again when I have more information. If this is the issue and after we figure everything out, I'll work on a PR that creates a parameter params.gpu_device = "cuda:" % hvd.local_rank() or something similar.

from mala.

vlad-oles commented on August 25, 2024

I changed line 62 of network.py from

self.to('cuda)

local_rank = hvd.local_rank() if self.use_horovod else 0
self.to(f'cuda:{local_rank}')

and now I'm getting ≈180s epoch time for 2 GPUs as compared to ≈360s epoch time for 1 GPU (on the same cases as Austin).

Also, running for 2 GPUs before the fix would crash MALA for me, with the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable, which makes sense if both ranks try to move the tensors to the same GPU.

from mala.

RandomDefaultUser commented on August 25, 2024

That sounds wonderful! I was getting the same crash. DId you push this already? @OmarHexa and me could test it further on our local cluster.

from mala.

RandomDefaultUser commented on August 25, 2024

Just created the pull request.

Great!

from mala.

ellisja commented on August 25, 2024

Hi @RandomDefaultUser, I think we have everything solved now (#129). Here are 3 figures with our results on Summit:

Scaling on 1 node with different GPU counts
Scaling over multiple nodes at 6 GPUs/node
Runtime with different numbers of Pytorch data workers (i.e. parameters.running.kwargs['num_workers']) at 6GPUs (1 node)

mala_summit_gpus_vs_epoch_time.pdf
mala_summit_nodes_vs_epoch_time.pdf
mala_summit_dataworkers_vs_epoch_time.pdf

One thing, num_workers affects the data transfer time between cpu and gpu, but is not useable with the lazy loading feature. You set num_workers = 0 by default. Does num_workers = 1 not work with lazy loading either? Thanks!

from mala.

ellisja commented on August 25, 2024

Please leave comments on the PR if you see anything. We will wait to merge until you can review and give an "okay". Feel free to test it out, especially any small lazy loading cases.

from mala.

RandomDefaultUser commented on August 25, 2024

Hi @ellisja and @vlad-oles, Thanks for all of your great work! The scaling behavior looks really great!
As for the num_workers - I think I just set 0 as it was the standard value in the example @OmarHexa and me followed for implementation. There was no deeper motivation to do that. I will test if setting it to 1 breaks lazy loading and have a look at the PR and then get back to you.

from mala.

OmarHexa commented on August 25, 2024

Hi @vlad-oles @ellisja , I have successfully tested the current horovod implementation on our cluster (Hemera) with single node and multigpus. The current implementation seems very promising. However, with multinode settings I am getting a CUDA error: invalid device ordinal. I added some line to let the device print its device type, device id, local rank and rank consecutively. Strangely local rank and rank is the same for 2 node with 1 gpu per node setting. which to my understanding should be different in this case. Could you please have a look at the output file to scrutinize what might be the problem?
out.txt
slurm.txt

from mala.

ellisja commented on August 25, 2024

Hi @OmarHexa, Yea I see what you mean. It looks like the tasks are being put on (node 0, gpu:0) and (node 0, gpu:1) instead of (node 0, gpu:0) and (node 1, gpu:0). I'm guessing that the --ntasks-per-node=1 and --gres=gpu:1 flags may not be sufficient for horovod. Can you try adding prior to the horovod call export CUDA_VISIBLE_DEVICES=1? I think if we can get hvd.local_rank() to give the proper values then it should work. Our machine is down for the day but I will try it out tomorrow morning.

from mala.

OmarHexa commented on August 25, 2024

Hi @ellisja , I have tried adding export CUDA_VISIBLE_DEVICES=1. In this case, cuda devices cannot be found and the device type is shown to be "cpu". I guess there is no GPU on each node with index 1 as slurm is only allocating 1gpu per node and they are indexed as 0 on each node.

from mala.

OmarHexa commented on August 25, 2024

I have done some benchmarking on the current horovod implementation for multigpu on a single node and different no. of workers. First benchmarking consists of two snapshots. The second benchmark consists of eight snapshots with lazy loading on.
The following benchmark suggests that the real power of horovod can be achieved by using more num. workers. I would suggest that num. workers=1 should be the default value.

from mala.

RandomDefaultUser commented on August 25, 2024

I've been working on this a bit. I will have some performance curves soon.
In the meantime, I've been looking at the multinode issue.
Using the command line options from horovod/horovod#2097 and submitting a simple training script on two nodes (one task/gpu each) I was able to produce the following output:

Loading module gcc/7.3.0 
Loading module openmpi/2.1.2 
Test run. 
Test run. 
horovd on this device:  size= 2 global_rank= 1 local_rank= 0 device= Tesla V100-SXM2-32GB
horovd on this device:  size= 2 global_rank= 0 local_rank= 0 device= Tesla V100-SXM2-32GB
Checking the snapshots and your inputs for consistency. 
Checking descriptor file  Al_fp_200x200x200grid_94comps_snapshot0.npy at /net/cns/projects/ml-dft-casus-data/fp/298K/2.699gcc/ 
Checking targets file  Al_ldos_200x200x200grid_250elvls_snapshot0.npy at /net/cns/projects/ml-dft-casus-data/ldos/298K/2.699gcc/ 
Checking descriptor file  Al_fp_200x200x200grid_94comps_snapshot1.npy at /net/cns/projects/ml-dft-casus-data/fp/298K/2.699gcc/ 
Checking targets file  Al_ldos_200x200x200grid_250elvls_snapshot1.npy at /net/cns/projects/ml-dft-casus-data/ldos/298K/2.699gcc/ 
Running MALA without test data. If this is not what you wanted, please revise the input script. 
Consistency check successful. 
Initializing the data scalers. 
Input scaler parametrized. 
Output scaler parametrized. 
Data scalers initialized. 
Build datasets. 
Build dataset done. 
Read data: DONE. 
size= 2 global_rank= 0 local_rank= 0 device= Tesla V100-SXM2-32GB
Rescaling learning rate because multiple workers are used for training. 
[2021-07-12 10:16:53.422797: W /tmp/pip-install-wodhhghv/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
1: [broadcast.layers.0.bias, broadcast.layers.0.weight, broadcast.layers.2.bias, broadcast.layers.2.weight, broadcast.layers.4.bias, broadcast.layers.4.weight ...]

The first two "horovod on this device" is output immediately after horovod init. Everything here seems fine, both ranks look as intended. However, after "Read data: DONE." only one rank is still doing something apparently. This then causes the subsqeuent error messages. I have no idea yet what is happening.

The submit script I am using is:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH --time=24:00:00
#SBATCH --job-name=yh_2n_1g_ram
#SBATCH --gres=gpu:1
#SBATCH --mem-per-cpu=80G
#SBATCH -A casus
#SBATCH -p casus

conda activate pytorch
module load gcc
module load openmpi


mpirun -np 2 -npernode 1 -bind-to none -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python3 train_yh_ram.py

@ellisja : I have not used export CUDA_VISIBLE_DEVICES=0 but will try that one next.

from mala.

RandomDefaultUser commented on August 25, 2024

Ah, I think I have an idea what might be happening:

[ 0 ]: Running MALA without test data. If this is not what you wanted, please revise the input script. 
[ 0 ]: Consistency check successful. 
[ 0 ]: Initializing the data scalers. 
[ 1 ]: Checking targets file  Al_ldos_200x200x200grid_250elvls_snapshot0.npy at /net/cns/projects/ml-dft-casus-data/ldos/298K/2.699gcc/ 
[ 1 ]: Checking descriptor file  Al_fp_200x200x200grid_94comps_snapshot1.npy at /net/cns/projects/ml-dft-casus-data/fp/298K/2.699gcc/ 
[ 1 ]: Checking targets file  Al_ldos_200x200x200grid_250elvls_snapshot1.npy at /net/cns/projects/ml-dft-casus-data/ldos/298K/2.699gcc/ 
[ 1 ]: Running MALA without test data. If this is not what you wanted, please revise the input script. 
[ 1 ]: Consistency check successful. 
[ 1 ]: Initializing the data scalers. 
[ 0 ]: Input scaler parametrized. 
[ 0 ]: Output scaler parametrized. 
[ 0 ]: Data scalers initialized. 
[ 0 ]: Build datasets. 
[ 0 ]: Build dataset done. 
[ 0 ]: Read data: DONE. 
size= 2 global_rank= 0 local_rank= 0 device= Tesla V100-SXM2-32GB
[ 0 ]: Rescaling learning rate because multiple workers are used for training.

Looks like rank 0 is finished with initializing the DataScalers far before rank 1 is. I will try simply adding a barrier and see what happens.
I am not really sure why this is only occuring now and not in the single node setup, maybe there is an issue with data access that only becomes apparent now?

from mala.

RandomDefaultUser commented on August 25, 2024

Also, I have merged the current develop back into the horovod branch. A lot has been happening since we started developing and I think it would be nice and beneficial to use the new interfaces etc. for this development as well.

from mala.

RandomDefaultUser commented on August 25, 2024

Adding a barrier at the end of data loading/preparation sort of helped, see here:
slurm-3616907.out.txt
(EDIT: Just to clarify: The reason both ranks print separately is for debug purposes, not another horovod bug).

I still get the same warnings about one process stalling, but only during the data loading process. Afterwards, it's all good and the training goes smoothly. I think that confirms my suspicion that for some reason, there is a problem when both ranks try to simultaneously access the data. I will try to investigate further and maybe come up with a solution. That might also be a hardware issue, there are several places on our cluster where data might be stored and maybe we have it in the wrong place i.e. one that's not good for parallel I/O.

from mala.

RandomDefaultUser commented on August 25, 2024

I think this is really just an issue with the file system.
Switching to the recommended partition gives this:

[ 1 ] Data scalers initialized. 
[ 1 ] Build datasets. 
[ 0 ] Input scaler parametrized. 
[ 0 ] Output scaler parametrized. 
[ 0 ] Data scalers initialized. 
[ 0 ] Build datasets. 
[ 1 ] Build dataset done. 
[2021-07-12 18:05:46.986585: W /tmp/pip-install-wodhhghv/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [allreduce.barrier]
[ 0 ] Build dataset done. 
[ 0 ] Read data: DONE. 
[ 1 ] Read data: DONE. 
size= 2 global_rank= 1 local_rank= 0 device= Tesla V100-SXM2-32GB
[ 1 ] Rescaling learning rate because multiple workers are used for training. 
size= 2 global_rank= 0 local_rank= 0 device= Tesla V100-SXM2-32GB
[ 0 ] Rescaling learning rate because multiple workers are used for training. 
[ 1 ] Network setup: DONE. 
[ 1 ] Starting training. 
[ 0 ] Network setup: DONE. 
[ 0 ] Starting training. 
[ 0 ] Initial Guess - validation data loss:  0.11493272306676955 
[ 1 ] Initial Guess - validation data loss:  0.11493272306676955 
[ 0 ] Epoch:  0 validation data loss:  0.00011622552497192373 
[ 0 ] Time for epoch[s]: 362.26197266578674

So one rank, at one point, is a little more then 60s behind the other one, but due to the barrier this is no problem. Probably that warning could even be further surpressed by adding more barriers in the process (after loading into RAM, after getting the scaling coefficients, etc.) but in general I would argue this should work. I will start doing more benchmark using more nodes and jobs per node and report what I find.

from mala.

RandomDefaultUser commented on August 25, 2024

I have new results, but before progressing I think it might make sense to define the still open tasks.

Test horovod on single node and see if it brings speedup (data in RAM)
Get horovod speedup from multinode setup (data in RAM)
Test horovod on single node and see if it brings speedup (data via lazy loading)
Get horovod speedup from multinode setup (data via lazy loading)

from mala.

RandomDefaultUser commented on August 25, 2024

These are the results:

From these I would argue that the first three points of the open tasks are fulfilled. I will run the RAM example on 8 nodes, to get an additional data point. But in general this looks good. That leaves only the lazy loading case to be improved.

I will also upload the submit.slurm configuration used for this in the doc, so that the next user does not struggle with this.

Edit: Error bars indicate the standard deviation of the epoch time. 100 epochs or 24 h were trained, whatever was shorter.

from mala.

RandomDefaultUser commented on August 25, 2024

I did some tests and noticed the following:
When we have say D data points, and N horovod ranks, each rank will process D/N points.
That means that the plots shown above are indeed very comparable. The error has to be somewhere else. I'll keep digging.

Edit: One thing that may influence this: since in our case the data is scattered across snapshots, this means N*number of snapshots file I/O for the lazy loading case. I am not sure if this is a problem. In theory, file I/O on the correct HPC partitions should be in parallel, correct?
Also it means that horovod speedup scales with batch size. The costly communication only happens after one mini batch has been processed.

from mala.

RandomDefaultUser commented on August 25, 2024

I did more tests.
For one, I noticed that most horovod time is lost during the minibatch forwarding/backpropagation. This is consistent with my expectation, so I don't think that there's a very obvious bug here. The bad news is that I have no idea why that steps takes so long exactly.

Furthermore I played around with the call parameters for the mpirun. I have found that I can make the epoch even slower (great success!), which leads me to believe that maybe I don't have the optimal call parameters yet... This is rather frustrating about horovod. What if we switch clusters? Do we have to do this all anew then?

I hope this will get better once we have implemented data parallelism. (EDIT: On second thought, this wouldn't even help. The time is mostly lost during the minibatching... Sure, we get additional overhead from the repeated file I/O but running the numbers, even if this overhead wasn't there, we'd be still slower...)

from mala.

RandomDefaultUser commented on August 25, 2024

OK, I finally got it. The setup on my local cluster did not do the GPU communication correctly; horovod was not using NCCL support correctly. I also haven't been able to activate that. But that's a problem specific to my cluster/setup, so no fault of MALA. If I instead switch to CPU, I see the speedups as reported by @ellisja (with the offset of GPU vs. CPU). Note that this is different data then in the plots above, because I currently cannot access the Al data. So horovod in MALA works now.

from mala.

RandomDefaultUser commented on August 25, 2024

(correct assignment of GPUs was also tested)

from mala.

Find horovod bug about mala HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent