Hi SMILEI developers, I investigated how SMILEI's runtime scales wit

Runtime increasing with more cores (homogenous and mostly empty boxes) about smilei HOT 6 CLOSED

smileipic commented on May 28, 2024

Runtime increasing with more cores (homogenous and mostly empty boxes)

from smilei.

Comments (6)

mccoys commented on May 28, 2024

Hi,

I have a few comments, that might help you investigate the causes of this problem, unless you have already tried all of that. Let me try anyway.

Concerning simulations with a cold or hot core, there is always going to be one MPI that takes all the work to do. To reduce this problem you may use load balancing:

Have you activated MPI load balancing ? It could help move patches around MPI processes.
Have you tried making use of OpenMP ? Your simulations seem to use only 1 openMP thread per MPI. You should investigate on the architecture of your machines. Typically, you must have as many MPI processes as nodes and as many threads-per-MPI as you have cores-per-node. This situation can vary.
To have efficient local (OpenMP) and global (MPI) load balancing, you must ensure you have more patches than the total number of threads. You should also make sure that your patches are smaller than the core itself, in order to share the particles between several patches.

Concerning the hot homogeneous simulation, this is a bit puzzling. From your outputs, it seems that most of the simulation time is taken in the synchronisation of particles. However, with a decently low temperature, the time for synchronisations should be low compared to the time for computing particles. If the temperature is very high, there might be too many communications because the particles travel fast, causing the synchronisation lag. This still looks surprising. Are you sure the simulation stays homogeneous over time?

from smilei.

jderouillat commented on May 28, 2024

Hi,
I ran the hothomogenouscorescaninput.py simulation on our cluster (each node have 2 Intel Sandy Bridge processor of 8 cores) for 1 to 256 MPI process (1 thread OpenMP).
The test is homogeneous so I didn't oversplit the domain in patches, I used 1 patch per MPI and no dynamic load balancing.

I split the domain to be as square as possible, and the only change in the namelist is the suppression of the definition of patchesx and patchesy that I modifief in the command line:

$ mpirun -np 1 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[1,1]  2>&1 |tee smilei_001_MPI_001_OMP.log
$ mpirun -np 2 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[2,1]  2>&1 |tee smilei_002_MPI_001_OMP.log
$ mpirun -np 4 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[2,2]  2>&1 |tee smilei_004_MPI_001_OMP.log
$ mpirun -np 8 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[4,2]  2>&1 |tee smilei_008_MPI_001_OMP.log
$ mpirun -np 16 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[4,4]  2>&1 |tee smilei_016_MPI_001_OMP.log
$ mpirun -np 32 -ppn 16 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[8,4]  2>&1 |tee smilei_032_MPI_001_OMP.log
$ mpirun -np 64 -ppn 16 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[8,8]  2>&1 |tee smilei_064_MPI_001_OMP.log
$ mpirun -np 128 -ppn 16 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[16,8]  2>&1 |tee smilei_128_MPI_001_OMP.log
$ mpirun -np 256 -ppn 16 ./smilei  hothomogenouscorescaninput.py Main.number_of_patches=[16,16] 2>&1 |tee smilei_256_MPI_001_OMP_clrw.log

For my point of view results are good considering the total number of particles.
On the 256 MPI process simulation, the number of particles per MPI process considering the 2 species is 1600 :

$ grep "Time in time loop" smilei*
smilei_001_MPI_001_OMP.log: Time in time loop :	74.567	99.843% coverage
smilei_002_MPI_001_OMP.log: Time in time loop :	35.904	99.835% coverage
smilei_004_MPI_001_OMP.log: Time in time loop :	18.314	99.785% coverage
smilei_008_MPI_001_OMP.log: Time in time loop :	9.738	99.722% coverage
smilei_016_MPI_001_OMP.log: Time in time loop :	5.421	99.466% coverage
smilei_032_MPI_001_OMP.log: Time in time loop :	2.813	99.247% coverage
smilei_064_MPI_001_OMP.log: Time in time loop :	1.653	98.879% coverage
smilei_128_MPI_001_OMP.log: Time in time loop :	1.114	97.852% coverage
smilei_256_MPI_001_OMP.log: Time in time loop :	0.895	97.031% coverage

from smilei.

jderouillat commented on May 28, 2024

I looked at hotcoreinput.py too, for this beginning of simulation it will be very hard to get an honest scaling on more than 4 compute units.
At initialization, the plasma is distributed on a circle which have a radius of 1.5.
The cell_length is 0.2, so with a minimum patch size of 6 cells, a patch will almost cover a quarter of the plasma, 4 patches cover almost the plasma. At the end of this part of the simulation, particles haven't move enough to consider a better scaling.
Below the scalability for 1 to 4 MPI / patches :

$ grep "Time in time" smilei*.log
smilei_001_MPI_001_OMP.log: Time in time loop :	16.202	98.869% coverage
smilei_002_MPI_001_OMP.log: Time in time loop :	8.825	98.801% coverage
smilei_004_MPI_001_OMP.log: Time in time loop :	4.533	98.629% coverage

from smilei.

JimmyHolloway commented on May 28, 2024

Hi Mccoys, Jderouillat,

I took your advice and made use of open MP and load balancing and investigated the temperature's effect on runtime.

The high temperatures:
I ran a cold-homogenous simulation set to compare with the hot-homogenous set. The cold-homogenous scaled sensibly. The log file told me that the code spent a lot less time synchronised partciles and fields in the cold-homo simulations. I am convinced the problem was just the really high temperature of the hot-homo simulations.

Open MP:
I performed the same simulation set as above but with open MP used properly and got good computational savings:

I am still testing load balancing but it looks like it gives further computaitonal saves with the 'core' simulations.

Thank you both for helping me with this!
Best,
Jimmy.

from smilei.

MickaelGrech commented on May 28, 2024

Hi Jimmy,
Are you now happy with how the code performs?
If so, can you close this issue?

from smilei.

JimmyHolloway commented on May 28, 2024

Hi Mickael,
Yes the issue was resolved. Thank you.

from smilei.

Runtime increasing with more cores (homogenous and mostly empty boxes) about smilei HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent