Comments (33)
Ok. I have launched a job and I'll let you if it works with aggressive load balancing. I have set every=40
. Just to be sure the default load balancing is every=150
as written on the documentation page? I'm using vectorization every=20
.
from smilei.
Thanks. I have tried with the dummy values mi=50
and vUPC=0.01
and was able to reproduce a problem. I will look into it.
from smilei.
I see (Address not mapped to object [0xfffffffffffffffd]
and failed: Cannot allocate memory
You probably ran out of memory
from smilei.
Smilei output file shows very little memory usage e.g. 60 GB while the nodes have 256 GB memory per node. In past I did encounter memory issues, but then Smilei output file would also show it.
from smilei.
I agree with @mccoys , it looks like a memory problem. Where did you see a memory occupation of 60 GB ?
In any case, the memory occupation is always underestimated because of many temporary buffers. A more accurate (but still underestimated) way to measure memory occupation is to use the Performance diagnostic. A possible scenario is that a strong load imbalance drives a peak of memory occupation on a single node and crashes it.
I notice that you are using very small patches with respect to your number of threads (more than 100 patches per openMP threads). You can try using larger patches. This should reduce the memory overhead induced by patches communication.
If you detect a peak of memory occupation somewhere that crashes a node you can also consider using the particle merging feature to mitigate that effect.
from smilei.
It's in the stdout files I had attached with this message earlier (see the first message). It says 60 GB. I can use the performance diagnostics to see if indeed the memory is the issue.
Last year, I had asked about memory issue and followed up on your suggestion to use larger patches. However, the runtimes get really slow and I couldn't finish simulations even after restarting them few times. Then I tried large no.processors e.g. 35000 for this problem. I could finish simulations in shorter time albeit the CPU usage was a bit low. I have also tried last year particle merging feature, but I couldn't optimize the merging parameters very well for my simulations.
from smilei.
Looking at the memory bandwidth per socket, I see very little memory usage (see the attached file)
from smilei.
If you need small patches for performance it confirms that your case is strongly imbalanced. It also explains why you have a poor CPU usage when scaling. It should show on the performance diag. Any chance you could use more openMP threads and less MPI processes ? Or are you already bound by the number of cores per socket of your system ?
from smilei.
At the end of the stdout, it says:
Maximum memory per node: 57.321124 GB (defined as MaxRSS*Ntasks/NNodes)
Is that used memory or available memory? I ask because in your document, the maximum memory per node appears to be about 50 GB, which is dangerously close to that limit above.
from smilei.
@mccoys Maximum memory per node is 256 GB.
@beck-llr itβs a collisionless shock simulation so of course it can be imbalanced. I tried vectorization, SDMD, particle merging and OpenMP tasks to speed things up but with a limited success so far. Iβm only using either 4 or 6 MPI processes per node and 12 and 19 OpenMP threads on two different machines because this gives the best performance.
from smilei.
Just to add that vectorization does help and compute time improves by 2x.
from smilei.
Note that load balancing produces a memory spike that can be very substantial. The crash appears at that moment and it seems related to MPI that is not able to send all the data between MPIs. Have you tried to do load balancing more/less often?
from smilei.
I do load balancing rather often, every 150 iterations. Should I even increase more? I can try it tonight.
from smilei.
No I bet you should reduce. If you do it rarely, it has to do a lot of patch swaps. Meaning a lot of memory allocation.
The default is 20, but maybe not optimal for your case
from smilei.
Yes the default is 150 according to pyinit.py
. Another metric that you can monitor is the number of patches per MPI process. You can directly check it out in the patch_load.txt
file. It displays the number of patches per mpi process after each load balance operation. You have a problem if an mpi process ends up with only a couple of patches.
from smilei.
Unfortunately this simulation failed even earlier than before. I attach the err, out and patch_text
files. From patch_text file, it see almost 200 from 1000 patches per thread. So I guess this is fine?
Although the simulation is imbalanced but when I plot the results until the crash, I don't see any unexpected behaviour. Everything seems physical and expected. This is why I'm worried. I asked the technical support and they seem to also suggest that debugging this would be very hard.
tjob_hybrid.err.9787451.txt
tjob_hybrid.out.9787451.txt
from smilei.
I had another quick look at this issue and UCX errors are usually related to MPI or network settings, allowing for different memory or cache amounts for MPI transfers. It is not directly a Smilei issue, so I am closing this.
from smilei.
Reopening from indication of @Tissot11 elsewhere that this is a regression as it used to work in v5.0. Can you confirm this? Do you have a case we could test?
from smilei.
Yeah, I do have a case...After I switched to Smilei v5.0 last year, I have seen numerous segmentation faults (with 2D simulations) on different machines with different compilers and libraries. Last month, I could manage to run the same simulation I complained about in the beginning of this thread with Smilei v4.7 without a segmentation fault or memory related crash.
Because of these widespread segmentation faults, I started using other codes for simulations. If you investigate this issue and we can hope to resolve it quickly then I can prepare a case and give to you...
from smilei.
It depends whether we are able to reproduce the error. If this error requires a large allocation to reproduce, it will take longer of course
from smilei.
Hi. It is indeed a large simulation and it will be difficult to provide a fix if one is really required.
@Tissot11 are you positive that there is a regression and that you observe the crash in an exactly identical configuration as before (same simulation size, number of patches, physical configuration, compiler, mpi module etc.) ?
I had a look at the logs you provided and it is indeed an extremely unbalanced simulation. After the last load balancing the number of patches per MPI rank spans from 176 to 4680 !! I assume this puts a lot of pressure on the dynamic load balancing process and MPI exchanges.
Moreover you are using a very high number of patches which also increases memory and communication overheads. Even 176 is a lot of patches when you have only 12 openMP threads.
I would strongly advise to divide your total number of patches by at least a factor of 4. You previously answered that this would slow down your simulation too much. By how much did you decrease your number of patches ? Did you check the minimum number of patch per MPI ? As long as you have at least 24 patches per MPI (with 12 openMP threads) it should not slow down dramaticaly. If you go down to less than one patch per tread is when you are going too far.
P.S: You may observe a serious slow down because of cash effect beyond a certain patch size. In that case you could try to increase your number of patch by only a factor of 2. I'd be really surprised if it didn't help but you can never know for sure :-)
from smilei.
Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning.
Note that the default merge_momentum_cell_size
is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell
is not too low. You're only interested in merging particles in cells with many more particles than average.
from smilei.
Indeed, the problems I first reported were with large 2D simulations. One of this simulation with larger domain and >25K CPUs I could manage to run with Smilei v4.7 (without any filtering, not so efficient as you explained due to patches) using older Intel compiler and libraries (compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12) on Horeka. I should emphasise that I'm mostly using interpolation oder 4 but sometimes I also use order 2.
However, I have now prepared a simple case (2D) that I ran on 8 nodes of Hawk at HLRS, and 4 nodes of another HPC machine. To summarize
- This simulation run fine with custom MPI library MPT at HLRS and OpenMPI 5.0 and gcc 10.2. However, it starts showing segmentation faults with OpenMPI if I just change the mass ratio and nothing else in the namelist.
- Even with MPT library, it shows segmentation faults (also with Smilei 4.7) if I enable Friedman filter. Also with intel mpi library on another machine same segmentation faults. Please see the attached namelist.
I fear that newer compilers and changes made in Smilei 5.0 have some subtle issues, at least for 2D simulations since 1D simulations I do not see any issues. I have spent lots of time trying to run the same and similar 2D simulations with several combinations of libraries and compilers and spent last few months talking with technical supports, and nothing could come out. This is why I have started using other codes.
I will be very happy if we could figure this out so that I can use Smilei for 2D simulations.
namelist.py.txt
Shock_test.e2581658.txt
Shock_test.e2581722.txt
Shock_test.e2581744.txt
Shock_test.e2583577.txt
Shock_test.e2583698.txt
tjob_hybrid.err.12833017.txt
from smilei.
Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning. Note that the default
merge_momentum_cell_size
is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat themerge_min_particles_per_cell
is not too low. You're only interested in merging particles in cells with many more particles than average.
I had this problem last year with memory. I started using interpolation order 4 and less number of particles-per-cell and also launching 4-6 MPI processes and 12 OpenMP threads on a single node. With this approach, I would not have memory issues anymore as the memory usage reported by every tool remain below 256 GB per node. However, sometime I saw memory errors related segmentation faults as I reported before which you and @mccoys attributed to intermittent memory spikes which I would not catch in any performance monitoring tools. I suspect, problem is with the MPI communication and that's why segmentation faults have become a very frequent occurrence with these 2D simulations.
from smilei.
mi
and vUPC
are undefined in the namelist you provided.
from smilei.
Sorry! This is a redacted version and I forgot that I still use these parameters later in the diagnostic
from smilei.
@beck-llr would it be possible to have a maximum_npatch_per_MPI
in load balancing? It would prevent overloading ranks when there is a strong load imbalance. Maybe this is not the issue here but the older logs really look like MPI is overloaded.
Now the new logs are different so we have to see (errors in the projectors usually means that particles are not where they are supposed to be).
from smilei.
@mccoys There are already options to tune the dynamic load balancing like cell_load
for instance which will influence min and max number of patch per MPI. In the present case I am more concerned by forcing a minimum number of patch (which can be achieved by increasing the cell load). But in fact the min and max are linked as if you increase the min, you mechanically decrease the max.
From my first tests the problem here now lies within the Friedman filter. I think it has been problematic for a while. This is a good opportunity to have a close look at it.
from smilei.
@beck-llr , so should I change the cell_load for simulations? I never set it up in my simulations. As @mccoys says something automatic to reduce load imbalance would be useful since most of the plasma physics simulations have load imbalance situations after a short interaction time. With laser-solid interactions, this could even be more demanding than shock simulations...
Besides, the Friedman filter, I have also seen segmentation faults with different MPI libraries. In general, it would be nice to have Smilei at least working with OpenMPI always and shows no segmentation faults except for obvious understandable reasons....
from smilei.
I was wondering if there is any relevant info you would want to share at this stage?
from smilei.
I would appreciate if you let me know the possible causes of these segmentation faults and whether you intend to address them soon. This would help me to decide if I should wait to use Smilei for simulating this problem or not...
from smilei.
The bug of the Friedman filter is reproducible and will be fixed eventually in a relatively short term.
For the rest, the issue is unclear and not reproducible for the moment. It does not mean there is no problem but it does not affect many people and I don't know exactly what we can do about it. Do you think you could provide a case that reproduces the problem which does not use the Friedman filter ?
from smilei.
The same namelist also crashed without the Friedman filter for me when choosing a larger simulation domain and longer duration. Though, I could run it successfully albeit inefficiently, using the older Smilei version (4.7). Even this reduced version show suddenly higher push times after a sufficient long simulation runtime. I guess this sudden increase of push times (more than a factor of 5 and 6) could be linked to memory load, leading to segmentation faults as mentioned before in this thread. However, I could never catch any unreasonable memory usage in any tools that I have at my disposal.
from smilei.
Related Issues (20)
- Adjusting gridGlobalOffset and temporal offset HOT 7
- Keeping track of number of MPI processes and OpenMP threads HOT 3
- Beam current distribution in LWFA through AM HOT 1
- Error with "Collisions" block HOT 2
- Enabling Prescribed Field for AM Geometry HOT 1
- Magnetic field in TrackParticle and ParticleBinning diagnostics HOT 21
- Force final step to be restartable option HOT 2
- 'cannot convert float NaN to integer' error when opening Track Particles Diagnostic HOT 13
- Thermal boundary condition HOT 2
- Deprecated pyplot register_cmap HOT 3
- problems with species creation while using parallelization HOT 4
- TrackParticles has different fusion neutron weights at low energies HOT 10
- ansys fluent
- timestep_over_CFL leads to an index error HOT 4
- AMcylindrical geometry: showing the Field diagnostic on both half planes
- Screen Diagnostic Not Working HOT 5
- PML for magnetized plasmas HOT 3
- Only electron, no ion in Species block
- The particle kinetic energy density HOT 1
- Error building HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smilei.