Hi, I could run one simulation with three restarts successfully but

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

At the end of the stdout, it says: <div class="snippet-clipboard-content notransla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Segmentation faults,about smileipic/smilei

Comments (33)

Tissot11 commented on September 27, 2024 1

Ok. I have launched a job and I'll let you if it works with aggressive load balancing. I have set every=40. Just to be sure the default load balancing is every=150 as written on the documentation page? I'm using vectorization every=20.

from smilei.

beck-llr commented on September 27, 2024 1

Thanks. I have tried with the dummy values mi=50 and vUPC=0.01 and was able to reproduce a problem. I will look into it.

from smilei.

mccoys commented on September 27, 2024

I see (Address not mapped to object [0xfffffffffffffffd]
and failed: Cannot allocate memory

You probably ran out of memory

from smilei.

Tissot11 commented on September 27, 2024

Smilei output file shows very little memory usage e.g. 60 GB while the nodes have 256 GB memory per node. In past I did encounter memory issues, but then Smilei output file would also show it.

from smilei.

beck-llr commented on September 27, 2024

I agree with @mccoys , it looks like a memory problem. Where did you see a memory occupation of 60 GB ?

In any case, the memory occupation is always underestimated because of many temporary buffers. A more accurate (but still underestimated) way to measure memory occupation is to use the Performance diagnostic. A possible scenario is that a strong load imbalance drives a peak of memory occupation on a single node and crashes it.

I notice that you are using very small patches with respect to your number of threads (more than 100 patches per openMP threads). You can try using larger patches. This should reduce the memory overhead induced by patches communication.

If you detect a peak of memory occupation somewhere that crashes a node you can also consider using the particle merging feature to mitigate that effect.

from smilei.

Tissot11 commented on September 27, 2024

It's in the stdout files I had attached with this message earlier (see the first message). It says 60 GB. I can use the performance diagnostics to see if indeed the memory is the issue.

Last year, I had asked about memory issue and followed up on your suggestion to use larger patches. However, the runtimes get really slow and I couldn't finish simulations even after restarting them few times. Then I tried large no.processors e.g. 35000 for this problem. I could finish simulations in shorter time albeit the CPU usage was a bit low. I have also tried last year particle merging feature, but I couldn't optimize the merging parameters very well for my simulations.

from smilei.

Tissot11 commented on September 27, 2024

Looking at the memory bandwidth per socket, I see very little memory usage (see the attached file)

9777446.pdf

from smilei.

beck-llr commented on September 27, 2024

If you need small patches for performance it confirms that your case is strongly imbalanced. It also explains why you have a poor CPU usage when scaling. It should show on the performance diag. Any chance you could use more openMP threads and less MPI processes ? Or are you already bound by the number of cores per socket of your system ?

from smilei.

mccoys commented on September 27, 2024

At the end of the stdout, it says:

Maximum memory per node: 57.321124 GB (defined as MaxRSS*Ntasks/NNodes)

Is that used memory or available memory? I ask because in your document, the maximum memory per node appears to be about 50 GB, which is dangerously close to that limit above.

from smilei.

Tissot11 commented on September 27, 2024

@mccoys Maximum memory per node is 256 GB.
@beck-llr it’s a collisionless shock simulation so of course it can be imbalanced. I tried vectorization, SDMD, particle merging and OpenMP tasks to speed things up but with a limited success so far. I’m only using either 4 or 6 MPI processes per node and 12 and 19 OpenMP threads on two different machines because this gives the best performance.

from smilei.

Tissot11 commented on September 27, 2024

Just to add that vectorization does help and compute time improves by 2x.

from smilei.

mccoys commented on September 27, 2024

Note that load balancing produces a memory spike that can be very substantial. The crash appears at that moment and it seems related to MPI that is not able to send all the data between MPIs. Have you tried to do load balancing more/less often?

from smilei.

Tissot11 commented on September 27, 2024

I do load balancing rather often, every 150 iterations. Should I even increase more? I can try it tonight.

from smilei.

mccoys commented on September 27, 2024

No I bet you should reduce. If you do it rarely, it has to do a lot of patch swaps. Meaning a lot of memory allocation.

The default is 20, but maybe not optimal for your case

from smilei.

beck-llr commented on September 27, 2024

Yes the default is 150 according to pyinit.py. Another metric that you can monitor is the number of patches per MPI process. You can directly check it out in the patch_load.txt file. It displays the number of patches per mpi process after each load balance operation. You have a problem if an mpi process ends up with only a couple of patches.

from smilei.

Tissot11 commented on September 27, 2024

Unfortunately this simulation failed even earlier than before. I attach the err, out and patch_text files. From patch_text file, it see almost 200 from 1000 patches per thread. So I guess this is fine?

Although the simulation is imbalanced but when I plot the results until the crash, I don't see any unexpected behaviour. Everything seems physical and expected. This is why I'm worried. I asked the technical support and they seem to also suggest that debugging this would be very hard.

tjob_hybrid.err.9787451.txt
tjob_hybrid.out.9787451.txt

patch_load.txt

from smilei.

mccoys commented on September 27, 2024

I had another quick look at this issue and UCX errors are usually related to MPI or network settings, allowing for different memory or cache amounts for MPI transfers. It is not directly a Smilei issue, so I am closing this.

from smilei.

mccoys commented on September 27, 2024

Reopening from indication of @Tissot11 elsewhere that this is a regression as it used to work in v5.0. Can you confirm this? Do you have a case we could test?

from smilei.

Tissot11 commented on September 27, 2024

Yeah, I do have a case...After I switched to Smilei v5.0 last year, I have seen numerous segmentation faults (with 2D simulations) on different machines with different compilers and libraries. Last month, I could manage to run the same simulation I complained about in the beginning of this thread with Smilei v4.7 without a segmentation fault or memory related crash.

Because of these widespread segmentation faults, I started using other codes for simulations. If you investigate this issue and we can hope to resolve it quickly then I can prepare a case and give to you...

from smilei.

mccoys commented on September 27, 2024

It depends whether we are able to reproduce the error. If this error requires a large allocation to reproduce, it will take longer of course

from smilei.

beck-llr commented on September 27, 2024

Hi. It is indeed a large simulation and it will be difficult to provide a fix if one is really required.

@Tissot11 are you positive that there is a regression and that you observe the crash in an exactly identical configuration as before (same simulation size, number of patches, physical configuration, compiler, mpi module etc.) ?

I had a look at the logs you provided and it is indeed an extremely unbalanced simulation. After the last load balancing the number of patches per MPI rank spans from 176 to 4680 !! I assume this puts a lot of pressure on the dynamic load balancing process and MPI exchanges.
Moreover you are using a very high number of patches which also increases memory and communication overheads. Even 176 is a lot of patches when you have only 12 openMP threads.

I would strongly advise to divide your total number of patches by at least a factor of 4. You previously answered that this would slow down your simulation too much. By how much did you decrease your number of patches ? Did you check the minimum number of patch per MPI ? As long as you have at least 24 patches per MPI (with 12 openMP threads) it should not slow down dramaticaly. If you go down to less than one patch per tread is when you are going too far.

P.S: You may observe a serious slow down because of cash effect beyond a certain patch size. In that case you could try to increase your number of patch by only a factor of 2. I'd be really surprised if it didn't help but you can never know for sure :-)

from smilei.

beck-llr commented on September 27, 2024

Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning.
Note that the default merge_momentum_cell_size is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell is not too low. You're only interested in merging particles in cells with many more particles than average.

from smilei.

Tissot11 commented on September 27, 2024

Indeed, the problems I first reported were with large 2D simulations. One of this simulation with larger domain and >25K CPUs I could manage to run with Smilei v4.7 (without any filtering, not so efficient as you explained due to patches) using older Intel compiler and libraries (compiler/intel/2022.0.2 numlib/mkl/2022.0.2 mpi/impi/2021.5.1 lib/hdf5/1.12) on Horeka. I should emphasise that I'm mostly using interpolation oder 4 but sometimes I also use order 2.

However, I have now prepared a simple case (2D) that I ran on 8 nodes of Hawk at HLRS, and 4 nodes of another HPC machine. To summarize

This simulation run fine with custom MPI library MPT at HLRS and OpenMPI 5.0 and gcc 10.2. However, it starts showing segmentation faults with OpenMPI if I just change the mass ratio and nothing else in the namelist.
Even with MPT library, it shows segmentation faults (also with Smilei 4.7) if I enable Friedman filter. Also with intel mpi library on another machine same segmentation faults. Please see the attached namelist.

I fear that newer compilers and changes made in Smilei 5.0 have some subtle issues, at least for 2D simulations since 1D simulations I do not see any issues. I have spent lots of time trying to run the same and similar 2D simulations with several combinations of libraries and compilers and spent last few months talking with technical supports, and nothing could come out. This is why I have started using other codes.

I will be very happy if we could figure this out so that I can use Smilei for 2D simulations.

namelist.py.txt
Shock_test.e2581658.txt
Shock_test.e2581722.txt
Shock_test.e2581744.txt
Shock_test.e2583577.txt
Shock_test.e2583698.txt
tjob_hybrid.err.12833017.txt

from smilei.

Tissot11 commented on September 27, 2024

Also for the particle merging to be efficient, you need to know what the distribution of your macro particles in your most populated patches/cells look like. I'm still convinced it could be very helpful in your case but it does require a bit of tuning. Note that the default merge_momentum_cell_size is VERY conservative. Do not hesitate to reduce it significantly. On the opposite, make sure tat the merge_min_particles_per_cell is not too low. You're only interested in merging particles in cells with many more particles than average.

I had this problem last year with memory. I started using interpolation order 4 and less number of particles-per-cell and also launching 4-6 MPI processes and 12 OpenMP threads on a single node. With this approach, I would not have memory issues anymore as the memory usage reported by every tool remain below 256 GB per node. However, sometime I saw memory errors related segmentation faults as I reported before which you and @mccoys attributed to intermittent memory spikes which I would not catch in any performance monitoring tools. I suspect, problem is with the MPI communication and that's why segmentation faults have become a very frequent occurrence with these 2D simulations.

from smilei.

beck-llr commented on September 27, 2024

mi and vUPC are undefined in the namelist you provided.

from smilei.

Tissot11 commented on September 27, 2024

Sorry! This is a redacted version and I forgot that I still use these parameters later in the diagnostic

namelist.py.txt

from smilei.

mccoys commented on September 27, 2024

@beck-llr would it be possible to have a maximum_npatch_per_MPI in load balancing? It would prevent overloading ranks when there is a strong load imbalance. Maybe this is not the issue here but the older logs really look like MPI is overloaded.

Now the new logs are different so we have to see (errors in the projectors usually means that particles are not where they are supposed to be).

from smilei.

beck-llr commented on September 27, 2024

@mccoys There are already options to tune the dynamic load balancing like cell_load for instance which will influence min and max number of patch per MPI. In the present case I am more concerned by forcing a minimum number of patch (which can be achieved by increasing the cell load). But in fact the min and max are linked as if you increase the min, you mechanically decrease the max.

From my first tests the problem here now lies within the Friedman filter. I think it has been problematic for a while. This is a good opportunity to have a close look at it.

from smilei.

Tissot11 commented on September 27, 2024

@beck-llr , so should I change the cell_load for simulations? I never set it up in my simulations. As @mccoys says something automatic to reduce load imbalance would be useful since most of the plasma physics simulations have load imbalance situations after a short interaction time. With laser-solid interactions, this could even be more demanding than shock simulations...

Besides, the Friedman filter, I have also seen segmentation faults with different MPI libraries. In general, it would be nice to have Smilei at least working with OpenMPI always and shows no segmentation faults except for obvious understandable reasons....

from smilei.

Tissot11 commented on September 27, 2024

I was wondering if there is any relevant info you would want to share at this stage?

from smilei.

Tissot11 commented on September 27, 2024

I would appreciate if you let me know the possible causes of these segmentation faults and whether you intend to address them soon. This would help me to decide if I should wait to use Smilei for simulating this problem or not...

from smilei.

beck-llr commented on September 27, 2024

The bug of the Friedman filter is reproducible and will be fixed eventually in a relatively short term.

For the rest, the issue is unclear and not reproducible for the moment. It does not mean there is no problem but it does not affect many people and I don't know exactly what we can do about it. Do you think you could provide a case that reproduces the problem which does not use the Friedman filter ?

from smilei.

Tissot11 commented on September 27, 2024

The same namelist also crashed without the Friedman filter for me when choosing a larger simulation domain and longer duration. Though, I could run it successfully albeit inefficiently, using the older Smilei version (4.7). Even this reduced version show suddenly higher push times after a sufficient long simulation runtime. I guess this sudden increase of push times (more than a factor of 5 and 6) could be linked to memory load, leading to segmentation faults as mentioned before in this thread. However, I could never catch any unreasonable memory usage in any tools that I have at my disposal.

from smilei.

Segmentation faults about smilei HOT 33 OPEN

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent