fbpic / fbpic Goto Github PK

View Code? Open in Web Editor NEW

168.0 23.0 69.0 12.02 MB

Spectral, quasi-3D Particle-In-Cell code, for CPU and GPU

Home Page: http://fbpic.github.io

License: Other

Python 100.00%

gpu physics-simulation particle-in-cell laser plasma

fbpic's Issues

Initialize random seed at the beginning of an FBPIC simulation

Initializing the random seed (on the CPU, as well as GPU when used) would make the simulations exactly reproducible.

As a reminder, the random number generator is currently used for:

determining the angular position of the particles of the plasma, when they are initialized
determining whether an ionization event will occur.

Update documentation for multi-threaded CPU usage

This includes:

Explaining how to choose the number of threads
Explaining how to install and run fbpic on Cori

Boosted particle diagnostic for GPU miss particles

I recently observed that the boosted particle diagnostics fails to retrieve all particles. For instance, in the case of a Gaussian bunch with 30000 macroparticles, the boosted particle diagnostics write approx 25000 macroparticles only.

This issue on occurs on GPU ; on CPU, the full 30000 macroparticles are written.

Implement several CPU optimizations

This includes:

Generating particles so that they are (initially at least) contiguous in r (and not in z). This would lead to better memory access patterns for current deposition/gathering, since cells are also contiguous in r. (Implemented in PR #79)
Improve/remove the weights function (it is currently taking a substantial fraction of the time and is not optimized). It could be implemented directly in the deposition functions, as it is done for CUDA kernels. (Implemented in PR #80)

Add injection of relativistic particles through a plane

For boosted-frame simulations, particles often need to be injected through a plane, where the particles behind the plane do not feel the EM fields, and the particles after the plane do feel them.

This could be implemented with a modified push_p, which checks the position of the particles before calling the Vay pusher.

Use prange and njit_parallel on more functions

In particular:

We could probably remove the gathering functions that use weight.
The ionization methods also need to switch to prange/njit_parallel

Simplify/optimize the code for current deposition on GPU

It would be interesting to take a similar approach to PR #137

Higher order particle shape factors

@Escapado and me did several implementation of the charge and current deposition for linear and higher order (mainly cubic) shape factors. We benchmarked the different implementations and it was not easy to beat the original implementation of the deposition in terms of speed. (Actually, I was quite surprised that the original implementation is really fast compared to all other possible implementations we tried).

In summary, we developed a new implementation that relies on both, looping over the sorted particles and atomically adding the summed local arrays to the global grid array. Thereby, we can remove the add_J and add_rho kernels. Also, the implementation can be done in a nice way with cuda thread local arrays and for loops. However, thread-local arrays are stored in global memory instead of registers, which makes this implementation 2 times slower than the original implementation.

We solved this performance issue by explicitly unrolling the for loops and storing the different values directly to the registers. By this, we now have the same performance as with the original implementation, however, the code is again very long and does not look nice (however, performance is more important in that case!).

Furthermore we made two decision:

The new implementation is restricted to work either with linear shapes or with cubic shapes in both dimensions (I think this restrictions is "okay" as we don't really need quadratic shapes and also it should be fine to have always the same shape factors in both dimensions)
We decided to keep the original implementation (without atomic adds) in order to be able to compare the different implementations with more use cases in the future. (This is up to discussion)

@RemiLehe Any thoughts on this?

Add documentation for boosted-frame

This could be added to the Advanced section.

We should probably also add a safe-guard that prevents the user from having c dt > dr in the boosted frame case (since this becomes unstable).

x and p shifted by 1/2 time step in the output

In the current implementation, we output x(n) and p(n-1/2) for the OpenPMD diagnostics. This should not be a problem, but at some point, it would be nice if we correct that by pushing the particles by half a time step back before writing the diagnostics, and push them again by half time step forward after the writing the diagnostics. That way, we would output x(n-1/2) and p(n-1/2), so both values are defined the same time.

We should make sure that we explain in the documentation at which loop-times the quantities are defined in our OpenPMD diagnostics.

Experiment with parallel support of `np.sum` in numba for reduction on CPU

We could use this feature for the sum_reduce step after the deposition, with multi-threaded execution on CPU.

Get DOI for fbpic github repo

For giving this repository more visibility it would be nice to be able to cite it in publication. To do this in a somewhat standardised form a DOI would be good to have.

This guide describes the necessary steps to get a DOI via a service called Zenodo.
As I understand this Zenodo takes the releases from the repository, archives them and links them with a DOI. The DOI link then points to a page in Zenodo, where citations can be generated in the standard formats, e.g. bibtex.
One downside I see in this is that the DOI links to the Zenodo page, which only then, indirectly, links to github.

Here, as an example is the DOI of the openPMD standard

Another option would be to create a custom bibtex snipped, which we are happy with, and then but it somewhere in this repo, e.g. the README. That way citation would also be somewhat consistent.

Functions in `lpa_utils/bunch.py` need to be fixed

The functions that create new electron bunches and calculate their space charge field have several issues that need to be fixed:

The initialization of the weight w has not been update with respect to the new definition of w (see #98)
Nothing checks that the particles are in the local subdomain before adding them to the simulation. As a consequence, the subsequent particle deposition can lead to out-of-bounds (observed in practice).

Do the `sum_reduce` only once per deposition (instead of once per species)

The sum_reduce operation (on CPU) is expensive when using a large number of cores, and is being done for each particle species. One solution would be to modify the code so as to only perform this operation only once per deposition (just like the divide by volume operation)

Implement boosted-frame openPMD output for ionizable particle

Right now it is not working, because the charge is considered constant for the whole species, in the boosted-frame openPMD output.

Experiment with MKL fft

We recently observed that the latest pyfftw is sometimes surprisingly slow on CPU.
It would be good to try the MKL fft (which should normally be automatically called by np.fft.fft)

Move grids by shifting fields in spectral space

We recently changed the code, such that the moving window updates and shifts the fields each timestep. However, this is currently done in spatial space and involves several extra kernels and also expensive allocation and deallocation of CUDA device memory. To get a better performance for the shifting of the fields, we should implement the shifting in spectral space. Furthermore, we then would want to also shift the "previous" charge density rho in spectral space, which we currently redeposit every time step.

Checkpoints and diagnostics need to be updated

From an off-line discussion with @MKirchen :

In order to avoid crashes when rewriting diagnostics that were already previously written, the datasets that are being modified should be removed and recreated.
The diagnostics should be written at time n+1/2 for consistency with the future moving window.
As a consequence the meta data for the particles should be updated and the interpolation of the position in the boosted-frame diagnostics should be removed.
The checkpoints should be separated from the rest of the diagnostics and written after the iteration is incremented.

Add staggered PSATD and PSTD solver

I think at some point it would make sense to add a staggered version of the (Galilean)-PSATD as well as the PSTD solver. This will enable us to make more comparisons between the different solvers in the future.

@RemiLehe What do you think?

Experiment with the JIT flags `fastmath` and `cache`

This could potentially speed up execution time and startup time respectively.

As mentioned here, because there is no global flag to turn these options on, we would need to redefine the jit function within fbpic, using functools.

Checkpoints not working with use_all_mpi_ranks = True

Hi,
Currently it is not possible to use use_all_mpi_ranks=True and use checkpoints.
The problem is that use_all_mpi_ranks=True causes the Simulation object to set self.mpi_comm=None. However, the checkpoint routines try to call sim.comm.mpi_comm.Barrier(), which is not defined. This is used there to wait for rank 0 while it's creating directories and checking files.

I'm not too familiar with this MPI stuff but I think we could either create a communicator even when using use_all_mpi_ranks = True, or have every rank check its own files and create its own directory.

Periodic boundaries seem broken in dev branch

While trying to merge the new higher order shape implementation
into the current dev branch I noticed that errors appear close to the
boundaries in the periodic plasma wave test file:

Maybe this is related to the "save_periodic" commit ?

Time estimator of the PIC loop: do not use first step

The estimator that computes the remaining time uses the time taken in the first step for the calculation. However, the first step is special since it includes JIT compilation. It should not be included.

Optimize number of threads per block for each kernel

This could be done with a standard simulation script with moving window and box full of electrons.

Even if the optimized number of threads might not generalize to any other situation, I think it will still be better than the current arbitrary choice (i.e. 256 threads per block for all kernels)

Make field push more readable: keep the standard PSATD routines

In pull request #6, the field push routine for the standard PSATD is replaced by the field push routine for the Galilean/comoving scheme.

However, I think this makes it less readable for people who are just interested in the standard PSTAD scheme. Although the Galilean/comoving coefficients reduce to the standard PSATD when ps.V is 0, it is not obvious at a first glance. I think keeping a separate, simpler function for the standard PSATD would be easier.

@MKirchen : Any opinion on this?

Remove old numba methods

These methods are superseded by the new threading methods of PR #80 .

Note: in order to be able to still use the old versions of numba (i.e. <0.34), we should do something like: in threading_utilities.py

if numba.version > 0.34: #The syntax maybe a bit different    
    threading_installed = True    
    prange = numba.prange 
else:
    threading_installed = False
    prange = range

Update: We still need to remove:

The old gathering functions
The old deposition functions (assuming they are not needed for the laser antenna)

Use pagelocked "pinned" CUDA device arrays

Pinned arrays cannot be swapped out to disk and therefore the CPU to GPU transfer of data can be faster.

Add options in `step` function for profiling the CPU code

The module cProfile allows to profile Python execution on the CPU side. In particular, we could add

pr = cProfile.Profile()
pr.enable()

at the beginning of the step function, and

pr.disable()

at the end. We could then print the results in a separate file for each MPI proc.

This would have several advantages:

See which functions take most of the time on the CPU.
See which CPU-side functions take time when running on GPU: this would allow us to avoid surprises and detect functions that perform unexpected memory copies from CPU to GPU.
See how much time the MPI call take, and what limits the scaling to many MPI cores (at least for the multi-CPU version)

Switch from `accelerate` to `pyculib`

The cuda libraries in Accelerate are now open-source (and renamed pyculib) !

We should switch to pyculib by:

Changing the import statements: http://pyculib.readthedocs.io/en/latest/release-notes.html.
Updating the installation instructions to conda install -c numba pyculib.

Launch Travis tests in Docker containers that contain the pre-installed dependencies

This would make the tests much faster, as we would avoid reinstalling the dependencies everytime.

Use cuda streams for the Hankel transforms

At several places in the PIC cycle, we could perform the Hankel transform in an asynchronous manner, using cuda streams.

For instance, we could launch the Hankel transform of J (after the current deposition) on different cuda streams (maybe one for Jz and one for Jr/Jt + one cuda stream for each azimuthal mode), and then go on with the particle half-push.

Current `dev` branch fails to compile on GPU

It seems that, from the time that we merged PR #80 (Implementation of CPU multi-threading), the code raises errors when attempting to run on GPU. These error seem to be minor typos, but I have not had time to investigate more.

Rewrite threaded functions: no while loops, no return

I needed to use a while loop (instead of the for loop) for the 3rd order shape factor gathering functions so that it compiles properly with numbas prange function. This is probably related to a bug in numba. We should change this back to a for loop once the bug is fixed.

Add a general print function that displays simulation information

At the beginning of the loop, the code should print key information about the simulation, including:

The number of guard cells
The exchange period
The resolution and timestep.

Feel free to suggest other important information. However, bear in mind that this should remain concise ; only useful information should be printed.

Improve handling of ionization on GPU

Right now, the ionization kernel on GPU loops over the particles in such a way that different thread have to access memory with a large stride (strided by the batch size). This is inefficient, but it was done so that each thread can calculate a partial sum of the number of ionized particles.

Another option would be that one batch of particle corresponds to one block instead of one thread. In this case, each thread in the block can access memory with a stride 1 (which should be much more efficient). We can then do inner block reduction, in a similar way as here.

Laser Classes

To have a more flexible way of initializing laser beams in a simulation we could introduce laser classes which are passed to the antenna or the direct initialization.

I would propose something like this:

In profiles.py define classes like

class GenericLaser(object)
def __init__(a0, w0, lambda, boost, ...):
# The class contains all information about the laser.
    self.a0 = a0
    self.w0 = w0
    self.lambda = lambda
def profile(z,r,t):
# A function that only depends on time and space coordinates
    return E

These objects would then be passed to LaserAntenna() or add_laser() as a parameter laser where then laser.profile(z,r,t) would be called.
For a user it would look like this

sim = Simulation(...)
laser = FancyLaserClassIMade(a0=12345)
add_laser(sim, laser=laser)
step(10000)

The benefit of this would be that I think it declutters the code a little since all laser parameters only live in profiles.py and that the user can define a custom laser in his input script. Plus, as all the laser information is stored in an object it could be used for e.g. a simulation preview plot etc.

Remove the option `ptcl_feedback` in the field push

This option was useful in the early stages of the code, when debugging. However, now it is cluttering the code, and making the field push routines unnecessarily long. I would remove it.

Improve printing of MPI processes

Right now, the following code prints information about the number of MPI processes and the number of threads used:

    if comm.rank == 0:
        if use_cuda:
            message = "\nRunning fbpic-%s on GPU " %__version__
        else:
            message = "\nRunning fbpic-%s on CPU " %__version__
        message += "with %d proc" %comm.size
        if use_threading and not use_cuda:
            message += " (%d threads per proc)" %numba.config.NUMBA_NUM_THREADS
        message += ".\n"

        print( message )

We should change it to the following for better readability:
(Note that I exchanged "proc" with "MPI processes")

    if comm.rank == 0:
        if use_cuda:
            message = "\nRunning fbpic-%s on GPU " %__version__
        else:
            message = "\nRunning fbpic-%s on CPU " %__version__
        message += "with %d MPI processes" %comm.size
        if use_threading and not use_cuda:
            message += " (%d threads per MPI process)" %numba.config.NUMBA_NUM_THREADS
        message += ".\n"

        print( message )

Check consistency of the openPMD files in the ionization test

Right now the ionization test checks the number of ionized ions from the particle arrays directly. It would be good to check that the number of ionized ions is consistent in the openPMD file.

Add the option to have a smoother+compensator on the charge and current

Right now, the charge/current smoothing is equivalent to a binomial filter. This can affect medium-range frequencies by some small amount (e.g. at 20 points per wavelength, the amplitude of the current produced by the laser field is slightly reduced by 2.5 %, by the smoother).

One option would be to add a compensator, which modifies the smoothing so that medium-range frequencies are less affected.

However, when smoothing charge density distributions with sharp edges, this may result in small regions of opposite-sign density, near the edges of the distribution, which might confuse the user.

Test of Cherenkov instability yields different result on the GPU

The file tests/test_boosted.py tests the growth of the NCI in a periodic box.

Here is the result when running it on the CPU:

Here is the result when running it on the GPU:

I think that part of the problem comes from the fact that the instability starts from numerical noise, which is different on the GPU and on the CPU. But still the fact that the pseudo-galilean scheme fails to suppress the instability on GPU is weird...

I'll investigate.

Add support for azimuthal modes m>1

This includes:

Updating the matrices for Hankel transform, for these modes
Include support for these modes in the gathering
Checking that the code for the laser antenna works for these modes
Updating the documentation

Eventually, we will need to add support for these modes on GPU, but this probably requires more profound changes (e.g. merging all azimuthal modes into a single array)

Each species should have its own `prefix_sum_shift`, `z_inject` and `z_end_plasma`

These different quantities are used by the moving window to inject particles.

However, these values could in principle be different for each species. For instance, for species that are sorted more rarely (e.g. photons), prefix_sum_shift is different than that of species which are sorted frequently.

Use `prange` on CPU, once available

This would allow to parallelize over threads within numba functions.

Checkpoints with external beams

This is minor and maybe doesn't even require any changes, but I want to bring it up anyways as I just ran into it.

The example input file only initialises the laser on the first start like:

# Load initial fields
    if use_restart is False:
        # Add a laser to the fields of the simulation
        add_laser( sim, a0, w0, ctau, z0 )
    else:
        # Load the fields and particles from the latest checkpoint file
        restart_from_checkpoint( sim )

I needed to use an external electron beam instead of a laser so I just naively replaced add_laser() by add_elec_bunch(). This is fine for the first start, but leads to problems when restarting from a checkpoint. This is because restart_from_checkpoint() only loads particle data that is in sim.ptcl, which only contains the plasma particles when jumping into the else statement. A workaround is to just reinitialise the particle beam and than have restart_from_checkpoint() overwrite the data.
My suggestion would be to maybe have restart_from_checkpoint() add a fresh Particle object for every species in the checkpoint openPMD file.
Another way to avoid this confusion could also be to just change the example file to something like

    # Load initial fields
    # Add a laser to the fields of the simulation
    add_laser( sim, a0, w0, ctau, z0 )
    if use_restart is True:
        # Load the fields and particles from the latest checkpoint file
        restart_from_checkpoint( sim )

This way the setup is alway the same and restart_from_checkpoint() just updates to the current iteration.

Optimize the `incl_prefix_sum` function when particles do not fill the grid

When particles do not fill the grid radially (e.g. for a particle bunch covering only part of the simulation space), then the incl_prefix_sum is very slow.

This is probably because only one thread is filling multiple cells, while the other threads in the same block are just idle. This, however, needs to be confirmed and optimized.

Checkpoint/restart associated with boosted-frame will cause issues in diagnostics

When using the checkpoint/restarts with the boosted-frame, the following scenario could occur:

At some point a checkpoint is written (say at iteration 1000)
The simulation continues for a while (say until iteration 1010). During this time, particles are continuously added to the boosted-frame diagnostics. (This is because of the way in which boosted-frame diagnostics work.)
The simulation stops (for whatever reason, e.g. Walltime limit)
Then the user restarts the simulation from checkpoint at iteration 1000.
The simulation proceeds and between iteration 1000 and iteration 1010, it will write again the same particles. So they will be duplicated!

Skip cuda atomic in deposition kernels when no particles are present in the cell

Build numba functions for the field push

Right now, the field push functions on the CPU are written in numpy, inside of fields.py. I think it could be better to have them written in numba, in a separate file. Also, it is a minor point, I would see two advantages to this:

The functions would be much closer to the GPU functions. Since the GPU functions are not automatically tested, but the CPU functions are, this would make slightly less likely that there is a bug in the GPU functions. (Although it is by no means a guarantee.)
The file fields.py would be shorter and more readable.

fbpic / fbpic Goto Github PK

fbpic's Issues

Recommend Projects

Recommend Topics

Recommend Org