fbpic / fbpic Goto Github PK
View Code? Open in Web Editor NEWSpectral, quasi-3D Particle-In-Cell code, for CPU and GPU
Home Page: http://fbpic.github.io
License: Other
Spectral, quasi-3D Particle-In-Cell code, for CPU and GPU
Home Page: http://fbpic.github.io
License: Other
The cuda libraries in Accelerate are now open-source (and renamed pyculib
) !
We should switch to pyculib
by:
conda install -c numba pyculib
.This is minor and maybe doesn't even require any changes, but I want to bring it up anyways as I just ran into it.
The example input file only initialises the laser on the first start like:
# Load initial fields
if use_restart is False:
# Add a laser to the fields of the simulation
add_laser( sim, a0, w0, ctau, z0 )
else:
# Load the fields and particles from the latest checkpoint file
restart_from_checkpoint( sim )
I needed to use an external electron beam instead of a laser so I just naively replaced add_laser()
by add_elec_bunch()
. This is fine for the first start, but leads to problems when restarting from a checkpoint. This is because restart_from_checkpoint()
only loads particle data that is in sim.ptcl
, which only contains the plasma particles when jumping into the else
statement. A workaround is to just reinitialise the particle beam and than have restart_from_checkpoint()
overwrite the data.
My suggestion would be to maybe have restart_from_checkpoint()
add a fresh Particle
object for every species in the checkpoint openPMD file.
Another way to avoid this confusion could also be to just change the example file to something like
# Load initial fields
# Add a laser to the fields of the simulation
add_laser( sim, a0, w0, ctau, z0 )
if use_restart is True:
# Load the fields and particles from the latest checkpoint file
restart_from_checkpoint( sim )
This way the setup is alway the same and restart_from_checkpoint()
just updates to the current iteration.
Right now the ionization test checks the number of ionized ions from the particle arrays directly. It would be good to check that the number of ionized ions is consistent in the openPMD file.
I think at some point it would make sense to add a staggered version of the (Galilean)-PSATD as well as the PSTD solver. This will enable us to make more comparisons between the different solvers in the future.
@RemiLehe What do you think?
In pull request #6, the field push routine for the standard PSATD is replaced by the field push routine for the Galilean/comoving scheme.
However, I think this makes it less readable for people who are just interested in the standard PSTAD scheme. Although the Galilean/comoving coefficients reduce to the standard PSATD when ps.V
is 0, it is not obvious at a first glance. I think keeping a separate, simpler function for the standard PSATD would be easier.
@MKirchen : Any opinion on this?
This could be added to the Advanced
section.
We should probably also add a safe-guard that prevents the user from having c dt > dr in the boosted frame case (since this becomes unstable).
Right now, the following code prints information about the number of MPI processes and the number of threads used:
if comm.rank == 0:
if use_cuda:
message = "\nRunning fbpic-%s on GPU " %__version__
else:
message = "\nRunning fbpic-%s on CPU " %__version__
message += "with %d proc" %comm.size
if use_threading and not use_cuda:
message += " (%d threads per proc)" %numba.config.NUMBA_NUM_THREADS
message += ".\n"
print( message )
We should change it to the following for better readability:
(Note that I exchanged "proc" with "MPI processes")
if comm.rank == 0:
if use_cuda:
message = "\nRunning fbpic-%s on GPU " %__version__
else:
message = "\nRunning fbpic-%s on CPU " %__version__
message += "with %d MPI processes" %comm.size
if use_threading and not use_cuda:
message += " (%d threads per MPI process)" %numba.config.NUMBA_NUM_THREADS
message += ".\n"
print( message )
Hi,
Currently it is not possible to use use_all_mpi_ranks=True
and use checkpoints.
The problem is that use_all_mpi_ranks=True
causes the Simulation object to set self.mpi_comm=None
. However, the checkpoint routines try to call sim.comm.mpi_comm.Barrier()
, which is not defined. This is used there to wait for rank 0 while it's creating directories and checking files.
I'm not too familiar with this MPI stuff but I think we could either create a communicator even when using use_all_mpi_ranks = True, or have every rank check its own files and create its own directory.
This option was useful in the early stages of the code, when debugging. However, now it is cluttering the code, and making the field push routines unnecessarily long. I would remove it.
I needed to use a while loop (instead of the for loop) for the 3rd order shape factor gathering functions so that it compiles properly with numbas prange function. This is probably related to a bug in numba. We should change this back to a for loop once the bug is fixed.
The functions that create new electron bunches and calculate their space charge field have several issues that need to be fixed:
w
has not been update with respect to the new definition of w
(see #98)@Escapado and me did several implementation of the charge and current deposition for linear and higher order (mainly cubic) shape factors. We benchmarked the different implementations and it was not easy to beat the original implementation of the deposition in terms of speed. (Actually, I was quite surprised that the original implementation is really fast compared to all other possible implementations we tried).
In summary, we developed a new implementation that relies on both, looping over the sorted particles and atomically adding the summed local arrays to the global grid array. Thereby, we can remove the add_J and add_rho kernels. Also, the implementation can be done in a nice way with cuda thread local arrays and for loops. However, thread-local arrays are stored in global memory instead of registers, which makes this implementation 2 times slower than the original implementation.
We solved this performance issue by explicitly unrolling the for loops and storing the different values directly to the registers. By this, we now have the same performance as with the original implementation, however, the code is again very long and does not look nice (however, performance is more important in that case!).
Furthermore we made two decision:
The new implementation is restricted to work either with linear shapes or with cubic shapes in both dimensions (I think this restrictions is "okay" as we don't really need quadratic shapes and also it should be fine to have always the same shape factors in both dimensions)
We decided to keep the original implementation (without atomic adds) in order to be able to compare the different implementations with more use cases in the future. (This is up to discussion)
@RemiLehe Any thoughts on this?
These methods are superseded by the new threading methods of PR #80 .
Note: in order to be able to still use the old versions of numba (i.e. <0.34), we should do something like: in threading_utilities.py
if numba.version > 0.34: #The syntax maybe a bit different
threading_installed = True
prange = numba.prange
else:
threading_installed = False
prange = range
Update: We still need to remove:
I recently observed that the boosted particle diagnostics fails to retrieve all particles. For instance, in the case of a Gaussian bunch with 30000 macroparticles, the boosted particle diagnostics write approx 25000 macroparticles only.
This issue on occurs on GPU ; on CPU, the full 30000 macroparticles are written.
Right now it is not working, because the charge is considered constant for the whole species, in the boosted-frame openPMD output.
This could potentially speed up execution time and startup time respectively.
As mentioned here, because there is no global flag to turn these options on, we would need to redefine the jit
function within fbpic, using functools
.
Right now, the charge/current smoothing is equivalent to a binomial filter. This can affect medium-range frequencies by some small amount (e.g. at 20 points per wavelength, the amplitude of the current produced by the laser field is slightly reduced by 2.5 %, by the smoother).
One option would be to add a compensator, which modifies the smoothing so that medium-range frequencies are less affected.
However, when smoothing charge density distributions with sharp edges, this may result in small regions of opposite-sign density, near the edges of the distribution, which might confuse the user.
At several places in the PIC cycle, we could perform the Hankel transform in an asynchronous manner, using cuda streams.
For instance, we could launch the Hankel transform of J (after the current deposition) on different cuda streams (maybe one for Jz and one for Jr/Jt + one cuda stream for each azimuthal mode), and then go on with the particle half-push.
From an off-line discussion with @MKirchen :
It would be interesting to take a similar approach to PR #137
The sum_reduce
operation (on CPU) is expensive when using a large number of cores, and is being done for each particle species. One solution would be to modify the code so as to only perform this operation only once per deposition (just like the divide by volume
operation)
The module cProfile allows to profile Python execution on the CPU side. In particular, we could add
pr = cProfile.Profile()
pr.enable()
at the beginning of the step function, and
pr.disable()
at the end. We could then print the results in a separate file for each MPI proc.
This would have several advantages:
We recently observed that the latest pyfftw
is sometimes surprisingly slow on CPU.
It would be good to try the MKL fft (which should normally be automatically called by np.fft.fft
)
It seems that, from the time that we merged PR #80 (Implementation of CPU multi-threading), the code raises errors when attempting to run on GPU. These error seem to be minor typos, but I have not had time to investigate more.
Right now, the ionization kernel on GPU loops over the particles in such a way that different thread have to access memory with a large stride (strided by the batch size). This is inefficient, but it was done so that each thread can calculate a partial sum of the number of ionized particles.
Another option would be that one batch of particle corresponds to one block instead of one thread. In this case, each thread in the block can access memory with a stride 1 (which should be much more efficient). We can then do inner block reduction, in a similar way as here.
We recently changed the code, such that the moving window updates and shifts the fields each timestep. However, this is currently done in spatial space and involves several extra kernels and also expensive allocation and deallocation of CUDA device memory. To get a better performance for the shifting of the fields, we should implement the shifting in spectral space. Furthermore, we then would want to also shift the "previous" charge density rho in spectral space, which we currently redeposit every time step.
In the current implementation, we output x(n) and p(n-1/2) for the OpenPMD diagnostics. This should not be a problem, but at some point, it would be nice if we correct that by pushing the particles by half a time step back before writing the diagnostics, and push them again by half time step forward after the writing the diagnostics. That way, we would output x(n-1/2) and p(n-1/2), so both values are defined the same time.
We should make sure that we explain in the documentation at which loop-times the quantities are defined in our OpenPMD diagnostics.
Initializing the random seed (on the CPU, as well as GPU when used) would make the simulations exactly reproducible.
As a reminder, the random number generator is currently used for:
The file tests/test_boosted.py
tests the growth of the NCI in a periodic box.
Here is the result when running it on the CPU:
Here is the result when running it on the GPU:
I think that part of the problem comes from the fact that the instability starts from numerical noise, which is different on the GPU and on the CPU. But still the fact that the pseudo-galilean scheme fails to suppress the instability on GPU is weird...
I'll investigate.
When using the checkpoint/restarts with the boosted-frame, the following scenario could occur:
This would make the tests much faster, as we would avoid reinstalling the dependencies everytime.
In particular:
weight
.This includes:
weights
function (it is currently taking a substantial fraction of the time and is not optimized). It could be implemented directly in the deposition functions, as it is done for CUDA kernels. (Implemented in PR #80)This could be done with a standard simulation script with moving window and box full of electrons.
Even if the optimized number of threads might not generalize to any other situation, I think it will still be better than the current arbitrary choice (i.e. 256 threads per block for all kernels)
We could use this feature for the sum_reduce
step after the deposition, with multi-threaded execution on CPU.
The estimator that computes the remaining time uses the time taken in the first step for the calculation. However, the first step is special since it includes JIT compilation. It should not be included.
This includes:
At the beginning of the loop, the code should print key information about the simulation, including:
Feel free to suggest other important information. However, bear in mind that this should remain concise ; only useful information should be printed.
This would allow to parallelize over threads within numba functions.
Right now, the field push functions on the CPU are written in numpy, inside of fields.py
. I think it could be better to have them written in numba, in a separate file. Also, it is a minor point, I would see two advantages to this:
fields.py
would be shorter and more readable.These different quantities are used by the moving window to inject particles.
However, these values could in principle be different for each species. For instance, for species that are sorted more rarely (e.g. photons), prefix_sum_shift
is different than that of species which are sorted frequently.
This includes:
Eventually, we will need to add support for these modes on GPU, but this probably requires more profound changes (e.g. merging all azimuthal modes into a single array)
For giving this repository more visibility it would be nice to be able to cite it in publication. To do this in a somewhat standardised form a DOI would be good to have.
This guide describes the necessary steps to get a DOI via a service called Zenodo.
As I understand this Zenodo takes the releases from the repository, archives them and links them with a DOI. The DOI link then points to a page in Zenodo, where citations can be generated in the standard formats, e.g. bibtex.
One downside I see in this is that the DOI links to the Zenodo page, which only then, indirectly, links to github.
Here, as an example is the DOI of the openPMD standard
Another option would be to create a custom bibtex snipped, which we are happy with, and then but it somewhere in this repo, e.g. the README. That way citation would also be somewhat consistent.
Pinned arrays cannot be swapped out to disk and therefore the CPU to GPU transfer of data can be faster.
To have a more flexible way of initializing laser beams in a simulation we could introduce laser classes which are passed to the antenna or the direct initialization.
I would propose something like this:
In profiles.py define classes like
class GenericLaser(object)
def __init__(a0, w0, lambda, boost, ...):
# The class contains all information about the laser.
self.a0 = a0
self.w0 = w0
self.lambda = lambda
def profile(z,r,t):
# A function that only depends on time and space coordinates
return E
These objects would then be passed to LaserAntenna()
or add_laser()
as a parameter laser
where then laser.profile(z,r,t) would be called.
For a user it would look like this
sim = Simulation(...)
laser = FancyLaserClassIMade(a0=12345)
add_laser(sim, laser=laser)
step(10000)
The benefit of this would be that I think it declutters the code a little since all laser parameters only live in profiles.py and that the user can define a custom laser in his input script. Plus, as all the laser information is stored in an object it could be used for e.g. a simulation preview plot etc.
When particles do not fill the grid radially (e.g. for a particle bunch covering only part of the simulation space), then the incl_prefix_sum
is very slow.
This is probably because only one thread is filling multiple cells, while the other threads in the same block are just idle. This, however, needs to be confirmed and optimized.
For boosted-frame simulations, particles often need to be injected through a plane, where the particles behind the plane do not feel the EM fields, and the particles after the plane do feel them.
This could be implemented with a modified push_p
, which checks the position of the particles before calling the Vay pusher.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.