Comments (8)
Thanks for the update, and for the great work!
I think the best would be to wait until the release 0.2.0 is done (hopefully by the end of the week). Once this is done, could you guys:
- Merge version 0.2.0 into your branch
- Redo the performance tests with the merged version (just in case)
- Push this branch to the main repository (since we might have to work together on the branch, I think it is more convenient to push it there than pushing it on a fork)
- Do a pull request to
dev
Then based on this, we can discuss the code in more details.
Does that sound good to you?
Also, how much work do you think it would be to do the 3rd-order deposition with the original scheme (i.e. having 8 copies of rho and J in this case, to avoid atomic adds)? If it is not too much work, it could allow to compare the performance of both implementation, in the case of the 3rd order.
from fbpic.
So preliminary tests show that using atomics is actually faster, both for bigger grid sizes and for smaller ones.
In order to explain how the new scheme works here is a quick rundown of it:
- For each cell create local variables of the 4 (linear) or 16 (cubic) cell indices that we want to deposit into.
- Loop over all the particles in that cell and deposit into the local variables.
- Atomically add the results each thread produces to the field array.
This eliminates the need to have a second kernel to add together copies of the grid. Furthermore it simplifies the procedure slightly compared to the old version.
In order to maximize performance the new deposition kernels do not use cuda.local.array
anymore. Instead there are named variables for all the local copies. This way the values are stored in registers, L1 and L2 cache which is generally faster than global memory. Additionally I found out that unfolding loops resulted in a performance increase. Ideally the compiler should take care of this to achieve greater instruction-level parallelism. Another performance gain came from removing the imaginary part of rho and j for mode 0 as they should be zero anyawy.
However removing loops and cuda.local.array
results in a huge wall of code. The kernel for depositing j with cubic particle shapes is about 800 lines of code.
from fbpic.
The newest commit in the higher order shape branch includes a gathering routine for cubic particle shapes. I tried to optimize the kernels performance by playing around with different implementation details.
For example I would try to eliminate if else
statements by using absolute values when adding field values or unroll loops and use registers instead of cuda.local.array
. However it turns out that these approaches in fact decrease performance. For this specific kernel using loops and cuda.local.array
and if else
when taking absolute values is about 50-70% faster than the two other implementation approaches.
Further performance improvements will be investigated.
from fbpic.
Great! Thanks a lot for the implementation, and for your careful performance optimization!
The new kernel looks great, and is very readable.
One very minor remark: would you mind renaming the kernel gather_field_gpu
to gather_field_gpu_linear
for clarity?
from fbpic.
@RemiLehe Good idea. Just added a commit that does that.
Also I changed the implementation of the CPU gathering and deposition methods to support cubic particle shapes as well. These new methods replace the old ones as they should be strictly equivalent to the old ones for the linear case. Generally these methods could be used for arbitrary order particle shapes, however the weights() utility function would need to be updated to support that as the shape functions are hard coded right now.
Maybe that could be done in the future but creating a general purpose algorithm is a little tricky as the formula for higher orders involve convolutions of functions.
I think it would be a good idea if I'd do a couple of benchmarks more on the exact performance differences for a few different cases and consolidate the results in a document.
from fbpic.
Yes, doing some benchmarking and exposing the results would be a great idea!
Also, we should probably also include the use of the 3rd order shape (at least on CPU) in the existing automated tests. I can do this, if it is okay with you.
from fbpic.
Sure, go ahead!
I'll try to do all the benchmarking next tuesday and post the results here.
from fbpic.
The following benchmarks were done using one Nvidia K20x. I measured the total compute time in milliseconds for the deposition (J and rho combined) and the gathering kernels with linear and cubic order in the periodic plasma wave test at three different grid sizes:
test_periodic_plasma_wave.py at 200x64 cells | |||
---|---|---|---|
Linear | Cubic | ||
Deposition | 1.42 ms | Deposition | 6.62 ms |
Gathering | 0.53 ms | Gathering | 2.38 ms |
test_periodic_plasma_wave.py at 200x512 cells | |||
---|---|---|---|
Linear | Cubic | ||
Deposition | 9.57 ms | Deposition | 42.14 ms |
Gathering | 4.04 ms | Gathering | 18.06 ms |
test_periodic_plasma_wave.py at 2000x512 cells | |||
---|---|---|---|
Linear | Cubic | ||
Deposition | 93.24 ms | Deposition | 419.33 ms |
Gathering | 40.13 ms | Gathering | 180.58 ms |
As one can see the slowdown from using cubic particle shapes is around four-fold which is to be expected considering that each particles deposits to 16 instead of 4 grid points in that case (analogously for gathering).
from fbpic.
Related Issues (20)
- openPMD diagnostics HOT 1
- Issue with last version of pyfftw on power9 HOT 1
- Problem about particle tracking in ionization injection HOT 3
- Set the delay between two Laser profile traveling in the same direction
- Feature request HOT 3
- About ionization elements in FBPIC HOT 1
- Bug in particle sorting when particle travel beyond guard cells HOT 1
- Question for X-FEL HOT 1
- running with MPI HOT 4
- Non-physical numerical noise when using Nm > 5 HOT 1
- FBPIC
- Change parameters midway through simulation HOT 4
- rho value dispalyed inconsistent with and without ionization HOT 1
- Down sampling in FBPIC HOT 1
- different figures for Bx on separate systems HOT 2
- hwo to restart a simulation with saved checkpoints HOT 1
- An MPI running warning occurs when fbpic is installed HOT 1
- Some help in diagnostic
- how to make wakefield smoother HOT 2
- large laser centroid position "removes" particles HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fbpic.