<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Higher order particle shape factors about fbpic HOT 8 CLOSED

fbpic commented on July 18, 2024

Higher order particle shape factors

from fbpic.

Comments (8)

RemiLehe commented on July 18, 2024

Thanks for the update, and for the great work!
I think the best would be to wait until the release 0.2.0 is done (hopefully by the end of the week). Once this is done, could you guys:

Merge version 0.2.0 into your branch
Redo the performance tests with the merged version (just in case)
Push this branch to the main repository (since we might have to work together on the branch, I think it is more convenient to push it there than pushing it on a fork)
Do a pull request to dev
Then based on this, we can discuss the code in more details.

Does that sound good to you?

Also, how much work do you think it would be to do the 3rd-order deposition with the original scheme (i.e. having 8 copies of rho and J in this case, to avoid atomic adds)? If it is not too much work, it could allow to compare the performance of both implementation, in the case of the 3rd order.

from fbpic.

Escapado commented on July 18, 2024

So preliminary tests show that using atomics is actually faster, both for bigger grid sizes and for smaller ones.
In order to explain how the new scheme works here is a quick rundown of it:

For each cell create local variables of the 4 (linear) or 16 (cubic) cell indices that we want to deposit into.
Loop over all the particles in that cell and deposit into the local variables.
Atomically add the results each thread produces to the field array.

This eliminates the need to have a second kernel to add together copies of the grid. Furthermore it simplifies the procedure slightly compared to the old version.

In order to maximize performance the new deposition kernels do not use cuda.local.array anymore. Instead there are named variables for all the local copies. This way the values are stored in registers, L1 and L2 cache which is generally faster than global memory. Additionally I found out that unfolding loops resulted in a performance increase. Ideally the compiler should take care of this to achieve greater instruction-level parallelism. Another performance gain came from removing the imaginary part of rho and j for mode 0 as they should be zero anyawy.

However removing loops and cuda.local.array results in a huge wall of code. The kernel for depositing j with cubic particle shapes is about 800 lines of code.

from fbpic.

Escapado commented on July 18, 2024

The newest commit in the higher order shape branch includes a gathering routine for cubic particle shapes. I tried to optimize the kernels performance by playing around with different implementation details.

For example I would try to eliminate if else statements by using absolute values when adding field values or unroll loops and use registers instead of cuda.local.array. However it turns out that these approaches in fact decrease performance. For this specific kernel using loops and cuda.local.array and if else when taking absolute values is about 50-70% faster than the two other implementation approaches.

Further performance improvements will be investigated.

from fbpic.

RemiLehe commented on July 18, 2024

Great! Thanks a lot for the implementation, and for your careful performance optimization!
The new kernel looks great, and is very readable.

One very minor remark: would you mind renaming the kernel gather_field_gpu to gather_field_gpu_linear for clarity?

from fbpic.

Escapado commented on July 18, 2024

@RemiLehe Good idea. Just added a commit that does that.

Also I changed the implementation of the CPU gathering and deposition methods to support cubic particle shapes as well. These new methods replace the old ones as they should be strictly equivalent to the old ones for the linear case. Generally these methods could be used for arbitrary order particle shapes, however the weights() utility function would need to be updated to support that as the shape functions are hard coded right now.
Maybe that could be done in the future but creating a general purpose algorithm is a little tricky as the formula for higher orders involve convolutions of functions.

I think it would be a good idea if I'd do a couple of benchmarks more on the exact performance differences for a few different cases and consolidate the results in a document.

from fbpic.

RemiLehe commented on July 18, 2024

Yes, doing some benchmarking and exposing the results would be a great idea!

Also, we should probably also include the use of the 3rd order shape (at least on CPU) in the existing automated tests. I can do this, if it is okay with you.

from fbpic.

Escapado commented on July 18, 2024

Sure, go ahead!

I'll try to do all the benchmarking next tuesday and post the results here.

from fbpic.

Escapado commented on July 18, 2024

The following benchmarks were done using one Nvidia K20x. I measured the total compute time in milliseconds for the deposition (J and rho combined) and the gathering kernels with linear and cubic order in the periodic plasma wave test at three different grid sizes:

test_periodic_plasma_wave.py at 200x64 cells
Linear		Cubic
Deposition	1.42 ms	Deposition	6.62 ms
Gathering	0.53 ms	Gathering	2.38 ms

test_periodic_plasma_wave.py at 200x512 cells
Linear		Cubic
Deposition	9.57 ms	Deposition	42.14 ms
Gathering	4.04 ms	Gathering	18.06 ms

test_periodic_plasma_wave.py at 2000x512 cells
Linear		Cubic
Deposition	93.24 ms	Deposition	419.33 ms
Gathering	40.13 ms	Gathering	180.58 ms

As one can see the slowdown from using cubic particle shapes is around four-fold which is to be expected considering that each particles deposits to 16 instead of 4 grid points in that case (analogously for gathering).

from fbpic.

Higher order particle shape factors about fbpic HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent