Code Monkey home page Code Monkey logo

Comments (8)

RemiLehe avatar RemiLehe commented on July 18, 2024

Thanks for the update, and for the great work!
I think the best would be to wait until the release 0.2.0 is done (hopefully by the end of the week). Once this is done, could you guys:

  • Merge version 0.2.0 into your branch
  • Redo the performance tests with the merged version (just in case)
  • Push this branch to the main repository (since we might have to work together on the branch, I think it is more convenient to push it there than pushing it on a fork)
  • Do a pull request to dev
    Then based on this, we can discuss the code in more details.

Does that sound good to you?

Also, how much work do you think it would be to do the 3rd-order deposition with the original scheme (i.e. having 8 copies of rho and J in this case, to avoid atomic adds)? If it is not too much work, it could allow to compare the performance of both implementation, in the case of the 3rd order.

from fbpic.

Escapado avatar Escapado commented on July 18, 2024

So preliminary tests show that using atomics is actually faster, both for bigger grid sizes and for smaller ones.
In order to explain how the new scheme works here is a quick rundown of it:

  1. For each cell create local variables of the 4 (linear) or 16 (cubic) cell indices that we want to deposit into.
  2. Loop over all the particles in that cell and deposit into the local variables.
  3. Atomically add the results each thread produces to the field array.

This eliminates the need to have a second kernel to add together copies of the grid. Furthermore it simplifies the procedure slightly compared to the old version.

In order to maximize performance the new deposition kernels do not use cuda.local.array anymore. Instead there are named variables for all the local copies. This way the values are stored in registers, L1 and L2 cache which is generally faster than global memory. Additionally I found out that unfolding loops resulted in a performance increase. Ideally the compiler should take care of this to achieve greater instruction-level parallelism. Another performance gain came from removing the imaginary part of rho and j for mode 0 as they should be zero anyawy.

However removing loops and cuda.local.array results in a huge wall of code. The kernel for depositing j with cubic particle shapes is about 800 lines of code.

from fbpic.

Escapado avatar Escapado commented on July 18, 2024

The newest commit in the higher order shape branch includes a gathering routine for cubic particle shapes. I tried to optimize the kernels performance by playing around with different implementation details.

For example I would try to eliminate if else statements by using absolute values when adding field values or unroll loops and use registers instead of cuda.local.array. However it turns out that these approaches in fact decrease performance. For this specific kernel using loops and cuda.local.array and if else when taking absolute values is about 50-70% faster than the two other implementation approaches.

Further performance improvements will be investigated.

from fbpic.

RemiLehe avatar RemiLehe commented on July 18, 2024

Great! Thanks a lot for the implementation, and for your careful performance optimization!
The new kernel looks great, and is very readable.

One very minor remark: would you mind renaming the kernel gather_field_gpu to gather_field_gpu_linear for clarity?

from fbpic.

Escapado avatar Escapado commented on July 18, 2024

@RemiLehe Good idea. Just added a commit that does that.

Also I changed the implementation of the CPU gathering and deposition methods to support cubic particle shapes as well. These new methods replace the old ones as they should be strictly equivalent to the old ones for the linear case. Generally these methods could be used for arbitrary order particle shapes, however the weights() utility function would need to be updated to support that as the shape functions are hard coded right now.
Maybe that could be done in the future but creating a general purpose algorithm is a little tricky as the formula for higher orders involve convolutions of functions.

I think it would be a good idea if I'd do a couple of benchmarks more on the exact performance differences for a few different cases and consolidate the results in a document.

from fbpic.

RemiLehe avatar RemiLehe commented on July 18, 2024

Yes, doing some benchmarking and exposing the results would be a great idea!

Also, we should probably also include the use of the 3rd order shape (at least on CPU) in the existing automated tests. I can do this, if it is okay with you.

from fbpic.

Escapado avatar Escapado commented on July 18, 2024

Sure, go ahead!

I'll try to do all the benchmarking next tuesday and post the results here.

from fbpic.

Escapado avatar Escapado commented on July 18, 2024

The following benchmarks were done using one Nvidia K20x. I measured the total compute time in milliseconds for the deposition (J and rho combined) and the gathering kernels with linear and cubic order in the periodic plasma wave test at three different grid sizes:

test_periodic_plasma_wave.py at 200x64 cells
Linear Cubic
Deposition 1.42 ms Deposition 6.62 ms
Gathering 0.53 ms Gathering 2.38 ms
 
test_periodic_plasma_wave.py at 200x512 cells
Linear Cubic
Deposition 9.57 ms Deposition 42.14 ms
Gathering 4.04 ms Gathering 18.06 ms
 
test_periodic_plasma_wave.py at 2000x512 cells
Linear Cubic
Deposition 93.24 ms Deposition 419.33 ms
Gathering 40.13 ms Gathering 180.58 ms

As one can see the slowdown from using cubic particle shapes is around four-fold which is to be expected considering that each particles deposits to 16 instead of 4 grid points in that case (analogously for gathering).

from fbpic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.