Code Monkey home page Code Monkey logo

Comments (8)

VincentSC avatar VincentSC commented on June 15, 2024

This algorithm has not been optimized yet just like various other algorithms, as functional completeness had a higher priority. So no need to benchmark before the final delivery. By remaining on Master, your code will become faster automatically.

from rocprim.

jszuppe avatar jszuppe commented on June 15, 2024

For very short and fixed-size segmented segmented radix sort may not be the best choice. I'd recommend using warp_sort since 32 is smaller than max warp size. It should be the fastest method. You can check benchmark_hip_warp_sort.cpp to see how to use warp_sort in a kernel.

Edit:

Unless you're using rocPRIM via hipCUB. Then the best choice is block_radix_sort since CUB does not provide warp-level sort primitive.

from rocprim.

ntrost57 avatar ntrost57 commented on June 15, 2024

Thank you for the answer. However, for my application, the segment size will eventually vary between 4 and 128 with single peaks larger than blocksize, being not fixed over the full array. I will wait for the final delivery with my tests, thank you.

from rocprim.

jszuppe avatar jszuppe commented on June 15, 2024

Performance of segmented radix sort with a fixed segment size of 32 on rocPRIM and CUB are comparable. Segmented radix sort just isn't designed for short fixed size segments.

Vega

sort_keys<int>(32)/manual_time                      3946 ms          0 ms          1   129.767MB/s   32.4418M items/s
sort_keys<long long>(32)/manual_time                9625 ms          1 ms          1   106.387MB/s   13.2984M items/s
sort_pairs<int, float>(32)/manual_time              4447 ms          0 ms          1   230.265MB/s   28.7832M items/s
sort_pairs<long long, double>(32)/manual_time      19015 ms          1 ms          1   107.707MB/s   6.73166M items/s

980

SortKeys<int>(32)/manual_time                      4381 ms       4382 ms          1   116.863MB/s   29.2157M items/s
SortKeys<long long>(32)/manual_time                6874 ms       6874 ms          1   148.972MB/s   18.6215M items/s
SortPairs<int, float>(32)/manual_time              6035 ms       6035 ms          1   169.678MB/s   21.2097M items/s
SortPairs<long long, double>(32)/manual_time      15661 ms      15662 ms          1   130.768MB/s   8.17303M items/s

from rocprim.

jszuppe avatar jszuppe commented on June 15, 2024

the segment size will eventually vary between 4 and 128

If it will be fixed and that you can use warp_sort or block_sort either way.

from rocprim.

ntrost57 avatar ntrost57 commented on June 15, 2024

Here, I do not agree. For testing I have created a very naive warp based segmented radix sort that does the initial test case within less than 1 ms on Vega10. It is limited to segment sizes of 32 tho.

from rocprim.

ex-rzr avatar ex-rzr commented on June 15, 2024

@ntrost57

vary between 4 and 128 with single peaks larger than blocksize

If this means something like, for example, 90% segments are <= 256, 9% <= 2000, 1% > 2000, then we have a solution for this (I hope good enough, but probably not perfect, as there is always a balance in short-long sequences performance) .

This solution is available here: https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/v0.3.2

rocPRIM (not hipCUB!!!) supports custom configs for device-level functions:

using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
rocrpim::segmented_radix_sort_keys<config>(...);

In this example sort segments longer than 64 * 4 (BlockSize * ItemsPerThread) will be sorted in multiple iterations (passes) using radix of 6 and 5 bits (BlockSize cannot be larger than 2^(radix bits) due to implementation restrictions).

Shorter sequences will be sorted using as small ItemsPerThread as possible (by dividing ItemsPerThread by 2), i.e. 200 will be sorted using 4 items per thread, 100 - 2 items per thread, 10 - 1 item.

You need to choose BlockSize and ItemsPerThread that have the best performance in most cases:

  • BlockSize should be small (64) if there are many short sequences and large (256) if many long sequences.
  • ItemsPerThread usually should be large (10-15).

Here "* ms" column means time for 10 repeats of 4M, so divide ms by 10.

using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
sort_keys<int>(32)/manual_time          14 ms          0 ms         49   10.8313GB/s   2.70781G items/s
sort_keys<int>(64)/manual_time           7 ms          0 ms         94   21.0067GB/s   5.25167G items/s
sort_keys<int>(128)/manual_time          6 ms          0 ms        128   27.5956GB/s    6.8989G items/s
sort_keys<int>(192)/manual_time          6 ms          0 ms        127   28.0194GB/s   7.00485G items/s
sort_keys<int>(256)/manual_time          4 ms          0 ms        160   35.7357GB/s   8.93392G items/s

Is the performance enough for your task?

As an opposite 256 has much lower performance for short segments:

using config = rocprim::segmented_radix_sort_config<7, 6, rocprim::kernel_config<256, 7>>;
sort_keys<int>(32)/manual_time           114 ms          0 ms          6   1.37003GB/s   350.728M items/s
sort_keys<int>(64)/manual_time            57 ms          0 ms         12   2.72697GB/s   698.105M items/s
sort_keys<int>(128)/manual_time           29 ms          0 ms         24   5.40673GB/s   1.35168G items/s
sort_keys<int>(192)/manual_time           19 ms          0 ms         35   8.06406GB/s   2.01601G items/s
sort_keys<int>(256)/manual_time           15 ms          0 ms         47   10.6808GB/s   2.67021G items/s
sort_keys<int>(1000)/manual_time           8 ms          0 ms         85   18.4266GB/s   4.60665G items/s
sort_keys<int>(10000)/manual_time         17 ms          0 ms         41   9.14953GB/s   2.28738G items/s

Btw, don't forget that if you know in advance the range of key values then you can set end_bit lower than sizeof(Key) * 8.

from rocprim.

ntrost57 avatar ntrost57 commented on June 15, 2024

ex-rzr thank you very much for your hint. With tuning the blocksize and items per thread performance looks much better.

from rocprim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.