Comments (8)
This algorithm has not been optimized yet just like various other algorithms, as functional completeness had a higher priority. So no need to benchmark before the final delivery. By remaining on Master, your code will become faster automatically.
from rocprim.
For very short and fixed-size segmented segmented radix sort may not be the best choice. I'd recommend using warp_sort
since 32 is smaller than max warp size. It should be the fastest method. You can check benchmark_hip_warp_sort.cpp
to see how to use warp_sort
in a kernel.
Edit:
Unless you're using rocPRIM via hipCUB. Then the best choice is block_radix_sort
since CUB does not provide warp-level sort primitive.
from rocprim.
Thank you for the answer. However, for my application, the segment size will eventually vary between 4 and 128 with single peaks larger than blocksize, being not fixed over the full array. I will wait for the final delivery with my tests, thank you.
from rocprim.
Performance of segmented radix sort with a fixed segment size of 32 on rocPRIM and CUB are comparable. Segmented radix sort just isn't designed for short fixed size segments.
Vega
sort_keys<int>(32)/manual_time 3946 ms 0 ms 1 129.767MB/s 32.4418M items/s
sort_keys<long long>(32)/manual_time 9625 ms 1 ms 1 106.387MB/s 13.2984M items/s
sort_pairs<int, float>(32)/manual_time 4447 ms 0 ms 1 230.265MB/s 28.7832M items/s
sort_pairs<long long, double>(32)/manual_time 19015 ms 1 ms 1 107.707MB/s 6.73166M items/s
980
SortKeys<int>(32)/manual_time 4381 ms 4382 ms 1 116.863MB/s 29.2157M items/s
SortKeys<long long>(32)/manual_time 6874 ms 6874 ms 1 148.972MB/s 18.6215M items/s
SortPairs<int, float>(32)/manual_time 6035 ms 6035 ms 1 169.678MB/s 21.2097M items/s
SortPairs<long long, double>(32)/manual_time 15661 ms 15662 ms 1 130.768MB/s 8.17303M items/s
from rocprim.
the segment size will eventually vary between 4 and 128
If it will be fixed and that you can use warp_sort
or block_sort
either way.
from rocprim.
Here, I do not agree. For testing I have created a very naive warp based segmented radix sort that does the initial test case within less than 1 ms on Vega10. It is limited to segment sizes of 32 tho.
from rocprim.
vary between 4 and 128 with single peaks larger than blocksize
If this means something like, for example, 90% segments are <= 256, 9% <= 2000, 1% > 2000, then we have a solution for this (I hope good enough, but probably not perfect, as there is always a balance in short-long sequences performance) .
This solution is available here: https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/v0.3.2
rocPRIM (not hipCUB!!!) supports custom configs for device-level functions:
using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
rocrpim::segmented_radix_sort_keys<config>(...);
In this example sort segments longer than 64 * 4 (BlockSize * ItemsPerThread) will be sorted in multiple iterations (passes) using radix of 6 and 5 bits (BlockSize cannot be larger than 2^(radix bits) due to implementation restrictions).
Shorter sequences will be sorted using as small ItemsPerThread as possible (by dividing ItemsPerThread by 2), i.e. 200 will be sorted using 4 items per thread, 100 - 2 items per thread, 10 - 1 item.
You need to choose BlockSize and ItemsPerThread that have the best performance in most cases:
- BlockSize should be small (64) if there are many short sequences and large (256) if many long sequences.
- ItemsPerThread usually should be large (10-15).
Here "* ms" column means time for 10 repeats of 4M, so divide ms by 10.
using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
sort_keys<int>(32)/manual_time 14 ms 0 ms 49 10.8313GB/s 2.70781G items/s
sort_keys<int>(64)/manual_time 7 ms 0 ms 94 21.0067GB/s 5.25167G items/s
sort_keys<int>(128)/manual_time 6 ms 0 ms 128 27.5956GB/s 6.8989G items/s
sort_keys<int>(192)/manual_time 6 ms 0 ms 127 28.0194GB/s 7.00485G items/s
sort_keys<int>(256)/manual_time 4 ms 0 ms 160 35.7357GB/s 8.93392G items/s
Is the performance enough for your task?
As an opposite 256 has much lower performance for short segments:
using config = rocprim::segmented_radix_sort_config<7, 6, rocprim::kernel_config<256, 7>>;
sort_keys<int>(32)/manual_time 114 ms 0 ms 6 1.37003GB/s 350.728M items/s
sort_keys<int>(64)/manual_time 57 ms 0 ms 12 2.72697GB/s 698.105M items/s
sort_keys<int>(128)/manual_time 29 ms 0 ms 24 5.40673GB/s 1.35168G items/s
sort_keys<int>(192)/manual_time 19 ms 0 ms 35 8.06406GB/s 2.01601G items/s
sort_keys<int>(256)/manual_time 15 ms 0 ms 47 10.6808GB/s 2.67021G items/s
sort_keys<int>(1000)/manual_time 8 ms 0 ms 85 18.4266GB/s 4.60665G items/s
sort_keys<int>(10000)/manual_time 17 ms 0 ms 41 9.14953GB/s 2.28738G items/s
Btw, don't forget that if you know in advance the range of key values then you can set end_bit
lower than sizeof(Key) * 8
.
from rocprim.
ex-rzr thank you very much for your hint. With tuning the blocksize and items per thread performance looks much better.
from rocprim.
Related Issues (20)
- Block_reduce fails to distribute correct answer to all lanes when hipBlockDim > 64 HOT 5
- Invalid use of inline assembly for GFX1010 target HOT 14
- Compiler Build issue HOT 1
- ROCm 3.9.1 fails to compile benchmark_warp_scan HOT 2
- error: reference to __host__ function 'inclusive_scan<rocprim::default_config, double *, double *, thrust::plus<double>>' in __host__ __device__ function HOT 2
- Ignoring return value warning HOT 2
- error: invalid operands to binary expression HOT 6
- Please enable two factor authentication in your github account
- Alternative documentation HOT 2
- error: no matching function for call to 'ceiling_div' HOT 2
- Add support for NAVI22 and NAVI23 i.e. gfx1031 and gfx1032 HOT 1
- rocPRIM reduction (block_reduce_int) issue HOT 2
- Follow the example of rocPRIM custom types example but compile failed. HOT 2
- rocPRIM on NAVI22 = gfx1031 only one test fails - can we fix this? HOT 6
- rocPRIM -- rocThrust dependency: error: no member named 'init_offset_scan_state_kernel' in rocprim::detail HOT 2
- rocPRIM 5.4.3 failed device_adjacent_difference on gfx906 HOT 4
- rocPRIM selects wrong code paths for warp reductions and scas on gfx1036 HOT 3
- Page fault during `rocprim::select` for no obvious reason. HOT 2
- `rocprim::block_load` fails to instantiate HOT 8
- first installation failed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rocprim.