Comments (5)
You may run benchmark_hip_device_radix_sort
.
I have one old file with results but I'm not sure the results are up-to-date (but should be close IIRC):
Vega20
sort_keys<int>/manual_time 111 ms 1 ms 6 11.2297GB/s 2.80742G items/s
sort_keys<long long>/manual_time 287 ms 1 ms 2 8.69748GB/s 1113.28M items/s
sort_keys<char>/manual_time 30 ms 1 ms 23 10.3015GB/s 10.3015G items/s
sort_keys<short>/manual_time 55 ms 1 ms 13 11.3174GB/s 5.65868G items/s
sort_pairs<int, float>/manual_time 153 ms 1 ms 5 16.3351GB/s 2.04189G items/s
sort_pairs<int, double>/manual_time 264 ms 1 ms 3 14.2078GB/s 1.18398G items/s
sort_pairs<int, custom_float2>/manual_time 265 ms 1 ms 3 14.1767GB/s 1.18139G items/s
sort_pairs<int, custom_double2>/manual_time 300 ms 1 ms 2 20.8282GB/s 1066.4M items/s
sort_pairs<long long, float>/manual_time 550 ms 1 ms 1 6.81494GB/s 581.542M items/s
sort_pairs<long long, double>/manual_time 653 ms 1 ms 1 7.65336GB/s 489.815M items/s
sort_pairs<long long, custom_float2>/manual_time 630 ms 1 ms 1 7.93732GB/s 507.989M items/s
sort_pairs<long long, custom_double2>/manual_time 650 ms 1 ms 1 11.5404GB/s 492.389M items/s
9300
sort_keys<int>/manual_time 133 ms 1 ms 5 9.37972GB/s 2.34493G items/s
sort_keys<long long>/manual_time 409 ms 2 ms 2 6.11434GB/s 782.635M items/s
sort_keys<char>/manual_time 31 ms 0 ms 23 10.1063GB/s 10.1063G items/s
sort_keys<short>/manual_time 65 ms 1 ms 12 9.60201GB/s 4.801G items/s
sort_pairs<int, float>/manual_time 198 ms 1 ms 3 12.5979GB/s 1.57473G items/s
sort_pairs<int, double>/manual_time 278 ms 1 ms 3 13.4663GB/s 1.12219G items/s
sort_pairs<int, custom_float2>/manual_time 243 ms 1 ms 3 15.4204GB/s 1.28503G items/s
sort_pairs<int, custom_double2>/manual_time 540 ms 1 ms 1 11.5652GB/s 592.14M items/s
sort_pairs<long long, float>/manual_time 571 ms 2 ms 1 6.56262GB/s 560.011M items/s
sort_pairs<long long, double>/manual_time 739 ms 3 ms 1 6.76718GB/s 433.1M items/s
sort_pairs<long long, custom_float2>/manual_time 681 ms 2 ms 1 7.34559GB/s 470.118M items/s
sort_pairs<long long, custom_double2>/manual_time 1268 ms 3 ms 1 5.91699GB/s 252.458M items/s
MI25
sort_keys<int>/manual_time 98 ms 1 ms 7 12.7012GB/s 3.17529G items/s
sort_keys<long long>/manual_time 332 ms 3 ms 2 7.51984GB/s 962.539M items/s
sort_keys<char>/manual_time 21 ms 1 ms 34 15.2263GB/s 15.2263G items/s
sort_keys<short>/manual_time 39 ms 1 ms 18 15.9251GB/s 7.96255G items/s
sort_pairs<int, float>/manual_time 182 ms 1 ms 4 13.7336GB/s 1.7167G items/s
sort_pairs<int, double>/manual_time 222 ms 1 ms 3 16.9125GB/s 1.40938G items/s
sort_pairs<int, custom_float2>/manual_time 222 ms 1 ms 3 16.8677GB/s 1.40564G items/s
sort_pairs<int, custom_double2>/manual_time 468 ms 1 ms 2 13.3555GB/s 683.804M items/s
sort_pairs<long long, float>/manual_time 503 ms 1 ms 1 7.45113GB/s 635.83M items/s
sort_pairs<long long, double>/manual_time 604 ms 1 ms 1 8.28461GB/s 530.215M items/s
sort_pairs<long long, custom_float2>/manual_time 598 ms 3 ms 1 8.35538GB/s 534.744M items/s
sort_pairs<long long, custom_double2>/manual_time 1061 ms 3 ms 1 7.06978GB/s 301.644M items/s
This is for 32M items.
Usually we use radix4 implementation.
What performance do you have for radix4?
I'll answer your question about histogram a bit later.
from rocprim.
I think that I can say that we don't use histograms the way you described.
So for the "7 + 7 + 6 + 6 + 6" case we run 5 iterations of these steps (let's use 7 bits as an example):
- calculate global histograms: how many items for each radix value (0..127) in each block
- calculate where each block will write items of each radix value (128 scans)
- calculate where items of each radix value start (one scan)
- sort items in each block (
BlockSize * ItemsPerThread
items) - scatter sorted items using positions calculated in 2 and 3
Step 4 doesn't use histograms for block-level sorting, in fact you may consider block_radix_sort
as 7 iterations of 1-bit sorting, i.e. it's something like radix1 histogram. To calculate such "histograms" I use quite efficient "bit" scan which uses ballot and masked bit count for warp-level scan.
So LDS is used only for exchanging items between threads, we need to store BlockSize * ItemsPerThread
items (256 * 15 ints
, for example).
from rocprim.
With Radix4 i see below performance on Vega10 which has 64 compute Units
Performance of Radix Sort key value pair where key is float type
elsatp time 21.4826ms
szElements 32000000
This timing doesn't include data transfer from host to device and device to host
What is the performance with rocPRIM on key value pair where key is float?
Are you seeing better than this?
More performance data:
Size | Time in ms |
---|---|
50000 | 0.438831 |
100000 | 0.474663 |
200000 | 0.520248 |
400000 | 0.672608 |
800000 | 1.00078 |
1600000 | 1.48771 |
3200000 | 2.62033 |
6400000 | 4.63337 |
12800000 | 8.84509 |
25600000 | 17.3193 |
51200000 | 34.1521 |
102400000 | 66.0865 |
204800000 | 132.304 |
from rocprim.
Newer results on ROCm 2.2/2.3:
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
Vega 20 [Radeon Pro Vega 20] 2.2
sort_keys<int>/manual_time 82 ms 2 ms 9 15.2244GB/s 3.80609G items/s
sort_keys<long long>/manual_time 218 ms 73 ms 3 11.4501GB/s 1.43127G items/s
sort_keys<float>/manual_time 80 ms 80 ms 9 15.6974GB/s 3.92435G items/s
sort_keys<double>/manual_time 217 ms 217 ms 3 11.5044GB/s 1.43805G items/s
sort_keys<char>/manual_time 19 ms 18 ms 37 16.2946GB/s 16.2946G items/s
sort_keys<short>/manual_time 38 ms 37 ms 18 16.2368GB/s 8.11839G items/s
sort_pairs<int, float>/manual_time 121 ms 2 ms 6 20.6264GB/s 2.5783G items/s
sort_pairs<int, double>/manual_time 197 ms 1 ms 4 19.05GB/s 1.5875G items/s
sort_pairs<int, custom_float2>/manual_time 200 ms 1 ms 4 18.7696GB/s 1.56413G items/s
sort_pairs<int, custom_double2>/manual_time 275 ms 273 ms 3 22.7394GB/s 1.13697G items/s
sort_pairs<float, float>/manual_time 114 ms 114 ms 6 21.9102GB/s 2.73878G items/s
sort_pairs<float, double>/manual_time 190 ms 188 ms 4 19.7542GB/s 1.64618G items/s
sort_pairs<float, custom_float2>/manual_time 192 ms 190 ms 4 19.5257GB/s 1.62714G items/s
sort_pairs<float, custom_double2>/manual_time 257 ms 255 ms 3 24.3047GB/s 1.21524G items/s
sort_pairs<long long, float>/manual_time 404 ms 400 ms 2 9.28342GB/s 792.186M items/s
sort_pairs<long long, double>/manual_time 479 ms 475 ms 2 10.4415GB/s 668.254M items/s
sort_pairs<long long, custom_float2>/manual_time 473 ms 473 ms 2 10.5676GB/s 676.323M items/s
sort_pairs<long long, custom_double2>/manual_time 626 ms 623 ms 1 11.9751GB/s 510.938M items/s
sort_pairs<double, float>/manual_time 409 ms 2 ms 2 9.16793GB/s 782.33M items/s
sort_pairs<double, double>/manual_time 484 ms 2 ms 2 10.3371GB/s 661.576M items/s
sort_pairs<double, custom_float2>/manual_time 478 ms 2 ms 2 10.4629GB/s 669.627M items/s
sort_pairs<double, custom_double2>/manual_time 593 ms 593 ms 1 12.6523GB/s 539.833M items/s
Fiji [Radeon R9 FURY / NANO Series] 2.3 -DCMAKE_CXX_FLAGS=-DROCPRIM_TARGET_ARCH=803
sort_keys<int>/manual_time 147 ms 2 ms 4 8.47622GB/s 2.11906G items/s
sort_keys<long long>/manual_time 423 ms 4 ms 2 5.9156GB/s 757.197M items/s
sort_keys<float>/manual_time 148 ms 30 ms 5 8.44191GB/s 2.11048G items/s
sort_keys<double>/manual_time 443 ms 443 ms 2 5.63737GB/s 721.583M items/s
sort_keys<char>/manual_time 31 ms 31 ms 22 10.0914GB/s 10.0914G items/s
sort_keys<short>/manual_time 68 ms 68 ms 12 9.25364GB/s 4.62682G items/s
sort_pairs<int, float>/manual_time 206 ms 206 ms 3 12.1418GB/s 1.51773G items/s
sort_pairs<int, double>/manual_time 293 ms 293 ms 2 12.8072GB/s 1092.88M items/s
sort_pairs<int, custom_float2>/manual_time 294 ms 294 ms 2 12.7493GB/s 1087.94M items/s
sort_pairs<int, custom_double2>/manual_time 564 ms 564 ms 1 11.0839GB/s 567.494M items/s
sort_pairs<float, float>/manual_time 204 ms 204 ms 3 12.2681GB/s 1.53352G items/s
sort_pairs<float, double>/manual_time 290 ms 290 ms 2 12.9205GB/s 1102.55M items/s
sort_pairs<float, custom_float2>/manual_time 295 ms 295 ms 2 12.7127GB/s 1084.82M items/s
sort_pairs<float, custom_double2>/manual_time 499 ms 499 ms 2 12.5331GB/s 641.695M items/s
sort_pairs<long long, float>/manual_time 700 ms 700 ms 1 5.35667GB/s 457.103M items/s
sort_pairs<long long, double>/manual_time 766 ms 766 ms 1 6.52996GB/s 417.918M items/s
sort_pairs<long long, custom_float2>/manual_time 776 ms 1 ms 1 6.44308GB/s 412.357M items/s
sort_pairs<long long, custom_double2>/manual_time 1324 ms 2 ms 1 5.66297GB/s 241.62M items/s
sort_pairs<double, float>/manual_time 609 ms 3 ms 1 6.1557GB/s 525.286M items/s
sort_pairs<double, double>/manual_time 811 ms 2 ms 1 6.16156GB/s 394.34M items/s
sort_pairs<double, custom_float2>/manual_time 805 ms 3 ms 1 6.20775GB/s 397.296M items/s
sort_pairs<double, custom_double2>/manual_time 1223 ms 3 ms 1 6.13253GB/s 261.654M items/s
Vega 10 [Radeon Instinct MI25] 2.3
sort_keys<int>/manual_time 86 ms 2 ms 8 14.5287GB/s 3.63219G items/s
sort_keys<long long>/manual_time 301 ms 3 ms 2 8.30259GB/s 1062.73M items/s
sort_keys<float>/manual_time 85 ms 83 ms 8 14.7651GB/s 3.69127G items/s
sort_keys<double>/manual_time 300 ms 298 ms 2 8.33531GB/s 1066.92M items/s
sort_keys<char>/manual_time 19 ms 18 ms 36 16.7683GB/s 16.7683G items/s
sort_keys<short>/manual_time 39 ms 38 ms 18 16.0629GB/s 8.03145G items/s
sort_pairs<int, float>/manual_time 141 ms 140 ms 5 17.7015GB/s 2.21269G items/s
sort_pairs<int, double>/manual_time 194 ms 1 ms 4 19.2917GB/s 1.60764G items/s
sort_pairs<int, custom_float2>/manual_time 196 ms 1 ms 4 19.0876GB/s 1.59064G items/s
sort_pairs<int, custom_double2>/manual_time 340 ms 1 ms 2 18.3827GB/s 941.196M items/s
sort_pairs<float, float>/manual_time 134 ms 81 ms 5 18.6017GB/s 2.32521G items/s
sort_pairs<float, double>/manual_time 188 ms 186 ms 4 19.9576GB/s 1.66314G items/s
sort_pairs<float, custom_float2>/manual_time 190 ms 188 ms 4 19.7484GB/s 1.6457G items/s
sort_pairs<float, custom_double2>/manual_time 321 ms 319 ms 2 19.4688GB/s 996.804M items/s
sort_pairs<long long, float>/manual_time 460 ms 456 ms 2 8.15707GB/s 696.07M items/s
sort_pairs<long long, double>/manual_time 552 ms 549 ms 1 9.05005GB/s 579.203M items/s
sort_pairs<long long, custom_float2>/manual_time 549 ms 549 ms 1 9.10125GB/s 582.48M items/s
sort_pairs<long long, custom_double2>/manual_time 804 ms 801 ms 1 9.32271GB/s 397.769M items/s
sort_pairs<double, float>/manual_time 461 ms 457 ms 2 8.13677GB/s 694.338M items/s
sort_pairs<double, double>/manual_time 544 ms 540 ms 1 9.18701GB/s 587.969M items/s
sort_pairs<double, custom_float2>/manual_time 543 ms 539 ms 1 9.21393GB/s 589.692M items/s
sort_pairs<double, custom_double2>/manual_time 779 ms 775 ms 1 9.62865GB/s 410.822M items/s
Integer keys are uniformly random [std::numeric_limits<key_type>::min(), std::numeric_limits<key_type>::max()]
, floating point keys are uniformly random [-1000, 1000]
. This explains why benchmarks with float keys are a bit faster than ints: not all bits as random -> less scattered writes to global memory.
If you want you can change this behavior in the benchmark source to match your keys generation scheme.
So Vega 10 [Radeon Instinct MI25]
for 32M float-float shows 2.32521G items/s, 13.4 ms (the Time column is sum of 10 runs).
Also, configs (radix sizes, items per thread) may be non-optimal for some types on the current ROCm versions. So config tuning may increase performance slightly.
from rocprim.
cool. Thanks Anton!
from rocprim.
Related Issues (20)
- Block_reduce fails to distribute correct answer to all lanes when hipBlockDim > 64 HOT 5
- Invalid use of inline assembly for GFX1010 target HOT 14
- Compiler Build issue HOT 1
- ROCm 3.9.1 fails to compile benchmark_warp_scan HOT 2
- error: reference to __host__ function 'inclusive_scan<rocprim::default_config, double *, double *, thrust::plus<double>>' in __host__ __device__ function HOT 2
- Ignoring return value warning HOT 2
- error: invalid operands to binary expression HOT 6
- Please enable two factor authentication in your github account
- Alternative documentation HOT 2
- error: no matching function for call to 'ceiling_div' HOT 2
- Add support for NAVI22 and NAVI23 i.e. gfx1031 and gfx1032 HOT 1
- rocPRIM reduction (block_reduce_int) issue HOT 2
- Follow the example of rocPRIM custom types example but compile failed. HOT 2
- rocPRIM on NAVI22 = gfx1031 only one test fails - can we fix this? HOT 6
- rocPRIM -- rocThrust dependency: error: no member named 'init_offset_scan_state_kernel' in rocprim::detail HOT 2
- rocPRIM 5.4.3 failed device_adjacent_difference on gfx906 HOT 4
- rocPRIM selects wrong code paths for warp reductions and scas on gfx1036 HOT 3
- Page fault during `rocprim::select` for no obvious reason. HOT 2
- `rocprim::block_load` fails to instantiate HOT 8
- first installation failed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rocprim.