Comments (19)
$ grep 'model name' /proc/cpuinfo
model name : AMD Ryzen 7 2700X Eight-Core Processor
$ grep 'avx2' /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
from pospop.
Thanks for the report! Please let me know what CPU model you are programming for so I can investigate this further.
from pospop.
Starting using bit matrix transpositions: 41dbbc5 (speedup after d24d616)
Note that until d6e39e5, the count8 code was changed very little and never used the transposing kernel as I found it to be slower than doing it the other way (using VPMOVMSKB
). Neverthless, the new and improved kernel code of 836a368 showed to be faster than the old variant on my Skylake machine, but I suppose it can be made faster on your system, too, with some tweaking.
On Linux, type
grep 'model name' /proc/cpuinfo
to find the CPU model. That's the information I need.
from pospop.
Aha! I've never tested on Ryzen. Let me investigate that for you.
from pospop.
Could you please post a full set of benchmark outputs for the current revision as well as for v1.0.4? I'd like to understand the performance characteristics better. The main problem is than Zen and Zen+ implement AVX by using a 128 bit FPU twice. Thus, my original Count8 kernel is slightly faster as it moves some load from the FPU into scalar operations.
from pospop.
d6e39e5
BenchmarkCount8/avx2/32-16 74067950 15.4 ns/op 2077.69 MB/s
BenchmarkCount8/avx2/64-16 57037659 20.8 ns/op 3079.35 MB/s
BenchmarkCount8/avx2/128-16 38254251 30.1 ns/op 4247.99 MB/s
BenchmarkCount8/avx2/256-16 24741512 52.4 ns/op 4890.02 MB/s
BenchmarkCount8/avx2/512-16 31537677 38.5 ns/op 13298.46 MB/s
BenchmarkCount8/avx2/1000-16 19230327 60.8 ns/op 16436.63 MB/s
BenchmarkCount8/avx2/10000-16 2023684 539 ns/op 18567.05 MB/s
BenchmarkCount8/avx2/100000-16 258202 4568 ns/op 21889.58 MB/s
BenchmarkCount8/avx2/1000000-16 24387 45925 ns/op 21774.52 MB/s
BenchmarkCount8/avx2/10000000-16 2398 485719 ns/op 20588.05 MB/s
BenchmarkCount8/avx2/100000000-16 232 4848575 ns/op 20624.62 MB/s
BenchmarkCount8/sse2/32-16 76123047 15.1 ns/op 2114.13 MB/s
BenchmarkCount8/sse2/64-16 57018202 20.9 ns/op 3064.25 MB/s
BenchmarkCount8/sse2/128-16 35960908 33.7 ns/op 3801.73 MB/s
BenchmarkCount8/sse2/256-16 44616007 25.4 ns/op 10067.82 MB/s
BenchmarkCount8/sse2/512-16 30101331 41.9 ns/op 12208.97 MB/s
BenchmarkCount8/sse2/1000-16 16921921 70.7 ns/op 14148.10 MB/s
BenchmarkCount8/sse2/10000-16 1937534 608 ns/op 16439.59 MB/s
BenchmarkCount8/sse2/100000-16 188360 5801 ns/op 17237.63 MB/s
BenchmarkCount8/sse2/1000000-16 20062 57294 ns/op 17453.74 MB/s
BenchmarkCount8/sse2/10000000-16 1959 603064 ns/op 16582.00 MB/s
BenchmarkCount8/sse2/100000000-16 190 6374957 ns/op 15686.38 MB/s
BenchmarkCount8/generic/32-16 15368632 80.0 ns/op 399.99 MB/s
BenchmarkCount8/generic/64-16 8109308 145 ns/op 441.79 MB/s
BenchmarkCount8/generic/128-16 3836839 280 ns/op 457.66 MB/s
BenchmarkCount8/generic/256-16 2114299 543 ns/op 471.73 MB/s
BenchmarkCount8/generic/512-16 1076280 1079 ns/op 474.31 MB/s
BenchmarkCount8/generic/1000-16 552895 2162 ns/op 462.43 MB/s
BenchmarkCount8/generic/10000-16 49760 20990 ns/op 476.41 MB/s
BenchmarkCount8/generic/100000-16 5702 209239 ns/op 477.92 MB/s
BenchmarkCount8/generic/1000000-16 500 2106739 ns/op 474.67 MB/s
BenchmarkCount8/generic/10000000-16 55 21090807 ns/op 474.14 MB/s
BenchmarkCount8/generic/100000000-16 5 207750018 ns/op 481.35 MB/s
BenchmarkCount16/avx2/32-16 68553722 17.3 ns/op 1845.48 MB/s
BenchmarkCount16/avx2/64-16 52283364 22.4 ns/op 2856.84 MB/s
BenchmarkCount16/avx2/128-16 36027898 32.1 ns/op 3991.84 MB/s
BenchmarkCount16/avx2/256-16 21851757 51.5 ns/op 4969.70 MB/s
BenchmarkCount16/avx2/512-16 31586925 39.5 ns/op 12947.11 MB/s
BenchmarkCount16/avx2/1000-16 18997122 65.5 ns/op 15271.71 MB/s
BenchmarkCount16/avx2/10000-16 2206296 516 ns/op 19379.26 MB/s
BenchmarkCount16/avx2/100000-16 250183 4701 ns/op 21270.91 MB/s
BenchmarkCount16/avx2/1000000-16 24770 45694 ns/op 21884.69 MB/s
BenchmarkCount16/avx2/10000000-16 2401 472581 ns/op 21160.40 MB/s
BenchmarkCount16/avx2/100000000-16 231 4895301 ns/op 20427.75 MB/s
BenchmarkCount16/sse2/32-16 77070478 15.4 ns/op 2071.72 MB/s
BenchmarkCount16/sse2/64-16 53893412 21.8 ns/op 2931.79 MB/s
BenchmarkCount16/sse2/128-16 34292407 34.1 ns/op 3756.48 MB/s
BenchmarkCount16/sse2/256-16 43857051 26.4 ns/op 9682.58 MB/s
BenchmarkCount16/sse2/512-16 24215067 42.7 ns/op 11981.72 MB/s
BenchmarkCount16/sse2/1000-16 16146453 76.2 ns/op 13124.74 MB/s
BenchmarkCount16/sse2/10000-16 1694382 610 ns/op 16395.64 MB/s
BenchmarkCount16/sse2/100000-16 201802 5603 ns/op 17848.50 MB/s
BenchmarkCount16/sse2/1000000-16 21052 57298 ns/op 17452.57 MB/s
BenchmarkCount16/sse2/10000000-16 1906 600737 ns/op 16646.22 MB/s
BenchmarkCount16/sse2/100000000-16 180 6108449 ns/op 16370.77 MB/s
BenchmarkCount16/generic/32-16 14856538 80.5 ns/op 397.56 MB/s
BenchmarkCount16/generic/64-16 8379171 144 ns/op 445.30 MB/s
BenchmarkCount16/generic/128-16 4122723 275 ns/op 464.92 MB/s
BenchmarkCount16/generic/256-16 2119821 532 ns/op 480.80 MB/s
BenchmarkCount16/generic/512-16 1000000 1052 ns/op 486.63 MB/s
BenchmarkCount16/generic/1000-16 579739 2011 ns/op 497.30 MB/s
BenchmarkCount16/generic/10000-16 54300 19199 ns/op 520.86 MB/s
BenchmarkCount16/generic/100000-16 5833 198476 ns/op 503.84 MB/s
BenchmarkCount16/generic/1000000-16 518 1984898 ns/op 503.80 MB/s
BenchmarkCount16/generic/10000000-16 60 20070014 ns/op 498.26 MB/s
BenchmarkCount16/generic/100000000-16 5 200073221 ns/op 499.82 MB/s
BenchmarkCount32/avx2/32-16 68429533 17.1 ns/op 1873.94 MB/s
BenchmarkCount32/avx2/64-16 47244782 22.2 ns/op 2883.31 MB/s
BenchmarkCount32/avx2/128-16 36776332 33.5 ns/op 3817.41 MB/s
BenchmarkCount32/avx2/256-16 19787458 53.5 ns/op 4782.09 MB/s
BenchmarkCount32/avx2/512-16 27683703 40.1 ns/op 12783.01 MB/s
BenchmarkCount32/avx2/1000-16 18520581 62.6 ns/op 15965.38 MB/s
BenchmarkCount32/avx2/10000-16 2222316 537 ns/op 18627.00 MB/s
BenchmarkCount32/avx2/100000-16 248487 4571 ns/op 21876.67 MB/s
BenchmarkCount32/avx2/1000000-16 24266 47142 ns/op 21212.68 MB/s
BenchmarkCount32/avx2/10000000-16 2472 499993 ns/op 20000.29 MB/s
BenchmarkCount32/avx2/100000000-16 235 4939368 ns/op 20245.51 MB/s
BenchmarkCount32/sse2/32-16 69231907 18.2 ns/op 1759.66 MB/s
BenchmarkCount32/sse2/64-16 43461750 24.3 ns/op 2634.68 MB/s
BenchmarkCount32/sse2/128-16 32750986 36.3 ns/op 3527.61 MB/s
BenchmarkCount32/sse2/256-16 41501784 28.7 ns/op 8907.62 MB/s
BenchmarkCount32/sse2/512-16 25459093 45.1 ns/op 11344.38 MB/s
BenchmarkCount32/sse2/1000-16 16290914 73.9 ns/op 13522.65 MB/s
BenchmarkCount32/sse2/10000-16 1906378 612 ns/op 16343.56 MB/s
BenchmarkCount32/sse2/100000-16 200224 5768 ns/op 17336.41 MB/s
BenchmarkCount32/sse2/1000000-16 19930 57189 ns/op 17485.98 MB/s
BenchmarkCount32/sse2/10000000-16 1924 607809 ns/op 16452.54 MB/s
BenchmarkCount32/sse2/100000000-16 202 5915212 ns/op 16905.57 MB/s
BenchmarkCount32/generic/32-16 12836344 90.3 ns/op 354.54 MB/s
BenchmarkCount32/generic/64-16 7625206 154 ns/op 416.33 MB/s
BenchmarkCount32/generic/128-16 3842911 274 ns/op 467.84 MB/s
BenchmarkCount32/generic/256-16 2111515 542 ns/op 472.63 MB/s
BenchmarkCount32/generic/512-16 1142546 1044 ns/op 490.51 MB/s
BenchmarkCount32/generic/1000-16 580545 1992 ns/op 501.99 MB/s
BenchmarkCount32/generic/10000-16 51256 19798 ns/op 505.11 MB/s
BenchmarkCount32/generic/100000-16 5929 197870 ns/op 505.38 MB/s
BenchmarkCount32/generic/1000000-16 615 2001184 ns/op 499.70 MB/s
BenchmarkCount32/generic/10000000-16 60 19551422 ns/op 511.47 MB/s
BenchmarkCount32/generic/100000000-16 5 200235963 ns/op 499.41 MB/s
BenchmarkCount64/avx2/32-16 56106811 20.7 ns/op 1545.96 MB/s
BenchmarkCount64/avx2/64-16 45492897 25.3 ns/op 2532.35 MB/s
BenchmarkCount64/avx2/128-16 32034108 36.1 ns/op 3541.27 MB/s
BenchmarkCount64/avx2/256-16 21864559 54.7 ns/op 4678.54 MB/s
BenchmarkCount64/avx2/512-16 28554049 43.4 ns/op 11806.46 MB/s
BenchmarkCount64/avx2/1000-16 17307355 65.9 ns/op 15163.51 MB/s
BenchmarkCount64/avx2/10000-16 2158840 536 ns/op 18670.93 MB/s
BenchmarkCount64/avx2/100000-16 247810 4773 ns/op 20951.19 MB/s
BenchmarkCount64/avx2/1000000-16 24412 46870 ns/op 21335.82 MB/s
BenchmarkCount64/avx2/10000000-16 2444 483615 ns/op 20677.60 MB/s
BenchmarkCount64/avx2/100000000-16 232 4956579 ns/op 20175.20 MB/s
BenchmarkCount64/sse2/32-16 53249934 22.6 ns/op 1413.04 MB/s
BenchmarkCount64/sse2/64-16 42888222 28.0 ns/op 2289.32 MB/s
BenchmarkCount64/sse2/128-16 29366755 40.2 ns/op 3184.26 MB/s
BenchmarkCount64/sse2/256-16 35718535 32.2 ns/op 7952.15 MB/s
BenchmarkCount64/sse2/512-16 23685950 47.8 ns/op 10718.39 MB/s
BenchmarkCount64/sse2/1000-16 14958813 78.2 ns/op 12790.76 MB/s
BenchmarkCount64/sse2/10000-16 1871818 613 ns/op 16307.44 MB/s
BenchmarkCount64/sse2/100000-16 213234 5746 ns/op 17402.78 MB/s
BenchmarkCount64/sse2/1000000-16 20287 57818 ns/op 17295.75 MB/s
BenchmarkCount64/sse2/10000000-16 1917 602384 ns/op 16600.71 MB/s
BenchmarkCount64/sse2/100000000-16 195 6003837 ns/op 16656.01 MB/s
BenchmarkCount64/generic/32-16 9844602 118 ns/op 271.61 MB/s
BenchmarkCount64/generic/64-16 5404071 189 ns/op 337.91 MB/s
BenchmarkCount64/generic/128-16 3587383 306 ns/op 418.25 MB/s
BenchmarkCount64/generic/256-16 2038212 557 ns/op 459.82 MB/s
BenchmarkCount64/generic/512-16 1088886 1088 ns/op 470.66 MB/s
BenchmarkCount64/generic/1000-16 564288 2081 ns/op 480.61 MB/s
BenchmarkCount64/generic/10000-16 58940 19670 ns/op 508.40 MB/s
BenchmarkCount64/generic/100000-16 5829 197991 ns/op 505.07 MB/s
BenchmarkCount64/generic/1000000-16 520 2018133 ns/op 495.51 MB/s
BenchmarkCount64/generic/10000000-16 60 19406719 ns/op 515.29 MB/s
BenchmarkCount64/generic/100000000-16 6 188469498 ns/op 530.59 MB/s
v1.0.4
BenchmarkCount8/avx2/32-16 113284057 10.4 ns/op 3077.83 MB/s
BenchmarkCount8/avx2/64-16 73449952 15.4 ns/op 4160.15 MB/s
BenchmarkCount8/avx2/128-16 47169706 25.2 ns/op 5074.83 MB/s
BenchmarkCount8/avx2/256-16 21865110 54.3 ns/op 4714.16 MB/s
BenchmarkCount8/avx2/512-16 34523324 34.5 ns/op 14859.67 MB/s
BenchmarkCount8/avx2/1000-16 22098831 53.8 ns/op 18583.47 MB/s
BenchmarkCount8/avx2/10000-16 2579947 472 ns/op 21174.07 MB/s
BenchmarkCount8/avx2/100000-16 289872 3960 ns/op 25252.14 MB/s
BenchmarkCount8/avx2/1000000-16 28410 39198 ns/op 25511.27 MB/s
BenchmarkCount8/avx2/10000000-16 2738 421837 ns/op 23705.83 MB/s
BenchmarkCount8/avx2/100000000-16 267 4081937 ns/op 24498.17 MB/s
BenchmarkCount8/sse2/32-16 78757671 15.5 ns/op 2067.16 MB/s
BenchmarkCount8/sse2/64-16 53975407 21.9 ns/op 2925.07 MB/s
BenchmarkCount8/sse2/128-16 34664730 32.9 ns/op 3889.01 MB/s
BenchmarkCount8/sse2/256-16 47214710 24.8 ns/op 10318.63 MB/s
BenchmarkCount8/sse2/512-16 27582103 42.2 ns/op 12120.04 MB/s
BenchmarkCount8/sse2/1000-16 16634394 70.2 ns/op 14246.76 MB/s
BenchmarkCount8/sse2/10000-16 2056755 610 ns/op 16385.02 MB/s
BenchmarkCount8/sse2/100000-16 199066 5686 ns/op 17586.62 MB/s
BenchmarkCount8/sse2/1000000-16 20299 58304 ns/op 17151.34 MB/s
BenchmarkCount8/sse2/10000000-16 1942 592495 ns/op 16877.79 MB/s
BenchmarkCount8/sse2/100000000-16 200 6038125 ns/op 16561.43 MB/s
BenchmarkCount8/generic/32-16 14566357 79.5 ns/op 402.32 MB/s
BenchmarkCount8/generic/64-16 8044152 150 ns/op 427.21 MB/s
BenchmarkCount8/generic/128-16 3772062 288 ns/op 444.01 MB/s
BenchmarkCount8/generic/256-16 2092000 551 ns/op 464.87 MB/s
BenchmarkCount8/generic/512-16 1083063 1080 ns/op 474.01 MB/s
BenchmarkCount8/generic/1000-16 570868 2114 ns/op 473.00 MB/s
BenchmarkCount8/generic/10000-16 49572 20788 ns/op 481.04 MB/s
BenchmarkCount8/generic/100000-16 5528 207059 ns/op 482.95 MB/s
BenchmarkCount8/generic/1000000-16 600 2076995 ns/op 481.46 MB/s
BenchmarkCount8/generic/10000000-16 55 21229168 ns/op 471.05 MB/s
BenchmarkCount8/generic/100000000-16 5 212457142 ns/op 470.68 MB/s
BenchmarkCount16/avx2/32-16 65816288 17.9 ns/op 1791.83 MB/s
BenchmarkCount16/avx2/64-16 50757498 22.7 ns/op 2822.43 MB/s
BenchmarkCount16/avx2/128-16 35025753 32.6 ns/op 3927.95 MB/s
BenchmarkCount16/avx2/256-16 21941785 53.6 ns/op 4774.50 MB/s
BenchmarkCount16/avx2/512-16 28513166 42.8 ns/op 11956.74 MB/s
BenchmarkCount16/avx2/1000-16 17108034 69.9 ns/op 14298.87 MB/s
BenchmarkCount16/avx2/10000-16 2014814 614 ns/op 16275.12 MB/s
BenchmarkCount16/avx2/100000-16 208441 5295 ns/op 18887.42 MB/s
BenchmarkCount16/avx2/1000000-16 21088 53613 ns/op 18652.22 MB/s
BenchmarkCount16/avx2/10000000-16 2085 554284 ns/op 18041.29 MB/s
BenchmarkCount16/avx2/100000000-16 213 5355222 ns/op 18673.36 MB/s
BenchmarkCount16/sse2/32-16 73891788 15.9 ns/op 2016.92 MB/s
BenchmarkCount16/sse2/64-16 53584813 22.2 ns/op 2888.94 MB/s
BenchmarkCount16/sse2/128-16 33977923 35.5 ns/op 3606.94 MB/s
BenchmarkCount16/sse2/256-16 44680690 26.5 ns/op 9649.60 MB/s
BenchmarkCount16/sse2/512-16 27087153 43.5 ns/op 11758.51 MB/s
BenchmarkCount16/sse2/1000-16 16411941 72.6 ns/op 13775.03 MB/s
BenchmarkCount16/sse2/10000-16 1913266 619 ns/op 16144.38 MB/s
BenchmarkCount16/sse2/100000-16 209222 6000 ns/op 16665.77 MB/s
BenchmarkCount16/sse2/1000000-16 20011 59011 ns/op 16946.10 MB/s
BenchmarkCount16/sse2/10000000-16 1972 586847 ns/op 17040.22 MB/s
BenchmarkCount16/sse2/100000000-16 193 5713296 ns/op 17503.03 MB/s
BenchmarkCount16/generic/32-16 14932627 75.8 ns/op 422.29 MB/s
BenchmarkCount16/generic/64-16 8251359 143 ns/op 446.01 MB/s
BenchmarkCount16/generic/128-16 3995899 271 ns/op 472.81 MB/s
BenchmarkCount16/generic/256-16 2194146 516 ns/op 496.23 MB/s
BenchmarkCount16/generic/512-16 1130038 1031 ns/op 496.69 MB/s
BenchmarkCount16/generic/1000-16 603375 1962 ns/op 509.78 MB/s
BenchmarkCount16/generic/10000-16 52326 19742 ns/op 506.54 MB/s
BenchmarkCount16/generic/100000-16 6142 193415 ns/op 517.02 MB/s
BenchmarkCount16/generic/1000000-16 568 1979633 ns/op 505.14 MB/s
BenchmarkCount16/generic/10000000-16 57 19146762 ns/op 522.28 MB/s
BenchmarkCount16/generic/100000000-16 6 196704699 ns/op 508.38 MB/s
BenchmarkCount32/avx2/32-16 72864624 17.1 ns/op 1866.88 MB/s
BenchmarkCount32/avx2/64-16 55126480 21.9 ns/op 2919.02 MB/s
BenchmarkCount32/avx2/128-16 37342851 32.2 ns/op 3977.41 MB/s
BenchmarkCount32/avx2/256-16 22630946 52.0 ns/op 4923.01 MB/s
BenchmarkCount32/avx2/512-16 27548682 41.3 ns/op 12391.07 MB/s
BenchmarkCount32/avx2/1000-16 17340909 69.7 ns/op 14337.86 MB/s
BenchmarkCount32/avx2/10000-16 2076030 593 ns/op 16867.66 MB/s
BenchmarkCount32/avx2/100000-16 233301 5252 ns/op 19041.26 MB/s
BenchmarkCount32/avx2/1000000-16 21309 52752 ns/op 18956.65 MB/s
BenchmarkCount32/avx2/10000000-16 2095 551314 ns/op 18138.48 MB/s
BenchmarkCount32/avx2/100000000-16 208 5611711 ns/op 17819.88 MB/s
BenchmarkCount32/sse2/32-16 65645158 17.2 ns/op 1864.68 MB/s
BenchmarkCount32/sse2/64-16 48960619 24.1 ns/op 2650.37 MB/s
BenchmarkCount32/sse2/128-16 32815432 35.6 ns/op 3591.24 MB/s
BenchmarkCount32/sse2/256-16 42327723 28.2 ns/op 9086.10 MB/s
BenchmarkCount32/sse2/512-16 20128138 59.2 ns/op 8642.98 MB/s
BenchmarkCount32/sse2/1000-16 13085713 90.6 ns/op 11035.66 MB/s
BenchmarkCount32/sse2/10000-16 1872972 634 ns/op 15781.59 MB/s
BenchmarkCount32/sse2/100000-16 201450 5716 ns/op 17495.91 MB/s
BenchmarkCount32/sse2/1000000-16 20067 59066 ns/op 16930.34 MB/s
BenchmarkCount32/sse2/10000000-16 2084 576319 ns/op 17351.50 MB/s
BenchmarkCount32/sse2/100000000-16 193 5898738 ns/op 16952.78 MB/s
BenchmarkCount32/generic/32-16 13305507 90.3 ns/op 354.55 MB/s
BenchmarkCount32/generic/64-16 7781857 151 ns/op 425.06 MB/s
BenchmarkCount32/generic/128-16 3962030 281 ns/op 455.87 MB/s
BenchmarkCount32/generic/256-16 2177896 525 ns/op 487.75 MB/s
BenchmarkCount32/generic/512-16 1122494 1029 ns/op 497.53 MB/s
BenchmarkCount32/generic/1000-16 594394 1903 ns/op 525.54 MB/s
BenchmarkCount32/generic/10000-16 51724 19835 ns/op 504.16 MB/s
BenchmarkCount32/generic/100000-16 5848 191398 ns/op 522.47 MB/s
BenchmarkCount32/generic/1000000-16 518 1998705 ns/op 500.32 MB/s
BenchmarkCount32/generic/10000000-16 61 19430005 ns/op 514.67 MB/s
BenchmarkCount32/generic/100000000-16 6 199107659 ns/op 502.24 MB/s
BenchmarkCount64/avx2/32-16 57851385 20.2 ns/op 1582.56 MB/s
BenchmarkCount64/avx2/64-16 43540004 26.4 ns/op 2419.68 MB/s
BenchmarkCount64/avx2/128-16 33230350 36.9 ns/op 3473.00 MB/s
BenchmarkCount64/avx2/256-16 22741604 56.3 ns/op 4549.09 MB/s
BenchmarkCount64/avx2/512-16 25663096 45.6 ns/op 11232.62 MB/s
BenchmarkCount64/avx2/1000-16 16008861 73.2 ns/op 13658.94 MB/s
BenchmarkCount64/avx2/10000-16 2083137 591 ns/op 16928.89 MB/s
BenchmarkCount64/avx2/100000-16 223752 5431 ns/op 18413.79 MB/s
BenchmarkCount64/avx2/1000000-16 21462 56225 ns/op 17785.57 MB/s
BenchmarkCount64/avx2/10000000-16 2154 551933 ns/op 18118.14 MB/s
BenchmarkCount64/avx2/100000000-16 206 5470769 ns/op 18278.97 MB/s
BenchmarkCount64/sse2/32-16 53932040 22.0 ns/op 1457.63 MB/s
BenchmarkCount64/sse2/64-16 42586113 27.8 ns/op 2304.66 MB/s
BenchmarkCount64/sse2/128-16 28128896 39.6 ns/op 3231.79 MB/s
BenchmarkCount64/sse2/256-16 35379745 32.2 ns/op 7956.09 MB/s
BenchmarkCount64/sse2/512-16 23239102 48.9 ns/op 10470.67 MB/s
BenchmarkCount64/sse2/1000-16 15173030 80.2 ns/op 12470.03 MB/s
BenchmarkCount64/sse2/10000-16 1912424 598 ns/op 16709.88 MB/s
BenchmarkCount64/sse2/100000-16 200026 5873 ns/op 17028.50 MB/s
BenchmarkCount64/sse2/1000000-16 19759 57521 ns/op 17385.04 MB/s
BenchmarkCount64/sse2/10000000-16 2040 597267 ns/op 16742.92 MB/s
BenchmarkCount64/sse2/100000000-16 192 5978950 ns/op 16725.34 MB/s
BenchmarkCount64/generic/32-16 9928184 117 ns/op 273.63 MB/s
BenchmarkCount64/generic/64-16 6628939 183 ns/op 350.39 MB/s
BenchmarkCount64/generic/128-16 3611708 304 ns/op 421.13 MB/s
BenchmarkCount64/generic/256-16 2057230 563 ns/op 454.42 MB/s
BenchmarkCount64/generic/512-16 1083744 1056 ns/op 484.70 MB/s
BenchmarkCount64/generic/1000-16 576555 2028 ns/op 493.06 MB/s
BenchmarkCount64/generic/10000-16 52099 19867 ns/op 503.35 MB/s
BenchmarkCount64/generic/100000-16 5992 197052 ns/op 507.48 MB/s
BenchmarkCount64/generic/1000000-16 524 1945532 ns/op 514.00 MB/s
BenchmarkCount64/generic/10000000-16 61 19713293 ns/op 507.27 MB/s
BenchmarkCount64/generic/100000000-16 6 194035798 ns/op 515.37 MB/s
677120e
BenchmarkCount8/avx2/32-16 180887323 6.50 ns/op 4919.63 MB/s
BenchmarkCount8/avx2/64-16 110568928 10.3 ns/op 6241.69 MB/s
BenchmarkCount8/avx2/128-16 63895208 17.8 ns/op 7181.04 MB/s
BenchmarkCount8/avx2/256-16 43572312 29.1 ns/op 8811.63 MB/s
BenchmarkCount8/avx2/512-16 41616794 29.9 ns/op 17109.62 MB/s
BenchmarkCount8/avx2/1000-16 20170315 59.1 ns/op 16918.99 MB/s
BenchmarkCount8/avx2/10000-16 2519863 453 ns/op 22055.16 MB/s
BenchmarkCount8/avx2/100000-16 302140 3975 ns/op 25154.25 MB/s
BenchmarkCount8/avx2/1000000-16 28449 38766 ns/op 25795.63 MB/s
BenchmarkCount8/avx2/10000000-16 2769 414361 ns/op 24133.54 MB/s
BenchmarkCount8/avx2/100000000-16 276 4010607 ns/op 24933.88 MB/s
BenchmarkCount8/popcnt/32-16 135109920 8.60 ns/op 3722.75 MB/s
BenchmarkCount8/popcnt/64-16 90996325 13.8 ns/op 4630.66 MB/s
BenchmarkCount8/popcnt/128-16 43640628 26.9 ns/op 4759.17 MB/s
BenchmarkCount8/popcnt/256-16 60230776 21.1 ns/op 12153.03 MB/s
BenchmarkCount8/popcnt/512-16 27787201 41.8 ns/op 12261.81 MB/s
BenchmarkCount8/popcnt/1000-16 14916432 81.4 ns/op 12292.15 MB/s
BenchmarkCount8/popcnt/10000-16 1940953 578 ns/op 17286.56 MB/s
BenchmarkCount8/popcnt/100000-16 210847 5706 ns/op 17524.07 MB/s
BenchmarkCount8/popcnt/1000000-16 20604 56097 ns/op 17826.36 MB/s
BenchmarkCount8/popcnt/10000000-16 2008 607747 ns/op 16454.21 MB/s
BenchmarkCount8/popcnt/100000000-16 195 5686707 ns/op 17584.87 MB/s
BenchmarkCount8/sse2/32-16 63942829 19.1 ns/op 1679.66 MB/s
BenchmarkCount8/sse2/64-16 37631593 33.4 ns/op 1918.07 MB/s
BenchmarkCount8/sse2/128-16 19319570 60.8 ns/op 2106.38 MB/s
BenchmarkCount8/sse2/256-16 33807202 35.1 ns/op 7294.89 MB/s
BenchmarkCount8/sse2/512-16 18850315 65.2 ns/op 7855.70 MB/s
BenchmarkCount8/sse2/1000-16 8662965 121 ns/op 8291.53 MB/s
BenchmarkCount8/sse2/10000-16 1152369 1011 ns/op 9894.85 MB/s
BenchmarkCount8/sse2/100000-16 125665 9287 ns/op 10767.45 MB/s
BenchmarkCount8/sse2/1000000-16 12408 95965 ns/op 10420.51 MB/s
BenchmarkCount8/sse2/10000000-16 1267 942471 ns/op 10610.41 MB/s
BenchmarkCount8/sse2/100000000-16 124 9998809 ns/op 10001.19 MB/s
BenchmarkCount8/generic/32-16 7558437 168 ns/op 190.43 MB/s
BenchmarkCount8/generic/64-16 3833438 324 ns/op 197.50 MB/s
BenchmarkCount8/generic/128-16 1826013 638 ns/op 200.49 MB/s
BenchmarkCount8/generic/256-16 942129 1291 ns/op 198.33 MB/s
BenchmarkCount8/generic/512-16 451552 2506 ns/op 204.33 MB/s
BenchmarkCount8/generic/1000-16 238472 4968 ns/op 201.27 MB/s
BenchmarkCount8/generic/10000-16 23505 50168 ns/op 199.33 MB/s
BenchmarkCount8/generic/100000-16 2397 485936 ns/op 205.79 MB/s
BenchmarkCount8/generic/1000000-16 229 4884775 ns/op 204.72 MB/s
BenchmarkCount8/generic/10000000-16 22 49935088 ns/op 200.26 MB/s
BenchmarkCount8/generic/100000000-16 3 476854149 ns/op 209.71 MB/s
from pospop.
I'll work on some more improvements to the code. Let me know if they improve the performance for you to the point where the old AVX2 kernel is obsolete. If they don't, I may have to find another solution.
from pospop.
I've improved the performance by another 6% in 9d68d4e. Please let me know how much this closes the performance gap on Zen for you. It should be more effective because it replaces shifts by logic operations that can run on more ports and eliminates some operations.
from pospop.
I'm sorry, almost no change.
from pospop.
How unfortunate. Sadly, the new code is almost 16% faster than the old Count8 kernel on Intel, so I'm a bit conflicted about keeping the old one. I'll have to think of a solution that works for both of us.
from pospop.
Thank you so much!
from pospop.
Hi @shenwei356!
In the carry branch, I have provided a version of the AVX2 kernel with a further 30% improvement. Please let me know if it finally manages to beat the original code on your machine.
from pospop.
Thank you for keep updating this package!
Unfortunately, the benchmark fails to run.
$ go test -bench=Benchmark*
SIGILL: illegal instruction
AMD Ryzen 7 2700X Eight-Core Processor CPU flags
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
$ go test -bench=Benchmark*
SIGILL: illegal instruction
PC=0x4eb5cc m=0 sigcode=2
instruction bytes: 0x62 0xf1 0x5 0x28 0xdf 0xf4 0xc5 0xcd 0x72 0xd6 0x1 0xc4 0xe3 0x55 0x46 0xe6
goroutine 19 [running]:
countavxcarry()
/home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/countavx2carry_amd64.s:188 +0x40c fp=0xc00009bd68 sp=0xc00009bd38 pc=0x4eb5cc
github.com/clausecker/pospop.count8avx2carry(0xc0000e2a80, {0xc0000cd001, 0x3ff, 0x3ff})
/home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/countavx2carry_amd64.s:501 +0x36 fp=0xc00009bd78 sp=0xc00009bd68 pc=0x4ebc76
github.com/clausecker/pospop.count8avx2carry(0x40cea7, {0xc0000cd001, 0x4e8bcb, 0x7})
<autogenerated>:1 +0x2b fp=0xc00009bda8 sp=0xc00009bd78 pc=0x4ee30b
github.com/clausecker/pospop.Count8(0xc0000e2a80, {0xc0000cd001, 0x0, 0x0})
/home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/dispatch.go:107 +0x25 fp=0xc00009bdd8 sp=0xc00009bda8 pc=0x4e57c5
github.com/clausecker/pospop.testCount8(0xc000083860, 0x523c58)
/home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/count_test.go:59 +0x157 fp=0xc00009bf50 sp=0xc00009bdd8 pc=0x4e8e57
github.com/clausecker/pospop.TestCount8.func1(0x0)
/home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/count_test.go:136 +0x25 fp=0xc00009bf70 sp=0xc00009bf50 pc=0x4ea485
testing.tRunner(0xc000083860, 0x523c98)
/usr/local/go/src/testing/testing.go:1253 +0x102 fp=0xc00009bfc0 sp=0xc00009bf70 pc=0x4b0302
testing.(*T).Run·dwrap·21()
/usr/local/go/src/testing/testing.go:1300 +0x2a fp=0xc00009bfe0 sp=0xc00009bfc0 pc=0x4b100a
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00009bfe8 sp=0xc00009bfe0 pc=0x463681
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1300 +0x35a
goroutine 1 [chan receive]:
testing.(*T).Run(0xc000083380, {0x51a843, 0x465d93}, 0x523ca0)
/usr/local/go/src/testing/testing.go:1301 +0x375
testing.runTests.func1(0xc000083380)
/usr/local/go/src/testing/testing.go:1592 +0x6e
testing.tRunner(0xc000083380, 0xc000093d18)
/usr/local/go/src/testing/testing.go:1253 +0x102
testing.runTests(0xc0000d2000, {0x5fcd60, 0x4, 0x4}, {0x4721ed, 0x64, 0x601040})
/usr/local/go/src/testing/testing.go:1590 +0x43f
testing.(*M).Run(0xc0000d2000)
/usr/local/go/src/testing/testing.go:1498 +0x51d
main.main()
_testmain.go:59 +0x14b
goroutine 18 [chan receive]:
testing.(*T).Run(0xc000083520, {0x51a135, 0x465d93}, 0x523c98)
/usr/local/go/src/testing/testing.go:1301 +0x375
github.com/clausecker/pospop.TestCount8(0x0)
/home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/count_test.go:136 +0x35
testing.tRunner(0xc000083520, 0x523ca0)
/usr/local/go/src/testing/testing.go:1253 +0x102
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1300 +0x35a
rax 0xfffb
rbx 0x4eb8c0
rcx 0x0
rdx 0x1e0
rdi 0xc0000e2a80
rsi 0xc0000cd400
rbp 0xc00009bd58
rsp 0xc00009bd38
r8 0x7fe8873875b8
r9 0x0
r10 0x7fe887393c88
r11 0x0
r12 0xc0000e2a80
r13 0x1
r14 0xc000083a00
r15 0xffffffffffffffff
rip 0x4eb5cc
rflags 0x10206
cs 0x33
fs 0x0
gs 0x0
exit status 2
FAIL github.com/clausecker/pospop 0.005s
from pospop.
Hi @shenwei356,
I apologise for the problem. I accidentally used an incorrect machine instruction in this kernel. I have pushed a fix for the problem in df1eae5. Please let me know if it does the trick.
from pospop.
Yes! It's faster for > 1000 elements but slower for <<=512 ones.
I miss the old version 677120e, with 6069.95 MB/s
for 64 elements. 😆
BenchmarkCount8/avx2carry/32-16 75166225 17.01 ns/op 1881.52 MB/s
BenchmarkCount8/avx2carry/64-16 49655649 22.10 ns/op 2896.57 MB/s
BenchmarkCount8/avx2carry/128-16 36818702 32.02 ns/op 3997.43 MB/s
BenchmarkCount8/avx2carry/256-16 23530563 53.38 ns/op 4795.83 MB/s
BenchmarkCount8/avx2carry/512-16 31999482 41.50 ns/op 12338.46 MB/s
BenchmarkCount8/avx2carry/1000-16 22948182 51.15 ns/op 19551.49 MB/s
BenchmarkCount8/avx2carry/10000-16 2811368 417.8 ns/op 23932.42 MB/s
BenchmarkCount8/avx2carry/100000-16 314949 3606 ns/op 27732.07 MB/s
BenchmarkCount8/avx2carry/1000000-16 33084 36173 ns/op 27645.11 MB/s
BenchmarkCount8/avx2carry/10000000-16 2142 558955 ns/op 17890.52 MB/s
BenchmarkCount8/avx2carry/100000000-16 181 5880131 ns/op 17006.42 MB/s
BenchmarkCount8/avx2/32-16 70049467 15.88 ns/op 2014.53 MB/s
BenchmarkCount8/avx2/64-16 54589984 21.07 ns/op 3036.93 MB/s
BenchmarkCount8/avx2/128-16 33218904 31.37 ns/op 4080.63 MB/s
BenchmarkCount8/avx2/256-16 20424726 52.44 ns/op 4882.03 MB/s
BenchmarkCount8/avx2/512-16 30637970 38.84 ns/op 13181.97 MB/s
BenchmarkCount8/avx2/1000-16 20244832 60.53 ns/op 16520.15 MB/s
BenchmarkCount8/avx2/10000-16 2364079 488.0 ns/op 20490.45 MB/s
BenchmarkCount8/avx2/100000-16 255962 4521 ns/op 22120.96 MB/s
BenchmarkCount8/avx2/1000000-16 25255 44622 ns/op 22410.66 MB/s
BenchmarkCount8/avx2/10000000-16 1872 629837 ns/op 15877.13 MB/s
BenchmarkCount8/avx2/100000000-16 152 7730327 ns/op 12936.06 MB/s
BenchmarkCount8/sse2/32-16 76672119 15.68 ns/op 2040.40 MB/s
BenchmarkCount8/sse2/64-16 52885789 21.45 ns/op 2983.43 MB/s
BenchmarkCount8/sse2/128-16 34227018 33.60 ns/op 3809.34 MB/s
BenchmarkCount8/sse2/256-16 44672299 26.23 ns/op 9760.54 MB/s
BenchmarkCount8/sse2/512-16 26528048 40.21 ns/op 12732.26 MB/s
BenchmarkCount8/sse2/1000-16 17626179 67.69 ns/op 14773.48 MB/s
BenchmarkCount8/sse2/10000-16 2042779 557.2 ns/op 17946.54 MB/s
BenchmarkCount8/sse2/100000-16 217683 5400 ns/op 18519.26 MB/s
BenchmarkCount8/sse2/1000000-16 21778 55213 ns/op 18111.77 MB/s
BenchmarkCount8/sse2/10000000-16 1868 642465 ns/op 15565.06 MB/s
BenchmarkCount8/sse2/100000000-16 151 7525209 ns/op 13288.67 MB/s
BenchmarkCount8/generic/32-16 34176175 33.74 ns/op 948.44 MB/s
BenchmarkCount8/generic/64-16 17622030 65.02 ns/op 984.27 MB/s
BenchmarkCount8/generic/128-16 8667603 129.9 ns/op 985.19 MB/s
BenchmarkCount8/generic/256-16 5341986 189.7 ns/op 1349.80 MB/s
BenchmarkCount8/generic/512-16 2922559 379.3 ns/op 1349.78 MB/s
BenchmarkCount8/generic/1000-16 1484482 765.8 ns/op 1305.89 MB/s
BenchmarkCount8/generic/10000-16 160234 7287 ns/op 1372.38 MB/s
BenchmarkCount8/generic/100000-16 16412 70696 ns/op 1414.50 MB/s
BenchmarkCount8/generic/1000000-16 1756 715249 ns/op 1398.11 MB/s
BenchmarkCount8/generic/10000000-16 163 7090338 ns/op 1410.37 MB/s
BenchmarkCount8/generic/100000000-16 15 69222069 ns/op 1444.63 MB/s
from pospop.
@shenwei356 Thank you for your input. I'll see if I can add a fast path for very short arrays. I mean, such a fast path kinda already exists. Maybe it is not good enough.
from pospop.
Cool! I also start to learn some basic go assembly and wrote a package with avo: https://github.com/shenwei356/pand . Could you please take a look :)
from pospop.
@shenwei356 Looks good! Some suggestions:
- conditional branches are expensive if they are hard to predict. Consider reducing the amount of conditional layers to just two (one SIMD register full, then byte-by-byte)
- also consider unrolling the main loop a bit more
- for AVX-512 you can use masking instead of a separate loop to deal with the tail
- you should align at least one of the inputs to one SIMD register worth of data before you start with the main loop. Memory accesses crossing cache line boundaries incur an extra penalty
- there is probably not too much of a benefit in using 512 bit registers here since the code is largely memory bound. Using 512 bit registers incurs a thermal throttle, so it's only advisable for long compute bound sections
- instead of moving two pointers and an index, consider using a double-indexed addressing mode so you only have to advance one register per iteration
- the tail code is wrong: it always writes 8 full bytes, even if the slices is shorter. This causes incorrect results when you for example slice from a larger array and only compute the bitwise and of the small slice
For inspiration on how to do better, consider asking a C compiler. For example, clang suggests this kind of code for AVX2 which addresses the issues I remarked.
from pospop.
Thanks, I need some time to digest these suggestions.
from pospop.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pospop.