Code Monkey home page Code Monkey logo

Comments (19)

shenwei356 avatar shenwei356 commented on September 21, 2024 1
$ grep 'model name' /proc/cpuinfo
model name      : AMD Ryzen 7 2700X Eight-Core Processor

$ grep 'avx2' /proc/cpuinfo          
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

from pospop.

clausecker avatar clausecker commented on September 21, 2024

Thanks for the report! Please let me know what CPU model you are programming for so I can investigate this further.

from pospop.

clausecker avatar clausecker commented on September 21, 2024

Starting using bit matrix transpositions: 41dbbc5 (speedup after d24d616)

Note that until d6e39e5, the count8 code was changed very little and never used the transposing kernel as I found it to be slower than doing it the other way (using VPMOVMSKB). Neverthless, the new and improved kernel code of 836a368 showed to be faster than the old variant on my Skylake machine, but I suppose it can be made faster on your system, too, with some tweaking.
On Linux, type

grep 'model name' /proc/cpuinfo

to find the CPU model. That's the information I need.

from pospop.

clausecker avatar clausecker commented on September 21, 2024

Aha! I've never tested on Ryzen. Let me investigate that for you.

from pospop.

clausecker avatar clausecker commented on September 21, 2024

Could you please post a full set of benchmark outputs for the current revision as well as for v1.0.4? I'd like to understand the performance characteristics better. The main problem is than Zen and Zen+ implement AVX by using a 128 bit FPU twice. Thus, my original Count8 kernel is slightly faster as it moves some load from the FPU into scalar operations.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024
d6e39e5
BenchmarkCount8/avx2/32-16              74067950                15.4 ns/op      2077.69 MB/s
BenchmarkCount8/avx2/64-16              57037659                20.8 ns/op      3079.35 MB/s
BenchmarkCount8/avx2/128-16             38254251                30.1 ns/op      4247.99 MB/s
BenchmarkCount8/avx2/256-16             24741512                52.4 ns/op      4890.02 MB/s
BenchmarkCount8/avx2/512-16             31537677                38.5 ns/op      13298.46 MB/s
BenchmarkCount8/avx2/1000-16            19230327                60.8 ns/op      16436.63 MB/s
BenchmarkCount8/avx2/10000-16            2023684               539 ns/op        18567.05 MB/s
BenchmarkCount8/avx2/100000-16            258202              4568 ns/op        21889.58 MB/s
BenchmarkCount8/avx2/1000000-16            24387             45925 ns/op        21774.52 MB/s
BenchmarkCount8/avx2/10000000-16                    2398            485719 ns/op        20588.05 MB/s
BenchmarkCount8/avx2/100000000-16                    232           4848575 ns/op        20624.62 MB/s
BenchmarkCount8/sse2/32-16                      76123047                15.1 ns/op      2114.13 MB/s
BenchmarkCount8/sse2/64-16                      57018202                20.9 ns/op      3064.25 MB/s
BenchmarkCount8/sse2/128-16                     35960908                33.7 ns/op      3801.73 MB/s
BenchmarkCount8/sse2/256-16                     44616007                25.4 ns/op      10067.82 MB/s
BenchmarkCount8/sse2/512-16                     30101331                41.9 ns/op      12208.97 MB/s
BenchmarkCount8/sse2/1000-16                    16921921                70.7 ns/op      14148.10 MB/s
BenchmarkCount8/sse2/10000-16                    1937534               608 ns/op        16439.59 MB/s
BenchmarkCount8/sse2/100000-16                    188360              5801 ns/op        17237.63 MB/s
BenchmarkCount8/sse2/1000000-16                    20062             57294 ns/op        17453.74 MB/s
BenchmarkCount8/sse2/10000000-16                    1959            603064 ns/op        16582.00 MB/s
BenchmarkCount8/sse2/100000000-16                    190           6374957 ns/op        15686.38 MB/s
BenchmarkCount8/generic/32-16                   15368632                80.0 ns/op       399.99 MB/s
BenchmarkCount8/generic/64-16                    8109308               145 ns/op         441.79 MB/s
BenchmarkCount8/generic/128-16                   3836839               280 ns/op         457.66 MB/s
BenchmarkCount8/generic/256-16                   2114299               543 ns/op         471.73 MB/s
BenchmarkCount8/generic/512-16                   1076280              1079 ns/op         474.31 MB/s
BenchmarkCount8/generic/1000-16                   552895              2162 ns/op         462.43 MB/s
BenchmarkCount8/generic/10000-16                   49760             20990 ns/op         476.41 MB/s
BenchmarkCount8/generic/100000-16                   5702            209239 ns/op         477.92 MB/s
BenchmarkCount8/generic/1000000-16                   500           2106739 ns/op         474.67 MB/s
BenchmarkCount8/generic/10000000-16                   55          21090807 ns/op         474.14 MB/s
BenchmarkCount8/generic/100000000-16                   5         207750018 ns/op         481.35 MB/s
BenchmarkCount16/avx2/32-16                     68553722                17.3 ns/op      1845.48 MB/s
BenchmarkCount16/avx2/64-16                     52283364                22.4 ns/op      2856.84 MB/s
BenchmarkCount16/avx2/128-16                    36027898                32.1 ns/op      3991.84 MB/s
BenchmarkCount16/avx2/256-16                    21851757                51.5 ns/op      4969.70 MB/s
BenchmarkCount16/avx2/512-16                    31586925                39.5 ns/op      12947.11 MB/s
BenchmarkCount16/avx2/1000-16                   18997122                65.5 ns/op      15271.71 MB/s
BenchmarkCount16/avx2/10000-16                   2206296               516 ns/op        19379.26 MB/s
BenchmarkCount16/avx2/100000-16                   250183              4701 ns/op        21270.91 MB/s
BenchmarkCount16/avx2/1000000-16                   24770             45694 ns/op        21884.69 MB/s
BenchmarkCount16/avx2/10000000-16                   2401            472581 ns/op        21160.40 MB/s
BenchmarkCount16/avx2/100000000-16                   231           4895301 ns/op        20427.75 MB/s
BenchmarkCount16/sse2/32-16                     77070478                15.4 ns/op      2071.72 MB/s
BenchmarkCount16/sse2/64-16                     53893412                21.8 ns/op      2931.79 MB/s
BenchmarkCount16/sse2/128-16                    34292407                34.1 ns/op      3756.48 MB/s
BenchmarkCount16/sse2/256-16                    43857051                26.4 ns/op      9682.58 MB/s
BenchmarkCount16/sse2/512-16                    24215067                42.7 ns/op      11981.72 MB/s
BenchmarkCount16/sse2/1000-16                   16146453                76.2 ns/op      13124.74 MB/s
BenchmarkCount16/sse2/10000-16                   1694382               610 ns/op        16395.64 MB/s
BenchmarkCount16/sse2/100000-16                   201802              5603 ns/op        17848.50 MB/s
BenchmarkCount16/sse2/1000000-16                   21052             57298 ns/op        17452.57 MB/s
BenchmarkCount16/sse2/10000000-16                   1906            600737 ns/op        16646.22 MB/s
BenchmarkCount16/sse2/100000000-16                   180           6108449 ns/op        16370.77 MB/s
BenchmarkCount16/generic/32-16                  14856538                80.5 ns/op       397.56 MB/s
BenchmarkCount16/generic/64-16                   8379171               144 ns/op         445.30 MB/s
BenchmarkCount16/generic/128-16                  4122723               275 ns/op         464.92 MB/s
BenchmarkCount16/generic/256-16                  2119821               532 ns/op         480.80 MB/s
BenchmarkCount16/generic/512-16                  1000000              1052 ns/op         486.63 MB/s
BenchmarkCount16/generic/1000-16                  579739              2011 ns/op         497.30 MB/s
BenchmarkCount16/generic/10000-16                  54300             19199 ns/op         520.86 MB/s
BenchmarkCount16/generic/100000-16                  5833            198476 ns/op         503.84 MB/s
BenchmarkCount16/generic/1000000-16                  518           1984898 ns/op         503.80 MB/s
BenchmarkCount16/generic/10000000-16                  60          20070014 ns/op         498.26 MB/s
BenchmarkCount16/generic/100000000-16                  5         200073221 ns/op         499.82 MB/s
BenchmarkCount32/avx2/32-16                     68429533                17.1 ns/op      1873.94 MB/s
BenchmarkCount32/avx2/64-16                     47244782                22.2 ns/op      2883.31 MB/s
BenchmarkCount32/avx2/128-16                    36776332                33.5 ns/op      3817.41 MB/s
BenchmarkCount32/avx2/256-16                    19787458                53.5 ns/op      4782.09 MB/s
BenchmarkCount32/avx2/512-16                    27683703                40.1 ns/op      12783.01 MB/s
BenchmarkCount32/avx2/1000-16                   18520581                62.6 ns/op      15965.38 MB/s
BenchmarkCount32/avx2/10000-16                   2222316               537 ns/op        18627.00 MB/s
BenchmarkCount32/avx2/100000-16                   248487              4571 ns/op        21876.67 MB/s
BenchmarkCount32/avx2/1000000-16                   24266             47142 ns/op        21212.68 MB/s
BenchmarkCount32/avx2/10000000-16                   2472            499993 ns/op        20000.29 MB/s
BenchmarkCount32/avx2/100000000-16                   235           4939368 ns/op        20245.51 MB/s
BenchmarkCount32/sse2/32-16                     69231907                18.2 ns/op      1759.66 MB/s
BenchmarkCount32/sse2/64-16                     43461750                24.3 ns/op      2634.68 MB/s
BenchmarkCount32/sse2/128-16                    32750986                36.3 ns/op      3527.61 MB/s
BenchmarkCount32/sse2/256-16                    41501784                28.7 ns/op      8907.62 MB/s
BenchmarkCount32/sse2/512-16                    25459093                45.1 ns/op      11344.38 MB/s
BenchmarkCount32/sse2/1000-16                   16290914                73.9 ns/op      13522.65 MB/s
BenchmarkCount32/sse2/10000-16                   1906378               612 ns/op        16343.56 MB/s
BenchmarkCount32/sse2/100000-16                   200224              5768 ns/op        17336.41 MB/s
BenchmarkCount32/sse2/1000000-16                   19930             57189 ns/op        17485.98 MB/s
BenchmarkCount32/sse2/10000000-16                   1924            607809 ns/op        16452.54 MB/s
BenchmarkCount32/sse2/100000000-16                   202           5915212 ns/op        16905.57 MB/s
BenchmarkCount32/generic/32-16                  12836344                90.3 ns/op       354.54 MB/s
BenchmarkCount32/generic/64-16                   7625206               154 ns/op         416.33 MB/s
BenchmarkCount32/generic/128-16                  3842911               274 ns/op         467.84 MB/s
BenchmarkCount32/generic/256-16                  2111515               542 ns/op         472.63 MB/s
BenchmarkCount32/generic/512-16                  1142546              1044 ns/op         490.51 MB/s
BenchmarkCount32/generic/1000-16                  580545              1992 ns/op         501.99 MB/s
BenchmarkCount32/generic/10000-16                  51256             19798 ns/op         505.11 MB/s
BenchmarkCount32/generic/100000-16                  5929            197870 ns/op         505.38 MB/s
BenchmarkCount32/generic/1000000-16                  615           2001184 ns/op         499.70 MB/s
BenchmarkCount32/generic/10000000-16                  60          19551422 ns/op         511.47 MB/s
BenchmarkCount32/generic/100000000-16                  5         200235963 ns/op         499.41 MB/s
BenchmarkCount64/avx2/32-16                     56106811                20.7 ns/op      1545.96 MB/s
BenchmarkCount64/avx2/64-16                     45492897                25.3 ns/op      2532.35 MB/s
BenchmarkCount64/avx2/128-16                    32034108                36.1 ns/op      3541.27 MB/s
BenchmarkCount64/avx2/256-16                    21864559                54.7 ns/op      4678.54 MB/s
BenchmarkCount64/avx2/512-16                    28554049                43.4 ns/op      11806.46 MB/s
BenchmarkCount64/avx2/1000-16                   17307355                65.9 ns/op      15163.51 MB/s
BenchmarkCount64/avx2/10000-16                   2158840               536 ns/op        18670.93 MB/s
BenchmarkCount64/avx2/100000-16                   247810              4773 ns/op        20951.19 MB/s
BenchmarkCount64/avx2/1000000-16                   24412             46870 ns/op        21335.82 MB/s
BenchmarkCount64/avx2/10000000-16                   2444            483615 ns/op        20677.60 MB/s
BenchmarkCount64/avx2/100000000-16                   232           4956579 ns/op        20175.20 MB/s
BenchmarkCount64/sse2/32-16                     53249934                22.6 ns/op      1413.04 MB/s
BenchmarkCount64/sse2/64-16                     42888222                28.0 ns/op      2289.32 MB/s
BenchmarkCount64/sse2/128-16                    29366755                40.2 ns/op      3184.26 MB/s
BenchmarkCount64/sse2/256-16                    35718535                32.2 ns/op      7952.15 MB/s
BenchmarkCount64/sse2/512-16                    23685950                47.8 ns/op      10718.39 MB/s
BenchmarkCount64/sse2/1000-16                   14958813                78.2 ns/op      12790.76 MB/s
BenchmarkCount64/sse2/10000-16                   1871818               613 ns/op        16307.44 MB/s
BenchmarkCount64/sse2/100000-16                   213234              5746 ns/op        17402.78 MB/s
BenchmarkCount64/sse2/1000000-16                   20287             57818 ns/op        17295.75 MB/s
BenchmarkCount64/sse2/10000000-16                   1917            602384 ns/op        16600.71 MB/s
BenchmarkCount64/sse2/100000000-16                   195           6003837 ns/op        16656.01 MB/s
BenchmarkCount64/generic/32-16                   9844602               118 ns/op         271.61 MB/s
BenchmarkCount64/generic/64-16                   5404071               189 ns/op         337.91 MB/s
BenchmarkCount64/generic/128-16                  3587383               306 ns/op         418.25 MB/s
BenchmarkCount64/generic/256-16                  2038212               557 ns/op         459.82 MB/s
BenchmarkCount64/generic/512-16                  1088886              1088 ns/op         470.66 MB/s
BenchmarkCount64/generic/1000-16                  564288              2081 ns/op         480.61 MB/s
BenchmarkCount64/generic/10000-16                  58940             19670 ns/op         508.40 MB/s
BenchmarkCount64/generic/100000-16                  5829            197991 ns/op         505.07 MB/s
BenchmarkCount64/generic/1000000-16                  520           2018133 ns/op         495.51 MB/s
BenchmarkCount64/generic/10000000-16                  60          19406719 ns/op         515.29 MB/s
BenchmarkCount64/generic/100000000-16                  6         188469498 ns/op         530.59 MB/s
v1.0.4
BenchmarkCount8/avx2/32-16              113284057               10.4 ns/op      3077.83 MB/s
BenchmarkCount8/avx2/64-16              73449952                15.4 ns/op      4160.15 MB/s
BenchmarkCount8/avx2/128-16             47169706                25.2 ns/op      5074.83 MB/s
BenchmarkCount8/avx2/256-16             21865110                54.3 ns/op      4714.16 MB/s
BenchmarkCount8/avx2/512-16             34523324                34.5 ns/op      14859.67 MB/s
BenchmarkCount8/avx2/1000-16            22098831                53.8 ns/op      18583.47 MB/s
BenchmarkCount8/avx2/10000-16            2579947               472 ns/op        21174.07 MB/s
BenchmarkCount8/avx2/100000-16            289872              3960 ns/op        25252.14 MB/s
BenchmarkCount8/avx2/1000000-16            28410             39198 ns/op        25511.27 MB/s
BenchmarkCount8/avx2/10000000-16                    2738            421837 ns/op        23705.83 MB/s
BenchmarkCount8/avx2/100000000-16                    267           4081937 ns/op        24498.17 MB/s
BenchmarkCount8/sse2/32-16                      78757671                15.5 ns/op      2067.16 MB/s
BenchmarkCount8/sse2/64-16                      53975407                21.9 ns/op      2925.07 MB/s
BenchmarkCount8/sse2/128-16                     34664730                32.9 ns/op      3889.01 MB/s
BenchmarkCount8/sse2/256-16                     47214710                24.8 ns/op      10318.63 MB/s
BenchmarkCount8/sse2/512-16                     27582103                42.2 ns/op      12120.04 MB/s
BenchmarkCount8/sse2/1000-16                    16634394                70.2 ns/op      14246.76 MB/s
BenchmarkCount8/sse2/10000-16                    2056755               610 ns/op        16385.02 MB/s
BenchmarkCount8/sse2/100000-16                    199066              5686 ns/op        17586.62 MB/s
BenchmarkCount8/sse2/1000000-16                    20299             58304 ns/op        17151.34 MB/s
BenchmarkCount8/sse2/10000000-16                    1942            592495 ns/op        16877.79 MB/s
BenchmarkCount8/sse2/100000000-16                    200           6038125 ns/op        16561.43 MB/s
BenchmarkCount8/generic/32-16                   14566357                79.5 ns/op       402.32 MB/s
BenchmarkCount8/generic/64-16                    8044152               150 ns/op         427.21 MB/s
BenchmarkCount8/generic/128-16                   3772062               288 ns/op         444.01 MB/s
BenchmarkCount8/generic/256-16                   2092000               551 ns/op         464.87 MB/s
BenchmarkCount8/generic/512-16                   1083063              1080 ns/op         474.01 MB/s
BenchmarkCount8/generic/1000-16                   570868              2114 ns/op         473.00 MB/s
BenchmarkCount8/generic/10000-16                   49572             20788 ns/op         481.04 MB/s
BenchmarkCount8/generic/100000-16                   5528            207059 ns/op         482.95 MB/s
BenchmarkCount8/generic/1000000-16                   600           2076995 ns/op         481.46 MB/s
BenchmarkCount8/generic/10000000-16                   55          21229168 ns/op         471.05 MB/s
BenchmarkCount8/generic/100000000-16                   5         212457142 ns/op         470.68 MB/s
BenchmarkCount16/avx2/32-16                     65816288                17.9 ns/op      1791.83 MB/s
BenchmarkCount16/avx2/64-16                     50757498                22.7 ns/op      2822.43 MB/s
BenchmarkCount16/avx2/128-16                    35025753                32.6 ns/op      3927.95 MB/s
BenchmarkCount16/avx2/256-16                    21941785                53.6 ns/op      4774.50 MB/s
BenchmarkCount16/avx2/512-16                    28513166                42.8 ns/op      11956.74 MB/s
BenchmarkCount16/avx2/1000-16                   17108034                69.9 ns/op      14298.87 MB/s
BenchmarkCount16/avx2/10000-16                   2014814               614 ns/op        16275.12 MB/s
BenchmarkCount16/avx2/100000-16                   208441              5295 ns/op        18887.42 MB/s
BenchmarkCount16/avx2/1000000-16                   21088             53613 ns/op        18652.22 MB/s
BenchmarkCount16/avx2/10000000-16                   2085            554284 ns/op        18041.29 MB/s
BenchmarkCount16/avx2/100000000-16                   213           5355222 ns/op        18673.36 MB/s
BenchmarkCount16/sse2/32-16                     73891788                15.9 ns/op      2016.92 MB/s
BenchmarkCount16/sse2/64-16                     53584813                22.2 ns/op      2888.94 MB/s
BenchmarkCount16/sse2/128-16                    33977923                35.5 ns/op      3606.94 MB/s
BenchmarkCount16/sse2/256-16                    44680690                26.5 ns/op      9649.60 MB/s
BenchmarkCount16/sse2/512-16                    27087153                43.5 ns/op      11758.51 MB/s
BenchmarkCount16/sse2/1000-16                   16411941                72.6 ns/op      13775.03 MB/s
BenchmarkCount16/sse2/10000-16                   1913266               619 ns/op        16144.38 MB/s
BenchmarkCount16/sse2/100000-16                   209222              6000 ns/op        16665.77 MB/s
BenchmarkCount16/sse2/1000000-16                   20011             59011 ns/op        16946.10 MB/s
BenchmarkCount16/sse2/10000000-16                   1972            586847 ns/op        17040.22 MB/s
BenchmarkCount16/sse2/100000000-16                   193           5713296 ns/op        17503.03 MB/s
BenchmarkCount16/generic/32-16                  14932627                75.8 ns/op       422.29 MB/s
BenchmarkCount16/generic/64-16                   8251359               143 ns/op         446.01 MB/s
BenchmarkCount16/generic/128-16                  3995899               271 ns/op         472.81 MB/s
BenchmarkCount16/generic/256-16                  2194146               516 ns/op         496.23 MB/s
BenchmarkCount16/generic/512-16                  1130038              1031 ns/op         496.69 MB/s
BenchmarkCount16/generic/1000-16                  603375              1962 ns/op         509.78 MB/s
BenchmarkCount16/generic/10000-16                  52326             19742 ns/op         506.54 MB/s
BenchmarkCount16/generic/100000-16                  6142            193415 ns/op         517.02 MB/s
BenchmarkCount16/generic/1000000-16                  568           1979633 ns/op         505.14 MB/s
BenchmarkCount16/generic/10000000-16                  57          19146762 ns/op         522.28 MB/s
BenchmarkCount16/generic/100000000-16                  6         196704699 ns/op         508.38 MB/s
BenchmarkCount32/avx2/32-16                     72864624                17.1 ns/op      1866.88 MB/s
BenchmarkCount32/avx2/64-16                     55126480                21.9 ns/op      2919.02 MB/s
BenchmarkCount32/avx2/128-16                    37342851                32.2 ns/op      3977.41 MB/s
BenchmarkCount32/avx2/256-16                    22630946                52.0 ns/op      4923.01 MB/s
BenchmarkCount32/avx2/512-16                    27548682                41.3 ns/op      12391.07 MB/s
BenchmarkCount32/avx2/1000-16                   17340909                69.7 ns/op      14337.86 MB/s
BenchmarkCount32/avx2/10000-16                   2076030               593 ns/op        16867.66 MB/s
BenchmarkCount32/avx2/100000-16                   233301              5252 ns/op        19041.26 MB/s
BenchmarkCount32/avx2/1000000-16                   21309             52752 ns/op        18956.65 MB/s
BenchmarkCount32/avx2/10000000-16                   2095            551314 ns/op        18138.48 MB/s
BenchmarkCount32/avx2/100000000-16                   208           5611711 ns/op        17819.88 MB/s
BenchmarkCount32/sse2/32-16                     65645158                17.2 ns/op      1864.68 MB/s
BenchmarkCount32/sse2/64-16                     48960619                24.1 ns/op      2650.37 MB/s
BenchmarkCount32/sse2/128-16                    32815432                35.6 ns/op      3591.24 MB/s
BenchmarkCount32/sse2/256-16                    42327723                28.2 ns/op      9086.10 MB/s
BenchmarkCount32/sse2/512-16                    20128138                59.2 ns/op      8642.98 MB/s
BenchmarkCount32/sse2/1000-16                   13085713                90.6 ns/op      11035.66 MB/s
BenchmarkCount32/sse2/10000-16                   1872972               634 ns/op        15781.59 MB/s
BenchmarkCount32/sse2/100000-16                   201450              5716 ns/op        17495.91 MB/s
BenchmarkCount32/sse2/1000000-16                   20067             59066 ns/op        16930.34 MB/s
BenchmarkCount32/sse2/10000000-16                   2084            576319 ns/op        17351.50 MB/s
BenchmarkCount32/sse2/100000000-16                   193           5898738 ns/op        16952.78 MB/s
BenchmarkCount32/generic/32-16                  13305507                90.3 ns/op       354.55 MB/s
BenchmarkCount32/generic/64-16                   7781857               151 ns/op         425.06 MB/s
BenchmarkCount32/generic/128-16                  3962030               281 ns/op         455.87 MB/s
BenchmarkCount32/generic/256-16                  2177896               525 ns/op         487.75 MB/s
BenchmarkCount32/generic/512-16                  1122494              1029 ns/op         497.53 MB/s
BenchmarkCount32/generic/1000-16                  594394              1903 ns/op         525.54 MB/s
BenchmarkCount32/generic/10000-16                  51724             19835 ns/op         504.16 MB/s
BenchmarkCount32/generic/100000-16                  5848            191398 ns/op         522.47 MB/s
BenchmarkCount32/generic/1000000-16                  518           1998705 ns/op         500.32 MB/s
BenchmarkCount32/generic/10000000-16                  61          19430005 ns/op         514.67 MB/s
BenchmarkCount32/generic/100000000-16                  6         199107659 ns/op         502.24 MB/s
BenchmarkCount64/avx2/32-16                     57851385                20.2 ns/op      1582.56 MB/s
BenchmarkCount64/avx2/64-16                     43540004                26.4 ns/op      2419.68 MB/s
BenchmarkCount64/avx2/128-16                    33230350                36.9 ns/op      3473.00 MB/s
BenchmarkCount64/avx2/256-16                    22741604                56.3 ns/op      4549.09 MB/s
BenchmarkCount64/avx2/512-16                    25663096                45.6 ns/op      11232.62 MB/s
BenchmarkCount64/avx2/1000-16                   16008861                73.2 ns/op      13658.94 MB/s
BenchmarkCount64/avx2/10000-16                   2083137               591 ns/op        16928.89 MB/s
BenchmarkCount64/avx2/100000-16                   223752              5431 ns/op        18413.79 MB/s
BenchmarkCount64/avx2/1000000-16                   21462             56225 ns/op        17785.57 MB/s
BenchmarkCount64/avx2/10000000-16                   2154            551933 ns/op        18118.14 MB/s
BenchmarkCount64/avx2/100000000-16                   206           5470769 ns/op        18278.97 MB/s
BenchmarkCount64/sse2/32-16                     53932040                22.0 ns/op      1457.63 MB/s
BenchmarkCount64/sse2/64-16                     42586113                27.8 ns/op      2304.66 MB/s
BenchmarkCount64/sse2/128-16                    28128896                39.6 ns/op      3231.79 MB/s
BenchmarkCount64/sse2/256-16                    35379745                32.2 ns/op      7956.09 MB/s
BenchmarkCount64/sse2/512-16                    23239102                48.9 ns/op      10470.67 MB/s
BenchmarkCount64/sse2/1000-16                   15173030                80.2 ns/op      12470.03 MB/s
BenchmarkCount64/sse2/10000-16                   1912424               598 ns/op        16709.88 MB/s
BenchmarkCount64/sse2/100000-16                   200026              5873 ns/op        17028.50 MB/s
BenchmarkCount64/sse2/1000000-16                   19759             57521 ns/op        17385.04 MB/s
BenchmarkCount64/sse2/10000000-16                   2040            597267 ns/op        16742.92 MB/s
BenchmarkCount64/sse2/100000000-16                   192           5978950 ns/op        16725.34 MB/s
BenchmarkCount64/generic/32-16                   9928184               117 ns/op         273.63 MB/s
BenchmarkCount64/generic/64-16                   6628939               183 ns/op         350.39 MB/s
BenchmarkCount64/generic/128-16                  3611708               304 ns/op         421.13 MB/s
BenchmarkCount64/generic/256-16                  2057230               563 ns/op         454.42 MB/s
BenchmarkCount64/generic/512-16                  1083744              1056 ns/op         484.70 MB/s
BenchmarkCount64/generic/1000-16                  576555              2028 ns/op         493.06 MB/s
BenchmarkCount64/generic/10000-16                  52099             19867 ns/op         503.35 MB/s
BenchmarkCount64/generic/100000-16                  5992            197052 ns/op         507.48 MB/s
BenchmarkCount64/generic/1000000-16                  524           1945532 ns/op         514.00 MB/s
BenchmarkCount64/generic/10000000-16                  61          19713293 ns/op         507.27 MB/s
BenchmarkCount64/generic/100000000-16                  6         194035798 ns/op         515.37 MB/s

677120e
BenchmarkCount8/avx2/32-16              180887323                6.50 ns/op     4919.63 MB/s
BenchmarkCount8/avx2/64-16              110568928               10.3 ns/op      6241.69 MB/s
BenchmarkCount8/avx2/128-16             63895208                17.8 ns/op      7181.04 MB/s
BenchmarkCount8/avx2/256-16             43572312                29.1 ns/op      8811.63 MB/s
BenchmarkCount8/avx2/512-16             41616794                29.9 ns/op      17109.62 MB/s
BenchmarkCount8/avx2/1000-16            20170315                59.1 ns/op      16918.99 MB/s
BenchmarkCount8/avx2/10000-16            2519863               453 ns/op        22055.16 MB/s
BenchmarkCount8/avx2/100000-16            302140              3975 ns/op        25154.25 MB/s
BenchmarkCount8/avx2/1000000-16            28449             38766 ns/op        25795.63 MB/s
BenchmarkCount8/avx2/10000000-16                    2769            414361 ns/op        24133.54 MB/s
BenchmarkCount8/avx2/100000000-16                    276           4010607 ns/op        24933.88 MB/s
BenchmarkCount8/popcnt/32-16                    135109920                8.60 ns/op     3722.75 MB/s
BenchmarkCount8/popcnt/64-16                    90996325                13.8 ns/op      4630.66 MB/s
BenchmarkCount8/popcnt/128-16                   43640628                26.9 ns/op      4759.17 MB/s
BenchmarkCount8/popcnt/256-16                   60230776                21.1 ns/op      12153.03 MB/s
BenchmarkCount8/popcnt/512-16                   27787201                41.8 ns/op      12261.81 MB/s
BenchmarkCount8/popcnt/1000-16                  14916432                81.4 ns/op      12292.15 MB/s
BenchmarkCount8/popcnt/10000-16                  1940953               578 ns/op        17286.56 MB/s
BenchmarkCount8/popcnt/100000-16                  210847              5706 ns/op        17524.07 MB/s
BenchmarkCount8/popcnt/1000000-16                  20604             56097 ns/op        17826.36 MB/s
BenchmarkCount8/popcnt/10000000-16                  2008            607747 ns/op        16454.21 MB/s
BenchmarkCount8/popcnt/100000000-16                  195           5686707 ns/op        17584.87 MB/s
BenchmarkCount8/sse2/32-16                      63942829                19.1 ns/op      1679.66 MB/s
BenchmarkCount8/sse2/64-16                      37631593                33.4 ns/op      1918.07 MB/s
BenchmarkCount8/sse2/128-16                     19319570                60.8 ns/op      2106.38 MB/s
BenchmarkCount8/sse2/256-16                     33807202                35.1 ns/op      7294.89 MB/s
BenchmarkCount8/sse2/512-16                     18850315                65.2 ns/op      7855.70 MB/s
BenchmarkCount8/sse2/1000-16                     8662965               121 ns/op        8291.53 MB/s
BenchmarkCount8/sse2/10000-16                    1152369              1011 ns/op        9894.85 MB/s
BenchmarkCount8/sse2/100000-16                    125665              9287 ns/op        10767.45 MB/s
BenchmarkCount8/sse2/1000000-16                    12408             95965 ns/op        10420.51 MB/s
BenchmarkCount8/sse2/10000000-16                    1267            942471 ns/op        10610.41 MB/s
BenchmarkCount8/sse2/100000000-16                    124           9998809 ns/op        10001.19 MB/s
BenchmarkCount8/generic/32-16                    7558437               168 ns/op         190.43 MB/s
BenchmarkCount8/generic/64-16                    3833438               324 ns/op         197.50 MB/s
BenchmarkCount8/generic/128-16                   1826013               638 ns/op         200.49 MB/s
BenchmarkCount8/generic/256-16                    942129              1291 ns/op         198.33 MB/s
BenchmarkCount8/generic/512-16                    451552              2506 ns/op         204.33 MB/s
BenchmarkCount8/generic/1000-16                   238472              4968 ns/op         201.27 MB/s
BenchmarkCount8/generic/10000-16                   23505             50168 ns/op         199.33 MB/s
BenchmarkCount8/generic/100000-16                   2397            485936 ns/op         205.79 MB/s
BenchmarkCount8/generic/1000000-16                   229           4884775 ns/op         204.72 MB/s
BenchmarkCount8/generic/10000000-16                   22          49935088 ns/op         200.26 MB/s
BenchmarkCount8/generic/100000000-16                   3         476854149 ns/op         209.71 MB/s

from pospop.

clausecker avatar clausecker commented on September 21, 2024

I'll work on some more improvements to the code. Let me know if they improve the performance for you to the point where the old AVX2 kernel is obsolete. If they don't, I may have to find another solution.

from pospop.

clausecker avatar clausecker commented on September 21, 2024

I've improved the performance by another 6% in 9d68d4e. Please let me know how much this closes the performance gap on Zen for you. It should be more effective because it replaces shifts by logic operations that can run on more ports and eliminates some operations.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024

I'm sorry, almost no change.

from pospop.

clausecker avatar clausecker commented on September 21, 2024

How unfortunate. Sadly, the new code is almost 16% faster than the old Count8 kernel on Intel, so I'm a bit conflicted about keeping the old one. I'll have to think of a solution that works for both of us.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024

Thank you so much!

from pospop.

clausecker avatar clausecker commented on September 21, 2024

Hi @shenwei356!

In the carry branch, I have provided a version of the AVX2 kernel with a further 30% improvement. Please let me know if it finally manages to beat the original code on your machine.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024

Thank you for keep updating this package!

Unfortunately, the benchmark fails to run.

$ go test -bench=Benchmark*
SIGILL: illegal instruction

AMD Ryzen 7 2700X Eight-Core Processor CPU flags

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

$ go test -bench=Benchmark*
SIGILL: illegal instruction
PC=0x4eb5cc m=0 sigcode=2
instruction bytes: 0x62 0xf1 0x5 0x28 0xdf 0xf4 0xc5 0xcd 0x72 0xd6 0x1 0xc4 0xe3 0x55 0x46 0xe6

goroutine 19 [running]:
countavxcarry()
        /home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/countavx2carry_amd64.s:188 +0x40c fp=0xc00009bd68 sp=0xc00009bd38 pc=0x4eb5cc
github.com/clausecker/pospop.count8avx2carry(0xc0000e2a80, {0xc0000cd001, 0x3ff, 0x3ff})
        /home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/countavx2carry_amd64.s:501 +0x36 fp=0xc00009bd78 sp=0xc00009bd68 pc=0x4ebc76
github.com/clausecker/pospop.count8avx2carry(0x40cea7, {0xc0000cd001, 0x4e8bcb, 0x7})
        <autogenerated>:1 +0x2b fp=0xc00009bda8 sp=0xc00009bd78 pc=0x4ee30b
github.com/clausecker/pospop.Count8(0xc0000e2a80, {0xc0000cd001, 0x0, 0x0})
        /home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/dispatch.go:107 +0x25 fp=0xc00009bdd8 sp=0xc00009bda8 pc=0x4e57c5
github.com/clausecker/pospop.testCount8(0xc000083860, 0x523c58)
        /home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/count_test.go:59 +0x157 fp=0xc00009bf50 sp=0xc00009bdd8 pc=0x4e8e57
github.com/clausecker/pospop.TestCount8.func1(0x0)
        /home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/count_test.go:136 +0x25 fp=0xc00009bf70 sp=0xc00009bf50 pc=0x4ea485
testing.tRunner(0xc000083860, 0x523c98)
        /usr/local/go/src/testing/testing.go:1253 +0x102 fp=0xc00009bfc0 sp=0xc00009bf70 pc=0x4b0302
testing.(*T).Run·dwrap·21()
        /usr/local/go/src/testing/testing.go:1300 +0x2a fp=0xc00009bfe0 sp=0xc00009bfc0 pc=0x4b100a
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00009bfe8 sp=0xc00009bfe0 pc=0x463681
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1300 +0x35a

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000083380, {0x51a843, 0x465d93}, 0x523ca0)
        /usr/local/go/src/testing/testing.go:1301 +0x375
testing.runTests.func1(0xc000083380)
        /usr/local/go/src/testing/testing.go:1592 +0x6e
testing.tRunner(0xc000083380, 0xc000093d18)
        /usr/local/go/src/testing/testing.go:1253 +0x102
testing.runTests(0xc0000d2000, {0x5fcd60, 0x4, 0x4}, {0x4721ed, 0x64, 0x601040})
        /usr/local/go/src/testing/testing.go:1590 +0x43f
testing.(*M).Run(0xc0000d2000)
        /usr/local/go/src/testing/testing.go:1498 +0x51d
main.main()
        _testmain.go:59 +0x14b

goroutine 18 [chan receive]:
testing.(*T).Run(0xc000083520, {0x51a135, 0x465d93}, 0x523c98)
        /usr/local/go/src/testing/testing.go:1301 +0x375
github.com/clausecker/pospop.TestCount8(0x0)
        /home/shenwei/shenwei/scripts/go/src/github.com/clausecker/pospop/count_test.go:136 +0x35
testing.tRunner(0xc000083520, 0x523ca0)
        /usr/local/go/src/testing/testing.go:1253 +0x102
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1300 +0x35a

rax    0xfffb
rbx    0x4eb8c0
rcx    0x0
rdx    0x1e0
rdi    0xc0000e2a80
rsi    0xc0000cd400
rbp    0xc00009bd58
rsp    0xc00009bd38
r8     0x7fe8873875b8
r9     0x0
r10    0x7fe887393c88
r11    0x0
r12    0xc0000e2a80
r13    0x1
r14    0xc000083a00
r15    0xffffffffffffffff
rip    0x4eb5cc
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0
exit status 2
FAIL    github.com/clausecker/pospop    0.005s

from pospop.

clausecker avatar clausecker commented on September 21, 2024

Hi @shenwei356,

I apologise for the problem. I accidentally used an incorrect machine instruction in this kernel. I have pushed a fix for the problem in df1eae5. Please let me know if it does the trick.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024

Yes! It's faster for > 1000 elements but slower for <<=512 ones.

I miss the old version 677120e, with 6069.95 MB/s for 64 elements. 😆

BenchmarkCount8/avx2carry/32-16         75166225                17.01 ns/op     1881.52 MB/s
BenchmarkCount8/avx2carry/64-16         49655649                22.10 ns/op     2896.57 MB/s
BenchmarkCount8/avx2carry/128-16        36818702                32.02 ns/op     3997.43 MB/s
BenchmarkCount8/avx2carry/256-16        23530563                53.38 ns/op     4795.83 MB/s
BenchmarkCount8/avx2carry/512-16        31999482                41.50 ns/op     12338.46 MB/s
BenchmarkCount8/avx2carry/1000-16       22948182                51.15 ns/op     19551.49 MB/s
BenchmarkCount8/avx2carry/10000-16       2811368               417.8 ns/op      23932.42 MB/s
BenchmarkCount8/avx2carry/100000-16       314949              3606 ns/op        27732.07 MB/s
BenchmarkCount8/avx2carry/1000000-16       33084             36173 ns/op        27645.11 MB/s
BenchmarkCount8/avx2carry/10000000-16       2142            558955 ns/op        17890.52 MB/s
BenchmarkCount8/avx2carry/100000000-16       181           5880131 ns/op        17006.42 MB/s
BenchmarkCount8/avx2/32-16              70049467                15.88 ns/op     2014.53 MB/s
BenchmarkCount8/avx2/64-16              54589984                21.07 ns/op     3036.93 MB/s
BenchmarkCount8/avx2/128-16             33218904                31.37 ns/op     4080.63 MB/s
BenchmarkCount8/avx2/256-16             20424726                52.44 ns/op     4882.03 MB/s
BenchmarkCount8/avx2/512-16             30637970                38.84 ns/op     13181.97 MB/s
BenchmarkCount8/avx2/1000-16            20244832                60.53 ns/op     16520.15 MB/s
BenchmarkCount8/avx2/10000-16            2364079               488.0 ns/op      20490.45 MB/s
BenchmarkCount8/avx2/100000-16            255962              4521 ns/op        22120.96 MB/s
BenchmarkCount8/avx2/1000000-16            25255             44622 ns/op        22410.66 MB/s
BenchmarkCount8/avx2/10000000-16            1872            629837 ns/op        15877.13 MB/s
BenchmarkCount8/avx2/100000000-16            152           7730327 ns/op        12936.06 MB/s
BenchmarkCount8/sse2/32-16              76672119                15.68 ns/op     2040.40 MB/s
BenchmarkCount8/sse2/64-16              52885789                21.45 ns/op     2983.43 MB/s
BenchmarkCount8/sse2/128-16             34227018                33.60 ns/op     3809.34 MB/s
BenchmarkCount8/sse2/256-16             44672299                26.23 ns/op     9760.54 MB/s
BenchmarkCount8/sse2/512-16             26528048                40.21 ns/op     12732.26 MB/s
BenchmarkCount8/sse2/1000-16            17626179                67.69 ns/op     14773.48 MB/s
BenchmarkCount8/sse2/10000-16            2042779               557.2 ns/op      17946.54 MB/s
BenchmarkCount8/sse2/100000-16            217683              5400 ns/op        18519.26 MB/s
BenchmarkCount8/sse2/1000000-16            21778             55213 ns/op        18111.77 MB/s
BenchmarkCount8/sse2/10000000-16            1868            642465 ns/op        15565.06 MB/s
BenchmarkCount8/sse2/100000000-16            151           7525209 ns/op        13288.67 MB/s
BenchmarkCount8/generic/32-16           34176175                33.74 ns/op      948.44 MB/s
BenchmarkCount8/generic/64-16           17622030                65.02 ns/op      984.27 MB/s
BenchmarkCount8/generic/128-16           8667603               129.9 ns/op       985.19 MB/s
BenchmarkCount8/generic/256-16           5341986               189.7 ns/op      1349.80 MB/s
BenchmarkCount8/generic/512-16           2922559               379.3 ns/op      1349.78 MB/s
BenchmarkCount8/generic/1000-16          1484482               765.8 ns/op      1305.89 MB/s
BenchmarkCount8/generic/10000-16          160234              7287 ns/op        1372.38 MB/s
BenchmarkCount8/generic/100000-16          16412             70696 ns/op        1414.50 MB/s
BenchmarkCount8/generic/1000000-16          1756            715249 ns/op        1398.11 MB/s
BenchmarkCount8/generic/10000000-16          163           7090338 ns/op        1410.37 MB/s
BenchmarkCount8/generic/100000000-16          15          69222069 ns/op        1444.63 MB/s

from pospop.

clausecker avatar clausecker commented on September 21, 2024

@shenwei356 Thank you for your input. I'll see if I can add a fast path for very short arrays. I mean, such a fast path kinda already exists. Maybe it is not good enough.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024

Cool! I also start to learn some basic go assembly and wrote a package with avo: https://github.com/shenwei356/pand . Could you please take a look :)

from pospop.

clausecker avatar clausecker commented on September 21, 2024

@shenwei356 Looks good! Some suggestions:

  • conditional branches are expensive if they are hard to predict. Consider reducing the amount of conditional layers to just two (one SIMD register full, then byte-by-byte)
  • also consider unrolling the main loop a bit more
  • for AVX-512 you can use masking instead of a separate loop to deal with the tail
  • you should align at least one of the inputs to one SIMD register worth of data before you start with the main loop. Memory accesses crossing cache line boundaries incur an extra penalty
  • there is probably not too much of a benefit in using 512 bit registers here since the code is largely memory bound. Using 512 bit registers incurs a thermal throttle, so it's only advisable for long compute bound sections
  • instead of moving two pointers and an index, consider using a double-indexed addressing mode so you only have to advance one register per iteration
  • the tail code is wrong: it always writes 8 full bytes, even if the slices is shorter. This causes incorrect results when you for example slice from a larger array and only compute the bitwise and of the small slice

For inspiration on how to do better, consider asking a C compiler. For example, clang suggests this kind of code for AVX2 which addresses the issues I remarked.

from pospop.

shenwei356 avatar shenwei356 commented on September 21, 2024

Thanks, I need some time to digest these suggestions.

from pospop.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.