Code Monkey home page Code Monkey logo

cpu_gemm_opt's Introduction

gemm_opt

currently only consider signle cpu core, and sgemm only

optimize gemm on x86 arch, tested on Intel(R) Xeon(R) Gold 6142 CPU

  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 1024K
  • L3 cache: 22528K
  • L1d TLB : 64 entries

most of the size exceed openblas(0.3.4)

detail is in this PDF for how to design the blocking/kernel.

some reference resource on gemm:

notes:

  • 96% cpu usage perf is achieved by tune the parameter. wait for tuning finish to update following chart
  • size below like 768 can't achieve 96%, may need reconsideration on blocking method
  • more fine grain optimization for micro kernel is needed for better perf above 96%, target 98%~99%
# 6x16 micro kernel
# need disable intel HT(hyperthread) in BIOS. also, frequency boost is prefered to close.
# ./gemm_driver  -kc 360 -nc 672 -mc 3072
# following char is used by tuned db

cpu:2, freq: 2600.0MHz, theoritical: 81.250 gflops (avx256,fmadd)
l1_size:32K, l2_size:1M, l3_size:22M, page_size:4096, tlb_entry_l1d:64
MC:3072, NC:672, KC:360, MR:6, NR:16
layout:CblasRowMajor, trans_a:CblasNoTrans, trans_b:CblasNoTrans
Considerations:
 L1: MR*KC+NR*KC+MR*NR+NR+MR*NR < L1_size/d_size, lhs:8128, rhs:8192, match?yes
 L2: NC*KC+MR*KC+MR*NC+MR*KC+MR*NC < L2_size/d_size, lhs:254304, rhs:262144, match?yes
 L3: MC*KC+NC*KC+MC*NC+NC*KC+MC*NC < L3_size/d_size, lhs:5718528, rhs:5767168, match?yes
 L1D TLB:
  TA:CEIL(MR*KC*d_size/PAGE_SIZE)+1, 4
  TB:CEIL(NR*KC*d_size/PAGE_SIZE)+1, 7
  TC:up to MR, 6
  TA+2*(TB)+TC < T_entry_total, lhs:24, rhs:64, match?yes

    M    N    K alpha beta   mc    nc   kc  mr  nr   gflops(%)   gflops_ref(%)
   48   48   48  1.0  1.0    108   16   48   6  16  60.13(74.01)  43.05(52.98)  [t]
   96   96   96  1.0  1.0    456   48   96   6  16  67.06(82.53)  56.63(69.70)  [t]
  144  144  144  1.0  1.0    624  448  160   6  16  55.85(68.74)  64.26(79.09)  [t]
  192  192  192  1.0  1.0    432  144  240   6  16  75.79(93.28)  68.14(83.86)  [t]
  240  240  240  1.0  1.0    516  112  320   6  16  74.75(92.00)  69.70(85.78)  [t]
  288  288  288  1.0  1.0    384   80  336   6  16  74.42(91.60)  69.12(85.07)  [t]
  384  384  384  1.0  1.0    432  240  224   6  16  75.11(92.45)  70.93(87.30)  [t]
  480  480  480  1.0  1.0    624  224  256   6  16  76.48(94.13)  73.23(90.13)  [t]
  576  576  576  1.0  1.0    816  240  320   6  16  76.76(94.47)  74.45(91.64)  [t]
  768  768  768  1.0  1.0   1344  384  256   6  16  77.69(95.62)  71.73(88.29)  [t]
  960  960  960  1.0  1.0   1872  336  320   6  16  78.31(96.38)  75.11(92.44)  [t]
 1152 1152 1152  1.0  1.0   2280  384  352   6  16  78.26(96.32)  74.82(92.08)  [t]
 1344 1344 1344  1.0  1.0   1656  448  352   6  16  78.07(96.08)  74.73(91.97)  [t]
 1536 1536 1536  1.0  1.0   2808  512  256   6  16  78.15(96.19)  71.88(88.47)  [t]
 1728 1728 1728  1.0  1.0   2424  448  352   6  16  78.21(96.26)  74.16(91.27)  [t]
 1920 1920 1920  1.0  1.0   3336  384  352   6  16  78.12(96.15)  73.65(90.64)  [t]
 2112 2112 2112  1.0  1.0   2496  448  352   6  16  77.45(95.32)  73.86(90.91)  [t]
 2400 2400 2400  1.0  1.0   2640  448  352   6  16  77.02(94.79)  73.63(90.62)  [t]
 2688 2688 2688  1.0  1.0   2976  448  352   6  16  77.66(95.58)  73.23(90.13)  [t]
 2976 2976 2976  1.0  1.0   3168  576  352   6  16  78.34(96.41)  72.97(89.80)  [t]
 3264 3264 3264  1.0  1.0   3456  640  256   6  16  78.20(96.24)  73.41(90.35)  [t]
 3552 3552 3552  1.0  1.0   3552  576  352   6  16  78.36(96.45)  73.03(89.88)  [t]
 3840 3840 3840  1.0  1.0   3888  512  352   6  16  78.49(96.60)  72.25(88.93)  [t]
 4128 4128 4128  1.0  1.0   4128  448  352   6  16  78.37(96.46)  73.30(90.21)  [t]
 4512 4512 4512  1.0  1.0   2256  576  352   6  16  78.35(96.43)  72.54(89.29)  [t]
 4896 4896 4896  1.0  1.0   4896  384  352   6  16  78.44(96.54)  73.32(90.25)  [t]
 5280 5280 5280  1.0  1.0   2640  576  352   6  16  78.42(96.51)  72.80(89.61)  [t]
 5664 5664 5664  1.0  1.0   3792  512  352   6  16  78.33(96.41)  73.26(90.16)  [t]
 6048 6048 6048  1.0  1.0   3120  576  352   6  16  78.38(96.47)  72.67(89.44)  [t]
 6432 6432 6432  1.0  1.0   3456  576  352   6  16  78.46(96.56)  73.16(90.05)  [t]
 6816 6816 6816  1.0  1.0   3456  576  352   6  16  78.56(96.69)  72.80(89.60)  [t]
 7200 7200 7200  1.0  1.0   3696  512  352   6  16  78.54(96.67)  72.98(89.82)  [t]
 7584 7584 7584  1.0  1.0   3840  512  352   6  16  78.59(96.72)  72.66(89.43)  [t]
 7968 7968 7968  1.0  1.0   4080  448  352   6  16  78.54(96.67)  72.92(89.74)  [t]
 8352 8352 8352  1.0  1.0   4320  448  352   6  16  78.56(96.68)  72.56(89.31)  [t]
 8736 8736 8736  1.0  1.0   4368  448  352   6  16  78.70(96.87)  73.14(90.02)  [t]
 9120 9120 9120  1.0  1.0   3072  576  352   6  16  78.62(96.76)  72.70(89.48)  [t]


cpu_gemm_opt's People

Contributors

carlushuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cpu_gemm_opt's Issues

Performance of mkl's sgemm

Powerful designs and beautiful introductions~ How about the 1-thread avx2 sgemm performance of Intel MKL (MKL_CBWR="AVX2") on Intel Xeon Gold 6142 ? I'm just curious about that..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.