Code Monkey home page Code Monkey logo

matmultutorial's Introduction

TensorOp Matmul Tutorial

This is an example repo for CUDA MatMul implementation. The aim of this repo is to provide some insights in high-performance kernel design for CUDA beginners. Currently, I only provide some implementation examples in examples/matmul/this. Contributions for more kernels and other MatMul implementations are highly welcomed.

About

There is a detailed explanation about the different versions of MatMul kernels in examples/matmul/this.

Contents

  • examples:

    • matmul: The MatMul implementations

      • this: The MatMul implemented by this repo
      • cublas: Call CuBLAS for performance test
      • cutlass: Call CUTLASS for performance test
      • mlir-gen: The cuda code generated by MLIR
      • triton: Call Triton for performance test
      • tvm: Call Relay+CUTLASS/CuBLAS or TensorIR for performance test
    • atom: The usage of single intrinsic/instructions

    • reduction: Some reduction kernels for epilogue

Performance Results

image The overall performance comparison among Relay, CuBLAS, CUTLASS, TensorIR, Triton, and our implementations. The y-axis is speedup to Relay+CUTLASS.

Overall, the geometric mean speedup to Relay+CUTLASS is 1.73x, to TensorIR (1000 tuning trials using MetaSchedule per case) is 1.22x, to CuBLAS is 1.00x, to CUTLASS is 0.999x, to Triton is 1.07x. The 61 shapes are:

No. M N K
1 5376 5376 2048
2 5376-128 5376 2048
3 5376-2*128 5376 2048
... ... ... ...
11 5376-10*128 5376 2048
12 5376+128 5376 2048
13 5376+2*128 5376 2048
... ... ... ...
21 5376+10*128 5376 2048
22 5376 5376-128 2048
23 5376 5376-2*128 2048
... ... ... ...
31 5376 5376-10*128 2048
32 5376 5376+128 2048
33 5376 5376+2*128 2048
... ... ... ...
41 5376 5376+10*128 2048
42 5376 5376 2048-128
43 5376 5376 2048-2*128
... ... ... ...
51 5376 5376 2048-10*128
52 5376 5376 2048+128
53 5376 5376 2048+2*128
... ... ... ...
61 5376 5376 2048+10*128

MLIR Generated CUDA kernels

I also use MLIR to generate MatMul kernels. The generated ones are in examples/matmul/mlir-gen. The performance to handwritten ones (examples/matmul/this) is shown as belows. As MLIR generated ones only implement part of the optimizations used by handwritten ones, we call the MLIR generated ones partial and the handwritten ones full.

mlir-gen Overall, MLIR generated versions achieve 86% the performance of handwritten kernels.

Plan

More kernels

I plan to implement kernels for other operators such as softmax in future.

Use CUTLASS in implementation

There is a plan to use the CuTe interface of CUTLASS to implement high-performance kernels.

matmultutorial's People

Contributors

knowingnothing avatar l1nkr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.