Shaojie WANG's Projects
learn aitemplate code
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
A list of awesome compiler projects and papers for tensor computation and deep learning.
buy things from taobao web
Collect performance data for CK/MISA/MIOpen to fast create presentation sheet.
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
how to design cpu gemm on x86 with avx256, that can beat openblas.
CUDA Templates for Linear Algebra Subroutines
Transformer related optimization, including BERT, GPT
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
Implement asm gemm on vega64 for 4096x4096 fp32 matrix
Help to check gpu kernel's shared mem
gpu coding practice
A performance benchmark for GPGPU or GPU based AIChips.
14 basic topics for VEGA64 performance optmization
Open deep learning compiler stack for cpu, gpu and specialized accelerators
这里是我用来编写一些卡狗题目的代码。kaggle==卡狗
《统计学习方法》的代码实现
Inference code for LLaMA models
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
This is the canonical git mirror of the LLVM subversion repository. The repository does not accept github pull requests at this moment. Please submit your patches at http://reviews.llvm.org.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Ongoing research training transformer models at scale
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Tensors and Dynamic neural networks in Python with strong GPU acceleration
ROCm Communication Collectives Library (RCCL)