rocm / composable_kernel Goto Github PK

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

Home Page: https://rocm.docs.amd.com/projects/composable_kernel/en/latest/

License: Other

CMake 1.20% Shell 0.31% C++ 97.31% Dockerfile 0.02% Python 0.49% NASL 0.66%

composable_kernel's Introduction

Composable Kernel

The Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads across multiple architectures (GPUs, CPUs, etc.). The CK library uses general purpose kernel languages, such as HIP C++.

CK uses two concepts to achieve performance portability and code maintainability:

A tile-based programming model
Algorithm complexity reduction for complex machine learning (ML) operators. This uses an innovative technique called Tensor Coordinate Transformation.

The current CK library is structured into four layers:

Templated Tile Operators
Templated Kernel and Invoker
Instantiated Kernel and Invoker
Client API

General information

To build our documentation locally, use the following code:

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

You can find a list of our developers and contributors on our Contributors page.

If you use CK, cite us as follows:

* [Realizing Tensor Operators Using Coordinate Transformations and Tile Based Programming](???):
  This paper will be available on arXiv soon.
* [CITATION.cff](/CITATION.cff)

CK is released under the MIT license.

Building CK

We recommend building CK inside Docker containers, which include all necessary packages. Pre-built Docker images are available on DockerHub.

To build a new Docker image, use the Dockerfile provided with the source code:
```
DOCKER_BUILDKIT=1 docker build -t ck:latest -f Dockerfile .
```

Launch the Docker container:

docker run                                     \
-it                                            \
--privileged                                   \
--group-add sudo                               \
-w /root/workspace                             \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace  \
ck:latest                                      \
/bin/bash

Clone CK source code from the GitHub repository and start the build:

git clone https://github.com/ROCm/composable_kernel.git && \
cd composable_kernel && \
mkdir build && \
cd build

You must set the GPU_TARGETS macro to specify the GPU target architecture(s) you want to run CK on. You can specify single or multiple architectures. If you specify multiple architectures, use a semicolon between each; for example, gfx908;gfx90a;gfx940.

cmake                                                                                             \
-D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
-D CMAKE_BUILD_TYPE=Release                                                                       \
-D GPU_TARGETS="gfx908;gfx90a"                                                                    \
..

If you don't set GPU_TARGETS on the cmake command line, CK is built for all GPU targets supported by the current compiler (this may take a long time).

Build the entire CK library:
```
make -j
```
Install CK:
```
make -j install
```

Optional post-install steps

Build examples and tests:
```
make -j examples tests
```
Build and run all examples and tests:
```
make -j check
```
You can find instructions for running each individual example in example.
Build ckProfiler:
```
make -j ckProfiler
```
You can find instructions for running ckProfiler in profiler.

Note the -j option for building with multiple threads in parallel. This speeds up the build significantly. Depending on the number of CPU cores and the amount of RAM on your system, you may want to limit the number of threads. For example, if you have a 128-core CPU and 64 Gb of RAM.

By default, -j launches one thread per CPU core, which can cause the build to run out of memory and crash. In such cases, you can reduce the number of threads to 32 by using -j32.

Additional cmake flags can be used to significantly speed-up the build:

INSTANCES_ONLY (default is OFF) must be set to ON in order to build only the instances and library while skipping all tests, examples, and profiler. This is useful in cases when you plan to use CK as a dependency and don't plan to run any examples or tests.
DTYPES (default is not set) can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build instances of select data types only. The main default data types are fp32 and fp16; you can safely skip other data types.
DL_KERNELS (default is OFF) must be set to ON in order to build instances, such as gemm_dl or batched_gemm_multi_d_dl. These instances are useful on architectures like the NAVI2x, as most other platforms have faster instances, such as xdl or wmma, available.

Using sccache for building

The default CK Docker images come with a pre-installed version of sccache, which supports clang being used as hip-compiler (" -x hip"). Using sccache can help reduce the time to re-build code from hours to 1-2 minutes. In order to invoke sccache, you need to run:

 sccache --start-server

then add the following flags to the cmake command line:

 -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache

You may need to clean up the build folder and repeat the cmake and make steps in order to take advantage of the sccache during subsequent builds.

Using CK as pre-built kernel library

You can find instructions for using CK as a pre-built kernel library in client_example.

Contributing to CK

When you contribute to CK, make sure you run clang-format on all changed files. We highly recommend using git hooks that are managed by the pre-commit framework. To install hooks, run:

sudo script/install_precommit.sh

With this approach, pre-commit adds the appropriate hooks to your local repository and automatically runs clang-format (and possibly additional checks) before any commit is created.

If you need to uninstall hooks from the repository, you can do so by running the following command:

script/uninstall_precommit.sh

If you need to temporarily disable pre-commit hooks, you can add the --no-verify option to the git commit command.

composable_kernel's People

Contributors

Stargazers

Watchers

Forkers

antinucleon zjing14 aosewski rocking5566 yf225 illsilin rosenrodt ramjana zstreet87 lawruble13 poyenc danyao12 guangzlu yanxing-shi nyamodil cloudhan hanwenjie33 fsx950223 keloli minminsun jrbyrnes craft-zhang purplesquirrelmedia purple-squirrel-network zhen8838 irlyngaas gongyu-lightmatter limin2021 zineos codego7250 apwojcik barathum000 samjwu vkarihal fruitboy1226 ishine bwroblew yangqiangli gongxijun shaojiewang zwshan petrex are-we-gfx1100-yet yangwang92 arvindcheru whchung abhishekmshra tring27 oneman fileaccent ulyssesrr xw285cornell jiajiduan trixirt briansp2020 yuguo-jack jefyang1 peixuanzuo congma13 luckyplusten terrencezhangx nmacchioni lajagapp henryverlin xiaobo1025 raramakr kulnaman stanleyjacob cameronshinn gzgavinzhao kawailee kylasa wbobby peterjunpark abhimeda crimsondump xuesongtap alexxu-amd amitkumar-amd pratmaha tenpercent yiakwy-xpu-ml-framework-team fighoture ncr5012 abbyoutsider fahmimrezuan tharun-tharun ltqin mozga-amd waheedi pavanmahaveer7 mathewgomas foreverlms jmmartinez angryloki alohali mirza-halilcevic

composable_kernel's Issues

Some kernels have verification failures if enabled inter-wave scheduling

There are a few FP16 non-shuffle XDL GEMM kernels using inter-wave scheduling after PR #184. It is suspected to be related to compiler issue and is under investigation
https://ontrack-internal.amd.com/browse/SWDEV-330459

The issue affect other operation types such as fp16 group_gemm

test fp16 TT

Perf: 0.693763 ms, 1.54771 TFlops, 12.2804 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2>
max err: 4049
Perf: 0.633443 ms, 1.695088 TFlops, 13.4498 GB/s, DeviceGemmXdl<256, 128, 256, 4, 8, 32, 32, 2, 4>
max err: 4311
Perf: 0.541443 ms, 1.983112 TFlops, 15.73514 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

test fp16 TN

Perf: 0.425282 ms, 2.52478 TFlops, 20.033 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

test fp16 NT

Perf: 0.577923 ms, 1.85793 TFlops, 14.7419 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2>
max err: 4049
Perf: 0.485923 ms, 2.209695 TFlops, 17.53298 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

test fp16 NN

Perf: 0.546723 ms, 1.96396 TFlops, 15.5832 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2>
max err: 4049
Perf: 0.468322 ms, 2.292742 TFlops, 18.19193 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

Post-merge of PR #134

Unresolved comments:

[avx2] design/issues with avx2 prototyping

This issue tracks the issues when developing avx2 CK

CPU only compile. A lot of headers are included hip_runtime.h, and use __device__ / __host__ symbol to describe host/device code. Better decouple device related code for CPU only.
DynamicBuffer contains GPU intrinsic for memory operation. for CPU operation, may need utilize avx related intrinsic.
__attribute__((ext_vector_type(N))) seems not recognize, 64/126/256 bit register. For ext_vector_type(8) will generate 2 xmm register, for ext_vector_type(4) will single generate xmm (this is wanted), but ext_vector_type(2)still generate singlexmm`. This gives us some difficulty for implementing the vector type on CPU.
Also, each ymm/xmm can not iterate over the inner 8 float / 4 float one by one, and apply an element wise operation. This register must be treated as a whole. Hence StaticallyIndexedArray can not be utilized.
register for frontend programming are limited, this implies we don't prefer to implement thread local buffer by using register to hold data. So every micro kernel will need to write the result into memory (cache) then do next iteration.
for level of task distribution, we design following multi-level gemm:
a). thread wise gemm: this is the micro kernel, with A/B matrix hope to exist in L1 cache.
b). block wise gemm: A/B matrix hope to exist in L2/L3 cache, or we call it cache block. Unlike the naming, this is still run on a single thread.
c). grid wise gemm: this is the whole task size. And we try do multi-thread on this level.
numa binding, thread binding
In multi-thread environment, bind thread to different core will have a big performance difference, especially on Current Zen chiplet design.
TODO:Zen optimization guide
tile blocking support not evenly divided block.
a). gridwise/block/thread wise gemm need calculate current block size at runtime instead of compile time
b). threadwise gemm distribute to different kernel.
c). threadwise transfer need deal with unevenly divided size and packing (or not packing).
transpose while read/write using avx register (how to describe by tensor transform)
DimAccessOrder with openmp
e.g. Order is <0, 1, 2>, and we need merge dim:1, dim:2 to utilize openmp for multi thread distribution
dynamic threadwise copy
today GPU use static_ford to the copy dimension. But for cpu, the number of iteration would be thousands or tens of thousands, which is not good enough to statically expand code.

Naming for Reduction Operation

Originally posted by @j4yan in #128 (comment)

IdentityValue

This value is a mathematical property of reduction type, and should be deterministic and not specified by user of reduction

https://docs.oracle.com/javase/tutorial/collections/streams/reduction.html

InitialValue

This value could be specified at runtime, it could be used in multi-stage reduction to hold the accumulated value of previous reduction

Suggested implementation

struct BaseReductionType {};

template<typename T>
struct ReduceSum : public BaseReductionType
{
    static constexpr T kIdentityValue = T{0}; // not modifiable
    
   constexpr T Reduce(T& acc, const T& v)
   {
       acc += v;
   }
};

template<typename ReduceType,
                  typename T>
struct ReductionOperation
{
    T initialValue_ = ReduceType::kIdentityValue; // modifiable
    
   constexpr T SetInitialValue(T& v)
   {
       initialValue_ = v;
   }

   constexpr T InitializeAcc(T& acc)
   {
       acc = initialValue_;
   }

   constexpr T Reduce(T& acc, const T& v)
   {
       ReduceType::Reduce(acc, v);
   }
};

Pointwise operation for reduction should be unified with normal pointwise operation

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/91d8b7d67ae9dbf8a6e691ea3e17c0b9705c6ba7/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp#L146-L330

A strange use case of UnaryIdentic

Batched GEMM performance issue

The new batched GEMM performance (develop cb87b04) is ~2TFlops lower than the old one (b53e9d0). This can be reproduced by running

./bin/ckProfiler batched_gemm 0 0 1 2 0 5 1024 512 2048 -1 -1 -1 8

Other observations include:

Both versions have exactly the same number of s and v instructions and almost the same number of VGPRS in the main K0 loop.
The old version has scratch memory if HasMainK0BlockLoop=true; the new version does not.

Proper way to Cast Address Space

Cast address space 4 ("Constant") to generic space 0 ("Generic")

-Wold-style-cast

Post merge issue of PR #78

CK fails to build for gfx90a on ROCm 5.0.0

When compiling for gfx90a using the following command line

CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_PREFIX_PATH=/opt/rocm -D CMAKE_CXX_FLAGS=" --offload-arch=gfx90a  -O3 " ..

The following error is encountered

fatal error: error in backend: Error while trying to spill VGPR0 from class VGPR_32: Cannot scavenge register without an emergency spill slot!
clang-14: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.0 22051 235b6880e2e515507478181ec11a20c1ec87945b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.0.0/llvm/bin
clang-14: note: diagnostic msg: Error generating preprocessed source(s).
library/src/tensor_operation_instance/gpu/gemm/CMakeFiles/device_gemm_instance.dir/build.make:374: recipe for target 'library/src/tensor_operation_instance/gpu/gemm/CMakeFiles/device_gemm_instance.dir/device_gemm_xdl_c_shuffle_2_stage_f16_f16_f16_mk_nk_mn_instance.cpp.o' failed
make[2]: *** [library/src/tensor_operation_instance/gpu/gemm/CMakeFiles/device_gemm_instance.dir/device_gemm_xdl_c_shuffle_2_stage_f16_f16_f16_mk_nk_mn_instance.cpp.o] Error 70
make[2]: *** Waiting for unfinished jobs....

Use int32_t instead of int for mfma

If tensor data that is integer type, only concrete type like int32_t, uint32_t, int64_t are allow. Also, use int32_t instead of int, use uint32_t instead of unsigned

Code style: https://github.com/ROCmSoftwarePlatform/composable_kernel/wiki/Coding-Style#integer-type

These need to be fixed:

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/98e1e2d0e933499d4342cf66686d6aa130dda925/include/ck/utility/amd_xdlops.hpp#L269-L274

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/98e1e2d0e933499d4342cf66686d6aa130dda925/include/ck/utility/amd_xdlops.hpp#L288-L293

backward data v4r1 have an issue on rocm5.0

Convolution 1d and 2d backward data v4r1 can pass the test(restnet50) on rocm4.0, but fail on rocm5.0. and if we don't use buffer store, it can pass on rocm5.0, so we guess this is a compiler issue.
Opened a compiler ticket on JIRA here: https://ontrack-internal.amd.com/browse/SWDEV-329795

Reduction modules overhaul

Design Document

https://ontrack-internal.amd.com/browse/LWPCK-205

Misc

ThreadwiseReduction support pointwise operations #213 (comment)

Add compiler warning on comparing int and size_t

There are a lot of places where we are comparing int with size_t, which is risky. I suggest we turn on warnings of this, and get rid of them eventually.

9110 compiler issue

Reproduce step

In the branch of 9110_issue_gemm_ex3
https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/9110_issue_gemm_ex3

$ mkdir build
$ cd build
$ cmake                                                                \
-D CMAKE_BUILD_TYPE=Release                                            \
-D HIP_ONLINE_COMPILER_FLAGS="-DCK_AMD_GPU_GFX908"                     \
-D CMAKE_CXX_FLAGS="-DCK_AMD_GPU_GFX908 --amdgpu-target=gfx908 -O3 "   \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                              \
-D CMAKE_PREFIX_PATH=/opt/rocm                                         \
..
$ make -j gemm_xdl_bias_relu_add
$ ./example/gemm_xdl_bias_relu_add 1 1 5 3840 4096 4096 4096 4096 4096

The result is

a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c0_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c1_m_n: dim 2, lengths {3840, 4096}, strides {1, 0}
arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
arg.c_grid_desc_m_n_{ 3840, 4096}
arg.c0_grid_desc_m_n_{ 3840, 4096}
arg.c1_grid_desc_m_n_{ 3840, 4096}
launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
Perf: 1.04685 ms, 123.082 TFlops, 90.1481 GB/s
error: 7.22387e+07
max_diff: 3345, 247, 3592

However, if I use the docker of CI server
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/Dockerfile
or
rocm/tensorflow:rocm4.3.1-tf2.6-dev

a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c0_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c1_m_n: dim 2, lengths {3840, 4096}, strides {1, 0}
arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
arg.c_grid_desc_m_n_{ 3840, 4096}
arg.c0_grid_desc_m_n_{ 3840, 4096}
arg.c1_grid_desc_m_n_{ 3840, 4096}
launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
Perf: 1.17188 ms, 109.951 TFlops, 80.5305 GB/s
error: 0
max_diff: 0, 1055, 1055

No error occurs

elementwise operation Support 1D~5D

We only support 2D (DeviceBinaryElementWise2D) so far.

Support more dimension for flexibility

[cppcheck][warning][missingReturn] ppcheck 2.6 warnings on missing returns

[Keywords]:
cppcheck; warning;
[Urgency: Low; Importance: Low]

[Env]:
[email protected]

[Description]:
The issue is 19 [missingReturn] warnings with cppcheck upgrade; 17 of them are within Composable Kernel repository:
[LOG]: [Jenkins Log]
(http://micimaster.amd.com/blue/rest/organizations/jenkins/pipelines/MLLibs/pipelines/MIOpen/branches/jenkins-ci-rocm-4.5/runs/11/nodes/13/steps/73/log/?start=0)
Nothing logically is wrong with the codes, we just wish to silent a few warnings and improve quality in static checks.

16 of them are within the same file data_type.hpp
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/6260ced2f3a4d9a2a832563905135c01ba72b56b/composable_kernel/include/utility/data_type.hpp#L186
and many other places in the same code ...
1 of them is within the amd_buffer_addressing.hpp
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/6260ced2f3a4d9a2a832563905135c01ba72b56b/composable_kernel/include/utility/amd_buffer_addressing.hpp#L265

[Recommendation]:

Add throw std::runtime_error{"Unsupported DataType"}; //Unreachable under each of the static_assert whenever there is no default return;
Since you already have static_assert, at least one of the else ifs is not necessary.

[Code Quality] GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v2r3 C threadwise copy

This need to be implemented as a single call of threadwise-copy.Run()

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/a2ad6d35315555cddeb7cf6e76fc5eee3864e6f6/composable_kernel/include/tensor_operation/gridwise_gemm_xdlops_v2r3.hpp#L574-L696

messy header file dependency

This is to follow up a discussion #130 (comment)

I suggest including any needed header files directly in a header (or source file). For example, if function void foo() declared/defined in A.hpp is used in B.hpp, A.hpp should be #included in B.hpp.

If this dependency issue is resolved, we can order the header #include any way we want, like alphabetically using clang-format.

Polymorphic function "IsSupportedArgument(const BaseArgument* p_base_arg): should check if pointer can be down-casted

Have a PR that fix every Device Op class

Issue brought up by @j4yan
#128 (comment)

Example Fix (need to be applied for all Device Op:

--- a/include/ck/tensor_operation/gpu/device/device_gemm_xdl.hpp
+++ b/include/ck/tensor_operation/gpu/device/device_gemm_xdl.hpp
@@ -412,9 +412,17 @@ struct DeviceGemmXdl
     }

     // polymorphic
-    bool IsSupportedArgument(const BaseArgument* p_arg) override
+    bool IsSupportedArgument(const BaseArgument* p_base_arg) override
     {
-        return IsSupportedArgument(*dynamic_cast<const Argument*>(p_arg));
+        const Argument* p_arg = dynamic_cast<const Argument*>(p_base_arg);
+
+        // make sure p_base_arg can be downcasted
+        if(!p_arg)
+        {
+            return false;
+        }
+
+        return IsSupportedArgument(*p_arg);
     }

Possible bug in reduction

This branch https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/reproduce_error_in_reduction is based on #118. I inserted some static assertion to ThreadwiseTensorSliceTransfer_v1r3 which causes the reduction compile to fail.

This may exposure a potential debug in reduction: The compile log shows that SliceLengths = Sequence{1}, and DstScalarPerVector=2.

On latest develop branch (827301d) we have exactly the same issue.

post merge issue of PR "Reduction in Composable Kernel #82"

Bug when reading data that is invalid from dynamic buffer,

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/567f5e9c5f0aa6481570fba9267224626014542f/composable_kernel/include/utility/dynamic_buffer.hpp#L84-L95

load instructions will be issued no matter what, even if the element is invalid. Potential for OOB.
Fix:
for LDS, it's fine, since LDS will wrap around OOB address
for global, it means segmentation fault.

Change "ushort" to "bhalf_t"

PR #100 change all "ushort" to "bhalf_t". After #100 is merged, please change all "ushort" in your PR to "bhalf_t" as well.

You can do that change in future PR.

Originally posted by @asroy in #92 (comment)

ConvBackwardWeightAlgo::V4R4R5XDLATOMICNHWC failure

./host/driver_offline/conv_wrw_driver_offline 1 4 1 4 0 1 1 256 128 3 3 14 14 1 1 1 1 1 1 1 1 1

result

layout: 1
in: dim 4, lengths {1, 14, 14, 128}, strides {25088, 1792, 128, 1}
wei: dim 4, lengths {256, 3, 3, 128}, strides {1152, 384, 128, 1}
out: dim 4, lengths {1, 14, 14, 256}, strides {50176, 3584, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_backward_weight_implicit_gemm_v4r4r5_xdlops_atomic_nhwc_kyxc_nhwk
GemmKTotal: 196 GrideSizeMN: 9 GemmKBatch: 1 GemmK0: 24 gemmKPad: 192
a_b_k0_m_k1_grid_desc{1, 24, 256, 8}
b_b_k0_n_k1_grid_desc{1, 24, 1152, 8}
c_m_n_grid_desc{ 256, 1152}
gridSize : 9
launch_and_time_kernel: grid_dim {9, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 0 times...
error: 2.50662e+06
max_diff: 74, 1, -73

compiler cannot generate 128 bit ds_read/write for in8x16_t

Remove unused headers #includes

Currently many header files are included but never used, since we don't have a mechanism or tool to help identify them. Tools like https://github.com/include-what-you-use/include-what-you-use can do the trick.

major code refactor

major

Add license
use absolute path for header
fix header dependency: #170
Tensor operation naming: grid/block/warp/thread-level operation, conv/gemm/reduce/elementwise operation
#198
~~#169~~

misc

#146
use is_same_v instead of is_same #175 (comment)
#90

batched-GEMM memory access fault

docker

compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:9110_ubuntu18.04_py3.6_pytorch_rocm5.0_internal_testing_7ff5b54 \

CMD

./bin/ckProfiler batched_gemm 1 1 0 0 0 5 384 64 384 384 64 64 384

Result

a_g_m_k: dim 3, lengths {384, 384, 384}, strides {147456, 384, 1}
b_g_k_n: dim 3, lengths {384, 384, 64}, strides {4096, 1, 64}
c_g_m_n: dim 3, lengths {384, 384, 64}, strides {24576, 64, 1}
arg.a_grid_desc_k0_m_k1_{48, 512, 8}
arg.b_grid_desc_k0_n_k1_{48, 128, 8}
arg.c_grid_desc_m_n_{512, 128}
launch_and_time_kernel: grid_dim {768, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
:0:rocdevice.cpp            :2594: 1147793389293 us: 32964: [tid:0x7f71ef795700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Aborted (core dumped)

use index_t instead of int for index calculation.

only index_t and long_index_t should be used for index calculation

Do a grep of all int and replace
grep -rw int ./composable_kernel/

remove pointer cast

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/b53e9d08ed5a9f80d8c13cf8bedd86155cc7c244/composable_kernel/include/utility/dynamic_buffer.hpp#L183-L237

Report "SmallVector unable to grow", when compiling

when compiling branch fp16_transfer_to_bf16 all tests, the compiler give some error info.
fatal error: error in backend: SmallVector unable to grow. Requested capacity (4294967296) is larger than maximum value for size type (4294967295) clang-14: error: clang frontend command failed with exit code 70 (use -v to see invocation) AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.1.0 22114 5cba46feb6af367b1cafaa183ec42dbfb8207b14)
this issue is recorded in ticket:
https://ontrack-internal.amd.com/browse/SWDEV-335749

Fix int8 GEMM performance issue

PR #131 add instance for int8 GEMM, but has performance issue

#131 (comment)

Use CMake options to replace macros in config.hpp

This is to follow up a discussion #130 (comment)

I suggest that the macros in config.hpp, at least those independent ones, are moved into cmake. This way users and developers do not need to change the code (ie, config.hpp) itself for different configurations.

unexpected behavior in make_freeze_transform

In this function, the number of tuples for upper indices is allowed to be different than the number of tuples for lower indices.
See here https://github.com/j4yan/composable_kernel/blob/conv3d_splitN_rebased/composable_kernel/include/tensor_operation/gridwise_batched_gemm_xdlops_v2r3.hpp#L333

Use ck::half_t instead of half_float::half on host, so we can remove dependency on half_float::half in future

Since numeric_limits<ck::half_t>::lowest() will get 0.
if all the value in input tensor is negative, ReductionHost will get 0

Post Merge Issue of PR #106

StaticBufferOfVectorTypeV2 should be deprecated

StaticBufferOfVectorTypeV2 should be deprecated. StaticBufferTupleOfVector should be used instead

Originally posted by @asroy in #79 (comment)

get rid of std and half_float in GPU code

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/7c0b149811765a7e25e38f7c00c61bba7e8b683d/include/ck/utility/math_v2.hpp#L6

[cppcheck][warning][ODR] possible UB on multiple `struct get_ref_desc_types` in ck namespace

[Keywords]:
cppcheck; warning;
[Urgency: Low; Importance: Medium]

[Env]:
[email protected]

[Description]:
A search on struct get_ref_desc_types will result 13 files with the same-named struct in the same namespace "ck". It violates ODR and may potentially result in UB.
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/6260ced2f3a4d9a2a832563905135c01ba72b56b/composable_kernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_all_dims.cpp#L161
(other files omitted, just search in the same folder)

[Recommendation]:

A temporal method is anonymous namespace, just wrap around this struct and it should be local to its own translation unit (.cpp file and all its includes) ;
A deeper search seems to reveal that there are lots of duplications among these 13 files. Take this struct for example, it seems that only first call and second call, and all-dim and partial-dim have differences. Can we move the struct definition to a separate file and consolidate them?

Kernels with LDS bank conflicts

Our kernel sees partial 2-way bank conflict for K-contiguous matrices, and full 2-way/4-way conflict for MN-contiguous matrices (dependent on tile sizes). Profiling has shown partial 2-way bank conflict for K-contiguous matrices doesn't result in observable impact on LDS issue latency. Therefore, our focus is to resolve bank conflict for MN-contiguous matrices.

The following LDS layout is chosen to avoid LDS write bank conflict for MN-contiguous matrices:

K0_M_1/K0_N_1 (fp32)
K0_M_2/K0_2_N (bf16/fp16)
K0_M_4/K0_N_4 (int8)

Conflict-free kernels are already added to C-shuffle device GEMM instances defined in library/src/tensor_operation_instance/gpu/gemm/device_gemm_xdl_c_shuffle_(type)_(layout)_instance.cpp. GridwiseGemm_k0mk1_k0nk1_mn_xdl_cshuffle_v1 (the underlying implementation of C-shuffle device GEMM) supports different K1 value for each individual A/B matrix to facilitate preferred conflict-free LDS layout for each A/B tile.

Other GEMM/Conv device kernels that do not yet facilitate the latest C-shuffle gridwise GEMM implementation still observe bank conflicts.

GEMM

Batched GEMM
Group GEMM
Split-K GEMM
Non-C-shuffle ordinary GEMM DeviceGemmXdl

Conv

Backward data (conflict in weight)
Backward weight (conflict in both output gradient and activation)
Forward prop maps to K-contiguous implicit GEMM layout so there's no observable bank conflict.

Refactor unit test for GEMM/convolution to use ckProfiler

ckProfiler already has necessary functionality for unit test of GEMM/convolution problem.
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/b51808d7a57fd533f4de2a267cc338d1a77d4f57/profiler/include/profile_gemm_impl.hpp#L72

Please do that for GEMM/convolution-forward/convolution-backward-data

Also please fix some issues in PR #131

#131 (comment)

conv2d_fwd_xdl_bias_relu_atomic_add incorrect results.

conv2d_fwd_xdl_bias_relu_atomic_add doesn't generate correct result.

./example/conv2d_fwd_xdl_bias_relu_atomic_add 1 3 5

Perf: 1.3421 ms, 109.356 TFlops, 248.561 GB/s
error: 6.43832e+08
max_diff: 161.844, 32.7812, 194.625

Proper way to cast float* to float2*

There is a backward data error when filter is 1x1 and stride=2

if filter is 1x1 and stride = 2, backward data can not pass the verification.
example config:
./build/bin/conv2d_bwd_data_xdl 1 2 1 256 64 128 1 1 2 2 2 2 1 1 0 0 0 0

NaNs for BF16 2D/3D convolution forward.

Issue in this PR #133

Steps to reproduce on ROCm 5.0 and 4.3.1

branch: aosewski/conv_3d
The issue occures only for first instance of generic convolution algorithm. Like this one:
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/d332ff7f3ab2e9e415ed516611c27cf6780cf7da/library/src/tensor_operation_instance/gpu/conv1d_fwd/device_conv1d_fwd_xdl_nwc_kxc_nwk_bf16_instance.cpp#L37
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/d332ff7f3ab2e9e415ed516611c27cf6780cf7da/library/src/tensor_operation_instance/gpu/conv2d_fwd/device_conv2d_fwd_xdl_nhwc_kyxc_nhwk_bf16_instance.cpp#L38
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/d332ff7f3ab2e9e415ed516611c27cf6780cf7da/library/src/tensor_operation_instance/gpu/conv3d_fwd/device_conv3d_fwd_xdl_ndhwc_kzyxc_ndhwk_bf16_instance.cpp#L37
So need to uncomment those lines.

The issue happens only when initializing input tensors with integer values. Thus one need to change current input/weights tensor initialization to integer values initialization. You can use
init.diff.txt file.

cmake -DBUILD_DEV=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-DCK_AMD_GPU_GFX908 --amdgpu-target=gfx908 -O3" -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -DCMAKE_PREFIX_PATH="/opt/rocm/;/opt/rocm/hip" ..
cmake --build . -j 36 --target test_convnd_fwd
./bin/test_conv1d_fwd
./bin/test_conv2d_fwd
./bin/test_conv3d_fwd

The error looks like following:

DeviceConv1DFwdXdl_Input_N_Hi_Wi_C_Weight_K_Y_X_C_Output_N_Ho_Wo_K<256, 256, 128, 4, Default>....                Supported!
launch_and_time_kernel: grid_dim {36, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
        out[0] != ref[0]: -nan != 0
Error: incorrect results!
        out[1] != ref[1]: -nan != 0
Error: incorrect results!
        out[2] != ref[2]: -nan != 0
Error: incorrect results!
        out[3] != ref[3]: -nan != 0
Error: incorrect results!
   max err: 0

TestConv1DNWCBF16Instances ..... FAILURE
>>>>>>>> RUN test conv2d nhwc <<<<<<<<<<
DeviceConv2DFwdXdl_Input_N_Hi_Wi_C_Weight_K_Y_X_C_Output_N_Ho_Wo_K<256, 256, 128, 4, Default>....                Supported!
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
        out[0] != ref[0]: -nan != -nan
Error: incorrect results!
        out[1] != ref[1]: -nan != -nan
Error: incorrect results!
        out[2] != ref[2]: -nan != -nan
Error: incorrect results!
        out[3] != ref[3]: -nan != -nan
Error: incorrect results!
   max err: 0

TestConv2DNHWCBF16Instances ..... FAILURE
>>>>>>>> RUN test conv3d ndhwc <<<<<<<<<<
DeviceConv3DFwdXdl_Input_N_Hi_Wi_C_Weight_K_Y_X_C_Output_N_Ho_Wo_K<256, 256, 128, 4, Default>....                Supported!
launch_and_time_kernel: grid_dim {256, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
        out[0] != ref[0]: -nan != -nan
Error: incorrect results!
        out[1] != ref[1]: -nan != -nan
Error: incorrect results!
        out[2] != ref[2]: -nan != -nan
Error: incorrect results!
        out[3] != ref[3]: -nan != -nan
Error: incorrect results!
   max err: 0

TestConv3DNDHWCBF16Instances ..... FAILURE

Rename elementwise operation (single source)

Rename type, variable and file like
AElementwiseOperation a_element_op
BElementwiseOperation b_element_op
CElementwiseOperation c_element_op
SrcElementwiseOperation src_element_op_
DstElementwiseOperation dst_element_op_
element_wise_reduce_operation
element_wise_operation.hpp

Because we need to add other elementwise operation (two source), to prevent confuse.
We need to rename the original concept. ex: tensor functor

definition of relative error

Currently the relative error when the reference value is zero is defined using an epsilon to avoid division by zero; see https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/313bbea5886850acab286f45e9d9816cf0b0dca0/library/include/ck/library/host_tensor/host_tensor.hpp#L327

The problem is that this approve results in very large relative error even if the absolute error is small. For example, (v = 1e-8, v_ref = 0, eps = 1e-10), then ref_diff = 100. We can use more reasonable definition, for example:
rel_diff = (v - v_ref) / max(|v|, |v_ref|). See https://en.wikipedia.org/wiki/Relative_change_and_difference for other alternatives.