Code Monkey home page Code Monkey logo

composable_kernel's Introduction

Composable Kernel

The Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads across multiple architectures (GPUs, CPUs, etc.). The CK library uses general purpose kernel languages, such as HIP C++.

CK uses two concepts to achieve performance portability and code maintainability:

  • A tile-based programming model
  • Algorithm complexity reduction for complex machine learning (ML) operators. This uses an innovative technique called Tensor Coordinate Transformation.

ALT

The current CK library is structured into four layers:

  • Templated Tile Operators
  • Templated Kernel and Invoker
  • Instantiated Kernel and Invoker
  • Client API

ALT

General information

To build our documentation locally, use the following code:

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

You can find a list of our developers and contributors on our Contributors page.

If you use CK, cite us as follows:

* [Realizing Tensor Operators Using Coordinate Transformations and Tile Based Programming](???):
  This paper will be available on arXiv soon.
* [CITATION.cff](/CITATION.cff)

CK is released under the MIT license.

Building CK

We recommend building CK inside Docker containers, which include all necessary packages. Pre-built Docker images are available on DockerHub.

  1. To build a new Docker image, use the Dockerfile provided with the source code:

    DOCKER_BUILDKIT=1 docker build -t ck:latest -f Dockerfile .
  2. Launch the Docker container:

    docker run                                     \
    -it                                            \
    --privileged                                   \
    --group-add sudo                               \
    -w /root/workspace                             \
    -v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace  \
    ck:latest                                      \
    /bin/bash
  3. Clone CK source code from the GitHub repository and start the build:

    git clone https://github.com/ROCm/composable_kernel.git && \
    cd composable_kernel && \
    mkdir build && \
    cd build

    You must set the GPU_TARGETS macro to specify the GPU target architecture(s) you want to run CK on. You can specify single or multiple architectures. If you specify multiple architectures, use a semicolon between each; for example, gfx908;gfx90a;gfx940.

    cmake                                                                                             \
    -D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
    -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
    -D CMAKE_BUILD_TYPE=Release                                                                       \
    -D GPU_TARGETS="gfx908;gfx90a"                                                                    \
    ..

    If you don't set GPU_TARGETS on the cmake command line, CK is built for all GPU targets supported by the current compiler (this may take a long time).

  4. Build the entire CK library:

    make -j
  5. Install CK:

    make -j install

Optional post-install steps

  • Build examples and tests:

    make -j examples tests
  • Build and run all examples and tests:

    make -j check

    You can find instructions for running each individual example in example.

  • Build ckProfiler:

    make -j ckProfiler

    You can find instructions for running ckProfiler in profiler.

Note the -j option for building with multiple threads in parallel. This speeds up the build significantly. Depending on the number of CPU cores and the amount of RAM on your system, you may want to limit the number of threads. For example, if you have a 128-core CPU and 64 Gb of RAM.

By default, -j launches one thread per CPU core, which can cause the build to run out of memory and crash. In such cases, you can reduce the number of threads to 32 by using -j32.

Additional cmake flags can be used to significantly speed-up the build:

  • INSTANCES_ONLY (default is OFF) must be set to ON in order to build only the instances and library while skipping all tests, examples, and profiler. This is useful in cases when you plan to use CK as a dependency and don't plan to run any examples or tests.

  • DTYPES (default is not set) can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build instances of select data types only. The main default data types are fp32 and fp16; you can safely skip other data types.

  • DL_KERNELS (default is OFF) must be set to ON in order to build instances, such as gemm_dl or batched_gemm_multi_d_dl. These instances are useful on architectures like the NAVI2x, as most other platforms have faster instances, such as xdl or wmma, available.

Using sccache for building

The default CK Docker images come with a pre-installed version of sccache, which supports clang being used as hip-compiler (" -x hip"). Using sccache can help reduce the time to re-build code from hours to 1-2 minutes. In order to invoke sccache, you need to run:

 sccache --start-server

then add the following flags to the cmake command line:

 -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache

You may need to clean up the build folder and repeat the cmake and make steps in order to take advantage of the sccache during subsequent builds.

Using CK as pre-built kernel library

You can find instructions for using CK as a pre-built kernel library in client_example.

Contributing to CK

When you contribute to CK, make sure you run clang-format on all changed files. We highly recommend using git hooks that are managed by the pre-commit framework. To install hooks, run:

sudo script/install_precommit.sh

With this approach, pre-commit adds the appropriate hooks to your local repository and automatically runs clang-format (and possibly additional checks) before any commit is created.

If you need to uninstall hooks from the repository, you can do so by running the following command:

script/uninstall_precommit.sh

If you need to temporarily disable pre-commit hooks, you can add the --no-verify option to the git commit command.

composable_kernel's People

Contributors

andriy-ca avatar aosewski avatar arai713 avatar aska-0096 avatar asroy avatar bartekxk avatar bwroblew avatar carlushuang avatar danyao12 avatar dependabot[bot] avatar fsx950223 avatar geyyer avatar guangzlu avatar illsilin avatar iq136boy avatar j4yan avatar jakpiase avatar jehandadkhan avatar junliume avatar lawruble13 avatar ltqin avatar mozga-amd avatar poyenc avatar qianfengz avatar randyh62 avatar rocking5566 avatar rosenrodt avatar samjwu avatar shaojiewang avatar zjing14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

composable_kernel's Issues

Some kernels have verification failures if enabled inter-wave scheduling

There are a few FP16 non-shuffle XDL GEMM kernels using inter-wave scheduling after PR #184. It is suspected to be related to compiler issue and is under investigation
https://ontrack-internal.amd.com/browse/SWDEV-330459

The issue affect other operation types such as fp16 group_gemm

test fp16 TT

Perf: 0.693763 ms, 1.54771 TFlops, 12.2804 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2>
max err: 4049
Perf: 0.633443 ms, 1.695088 TFlops, 13.4498 GB/s, DeviceGemmXdl<256, 128, 256, 4, 8, 32, 32, 2, 4>
max err: 4311
Perf: 0.541443 ms, 1.983112 TFlops, 15.73514 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

test fp16 TN

Perf: 0.425282 ms, 2.52478 TFlops, 20.033 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

test fp16 NT

Perf: 0.577923 ms, 1.85793 TFlops, 14.7419 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2>
max err: 4049
Perf: 0.485923 ms, 2.209695 TFlops, 17.53298 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

test fp16 NN

Perf: 0.546723 ms, 1.96396 TFlops, 15.5832 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2>
max err: 4049
Perf: 0.468322 ms, 2.292742 TFlops, 18.19193 GB/s, DeviceGemmXdl<128, 128, 128, 4, 8, 32, 32, 4, 2>
max err: 4070

Post-merge of PR #134

[avx2] design/issues with avx2 prototyping

This issue tracks the issues when developing avx2 CK

  1. CPU only compile. A lot of headers are included hip_runtime.h, and use __device__ / __host__ symbol to describe host/device code. Better decouple device related code for CPU only.

  2. DynamicBuffer contains GPU intrinsic for memory operation. for CPU operation, may need utilize avx related intrinsic.

  3. __attribute__((ext_vector_type(N))) seems not recognize, 64/126/256 bit register. For ext_vector_type(8) will generate 2 xmm register, for ext_vector_type(4) will single generate xmm (this is wanted), but ext_vector_type(2)still generate singlexmm`. This gives us some difficulty for implementing the vector type on CPU.

  4. Also, each ymm/xmm can not iterate over the inner 8 float / 4 float one by one, and apply an element wise operation. This register must be treated as a whole. Hence StaticallyIndexedArray can not be utilized.

  5. register for frontend programming are limited, this implies we don't prefer to implement thread local buffer by using register to hold data. So every micro kernel will need to write the result into memory (cache) then do next iteration.

  6. for level of task distribution, we design following multi-level gemm:
    a). thread wise gemm: this is the micro kernel, with A/B matrix hope to exist in L1 cache.
    b). block wise gemm: A/B matrix hope to exist in L2/L3 cache, or we call it cache block. Unlike the naming, this is still run on a single thread.
    c). grid wise gemm: this is the whole task size. And we try do multi-thread on this level.

  7. numa binding, thread binding
    In multi-thread environment, bind thread to different core will have a big performance difference, especially on Current Zen chiplet design.
    TODO:Zen optimization guide

  8. tile blocking support not evenly divided block.
    a). gridwise/block/thread wise gemm need calculate current block size at runtime instead of compile time
    b). threadwise gemm distribute to different kernel.
    c). threadwise transfer need deal with unevenly divided size and packing (or not packing).

  9. transpose while read/write using avx register (how to describe by tensor transform)

  10. DimAccessOrder with openmp
    e.g. Order is <0, 1, 2>, and we need merge dim:1, dim:2 to utilize openmp for multi thread distribution

  11. dynamic threadwise copy
    today GPU use static_ford to the copy dimension. But for cpu, the number of iteration would be thousands or tens of thousands, which is not good enough to statically expand code.

Naming for Reduction Operation

Originally posted by @j4yan in #128 (comment)

IdentityValue

This value is a mathematical property of reduction type, and should be deterministic and not specified by user of reduction

https://docs.oracle.com/javase/tutorial/collections/streams/reduction.html

image

InitialValue

This value could be specified at runtime, it could be used in multi-stage reduction to hold the accumulated value of previous reduction

Suggested implementation

struct BaseReductionType {};

template<typename T>
struct ReduceSum : public BaseReductionType
{
    static constexpr T kIdentityValue = T{0}; // not modifiable
    
   constexpr T Reduce(T& acc, const T& v)
   {
       acc += v;
   }
};

template<typename ReduceType,
                  typename T>
struct ReductionOperation
{
    T initialValue_ = ReduceType::kIdentityValue; // modifiable
    
   constexpr T SetInitialValue(T& v)
   {
       initialValue_ = v;
   }

   constexpr T InitializeAcc(T& acc)
   {
       acc = initialValue_;
   }

   constexpr T Reduce(T& acc, const T& v)
   {
       ReduceType::Reduce(acc, v);
   }
};

Pointwise operation for reduction should be unified with normal pointwise operation

https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/91d8b7d67ae9dbf8a6e691ea3e17c0b9705c6ba7/include/ck/tensor_operation/gpu/element/element_wise_operation.hpp#L146-L330

A strange use case of UnaryIdentic

Batched GEMM performance issue

The new batched GEMM performance (develop cb87b04) is ~2TFlops lower than the old one (b53e9d0). This can be reproduced by running

./bin/ckProfiler batched_gemm 0 0 1 2 0 5 1024 512 2048 -1 -1 -1 8

Other observations include:

  1. Both versions have exactly the same number of s and v instructions and almost the same number of VGPRS in the main K0 loop.
  2. The old version has scratch memory if HasMainK0BlockLoop=true; the new version does not.

CK fails to build for gfx90a on ROCm 5.0.0

When compiling for gfx90a using the following command line

CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_PREFIX_PATH=/opt/rocm -D CMAKE_CXX_FLAGS=" --offload-arch=gfx90a  -O3 " ..

The following error is encountered

fatal error: error in backend: Error while trying to spill VGPR0 from class VGPR_32: Cannot scavenge register without an emergency spill slot!
clang-14: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.0 22051 235b6880e2e515507478181ec11a20c1ec87945b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.0.0/llvm/bin
clang-14: note: diagnostic msg: Error generating preprocessed source(s).
library/src/tensor_operation_instance/gpu/gemm/CMakeFiles/device_gemm_instance.dir/build.make:374: recipe for target 'library/src/tensor_operation_instance/gpu/gemm/CMakeFiles/device_gemm_instance.dir/device_gemm_xdl_c_shuffle_2_stage_f16_f16_f16_mk_nk_mn_instance.cpp.o' failed
make[2]: *** [library/src/tensor_operation_instance/gpu/gemm/CMakeFiles/device_gemm_instance.dir/device_gemm_xdl_c_shuffle_2_stage_f16_f16_f16_mk_nk_mn_instance.cpp.o] Error 70
make[2]: *** Waiting for unfinished jobs....

Use int32_t instead of int for mfma

9110 compiler issue

Reproduce step

In the branch of 9110_issue_gemm_ex3
https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/9110_issue_gemm_ex3

$ mkdir build
$ cd build
$ cmake                                                                \
-D CMAKE_BUILD_TYPE=Release                                            \
-D HIP_ONLINE_COMPILER_FLAGS="-DCK_AMD_GPU_GFX908"                     \
-D CMAKE_CXX_FLAGS="-DCK_AMD_GPU_GFX908 --amdgpu-target=gfx908 -O3 "   \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                              \
-D CMAKE_PREFIX_PATH=/opt/rocm                                         \
..
$ make -j gemm_xdl_bias_relu_add
$ ./example/gemm_xdl_bias_relu_add 1 1 5 3840 4096 4096 4096 4096 4096

The result is

a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c0_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c1_m_n: dim 2, lengths {3840, 4096}, strides {1, 0}
arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
arg.c_grid_desc_m_n_{ 3840, 4096}
arg.c0_grid_desc_m_n_{ 3840, 4096}
arg.c1_grid_desc_m_n_{ 3840, 4096}
launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
Perf: 1.04685 ms, 123.082 TFlops, 90.1481 GB/s
error: 7.22387e+07
max_diff: 3345, 247, 3592

However, if I use the docker of CI server
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/Dockerfile
or
rocm/tensorflow:rocm4.3.1-tf2.6-dev

a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c0_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
c1_m_n: dim 2, lengths {3840, 4096}, strides {1, 0}
arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
arg.c_grid_desc_m_n_{ 3840, 4096}
arg.c0_grid_desc_m_n_{ 3840, 4096}
arg.c1_grid_desc_m_n_{ 3840, 4096}
launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
Perf: 1.17188 ms, 109.951 TFlops, 80.5305 GB/s
error: 0
max_diff: 0, 1055, 1055

No error occurs

[cppcheck][warning][missingReturn] ppcheck 2.6 warnings on missing returns

[Keywords]:
cppcheck; warning;
[Urgency: Low; Importance: Low]

[Env]:
[email protected]

[Description]:
The issue is 19 [missingReturn] warnings with cppcheck upgrade; 17 of them are within Composable Kernel repository:
[LOG]: [Jenkins Log]
(http://micimaster.amd.com/blue/rest/organizations/jenkins/pipelines/MLLibs/pipelines/MIOpen/branches/jenkins-ci-rocm-4.5/runs/11/nodes/13/steps/73/log/?start=0)
Nothing logically is wrong with the codes, we just wish to silent a few warnings and improve quality in static checks.

[Recommendation]:

  1. Add throw std::runtime_error{"Unsupported DataType"}; //Unreachable under each of the static_assert whenever there is no default return;
  2. Since you already have static_assert, at least one of the else ifs is not necessary.

messy header file dependency

This is to follow up a discussion #130 (comment)

I suggest including any needed header files directly in a header (or source file). For example, if function void foo() declared/defined in A.hpp is used in B.hpp, A.hpp should be #included in B.hpp.

If this dependency issue is resolved, we can order the header #include any way we want, like alphabetically using clang-format.

Polymorphic function "IsSupportedArgument(const BaseArgument* p_base_arg): should check if pointer can be down-casted

  • Have a PR that fix every Device Op class

Issue brought up by @j4yan
#128 (comment)

Example Fix (need to be applied for all Device Op:

--- a/include/ck/tensor_operation/gpu/device/device_gemm_xdl.hpp
+++ b/include/ck/tensor_operation/gpu/device/device_gemm_xdl.hpp
@@ -412,9 +412,17 @@ struct DeviceGemmXdl
     }

     // polymorphic
-    bool IsSupportedArgument(const BaseArgument* p_arg) override
+    bool IsSupportedArgument(const BaseArgument* p_base_arg) override
     {
-        return IsSupportedArgument(*dynamic_cast<const Argument*>(p_arg));
+        const Argument* p_arg = dynamic_cast<const Argument*>(p_base_arg);
+
+        // make sure p_base_arg can be downcasted
+        if(!p_arg)
+        {
+            return false;
+        }
+
+        return IsSupportedArgument(*p_arg);
     }

post merge issue of PR "Reduction in Composable Kernel #82"

ConvBackwardWeightAlgo::V4R4R5XDLATOMICNHWC failure

./host/driver_offline/conv_wrw_driver_offline 1 4 1 4 0 1 1 256 128 3 3 14 14 1 1 1 1 1 1 1 1 1

result

layout: 1
in: dim 4, lengths {1, 14, 14, 128}, strides {25088, 1792, 128, 1}
wei: dim 4, lengths {256, 3, 3, 128}, strides {1152, 384, 128, 1}
out: dim 4, lengths {1, 14, 14, 256}, strides {50176, 3584, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_backward_weight_implicit_gemm_v4r4r5_xdlops_atomic_nhwc_kyxc_nhwk
GemmKTotal: 196 GrideSizeMN: 9 GemmKBatch: 1 GemmK0: 24 gemmKPad: 192
a_b_k0_m_k1_grid_desc{1, 24, 256, 8}
b_b_k0_n_k1_grid_desc{1, 24, 1152, 8}
c_m_n_grid_desc{ 256, 1152}
gridSize : 9
launch_and_time_kernel: grid_dim {9, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 0 times...
error: 2.50662e+06
max_diff: 74, 1, -73

major code refactor

major

  • Add license
  • use absolute path for header
  • fix header dependency: #170
  • Tensor operation naming: grid/block/warp/thread-level operation, conv/gemm/reduce/elementwise operation
  • #198
  • #169

misc

batched-GEMM memory access fault

docker

compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:9110_ubuntu18.04_py3.6_pytorch_rocm5.0_internal_testing_7ff5b54 \

CMD

./bin/ckProfiler batched_gemm 1 1 0 0 0 5 384 64 384 384 64 64 384

Result

a_g_m_k: dim 3, lengths {384, 384, 384}, strides {147456, 384, 1}
b_g_k_n: dim 3, lengths {384, 384, 64}, strides {4096, 1, 64}
c_g_m_n: dim 3, lengths {384, 384, 64}, strides {24576, 64, 1}
arg.a_grid_desc_k0_m_k1_{48, 512, 8}
arg.b_grid_desc_k0_n_k1_{48, 128, 8}
arg.c_grid_desc_m_n_{512, 128}
launch_and_time_kernel: grid_dim {768, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 5 times...
:0:rocdevice.cpp            :2594: 1147793389293 us: 32964: [tid:0x7f71ef795700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Aborted (core dumped)

Report "SmallVector unable to grow", when compiling

when compiling branch fp16_transfer_to_bf16 all tests, the compiler give some error info.
fatal error: error in backend: SmallVector unable to grow. Requested capacity (4294967296) is larger than maximum value for size type (4294967295) clang-14: error: clang frontend command failed with exit code 70 (use -v to see invocation) AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.1.0 22114 5cba46feb6af367b1cafaa183ec42dbfb8207b14)
this issue is recorded in ticket:
https://ontrack-internal.amd.com/browse/SWDEV-335749

[cppcheck][warning][ODR] possible UB on multiple `struct get_ref_desc_types` in ck namespace

[Keywords]:
cppcheck; warning;
[Urgency: Low; Importance: Medium]

[Env]:
[email protected]

[Description]:
A search on struct get_ref_desc_types will result 13 files with the same-named struct in the same namespace "ck". It violates ODR and may potentially result in UB.
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/6260ced2f3a4d9a2a832563905135c01ba72b56b/composable_kernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_all_dims.cpp#L161
(other files omitted, just search in the same folder)

[Recommendation]:

  1. A temporal method is anonymous namespace, just wrap around this struct and it should be local to its own translation unit (.cpp file and all its includes) ;
  2. A deeper search seems to reveal that there are lots of duplications among these 13 files. Take this struct for example, it seems that only first call and second call, and all-dim and partial-dim have differences. Can we move the struct definition to a separate file and consolidate them?

Kernels with LDS bank conflicts

Our kernel sees partial 2-way bank conflict for K-contiguous matrices, and full 2-way/4-way conflict for MN-contiguous matrices (dependent on tile sizes). Profiling has shown partial 2-way bank conflict for K-contiguous matrices doesn't result in observable impact on LDS issue latency. Therefore, our focus is to resolve bank conflict for MN-contiguous matrices.

The following LDS layout is chosen to avoid LDS write bank conflict for MN-contiguous matrices:

  • K0_M_1/K0_N_1 (fp32)
  • K0_M_2/K0_2_N (bf16/fp16)
  • K0_M_4/K0_N_4 (int8)

Conflict-free kernels are already added to C-shuffle device GEMM instances defined in library/src/tensor_operation_instance/gpu/gemm/device_gemm_xdl_c_shuffle_(type)_(layout)_instance.cpp. GridwiseGemm_k0mk1_k0nk1_mn_xdl_cshuffle_v1 (the underlying implementation of C-shuffle device GEMM) supports different K1 value for each individual A/B matrix to facilitate preferred conflict-free LDS layout for each A/B tile.

Other GEMM/Conv device kernels that do not yet facilitate the latest C-shuffle gridwise GEMM implementation still observe bank conflicts.

GEMM

  • Batched GEMM
  • Group GEMM
  • Split-K GEMM
  • Non-C-shuffle ordinary GEMM DeviceGemmXdl

Conv

  • Backward data (conflict in weight)
  • Backward weight (conflict in both output gradient and activation)
  • Forward prop maps to K-contiguous implicit GEMM layout so there's no observable bank conflict.

conv2d_fwd_xdl_bias_relu_atomic_add incorrect results.

conv2d_fwd_xdl_bias_relu_atomic_add doesn't generate correct result.

./example/conv2d_fwd_xdl_bias_relu_atomic_add 1 3 5

Perf: 1.3421 ms, 109.356 TFlops, 248.561 GB/s
error: 6.43832e+08
max_diff: 161.844, 32.7812, 194.625

NaNs for BF16 2D/3D convolution forward.

Issue in this PR #133

Steps to reproduce on ROCm 5.0 and 4.3.1

branch: aosewski/conv_3d
The issue occures only for first instance of generic convolution algorithm. Like this one:
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/d332ff7f3ab2e9e415ed516611c27cf6780cf7da/library/src/tensor_operation_instance/gpu/conv1d_fwd/device_conv1d_fwd_xdl_nwc_kxc_nwk_bf16_instance.cpp#L37
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/d332ff7f3ab2e9e415ed516611c27cf6780cf7da/library/src/tensor_operation_instance/gpu/conv2d_fwd/device_conv2d_fwd_xdl_nhwc_kyxc_nhwk_bf16_instance.cpp#L38
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/d332ff7f3ab2e9e415ed516611c27cf6780cf7da/library/src/tensor_operation_instance/gpu/conv3d_fwd/device_conv3d_fwd_xdl_ndhwc_kzyxc_ndhwk_bf16_instance.cpp#L37
So need to uncomment those lines.

The issue happens only when initializing input tensors with integer values. Thus one need to change current input/weights tensor initialization to integer values initialization. You can use
init.diff.txt file.

cmake -DBUILD_DEV=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-DCK_AMD_GPU_GFX908 --amdgpu-target=gfx908 -O3" -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -DCMAKE_PREFIX_PATH="/opt/rocm/;/opt/rocm/hip" ..
cmake --build . -j 36 --target test_convnd_fwd
./bin/test_conv1d_fwd
./bin/test_conv2d_fwd
./bin/test_conv3d_fwd

The error looks like following:

DeviceConv1DFwdXdl_Input_N_Hi_Wi_C_Weight_K_Y_X_C_Output_N_Ho_Wo_K<256, 256, 128, 4, Default>....                Supported!
launch_and_time_kernel: grid_dim {36, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
        out[0] != ref[0]: -nan != 0
Error: incorrect results!
        out[1] != ref[1]: -nan != 0
Error: incorrect results!
        out[2] != ref[2]: -nan != 0
Error: incorrect results!
        out[3] != ref[3]: -nan != 0
Error: incorrect results!
   max err: 0

TestConv1DNWCBF16Instances ..... FAILURE
>>>>>>>> RUN test conv2d nhwc <<<<<<<<<<
DeviceConv2DFwdXdl_Input_N_Hi_Wi_C_Weight_K_Y_X_C_Output_N_Ho_Wo_K<256, 256, 128, 4, Default>....                Supported!
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
        out[0] != ref[0]: -nan != -nan
Error: incorrect results!
        out[1] != ref[1]: -nan != -nan
Error: incorrect results!
        out[2] != ref[2]: -nan != -nan
Error: incorrect results!
        out[3] != ref[3]: -nan != -nan
Error: incorrect results!
   max err: 0

TestConv2DNHWCBF16Instances ..... FAILURE
>>>>>>>> RUN test conv3d ndhwc <<<<<<<<<<
DeviceConv3DFwdXdl_Input_N_Hi_Wi_C_Weight_K_Y_X_C_Output_N_Ho_Wo_K<256, 256, 128, 4, Default>....                Supported!
launch_and_time_kernel: grid_dim {256, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
        out[0] != ref[0]: -nan != -nan
Error: incorrect results!
        out[1] != ref[1]: -nan != -nan
Error: incorrect results!
        out[2] != ref[2]: -nan != -nan
Error: incorrect results!
        out[3] != ref[3]: -nan != -nan
Error: incorrect results!
   max err: 0

TestConv3DNDHWCBF16Instances ..... FAILURE

Rename elementwise operation (single source)

Rename type, variable and file like
AElementwiseOperation a_element_op
BElementwiseOperation b_element_op
CElementwiseOperation c_element_op
SrcElementwiseOperation src_element_op_
DstElementwiseOperation dst_element_op_
element_wise_reduce_operation
element_wise_operation.hpp

Because we need to add other elementwise operation (two source), to prevent confuse.
We need to rename the original concept. ex: tensor functor

definition of relative error

Currently the relative error when the reference value is zero is defined using an epsilon to avoid division by zero; see https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/313bbea5886850acab286f45e9d9816cf0b0dca0/library/include/ck/library/host_tensor/host_tensor.hpp#L327

The problem is that this approve results in very large relative error even if the absolute error is small. For example, (v = 1e-8, v_ref = 0, eps = 1e-10), then ref_diff = 100. We can use more reasonable definition, for example:
rel_diff = (v - v_ref) / max(|v|, |v_ref|). See https://en.wikipedia.org/wiki/Relative_change_and_difference for other alternatives.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.