hipercore / ktt Goto Github PK

Kernel Tuning Toolkit

License: MIT License

C 4.41% Lua 0.44% C++ 94.74% Cuda 0.03% GLSL 0.03% Python 0.35%

opencl cuda vulkan cpp kernel-tuning hpc autotuning

ktt's Introduction

KTT - Kernel Tuning Toolkit

KTT is an autotuning framework for OpenCL, CUDA kernels and GLSL compute shaders. It primarily focus to GPU accelerators, but can be used to auto-tune also code for different devices (e.g., CPUs) when OpenCL is utilized. Version 2.1 which introduces API bindings for Python and new onboarding guide is now available.

Main features

Ability to define kernel tuning parameters such as kernel thread sizes, vector data types and loop unroll factors to optimize computation for a particular device.
Support for iterative kernel launches and composite kernels.
Support for multiple compute queues and asynchronous operations.
Support for online auto-tuning - kernel tuning combined with regular kernel running.
Ability to automatically ensure the correctness of tuned computation with reference kernel or C++ function.
Support for multiple compute APIs, switching between CUDA, OpenCL and Vulkan requires only minor changes in C++ code (e.g., changing the kernel source file), no library recompilation is needed.
Public API available in C++ (native) and Python (bindings).
Many customization options, including support for kernel arguments with user-defined data types, ability to change kernel compiler flags and more.

Getting started

Introductory guide to KTT can be found here.
Full documentation for KTT API can be found here.
KTT FAQ can be found here.
The newest release of the KTT framework can be found here.
Prebuilt binaries are not provided due to many different combinations of compute APIs and build options available. The Building KTT section contains detailed instructions on how to perform a build.

Tutorials

Tutorials are short examples that serve as an introduction to the KTT framework. Each tutorial covers a specific part of the API. All tutorials are available for both OpenCL and CUDA backends. Most of the tutorials are also available for Vulkan. Tutorials assume that the reader has some knowledge about C++ and GPU programming. List of the currently available tutorials:

Info: Retrieving information about compute API platforms and devices through KTT API.
KernelRunning: Running simple kernel with KTT framework and retrieving output.
KernelTuning: Simple kernel tuning using a small number of tuning parameters and reference computation to validate output.
CustomArgumentTypes: Usage of kernel arguments with custom data types and validating the output with value comparator.
ComputeApiInitializer: Providing tuner with custom compute context, queues and buffers.
VectorArgumentCustomization: Showcasing different usage options for vector kernel arguments.
PythonInterfaces: Implementing custom searchers and stop conditions in Python, which can afterward be used with the tuner.

Examples

Examples showcase how the KTT framework could be utilized in real-world scenarios. They are more complex than tutorials and assume that the reader is familiar with KTT API. List of some of the currently available examples:

CoulombSum2d: Tuning of electrostatic potential map computation, focuses on a single slice.
CoulombSum3dIterative: 3D version of the previous example, utilizes kernel from 2D version and launches it iteratively.
CoulombSum3d: Alternative to iterative version, utilizes kernel which computes the entire map in a single invocation.
Nbody: Tuning of N-body simulation.
Reduction: Tuning of vector reduction, launches a kernel iteratively.
Sort: Radix sort example, combines multiple kernels into a composite kernel.
Bicg: Biconjugate gradients method example, features reference computation, composite kernels and constraints.

Building KTT

KTT can be built as a dynamic (shared) library using the command line build tool Premake. Currently supported operating systems are Linux and Windows.
The prerequisites to build KTT are:
- C++17 compiler, for example Clang 7.0, GCC 9.1, MSVC 14.16 (Visual Studio 2017) or newer
- OpenCL, CUDA or Vulkan library, supported SDKs are AMD OCL SDK, Intel SDK for OpenCL, NVIDIA CUDA Toolkit and Vulkan SDK
- Command line build tool Premake 5
- (Optional) Python 3 with NumPy for Python bindings support
- (Optional) NVIDIA CUPTI, or AMD GPU Performance API for profiling and profile-based searcher
Build under Linux (inside KTT root folder):
- ensure that path to vendor SDK is correctly set in the environment variables
- run ./premake5 gmake to generate makefile
- run cd Build to get inside the build directory
- afterwards run make config={configuration}_{architecture} to build the project (e.g., make config=release_x86_64)
Build under Windows (inside KTT root folder):
- ensure that path to vendor SDK is correctly set in the environment variables; this should be done automatically during SDK installation
- run premake5.exe vs20xx (e.g., premake5.exe vs2019) to generate Visual Studio project files
- open generated solution file and build the project inside Visual Studio
The following build options are available:
- --outdir=path specifies custom build directory, default build directory is Build
- --platform=vendor specifies SDK used for building KTT, useful when multiple SDKs are installed
- --profiling=library enables compilation of kernel profiling functionality using specified library
- --vulkan enables compilation of experimental Vulkan backend
- --python enables compilation of Python bindings
- --no-examples disables compilation of examples
- --no-tutorials disables compilation of tutorials
- --tests enables compilation of unit tests
- --no-cuda disables the inclusion of CUDA API during compilation, only affects Nvidia platform
- --no-opencl disables the inclusion of OpenCL API during compilation
KTT rely on external dynamic (shared) libraries to work correctly. There are multiple ways to provide access to these libraries, e.g., copying a given library inside the application folder or adding the containing folder to the library path (example for Linux: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/shared/library). Libraries which are bundled with device drivers are usually visible by default. The list of libraries currently utilized by KTT (typically nor all libraries are required, depending on settings of the premake5 execution):
- OpenCL distributed with specific device drivers (OpenCL only)
- cuda distributed with specific device drivers (CUDA only)
- nvrtc distributed with specific device drivers (CUDA only)
- cupti bundled with Nvidia CUDA Toolkit (CUDA profiling only)
- nvperf_host bundled with Nvidia CUDA Toolkit (new CUDA profiling only)
- nvperf_target bundled with Nvidia CUDA Toolkit (new CUDA profiling only)
- GPUPerfAPICL bundled with KTT distribution (AMD OpenCL profiling only)
- vulkan distributed with specific device drivers (Vulkan only)
- shaderc_shared bundled with Vulkan SDK (Vulkan only)

Using KTT in user's applications

Applications using KTT need to link KTT dynamic library and include KTT headers.

Application has to include Ktt.h (located in KTT/Source folder)
During application build, KTT headers have to be available, e.g., by copying them into standard headers location, or by passing their position to the compiler (by, e.g., -Ilocation_of_my_KTT/Source with g++)
During application build, KTT library has to be linked, e.g., by -lktt, or -Llocation_of_my_KTT/Build/x86_64_Release/ -lktt. KTT dynamic library can be also copied into standard location of dynamic libraries (e.g., /usr/lib).
During application execution, KTT library have to be available (e.g., located in the application's folder, in standard location for dynamic libraries, or any place included in corresponding environmental variable, such as LD_LIBRARY_PATH)

Python bindings

To be able to use KTT Python API, the KTT module must be built with --python option. For the build option to work, access to Python development headers and library must be provided under environment variables PYTHON_HEADERS and PYTHON_LIB respectively. Once the build is finished, in addition to the regular C++ module, a Python module will be created (named pyktt.pyd under Windows, pyktt.so under Linux). This module can be imported into Python programs in the same way as regular modules. Note that Python must have access to all modules which depend on the KTT module (e.g., various profiling libraries), otherwise the loading will fail.

Related projects

KTT API is based on CLTune project. Certain parts of the API are similar to CLTune. However, the internal structure is completely rewritten from scratch. The ClTuneGemm and ClTuneConvolution examples are adopted from CLTune.

KTT search space generation and tuning configuration storage techniques are derived from ATF project. Due to differences in API and available framework features, certain modifications were made to the original ATF algorithms. The examples stored in AtfSamples folder are adopted from ATF.

How to cite

F. Petrovič et al. A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit. In Future Generation Computer Systems, Volume 108, 2020.

ktt's People

Contributors

Stargazers

Watchers

Forkers

jpazurikova rtrembecky jiri-filipovic aminnezarat knutkirkhorn gujingwei macweese bigno78 elseviersoftwarex latsa

ktt's Issues

Unable to compile without CUDA

Hi,
I am unable to compile KTT, development branch.
The compilation fails with the message attached here make.log

First, settings:

I have Intel OpenCl installed and paths set up
I DO not have CUDA installed, CUDA_PATH is not set

Second, what I have tried and failed to resolve the situation:

clean build to rule out some old and obsolete settings in Makefiles
premake --no-cuda --platform=intel to explicitly tell that I do not have CUDA
new git clone to rule out the influence of any of my modifications

Master branch works fine, but it does not contain all new features I need to develop a new example.
Happy to answer any additional questions. I hope to see this resolved soon.
Best regards,
Janka

HostZeroCopy not working on OpenCL

Hi,
I have tried out ArgumentMemoryLocation::HostZeroCopy and came across some unexpected behaviour on OpenCL.

I have a small example showing the issue in https://github.com/jpazurikova/KTT/tree/sort-memory-mapping
There is just one kernel in the composition. Its arguments are initialized with random values, the first several values of the arrays are written on the output. The kernel itself just prints out several of the values from the beginning of the arrays (starts with "beginning"), adds 2 to each value and writes the values again (starts with "end").

CUDA version works as expected, it is clear that the values printed from the kernel are the same than the ones initialized in the host code. I am printing just the first value of uint2, that's why there is only every other value printed out by kernel. Example output:

0 128959393 128959393
1 1692901013 1692901013
2 436085873 436085873
3 748533630 748533630
4 776550279 776550279
5 289134331 289134331
6 807385195 807385195
7 556889022 556889022
8 95168426 95168426
9 1888844001 1888844001
[INFO] Initializing tuner for device GeForce GTX 680
[INFO] Computing reference class result for kernel sort
[INFO] Running kernel composition sort with configuration: global size (1, 1, 1), local size (1, 1, 1), parameters: SORT_BLOCK_SIZE: 32, SCAN_BLOCK_SIZE: 32, SORT_VECTOR: 2, SCAN_VECTOR: 2
[DEBUG] Uploading buffer for argument 0, event id: 0
[DEBUG] Performing buffer operation synchronization for event id: 0
[DEBUG] Uploading buffer for argument 1, event id: 1
[DEBUG] Uploading buffer for argument 2, event id: 2
[DEBUG] Uploading buffer for argument 3, event id: 3
[DEBUG] Launching kernel radixSortBlocks, event id: 4
[DEBUG] Performing kernel synchronization for event id: 4
beginning
0 128959393 128959393
1 436085873 436085873
2 776550279 776550279
3 807385195 807385195
4 95168426 95168426
5 1353699824 1353699824
6 686611921 686611921
7 912905316 912905316
8 1698004709 1698004709
9 1532780646 1532780646
end
0 128959395 128959393
1 436085875 436085873
2 776550281 776550279
3 807385197 807385195
4 95168428 95168426
5 1353699826 1353699824
6 686611923 686611921
7 912905318 912905316
8 1698004711 1698004709
9 1532780648 1532780646

However, OpenCL kernel prints out zeros, the initialized values do not "get" to the kernel. Example output:

0 128959393 128959393
1 1692901013 1692901013
2 436085873 436085873
3 748533630 748533630
4 776550279 776550279
5 289134331 289134331
6 807385195 807385195
7 556889022 556889022
8 95168426 95168426
9 1888844001 1888844001
[INFO] Initializing tuner for device Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
[INFO] Computing reference class result for kernel sort
[INFO] Running kernel composition sort with configuration: global size (1, 1, 1), local size (1, 1, 1), parameters: SORT_BLOCK_SIZE: 32, SCAN_BLOCK_SIZE: 32, SORT_VECTOR: 2, SCAN_VECTOR: 2
[DEBUG] Uploading buffer for argument 0, event id: 0
[DEBUG] Performing buffer operation synchronization for event id: 0
[DEBUG] Uploading buffer for argument 1, event id: 1
[DEBUG] Uploading buffer for argument 2, event id: 2
[DEBUG] Uploading buffer for argument 3, event id: 3
[DEBUG] Launching kernel radixSortBlocks, event id: 4
[DEBUG]beginning
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
end
0 2 2
1 2 2
2 2 2
3 2 2
4 2 2
5 2 2
6 2 2
7 2 2
8 2 2
9 2 2

This is obviously unexpected behaviour. If KTT does not support HostZeroCopy for OpenCL, it should say so and end the execution. Or fall back and use Device instead of HostZeroCopy.

I have debugged this a little and pinned it down to the following: the bufferEvent for the vectors created at opencl_engine.cpp:279 gets erased in opencl_engine.cpp:525 due to validFlag set to false. The flag set to true causes segfault, so it's probably false for a good reason. You clearly know better what to do, I just wanted to help with finding the origin of the error.

Hope this helps. If you have any questions, just ask.
Janka

global and local sizes in output are not correct if they were changed in manipulator

I am running a kernel composition using manipulator. Here is a code snippet of important lines to describe what I do:

main function:

const ktt::DimensionVector ndRangeDimensions(M, N);
const ktt::DimensionVector workGroupDimensions(32, 32);
ktt::KernelId kernelFusedId = tuner.addKernelFromFile(kernelFile, "bicgFused", ndRangeDimensions, workGroupDimensions);
std::vector<ktt::ArgumentId> kernelArgs = { AId, x1Id, y1Id, x2Id, y2Id, mId, nId };
ktt::KernelId kernelId = tuner.addComposition("BicgPolyBenchAndFused", std::vector<ktt::KernelId>{kernel1Id, kernel2Id, kernelFusedId}, std::make_unique<BicgManipulator>(kernel1Id, kernel2Id, kernelFusedId));

void launchComputation(const ktt::KernelId kernelId) override
{
    ktt::DimensionVector globalSize = getCurrentGlobalSize(kernelFusedId);
    ktt::DimensionVector localSize = getCurrentLocalSize(kernelFusedId);
    std::vector<ktt::ParameterPair> parameterValues = getCurrentConfiguration();

    const int rowsProcessed = getParameterValue("ROWS_PROCESSED", parameterValues);
    const int tile = getParameterValue("TILE", parameterValues);
    globalSize.setSizeX(M);
    globalSize.setSizeY(N / rowsProcessed * tile);
    localSize.setSizeX(tile);
    localSize.setSizeY(tile);
    printf("changed global to %d x %d and local to %d x %d\n", globalSize.getSizeX(), globalSize.getSizeY(), localSize.getSizeX(), localSize.getSizeY());

    runKernel(kernelFusedId, globalSize, localSize);
}

In output, I see global size 2: 8192, 8192, 1; local size 2: 32, 32, 1;, even though I would expect the changed values returned by my printf: changed global to 8192 x 1024 and local to 16 x 16.

Make getters return const& rather than value when possible

There are currently some classes with getters that return member attributes such as std::string by value. Change these to return a const reference.

Optimize kernel output validation

Avoid making an additional copy of kernel arguments during output validation.

Document required .dll / .so files

KTT requires multiple .dll / .so libraries to work correctly. Add information about these files to documentation. Also add examples how to set path to these files.

Crash inside tuning manipulator

Calling updateArgumentVector() method in tuning manipulator causes a crash.

Vector values are not passed properly

Vector std::vector<size_t> test{0, 76}; passed to the kernel is read in kernel as {0, 0}.

Minimal code to reproduce:

#include <vector>
#include <iostream>
#include "tuner_api.h"

int main(int argc, char **argv) {
        ktt::Tuner tuner(0, 0, ktt::ComputeAPI::CUDA);
        tuner.setGlobalSizeType(ktt::GlobalSizeType::CUDA);

        ktt::KernelId kernelId = tuner.addKernelFromFile("../examples/sparse/bug.cu", "bug", ktt::DimensionVector(1),
                                                         ktt::DimensionVector(1));
        std::vector<size_t> test{0, 76};
        tuner.setKernelArguments(kernelId, {tuner.addArgumentVector(test, ktt::ArgumentAccessType::ReadOnly)});

        std::cout << test[0] << " " << test[1] << std::endl;

        tuner.tuneKernel(kernelId);

        return 0;
}

bug.cu:

extern "C" __global__ void bug(const size_t* csrRowPtrA) {
        if (threadIdx.x == 0)
                printf("[0]: %d, [1]: %d\n", csrRowPtrA[0], csrRowPtrA[1]);
}

Output: [0]: 0, [1]: 0.

KTT version: fba8d73

Tuning parameters in multiple packs

Tuning parameters which are part of multiple packs at once are handled incorrectly by configuration manager.

[Development branch] Bug in setArgumentSynchronization() method

Calling setArgumentSynchronization(true, memoryType) after previously calling setArgumentSynchronization(false, memoryType) inside single launchComputation() method run does not have the desired effect.

Make include paths consistent across all files

Always use full paths to include files rather than the mix of full and relative paths that is used now.

Source/KernelRunner/ComputeLayer.cpp:301:48: warning: unused parameter ‘id’

../Source/KernelRunner/ComputeLayer.cpp: In member function ‘virtual bool ktt::ComputeLayer::GetProfiling(ktt::KernelId)’:
../Source/KernelRunner/ComputeLayer.cpp:301:48: warning: unused parameter ‘id’ [-Wunused-parameter]
  301 | bool ComputeLayer::GetProfiling(const KernelId id)
      |                                 ~~~~~~~~~~~~~~~^~

Unify kernel and kernel composition usage outside kernel manager

Components such as tuning runner and configuration manager now need to work with both kernels and compositions which leads to significant code duplication.

Standalone kernels can be wrapped into a composition so that only kernel manager has to directly work with both types of these structures. Other components will utilize only compositions.

addCompositionKernelParameter method is missing a version without a thread modifier

As of KTT documentation, addCompositionKernelParameter method has only one signature, where addParameter has two.

addCompositionKernelParameter ( const KernelId compositionId, const KernelId kernelId, const std::string & parameterName, const std::vector< size_t > & parameterValues, const ModifierType modifierType, const ModifierAction modifierAction, const ModifierDimension modifierDimension );
addParameter ( const KernelId id, const std::string & parameterName, const std::vector< size_t > & parameterValues );
addParameter ( const KernelId id, const std::string & parameterName, const std::vector< size_t > & parameterValues, const ModifierType modifierType, const ModifierAction modifierAction, const ModifierDimension modifierDimension );

I am using ModifierType::None, but it is not a nice solution, as the other two arguments are useless.

User data in result serialization

Add support for user data in result serialization.

Enable all warnings during default KTT build

KTT is currently built with warning level 3 inside MSVC. Use warning level 4 instead and fix all the relevant warnings