kerneltuner / kernel_tuner Goto Github PK

View Code? Open in Web Editor NEW

247.0 9.0 44.0 18.14 MB

Kernel Tuner

Home Page: https://kerneltuner.github.io/kernel_tuner/

License: Apache License 2.0

Python 100.00%

opencl-kernels kernel-tuner cuda-kernels python gpu cuda opencl c cplusplus autotuning

kernel_tuner's Introduction

Create optimized GPU applications in any mainstream GPU programming language (CUDA, HIP, OpenCL, OpenACC).

What Kernel Tuner does:

Works as an external tool to benchmark and optimize GPU kernels in isolation
Can be used directly on existing kernel code without extensive changes
Can be used with applications in any host programming language
Blazing fast search space construction
More than 20 optimization algorithms to speedup tuning
Energy measurements and optimizations (power capping, clock frequency tuning)
... and much more! For example, caching, output verification, tuning host and device code, user defined metrics, see the full documentation.

Installation

First, make sure you have your CUDA, OpenCL, or HIP compiler installed
Then type: pip install kernel_tuner[cuda], pip install kernel_tuner[opencl], or pip install kernel_tuner[hip]
or why not all of them: pip install kernel_tuner[cuda,opencl,hip]

More information on installation, also for other languages, in the installation guide.

Example

import numpy as np
from kernel_tuner import tune_kernel

kernel_string = """
__global__ void vector_add(float *c, float *a, float *b, int n) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    if (i<n) {
        c[i] = a[i] + b[i];
    }
}
"""

n = np.int32(10000000)

a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros_like(a)

args = [c, a, b, n]

tune_params = {"block_size_x": [32, 64, 128, 256, 512]}

tune_kernel("vector_add", kernel_string, n, args, tune_params)

More examples here.

Resources

Full documentation
Guides:
Features & Use cases:
Kernel Tuner Tutorial slides [PDF], hands-on:
- Vector add example [.ipynb]
- Tuning thread block dimensions [.ipynb]
- Search space restrictions & output verification [.ipynb]
- Visualization & search space optimization [.ipynb]
Energy Efficient GPU Computing tutorial slides [PDF], hands-on:
- Kernel Tuner for GPU energy measurements [.ipynb]
- Code optimizations for energy [.ipynb]
- Mixed precision and accuracy tuning [.ipynb]
- Optimzing for time vs for energy [.ipynb]

Kernel Tuner ecosystem

C++ magic to integrate auto-tuned kernels into C++ applications

C++ data types for mixed-precision CUDA kernel programming

Monitor, analyze, and visualize auto-tuning runs

Communication & Contribution

GitHub Issues: Bug reports, install issues, feature requests, work in progress
GitHub Discussion group: General questions, Q&A, thoughts

Contributions are welcome! For feature requests, bug reports, or usage problems, please feel free to create an issue. For more extensive contributions, check the contribution guide.

Citation

If you use Kernel Tuner in research or research software, please cite the most relevant among the publications on Kernel Tuner. To refer to the project as a whole, please cite:

@article{kerneltuner,
  author  = {Ben van Werkhoven},
  title   = {Kernel Tuner: A search-optimizing GPU code auto-tuner},
  journal = {Future Generation Computer Systems},
  year = {2019},
  volume  = {90},
  pages = {347-358},
  url = {https://www.sciencedirect.com/science/article/pii/S0167739X18313359},
  doi = {https://doi.org/10.1016/j.future.2018.08.004}
}

kernel_tuner's People

Contributors

Stargazers

Watchers

Forkers

taihulight egpbos ipelupessy jhidding nicorenaud tdwebste gpuworld stijnh hiker anantzoid alex1400 bayesianopt wjp zengyingchao vos90 eddieburning zzmcdc sakehl isazi gongzg liusiping10 jiskattema kostisasteriou schoonhovenrichard luchangli03 royinx mfkiwl cffbots csbnw asubah adriaangraas webclinic017 loostrum rdguerrerom odgaard deekshitha-kumbla milolurati liuliujie wat-ai bouweandela zacharyvarley tdejong00 martijnfr

kernel_tuner's Issues

backend function benchmark

c.py, cuda.py, and opencl.py

returning the times array would offer more flexibility than returning the mean of times[1:-1], e.g. when you're interested in the distribution of the times required.
variable iterations seems to be more at home as .benchmark(self, func, gpu_args, threads, grid, iterations=7) than as an instance variable (self.iterations)
if the times[1:-1] construct is to remain, we should add a check that assert len(times) >= 3, "not enough iterations."
it is currently not possible to adjust the number of iterations of a benchmark using the public interface

in cuda.py compiler_options = ['-Xcompiler=-Wall'] not used

source seems to suggest the following compiler option should be used:

compiler_options = ['-Xcompiler=-Wall']

later on self.compiler_options is added, but then in the compile:

self.current_module = source_mod(kernel_string, options=self.compiler_options + ["-e", kernel_name],
                                               arch='compute_' + self.cc, code='sm_' + self.cc,
                                               cache_dir=False, no_extern_c=no_extern_c)

ie self.compiler_options is used. maybe check whether option is needed and correct accordingly..

capitalize grid_size_x, grid_size_y, grid_size_z as per convention

If possible, I'd like to capitalize these variable names as per the convention for preprocessor defines. (code)

example of hostcode that uses OpenCL runtime compilation

As a step towards providing a library for integrating kernels into whole applications, we need some simple and clean examples that use hostcode that uses OpenCL runtime compilation. This could just be for a simple vector_add kernel.

expand tuning host code tutorial

The goal is to expand the current text (http://benvanwerkhoven.github.io/kernel_tuner/hostcode.html) into a more self-contained tutorial. And again as a Jupyter Notebook so that the user can really play around and try things out. It's perhaps better to switch to a simpler example than a fully overlapped and streamed convolution kernel, that can be a bit too complicated.

NAN run time with iteration = 1

It seems that if we set iterations=1, the run time printed is nan.

can we merge util.py into core.py?

the current sepration is not helping me find things better. Merging the two would remove one degree of freedom at least.

Add argument to pass options to the tuning strategies

The idea is to add the option for users of tune_kernel to pass a dictionary to control settings specific to the search strategies. Certain options could be used for different search strategies, for example:

"popsize" to control the population size in genetic algorithm, differential evolution, particle swarm optimization, and the firefly algorithm strategies.
"maxiter" to control the number of iterations or generations in genetic algorithm, particle swarm optimization, firefly algorithm, and many of the methods supported by minimize.
"T" to control the 'temperature' parameter in basin hopping and simulated annealing.
"sample_fraction" to control the fraction of the population that random_sample works on, currently this is an optional argument to tune_kernel.

Noodles Runner - dict result workaround

Noodles runner can only work with tuple currently, but we want a dictionary as a result so this needs to be constructed in the run function.

Units for timings in tests

The reported times in the tests don't give the units (s? ms? ns?), nor the type of time (wall clock, cpu/gpu).

add pep8 to dev tools

something along the lines of

pip install pytest-pep8

pip install pylint

make code pep8 compliant (currently, the pylint score is just over 50%)
add pep8 to the 'how to contribute' documentation

avoid ambiguity of original_kernel

currently, original_kernel (here, here, and here) can be either a filename or a string, while here and here, only a string is expected. I think it's more clear to have arguments that are one thing only. This makes it easier to reason about code. It can be easily achieved by having a kernel_string argument, which is always a string, and a (new) argument, kernel_filename or something, which is always a filename. Both should be set to None by default, the user can then overwrite one of them with a non-default value.

For inspiration, look here.

the function interfaces would then become:

# old 
def detect_language(lang, original_kernel) 
# new
def detect_language(lang, kernel_string)

# old 
def looks_like_a_filename(original_kernel)
# new
# no longer needed

# old
def get_kernel_string(original_kernel)
# new
def get_kernel_string(kernel_string=None, kernel_filename=None)

# old
def prepare_kernel_string(original_kernel, params, grid=(1, 1, 1))
# new
def prepare_kernel_string(kernel_string, params, grid=(1, 1, 1))

# old
def setup_kernel_strings(kernel_name, original_kernel, params, grid)
# new
def setup_kernel_strings(kernel_name, kernel_string, params, grid)

use readthedocs for generating documentation

add pytest.ini file or tool:pytest section in setup.cfg

(to be explicit about the test configuration)

Broke output checking for C backend

My latest change to output checking broke it for the C backend, need to fix the memcpy_htod function. Idea is to replace the contents of the memory pointed to by dest with the original copy of the kernel arguments that is pointed to by src.

https://github.com/benvanwerkhoven/kernel_tuner/blob/4d2cee3870cea75d9c6569512fc88ce6d777a8d7/kernel_tuner/c.py#L321-L336

consider using Python built-in tempfile module for handling temporary files

e.g.

tempfile docs here (Python 2.6) and here (Python 3.6)

put functions in alphabetical order

I asked around (during one of our clean coding sessions) what people think one should do regarding the ordering of methods/functions. In the Java world at least, the 'right' way of doing it is to order them depending on what level a fucntion is on in the callgraph (I find that confusing BTW). This leaves me unsure how to resolve this issue. Regardless, I still think it would be a good idea to have some kind of ordering in the function names. I leave it up to you wether to pursue this or not.

example of hostcode that uses CUDA runtime compilation

As a step towards providing a library for integrating kernels into whole applications, we need some simple and clean examples that use hostcode that uses CUDA runtime compilation. This could just be for a simple vector_add kernel.

Running OpenCL tests on Travis

It seems like this may be possible using certain drivers, see this discussion: JuliaGPU/OpenCL.jl#121

improved error message for when pycuda is not (correctly) installed

Sometimes people upgrade CUDA versions and get errors like:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/sheldens/KM3Net/test/test_prefix_sum_kernel.py", line 34, in test_prefix_sum_kernel
    problem_size, args, params)
  File "/home/sheldens/.local/lib/python2.7/site-packages/kernel_tuner/interface.py", line 451, in run_kernel
    func = dev.compile_kernel(instance, False)
  File "/home/sheldens/.local/lib/python2.7/site-packages/kernel_tuner/core.py", line 238, in compile_kernel
    raise e
TypeError: 'NoneType' object is not callable

These aren't very informative. There should be a nicer message that we were unable to successfully load pycuda. The most common issue is that pycuda was installed against a previous CUDA version.

reorganize unit tests

Some functions have moved around quite a bit within the Kernel Tuner code base, but the test suite still reflects the original file structure. It would be nice if the tests could be reorganized such that it will be easier to find the tests that are responsible for a single module.

Also some tests may be considered integration tests and should perhaps be moved to a separate place (I'm talking about tests that actually call run_kernel or tune_kernel).

I'm perfectly fine with the fact that some unit tests require a GPU (and use skiptest when they're not present). I know this goes into general unit testing opinions that think unit tests should not depend on external factors, but it's something we can't avoid when working with GPU codes.

check if the list of arguments supplied matches kernel signature

This issue will be a bit more work as it requires that we inspect the code being passed to the tuner, which could be CUDA, OpenCL, or C code.

We know the name of the function to be called so we can check if the signature of the kernel begin called actually corresponds with the number of arguments and whether the data types match with the argument list that has been passed by the user.

There is already a check_argument_list function, but that currently only checks whether the user has supplied numpy objects. The idea is to check whether the type (for example numpy.ndarray with dtype=numpy.float32) actually matches the argument to the kernel (for example float *).

Also check whether we need to do anything special in case the user supplies a code generating function rather than a string of source code or a filename.

faster build in travis

In order to speed up the building in Travis we need to remove the sudo dependency from the .travis.yml file. This allow to use the faster docker stack offer by Travis CI.
See the travis blog, for further explanations.

Bugfix dtoh memcpy for C interface when using bytes

Due to a known bug in Numpy:
numpy/numpy#6741

It is currently not supported to copy results from the kernel back to Python when using the C interface.

I'm working on a workaround, based on the pull request:
numpy/numpy#6214

differentiate detween device and device_index, platform and platform_index

in cuda.py and in opencl.py.

option quiet does not work

the sequential runner and tune kernel should not print stuff to stdout when user passes quiet=True to tune_kernel

expand matrix multiply tutorial

The goal is to turn this into a full tutorial for implementing a tiled version of a matrix multiply kernel and tuning thread block and tile sizes with Kernel Tuner. The tutorial should explain the basic GPU implementation and how the user can create a script to tune that kernel with Kernel Tuner.

We can start from what we already have here: http://benvanwerkhoven.github.io/kernel_tuner/matrix.html
And expand that and turn it into a Jupyter Notebook.

use skiptest decorator

We currently have some tests that call a function to check if there's a GPU and a CUDA installation (or OpenCL device and pyopencl).

We need to see if we can use the skiptest decorator that is offered by pytest instead of calling the function ourselves.

check tunable parameters for special names that should not be used

The user can decide on their own names for tunable parameters, but some names will lead to errors because they are also used by the tuner internally.

These include 'time', 'times', which are used by the tuner to report output back to the user.

The parameters 'grid_size_x', 'grid_size_y', and 'grid_size_z' are inserted into the code as C-preprocessor defines, if the user also uses these names for their tunable parameters I'm pretty sure it should lead to compile errors.

The idea is to check the tunable parameters to not contain these reserved keywords when the user calls tune_kernel and report a user-friendly error when if they do.

Allow runners to return a tuple of result array and environment dictionary

Currently the runners only return a result array.

Add the possibility to return a dictionary with environment information instead of printing to screen.

More sophisticated implementation of Genetic Algorithm

The current implementation for the genetic algorithm strategy is the simplest implementation possible, but probably the performance can be improved with more sophisticated methods.

use more advanced parsing for argument lists

Currently the argument checker only looks for the first pair of braces "(" and ")", we should do something a little bit more advanced and try to really get the argument list. Note that just looking for the first occurrence of the kernel name is not safe enough either, the name itself could appear in multiple places as a substring of something else, or inside a comment.

So depending on the language it would be best to look for something like "__kernel" "void" "kernel_name" "(" comma-separated argument list ")". Multiline regexes are probably what we need.

Try using Travis for generating gh-pages documentation

Since readthedocs does not seem optimal due to advertisements (see #41) , we want to explore automated alternatives. It seems possible to use Travis to generate gh-pages documentation automatically, e.g. https://gist.github.com/domenic/ec8b0fc8ab45f39403dd. That means we would just keep using the old documentation building setup, but then automated.

interface.py superfluous?

init.py is the public interface AFAIK (e.g. https://www.reddit.com/r/Python/comments/1bbbwk/whats_your_opinion_on_what_to_include_in_init_py/), no need for interface.py

continue on non-fatal cuda import fails

it may happen that the import of DynamicSourceModule, while it will also not
be used later on, theoretically in that case the script should not fail, at least that is implied by:

if int(self.cc) >= 35:
                source_mod = DynamicSourceModule
            else:
                source_mod = SourceModule

see #31
otherwise the ImportError message should be changed

Documentation issues: Getting Started

A place to collect small issues in the docs.

Edit: only for Getting Started.

check if reference output matches arguments array

Users of the tuner can supply a reference output using the 'answer=' option of tune_kernel. Currently we don't perform any checks on this input, while it could lead to weird errors if the user makes a mistake.

We should check that the length of the list passed to answer matches that of the arguments passed to the kernel. For all objects in the list that are not None we should check if their datatype matches with the datatype of the corresponding location in the arguments list.

There is one exception, when the user supplies a custom function for output verification (using the 'verify=' optional argument of tune_kernel) we should be more flexible. A non-None value in answer means that there is output at this position, the corresponding position in arguments should be at least an array and not a scalar, but the dtype or the length of the array should not be checked as these are now the responsibility of the user-supplied verification function.

update correctness verification documentation

This document should be updated to include the description of a feature, which is the verify option. This option allows users to specify their own functions for performing output verification. The current article can be found here: http://benvanwerkhoven.github.io/kernel_tuner/correctness.html

An example of how to use the new verify option can be found in the reduction example.

Generate images for the diffusion tutorial

In the current tutorial the images in the difussion tutorial are not rendered properly, probably because the jupyter notebook was not saved in an state where all the cells were evaluated.
The Jupyter notebook needs to be run and the docs update.

minimum CUDA and OpenCL versions in the documentation

maybe I am overlooking things, but what is your (minimum) targeted CUDA and OpenCL versions?

I could run the examples with openCL 1.1 with a minor fix (#33), but I am not sure this is ok

Including CUDA header files in generated code

I was trying the generator system with some vector operations. In OpenCL the operators on vector types are part of the language, so no issue there, but to use them in CUDA I need to include "helper_math.h".
However, in cuda.py the whole generated code is wrapped in "extern C", and this generates a plethora of naming errors with this specific header file.
To make everything work I set no_extern_c to True in cuda.py, and added an "extern C" statement only around the kernel. This solution would require the user, to my understanding, to use "extern C" for every kernel, breaking compatibility with current code and making everything more cumbersome.

jupyter notebook: add line wraps for better diff'ing

currently, if you want to make a change to the text of the notebook, it's difficult to do any effective diff'ing becasue there are no line breaks in the text. e.g: https://github.com/benvanwerkhoven/kernel_tuner/blame/45ddee4bfdd7b4ddeebd26431c037a8e54996dd9/tutorial/00-Kernel-Tuner-Introduction.ipynb#L548

get_config_string trailing comma

https://github.com/benvanwerkhoven/kernel_tuner/blob/dc2cbc4ed50b806478e3b1264e9a3be73a405f3d/kernel_tuner/util.py#L16

what you probably want is:

return ", ".join([k + "=" + str(v) for k, v in params.items()])

(gets rid of the trailing comma)

Reorganize tests

The way the tests are organized still reflects one of the very early structures of Kernel Tuner. The plan is to reorganize them in a way that reflects the current structure.

simplify installation process

We can use the 'extras' feature inside setup.py to specify a number of extras that require optional dependencies.

Kernel Tuner has several optional dependencies, for example you could only be interested in tuning CUDA kernels, so it would not make sense to require OpenCL to be installed, or vice versa. You can read more about this in the install guide.

The idea is to turn these optional depencies into extras such that the user can install Kernel Tuner in a single command, instead of first installing Kernel Tuner and then having to also install all optional dependencies.

Currently we have a file called requirements-dev.txt for all the dependencies that are required to start developing Kernel Tuner. These could also be integrated using an 'extra'.

Please don't forget to update the install and contribution guides accordingly when you work on this issue.

Log benchmarked instances to file and restart from file

The idea is to track progress during tuning by logging benchmarked instances to a user-specified file, probably just using JSON. If the file already exists, then continue from where the tuning process left off. For the most optimization strategies this is very easy to implement, because the only change is not starting from an empty cache but with a pre-filled cache instead.

The brute_force strategy will need some adaptation to use a cache, and I'm not entirely sure what the semantics will be for the random_sample strategy. Perhaps it should just sample the space that is left unexplored.

fully migrate from nosetests to pytest

The goal is to only use pytest for testing Kernel Tuner, but in the tests we still use the no longer maintained nosetests here and there. We want to replace these uses with their pytest equivalents as much as possible.

tutorial review by jspaaks

I'm collecting some notes here pertaining to the text of the kernel_tuner including its tutorial.

The kernel_tuner code base is currently at: https://github.com/benvanwerkhoven/kernel_tuner/tree/d36feb61eab791e08132051ffefa9458df79ad80

(I'm using GitHub's blame option to be able to refer to specific line numbers in specific commits for .rst files)

I would remove https://github.com/benvanwerkhoven/kernel_tuner/blame/d36feb61eab791e08132051ffefa9458df79ad80/README.rst#L47 since you already have #48
in the install section, add a reference to the dependencies section so a user knows what to have installed before attempting to install the tuner itself. Speaking of dependencies, isn;t it just a case of pip install -r requirements.txt? Since that already lists pycuda and pyopencl.
my default python is 2.7 for some reason, looks like that results in an error (could have been related to the errors I got below)

using virtual env for python3 should make things easier, e.g.

# use virtualenv to create a python3 virtual environment called .venv-3
# (needs virtualenv?)
virtualenv -p python3 .venv-3 

# activate the virtual environment
. ./.venv-3/bin/activate

# install requirements
pip install --requirement requirements.txt

# install the tuner itself
pip install .

code snippet above doesn't work yet, it complains about nvcc not being present, tried to fix with
```
sudo apt install nvidia-cuda-toolkit
```
(still complains about missing numpy)
installing CUDA toolkit from repos apparently is not advisable. Get the newest version instead from https://developer.nvidia.com/cuda-downloads
(as discussed) remove pycuda and pyopencl from requirements.txt; possibly introduce them later when the user knows what he/she needs. Maybe also remove the 'devDependencies': mock, nose and pytest. Possibly use different sets of requirements, e.g.
- requirements-testing
- requirements-opencl
- requirements-cuda, etc

solved install problems as follows:

# removed python2 entirely
sudo apt install python3
sudo apt install python3-pip
# edited the requirements so it only includes numpy mock nose  pytest
pip3 install -r requirements.txt --no-cache-dir
pip3 install .

about the example about heat diffusion: you mention somewhere that 'temperature' is being dispersed, but technically it's 'heat'. You'd need the heat capacity of the material in order to calculate the change in temperature after a change in heat due to dispersion.
you use lower case delta to signify a distance (in time and space), AFAICT this should be upper case, the difference being that lower case delta is used to signify an infinitesimal distance, whereas your distance is 1.0 in space, and 0.225 in time.
I'd benefit from a graph/schematic explaining x and y, and u at various combinations of x and y (not sure this is possible in an jupyter notebook).
I'd benefit from a derivation of the diffusion equation (maybe not in the main document but a link to some other document that uses the same symbols).
I see you use %matplotlib inline somewhere. I didn't know this syntax (never worked with notebooks before). Perhaps add a comment that explains it?
Depending on the intended audience, maybe provide a quick recap of threads / blocks and why they matter; or point to some of your other material?

in In[3] line field[numpy.random.randint(0,nx,size=10), numpy.random.randint(0,ny,size=10)] = 1e3, I'd break it down, somehting like:

nHotspots = 10
cols = numpy.random.randint(0,nx,size=nHotSpots)
rows = numpy.random.randint(0,ny,size=nHotSpots)
tempHot = 1000.
tempCold = 1.
field = numpy.ones((ny, nx)).astype(numpy.float32) * tempCold
field[rows, cols] = tempHot
return field

In In[7] an d elsewhere, you do a memcpy_htod. Maybe good to mention that htod stands for host to device (I think:)

Support for C++ templated kernels

CUDA supports the use of C++ in the kernel language, including the use of templates. Currently, using pycuda, it's required for kernels to have extern "C" linkage, which prevents templated kernels from being used directly. Work has started by Willem Jan to make it possible to call such kernels directly using nvrtc.

time increase after using this tool as on-line tuner

I use kernel_tuner as a on-line tuner to tune GEMM, which means when I want to do gemm, will first tune the parameters, then after tuning will use the kernel_tuner's result to set the parameters, I find the kernel running time after tuning is larger than time before tuning whose parameters are set in the search space, but just randomly. Do you have some suggestions for this problem?