Code Monkey home page Code Monkey logo

kernel_tuner's Introduction


Build Status CodeCov Badge PyPi Badge Zenodo Badge SonarCloud Badge OpenSSF Badge FairSoftware Badge

Create optimized GPU applications in any mainstream GPU programming language (CUDA, HIP, OpenCL, OpenACC).

What Kernel Tuner does:

Installation

  • First, make sure you have your CUDA, OpenCL, or HIP compiler installed
  • Then type: pip install kernel_tuner[cuda], pip install kernel_tuner[opencl], or pip install kernel_tuner[hip]
  • or why not all of them: pip install kernel_tuner[cuda,opencl,hip]

More information on installation, also for other languages, in the installation guide.

Example

import numpy as np
from kernel_tuner import tune_kernel

kernel_string = """
__global__ void vector_add(float *c, float *a, float *b, int n) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    if (i<n) {
        c[i] = a[i] + b[i];
    }
}
"""

n = np.int32(10000000)

a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros_like(a)

args = [c, a, b, n]

tune_params = {"block_size_x": [32, 64, 128, 256, 512]}

tune_kernel("vector_add", kernel_string, n, args, tune_params)

More examples here.

Resources

Kernel Tuner ecosystem


C++ magic to integrate auto-tuned kernels into C++ applications


C++ data types for mixed-precision CUDA kernel programming


Monitor, analyze, and visualize auto-tuning runs

Communication & Contribution

  • GitHub Issues: Bug reports, install issues, feature requests, work in progress
  • GitHub Discussion group: General questions, Q&A, thoughts

Contributions are welcome! For feature requests, bug reports, or usage problems, please feel free to create an issue. For more extensive contributions, check the contribution guide.

Citation

If you use Kernel Tuner in research or research software, please cite the most relevant among the publications on Kernel Tuner. To refer to the project as a whole, please cite:

@article{kerneltuner,
  author  = {Ben van Werkhoven},
  title   = {Kernel Tuner: A search-optimizing GPU code auto-tuner},
  journal = {Future Generation Computer Systems},
  year = {2019},
  volume  = {90},
  pages = {347-358},
  url = {https://www.sciencedirect.com/science/article/pii/S0167739X18313359},
  doi = {https://doi.org/10.1016/j.future.2018.08.004}
}

kernel_tuner's People

Contributors

abelsiqueira avatar anantzoid avatar benvanwerkhoven avatar bouweandela avatar csbnw avatar dependabot[bot] avatar egpbos avatar felipez avatar fjwillemsen avatar hannospreeuw avatar hiker avatar ipelupessy avatar isazi avatar jhidding avatar liuliujie avatar loostrum avatar milolurati avatar nicorenaud avatar sakehl avatar schoonhovenrichard avatar stijnh avatar wjp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kernel_tuner's Issues

backend function benchmark

c.py, cuda.py, and opencl.py

  1. returning the times array would offer more flexibility than returning the mean of times[1:-1], e.g. when you're interested in the distribution of the times required.
  2. variable iterations seems to be more at home as .benchmark(self, func, gpu_args, threads, grid, iterations=7) than as an instance variable (self.iterations)
  3. if the times[1:-1] construct is to remain, we should add a check that assert len(times) >= 3, "not enough iterations."
  4. it is currently not possible to adjust the number of iterations of a benchmark using the public interface

in cuda.py compiler_options = ['-Xcompiler=-Wall'] not used

source seems to suggest the following compiler option should be used:

compiler_options = ['-Xcompiler=-Wall'] 

later on self.compiler_options is added, but then in the compile:

self.current_module = source_mod(kernel_string, options=self.compiler_options + ["-e", kernel_name],
                                               arch='compute_' + self.cc, code='sm_' + self.cc,
                                               cache_dir=False, no_extern_c=no_extern_c)

ie self.compiler_options is used. maybe check whether option is needed and correct accordingly..

example of hostcode that uses OpenCL runtime compilation

As a step towards providing a library for integrating kernels into whole applications, we need some simple and clean examples that use hostcode that uses OpenCL runtime compilation. This could just be for a simple vector_add kernel.

Add argument to pass options to the tuning strategies

The idea is to add the option for users of tune_kernel to pass a dictionary to control settings specific to the search strategies. Certain options could be used for different search strategies, for example:

  • "popsize" to control the population size in genetic algorithm, differential evolution, particle swarm optimization, and the firefly algorithm strategies.
  • "maxiter" to control the number of iterations or generations in genetic algorithm, particle swarm optimization, firefly algorithm, and many of the methods supported by minimize.
  • "T" to control the 'temperature' parameter in basin hopping and simulated annealing.
  • "sample_fraction" to control the fraction of the population that random_sample works on, currently this is an optional argument to tune_kernel.

Units for timings in tests

The reported times in the tests don't give the units (s? ms? ns?), nor the type of time (wall clock, cpu/gpu).

add pep8 to dev tools

something along the lines of

pip install pytest-pep8

or

pip install pylint
  • make code pep8 compliant (currently, the pylint score is just over 50%)
  • add pep8 to the 'how to contribute' documentation

avoid ambiguity of original_kernel

currently, original_kernel (here, here, and here) can be either a filename or a string, while here and here, only a string is expected. I think it's more clear to have arguments that are one thing only. This makes it easier to reason about code. It can be easily achieved by having a kernel_string argument, which is always a string, and a (new) argument, kernel_filename or something, which is always a filename. Both should be set to None by default, the user can then overwrite one of them with a non-default value.

For inspiration, look here.

the function interfaces would then become:

# old 
def detect_language(lang, original_kernel) 
# new
def detect_language(lang, kernel_string)

# old 
def looks_like_a_filename(original_kernel)
# new
# no longer needed

# old
def get_kernel_string(original_kernel)
# new
def get_kernel_string(kernel_string=None, kernel_filename=None)

# old
def prepare_kernel_string(original_kernel, params, grid=(1, 1, 1))
# new
def prepare_kernel_string(kernel_string, params, grid=(1, 1, 1))

# old
def setup_kernel_strings(kernel_name, original_kernel, params, grid)
# new
def setup_kernel_strings(kernel_name, kernel_string, params, grid)

put functions in alphabetical order

I asked around (during one of our clean coding sessions) what people think one should do regarding the ordering of methods/functions. In the Java world at least, the 'right' way of doing it is to order them depending on what level a fucntion is on in the callgraph (I find that confusing BTW). This leaves me unsure how to resolve this issue. Regardless, I still think it would be a good idea to have some kind of ordering in the function names. I leave it up to you wether to pursue this or not.

example of hostcode that uses CUDA runtime compilation

As a step towards providing a library for integrating kernels into whole applications, we need some simple and clean examples that use hostcode that uses CUDA runtime compilation. This could just be for a simple vector_add kernel.

improved error message for when pycuda is not (correctly) installed

Sometimes people upgrade CUDA versions and get errors like:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/sheldens/KM3Net/test/test_prefix_sum_kernel.py", line 34, in test_prefix_sum_kernel
    problem_size, args, params)
  File "/home/sheldens/.local/lib/python2.7/site-packages/kernel_tuner/interface.py", line 451, in run_kernel
    func = dev.compile_kernel(instance, False)
  File "/home/sheldens/.local/lib/python2.7/site-packages/kernel_tuner/core.py", line 238, in compile_kernel
    raise e
TypeError: 'NoneType' object is not callable

These aren't very informative. There should be a nicer message that we were unable to successfully load pycuda. The most common issue is that pycuda was installed against a previous CUDA version.

reorganize unit tests

Some functions have moved around quite a bit within the Kernel Tuner code base, but the test suite still reflects the original file structure. It would be nice if the tests could be reorganized such that it will be easier to find the tests that are responsible for a single module.

Also some tests may be considered integration tests and should perhaps be moved to a separate place (I'm talking about tests that actually call run_kernel or tune_kernel).

I'm perfectly fine with the fact that some unit tests require a GPU (and use skiptest when they're not present). I know this goes into general unit testing opinions that think unit tests should not depend on external factors, but it's something we can't avoid when working with GPU codes.

check if the list of arguments supplied matches kernel signature

This issue will be a bit more work as it requires that we inspect the code being passed to the tuner, which could be CUDA, OpenCL, or C code.

We know the name of the function to be called so we can check if the signature of the kernel begin called actually corresponds with the number of arguments and whether the data types match with the argument list that has been passed by the user.

There is already a check_argument_list function, but that currently only checks whether the user has supplied numpy objects. The idea is to check whether the type (for example numpy.ndarray with dtype=numpy.float32) actually matches the argument to the kernel (for example float *).

Also check whether we need to do anything special in case the user supplies a code generating function rather than a string of source code or a filename.

faster build in travis

In order to speed up the building in Travis we need to remove the sudo dependency from the .travis.yml file. This allow to use the faster docker stack offer by Travis CI.
See the travis blog, for further explanations.

option quiet does not work

the sequential runner and tune kernel should not print stuff to stdout when user passes quiet=True to tune_kernel

expand matrix multiply tutorial

The goal is to turn this into a full tutorial for implementing a tiled version of a matrix multiply kernel and tuning thread block and tile sizes with Kernel Tuner. The tutorial should explain the basic GPU implementation and how the user can create a script to tune that kernel with Kernel Tuner.

We can start from what we already have here: http://benvanwerkhoven.github.io/kernel_tuner/matrix.html
And expand that and turn it into a Jupyter Notebook.

use skiptest decorator

We currently have some tests that call a function to check if there's a GPU and a CUDA installation (or OpenCL device and pyopencl).

We need to see if we can use the skiptest decorator that is offered by pytest instead of calling the function ourselves.

check tunable parameters for special names that should not be used

The user can decide on their own names for tunable parameters, but some names will lead to errors because they are also used by the tuner internally.

These include 'time', 'times', which are used by the tuner to report output back to the user.

The parameters 'grid_size_x', 'grid_size_y', and 'grid_size_z' are inserted into the code as C-preprocessor defines, if the user also uses these names for their tunable parameters I'm pretty sure it should lead to compile errors.

The idea is to check the tunable parameters to not contain these reserved keywords when the user calls tune_kernel and report a user-friendly error when if they do.

use more advanced parsing for argument lists

Currently the argument checker only looks for the first pair of braces "(" and ")", we should do something a little bit more advanced and try to really get the argument list. Note that just looking for the first occurrence of the kernel name is not safe enough either, the name itself could appear in multiple places as a substring of something else, or inside a comment.

So depending on the language it would be best to look for something like "__kernel" "void" "kernel_name" "(" comma-separated argument list ")". Multiline regexes are probably what we need.

continue on non-fatal cuda import fails

it may happen that the import of DynamicSourceModule, while it will also not
be used later on, theoretically in that case the script should not fail, at least that is implied by:

if int(self.cc) >= 35:
                source_mod = DynamicSourceModule
            else:
                source_mod = SourceModule

see #31
otherwise the ImportError message should be changed

check if reference output matches arguments array

Users of the tuner can supply a reference output using the 'answer=' option of tune_kernel. Currently we don't perform any checks on this input, while it could lead to weird errors if the user makes a mistake.

We should check that the length of the list passed to answer matches that of the arguments passed to the kernel. For all objects in the list that are not None we should check if their datatype matches with the datatype of the corresponding location in the arguments list.

There is one exception, when the user supplies a custom function for output verification (using the 'verify=' optional argument of tune_kernel) we should be more flexible. A non-None value in answer means that there is output at this position, the corresponding position in arguments should be at least an array and not a scalar, but the dtype or the length of the array should not be checked as these are now the responsibility of the user-supplied verification function.

Including CUDA header files in generated code

I was trying the generator system with some vector operations. In OpenCL the operators on vector types are part of the language, so no issue there, but to use them in CUDA I need to include "helper_math.h".
However, in cuda.py the whole generated code is wrapped in "extern C", and this generates a plethora of naming errors with this specific header file.
To make everything work I set no_extern_c to True in cuda.py, and added an "extern C" statement only around the kernel. This solution would require the user, to my understanding, to use "extern C" for every kernel, breaking compatibility with current code and making everything more cumbersome.

Reorganize tests

The way the tests are organized still reflects one of the very early structures of Kernel Tuner. The plan is to reorganize them in a way that reflects the current structure.

simplify installation process

We can use the 'extras' feature inside setup.py to specify a number of extras that require optional dependencies.

Kernel Tuner has several optional dependencies, for example you could only be interested in tuning CUDA kernels, so it would not make sense to require OpenCL to be installed, or vice versa. You can read more about this in the install guide.

The idea is to turn these optional depencies into extras such that the user can install Kernel Tuner in a single command, instead of first installing Kernel Tuner and then having to also install all optional dependencies.

Currently we have a file called requirements-dev.txt for all the dependencies that are required to start developing Kernel Tuner. These could also be integrated using an 'extra'.

Please don't forget to update the install and contribution guides accordingly when you work on this issue.

Log benchmarked instances to file and restart from file

The idea is to track progress during tuning by logging benchmarked instances to a user-specified file, probably just using JSON. If the file already exists, then continue from where the tuning process left off. For the most optimization strategies this is very easy to implement, because the only change is not starting from an empty cache but with a pre-filled cache instead.

The brute_force strategy will need some adaptation to use a cache, and I'm not entirely sure what the semantics will be for the random_sample strategy. Perhaps it should just sample the space that is left unexplored.

fully migrate from nosetests to pytest

The goal is to only use pytest for testing Kernel Tuner, but in the tests we still use the no longer maintained nosetests here and there. We want to replace these uses with their pytest equivalents as much as possible.

tutorial review by jspaaks

I'm collecting some notes here pertaining to the text of the kernel_tuner including its tutorial.

The kernel_tuner code base is currently at: https://github.com/benvanwerkhoven/kernel_tuner/tree/d36feb61eab791e08132051ffefa9458df79ad80

(I'm using GitHub's blame option to be able to refer to specific line numbers in specific commits for .rst files)

  1. I would remove https://github.com/benvanwerkhoven/kernel_tuner/blame/d36feb61eab791e08132051ffefa9458df79ad80/README.rst#L47 since you already have #48
  2. in the install section, add a reference to the dependencies section so a user knows what to have installed before attempting to install the tuner itself. Speaking of dependencies, isn;t it just a case of pip install -r requirements.txt? Since that already lists pycuda and pyopencl.
  3. my default python is 2.7 for some reason, looks like that results in an error (could have been related to the errors I got below)
  4. using virtual env for python3 should make things easier, e.g.
    # use virtualenv to create a python3 virtual environment called .venv-3
    # (needs virtualenv?)
    virtualenv -p python3 .venv-3 
    
    # activate the virtual environment
    . ./.venv-3/bin/activate
    
    # install requirements
    pip install --requirement requirements.txt
    
    # install the tuner itself
    pip install .
  5. code snippet above doesn't work yet, it complains about nvcc not being present, tried to fix with
    sudo apt install nvidia-cuda-toolkit
    (still complains about missing numpy)
    installing CUDA toolkit from repos apparently is not advisable. Get the newest version instead from https://developer.nvidia.com/cuda-downloads
  6. (as discussed) remove pycuda and pyopencl from requirements.txt; possibly introduce them later when the user knows what he/she needs. Maybe also remove the 'devDependencies': mock, nose and pytest. Possibly use different sets of requirements, e.g.
    • requirements-testing
    • requirements-opencl
    • requirements-cuda, etc
  7. solved install problems as follows:
    # removed python2 entirely
    sudo apt install python3
    sudo apt install python3-pip
    # edited the requirements so it only includes numpy mock nose  pytest
    pip3 install -r requirements.txt --no-cache-dir
    pip3 install .
  8. about the example about heat diffusion: you mention somewhere that 'temperature' is being dispersed, but technically it's 'heat'. You'd need the heat capacity of the material in order to calculate the change in temperature after a change in heat due to dispersion.
  9. you use lower case delta to signify a distance (in time and space), AFAICT this should be upper case, the difference being that lower case delta is used to signify an infinitesimal distance, whereas your distance is 1.0 in space, and 0.225 in time.
  10. I'd benefit from a graph/schematic explaining x and y, and u at various combinations of x and y (not sure this is possible in an jupyter notebook).
  11. I'd benefit from a derivation of the diffusion equation (maybe not in the main document but a link to some other document that uses the same symbols).
  12. I see you use %matplotlib inline somewhere. I didn't know this syntax (never worked with notebooks before). Perhaps add a comment that explains it?
  13. Depending on the intended audience, maybe provide a quick recap of threads / blocks and why they matter; or point to some of your other material?
  14. in In[3] line field[numpy.random.randint(0,nx,size=10), numpy.random.randint(0,ny,size=10)] = 1e3, I'd break it down, somehting like:
    nHotspots = 10
    cols = numpy.random.randint(0,nx,size=nHotSpots)
    rows = numpy.random.randint(0,ny,size=nHotSpots)
    tempHot = 1000.
    tempCold = 1.
    field = numpy.ones((ny, nx)).astype(numpy.float32) * tempCold
    field[rows, cols] = tempHot
    return field
  15. In In[7] an d elsewhere, you do a memcpy_htod. Maybe good to mention that htod stands for host to device (I think:)

Support for C++ templated kernels

CUDA supports the use of C++ in the kernel language, including the use of templates. Currently, using pycuda, it's required for kernels to have extern "C" linkage, which prevents templated kernels from being used directly. Work has started by Willem Jan to make it possible to call such kernels directly using nvrtc.

time increase after using this tool as on-line tuner

I use kernel_tuner as a on-line tuner to tune GEMM, which means when I want to do gemm, will first tune the parameters, then after tuning will use the kernel_tuner's result to set the parameters, I find the kernel running time after tuning is larger than time before tuning whose parameters are set in the search space, but just randomly. Do you have some suggestions for this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.