Code Monkey home page Code Monkey logo

clblast's Introduction

CLBlast: The tuned OpenCL BLAS library

Build Status

CLBlast is a modern, lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators. CLBlast implements BLAS routines: basic linear algebra subprograms operating on vectors and matrices.

Note that the CLBlast library is actively being developed, and might not be mature enough for production environments. This preview-version doesn't support the less commonly used routines yet: they will be added in due time. It also lacks extensive tuning on some common OpenCL platforms: out-of-the-box performance on some devices might be poor. See below for more details (and how to tune yourself).

Why CLBlast and not clBLAS or cuBLAS?

Use CLBlast instead of clBLAS:

  • When you care about achieving maximum performance.
  • When you want to be able to inspect the BLAS kernels or easily customize them to your needs.
  • When you run on exotic OpenCL devices which you need to tune yourself.
  • When you are still running on OpenCL 1.1 hardware.
  • When you value an organized and modern C++ codebase.
  • When you target Intel CPUs and GPUs or embedded devices

Use CLBlast instead of cuBLAS:

  • When you want your code to run on devices other than NVIDIA CUDA-enabled GPUs.
  • When you want to tune for a specific configuration (e.g. rectangular matrix-sizes)
  • When you sleep better if you know that the library you use is open-source.

When not to use CLBlast:

  • When you run on NVIDIA's CUDA-enabled GPUs only and can benefit from cuBLAS's assembly-level tuned kernels.
  • When you need those BLAS routines that are not yet supported by CLBlast.

Compilation and installation

The pre-requisites for compilation of CLBlast are:

  • CMake version 2.8.10 or higher
  • A C++11 compiler, for example:
    • GCC 4.7.0 or newer
    • Clang 3.3 or newer
    • AppleClang 5.0 or newer
    • ICC 14.0 or newer
    • MSVC (Visual Studio) 2015 or newer
  • An OpenCL 1.1 or newer library, for example:
    • Apple OpenCL
    • NVIDIA CUDA SDK
    • AMD APP SDK
    • Intel OpenCL
    • Beignet

Furthermore, to build the (optional) correctness and performance tests, another BLAS library is needed to serve as a reference. This can be either:

  • The OpenCL BLAS library clBLAS (maintained by AMD)
  • A regular CPU Netlib BLAS library, e.g.:
    • OpenBLAS
    • BLIS
    • Accelerate

An example of an out-of-source build using a command-line compiler and make (starting from the root of the CLBlast folder):

mkdir build
cd build
cmake ..
make
sudo make install

When using Visual Studio, the project-files can be generated as follows:

mkdir build
cd build
cmake -G "Visual Studio 14 Win64" ..

A custom installation folder can be specified when calling CMake:

cmake -DCMAKE_INSTALL_PREFIX=/path/to/install/directory ..

Using the library

Like clBLAS and cuBLAS, CLBlast also requires OpenCL device buffers as arguments to its routines. This means you'll have full control over the OpenCL buffers and the host-device memory transfers. CLBlast's API is designed to resemble clBLAS's C API as much as possible, requiring little integration effort in case clBLAS was previously used. Using CLBlast starts by including the C++ header:

#include <clblast.h>

Or alternatively the plain C version:

#include <clblast_c.h>

Afterwards, any of CLBlast's routines can be called directly: there is no need to initialize the library. The available routines and the required arguments are described in the clblast.h include file and the included API documentation. Additionally, a couple of stand-alone example programs are included in samples/.

Using the tuners (optional)

The CLBlast library will be tuned in the future for the most commonly used OpenCL devices. This pre-release of CLBlast is only tuned for a limited number of devices, in particular those with the following CL_DEVICE_NAME values:

  • NVIDIA GPUs:
    • GeForce GTX 480
    • GeForce GTX 680
    • GeForce GTX 750 Ti
    • GeForce GTX 980
    • GeForce GTX Titan
    • GeForce GTX Titan X
    • Tesla K20m
    • Tesla K40m
  • AMD GPUs:
    • Tahiti
    • Hawaii
    • Pitcairn
    • R9 M370X
  • Intel GPUs:
    • Iris
    • Iris Pro
  • Intel CPUs:
    • Core i5-6200U
    • Core i7-3770K
    • Core i7-5930K
  • Other devices:
    • ARM Mali-T628 GPU
    • Intel MIC

If your device is not (yet) among this list or if you want to tune CLBlast for specific parameters (e.g. rectangular matrix sizes), you should compile the library with the optional tuners:

cmake -DTUNERS=ON ..

Note that CLBlast's tuners are based on the CLTune auto-tuning library, which has to be installed separately (version 1.7.0 or higher). CLTune is available from GitHub.

Compiling with -DTUNERS=ON will generate a number of tuners, each named clblast_tuner_xxxxx, in which xxxxx corresponds to a .opencl kernel file as found in src/kernels. These kernels corresponds to routines (e.g. xgemm) or to common pre-processing or post-processing kernels (copy and transpose). Running such a tuner will test a number of parameter-value combinations on your device and report which one gave the best performance. Running make alltuners runs all tuners for all precisions in one go. You can set the default device and platform for alltuners by setting the DEFAULT_DEVICE and DEFAULT_PLATFORM environmental variables before running CMake.

The tuners output a JSON-file with the results. The best results need to be added to include/internal/database/xxxxx.h in the appropriate section. However, this can be done automatically based on the JSON-data using a Python script in scripts/database/database.py. If you want the found parameters to be included in future releases of CLBlast, please attach the JSON files to the corresponding issue on GitHub or email the main author.

In summary, tuning the entire library for your device can be done as follows (starting from the root of the CLBlast folder):

mkdir build
cd build
cmake -DTUNERS=ON ..
make
make alltuners
python ../scripts/database/database.py . ..
make

Compiling the correctness and performance tests (optional)

To make sure CLBlast is working correctly on your device (recommended), compile with the tests enabled:

cmake -DTESTS=ON ..

Afterwards, executables in the form of clblast_test_xxxxx are available, in which xxxxx is the name of a routine (e.g. xgemm). Note that CLBlast is best tested against clBLAS for correctness. If the library clBLAS is not installed on your system, it will use a regular CPU BLAS library to test against. If both are present, setting the command-line option -clblas 1 or -cblas 1 will select the library to test against for the clblast_test_xxxxx executables.

With the -DTESTS=ON flag, additional performance tests are compiled. These come in the form of client executables named clblast_client_xxxxx, in which xxxxx is the name of a routine (e.g. xgemm). These clients take a bunch of configuration options and directly run CLBlast in a head-to-head performance test against clBLAS and/or a CPU BLAS library.

Performance remarks

The CLBlast library provides pre-tuned parameter-values for a number of OpenCL devices. If your device is not among these, then out-of-the-box performance might be poor. Even if the device is included performance might be poor in some cases: the preview version is not thoroughly tested for performance yet. See above under Using the tuners to find out how to tune for your device.

The folder doc/performance contains some PDF files with performance results on tested devices. Performance is compared against a tuned version of the clBLAS library. The graphs of the level-3 routines (Xgemm, Xsymm, Xsyrk) show the strong points of CLBlast:

  • The library reaches a high peak performance for large matrix sizes, in some cases a factor 2 more than clBLAS.
  • The performance for non-power of 2 values (e.g. 1000) is roughly equal to power of 2 cases (e.g. 1024). This is not the case for clBLAS, which sometimes shows a drop of a factor 2.
  • The performance is also constant for different layouts and transpose options. Again, this is not the case for clBLAS.

The graphs also show the current weak points of CLBlast: for small sizes the benefit is minimal or non-existent, and for some specific configurations clBLAS is still faster.

These graphs can be generated automatically on your own device. First, compile CLBlast with the tests enabled. Then, make sure your installation of the reference clBLAS is performance-tuned by running the tune executable. Finally, run one of the graph-scripts found in test/performance/graphs using R. For example, to generate the Xgemm PDF on device 1 of platform 0:

Rscript path/to/test/performance/graphs/xgemm.r 0 1

Supported routines

CLBlast is in active development but already supports almost all the BLAS routines. The supported routines are marked with '✔' in the following tables. Routines marked with '-' do not exist: they are not part of BLAS at all.

Level-1 S D C Z
xSWAP
xSCAL
xCOPY
xAXPY
xDOT - -
xDOTU - -
xDOTC - -
xNRM2
xASUM
IxAMAX
Level-2 S D C Z
xGEMV
xGBMV
xHEMV - -
xHBMV - -
xHPMV - -
xSYMV - -
xSBMV - -
xSPMV - -
xTRMV
xTBMV
xTPMV
xGER - -
xGERU - -
xGERC - -
xHER - -
xHPR - -
xHER2 - -
xHPR2 - -
xSYR - -
xSPR - -
xSYR2 - -
xSPR2 - -
Level-3 S D C Z
xGEMM
xSYMM
xHEMM - -
xSYRK
xHERK - -
xSYR2K
xHER2K - -
xTRMM

In addition, some non-BLAS routines are also supported by CLBlast. They are experimental and should be used with care:

Additional S D C Z
xSUM
IxMAX
IxMIN

Some BLAS routines are not supported yet by CLBlast. They are shown in the following table:

Unsupported S D C Z
xROTG - -
xROTMG - -
xROT - -
xROTM - -
xTRSV
xTBSV
xTPSV
xTRSM

Contributing

Contributions are welcome in the form of tuning results for OpenCL devices previously untested. Furthermore, merge requests are welcome as long as they contain unit additions or modifications. Furthermore, they should follow the CLBlast coding style, which is based on the Google C++ style guide and the Effective C++ books by Scott Meyers.

The contributing authors (code, pull requests, testing) so far are:

Tuning and testing on a variety of OpenCL devices was made possible by:

Support us

This project started in March 2015 as an evenings and weekends free-time project next to a full-time job for Cedric Nugteren. If you are in the position to support the project by OpenCL-hardware donations or otherwise, please find contact information on the website of the main author.

To-do list before release of version 1.0

  • Add half-precision routines (e.g. HGEMM)
  • Add API documentation

clblast's People

Contributors

cnugteren avatar psyhtest avatar

Stargazers

Lewei Lu avatar

Watchers

Werner Saar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.