Code Monkey home page Code Monkey logo

feta2.cuh's Introduction

FETA2.cuh

FETA2.cuh stands for Fast Expression Template Algebra for CUDA/C++. It allows you to make use of the powerful and well-established Eigen C++ linear algebra library in CUDA device code, while avoiding the hassle of CUDA's memory management and its performance traps.

FETA2.cuh provides types representing ensembles (i.e., large collections) of Eigen matrices/vectors. For instance, a feta::Vector3dEnsemble behaves a lot like an std::vector<Eigen::Vector3d>, but it'll be significantly faster for GPU computations, and it greatly simplifies memory management on host, device global and device shared memory.

See feta2_demo.cu for a quick overview of FETA2's functionality.

FETA2 is particularly useful if you want to parallelise a many small but independent linear algebra operations in GPU code. For instance, if you have 1M matrices and want to multiply each by a vector, FETA2 allows you to easily perform each matrix-vector operation in a different thread on the GPU. This kind of scenario is common in scientific computing, for instance in Monte-Carlo simulations.

FETA2 is not useful if you want to accelerate a few large linear algebra operations. For this you should use dedicated libraries such as cuBLAS or Eigen's tensor module.

Why not just Eigen?

You may ask: why not just use Eigen by itself? After all, it seems pretty straightforward:

__global__ void matMul(Eigen::Matrix3d* mats, Eigen::Vector3d* vecs, int nSamples)
{
    const int i = threadIdx.x + blockIdx.x * blockDim.x;

    if (i >= nSamples) return;

    vecs[i] = mats[i] * vecs[i];
}

While this code works, it has a hidden performance trap. The issue is memory layout: each matrix/vector is stored contiguously in memory. For instance, the data in the Vector3d* is stored as xyzxyzxyzxyz.... This leads to strided accesses to global memory, which has a dramatic impact on performance.

To mitigate the issue, we should use a different memory layout, where each component of each vector is contiguous in memory: xxx...yyy...zzz.... This will lead to improved cache utilisation and greatly improved performance. You may know these two layouts as the AoS and SoA.

For vectors the solution is simple -- just store the vectors as columns in a row-major matrix. But for matrices, it's not so obvious.

Enter FETA:

__global__ void matMul(feta::Matrix3dEnsemble::GRef mats, feta::Vector3dEnsemble::GRef vecs)
{
    const feta::SampleIndex si(threadIdx.x, blockIdx.x, blockDim.x);

    if (si.global() >= vecs.size()) return;

    vecs[si] = mats[si] * vecs[si];
}

FETA's ensemble types allow you to treat them as if they were arrays of Eigen objects, but use a GPU-optimised memory layout internally. The element access operator, e.g. mats[is], gives you access to a regular Eigen matrix/vector type, just like with an array.

Here, ::GRef stands for "global reference", and indicates that the type references data that "lives" in GPU global memory. FETA also allows you to easily work with shared memory, which is why we use SampleIndex -- but that is a topic for another time.

Benchmarks

All benchmarks reported here were performed on an NVIDIA Tesla V100 16GB GPU, with 10M samples, averaged over 100k repetitions, and using double-precision arithmetic. The code for each can be found under tests/benchmarks.

"Eigen (naive)" refers to simply passing an array of Eigen objects (e.g., Eigen::Vector3d *) into the kernel. "Manual" refers to a manual but well-optimised implementation, i.e. passing a raw array of doubles and using a GPU-optimised memory layout. Finally, "FETA2" implementations use FETA2's ensemble types and then use Eigen's vector arithmetics to actually perform the operations.

  • Vector dot product out[i] = a[i].dot(b[i]) for 3D, 6D and 9D vectors.

Performance of VecDot implementations

  • Euclidean distance out[i] = (a[i] - b[i]).norm() for 3D, 6D and 9D vectors.

Performance of VecDist implementations

feta2.cuh's People

Contributors

seanbone avatar

Stargazers

Peter Arzt avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.