Code Monkey home page Code Monkey logo

htm.cuda's Introduction

Implementation of HTM Spatial Pooler algorithm in CUDA

This is my CUDA implementation of Numenta's Spatial Pooling algorithm.

My goal starting this project was to create an efficient CUDA backend for the entire Numenta's Platform for Intelligent Computing (NuPIC). I am open to starting an OSS project. If you are interested, do not hesitate to contact me!

Details of CUDA implementation

Each SP column is mapped to a single thread. Unlike the original sequential version, inhibition is performed only per CUDA block of 1024 threads (some newer devices allow 2048 threads per block). This can be further improved in a future version, e.g. by mapping multiple columns to a single thread. This also means that the size of the spatial pooler is limited by the maximum number of threads your device can spin per kernel and the amount of global memory.

Initial permanences, potential connections and boost factors are initialized on the device - the only data transferred is the input vector. Transfer from pinned memory is faster, but incomparably slower to allocate. The potential connections are stored indeces with regards to the starting point to the block of input which belongs to corresponding CUDA block, and hence the overlap can be calculated simply as a block-wise CSR matrix-vector multiplication (with boost factors considered), input values being stored in shared memory. According to this paper, this should be reasonably efficient choice.

Future optimization

Although most methods make extensive use of parallel, block-wise reduction to speed up their execution, some further opitmization can be made: adaptSynapses() exhibits some thread divergence (but I don't know how this can be evaded), and inhibitColumns() can potentially use more efficient sorting method (this probably does not present major bottleneck. however).

Performance

This code performs significantly faster than the sequential C++ implementation, and can handle input of almost unlimited size. For example, with input of 131072 neurons and 32768 SP columns, the kernel performs one iteration in 28.536 ms on (rather obsolete) GeForce GTX 670, while handles input and number of columns of half the size (since it doesn't allow larger inputs) in 268.98 ms - 9.43x speedup on larger input.

Requirements

  • nVidia GPU with CUDA compatibility version >= 2.0 (most Tesla products and newer)
  • CUDA Toolkit installed
  • A C++ compiler (such as GCC)

You can check for CUDA compatibility of your device here. The newest version of CUDA toolkit is available for download here.

How to build

To build the main program, simply type

nvcc HelloSP.cu -o HelloSP -std=c++11

Building the unit tests is done analogously

nvcc UnitSP.cu -o UnitSP -std=c++11

Unit tests

Unit tests are run simply by running UnitSP.cu

Acknowledgement

The initial work was funded by the Grant agency of CTU Prague under project SGS16/231/OHK3/3T/13.

htm.cuda's People

Contributors

breznak avatar mirgee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htm.cuda's Issues

Add documentation

Add to Readme

  • requirements
  • steps how to build
  • high level doc of this work
    • differences to c++ version/limitations (only local inhibition), forced separation of cols into "blocks"(how many,...)
    • hi-lev What does it do - CUDA-wise (loads data in batch, execute in parallel each "block",...)
  • comments in SpatialPooler.cu, HelloSP.cu - why there are "duplicit" methods (getPermanences), document cuda impl a bit
  • mention performance gains
  • Notes on Future work (what else can be optimized (in SP, in HTM,..)

C++ wrapper for CUDA class SP

with aim to run cudaSP seamlessly from C++. Ideally a drop-in replacement.

  • integrate HelloSP & SpatialPooler.cu into 1 file
    • there should be SpatialPooler.cu & a main.cu (main only calls initialization & compute of SP.cu)
  • cmake detect CUDA, nvcc & nVidia HW availability
    • cmake compile using nvcc & run tests
  • C++ wrapper
    • wrap all CUDA code with ifdef HAS_CUDA
    • SpatialPoolerCuda.hpp that wraps implemented cu code to c++ API (minimal initialize & compute)
    • SpatialPoolerCuda.cpp implements simple buffering
      • as c++ compute calls with a singleton data, while cu calls with block of data
    • cmake builds c++ wrapper

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.