Code Monkey home page Code Monkey logo

project2-stream-compaction's Introduction

CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

  • Matt Elser
  • Tested on: Tested on: Ubuntu 20.04, i3-10100F @ 3.6GHz 16GB, GeForce 1660 Super 6GB

timing data table

Main Features

This project implements an exclusive scan (performing an operation on (in this case a sum of) all previous elements along an array) and stream compaction (removing elements from an array based on a condition) using the CPU, GPU with CUDA, and the CUDA powered library Thrust.

  • all required algorithms work
  • Efficient scan and compaction are faster than CPU implementation for arrays of sufficient size
  • Radix sort has been implemented (without tiling or bitonic merge)

Time Comparison

The test setup in main.cpp has been coopted to set up repeated timings to remove noise from measurements.

scan comparison plot compact comparison plot

CPU scan and compact may be faster for smaller arrays, but the efficient GPU algorithms become more efficient at array size 2^16 for compact and 2^17 for efficient scan.

Block Size Comparison

changing the number of threads per blocks did not have a noticeable impact on timing. Minor differences are can be seen in the graph below, but almost all are around one standard deviation of another, so this may just be noise. Data was gathered by timing 100 runs of each algorithm with an array of size 2^22, see the bottom of the readme for the data table.

block comparison plot

Known limitations

  • [FIXED] The Naive implementation fails for array sizes greater than 2^25.
    • Naive was calling an inefficient number of threads, leading to higher-than needed threadIdx.x values. When multiplied to get the index this overflowed int and yielded a negative index. Logic around indices (reasonably) assumed positive values and therefore caused an out of bounds write.
  • compact scan fails for array sizes greater than 2^28 due to running out of CPU memory on the (16Gb) test machine.

Extra Credit

  • Work Efficient GPU algorithms are more efficient than CPU (for large array sizes). This was achieved by removing divergent threads from upsweep and downsweep kernel calls. Prior to the discussion in class of modifying the indexing for optimal thread scheduling, these algorithms were implemented to acheive the same end iteratively. Each layer is a separate kernel call, and each kernel call only spawns one thread per index being written to (i.e. n/2 for the first call, n/4 for the next, etc.). This does not have the same benefit of contiguous memory reads, however. Of note, this iterative method does simplify syncing of threads/kernels. No threads on any given layer are reading/writing from/to any of the indices being read/written from/to by any other thread. Since All layers are separate kernel calls (on the default stream and therefore with an implicit join/sync), so no explicit syncs are needed.
  • Radix sort has been implemented, though without tiling or bitonic merge. It sorts correctly for all array sizes (power of two and non-power of two) up until 2^27, at which point the test machine runs out of memory. Radix sorting has been validated against Thrust's sort (though the timing of the two are different by several orders of magnitude). The algorithm has not been optimized to use shared memory or contiguous memory reads, and would fail for arrays with negative values. Here is a plot comparing the timing of the radix sort implementation with Thrusts sort sort runtime comparison

block comparison table

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.