SVD Solver

GPGPU-based SVD solver for large dense matrices

Abstract

Singular value decomposition is a fundamental factorization in linear algebra used in wide-ranging scientific and computer science applications. This report presents a comparison of three optimized computational models to solve SVD of large dense matrices, with the following corresponding implementations: (1) Cache-optimized single-core; (2) Parallelized multi-core, using a 6-core 2.6 GHz Intel Xeon Processor; (3) GPGPU-accelerated, using an NVIDIA Tesla V100 GPU. The proposed implementations show progressive and substantial performance improvements over a naive baseline single-core implementation. Multi-core and GPU implementations also show marked speedup over the optimized single-core implementation.

Reproducibility

This section details instructions on how to replicate the results presented in this report. These instructions are also available in the Final Report document. To generate performance results, there are two main executable binaries for

CPU SVD Benchmarks (all single-core and multi-core implementations);
CUDA-1 and CUDA-2 SVD Benchmarks (two-Stage bidiagonal reduction).

Note: CUDA and NVIDIA drivers need to be installed, as well as the nvcc compiler, to generate executable files for .cu source code files. OpenMP is required for multi-core SVD.

Benchmarks

CPU (Single-core and Multi-core) Benchmarks

To compile the (C++) benchmark source code that includes OpenMP directives for parallel computing using a command-line compiler use the following command:

c++ -std=c++17 -O3 -fopenmp -mavx -march=native svd_cpu.cpp -o svd_cpu

The SVD CPU Benchmark tool commands have the following options:

[base|singlecore|multicore|diagonal]: Computational Model.
- base : Golub-Kahan Bidiagonal Reduction
- singlecore : Blocked (Panel) Bidiagonal Reduction (Requires Block Size)
- multicore : Tiled Bidiagonal Reduction (Requires Block Size)
- diagonal : QR Diagonalization
[<int> Step size] The increase in matrix dimensions with each test iteration.
[<int> Number of steps] Number of times to increase the size of the matrix for each test iteration.
[<int> Number of test instances] How many iterations to average for a given matrix size.
[<int> Block size ] Band or Tile size for reduction; the value of 32 has been used for all multi-core benchmark tests in this report. The band size should evenly divide the matrix dimensions.

Benchmarks automatically generate .csv files in a directory called data that must be created in the main application directory.

Golub-Kahan Algorithm

Example: ./svd_cpu base 320 10 1

Optimized Single-core SVD (One-Stage Bidiagonal Reduction)

Example: ./svd_cpu singlecore 320 10 1 8

Multi-core SVD (Two-Stage Bidiagonal Reduction)

Must include a band size.
Example: ./svd_cpu multicore 320 10 1 32

QR Diagonalization

Example: ./svd_cpu diagonal 320 10 1

GPU (CUDA) Benchmarks

To compile the CUDA source code on the UVic Azure instance for the GPU-accelerated program using a command-line NVCC compiler use the following commands for CUDA-1 and CUDA-2 versions respectively:

nvcc -O3 svd_cuda_1.cu -o svd_cuda1
nvcc -O3 svd_cuda_2.cu -o svd_cuda2

The SVD CUDA Benchmark tool commands have the following options:

[benchmark]: Run benchmark mode.
- [<int> Step size] The increase in matrix dimensions with each test iteration.
- [<int> Number of steps] Number of times to increase the size of the matrix for each test iteration.
- [<int> Number of test instances] How many iterations to average for a given matrix size.
- [<int> Block size ] Band or Tile size for reduction; the value of 32 has been used for all multi-core benchmark tests in this report. The band size should evenly divide the matrix dimensions.
[check]: Run correctness check.
- [<int> 64|512|1024] : Check correctness for test matrix of size 64, 512 or 1024.

CUDA GPU-accelerated SVD (Two-Stage Bidiagonal Reduction) - Version 1

To run similar benchmarks to those presented in this report, it is necessary to run the baseline and then the CUDA executions separately. To replicate the exact parameters, use the following commands (for running a single test instance):

./svd_cpu multicore 32 320 10 1

CUDA GPU-accelerated SVD (Two-Stage Bidiagonal Reduction) - Version 2

./svd_cuda1 benchmark 32 320 12 1 ./svd_cuda2 benchmark 32 320 12 1

Benchmarks save data in CSV format to the data folder that can be loaded for generating plots (see below).

Plot Results

To generate plots of the results, open and run the Python Notebook file generate_results_plots.ipynb found in the repository to load results after running the baseline and CUDA benchmarks.

Correctness Tests

Commands are also available to verify the correctness of the computed band reduction results based on verified sample test results. Test matrices of size 64x64, 512x512, and 1024x1024 with corresponding verified band reduction and bidiagonal reductions can be compared using the following commands respectively:

./svd_cuda1 check 64 ./svd_cuda2 check 64

./svd_cuda1 check 1024 ./svd_cuda2 check 1024

Note that a band size of 4 is used for these tests.

Matrix parameters

cols: n columns
rows: m rows
min_val: lower bound for matrix values
max_val: upper bound for matrix values

Files

svd_cpu.cpp: Routine to run CPU (baseline, single-core and multi-core) benchmarks.
svd_cuda.cpp: Routine to run CUDA (GPGPU) benchmarks and correctness tests.
matrix.h: Matrix data structure. Class object and methods for 2D mxn matrices
matrix_gpu.h: GPU-enabled Matrix data structure. Class object and methods for 2D mxn matrices
svd_cuda_1.cu: GPU SVD algorithm implementation. Applies two-step SVD reduction of mxn matrix A to the form A = U.Sigma.V^T where the columns of U form an nxn orthonormal matrix; the rows of V^T form an nxn orthonormal matrix, and \Sigma is an m×n diagonal matrix with positive real entries known as the singular values of A. Kernels are two-stage: (I) Dense -> Band Matrix; and (II) Band Matrix -> Bidiagonal.
svd_cuda_2.cu: GPU SVD algorithm implementation. Applies two-step SVD reduction of mxn matrix A to the form A = U.Sigma.V^T where the columns of U form an nxn orthonormal matrix; the rows of V^T form an nxn orthonormal matrix, and \Sigma is an m×n diagonal matrix with positive real entries known as the singular values of A. Kernels are two-stage: (I) Dense -> Band Matrix; and (II) Band Matrix -> Bidiagonal.
svd_serial.h: Serial SVD algorithm implementation. Applies two-step SVD reduction of mxn matrix A to the form A = U.Sigma.V^T where the columns of U form an nxn orthonormal matrix; the rows of V^T form an nxn orthonormal matrix, and \Sigma is an m×n diagonal matrix with positive real entries known as the singular values of A.
svd_parallel.h: Parallel SVD algorithm implementation. Applies two-step SVD reduction of mxn matrix A to the form A = U.Sigma.V^T where the columns of U form an nxn orthonormal matrix; the rows of V^T form an nxn orthonormal matrix, and \Sigma is an m×n diagonal matrix with positive real entries known as the singular values of A. Kernels are two-stage: (I) Dense -> Band Matrix; and (II) Band Matrix -> Bidiagonal.
benchmarking.cpp: Benchmarking routine adapted from CSC586C (Sean Chester) benchmarking application.
tests.cpp: Generates a unit test suite using the Catch2 header-only library for all tests.
timing.h: Timing library to benchmark and comparatively analyse different implementation approaches.
utils.h: Utility functions.

Public Methods

brd: Bidiagonal Reduction: Golub-Kahan Algorithm Computes bidiagonal matrix B = U1'.A.V1 using Householder transformations to reduce upper/lower triangles..
- Input: Matrix A (m x n matrix)
- Output: -Matrix B (bidiagonal m x n matrix) -Matrix U1 (left-side orthogonal matrix) -Matrix V1 (left-side orthogonal matrix)
block_brd: Computes bidiagonal matrix B = U1'.A.V1 using Householder transformations to reduce upper/lower triangles
- Input: Matrix A (m x n matrix), t transpose
- Output: -Matrix B (bidiagonal m x n matrix) -Matrix U1 (left-side orthogonal matrix) -Matrix V1 (left-side orthogonal matrix)
brd_p1: Dense-to-Band Reduction (Großer and Benedikt, 1999). Completes Dense matrix -> Banded matrix (Stage I of two-stage process). Computes banded bidiagonal matrix B = U1'.A.V1 using QR and LQ transformations to upper and lower diagonals
- Input: Matrix A (m x n matrix)
- Output: -Matrix B (banded bidiagonal m x n matrix) -Matrix U1 (left-side orthogonal matrix) -Matrix V1 (left-side orthogonal matrix)
brd_p2: Band-to-Bidiagonal Reduction (Haider, et al., 2013). Completes Banded matrix -> Bidiagonal matrix (Stage II of two-stage process). Computes bidiagonal matrix B = U1'.A.V1 using Householder transformations to upper and lower diagonals of banded matrix.
- Input: Matrix B (banded bidiagonal m x n matrix)
- Output: -Matrix B (Bidiagonal m x n matrix) -Matrix U1 (left-side orthogonal matrix) -Matrix V1 (left-side orthogonal matrix)
cuda_brd_p1: Dense-to-Band Reduction (Gates, et al. 2018). Completes Dense matrix -> Banded matrix (Stage I of two-stage process). Computes banded bidiagonal matrix B = U1'.A.V1 using QR and LQ transformations to upper and lower diagonals
- Input: Matrix A (m x n matrix)
- Output: -Matrix B (banded bidiagonal m x n matrix) -Matrix U1 (left-side orthogonal matrix) -Matrix V1 (left-side orthogonal matrix)
diag_reduce: Convergent Application of "Chase-the-bulge" algorithm. Adapted from "Accurate Singular Values of Bidiagonal Matrices" (Demmel, Kahan, 1990 This algorithm begins and ends with vectors d and e, representing the diagonal and superdiagonal of a bidiagonal matrices. Vector d has length n, e has length n-1.
- Input: B (m x n bidiagonal matrix)
- Output: Sigma (SVD diagonal)

Example Output

Benchmark: CUDA-1 Band Reduction
	Band size: 32
	Step size: 320
	Number of steps: 10
	Number of test instances: 1

Average time per CUDA-1 Band Reduction
N = 320 | 1.64687 sec
N = 640 | 0.594505 sec
N = 960 | 1.31839 sec
N = 1280 | 2.37395 sec
N = 1600 | 3.82875 sec
N = 1920 | 5.87805 sec
N = 2240 | 7.97496 sec
N = 2560 | 11.0635 sec
N = 2880 | 15.4251 sec
N = 3200 | 22.0778 sec


Checking correctness ...
Reading file: /data/spencerrose/test_float_512_512.bin

-------
Matrix capacity: 512 [262144 elements; m = 512, n = 512]
Matrix overhead: 12352b
Size of Payload: 1048576b
Matrix total size: 1060928b

 1.8647  1.4094  2.0559  2.1145  2.2026  3.8622  3.0560  1.3343
 2.1022  3.8643  1.6300  2.2606  2.4072  1.7867  2.4081  3.9153
 1.8043  2.1092  3.0419  1.6380  2.1340  1.5234  1.6438  3.8442
 2.9696  2.1960  2.4494  3.2856  3.5322  3.6105  3.5070  2.2075
 3.5272  1.6055  1.1876  1.4772  1.0043  2.4362  1.6640  1.9256
 2.7712  2.8750  1.7220  2.4514  3.1797  3.3102  3.5173  2.0833
 3.2625  3.0375  3.9408  1.1109  1.5304  1.0457  3.0382  2.5855
 1.4313  2.9194  2.5709  1.7505  3.9874  1.7198  2.4379  3.5617


CUDA-1 Test (Band):

-------
Matrix capacity: 512 [262144 elements; m = 512, n = 512]
Matrix overhead: 12352b
Size of Payload: 1048576b
Matrix total size: 1060928b

 -70.001366  -64.015396  -63.014526  -62.436245  1419.793213  0.000008  0.000004  0.000008  0.000004  ...  0.000008
 0.000001  35.563316  21.200325  16.327179  -380.614532  -27.962111  0.000000  0.000000  0.000002   ...  0.000000
 0.000001  -0.000001  31.628269  11.523483  -256.873993  -1.875523  -26.901138  0.000000  0.000000   ...  0.000000
 0.000001  0.000000  -0.000000  -30.793488  166.446198  0.998970  1.384160  25.481604  0.000000   ...  -0.000001
 0.000001  -0.000001  -0.000000  -0.000000  294.396088  0.871984  0.063367  1.140689  -26.300608  ...  0.000000
 0.000000  0.000001  0.000000  0.000000  0.000002  -25.009996  -2.588540  0.323188


Baseline Test (Band):

-------
Matrix capacity: 512 [262144 elements; m = 512, n = 512]
Matrix overhead: 12352b
Size of Payload: 1048576b
Matrix total size: 1060928b

 70.001434  64.015419  63.014412  62.436203  -1419.794922  0.000000  0.000000  0.000000   ...  0.000004
 0.000000  -35.563320  -21.200319  -16.327200  380.615082  -27.962122  0.000000  0.000000   ...  0.000000
 0.000000  0.000000  31.628269  11.523487  -256.873779  1.875516  -26.901150  -0.000000   ...  0.000000
 0.000000  0.000000  0.000000  -30.793514  166.446259  -0.998982  1.384142  -25.481659   ...  0.000000
 0.000000  -0.000000  0.000000  0.000000  294.395996  -0.871994  0.063342  -1.140695  26.300621

scrose / svdsolver Goto Github PK

svdsolver's Introduction