Code Monkey home page Code Monkey logo

fusedgcn4hls's Introduction

FusedGCN: A Systolic Three-Matrix Multiplication Architecture for Graph Convolutional Networks

In this work, we focus on accelerating Graph Convolutional Networks' (GCNs) inference as some of the first and most effective processing models for Graph Neural Networks (GNNs). GNNs are able to learn graph-related representations, where nodes represent objects and edges represent the relationships between them. Thus, GNNs have been widely adopted in applications that use graph-structured data, such as social networks and recommendation systems.

Inference of a GCN layer could be represented, as shown by T. Kipf and M. Welling in Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017), as the product of the normalized adjacency matrix S, the input feature matrix H and the weight matrix W.

CSR example

Where sigma is an activation function, Hk is the output feature matrix of the k-th layer, Wk is the weight matrix of the k-th layer and S is the normalized adjacency matrix computed as follows.

CSR example

where CSR example

In FusedGCN, we propose the computation of this three-matrix product in a fused manner. To do so, let us firstly denote the HW product as C. Then each element cjq of C is the dot product of the j-th row of H and the q-th column of W. The final output can be computed as H* = SC. For every element of H*, we have

CSR example

Assuming a dense matrix S, the fused three-matrix multiplication can be expressed with the unified loop structure shown below.

for (int i = 0; i < N; i++) //temporal
  for (int p = 0; p < I; p++) //spatial parallel
    for (int q = 0; q < O; q++) //spatial parallel
      for (int j = 0; j < N; j++) //temporal
        h_star[i][q] += (s[i][j] * h[j][p]) * w[p][q];

Adjacency matrices of real-world graphs are extremely sparse. A simple example of a normalized adjacency matrix in CSR format is shown in the following figure.

CSR example

To transform the fused three-matrix multiplication to operate directly on CSR representation, we need to replace the instances of s[i][j] in the previous code block with the corresponding CSR representation of the non-zero elements of S.

for (int i = 0; i < N; i++) //temporal
  for (int p = 0; p < I; p++) //spatial parallel
    for (int q = 0; q < O; q++) //spatial parallel
      for (int j = S_pos[i]; j < S_pos[i+1]; j++) //temporal
        h_star[i][q] += (S_val[j] * h[S_col[j]][p]) * w[p][q];

The outer loop passes through all the rows of matrix S (each node of the graph is visited once). The non-zero elements of a selected row of S (row 2 in the following figure) with value S_val[j] are multiplied with all the elements of the row of H that corresponds to the column S_col[j]. The product derived for each non-zero element is multiplied with all columns of W to produce a complete row of the output. Since S is sparse, only selected rows of H are fetched: the ones that correspond to the columns of the non-zero elements of S.

Sparse multiplication example

In FusedGCN, we adopt a weight-stationary systolic array philosophy, whereby the weight matrix of a convolution layer is pre-loaded and reused for computing the convolution of all nodes of the graph. The computation for the i-th output row is split in N steps executed serially in time, but in a pipelined fashion. The same steps are performed in parallel for all elements that belong to the same output row, using the following systolic array.

Fused systolic

In each cycle, we take one non-zero element of S with value S_val[j] and multiply it in parallel with all elements of the corresponding row of matrix H (selected by S_col[j]). The output of the multipliers is then multiplied (dot product) in parallel with all columns of the weight matrix by the systolic array. Each processing element of the systolic array multiplies the broadcast value with the locally-saved weight and adds to this product the output of the above processing element of the same column. Once summation is completed, it forwards the result to the next processing element of the same column in the next cycle.

The partial result computed at the output of each column of the systolic array for one non-zero element of S should be added to the results produced by the remaining non-zero elements of the same row of S. This addition is done by the accumulators shown at the bottom of each column of the systolic array. The computation for the i-th output row is completed when all elements of the i-th row of S have entered the systolic array. When starting to compute a new output row, we first need to re-initialize the value of the output accumulators to zero. The already loaded weights do not need to change and they can be reused for all rows of S.

In real-world applications it is infeasible to fetch a whole row of H from memory in a single cycle. Thus, a row of H would be fetched in multiple segments/partitions, over consecutive clock cycles. The same limitation exists on the output side, i.e., when trying to write in memory a computed row of the output H*. In most practical cases, the computed row would be stored in multiple segments. The computation engine should match this read/write throughput limits to avoid underutilization. Therefore, a systolic array of size K ร— M would be enough to sustain the input/output bandwidth, assuming that we can read K words and write M words in each cycle.

The figure below depicts a 3ร—3 systolic array used to compute in tiles the fused product of a GCN layer with I = 9 input features and O = 6 output features per node, respectively. Even if the computation evolves in more steps to facilitate the limited input-output bandwidth, the basic rule of the operation of FusedGCN does not change: one output row is first fully computed before moving to the next one.

Tiled systolic

Tiled computation shown previously can be expressed in a loop structure as follows.

for (int i = 0; i < N; i++) //temporal
  for (int k = 0; k < I/M; k++) //temporal
    for (int r = 0; r < O/M; r++) //temporal
      for (int p = 0; p < K; p++) //spatial parallel
        for (int q = 0; q < M; q++) //spatial parallel
          for (int j = S_pos[i]; j < S_pos[i+1]; j++) //temporal
            h_star[i][q+(r*M)] += (S_val[j]* h[S_col[j]][p+(k*M)])* w[p+(k*M)][q+(r*M)];

Repository hierarchy

The folders of this repository are organized as follows:

  • images: figures shown in readme
  • tiled_systolic_example: example of a real-world application using systolic array smaller than the matrix dimensions

Implementation

FusedGCN implemented in C++ using Fast-Float4HLS floating-point representation, that depends on ac_fixed library from HLSLibs.

Also the post-synthesis RTL co-simulation of the given examples require the sc_verify flow of Catapult HLS. The necessary header (mc_scverify) is publicly available in ac_simutils.

Downloading the needed header-only libraries can be done by running the provided script set_libs.sh.

All the examples implement a 2-layer GCN using the ReLU activation function for the hidden layer and a hard decision (maximum) function for the output layer.

Future Work

  • Hardware implementation of the softmax activation function

Reference

FusedGCN was presented in IEEE International Conference on Application-specific Systems, Architectures and Processors, July 2022. You can find the paper here. To cite this work please use:

@inproceedings{FusedGCN,
author={C. Peltekis and D. Filippas and C. Nicopoulos and G. Dimitrakopoulos},
booktitle={IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP)},
title={Fused{GCN}: A Systolic Three-Matrix Multiplication 
Architecture for Graph Convolutional Networks},
year={2022}}

License

FusedGCN is licensed under the MIT License.

fusedgcn4hls's People

Contributors

chrispelt avatar gdimitrak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

deepware-ai

fusedgcn4hls's Issues

data for testbench

Hello, I am trying to experiment/test the sample code. Would it be possible to add the data for the citeseer test matrices .txt, please ? There are 4 matrices in the tb and I am not sure how to reproduce them. Great work by the way. Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.