Code Monkey home page Code Monkey logo

aic2021_tpu_template's Introduction

AIC2021 Project1 - TPU

tags: aic2021

Project Description

Design a Tensor Processing Unit(TPU) which has 4x4 Processing elements(PEs) that is capable to calculate (4*K)*(K*4) 8-bit integer matrix muplication. (Where is K is limited by the size of input global buffer)

Project Constraints

  1. Your designs should be written in verilog language.
  2. Your PEs shouldn't more than 4x4, where a 2D systolic array architecture is strictly required in this project.
  3. An 8-bit data length design.
  4. 3KiBytes in total of global buffer size.

Project Deadline

  • May 7, 2021, 23:00

  • You should know that in real world, TPU is a Deep Learning Processor(DLP) on the common bus, data is prepared continously by CPU or DMA from DRAM.
  • In this project, you should focus only on the design and dataflow inisde the TPU, instead of full system simulation including CPU, DMA, and DRAM (Make it simple ๐Ÿ˜„ unless you need more challange).

Project directory hierachy

AIC2021_TPU/
    +-- tb/
    |   +-- matmul.py
    |   +-- top_tb.v
    +-- src/
    |   +-- define.v
    |   +-- global_buffer.v
    |   +-- top.v
    |
    Makefile

TOP Simulation Achitecture

  • Your TPU design should be under the top module which provided by TA.
  • TOP module includes three global buffers prepared for your TPU. Each of the global buffers has its own read write port, 256x32bit=1KiBytes size and result in total 3KiBytes of global buffer.
  • Although the global buffer is provided by TA, you are free to design your own global buffer's behavior, except the name of the global buffers which already defined in testbench in order to load the data & check the correctness of the output.

Testbench

  • At the start of the simulation, tb will load the global buffer A & B, which assume that CPU or DMA has already prepared the data for TPU in global buffer. When signal start==1, the size of the two matrices will be available for TPU (m, n, k).
    • A(M*K)*B(K*N)
  • You should implement your own data loader, process elements(PEs), and controller which schedules the data in global buffer A & B to be calculated in the systolic array.
  • Testbench will compare your output global buffer with golden, when you finish the calculation(done==1).

Prerequisite

  • python3 with numpy library installed
  • iverilog, ncverilog (or any other verilog compiler)

Makefile

  • make test1
    • A(2*2)*B(2*2)
  • make test2
    • A(4*4)*B(4*4)
  • make test3
    • A(4*K)*B(K*4), where K=9
  • make monster (extra)
    • A(M*K)*B(K*N), where K<10, M<10, N<10
    • Although our target is (4*K)*(K*4) matrix multiplication, when M & N is small enough to fit in the input global buffers, give a solution for that size of input matrices. ๐Ÿ˜„
  • make clean
    • This will remove the build/ folder

Global buffer mapping

build/
    +-- matrix_a.bin
    +-- matrix_b.bin
    +-- golden.bin
  • Memory Mapping - Type A (with transpose)
  • Memory Mapping - Type B (Without transpose)
  • As shown in the figure above, two figures give an example of A(6*6)*B(6*6), how is the memory mapping of 8-bit matrix data into 32-bit global buffer. Your output global buffer should follow the memory mapping - type B.

Grading Scores

  • Testbench1~3 (70%)
    • Designs of dataflow in TPU
    • Execution time ranking in class
    • Data reuse method
    • Pass atleast test1~3
  • Readme (20%)
    • Members' Student ID
    • TPU architecture graph
    • Explain your dataflow in TPU
    • Pls descript as much as you can
    • ...
  • Extra (20%)
    • Support (M*K)*(K*N)
    • or other features
      • please provided you own testbench for the extra features
    • Good coding style
    • Plagiarizing(copy-&-paste) others code is probihited
      • Dont try to do that ๐Ÿ˜„, warning from TAs -100%
    • Synthesis (10%)
      • Any synthesis tools is acceptable
      • Synthesis your TPU module only
      • list your cell library or FPGA
      • timing report, area report

aic2021_tpu_template's People

Contributors

oscarkai9139 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.