The verilog design file can be found here.
- Each memory is 64 locations deep.
- Each memory location is 16 byte wide and is written (or read) byte by byte.
- There are four address pointers A, B, C and D that can read (or write) 4 bytes of data simultaneously (1 byte per pointer per cycle). This is how data is fed as input into the systolic array unit.
- Each pointer would stay at a particular address for 17 cycles. This limits the maximum number of rows (and coloumns based on AxB) an input matrix can have to 16.
- The first cycle is used to write 128-bit 0 in all the address locations and each byte would then be overwritten in the subsequent cycles based on address given.
- This memory has 256 address locations.
- Each location is 4 byte wide.
- The outputs of four PEs from one row are concatenated and stored in one address location of this register. This way, only 4 address locations are utilised.
- The data in each address gets overwritten each cycle and the final result is obtained after 17cycles of multiplication start.
- No matter the matrix size, the multiplication is always 4x16 : 16x4 (buffed up other bytes with 0).
- Number of cycles taken for this operation is pre-calculated and after these many cycles, a complete flag is raised and this is when the data from mem_C is ready to be read.
- After receiving data_incoming signal from source, write enable is made high for memories A and B for 17 cycles. The first cycle is used to reset these memories.
- Data is written into A and B based on the addresses given by source in the next 16 cycles.
- After 17 cycles, write enable is disabled and read enable is forced high and the data is read byte by byte.
- Therefore writing and reading don't happen simultaneously.
- read_enable signal for A and B also raises write enable for C. This goes low only as mentioned in step-2.
- read_enable is high for mem_C for atleast 4 cycles after this.
- All this while (since the time read enable is raised high for A and B and till step-8 is done), a busy flag is raised and given as input to the source saying that it can't write data.
- Conversion of control logic into verilog code.
- Checking the design with a proper test-bench.
- Improvising the speed of the design by taking the input matrices's size into consideration.