In this project, you will be introduced to the Gemmini accelerator. As a RoCC (RISC-V Custom Coprocessor) accelerator, Gemmini executes custom instructions sent by the RISC-V processor, and accelerates matrix multiplication with a systolic array. Although these instructions provide full potential to optimize computations, they are quite complicated as we will need to coordinate the data movement between the main memory (L2 cache actually) and the Gemmini scratchpad and accumulator SRAM, as well as to feed the systolic array.
For this project we will focus on the helper functions defined by the Gemmini software, in particular tiled_matmul_auto that takes care of moving matrices into Gemmini, performing multiplication, moving out the results.
Since Gemmini makes heavy use of the Chipyard framework, we will need to initialize Chipyard environment and go to the directory 'chipyard/generators/gemmini' using a terminal in VS Code.
ubuntu@ubuntu:~$ source chipyard/env.sh ... ubuntu@ubuntu:~$ cd chipyard/generators/gemmini ... ubuntu@ubuntu:~/chipyard/generators/gemmini$ pushd software/gemmini-rocc-tests; ./build.sh; popd ... ~/chipyard/generators/gemminiMake sure to stay in 'chipyard/generators/gemmini' and use pushd/popd commands to move between these directories when necessary. In particular, don't forget to run "pushd software/gemmini-rocc-tests; ./build.sh; popd" if you have modified a program.
It is very convenient to use VS Code to browse the programs in 'chipyard/generators/gemmini/software/gemmini-rocc-tests/bareMetalC'. Quite a few of them demonstrates how to use tiled_matmul_auto in different ways. Let's focus on 'tiled_matmul_ws.c' where the weight-stationary dataflow is used. Read this file and run Spike simulation to understand the structure of the code.
... ubuntu@ubuntu:~/chipyard/generators/gemmini$ ./scripts/run-spike.sh tiled_matmul_ws MAT_DIM_I: 64 MAT_DIM_J: 64 MAT_DIM_K: 64 Gemmini extension configured with: dim = 16 Starting slow CPU matmul Cycles taken: 2130523 Starting gemmini matmul Cycles taken: 96
The huge number of cycles required to compute the matrix multiplication using CPU simply means Verilator simulation will take too much time to complete. On the other hand, the 96 cycles required to complete the computation in Gemmini seems too good to be true (Why? We'll come back to this). Turn off CPU computation by changing the line '#define CHECK_RESULT 1' into '#define CHECK_RESULT 0'. Rebuild the program and run Verilator simulation to get the actual number of cycles need.
Here are the tasks/questions that you need to do/answer for this project. You may need to perform additional searches online.
Complete the above tasks/questions for a total of 15 points. Write a project report in .doc/.docs or .pdf format, include screenshots as need, and submit it to Canvas before the deadline.