ECE 587 Spring 2025

RISC-V Prototyping III - Gemmini



Report Due: 04/20 (Sun.), by the end of the day (Chicago time)
Late submissions will NOT be graded

I. Objective

In this project, you will be introduced to the Gemmini accelerator. As a RoCC (RISC-V Custom Coprocessor) accelerator, Gemmini executes custom instructions sent by the RISC-V processor, and accelerates matrix multiplication with a systolic array. Although these instructions provide full potential to optimize computations, they are quite complicated as we will need to coordinate the data movement between the main memory (L2 cache actually) and the Gemmini scratchpad and accumulator SRAM, as well as to feed the systolic array.

For this project we will focus on the helper functions defined by the Gemmini software, in particular tiled_matmul_auto that takes care of moving matrices into Gemmini, performing multiplication, moving out the results.


II. Gemmini

Since Gemmini makes heavy use of the Chipyard framework, we will need to initialize Chipyard environment and go to the directory 'chipyard/generators/gemmini' using a terminal in VS Code.

ubuntu@ubuntu:~$ source chipyard/env.sh 
... ubuntu@ubuntu:~$ cd chipyard/generators/gemmini
... ubuntu@ubuntu:~/chipyard/generators/gemmini$ pushd software/gemmini-rocc-tests; ./build.sh; popd
...
~/chipyard/generators/gemmini
Make sure to stay in 'chipyard/generators/gemmini' and use pushd/popd commands to move between these directories when necessary. In particular, don't forget to run "pushd software/gemmini-rocc-tests; ./build.sh; popd" if you have modified a program.

It is very convenient to use VS Code to browse the programs in 'chipyard/generators/gemmini/software/gemmini-rocc-tests/bareMetalC'. Quite a few of them demonstrates how to use tiled_matmul_auto in different ways. Let's focus on 'tiled_matmul_ws.c' where the weight-stationary dataflow is used. Read this file and run Spike simulation to understand the structure of the code.

... ubuntu@ubuntu:~/chipyard/generators/gemmini$ ./scripts/run-spike.sh tiled_matmul_ws
MAT_DIM_I: 64
MAT_DIM_J: 64
MAT_DIM_K: 64
Gemmini extension configured with:
    dim = 16
Starting slow CPU matmul
Cycles taken: 2130523
Starting gemmini matmul
Cycles taken: 96

The huge number of cycles required to compute the matrix multiplication using CPU simply means Verilator simulation will take too much time to complete. On the other hand, the 96 cycles required to complete the computation in Gemmini seems too good to be true (Why? We'll come back to this). Turn off CPU computation by changing the line '#define CHECK_RESULT 1' into '#define CHECK_RESULT 0'. Rebuild the program and run Verilator simulation to get the actual number of cycles need.


III. Project Deliverables

Here are the tasks/questions that you need to do/answer for this project. You may need to perform additional searches online.

Complete the above tasks/questions for a total of 15 points. Write a project report in .doc/.docs or .pdf format, include screenshots as need, and submit it to Canvas before the deadline.