RV32I Pipelined SoC on Zybo Z7
This project was an end-to-end implementation of a small RISC-V RV32I SoC in SystemVerilog, taken from individual modules through full FPGA bring-up on a Zybo Z7. The focus was building a clean pipeline incrementally, proving each subsystem in simulation, and then integrating memory + MMIO with predictable 1-cycle timing so software bring-up was repeatable instead of “trial and hope.”
Figure 0. Project hero image (block diagram / Zybo photo / UART output screenshot).
Quick Specs
- ISA — RISC-V RV32I
- Pipeline — IF/ID/EX/MEM/WB with forwarding, load-use stall, and redirect flush
- Memory — dual-port BRAM (Port A instruction, Port B data with byte strobes)
- Boot — program image loaded from
prog.hexvia$readmemh - Peripherals — MMIO UART TX + MMIO LED GPIO
- Board I/O — 125 MHz clock, BTN reset, 4 LEDs, UART TX on PMOD pin
Implementation Process
The build was intentionally staged so each layer could be validated before moving on. The overall pattern was: implement a minimal version, write a tiny test to break it, fix timing/edge cases, then integrate the next block. The “definition of done” for each step was: deterministic waveforms in simulation and a simple hardware check on the Zybo.
Milestones
- Bring up a minimal datapath: regfile + ALU + decoder producing stable control.
- Stand up pipeline registers and run a small instruction subset with NOP padding.
- Add hazards in the practical order: forwarding first, then load-use stall, then redirect flush.
- Integrate BRAM with a strict 1-cycle model and confirm loads/stores against a test program.
- Integrate MMIO using the same timing model as BRAM so core assumptions stay consistent.
- Close the loop on FPGA: constraints, bitstream, UART prints, LED writes, then deeper tests.
Core Bring-Up: From Simple to Pipelined
The core started as simple building blocks (decoder, regfile, ALU) verified in isolation. Once those were stable, the pipeline registers were introduced and the CPU was run with conservative assumptions first (known-good NOPs between dependent instructions). That created a baseline where “the pipeline works” before adding hazard complexity.
Hazards and Control Flow (Implemented Incrementally)
- Forwarding was added first to eliminate obvious RAW hazards without slowing everything down.
- A 1-cycle load-use stall was then implemented for the one case forwarding cannot solve.
- Redirect + flush was added after that, with branch/jump decisions resolved in EX and bubbles injected upstream.
Once hazards were in place, I shifted effort into verification by writing small “trap” programs that intentionally trigger each hazard. The goal was not just correctness, but predictability: the same program should produce the same writebacks and memory side effects every run.
Memory System Implementation (1-Cycle BRAM Model)
Instead of treating memory as combinational (which rarely matches FPGA reality), BRAM was implemented with registered addressing and a consistent 1-cycle read response. This forced the core to behave like it would on actual FPGA memory and prevented a common failure mode where simulation “works” but hardware breaks.
- Port A was dedicated to instruction fetch to keep timing simple and stable.
- Port B handled data loads/stores with byte strobes, enabling SB/SH/SW behavior.
- Program loading used
$readmemhso early bring-up could run real software immediately.
Figure 1. System integration view (core ↔ BRAM/MMIO via address decode).
MMIO Integration (Matched to BRAM Timing)
MMIO was implemented to behave like BRAM from the core’s perspective. Requests are registered, and read data is computed combinationally from the registered request. That decision avoids a “two timing models” situation where RAM loads behave one way and MMIO loads behave another, which makes software bring-up painful.
MMIO Address Map
| Address | Register | Access | Notes |
|---|---|---|---|
0x1000_0000 |
UART_TX | W | Write [7:0] to transmit when ready. |
0x1000_0004 |
UART_STAT | R | [0] returns ready status for polling. |
0x2000_0000 |
LED | W | Write [3:0] to drive the user LEDs. |
UART TX Implementation and Debug
The UART transmitter was built as a small, parameterized module with a divider derived from CLK_HZ and
BAUD. I verified framing in simulation first (start bit, 8 data bits, stop bit), then used the Zybo
hardware to validate end-to-end output by printing short strings and confirming timing at 115200.
- Kept the interface simple: start pulse + data, plus a ready flag.
- Generated the start pulse only when the transmitter is ready, to avoid overruns.
- Used polling via UART_STAT during early software bring-up to keep control simple.
Figure 2. UART output / TX waveform capture (optional).
Top-Level Integration (Address Decode + Return Data Alignment)
The SoC wrapper was treated as an engineering project on its own: make address decode obvious, keep the interface to the core stable, and guarantee readback alignment. The key detail here is that return data selection uses registered selects so both RAM and MMIO read paths line up with the pipeline’s expectations.
- Decode RAM vs MMIO based on address region.
- Register the selected target so read data returns on a consistent cycle.
- Keep instruction fetch isolated from data accesses to reduce integration coupling.
FPGA Bring-Up (Constraints, Timing, First Light)
Hardware bring-up started with the smallest possible proof: LEDs and UART. Once those worked, I expanded into memory tests and hazard-triggering instruction sequences. This kept debugging grounded in observable signals instead of guessing.
Bring-Up Checklist
- Confirm the clock constraint matches the board clock (expected 125 MHz).
- Program FPGA and validate reset behavior (BTN reset, clean startup state).
- Run a UART “hello world” by polling UART_STAT and writing UART_TX.
- Write a simple LED pattern through the MMIO LED register.
- Run RAM read/write tests (bytes/halfwords/words) to validate strobes and packing.
- Run hazard tests (forwarding, load-use stall, redirect flush) and compare expected vs observed output.
Demo Ideas
- UART boot banner with a short self-test summary.
- LED counter driven by memory-mapped writes.
- Waveform capture of UART TX showing correct start/data/stop framing.
- Short “hazard demo” program that prints pass/fail per case.
Figure 3. Demo media placeholder (UART terminal / scope capture / short video).