Skip to content

Latest commit

 

History

History
119 lines (91 loc) · 6.59 KB

README.md

File metadata and controls

119 lines (91 loc) · 6.59 KB

TordBoyau

A pipelined RISC-V processor

Instructions

Included Vivado project is configured for an ARTY A35T

  • Load project in Vivado, synthesize, create bitstream, send to device
  • Included firmware computes a raytracing image and displays it on the TTY using ANSI codes
  • Connect to terminal using 1000000 bauds (see terminal.sh, adapt to your setup)

Instructions for other boards / Yosys-NextPNR

Plug your board, then use BOARDS/run_<BOARD_NAME>.sh. Implemented for:

  • ulx3s

(other boards are coming)

Configuration

Several parameters can be configured in soc.v:

Name Description
CPU_FREQ Depending on the options, timings will validate around 100-120 MHz
CONFIG_PC_PREDICT Enables D-F path, used by branch prediction and return address stack
CONFIG_RAS Enables return address stack
CONFIG_GSHARE GSHARE branch predictor (uses BTFNT if not set)
CONFIG_RV32M RV32M instruction set (MUL,DIV,REM).
CONFIG_DEBUG Enables built-in debugger/disassembler (used in simulation)
CONFIG_INITIALIZE Initializes register file and BHT (required by Icarus and some synth tools)

Firmware

Firmware takes the form of two files, PROGROM.hex that contains code, and DATARAM.hex that contains variables initialization. The included firmware computes an image in raytracing and sends it to the TTY (1000000 bauds). It also measures the average CPI, and a 'raystones' performance score (pixels/s/MHz).

Some precompiled firmwares are available in PRECOMPILED_FIRMWARE/<arch>/<progname>/PROGROM.hex and DATARAM.hex. To use one of them, just copy PROGROM.hex and DATARAM.hex in TordBoyau/ (the same directory that contains soc.v) and re-synthesize (or launch simulation).

Other firmwares can be compiled, see learn-fpga, pipeline tutorial for more details (PROGROM.hex and DATARAM.hex are portable between both projects, just make sure you target the same instruction set (RV32I or RV32IM). You will need also to remove all the lines of zeroes after line 1024 in DATARAM.hex (the core in learn-fpga is configured with 64kB of data ram, and here it is 16kB, which suffices for most examples).

Performance (RV32I) (A35T/Vivado)

branch prediction CoreMarks/MHz DMips/MHz Raystones LUTs FFs MaxFreq
none 0.928 1.298 5.665 909 517 125 MHz
static (BTFNT) 1.118 1.488 6.633 938 516 125 MHz
static + RAS 1.147 1.528 6.795 1040 676 105 MHz
gshare 1.124 1.562 7.186 1297 547 120 MHz
gshare + RAS 1.153 1.606 7.375 1388 711 100 MHz

Performance (RV32IM) (A35T/Vivado)

branch prediction CoreMarks/MHz DMips/MHz Raystones LUTs FFs MaxFreq
none 2.387 1.341 15.296 1368 681 < 80 MHz
static (BTFNT) 2.763 1.545 16.097 1363 680 < 80 MHz
static + RAS 2.790 1.579 16.476 1478 840 < 80 MHz
gshare 2.837 1.597 17.753 1760 711 < 80 MHz
gshare + RAS 2.866 1.634 18.215 1801 875 < 80 MHz
  • Vivado complains that it fails to meet timings even at 80 MHz, to be investigated...
  • However, in practice, it seems to work at 140 MHz with the largest configuration (gshare + RAS). CoreMarks and Dhrystones both validate correct operation, and RayStones generates the correct image.

Debugger / disassembler

Simulation can be started using BOARDS/run_verilator.sh. If CONFIG_DEBUG is set in soc.v, then one can see the content of the pipeline stages, the hazards, register forwarding, branch prediction, return address stack. It is also possible to create "breakpoints", by defining the breakpoint signal in TordBoyau5.v (default breakpoint is on TTY character display).

Sequential pipeline

A completely sequential version TordBoyau5_sequential is included. It has a state machine that executes each stage sequentially, without hazard nor data forwarding. It is there to estimate an upper boundary of what maxfreq one can expect on a given FPGA. On the ARTY, it validates at 150 MHz (still works at 160 MHz).

Documentation on the design

Next steps / TODO

  • Try to validate RV32IM at 140 MHz or so
    • Activating RAS makes maxfreq drop, to be investigated.
    • Activating RV32M makes maxfreq dramatically drop, to be investigated.
    • I don't have a Branch Target Buffer, I'm always computing the branch target, maybe it is not good.
  • RAM is loaded at the end of the Execute stage and written in Mem stage. Maybe it is not good (especially if it uses two ports of the BRAM)
  • register bank is read at the beginning of Execute instead of Decode, which is not classical. On the positive side, then register forwarding muxes only need to be three-ways. On the negative side, it probably makes the critical path longer.
  • Write Amaranth glue code for LiteX, so that we can run Doom on it. Doom already works for the simpler non-pipelined FemtoRV cores. Here we need to adapt LiteX cache and plug it onto PROGROM and DATARAM.
  • It seems that alignment logic for load and store plays a role in the critical path. A 6 stages pipeline may be more optimal, to be tested.
  • Write scripts to synthesize using yosys and nextpnr-xilinx
  • Write scripts for other boards (ULX3S, orange crab, ...)