Preprint: doi.org/10.31224/7181 (EngrXiv landing page)
This repository is the code-and-data supplement for the paper "Speculative GHR
Forwarding: Eliminating Stale Branch-Predictor State in Deep FPGA Pipelines"
(targeting ACM Transactions on Reconfigurable Technology and Systems, TRETS).
The write-up is in the ACM acmart format and split into the main paper
paper/main.tex and a companion
paper/supplement.tex holding the corroborating figures
and tables (area, power/efficiency, BRAM, published-cores, CoreMark detail, CPI
ablation, the inter-branch-distance and hazard tables, and the Mechanism-B
susceptibility screen). Both build against the same
paper/figures/ and paper/references.bib.
The paper's contribution is Speculative GHR Forwarding (SGF). When an in-order FPGA soft core is pipelined deeper to raise its clock, the branch predictor's global history register goes stale - the latency from a branch resolving to the next prediction grows, so closely-spaced branches are predicted from history that does not yet reflect the previous branch's outcome - and mispredictions inflate even though the predictor hardware is unchanged. SGF removes that penalty with a dual-GHR scheme whose per-branch checkpoint is forwarded through the pipeline's existing registers, requiring no reorder buffer. To measure the effect cleanly and isolate exactly what SGF corrects, we build five pipeline variants (4–8 stages) from identical RTL for every functional unit, vary only the staging, and synthesize all of them on a Xilinx Artix-7 (Vivado 2025.2, 10 synthesis directive combinations = 50 post-place-and-route runs):
Speculative GHR Forwarding (SGF) - a ROB-free realization of speculative branch history for an in-order FPGA soft core. SGF cuts 7-stage CoreMark mispredictions 31.4% at 0.7% area and no frequency loss.
This README is organized so a reviewer can verify each code-backed claim from the committed source, scripts, and result logs without re-running the full Vivado flow. Every number in the paper traces to a committed result log; the Claim-to-Code Map below is the index.
Authors: Devansh Joshi and Shafin Ula (The University of Texas at Austin)
The paper's claims are measured outcomes under one FPGA family (Artix-7
XC7A35T), one toolchain (Vivado 2025.2), one in-order microarchitecture,
the gshare predictor (with measured generalization to tournament and TAGE),
and -O1 compilation, with an -O2 re-measurement that confirms Mechanism B
and SGF's benefit persist - not fabric- or predictor-independent laws. Absolute
frequency/area/power do not transfer without re-synthesis; what transfers is the
Mechanism-A/B decomposition and the ROB-free dual-GHR construction. See
paper/main.tex §Limitations.
- Xilinx Vivado 2025.2 (XSim for cycle-accurate simulation; post-place-and-route for F_max/area). The simulations are deterministic and single-clock-domain.
- riscv-none-elf-gcc 15.2.0 (xPack) - only needed to rebuild benchmark hex;
prebuilt hex is committed under
programs/. - Python 3 with
matplotlib(+numpy) - only for regenerating figures.
CPU-only machines cannot re-run the Vivado flow, but all result logs and the figure-regeneration scripts are committed, so every numeric claim stays inspectable offline.
Portability / scratch directory. No absolute machine paths are baked into the repo - every driver script resolves the repo root from its own location (
[info script]) and writes Vivado projects and temporaries to a git-ignored./.vivado_work/by default. Set theRISCV_WORKenvironment variable to redirect that scratch to a faster or larger disk. Committed result logs record only the testbench filename (no drive letters), so they reproduce identically on any machine.
Drive the experiments from the Vivado Tcl console (source <script>). All
driver scripts live in scripts/; the committed result log that
backs each paper section lives in results/, named in the comments
below.
# --- Synthesis: F_max + area, 10 directive combinations, all five depths ---
source scripts/synth_seeds.tcl ;# -> results/seed_results.log (two-tier F_max)
source scripts/synth_all.tcl ;# -> results/synth_results.log (area, DSP)
source scripts/synth_bram.tcl ;# BRAM instr-memory variants -> results/bram_synth_results.log
# --- Baseline CPI / misprediction measurements (Mechanism A vs B) ---
source scripts/run_coremark.tcl ;# -> results/coremark_official_results.log
source scripts/run_dhrystone.tcl ;# -> results/dhrystone_results.log
source scripts/run_diagnostic.tcl ;# -> results/diagnostic_results.log
source scripts/run_embench_official.tcl ;# -> results/embench_official_results.log
# --- Causal isolation of Mechanism B (the two controls) ---
source scripts/run_bimodal_coremark.tcl ;# no-GHR control -> results/bimodal_coremark_results.log
source scripts/run_pht_sweep_coremark.tcl ;# 32..1024 PHT sweep -> results/pht_sweep_coremark_results.log
# --- SGF and orthogonal techniques ---
source scripts/run_sgf_eval.tcl ;# SGF 7/8-stage bench + synth
;# -> results/sgf_benchmark_results.log, results/sgf_synth_results.log
source scripts/run_sgf_6stage.tcl ;# SGF on the 6-stage (recommended depth)
# --- Mechanism-B susceptibility screen (additional Embench-IoT kernels) ---
# Build the kernel hex first: bash scripts/build_extra_embench.sh
source scripts/run_extra_embench.tcl ;# huffbench, sglib-combined: baseline + 6/7/8-stage SGF
;# -> results/extra_embench_results.log
# --- Generalization to other predictor families (measured, 7-stage) ---
source scripts/run_tournament_7stage.tcl ;# tournament -> results/tournament_results.log, tournament_synth_results.log
source scripts/run_tage_7stage.tcl ;# downscaled TAGE -> results/tage_results.log, tage_synth_results.log
# --- Robustness and power ---
source scripts/run_O2_7stage.tcl ;# -O2 recompilation, 7-stage -> results/o2_7stage_results.log
source scripts/run_saif_workload.tcl ;# workload-specific (CoreMark) SAIF power, all depthsNote: long benchmark simulations generate large XSim waveform temporaries. The scripts delete each project directory after scraping its result log so the disk doesn't fill; keep waveform logging off for multi-million-cycle runs.
Regenerate all figures from the committed data (no Vivado needed):
python scripts/generate_plots_5depth.py # fmax, area, cpi, mispred, coremark_mhz, cpi_vs_depth (+ CPI-model fit)
python scripts/regen_throughput_published.py # throughput, published-cores sanity check
python scripts/regen_power_figs.py # power, efficiency (workload-SAIF)
python scripts/regen_bram_fig.py # LUT- vs BRAM-memory F_maxEvery code-backed claim in the paper, mapped to the source / script / committed log that supports it.
| Paper claim | Code / artifact |
|---|---|
| Two-tier F_max (70–74 vs 115–118 MHz, p < 0.001, d = 18.5); execute-stage split is the only knee | scripts/synth_seeds.tcl, scripts/synth_all.tcl → results/seed_results.log, results/synth_results.log; fig scripts/regen_throughput_published.py |
| 6-stage is throughput- and energy-optimal on this fabric (64.1 MIPS, lowest EDP) | CPI from results/coremark_official_results.log (4/5/6-stage) and results/7s8s_rerun_results.log (canonical 7/8-stage, fixed-hex) × F_max above; fig scripts/regen_throughput_published.py |
| Depth splits into Mechanism A (all workloads) + Mechanism B (CoreMark +5.7%, statemate +10%) | 4/5/6-stage: results/coremark_official_results.log, results/embench_official_results.log, results/dhrystone_results.log, results/diagnostic_results.log; 7/8-stage baseline: results/7s8s_rerun_results.log (canonical 2,893,145-instr CoreMark; the 7/8-stage rows in *_official_results.log are a superseded pre-fixed-hex run) |
| Mechanism B requires a GHR (bimodal control: zero depth inflation, 85,493/63,272 at every depth) | rtl/branch_predictor_bimodal.sv → results/bimodal_coremark_results.log |
| Mechanism B is not PHT aliasing (32→1024 sweep, inflation persists) | results/pht_sweep_coremark_results.log |
| SGF: −31.4% CoreMark / −22.7% statemate, +0.7% area, no F_max loss | rtl/branch_predictor_sgf.sv, rtl_7stage/rv32i_pipeline_7stage_sgf_top.sv → results/sgf_benchmark_results.log, results/sgf_synth_10seeds_results.log (10-directive F_max) |
| SGF on the 6-stage (−23.9% CoreMark) | rtl/rv32i_pipeline_sgf_top.sv, scripts/run_sgf_6stage.tcl → results/sgf_6stage_results.log |
| Mechanism B reproduces on two further screened kernels (huffbench, sglib-combined; SGF −42% / −22% at 7-stage) | scripts/build_extra_embench.sh, scripts/run_extra_embench.tcl → results/extra_embench_results.log |
| Analytical CPI model R² = 0.977, R²_CV = 0.941 (LOWO) | scripts/generate_plots_5depth.py (fit + cpi_vs_depth); inputs are the benchmark logs above |
| Workload-SAIF power falls with depth (0.235→0.217 W, opposite to uniform-toggle) | scripts/run_saif_workload.tcl → results/saif_workload_results.log; fig scripts/regen_power_figs.py |
SGF benefit survives -O2 (CoreMark −43.6%, statemate −12.0%; IBD stays < 10) |
scripts/build_O2_benchmarks.sh, scripts/run_O2_7stage.tcl → results/o2_7stage_results.log |
| SGF generalizes to a 2nd predictor family: tournament (CoreMark −3.1%, statemate −10.5%, every workload down), +0.6% LUT / +0.5% FF, no F_max loss | rtl/branch_predictor_tournament.sv, rtl/branch_predictor_tournament_sgf.sv, rtl_7stage/rv32i_pipeline_7stage_tournament{,_sgf}_top.sv, scripts/run_tournament_7stage.tcl → results/tournament_results.log, results/tournament_synth_results.log |
| SGF on a 3rd family: TAGE (CoreMark −31.1%, statemate −11.1%; aha-mont64 +5.0% CPI-neutral edge case) - benefit generalizes but cost is +21% LUTs (multi-table), and the gshare confidence filter backfires here | rtl/branch_predictor_tage.sv, rtl/branch_predictor_tage_sgf.sv, rtl_7stage/rv32i_pipeline_7stage_tage{,_sgf}_top.sv, scripts/run_tage_7stage.tcl → results/tage_results.log, results/tage_synth_results.log, results/tage_filtered_results.log (confidence-filter variant) |
| BRAM instr-memory lets 7/8-stage near-match 6-stage F_max | rtl/rv32i_pipeline_bram_top.sv, rtl_7stage/rv32i_pipeline_7stage_bram_top.sv, scripts/synth_bram.tcl → results/bram_synth_results.log; fig scripts/regen_bram_fig.py |
| Determinism: identical instruction/branch counts + memory checksums across all depths | every benchmark log above; tb/rv32i_benchmark_tb.sv, tb/rv32i_riscv_tests_tb.sv |
| Functional correctness (37 riscv-tests + 24-point bench) | tb/rv32i_riscv_tests_tb.sv, tb/rv32i_comprehensive_tb.sv, scripts/run_all_tests.tcl → results/test_results.log |
A pipelined RV32IM core in SystemVerilog: all 40 RV32I + 8 RV32M instructions, M-mode CSRs / trap handling / MRET, a gshare predictor (64-entry PHT, 6-bit GHR) with a 32-entry BTB and 4-entry RAS, a direct-mapped I-cache, 3-source forwarding, and 64-bit performance counters. Validated against the 37-test riscv-tests suite and a 24-point comprehensive testbench; the 6-stage variant has been deployed on a Basys 3 at 100 MHz. Predictor and all functional units are held constant across depths - only the pipeline staging changes. Each split carries structural side effects (load-use stall, IF2 bubble, forwarding-source count); the paper makes these explicit and decomposes them in the CPI model rather than claiming depth varies in complete isolation.
| Variant | Stages | Notes |
|---|---|---|
| 4-stage | IF, ID, EX, MEM/WB | merged MEM/WB (no load-use stall), longest critical path |
| 5-stage | IF, ID, EX, MEM, WB | classic RISC |
| 6-stage | IF, ID, EX1, EX2, MEM, WB | execute split → +56% F_max; throughput-optimal |
| 7-stage | IF1, IF2, ID, EX1, EX2, MEM, WB | fetch split (registered PC); 1-cycle IF2 bubble |
| 8-stage | IF1, IF2, ID, EX1, EX2, MEM1, MEM2, WB | memory split |
Variant tops with the contribution applied: SGF (*_sgf_top.sv). (The repo
also retains exploratory next-line-predictor tops, *_nlp_top.sv, which the paper
does not use: on this fabric the next-line predictor's IF1 redirect lowers
frequency below the deep tier, a net throughput loss.)
| Metric | 4-stg | 5-stg | 6-stg | 7-stg | 8-stg |
|---|---|---|---|---|---|
| F_max (MHz) | 70.4 | 74.0 | 117.3 | 115.0 | 117.5 |
| Slice LUTs | 7,219 | 7,429 | 7,631 | 7,766 | 7,841 |
| Slice Regs | 8,040 | 8,190 | 8,501 | 8,550 | 8,678 |
| CoreMark CPI | 1.54 | 1.63 | 1.83 | 2.02 | 2.18 |
| CoreMark MIPS | 45.7 | 45.4 | 64.1 | 56.9 | 53.9 |
SGF on the 7-stage: +49 LUT (+0.6%), +57 FF (+0.7%), F_max 117.5 vs 115.0 MHz (no degradation). SGF ships without a confidence filter - we measured one that fixes a zero-CPI crc32 edge case but costs ~11 points of the CoreMark benefit, so it is off by default and retained only as a build option (see the paper's crc32 / confidence-filter section).
Seven workloads, all -O1, bare-metal: CoreMark (EEMBC), Dhrystone 2.1,
a mixed Diagnostic program, and four Embench-IoT kernels (aha-mont64,
crc32, statemate, edn). Sources and prebuilt hex are under
programs/ (asm/, c/, coremark/, embench/). Each reads the
hardware counters before/after and stores results to data memory for testbench
readout. CoreMark and statemate are additionally rebuilt at -O2, confirming
Mechanism B and SGF's benefit are not an -O1 artifact (results/o2_7stage_results.log).
A supplementary Mechanism-B susceptibility screen adds two further low-IBD,
data-dependent Embench-IoT kernels (huffbench, sglib-combined). Both are
Mechanism-B positive and SGF removes the inflation at the 6/7/8-stage, bringing
the count of workloads exhibiting Mechanism B to four. Build with
scripts/build_extra_embench.sh and run with
scripts/run_extra_embench.tcl → results/extra_embench_results.log.
paper/
main.tex main paper (acmart / ACM TRETS format)
supplement.tex supplementary material (corroborating figures + tables)
references.bib bibliography
figures/ all figures (.png + .pdf)
rtl/ shared functional units + 6-stage baseline + SGF/bimodal/BRAM tops
branch_predictor.sv gshare + BTB + RAS (baseline)
branch_predictor_sgf.sv dual-GHR speculative-forwarding predictor (SGF)
branch_predictor_bimodal.sv no-GHR control predictor (causal isolation)
rv32i_pipeline_top.sv 6-stage baseline rv32i_pipeline_sgf_top.sv 6-stage SGF
alu / mdu / csr_unit / regfile / imm_gen / control / icache / imem / dmem / ...
rtl_4stage/ rtl_5stage/ depth-specific tops + forwarding/hazard units
rtl_7stage/ 7-stage baseline, SGF, NLP, SGF+NLP, BRAM tops + pipe_if1_if2
rtl_8stage/ 8-stage tops (incl. pipe_mem1_mem2)
tb/ single-cycle, pipeline, comprehensive, riscv-tests benches
programs/ asm/ , c/ , coremark/ , embench/ (sources + hex)
scripts/ ALL driver TCL (synthesis, baseline, causal, SGF, tournament,
TAGE, -O2, power) + regen_*.py figure scripts + synth .xdc
_synth_rtl/ flat RTL snapshot used by the synthesis/baseline harness
results/ ALL result logs (every paper number traces to one of these)
constraints/ Basys 3 XDC
tools/ hex disassembler, riscv-tests runner
The original work in this repository - the RV32IM RTL, testbenches, driver and
figure scripts, result logs, constraints, and paper sources - is released under
the MIT License (see LICENSE).
The benchmark programs under programs/ are not covered by MIT;
each retains its upstream license (CoreMark - EEMBC, Apache-2.0; Embench-IoT;
Dhrystone 2.1 - public domain). Only our bare-metal porting layer, linker
scripts, and derived .hex images are MIT-licensed. See NOTICE for
details.
If you use this work, please cite the paper (Joshi and Ula, "Speculative GHR Forwarding: Eliminating Stale Branch-Predictor State in Deep FPGA Pipelines").