Building an Out-of-Order Superscalar CPU Simulator
How I built a configurable superscalar CPU simulator in C++ to explore the design space of modern processors — and what the bottleneck analysis revealed.
Motivation
Modern CPUs are superscalar and out-of-order: they fetch, rename, issue, and retire multiple instructions per cycle, executing them out of program order while maintaining the illusion of sequential execution. Understanding *why* a particular microarchitectural configuration performs the way it does requires more than textbook knowledge — it requires building and measuring.
Architecture of the Simulator
The simulator models the key stages of an out-of-order pipeline:
Fetch & Decode
Up to WIDTH instructions are fetched per cycle from an instruction trace. Each instruction carries an opcode, destination register, and source registers.
Register Renaming
A rename map table (RMT) maps architectural registers to physical registers. The rename stage allocates entries in the Reorder Buffer (ROB) and updates the RMT. This eliminates false dependencies (WAR and WAW hazards).
Issue Queue & Wakeup/Select
Instructions enter the issue queue and wait for their source operands to become ready. The wakeup logic broadcasts completing tags each cycle; the select logic picks up to WIDTH ready instructions for execution, respecting functional unit availability.
Execute
The simulator models multi-latency functional units: integer ALU (1 cycle), multiplier (3 cycles), and memory (variable latency depending on cache hit/miss).
Retire
Instructions retire in program order from the ROB head, freeing physical registers and ROB entries.
Performance Optimization
The initial implementation was correct but slow. Profiling revealed:
- Issue queue search was O(n) per wakeup — I switched to a tag-indexed wakeup list for O(1) broadcast.
- ROB operations used a linked list — switching to a circular buffer with head/tail indices eliminated pointer chasing.
- Cache simulation had poor locality — restructuring the tag arrays to be contiguous in memory improved cache performance of the simulator itself.
These optimizations enabled running design-space sweeps across thousands of configurations in reasonable time.
What the Bottleneck Analysis Showed
By sweeping WIDTH (1–8), ROB size (32–256), and IQ size (16–128), I found:
- ROB pressure dominates at narrow widths: a WIDTH=2 machine with ROB=32 retires only 1.4 IPC because the ROB fills up with long-latency instructions.
- IQ pressure dominates at wide widths: a WIDTH=8 machine with IQ=16 can't find enough ready instructions to fill the pipeline.
- FU contention becomes the bottleneck once ROB and IQ are sufficiently large: adding a second multiplier unit improved IPC by 12% on multiply-heavy traces.
Key Insight
The simulator confirmed a fundamental principle of computer architecture: the bottleneck shifts as you scale different resources. There's no single "right" configuration — it depends on the workload mix and the area/power budget.