Loop Unrolling: Why It Behaves Completely Differently on Old CPUs, Modern CPUs, and GPUs
It’s crucial to distinguish between:
●
Old CPUs (in-order, no renaming, weak branch prediction)
●
Modern CPUs (OoO, register renaming, large ROB, advanced predictors)
●
Modern GPUs (SIMT, no branch prediction, no dynamic renaming)
Each reacts very differently to loop unrolling.
✔ 1. On Old CPU Architectures, Loop Unrolling Did Improve Performance
Especially on:
●
Intel 80386 / 80486
●
Pentium (classic)
●
Early MIPS / SPARC
●
ARM7, ARM9
●
Old consoles (PS1, PS2 Emotion Engine, Dreamcast SH4)
Why?
A) Branch misprediction was catastrophic
Prediction was extremely primitive (1-bit, 2-bit counters).
A loop with cmp + jnz per iteration caused:
●
pipeline bubbles
●
full pipeline flush
●
3–6 cycle penalties (or more)
B) Loop overhead was expensive
Example on an in-order CPU:
Code: Select all
cmp ecx, 0
jnz loop
If each iteration did very little work, the loop overhead dominated.
C) No register renaming
Architectures only had a few real registers:
●
8 GPRs (x86)
●
No OoO execution
●
No massive PRF (Physical Register File)
Thus:
●
no “rename pressure”
●
more live values did not hurt performance
Therefore, on old CPUs:
EffectResultFewer branches✔ FasterLess loop overhead✔ FasterMore ILP even on simple pipelines✔ Faster
On old CPUs, loop unrolling almost always improved performance.
2×, 4×, sometimes even 8× unrolling could help.
✔ 2. On Modern Out-of-Order CPUs, the Behavior Is the Opposite
Modern CPUs include:
●
160+ rename registers
●
Large ROB (200–300+ entries)
●
Advanced branch prediction (TAGE, loop predictors, LTarget)
●
uop cache
●
Macro-fusion
●
Branch elimination
●
Aggressive prefetchers
●
Deep speculation
Now:
• Loop control is almost free
The branch predictor doesn’t fail.
Loop control often fuses into 1 µop.
The body comes from the uop cache.
• Unrolling too much becomes expensive
Over-unrolling increases pressure on:
●
rename registers
●
ROB
●
scheduling windows
●
execution ports
●
L1I instruction cache
This causes stalls, backpressure, and slowdowns.
Moderate unrolling can help, but excessive unrolling is slower on modern CPUs.
✔ 3. On Modern GPUs (SIMT), the Rules Change Completely
GPUs have:
●
NO branch predictor
●
NO out-of-order execution
●
NO dynamic register renaming
Branches cause warp divergence, which is extremely expensive.
Thus:
✔ Unrolled + branchless ] Loop with conditions
(Always true on NVIDIA/AMD/Apple mobile GPUs)
Shader compilers aggressively unroll loops for this reason.
✔ 4. FINAL ULTRA-PRECISE SUMMARY
ArchitectureDoes Unrolling Help?Why?Old CPU (80386, MIPS, PS1/PS2, ARM7)✔ YesBranches expensive, pipeline in-order, no renamingModern CPU (OoO x86, ARM big cores)⚠ SometimesExcessive unrolling increases PRF/ROB pressure & I-cache footprintModern GPU (SIMT)✔ Yes, stronglyAvoids divergence, branchless = faster, no dynamic renaming
On modern CPUs: Very large unrolling = more live values = pressure on PRF = stalls.