Branchless: Modern CPU vs GPU vs old CPU

Noiecity · Post by **Noiecity** » Fri Nov 28, 2025 3:48 am

I was reading in multiple shader books talking in favor of branchless programming due to better latency, better performance in general... this is the same for older CPUs, but not for modern CPUs... in fact it generates latency many times... I asked ChadGPT about this and he answered something interesting:

Loop Unrolling: Why It Behaves Completely Differently on Old CPUs, Modern CPUs, and GPUs

It’s crucial to distinguish between:

●

Old CPUs (in-order, no renaming, weak branch prediction)

●

Modern CPUs (OoO, register renaming, large ROB, advanced predictors)

●

Modern GPUs (SIMT, no branch prediction, no dynamic renaming)

Each reacts very differently to loop unrolling.

✔ 1. On Old CPU Architectures, Loop Unrolling Did Improve Performance

Especially on:

●

Intel 80386 / 80486

●

Pentium (classic)

●

Early MIPS / SPARC

●

ARM7, ARM9

●

Old consoles (PS1, PS2 Emotion Engine, Dreamcast SH4)

Why?

A) Branch misprediction was catastrophic

Prediction was extremely primitive (1-bit, 2-bit counters).

A loop with cmp + jnz per iteration caused:

●

pipeline bubbles

●

full pipeline flush

●

3–6 cycle penalties (or more)

B) Loop overhead was expensive

Example on an in-order CPU:

Code: Select all

cmp ecx, 0
jnz loop

This could cost 3–5 cycles by itself.

If each iteration did very little work, the loop overhead dominated.

C) No register renaming

Architectures only had a few real registers:

●

8 GPRs (x86)

●

No OoO execution

●

No massive PRF (Physical Register File)

Thus:

●

no “rename pressure”

●

more live values did not hurt performance

Therefore, on old CPUs:

EffectResultFewer branches✔ FasterLess loop overhead✔ FasterMore ILP even on simple pipelines✔ Faster

Conclusion:

On old CPUs, loop unrolling almost always improved performance.

2×, 4×, sometimes even 8× unrolling could help.

✔ 2. On Modern Out-of-Order CPUs, the Behavior Is the Opposite

Modern CPUs include:

●

160+ rename registers

●

Large ROB (200–300+ entries)

●

Advanced branch prediction (TAGE, loop predictors, LTarget)

●

uop cache

●

Macro-fusion

●

Branch elimination

●

Aggressive prefetchers

●

Deep speculation

Now:

• Loop control is almost free

The branch predictor doesn’t fail.

Loop control often fuses into 1 µop.

The body comes from the uop cache.

• Unrolling too much becomes expensive

Over-unrolling increases pressure on:

●

rename registers

●

ROB

●

scheduling windows

●

execution ports

●

L1I instruction cache

This causes stalls, backpressure, and slowdowns.

Conclusion:

Moderate unrolling can help, but excessive unrolling is slower on modern CPUs.

✔ 3. On Modern GPUs (SIMT), the Rules Change Completely

GPUs have:

●

NO branch predictor

●

NO out-of-order execution

●

NO dynamic register renaming

Branches cause warp divergence, which is extremely expensive.

Thus:

✔ Unrolled + branchless ] Loop with conditions

(Always true on NVIDIA/AMD/Apple mobile GPUs)

Shader compilers aggressively unroll loops for this reason.

✔ 4. FINAL ULTRA-PRECISE SUMMARY

ArchitectureDoes Unrolling Help?Why?Old CPU (80386, MIPS, PS1/PS2, ARM7)✔ YesBranches expensive, pipeline in-order, no renamingModern CPU (OoO x86, ARM big cores)⚠ SometimesExcessive unrolling increases PRF/ROB pressure & I-cache footprintModern GPU (SIMT)✔ Yes, stronglyAvoids divergence, branchless = faster, no dynamic renaming

On modern CPUs: Very large unrolling = more live values = pressure on PRF = stalls.