Branchless: Modern CPU vs GPU vs old CPU

Discussion about everything. New games, 3d math, development tips...
Post Reply
Noiecity
Posts: 336
Joined: Wed Aug 23, 2023 7:22 pm
Contact:

Branchless: Modern CPU vs GPU vs old CPU

Post by Noiecity »

I was reading in multiple shader books talking in favor of branchless programming due to better latency, better performance in general... this is the same for older CPUs, but not for modern CPUs... in fact it generates latency many times... I asked ChadGPT about this and he answered something interesting:

Loop Unrolling: Why It Behaves Completely Differently on Old CPUs, Modern CPUs, and GPUs


It’s crucial to distinguish between:





Old CPUs (in-order, no renaming, weak branch prediction)





Modern CPUs (OoO, register renaming, large ROB, advanced predictors)





Modern GPUs (SIMT, no branch prediction, no dynamic renaming)






Each reacts very differently to loop unrolling.


✔ 1. On Old CPU Architectures, Loop Unrolling Did Improve Performance


Especially on:





Intel 80386 / 80486





Pentium (classic)





Early MIPS / SPARC





ARM7, ARM9





Old consoles (PS1, PS2 Emotion Engine, Dreamcast SH4)





Why?

A) Branch misprediction was catastrophic


Prediction was extremely primitive (1-bit, 2-bit counters).

A loop with cmp + jnz per iteration caused:





pipeline bubbles





full pipeline flush





3–6 cycle penalties (or more)





B) Loop overhead was expensive


Example on an in-order CPU:

Code: Select all

cmp ecx, 0
jnz loop
This could cost 3–5 cycles by itself.

If each iteration did very little work, the loop overhead dominated.

C) No register renaming


Architectures only had a few real registers:





8 GPRs (x86)





No OoO execution





No massive PRF (Physical Register File)






Thus:





no “rename pressure”





more live values did not hurt performance





Therefore, on old CPUs:

EffectResultFewer branches✔ FasterLess loop overhead✔ FasterMore ILP even on simple pipelines✔ Faster

📌 Conclusion:

On old CPUs, loop unrolling almost always improved performance.

2×, 4×, sometimes even 8× unrolling could help.


✔ 2. On Modern Out-of-Order CPUs, the Behavior Is the Opposite


Modern CPUs include:





160+ rename registers





Large ROB (200–300+ entries)





Advanced branch prediction (TAGE, loop predictors, LTarget)





uop cache





Macro-fusion





Branch elimination





Aggressive prefetchers





Deep speculation






Now:

• Loop control is almost free


The branch predictor doesn’t fail.

Loop control often fuses into 1 µop.

The body comes from the uop cache.

• Unrolling too much becomes expensive


Over-unrolling increases pressure on:





rename registers





ROB





scheduling windows





execution ports





L1I instruction cache






This causes stalls, backpressure, and slowdowns.


📌 Conclusion:

Moderate unrolling can help, but excessive unrolling is slower on modern CPUs.


✔ 3. On Modern GPUs (SIMT), the Rules Change Completely


GPUs have:





NO branch predictor





NO out-of-order execution





NO dynamic register renaming






Branches cause warp divergence, which is extremely expensive.


Thus:

✔ Unrolled + branchless ] Loop with conditions


(Always true on NVIDIA/AMD/Apple mobile GPUs)


Shader compilers aggressively unroll loops for this reason.


✔ 4. FINAL ULTRA-PRECISE SUMMARY

ArchitectureDoes Unrolling Help?Why?Old CPU (80386, MIPS, PS1/PS2, ARM7)✔ YesBranches expensive, pipeline in-order, no renamingModern CPU (OoO x86, ARM big cores)⚠ SometimesExcessive unrolling increases PRF/ROB pressure & I-cache footprintModern GPU (SIMT)✔ Yes, stronglyAvoids divergence, branchless = faster, no dynamic renaming

On modern CPUs: Very large unrolling = more live values ​​= pressure on PRF = stalls.
Irrlicht is love, Irrlicht is life, long live to Irrlicht
Post Reply