C++ Performance: Demystifying Cache Line Contention and alignas

Noiecity · Post by **Noiecity** » Sat Feb 28, 2026 4:00 pm

I leave a fraction of a book I was reading, and another of a post on a site, I find it interesting to share.

C++ Performance: Demystifying Cache Line Contention and alignas:
https://code-examples.net/en/q/4be52cd/ ... nd-alignas

Optimizing Software In C++ : Agner Fog.
Page 107-108.
"9.10 Cache contentions in large data structures"
Let's look at what happens inside the loop, for example when r = 28. We take the elements from row 28 below the diagonal and swap these elements with column 28 above the diagonal. The first eight elements in row 28 share the same cache line. But these eight elements will go into eight different cache lines in column 28 because the cache lines follow the rows, not the columns. Every fourth of these cache lines belong to the same set in the cache. When we reach element number 16 in column 28, the cache will evict the cache line that was used by element 0 in this column. Number 17 will evict number 1. Number 18 will evict number 2, etc. This means that all the cache lines we used above the diagonal have been lost at the time we are swapping column 29 with line 29. Each cache line has to be reloaded eight times because it is evicted before we need the next element. I have confirmed this by measuring the time it takes to transpose a matrix using example 9.9a on a Pentium 4 with different matrix sizes. The results of my experiment are given below. The time unit is clock cycles per array element.
---
Matrix size1:
Matrix size 63x63.
Total kb: 31kb
Time per element: 11.6

Matrix size2:
Matrix size 64x64.
Total kb: 32kb
Time per element: 16.4

Matrix size3:
Matrix size 65x65.
Total kb: 33kb
Time per element: 11.8
---
Matrix size4:
Matrix size 127x127.
Total kb: 126kb
Time per element: 12.2

Matrix size5:
Matrix size 128x128.
Total kb: 128kb
Time per element: 17.4

Matrix size6:
Matrix size 129x129.
Total kb: 130kb
Time per element: 14.4
---
Matrix size7:
Matrix size 511x511.
Total kb: 2040kb
Time per element: 38.7

Matrix size8:
Matrix size 512x512.
Total kb: 2048kb
Time per element: 230.7

Matrix size9:
Matrix size 513x513.
Total kb: 2056kb
Time per element: 38.1
---
"The table shows that it takes 40% more time to transpose the matrix when the size of the matrix is a multiple of the level-1 cache size. This is because the critical stride is a multiple of the size of a matrix line. The delay is less than the time it takes to reload the level-1 cache from the level-2 cache because the out-of-order execution mechanism can prefetch the data.
The effect is much more dramatic when contentions occur in the level-2 cache. The level-2 cache is 512 kb, 8 ways. The critical stride for the level-2 cache is 512 kb / 8 = 64 kb. This corresponds to 16 lines in a 512x512 matrix. My experimental results in table 9.1 show that it takes six times as long time to transpose a matrix when contentions occur in the level-2 cache as when contentions do not occur. The reason why this effect is so much stronger for level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot prefetch more than one line at a time.
A simple way of solving the problem is to make the rows in the matrix longer than needed in order to avoid that the critical stride is a multiple of the matrix line size. I tried to make the matrix 512x520 and leave the last 8 columns unused. This removed the contentions and the time consumption was down to 36.
There may be cases where it is not possible to add unused columns to a matrix. For example, a library of math functions should work efficiently on all sizes of matrices. An efficient solution in this case is to divide the matrix into smaller squares and handle one square at a time. This is called square blocking or tiling. This technique is illustrated in example 9.9b.

Code: Select all

// Example 9.9b
void transpose(double a[SIZE][SIZE]) {
// Define macro to swap two elements:
#define swapd(x,y) {temp=x; x=y; y=temp;}
// Check if level-2 cache contentions will occur:
if (SIZE > 256 && SIZE % 128 == 0) {
// Cache contentions expected. Use square blocking:
int r1, r2, c1, c2; double temp;
// Define size of squares:
const int TILESIZE = 8; // SIZE must be divisible by TILESIZE
// Loop r1 and c1 for all squares:
for (r1 = 0; r1 < SIZE; r1 += TILESIZE) {
for (c1 = 0; c1 < r1; c1 += TILESIZE) {
// Loop r2 and c2 for elements inside sqaure:
for (r2 = r1; r2 < r1+TILESIZE; r2++) {
for (c2 = c1; c2 < c1+TILESIZE; c2++) {
swapd(a[r2][c2],a[c2][r2]);
}
}
}
// At the diagonal there is only half a square.
// This triangle is handled separately:
for (r2 = r1+1; r2 < r1+TILESIZE; r2++) {
for (c2 = r1; c2 < r2; c2++) {
swapd(a[r2][c2],a[c2][r2]);
}
}
}
}
else {
// No cache contentions. Use simple method.
// This is the code from example 9.9a:
int r, c; double temp;
for (r = 1; r < SIZE; r++) { // loop through rows
for (c = 0; c < r; c++) { // loop columns below diagonal
swapd(a[r][c], a[c][r]); // swap elements
}
}
}
}

This code took 50 clock cycles per element for a 512x512 matrix in my experiments.
Contentions in the level-2 cache are so expensive that it is very important to do something about them. You should therefore be aware of situations where the number of columns in a matrix is a high power of 2. Contentions in the level-1 cache are less expensive. Using complicated techniques like square blocking for the level-1 cache may not be worth the effort.
Square blocking and similar methods are further described in the book "Performance Optimization of Numerically Intensive Codes", by S. Goedecker and A. Hoisie, SIAM 2001."

edit:I don't know in what context or how to apply what is said here, but I found it interesting.