Page 1 of 1

Debate over CPU architecture (cache coerency) and Peterson

Posted: Thu Mar 05, 2015 11:09 pm
by REDDemon
Today we had a small debate about CPU architecture in the forum of our classroom :D

Basically the teacher proposed a problem:

The following code implements Peterson algorithm, it is prooven to be correct but on a modern x86 processor does not work (code given by the teacher):

Code: Select all

 
#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
static volatile int flag1 = 0;
static volatile int flag2 = 0;
static volatile int turn = 1;
static volatile int gSharedCounter = 0;
int gLoopCount;
 
void proc1() {
    flag1 = 1;
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
    // Critical section
    gSharedCounter++;
    // Let the other task run
    flag1 = 0;
}
 
void proc2() {
    flag2 = 1;
    turn = 1;
    while((flag1 == 1) && (turn == 1)) ;
    // critical section
    gSharedCounter++;
    // leave critical section
    flag2 = 0;
}
 
//
// Tasks, as a level of indirection
//
void *task1(void *arg) {
    int i;
    printf("Starting task1n");
    for(i=0;i<gLoopCount;i++) proc1();
}
 
void *task2(void *arg) {
    int i;
    printf("Starting task1n");
    for(i=0;i<gLoopCount;i++) proc2();
}
int
main(int argc, char ** argv)
{
    int loopCount = 0;
    pthread_t dekker_thread_1;
    pthread_t dekker_thread_2;
    void * returnCode;
    int result;
    int expected_sum;
    
    /* Check arguments to program*/
    if(argc != 2)
    {
        fprintf(stderr, "USAGE: %s <loopcount>n", argv[0]);
        exit(1);
    }
    
    /* Parse argument */
    loopCount = atoi(argv[1]);
    gloopCount = loopCount;
    expected_sum = 2*gloopCount;
    
    /* Start the threads */
    result = pthread_create(&dekker_thread_1, NULL, task1, NULL);
    result = pthread_create(&dekker_thread_2, NULL, task2, NULL);
    
    /* Wait for the threads to end */
    result = pthread_join(dekker_thread_1,&returnCode);
    result = pthread_join(dekker_thread_2,&returnCode);
    printf("Both threads terminatedn");
    
    /* Check result */
    if( gSharedCounter != expected_sum ) {
        printf("[-] Sum %d rather than %d.n", gSharedCounter, expected_sum);
        return 1;
        } else {
        printf("[+] Okn");
        return 0;
    }
}
 
the above algorithm infact does not work, and prints "Sum RANDOM NUMBER rather than 1000000000" (when input of program for gloopCount is 500000000).

However he asked to investigate the problem and find possible solutions.

Here's mine (c++11):

Code: Select all

 
#include <thread>
#include <limits>
#include <iostream>
#undef NDEBUG
#include <cassert>
 
// make globals local to this file (do no interfere with other tests namespace)
namespace {
       
        alignas(4) static volatile int flag1 = 0;
        alignas(4) static volatile int flag2 = 0;
        alignas(4) static volatile int turn = 1;
        alignas(4) static volatile int gSharedCounter = 0;
        alignas(4) int gLoopCount = 500000000;
       
        void proc1() {
                flag1 = 1;
                turn = 2;
                while((flag2 == 1) && (turn == 2)) ;
               
                // ENTER
                ++gSharedCounter;
                // LEAVE
               
                flag1 = 0;
        }
 
        void proc2() {
                flag2 = 1;
                turn = 1;
                while((flag1 == 1) && (turn == 1)) ;
               
                // ENTER
                ++gSharedCounter;
                // LEAVE
               
                flag2 = 0;
        }
       
        void task1(){
                for(alignas(4) int i = 0; i<gLoopCount; i++)
                        proc1();
               
        }
       
        void task2(){
                for(alignas(4) int i = 0; i<gLoopCount; i++)
                        proc2();
        }
       
}
 
 
// Runned by CMake Test runner. See CMakeLists.txt
int main(int argc, char ** argv){
        std::cout<<"running peterson with iterations: "<<gLoopCount<<std::endl;
        std::thread t1(task1);
        std::thread t2(task2);
        t2.join();
        t1.join();
        assert(gLoopCount*2==gSharedCounter);
        return 0;
}
the real change in the code is that I use "alignas(4)". But in theory the compiler should already align integers to 4 bytes and hence to cache lines

The debate is that according to teacher's assistant the problem is not alignment of variables and infact applying "alignas(4)" to the original C code does not make the algorithm working.(however I read alignas of C is different from C++.. or that is false?)

So here we have 2 different versions:
1) not working C version (indipendently of the fact someone used alignas macro)
2) working C++ version (however people here a bit lazy, no one tested running my code, so it could be a peculiarity of my CPU)

Wanna join the debate?

Could be possible that code of teacher does not work because we are running it inside a VM? (well I don't know in wich environment the teacher's assistent runned the code, but at least me I runned the first algorithm on a Linux VM (because using pthreads), while I runned the 2nd algoirthm under windows Vista without using any Virtual env. How I love no one tried to run my code. ^^)

Little warning, the test requires ~2 minutes to end

EDIT: another intersting thing.

I was compiling using "-O3" with that flag my code works nice, without it doesn't work (maybe it depends on MMX registers too?)

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Fri Mar 06, 2015 12:03 am
by mongoose7
You can't do this:

Code: Select all

 
    flag1 = 1;
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
BTW Intel CPUs are always cache coherent I believe. They use the I2 bus I think. Other architectures may only force cache coherency in synchronisation primitives.

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Fri Mar 06, 2015 7:24 am
by REDDemon
Apparently with -O3 optimization I can do that and it works, but why?

As far as i know all modern (x86) cpus are cache coerent. probably I'm wrong, but I do not understand why -O3 makes a difference here (the algorithm should never work, regardless of the architecture)

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Fri Mar 06, 2015 9:08 am
by hendu
(Am I doing your homework for you?)

It would only be correct in a non-optimizing compiler. The C standard allows the compiler to optimize the sets out, reorder them, or any other such transformation. Volatile alone is not what you think it is, the compiler is not thread aware.

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Sat Mar 07, 2015 3:56 am
by mongoose7
I had a loooong look into this. It appears that coherence is a subjective matter. It means that you can interleave operations from the two threads in such a way that the observation by each thread is consistent. The sample code is:

Code: Select all

void proc1() {
    flag1 = 1;
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
    // Critical section
    gSharedCounter++;
    // Let the other task run
    flag1 = 0;
}
But flag1 is never read, so the following is actually cache consistent.

Code: Select all

void proc1() {
    turn = 2;
    while((flag2 == 1) && (turn == 2)) ;
    // Critical section
    gSharedCounter++;
    // Let the other task run
    flag1 = 1;
    flag1 = 0;
}
What is needed here is a fence.

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Sun Mar 22, 2015 3:15 pm
by REDDemon
yup worked with fence. The reason why It won't worked without it was really complex

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Mon Mar 23, 2015 1:24 am
by mongoose7
It did appear to defy logic. However, you probably cannot ask the cache consistency code to solve multithreading issues for you. I thought that each processor could update each others cache, but in case of a conflict you cannot signal the other processor that the update has been ignored (and the data at that location is not what you wrote). It starts to look like multithreaded coding in hardware. Your example was interesting and instructive.

Re: Debate over CPU architecture (cache coerency) and Peters

Posted: Sun Mar 29, 2015 5:26 pm
by REDDemon
Thanks, but the example came from my teacher :D. I just wrote my first spinlock and workstealing queue :O__