Slide Set 11 for ENCM 369 Winter 2015 Lecture Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2015 ENCM 501 W15 Lectures Slide Set 11 Contents Race conditions Introduction to Pthreads programming Locks slide 2/32 ENCM 501 W15 Lectures Slide Set 11 Outline of Slide Set 11 Race conditions Introduction to Pthreads programming Locks slide 3/32 ENCM 501 W15 Lectures Slide Set 11 slide 4/32 Race conditions A race condition is a situation in which the evolution of the state of some system depends critically on the exact order of nearly-simultaneous events in two or more subsystems. Race conditions can exist as a pure hardware problem in the design of sequential logic circuits. Race conditions can also arise at various levels of parallelism in software systems, including instruction-level parallelism and thread-level parallelism. slide 5/32 ENCM 501 W15 Lectures Slide Set 11 TLP race condition example Two threads are running in the same loop, both reading and writing a global variable called counter . . . Thread A: while ( condition ) { Thread B: while ( condition ) { do some work counter++; } do some work counter++; } Before considering the race condition in a multicore system, let’s address this question: What is the potential failure if this program is running in a uniprocessor system? slide 6/32 ENCM 501 W15 Lectures Slide Set 11 Imagine a uniprocessor MIPS64 system, with 100(R28) being the address of counter . . . Thread A Thread B LD R8, 100(R28) DADDIU R8, R8, 1 context switch [suspended] context switch SD [suspended] .. . LD R8, 100(R28) DADDIU R8, R8, 1 SD R8, 100(R28) .. . [suspended] R8, 100(R28) The value of counter is supposed to increase by 2, but will actually increase by only 1. slide 7/32 ENCM 501 W15 Lectures Slide Set 11 TLP race condition example, continued The same defective program, now with the threads running simultaneously in two cores . . . Thread A: while ( condition ) { Thread B: while ( condition ) { do some work counter++; } do some work counter++; } Again assume MIPS64, with 100(R28) being the address of counter . . . slide 8/32 ENCM 501 W15 Lectures Slide Set 11 Thread A .. . LD R8, 100(R28) DADDIU R8, R8, 1 SD R8, 100(R28) .. . Thread B .. . LD R8, 100(R28) DADDIU R8, R8, 1 SD R8, 100(R28) .. . Thread A loses the race to write to counter before Thread B reads from counter. Again the value of counter is supposed to increase by 2, but will actually increase by only 1. A cache coherency protocol such as MSI will not prevent failure! slide 9/32 ENCM 501 W15 Lectures Slide Set 11 In fact, MSI might even fail to help in this situation, in which Thread A starts to write before Thread B starts to read . . . Thread A .. . LD R8, 100(R28) DADDIU R8, R8, 1 SD R8, 100(R28) .. . Thread B .. . LD R8, 100(R28) DADDIU R8, R8, 1 SD R8, 100(R28) .. . Due to latency in a snoop-based implementation of MSI, Thread B could do its load before an invalidate command from Thread A’s store is processed. ENCM 501 W15 Lectures Slide Set 11 slide 10/32 A memory-register ISA does not solve race condition problems This x86-64 instruction . . . addq $1, (%rax) . . . does a memory read, an addition, and a memory write. Why can’t that be used to eliminate the problem we saw with MIPS64 instructions? slide 11/32 ENCM 501 W15 Lectures Slide Set 11 Making the two-thread program safe Thread A: while ( condition ) { } Thread B: while ( condition ) { do some work do some work acquire lock counter++; release lock acquire lock counter++; release lock } Setting up some kind of lock—often called a mutual exclusion, or mutex—prevents the failures seen in the past two slide. Let’s make some notes about how this would work in a uniprocessor system. slide 12/32 ENCM 501 W15 Lectures Slide Set 11 Thread A: while ( condition ) { } Thread B: while ( condition ) { do some work do some work acquire lock counter++; release lock acquire lock counter++; release lock } What about the multicore case? The lock will prevent Thread B from reading counter while Thread A is doing a load-add-store update with counter. Is that enough? Consider the MSI protocol again. What must happen either as a result of Thread A releasing the lock, or as a result of Thread B acquiring the lock? ENCM 501 W15 Lectures Slide Set 11 slide 13/32 ISA and microarchitecture support for concurrency Instruction sets have to provide special atomic instructions to allow software implementation of synchronization facilities such as mutexes. An atomic instruction (or a sequence of instructions that is intended to provide the same kind of behaviour, such as MIPS LL/SC) typically works like this: I memory read is attempted at some location; I some kind of write data is generated; I memory write to the same location is attempted. ENCM 501 W15 Lectures Slide Set 11 slide 14/32 The key aspects of atomic instructions are: I the whole operation succeeds or the whole operation fails, in a clean way that can be checked after the attempt was made; I if two or more threads attempt the operation, such that the attempts overlap in time, one thread will succeed, and all the other threads will fail. ENCM 501 W15 Lectures Slide Set 11 Outline of Slide Set 11 Race conditions Introduction to Pthreads programming Locks slide 15/32 ENCM 501 W15 Lectures Slide Set 11 slide 16/32 Introduction to Pthreads programming Pthreads (short for POSIX threads) is a specification for a C library to support multi-threaded programming. Even if you don’t program directly with Pthreads in practical projects, it’s good to have a basic understanding of it, because it’s often used as “infrastructure” for higher-level systems for multi-threaded programming, such as OpenMP or the C++11 <thread> library. Please see Assignment 10 for details about types, function prototypes, and so on! ENCM 501 W15 Lectures Slide Set 11 slide 17/32 Adding up array elements using threads Here’s one way to add up a[0], a[1], . . . , a[n-1] using a single thread: double sum(double *a, int n) { double result = 0.0; double *p, *end; end = a + n; for (p = a; p != end; p++) result += *p; return result; } Note the use of the pointer end to indicate the address of the first chunk of memory we don’t want to read with the expression *p. slide 18/32 ENCM 501 W15 Lectures Slide Set 11 Suppose we want to work in parallel with 4 threads. We can divide the problem into Problem 0, Problem 1, Problem 2, and Problem 3, each adding up roughly 1/4 of the array elements . . . a a a+n/4 a+n/4 a+n/2 end end a a a+n/2 a+3*n/4 a+3*n/4 end end This approach works for any number of threads. 4 is just an example. ENCM 501 W15 Lectures Slide Set 11 slide 19/32 The solution to the April 9 tutorial shows how to get threads started working on Problem 0, Problem 1, etc., in parallel, wait for the worker threads to complete, then combine the results into an overall result. Let’s look at some performance numbers for adding up an array of 100 million doubles on a quad-core Core i7. ENCM 501 W15 Lectures Slide Set 11 Outline of Slide Set 11 Race conditions Introduction to Pthreads programming Locks slide 20/32 ENCM 501 W15 Lectures Slide Set 11 slide 21/32 Locks Let’s return to the problem of using locks to avoid race conditions. Before that, though, here’s a question: Why is it not necessary to use locks when adding up array elements with threads? slide 22/32 ENCM 501 W15 Lectures Slide Set 11 Using locks to manage access to shared data Thread A: while ( condition ) { } Thread B: while ( condition ) { do some work do some work acquire lock counter++; release lock acquire lock counter++; release lock } Correct updates to counter will be done by both threads, because only one thread can have the lock at any given time. If A has the lock and B tries to acquire the lock, B will have to wait. ENCM 501 W15 Lectures Slide Set 11 slide 23/32 Example of locking with a Pthreads mutex This C code sets up a global variable of type pthread_mutex_t, and initializes it to the unlocked state . . . pthread_mutex_t the_lock = PTHREAD_MUTEX_INITIALIZER; Functions pthread_mutex_lock and pthread_mutex_unlock can be used to acquire and release the lock, as seen on the next slide. Note the abstraction here: The interface is designed to hide ISA and microarchitecture details from the programmer. ENCM 501 W15 Lectures Slide Set 11 slide 24/32 pthread_mutex_t the_lock = PTHREAD_MUTEX_INITIALIZER; void * foo(void * arg) { while (/* some condition */) { /* Do some work. */ // Get the lock. // If another thread has the lock, wait. pthread_mutex_lock(&the_lock); /* Access shared memory. */ // Release the lock. pthread_mutex_unlock(&the_lock); } /* ... */ } ENCM 501 W15 Lectures Slide Set 11 How NOT to set up a lock int my_lock = 0; // 0: unlocked; 1: locked. void get_lock(void) { // Keep trying until my_lock == 0. while (my_lock != 0) ; // Aha! The lock is available! my_lock = 1; } Why is this approach totally useless? slide 25/32 ENCM 501 W15 Lectures Slide Set 11 slide 26/32 ISA and microarchitecture support for concurrency Instruction sets have to provide special atomic instructions to allow software implementation of synchronization facilities such as mutexes (locks) and semaphores. An atomic RMW (read-modify-write) instruction (or a sequence of instructions that is intended to provide the same kind of behaviour, such as MIPS LL/SC) typically works like this: I memory read is attempted at some location; I some kind of write data is generated; I memory write to the same location is attempted. ENCM 501 W15 Lectures Slide Set 11 slide 27/32 The key aspects of atomic RMW instructions are: I the whole operation succeeds or the whole operation fails, in a clean way that can be checked after the attempt was made; I if two or more threads attempt the operation, such that the attempts overlap in time, one thread will succeed, and all the other threads will fail. ENCM 501 W15 Lectures Slide Set 11 slide 28/32 MIPS LL and SC instructions LL (load linked:) This is like a normal LW instruction, but it also gets the processor ready for upcoming SC instruction. SC (store conditional:) The assembler syntax is SC GPR1 , offset ( GPR2 ) If SC succeeds, it works like SW, but also writes 1 into GPR1 . If SC fails, there is no memory write, and GPR1 gets a value of 0. The hardware ensures that if two or more threads attempt LL/SC sequences that overlap in time, SC will succeed in only one thread. ENCM 501 W15 Lectures Slide Set 11 slide 29/32 Use of LL and SC to lock a mutex Suppose R9 points to a memory word used to hold the state of a mutex: 0 for unlocked, 1 for locked. Here is code for MIPS, with delayed branch instructions. L1: LL BNE ORI SC BEQ NOP R8, R8, R8, R8, R8, (R9) R0, L1 R0, 1 (R9) R0, L1 Let’s add some comments to explain how this works. What would the code be to unlock the mutex? ENCM 501 W15 Lectures Slide Set 11 slide 30/32 Spinlocks The example on the last slide demonstrates spinning in a loop to acquire a lock. Suppose Thread A is spinning, waiting to acquire a lock. Then Thread A is occupying a core, using energy, and not really doing any work. That’s fine if the lock will soon be released. However, if the lock may be held for a long time, a more sophisticated algorithm is better: I Thread spins, but gives up after some fixed number of iterations. I Thread makes system call to OS kernel, asking to sleep until the lock is available. ENCM 501 W15 Lectures Slide Set 11 slide 31/32 SC and similar instructions have long latencies In a multicore system, SC, and instructions that are intended for similar purposes in other ISAs—see, for example, CMPXCHG in x86 and x86-64—will necessarily have to inspect some shared global state to determine success or failure. There is no way to make a safe decision about SC simply by looking within a private cache with one core! Execution of SC by a core must therefore cause a many-cycle stall in that core. It’s only one instruction, but that doesn’t mean it’s cheap in terms of time! ENCM 501 W15 Lectures Slide Set 11 slide 32/32 Locks aren’t free, but are often necessary It should be clear that any kind of variable or data structure that could be written to by two or more threads must be protected by a lock. Consider a program in which only one thread writes to a variable or data structure, but many threads read that variable or data structure from time to time. Why might a lock be necessary in this one-writer, many-reader case? What kind of modification to the lock design might improve program efficiency?
© Copyright 2024