Slide Set 11 - for ENCM 369 Winter 2015 Lecture Section 01

Slide Set 11
for ENCM 369 Winter 2015 Lecture Section 01
Steve Norman, PhD, PEng
Electrical & Computer Engineering
Schulich School of Engineering
University of Calgary
Winter Term, 2015
ENCM 501 W15 Lectures Slide Set 11
Contents
Race conditions
Introduction to Pthreads programming
Locks
slide 2/32
ENCM 501 W15 Lectures Slide Set 11
Outline of Slide Set 11
Race conditions
Introduction to Pthreads programming
Locks
slide 3/32
ENCM 501 W15 Lectures Slide Set 11
slide 4/32
Race conditions
A race condition is a situation in which the evolution of the
state of some system depends critically on the exact order of
nearly-simultaneous events in two or more subsystems.
Race conditions can exist as a pure hardware problem in the
design of sequential logic circuits.
Race conditions can also arise at various levels of parallelism in
software systems, including instruction-level parallelism and
thread-level parallelism.
slide 5/32
ENCM 501 W15 Lectures Slide Set 11
TLP race condition example
Two threads are running in the same loop, both reading and
writing a global variable called counter . . .
Thread A:
while ( condition ) {
Thread B:
while ( condition ) {
do some work
counter++;
}
do some work
counter++;
}
Before considering the race condition in a multicore system,
let’s address this question: What is the potential failure if this
program is running in a uniprocessor system?
slide 6/32
ENCM 501 W15 Lectures Slide Set 11
Imagine a uniprocessor MIPS64 system, with 100(R28) being the
address of counter . . .
Thread A
Thread B
LD
R8, 100(R28)
DADDIU R8, R8, 1
context switch
[suspended]
context switch
SD
[suspended]
..
.
LD
R8, 100(R28)
DADDIU R8, R8, 1
SD
R8, 100(R28)
..
.
[suspended]
R8, 100(R28)
The value of counter is supposed to increase by 2, but will
actually increase by only 1.
slide 7/32
ENCM 501 W15 Lectures Slide Set 11
TLP race condition example, continued
The same defective program, now with the threads running
simultaneously in two cores . . .
Thread A:
while ( condition ) {
Thread B:
while ( condition ) {
do some work
counter++;
}
do some work
counter++;
}
Again assume MIPS64, with 100(R28) being the address of
counter . . .
slide 8/32
ENCM 501 W15 Lectures Slide Set 11
Thread A
..
.
LD
R8, 100(R28)
DADDIU R8, R8, 1
SD
R8, 100(R28)
..
.
Thread B
..
.
LD
R8, 100(R28)
DADDIU R8, R8, 1
SD
R8, 100(R28)
..
.
Thread A loses the race to write to counter before Thread B
reads from counter.
Again the value of counter is supposed to increase by 2, but will
actually increase by only 1.
A cache coherency protocol such as MSI will not prevent failure!
slide 9/32
ENCM 501 W15 Lectures Slide Set 11
In fact, MSI might even fail to help in this situation, in which
Thread A starts to write before Thread B starts to read . . .
Thread A
..
.
LD
R8, 100(R28)
DADDIU R8, R8, 1
SD
R8, 100(R28)
..
.
Thread B
..
.
LD
R8, 100(R28)
DADDIU R8, R8, 1
SD
R8, 100(R28)
..
.
Due to latency in a snoop-based implementation of MSI, Thread B
could do its load before an invalidate command from Thread A’s
store is processed.
ENCM 501 W15 Lectures Slide Set 11
slide 10/32
A memory-register ISA does not solve race
condition problems
This x86-64 instruction . . .
addq $1, (%rax)
. . . does a memory read, an addition, and a memory write.
Why can’t that be used to eliminate the problem we saw with
MIPS64 instructions?
slide 11/32
ENCM 501 W15 Lectures Slide Set 11
Making the two-thread program safe
Thread A:
while ( condition ) {
}
Thread B:
while ( condition ) {
do some work
do some work
acquire lock
counter++;
release lock
acquire lock
counter++;
release lock
}
Setting up some kind of lock—often called a mutual
exclusion, or mutex—prevents the failures seen in the past
two slide.
Let’s make some notes about how this would work in a
uniprocessor system.
slide 12/32
ENCM 501 W15 Lectures Slide Set 11
Thread A:
while ( condition ) {
}
Thread B:
while ( condition ) {
do some work
do some work
acquire lock
counter++;
release lock
acquire lock
counter++;
release lock
}
What about the multicore case? The lock will prevent
Thread B from reading counter while Thread A is doing a
load-add-store update with counter.
Is that enough? Consider the MSI protocol again. What must
happen either as a result of Thread A releasing the lock, or as
a result of Thread B acquiring the lock?
ENCM 501 W15 Lectures Slide Set 11
slide 13/32
ISA and microarchitecture support for concurrency
Instruction sets have to provide special atomic instructions to
allow software implementation of synchronization facilities
such as mutexes.
An atomic instruction (or a sequence of instructions that is
intended to provide the same kind of behaviour, such as MIPS
LL/SC) typically works like this:
I memory read is attempted at some location;
I some kind of write data is generated;
I memory write to the same location is attempted.
ENCM 501 W15 Lectures Slide Set 11
slide 14/32
The key aspects of atomic instructions are:
I the whole operation succeeds or the whole operation fails,
in a clean way that can be checked after the attempt was
made;
I if two or more threads attempt the operation, such that
the attempts overlap in time, one thread will succeed, and
all the other threads will fail.
ENCM 501 W15 Lectures Slide Set 11
Outline of Slide Set 11
Race conditions
Introduction to Pthreads programming
Locks
slide 15/32
ENCM 501 W15 Lectures Slide Set 11
slide 16/32
Introduction to Pthreads programming
Pthreads (short for POSIX threads) is a specification for a C
library to support multi-threaded programming.
Even if you don’t program directly with Pthreads in practical
projects, it’s good to have a basic understanding of it, because
it’s often used as “infrastructure” for higher-level systems for
multi-threaded programming, such as OpenMP or the C++11
<thread> library.
Please see Assignment 10 for details about types, function
prototypes, and so on!
ENCM 501 W15 Lectures Slide Set 11
slide 17/32
Adding up array elements using threads
Here’s one way to add up a[0], a[1], . . . , a[n-1] using a
single thread:
double sum(double *a, int n) {
double result = 0.0;
double *p, *end;
end = a + n;
for (p = a; p != end; p++)
result += *p;
return result;
}
Note the use of the pointer end to indicate the address of the
first chunk of memory we don’t want to read with the
expression *p.
slide 18/32
ENCM 501 W15 Lectures Slide Set 11
Suppose we want to work in parallel with 4 threads. We can divide
the problem into Problem 0, Problem 1, Problem 2, and
Problem 3, each adding up roughly 1/4 of the array elements . . .
a
a
a+n/4
a+n/4
a+n/2
end
end
a
a
a+n/2
a+3*n/4
a+3*n/4
end
end
This approach works for any number of threads. 4 is just an
example.
ENCM 501 W15 Lectures Slide Set 11
slide 19/32
The solution to the April 9 tutorial shows how to get threads
started working on Problem 0, Problem 1, etc., in parallel,
wait for the worker threads to complete, then combine the
results into an overall result.
Let’s look at some performance numbers for adding up an
array of 100 million doubles on a quad-core Core i7.
ENCM 501 W15 Lectures Slide Set 11
Outline of Slide Set 11
Race conditions
Introduction to Pthreads programming
Locks
slide 20/32
ENCM 501 W15 Lectures Slide Set 11
slide 21/32
Locks
Let’s return to the problem of using locks to avoid race
conditions.
Before that, though, here’s a question: Why is it not necessary
to use locks when adding up array elements with threads?
slide 22/32
ENCM 501 W15 Lectures Slide Set 11
Using locks to manage access to shared data
Thread A:
while ( condition ) {
}
Thread B:
while ( condition ) {
do some work
do some work
acquire lock
counter++;
release lock
acquire lock
counter++;
release lock
}
Correct updates to counter will be done by both threads,
because only one thread can have the lock at any given time.
If A has the lock and B tries to acquire the lock, B will have
to wait.
ENCM 501 W15 Lectures Slide Set 11
slide 23/32
Example of locking with a Pthreads mutex
This C code sets up a global variable of type
pthread_mutex_t, and initializes it to the unlocked state . . .
pthread_mutex_t the_lock = PTHREAD_MUTEX_INITIALIZER;
Functions pthread_mutex_lock and
pthread_mutex_unlock can be used to acquire and release
the lock, as seen on the next slide.
Note the abstraction here: The interface is designed to hide
ISA and microarchitecture details from the programmer.
ENCM 501 W15 Lectures Slide Set 11
slide 24/32
pthread_mutex_t the_lock = PTHREAD_MUTEX_INITIALIZER;
void * foo(void * arg) {
while (/* some condition */) {
/* Do some work. */
// Get the lock.
// If another thread has the lock, wait.
pthread_mutex_lock(&the_lock);
/* Access shared memory. */
// Release the lock.
pthread_mutex_unlock(&the_lock);
}
/* ... */
}
ENCM 501 W15 Lectures Slide Set 11
How NOT to set up a lock
int my_lock = 0; // 0: unlocked; 1: locked.
void get_lock(void) {
// Keep trying until my_lock == 0.
while (my_lock != 0)
;
// Aha! The lock is available!
my_lock = 1;
}
Why is this approach totally useless?
slide 25/32
ENCM 501 W15 Lectures Slide Set 11
slide 26/32
ISA and microarchitecture support for concurrency
Instruction sets have to provide special atomic instructions to
allow software implementation of synchronization facilities
such as mutexes (locks) and semaphores.
An atomic RMW (read-modify-write) instruction (or a
sequence of instructions that is intended to provide the same
kind of behaviour, such as MIPS LL/SC) typically works like
this:
I memory read is attempted at some location;
I some kind of write data is generated;
I memory write to the same location is attempted.
ENCM 501 W15 Lectures Slide Set 11
slide 27/32
The key aspects of atomic RMW instructions are:
I the whole operation succeeds or the whole operation fails,
in a clean way that can be checked after the attempt was
made;
I if two or more threads attempt the operation, such that
the attempts overlap in time, one thread will succeed, and
all the other threads will fail.
ENCM 501 W15 Lectures Slide Set 11
slide 28/32
MIPS LL and SC instructions
LL (load linked:) This is like a normal LW instruction, but it
also gets the processor ready for upcoming SC instruction.
SC (store conditional:) The assembler syntax is
SC GPR1 , offset ( GPR2 )
If SC succeeds, it works like SW, but also writes 1 into GPR1 .
If SC fails, there is no memory write, and GPR1 gets a value
of 0.
The hardware ensures that if two or more threads attempt
LL/SC sequences that overlap in time, SC will succeed in
only one thread.
ENCM 501 W15 Lectures Slide Set 11
slide 29/32
Use of LL and SC to lock a mutex
Suppose R9 points to a memory word used to hold the state of
a mutex: 0 for unlocked, 1 for locked. Here is code for MIPS,
with delayed branch instructions.
L1:
LL
BNE
ORI
SC
BEQ
NOP
R8,
R8,
R8,
R8,
R8,
(R9)
R0, L1
R0, 1
(R9)
R0, L1
Let’s add some comments to explain how this works.
What would the code be to unlock the mutex?
ENCM 501 W15 Lectures Slide Set 11
slide 30/32
Spinlocks
The example on the last slide demonstrates spinning in a loop
to acquire a lock.
Suppose Thread A is spinning, waiting to acquire a lock. Then
Thread A is occupying a core, using energy, and not really
doing any work.
That’s fine if the lock will soon be released. However, if the
lock may be held for a long time, a more sophisticated
algorithm is better:
I Thread spins, but gives up after some fixed number of
iterations.
I Thread makes system call to OS kernel, asking to sleep
until the lock is available.
ENCM 501 W15 Lectures Slide Set 11
slide 31/32
SC and similar instructions have long latencies
In a multicore system, SC, and instructions that are intended
for similar purposes in other ISAs—see, for example,
CMPXCHG in x86 and x86-64—will necessarily have to inspect
some shared global state to determine success or failure.
There is no way to make a safe decision about SC simply by
looking within a private cache with one core!
Execution of SC by a core must therefore cause a many-cycle
stall in that core. It’s only one instruction, but that doesn’t
mean it’s cheap in terms of time!
ENCM 501 W15 Lectures Slide Set 11
slide 32/32
Locks aren’t free, but are often necessary
It should be clear that any kind of variable or data structure
that could be written to by two or more threads must be
protected by a lock.
Consider a program in which only one thread writes to a
variable or data structure, but many threads read that variable
or data structure from time to time.
Why might a lock be necessary in this one-writer, many-reader
case? What kind of modification to the lock design might
improve program efficiency?