HOW TO SURVIVE DEVELOPING MULTI-CORE PROGRAMS Michael Weber Alfons Laarman University of Twente

HOW TO SURVIVE DEVELOPING MULTI-CORE PROGRAMS
Michael Weber
Alfons Laarman
University of Twente
IPA Spring Days
2010–04–23
Outline
1. Hardware
2. Parallel Performance
3. Multi-Core Pitfalls
4. Memory Models
5. Synchronization
6. Lessons Learned
Moore’s Law
“Transistor count in integrated
circuits doubles every two years.”
Gordon E. Moore (1965)
http://www.intel.com/technology/mooreslaw/
(Source: Smoothspan)
(Source: Anandtech)
2–8 Cores (Intel® x86)
(Intel Nehalem)
200+ Cores (GeForce 260)
600+ Cores (3x GeForce)
13 GPUs (FASTRA II)
≈ 6.000 EUR
12 TFLOPS
http://fastra2.ua.ac.be/
Moore’s Law: Corollaries
(flickr: chasingfun)
# Cores increases,
e.g., doubles every 2 years
(CPU clock speed stable)
Software
Parallelism must double,
every 2 years.
This is difficult, generally.
Efficiency
Amdahl’s Law
Given a job, that is executed on N
processors. Let p ! [0, 1] be the
fraction of the job that can be
parallelized (over N processors).
Let sequential execution of the job
take 1 time unit.
Then parallel execution of the job
p
takes (1 − p) +
time units.
N
1
So the speed-up is
p
(1 − p) + N
Amdahl, Gene (1967). "Validity of the
Single Processor Approach to Achieving
Large-Scale Computing Capabilities".
AFIPS Conference Proceedings
(30): 483–485.
Amdahl’s Law: Examples
N = 10
p = 0.6
gives speed-up of
1
0.4+ 0.6
10
p = 0.9
gives speed-up of
1
0.1+ 0.9
10
= 5.3
gives speed-up of
1
0.01+ 0.99
10
= 9.2
p = 0.99
= 2.2
Amdahl’s Law: Examples
N = 10
p = 0.6
gives speed-up of
1
0.4+ 0.6
10
p = 0.9
gives speed-up of
1
0.1+ 0.9
10
= 5.3
gives speed-up of
1
0.01+ 0.99
10
= 9.2
p = 0.99
= 2.2
Conclusion:
To make efficient use of multiprocessors, it is important to
minimize sequential parts
reduce idle time in which threads wait.
Measuring Parallelism: Speed-up
Given a multi-threaded program:
T1: wall clock time of “best” sequential algorithm
TP: minimum time to execute the program
on P processors (1 ! P ! ")
T": critical path length
Speed-up: T1/TP
Linear speed-up: if T1/TP = #(P)
Maximum speed-up: T1/T"
Measuring Parallelism: Efficiency
Speed-up: SP = T1/TP
serial time
parallel time
Measuring Parallelism: Efficiency
Speed-up: SP = T1/TP = P ! EP
serial time
parallel time
T1
Efficiency: EP =
= SP / P
P × TP
serial cost
parallel cost
Measuring Parallelism: Efficiency
Speed-up: SP = T1/TP = P ! EP
serial time
parallel time
T1
Efficiency: EP =
= SP / P
P × TP
serial cost
parallel cost
EP in percent
Measuring Parallelism: Efficiency
Speed-up: SP = T1/TP = P ! EP
serial time
parallel time
T1
Efficiency: EP =
= SP / P
P × TP
serial cost
parallel cost
EP in percent
E1 = T1/T1 = 100%
Plots: TP
Plots: TP
Missing data:
• What is measured?
• Hardware specs
• Software versions
• Average of what?
(benchmark instances)
Plots: TP (log-scale)
Log scale
compresses
large range
Plots: Speed-up (SP)
Plots: Speed-up (SP)
Interesting Points
Plots: Speed-up (SP)
What is the base,
i.e., T1?
Plots: Speed-up (SP)
Hyper-threaded i7:
8 Phys, /16 Virt. Cores
What is the base,
i.e., T1?
Plots: Speed-up (SP)
Hyper-threaded i7:
8 Phys, /16 Virt. Cores
What is going on
here? (6/10 cores)
What is the base,
i.e., T1?
Plots: Speed-up (SP)
Uses Coordinator
Thread
Hyper-threaded i7:
8 Phys, /16 Virt. Cores
What is going on
here? (6/10 cores)
What is the base,
i.e., T1?
Plots: Speed-up (SP)
Uses Coordinator
Thread
Hyper-threaded i7:
8 Phys, /16 Virt. Cores
What is going on
here? (6/10 cores)
What is the base,
i.e., T1?
Beware of
averages!
Plots: Speed-up (SP)
Uses Coordinator
Thread
Hyper-threaded i7:
8 Phys, /16 Virt. Cores
What is going on
here? (6/10 cores)
What is the base,
i.e., T1?
Speedup plots can be deceiving.
Always cross-check with real (wall-clock) time
Beware of
averages!
Efficiency: References
D. M. Nicol and F. H. Willard, Problem size, parallel architecture,
and optimal speedup, J. Parallel Distrib. Comput. 5:404-420,
1988
J. P. Singh, J. L. Hennessy, and A. Gupta, Scaling parallel
programs for multiprocessors: methodology and examples, IEEE
Computer, 26(7):42-50, 1993
X. H. Sun and L. M. Ni, Scalable problems and memory-bound
speedup, J. Parallel Distrib. Comput., 19:27-37, 1993
P. H. Worley, The effect of time constraints on scaled speedup,
SIAM J. Sci. Stat. Comput., 11:838-858, 1990
Performance
Know Your Target Architecture
static int i, j;
Thread 0
for (i = 0; i < 10000000; ++i) {
Thread 1
for (j = 0; j < 10000000; ++j) {
local computation
}
local computation
}
What is the expected efficiency of this multi-threaded code?
Data Layout Matters! Why?
Know Your Target Architecture
static int i; static char padding[SZ_CACHE_LINE]; static int j;
Thread 0
for (i = 0; i < 10000000; ++i) {
Thread 1
for (j = 0; j < 10000000; ++j) {
local computation
}
local computation
}
What is the expected efficiency of this multi-threaded code?
Data Layout Matters! Why?
False Sharing
Sharing Cache Lines:
Bad Performance
Profile your code
Be Cache-Conscious:
Small working set
Languages: not helpful…
RAM*
RAM*
(*) Performance only for the supported access patterns
“RAM is the New Disk”
“Disk is the New Tape”
Modern Hardware
Long pipelines
RAM has not been
“uniform cost” for a while
Cache Lines
Expensive cache transfers
Source: Wiggers bakker, kokkeler, smit (2007)
Cache coherence overhead
Know Your Algorithmic Models
Consequences for Analysis of Algorithms:
Uniform Cost Model is inadequate to predict
behavior of algorithm
Alternatives:
I/O Complexity: how many blocks of data are touched?
Cache-conscious Algorithms
Cache-oblivous Algorithms
Correctness
“What You See Is Not What’s eXecuted” (Tom Reps)
Source Code
Compiler/JIT
Optimizations
(CSE, Register Alloc.)
Processor
Prefetching
Speculative Exec.
Out-of-Order Exec.
Cache
Store Buffers
Shared Caches
Local Caches
Registers
Actual Execution
“What You See Is Not What’s eXecuted” (Tom Reps)
•
•
Source Code
Compiler/JIT
Optimizations
(CSE, Register Alloc.)
Processor
Prefetching
Speculative Exec.
Out-of-Order Exec.
Transformations invisible
Programmer operates under
assumption:
• Order of executed
operations equivalent to
some sequential execution
according to program order
• Writes become visible to all
processors at the same time
Cache
Store Buffers
Shared Caches
Local Caches
Registers
Actual Execution
“What You See Is Not What’s eXecuted” (Tom Reps)
•
•
Source Code
Compiler/JIT
Optimizations
(CSE, Register Alloc.)
Processor
Prefetching
Speculative Exec.
Out-of-Order Exec.
Transformations invisible
Programmer operates under
assumption:
• Order of executed
operations equivalent to
some sequential execution
according to program order
• Writes become visible to all
processors at the same time
Cache
Store Buffers
Shared Caches
Local Caches
System Assumption:
Programs should be
synchronized correctly!
Registers
Actual Execution
Invisible Transformations
Register Allocation
t0 = var
update t0 (repeatedly)
var = t0
Invisible Transformations
Register Allocation
t0 = var
update t0 (repeatedly)
var = t0
Speculative Execution (e.g., for better branch prediction)
if (condition)
update var
t0 = var
update var
if (not condition)
var = t0 /* undo */
temporary
garbage
Invisible Transformations
Register Allocation
t0 = var
update t0 (repeatedly)
var = t0
Speculative Execution (e.g., for better branch prediction)
if (condition)
update var
t0 = var
update var
if (not condition)
var = t0 /* undo */
Key issue: System “invents” updates
temporary
garbage
Read/Write Tearing
static volatile unsigned long gx;
Thread 0
Thread 1
int i;
for (;;) {
gx = (i==0)? 0L : 0xaaaabbbbccccddddL;
i = 1 - i;
}
for (;;) {
unsigned long x = gx;
assert (x == 0 || x == 0xaaaabbbbccccddddL);
}
Why does the assert fire?
On a 32-bit platform: two operations to write value;
T1 might see 0xaaaabbbb00000000 or 0x00000000ccccdddd
Know Your Atomicity Guarantees
Memory Models
Memory Model = Instruction Reordering +
Store Atomicity, Arvind and Maessen, SIGARCH Vol.
34, Issue 2, (May 2006)
Java Memory Model
JSR-133: "Java Memory Model and Thread Specification Revision"
Java 1.5
Several attempts to get JMM right (cf. papers of Bill Pugh)
Intel Memory Model
"Intel 64 and IA-32 Architectures Software Developer's Manual",
Vol. 3A, Chapter 8, 2009
Weak guarantees on ordering of memory load/stores
Effects of Relaxed Memory Models
boolean wantp ← false, wantq ← false
Thread 0
p2:
p3:
p4:
p8:
wantp ← true
loop
if wantq
…
critical section
Thread 1
q2:
q3:
q4:
q8:
wantq ← true
loop
if wantp
…
critical section
Ensures mutual exclusion if architecture supports Sequential
Consistency (and compiler does not reorder code…)
Most architectures do not enforce ordering of accesses
between different memory locations (wantp, wantq)
Does not ensure MutEx under weaker memory models
Unusual Effects of Memory Models
int A ← 0, flag1 ← 0, flag2 ← 0
Thread 0
int reg1, reg2
flag1 ← 1
A ← 1
reg1 ← A
reg2 ← flag2
Thread 1
int reg3, reg4
flag2 ← 1
A ← 2
reg3 ← A
reg4 ← flag1
Result: reg1 = 1, reg3 = 2, reg2 = reg4 = 0
Unusual Effects of Memory Models
int A ← 0, flag1 ← 0, flag2 ← 0
Thread 0
int reg1, reg2
flag1 ← 1
A ← 1
reg1 ← A
reg2 ← flag2
Thread 1
int reg3, reg4
flag2 ← 1
A ← 2
reg3 ← A
reg4 ← flag1
Result: reg1 = 1, reg3 = 2, reg2 = reg4 = 0
Possible on SPARC TSO model:
TSO: total store order
Write to A propagated only to local reads to A
Reads to flags can happen before writes to flags
Synchronization
What is Wrong with Locking?
Not robust: If a thread holding a lock is delayed, other threads
cannot make progress.
Hard to use: Even a simple queue based on fine-grained locking is
a tour de force.
Deadlock: Can occur if threads attempt to lock the same
objects in different orders (lock-order reversal).
Not composable: Managing concurrent locks to, e.g., atomically
delete an item from one table and insert it in another table is
essentially impossible without breaking the lock internals.
Relies on conventions: Nobody really knows how to organize
and maintain large systems that rely on locking.
Specific Locking Problems
Amdahl’s Law…
Stampede
Lock Convoys
Two-Step Dance
Priority Inversion
Specific Locking Problems
Amdahl’s Law…
Stampede
Lock Convoys
Two-Step Dance
Priority Inversion
Specific Locking Problems
Amdahl’s Law…
Stampede
Lock Convoys
Two-Step Dance
Priority Inversion
Specific Locking Problems
Amdahl’s Law…
Stampede
Lock Convoys
Two-Step Dance
Priority Inversion
Specific Locking Problems
Amdahl’s Law…
Stampede
Lock Convoys
Two-Step Dance
Priority Inversion
Specific Locking Problems
Amdahl’s Law…
Stampede
Lock Convoys
Two-Step Dance
Priority Inversion
Progress Properties
Blocking
Deadlock-free: some thread trying to get the lock eventually
succeeds.
Starvation-free: every thread trying to get the lock eventually
succeeds.
Non-blocking
Lock-free: some thread calling a method eventually returns.
Wait-free: every thread calling a method eventually returns.
Lock- and wait-freeness disallow blocking methods like locks. They
guarantee that the system can cope with crash-failures.
Picking a progress property for a given application again
depends on its needs.
Compare And Set/Swap (CAS)
Building block for Lock-free/Wait-free Algorithms
In Java, some standard read-modify-write “registers” are:
getAndSet(v): assign v, and
return the prior value.
compareAndSet(e,u): if
the prior value is e, then replace it by u,
else leave it unchanged; return a boolean to indicate whether the
value was changed.
(get(): returns current value)
No Double CAS on commodity hardware
Susceptible to the ABA Problem
Lock-Free Stack
Treiber’s Lock-Free Stack
public class LockFreeStack<T> {
AtomicReference<Node<T>> head = new AtomicReference<Node<T>>();
public void push(T item) { … }
public T
pop() { … }
static class Node<T> {
final T item;
Node<T> next;
public Node(T item) { this.item = item; }
}
}
Lock-Free Stack
Treiber’s Lock-Free Stack
public void push(T item) {
Node<T> newHead = new Node<T>(item);
Node<T> oldHead;
do {
oldHead = head.get();
newHead.next = oldHead;
} while (!head.compareAndSet(oldHead, newHead));
}
public T pop() {
Node<T> oldHead;
Node<T> newHead;
do {
oldHead = head.get();
if (oldHead == null) return null;
newHead = oldHead.next;
} while (!head.compareAndSet(oldHead, newHead));
return oldHead.item;
}
Lock-Free Stack
Treiber’s Lock-Free Stack
public void push(T item) {
Node<T> newHead = new Node<T>(item);
Node<T> oldHead;
do {
oldHead = head.get();
newHead.next = oldHead;
} while (!head.compareAndSet(oldHead, newHead));
}
public T pop() {
Node<T> oldHead;
Node<T> newHead;
do {
oldHead = head.get();
if (oldHead == null) return null;
newHead = oldHead.next;
} while (!head.compareAndSet(oldHead, newHead));
return oldHead.item;
}
Why does it work?
Lock-Free Stack
Treiber’s Lock-Free Stack
public void push(T item) {
Node<T> newHead = new Node<T>(item);
Node<T> oldHead;
do {
oldHead = head.get();
newHead.next = oldHead;
} while (!head.compareAndSet(oldHead, newHead));
}
public T pop() {
Node<T> oldHead;
Node<T> newHead;
do {
oldHead = head.get();
if (oldHead == null) return null;
newHead = oldHead.next;
} while (!head.compareAndSet(oldHead, newHead));
return oldHead.item;
}
Why does it work?
Linearization Points?
Lock-Free Stack
Treiber’s Lock-Free Stack
public void push(T item) {
Node<T> newHead = new Node<T>(item);
Node<T> oldHead;
do {
oldHead = head.get();
newHead.next = oldHead;
} while (!head.compareAndSet(oldHead, newHead));
}
public T pop() {
Node<T> oldHead;
Node<T> newHead;
do {
oldHead = head.get();
if (oldHead == null) return null;
newHead = oldHead.next;
} while (!head.compareAndSet(oldHead, newHead));
return oldHead.item;
}
Why does it work?
Linearization Points?
Best-known low-load method,
but scales poorly due to
contention and inherent
sequential bottleneck
Lock-Free Stack
Treiber’s Lock-Free Stack
public void push(T item) {
Node<T> newHead = new Node<T>(item);
Node<T> oldHead;
do {
oldHead = head.get();
newHead.next = oldHead;
} while (!head.compareAndSet(oldHead, newHead));
}
public T pop() {
Node<T> oldHead;
Node<T> newHead;
do {
oldHead = head.get();
if (oldHead == null) return null;
newHead = oldHead.next;
} while (!head.compareAndSet(oldHead, newHead));
return oldHead.item;
}
Why does it work?
Linearization Points?
Best-known low-load method,
but scales poorly due to
contention and inherent
sequential bottleneck
Better:
Elimination Backoff Stack
[TAMP Sec. 11.4, p.249]
Lock-Free Stack
Treiber’s Lock-Free Stack
public void push(T item) {
Node<T> newHead = new Node<T>(item);
Node<T> oldHead;
do {
oldHead = head.get();
newHead.next = oldHead;
} while (!head.compareAndSet(oldHead, newHead));
}
public T pop() {
Node<T> oldHead;
Node<T> newHead;
do {
oldHead = head.get();
if (oldHead == null) return null;
newHead = oldHead.next;
} while (!head.compareAndSet(oldHead, newHead));
return oldHead.item;
}
Why does it work?
Linearization Points?
Best-known low-load method,
but scales poorly due to
contention and inherent
sequential bottleneck
Better:
Elimination Backoff Stack
[TAMP Sec. 11.4, p.249]
Challenge:
Write Lock-Free Stack using
Array instead of Linked List
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
Eventually all will be added!
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
Eventually all will be added!
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
Eventually all will be added!
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
A Wait-Free (But Lossy) Set…
Adding K elements to a set
N threads attempt to add elements
simultaneously
Worst case: only one succeeds,
N-1 are lost
But: now K-1 elements left to insert
Eventually all will be added!
Wait-Free Lossy Set
Can be implemented even without CAS and fences,
need just atomicity guarantees for writing machine words
For performance, minimize loss (easy)
Excellent scalability!
Class of Revisiting-Resistant Algorithms [2008] remains
correct, even with intermediate wrong answers
E.g., set of visited states in Graph Search problems
On Abstractions
Abstractions: Tear Them Down!
Data Layout is critical for
performance:
No mainstream language
provides high-level influence
Locks not compositional
Generic solutions prone to
underperform
Design for Debugging &
Profiling
Tailor-Made Implementations
What works for us:
Build specific solutions
Develop the right mindset
(Experiments and Experience)
Keep It Simple & Static (KISS):
avoid frameworks,
avoid memory allocation
Reliable low-level code
performance trumps high-level
code with uncertain pay-offs
Retaining Sanity in Multi-Core Programming
Tailor-Made Solutions
Tear Down Abstractions
KISS (Design for Debugging)
Know Your Target Architecture
Pitfalls: False Sharing, Convoys,
Stampedes, Priority Inversion, …
Trust Nobody (WYSINWX)
Know Your Algorithmic Models
Performance Analysis
Experiments
Throw away the expected results,
explain the unexpected
Recommended Material
The Art of Multiprocessor Programming.
Maurice Herlihy and Nir Shavit.
Morgan Kaufmann, 2008.
Programming Erlang: Software for a Concurrent World.
Joe Armstrong.
Pragmatic Programmer, 2007.
”Machine Architecture:
Things Your Programming Language Never Told You”
Herb Sutter, 2007.