HOW TO SURVIVE DEVELOPING MULTI-CORE PROGRAMS Michael Weber Alfons Laarman University of Twente IPA Spring Days 2010–04–23 Outline 1. Hardware 2. Parallel Performance 3. Multi-Core Pitfalls 4. Memory Models 5. Synchronization 6. Lessons Learned Moore’s Law “Transistor count in integrated circuits doubles every two years.” Gordon E. Moore (1965) http://www.intel.com/technology/mooreslaw/ (Source: Smoothspan) (Source: Anandtech) 2–8 Cores (Intel® x86) (Intel Nehalem) 200+ Cores (GeForce 260) 600+ Cores (3x GeForce) 13 GPUs (FASTRA II) ≈ 6.000 EUR 12 TFLOPS http://fastra2.ua.ac.be/ Moore’s Law: Corollaries (flickr: chasingfun) # Cores increases, e.g., doubles every 2 years (CPU clock speed stable) Software Parallelism must double, every 2 years. This is difficult, generally. Efficiency Amdahl’s Law Given a job, that is executed on N processors. Let p ! [0, 1] be the fraction of the job that can be parallelized (over N processors). Let sequential execution of the job take 1 time unit. Then parallel execution of the job p takes (1 − p) + time units. N 1 So the speed-up is p (1 − p) + N Amdahl, Gene (1967). "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): 483–485. Amdahl’s Law: Examples N = 10 p = 0.6 gives speed-up of 1 0.4+ 0.6 10 p = 0.9 gives speed-up of 1 0.1+ 0.9 10 = 5.3 gives speed-up of 1 0.01+ 0.99 10 = 9.2 p = 0.99 = 2.2 Amdahl’s Law: Examples N = 10 p = 0.6 gives speed-up of 1 0.4+ 0.6 10 p = 0.9 gives speed-up of 1 0.1+ 0.9 10 = 5.3 gives speed-up of 1 0.01+ 0.99 10 = 9.2 p = 0.99 = 2.2 Conclusion: To make efficient use of multiprocessors, it is important to minimize sequential parts reduce idle time in which threads wait. Measuring Parallelism: Speed-up Given a multi-threaded program: T1: wall clock time of “best” sequential algorithm TP: minimum time to execute the program on P processors (1 ! P ! ") T": critical path length Speed-up: T1/TP Linear speed-up: if T1/TP = #(P) Maximum speed-up: T1/T" Measuring Parallelism: Efficiency Speed-up: SP = T1/TP serial time parallel time Measuring Parallelism: Efficiency Speed-up: SP = T1/TP = P ! EP serial time parallel time T1 Efficiency: EP = = SP / P P × TP serial cost parallel cost Measuring Parallelism: Efficiency Speed-up: SP = T1/TP = P ! EP serial time parallel time T1 Efficiency: EP = = SP / P P × TP serial cost parallel cost EP in percent Measuring Parallelism: Efficiency Speed-up: SP = T1/TP = P ! EP serial time parallel time T1 Efficiency: EP = = SP / P P × TP serial cost parallel cost EP in percent E1 = T1/T1 = 100% Plots: TP Plots: TP Missing data: • What is measured? • Hardware specs • Software versions • Average of what? (benchmark instances) Plots: TP (log-scale) Log scale compresses large range Plots: Speed-up (SP) Plots: Speed-up (SP) Interesting Points Plots: Speed-up (SP) What is the base, i.e., T1? Plots: Speed-up (SP) Hyper-threaded i7: 8 Phys, /16 Virt. Cores What is the base, i.e., T1? Plots: Speed-up (SP) Hyper-threaded i7: 8 Phys, /16 Virt. Cores What is going on here? (6/10 cores) What is the base, i.e., T1? Plots: Speed-up (SP) Uses Coordinator Thread Hyper-threaded i7: 8 Phys, /16 Virt. Cores What is going on here? (6/10 cores) What is the base, i.e., T1? Plots: Speed-up (SP) Uses Coordinator Thread Hyper-threaded i7: 8 Phys, /16 Virt. Cores What is going on here? (6/10 cores) What is the base, i.e., T1? Beware of averages! Plots: Speed-up (SP) Uses Coordinator Thread Hyper-threaded i7: 8 Phys, /16 Virt. Cores What is going on here? (6/10 cores) What is the base, i.e., T1? Speedup plots can be deceiving. Always cross-check with real (wall-clock) time Beware of averages! Efficiency: References D. M. Nicol and F. H. Willard, Problem size, parallel architecture, and optimal speedup, J. Parallel Distrib. Comput. 5:404-420, 1988 J. P. Singh, J. L. Hennessy, and A. Gupta, Scaling parallel programs for multiprocessors: methodology and examples, IEEE Computer, 26(7):42-50, 1993 X. H. Sun and L. M. Ni, Scalable problems and memory-bound speedup, J. Parallel Distrib. Comput., 19:27-37, 1993 P. H. Worley, The effect of time constraints on scaled speedup, SIAM J. Sci. Stat. Comput., 11:838-858, 1990 Performance Know Your Target Architecture static int i, j; Thread 0 for (i = 0; i < 10000000; ++i) { Thread 1 for (j = 0; j < 10000000; ++j) { local computation } local computation } What is the expected efficiency of this multi-threaded code? Data Layout Matters! Why? Know Your Target Architecture static int i; static char padding[SZ_CACHE_LINE]; static int j; Thread 0 for (i = 0; i < 10000000; ++i) { Thread 1 for (j = 0; j < 10000000; ++j) { local computation } local computation } What is the expected efficiency of this multi-threaded code? Data Layout Matters! Why? False Sharing Sharing Cache Lines: Bad Performance Profile your code Be Cache-Conscious: Small working set Languages: not helpful… RAM* RAM* (*) Performance only for the supported access patterns “RAM is the New Disk” “Disk is the New Tape” Modern Hardware Long pipelines RAM has not been “uniform cost” for a while Cache Lines Expensive cache transfers Source: Wiggers bakker, kokkeler, smit (2007) Cache coherence overhead Know Your Algorithmic Models Consequences for Analysis of Algorithms: Uniform Cost Model is inadequate to predict behavior of algorithm Alternatives: I/O Complexity: how many blocks of data are touched? Cache-conscious Algorithms Cache-oblivous Algorithms Correctness “What You See Is Not What’s eXecuted” (Tom Reps) Source Code Compiler/JIT Optimizations (CSE, Register Alloc.) Processor Prefetching Speculative Exec. Out-of-Order Exec. Cache Store Buffers Shared Caches Local Caches Registers Actual Execution “What You See Is Not What’s eXecuted” (Tom Reps) • • Source Code Compiler/JIT Optimizations (CSE, Register Alloc.) Processor Prefetching Speculative Exec. Out-of-Order Exec. Transformations invisible Programmer operates under assumption: • Order of executed operations equivalent to some sequential execution according to program order • Writes become visible to all processors at the same time Cache Store Buffers Shared Caches Local Caches Registers Actual Execution “What You See Is Not What’s eXecuted” (Tom Reps) • • Source Code Compiler/JIT Optimizations (CSE, Register Alloc.) Processor Prefetching Speculative Exec. Out-of-Order Exec. Transformations invisible Programmer operates under assumption: • Order of executed operations equivalent to some sequential execution according to program order • Writes become visible to all processors at the same time Cache Store Buffers Shared Caches Local Caches System Assumption: Programs should be synchronized correctly! Registers Actual Execution Invisible Transformations Register Allocation t0 = var update t0 (repeatedly) var = t0 Invisible Transformations Register Allocation t0 = var update t0 (repeatedly) var = t0 Speculative Execution (e.g., for better branch prediction) if (condition) update var t0 = var update var if (not condition) var = t0 /* undo */ temporary garbage Invisible Transformations Register Allocation t0 = var update t0 (repeatedly) var = t0 Speculative Execution (e.g., for better branch prediction) if (condition) update var t0 = var update var if (not condition) var = t0 /* undo */ Key issue: System “invents” updates temporary garbage Read/Write Tearing static volatile unsigned long gx; Thread 0 Thread 1 int i; for (;;) { gx = (i==0)? 0L : 0xaaaabbbbccccddddL; i = 1 - i; } for (;;) { unsigned long x = gx; assert (x == 0 || x == 0xaaaabbbbccccddddL); } Why does the assert fire? On a 32-bit platform: two operations to write value; T1 might see 0xaaaabbbb00000000 or 0x00000000ccccdddd Know Your Atomicity Guarantees Memory Models Memory Model = Instruction Reordering + Store Atomicity, Arvind and Maessen, SIGARCH Vol. 34, Issue 2, (May 2006) Java Memory Model JSR-133: "Java Memory Model and Thread Specification Revision" Java 1.5 Several attempts to get JMM right (cf. papers of Bill Pugh) Intel Memory Model "Intel 64 and IA-32 Architectures Software Developer's Manual", Vol. 3A, Chapter 8, 2009 Weak guarantees on ordering of memory load/stores Effects of Relaxed Memory Models boolean wantp ← false, wantq ← false Thread 0 p2: p3: p4: p8: wantp ← true loop if wantq … critical section Thread 1 q2: q3: q4: q8: wantq ← true loop if wantp … critical section Ensures mutual exclusion if architecture supports Sequential Consistency (and compiler does not reorder code…) Most architectures do not enforce ordering of accesses between different memory locations (wantp, wantq) Does not ensure MutEx under weaker memory models Unusual Effects of Memory Models int A ← 0, flag1 ← 0, flag2 ← 0 Thread 0 int reg1, reg2 flag1 ← 1 A ← 1 reg1 ← A reg2 ← flag2 Thread 1 int reg3, reg4 flag2 ← 1 A ← 2 reg3 ← A reg4 ← flag1 Result: reg1 = 1, reg3 = 2, reg2 = reg4 = 0 Unusual Effects of Memory Models int A ← 0, flag1 ← 0, flag2 ← 0 Thread 0 int reg1, reg2 flag1 ← 1 A ← 1 reg1 ← A reg2 ← flag2 Thread 1 int reg3, reg4 flag2 ← 1 A ← 2 reg3 ← A reg4 ← flag1 Result: reg1 = 1, reg3 = 2, reg2 = reg4 = 0 Possible on SPARC TSO model: TSO: total store order Write to A propagated only to local reads to A Reads to flags can happen before writes to flags Synchronization What is Wrong with Locking? Not robust: If a thread holding a lock is delayed, other threads cannot make progress. Hard to use: Even a simple queue based on fine-grained locking is a tour de force. Deadlock: Can occur if threads attempt to lock the same objects in different orders (lock-order reversal). Not composable: Managing concurrent locks to, e.g., atomically delete an item from one table and insert it in another table is essentially impossible without breaking the lock internals. Relies on conventions: Nobody really knows how to organize and maintain large systems that rely on locking. Specific Locking Problems Amdahl’s Law… Stampede Lock Convoys Two-Step Dance Priority Inversion Specific Locking Problems Amdahl’s Law… Stampede Lock Convoys Two-Step Dance Priority Inversion Specific Locking Problems Amdahl’s Law… Stampede Lock Convoys Two-Step Dance Priority Inversion Specific Locking Problems Amdahl’s Law… Stampede Lock Convoys Two-Step Dance Priority Inversion Specific Locking Problems Amdahl’s Law… Stampede Lock Convoys Two-Step Dance Priority Inversion Specific Locking Problems Amdahl’s Law… Stampede Lock Convoys Two-Step Dance Priority Inversion Progress Properties Blocking Deadlock-free: some thread trying to get the lock eventually succeeds. Starvation-free: every thread trying to get the lock eventually succeeds. Non-blocking Lock-free: some thread calling a method eventually returns. Wait-free: every thread calling a method eventually returns. Lock- and wait-freeness disallow blocking methods like locks. They guarantee that the system can cope with crash-failures. Picking a progress property for a given application again depends on its needs. Compare And Set/Swap (CAS) Building block for Lock-free/Wait-free Algorithms In Java, some standard read-modify-write “registers” are: getAndSet(v): assign v, and return the prior value. compareAndSet(e,u): if the prior value is e, then replace it by u, else leave it unchanged; return a boolean to indicate whether the value was changed. (get(): returns current value) No Double CAS on commodity hardware Susceptible to the ABA Problem Lock-Free Stack Treiber’s Lock-Free Stack public class LockFreeStack<T> { AtomicReference<Node<T>> head = new AtomicReference<Node<T>>(); public void push(T item) { … } public T pop() { … } static class Node<T> { final T item; Node<T> next; public Node(T item) { this.item = item; } } } Lock-Free Stack Treiber’s Lock-Free Stack public void push(T item) { Node<T> newHead = new Node<T>(item); Node<T> oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead, newHead)); } public T pop() { Node<T> oldHead; Node<T> newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead, newHead)); return oldHead.item; } Lock-Free Stack Treiber’s Lock-Free Stack public void push(T item) { Node<T> newHead = new Node<T>(item); Node<T> oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead, newHead)); } public T pop() { Node<T> oldHead; Node<T> newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead, newHead)); return oldHead.item; } Why does it work? Lock-Free Stack Treiber’s Lock-Free Stack public void push(T item) { Node<T> newHead = new Node<T>(item); Node<T> oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead, newHead)); } public T pop() { Node<T> oldHead; Node<T> newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead, newHead)); return oldHead.item; } Why does it work? Linearization Points? Lock-Free Stack Treiber’s Lock-Free Stack public void push(T item) { Node<T> newHead = new Node<T>(item); Node<T> oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead, newHead)); } public T pop() { Node<T> oldHead; Node<T> newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead, newHead)); return oldHead.item; } Why does it work? Linearization Points? Best-known low-load method, but scales poorly due to contention and inherent sequential bottleneck Lock-Free Stack Treiber’s Lock-Free Stack public void push(T item) { Node<T> newHead = new Node<T>(item); Node<T> oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead, newHead)); } public T pop() { Node<T> oldHead; Node<T> newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead, newHead)); return oldHead.item; } Why does it work? Linearization Points? Best-known low-load method, but scales poorly due to contention and inherent sequential bottleneck Better: Elimination Backoff Stack [TAMP Sec. 11.4, p.249] Lock-Free Stack Treiber’s Lock-Free Stack public void push(T item) { Node<T> newHead = new Node<T>(item); Node<T> oldHead; do { oldHead = head.get(); newHead.next = oldHead; } while (!head.compareAndSet(oldHead, newHead)); } public T pop() { Node<T> oldHead; Node<T> newHead; do { oldHead = head.get(); if (oldHead == null) return null; newHead = oldHead.next; } while (!head.compareAndSet(oldHead, newHead)); return oldHead.item; } Why does it work? Linearization Points? Best-known low-load method, but scales poorly due to contention and inherent sequential bottleneck Better: Elimination Backoff Stack [TAMP Sec. 11.4, p.249] Challenge: Write Lock-Free Stack using Array instead of Linked List A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert Eventually all will be added! A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert Eventually all will be added! A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert Eventually all will be added! A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert A Wait-Free (But Lossy) Set… Adding K elements to a set N threads attempt to add elements simultaneously Worst case: only one succeeds, N-1 are lost But: now K-1 elements left to insert Eventually all will be added! Wait-Free Lossy Set Can be implemented even without CAS and fences, need just atomicity guarantees for writing machine words For performance, minimize loss (easy) Excellent scalability! Class of Revisiting-Resistant Algorithms [2008] remains correct, even with intermediate wrong answers E.g., set of visited states in Graph Search problems On Abstractions Abstractions: Tear Them Down! Data Layout is critical for performance: No mainstream language provides high-level influence Locks not compositional Generic solutions prone to underperform Design for Debugging & Profiling Tailor-Made Implementations What works for us: Build specific solutions Develop the right mindset (Experiments and Experience) Keep It Simple & Static (KISS): avoid frameworks, avoid memory allocation Reliable low-level code performance trumps high-level code with uncertain pay-offs Retaining Sanity in Multi-Core Programming Tailor-Made Solutions Tear Down Abstractions KISS (Design for Debugging) Know Your Target Architecture Pitfalls: False Sharing, Convoys, Stampedes, Priority Inversion, … Trust Nobody (WYSINWX) Know Your Algorithmic Models Performance Analysis Experiments Throw away the expected results, explain the unexpected Recommended Material The Art of Multiprocessor Programming. Maurice Herlihy and Nir Shavit. Morgan Kaufmann, 2008. Programming Erlang: Software for a Concurrent World. Joe Armstrong. Pragmatic Programmer, 2007. ”Machine Architecture: Things Your Programming Language Never Told You” Herb Sutter, 2007.
© Copyright 2025