Design of Parallel and High-Performance Computing Fall 2013 Lecture: Locks and Lock-Free Instructor: Torsten Hoefler & Markus Püschel TA: Timo Schneider Administrivia Next week – progress presentations Make sure, Timo knows about your team (this step is important!) Send slides (ppt or pdf) by Sunday 11:59pm to Timo! 10 minutes per team (hard limit) Prepare! This is your first impression, gather feedback from us! Rough guidelines: Present your plan Related work (what exists, literature review!) Preliminary results (not necessarily) Main goal is to gather feedback, so present some details Pick one presenter (make sure to switch for other presentations!) Intermediate (very short) presentation: Thursday 11/21 during recitation Final project presentation: Monday 12/16 during last lecture 2 Review of last lecture Language memory models History Java/C++ overview Locks Two-thread Peterson N-thread Many different locks, strengths and weaknesses Lock options and parameters Formal proof methods Correctness (mutual exclusion as condition) Progress 3 DPHPC Overview 4 Goals of this lecture Hardware operations for concurrency control More on locks (using advanced operations) Spin locks Various optimized locks Even more on locks (issues and extended concepts) Deadlocks, priority inversion, competitive spinning, semaphores Case studies Barrier Reasoning about semantics Locks in practice: a set structure 5 Lamport’s Bakery Algorithm (1974) Is a FIFO lock (and thus fair) Each thread takes number in doorway and threads enter in the order of their number! volatile int flag*n+ = ,0,0,…,0-; volatile int label*n+ = ,0,0,….,0-; void lock() { flag[tid] = 1; // request label[tid] = max(label[0], ...,label[n-1]) + 1; // take ticket while ((∃k != tid)(flag[k] && (label[k],k) <* (label[tid],tid))) {}; } public void unlock() { flag[tid] = 0; } 6 Lamport’s Bakery Algorithm Advantages: Elegant and correct solution Starvation free, even FIFO fairness Not used in practice! Why? Needs to read/write N memory locations for synchronizing N threads Can we do better? Using only atomic registers/memory 7 A Lower Bound to Memory Complexity Theorem 5.1 in [1]: “If S is a [atomic] read/write system with at least two processes and S solves mutual exclusion with global progress [deadlock-freedom], then S must have at least as many variables as processes” So we’re doomed! Optimal locks are available and they’re fundamentally non-scalable. Or not? [1] J. E. Burns and N. A. Lynch. Bounds on shared memory for mutual exclusion. Information and Computation, 107(2):171–184, December 1993 8 Hardware Support? Hardware atomic operations: Test&Set Write const to memory while returning the old value Atomic swap Atomically exchange memory and register Fetch&Op Get value and apply operation to memory location Compare&Swap Compare two values and swap memory with register if equal Load-linked/Store-Conditional LL/SC Loads value from memory, allows operations, commits only if no other updates committed mini-TM Intel TSX (transactional synchronization extensions) Hardware-TM (roll your own atomic operations) 9 Relative Power of Synchronization Design-Problem I: Multi-core Processor Which atomic operations are useful? Design-Problem II: Complex Application What atomic should I use? Concept of “consensus number” C if a primitive can be used to solve the “consensus problem” in a finite number of steps (even if a threads stop) atomic registers have C=1 (thus locks have C=1!) TAS, Swap, Fetch&Op have C=2 CAS, LL/SC, TM have C=∞ 10 Test-and-Set Locks Test-and-Set semantics Memoize old value Set fixed value TASval (true) Return old value After execution: Post-condition is a fixed (constant) value! bool test_and_set (bool *flag) { bool old = *flag; *flag = true; return old; } // all atomic! 11 Test-and-Set Locks Assume TASval indicates “locked” Write something else to indicate “unlocked” TAS until return value is != TASval When will the lock be granted? volatile int lock = 0; Does this work well in practice? void lock() { while (TestAndSet(&lock) == 1); } void unlock() { lock = 0; } 12 Contention On x86, the XCHG instruction is used to implement TAS For experts: x86 LOCK is superfluous! Cacheline is read and written Ends up in exclusive state, invalidates other copies Cacheline is “thrown” around uselessly High load on memory subsystem movl $1, %eax xchg %eax, (%ebx) x86 bus lock is essentially a full memory barrier 13 Test-and-Test-and-Set (TATAS) Locks Spinning in TAS is not a good idea Spin on cache line in shared state All threads at the same time, no cache coherency/memory traffic Danger! Efficient but use with great care! Generalizations are dangerous volatile int lock = 0; void lock() { do { while (lock == 1); } while (TestAndSet(&lock) == 1); } void unlock() { lock = 0; } 14 Warning: Even Experts get it wrong! Example: Double-Checked Locking 1997 Problem: Memory ordering leads to race-conditions! 15 Contention? Do TATAS locks still have contention? When lock is released, k threads fight for cache line ownership One gets the lock, all get the CL exclusively (serially!) What would be a good solution? (think “collision avoidance”) volatile int lock = 0; void lock() { do { while (lock == 1); } while (TestAndSet(&lock) == 1); } void unlock() { lock = 0; } 16 TAS Lock with Exponential Backoff Exponential backoff eliminates contention statistically Locks granted in unpredictable order Starvation possible but unlikely How can we make it even less likely? volatile int lock = 0; void lock() { while (TestAndSet(&lock) == 1) { wait(time); time *= 2; // double waiting time } } void unlock() { lock = 0; } Similar to: T. Anderson: “The performance of spin lock alternatives for shared-memory multiprocessors”, TPDS, Vol. 1 Issue 1, Jan 1990 17 TAS Lock with Exponential Backoff Exponential backoff eliminates contention statistically Locks granted in unpredictable order Starvation possible but unlikely Maximum waiting time makes it less likely volatile int lock = 0; const int maxtime=1000; void lock() { while (TestAndSet(&lock) == 1) { wait(time); time = min(time * 2, maxtime); } } void unlock() { lock = 0; } Similar to: T. Anderson: “The performance of spin lock alternatives for shared-memory multiprocessors”, TPDS, Vol. 1 Issue 1, Jan 1990 18 Comparison of TAS Locks 19 Improvements? Are TAS locks perfect? What are the two biggest issues? Cache coherency traffic (contending on same location with expensive atomics) -- or - Critical section underutilization (waiting for backoff times will delay entry to CR) What would be a fix for that? How is this solved at airports and shops (often at least)? Queue locks -- Threads enqueue Learn from predecessor if it’s their turn Each threads spins at a different location FIFO fairness 20 Array Queue Lock Array to implement queue Tail-pointer shows next free volatile int array*n+ = ,1,0,…,0-; volatile int index*n+ = ,0,0,…,0-; volatile int tail = 0; queue position Each thread spins on own location void lock() { CL padding! index[tid] = GetAndInc(tail) % n; index[] array can be put in TLS while (!array[index[tid]]); // wait to receive lock } So are we done now? What’s wrong? Synchronizing M objects requires Θ(NM) storage What do we do now? void unlock() { array[index[tid]] = 0; // I release my lock array[(index[tid] + 1) % n] = 1; // next one } 21 CLH Lock (1993) List-based (same queue principle) Discovered twice by Craig, Landin, Hagersten 1993/94 2N+3M words N threads, M locks Requires thread-local qnode pointer Can be hidden! typedef struct qnode { struct qnode *prev; int succ_blocked; } qnode; qnode *lck = new qnode; // node owned by lock void lock(qnode *lck, qnode *qn) { qn->succ_blocked = 1; qn->prev = FetchAndSet(lck, qn); while (qn->prev->succ_blocked); } void unlock(qnode **qn) { qnode *pred = (*qn)->prev; (*qn)->succ_blocked = 0; *qn = pred; } 22 CLH Lock (1993) Qnode objects represent thread state! succ_blocked == 1 if waiting or acquired lock succ_blocked == 0 if released lock List is implicit! One node per thread Spin location changes NUMA issues (cacheless) Can we do better? typedef struct qnode { struct qnode *prev; int succ_blocked; } qnode; qnode *lck = new qnode; // node owned by lock void lock(qnode *lck, qnode *qn) { qn->succ_blocked = 1; qn->prev = FetchAndSet(lck, qn); while (qn->prev->succ_blocked); } void unlock(qnode **qn) { qnode *pred = (*qn)->prev; (*qn)->succ_blocked = 0; *qn = pred; } 23 MCS Lock (1991) Make queue explicit Acquire lock by appending to queue Spin on own node until locked is reset Similar advantages as CLH but Only 2N + M words Spinning position is fixed! Benefits cache-less NUMA What are the issues? Releasing lock spins More atomics! typedef struct qnode { struct qnode *next; int succ_blocked; } qnode; qnode *lck = NULL; void lock(qnode *lck, qnode *qn) { qn->next = NULL; qnode *pred = FetchAndSet(lck, qn); if(pred != NULL) { qn->locked = 1; pred->next = qn; while(qn->locked); }} void unlock(qnode * lck, qnode *qn) { if(qn->next == NULL) , // if we’re the last waiter if(CAS(lck, qn, NULL)) return; while(qn->next == NULL); // wait for pred arrival } qn->next->locked = 0; // free next waiter qn->next = NULL; } 24 Lessons Learned! Key Lesson: Reducing memory (coherency) traffic is most important! Not always straight-forward (need to reason about CL states) MCS: 2006 Dijkstra Prize in distributed computing “an outstanding paper on the principles of distributed computing, whose significance and impact on the theory and/or practice of distributed computing has been evident for at least a decade” “probably the most influential practical mutual exclusion algorithm ever” “vastly superior to all previous mutual exclusion algorithms” fast, fair, scalable widely used, always compared against! 25 Time to Declare Victory? Down to memory complexity of 2N+M Probably close to optimal Only local spinning Several variants with low expected contention But: we assumed sequential consistency Reality causes trouble sometimes Sprinkling memory fences may harm performance Open research on minimally-synching algorithms! Come and talk to me if you’re interested 26 More Practical Optimizations Let’s step back to “data race” (recap) two operations A and B on the same memory cause a data race if one of them is a write (“conflicting access”) and neither AB nor BA So we put conflicting accesses into a CR and lock it! This also guarantees memory consistency in C++/Java! Let’s say you implement a web-based encyclopedia Consider the “average two accesses” – do they conflict? 27 Reader-Writer Locks Allows multiple concurrent reads Multiple reader locks concurrently in CR Guarantees mutual exclusion between writer and writer locks and reader and writer locks Syntax: read_(un)lock() write_(un)lock() 28 A Simple RW Lock Seems efficient!? Is it? What’s wrong? Polling CAS! Is it fair? Readers are preferred! Can always delay writers (again and again and again) const W = 1; const R = 2; volatile int lock=0; // LSB is writer flag! void read_lock(lock_t lock) { AtomicAdd(lock, R); while(lock & W); } void write_lock(lock_t lock) { while(!CAS(lock, 0, W)); } void read_unlock(lock_t lock) { AtomicAdd(lock, -R); } void write_unlock(lock_t lock) { AtomicAdd(lock, -W); } 29 Fixing those Issues? Polling issue: Combine with MCS lock idea of queue polling Fairness: Count readers and writers (1991) The final algorithm (Alg. 4) has a flaw that was corrected in 2003! 30 Deadlocks Kansas state legislature: “When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.” [according to Botkin, Harlow "A Treasury of Railroad Folklore" (pp. 381)] What are necessary conditions for deadlock? 31 Deadlocks Necessary conditions: Mutual Exclusion Hold one resource, request another No preemption Circular wait in dependency graph One condition missing will prevent deadlocks! Different avoidance strategies (which?) 32 Issues with Spinlocks Spin-locking is very wasteful The spinning thread occupies resources Potentially the PE where the waiting thread wants to run requires context switch! Context switches due to Expiration of time-slices (forced) Yielding the CPU 33 What is this? 34 Why is the 1997 Mars Rover in our lecture? It landed, received program, and worked … until it spuriously rebooted! watchdog Scenario (vxWorks RT OS): Single CPU Two threads A,B sharing common bus, using locks (independent) thread C wrote data to flash Priority: ACB (A highest, B lowest) Thread C would run into a lifelock (infinite loop) Thread B was preempted by C while holding lock Thread A got stuck at lock [http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Authoritative_Account.html] 35 Priority Inversion If busy-waiting thread has higher priority than thread holding lock ⇒ no progress! Can be fixed with the help of the OS E.g., mutex priority inheritance (temporarily boost priority of task in CR to highest priority among waiting tasks) 36 Condition Variables Allow threads to yield CPU and leave the OS run queue Other threads can get them back on the queue! cond_wait(cond, lock) – yield and go to sleep cond_signal(cond) – wake up sleeping threads Wait and signal are OS calls Often expensive, which one is more expensive? Wait, because it has to perform a full context switch 37 Condition Variable Semantics Hoare-style: Signaler passes lock to waiter, signaler suspended Waiter runs immediately Waiter passes lock back to signaler if it leaves critical section or if it waits again Mesa-style (most used): Signaler keeps lock Waiter simply put on run queue Needs to acquire lock, may wait again 38 When to Spin and When to Block? Spinning consumes CPU cycles but is cheap “Steals” CPU from other threads Blocking has high one-time cost and is then free Often hundreds of cycles (trap, save TCB …) Wakeup is also expensive (latency) Also cache-pollution Strategy: Poll for a while and then block 39 When to Spin and When to Block? What is a “while”? Optimal time depends on the future When will the active thread leave the CR? Can compute optimal offline schedule Actual problem is an online problem Competitive algorithms An algorithm is c-competitive if for a sequence of actions x and a constant a holds: C(x) ≤ c*Copt(x) + a What would a good spinning algorithm look like and what is the competitiveness? 40
© Copyright 2024