Lock Cohorting: A General Technique for Designing NUMA Locks Aaron Schutza COMP 522 November 6th, 2014 1 Outline • Non-Uniform Memory Access (NUMA) architecture • Differences between NUMA systems and uniform memory access • Synchronization on NUMA architectures • Lock cohorting • A transformation for creating NUMA locks • Different cohort lock designs • Example of a cohort lock in action • Empirical results • Conclusion 2 NUMA Architectures • Consequence of scaling memory bandwidth with processor count • Not all memory is equidistant to all cores • Typically has system wide cache coherence • Memory access latency depends on the distance between the core and data location • Memory or cache in local socket • Memory or cache in remote socket • Multiple levels of memory locality • 1 hop, 2 hop, 3 hop… 3 A Multi-Socket NUMA System 4 Synchronization on NUMA • Problem: • Passing locks between threads on different sockets can be costly • Overhead from passing lock and data it protects • Data that has been accessed on a remote socket produces long latency cache misses • Solution: • Locks can be designed to improve locality of reference • Encourage threads with mutual locality to acquire a given lock consecutively • Benefits: • Reduces migration of locks between NUMA nodes • Reduces cache misses 5 Example: Hierarchical Backoff Lock • Test-and-test-and-set lock with backoff scheme to reduce cross node contention of a lock variable • Thread locality is used to tune the backoff delay • When acquiring a lock, assign thread ID to lock state • When spin waiting, compare thread ID with lock holder and backoff proportionally • Limitations: • Reduce lock migration only probabilistically • Lots of invalidation traffic: costly for NUMA • For more details see Radović & Hagersten HPCA 2003 6 Lock Cohorting • Use two levels of locks: • Global locks • Local locks, one for each socket or cluster (NUMA node) • First in socket to acquire local lock: • Acquire socket lock then the global lock • Pass local lock to other waiters in the local node • Eventually relinquish global lock to give other nodes a chance • Recipe for NUMA-aware locks without special algorithms • Cohorting can compose any kind of lock into a NUMA lock • Augments properties of cohorted locks with locality preservation • Benefits: • Reduces average overhead of lock acquisition • Reduces interconnect traffic for lock and protected data 7 Global and Local Lock Properties • Global lock G: • Thread-oblivious: acquiring thread can differ from releasing thread • Globally available to all nodes of the system • Local lock S: • Cohort detection property: a thread releasing the lock can detect if there are threads attempting to acquire the lock • Records last state of release as global or local • Once S is acquired: • Local release → proceed to critical section • Global release → try to acquire G • Upon release of S: • IF may_pass_local OR alone? → release globally • ELSE → release locally 8 Lock Cohorting in Action • Suppose L is a cohorting lock implemented by global lock G, socket locks S1, and S2 • t1 encounters the lock L first Node 1 t1 Ac. L 9 Node 2 t2 t3 t4 t5 t6 t7 t8 Lock Cohorting in Action • To enter the critical section t1 must acquire S1 first Node 1 t1 Ac. S1 10 Node 2 t2 t3 t4 t5 t6 t7 t8 Lock Cohorting in Action • After t1 acquires S1, G must be acquired for t1 to enter the critical section Node 1 t1 S1 Ac. G 11 Node 2 t2 t3 t4 t5 t6 t7 t8 Lock Cohorting in Action • t1 acquires G and enters the critical section • Subsequently t5 and t6 attempt to acquire L Node 1 t1 S1 G 12 Node 2 t2 t3 t4 t5 t6 Ac. L Ac. L t7 t8 Lock Cohorting in Action • t5 and t6 compete first to acquire S2, t5 wins • t6 is added to S2’s cohort since it’s spinning on S2 • t5 spins on G • Threads on node 2 wait until G is released Node 1 t1 S1 G 13 Node 2 t2 t3 t4 t5 t6 S2 {t6} Ac. G Ac. S2 t7 t8 Lock Cohorting in Action • Next t2 and t3 encounter L first seeking S1 • t2 and t3 add to S1’s cohort Node 1 14 Node 2 t1 t2 t3 S1 {t2,t3} G Ac. S1 Ac. S1 t4 t5 t6 S2 {t6} Ac. G Ac. S2 t7 t8 Lock Cohorting in Action • t1 finishes with lock and sees S1’s cohort is not empty • t1 releases S1 locally and G remains locked • t2 acquires S1 and enters the critical section Node 1 15 Node 2 t1 t2 t3 G S1 {t3} Ac. S1 t4 t5 t6 S2 {t6} Ac. G Ac. S2 t7 t8 Lock Cohorting in Action • t2 finishes the critical section and sees its cohort is still not empty • t2 releases S1 locally and t3 acquires it Node 1 t1 G 16 Node 2 t2 t3 S1 {} t4 t5 t6 S2 {t6} Ac. G Ac. S2 t7 t8 Lock Cohorting in Action • t3 exits and sees an empty cohort for S1 • t3 releases S1 globally and then releases G • The next acquisition of S1 will see that it was released globally and that thread will seek G Node 1 t1 G 17 Node 2 t2 t3 {} Re. G t4 t5 t6 S2 {t6} Ac. G Ac. S2 t7 t8 Lock Cohorting in Action • Now t5 acquires G and enters the critical section • Upon exiting it will see a non-empty cohort for S2 and release it locally Node 1 t1 18 Node 2 t2 t3 t4 t5 t6 S2 {t6} G Ac. S2 t7 t8 Lock Cohorting in Action • t6 then enters the critical section • Upon exiting it will see an empty cohort for S2 and globally release it Node 1 t1 19 Node 2 t2 t3 t4 t5 t6 G S2 {} t7 t8 Lock Cohorting in Action • t6 releases G and S1 and S2 are in the globally ready state • "Cohorting" within a node ensures that a lock will not unnecessarily migrate between nodes 1 and 2 Node 1 t1 20 Node 2 t2 t3 t4 t5 t6 t7 t8 Cohort Lock Designs • C-BO-BO lock • Global backoff (BO) lock and local backoff locks per node • C-TKT-TKT lock • Global ticket lock and local ticket (TKT) locks per node • C-BO-MCS lock • Global backoff lock and local Mellor-Crummey Scott (MCS) lock • C-MCS-MCS lock • C-TKT-MCS lock • Use of abortable locks in cohort designs needs extra features to limit aborting while in a cohort: • A-C-BO-BO lock • A-C-BO-CLH lock (queue lock of Craig, Landin, & Hagersten) 21 C-BO-MCS Lock in Action • 1A acquires local MCS lock and then acquires the global lock 22 Dice et al. C-BO-MCS Lock in Action • 2A acquires local MCS lock and then spins on the global lock • 1A enters the critical section 23 Dice et al. C-BO-MCS Lock in Action • 1B and 1C add themselves to local MCS queue 24 Dice et al. C-BO-MCS Lock in Action • 1A exits the critical section and sees that it points to 1B • Because the MCS tail pointer is not null, 1A releases the MCS lock and leaves the global lock untouched • 1B is allowed to enter the critical section 25 Dice et al. C-BO-MCS Lock in Action • 1B exits the critical section and sees that the queue points to 1C • The MCS lock is released locally and acquired by 1C • 1C enters the critical section 26 Dice et al. C-BO-MCS Lock in Action • 1C exits the critical section and sees that the MCS tail pointer is null • 1C then releases the global lock and the local lock 27 Dice et al. C-BO-MCS Lock in Action • 2A acquires the global lock and the critical section passes to the other cluster 28 Dice et al. Empirical Results • Dice et al. conduct experiments on benchmarks that test the performance of each lock design • A microbenchmark LBench is used as a representative workload • LBench launches identical threads • Each thread loops as follows: • Acquire central lock • Access shared data in critical section • Release lock • ~4ms of non-critical work • Run on Oracle T5440 series machine • 256 hardware threads • 4 NUMA clusters • Evaluation shows that cohort locks outperform previous locks by at least 60% 29 Average Throughput vs. # of Threads • These results use LBench • Similar results were found for different LBench thread settings • MCS is the baseline scalable lock: low performance without locality awareness 30 Dice et al. Conclusions • Lock cohorting yields an improvement over previous NUMA aware lock designs • Powerful lock design • No special locks required • Versatility • Can be extended to further layers of locality • e.g., tile based systems where locality is based on grid position • Multiple levels of lock cohorts • Performance scaling with thread count is better with lock cohorting 31 Reference Z. Radovic and E. Hagersten. Hierarchical Backoff Locks for Nonuniform Communication Architectures. In HPCA-9, pages 241–252, Anaheim, California, USA, Feb. 2003. David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock cohorting: a general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 247-256. DOI=10.1145/2145816.2145848 32
© Copyright 2024