Lock Cohorting: A General Technique for Designing NUMA Locks Aaron Schutza

Lock Cohorting:
A General Technique for Designing NUMA Locks
Aaron Schutza
COMP 522
November 6th, 2014
1
Outline
•  Non-Uniform Memory Access (NUMA) architecture
•  Differences between NUMA systems and uniform memory access
•  Synchronization on NUMA architectures
•  Lock cohorting
•  A transformation for creating NUMA locks
•  Different cohort lock designs
•  Example of a cohort lock in action
•  Empirical results
•  Conclusion
2
NUMA Architectures
•  Consequence of scaling memory bandwidth with
processor count
•  Not all memory is equidistant to all cores
•  Typically has system wide cache coherence
•  Memory access latency depends on the distance between
the core and data location
•  Memory or cache in local socket
•  Memory or cache in remote socket
•  Multiple levels of memory locality
•  1 hop, 2 hop, 3 hop…
3
A Multi-Socket NUMA System
4
Synchronization on NUMA
•  Problem:
•  Passing locks between threads on different sockets can be costly
•  Overhead from passing lock and data it protects
•  Data that has been accessed on a remote socket produces long
latency cache misses
•  Solution:
•  Locks can be designed to improve locality of reference
•  Encourage threads with mutual locality to acquire a given lock
consecutively
•  Benefits:
•  Reduces migration of locks between NUMA nodes
•  Reduces cache misses
5
Example: Hierarchical Backoff Lock
•  Test-and-test-and-set lock with backoff scheme to reduce
cross node contention of a lock variable
•  Thread locality is used to tune the backoff delay
•  When acquiring a lock, assign thread ID to lock state
•  When spin waiting, compare thread ID with lock holder and backoff
proportionally
•  Limitations:
•  Reduce lock migration only probabilistically
•  Lots of invalidation traffic: costly for NUMA
•  For more details see Radović & Hagersten HPCA 2003
6
Lock Cohorting
•  Use two levels of locks:
•  Global locks
•  Local locks, one for each socket or cluster (NUMA node)
•  First in socket to acquire local lock:
•  Acquire socket lock then the global lock
•  Pass local lock to other waiters in the local node
•  Eventually relinquish global lock to give other nodes a chance
•  Recipe for NUMA-aware locks without special algorithms
•  Cohorting can compose any kind of lock into a NUMA lock
•  Augments properties of cohorted locks with locality preservation
•  Benefits:
•  Reduces average overhead of lock acquisition
•  Reduces interconnect traffic for lock and protected data
7
Global and Local Lock Properties
•  Global lock G:
•  Thread-oblivious: acquiring thread can differ from releasing thread
•  Globally available to all nodes of the system
•  Local lock S:
•  Cohort detection property: a thread releasing the lock can detect if
there are threads attempting to acquire the lock
•  Records last state of release as global or local
•  Once S is acquired:
•  Local release → proceed to critical section
•  Global release → try to acquire G
•  Upon release of S:
•  IF may_pass_local OR alone? → release globally
•  ELSE → release locally
8
Lock Cohorting in Action
•  Suppose L is a cohorting lock implemented by global lock
G, socket locks S1, and S2
•  t1 encounters the lock L first
Node 1
t1
Ac. L
9
Node 2
t2
t3
t4
t5
t6
t7
t8
Lock Cohorting in Action
•  To enter the critical section t1 must acquire S1 first
Node 1
t1
Ac. S1
10
Node 2
t2
t3
t4
t5
t6
t7
t8
Lock Cohorting in Action
•  After t1 acquires S1, G must be acquired for t1 to enter
the critical section
Node 1
t1
S1
Ac. G
11
Node 2
t2
t3
t4
t5
t6
t7
t8
Lock Cohorting in Action
•  t1 acquires G and enters the critical section
•  Subsequently t5 and t6 attempt to acquire L
Node 1
t1
S1
G
12
Node 2
t2
t3
t4
t5
t6
Ac. L
Ac. L
t7
t8
Lock Cohorting in Action
•  t5 and t6 compete first to acquire S2, t5 wins
•  t6 is added to S2’s cohort since it’s spinning on S2
•  t5 spins on G
•  Threads on node 2 wait until G is released
Node 1
t1
S1
G
13
Node 2
t2
t3
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
•  Next t2 and t3 encounter L first seeking S1
•  t2 and t3 add to S1’s cohort
Node 1
14
Node 2
t1
t2
t3
S1
{t2,t3}
G
Ac. S1
Ac. S1
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
•  t1 finishes with lock and sees S1’s cohort is not empty
•  t1 releases S1 locally and G remains locked
•  t2 acquires S1 and enters the critical section
Node 1
15
Node 2
t1
t2
t3
G
S1
{t3}
Ac. S1
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
•  t2 finishes the critical section and sees its cohort is still not
empty
•  t2 releases S1 locally and t3 acquires it
Node 1
t1
G
16
Node 2
t2
t3
S1
{}
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
•  t3 exits and sees an empty cohort for S1
•  t3 releases S1 globally and then releases G
•  The next acquisition of S1 will see that it was released
globally and that thread will seek G
Node 1
t1
G
17
Node 2
t2
t3
{}
Re. G
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
•  Now t5 acquires G and enters the critical section
•  Upon exiting it will see a non-empty cohort for S2 and
release it locally
Node 1
t1
18
Node 2
t2
t3
t4
t5
t6
S2
{t6}
G
Ac. S2
t7
t8
Lock Cohorting in Action
•  t6 then enters the critical section
•  Upon exiting it will see an empty cohort for S2 and
globally release it
Node 1
t1
19
Node 2
t2
t3
t4
t5
t6
G
S2
{}
t7
t8
Lock Cohorting in Action
•  t6 releases G and S1 and S2 are in the globally ready
state
•  "Cohorting" within a node ensures that a lock will not
unnecessarily migrate between nodes 1 and 2
Node 1
t1
20
Node 2
t2
t3
t4
t5
t6
t7
t8
Cohort Lock Designs
•  C-BO-BO lock
•  Global backoff (BO) lock and local backoff locks per node
•  C-TKT-TKT lock
•  Global ticket lock and local ticket (TKT) locks per node
•  C-BO-MCS lock
•  Global backoff lock and local Mellor-Crummey Scott (MCS) lock
•  C-MCS-MCS lock
•  C-TKT-MCS lock
•  Use of abortable locks in cohort designs needs extra
features to limit aborting while in a cohort:
•  A-C-BO-BO lock
•  A-C-BO-CLH lock (queue lock of Craig, Landin, & Hagersten)
21
C-BO-MCS Lock in Action
•  1A acquires local MCS lock and then acquires the global
lock
22
Dice et al.
C-BO-MCS Lock in Action
•  2A acquires local MCS lock and then spins on the global
lock
•  1A enters the critical section
23
Dice et al.
C-BO-MCS Lock in Action
•  1B and 1C add themselves to local MCS queue
24
Dice et al.
C-BO-MCS Lock in Action
•  1A exits the critical section and sees that it points to 1B
•  Because the MCS tail pointer is not null, 1A releases the
MCS lock and leaves the global lock untouched
•  1B is allowed to enter the critical section
25
Dice et al.
C-BO-MCS Lock in Action
•  1B exits the critical section and sees that the queue points
to 1C
•  The MCS lock is released locally and acquired by 1C
•  1C enters the critical section
26
Dice et al.
C-BO-MCS Lock in Action
•  1C exits the critical section and sees that the MCS tail
pointer is null
•  1C then releases the global lock and the local lock
27
Dice et al.
C-BO-MCS Lock in Action
•  2A acquires the global lock and the critical section passes
to the other cluster
28
Dice et al.
Empirical Results
•  Dice et al. conduct experiments on benchmarks that test the
performance of each lock design
•  A microbenchmark LBench is used as a representative
workload
•  LBench launches identical threads
•  Each thread loops as follows:
•  Acquire central lock
•  Access shared data in critical section
•  Release lock
•  ~4ms of non-critical work
•  Run on Oracle T5440 series machine
•  256 hardware threads
•  4 NUMA clusters
•  Evaluation shows that cohort locks outperform previous locks
by at least 60%
29
Average Throughput vs. # of Threads
•  These results use LBench
•  Similar results were found for different LBench thread settings
•  MCS is the baseline scalable lock: low performance without locality awareness
30
Dice et al.
Conclusions
•  Lock cohorting yields an improvement over previous
NUMA aware lock designs
•  Powerful lock design
•  No special locks required
•  Versatility
•  Can be extended to further layers of locality
•  e.g., tile based systems where locality is based on grid position
•  Multiple levels of lock cohorts
•  Performance scaling with thread count is better with lock
cohorting
31
Reference
Z. Radovic and E. Hagersten. Hierarchical Backoff Locks for Nonuniform
Communication Architectures. In HPCA-9, pages 241–252, Anaheim, California,
USA, Feb. 2003.
David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock cohorting: a general
technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN
symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM,
New York, NY, USA, 247-256. DOI=10.1145/2145816.2145848
32