Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research – Tokyo © 2014 IBM Corporation IBM Research - Tokyo Thread-Level Speculation (TLS) [Franklin et al.,’92] or Speculative Multithreading (SpMT) Speculatively parallelize a sequential program into a multithreaded program. What is parallelization? – To find data-independent tasks from a program. Why speculation? – Because a compiler cannot detect every data dependence. Sequential execution Task Task Task TLS execution w/ 3 threads 2 © 2014 IBM Corporation IBM Research - Tokyo Runtime Requirements for TLS With TLS: – Compiler finds probably data-independent tasks. – Runtime guarantees data independence among tasks. (Minimum) runtime requirements for TLS – Data dependence (= conflict) detection among tasks – Execution rollback at a conflict – Ordered commit of tasks TLS execution w/ 3 threads Ordered commit Conflict Rollback 3 © 2014 IBM Corporation IBM Research - Tokyo Hardware Transactional Memory (HTM) Coming into the Market Blue Gene/Q zEC12 POWER8 4th Generation Core Processor (Haswell) HTM supports… – Conflict detection among transactions – Execution rollback at a conflict HTM satisfies 2/3 of the runtime requirements for TLS! – Task = transaction 4 © 2014 IBM Corporation IBM Research - Tokyo Our Goal How well can TLS improve the performance on real HTM hardware? – Used Intel 4th Generation Core Processor (Intel TSX). – Manually modified and measured SPEC CPU2006. 5 © 2014 IBM Corporation IBM Research - Tokyo Our True Goal How poorly can TLS improve the performance on real HTM hardware? Because proposed TLS systems had advanced hardware support. – E.g. ordered transactions, data forwarding, etc. Blue Gene/Q is the only real system supporting advanced hardware for TLS. – Ordered transactions 6 © 2014 IBM Corporation IBM Research - Tokyo Our True Goal How poorly can TLS improve the performance on real HTM hardware? What kind of hardware support should be implemented next in the off-the-shelf HTM? 7 © 2014 IBM Corporation IBM Research - Tokyo Transactional Memory At programming/compile time – Enclose critical sections with transaction begin/end operations. xbegin(); a->count++; xend(); Thread X xbegin(); a->count++; Thread Y At execution time xend(); xbegin(); – Memory operations within a a->count++; transaction observed as one step xend(); by other threads. – Multiple transactions executed in parallel as long as their memory xbegin(); operations do not conflict. xbegin(); a->count++; b->count++; xend(); xend(); 8 © 2014 IBM Corporation IBM Research - Tokyo HTM Instruction set (Intel TSX) – XBEGIN: Begin a transaction XBEGIN abort_handler ... ... XEND – XEND: End a transaction – XABORT, etc. abort_handler: ... Micro-architecture – Read and write sets held in CPU caches – Conflict detection using CPU cache coherence protocol Conflict detection by cache line granularity – Rollback by discarding write set and restoring registers Abort reasons: – Read set and write set conflict – Read set and write set overflow – External interruptions, etc. 9 © 2014 IBM Corporation IBM Research - Tokyo TLS for Loops We focus on frequently executed loops. – Task = iteration(s) = transaction Why not parallelize function calls? – Difficult to implement TLS for function calls on HTM. (Refer to the paper for the details.) Sequential execution TLS execution w/ 3 threads 10 Iteration 1 Iteration 2 Iteration 3 Iteration 1 Iteration 2 Iteration 3 © 2014 IBM Corporation IBM Research - Tokyo TLS on HTM Enclose each iteration with XBEGIN and XEND. Conflict Iteration 3 XBEGIN XBEGIN Iteration 2 Iteration 3 re-execution XEND XEND Iteration 1 XBEGIN 11 XEND XBEGIN Re-execute iteration in case of abort. © 2014 IBM Corporation IBM Research - Tokyo Ordered Transactions Must commit in the same order as sequential execution. XBEGIN Iteration 2 12 Iteration 3 XEND XBEGIN Iteration 1 XEND XBEGIN – Because data independence can be guaranteed only after all of the preceding iterations have committed. Commit order inversion © 2014 IBM Corporation IBM Research - Tokyo Ordered Transactions by Software Hardware support by proposed TLS systems – Wait until the preceding iterations commit. Software implementation by checking commit order – Use a global variable to indicate the next iteration to commit. 13 Can commit? XEND Iteration 3 Can recommit? execution XEND Iteration 3 Can commit? XBEGIN XBEGIN Iteration 2 XEND Can Iteration 1 commit? XBEGIN XBEGIN – Abort if cannot commit. © 2014 IBM Corporation IBM Research - Tokyo Ordered Transactions by Software Hardware support by proposed TLS systems – Wait until the preceding iterations commit. Software implementation by checking commit order – Use a global variable to indicate the next iteration to commit. 14 XEND Can commit? Iteration 3 Can recommit? execution XEND Iteration 3 Why not spin-wait? Can Refer to our paper…. commit? XBEGIN XBEGIN Iteration 2 XEND Can Iteration 1 commit? XBEGIN XBEGIN – Abort if cannot commit. © 2014 IBM Corporation IBM Research - Tokyo Our Goal How poorly can TLS improve the performance on real HTM hardware? What kind of hardware support should be implemented next in the off-the-shelf HTM? – Will hardware support for ordered transactions really help? 15 © 2014 IBM Corporation IBM Research - Tokyo False Sharing due to Cache-Line Granularity Conflict Detection double array[]; ... for (int i = ...; i < ...; i++) { ... array[i] = ...; ... } TLS Writes by Thread 1 Writes by Thread 2 Writes by Thread 3 array[] Cache line = 64 bytes on x86 16 © 2014 IBM Corporation IBM Research - Tokyo Iteration 17 Iteration 18 … … Iteration 16 XEND Iteration 8 Iteration 24 XEND XBEGIN Iteration 9 Iteration 10 … XEND Iteration 1 Iteration 2 XBEGIN XBEGIN Transaction Coarsening to Avoid False Sharing Writes by Thread 1 Writes by Thread 3 Writes by Thread 2 array[] 17 © 2014 IBM Corporation IBM Research - Tokyo Benchmarks and Methodology SPEC CPU2006 – 6 benchmarks showing more than 1.5-fold speedups with 4 threads in a previous TLS study [Packirisamy et al., 2009] – 429.mcf, 433.milc, 456.hmmer, 464.h264ref, 470.lbm, and 482.sphinx3 Manually modified frequently executed loops. – Inserted XBEGIN, XEND, and commit order checks. – Transformed a target loop into a doubly-nested loop for transaction coarsening Experimental environment – Core i7-4770 processor (4 cores, 2-way SMT) – 4-GB memory – Linux 2.6.32-431 / GCC 4.9.0 18 © 2014 IBM Corporation IBM Research - Tokyo Normalized Throughput Results Throughput (1 = sequential) 433.milc 464.h264ref 482.sphinx3 1 0.5 0 Higher is better 429.mcf 456.hmmer 470.lbm 1.5 SMT enabled 0 1 2 3 4 5 6 7 8 9 Number of software threads Up to 11% speedups with 2 or 4 threads. But mostly degraded the throughput. 19 © 2014 IBM Corporation IBM Research - Tokyo Total Overflow Other 433.milc 100 Abort ratio (%) Throughput (1 = sequential) 1.5 Order inversion Conflict 1 0.5 0 80 60 40 20 0 0 1 2 3 4 5 6 7 8 9 Number of software threads 1 2 3 4 5 6 7 8 9 Number of software threads Parallel program – Loop coverage: 23% Commit order inversion is a dominant abort reason. Hardware support for ordered transactions will help. 20 © 2014 IBM Corporation IBM Research - Tokyo Abort Statistics (1/2) 433.milc 120 120 100 100 Total 80 Order inversion 60 Buffer overflow Abort ratio (%) Abort ratio (%) 429.mcf 80 60 40 20 0 Total Order inv Buffer ove 40 Conflict Conflict Other 20 Other 0 1 2 3 4 5 6 7 8 9 1 Number of software threads 456.hmmer 2 3 4 5 6 7 8 9 Number of software threads Abort ratio (%) 120 100 Total 80 Order inversion 60 Buffer overflow 40 Conflict 20 0 1 2 3 4 5 6 7 Number of software threads 21 8 9 Other Conflicts were a dominant abort reason in all of the benchmarks except 433.milc. © 2014 IBM Corporation IBM Research - Tokyo Abort Statistics (2/2) 470.lbm 120 120 100 100 Total 80 Order inversion 60 Buffer overflow Abort ratio (%) Abort ratio (%) 464.h264ref 80 60 40 20 0 Total Order inv Buffer ove 40 Conflict Conflict Other 20 Other 0 1 2 3 4 5 6 7 8 9 1 Number of software threads 482.sphinx3 2 3 4 5 6 7 8 9 Number of software threads Abort ratio (%) 120 100 Total 80 Order inversion 60 Buffer overflow 40 Conflict 20 0 1 2 3 4 5 6 7 Number of software threads 22 8 9 Other Conflicts were a dominant abort reason in all of the benchmarks except 433.milc. © 2014 IBM Corporation IBM Research - Tokyo Reasons for Conflicts and Possible Hardware Support 23 Benchmark Conflict reason Possible hardware support 429.mcf RAW dependence Data forwarding 433.milc No 456.hmmer RAW dependence Data forwarding 464.h264ref WAR dependence Multi-version cache WAW dependence (false sharing by prefetching) (Fix in prefetcher) 470.lbm WAW dependence (false sharing) Word-level conflict detection 482.sphinx3 WAW dependence (false sharing by prefetching) (Fix in prefetcher) © 2014 IBM Corporation IBM Research - Tokyo Examples of Read-After-Write Data Dependence 429.mcf static int size; static DATA array[N]; func() { ... for (...) { ... if (...) { size++; array[size]->field = ...; } } ... } 456.hmmer for (k = 1; k <= M; k++) { ... dc[k] = dc[k-1] + ...; ... } Hardware support already proposed in TLS literatures. – Data forwarding. 24 © 2014 IBM Corporation IBM Research - Tokyo Example of Write-After-Read Data Dependence 464.h264ref for (...) { ... line = func(); ... = line[0]; ... } static DATA line[N]; DATA *func() { ... line[0] = ...; ... return line; } Difficult to analyze by a compiler. – WAR dependence across different functions in different source files. Multi-version caches needed. 25 © 2014 IBM Corporation IBM Research - Tokyo Conflicts Precede Commit Order Inversion Commit order matters only when most of the transactions reach the committing points. XBEGIN Iteration 2 26 XEND XBEGIN Iteration 1 XEND XBEGIN With data dependence, most of the transactions cannot run to the end. Conflict Iteration 3 Commit order inversion © 2014 IBM Corporation IBM Research - Tokyo Conflicts due to Prefetching Even with transaction coarsening, conflicts still happened. – 464.h264ref and 482.sphinx3. Prefetched adjacent cache lines caused conflicts. Writes by Thread 1 Prefetch 64 bytes Conflict 64 bytes 64 bytes Prefetch Writes by Thread 2 27 © 2014 IBM Corporation IBM Research - Tokyo Conclusion How well can TLS improve the performance on real HTM hardware? – Up to 11% speedups with 4 threads in SPEC CPU2006 on 4th Generation Core Processor. – But degraded throughput in most cases. What kind of hardware support should be implemented next in the off-the-shelf HTM? – Hardware support for ordered transactions will help in parallel programs. – However, many programs contain data dependence. Not only ordered transactions, but also other hardware facilities to avoid conflicts should be implemented. – (Intel should fix the adjacent cache line prefetcher!) 28 © 2014 IBM Corporation
© Copyright 2024