Document 434404

Thread-Level Speculation on
Off-the-Shelf Hardware
Transactional Memory
Rei Odaira
Takuya Nakaike
IBM Research – Tokyo
© 2014 IBM Corporation
IBM Research - Tokyo
Thread-Level Speculation (TLS) [Franklin et al.,’92]
or Speculative Multithreading (SpMT)
Speculatively parallelize a sequential program into a
multithreaded program.
What is parallelization?
– To find data-independent tasks from a program.
Why speculation?
– Because a compiler cannot detect every data dependence.
Sequential
execution
Task
Task
Task
TLS
execution
w/ 3 threads
2
© 2014 IBM Corporation
IBM Research - Tokyo
Runtime Requirements for TLS
With TLS:
– Compiler finds probably data-independent tasks.
– Runtime guarantees data independence among tasks.
(Minimum) runtime requirements for TLS
– Data dependence (= conflict) detection among tasks
– Execution rollback at a conflict
– Ordered commit of tasks
TLS
execution
w/ 3 threads
Ordered commit
Conflict
Rollback
3
© 2014 IBM Corporation
IBM Research - Tokyo
Hardware Transactional Memory (HTM) Coming into the Market
Blue Gene/Q
zEC12
POWER8
4th Generation
Core Processor
(Haswell)
HTM supports…
– Conflict detection among transactions
– Execution rollback at a conflict
HTM satisfies 2/3 of the runtime requirements for TLS!
– Task = transaction
4
© 2014 IBM Corporation
IBM Research - Tokyo
Our Goal
How well can TLS improve the performance on real
HTM hardware?
– Used Intel 4th Generation Core Processor
(Intel TSX).
– Manually modified and measured SPEC CPU2006.
5
© 2014 IBM Corporation
IBM Research - Tokyo
Our True Goal
How poorly can TLS improve the performance on real
HTM hardware?
Because proposed TLS systems had advanced
hardware support.
– E.g. ordered transactions, data forwarding, etc.
Blue Gene/Q is the only real system supporting
advanced hardware for TLS.
– Ordered transactions
6
© 2014 IBM Corporation
IBM Research - Tokyo
Our True Goal
How poorly can TLS improve the performance on real
HTM hardware?
What kind of hardware support should be
implemented next in the off-the-shelf HTM?
7
© 2014 IBM Corporation
IBM Research - Tokyo
Transactional Memory
At programming/compile time
– Enclose critical sections with
transaction begin/end operations.
xbegin();
a->count++;
xend();
Thread X
xbegin();
a->count++; Thread Y
At execution time
xend();
xbegin();
– Memory operations within a
a->count++;
transaction observed as one step
xend();
by other threads.
– Multiple transactions executed in
parallel as long as their memory
xbegin();
operations do not conflict.
xbegin();
a->count++; b->count++;
xend();
xend();
8
© 2014 IBM Corporation
IBM Research - Tokyo
HTM
Instruction set (Intel TSX)
– XBEGIN: Begin a transaction
XBEGIN abort_handler
...
...
XEND
– XEND: End a transaction
– XABORT, etc.
abort_handler:
...
Micro-architecture
– Read and write sets held in CPU caches
– Conflict detection using CPU cache coherence protocol
Conflict detection by cache line granularity
– Rollback by discarding write set and restoring registers
Abort reasons:
– Read set and write set conflict
– Read set and write set overflow
– External interruptions, etc.
9
© 2014 IBM Corporation
IBM Research - Tokyo
TLS for Loops
We focus on frequently executed loops.
– Task = iteration(s) = transaction
Why not parallelize function calls?
– Difficult to implement TLS for function calls on HTM.
(Refer to the paper for the details.)
Sequential
execution
TLS
execution
w/ 3 threads
10
Iteration 1
Iteration 2
Iteration 3
Iteration 1
Iteration 2
Iteration 3
© 2014 IBM Corporation
IBM Research - Tokyo
TLS on HTM
Enclose each iteration with XBEGIN and XEND.
Conflict
Iteration 3
XBEGIN
XBEGIN
Iteration 2
Iteration 3
re-execution
XEND
XEND
Iteration 1
XBEGIN
11
XEND
XBEGIN
Re-execute iteration in case of abort.
© 2014 IBM Corporation
IBM Research - Tokyo
Ordered Transactions
Must commit in the same order as sequential execution.
XBEGIN
Iteration 2
12
Iteration 3
XEND
XBEGIN
Iteration 1
XEND
XBEGIN
– Because data independence can be guaranteed only after
all of the preceding iterations have committed.
Commit order
inversion
© 2014 IBM Corporation
IBM Research - Tokyo
Ordered Transactions by Software
Hardware support by proposed TLS systems
– Wait until the preceding iterations commit.
Software implementation by checking commit order
– Use a global variable to indicate the next iteration to commit.
13
Can
commit?
XEND
Iteration 3
Can
recommit?
execution
XEND
Iteration 3
Can
commit?
XBEGIN
XBEGIN
Iteration 2
XEND
Can
Iteration 1 commit?
XBEGIN
XBEGIN
– Abort if cannot commit.
© 2014 IBM Corporation
IBM Research - Tokyo
Ordered Transactions by Software
Hardware support by proposed TLS systems
– Wait until the preceding iterations commit.
Software implementation by checking commit order
– Use a global variable to indicate the next iteration to commit.
14
XEND
Can
commit?
Iteration 3
Can
recommit?
execution
XEND
Iteration 3
Why not spin-wait?
Can
Refer to our paper….
commit?
XBEGIN
XBEGIN
Iteration 2
XEND
Can
Iteration 1 commit?
XBEGIN
XBEGIN
– Abort if cannot commit.
© 2014 IBM Corporation
IBM Research - Tokyo
Our Goal
How poorly can TLS improve the performance on real
HTM hardware?
What kind of hardware support should be
implemented next in the off-the-shelf HTM?
– Will hardware support for ordered transactions really
help?
15
© 2014 IBM Corporation
IBM Research - Tokyo
False Sharing due to Cache-Line Granularity Conflict Detection
double array[];
...
for (int i = ...; i < ...; i++) {
...
array[i] = ...;
...
}
TLS
Writes by Thread 1
Writes by Thread 2
Writes by Thread 3
array[]
Cache line = 64 bytes on x86
16
© 2014 IBM Corporation
IBM Research - Tokyo
Iteration 17 Iteration 18
…
…
Iteration 16
XEND
Iteration 8
Iteration 24
XEND
XBEGIN
Iteration 9 Iteration 10
…
XEND
Iteration 1 Iteration 2
XBEGIN
XBEGIN
Transaction Coarsening to Avoid False Sharing
Writes by Thread 1
Writes by Thread 3
Writes by Thread 2
array[]
17
© 2014 IBM Corporation
IBM Research - Tokyo
Benchmarks and Methodology
SPEC CPU2006
– 6 benchmarks showing more than 1.5-fold speedups with 4
threads in a previous TLS study [Packirisamy et al., 2009]
– 429.mcf, 433.milc, 456.hmmer, 464.h264ref, 470.lbm, and
482.sphinx3
Manually modified frequently executed loops.
– Inserted XBEGIN, XEND, and commit order checks.
– Transformed a target loop into a doubly-nested loop for
transaction coarsening
Experimental environment
– Core i7-4770 processor (4 cores, 2-way SMT)
– 4-GB memory
– Linux 2.6.32-431 / GCC 4.9.0
18
© 2014 IBM Corporation
IBM Research - Tokyo
Normalized Throughput Results
Throughput
(1 = sequential)
433.milc
464.h264ref
482.sphinx3
1
0.5
0
Higher is better
429.mcf
456.hmmer
470.lbm
1.5
SMT enabled
0
1 2 3 4 5 6 7 8 9
Number of software threads
Up to 11% speedups with 2 or 4 threads.
But mostly degraded the throughput.
19
© 2014 IBM Corporation
IBM Research - Tokyo
Total
Overflow
Other
433.milc
100
Abort ratio (%)
Throughput
(1 = sequential)
1.5
Order inversion
Conflict
1
0.5
0
80
60
40
20
0
0
1 2 3 4 5 6 7 8 9
Number of software threads
1
2
3
4
5
6
7
8
9
Number of software threads
Parallel program
– Loop coverage: 23%
Commit order inversion is a dominant abort reason.
Hardware support for ordered transactions will help.
20
© 2014 IBM Corporation
IBM Research - Tokyo
Abort Statistics (1/2)
433.milc
120
120
100
100
Total
80
Order inversion
60
Buffer overflow
Abort ratio (%)
Abort ratio (%)
429.mcf
80
60
40
20
0
Total
Order inv
Buffer ove
40
Conflict
Conflict
Other
20
Other
0
1
2
3
4
5
6
7
8
9
1
Number
of software threads
456.hmmer
2
3
4
5
6
7
8
9
Number of software threads
Abort ratio (%)
120
100
Total
80
Order inversion
60
Buffer overflow
40
Conflict
20
0
1
2
3
4
5
6
7
Number of software threads
21
8
9
Other
Conflicts were a dominant
abort reason in all of the
benchmarks except 433.milc.
© 2014 IBM Corporation
IBM Research - Tokyo
Abort Statistics (2/2)
470.lbm
120
120
100
100
Total
80
Order inversion
60
Buffer overflow
Abort ratio (%)
Abort ratio (%)
464.h264ref
80
60
40
20
0
Total
Order inv
Buffer ove
40
Conflict
Conflict
Other
20
Other
0
1
2
3
4
5
6
7
8
9
1
Number
of software threads
482.sphinx3
2
3
4
5
6
7
8
9
Number of software threads
Abort ratio (%)
120
100
Total
80
Order inversion
60
Buffer overflow
40
Conflict
20
0
1
2
3
4
5
6
7
Number of software threads
22
8
9
Other
Conflicts were a dominant
abort reason in all of the
benchmarks except 433.milc.
© 2014 IBM Corporation
IBM Research - Tokyo
Reasons for Conflicts and Possible Hardware Support
23
Benchmark
Conflict reason
Possible hardware
support
429.mcf
RAW dependence
Data forwarding
433.milc
No
456.hmmer
RAW dependence
Data forwarding
464.h264ref
WAR dependence
Multi-version cache
WAW dependence (false
sharing by prefetching)
(Fix in prefetcher)
470.lbm
WAW dependence (false
sharing)
Word-level conflict
detection
482.sphinx3
WAW dependence (false
sharing by prefetching)
(Fix in prefetcher)
© 2014 IBM Corporation
IBM Research - Tokyo
Examples of Read-After-Write Data Dependence
429.mcf
static int size;
static DATA array[N];
func() {
...
for (...) {
...
if (...) {
size++;
array[size]->field = ...;
}
}
...
}
456.hmmer
for (k = 1; k <= M; k++) {
...
dc[k] = dc[k-1] + ...;
...
}
Hardware support already proposed in TLS literatures.
– Data forwarding.
24
© 2014 IBM Corporation
IBM Research - Tokyo
Example of Write-After-Read Data Dependence
464.h264ref
for (...) {
...
line = func();
... = line[0];
...
}
static DATA line[N];
DATA *func() {
...
line[0] = ...;
...
return line;
}
Difficult to analyze by a compiler.
– WAR dependence across different functions in different
source files.
Multi-version caches needed.
25
© 2014 IBM Corporation
IBM Research - Tokyo
Conflicts Precede Commit Order Inversion
Commit order matters only when most of the
transactions reach the committing points.
XBEGIN
Iteration 2
26
XEND
XBEGIN
Iteration 1
XEND
XBEGIN
With data dependence, most of the transactions
cannot run to the end.
Conflict
Iteration 3
Commit order
inversion
© 2014 IBM Corporation
IBM Research - Tokyo
Conflicts due to Prefetching
Even with transaction coarsening, conflicts still
happened.
– 464.h264ref and 482.sphinx3.
Prefetched adjacent cache lines caused conflicts.
Writes by Thread 1
Prefetch
64 bytes
Conflict
64 bytes
64 bytes
Prefetch
Writes by Thread 2
27
© 2014 IBM Corporation
IBM Research - Tokyo
Conclusion
How well can TLS improve the performance on real
HTM hardware?
– Up to 11% speedups with 4 threads in SPEC CPU2006
on 4th Generation Core Processor.
– But degraded throughput in most cases.
What kind of hardware support should be
implemented next in the off-the-shelf HTM?
– Hardware support for ordered transactions will help in
parallel programs.
– However, many programs contain data dependence.
Not only ordered transactions, but also other
hardware facilities to avoid conflicts should be
implemented.
– (Intel should fix the adjacent cache line prefetcher!)
28
© 2014 IBM Corporation