Energy-Efficient Parallel Computer Architecture Christopher Batten Computer Systems Laboratory

Energy-Efficient
Parallel Computer Architecture
Christopher Batten
Computer Systems Laboratory
School of Electrical and Computer Engineering
Cornell University
Fall 2014
• Batten Research Group Overview •
XLOOPS: Specialization for Loop Patterns
Motivating Trends in Computer Architecture
7
?
Intel 48-Core
Prototype
AMD 4-Core
Opteron
Intel
P4
10
6
10
5
Transistors
(Thousands)
Parallelism?
SPECint
Performance
10
DEC
Alpha
4 21264
10
3
10
Frequency
(MHz)
MIPS
R2K
2
10
Typical
Power (W)
1
10
Number
of Cores
0
Trend 1
Power & energy
constrain all
systems
Trend 2
Transition to
multicore
processors
Trend 3
Inevitable end
of Moore’s law
10
1975 1980 1985 1990 1995 2000 2005 2010 2015
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten
Cornell University
Christopher Batten
2 / 19
XLOOPS: Specialization for Loop Patterns
Design
Performance
Constraint
ci
al
iz
at
io
n
bi
lit
y
More Flexible
Accelerator
.S
pe
Less Flexible
Accelerator
vs
Embedded
Architectures
Custom
ASIC
Fl
ex
i
Energy Efficiency (Tasks per Joule)
• Batten Research Group Overview •
Design Power
Constraint
Simple
Processor
High-Performance
Architectures
Performance (Tasks per Second)
Cornell University
Christopher Batten
3 / 19
• Batten Research Group Overview •
XLOOPS: Specialization for Loop Patterns
Fine-Grain Heterogeneous Architectures
L1 I$
L1 I$
OoO GPP
OoO GPP
L1
I$
L1
D$
L1
I$
L1 D$
GPP
L1
D$
L1 D$
L1
I$
L1
D$
GPP
L1
I$
L1
D$
L1
I$
GPP
L1
D$
L1
I$
GPP
On-Chip Router
L2 Cache Bank
L2 Cache Bank
L2 Cache Bank
L2 Cache Bank
On-Chip Router
On-Chip Router
μT0
μT4
μT1
μT5
μT2
μT6
SIMT Memory Unit
L1 I$
Cornell University
L1 D$
Throughput-Optimized Tile:
many small in-order cores
Data-Parallel-Optimized Tile:
flexible data-parallel accelerator
Instruction Memory
SIMT Issue Unit
μT3
μT7
CP
LD
ST
Latency-Optimized Tile:
few large out-of-order cores
GPP
L1
D$
On-Chip Router
CP
GPP
ST
Application-Optimized Tile:
instruction set specialization;
flexible algorithm or
data-structure accelerators
Data Memory
Christopher Batten
4 / 19
• Batten Research Group Overview •
XLOOPS: Specialization for Loop Patterns
Projects Within the Batten Research Group
GPGPU
Microarchitecture
XLOOPS: Specialization XPC: Explicit-Parallel- Polymorphic Hardware
Specialization
Call Architectures
for Loop Patterns
Ctrl
Proc
SIMT Issue Unit
SIMT Lanes
SIMT Mem Unit
L1 Data Cache
Reconfigurable Power
Distribution Networks
Complexity Effective
On-Chip Networks
instrA
instrB
pcall n, func1
...
psync func1:
instrC
pcall func2
...
pret
Hardware Acceleration
for Dynamic Languages
Scheme
Python
JIT Compiler
Control
CP
L1
I$
Control Crossbar
Poly
ASU
Poly
ASU
Poly
DSU
Poly
DSU
Poly
DSU
Memory Crossbar
L1 Data Cache
Python-Based
Hardware Modeling
class Reg (Model):
def __init__( s ):
s.in = InPort(32)
s.out = OutPort(32)
def elaborate( s ):
@s.posedge_clk
def seq_logic():
s.out.next = s.in
Cornell University
Christopher Batten
5 / 19
• Batten Research Group Overview •
XLOOPS: Specialization for Loop Patterns
Vertically Integrated Research Methodology
Our research involves reconsidering all aspects of the computing stack
including applications, programming frameworks, compiler optimizations,
runtime systems, instruction set design, microarchitecture design, VLSI
implementation, and hardware design methodologies
C++ Applications
Cross
Compiler
Verilog RTL Model
RTL
Simulator
Binary
Functional
Simulator
Synthesis
Place&Route
Gate-Level Model
Cycle-Level
Simulator
Gate-Level
Simulator
C++ Functional C++ Cycle-Level Switching Activity
Specification
Model
Key Metrics: Cycle Count,
Cycle Time, Area, Energy
Cornell University
Experimenting with full-chip
layout, FPGA prototypes, and
test chips is a key part of our
research methodology
Layout
Power
Analysis
Christopher Batten
6 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS: Architectural Specialization for
Inter-Iteration Loop Dependence Patterns
Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu,
Zhiru Zhang, and Christopher Batten
47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO)
Cambridge, UK, Dec. 2014
Cornell University
Christopher Batten
7 / 19
• XLOOPS: Specialization for Loop Patterns •
Batten Research Group Overview
Inter-Iteration Loop Dependence Patterns
Iteration
0
Iteration
1
Iteration
2
Iteration
3
Iteration
n-1
inst0
inst1
inst2
inst3
...
branch
inst0
inst1
inst2
inst3
...
branch
inst0
inst1
inst2
inst3
...
branch
inst0
inst1
inst2
inst3
...
branch
inst0
inst1
inst2
inst3
...
branch
Explicit loop specialization (XLOOPS) elegantly encodes inter-iteration
loop dependence patterns in the instruction set and enables
traditional, specialized, and adaptive execution.
Cornell University
Christopher Batten
8 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Unordered Concurrent
for ( i=0; i<N; i++ )
C[i] = A[i] * B[i]
loop:
lw
lw
mul
sw
addiu.xi
addiu.xi
addiu.xi
addiu
xloop.uc
Cornell University
r2,
r3,
r4,
r4,
rA,
rB,
rC,
r1,
r1,
0(rA)
0(rB)
r2, r3
0(rC)
4
4
4
r1, 1
rN, loop
Iteration 0
inst0
inst1
inst2
inst3
...
xloop.uc
Christopher Batten
Iteration 1
inst0
inst1
inst2
inst3
...
xloop.uc
Iteration 2
inst0
inst1
inst2
inst3
...
xloop.uc
Iteration 3
inst0
inst1
inst2
inst3
...
xloop.uc
9 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Ordered-Through-Registers
for ( X=0, i=0; i<N; i++ )
X += A[i]
B[i] = X
loop:
lw
addu
sw
addiu.xi
addiu.xi
addiu
xloop.or
Cornell University
r2,
rX,
rX,
rA,
rB,
r1,
r1,
0(rA)
r2, rX
0(rB)
4
4
r1, 1
rN, loop
Iteration 0
inst0
inst1
inst2
inst3
...
xloop.or
Christopher Batten
Iteration 1
inst0
inst1
inst2
inst3
...
xloop.or
Iteration 2
inst0
inst1
inst2
inst3
...
xloop.or
Iteration 3
inst0
inst1
inst2
inst3
...
xloop.or
10 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Ordered-Through-Memory
for ( i=K; i<N; i++ )
A[i] = A[i] * A[i-K]
# r1 = rK
# r3 = rA
loop:
lw
lw
mul
sw
addiu.xi
addiu.xi
addiu
xloop.om
Cornell University
+ 4*rK
r4,
r5,
r6,
r6,
r3,
rA,
r1,
r1,
Iteration 0
inst0
inst1
inst2
inst3
...
xloop.om
0(r3)
0(rA)
r4, r5
0(r3)
4
4
r1, 1
rN, loop
Christopher Batten
Iteration 1
inst0
inst1
inst2
inst3
...
xloop.om
Iteration 2
inst0
inst1
inst2
inst3
...
xloop.om
Iteration 3
inst0
inst1
inst2
inst3
...
xloop.om
11 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS ISA: Unordered Atomic
for ( i=0; i<N; i++ )
B[ A[i] ]++
D[ C[i] ]++
loop:
lw
lw
addiu
sw
addiu.xi
...
addiu
xloop.ua
Cornell University
r6,
r7,
r7,
r7,
rA,
0(rA)
0(r6)
r7, 1
0(r6)
rA, 4
Iteration 0
inst0
inst1
inst2
inst3
...
xloop.ua
Iteration 1
inst0
inst1
inst2
inst3
...
xloop.ua
Iteration 3
inst0
inst1
inst2
inst3
...
xloop.ua
Iteration 2
inst0
inst1
inst2
inst3
...
xloop.ua
r1, r1, 1
r1, rN, loop
Christopher Batten
12 / 19
• XLOOPS: Specialization for Loop Patterns •
Batten Research Group Overview
XLOOPS ISA: Dynamic Bound
Iteration 0
inst0
inst1
inst2
inst3
...
xloop.db
Cornell University
Iteration 1
inst0
inst1
inst2
inst3
...
xloop.db
Iteration 2
inst0
inst1
inst2
inst3
...
xloop.db
Iteration 3
inst0
inst1
inst2
inst3
...
xloop.db
Christopher Batten
Iteration 4
inst0
inst1
inst2
inst3
...
xloop.db
Iteration 5
inst0
inst1
inst2
inst3
...
xloop.db
13 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS Compiler
Kernel implementing Floyd-Warshall shortest path algorithm
for ( int k = 0; k < n; k++ )
#pragma xloops ordered
for ( int i = 0; i < n; i++ )
#pragma xloops unordered
for ( int j = 0; j < n; j++ )
path[i][j] = min( path[i][j], path[i][k] + path[k][j] );
Kernel implementing greedy algorithm for maximal matching
#pragma xloops ordered
for ( int i = 0; i < num_edges; i++ )
int v = edges[i].v; int u = edges[i].u;
if ( vertices[v] < 0 && vertices[u] < 0 )
vertices[v] = u; vertices[u] = v; out[k++] = i;
Cornell University
Christopher Batten
14 / 19
• XLOOPS: Specialization for Loop Patterns •
Batten Research Group Overview
XLOOPS Microarchitecture
I Elegant hardware/software
Lane Management Unit
DBN
GPP
GPR RF
32 × 32b
2r2w
Lane
0
IDQ
Lane
1
IDQ
Lane
3
IDQ
Inst Buf
128×
Inst Buf
128×
Inst Buf
128×
Lane RF
24 × 32b
2r2w
Lane RF
24 × 32b
2r2w
Lane RF
24 × 32b
2r2w
CIB 8×
CIB 8×
CIB 8×
SLFU
SLFU
SLFU
LSQ
16×
LSQ
16×
LSQ
16×
abstraction enables efficient
traditional execution on
general-purpose processor
SLFU
LLFU
D$ Request/Response Crossbar
L1 I$ 16 KB
I Loop-pattern specialization unit
enables specialized execution for
improved performance and
efficiency
I Adaptive execution uses two
profiling phases to adaptively
determine best location to run
loop
L1 D$ 16 KB
L2 Request and Response Crossbars
Cornell University
Christopher Batten
15 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS Cycle-Level Performance Results
OOO 4-Way
OOO 2-Way with LPSU
4.0
3.5
3.0
2.5
2.0
1.5
1.0
rg
b2
cm
sg yk
em -u
ss m c
ea -u
r c
sy chm uc
m
vi -u
te c
rb
iw uc
ad arpc uc
m
co -o
va r
di r-o
km the r
ea r-or
ns
sh or
s a
dy ym -or
np mro or
gks k om
ac nn
ks k-sm om
ac k- om
lg
w om
a
m r-om
st men or
ci m
lbt orm
re
hs e-u
hu or a
ffm t-u
an a
rs -ua
o
bf rt-u
qs s-u a
or ct-u db
cdb
Speedup Normalized to
Simple RISC Core
OOO 2-Way
xloop.uc
xloop.or
xloop.{om,orm}
xloop.ua xloop.*.db
I XLOOPS vs. Simple Core : Always higher performance
I XLOOPS vs. OOO 2-way : Generally competitive or higher performance
I XLOOPS vs. OOO 4-way : Mixed results
Cornell University
Christopher Batten
16 / 19
Batten Research Group Overview
• XLOOPS: Specialization for Loop Patterns •
XLOOPS Cycle-Level Energy-Efficiency Results
Normalized Energy Efficiency
XLOOPS vs.
Simple RISC Core
XLOOPS vs.
OOO 2-Way
XLOOPS vs.
OOO 4-Way
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
Normalized Performance
0.5
1.0
1.5
2.0
2.5
Normalized Performance
I XLOOPS vs. Simple Core : Competitive energy efficiency
I XLOOPS vs. OOO 2-way : Always more energy efficient
I XLOOPS vs. OOO 4-way : Always more energy efficient
Cornell University
Christopher Batten
17 / 19
Batten Research Group Overview
DCache
16KB SRAM for Cache Lines
• XLOOPS: Specialization for Loop Patterns •
32b Integer
Mul/Div Unit
DCache
Tags
ICache
16KB SRAM for Cache Lines
ICache
Tags
Loop Pattern
Specialization Unit
32b IEEE
Floating Point Unit
I RISC scalar
processor with
4-lane LPSU
I Implemented in
TSMC 40 nm using
ASIC CAD toolflow
I ≈40% extra area
compared to simple
RISC processor
Scalar
Processor
Cornell University
I TSMC 40 nm
standard-cell-based
implementation
I Supports xloop.uc
L0
L0
Instr
Instr
Buffer Buffer
L0
L0
Instr
Instr
Buffer Buffer
VLSI
Implementation
Christopher Batten
18 / 19
Batten Research Group Overview
XLOOPS: Specialization for Loop Patterns
Take-Away Points
I We envision future computing systems using
fine-grain heterogeneous architectures with a mix
of tiles optimized for latency, throughput, classes
of applications, or even a specific application
I We are using a vertically integrated research
methodology to explore a variety of projects that
will hopefully enable such fine-grain
heterogeneous architectures
Cornell University
Christopher Batten
19 / 19