Energy-Efficient Parallel Computer Architecture Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Fall 2014 • Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns Motivating Trends in Computer Architecture 7 ? Intel 48-Core Prototype AMD 4-Core Opteron Intel P4 10 6 10 5 Transistors (Thousands) Parallelism? SPECint Performance 10 DEC Alpha 4 21264 10 3 10 Frequency (MHz) MIPS R2K 2 10 Typical Power (W) 1 10 Number of Cores 0 Trend 1 Power & energy constrain all systems Trend 2 Transition to multicore processors Trend 3 Inevitable end of Moore’s law 10 1975 1980 1985 1990 1995 2000 2005 2010 2015 Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten Cornell University Christopher Batten 2 / 19 XLOOPS: Specialization for Loop Patterns Design Performance Constraint ci al iz at io n bi lit y More Flexible Accelerator .S pe Less Flexible Accelerator vs Embedded Architectures Custom ASIC Fl ex i Energy Efficiency (Tasks per Joule) • Batten Research Group Overview • Design Power Constraint Simple Processor High-Performance Architectures Performance (Tasks per Second) Cornell University Christopher Batten 3 / 19 • Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns Fine-Grain Heterogeneous Architectures L1 I$ L1 I$ OoO GPP OoO GPP L1 I$ L1 D$ L1 I$ L1 D$ GPP L1 D$ L1 D$ L1 I$ L1 D$ GPP L1 I$ L1 D$ L1 I$ GPP L1 D$ L1 I$ GPP On-Chip Router L2 Cache Bank L2 Cache Bank L2 Cache Bank L2 Cache Bank On-Chip Router On-Chip Router μT0 μT4 μT1 μT5 μT2 μT6 SIMT Memory Unit L1 I$ Cornell University L1 D$ Throughput-Optimized Tile: many small in-order cores Data-Parallel-Optimized Tile: flexible data-parallel accelerator Instruction Memory SIMT Issue Unit μT3 μT7 CP LD ST Latency-Optimized Tile: few large out-of-order cores GPP L1 D$ On-Chip Router CP GPP ST Application-Optimized Tile: instruction set specialization; flexible algorithm or data-structure accelerators Data Memory Christopher Batten 4 / 19 • Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns Projects Within the Batten Research Group GPGPU Microarchitecture XLOOPS: Specialization XPC: Explicit-Parallel- Polymorphic Hardware Specialization Call Architectures for Loop Patterns Ctrl Proc SIMT Issue Unit SIMT Lanes SIMT Mem Unit L1 Data Cache Reconfigurable Power Distribution Networks Complexity Effective On-Chip Networks instrA instrB pcall n, func1 ... psync func1: instrC pcall func2 ... pret Hardware Acceleration for Dynamic Languages Scheme Python JIT Compiler Control CP L1 I$ Control Crossbar Poly ASU Poly ASU Poly DSU Poly DSU Poly DSU Memory Crossbar L1 Data Cache Python-Based Hardware Modeling class Reg (Model): def __init__( s ): s.in = InPort(32) s.out = OutPort(32) def elaborate( s ): @s.posedge_clk def seq_logic(): s.out.next = s.in Cornell University Christopher Batten 5 / 19 • Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns Vertically Integrated Research Methodology Our research involves reconsidering all aspects of the computing stack including applications, programming frameworks, compiler optimizations, runtime systems, instruction set design, microarchitecture design, VLSI implementation, and hardware design methodologies C++ Applications Cross Compiler Verilog RTL Model RTL Simulator Binary Functional Simulator Synthesis Place&Route Gate-Level Model Cycle-Level Simulator Gate-Level Simulator C++ Functional C++ Cycle-Level Switching Activity Specification Model Key Metrics: Cycle Count, Cycle Time, Area, Energy Cornell University Experimenting with full-chip layout, FPGA prototypes, and test chips is a key part of our research methodology Layout Power Analysis Christopher Batten 6 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS: Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten 47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO) Cambridge, UK, Dec. 2014 Cornell University Christopher Batten 7 / 19 • XLOOPS: Specialization for Loop Patterns • Batten Research Group Overview Inter-Iteration Loop Dependence Patterns Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration n-1 inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch Explicit loop specialization (XLOOPS) elegantly encodes inter-iteration loop dependence patterns in the instruction set and enables traditional, specialized, and adaptive execution. Cornell University Christopher Batten 8 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS ISA: Unordered Concurrent for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] loop: lw lw mul sw addiu.xi addiu.xi addiu.xi addiu xloop.uc Cornell University r2, r3, r4, r4, rA, rB, rC, r1, r1, 0(rA) 0(rB) r2, r3 0(rC) 4 4 4 r1, 1 rN, loop Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Christopher Batten Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc 9 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS ISA: Ordered-Through-Registers for ( X=0, i=0; i<N; i++ ) X += A[i] B[i] = X loop: lw addu sw addiu.xi addiu.xi addiu xloop.or Cornell University r2, rX, rX, rA, rB, r1, r1, 0(rA) r2, rX 0(rB) 4 4 r1, 1 rN, loop Iteration 0 inst0 inst1 inst2 inst3 ... xloop.or Christopher Batten Iteration 1 inst0 inst1 inst2 inst3 ... xloop.or Iteration 2 inst0 inst1 inst2 inst3 ... xloop.or Iteration 3 inst0 inst1 inst2 inst3 ... xloop.or 10 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS ISA: Ordered-Through-Memory for ( i=K; i<N; i++ ) A[i] = A[i] * A[i-K] # r1 = rK # r3 = rA loop: lw lw mul sw addiu.xi addiu.xi addiu xloop.om Cornell University + 4*rK r4, r5, r6, r6, r3, rA, r1, r1, Iteration 0 inst0 inst1 inst2 inst3 ... xloop.om 0(r3) 0(rA) r4, r5 0(r3) 4 4 r1, 1 rN, loop Christopher Batten Iteration 1 inst0 inst1 inst2 inst3 ... xloop.om Iteration 2 inst0 inst1 inst2 inst3 ... xloop.om Iteration 3 inst0 inst1 inst2 inst3 ... xloop.om 11 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS ISA: Unordered Atomic for ( i=0; i<N; i++ ) B[ A[i] ]++ D[ C[i] ]++ loop: lw lw addiu sw addiu.xi ... addiu xloop.ua Cornell University r6, r7, r7, r7, rA, 0(rA) 0(r6) r7, 1 0(r6) rA, 4 Iteration 0 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 1 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 3 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 2 inst0 inst1 inst2 inst3 ... xloop.ua r1, r1, 1 r1, rN, loop Christopher Batten 12 / 19 • XLOOPS: Specialization for Loop Patterns • Batten Research Group Overview XLOOPS ISA: Dynamic Bound Iteration 0 inst0 inst1 inst2 inst3 ... xloop.db Cornell University Iteration 1 inst0 inst1 inst2 inst3 ... xloop.db Iteration 2 inst0 inst1 inst2 inst3 ... xloop.db Iteration 3 inst0 inst1 inst2 inst3 ... xloop.db Christopher Batten Iteration 4 inst0 inst1 inst2 inst3 ... xloop.db Iteration 5 inst0 inst1 inst2 inst3 ... xloop.db 13 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS Compiler Kernel implementing Floyd-Warshall shortest path algorithm for ( int k = 0; k < n; k++ ) #pragma xloops ordered for ( int i = 0; i < n; i++ ) #pragma xloops unordered for ( int j = 0; j < n; j++ ) path[i][j] = min( path[i][j], path[i][k] + path[k][j] ); Kernel implementing greedy algorithm for maximal matching #pragma xloops ordered for ( int i = 0; i < num_edges; i++ ) int v = edges[i].v; int u = edges[i].u; if ( vertices[v] < 0 && vertices[u] < 0 ) vertices[v] = u; vertices[u] = v; out[k++] = i; Cornell University Christopher Batten 14 / 19 • XLOOPS: Specialization for Loop Patterns • Batten Research Group Overview XLOOPS Microarchitecture I Elegant hardware/software Lane Management Unit DBN GPP GPR RF 32 × 32b 2r2w Lane 0 IDQ Lane 1 IDQ Lane 3 IDQ Inst Buf 128× Inst Buf 128× Inst Buf 128× Lane RF 24 × 32b 2r2w Lane RF 24 × 32b 2r2w Lane RF 24 × 32b 2r2w CIB 8× CIB 8× CIB 8× SLFU SLFU SLFU LSQ 16× LSQ 16× LSQ 16× abstraction enables efficient traditional execution on general-purpose processor SLFU LLFU D$ Request/Response Crossbar L1 I$ 16 KB I Loop-pattern specialization unit enables specialized execution for improved performance and efficiency I Adaptive execution uses two profiling phases to adaptively determine best location to run loop L1 D$ 16 KB L2 Request and Response Crossbars Cornell University Christopher Batten 15 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS Cycle-Level Performance Results OOO 4-Way OOO 2-Way with LPSU 4.0 3.5 3.0 2.5 2.0 1.5 1.0 rg b2 cm sg yk em -u ss m c ea -u r c sy chm uc m vi -u te c rb iw uc ad arpc uc m co -o va r di r-o km the r ea r-or ns sh or s a dy ym -or np mro or gks k om ac nn ks k-sm om ac k- om lg w om a m r-om st men or ci m lbt orm re hs e-u hu or a ffm t-u an a rs -ua o bf rt-u qs s-u a or ct-u db cdb Speedup Normalized to Simple RISC Core OOO 2-Way xloop.uc xloop.or xloop.{om,orm} xloop.ua xloop.*.db I XLOOPS vs. Simple Core : Always higher performance I XLOOPS vs. OOO 2-way : Generally competitive or higher performance I XLOOPS vs. OOO 4-way : Mixed results Cornell University Christopher Batten 16 / 19 Batten Research Group Overview • XLOOPS: Specialization for Loop Patterns • XLOOPS Cycle-Level Energy-Efficiency Results Normalized Energy Efficiency XLOOPS vs. Simple RISC Core XLOOPS vs. OOO 2-Way XLOOPS vs. OOO 4-Way 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 Normalized Performance Normalized Performance 0.5 1.0 1.5 2.0 2.5 Normalized Performance I XLOOPS vs. Simple Core : Competitive energy efficiency I XLOOPS vs. OOO 2-way : Always more energy efficient I XLOOPS vs. OOO 4-way : Always more energy efficient Cornell University Christopher Batten 17 / 19 Batten Research Group Overview DCache 16KB SRAM for Cache Lines • XLOOPS: Specialization for Loop Patterns • 32b Integer Mul/Div Unit DCache Tags ICache 16KB SRAM for Cache Lines ICache Tags Loop Pattern Specialization Unit 32b IEEE Floating Point Unit I RISC scalar processor with 4-lane LPSU I Implemented in TSMC 40 nm using ASIC CAD toolflow I ≈40% extra area compared to simple RISC processor Scalar Processor Cornell University I TSMC 40 nm standard-cell-based implementation I Supports xloop.uc L0 L0 Instr Instr Buffer Buffer L0 L0 Instr Instr Buffer Buffer VLSI Implementation Christopher Batten 18 / 19 Batten Research Group Overview XLOOPS: Specialization for Loop Patterns Take-Away Points I We envision future computing systems using fine-grain heterogeneous architectures with a mix of tiles optimized for latency, throughput, classes of applications, or even a specific application I We are using a vertically integrated research methodology to explore a variety of projects that will hopefully enable such fine-grain heterogeneous architectures Cornell University Christopher Batten 19 / 19
© Copyright 2024