Designing energy-‐‑efficient microprocessor: How to fight process variations Ruzica Jevtic Berkeley Wireless Research Center 3 Moore’s law )*01/-5(627()8920.:;(<:/- !""#$%&'()*'+ ,--. 10 10000 Nominal feature size Transistors Per Die 1010 109 108 107 106 105 104 0? 103 102 101 01 ,1 .0,! ,./! 0,2! >9)8:;<4', 5#"6$&&"# /7! >9)8:;<4'5#"6$&&"# 0/! 5$89:;<= 7'5#"6$&&"# 7! 5$89:;<= >>>'5#"6$&&"# 0! 5$89:;<='>>'5#"6$&&"# ,./? 5$89:;<=''5#"6$&&"# /7? 72/4'5#"6$&&"# 0/? 32/4'5#"6$&&"# 7? 2-,2/ 2-2- 2-2/ 2--2 1965 Data (Moore) 7--7 Memory Microprocessor 1965 1970 1975 1980 1985 Graph from S.Chou, ISSCC’2005 1990 1995 2000 2005 Pm Gate Length 0.1 0.7X every 2 years 130nm 90nm 65nm 45nm 32nm 22nm 1000 nm 100 70nm 50nm 35nm ~30nm 10 1970 2010 Source: Intel 250nm 180nm 0.01 100 1960 1 0@ 1980 1990 2000 2010 2020 )8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D Source: Intel, IEDM presentations Scaling consequences: • !""#$%&'A)*')8B'6"&9 Process variations • Power has become critical! (c) Ruzica Jevtic 2012 3 Scaling consequences • Fabrication: - Litography with liquids - Use of difraction… • Design rules: - No corners - One direction lines… • Additional steps: - Optical Proximity Check • Consequences: - Random dopant fluctuations - Hot-carrier trapping… (c) Ruzica Jevtic 2012 J. Hartmann, ISSCC 2007 Litography +,"-./0&$-1*!)&4,'/ 10 • Litography issues: 193nm wavelength for lines as small as 30nm! 10000 Nominal feature size scaling 1 1000 365nm 248nm Pm nm 193nm 250nm 180nm 130nm 90nm 65nm 45nm 32nm 22nm 0.1 100 EUV 13nm 0.01 10 !"#$%&'()(*+,-./0,-1+2& 1970 1980 1990 2000 2010 2020 567*8 9#)-'.4./1*.:*"-#*:;";0#*<:.0#=#0>? /0+-,.3215(6,(7.,-21"+-.&.+&3 193nm light Mask (c) Ruzica Jevtic 2012 Light Light Process variations B. Nikolic, TCAS-‐‑I, 2011 • • • • Process corners: typical, fast and slow All corners coexist on the same wafer If we design for the worst case, it is too pessimistic! Need to know after fab chip features observation circuits (c) Ruzica Jevtic 2012 3 Moore’s law )*01/-5(627()8920.:;(<:/- !""#$%&'()*'+ ,--. 10 10000 Nominal feature size Transistors Per Die 1010 109 108 107 106 105 104 0? 103 102 101 01 ,1 .0,! ,./! 0,2! >9)8:;<4', 5#"6$&&"# /7! >9)8:;<4'5#"6$&&"# 0/! 5$89:;<= 7'5#"6$&&"# 7! 5$89:;<= >>>'5#"6$&&"# 0! 5$89:;<='>>'5#"6$&&"# ,./? 5$89:;<=''5#"6$&&"# /7? 72/4'5#"6$&&"# 0/? 32/4'5#"6$&&"# 7? 2-,2/ 2-2- 2-2/ 2--2 1965 Data (Moore) 7--7 Memory Microprocessor 1965 1970 1975 1980 1985 Graph from S.Chou, ISSCC’2005 1990 1995 2000 2005 Pm Gate Length 0.1 0.7X every 2 years 130nm 90nm 65nm 45nm 32nm 22nm 1000 nm 100 70nm 50nm 35nm ~30nm 10 1970 2010 Source: Intel 250nm 180nm 0.01 100 1960 1 0@ 1980 1990 2000 2010 2020 )8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D Source: Intel, IEDM presentations Scaling consequences: • !""#$%&'A)*')8B'6"&9 Process variations • Power has become critical! (c) Ruzica Jevtic 2012 3 Scaling issues: Power *+,#"-./00/123/+& Power Trends in Intel's Microprocessors 1000 Has been > doubling every 2 years Power [W] 100 Itanium II Itanium Pentium III Core 2 Pentium 4 Pentium Pro 10 Pentium Pentium II 486DX 80286 8086 1 8088 8008 Has to stay ~constant 386DX 8080 4004 0.1 1970 1975 1980 1985 1990 1995 2000 2005 4 • Operating voltage scaled to avoid device breakdown 3 • Not scaled enough to keep the performance boosting up • Power has become a critical design constraint (c) Ruzica Jevtic 2012 Power issues • Power became an issue before variations: - battery life for mobile devices - power limits performance! - performance is what sells the product! • Power classification: - Dynamic power (60%) P = sw ! f !Vdd2 ! C - Static power (30%) (c) Ruzica Jevtic 2012 exp(Vdd) Power-‐‑performance trade-‐‑offs • The best way to reduce power (both leakage and active) is to reduce the power supply • How to maintain throughput under reduced supply? • Introducing more parallelism/pipelining • Dynamic voltage scaling with variable throughput Energy Performance (c) Ruzica Jevtic 2012 Overview • Error detection and correction circuits • Dynamic voltage and frequency scaling • Control voltage through DC-DC converters • Conclusions (c) Ruzica Jevtic 2012 Impact of Dynamic Variations VCC Droop MSFF CLK D Q MSFF CLK CLK Timing with VCC Droop Timing with Nominal VCC Setup Time D D Timing Guardband Ø Guardbands required to ensure correct operation within the presence of dynamic variations (c) Ruzica Jevtic 2012 K. Bowman, CMOS Emerging Tech. Workshop, 2009 Timing-‐‑Error Detection Error-‐‑Detection Sequential (EDS) LATCH RAZOR ERROR VCC Droop MSFF CLK D MSFF CLK CLK Data Arrives Late Error Detected D ERROR [1] P. Franco, et al., VLSI Test Symp., 1994. [2] M. Nicolaidis, VLSI Test Symp., 1999. [3] D. Ernst, et al., MICRO, 2003. Q (c) Ruzica Jevtic 2012 Recovery mechanism TUNING DVS PROCESSOR USING DELAY-ERROR DETECTION AND CORRECTION d pipeline recovery mechanism. • Correct data restored in the following clock cycle state• of All the shadow latch. Delay buffers are re- following is nullified by a “bubble previous pipeline stages have the toerrant be stage flushed serted in those paths which fail to meet this min- indicates to the next and subsequent stages that th • Program counter resumes at the next instruction ay constraint imposed by the shadow latch. The is invalid. Second, a backward propagating flush (c) Ruzica elay buffers incurs power overhead because of gered by asserting the stage identifier (ID) of the Jevtic 2012 Razor – issues D Q D Q Short path Long path D Q ERROR! CLK Shadow Latch Good idea but has a lot of issues: - Duty cycle constraints the shortest path - Metastability at the output of the main FF - The longest paths changed if a latch is added to FF (c) Ruzica Jevtic 2012 Improvements ARM’10 Intel’09 D Q ERROR PG RISING * * * * EN ERROR LATCH PG FALLING ü No added buffers for short paths ü Data metastability detected w/o additional detector - Detection window tuned in prefab (c) Ruzica Jevtic 2012 ü Lower clock power ü Data metastability solved -‐‑ Additonal buffers needed for short paths EDS Tunable Replica Circuit (TRC) Calibration Bits Logic stages: Ø Inverter Ø NAND Ø NOR Ø Pass gates Ø Repeated interconnects J. Tschanz, et al., Symp. VLSI Circuits, 2009. (c) Ruzica Jevtic 2012 error TRC – cont’d • Not so accurate as EDS, but no interfering with the longest path • If properly tuned, no recovery mechanism needed (c) Ruzica Jevtic 2012 QJH D'HOD\FKDQJHZ Observation circuits -‐‑ summary Advantages: ü Reduce margins for process variations ü EDS: • enable instruction recovery • detect errors in pipeline stages ü TRC: • capture clock-to-data delay per pipeline stage • have low design overhead - EDS: Disadvantages: • Adding buffers for short paths • Metastability issues • The longest path is affected by additional circuits - TRC: • Cannot detect local dynamic variations • Requires margin between TRC & the longest path • Requires post-silicon calibration (c) Ruzica Jevtic 2012 Overview • Error detection and correction circuits • Dynamic voltage and frequency scaling • Control voltage through DC-DC converters • Conclusions (c) Ruzica Jevtic 2012 DVFS Processor Usage Model Desired Throughput g p Compute-intensive and low-latency processes Maximum Processor Speed System Idle time Background and high-latency processes System Optimizations: (c) Ruzica Jevtic 2012• Maximize Peak Throughput Burd, ISSCC’00Burd ISSCC’00 DVFS – cont’d Dynamic Voltage Scaling (DVS) 2 Dynamically adapt Delivered Throughput 1 Vary !CLK,VDD time Burd ISSCC’00 • Dynamically scale energy/operation with throughput. Burd, ISSCC’00 •(c)Always minimize speed o minimize average energy/operation. Ruzica Jevtic 2012 Traditional DVFS fcpu Counter Fcpu_av + ... Ring Oscillator uP -‐‑ DC-‐‑DC Burd, ISSCC’00 • • • Traditional DVFS use a ring oscillator for frequency detection (Xscale, PowerPC, Pentium M, …) Impossible to use nowadays: ring oscillator frequency change does not reflect the cpu frequency change Different paths have different behavior with Vdd change (c) Ruzica Jevtic 2012 Resilient DVFS uP EDS Slow down TRC Slow down EDS with recovery Speed up Max. path delay CLK Cntrl DC-‐‑DC Fcpu_av • Resilient DVFS use TRCs and EDS as freq. indicator • Two options: TRC+EDS or EDS with recovery J. Tschanz, et al., Symp. VLSI Circuits, 2009. (c) Ruzica Jevtic 2012 RAVEN microprocessor • Implemented in the newest technology: 28nm! (apple i5/i7 cores in 32nm, altera chips in 28nm) • Mobile applications: battery life important! • Manycore architecture: exploiting parallelism! • Error detection circuits for observation: improvement over ARM architecture! • Unconventional DVFS scheme (c) Ruzica Jevtic 2012 Motivation -‐‑ DVFS • Single core E Two cores E Perf " " Perf Performance dictated by the slower core " In a conventional DVFS synchronous system (c) Ruzica Jevtic 2012 Goal • Manycore Instead of operating here E 10x Would like to operate here " Perf Make sure to operate at the optimal energyperformance point (c) Ruzica Jevtic 2012 Per-Core Supply/Clock Control FIFO FIFO DC-‐‑DC DC-‐‑DC EDS EDS PLL EDS EDS DC-‐‑DC DC-‐‑DC FIFO (c) Ruzica Jevtic 2012 FIFO DVFS on manycore processor TRC fdesired Control EDS Clock Generator Energy Delay • uP Tracking DC-‐‑DC Allow the ripple at the output and track the voltage through clock generation for beler energy-‐‑efficiency (c) Ruzica Jevtic 2012 Overview • Error detection and correction circuits • Dynamic voltage and frequency scaling • Control voltage through DC-DC converters • Conclusions (c) Ruzica Jevtic 2012 DC-‐‑DC converters Two phases: 1. Loading the energy from the battery 2. Transferring loaded energy to the output (c) Ruzica Jevtic 2012 Switched Capacitor DC-‐‑DC Converter Φ1 Φ1 Φ2 Vout Vin Vout Vin !" !" #$%# $"%&'( )"*+(&,-"% ./0 .-!!1 &()"'02)"$ %3&)*+"$4*252*&)/0 $*4$* */(6"0)"0% Φ2 ñ Cfly Φ2 Vout Φ1 • Advantages: - 2Fully integrated on chip .9:; 7; <=> 7#8 ?@AB4CDEF %* $*4$* GDFHAI@AI =FC a <J> 9@? DBAI=@9DF=K E=HALDIM?; - Smaller area (no inductive components) - Large power density =? = LDIM DL ?AI9A? KD?? ?9M9K=I @D @PA ?E &&; %* */(6"0)"0 !/%% 2(2!1%&% 2($ /5)&N&O2)&/( • &FDisadvantages: DICAI @D =GP9AHA @PA DB@9M=K @I=CADLL JA@EAAF BDEAI CAF4 &F =CC9@9DFR =FQ %* GDFHAI@AI E9KK =K?D =IA 9FCABAFCAF@ DL @PA KD=C GUIIAF@R 9FGK ?9@Q =FC -AL!G9AFGQR 9F @P9? ?AG@9DF EA E9KK !I?@ =F=KQSA @PA DBAI4 Discrete output voltage =@9DF =FC KD?? MAGP=F9?M? DL %* GDFHAI@AI?; )P9? =F=KQ?9? E9KK BK=@A G=B=G9@DI ?E9@GP9F: KD??A?; (D@A @P Low efficiency KA=C @D CA?9:F ATU=@9DF? LDI ?E9@GP9F: LIATUAFGQ =FC ?E9@GP LDI =F %* GDFHAI@AI E9KK =K?D GDF@I9JU@A E9C@P @P=@ M9F9M9SA (c) Ruzica Jevtic 2012KD??A? 9F = :9HAF @AGPFDKD:Q =FC BDEAI JA FA:KAG@AC PAIA ?9FGA @P9? KD?? 9? @QB9 Discrete points solution 0.96V 0.78V 0.55V H-‐‑P. Le, ISSCC 2011 0.43V Vin [V] ra)o Vout [V] Range[V] 1 1/2 0.5 0.45 – 0.55 1 2/3 0.67 0.55 – 0.71 1 1/3 Too small Too small 1.8 1/2 0.9 0.79 – 0.96 (c) Ruzica Jevtic 2012 Efficiency solution Loss conventional our approach Efficiency can be improved by more than 10%! (c) Ruzica Jevtic 2012 DVFS details Tracking DC-‐DC Configuration Fcpu_av Vin Vout=1/2 Vin Vout=2/3 Vin ... Vout TRC EDS ... uP Vref Energy Clock gen. Delay • Vo Set the ripple size at the optimal energy-‐‑delay point clk (c) Ruzica Jevtic 2012 Overview • Error detection and correction circuits • Dynamic voltage and frequency scaling • Control voltage through DC-DC converters • Conclusions (c) Ruzica Jevtic 2012 Summary • Process variations introduce difficulties in circuit design • Power is a critical design constraint • Observation circuits needed in order to avoid too conservative design decisions (EDS and TRC) • Multicore/manycore architectures open spatial dimension for energy optimization through DVFS • Fine granularity V-F control enabled through performance observation and DC-DC converters (c) Ruzica Jevtic 2012 Acknowledgement Marie Curie FP7 People program (c) Ruzica Jevtic 2012
© Copyright 2024