Document 189892

Designing energy-­‐‑efficient microprocessor: How to fight process variations
Ruzica Jevtic
Berkeley Wireless Research Center
3
Moore’s law
)*01/-5(627()8920.:;(<:/-
!""#$%&'()*'+ ,--.
10
10000
Nominal feature size
Transistors
Per Die
1010
109
108
107
106
105
104
0?
103
102
101
01 ,1
.0,!
,./!
0,2!
>9)8:;<4', 5#"6$&&"#
/7!
>9)8:;<4'5#"6$&&"#
0/!
5$89:;<= 7'5#"6$&&"#
7!
5$89:;<= >>>'5#"6$&&"#
0!
5$89:;<='>>'5#"6$&&"#
,./?
5$89:;<=''5#"6$&&"#
/7?
72/4'5#"6$&&"#
0/?
32/4'5#"6$&&"#
7?
2-,2/
2-2- 2-2/
2--2
1965 Data (Moore)
7--7
Memory
Microprocessor
1965
1970
1975
1980
1985
Graph from S.Chou, ISSCC’2005
1990
1995
2000
2005
Pm
Gate Length
0.1
0.7X every
2 years
130nm
90nm
65nm
45nm
32nm
22nm
1000
nm
100
70nm
50nm
35nm
~30nm
10
1970
2010
Source: Intel
250nm
180nm
0.01
100
1960
1
0@
1980
1990
2000
2010
2020
)8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D
Source: Intel, IEDM presentations
Scaling consequences:
•  !""#$%&'A)*')8B'6"&9
Process variations
•  Power has become critical!
(c) Ruzica Jevtic 2012
3
Scaling consequences
•  Fabrication:
- Litography with liquids
- Use of difraction…
•  Design rules:
- No corners
- One direction lines…
•  Additional steps:
- Optical Proximity Check
•  Consequences:
- Random dopant fluctuations
- Hot-carrier trapping…
(c) Ruzica Jevtic 2012
J. Hartmann, ISSCC 2007
Litography
+,"-./0&$-1*!)&4,'/
10
•  Litography issues:
193nm wavelength for
lines as small as 30nm!
10000
Nominal feature size scaling
1
1000
365nm
248nm
Pm
nm
193nm
250nm
180nm
130nm
90nm
65nm
45nm
32nm
22nm
0.1
100
EUV 13nm
0.01
10
!"#$%&'()(*+,-./0,-1+2&
1970
1980
1990
2000
2010
2020
567*8 9#)-'.4./1*.:*"-#*:;";0#*<:.0#=#0>?
/0+-,.3215(6,(7.,-21"+-.&.+&3
193nm light
Mask
(c) Ruzica Jevtic 2012
Light
Light
Process variations
B. Nikolic, TCAS-­‐‑I, 2011
• 
• 
• 
• 
Process corners: typical, fast and slow
All corners coexist on the same wafer
If we design for the worst case, it is too pessimistic!
Need to know after fab chip features observation circuits
(c) Ruzica Jevtic 2012
3
Moore’s law
)*01/-5(627()8920.:;(<:/-
!""#$%&'()*'+ ,--.
10
10000
Nominal feature size
Transistors
Per Die
1010
109
108
107
106
105
104
0?
103
102
101
01 ,1
.0,!
,./!
0,2!
>9)8:;<4', 5#"6$&&"#
/7!
>9)8:;<4'5#"6$&&"#
0/!
5$89:;<= 7'5#"6$&&"#
7!
5$89:;<= >>>'5#"6$&&"#
0!
5$89:;<='>>'5#"6$&&"#
,./?
5$89:;<=''5#"6$&&"#
/7?
72/4'5#"6$&&"#
0/?
32/4'5#"6$&&"#
7?
2-,2/
2-2- 2-2/
2--2
1965 Data (Moore)
7--7
Memory
Microprocessor
1965
1970
1975
1980
1985
Graph from S.Chou, ISSCC’2005
1990
1995
2000
2005
Pm
Gate Length
0.1
0.7X every
2 years
130nm
90nm
65nm
45nm
32nm
22nm
1000
nm
100
70nm
50nm
35nm
~30nm
10
1970
2010
Source: Intel
250nm
180nm
0.01
100
1960
1
0@
1980
1990
2000
2010
2020
)8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D
Source: Intel, IEDM presentations
Scaling consequences:
•  !""#$%&'A)*')8B'6"&9
Process variations
•  Power has become critical!
(c) Ruzica Jevtic 2012
3
Scaling issues: Power *+,#"-./00/123/+&
Power Trends in Intel's Microprocessors
1000
Has been > doubling
every 2 years
Power [W]
100
Itanium II
Itanium
Pentium III
Core 2
Pentium 4
Pentium Pro
10
Pentium
Pentium II
486DX
80286
8086
1
8088
8008
Has to stay
~constant
386DX
8080
4004
0.1
1970
1975
1980
1985
1990
1995
2000
2005
4
•  Operating voltage scaled to avoid device breakdown
3
•  Not scaled enough to keep the performance boosting up
•  Power has become a critical design constraint
(c) Ruzica Jevtic 2012
Power issues
•  Power became an issue before variations:
- battery life for mobile devices
- power limits performance!
- performance is what sells the product!
•  Power classification:
- Dynamic power (60%)
P = sw ! f !Vdd2 ! C
- Static power (30%)
(c) Ruzica Jevtic 2012
exp(Vdd)
Power-­‐‑performance trade-­‐‑offs
•  The best way to reduce power (both leakage and
active) is to reduce the power supply
•  How to maintain throughput under reduced supply?
•  Introducing more parallelism/pipelining
•  Dynamic voltage scaling with variable throughput
Energy
Performance
(c) Ruzica Jevtic 2012
Overview
•  Error detection and correction circuits
•  Dynamic voltage and frequency scaling
•  Control voltage through DC-DC converters
•  Conclusions
(c) Ruzica Jevtic 2012
Impact of Dynamic Variations
VCC Droop
MSFF
CLK
D
Q
MSFF
CLK
CLK
Timing with VCC Droop
Timing with Nominal VCC
Setup Time
D
D
Timing Guardband
Ø  Guardbands required to ensure correct operation within the presence of dynamic variations
(c) Ruzica Jevtic 2012
K. Bowman, CMOS Emerging Tech. Workshop, 2009 Timing-­‐‑Error Detection
Error-­‐‑Detection Sequential (EDS)
LATCH
RAZOR
ERROR
VCC Droop
MSFF
CLK
D
MSFF
CLK
CLK
Data Arrives Late
Error Detected
D
ERROR
[1]
P. Franco, et al., VLSI Test Symp., 1994.
[2]
M. Nicolaidis, VLSI Test Symp., 1999.
[3]
D. Ernst, et al., MICRO, 2003.
Q
(c) Ruzica Jevtic 2012
Recovery mechanism
TUNING DVS PROCESSOR USING DELAY-ERROR DETECTION AND CORRECTION
d pipeline recovery mechanism.
•  Correct data restored in the following clock cycle
state• of All
the shadow
latch. Delay
buffers are
re- following
is nullified by a “bubble
previous
pipeline
stages
have the
toerrant
be stage
flushed
serted in those paths which fail to meet this min- indicates to the next and subsequent stages that th
•  Program counter resumes at the next instruction
ay constraint imposed by the shadow latch. The is invalid. Second, a backward propagating flush
(c) Ruzica
elay buffers
incurs
power overhead because of gered by asserting the stage identifier (ID) of the
Jevtic 2012
Razor – issues
D
Q
D
Q
Short path
Long path
D
Q
ERROR!
CLK
Shadow Latch
Good idea but has a lot of issues:
-  Duty cycle constraints the shortest path
-  Metastability at the output of the main FF
-  The longest paths changed if a latch is added to FF
(c) Ruzica Jevtic 2012
Improvements
ARM’10
Intel’09
D
Q
ERROR
PG RISING
*
*
*
*
EN
ERROR
LATCH
PG FALLING
ü  No added buffers for short paths
ü  Data metastability detected w/o
additional detector
- Detection window tuned in prefab
(c) Ruzica Jevtic 2012
ü  Lower clock power
ü  Data metastability solved
-­‐‑ Additonal buffers needed for short paths
EDS
Tunable Replica Circuit (TRC)
Calibration Bits
Logic stages:
Ø  Inverter
Ø  NAND
Ø  NOR
Ø  Pass gates
Ø  Repeated interconnects
J. Tschanz, et al., Symp. VLSI Circuits, 2009.
(c) Ruzica Jevtic 2012
error
TRC – cont’d
•  Not so accurate as EDS, but no interfering with the longest path
•  If properly tuned, no recovery mechanism needed
(c) Ruzica Jevtic 2012
QJH
D'HOD\FKDQJHZ
Observation circuits -­‐‑ summary
Advantages:
ü  Reduce margins for process variations
ü  EDS:
•  enable instruction recovery
•  detect errors in pipeline stages
ü  TRC:
•  capture clock-to-data delay per pipeline stage
•  have low design overhead
-  EDS:
Disadvantages:
•  Adding buffers for short paths
•  Metastability issues
•  The longest path is affected by additional circuits
-  TRC:
•  Cannot detect local dynamic variations
•  Requires margin between TRC & the longest path
•  Requires post-silicon calibration
(c) Ruzica Jevtic 2012
Overview
•  Error detection and correction circuits
•  Dynamic voltage and frequency scaling
•  Control voltage through DC-DC converters
•  Conclusions
(c) Ruzica Jevtic 2012
DVFS
Processor Usage Model
Desired
Throughput
g p
Compute-intensive and
low-latency processes
Maximum Processor Speed
System Idle
time
Background and
high-latency processes
System Optimizations:
(c) Ruzica Jevtic 2012• Maximize Peak Throughput
Burd, ISSCC’00Burd
ISSCC’00
DVFS – cont’d
Dynamic Voltage Scaling (DVS)
2 Dynamically adapt
Delivered
Throughput
1 Vary !CLK,VDD
time
Burd
ISSCC’00
• Dynamically scale energy/operation with throughput.
Burd, ISSCC’00
•(c)Always
minimize speed o minimize average energy/operation.
Ruzica Jevtic 2012
Traditional DVFS
fcpu
Counter
Fcpu_av +
...
Ring Oscillator
uP
-­‐‑
DC-­‐‑DC
Burd, ISSCC’00
• 
• 
• 
Traditional DVFS use a ring oscillator for frequency detection (Xscale, PowerPC, Pentium M, …)
Impossible to use nowadays: ring oscillator frequency change does not reflect the cpu frequency change
Different paths have different behavior with Vdd change
(c) Ruzica Jevtic 2012
Resilient DVFS
uP
EDS Slow down
TRC
Slow down
EDS with recovery
Speed up
Max. path delay
CLK Cntrl
DC-­‐‑DC
Fcpu_av
•  Resilient DVFS use TRCs and EDS as freq. indicator
•  Two options: TRC+EDS or EDS with recovery
J. Tschanz, et al., Symp. VLSI Circuits, 2009.
(c) Ruzica Jevtic 2012
RAVEN microprocessor
•  Implemented in the newest technology: 28nm!
(apple i5/i7 cores in 32nm, altera chips in 28nm)
•  Mobile applications: battery life important!
•  Manycore architecture: exploiting parallelism!
•  Error detection circuits for observation:
improvement over ARM architecture!
•  Unconventional DVFS scheme
(c) Ruzica Jevtic 2012
Motivation -­‐‑ DVFS
•  Single core
E
Two cores
E
Perf
"
"
Perf
Performance dictated by the slower core
"
In a conventional DVFS synchronous system (c) Ruzica Jevtic 2012
Goal
•  Manycore
Instead of operating here
E
10x
Would like to operate here
"
Perf
Make sure to operate at the optimal energyperformance point
(c) Ruzica Jevtic 2012
Per-Core Supply/Clock Control
FIFO
FIFO
DC-­‐‑DC
DC-­‐‑DC
EDS
EDS
PLL
EDS
EDS
DC-­‐‑DC
DC-­‐‑DC
FIFO
(c) Ruzica Jevtic 2012
FIFO
DVFS on manycore processor
TRC
fdesired
Control
EDS
Clock Generator
Energy
Delay
• 
uP
Tracking
DC-­‐‑DC
Allow the ripple at the output and track the voltage through clock generation for beler energy-­‐‑efficiency (c) Ruzica Jevtic 2012
Overview
•  Error detection and correction circuits
•  Dynamic voltage and frequency scaling
•  Control voltage through DC-DC converters
•  Conclusions
(c) Ruzica Jevtic 2012
DC-­‐‑DC converters
Two phases:
1. Loading the energy
from the battery
2. Transferring loaded
energy to the output
(c) Ruzica Jevtic 2012
Switched Capacitor DC-­‐‑DC Converter
Φ1
Φ1
Φ2
Vout
Vin
Vout
Vin
!" !" #$%# $"%&'( )"*+(&,-"% ./0 .-!!1 &()"'02)"$ %3&)*+"$4*252*&)/0
$*4$* */(6"0)"0%
Φ2
ñ Cfly
Φ2
Vout
Φ1
•  Advantages:
- 2Fully
integrated
on
chip
.9:; 7; <=>
7#8 ?@AB4CDEF
%* $*4$* GDFHAI@AI
=FC a
<J> 9@?
DBAI=@9DF=K E=HALDIM?;
- Smaller area (no inductive components)
- Large
power
density
=? = LDIM DL ?AI9A? KD?? ?9M9K=I @D @PA ?E
&&; %*
*/(6"0)"0
!/%% 2(2!1%&%
2($ /5)&N&O2)&/(
•  &FDisadvantages:
DICAI @D =GP9AHA @PA DB@9M=K @I=CADLL JA@EAAF BDEAI CAF4 &F =CC9@9DFR =FQ %* GDFHAI@AI E9KK =K?D
=IA 9FCABAFCAF@ DL @PA KD=C GUIIAF@R 9FGK
?9@Q =FC -AL!G9AFGQR
9F
@P9?
?AG@9DF
EA
E9KK
!I?@
=F=KQSA
@PA
DBAI4
Discrete output voltage
=@9DF =FC KD?? MAGP=F9?M? DL %* GDFHAI@AI?; )P9? =F=KQ?9? E9KK BK=@A G=B=G9@DI ?E9@GP9F: KD??A?; (D@A @P
Low
efficiency
KA=C @D CA?9:F
ATU=@9DF?
LDI ?E9@GP9F: LIATUAFGQ =FC ?E9@GP LDI =F %* GDFHAI@AI E9KK =K?D GDF@I9JU@A
E9C@P
@P=@ M9F9M9SA
(c) Ruzica
Jevtic 2012KD??A? 9F = :9HAF @AGPFDKD:Q =FC BDEAI
JA FA:KAG@AC PAIA ?9FGA @P9? KD?? 9? @QB9
Discrete points solution
0.96V
0.78V
0.55V
H-­‐‑P. Le, ISSCC 2011
0.43V
Vin [V] ra)o Vout [V] Range[V]
1 1/2 0.5 0.45 – 0.55 1 2/3 0.67 0.55 – 0.71 1 1/3 Too small Too small 1.8
1/2
0.9 0.79 – 0.96 (c) Ruzica Jevtic 2012
Efficiency solution
Loss
conventional
our approach
Efficiency can be improved by more than 10%!
(c) Ruzica Jevtic 2012
DVFS details
Tracking
DC-­‐DC
Configuration
Fcpu_av
Vin
Vout=1/2 Vin
Vout=2/3 Vin ...
Vout
TRC
EDS
...
uP
Vref
Energy
Clock gen.
Delay
• 
Vo
Set the ripple size at the optimal energy-­‐‑delay point clk
(c) Ruzica Jevtic 2012
Overview
•  Error detection and correction circuits
•  Dynamic voltage and frequency scaling
•  Control voltage through DC-DC converters
•  Conclusions
(c) Ruzica Jevtic 2012
Summary
•  Process variations introduce difficulties in circuit
design
•  Power is a critical design constraint
•  Observation circuits needed in order to avoid too
conservative design decisions (EDS and TRC)
•  Multicore/manycore architectures open spatial
dimension for energy optimization through DVFS
•  Fine granularity V-F control enabled through
performance observation and DC-DC converters
(c) Ruzica Jevtic 2012
Acknowledgement
Marie Curie FP7 People program
(c) Ruzica Jevtic 2012