slides

Power Optimization for Clock Network
with Clock Gate
Cloning and Flip-Flop Merging
Shih-Chuan Lo
Chih-Cheng Hsu
Mark Po-Hung Lin
Department of Electrical Engineering
National Chung Cheng University, Chiayi, Taiwan
page 1
Outline
•
•
•
•
•
Introduction
Preliminaries
The Proposed Algorithms
Experimental Results
Conclusions
page 2
Outline
• Introduction
 Low Power Design Methodologies
 The Concept of Clock-Gating Cell
 The Concept of Clock-Gate Cloning
 The Concept of Flip-Flop Merging
 Previous Work
 Our Contributions
•
•
•
•
Preliminaries
The Proposed Algorithms
Experimental Results
Conclusions
page 3
Low Power Design Methodologies
• Clock gating cell (CG)
 [Wu et al., TCAS'00], [Shen et al., TVLSI'10],
• Clock gate cloning
 [Teng & Soin, ICSE'10], [Vishweshwara et al., ISQED'12]
• Multi-bit flip-flop (MBFF)
 [Pokala et al., ASIC92], [Kretchmer, EE Times Asia'01],
[Chen et al., SNUG10], [Lin et al., TCAD'11],
[Wang et al.,TCAD'12], [Jiang et al., TCAD'12],
[Shyu et al., TVLSI13], [Tsai et al., ISPD13]
• …
page 4
The Concept of Clock-Gating Cell
• A clock-gating cell can turn off the clocks at flip-flop inputs
when they are not required.
 In Fig.(a), the FFs will load new data at their input pins “D” only when
the enable signal “EN” is active.
 In Fig.(b), the CG can shut off “gclk” to the FFs when “Din” is not
changed.
Less clock network power
and smaller chip area
page 5
The Concept of Clock-Gate Cloning
• Clock buffer chain may result in:
 Longer delay
 Degrade the circuit performance
 Induce power consumption
• After replicate sufficient CGs and
connect each CG to a smaller
number of FFs
 The number of required clock
buffers can be reduced.
 Power consumption and path delay
of the gated clock network can be
minimized.
page 6
The Concept of Multi-bit flip-flop
• Replacing 1-bit FFs with MBFFs
can reach up to 30% total clock
power reduction.
 [Jiang et al., TCAD'12]
• An MBFF contains several 1-bit
FFs which share common
inverters in the MBFF cell.
 [Chen et al., SNUG'10]
• Replacing several 1-bit FFs with
an MBFF will reduce
 Inverters in FF cells
 Clock sinks
 Clock drivers
page 7
Previous Work of CG Cloning
• [Teng & Soin, ICSE'10]
 Introduced cutting-based algorithm to split a CG and redistribute the CG
fanout according to the cut line.
 The CG splitting algorithm is iteratively performed until the timing
violation of each CG’s enable signal is eliminated.
• [Vishweshwara et al., ISQED'12]
 Proposed a clustering-based algorithm to recursively replicate a CG
when the CG has a large number of fanout, or when the spreading area
of its fanout is larger than a limit.
page 8
Previous Work of FF Merging
• [Kretchmer, EE Times Asia'01], [Chen et al., SNUG10]
 Demonstrated the feasibility of applying MBFFs during logic synthesis.
• [Pokala et al., ASIC92]
 Applied MBFFs before placement optimization.
• [Tsai et al., ISPD13]
 Applied MBFFs during placement optimization.
• [Lin et al., TCAD'11], [Wang et al.,TCAD'12],
[Jiang et al., TCAD'12], [Shyu et al., TVLSI13]
 Perform power optimization with MBFFs at the post-placement stage
for better timing budgeting.
page 9
Our Contributions
• We present the first problem formulation
 For gated clock network optimization with simultaneous CG cloning
and FF merging.
• We introduce a novel optimization flow consisting of
 MBFF aware CG cloning
 CG-based FF merging
 MBFF and CG placement optimization
• We formulate the MBFF-aware CG cloning optimization
problem as a partitioning problem.
 Our formulation is to maximize skew slack corresponding to different
CGs subject to bounded slack constraints.
• Our experimental results show that the proposed approach
leads to better dynamic power and clock wirelength.
page 10
Outline
• Introduction
• Preliminaries
 Power Model of Gated Clock Network
 Inter-CG Clock Skew due to CG Cloning
 Control-Path Timing Constraint for Gated Clock Network
 Data-Path Timing Constraint for FF Merging
 Placement Density Constraint for CGs and MBFFs
 Problem Formulation
• The Proposed Algorithms
• Experimental Results
• Conclusions
page 11
Power Model of Gated Clock Network
• The power dissipated in the gated clock network can be
modelled as follows.
 [Shen et al., TVLSI'10]





Pd   clk  c0lclk  Cbuf  Ccgclk   gclk  c0l gclk  Cbuf  C f   0.5   en  c0len  Cbuf  Ccgen Vdd2 
clock net
gated clock tree
1
Tperiod
enable signal net
Pd
Vdd
T period
dynamic power consumption
supply voltage
clock period
c0
unit wire capacitance
C
l
input capacitance

wirelength
switching activity
page 12
Inter-CG Clock Skew due to CG Cloning
• When a CG is replicated in the gated clock network, the interCG clock skew Tskew , can be calculated as follows.
• To minimize Tskew , we shall balance the wirelength and flipflop fanout numbers among all different CGs.

 
Tskew  Ti clk  Ti gclk  T jclk  T jgclk
gi
T CG
Tskew
the
i th

CG
CG delay
inter-CG clock skew among gated FFs
Ti clk
interconnection delay from the clock root to gi
Ti gclk
interconnection delay from gi to the farthest gated FF
page 13
Control-Path Timing Constraint for Gated
Clock Network
• The figure shows the control-path timing of the gated clock
network.
 T period
Ti en  Ti gclk  Tperiod  T EL  T CG
T EL
Ti en
CG delay
interconnection delay from the clock root to gi
page 14
Data-Path Timing Constraint for FF Merging
• Only the FFs which have common intersection of their timingfeasible regions can be merged.
 [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12]
• The timing-feasible region of a flip-flop can be obtained from
the available timing slack on the corresponding data paths.
page 15
Placement Density Constraint for CGs and MBFFs
• We divide the chip area into a number of bins with equal size.
 [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12]
• A CG or an MBFF can only be placed in a bin whose density is
less than the maximum placement density.
 To evenly distribute logic cells throughout the chip area, in order to
avoid routing congestion.
page 16
Problem Formulation
• Input
 A clock gating domain contains a set of FFs which are controlled by the
gated clock signals whose switching activities are the same.
 A cell library containing both CG and MBFF cells.
• Objectives
 Minimize Pd and Tskew of the clock-gating domain
(Pd is the primary objective, while Tskew is the secondary one because
Tskew can be further minimized after clock tree routing.)
• Constraint
 Control-path timing constraint
 Data-path timing constraint
 Placement density constraint.
page 17
Outline
• Introduction
• Preliminaries
• The Proposed Algorithms
 The Proposed Algorithms Flow
 MBFF-aware CG Cloning
 CG-based FF Merging
 MBFF and CG Placement Optimization
• Experimental Results
• Conclusions
page 18
The Proposed Algorithms Flow
Initial placement / Cell library /
Design constraints
MBFF-aware CG Cloning
CG-based FF Merging
MBFF & CG Placement Opt.
Optimized placement containing
newly generated CGs and MBFFs
page 19
MBFF-aware CG Cloning
• The CG must be replicated and the fanout FFs are bisected when:
 Control path violates the timing constraint
 CG drives too many FFs leading to larger clock power consumption.
page 20
Hyper Graph Construction
• According to the timing-feasible region of each FF, we construct
the hypergraph, H(V,E).
 vi: the timing-feasible region of the FF fi.
 ei: the intersection among the timing feasible regions of different fi.
 w(ei): the number of vertices connected by ei.
w(e3)=3
w(e1)=4
w(e2)=2
page 21
Cut-line Determination with
Inter-CG Skew Budgeting
• The cut direction is determined by the physical dimension of
the FF bounding box. [Teng & Soin, ICSE'10]
 A vertical (horizontal) cut is applied if the dimension in x-direction is
larger (smaller) than that in y-direction.
• To balance the delay passing through different CGs, we sweep
max
the cut line to search for the maximum skew slack Tskew
.
_ slack
page 22
Skew Slack (1/2)
• In Fig.(c) (Fig.(d)), the CGs are placed at the position closest to
(farthest from) the clock root within the respective FF
bounding boxes, resulting in the shortest (longest ) clock signal
delay from the clock root to the FFs.
T
i
clk
 Ti gclk

min
T
clk
j
 T jgclk

min
T
i
clk
 Ti gclk

max
T
clk
j
 T jgclk

T clk  T gclk
max
page 23
Skew Slack (2/2)
• The skew slack, can be calculated by the difference between
the minimum longest and the maximum shortest clock signal
delay.
• To more easily balance the delay passing through different
CGs, we would like to find out a physical cut line which
maximizes the skew slack.

 , T
Tskew _ slack  min Ti clk  Ti gclk
T
i
clk
 Ti gclk
max

min
T
clk
j
clk
j
 

 T jgclk max  max Ti clk  Ti gclk
Tskew _ slack
 T jgclk

min
T
i
clk
 Ti gclk

max
T
clk
j
 , T
min
 T jgclk

clk
j
 T jgclk
 
min
T clk  T gclk
max
page 24
MBFF-aware FF Swapping
• We perform the FM algorithm on H(V,E) to move FFs between
different FF sets such that the cut size is minimized.
 Cut size: sum of edge weights on the cut line
• A balance condition that the skew slack after moving an FF to
max
the other FF set must not less than   Tskew _ slack .
  is a balance factor, 0    1 .
page 25
CG-based FF Merging
• We merge 1-bit FFs into MBFFs starting from the four
boundaries of the FF bounding box to the center area, based on
 INTEGRA [Jiang et al., TCAD'12]
 Spiral clustering technique [Chang et al., ISPD'12]
page 26
MBFF and CG Placement Optimization
• We perform MBFF and CG placement optimization to





Minimize inter-CG clock skew
Minimize wirelength
Minimize required clock buffers
Satisfying control/data-path timing constraints
Satisfying placement density constraints
page 27
MBFF Placement
• When placing the MBFFs controlled by the same CG, we
search for the placement bins, which satisfy:
 Placement density constraint
 In the timing-feasible region corresponding to each MBFF
 The FF bounding box of the CG fanouts is minimized.
• The smaller FF bounding box can result in shorter gated clock signal
wirelength, and hence smaller T gclk and P d .
page 28
CG Placement
• The CGs are initially placed inside their feasible positions
which satisfy the control-path timing constraings.
 The feasible region of a CG is roughly an ellipse whose the two foci are
at the positions of the enable logic and one of the CG fanout FFs.
• We perform an iterative optimization algorithm to:
 Move CGs around their feasible regions until inter-CG clock skew
cannot be further minimized.
 Add clock buffers to either clock path from the clock root to a CG for
delay balance.
 Insert buffers to either enable signal path from the enable logic to a CG
for a larger feasible region of the CG.
page 29
Outline
•
•
•
•
Introduction
Preliminaries
The Proposed Algorithms
Experiments
 Experimental Setups
 Experimental Comparisons
 Experimental Results
• Conclusions
page 30
Experimental Setups
• Programming language
 C++
• Platform
 2.26GHz Intel Xeon machine under the Linux operating system
• We adopted the benchmark circuits in [Jiang et al., TCAD'12]
 Add other logical, physical and timing information for CGs, clock root, and EL.
 Referred to the Nangate 45nm Open Cell Library to set the input capacitance.
 Assumed that all FFs in each circuit are initially connected to the same CG.
 Chose the circuits containing less than 1,000 FFs with reasonable FF bounding boxes.
page 31
Experimental Comparisons
• Reference Flow 1 & 2
 CG cloning technique is based on the MBFF-aware CG cloning without
applying MBFF-aware FF swapping.
 FF merging technique is exactly the same as the CG-based FF merging.
page 32
Experimental Results (1/2)
• Comparisons the numbers of MBFFs with different bit
numbers (“# of FFs”) and CG numbers (“# of CGs”).
 When comparing with “Reference Flow 1” the proposed flow results
in much more MBFFs with similar clock gate numbers.
 When comparing with “Reference Flow 2” the proposed flow results
in much slightly more CGs and slightly fewer MBFFs.
page 33
Experimental Results (2/2)
• Comparisons of the dynamic power consumption
 15% less than that resulting from “Reference Flow 1”.
 10% less than that resulting from “Reference Flow 2”.
• Comparisons of the clock net wirelength
 22% less than that resulting from “Reference Flow 1”.
 18% less than that resulting from “Reference Flow 2”.
• Comparisons of the signal net wirelength
 2% less than that resulting from “Reference Flow 2”.
page 34
Outline
•
•
•
•
•
Introduction
Preliminaries
The Proposed Algorithms
Experimental Results
Conclusions
page 35
Conclusions
• We have presented a new problem formulation for clock
network optimization with both CGs and MBFFs.
• We have also introduced novel techniques to optimize gated
clock network with CG cloning and FF merging simultaneously.
• The experimental results have shown that the proposed
approach results in better dynamic power and clock wirelength
compared with those which optimize gated clock network with
CGs and MBFFs separately.
page 36
Thanks for Your Attention
page 37