Sequential Optimization for Low Power Digital Design by Aaron Paul Hurst

Sequential Optimization for Low Power Digital Design
by
Aaron Paul Hurst
B.S. (Carnegie Mellon University) 2002
M.S. (Carnegie Mellon University) 2002
A dissertation submitted in partial satisfaction
of the requirements for the degree of
Doctor of Philosophy
in
Electrical Engineering and Computer Science
in the
GRADUATE DIVISION
of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor Robert K. Brayton, Chair
Professor Andreas Kuehlmann
Professor Margaret Taylor
Spring 2008
The dissertation of Aaron Paul Hurst is approved.
Chair
Date
Date
Date
University of California, Berkeley
Spring 2008
Sequential Optimization for Low Power Digital Design
c 2008
Copyright by
Aaron Paul Hurst
Abstract
Sequential Optimization for Low Power Digital Design
by
Aaron Paul Hurst
Doctor of Philosophy in Electrical Engineering and Computer Science
University of California, Berkeley
Professor Robert K. Brayton, Chair
The power consumed by digital integrated circuits has grown with increasing transistor
density and system complexity. One of the particularly power-hungry design features is the
generation, distribution, and utilization of one or more synchronization signals (clocks). In
many state-of-the-art designs, up to 30%-50% of the total power is dissipated in the clock
distribution network.
In this work, we examine the application of sequential logic synthesis techniques to
reduce the dynamic power consumption of the clocks. These optimizations are sequential
because they alter the structural location, functionality, and/or timing of the synchronization elements (registers) in a circuit netlist. A secondary focus is on developing algorithms
that scale well to large industrial designs.
The first part of the work deals with the use of retiming to minimize the number of
registers and therefore the capacitive load on the clock network. We introduce a new
formulation of the problem and then show how it can be extended to include necessary
constraints on the worst-case timing and initializability of the resulting netlist. It is then
demonstrated how retiming can be combined with the orthogonal technique of intentional
clock skewing to minimize the combined capacitive load under a timing constraint.
1
The second part introduces a new technique for inserting clock gating logic, whereby a
clock’s propagation is conditionally blocked for subsets of the registers in the design that
are not actively switching logic state. The conditions under which the clock is disabled are
detected through the use of random simulation and Boolean satisfiability checking. This
process is quite scalable and also offers the potential for additional logic simplification.
Professor Robert K. Brayton
Dissertation Committee Chair
2
Contents
Contents
i
List of Figures
v
List of Tables
vii
Acknowledgements
ix
1 Introduction
1.1
1.2
1.3
1
Low Power Digital Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.1
Technological . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.2
Commercial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.3
Environmental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Sequential Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2.1
Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2.2
Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . .
19
2 Unconstrained Min-Register Retiming
2.1
2.2
2.3
22
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.2.1
LP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.2.2
Min-Cost Network Circulation Formulation . . . . . . . . . . . . . .
31
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.3.1
33
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
2.4
2.5
2.6
2.3.2
Single Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.3.3
Multiple Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
2.4.1
Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
2.4.2
Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
2.4.3
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
2.5.1
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
2.5.2
Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
2.5.3
Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
2.5.4
Large Artificial Benchmarks . . . . . . . . . . . . . . . . . . . . . . .
71
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
3 Timing-Constrained Min-Register Retiming
74
3.1
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3.2
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3.2.1
LP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3.2.2
Minaret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
3.3.1
Single Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
3.3.2
Multiple Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
3.3.3
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
3.4
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
3.5
Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
3.5.1
Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
3.6.1
Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
3.6.2
Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101
3.3
3.6
3.7
4 Guaranteed Initializability Min-Register Retiming
102
4.1
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
4.2
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104
4.2.1
104
Initial State Computation . . . . . . . . . . . . . . . . . . . . . . . .
ii
4.2.2
Constraining Retiming . . . . . . . . . . . . . . . . . . . . . . . . . .
107
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
109
4.3.1
Feasibility Constraints . . . . . . . . . . . . . . . . . . . . . . . . . .
109
4.3.2
Incremental Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114
4.4.1
Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114
4.4.2
Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
4.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
4.3
4.4
5 Min-Cost Combined Retiming and Skewing
5.1
119
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
5.1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
5.1.2
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123
5.2
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
5.3
Algorithm: Exact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
125
5.4
Algorithm: Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
5.4.1
Incremental Retiming . . . . . . . . . . . . . . . . . . . . . . . . . .
128
5.4.2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
5.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
5.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
6 Clock Gating
6.1
6.2
6.3
138
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
138
6.1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
6.1.2
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
140
6.2.1
Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
140
6.2.2
Symbolic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
141
6.2.3
RTL Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
143
6.2.4
ODC-Based Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . .
144
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
144
6.3.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
6.3.2
Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
146
iii
6.3.3
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
6.3.4
Literal Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
148
6.3.5
Candidate Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . .
152
6.3.6
Candidate Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
152
6.3.7
Candidate Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . .
153
6.3.8
Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155
6.4
Circuit Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
156
6.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
6.5.1
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
6.5.2
Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
158
6.5.3
Power Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
160
6.5.4
Circuit Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . .
162
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
162
6.6
7 Conclusion
165
7.1
Minimizing Total Clock Capacitance . . . . . . . . . . . . . . . . . . . . . .
166
7.2
Minimizing Effective Clock Switching Frequency . . . . . . . . . . . . . . .
167
Bibliography
169
A Benchmark Characteristics
174
iv
List of Figures
1.1
Tradeoff of performance and power.
. . . . . . . . . . . . . . . . . . . . . .
1.2
Cost of IC cooling system technologies.
. . . . . . . . . . . . . . . . . . . .
6
1.3
Overview of US power consumption. [1] . . . . . . . . . . . . . . . . . . . .
8
1.4
Forward and backward retiming moves. . . . . . . . . . . . . . . . . . . . .
11
1.5
A circuit and its corresponding retiming graph.
. . . . . . . . . . . . . . .
12
1.6
Retiming to improve worst-case path length.
. . . . . . . . . . . . . . . . .
14
1.7
Retiming to reduce the number of registers.
. . . . . . . . . . . . . . . . .
14
1.8
Intentional clock skewing. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1
The elimination of clock endpoints also reduces the number of distributive
elements required. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.2
A scan chain for manufacturing test.
. . . . . . . . . . . . . . . . . . . . .
27
2.3
A three bit binary counter with enable. . . . . . . . . . . . . . . . . . . . .
34
2.4
An example circuit requiring unit backward flow.
38
2.5
An example circuit requiring multiple backward flow.
2.6
Fan-out sharing in flow graph.
2.7
The illegal retiming regions induced by the primary input/outputs.
2.8
The corresponding flow problem for a combinational network.
2.9
Flow chart of min-register retiming over multiple frames
. . . . . . . . . . . . . .
5
. . . . . . . . . . . .
39
. . . . . . . . . . . . . . . . . . . . . . . . .
40
. . . .
41
. . . . . . .
45
. . . . . . . . . .
46
2.10 A cut in the unrolled circuit. . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2.11 Retiming cut composition.
51
. . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 The runtime of flow-based retiming vs. CS2 and MCF for the largest designs. 66
2.13 The runtime of flow-based retiming vs. CS2 and MCF for the medium designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
2.14 The distribution of design size vs. total number of iterations in the forward
and backward directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
v
2.15 The percentage of register savings contributed by each direction / iteration.
72
3.1
Bounding timing paths using ASAP and ALAP positions.
. . . . . . . . .
77
3.2
The computation of conservative long path timing constraints. . . . . . . .
80
3.3
The implementation of conservative timing constraints.
. . . . . . . . . . .
81
3.4
The computation of exact long path timing constraints. . . . . . . . . . . .
82
3.5
The implementation of exact long path timing constraints.
. . . . . . . . .
83
3.6
An example of timing-constrained min-register forward retiming. . . . . . .
86
3.7
An example of timing-constrained min-register retiming on a critical cycle.
88
3.8
Average fraction of conservative nodes refined in each iteration.
. . . . . .
97
3.9
Registers in over-constrained cut vs under-constrained cut over time relative
to final solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
3.10 Registers after min-reg retiming vs. max delay constraint for selected designs. 99
4.1
A circuit with eight registers and their initial states..
. . . . . . . . . . . .
105
4.2
Computing the initial states after a forward retiming move. . . . . . . . . .
106
4.3
Computing the initial states after a backward retiming move.
. . . . . . .
106
4.4
Binary search for variables in feasibility constraint.
. . . . . . . . . . . . .
111
4.5
Feasibility bias structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113
5.1
Costs of moving register boundary with retiming and skew on different
topologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
121
5.2
Overall progression of retiming exploration. . . . . . . . . . . . . . . . . . .
131
5.3
Dynamic power of two designs over course of optimization.
. . . . . . . . .
137
6.1
Clock gating circuits.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
140
6.2
Opportunities for structural gating.
6.3
Non-structural gating.
. . . . . . . . . . . . . . . . . . . . . .
141
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
142
6.4
Unknown relationship between BDDs and post-synthesis logic. . . . . . . .
142
6.5
Timing constraints based upon usage. . . . . . . . . . . . . . . . . . . . . .
149
6.6
Distance constraints.
150
6.7
Proving candidate function.
6.8
Heuristic candidate grouping.
6.9
ODC-Based Circuit Simplification after Gating.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
153
. . . . . . . . . . . . . . . . . . . . . . . . .
154
. . . . . . . . . . . . . . .
157
6.10 Four-cut for structural check. . . . . . . . . . . . . . . . . . . . . . . . . . .
159
vi
List of Tables
1.1
Power consumption of performance-oriented NVIDIA GPUs in 2004 and 2008
[2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1
Worst-case runtimes of various min-cost network flow algorithms . . . . . .
32
2.2
Worst-case runtimes of selected maximum network flow algorithms [3] . . .
37
2.3
Unconstrained min-reg runtime, LGsynth benchmarks. . . . . . . . . . . . .
63
2.4
Unconstrained min-reg runtime, QUIP benchmarks. . . . . . . . . . . . . .
64
2.5
Unconstrained min-reg runtime, OpenCores benchmarks. . . . . . . . . . . .
64
2.6
Unconstrained min-reg runtime, Intel benchmarks. . . . . . . . . . . . . . .
65
2.7
Unconstrained min-reg characteristics, LGsynth benchmarks w/ improv. . .
68
2.8
Unconstrained min-reg characteristics, QUIP benchmarks. . . . . . . . . . .
68
2.9
Unconstrained min-reg characteristics, Intel benchmarks. . . . . . . . . . . .
69
2.10 Unconstrained min-reg characteristics, OpenCores benchmarks. . . . . . . .
70
2.11 Unconstrained min-reg runtime, large artificial benchmarks. . . . . . . . . .
72
3.1
Delay-constrained min-reg runtime vs. Minaret. . . . . . . . . . . . . . . . .
93
3.2
Period-constrained min-reg characteristics, LGsynth benchmarks. . . . . . .
95
3.3
Period-constrained min-reg characteristics, OpenCores benchmarks.
. . . .
95
3.4
Period-constrained min-reg characteristics, QUIP benchmarks. . . . . . . .
96
3.5
Min-delay-constrained min-reg characteristics, LGsynth benchmarks. . . . .
99
3.6
Min-delay-constrained min-reg characteristics, OpenCores benchmarks. . . .
100
3.7
Min-delay-constrained min-reg characteristics, QUIP benchmarks. . . . . .
100
4.1
Guaranteed-initializability retiming applied to benchmarks. . . . . . . . . .
117
5.1
Runtime and quality of exact and heuristic approaches. . . . . . . . . . . .
133
5.2
Power-driven combined retiming/skew optimization. . . . . . . . . . . . . .
133
vii
5.3
Area-driven combined retiming/skew optimization. . . . . . . . . . . . . . .
134
5.4
Results summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
6.1
Structural clock gating results. . . . . . . . . . . . . . . . . . . . . . . . . .
161
6.2
New clock gating results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
6.3
ODC-based simplification results. . . . . . . . . . . . . . . . . . . . . . . . .
164
A.1 Benchmark Characteristics: LGsynth . . . . . . . . . . . . . . . . . . . . . .
175
A.2 Benchmark Characteristics: QUIP . . . . . . . . . . . . . . . . . . . . . . .
176
A.3 Benchmark Characteristics: OpenCores . . . . . . . . . . . . . . . . . . . .
177
A.4 Benchmark Characteristics: Intel . . . . . . . . . . . . . . . . . . . . . . . .
178
viii
Acknowledgements
Prof. Robert Brayton has my infinite gratitude for making the last five years enjoyable
and educational and for allowing my graduate school experience to exceed my expectations.
There was never a thought that he was unwilling to explore, and I thank him for the
intellectual freedom to walk down so many paths and the experienced guidance on every
one of them. His impression on his students and the field as a whole is unmeasurable.
I’d like to thank Prof. Andreas Kuehlmann for his support in so many different and
varied ways: as an instructor (twice), for a GSI experience, for an internship (twice), as a
committee member (on all of my preliminary, qualifying, and dissertation committees), and
as a manager. His possession of both detailed insight and broad vision is a rare combination.
Alan Mishchenko has been an absolute joy to interact with, and I thank him for all
of his effort on paper writing, in code, and in ideas. I can only aspire to a fraction of
his perpetual enthusiasm for new ideas. It was a conversation in his car that sparked my
interest in pursuing flow-based retiming, and he deserves credit for much of it.
Christoph Albrecht has been a wonderful collaborator, coworker, and mentor throughout
this work. His thoughtfulness and careful precision in unsurpassed and has challenged me
in many ways. His expertise in sequential optimization has also contributed much to this
work.
Philip Chong was a mentor for my EE219B project, my first foray into the area of
sequential optimization. He was great to collaborate with on that and other projects,
including my summer work on clock skewing under variation and OpenAccess Gear.
I would like thank Prof. Andrew Neureuther for the feedback during my qualifying
exam, and Prof. Margaret Taylor for being on my committee and supporting this small
piece of work on reducing unnecessary power usage. It is but a small step in our larger
pursuit towards better energy policy.
My summer internships were an invaluable piece of my education, and I’d like to thank
everyone who gave me a taste of the world outside of academia. I thank Premal Buch
ix
and C. Van Eijk at Magma Design Automation for enriching my first summer, everyone at
Cadence Research Labs, and Peter Hazewindus, Ken McElvain, and Bing Tian at Synplicity
for making my two hour commute absolutely worthwhile. Thank you to Katharina GroteSchwinges, Lydia Probst, and Miteinander for an unforgettable summer spent pursuing
interests outside of engineering.
Bob’s other students, past and present, will hopefully remain lifelong collaborators,
compatriots, and friends. I’ve enjoyed my interaction with Fan Mo, Yinghua Li, William
Jiang, Zile Wei, and Sungmin Cho and would like to especially thank Shauki Elisaad,
Satrajit Chatterjee, and Mike Case. I will come to miss our late Friday meetings. Donald
Chai and Nathan Kitchen also deserve thanks for ideas, feedback, and enjoyable trips out
of town. Thank you Arthur Quiring, Martin Barke, and Sayak Ray for the efforts on our
joint projects on clock gating.
My studies were supported through the generous contributions of the State of California
MICRO program, the Center for Circuits and Systems Solutions (C2S2), and our industrial
collaborators Actel, Altera, Calypto, Intel, Magma, Synopsys, Synplicity, and Xilinx. I will
strive to repay their far-sighted investment into the educational system.
Without a constant flow of caffeine, I’d have been a walking zombie for the last few
years. Perhaps more importantly, the coffee shops of Berkeley gave me a place to escape
to work and a truly comfortable third space. Thank you (in no particular order) to the
employees and owners of Cafe Strada, Milano, Spasso, Roma, Nomad, Jumpin’ Java, A
Cuppa Team, Bittersweet, and Peet’s.
My friends and roommates have been an integral part of the last five years and deserve
credit for keeping me sane: thank you Andrew Main, Tim DeBenedictis, Ryan Huebsch,
Bryan Vodden, Josh Walstrom, Jay Kuo, Jimmy Tiehm, William Ma, Jen Archuleta, Steve
Ulrich, Andrew MacBride, Adrian Rivera, Luis di Silva, and Simon Goldsmith.
Thank you Chris for tolerating the late hours and all the sacrifices that were made in
the name of completing this dissertation. Your support has meant the world to me. I will
do everything in my power to return the favor when it comes your turn!
x
A life-long thank you is owed to my family for the unconditional support and love
through all of these years. I must have been destined to be an electrical engineer from
the weekends spent filling breadboards in my father’s electronics lab: to this day, I still
remember the function of a 74LS90 (it’s a 4-bit decimal counter). Perhaps the greatest
credit is due my parents for instilling in me a love of science and thought that has propelled
me this far.
xi
xii
Chapter 1
Introduction
This dissertation is a study of how automatic digital circuit design techniques that
manipulate the sequential components can be used to minimize the power consumption of
integrated circuit devices. In the course of this study, several new techniques are introduced
to enhance the potential for power reduction and are characterized on a set of benchmark
designs.
Before moving to the main part of the work, we begin with an introduction to and some
background in the two facets of the subject: Low Power Digital Design, discussed in Section
1.1, and Sequential Optimization, discussed in Section 1.2.
It is assumed throughout that the reader has some familiarity and comfort with mathematical and algorithmic notation, the vocabulary and terminology of digital design, and
computer science, especially with regard to complexity theory.
1
1.1
Low Power Digital Design
.
The power consumption of CMOS integrated circuits (ICs) has remained at most a
secondary concern for most of their history. While low power devices and design technologies
have been in existence for decades, it is really only the last ten years that have seen the
promotion of low power from a niche or secondary issue to an critical concern in digital
design. The convergence of technological, market, and societal forces has brought this issue
to the forefront.
As a broad motivation of this work, we examine in detail why power consumption is
such an important issue at present. Section 1.1.1 examines the technological changes that
lie behind skyrocketing power densities and increasing per-die consumption. Section 1.1.2
discusses the commercial applications and drivers behind the push for lower power technology. Finally, on a macroscopic level, Section 1.1.3 examines the environmental ramifications
of these technological trends.
1.1.1
Technological
As in many other aspects of integrated circuit technology, the fundamental driver of the
changing role of power is continued semiconductor device scaling. The ever-shrinking size
of each transistor results in ever-increasing power consumptions through the consequent
increasing speeds, increasing densities, and increased parasitic device behavior.
The total power consumed by a digital design can be decomposed into two main components: dynamic and static. The static component of power is that which is consumed by
a device regardless of its operational behavior; this includes the case when all transistors
are quiescent. In a modern CMOS design, static power includes transistor gate leakage
but is dominated by the sub-threshold leakage: the flow of current that passes through
the transistor stack from the supply to ground due to the gate voltage being insufficiently
above/below the threshold and the transistors incompletely switched off. The sub-threshold
2
leakage current scales exponentially with the threshold voltage Vth device characteristic.
Smaller transistors and smaller supply voltages have driven this parameter downward and
in short time brought the resuling leakage power from near zero to a real concern.
The dynamic component of the power is the energy that is dissipated per unit time
due to the switching of transistors. In a CMOS circuit, this is primary comprised of two
components: the short-circuit current and the capacitive switching. In a well balanced cell
library, the short-circuit current is a small fraction of the total. Generally, the capacitive
switching dominates. As the capacitive elements in a circuit (e.g. the nets, transistor
gates, and internal capacitances) switch logic state and charge from a low to high voltage,
a quantity of energy is required to effect the transition. The power required is a function of
the capacitance to be changed, the switching frequency, and the rail-to-rail voltage. This is
expressed by Equation 1.1. Here, f is the transition frequency, Vdd the supply voltage, and
C the switched capacitance.
P =
1 2
V Cf
2 dd
(1.1)
The focus in this work is on the dynamic power dissipated during capacitive switching.
This presents three variables with the potential to optimize, all of which are affected by
synthesis choices. While the supply voltage can be increased to improve performance (at the
expense of power), the challenges to its further decrease are substantial: the largest being
the maintenance of the relationship Vth ≈ 21 Vdd and the increase in static power dissipation
that results from pushing Vth any lower. Although useful, we do not wish to consider the
tradeoff between dynamic and static power at this time and instead turn to methods that
accomplish a straightforward reduction.
Chapters 2 through 4 discuss techniques for reducing the switched capacitance on the
clock, Cclk . In a typical design, the clock network possess both the single largest total capacitive sink and also the greatest switching frequency: its share of the power is accordingly
large and often in the range of 30% to 50% of the total. The clock network presents an
important and attractive target for power optimization techniques.
3
Chapter 6 introduces a new algorithm for reducing the average frequency with which
the clock must be switched, fclk , for particular subsets of the network. This is completely
compatible with the above methods.
1.1.2
Commercial
The market forces that lie behind the drive to reduce integrated circuit power requirements are not driven by the quantity and cost of the energy itself– though this will be
examined in the next section– so much as the consequences of the power usage on device
functionality and value. Unlike the cost of the energy use itself, these pressures are felt
more directly on the manufacturers of the integrated circuits, and it is these manufacturers
who are the consumers of design technology such as is the subject of this research.
There are many possible channels through which power affects the functionality and
competitive value of a particular digital device, and we examine two of them in more
detail now. Consider the set of all digital devices characterized jointly by performance
(measured via clock frequency, computational operations per second, etc.) and total power
consumption. As is illustrated in Figure 1.1. There is a direct trade-off between these two
characteristics through the frequency-power relationship of Equation 1.1 but also through
other design choices and variables. Given whatever is the current state-of-the-art design
technology, we can then establish a maximum performance-power frontier, as is illustrated
by the curve in Figure 1.1. This curve aids in differentiating two broad market segments of
interest.
High-Performance Systems
One lies to the right of the graph and could be labeled
high-performance systems. For our purposes, this includes scientific and super-computers,
information servers, networking equipment, and personal computers: any system whose
market value is driven primarily or in part by its performance. For this type of design,
the power consumption is an issue not because of its effect on value but because of the
limitations it presents in the pursuit of continued performance improvement. Because of
4
Figure 1.1. Tradeoff of performance and power.
cost or application requirements, these designs eventually face such a limitation in the form
thermal is depicted on the graph.
of thermal constraints. A hypothetical barrier Pmax
The thermal constraints arise from the fact that the energy consumed (through the
mechanisms described in the previous section) ends up almost entirely as waste heat. At high
enough rates of energy consumption, the accumulation of waste heat surpasses the ability
of the integrated circuit’s environment to passively dissipate it. The resulting temperature
increase can quickly disrupt or even permanently damage the device. This necessitates
the inclusion of heat-dissipation systems, from passive heat-sinks to active air-flow control
and air-conditioning and eventually to liquid cooling. However, the cost of these options
does not scale well with increased capacity and presents an economic limitation on chasing
increased computational performance. Beyond 35-40W, the cost of additional capacity is
approximately $1 per watt [4]. An overview of the capacity and cost of various cooling
technologies is outlined in Figure 1.2.
The high supply currents also add additional cost to the power delivery and regulation
systems. This can be especially costly in large installations with multiple computers.
5
Figure 1.2. Cost of IC cooling system technologies.
Portable Systems
At the opposite end of the performance-power curve lies a market
segment identifiable as portable systems. These are the devices that depend on mobile
power sources (e.g. batteries) and include phones, music and media players, hand-held
computers and game systems, and remote sensors and monitoring devices. Here, the power
consumption affects value through the cost and weight of the energy storage necessary
to meet the minimal functionality as well as the single-charge operating lifetime. This
hypothetical barrier is depicted as fmin .
Energy storage has become an issue because of the divergence between the quantities
of energy that can be stored per unit weight and the quantities consumed by increasingly
power-hungry devices. Improvements in battery technology and the amount of energy that
can be stored have lagged significantly behind increases in the rate at which it is consumed.
Whereas there has been an exponential increase in density (ala Moore’s law), battery energydensity has only improved about 6% per year [5]. The next-generation technologies (e.g.
fuel cells) are not yet close to productization. Any decrease in power consumption can
therefore be translated into either increased single-change lifetime or reduced battery costs.
Both high-performance and portable system applications are facing immediate con6
straints imposed by the power consumption of digital integrated circuits. This work has
industrial applications for both market segments.
1.1.3
Environmental
Beyond the market pressures that are driving low power IC technology, there are strong
reasons to strive for minimizing the energy consumed in digital devices. While the power
required to charge a 0.2 femtofarad transistor gate input to 1.5V is trivially small– 0.2 fJ,
approximately the same energy to lift a grain of chalk dust a few centimeters– the combined
frequency, transistor density, and pervasiveness of digital devices totals to a substantial rate
of energy use. Furthermore, as each of these quantities continues to grow so does the energy
used. In a carbon-based energy economy, this exploding growth in electricity consumption
represents a dangerous proposition for atmospheric health.
Consider the 2001 total U.S. annual energy usage, broken down by sector in the left side
of Figure 1.3. Within the residential component, retail electric power (and the accompanying loss through distribution) accounted for 70% of the total, and within this 70%, 67%
was used for appliances (excluding air conditioners, water heaters, and household heating).
While large mechanical appliances (and especially refrigerators) make up the bulk of this
total, the home office and entertainment devices that are wholly digital or digitally-centric
contribute a 10% share. This corresponds to 82 billion kilowatt-hours per year or an average 9,360 megawatts of continuous usage. The relative contribution of digital devices in
the commercial segment is even higher.
While non-digital uses for energy still represent the most substantial target for energyefficiency technology and conservation efforts, this is rapidly changing. The trend in power
usage (both per capita and total) of many large appliances has actually been shrinking
in recent years due to continuing improvements in efficiency and an effective campaign to
replace older models with newer energy-saving versions. Unfortunately, this trend does not
extend to digital-centric devices: several of the last few years have seen double-digit growth
in their combined power draw. Again, this can likely be attributed to both their increased
7
Figure 1.3. Overview of US power consumption. [1]
proliferation and increasing per-device energy consumption. This represents an increasing
large cause for attention.
While this type of top-down analysis illustrates the total energy used by digital devices,
it’s difficult to isolate the exact contribution of integrated circuit power consumption to the
total. Even within a personal computer, a substantial fraction of the power goes towards
the power supply, cooling system, and mechanical disks. To make a case that individual
integrated circuits consume a non-trivial fraction of the total energy output, we examine a
case built from the bottom-up.
As an example, consider the latest GPU (graphics processing unit) offerings from
NVIDIA, Inc. This one company represents a tiny fraction of the integrated circuit industry, though their products do find themselves in a sizeable number of personal computers.
Based on the figures from [6], approximately 115.9 million desktop-based GPUs were sold
in 2007.
8
GPU
Year
Market Positioning
Idle Power (W)
Peak Power (W)
Geforce 5900 Ultra
Geforce 8800 GTX
2004
2008
Consumer,
Consumer,
Performance-Oriented Performance-Oriented
26.8
46.4
59.2
131.5
Table 1.1. Power consumption of performance-oriented NVIDIA GPUs in 2004 and 2008
[2].
The power consumption of one of the latest performance desktop products, the Geforce
8800 GTX, is presented in Table 1.1. Note that the idle power– when neither the GPU nor
the computer are performing any computation– is 46.4W. While it’s difficult to estimate
typical usage patterns, it’s not unreasonable to assume that a significant fraction of the
host machines are on at any given time and wasting this power. If 115.9 million Geforce
8800 GTX units are on (and idle), the combined power draw would total 5,380 MW.
While the capacity of any given generation station may vary dramatically, a typical
output of a coal-based electric generator is roughly 1,000 MW. Approximately five coal
plants are required to supply the energy that is wasted in this scenario. As the average
pollution rate for a coal-fired electricity station was 2.095 pounds CO2 per kilowatt-hour in
1999 [7], the resulting carbon released would amount to 49.3 million tons in one year.
Not all of NVIDIA’s graphics products consume as much electricity as their performance
products, but the trend towards increased energy usage is unmistakable. Table 1.1 also lists
the power consumed by a component with identical market positioning just four years ago.
Within this span of time, the idle power dissipation has increased 2.2x times and the peak
power 2.8x! This trend has been continuing for a long time (even though power was not
enough of a concern to have been widely characterized for a desktop components in earlier
times) and is likely to continue into the future.
An unfortunate consequence of the relationship between power and performance (as
depicted in Figure 1.1) is that any improvements in digital power-efficiency are likely to be
traded for increased computational performance in the class of speed-driven devices that
consume the bulk of the total IC power draw. While this is an unmitigated good for the
9
future utility of computation and the many benefits that it brings, it utterly fails to address
the problem of the resulting energy use. As with most viable attempts to address the
problem of energy sustainability, it is likely that new technology will have to be coupled
with fresh approaches to policy to achieve the needed results.
1.2
Sequential Optimization
The collection of low power technologies discussed in this work fall into the class of
logic synthesis techniques broadly known as sequential optimization. In synchronous digital
designs, the correct temporal behavior of the system is achieved though the insertion of
synchronization logic. This usually consists of state storage elements known as registers
that are driven by one or more clock signals. This is the overwhelmingly dominant paradigm
for current digital design. It is the manipulation of these elements on which we focus.
In contrast, combinational optimization represents a variety of logic synthesis techniques
that treat certain aspects of the circuit behavior as invariants. Though the logic implementation can be dramatically altered, the function implemented at the inputs of every register
is exactly preserved. For timing-driven combinational optimizations, it is also assumed that
the timing relationships at and between the registers is fixed. It is exactly the relaxation of
these two assumptions that is considered in this work. We also explicitly consider the clock
network and its accordant power consumption; the mechanics of the clock distribution are
outside the scope of combinational logic synthesis.
The rest of this section gives a general overview of two of the sequential optimization
techniques that are central to several of the chapters in this dissertation: retiming [8] and
clock skew scheduling [9]. If additional background is necessary, we refer the reader to
the original works; these are complete and still very relevant sources for understanding
the motivations for and details of the transformations. The other general technique, clock
gating, is introduced and motivated in Chapter 6.
10
1.2.1
Retiming
Retiming is a method for relocating the structural positions of the registers in a design
such that the output functionality is preserved. First proposed by [8], it has been utilized
for two decades. Implementations of retiming are found in all of the major commercial logic
synthesis tools in both the ASIC and FPGA markets.
Figure 1.4. Forward and backward retiming moves.
The retiming transformation can be most easily understood as the repeated application
of set of simple moves. If every direct output of a combinational gate is a register, these
registers can be removed and one inserted on every input of the node. Correspondingly, if
every direct input of a node is a register, these can be removed and a register inserted on every output. In this manner sequential elements can be “pushed across” combinational ones.
This is illustrated in Figure 1.4. Every valid retiming transformation can be decomposed
into a sequence of these incremental moves. The work of [8] describes an elegant method of
capturing any legal retiming without explicitly enumerating a sequence of moves; we review
this now.
11
A retiming graph G is defined as follows. Let G =< V, E, wi > be a directed graph
with edge weights. The vertices correspond to the combinational elements and external
connections in a circuit and the edges E ⊆ V × V to the dependencies between them.
(Hereafter, edges are interchangeably referred to by their endpoints or their label: for
example, e ≡ (u, v).) Each edge e represents a path between two combinational elements
or a primary IO through zero or more sequential elements. The number of registers present
on each edge is captured by wi (e) : E → Z. wi is the initial register weight or sequential
latency. In the timing-constrained version of the problem, each combinational element has
an associated worst-case delay, d(v) : V → ℜ.
Figure 1.5. A circuit and its corresponding retiming graph.
A circuit and its corresponding retiming graph are depicted in Figure 1.5. The sequential
elements have been removed. Note that there are two edges g2 → g4 because there are two
paths with different sequential latencies.
The problem of retiming consists of generating a new graph G ′ by altering only the
number of registers on each edge w(e). The retiming transformation can be completely
12
described by a retiming lag function r(v) : V → Z. The lag function describes the number
of registers that are to be moved backwards over each combinational node. After the
registers have been relocated, the number present on each edge, wr is given by Equation
1.2.
wr (u, v) = wi (u, v) − r(u) + r(v)
(1.2)
There may be restrictions imposed on the lag function. For the retimed circuit to be
physical, the final register count wr (e) must be non-negative for every edge. This imposes
a constraint on the lag function of the form of Equation 1.3. It is typically also desirable to
fix the lags of all external inputs and outputs to zero; this prevents desyncronization with
the environment.
r(u) − r(v) ≤ wi (u, v)
(1.3)
Typically, other constraints are imposed upon the selection of this function and an
objective is defined. Common objectives include minimizing the register count, minimizing
the worst-case combinational path delay, or either with a constraint imposed on the other.
The additional details of these problems will be discussed in later chapters.
Examples of how retiming can be applied to improve either the worst-case delay or the
number of registers are illustrated in Figures 1.6 and 1.7, respectively. The green combinational gates are labelled with their worst-case delays D. The optimum retiming moves
(for either delay or register count) are illustrated using magenta arcs to the new silhouetted
locations. The corresponding retiming lags r(V ) are also indicated on the combinational
gates (where the value is non-zero).
The retiming problem is an instance of integer linear programming (ILP). In most
cases, the structure can be used with specialized solvers to attack the specific problem more
efficiently. If the network structure of the problem is maintained, the worst-case bound is
polynomial (and strongly P for the classes of problems that we will be examining). This
13
Figure 1.6. Retiming to improve worst-case path length.
Figure 1.7. Retiming to reduce the number of registers.
will be examined more closely in Chapter 2. However, we will also see multiple subproblems
that require the solution of a mixed-integer linear program (MILP). The MILP variant is
NP-hard.
14
Multiple clocks
Modern synchronous designs may utilize anywhere between one and
hundreds of clocks. The complicating problem is that registers with different clock signals
can not be merged into a single register in either the final solution or any of the intermediate points over which the registers must be retimed. This reality is often not explicitly
addressed in works on retiming, but there is a relatively straightforward solution. The registers can be partitioned into clock domains, and the domains boundaries never crossed.
A similar partitioning is also necessary for differences in any other asynchronous control
signals. Though all of our example circuits have a single clock and reset, we assume that
this method would be applied for multiple-clocked designs.
Advantages As a sequential optimization, the advantages of retiming are several. The
first is relative ease of computation: the runtime of the optimization algorithm grows with
the size of the circuit with a low-degree polynomial asymptote. This is better than many
combinational synthesis problems– let alone sequential ones, many of which lie in the PSPACE class.
Despite being easy to compute, retiming often holds significant potential to improve
the desired objective. In the case of performance-oriented optimization, the resulting improvement in speed can be quite significant. In [10], a profiling of several industrial circuits
indicated that the worst-case average cycle length was significantly shorter than the single
worst-case path length. Through some combination of misbalanced sequential partitioning
of the original designs and the inability of the design tools to perfectly balance the path
lengths, there remains significantly potential to reposition the registers and balance slack.
Retiming can also be applied to reduce power consumption. This will be examined in
more detail in Chapters 2 through 5.
Challenges The primary challenges involved with retiming a design arise from the reencoding of the state machine. The values (and number of bits) stored in the retimed
registers at each cycle do not correspond with those in the original design. First and
foremost, this complicates the formal verification of the netlist revision. The problem no
15
longer becomes one of combinational equivalence checking at the register inputs and primary outputs, though the problem is still quite tractable due to the maintainence of other
equivalent points. However, if the retiming is interleaved with additional resynthesis, it
immediately becomes very difficult to resolve even potential locations for equivalence between the original and transformed circuits. Until very recently, there were no commercial
verification tools that could overcome this problem in a completely satisfactory manner,
and in industrial practice, no change is generally allowable unless it can be verified. While
still a difficulty, advances in sequential verification have brought retiming into the realm of
verifiable optimizations.
The burden due to state re-encoding is not only borne by the automated tools but by the
human designer as well. The latched state values are often the primary points for debugging
a simulated version of the design and the only points for debugging a silicon devices. The
translation of these values to and from the original specification requires additional tools
and/or effort.
1.2.2
Clock Skew Scheduling
Clock skew scheduling [9] offers a technique for balancing the computation across sequential elements by applying different non-zero delays on the clock inputs of each register.
To differentiate from the unwanted version, this is often called intentional skewing.
The latest arrival times at the latches in a single design may vary considerably. This
imbalance may come as a result of timing misprediction in the design flow or because of
a fundamental imbalance in the sequential partitioning of a design. Since the latches in
a single clock domain must all operate at the same frequency, performance is limited by
the slowest delay path, even if the others could operate at a higher speed. While the delay
balancing of skew scheduling, the timing of the design is then no longer limited by the single
worst-case path but by the maximum average delay around any loop of register-to-register
path segments.
The insertion of intentional clock skew is illustrated in Figure 1.8. The register-to-
16
Figure 1.8. Intentional clock skewing.
register timing paths are labeled with with a maximum delay D and minimum delay d The
arrival of the clock at register R2 is intentionally delayed by τ (R2 ) time. There is assumed
to be clock insertion delay along all of the paths, and the τ value represents the deviation
from the nominal. We will see that only the relative values of τ matter; the choice of the
nominal value is arbitrary.
The re-balancing of the timing criticality can then be observed. While the insertion of
an intentional skew of delay τ (R2 ) > 0 delays the arrival of the clock at R2 and increases
the allowable delay DR1 →R2 along the longest path R1
R2 (before the setup timing of
register R2 is violated), the permissible worst-case delay DR2 →R3 along the path R2
R3 is
correspondingly decreased. Any timing slack added to the incoming paths of R2 is directly
borrowed from the outgoing paths. The opposite effect occurs for the shortest paths and
hold timing constraints.
The problem of computing an optimal clock skew schedule can be formulated as a
continuous linear program. The objective is to minimize the clock period by choosing a
set of per-register skews, τ (r), subject to the linear constraints arising from setup and hold
constraints along each register-to-register timing path. The setup and hold constraints are
17
Equations 1.4 and 1.5, respectively. D(u, v) is the maximum path delay along u
is the minimum path delay along u
v, d(u, v)
v, and T is the clock period.
Su→v :
Hu→v :
D(u, v) ≤ T − τ (u) + τ (v)
d(u, v) ≥ τ (u) − τ (v)
(1.4)
(1.5)
(1.6)
The final problem can be solved using a general approach to linear programming (e.g.
simplex, interior-point methods, etc.). While in the worst-case simplex is exponential in
the size of the problem, it does not generally perform so poorly for practical problems. Our
experience is that it is slow but tractable for solving retiming problems. In any case, weakly
polynomial alternatives exist [11] [12].
This minimization also corresponds to the determination of the maximum mean distance
around any cycle in the register-to-register timing graph. There exist several algorithms [13]
for solving the maximum-mean-cycle-time problem, several of which are quite efficient in
practice. Our experience agrees with the observation that Howard’s algorithm is the most
efficient, though Burns’ method is also useful for incremental analysis and computations
performed directly on the circuit structure.
Advantages In contrast to retiming, clock skew scheduling also possesses the desirable
feature of preserving circuit structure and functionality. The verification and testing issues
that plague retiming do not apply to clock skew scheduling. In recent years, it has gained
practical acceptance in multiple design tools, usually at the end of the flow after physical
synthesis is nearly complete.
Challenge There are real difficulties in the implementation of a specific clock skew schedule. The challenges of constructing a near-zero-skew clock distribution network are already
significant, and the requirement that each endpoint have a different and specific insertion
18
delay complicates the problem. Furthermore, the physical difficulties of inserting multiple
buffers in the vicinity of every skewed register are also problematic.
There have been some recent advances in the ease clock skew implementation. The use
of routing delays to insert skews has been studied in [14], thereby minimizing the number
of delay buffers that must be placed. The clock skew schedule itself can be altered by using
the flexibility of the non-critical constraints and/or the relinquishment of optimality. This
was studied by [15] with favorable results.
1.3
Organization of this Dissertation
The overall theme of this dissertation is low-power design using sequential logic optimization techniques. Within this space, each chapter represents the study of a problem and
a corresponding solution. We define a problem as consisting of the optimization of some objective under a given transformation that is subject to a particular set of constraints. There
is (unsurprisingly) significant overlap in these elements between the chapters, and it may
benefit the reader to consider the thesis in its entirety. However, significant effort has been
put into the organization to provide coherent boundaries between the particular problem
features that may be of interest to different readers. design A secondary but very important
theme that is common to all of the work present here is scalability to large designs. While
computationally expensive and powerful optimization techniques can be applied to small
circuits with impressive results, the utility of such methods is very limited in practice. We
have intentionally focused on algorithms that are applicable to large (and growing) design
sizes– even if this comes at the expensive of obviously better or more complete solutions.
This important property of our approach should be observed throughout this dissertation.
The structure of each chapter follows the following general format (with the section titles
are in bold): an introduction to the Problem and its motivation, background and information about Previous Work, a detailed description of our Algorithm and a corresponding
Analysis of its behavior, and finally a presentation of Experimental Results.
The content of each chapter is roughly as follows:
19
• Chapter 2. Unconstrained Minimum-Register Retiming.
We introduce a new algorithm for the minimization of the number of registers in a
circuit using retiming. At this point, the only constraint on the solution is functional
correctness. The technique is compared to existing solutions both analytically and
empirically. This chapter serves as the foundation of the two subsequent ones.
• Chapter 3. Delay-Constrained Minimum-Register Retiming
In this chapter we extend the algorithm in Chapter 2 to include constraints on both
the worst-case minimum and maximum path delays in the problem of minimization
the number of registers in a circuit under retiming. For synthesis applications, these
constraints are critical to ensure the timing correctness of the resulting circuit.
• Chapter 4. Guaranteed Initializable Minimum-Register Retiming
The algorithm in Chapter 2 is further extended to guarantee that the resulting retiming will be initializable to a state that corresponds to the initial one in the original
circuit. The worst-case complexity of this problem is in class N P , but we show that
our technique is quite efficient in practice for the examples that we examined.
• Chapter 5. Combined Minimum-Cost Retiming and Clock Skewing
We discuss algorithms for simultaneously minimizing the both number of registers in a
circuit and the number of clock skew buffers under a maximum path delay constraint.
A general cost function is defined (that is inclusive of power minimization) and both
its exact and heuristic minimization is studied. It is demonstrated that combining the
features of both retiming and skewing can lead to a significant better solution than
either on its own.
• Chapter 6. A New Technique for Clock Gating
A new technique for the synthesis of clock gating logic is introduced using the efficient
combination of functional simulation and a satisfiability solver. Clock gating inserts
combinational logic on the clock path to minimize the conditions under which the
20
registers in the design must be switched. We improve on previous methods in runtime,
quality, and/or the minimization of netlist perturbation.
21
Chapter 2
Unconstrained Min-Register
Retiming
In this chapter we introduce a new algorithm for the minimization of the number of
registers in a circuit using retiming. At this point, the only constraint on the solution is
functional correctness: the primary outputs in the retimed design must exhibit functionally
identical behavior to the original circuit under every possible sequence of inputs. It assumed
that the registers do not have any specific reset or initial state. This flavor of the retiming
problem is known as unconstrained min-register retiming.
We assume that the retiming transformation is understood. The reader may review
Section 1.2.1 for more background on retiming.
The chapter begins in Section 2.1 by defining the problem of register minimization and
discussing the motivations behind and importance of reducing the number of registers in the
design. In Section 2.2, we discuss the background and previous solutions to this problem.
Section 2.3 introduces a new algorithm to compute the optimal min-register retiming using
a maximum-flow-based formulation and illustrates its behavior on several small examples.
Further analysis of the correctness, complexity, and limitations of the new algorithm is
described in Section 2.4. Finally, experimental results– including a direct comparison with
existing best practices– are presented in Section 2.5.
22
Chapters 3 and 4 further develop the maximum-flow-based retiming technique introduced in this chapter, describing the means to constrain the solution’s worst-case delay and
correctness at initialization, respectively.
2.1
Problem
A circuit is assumed to be a directed hypergraph Ghyp =< V, H > where each directed
hyperedge h ∈ H is ⊆ 2V × 2V . There exist three classes of vertices: Vseq , the sequential
elements (hereafter referred to as registers), Vcomb , the combinational gates, and Vio , the primary inputs and outputs (PIOs). The sequential and combinational gates may correspond
to either specific objects in a technology library, generic primitives, black-boxed hierarchy,
or some other complex technology-independent descriptions (e.g. sum-of-products). This
flexibility makes retiming applicable to any stage of a synthesis flow, from a block-level RTL
netlist to a placed physical one.
We first decompose Ghyp into an equivalent graph G with pair-wise directed edges E ⊆
V × V such that E = {u → v : ∃h s.t. u ∈ sources(h) ∧ v ∈ sinks(h)}. Each hyperedge is
broken into the complete set of connections from sources to sinks.
The problem studied here is the simple minimization of the number of sequential vertices
|Vseq | via retiming. The only constraint is that the functionality of the circuit, as observed
at the primary outputs, remains unaltered. Here, functionality does not include timing
or electrical considerations; only the logical values at the end of an unbounded period of
evaluation determine correctness. It is assumed that the initial values of the registers are
unspecified.
2.1.1
Motivation
Registers are particularly important targets for optimization. There are common optimizations that are shared between combinational and sequential elements: for example,
several functionally identical cells with varying drive strengths and threshold voltages may
23
be present in a library to trade off area and dynamic power against performance and leakage power. There are also unique complexities inherent to only sequential elements. These
present opportunities for improving design characteristics that are not applicable to the
combinational logic. It is these design features that make sequential optimizations (such as
retiming) of particular interest, and we examine them now.
The first critical aspect of design that is not directly touched by combinational optimization is squarely within the domain of sequential optimization: the clock.
Clock Power It is typical for the current generation of integrated circuits to consume
about 30% of their total dynamic power consumption in the generation, distribution, and
utilization of sequential synchronization signals, and it is possible for this fraction to climb
above one half [16]. In most architectures, this takes the form of one or more large clock
networks [17]. These signals must be distributed with extreme timing precision across large
areas– or in many cases the entire die– to thousands of synchronization points.
The total dynamic power consumed in the clock network takes the form of Equation
2.1, where Vpp is the peak-to-peak signal voltage, Cclk is the total capacitance in the clock
distribution network (including endpoints), and fclk is the frequency. The minimum voltage
necessary to switch the transistors– and therefore Vpp – is dictated by the process technology.
The performance is proportional to fclk and is often either tightly constrained or is the
primary optimization objective.
1 2
Cclk fclk
P = Vpp
2
(2.1)
This leaves the total capacitance of the clock network as the best target for minimizing
the dynamic power consumption. The components of this capacitance can be broken into
three categories: wire, intermediate buffers, and the clock-driven sequential gates. This
is captured in Equation 2.2. We are mostly concerned with minimizing the total power
consumption by minimizing the capacitance on the leaves the clock distribution network,
the registers R.
24
Cclk = Cnet + Cbuf +
X
cireg
(2.2)
i=1..R
We assume that the capacitance of each clock input (that is, each register or latch) is
determined by the technology and the details of the standard cell implementation. Further advances in device and library technology present excellent opportunities for reducing
clock power consumption by improving these values. However, with these characteristics
fixed, the design problem becomes one of minimizing the total number of clock inputs, or
correspondingly, minimizing the total number of sequential elements.
Reducing the number of points to which the clock must be distributed also reduces the
power that must be consumed for the purely distributional components (i.e. Cnet and Cbuf ).
For this reason, minimizing the number of registers has a greater effect on the total power
consumption than the reduction in leaf capacitance alone. Though we do not measure and
include this effect in our results, its contribution should be recognized.
An alternative to minimize the clock power would be to abandon the synchronous
paradigm and eliminate the need for an explicit clock entirely. Various asynchronous design
methods have been proposed that do not require the expensive distribution of a regional
synchronization signals [18] [19] [20] [21]. There has even been success in employing this
strategy in both academic [22] and commercial designs [23], but for now, the synchronous
model retains a commanding dominance. Its simplicity, tool support, and maturity is unmatched. For the immediate future, digital design is wedded to the existence of a clock.
Clock Tree Synthesis Effort
The design of the clock network is consistently one of
the most challenging aspects of VLSI timing closure and typically involves significant effort
on the part of both the automated tools and the human designers. Routing, placement,
and buffering each present physical, electrical, and timing challenges. While this process
is beyond the scope of this work, [17] presents an overview of the problem and current
methodologies.
The reduction in the number of clock distribution endpoints simplifies each on these
25
Figure 2.1. The elimination of clock endpoints also reduces the number of distributive
elements required.
problems. As illustrated in Figure 2.1, if retiming is able to eliminate registers R5 and R6 ,
additional savings in design effort, area, power, and routing can be realized because of the
elimination of buffers B3 and B6 . In general, the reduction in registers may not be concentrated in any one branch of the clock network; however, because the levels are (re)allocated
to balance the electrical loads, the effect is the same. Retiming is often performed before
clock tree synthesis.
Manufacturing Test Because of the tremendous complexity of a silicon device, there
are a tremendous number of opportunities and locations for defects to appear during manufacturing. An overview is presented in [24]. Ensuring that each device conforms to the
functional and operational specifications is time-consuming, challenging, and requires expensive equipment. However, the costs for missing defects range from loss in yield to in-field
replacement to the consequences of jeopardizing human safety.
There are several different styles of test logic, but the dominant one involves the insertion
of scan chains, as illustrated by Figure 2.1.1. A serially-connected path is created through
all of the testable registers in the design. The testing then consists of three phases: first,
26
the register scan inputs and outputs (“scin” and “scout”) are enabled to allow the shift of
test vectors into the sequential elements; the regular inputs and outputs are enabled and
the values of the next state are computed; these result vectors are shifted out and evaluated
for correctness. The shift in and out operations can be combined, but because of the large
number of registers, the length of a chain can still be quite long. With increasing design
complexity, the number of registers in a design grows, and the time required to shift in/out
a single vector increases proportionally. As a complex design can require thousands of test
vectors, this becomes the driving component of total test time.
Figure 2.2. A scan chain for manufacturing test.
The total test time is the main component of per-unit test cost. Register reduction
addressed this directly. A decrease in the register count is a proportional a decrease in the
scan chain length and the time required to load each vector. This offers a valuable means
for reducing the test cost.
27
Verification
While the flow-based formulation of min-register retiming developed in this
chapter is extended in Chapters 3 and 4 to include design constraints that are necessary
for synthesis, the unconstrained problem does have intrinsic value in the area of sequential
verification.
The goal of sequential verification is to prove one or more properties over the entire
operation of the state machine implemented by a sequential circuit [25] [26] [27] [28]. The
state space of the design is the critical driver of the complexity; the number of potential
design states grows exponentially with the number of state bits.
Because each register implements a state bit, register minimization can be used to
significantly reduce the size of the problem. While the corresponding reduction in the total
state space is exponential, this reduction doesn’t necessarily come within the reachable state
space. Never-the-less, the guaranteed linear reduction in the state representation is useful
to improve the practical memory and runtime requirements of a sequential verification tool.
The work of [29] does demonstrate an empirical relationship between retimed register
count and the difficulty of sequential equivalence checking. In this work, it was shown that
preprocessing with retiming decreases the total runtime of sequential verification. Retiming
is used industrially in IBM’s SixthSense tool to this same end [30]. Although anecdotal, the
experience of others in using these retiming algorithms in sequential verification has been
decidedly positive.
2.2
Previous Work
2.2.1
LP Formulation
The use of retiming to minimize the number of registers in a design was first suggested
by [8]. This objective was one of the first suggested applications for retiming in this original
work.
The problem can be formulated as an integer linear program of the form of Equation
2.3. As introduced in Section 1.2.1, let G =< V, E > be a retiming graph. r(v) is a retiming
28
lag function, and wi (u, v) is the initial number of registers present on each edge u → v. The
number of registers on edge u → v after retiming is wr (u, v) = wi (u, v) − r(u) + r(v).
min
X
wr (e) s.t.
∀e∈E
r(u) − r(v) ≤ wi (u, v)
∀u → v
(2.3)
Let |G| be the total number of registers present in the circuit described by the graph G.
If the circuit is retimed as described by the lag function r(v), let |G ′ | be the total number of
registers after retiming. This quantity can be computed from r(v) as described by Equation
2.4. Let the outdegree of a node be the number of outgoing edges: outdegree(v) = |{e =
v → u : ∃u ∈ V ∧ e ∈ E}|. Indegree is defined similarly.
|G ′ | =
X
wr (e)
(2.4)
∀e∈E
= |G| +
X
r(v)(indegree(v) − outdegree(v))
(2.5)
∀v∈V
Fan-out Sharing Because the retiming graph G represents all connectivity as pair-wise
edges, it does not adequately model the hypergraph connectivity of the netlist Ghyp . A single
physical wire may implement multiple point-to-point connections. For certain applications
of retiming, this is irrelevant to the problem; for the minimum-register objective, correctly
accounting for the connectivity of the retimed circuit is imperative. In particular, the
registers on edges faning out from the same vertex can be shared. With this register fan-out
sharing, the correct register count is described by Equation 2.6.
|G′ | =
X
∀u∈V
max
∀{(u,v)|u→v,∃v}
29
wr (u, v)
(2.6)
Leiserson and Saxe introduce a transformed graph Gˆ that exactly models register fanout sharing. A mirror vertex vˆ is added for every vertex v that has outdegree(v) > 1
(i.e. multiple fan-outs). For every edge v → u, a mirror edge eˆ = u → vˆ is also created
and assigned an initial register count as given by Equation 2.7. An edge breadth function
β(e) : E → ℜ is also applied, where β(e) =
1
outdegree(v) .
With the edge breadths, the total
number of registers in the retimed circuit becomes Equation 2.8. The number of registers
ˆ can be shown to be identical to the number of registers after maximally collapsing the
|G|
registers in G with fan-out sharing.
wi (u → vˆ) =
|Gˆ′ | =
X
max
∀e∈outgoing(v)
wi (e) − wi (v → u)
(2.7)
β(e)wr (e)
(2.8)
∀e∈E
= Gˆ
X
∀v∈V

r(v) 
X
β(e) −
∀e∈outgoing(v)
X
∀e∈incoming(v)

β(e)
(2.9)
We assume this method of modeling fan-out sharing is used throughout this work.
Problem Size
The size of the final unconstrained min-register linear program is quite
compact. The number of variables is 2Vcomb , where Vcomb is the number of combinational
nodes in the circuit. The number of constraints is proportional to the number of pairwise
combinational edges. The LP can be generated from the circuit in O(P ) time, where P is
the number of node connections (i.e. pins).
This can be solved directly using a general integer linear programming (ILP) solver.
In practice, a more efficient solution is possible due to the specific nature of the problem.
The dual of the problem does not require integer methods (that are NP-hard in the worst
case) and can instead be solved as a continuous LP. Furthermore, the min-register retiming
formulation has a particular structure that makes it well suited to network solutions. Next,
we look at the minimum-cost network circulation problem and how it can be applied to the
problem at hand.
30
2.2.2
Min-Cost Network Circulation Formulation
The dual of the linear program in Equation 2.3 possesses a network structure. In particular, this problem corresponds to the computation of the minimum-cost network circulation
(MCC).
The minimum-cost network circulation problem is as follows. Given a graph G = (V, E),
let u(e) : E → ℜ be the capacity of each edge, and c(e) : E → ℜ be the cost per unit of
flow along each edge. A flow demand d(v) : V → ℜ is associated with each vertex; the total
demand of all vertices is zero. The objective is to find a flow along each edge f (e) : E → ℜ
that satisfies the demand at each vertex and minimizes the total cost. An MCC problem
can be expressed as a linear program of the form of Equation 2.11.
min
X
u(e)c(e) s.t.
∀e∈E
X
∀e∈incoming(v)
f (e) −
X
f (e) ≤ u(e)
∀e ∈ E
(2.10)
f (e) = d(v)
∀v ∈ V
(2.11)
∀e∈outgoing(v)
When MCC is applied to retiming, the vertices V in the minimum-cost circulation
problem correspond exactly to the vertices in the retiming graph. The demand d(v) is
defined by Equation 2.12 and equal to the net weight of the incoming less the outgoing
edges. The capacity of each edge u(e) is unbounded, and the cost of each edge c(e) equal to
the register weight, as specified in Equations 2.13 and 2.14, respectively. All of the costs and
demands are integers (or rationals, if fan-out sharing is used); this property is important
can be shown to improve the worst-case runtime of some methods.
d(v) = incoming(v) − outgoing(v)
(2.12)
u(e) = ∞
(2.13)
c(e) = wi (e)
(2.14)
31
Algorithm
Year of Publication
Worst-Case Runtime Strongly P
Edmonds and Karp
1972 O(e(log U )(e + v log v))
No
Tardos
1985
O(e4 )
Yes
Goldberg and Tarjan
1987
O(ve2 log v log(v 2 /e))
Yes
Goldberg and Tarjan
1988
O(ve2 log2 v)
Yes
Ahuja et al.
1988
O(ve log log U log vC)
Yes
Orlin
1988 O(e(log v)(e + v log v))
Yes
Table 2.1. Worst-case runtimes of various min-cost network flow algorithms
The algorithms available to solve MCC problems have expanded and been improved in
recent decades. In Table 2.1, the worst-case asymptotic runtime bounds of several algorithms are compared. Here, e is the number of arcs, v is the number of vertices, U is the
maximum capacity of any edge, and C is the maximum cost of any edge.
Currently, the (generally) best-performing solution methods are based upon scaling and
preflow-push. [31] describes an algorithm with O V E log(V 2 /E) log(V C) worst-case time,
although other methods have non-comparable bounds.
Within the class of network linear programs, minimum-cost flow appears to one of the
trickier problems. While its worst case bound is not strictly greater than other similar
problems (e.g. maximum-flow), its application to practical problems is widely understood
to require a greater degree of effort.
Theoretically, it has also proved to be a challenge. It wasn’t until 1985 with the work
of [32] that an algorithm with strongly polynomial worst-case runtime was developed; this
is over a decade later than the same bound was established for computing maximum-flow.
Several of the procedures in Table 2.1 even require solving a maximum-flow problem in the
course of the algorithm.
2.3
Algorithm
We introduce a new method for unconstrained minimum-register retiming. Instead of
the traditional minimum-cost network circulation formulation, we utilize a technique based
upon iterating a maximum network flow problem.
32
The overall outline of our algorithm is presented in Algorithm 1. A maximum network
flow problem is constructed from the circuit graph and solved, the residual graph is used
to generate a minimum cut, and the registers are retimed forward to the cut location; this
procedure is iterated until a fix-point is reached. Next, a similar sequence of operations is
performed to retime the registers backward. When the backward fix-point is reached, the
resulting circuit is the optimal minimum register retiming.
Algorithm 1: Flow-based Min-register Retiming: FRETIME()
Input : a sequential circuit G
Output: min-register retimed circuit G′
let |G| be the number of registers in G
direction ← forward
repeat
nprev ← |G|
Gresidual ← maxf low(G)
C ← mincut(Gresidual )
move registers to C
until |G| = nprev
direction ← backward
repeat
nprev ← |G|
Gresidual ← maxf low(G)
C ← mincut(Gresidual )
move registers to C
until |G| = nprev
2.3.1
Definitions
A combinational frame of the circuit is comprised of the acyclic network between the
register outputs / PIs and register inputs / POs. An example of this is illustrated in Figure
2.3(ii) for the circuit in 2.3(i). The inputs (the register outputs / PIs) lie on the left, and the
33
outputs (the register inputs / POs) on the right. The registers are denoted with rectangles
and the primary IOs with squares. The registers are duplicated for ease of illustration,
and in the duplicated names we use superscripts to denote the n-th cycle replication of
the element. The cycles that exists in the original sequential circuit are implied by the
connections through the registers between the inputs and the duplicated outputs. These
conventions apply to subsequent diagrams of a similar nature.
Figure 2.3. A three bit binary counter with enable.
Let G =< V, E, Vsrc , Vsink > be a directed acyclic graph with a set of source nodes
Vsrc ⊂ V with no incoming edges and a set of sink nodes Vsink ⊂ V with no outgoing edges.
A source-to-sink path is a set of edges p = vsrc
vsink that transitively connect some
vsrc ∈ Vsrc and vsink ∈ Vsink
The fan-out of v is the set of nodes U = {u : v → u ∈ E}. Similarly, the fan-in of v
is the set of nodes U = {u : u → v ∈ E}. The set T F O(v) is the set of vertices in the
transitive fan-out of v; T F I(v) is the transitive fan-in of v. Unless state otherwise, the
transitive connectivity is assumed to be broken at sequential vertices.
A cut of a G is a subset of edges C ⊆ E that partitions G into two disjoint subgraphs
with the source and sink nodes in separates halves. It holds that for any cut there exists
no source-to-sink path p where p ∩ C = ∅; every source-to-sink path is cut at least once.
34
A retiming cut of G is a cut such that there exists no path p from vsrc → vsink where
|p ∩ C| > 1. Every source-to-sink path is cut exactly once.
2.3.2
Single Frame
The core of the algorithm consists of minimizing the number of registers within a single
combinational frame. Let us consider only the paths through the combinational logic that
lie between two registers (thus temporarily ignoring the primary inputs and outputs). In
this combinational frame, we assign Vsrc = R0 and all Vsink = R1 . The current position of
the registers clearly forms a complete cut through the network (immediately at its inputs)
and also meets the above definition of a retiming cut. The width of the cut is the initial
number of registers.
Consider retiming the registers in the forward direction through the combinational circuit. As the registers are retimed over the combinational nodes, the corresponding cut moves
forward through the network and may grow or shrink in width as registers are replicated
and/or shared as dictated by the graph structure.
The problem of minimizing the number of registers by retiming them to new positions
within the scope of the combinational frame is equivalent to finding a minimum width cut.
This is the dual of the maximum network flow problem, for which efficient solutions exist.
Maximum Flow
The maximum flow problem is defined as follows.
A flow graph G =< V, E, Vsrc , Vsink , u > extends the previous graph by adding a capacity
function. u(e) : E → ℜ is the capacity of each edge. Without loss of generality, we
also identify a single source and sink: vsrc and vsink . The multiple sources and sinks can
be simulated by adding an unconstrained edge from every element of those sets to the
appropriate singular version.
The objective is to find a flow along every edge f (e) : E → ℜ that maximizes the
total flow from vsrc to vsink without violating any of the individual edge capacities. The
35
maximum-flow problem can also be expressed as a linear program of the form of Equation
2.15.
max
X
f (e) s.t.
(2.15)
∀{e≡(s,t):s=vsrc }
X
f (e) −
∀e∈incoming(v)
X
f (e) −
∀e∈incoming(vsink )
X
f (e) ≤ u(e)
∀e ∈ E
f (e) = 0
∀v ∈ V \vsink , vsrc
∀e∈outgoing(v)
X
f (e) = 0
∀e∈outgoing(vsrc )
Maximum-flow is one of the fundamental problems in the class of network algorithms.
It can be viewed as one of the essential “halves” of the more general minimum-cost network
circulation problem. (The other being the shortest path computation.) [33] A maximumflow problem can be written in the more general MCC form by setting all of the costs to
zero and adding an unconstrained edge from vsink to vsrc .
Similarly to minimum-cost network circulation, there are specialized solution methods
that make use of the particular structure of the maximum-flow problem. Table 2.2 describes
some of these algorithms of historical and practical interest. Here, v is the number of nodes
in the problem and e the number of edges. U is the maximum capacity of any edge in
the graph. There are both pseudo-polynomial algorithms (whose complexity involves the
maximum edge capacity U ) and strongly polynomial algorithms (whose complexity only
depends on the size of the flow graph).
Given a flow f (e) in the original graph G, the residual graph Gresidual
=<
V, E, Vsrc , Vsink , u > is defined as the having the same structure as the original flow
network but a set of capacities uresidual (e) as in Equation 2.16. The residual graph captures
the amount of remaining capacity available on each edge. If the flow f (e) is indeed maximal,
there will exist no path vsrc
vsink in the edges with remaining capacity in the residual
36
Algorithm
Year of Publication
Worst-Case Runtime
Dantzig
1951
O(v 2 eU )
Ford and Fulkerson
1956
O(veU )
Dinitz
1970
O(v 2 e)
2
Edmonds and Karp
1972
O(e log U )
Karzanov
1974
O(v 3 )
Cherkassky
1977
O(v 2 e1/2 )
Sleator and Tarjan
1983
O(ve log v)
Goldberg and Tarjan
1986
O(ve log(v 2 /e)
Ahuja and Orlin
1987
O(ve + n2 log U )
Cheriyan et al.
1990
O(v 3 / log v)
2/3
1/2
Goldberg and Rao
1997 O(min(v , e )e log(v 2 /e) log U )
Table 2.2. Worst-case runtimes of selected maximum network flow algorithms [3]
graph. Correspondingly, the residual flow at every edge in a minimum-width cut will be
zero.
uresidual (e) = u(e) − f (e)
Deriving Minimum Cut
(2.16)
Once the maximum flow through the combinational network
has been determined, the corresponding minimum cut is derived. The width of this cut is
identical to the maximum flow and corresponds to the number of registers in the circuit
after the retiming has been applied.
The residual graph is used to generate a corresponding minimum cut. The vertices in
the network are partitioned into two sets: S, those that are reachable in the residual graph
with additional flow from the source, and R, those that are not. Generating this partition
is O(E) in the worst case. The partition must be a complete cut, because there can exist
no additional flow path from the source to the sink if a maximal flow has already been
assigned. We define the minimum-width cut Cmin to be the set of edges u → v where
(u ∈ S) ∧ (v ∈ R).
37
Reverse Edges Without additional constraints, the cut Cmin may not be a legal retiming.
A cut in a directed graph only guarantees that all paths in the graph are cut at least once.
This is a necessary but not sufficient condition for the cut to be a valid retiming.
In the initial circuit, it is evident that any path in the combined graph passes through
exactly one register (i.e. it is a retiming cut). Any valid retiming must preserve this
property. If this were not the case, the latency of that path would be altered and the
sequential behavior of the circuit changed. We seek the minimum cut in the graph such
that all paths are crossed exactly once.
Figure 2.4 illustrates a simple example of how this can lead to an illegal retiming.
There are exactly two forward flow paths: {R10 → v1 → R11 v} and {R30 → v4 → R31 }. The
corresponding cut Cmin = {(v1 , R11 ), (R30 , v4 )}, but this is illegal. The path {R3 → v4 →
v3 → v2 → v1 → R11 } now has two registers where it previously had one. This would insert
additional sequential latency and alter the functionality of the circuit. Another example is
provided in Figure 2.5.
Figure 2.4. An example circuit requiring unit backward flow.
38
Figure 2.5. An example circuit requiring multiple backward flow.
The network flow problem can be altered to eliminate the possibility that a path is
crossed more than once. Reverse edges with unbounded capacity are added in the direction
opposite to the constrained edges in the original network. These additional paths may
increase the maximum flow (and therefore the size of the minimum cut) but guarantee that
the resulting minimum cut will correspond to a legal retiming. For a path in the original
graph to cross the finite width cut more than once from S → R, there must be at least one
edge that crosses from R → S. If the unbounded reverse edges are also present, this would
imply an infinite-capacity edge from S → R, thus violating the finiteness of the cut-width.
We label this Property 1.
Property 1. If there exists an unconstrained flow edge e = u → v, a finite minimum cut
Cmin will never contain edge e.
Proof. If e were in Cmin , this implies that u ∈ S ∧ v ∈ R from the method in which the
cut is generated. However, if u(e) = ∞, the edge can never become saturated. There will
always exist a path to R, and this edge could not have possibly been included in Cmin .
The addition of the unbounded reverse flow corrects the example in Figures 2.4. A new
39
flow path is created {R20 → v2 → v3 → R21 }. With this increase in the maximum flow
comes an increase in the width of the minimum cut; the new locations of the registers after
retiming are identical to their pre-retiming positions.
Similarly, the number of registers in Figure 2.5 will be increased, and the correct functionality of the circuit restored. In this example, f (v1 → v6 ) = 2 in the maximum possible
flow. This demonstrates that unit flow on the reverse edges is not sufficient to enforce
guarantee that the resulting cut is a legal retiming cut.
Fan-out Sharing Equivalently to the method in Section 2.2.1, it is also needed to account
for the sharing of registers at nodes with multiple fan-outs. This requires another simple
modification to the network flow problem. Each circuit node v is decomposed into two
vertices: a receiver of all of the former fan-in arcs v receiver and an emitter of all of the
former fan-out arcs v emitter . The flow constraints are removed from these structural edges,
and a single edge with a unit flow constraint is inserted from the receiver to the emitter.
This is transformation is depicted in Figure 2.6.
Figure 2.6. Fan-out sharing in flow graph.
Via Property 1, the unconstrained edges can not participate in the minimum cut; only
the internal edge is available to make a unit contribution to the cut-width. Each node
will therefore require at most one register regardless of its fan-out degree. Then, to model
40
fan-out (as opposed to fan-in) sharing, the reverse edges are connected between adjacent
receivers. This idea can also be extended to model fan-in sharing as in [34].
Primary Inputs and Outputs The primary inputs and outputs (PIOs) can be treated
in different ways, depending on the application. The allowed locations of the minimum cut
and the subsequent insertion/removal of registers can be adjusted to either fix or selectively
alter the sequential behavior of the circuit with respect to the external environment.
In synthesis, the relative latencies at all of the PIOs is assumed to be invariant. In
verification applications, it is not necessary to preserve the synchronization of the inputs
and outputs. It may be desirable to borrow or loan registers to the environment individually
for each PIO if the result is a net decrease in the total register count.
To allow register borrowing, the external connections should be left dangling. Registers
will be donated to the environment if the minimum cut extends past the dangling primary
outputs (POs); conversely, registers will be borrowed if the minimum cut appears in the
transitive fan-out region of the dangling primary inputs (PIs). The inclusion of this region
introduces additional flow paths and allows additional possibilities for minimizing the total
register count.
Figure 2.7. The illegal retiming regions induced by the primary input/outputs.
To disallow desynchronization with the environment, a host node and normalization can
41
be employed, or alternatively, the flow problem suitably modified: the POs are connected
to the sink and the transitive fan-out of the PIs is blocked from participating in the minimum cut. This is illustrated in Figure 2.7. All paths through the combinational network
that originate from a PI have a sequential latency that must remain at zero; inserting a
register anywhere in the T F O({PIs}) would alter this. To exclude this region, one of two
methods can be used: (i) temporarily redirecting to the sink all edges e ≡ (u, v) where
v ∈ T F O({PIs}), or (ii) replacing the constrained flow arcs in this fan-out cone with unconstrained ones, thus preventing these nodes from participating in the minimum cut. Both
methods restrict the insertion or deletion of registers in the invalid region. We primarily
utilize (ii), as it decreases the length of the flow paths from source to sink. Selectively
disallowing desynchronization during verification may be motivated by the need to control
complexity. Because register borrowing requires the initial values of the new registers to be
constrained to those reachable in the original circuit, it is necessary to construct additional
combinational logic for computing the initial state. If the size of this logic grows undesirably
large, register borrowing can be turned off at any point for any individual inputs or outputs.
Implementing the Retiming
The mechanics of moving the registers to their new lo-
cations is trivial. First, the register nodes are disconnected them from their original graph
locations. The former fan-outs of each register are reconnected to its fan-ins. Secondly, new
register nodes are inserted along the arcs in the minimum cut, one register per arc source (if
fan-out sharing is enabled). Note that this does not require that all of the fan-outs of a node
in the circuit are transfered to the register; the connections to fan-outs that correspond to
any outgoing arcs of a node that do not cross the minimum cut are left untouched.
If the registers have defined initial states, some computation must be performed to
calculate a set of equivalent initial states at the new positions (if one exists). This is
addressed in detail in Chapter 4.
We can now prove an important property of the resulting retiming: that it minimizes
the registers with the minimum amount of movement.
Let C1 and C2 be two minimum width cuts. We define topological nearness to v as
42
follows. Let p = v
v ′ be a path in G that consists of a set of ordered edges e0..i . Cut
C1 is strictly nearer than C2 if ∀p, ei ∈ C1 ∧ ej ∈ C2 ∧ i ≤ j. It is partially nearer if
∃p, ei ∈ C1 ∧ ej ∈ C2 ∧ i ≤ j. A cut is the strictly nearest of a set of cuts if and only if there
exists no other member of the set that is partially nearer.
Lemma 2. The returned cut Cmin will be the minimum-width retiming cut that is topologically strictly nearest to vsrc .
Proof. We prove this by demonstrating that the existence of a nearer cut violates the manner
in which Cmin was constructed. The source u of every edge (u, v) ∈ Cmin must have been
reachable in the residual graph from vsrc , but this can not be the case for every edge if there
exists another closer cut.
Assume that there exists some minimum-width retiming cut C ′ that lies topologically
partially nearer to vsrc than the returned cut Cmin . From the definition of partial topological
nearness, we know that there exists some path p = vsrc
v ′ such that ei ∈ C ′ ∧ ej ∈
Cmin ∧ i < j.
If ej is further from vsrc than ei in path p, then we know that for every other path
containing ej that there can exist no edge in ek ∈ C ′ where k ≥ j. If this were the case,
then by transitivity ek ∈ T F O(ej ) and C ′ would not be a retiming cut.
However, because C ′ is a cut, every path vsrc
vsink must contain some edge ek in
C ′ , including those with ej . Therefore, every path to ej must have an edge ek ∈ C ′ where
k < j.
Because uresidual (ek ) = 0, there could not have been a flow path in the residual graph to
ej . Because Cmin was constructed exactly by partitioning the nodes by source-reachability
and ej is in the shallower source-reachable region, this implies a contradiction. This situation
can not happen.
Cmin must therefore be the topologically strictly nearest cut to vsrc .
Lemma 2 also holds when expressed in terms of the corresponding retiming lag functions.
43
Let r1 (v) and r2 (v) be two valid retiming lag functions. r1 is strictly nearer than r2 if
∀v ∈ V, r1 (v) ≤ r2 (v). It is partially nearer if ∃v ∈ V, r1 (v) < r2 (v). Given a set of valid
retimings r1..i , one can be said to be strictly nearest if and only if there exists no other
member of the set that is partially nearer. The proof is similar.
Final Problem
The final flow graph on which the minimum-register retiming with fan-
out sharing can be computed for a single combinational frame is illustrated in Figure 2.8(ii).
The complete local flow graph is shown for gate g3 . Solid lines represent edges with unit flow
constraints and dotted lines edges with unbounded flow constraints. This has been derived
from 2.8(ii) be decomposing the hypergraph into arcs, splitting the nodes into emitters and
receivers to model fan-out sharing, and adding reverse unconstrained edges. We will see in
Section 2.4.2 that it is not necessary to explicitly construct this network.
2.3.3
Multiple Frames
Thus far, we have only considered the forward retiming of registers in the circuit. It
is sufficient to consider only one direction if the circuit is strongly connected (i.e. through
the use of a host or environment node) and normalization is applied. However, in general,
the optimum minimum-register retiming requires both forward and backward moves. The
procedure for a single iteration of backward retiming is nearly identical, except that the
maximum flow from the register inputs (sources) to the primary inputs and register outputs
(sinks) is computed. Note that the fan-out sharing receivers now correspond to the original
outputs and the fan-out sharing emitters to the inputs.
The overall algorithm consists of two iterative phases: forward and backward. In each
phase, the single frame of iteration is repeated until the number of registers reaches a
fix-point. The procedure is outlined in Algorithm 1 and Figure 2.9.
At no point during retiming is it necessary to unroll the circuit or alter the combinational
logic; only the register boundary is moved by extracting registers from their initial position
44
Figure 2.8. The corresponding flow problem for a combinational network.
and inserting them in the their final position. Therefore, each iteration is fast. In each
iteration, every node’s retiming lag r(v) is either changed by one or unchanged.
The ordering of the two phases (forward and backward) doesn’t affect the number of
registers in the result, but we chose to perform forward retiming first because in general
min-register retiming is not unique. The registers can be moved to identical-sized cuts that
closest in either the forward or backward directions. It is also possible to interleave forward
and backward steps. However, the forward-first approach reduces the amount of logic that
has to be retimed backward, thereby reducing the difficulty of computing a new initial state.
This process will be explained in Chapter 4.
45
Figure 2.9. Flow chart of min-register retiming over multiple frames
2.4
2.4.1
Analysis
Proof
In this section, we prove two facts about our flow-based retiming algorithm: (i) the
result preserves functional correctness (less initialization behavior), and (ii) the result has
the minimum number of registers possible via retiming alone.
Correctness
Theorem 3. Algorithm 1 preserves functional correctness.
Proof. From the result of [8], we know that all legal retimings preserve functionality, and
also that a transformation is a legal retiming if and only if it can be described by some
retiming lag function. It is therefore sufficient to prove functional correctness to describe
the retiming lag function that exactly corresponds to the transformation that results from
the application of our algorithm. We do this constructively.
46
We begin with the registers in their initial positions and the lag value of every combinational node initialized to zero. That is ∀v ∈ V, r0 (v) = 0.
The algorithm proceeds by iterating the single frame register minimization in either
direction. In each iteration, a minimum cut is computed under the constraint that every
directed path from the starting positions of the registers to either a register or an output
crosses the cut exactly once. Let C0 be the original locations of the registers and Cmin be
the set of edges in this cut. We move the registers to these edges.
Lemma 4 (Correspondence of lag function to cut). The movement between two retiming
cuts C0 and C1 corresponds to a retiming lag function rc (v).
There exists a retiming lag function that reproduces this register movement exactly.
Given the lag function ri (v) that generates the circuit at the start of each iteration i, let
ri+1 (v) be the lag function that generates the circuit at the end of the iteration. This is
computed as stated in Equation 2.4.1. Both of these are transformations from the initial
register positions; the movement of this one iteration alone is captured by ri+1 (v) − ri (v).



ri (v) + 1 v ∈ T F O(C) ∧ backward direction



ri+1 (v) =
ri (v) − 1 v ∈ T F I(C) ∧ forward direction





ri (v)
otherwise
(2.17)
In the case where the flow graph has been transformed to model fan-out sharing, consider
a cut edge between vreceiver → vemitter to be a cut among all structural edges {v → u : u ∈
f anout(v)}.
We now show that ri+1 (v)− ri (v) exactly reproduces the movement of the registers from
their positions at the start of the iteration to those at the end. Registers are removed from
C0 , added to C1 , and the remaining edges are left untouched.
First, consider the edge u → v in the retiming graph that corresponds to each of the
original register locations. If the minimum cut does not lie on exactly this edge, it must
be in its fanout. Therefore, either the lag of u will be incremented or the lag of node v
47
decremented (depending on the retiming direction of this iteration). We also know that
the lag at the other end of the edge remains constant, as this marks the boundary of the
combinational frame in which the cut is contained. The result in a net decrease in the
retimed register weight, wr (u → v) (from Equation 1.2), of exactly one. If the minimum
cut lies on exactly this edge, the register is preserved as expected.
Next, consider the edge u → v in the retiming graph that corresponds to each edge in C.
Here, the lag of the node in the direction of the movement will remain constant, as this lies
beyond the cut and can not be in its transitive fanout/in. Because no path can cross the cut
more than once, no other edge in C could contain the node in its transitive fanout/in. (This
is exactly the condition that would be violated and lead to functionally incorrect solutions
if a minimum cut were computed without the single-crossing-per-path constraint.) As the
other end of the edge is incremented/decremented, the net register weight is increased by
exactly one.
All edges in the retiming graph that do not lie at the original register boundary or the
minimum cut will have no change in register weight. We know that wi (u → v) = 0 because
it was not an original register location. This means that there is a combinational path
between u and v. Because the cut C does not lie between the two, the transitivity of the
TFO/TFI operations in Equation 2.4.1 dictates that u is incremented if and only if v is
incremented. There will be no net change on any of these edges and no new registers.
We have shown that there exists a retiming lag function that reproduces the register
relocation performed in each iteration. The described function exactly: (i) removes the
registers from their positions at the start of the iteration, (ii) inserts registers at exactly
the new positions at the end of the iteration (and no others). We have also described how
to compute the cumulative lag function that describes the total change over all iterations.
From the result of [8], we can establish that this transformation is indeed a valid and
functionally correct retiming.
If the transitive fan-in(out) of the PIs(POs) is excluded from the minimum cut compu-
48
tation via the mechanism described in Section , the lag at the inputs and output nodes (as
described by Equation ) will remain zero. This guarantees the that sequential latency of
every output is preserved.
Optimality First, we demonstrate the converse of Lemma 4. We had proven Lemma 4
by considering combinational frames one at a time and composing the solution. The idea
can be generalized to multiple simultaneous combinational frames (e.g. an unrolled circuit).
For Lemma 5 we examine the unrolled circuit directly.
Consider unrolling the sequential circuit by n cycles, where n > maxr(v) − minr(v).
Each vertex v is replicated n times, producing a corresponding set of unrolled vertices v 0..n .
If v 0 represents the vertex in the reference frame, v i corresponds to that vertex in the ith
unrolled frame. Similarly for edges. The register inputs from time cycle i are connected to
the register outputs of time cycle i + 1. An example of this is contained in Figure 2.10;
Lemma 5. [Correspondence of cut to lag function] Every retiming lag function rc (v) corresponds to a cut C in the unrolled circuit.
Proof. A retiming lag function r(v) corresponds to a cut C in the unrolled circuit. This cut
consists of the edges ei = ui → v i where r(v) − r(u) ≥ i < r(u).
The positions of the registers of the reference cycle after any retiming r(v) can be
expressed as a cut C in the edges of this unrolled circuit. The elements of C are the register
positions. The unretimed cut, Cinit , (such that r(v) = 0) lies at the base of the unrolled
circuit. The size of this cut, |C|, is the number registers post-retiming, or equivalently, the
number of combinational nodes whose fan-outs hyper edges cross the cut.
A cut C is a valid retiming if every path through the combinational network passes
through it exactly once. This implies that for any two registers R1 , R2 ∈ C, R1 ∈
/ T F O(R2 )
and vice versa. If this were not the case, additional latency would be introduced and
functionality of the circuit would be altered.
A combinational frame of the cut C with retiming function r(v) is the region in the
49
Figure 2.10. A cut in the unrolled circuit.
unrolled circuit between C and C ′ , where C ′ is generated by r ′ (v) = r(v) + 1. If the circuit
were retimed to C, this corresponds exactly to the register-free combinational network
structure that would lie on the outputs of the register boundary.
Theorem 6. Algorithm 1 results in the optimally minimum number of registers of any
retiming.
Proof. Consider an optimal minimum register retiming and its corresponding cut Cmin .
While there exist many such cuts, assume Cmin to be the one that lies strictly forward
50
of the initial register positions and is topologically closest to Cinit . It can be shown with
Lemma 7 that there is one unambiguously closest cut.
Our algorithm iteratively computes the nearest cut of minimum width reachable within
one combinational frame and terminates when there is no change in the result. Let the
resulting cut after iteration i be Ci . The cut Ci at termination will be identical to Cmin if
the following two conditions are met.
Condition 1
. No register in Ci lies topologically forward of any register in Cmin .
Condition 2
. After each iteration, |Ci+1 | < |Ci | unless Ci = Cmin .
Figure 2.11. Retiming cut composition.
Lemma 7 (Cut composition). Let Ci and Cj be two valid retiming cuts, and {si , ti } and
{sj , tj } be a partitioning of each: (s ∪ t = C) ∧ (s ∩ t = ∅). Also for any path p, (p ∩ si ) ⇔
(p ∩ sj ) and (p ∩ ti ) ⇔ (p ∩ tj ). If this is the case, the cuts {si , tj } and {sj , ti } are also valid
retimings.
Proof. One example of such a partitioning is induced by topological order. If the points
of intersection of the cuts with a path p are Ri ∈ Ci and Rj ∈ Cj , we can assign the
registers to s if Rj ∈ T F O(Ri ), and t otherwise. The s sets will include the registers that
are topologically closer in Ci than Cj ; the t sets will include the registers that are in both
sets or topologically closer in Cj than Ci .
51
If a given p crosses Ci at Ri ∈ si , it may not cross any other register in ti (from the
definition of a partition). It also may not contain any register in tj (from the definition
of the sets). The cut {si , tj } has no more than one register on any path. If a p does not
intersect si , then we know that it must cross at some Ri ∈ ti (from the definition of a
partition). It also must then intersect some register in tj (from the definition of the sets).
The cut {si , tj } has at least one register on any path. Therefore, {si , tj } is crossed by every
path exactly once and is a valid retiming. Similarly for {sj , ti }.
Proof of Condition 1 Consider a cut Ci that violates Condition 1. Let {si , ti } be a
partition of Ci and {smin , tmin } be a partition of Cmin such that si is the subset of registers
in Ci that lie topologically forward of the subset smin of the registers in Cmin . This is
illustrated in Figure 2.11. By Lemma 7, we know that both {si , tmin } and {smin , ti } are
valid cuts.
Because a single iteration returns the nearest cut of minimum width within a frame,
this Ci = {si , ti } must be strictly smaller than the closer {smin , ti }. This implies that
|si | < |smin | and that |{si , tmin }| < |{smin , tmin }| = Cmin . This is impossible by definition.
Therefore, Condition 1 must be true.
Observation 1 Retiming by an entire combinational frame does not change any of the
register positions in the resulting circuit and also represents a valid retiming cut. Because
a register is moved over every combinational node, the retiming lag function is universally
incremented. The number of registers on a particular edge is a relative quantity, the result
is structurally identical to the original.
Proof of Condition 2 We can use the minimum cut to generate a cut that is strictly
′
smaller than Ci and reachable within a combinational frame. Consider the cut Cmin
that
is generated from Cmin via Observation 1 such that its deepest point is reachable within
the combinational frame of Ci . Some of the retiming lags may be temporarily negative.
′
Let {si , ti } be a partition of Ci and {smin , tmin } be a partition of Cmin
such that smin are
52
′
the deepest registers in Cmin
that lie topologically forward of the subset si of the registers
in Ci . smin 6= ∅ if Ci 6= Cmin . Using the reasoning from Condition 1, both {si , tmin } and
{smin , ti } are valid cuts.
We know that |smin | < |si |, otherwise there would be implied the existence of a topologically nearer cut |{si , tmin }| ≤ Cmin . Therefore, the cut {smin , ti } is strictly smaller than Ci
and is reachable within one combinational frame and would be returned by a single iteration
of the algorithm. Note that this doesn’t imply that there aren’t other smaller cuts, only
that there must exist at least one that is strictly smaller. Therefore, Condition 2 must also
be true.
2.4.2
Complexity
The core of our algorithm consists of computing the maximum flow through a single
combinational frame of the circuit. If the circuit has V gates and E pairwise connections
between them, the corresponding flow problem will have 2V vertices and V +2E edges. The
doubling of vertices is due to the split into receiver/emitter pairs to model fanout sharing,
and the edge total is due to the structural edges, the reverse edges, and the internal edges
between the receiver and emitter nodes. We assume that every vertex has at least one
structural fanout and that V < E. The mapping between the flow graph and the original
circuit is therefore linear.
Based on the result of [35], the worst-case runtime of computing the maximum flow
through this graph is O V E log(V 2 /E) . We can then derive a minimum cut from the
residual graph in O(E) time, as this is just a check for source reachability. The maximum
number of iterations can also be bounded by R via Condition 2 in the above proof. The
total worst-case runtime is therefore O RV E log(V 2 /E) using this method.
Our experience with [35] indicates that the algorithmic enhancements to improve the
worst-case bound of the maximum-flow runtime do not provide much savings in terms of
average runtime for min-reg retiming on our examples. We initially used the HIPR tool
[36] to compute the maximum flow and determined that it was not any more effective than
53
a simple and efficient implementation that utilizes the structure of our particular problem.
We describe this now.
Binary Simplification
The specific nature of the flow problem constructed to solve for
the minimum-register retiming within a single combinational frame (as shown in Figure 2.8)
permits simplification in the method used to solve for the maximum flow.
This simplification is premised on the observation of the capacity of every edge in the
flow graph is either one or infinity. Furthermore, with fanout sharing, there is exactly one
unit of flow that may pass from the input to the output of a gate in the circuit. Instead
of having to maintain the residual flow on each edge, we therefore need only store for each
node: (i) whether its internal edge is at capacity, and (ii) the last internal edge in the flow
path.
We introduce the binary maximum flow technique described by Algorithm 2. This
technique is based on shortest path augmentation and proves to be quite fast and efficient
for solving minimum-register retiming problems. In the context of path augmentation, the
above two per-node pieces of information permit the ability to check for remaining capacity
(on the unit constrained edges) and the unwinding of flow segments (to redirect along other
paths).
Because the model of fan-out sharing requires two (implicit) nodes in the flow graph
for every node in the circuit graph, we introduce the notion of a vertex pole to differentiate
between the fan-out emitter and receiver. A vertex pole vpole ∈ V × {r, e}.
The reason that a shortest path augmentation performs favorably compared to the
more sophisticated capacity-scaling pre-flow push method is that both the capacities and
resulting flows are quite uniform. Scaling does not work because every path from source
to sink passes through an edge of minimum capacity (i.e. 1). With few exceptions– Figure
2.5 being such an exception– the maximal flow along every edge will be zero or one. This
negates much of the benefit of pre-flow, in which bundles of flow can be pushed along large
edges in a single early step and shared amongst smaller edges in later ones.
54
Algorithm 2: Binary Max-flow 1 : MAXFLOW()
1: Input: a combinational circuit graph G =< V, E >
2:
Output: minimum retiming cut width
3:
define Vpole = V × {r, e}
4:
let d(ˆ
v ) : Vpole → Z be initially unassigned
5:
let f (v) : V → 2 be flow markers
6:
let pred(v) : V → V be predecessor markers
7:
let H(x) : Z → Z = 0 be histogram of sink distances
8:
let f low = 0
9:
INITDIST()
10:
11:
12:
while ADVANCE E(vsrc , ∅) do
increment f low
return f low
The worst-case runtime of our binary-simplified method can be bounded strictly in
terms of the vertices and edges in the circuit, but it is useful to introduce an alternative
bound. Let R be the number of registers in the design. Because the number of registers
decreases in each iteration, we know that the min-cut will never be larger than R. This
also arises structurally from the fact that the flow out of the source has no more than R
paths through the initial positions of the registers. As the time to compute each of the
maximum R augmenting paths is O(E), the runtime of determining the maximum flow can
be bounded by O(RE).
Using this bound on the maximum flow, the total runtime of the algorithm is O(R2 E).
This is non-comparable (neither strictly better nor strictly worse) than the runtime of the
best minimum-cost flow algorithm.
While the maximum number of iterations in a real circuit appears to be quite small
(based on the results in Section 2.5), a bound may still be desirable to limit the worst-case
runtime. This comes at the expense of optimality. In this case, our algorithm’s runtime
becomes O(RE). This is the time in which we can get a useable reduction in register count.
55
Algorithm 3: Binary Max-flow 1 : INITDIST(ˆ
v)
1: Input: a vertex pole v
ˆ
2:
let Π be a queue of Vpole
3:
d(vsink , r) ← 0
4:
push (vsink , r) → Π
5:
while Π do
6:
pop Π → vˆ ≡ (v, y)
7:
increment H(d(ˆ
v ))
8:
if y = r then
9:
10:
for all u
ˆ ≡ (u, e) s.t. ∃(u, v) ∈ E do
if d(ˆ
u) is unassigned then
11:
d(ˆ
u) ← d(ˆ
v) + 1
12:
push u
ˆ→Π
13:
14:
for all u
ˆ ≡ (u, r) s.t. ∃(v, u) ∈ E do
if forward and d(ˆ
u) is unassigned then
15:
d(ˆ
u) ← d(ˆ
v) + 1
16:
push u
ˆ→Π
17:
18:
if y = e then
if d(v, r) is unassigned then
19:
d(v, r) ← d(ˆ
v) + 1
20:
push (v, r) → Π
21:
22:
for all u
ˆ ≡ (u, e) s.t. ∃(v, u) ∈ E do
if backward and d(ˆ
u) is unassigned then
23:
d(ˆ
u) ← d(ˆ
v) + 1
24:
push u
ˆ→Π
56
Algorithm 4: Binary Max-flow 1 : ADVANCE R(v, vpred )
1: Input: a vertex v
2:
Input: a vertex vpred , the flow predecessor
3:
Output: {true, false}
4:
if v ≡ vsink then
5:
return true
6:
let vˆ = (v, r); rsl = false
7:
mark vˆ visited
8:
// reverse unconstrained edges (forward)
9:
for all u
ˆ ≡ (u, r) s.t. ∃(u, v) ∈ E do
10:
11:
if forward and d(ˆ
u) + 1 = d(ˆ
v ) and u
ˆ unvisited then
rsl ← ADVANCE R(u, vpred )
12:
// unwinding flow to another node
13:
if f (v) ∧ vpred 6= ∅ ∧ d(vpred , e) + 1 = d(ˆ
v ) ∧ (vpred , e) unvisited then
14:
rsl ← ADVANCE E(vpred , pred(v))
15:
pred(v) ← vpred if rsl
16:
// adding internal flow
17:
if ¬f (v) and d(v, e) + 1 = d(ˆ
v ) and (v, e) unvisited then
18:
19:
if ADVANCE E(v, vpred ) then
f (v) ← true; rsl ← true; pred(v) ← vpred
20:
mark vˆ unvisited
21:
if ¬rsl then
22:
23:
RETREAT(ˆ
v)
return rsl
57
Algorithm 5: Binary Max-flow 1 : ADVANCE E(v, vpred )
1: Input: a vertex v
2:
Input: a vertex vpred , the flow predecessor
3:
Output: {true, false}
4:
if v ≡ vsink then
5:
return true
6:
let vˆ = (v, e); rsl = false
7:
mark vˆ visited
8:
// structural edges (backward)
9:
for all u
ˆ ≡ (u, r) s.t. ∃(u, v) ∈ E do
10:
11:
if backward and d(ˆ
u) + 1 = d(ˆ
v ) and u
ˆ unvisited then
rsl ← ADVANCE R(u, vpred )
12:
// structural edges (forward)
13:
for all u
ˆ ≡ (u, r) s.t. ∃(v, u) ∈ E do
14:
15:
if forward and d(ˆ
u) + 1 = d(ˆ
v ) and u
ˆ unvisited then
rsl ← ADVANCE R(u, vpred )
16:
// reverse unconstrained edges (backward)
17:
for all u
ˆ ≡ (u, e) s.t. ∃(v, u) ∈ E do
18:
19:
if backward and d(ˆ
u) + 1 = d(ˆ
v ) and u
ˆ unvisited then
rsl ← ADVANCE E(u, vpred )
20:
// unwinding internal flow
21:
if f (v) and d(v, r) + 1 = d(ˆ
v ) and (v, r) unvisited then
22:
23:
if ADVANCE R(v, pred(v)) then
f (v) ← 0; rsl ← true; pred(v) ← ∅
24:
mark vˆ unvisited
25:
if ¬rsl then
26:
27:
RETREAT(ˆ
v)
return rsl
58
Algorithm 6: Binary Max-flow 1 : RETREAT(ˆ
v)
1: Input: a vertex pole v
ˆ ≡ (v, y)
2:
Output: none
3:
let m = ∞
4:
if y = r then
5:
// unwinding flow to another node
6:
m ← min m, d(pred(v), e) if f (v)
7:
// adding internal flow
8:
m ← min m, d(v, e) if ¬f (v)
9:
// reverse unconstrained edges (forward)
10:
11:
12:
if forward then
m ← min m, d(u, r)∀u s.t. ∃(u, v) ∈ E
if y = e then
13:
// unwinding internal flow
14:
m ← min m, d(v, r) if f (v)
15:
// structural edges
16:
if backward then
17:
18:
m ← min m, d(u, r)∀u s.t. ∃(u, v) ∈ E
else
19:
m ← min m, d(u, r)∀u s.t. ∃(v, u) ∈ E
20:
// reverse unconstrained edges (backward)
21:
if backward then
22:
m ← min m, d(u, e)∀u s.t. ∃(v, u) ∈ E
23:
let d′ = d(ˆ
v)
24:
decrement H(d′ ); d(ˆ
v ) ← m + 1; increment H(d(ˆ
v ))
25:
if H(d′ ) = 0 then
26:
exit with maximum flow
59
2.4.3
Limitations
Retiming does not exist within a vacuum. It is but one of many optimizations available
to the designer to exchange performance for power, area, and complexity. When the entire
space of available design transformations is considered, retiming is only one of many degrees
of freedom. The best solution in the joint space does not generally correspond to the
power-minimal retiming of the initial solution. For example, it may be desirable to retime a
circuit to maximize performance at the expense of power consumption to allow for smaller
combinational gates (thereby consuming less dynamic power) along the critical paths; this
may be a better overall solution than retiming to minimize power at the expense of even
greater power consumption in the combinational gates. In addition, higher speed can be
traded from power by scaling back Vdd . In general, this problem is complex, though a
limited exploration is discussed in Chapter 5.
Instead, we focus on the design space reachable solely with retiming alone and how this
can be used to minimize power consumption. However, even if the choices of the other
design elements are decoupled from the power-minimization problem, these details still
interact with and are dependent upon the choice of retiming. For example, the positions of
the registers in the netlist can affect the global and local placement of standard cells, the
congestion of the routing problem, and the capacitance of the routed wires. Two design
alternatives could be propagated in parallel through the design process and evaluated on at
a stage where the physical model is sufficiently detailed to make an accurate comparison.
However, there remains unpredictability in the process until very late in the flow (e.g.
routing), and it is not practical or cost-effective to evaluate alternatives in this manner.
The consequences on power consumption of retiming one or more registers is therefore
greatly dependent on the model used to evaluate power and the design details available.
Despite the inaccuracy, the computational requirements typically dictate that the model
used to evaluate and select retiming choices is typically without physical implementation.
Discussions of optimality are therefore confined to this view.
60
2.5
2.5.1
Experimental Results
Setup
The experimental setup consisted of a pool of 3.0Ghz AMD x64 machines made available
by [37]. All applications were written in C/C++, compiled using GNU g++ version 4.1,
and run under Linux using PBS.
Several different sources of circuit benchmarks were used to evaluate the behavior and
performance of flow-based retiming. These can be grouped into four sets: ISCAS/LGsynth,
OpenCores, QUIP, and Intel. The basic characteristics of each of these are described in
Appendix A.
The ISCAS/LGsynth benchmarks are well known to the logic synthesis community,
having originated at ISCAS in 1987. The original set was extended in 1989 and then again
in 1991 and 1993; these versions were obtained in BLIF format through the LGsynth suite.
A large fraction of the elements therein are purely combinational and not of interest to this
work; these have been excluded from consideration. The remaining sequential circuits are
described in Table A.1. On average, the circuits in this set are by far the smallest in size,
both in number of combinational and sequential elements.
The OpenCores benchmarks are examples from the OpenCores open-source hardware
designs [38]. These designs were synthesized and offered as benchmarks for the synthesis
community in 2005 [39]. A large number of the OpenCores designs implement hardware
controllers and are excellent examples to evaluate the efficacy of (minimum-register) retiming on practical sequential circuits. We use versions synthesized from the RTL originals
using Altera’s Quartus [40] tool as a front end. It should be noted that the Quartus flow
includes optimizations to both the combinational and sequential behavior. The statistics
presented in Table A.3 already include this pre-optimization and any improvements reported
elsewhere are in addition to it.
The QUIP [41] benchmarks were provided in behavioral Verilog as part of the Altera
61
University QUIP package and were also synthesized using Quartus as a front end. This set
contains the single largest example, “uoft raytracer”.
The final set of designs were provided by Intel and are derived from various verification
problems. All of the circuits in this suite were provided as single-output AIGer files. While
not synthesis examples, this set provides a means of evaluating the utility of flow-based
retiming in another domain. On average, this set contains the largest and most difficult
examples.
All of the above circuits were imported into the ABC logic synthesis and verification
system. The original structures of the LGsynth examples were preserved, but the other
three sets were further optimized. Hierarchy removal, dangling node removal, structural
hashing, greedy rewriting, and algebraic re-balancing were applied to minimize the number
of combinational nodes. No changes were made to the sequential elements or sequential
behavior of the circuits.
2.5.2
Runtime
One of the primary contributions of the flow-based unconstrained minimum-register
retiming approach is the reduction in the computational effort required to compute the
optimal minimum-register solution. In this section, we contrast the runtime of our approach
against the best-known available alternatives.
Though the previous approaches to minimum-register retiming all utilize a formulation
of the problem as an instance of minimum-cost network circulation, there are many solution
methods to solving this class of linear programs, including some that are specialized to and
highly effective on exactly this problem. To strengthen the comparison, we present results
from two different tools that use two different solution methods. Both are mature, off-theshelf, publicly available programs. We believe these are representative of the best practices
available.
The first comparison is made against the CS2 software package from Andrew Goldberg’s
Network Optimization Library [42]. The source of CS2 is available under a restricted
62
Name
flow time cs2 time cs2 perc mcf time mcf perc
s641
0.00
0.00
0.00
s13207
0.13
0.08 -38.5%
0.05
-61.5%
s9234
0.01
0.02
0.01
s713
0.00
0.00
0.00
s953
0.00
0.00
0.00
s38584.1
0.07
0.46 557.1%
0.23 228.6%
s400
0.00
0.00
0.00
s382
0.00
0.00
0.00
s38417
0.86
0.46 -46.5%
0.31
-64.0%
s5378
0.00
0.02
0.01
s444
0.01
0.00
0.00
AVERAGE
157.4%
34.4%
Table 2.3. Unconstrained min-reg runtime, LGsynth benchmarks.
academic license and was compiled using the same tool flow described in Section 2.5.1. The
algorithmic core of CS2 is the cost- and capacity-scaling preflow-push methods described
in [43]; this method possesses one of the best known worst-case bounds for solving the
minimum-cost network circulation problem.
The second comparison is made against the MCF package [44], a tool available in C++
from the Zuse Institute for Berlin that is free of charge for academic use. MCF is an
implementation of a primal and dual network simplex algorithm. As this represents a means
of solving the minimum-cost flow problem using a different class of solution methods, this
provides a second comparison point against which to evaluate our method. The algorithmic
basis used in the MCF tool is described in [45] and [33].
Tables 2.3, 2.4, 2.5, and 2.6 describe the results of applying the above unconstrained
register minimization algorithms to all four of the benchmark suites described in Appendix
A. The minimized number of registers were identical and is presented in later tables; here
are only the runtime values. Only the subset of benchmarks which had a non-zero decrease
in register count are presented. The first column in each section lists the total runtime of
each approach. The second columns (for CS2 and MCF) give the percentage increase in
the runtimes over our approach. The runtimes are only compared for examples where it is
greater than 0.05 seconds. The other values are not included in the totals.
63
Name
flow time cs2 time cs2 perc mcf time mcf perc
barrel16 opt
0.00
0.00
0.00
barrel16a opt
0.00
0.00
0.00
barrel32 opt
0.00
0.00
0.00
nut 004 opt
0.00
0.00
0.00
nut 002 opt
0.00
0.00
0.00
mux32 16bit opt
0.01
0.00
0.00
mux8 64bit opt
0.00
0.00
0.00
nut 000 opt
0.01
0.01
0.00
nut 003 opt
0.01
0.01
0.01
mux64 16bit opt
0.02
0.01
0.00
mux8 128bit opt
0.02
0.01
0.01
barrel64 opt
0.01
0.01
0.01
nut 001 opt
0.03
0.02
0.01
radar12 opt
0.19
n/a
n/a
0.38 100.0%
radar20 opt
0.89
n/a
n/a
1.61
80.9%
uoft raytracer opt
3.29
4.01
21.9%
10.95 232.8%
AVERAGE
21.9%
232.8%
Table 2.4. Unconstrained min-reg runtime, QUIP benchmarks.
Name
flow time cs2 time cs2 ratio mcf time mcf ratio
oc ata vhd 3 opt
0.01
0.02
0.01
oc cfft 1024x12 opt
0.16
0.05
0.03
oc dct slow opt
0.01
0.00
0.00
oc ata ocidec3 opt
0.01
0.02
0.01
oc pci opt
0.04
0.07
1.75
0.06
1.50
oc aquarius opt
0.06
0.25
4.17
0.21
3.50
oc miniuart opt
0.01
0.00
0.00
oc oc8051 opt
0.02
0.05
2.50
0.10
5.00
oc ata ocidec2 opt
0.00
0.01
0.01
oc aes core inv opt
0.01
0.03
0.05
oc aes core opt
0.01
0.02
0.02
oc vid comp sys h
0.00
0.01
0.01
os blowfish opt
0.01
0.03
3.00
0.04
4.00
oc wb dma opt
0.05
0.13
2.60
0.10
2.00
oc smpl fm rcvr opt
0.00
0.01
0.01
oc vid comp sys h
0.00
0.01
0.00
oc vid comp sys d
0.22
0.27
1.23
0.66
3.00
oc vga lcd opt
0.02
0.06
3.00
0.06
3.00
oc fpu opt
0.08
1.01
12.63
0.31
3.88
oc mem ctrl opt
0.08
0.12
1.50
0.10
1.25
oc des perf opt opt
0.24
0.05
0.21
0.04
0.17
oc ethernet opt
0.03
0.08
2.67
0.09
3.00
oc minirisc opt
0.01
0.01
0.01
oc hdlc opt
0.01
0.01
0.01
AVERAGE
1.00
3.20
2.75
Table 2.5. Unconstrained min-reg runtime, OpenCores benchmarks.
64
Name
flow time cs2 time cs2 ratio mcf time mcf ratio
intel 005
0.00
0.02
0.01
intel 001
0.01
0.00
0.00
intel 002
0.00
0.00
0.00
intel 003
0.01
0.01
0.00
intel 004
0.00
0.00
0.00
intel 028
0.97
17.55
18.09
55.90
57.63
intel 029
0.05
0.12
2.40
0.13
2.60
intel 030
0.90
8.26
9.18
34.90
38.78
intel 031
0.04
0.18
4.50
0.12
3.00
intel 013
2.12
11.46
5.41
141.92
66.94
intel 025
0.09
0.32
3.56
0.25
2.78
intel 026
0.03
0.10
3.33
0.07
2.33
intel 027
0.61
3.51
5.75
22.85
37.46
intel 036
0.90
4.82
5.36
31.68
35.20
intel 037
0.78
9.77
12.53
33.03
42.35
intel 038
1.23
22.75
18.50
65.97
53.63
intel 039
1.64
8.90
5.43
56.85
34.66
intel 032
0.09
0.39
4.33
0.36
4.00
intel 033
0.68
6.42
9.44
19.79
29.10
intel 034
0.48
1.08
2.25
2.09
4.35
intel 035
0.94
6.57
6.99
15.60
16.60
intel 014
0.48
7.36
15.33
12.97
27.02
intel 015
0.03
0.14
4.67
0.13
4.33
intel 016
0.24
1.02
4.25
1.24
5.17
intel 009
0.86
6.44
7.49
36.59
42.55
intel 010
0.05
0.16
3.20
0.13
2.60
intel 011
0.04
0.21
5.25
0.12
3.00
intel 012
0.74
9.08
12.27
35.04
47.35
intel 021
0.03
0.14
4.67
0.06
2.00
intel 022
0.04
0.20
5.00
0.12
3.00
intel 023
0.03
0.16
5.33
0.07
2.33
intel 024
0.02
0.10
5.00
0.06
3.00
intel 017
0.03
0.08
2.67
0.08
2.67
intel 018
0.03
0.08
2.67
0.08
2.67
intel 019
0.03
0.10
3.33
0.09
3.00
intel 020
0.02
0.10
5.00
0.06
3.00
intel 042
1.28
9.36
7.31
69.34
54.17
intel 041
1.28
6.00
4.69
62.66
48.95
intel 040
1.63
7.64
4.69
69.38
42.56
intel 006
0.01
0.02
0.03
intel 007
0.08
0.18
2.25
0.41
5.13
intel 043
1.06
9.58
9.04
46.34
43.72
AVERAGE
1.00
6.42
21.66
Table 2.6. Unconstrained min-reg runtime, Intel benchmarks.
65
The runtimes are presented graphically in Figure 2.12 (the large benchmarks) and Figure
2.13 (the medium-sized benchmarks).
Figure 2.12. The runtime of flow-based retiming vs. CS2 and MCF for the largest designs.
2.5.3
Characteristics
For the average circuit, the unconstrained minimum-register solution had 16.9% fewer
latches than the original circuit– though the optimization potential varied greatly with the
function and structure of each benchmark circuit. For the synthesis examples (LGsynth,
OpenCores, and QUIP), the average reduction was 5.6%; for the verification examples
(Intel), the average reduction was 44.3%. If we restrict our attention to only the circuits
that saw any improvement in latch count, the result was an average 26.1% reduction. All of
the verification circuit had some reduction; only 51.0% of the synthesis examples did. The
average non-zero reduction in the synthesis examples was 11.1%. The largest reduction in
a verification example was 64.0% and occurred in “intel 002”. The largest reduction in a
synthesis example was 62.5% and occurred in the OpenCores circuit “oc fpu opt”.
66
Figure 2.13. The runtime of flow-based retiming vs. CS2 and MCF for the medium designs.
Tables 2.7, 2.8, 2.9, and 2.10 describe the characteristics of the retimed result and
the flow-based algorithm. The final number of registers is in column Fin Regs and the
percentage decrease versus the original count (available in Appendix A) in column % ∆
Levs. The columns # F Its and # B Its are the number of iterations in each direction
that reduced the number of registers. This excludes the final dummy iteration in each
direction to detect that the termination condition had been reached. The circuits without
any decrease in register count are omitted: for these examples, the number of levels remains
unchanged, and both the forward and backward iteration counts were zero.
These tables also present the number of levels in the longest path after retiming, Fin
Levels, and also the percentage increase from the length of the longest path before retiming,
% ∆ Levs. Because the minimum-register retiming was performed without an constraint
on the path length, it most often resulted in an increase in the longest path. In some cases,
however, the maximum path length was actually decreased: this was the case for most the
Intel verification examples as well as two of the LGsynth designs.
Figure 2.14 charts the number of forward and backward iterations that were required
67
Name
Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its
s382
18 -14.30%
17
0.00%
0
1
s400
18 -14.30%
17
0.00%
0
1
s444
18 -14.30%
19
-5.00%
0
1
s641
17 -10.50%
78
0.00%
0
1
s713
17 -10.50%
86
0.00%
0
1
s953
22 -24.10%
28
3.70%
0
1
s5378
140 -14.10%
38
15.20%
1
1
s9234
127
-5.90%
58
5.50%
1
0
s13207
466 -30.30%
54
-8.50%
3
5
s38584.1
1425
-0.10%
70
0.00%
1
0
s38417
1285 -12.30%
75
15.40%
2
2
AVERAGE
-13.70%
2.39%
Table 2.7. Unconstrained min-reg characteristics, LGsynth benchmarks w/ improv.
Name
Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its
barrel16 opt
32 -13.50%
9
12.50%
1
0
barrel16a opt
32 -13.50%
11
10.00%
1
0
barrel32 opt
64
-8.60%
11
10.00%
1
0
nut 004 opt
167
-9.70%
17
41.70%
2
2
nut 002 opt
195
-8.00%
19
0.00%
1
2
mux32 16bit opt
493
-7.50%
8
33.30%
1
1
mux8 64bit opt
513 -11.40%
6
50.00%
1
1
nut 000 opt
315
-3.40%
55
0.00%
1
2
nut 003 opt
234 -11.70%
40
11.10%
1
1
mux64 16bit opt
975
-6.80%
8
33.30%
1
1
mux8 128bit opt
1025 -11.30%
6
50.00%
1
1
barrel64 opt
128
-5.20%
12
9.10%
1
0
nut 001 opt
437
-9.70%
78
41.80%
2
2
radar12 opt
3767
-2.80%
44
0.00%
1
3
radar20 opt
5357 -10.70%
44
0.00%
2
1
uoft raytracer opt
11609 -11.20%
243 161.30%
3
2
AVERAGE
-9.06%
29.01%
Table 2.8. Unconstrained min-reg characteristics, QUIP benchmarks.
to reach the optimal retiming against the total design size of each of the benchmarks in
Appendix A. Each design is colored according the benchmark source. While there doesn’t
appear to be much of a relationship between individual design size and iteration count, there
is some evident correlation with the benchmark sources. The structure of the design– in
which there is some commonality within each benchmark source– is likely to be the primary
determinant. For the Intel set, many of the examples were the same underlying circuit with
a different safety property at the output.
For the set of benchmarks that we examined, the number of iterations required was
68
Name
Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its
intel 001
17 -52.80%
18 -48.60%
1
1
intel 004
38 -56.30%
55 -36.00%
1
1
intel 002
27 -64.00%
43 -40.30%
1
1
intel 003
45 -48.30%
72 -31.40%
1
1
intel 005
67 -60.60%
114 -32.90%
1
1
intel 006
172 -50.90%
181 -48.30%
1
1
intel 024
237 -33.60%
612
-0.30%
1
1
intel 023
238 -33.50%
593
-3.30%
1
1
intel 020
232 -34.50%
622
-0.30%
1
1
intel 017
337 -45.50%
438
-0.50%
1
1
intel 021
242 -33.70%
634
-0.30%
1
1
intel 026
293 -40.40%
660
-0.30%
1
1
intel 018
318 -35.20%
738
-0.30%
1
1
intel 019
336 -34.10%
763
-0.30%
1
1
intel 015
379 -31.50%
918
-1.80%
1
1
intel 022
357 -32.60%
952
-0.20%
1
1
intel 029
388 -31.20%
1007
-0.20%
1
1
intel 031
358 -32.60%
954
-0.20%
1
1
intel 011
360 -32.50%
1001
-0.20%
1
1
intel 010
366 -32.10%
992
-0.20%
1
1
intel 007
593 -54.60%
596 -55.20%
1
1
intel 025
605 -46.00%
1098
-0.20%
1
1
intel 032
635 -33.90%
1785
-0.10%
1
1
intel 016
1306 -43.10%
2469
-0.10%
1
1
intel 034
1257 -61.90%
1297
-1.00%
1
1
intel 014
2317 -46.20%
3068
-4.00%
1
1
intel 035
2357 -46.50%
5946
0.00%
1
1
intel 033
2370 -46.30%
6105
0.00%
1
1
intel 027
2773 -46.10%
3434 -20.80%
1
1
intel 012
3096 -47.40%
3813 -20.20%
1
1
intel 037
3139 -47.00%
3949 -17.80%
1
1
intel 030
2879 -46.70%
7136
0.00%
1
1
intel 009
2881 -46.60%
7134
0.00%
1
1
intel 036
3140 -45.90%
7279
0.00%
1
1
intel 043
3755 -48.00%
4429 -26.30%
1
1
intel 028
3885 -47.80%
4585 -25.90%
1
1
intel 042
4660 -48.30%
5362 -22.00%
1
1
intel 038
4664 -48.20%
5356 -22.10%
1
1
intel 040
4820 -49.30%
5052 -28.70%
1
1
intel 041
4786 -48.40%
5507 -21.40%
1
1
intel 039
4850 -49.00%
5572 -21.40%
1
1
intel 013
7076 -47.00%
8194 -23.60%
1
1
AVERAGE
-44.29%
-13.25%
Table 2.9. Unconstrained min-reg characteristics, Intel benchmarks.
69
Name
Fin Regs % ∆Regs Fin Levs % ∆Levs #F Its #B Its
oc miniuart opt
88
-2.20%
12
20.00%
1
1
oc dct slow opt
165
-7.30%
21
23.50%
0
2
oc ata ocidec2 opt
283
-6.60%
13
0.00%
1
0
oc minirisc opt
264
-8.70%
28
21.70%
2
1
oc vid comp sys h
47 -20.30%
22
69.20%
1
0
oc vid comp sys h
60
-1.60%
20
0.00%
0
1
oc hdlc opt
374 -12.20%
16
33.30%
1
3
oc smpl fm rcvr opt
222
-1.80%
37
8.80%
1
0
oc ata ocidec3 opt
555
-6.60%
13 -13.30%
1
1
oc ata vhd 3 opt
560
-5.70%
15
0.00%
1
1
oc aes core opt
394
-2.00%
13
0.00%
0
1
oc aes core inv opt
658
-1.60%
13
0.00%
1
1
oc cfft 1024x12 opt
704 -33.00%
157 647.60%
13
1
oc vga lcd opt
1078
-2.70%
34
0.00%
2
1
os blowfish opt
827
-7.20%
36
-2.70%
1
0
oc pci opt
1308
-3.40%
46
0.00%
2
1
oc ethernet opt
1259
-1.00%
33
0.00%
1
2
oc des perf opt opt
1088 -44.90%
81 1520.00%
16
0
oc oc8051 opt
739
-2.00%
60
15.40%
1
1
oc mem ctrl opt
1812
-0.70%
32
0.00%
1
1
oc wb dma opt
1749
-1.50%
33
83.30%
1
1
oc aquarius opt
1474
-0.20%
99
0.00%
1
0
oc fpu opt
247 -62.50%
1080
4.90%
2
0
oc vid comp sys d
2305 -35.10%
38
22.60%
1
1
AVERAGE
-11.28%
102.26%
Table 2.10. Unconstrained min-reg characteristics, OpenCores benchmarks.
small: the average was 2.7 (with an average 1.5 forward and 1.2 backward). The maximum
of any design was 16. Furthermore, not only is the number of iterations small, but the
majority of the reduction comes in the earliest iterations. Figure 2.15 illustrates the fraction
of the total register reduction that was contributed by each iteration for each benchmark
source. F0 is the contribution of the first forward iteration, F1 the second forward iteration,
and F2+ all other forward iterations; the backward iterations are labelled similarly. Almost
all of the reduction in the number of registers occurs after the first iteration in either
direction. The number of iterations can be bounded as necessary to control the runtime
without sacrificing much of the improvement.
70
Figure 2.14. The distribution of design size vs. total number of iterations in the forward
and backward directions.
2.5.4
Large Artificial Benchmarks
Because the runtimes of the benchmarks in Appendix A are relatively fast, a set of
larger artificial circuits was created by combining the OpenCores benchmarks in Table A.3.
As the number of retiming iterations required appears to be independent of the circuit
size– probably because the maximum latency around any loop or from input to output
is also size independent-the circuits ”large1” and ”large2” were constructed via parallel
composition to preserve this property. The 2 and 4 million gate circuits, ”larger5” and
”larger6”, were generated similarly. In contrast, the two circuits ”deep3” and ”deep4” were
built by splitting and serially composing the components. The runtime of our flow-based
71
Figure 2.15. The percentage of register savings contributed by each direction / iteration.
Name
large1
large2
deep3
deep4
larger5
largest6
Nodes Init Regs Fin Regs
1 006 k
72.9 k
66.9 k
1 005 k
82.7 k
76.9 k
1 010 k
74.7 k
67.6 k
1 074 k
86.4 k
82.0 k
2 003 k
151.1 k
139.5 k
4 008 k
300.1 k
279.0 k
CS2
Flow-Based
Time #F Its #B Its Time
147.9s
3
3 33.0s
131.3s
3
3 24.5s
182.0s
3
21 34.2s
130.3s
3
3 17.9s
410.6s
3
3 67.2s
818.3s
3
3 139.9s
Incr
4.48x
5.36x
5.32x
7.27x
6.11x
5.85x
Table 2.11. Unconstrained min-reg runtime, large artificial benchmarks.
algorithm are compared to that of the minimum-cost network-flow based solution with CS2
and presented in Table 2.11.
2.6
Summary
The contribution of this chapter is a new algorithm for computing a minimum-register
retiming without any constraints. This has useful applications in physical design, verification, and synthesis. The improvements over previous techniques are:
Faster Runtime. The worst-case bound is non-comparable (i.e. neither better nor
72
worse) than existing minimum-cost network circulation-based approaches, but the empirical
runtime comparison on both synthesis and verification examples is favorable. We measure
an average improvement of 4.6x and 14.9x over two state-of-the-art solution methods. The
absolute runtime is also quite fast: less than 0.10s of CPU time for 76% of the benchmarks
and a maximum of 3.29s.
Scalable Effort. Each iteration of the single-frame register minimization problem
produces a result that is strictly better than the previous. The algorithm can therefore be
terminated after an arbitrary number of iterations with partial improvement. This feature
is important for guaranteeing scalability in runtime-limited applications. Furthermore, our
experience indicates that the vast majority of the improvement comes from the first iteration
in each direction. This incrementality is not an obvious feature of the alternative minimumcost network circulation-based approaches.
Minimal Perturbation. In general, there are multiple optimal minimum-register
retiming solutions. Ours returns the one with the minimum register movement. This feature
is important to minimize the perturbation of the netlist and avoid unnecessary synthesis
instability.
Extensible problem representation. The maximum flow-based formulation provides a framework into which necessary problem constraints can be easily incorporated.
We will see two examples of this in Chapters 3 (timing constraints) and 4 (initilizability
constraints).
73
Chapter 3
Timing-Constrained Min-Register
Retiming
In this chapter we extend the flow-based algorithm in Chapter 2 to include constraints
on both the worst-case minimum and maximum propagation delays in the problem of minimization the number of registers in a circuit under retiming. For synthesis applications,
these constraints are critical to ensure the timing correctness of the resulting circuit.
Again, we assume that the retiming transformation is understood. The reader may
review Section 1.2.1 for more background on retiming. The content of this chapter also depends on the maximum-flow-based formulation of minimum-register retiming from Chapter
2. An understanding of that material is prerequisite.
The chapter begins in Section 3.1 by defining the problem of register minimization
under delay constraints. The existing approaches to solving this problem are described in
Section 3.2. Our new maximum-flow-based approach is described Section 3.3, including
a few examples. Analyses of the correctness, complexity, and limitations is presented in
Section 3.4. Some experimental results are contained within Section 3.6.
74
3.1
Problem
The algorithm described in Chapter 2 finds the minimum number of registers– but
without regard to the effect on the lengths of the longest and shortest delay paths. In most
synthesis applications (as opposed to verification), it is necessary to introduce constraints
on the minimum and maximum combinational path delays. This problem is known as
timing-constrained minimum-register retiming. Its computational difficulty exceeds that of
both the minimum-register and minimum-delay problems.
Let d(u, v) be the minimum combinational path delay along u
the maximum combinational path delay along u
v and D(u, v) be
v. We do not make any assumptions
about the timing model. Let Dk (u, v) be the maximum combinational delay that passes
through exactly k registers on path u
v. Therefore D(u, v) ≡ D0 (u, v). Let the number
of registers– the sequential latency– along a path u
v be δ(u, v).
The timing constraints arise from the setup and hold constraints at the register inputs.
The inputs must be stable for a defined period both before and after each clock edge.
Under a given clock period, the setup and hold constraints lead to limits on the worst-case
propagation delays along the longest and shortest combinational paths that terminate at
each register.
3.2
Previous Work
3.2.1
LP Formulation
The existing timing-constrained min-register algorithms utilize an extension of the linear
program described in Section 2.2.1. Recall that G =< V, E, wi > is a retiming graph, r(v)
is a retiming lag function, and wi (u, v) is the initial number of registers present on each
edge u → v.
To model the timing constraints, we define two matrices Wi (u, v) : V × V → Z and
D(u, v) : V × V → ℜ. Wi (u, v) is the minimum total retiming weight along any path
75
u
v, and D(u, v) is the worst-case total combinational delay along this path. Because
wi (u, v) ≥ 0 along each edge and path delays are at least non-negative in any cycle, only
acyclic paths u
v need to be considered in the generation of Wi and D. The constraint
on the resulting maximum path length can then be expressed as in Equation 3.2 in terms
of the desired clock period T . (Equation 3.1 is the previously introduced constraint on
non-negative register weight.)
r(u) − r(v) ≤ wi (u, v)
D(u, v)
T
r(u) − r(v) ≤ Wi (u, v) + 1 −
∀u → v
(3.1)
∀u
(3.2)
v
The fundamental bottleneck of this approach lies in the enumeration and incorporation
of all pair-wise delay constraints u
v. In the original algorithm, all connected pairs
were examined, resulting in an O(V 3 ) procedure. The bound can be improved slightly: an
improved algorithm can accomplish this in O(V E) time.
3.2.2
Minaret
It is not actually necesary to include a delay constraint for every possible pair of connected vertices. Constraints that are redundant or will never become critical can be ignored.
Path containment presents one opportunity to prune constraints. A technique to ignore redundant edges on-the-fly is described by [?].
The Minaret algorithm [46] provided a leap forward by using a retiming-skew equivalence to restrict the constraint set to those that are potentially timing-critical. This is
accomplished by computing the minimum and maximum clock skews for each register and
using these values to bound the permissible retiming locations. Pair-wise delay constraints
need not be calculated along paths that always contain a register.
One such unnecessary constraint is depicted in Figure 3.1. The earliest and latest skew
values for register R are τASAP (R) and τALAP (R), respectively, and these values can be
used to derive a corresponding ASAP and ALAP retiming position of R. Because R will
76
always lie between nodes u and v, it is not necessary to explore or incorporate a timing
constraint between these nodes.
Figure 3.1. Bounding timing paths using ASAP and ALAP positions.
3.3
Algorithm
Our algorithm provides a leap forward over Minaret by only enumerating the nonconservative constraints in the subset of the circuit that are both timing- and area-critical.
Only constraints that lie in the path to area improvement need to be generated. The others
can not be ignored but are instead are replaced with a fast but safe approximation. The
number of constraints that must be generated and introduced is much smaller. Only the
intersection between the area- and timing-critical regions must be treated in detail.
An additional advantage–critical for industrial scalability– is that the optimum solution
is approached via a set of intermediate solutions, which are monotonically improving and
always timing-feasible. Thus, the algorithm can be terminated at any point with an improved timing feasible solution. While an early termination of the unconstrained problem
would only become necessary for extremely large designs or very tight runtime constraints,
the increased difficulty of timing-constrained minimum-register retiming makes runtime a
potential concern.
Finally, short-path timing constraints are handled also.
77
The timing constraints are defined as follows. Consider the presence of a register R on
be the maximum allowable arrival time (such
any allowable net n in the design. Let Amax
n
be the minimum allowable arrival time
that the setup constraint of R is met) and Amin
n
(such that the hold constraint of R is met). In a simple application, the maximum arrival
time constraints at every net would uniformly equal to the clock period and the minimum
arrival constraints zero. In a more precise application, these values would include local
variations such as the estimated local unwanted clock skew (δclk ), the timing parameters
of the register cell appropriate to drive the capacitive load (e.g. setup Sn and hold Hn ),
and the maximum period of the local clock domain (Tclk ). Equations suggest definitions in
terms of these parameters.
Amax
= Tclk − Sn − δclk
n
(3.3)
Amin
= Hn − δclk
n
(3.4)
If there is physical information available, the location of net n can be used to estimate
the location of a register that is retimed to its output. This can be invaluable for refining
the estimates of the above parameters due to any dependence with spatial location. One
example of this is the local clock skew δclk . If the register’s physical location is changed,
it will require either: (i) additional clock routing latency or (ii) reconnection to a different
branch of the clock distribution network with a different nominal value and variation of its
latency. Other spatially-dependent effects might include manufacturing variation, supply
voltage variation, and budgets for wire-delays due to known local placement or routing
congestion.
An important prerequisite is that the initial positions of the registers meet these maximum and minimum arrival constraints. If it is desired that a higher frequency be achieved
through retiming, the design would need to be retimed first by one of the many delayminimizing retiming algorithms [8] [47] [48] [49], among which efficient exact and heuristic
solutions are available.
78
3.3.1
Single Frame
Consider retiming one or more registers in either direction within one combinational
frame of the circuit (as described in Section 2.4.1) to the output net of some node v in the
circuit. Let Rv be the potential new register. There are four timing constraints that are
affected by this move: the latest and earliest arrival times on the timing paths that start
and end at the retimed register. At the start or end of the path, two constraints are made
potentially critical; the other two can be ignored. Observe that the degree of criticality of
these constraints is strictly increasing with the distance that the registers move.
We introduce two versions of each of these constraints: conservative and exact. In the
conservative version, it is assumed that the end of the timing path opposite the moving
register remains fixed. This is an over-constraint: the other register may have moved also
in the same direction, thereby relaxing the timing criticality. In the exact constraints, we
also consider the position of the register at the other end of the timing path and specify the
location to which it must be moved for the path to meet the constraint.
Conservative Constraints The set of conservative constraints is Ccons ⊆ V . The set
Ccons defines the vertices past which a register can not be retimed without violating a delay
constraint. A node v is marked as being conservatively constrained if there exists some
register Ru such that D1 (Ru
or d0 (v
v) > Amax
v
Ru ) < Amin
Ru .
The entire set of conservative constraints can be computed in O(E) time with a static
timing analysis (STA) of the original circuit. The short-path constraints can be identified
in one pass. The long-path constraints require two passes (to capture the components of
the path on either side of the moved register); register output arrivals are seeded with their
input arrivals from the first pass and then those values are propagated forward.
This process is illustrated in Figure 3.2. Each gate in this example is assumed to
have unit delay and is labelled with its arrival time. In Figure 3.2(i)– the first pass– all
register outputs have an initial arrival time of zero, and STA is used to propagate the arrival
times forward to the outputs of each combinational gate. In the Figure 3.2(ii)– the second
79
pass– the register outputs are seeded with the arrival times at their inputs in the previous
pass, and another pass of STA is applied. The resulting labels at each node v are exactly
max D1 (Ru , v). If the maximum delay constraint is 5, the nodes labelled with arrival times
∀Ru ∈R
higher than 5 will be conservatively constrained.
Figure 3.2. The computation of conservative long path timing constraints.
A conservative timing constraint can be enforced by simply redirecting the fan-ins of
the constrained node to the flow sink (or, equivalently, increasing the capacity of its flow
constraint to infinity). Afterward, these timing-constrained nodes will not participate in
the resulting minimum cut. This operation is depicted in Figure 3.3: the conservative
constraints on nodes v1 and v4 in sub-figure (i) are implemented by modifying the flow
graph to that of sub-figure (ii). The redirected edges are highlighted.
Exact Constraints The set of exact constraints Cexact ⊆ V × V does not presume that
the other end of a timing path has remained stationary. Each exact constraint encodes the
80
Figure 3.3. The implementation of conservative timing constraints.
position u to which the register on the other end of a timing path would have to move for new
register Rv on the output of node v to be timing feasible. The set of exact constraints Cexact
defines the vertex pairs (u, v) that describe these dependencies. These can be computed
easily: the exact constraints of vertex v are the nodes U described by Equation 3.5 and 3.6
and are exactly the base of the transitive fan-in/out cones whose combinational “depth” is
A and cross exactly one register. The total time to enumerate all possible exact constraints
for all nodes is O(V E).
Umin = {u : d1 (v → u) > Amin
∧ ∀u′ ∈ f anin(u), d1 (v → u′ ) ≤ Am
v
v in}
(3.5)
Umax = {u : D1 (v → u) > Amax
∧ ∀u′ ∈ f anout(u), D1 (v → u′ ) ≤ Am
v
v in}
(3.6)
81
An example of the computation of a set of exact constraints is depicted in Figure 3.4
for node vc . The cone of D1 (u → vc ) ≤ 3 is colored orange. The resulting exact constraints
are formed between the base (fan-ins) of this cone and vc : (v1 , vc ) and (v2 , vc ). It is helpful
to note that several nodes at the end of a path of length 3 do not spawn exact constraints:
vi , because the path vi
vc contains two registers, and vj and vk because the paths to vc
do not cross any registers.
Figure 3.4. The computation of exact long path timing constraints.
Enforcement of the exact timing constraints is accomplished by introducing additional
unconstrained flow edges into the graph. An edge n → m is added for every exact constraint,
where n is the potential new register position and m is the point to which the register
boundary must also move. This is depicted in Figure 3.5, where m ≡ v4 and n ≡ v2 . By
Property 1, these unconstrained flow edges will restrict the resulting cut to exactly the cases
that are timing infeasible; the cut will be the optimally minimum one that meets the exact
constraints.
Because the depth of the cone is at least the current period, each timing arc will terminate at a node that is never topologically deeper than its source. The arcs will therefore
always be in the direction from sink to source, and the flow from source to sink will remain
finite. This motivates the requirement that the circuit initially meets the timing constraints.
There may be cycles within the set of constraints; this occurs whenever a critical se-
82
Figure 3.5. The implementation of exact long path timing constraints.
quential cycle is present in the netlist. A correct result is such that moves of registers within
a critical cycle are synchronized.
Timing Refinement
The overall algorithm (Algorithm 7) consists of an iterative re-
finement of the conservatism of the timing constraints until the optimal solution has been
reached for a combinational frame. Conservative constraints are removed and replaced with
exact ones. As we will see, the refinement need only be performed for regions whose timing
conservative is preventing further area improvement, i.e. area-critical.
Let Ccons be the current set of conservative constraints and Cexact be the current set of
exact constraints. The algorithm begins by initializing Cexact to be empty and Ccons to be
83
Algorithm 7: Single-Frame Flow-based Min-register Retiming: FRETIME 1()
Input : a combinational circuit graph G =< V, E >
Output: a retiming cut R
let Ccons = {n ∈ V : ∃m s.t. d1 (m → n) ≥ Amax
}
n
let Cexact = ∅ of V × V
repeat
compute min cut Runder under Cexact
compute min cut Rover under Cexact ∪ Ccons
Ntighten ← T F I(Runder ) ∩ T F O(Rover ) ∩ Ccons
forall n ∈ Ntighten do
Plong ← {m ∈ V : Amax
− dm < d1 (m
n
n) ≤ Amax
}
n
Pshort ← {m ∈ V : Amin
m − dm ≤ d1 (n
m) < Amin
n }
Ccons → Ccons − n
Cexact → Cexact ∪ (n × Plong ) ∪ (n × Pshort )
until Ntighten = ∅
return R = Rover = Runder
complete (that is, all nodes which could possibly be conservatively constrained). During an
iteration, each net will be in one of three states: (i) none, if retiming a register to this net
will not introduce a timing violation, (ii) conservative, or (iii) exact.
The area-critical region’s refinement is accomplished as follows. We compute two minimum cuts: Runder , the minimum cut under only the (current) exact constraints, and Rover ,
the minimum cut under both the exact and conservative constraints. Note that Runder is
under-constrained and Rover is over-constrained. Therefore, Runder will be at least as deep
as Rover , whose tighter constraints prevent the registers from being pushed as far due to the
monotonically increasing criticality of both types of timing constraints with greater register
movement.
The vertices whose timing is to be tightened are those that are conservatively constrained
and lie (topologically) between the two cuts. The nodes shallower than Rover are not of
interest because the over-constrained solution already lies beyond them. Likewise, the nodes
84
deeper than Rover are not of interest because the under-constrained solution does not reach
them. The exact constraints are computed for each of the tightened nodes, inserted into
Cexact , and the nodes are removed from Ccons . The refinement terminates after an iteration
if no new constraints are introduced or, equivalently, when Runder = Rover .
3.3.2
Multiple Frames
The complete algorithm proceeds identically to the unconstrained version: the singleframe procedure is iterated in both the forward and backward directions until a fixed point
is reached. To differentiate the steps of timing refinement with the interation across multiple
frames, we refer to the refinement steps as sub-iterations and the computation on each frame
as an iteration.
3.3.3
Examples
Figure 3.6 illustrates the application of timing-constrained minimum-register retiming
to the depicted circuit. It is assumed that each buffer has a unit delay: the maximum
path length is 3 units. The maximum delay constraint Amax at all points is 3; there are no
minimum delay constraints (or equivalently, they are −∞). The arrival time at the primary
inputs is undefined and assumed to be −∞
The maximum D1 (u, n) values of the two-pass timing analysis are labelled below each
of the combinational nodes n in Figure 3.6(i). Because the values are greater than the maximum delay constraint at these nodes, n4..7 are marked as being conservatively constrained.
The first iteration of forward min-register retiming is performed. In the first constraint
refinement sub-iteration (Figure 3.6(ii.a)), the over- and under- constrained cuts (Rover and
Runder , respectively) are computed. In this case, Rover is identical to the original positions
of the registers and has a cut width of 4; Runder lies at a position with a cut width of 3.
Because node n7 lies between these two cuts and is conservatively constrained, it is a target
for refinement. The base U of the fan-in cone with a combinational delay of Amax and a
sequential delay of 1 (s.t. D1 (u
n7 ) = Amax ) are collected, and in this case, U = {n4 }.
85
Figure 3.6. An example of timing-constrained min-register forward retiming.
This introduces one new exact constraint, for which a corresponding unconstrained edge is
added to the flow graph: from n7 → n4 . The conservative constraint on n7 is removed.
This concludes the first sub-iteration.
In the second refinement sub-iteration (Figure 3.6(ii.b)), we compute a new Runder and
Rover . The addition of the exact constraint has altered the location of Runder . However, because the two cuts are not identical, we continue with refinement on the set of conservatively
86
constrained nodes between the two cuts. There must exist another (different) conservatively
constrained node that is limiting area improvement, which in this case is node n4 . The one
exact constraint on n4 arises from n1 , and an unconstrained edge n4 → n1 is added to
the flow graph. The conservative constraint on n4 is removed. This concludes the second
sub-iteration.
In the third and final refinement sub-iteration (Figure 3.6(ii.c)), we compute the new
Runder and Rover . In this case, we find that they have both moved and are now identical.
There are still conservative timing constraints on nodes {n2 , n3 , n5 , n6 }, but these do not
affect the register minimization and are not area-critical. We can terminate the refinement
with the expected minimum of registers that meets all timing constraints.
Although not illustrated, the global minimization algorithm would repeat another forward iteration and one backward iteration. As there is no further register minimization
possible, it is not necessary to perform any further timing refinement. In both cases, Rover
and Runder will be immediately identical and equal to the same positions of the registers
as after the first forward iteration. The final solution has 3 registers and successfully meets
all constraints with a maximum path delay of 3.
Figure 3.7 illustrates the application of timing-constrained minimum-register retiming
to a circuit with a critical cycle. Again, it is assumed that each buffer has a unit delay: the
maximum path length is 2 units. The maximum delay constraint is uniformly 2; there are
no minimum delay constraints (or equivalently, they are −∞).
In the first refinement sub-iteration, the conservative constraint of node n2 in Figure
3.7(ii.a) is replaced by an exact constraint n2 → n4 in (ii.b). Then, in the second subiteration, the conservative constraint of n4 is replaced with the exact constraint n4 → n2 .
There is now a cycle in the flow graph! The effect of this is that a register can not be
retimed past any of the nodes in the cycle unless registers are retimed past all of them. In
terms of the original circuit in Figure 3.7(i), the consequence is that the registers are forced
to move around the cycle in “lock-step”. This is the expected behavior for such a critical
cycle. If the minimum retiming required the shift to extend over multiple combinational
87
Figure 3.7. An example of timing-constrained min-register retiming on a critical cycle.
nodes, multiple disjoint cycles of unconstrained edges would be introduced into the flow
graph.
The final solution in Figure 3.7(ii.c) is reached in the next refinement sub-iteration. The
resulting single-frame solution has 2 registers and a maximum delay of 2. This is also the
globally minimum multi-frame solution.
88
3.4
Analysis
3.5
Proof
In this section, we prove two facts about our flow-based retiming algorithm: (i) the
result is correct, meeting both the constraints on functionality and timing, and (ii) the
result has the optimally minimum number of registers.
Correctness: Functionality As the timing-constrained flavor of the minimum-register
retiming algorithm is a more strictly-constrained version of the approach in Chapter 2, the
proof for functional correctness proceeds identically to that of Section 2.4.1. We refer to
that explanation for more detail.
Correctness: Timing
Theorem 8. Algorithm 7 meets the latest arrival time constraint Amax
and earliest arrival
r
time constraint Amin
at every register r.
r
Proof. Without loss of generality, let us consider the long-path constraints during forward
retiming. The proof is similar for backward retiming and short-path constraints.
Let vR be the node that drives the input of a new register R. At the end of the iteration
that led R to be retimed to the output of vR , we know that vR was in one of two states:
(i) it had never been constrained, or (ii) it was originally conservatively constrained but is
now exactly constrained. vR must not have remained subject to a conservative constraint,
because the cut can not lie at the output of v due to the unconstrained flow path along the
timing edge in the flow graph from vR to the sink. We show that in both of the possible
cases that there exists no long combinational path D0 (u, vR ) for any u that violates the
timing constraint at R.
• Case (i) Not having ever been constrained implies that there existed no long-path
D1 (Ru , vR ) > Amax
vR for any register Ru . Because register R had been retimed from
89
the transitive fan-in of vR to its output, every combinational path D0 now corresponds
to a previous D1 (though the converse is not necessarily true). Because there existed
no D1 with length greater than the constraint, there is now no combinational path
that violates it.
• Case (ii) Having been originally conservatively constrained implies that there was
some path p = u
vR such that D1 (u, vR ) > Amax
vR . Let the single register along
this path be R1 . Because R1 will be retimed forward to R, δ(p) could potentially be
reduced zero, thereby creating a long-path timing violation. We must show that δ(p)
remains > 0.
Every such path must contain two consecutive nodes u′ and u′′ such that D1 (u′ , vR ) >
′
′′
max
Amax
vR ∧ D1 (u , vR ) ≤ AvR . The first node u must always exist as u itself satisfies
this condition. The second node u′′ must also exist: the zero-register path delay
D0 (R1 , vR ) is ≤ Amax
vR because the original circuit meets timing. There must therefore
be some node between u and R1 where the length of the path is ≤ Amax
vR and crosses
exactly one register (i.e. R1 ).
The node u′ happens to be the vertex that satisfies the condition (from Equation 3.5)
that generates an exact constraint when vR is tightened. An unconstrained edge will
be added from vR → u′ in the flow graph. From Property 1, this imposes the condition
that the cut lies beyond vR only if it lies beyond u′ . That is, R is retimed to vR only
if another register is retimed past u′ . This adds one unit of sequential latency to p
and ensures that δ(p) > 0.
Because there exists no path delay D1 (u, vR ) that becomes combinational and introduces
a timing violation D0 (u, vR ) > Amax
vR , the timing at register R will be therefore be always
correct.
Optimality
90
Theorem 9. Algorithm 1 results in the minimum number of registers of any retiming that
meets the given timing constraints.
Proof. Consider a counterexample. If Cresult is the cut returned by any iteration of our
algorithm, let C ′ be the smaller cut that also meets all of the timing constraints. We can
show that this situation will never arise.
By minimum-cut maximum-flow duality, Cresult is larger than C ′ because the corresponding flow graph has one or more additional paths from the source to the sink. Each
one of these paths must also cross C ′ . Because the structural edges in the two flow graphs
are identical (because the circuits are the same), the additional edge across C ′ must be due
to a timing constraint. This was the result of the implementation of either a conservative
or an exact constraint.
If this edge were due to a conservative constraint, it must have originated at a node
vcons that lies topologically between Cresult and C ′ . (It must be deeper than Cresult as
the unconstrained edge of our attention could not have been part of the finite width cut
Cresult.) However, this situation violates the termination condition of our algorithm: vcons
is a conservatively constrained node that lies between the current over-constrained cut and
a strictly smaller under-constrained cut Cresult.
If this edge were due to an exact constraint, let that particular timing arc be e = uS →
uR . Because this adds additional flow from S → R (where S is the subgraph partition
created by C ′ that lies closer to the source; R is the other), this implies that uS ∈ S and
uR ∈ R. Therefore, in the C ′ solution a register is being retimed past uS but not uR .
This explicitly violates the corresponding timing constraint and would imply that C ′ is not
timing feasible.
Therefore, there can exist no such cut C ′ that has fewer registers and also meets the
timing constraints. The result returned by algorithm is optimal.
91
3.5.1
Complexity
The complexity of the original formulation of delay-constrained minimum-register retiming is limited by the pair-wise delay constraints. While we significantly improve upon
the average runtime by identifying many cases that do not require the enumeration of these
constraint pairs, the worst-case is still limited by this quadratic behavior. An antagonistic
circuit (with a very wide and interconnected structure) can be constructed with V nodes
and V 2 critical delay constraints.
In our algorithm, each delay constraint pair results in an addition edge in the flow graph.
Because the complexity of computing the minimum-cut is O(RE), the additional of these
delay constraints increases the worst-case runtime to O(RV 2 ). (The number of original
edges E is a subset of V 2 and thus subsumed by this quantity.)
The minimum-cut needs to be computed (twice) in each timing refinement sub-iteration.
An antagonistic circuit (with a very long and narrow structure) can be constructed that
results in the refinement of only one node per sub-iteration. The worst-case total runtime
would then be O(RV 3 ) per iteration.
It should be noted that the structures which result in a large number sub-iterations are
apparently complementary to those that result in a quadratic number of delay constraints.
It may be possible to use this fact to prove a tighter bound than O(RV 3 ).
Finally, the number of iterations of the main loop remains at most R. The final worstcase bound is therefore O(R2 V 3 ). This is worse than the unconstrained version of the
problem by a factor of V .
3.6
Experimental Results
We utilize an experimental setup identical to that described in Section 2.5.
92
Name
s38417
b17 opt
mux8 128bit
oc cfft
oc des perf
oc pci
oc wb dma
oc vga
MEAN
Original
Gates Init
19.5k
49.3k
7.8k
19.5k
41.3k
19.6k
29.2k
17.1k
Flow-Based
Minaret
Regs Delay Final Regs Runtime Runtime
1465
54
1288
2.08s
17.8s
1414
44
1413
6.9s
227.6s
1155
14
1149
0.07s
0.5s
1051
111
874
12.1s
769.s
1976
31
1920
10.2s
114.6s
1354
88
1311
0.10s
33.8s
1775
36
1754
0.24s
24.6s
1108
123
1079
0.10s
30.6s
1x
102x
Table 3.1. Delay-constrained min-reg runtime vs. Minaret.
3.6.1
Runtime
One of the primary contributions of the flow-based timing-constrained minimum-register
retiming approach is the reduction in the computational effort required to compute the
optimal minimum-register solution. In this section, we contrast the runtime of our approach
against the best-known available alternative.
The Minaret tool [46] serves as the primary comparison point as the best known publiclyavailable software for this problem. We greatly appreciate the source being made available
by the authors, which was then recompiled on our experimental platform. Minaret worked
wonderfully for the packaged benchmarks, but we did have some problems with applying it
to the circuits in our benchmark suites. Because of these errors (which we were not able to
correct), not all of the circuits in Appendix A are available as comparison points.
Table 3.1 compares the performance of our timing-constrained algorithm against
Minaret. The test-cases presented are the ones with over 1000 registers that were processed by Minaret without error. The maximum arrival constraint Amax for every node was
set to the initial circuit delay, and minimum delay constraints Amin were set to negative
infinity (because these are not supported by Minaret). The runtimes of both Minaret and
our flow-based method are listed. The average runtime of Minaret is 102x that of our tool.
We implemented a unit timing model for comparison with Minaret (hence the integer
worst-case delay values), but the algorithm can be used with one that is much more descriptive. A second implementation used a standard load- and slew-dependent interpolating
93
table lookup to compute path delays. Because computation of timing data dominates the
runtime, this extra effort increased the runtime to 5x that of the unit delay version.
The load-aware timing analysis can be written to include not only the effects of the
capacitive loads on the propagation delays through the combinational elements, but also
on the potential positions of the retimed registers. The constraints Amax
and Amin
at each
v
v
node v are those that would apply to a register at its output; these values can be adjusted
to include the effects of using a register instead of the existing combinational gate to drive
the capacitive load at both ends of the critical path. However, under such a non-linear delay
model, the result may be more accurate but can no longer be guaranteed to be optimal.
3.6.2
Characteristics
Tables 3.2, 3.3, and 3.4 present some characteristics about the behavior of the timingconstrained retiming algorithm on the synthesis benchmarks. Again, the initial period of
the circuit (column Initial Del) was used as the maximum arrival constraint of each register.
The worst-case combinational delay path after retiming is listed in column Finals Del;
this is verifiably less than the constraint in the previous column. The resulting number of
registers is in column Final Regs. Further details about the behavior are captured in the
next four columns: Iters, the number of iterations with improvement, Cons, the number
of initially conservatively constrained nodes, Refined, the number of the initially conservatively constrained nodes that were refined, and Exact, the number of exact constraints that
resulted. For the latter three metrics, the average values of the first forward and backward
iteration are presented. Finally, column Time lists the total runtime in seconds.
A closer examination of the number of conservative nodes that needed to be refined in
each refinement iteration is presented in Figure 3.8. The upper graph measures the number
of refined nodes; the bottom the number of resulting exact constraints. These quantities
are averaged across all iterations and all non-verification benchmarks, though the forward
and backward phases are graphed separately. From these values, we can conclude that the
majority of the conservatively-constrained nodes that need to be refined are identified in
94
Initial
Final
Name
Regs Del Del Regs Iters Cons Refined Exact Time
s641
19 78 78
17 1
111
0
0 0.01
s713
19 86 86
17 1
122
0
0
0
s400
21 17 17
18 2
93
8
38 0.01
s382
21 17 17
18 2
86
7
32
0
s444
21 20 19
18 1
123
25
199
0
s953
29 27 26
24 1
322
68 1100 0.02
s9234
135 55 55 127 1
715
2
14 0.01
s5378
163 33 33 149 2
317
0
0 0.02
s13207
669 59 54 466 8
279
12
71 0.46
s38584.1 1426 70 70 1425 1
849
0
0 0.24
s38417
1465 65 65 1288 4
1358
910 24786 6.37
Table 3.2. Period-constrained min-reg characteristics, LGsynth benchmarks.
Initial
Name
oc miniuart
oc dct slow
oc simple fm rcvr
oc minirisc
oc ata ocidec2
oc aes core
oc hdlc
oc ata ocidec3
oc ata vhd 3
oc fpu
oc aes core inv
oc oc8051
os blowfish
oc cfft 1024x12
oc vga lcd
oc ethernet
oc pci
oc aquarius
oc wb dma
oc mem ctrl
oc des perf
Final
Regs Del Del Regs Iters
90
10
10
89 1
178
17
17 168 2
226
34
34 224 1
289
23
23 269 2
303
13
13 283 1
402
13
13 394 1
426
12
12 383 3
594
15
13 555 2
594
15
15 560 2
659 1030 1030 298 2
669
13
13 658 2
754
52
52 746 2
891
37
36 827 1
1051
21
21 910 10
1108
34
34 1078 3
1272
33
33 1259 3
1354
46
46 1308 3
1477
99
99 1474 1
1775
18
18 1767 2
1825
32
32 1812 2
1976
5
5 1976 0
Cons Refined Exact Time
190
2
5 0.01
33
3
9 0.01
184
5
12 0.03
308
3
37 0.02
146
19
77 0.01
828
0
0 0.01
186
9
74 0.04
161
24
89 0.02
225
35
132 0.03
288
133 1172 2.51
512
0
0 0.03
1793
2
17 0.06
17
0
0 0.03
596
1154 9881 3.27
191
0
0 0.05
315
0
0 0.09
46
0
0 0.09
5567
0
0 0.09
1323
191 3614 0.43
1043
0
0 0.13
6077
408 6714 0.08
Table 3.3. Period-constrained min-reg characteristics, OpenCores benchmarks.
95
Initial
Final
Name
Regs Del Del Regs
barrel16
37
8
6
36
barrel16a
37 10 10
34
barrel32
70 10
8
69
barrel64
135 11
9
134
nut 004
185 12 12
168
nut 002
212 19 19
195
nut 003
265 36 36
235
nut 000
326 55 55
315
nut 001
484 55 55
446
mux32 16bit
533
6
6
503
mux8 64bit
579
4
4
573
mux64 16bit
1046
6
6
985
mux8 128bit
1155
4
4 1149
radar12
3875 44 44 3767
radar20
6001 44 44 5357
uoft raytracer 13079 93 93 12030
Iters Cons Refined Exact Time
1
17
0
0
0
1
18
0
0
0
1
33
0
0 0.01
1
65
0
0 0.01
4
18
3
10 0.02
3
57
0
0 0.01
2
102
2
10 0.03
3
62
0
0 0.01
5
142
16
77 0.17
1
140
0
0 0.01
1
290
0
0 0.01
1
149
0
0 0.02
1
578
0
0 0.02
4
21
0
0 0.45
3
23
0
0 2.48
4
7126
4261 57401 58.91
Table 3.4. Period-constrained min-reg characteristics, QUIP benchmarks.
the first two sub-iterations. Also, it appears that there is a strong relationship between the
number of tightened nodes and the resulting number of exact constraints.
Figure 3.9 illustrates the effect of the refinement of conservative constraints into exact
constraints on the over- and under-constrained cuts. In each refinement sub-iteration, the
removal of conservative constraints decreases the number of registers in the over-constrained
cut while the addition of exact constraints increases the number in the under-constrained
one. Eventually, the number of registers (and structural location) of these two cuts converges. In this graph, the number of registers in these two cuts is presented relative to the
size of the final one, and the progressing convergence of their sizes is captured over time.
Again, these are the average values over all iterations and all non-verification benchmarks;
the forward and backward phases are graphed separately. Not all of the bechmarks required
so many iterations, and almost all of the refinement occurred in the last one or two iterations
before convergence.
Figure 3.9 can be used to get an idea of the optimality that is lost through early
termination with an over-constrained cut. Even after only the first sub-iteration of timing
refinement, the over-conservative cut is only suboptimal by less than 6% on average.
96
Figure 3.8. Average fraction of conservative nodes refined in each iteration.
We also examine the relationship between the tightness of the maximum delay constraints and the ability of the minimum-register retiming to decrease the registers in the
design. This is primarily a characteristic of the target circuits and not the minimization
algorithm, but it gives an idea of the tradeoffs involved in timing-constrained register minimization. Five designs are described in detail in Figure 3.10. The delay values in the x-axis
are normalized to the delay of the unconstrained minimum-register solution. (Points toward
the left are more tightly constrained.) The number of registers in the y-axis are normalized
to the number of registers in the unconstrained minimum-register solution. Even within
this small sample of designs, it can be observed that the relationship between delay and
register count is very variable and highly design-dependent.
The relativeness tightness of the imposed delay constraints affects not only the result
but also the behavior of the algorithm. Tighter constraints result in more of the nodes
being initially constrained, more of the nodes being refined, and more exact constraints
97
Figure 3.9. Registers in over-constrained cut vs under-constrained cut over time relative to
final solution.
per tightened node. On the other hand, the mobility of the registers is more limited. To
quantify these tradeoffs, we took all of the designs in Tables 3.2, 3.3, and 3.4 and applied
the heuristic minimum-delay retiming algorithm of [47]. The resulting circuits had smaller
delays and (typically) and increased register count. We then applied the delay-constrained
min-register algorithm using the optimized worst-case delays as the global constraints. The
characteristics of the resulting run are presented in Tables 3.5, 3.6, and 3.7. The meaning
of the table columns is identical to those in original-period-constrained tables. One design,
“uoft raytracer”, is omitted due to problems in the min-delay optimization. There were both
examples for which the runtime increased and decreased. While the decreases outnumbered
the increases, the average runtime rose 10.0x.
As an experiment to verify the correctness of our result, we compared the result of
applying our algorithm to both the original circuits and the delay-minimized versions using
the same original delay constraints. The expected result of an identical minimized register
count was observed. Also, as expected, the resulting circuits were generally not isomorphic,
due to the differing initial positions of the registers at the entry to our algorithm.
98
Figure 3.10. Registers after min-reg retiming vs. max delay constraint for selected designs.
Initial
Final
Name
Regs Del Del Regs Iters Cons Refined Exact Time
s382
30 17 11
26 1
131
9
61
0
s400
32 17 11
27 1
142
11
76
0
s444
36 20 12
30 4
88
20
108 0.01
s953
33 27 22
31 1
358
81 1898 0.03
s5378
177 33 29 166 1
785
6
28 0.02
s9234
135 55 51 127 1
816
8
62 0.02
s13207
690 59 46 455 8
469
12
91 0.49
s38417
1629 65 45 1356 4
3173
1837 46937 11.33
s38584.1 1428 70 63 1427 1
974
0
0 0.22
s641
19 78 78
17 1
111
0
0
0
s713
19 86 86
17 1
122
0
0
0
Table 3.5. Min-delay-constrained min-reg characteristics, LGsynth benchmarks.
99
Initial
Name
oc des perf
oc miniuart
oc hdlc
oc ata ocidec2
oc aes core
oc aes core inv
oc ata ocidec3
oc ata vhd 3
oc dct slow
oc wb dma
oc cfft 1024x12
oc minirisc
oc mem ctrl
oc ethernet
oc simple fm rcvr
oc vga lcd
os blowfish
oc pci
oc oc8051
oc aquarius
oc fpu
Final
Regs Del Del Regs Iters
1976
5
5 1976 0
90
10 10
89 1
431
12
9 397 2
313
13 10 295 2
402
13 13 394 1
669
13 13 658 2
616
15 11 566 3
619
15 11 575 3
204
17
7 192 2
1870
18 13 1793 2
1580
21 13 1088 6
292
23 21 273 2
1835
32 31 1813 2
1298
33 28 1274 3
254
34 22 241 1
1121
34 23 1091 3
899
37 33 834 1
1421
46 24 1371 3
770
52 49 765 2
1531
99 92 1504 2
1992 1030 217 856 5
Cons Refined Exact Time
6077
408 6714 0.08
190
2
5
0
581
29
286 0.03
184
21
156 0.03
828
0
0 0.01
512
0
0 0.03
380
58
393 0.07
405
55
397 0.07
145
40
244 0.01
1370
122 2670 0.39
720
1000 7185 1.57
348
7
64 0.01
1051
3
19 0.25
500
27
213 0.18
236
110
910 0.04
415
0
0 0.06
48
0
2 0.03
584
24
62 0.15
1960
3
57 0.06
4217
37 2096 0.29
2457
5217 29768 12.23
Table 3.6. Min-delay-constrained min-reg characteristics, OpenCores benchmarks.
Initial
Final
Name
Regs Del Del Regs Iters Cons Refined Exact Time
mux8 64bit
767
4
2 767 0
955
95
159 0.02
mux8 128bit 1535
4
2 1535 0
1915
191
319 0.03
mux32 16bit 653
6
3 586 2
300
68
108 0.02
mux64 16bit 1244
6
3 1171 2
518.5
98 263.5 0.05
barrel16
86
8
3
86 0
82
32
232
0
barrel16a
99 10
4
87 2
48.5
69.5
292 0.01
barrel32
198 10
4 183 1
137
96
736 0.02
barrel64
647 11
4 647 0
514
320 3712
0.1
nut 004
242 12
5 204 2
163.5
64.5
288 0.02
nut 002
228 19 13 210 3
104.5
13
56 0.02
nut 003
323 36 19 280 2
194.5
44.5 114.5 0.04
radar12
3970 44 23 3840 2
847.5
0.5
24 0.59
radar20
6862 44 23 5893 3
1746.5
1548 11181 4.63
nut 000
414 55 24 367 3
161
47
268 0.04
nut 001
579 55 33 485 4
351.5
52
354 0.11
Table 3.7. Min-delay-constrained min-reg characteristics, QUIP benchmarks.
100
3.7
Summary
The contribution of this chapter is a new algorithm for computing a minimum-register
retiming under constraints on both minimum and maximum path delay constraints. The
improvements over previous techniques include:
Faster Runtime. We measured our technique against the best-known published solution to the delay-constrained minimum-register retiming problem, implemented in the
academic Minaret tool. The runtime was 102x faster. The worst-case bound is difficult
to compare to other techniques, but the empirical improvement for real-life examples is
substantial.
Intermediate timing feasibility. In every iteration of both the outer algorithm and
the inner timing refinement, there exists a solution that is strictly smaller than the original
solution and meets all of the timing constraints. The algorithm can therefore be terminated
at any point, and the tradeoff of runtime versus quality be adjusted as desired.
Flexible timing model. Our formulation considers both long- and short- path constraints. While Minaret uses a unit-delay timing model, we allow arbitrary values. The
timing analysis can also be replaced with other more descriptive models.
A common problem formulation. This approach to delay-constrained register minimization introduces timing constraints into the problem structure described in Chapter 2
but does not alter the underlying algorithm. Other constraint types (such as those from
the subsequent chapter) are completely compatible with this optimization.
101
Chapter 4
Guaranteed Initializability
Min-Register Retiming
In this chapter we extend the algorithm in Chapter 2 to include constraints on the
initializability of the circuit. The initializability requirement is a guarantee that upon reset,
the registers can be initialized to a set of values that results in identical behavior.
Again, we assume that the retiming transformation is understood. The reader may
review Section 1.2.1 for more background. The content of this chapter also depends on
the maximum-flow-based formulation of minimum-register retiming from Chapter 2. An
understanding of that material is a prerequisite.
The chapter begins in Section 3.1 by defining the problem of guaranteeing initializability
after retiming. The existing approaches to solving this problem are described in Section 3.2.
Our new maximum-flow-based approach is described Section 3.3, including a few examples.
An analysis of the correctness, complexity, and limitations is presented in Section 3.4. Some
experimental results are given within Section 3.6.
102
4.1
Problem
The majority of sequential devices contain a mechanism for bringing the state to a
known value. This is useful at initialization, upon the detection of an unexpected error,
or when a restart is desired by the user. The typical implementation of this mechanism is
through a reset signal that is distributed to all sequential components. When the reset is
asserted, all state elements are reverted to a known initial state or initialization value.
There do exist other procedures for bringing the system to a known state, but we only
address resetting the registers to specific values at this time.
The initial state of each register bit is specified as part of the design. It may be required
to be either zero or one or allowed to be either (often referred to as x-valued). In the
unspecified case, the register may still require a reset to drive its output to a legal logic
value (to guarantee correct electrical functionality by removing any meta-stability present
in the physical device), but the particular logic value is unimportant. This affords additional freedom that can be exploited at different points of the design flow, including during
retiming.
When a circuit is transformed by retiming, the initialization values of the new retimed
registers must be assigned to maintain the output functionality of the circuit. From the
point of reset onward, the output must be identical under any possible input trace. A set
of initial values that satisfy this requirement is known as an equivalent initial state.
There may not exist any such equivalent initial state. The retiming must then be
rejected, adjusted, or the circuit structure altered. The difficulty of efficiently modifying
the retiming and/or the circuit so that the circuit has identical initialization behavior is
one of the central challenges to retiming. It has even been suggested that this is one of
primary obstacles to its further industrial adoption [50], though this does not conform to
our experience. It is the problem of retiming under the constraint that an equivalent initial
state must exist that we turn to now.
103
4.2
Previous Work
The effect of retiming on reset behavior can be conceptualized using the equivalent finite
state machine representation of the sequential circuit. Each register corresponds to a bit in
the set of possible state representations, and the initial value– or multiple possible values,
if unspecified– of each register dictates the one or more initial states from which the state
machine progresses. There may be multiple states that belong to an initialization sequence
and are not reachable from later states. The remaining set of states comprise the cyclic
core. While retiming is guaranteed to maintain the cyclic core of the state machine, the
transformation may alter the initialization sequences [51].
If the retimed circuit has an equivalent initial state, then transitively, it is known that
all of the subsequent initialization sequences must also exist. The problem of finding an
equivalent initial state is known as the initial state computation. In the next subsection,
Section 4.2.1, we discuss a method for computing an equivalent initial state or determining
that none exists. In Section 4.2.2, we given an overview of previous solutions for excluding
particular retimings that do not have such a state.
4.2.1
Initial State Computation
The problem of computing an equivalent initial state can be solved by finding a set
of register assignments that reproduces the same logic values on the outputs of all combinational gates. While this is sufficient to guarantee output equivalence, it is not strictly
necessary; however, this formulation contains the scope of the computation to the retimed
logic and is the problem commonly solved in practice. Determining such an equivalent
initial state is substantially different for combinational nodes with a positive retiming lag
(over which registers were retimed in the direction of signal propagation) than for ones with
a negative retiming lag.
For registers that were retimed in the forward direction, an equivalent initial state can
be computed by logically propagating the initial states forward through the combinational
logic to the new register locations. This process is identical to functional simulation. With
104
Figure 4.1. A circuit with eight registers and their initial states..
unspecified x values, three-valued simulation should be used. In either case, an equivalent
initial state is guaranteed to exist.
For registers that were retimed in the backward direction, it is necessary to find a
set of values, that when propagated forward, are identical to original initial state. This
problem can be solved using SAT. As the registers are moved backwards, the initialization
problem can be simultaneously constructed by unrolling the logic over which the registers
were moved. As is necessary, this may result in some or all of the circuit being replicated
multiple times. The SAT problem then consists of finding an assignment to the base of
this unrolled cone (corresponding to the new positions of the registers) such that the values
at the leaves (corresponding to the original locations of the registers) are identical to their
original initial values. Any registers that do not have a reset value specified are omitted as
constraints at the top of the cone.
Consider the example of initial state computation after retiming the circuit depicted in
Figure 4.1. The original circuit contains eight registers r1 through r8 , each labeled with
their values at initialization. We will apply minimum-register retiming in separate forward
and backward phases (as is done in the algorithm in Chapter 2).
The forward minimum-register retiming phase will replace the four registers r1 through
r4 with one register r1−4 in the location shown in Figure 4.2. If the initial values of the
original registers are propagated forward, it can be seen that the logical state of the net on
105
Figure 4.2. Computing the initial states after a forward retiming move.
Figure 4.3. Computing the initial states after a backward retiming move.
which the new register lies is 0. If the new register r1−4 initializes to 0, the retimed circuit
will behave identically to the original.
The backward minimum-register retiming phase will replace the four registers r5 through
r8 with the one register r5−8 in the location shown in Figure 4.3(i). Here, we must find
an initial value assignment to the new register that when propagated forward results in
the specified initial values on the nets on which the original registers lie. In this particular
example, no such single satisfying assignment exists; this is not an initializable retiming.
One could accept the solution after forward retiming and abandon the backward phase
completely, though this clearly sacrifices significant optimization potential. This is overkill
for a conflict that may be confined to a very local portion of the circuit. There exist other
106
backward retiming solutions that improve the objective and yet possess an equivalent initial
state. One such example is illustrated in Figure 4.3(ii). The question at hand is how to
constrain the retiming algorithm to avoid only the uninitializable solutions.
4.2.2
Constraining Retiming
It is always possible to restrict the retiming transformation such that the resulting
circuit will have a set of feasible initial states. The most straightforward such restriction is
that the registers only be moved in the forward direction. However, this comes at the cost
of a loss in optimization potential. Our experiments indicate that almost half of the average
reduction in register count is achieved only through movement in the backward direction
(Figure 2.15); it is clearly desirable to capture as much of this improvement as possible.
One solution is to transform a general retiming solution into an equivalent forward-only
one. The lag function of the minimal forward-only retiming, r ′ (v) : V → Z{0,+} , can be
derived from any other lag function r(v) : V → Z by subtracting the minimum element in
the range of r(v) (Equation 4.1). There is no loss in generality in this procedure.
r ′ (v) := r(v) − min r(v)
∀v∈V
(4.1)
This offset also applies to the lags of primary inputs (and outputs) and will result in
non-zero lags on these nodes if there exist any node v such that r(v) < 0 (i.e. that was
backward retimed). The registers that were ”passed through the environment” may not
have a corresponding single initialization value; in general, additional combinational logic is
required to handle their initialization. The synthesis of this is described by [52]. There are
limitations to this approach in both analysis and implementation: the construction of the
circuit’s state machine is necessary to perform the analysis (which can only be explicitly
built for small control circuits), and the resulting changes to the combinational netlist may
be unpredictable and/or extensive. It is highly desirable for retiming to leave the structure
of the rest of the netlist mostly intact.
The ability to retime in the reverse direction while still maintaining initial state feasi107
bility was first addressed by [53]. This is an improvement over the forward-only retiming
of [52] as it allows individual registers to be moved backwards without having to be pushed
through the environment. However, the method used to restore initializability is heuristic
and not well-targeted to the conflict. A constraint on the minimum lag is introduced to
gradually reduce backward movement.
For minimum-delay retiming, [54] describe an elegant method that finds the provably
“most-initializable” co-optimal solution and then relaxes the target period until it is found.
The relaxation of the target period allows the magnitude of the retiming lags in all portions
of the circuit to be scaled back simultaneously. This approach is unfortunately not applicable to the min-register problem. Here, there is no equivalent global objective to be relaxed:
an incrementally relaxation of the objective– an additional register– can be assigned to any
number of locations to improve the initializability and improves only a local portion of the
circuit. Choosing the optimal assignments in the min-register problem is a significantly
more difficult problem.
For the min-register problem, the work of [55] maintains the guarantee of optimality in
the number of registers. Complete sets of feasible initial states are generated by justifying
the original initial values backward through the circuit; there are many such sets as the
justification process is non-unique. The choice of initial states then imposes constraints on
the retiming which are incorporated by only allowing registers with identical values to be
merged. If this constrained problem produces a result that is identical to the lower bound
on the number of registers with retiming, a feasible optimal solution has been identified;
if not, additional sets of initial values are generated until all have been enumerated. This
approach, however, is not scalable to large circuits. The constrained retiming problem
(when fan-out sharing is considered) is formulated as a general MILP (mixed-integer linear
program). Without leveraging any particular structure of the problem, the scalability of
the problem disappears into the poor performance of a general solver on a large problem.
Furthermore, as the number of initial states is exponential in the worst case, and this
approach requires as many iterations if the feasible optimal solution is even slightly worse
than the lower bound.
108
4.3
Algorithm
We propose a technique to generate a minimum-register retiming with a known equivalent initial state that is both optimal under this constraint and empirically scalable in
its runtime. This is accomplished by using the formulation of minimum-register retiming
introduced in Chapter 2. While the problem remains NP-hard, the algorithm appears to
be efficient for real circuits.
The general procedure consists of generating a set of feasibility constraints to incorporate
into the maximum-flow problem to bias the registers from being retimed into local portions
of the circuit that are known to introduce conflicts. These constraints are introduced
incrementally and in such a way that every iteration adds exactly one additional register
to the final solution.
In this section, we use the notion of depth to express the distance with which the
registers have been retimed from their original locations. As the initializability algorithm
only operates during the backward phase of retiming, a deeper retiming refers to a cut that
has a more negative lag function and lies more backward relative to the direction of signal
propagation. In terms of the corresponding flow problem, however, a deeper cut lies closer
to the sink and more forward relative to the direction of flow.
4.3.1
Feasibility Constraints
It is important to draw the distinction between the original circuit and the initialization
circuit which is used to test for and compute a new set of initial register values. The circuit
on which the initial state computation is performed consists of the unrolled logic between the
original and retimed locations of the registers. This initialization circuit is not necessarily
a subset of the original design, as each of the original nodes may be replicated zero or more
times.
Let this initialization circuit be the graph Ginit =< W, E > where W ⊆ V × Z. V are
109
the nodes in the original design, and each initialization node w corresponds to a vertex in
the original problem and a lag value at the time that copy w was added to the problem.
Let a feasibility constraint γ be a subset of the problem variables W that has the
following property: it is sufficient for a retiming to be as topologically deep as γ to imply
infeasibility. Correspondingly, a retiming must be at least partially shallower than γ to be
feasible. We require γ to be a partial cut in the initialization circuit: a set of vertices W
where there exists no (i, j) ∈ W such that i ∈ T F O(j).
Lemma 10 implies that infeasibility is monotonically increasing with topological order,
and that such a partial cut γ must exist.
Lemma 10. If a particular retiming is initial state infeasible, all strictly deeper retimings
are also infeasible.
Proof. Consider a feasible assignment at some strictly deeper cut. The forward propagation
of these initial states implies a set of initial values at the location of shallower cut. This set
of values comprises a feasible initial state for a retiming at the shallower cut and violates
the assumption.
We find such a partial cut γ using the procedure described by Algorithm 8 and as
follows. The initialization circuit is ordered topologically from the retimed locations of the
registers to the initial locations. Binary search is then used to find the shallowest complete
cut, which results in an UNSAT initial state. In each step at node w, the implied cut lies
between the nodes whose ordering label ≤ w and the nodes whose ordering label is > w.
The subset of the variables in the SAT problem beyond w are excluded (i.e. all of the
clauses which contain them removed). This reduces the problem to the exact point in the
topological order at which it is sufficient to imply infeasibility. Note that this point is a
result of the particular ordering chosen amongst the multiple partial orderings implied by
topology.
The last variable w that was required to produce UNSAT is then added to γ. However,
in the new test for SAT, only the variables and constraints in the transitive fan-in of the
110
Figure 4.4. Binary search for variables in feasibility constraint.
γ are included in the problem. This is done to ignore the variables that were included
not because of their topological relationship to w but because of the particular ordering.
Because of the exclusion of this region, the initialization circuit may be SAT once again, and
the addition of more variables required. The procedure is repeated until the transitive fan-in
of γ by itself is sufficient to imply UNSAT. In subsequent binary searches, the variables in
T F O(γ) are always included in the problem. The final γ is the new feasibility constraint.
Figure 4.4 illustrates the identification of a variable w through binary search on the
topological ordering. The green region represents the ordered initialization circuit Ginit in
which w is a node. The cut Ctopo>w excludes this particular variable while the cut Ctopo≥w
includes it. The transitive fan-out cone of w that is tested as being sufficient to imply
UNSAT is highlighted in orange.
If it is available, the UNSAT core can be used to generate a feasibility constraint. The
UNSAT core localizes the variables (e.g. the nets) in the problem that resulted in the
111
Algorithm 8: Construction of a feasibility constraint: FIND FEAS CONST()
Input : an initialization circuit graph Ginit =< W, E >
Output: a feasibility constraint γ
let γ be an empty ⊂ 2W
assert SAT (Ginit ) = false
repeat
topologically order W
binary search w ∈ W until
¬SAT (Ginit w/o vars T F O(γ) ∪ u : topo(u) ≥ topo(w)
∧SAT (Ginit w/o vars T F O(γ) ∪ u : topo(u) > topo(w)
add w to γ
until SAT (Ginit ) w/o variables T F O(γ) = true
return γ
conflict that prevented the existence of a satisfying assignment. This eliminates the need
for a binary search of the problem variables.
4.3.2
Incremental Bias
To implement each constraint γ, a penalty structure is added to the flow graph to bias
it against any cuts that lie further from the initial positions of the registers than γ. This
is accomplished using the graph feature in Figure 4.5. A new node nbias is added: its flow
fan-ins are the nodes v ∈ γ, and its fan-out is the sink node vsink . The effect of this structure
is to add exactly one additional unit of flow from γ to vsink . Without the model for fan-out
sharing, the edge from nbias → vsink would constrain the flow; with fan-out sharing, the
internal edge in nbias with unit capacity constrains the flow. The sum total of the width of
the edges crossing Cmin is increased by one, and this cut may no longer be the minimum
width cut in the graph. The next iteration has been incrementally biased against selecting
Cmin and any other cut that lies beyond γ and cuts this additional flow path.
The feasibility constraint γ is in the space of W , but it must be applied to the flow graph
112
Figure 4.5. Feasibility bias structure.
in the space of V . This is accomplished by delaying the implementation of γ’s bias until all
of the nodes (with the appropriate lags) come into the scope of the current combinational
frame. Until this happens, the cut is also prevented from prematurely passing any of
the nodes already in scope. Their fan-outs are temporarily redirected to the sink. This
temporary delay does not affect the result.
Each feasibility constraint introduces exactly one register. As the register count is
increased, one of two cases will occur: (i) the minimum cut is now shallower than γ and
the result is initializable, (ii) the minimum cut is still as deep as γ and another penalty is
necessary. In this manner, the minimum cut is “squeezed forward” out of the conflict region
and the register count is incremented until it first becomes possible to find an equivalent
initial state.
The overall algorithm consists of the repeated identification of a feasibility constraint γ
and its addition to the cumulative set Cf eas . The bias structure for every constraint in Cf eas
is added to the graph and the new (larger) backward minimum cut computed. The iteration
113
terminates when the minimum cut has an equivalent initial state. This is summarized in
Algorithm 9.
If optimality is desired and multiple penalties with overlapping elements are generated,
search is required to check for the cases where confining the biases to a single subset of the
constraint variables is sufficient to push the resulting cut forward beyond that subset of
the constraint. The solution may be feasible due to the boolean relationships between the
various overlapping elements and constraints; this can not be addressed with any strictly
topological analysis. As the search process is exponential, the expected NP-hard worst-case
complexity of the initialization problem is contained within this case.
However, if optimality is not necessary, the problem can be simplified. The biases can
be added in a straightforward manner until the minimum cut is pushed forward beyond
the feasibility constraints. Alternatively, the initialization circuit can be chopped at the
feasibility cut (similarly to the manner in which conservative timing constraints were added
in Section 3.3) and the cut made feasible in a single step.
We can also consider the duplication of registers on the same net to provide both initial
states (if this is allowed). This corresponds to the case where some portion of the minimum
cut was pushed to exactly w ∈ γ, the point at which register duplication is capable of
resolving the conflict. If the fan-out of w is greater than 2, this technique may be more
efficient than biasing the cut until it is pushed past w. Both duplication and structural bias
can be integrated into the search for the case requiring multiple registers.
4.4
4.4.1
Analysis
Proof
Correctness: Functionality
As the guaranteed-initializable minimum-register retiming
algorithm is a more strictly-constrained version of the approach in Chapter 2, the proof
for functional correctness proceeds identically to that of Section 2.4.1. We refer to that
explanation for more detail.
114
Algorithm 9: Guaranteed Initializable Retiming: INIT RETIME()
Input : a combinational circuit graph G =< V, E >
Output: an initializable retiming cut R
let init state problem variables W be ⊆ V × Z
let Cf eas be an empty list of 2W
// forward retiming phase
repeat
nprev ← |G|
G ← M IN REGf orward (G)
until |G| = nprev
// backward retiming phase
Gsaved ← G
repeat
G ← Gsaved
Ginit ← ∅
repeat
nprev ← |G|
G ← M IN REGbackward (G)
build Ginit
until |G| = nprev
if G has initial state then
return G
γ ← FIND FEAS CONSTRAINT(Ginit)
Cf eas ← Cf eas ∪ γ
until forever
115
Correctness: Initializability
The initializability of the final circuit is implicit in the
termination condition of the outermost loop of the algorithm. The equivalent initial state
is computed, and if one exists, we have an example that proves the initializability of the
resulting circuit. If no such state exists, another iteration is necessary to modify the retiming
solution.
4.4.2
Complexity
Because the problem of computing an equivalent initial state after backward retiming–
let alone transforming that retiming–is already N P -hard, is not possible to establish a
polynomial upper bound on the runtime of this algorithm. However, this in no way precludes
its speed and scalability on the class of circuits typically seen in the real world. Our
experience has shown that, for the circuits that we examined, the check for an equivalent
initial state via SAT is extremely fast. The total number of calls to the SAT solver is
bounded by O(F RlogR), where R is the original number of registers in the design and F
is the number of additional registers that are required to ensure initial state feasibility. F
is quite small in all of the examined circuits.
4.5
Experimental Results
We utilize an experimental setup identical to that described in Section 2.5.
First, unconstrained minimum-register retiming was applied to all of the non-verification
benchmarks. The initial state was preserved in the overwhelming majority of cases; only
one design in the entire suite (“s400”) did not have an equivalent initial state. This may be
an unrepresentatively low rate of non-initializability; industrial collaborators have reported
experiencing a rate of approximately 10%.
To provide a more thorough evaluation of the initial-state-feasible minimum-register
retiming, the initializable benchmarks were modified to create initialization conflicts. The
116
Name
Nodes Init Regs Infeas Regs Addl Feas Regs Avg |γ| Runtime
s400
0.3k
21
18
1
8
0.08
oc aes core 16.6k
402
395
3
2
2.55
oc vga lcd
17.1k
1108
1087
1
1
1.09
nut 003
6.6k
484
450
3
1
1.41
radar12
71.1k
3875
3771
27
2.3
108.3
oc wb dma 29.2k
1775
1757
2
3.5
5.7
Table 4.1. Guaranteed-initializability retiming applied to benchmarks.
original reset values of the registers were replaced with random bits, though in many cases
with multiple sets of random values, the result was still not uninitializable.
The results of applying our algorithm to both “s400” and the initial-state-randomized
designs are described in Table 4.1. The number of initial registers in each design is listed
in column Init Regs, and after unconstrained minimum-register retiming, this value was
reduced to the number of registers in column Infeas Regs. There existed no equivalent
initial states for these solutions. The guaranteed-initializable version was then applied, and
the number of additional registers (or, equivalently, the number of iterations) is listed in
the column Addl Feas Regs. Column Avg. |γ| is the average number of nodes in each of the
feasibility constraints. Runtime is the total runtime in seconds.
The randomization of the initial states likely results in more difficult problems than
would be generated in any actual design, and yet the optimal feasible retiming can be
found in a median runtime of a little over a second. The circuit “radar12” is the out-lier
and presents a challenge due to its particular arithmetic structure.
The small average size of the feasibility constraints indicates that the conflicts that
prevent the existence of an equivalent initial state are indeed very local.
4.6
Summary
The contribution of this chapter is a new algorithm for computing a minimum-register
retiming that is guaranteed to have an equivalent initial state. This is a requirement for
the correct initialization behavior of the retimed design. The solution can be computed
117
either optimally or heuristically, and the approach appears to scale well to moderatelysized industrial designs.
118
Chapter 5
Min-Cost Combined Retiming and
Skewing
In this chapter we discuss algorithms for simultaneously minimizing the both number
of registers in a circuit and the number of clock skew buffers under a maximum path delay
constraint.
We assume that the reader is familiar with both retiming and clock skew scheduling,
which are introduced in Sections 1.2.1 and 1.2.2, respectively. This chapter also utilizes
minimum-register retiming, such as was discussed in Chapter 2, though the material is not
prefaced on that particular algorithm.
The chapter begins in Section 5.1 by defining the problem of joint register and skew
buffer minimization. We introduce a combined cost function that can be generalized to
different objectives, including dynamic power consumption. In Section 5.2, we discuss
previous work. Section 5.3 introduces a formulation that solves the joint optimization
exactly under linear cost functions. Because the solution of this problem is not scalable
to larger designs, we instead turn to the new heuristic algorithm described in Section 5.4.
Experimental results are presented in Section 5.5.
119
5.1
5.1.1
Problem
Motivation
Both retiming [8] and skew scheduling [9] are sequential optimizations with different
means of implementation that have the same objective: balancing the delay along long
combinational paths with adjacent shorter ones. Retiming relocates the structural position
of the registers in a design, and skew scheduling inserts intentional delays into the clock
distribution network to move the temporal position of the registers. The optimal minimum
delay of both techniques is bounded by the maximum mean cycle time of the worst-case
register-to-register delays.
There are costs associated with applying each of the two techniques. Retiming alters
the number of registers (in either direction), affecting the area, dynamic power, and the
other design metrics discussed in Section 2.1. Clock skewing requires the implementation
of a particular set of relative delays. These specific and non-uniform clock path delay
requirements impose a real challenge to clock network design. The implementation is usually
accomplished with carefully-planned additional wiring, buffers, or delay elements, and each
of these elements consumes more power and area.
We use the notion of a cost function to describe the value of a particular implementation
choice. Any or all of these metrics could be included in such a function, either quantitatively
or heuristically. For this reason, we treat cost as a very general and user-definable concept.
However, our focus is primarily on the dynamic power consumption of the registers and
skew elements that must be driven by the clock tree. Separately and secondarily, we also
consider the area of these cells. Whenever the concept of cost is visited, these two metrics
could be used by the reader as a concrete example of its potential use.
The problem that we consider in this chapter is how to select a combination of simultaneous retiming and skewing to meet the given delay constraint in such a way that the
desired cost function is minimized.
This approach is motivated by the observation that the cost of both sequential opti-
120
Figure 5.1. Costs of moving register boundary with retiming and skew on different topologies.
mization techniques have a strong dependence on the circuit topology-but in different and
often complementary ways. Within a single design, there are critical elements where performance can be improved more efficiently through retiming and others where skewing is
more suitable. When the two are used in combination, a performance improvement can be
obtained with less implementation cost than either in isolation.
Figure 5.1 describes a pair of examples illustrating this different in cost. There are two
circuits in parts (i) and (ii) with different topologies; in both of these we desire to move
the register boundary forward in time and/or structure (such that the slacks on the fanout
paths are increased). The graphs below each circuit demonstrate the approximate clock
tree (power or area) cost of implementing this re-balancing with either retiming or skew.
In Figure (i), a circuit with a narrowing fanout can be retimed forward with an outright
reduction in cost; skewing is expensive. In Figure (ii), the transitive fan-out width grows
outward from the latch boundary; in this case, retiming is expensive but skewing is relatively
121
cheap. The cost of moving the register boundary forward by some amount of time requires
fewer skew buffers in (ii) than in (i) due to the smaller initial number of registers.
The simultaneous application of retiming and intentional skew also has the advantage of
avoiding extreme solutions and the associated problems of either. While both clock skewing
and retiming are present in several commercial design tools, but their role is typically
confined to small incremental resynthesis. The scope of the allowed change is local and
limited. Global optimization is specifically avoided because of its unpredictability in the
difficulty of implementation in the extreme (i.e. globally optimal) solutions.
While the algorithms for pursuing optimal solutions are well understood, strategies for
backing off from extreme solutions to feasible intermediates are less developed. An example
is delay-constrained minimum-area retiming. Even if this technique were computationally
tractable for large designs, it gives no information about the shape of the cost curve or
the quality of nearby alternatives (such as was presented in Figure 3.10 at the expense of
significant computational effort). Especially in the context of a complete design flow, the
designer is left with little to no information about how to balance the extent of retiming with
other means of meeting the design specifications. Combining multiple techniques provides
exactly such a mechanism to back off extreme solutions of either.
Because the cost of a retiming movement can be negative when registers are shared,
it is possible to reduce the cost of a set of registers with retiming below their cost in the
original design. This allows relaxing the performance if possible with the aim to recover
registers. Even with an aggressive performance constraint, it may still be beneficial to
introduce timing violations with retiming and then correct for them with intentional skew–
if the cost of the additional skew buffers is outweighed by the reduction in registers. Figure
5.1 (i) could be one such example of this.
122
5.1.2
Definitions
First, we consider the exact formulation of a cost function to measure and minimize the
dynamic power consumption of the skew elements and registers in the clock network. We
assume that either or both retiming and clock skew scheduling have been applied.
Let Creg be the clock input capacitance of a register. The (weighted) number of registers
between two combinational nodes is wr (e); this is exactly the retimed register weight in
Section 1.2.1 and is defined in terms of the original register weight and some retiming lag
function r(v) that completely characterizes the retiming transformation.
We assume that skews are implemented at each register r’s clock input with a string of
delay buffer that produce the required relative skew τ (r). The input capacitance of each
one of these buffers is Cbuf and its delay dbuf . The periodicity of the clock can be used to
reduce the required delay to the fractional component of a clock cycle T .
The power after retiming (with a lag function r(v)) is expressed by Equation 5.1. The
power to implement a clock skew schedule τ (r) is expressed by Equation 5.2. Ptot is the
sum of these two quantities.
Pret = Creg
X
wr (e)
(5.1)
∀e∈E
Pskew =
Cbuf X
τ (r)
⌋
τ (r) − T ⌊
dbuf
T
(5.2)
∀r∈R
Ptot = Pret + Pskew
(5.3)
A similar set of cost functions can be described to capture the change in area. Let
Areg be the area of a register and Abuf be the area of a buffer. The total area cost Atot of
applying a (simultaneous) retiming transformation and skew is expressed by Equation 5.6.
123
Aret = Areg
X
wr (e)
(5.4)
Abuf X
τ (r)
τ (r) − T ⌊
⌋
=
dbuf
T
(5.5)
∀e∈E
Askew
∀r∈R
Atot = Aret + Askew
(5.6)
If we ignore the utilization of periodicity, each of the above cost functions in linear in
the number of registers and the applied skew schedule.
5.2
Previous Work
The parallels between clock skew scheduling and retiming have been recognized and
utilized by others.
The work of [56] describes the technique of continuous retiming, whereby the sequential
arrival times are calculated and then converted into a retiming lag function. The real-valued
arrival times can be computed quite efficiently using Bellman-Ford [57], and while this does
not provide an optimal solution to the discrete retiming problem, it often provides a very
good solution.
The Minaret algorithm [58] (discussed in depth in Chapter 3) and its min-delay variant
ASTRA [49] both utilize retiming-skew equivalence to bound the number of pair-wise delay
constraints that must be enumerated. This is accomplished by computing the as-late-aspossible (ALAP) and as-soon-as-possible (ASAP) skews. This is again done by using a
propagation of sequential arrival times via Bellman-Ford. Each register in the circuit can
not be retimed over gates of a greater total delay of those skews in either direction and
yield a valid solution.
While these techniques utilize the similarities between skewing and retiming to simplify
the computational effort, as far as we are aware, this is the first work that motivates and
explores the implementation benefits of the simultaneous use and joint optimization of
retiming and skew.
124
5.3
Algorithm: Exact
We now construct an exact formulation to minimize the cost of the simultaneous application of retiming and skew under a set of simple longest-path delay constraints.
Retiming Component
In Section 1.2.1, we discussed a network ILP formulation of
retiming that required the enumeration of all pair-wise timing paths to incorporate delay
constraints. There is an MILP formulation of the delay-constrained problem, that while
not as efficient to solve directly, describes an equivalent problem. For this formulation, we
introduce a real-valued quantity R(v) : V → ℜ that describes the sequential arrival time
at each combinational node v. Note that our definition of the real-valued function R(v) is
different than the similar problem described by [8] but is linearly related. This change will
be motivated shortly.
The constraints on R(v) and the retiming lag function r(v) are captured in Equations
5.7 to 5.9. As before, wi (e) is the initial register weight of each edge, and wr (e) is the
retimed weight. d(v) is the worst-case delay of combinational node v, and T is the overall
period constraint. With these constraints, the retimed circuit will satisfy: correct timing
propagation (Equation 5.7), non-negative register count (Equation 5.8), and correct setup
timing (Equation 5.9).
R(v) − R(u) ≥ d(e) − wr (e)T
r(u) − r(v) ≤ wi (u → v)
d(v) ≤ R(v) ≤ T
Clock Skew Component
∀e = (u → v)
(5.7)
∀e = (u → v)
(5.8)
∀v ∈ V
(5.9)
Clock skew scheduling is typically formulated in terms of con-
straints and variables on the registers in the circuit. Enumerating the registers is convenient,
because one independent variable (i.e. its skew) can be created for each. We depart from
this traditional formulation and instead introduce one that is compatible with the MILP
retiming above. The MILP formulation can be modified to the traditional skew schedul125
ing problem by 1) assuming all retiming lags r(v) = 0 and 2) removing the constraint of
Equation 5.9.
Because all of the integer variables are fixed to zero, the resulting problem is a linear
program. Furthermore, determining the minimum feasible T is an instance of the maximum
mean cycle problem, for which several efficient algorithms exist. [13]
The resulting solution will comprise a set of feasible sequential arrival times R(v), and
while there is not a one-to-one correspondence between these variables and the skews of the
registers in the design, the mapping is trivial. Given the sequential arrival time R(u) at the
output of node u, the necessary skew S on the k-th register (ordered topologically on edge
u → v) is as in Equation 5.10.
S(u → v, k) = R(u) − kT
(5.10)
In general, there are many possible skew schedules to meet a target period. Analogously
to delay-constrained minimum-area retiming, a schedule can be chosen that minimizes the
amount of total skew by adding an appropriate optimization objective. However, compared
to retiming, this is much less difficult; the minimum-cost schedule can be generated on the
graph using the Bellman-Ford algorithm [57].
Combined Formulation
The problems of retiming and skewing can be combined into
a single MILP to minimize the cost when simultaneously employing both retiming and
skewing. In this combined problem, because the number and location of registers will vary
with the retiming, the skew component of the total cost is not a straightforward quantity.
Instead, we describe a set of variables cs (e) to capture the total sum of skews along any
edge e in the retiming graph. This is described in Equation 5.11.
cs (e) =
X
S(e, k)
(5.11)
k=1..wr (e)
This is a piecewise linear function of R(u) and wr (u → v). First, consider the case where
126
there are either zero or one registers present on the edge. If zero, the skew cost should also
be zero, and if one, the cost should be S(e, 1). This is realized in the linear constraint of
Equation 5.12, where β is arbitrary and larger than any R(v).
cs (u → v) ≥ β(wr (u → v) − 1) + R(u) − T
(5.12)
In our experience, there seems to be little to no loss in optimality by leaving the restriction that wr (e) is at most one, but for completeness, it is possible to relax this constraint
by introducing a set of M ordered binary indicator variables wr1..M (e) to represent wr (e)
via the constraints of Equations 5.13 and 5.14. The general expression of cs (e) that allows
up to an arbitrary maximum M registers to be retimed along each edge is Equation 5.15.
X
wrj (e) = wr (e)
(5.13)
j=1..M
0 ≤ wrM (e) ≤ ... ≤ wr2 (e) ≤ wr1 (e) ≤ 1
cs (u → v) ≥
X
β(wrj (u → v) − 1) + R(u) − jT
(5.14)
(5.15)
j=1..M
Coupled with an objective that is a linear function of the variables (such as the ones
from Section 5.1.2), the program becomes a mixed integer linear program (MILP). The
result will be the optimal combination of retiming and skewing to minimize the given linear
cost. While complete, this formulation for minimum cost is of little practical use: it is
computationally intractable for all but the smallest circuits. A better approach is needed.
5.4
Algorithm: Heuristic
We describe a heuristic technique to minimize the cost of the joint application of retiming
and clock skew to meet a period target Ttarg . We refer to this general process as end-to-end
retiming in [59] because it visits:
127
1. both the min-delay retiming solution and min-register retiming, and
2. a continuous set of solutions between them
In contrast to retiming to a single solution with a specific goal (e.g. minimum-delay,
delay-constrained minimum-area), end-to-end retiming explores an entire spectrum of performance possibilities. Other than the endpoints, no guarantee is made that any single point
is exactly optimal in either register count or delay; in general, the path that is generated
will be suboptimal to varying degrees.
Each of these retiming solutions defines the retiming component of the joint skewretiming solution. The remaining skew necessary to meet the performance target Ttarg can
be efficiently computed using Burns’ algorithm [60].
The motivation behind this approach is the value in having a complete performance/cost
curve available and the information it gives to optimize the desired cost function. A heuristic
solution chosen with knowledge of its alternatives is often more valuable than meeting the
performance target with a fixed but blindly chosen combination of the two optimizations,
even if each is applied optimally.
5.4.1
Incremental Retiming
The process of incremental retiming for delay is the primary engine for generating
sequence of possible retimings. In incremental retiming, the registers are only retimed over
a single gate at a time.
A heuristic incremental retiming for minimum delay that produces near-optimal results
but allows a wide variety of design constraints to be included in the problem is described by
[47]. Because the decision-making is not premised on a simplified timing model, the timing
information can be as accurate as needed, even including wire delays and other physical
information. The flexibility in including constraints is particularly powerful. The authors
of [47] concentrate on excluding solutions that violate physical constraints, but any move
that leads to a blow-up of any implementation cost can be similarly blocked or delayed.
128
If the optimal minimum-delay retiming is required, [48] proposes an elegant incremental
algorithm for finding the exact minimum-delay solution, even with non-uniform gate delays.
The drawback of this approach is its simplistic timing model.
We intentionally do not specify an exact recipe for choosing incremental moves, because
both of the above solutions offer different but useful tradeoffs of timing accuracy, computational effort, and optimality. It is also possible to devise other customized alternatives as
the application sees fit. Our only requirements are that:
1. Each register is retimed over no more than one gate per iteration.
2. The solution is legal after every iteration.
3. The objective of each move is to minimize the worst-case path delay.
5.4.2
Overview
The overall algorithm is described in Algorithm 10 and illustrated graphically in Figure
5.2. The general progression consists of multiple applications of the supplied incremental
retiming engine from a few important starting points (which will be described shortly). At
each solution, the required skew is computed to meet the performance target Ttarg and
the implied total cost evaluated. This combination gives us a new point along a performance/cost trade-off curve along which the best solution(s) are retained for later use.
After a fixed number of incremental retiming steps, the optimal minimum-delay retiming
is (optionally) computed and examined. Then, the solution is retimed to the minimum-area
solution and incremental delay retiming is then applied for a number of iterations.
Starting from the original design, incremental delay retiming is applied. Once the first
application of incremental retiming for delay has reached its limit at the minimum-delay
solution, the set of points with better performance than the initial design has been fully
explored; however, this is only half of the space.
Next, minimum-register retiming is applied to generate a solution that has the exact
minimum number of registers of any retiming. The minimum-register retiming algorithm
129
Algorithm 10: End-to-end retiming: END2END()
Input : a sequential circuit G, a target period Ttarg , number of retiming steps k
Output: a retiming lag function r(v)
Output: a clock skew schedule τ (r)
// Original solution
Data: Gcur ← G
Data: rcur (v), rbest (v) : V → Z = 0
Data: τcur (r), τbest (r) : R → ℜ = computeskew(Gcur , Ttarg )
Data: costcur , costbest = cost(τcur
~ , rcur
~ )
// Incremental retiming
repeat
Gcur , rcur
~ = incremental retiming of Gcur
τcur
~ = computeskew(Gcur , Ttarg )
costcur = cost(τcur
~ , rcur
~ )
if (costcur < costbest ) then
costbest , rbest
~ , τbest
~ ← costcur , rcur
~ , τcur
~
until k times
// Min-register solution
Gcur , rcur
~ = min-register retiming of G τcur
~ = computeskew(Gcur , Ttarg )
costcur = cost(τcur
~ , rcur
~ )
if (costcur < costbest ) then
costbest , rbest
~ , τbest
~ ← costcur , rcur
~ , τcur
~
repeat
Gcur , rcur
~ = incremental retiming of Gcur
τcur
~ = computeskew(Gcur , Ttarg )
costcur = cost(τcur
~ , rcur
~ )
if (costcur < costbest ) then
costbest , rbest
~ , τbest
~ ← costcur , rcur
~ , τcur
~
until k times
return τbest
~ , rbest
~
130
Figure 5.2. Overall progression of retiming exploration.
described in Chapter 2, even for the largest circuits. Because it is also canonical in the
number of registers, the result is not dependent upon the particular retiming supplied as
an input.
A second phase of incremental retiming is then applied until a period at least as small
as that of the original circuit has been recovered, at which point the algorithm terminates.
The netlist at exit does not generally correspond to the original, and no guarantee can be
made about the relative numbers of registers or the total cost. Empirically, after retiming
across the entire performance axis, a heuristic incremental method is typically not able to
reproduce the quality of the original and exits with a slight increase in register count. This
motivates the exploration in two segments, starting first from the initial netlist, as shown
in Figure 5.2.
The smoothness of the resulting curve follows from the restriction on the incremental
moves to be one gate at a time. The distance between two adjacent points on the retiming
curve can thus be guaranteed to be within a certain delay granularity g. If the delay model
is load-independent, then g can be bounded by Equation 5.16: the largest gate delay. If
131
the delay model is load dependent, that is, d(v) ≈ dintrinsic (v) + dload (v)Cload (v), the bound
becomes Equation 5.17.
g =max d(v)
(5.16)
g =max dintrinsic (v) + dload (v)∆Cload (v)
(5.17)
∀v∈V
∀v∈V
In almost all cases, the total change in load capacitance along any path is very small;
however, if necessary, the maximum change in capacitive load can be fixed by limiting the
number of registers than can be retimed in any iteration.
5.5
Experimental Results
End-to-end retiming is applied to a set of industry-supplied and academic designs. All
benchmarks were first pre-optimized using the ABC logic synthesis package [3]. The timing
data was extracted using a full table-based slew and load-aware timing analysis, and this
model was used in an incremental min-delay retiming algorithm similar to [18]. The maximum mean cycle times were measured and used as the target performances. The following
experiments were conducted on a set of 64-bit 2.33GHz Pentium Xeon computers.
First, the combined retiming/skewing obtained from end-to-end retiming is compared
against the optimum solutions obtained from the exact formulation to illustrate the tradeoff
between optimality and runtime. Because of the limitations of the exact method, only the
smallest of the ISCAS benchmarks, solvable as a MILP within an hour of runtime, were
used. The results of four of the largest designs are presented in Table 5.1. In half of these
cases, the optimal minimum-cost solution was found with heuristic end-to-end retiming,
though the average runtime was over two orders of magnitude faster.
Next, end-to-end retiming was applied to a set of larger benchmarks [38]. Our technique
was used to minimize both the dynamic power consumption of the clock endpoints (Equation
5.3) and the total area of the required registers and buffers (Equation 5.6). The power, area,
132
Name
s349
s526n
s1196
s1423
Exact Min-Power
Area Runtime
2.80e3
49.0s
3.70e3
34.0s
2.92e3
1.5s
1.25e4
2.3s
Heuristic Min-Power
Area
Runtime
3.00e3
0.07s
4.14e3
0.03s
2.92e3
0.04s
1.25e4
0.14s
Table 5.1. Runtime and quality of exact and heuristic approaches.
Name
mux32 16bit
mux64 16bit
mux8 128bit
mux8 64bit
nut 000
nut 001
nut 002
nut 003
oc ata ocidec2
oc ata v 3
oc cordir p2r
oc hdlc
oc minirisc
oc mips
oc oc8051
oc pavr
oc pci
oc vga lcd
oc wb dma
os blowfish
radar12
AVERAGE
Orig Sk-Only Ret-Only Comb %Improv
7.7
8.3
10.5
8.3
0.0%
15.1
15.9
23.2
15.2
4.4%
16.6
20.1
24.6
15.2
24.4%
8.3
10.1
12.3
7.9
22.3%
4.7
7.5
5.7
5.6
2.6%
7
19.2
13.8
10.8
21.7%
2.4
3.4
3.3
3.1
4.6%
3.8
4.2
4.4
3.8
7.9%
4.4
4.7
4.7
4.5
4.0%
2.3
3
2.8
2.7
4.6%
10.4
57.7
38.6
22
43.0%
6.1
6.1
6.2
5.5
10.1%
4.2
4.3
4.3
4.2
0.7%
18.1
41.7
25.5
20.8
18.4%
10.9
10.9
11.1
10.7
1.8%
17.7
34.1
26.5
23.3
12.1%
19.5
21.8
21.2
20.3
4.2%
16
16.4
18.1
15.9
3.0%
25.6
27.7
27.4
26.1
4.7%
12.8
28.3
20.2
14
30.7%
55.8
74.8
59.5
59.4
0.2%
10.7%
Table 5.2. Power-driven combined retiming/skew optimization.
and skew buffer delay values were taken from the GSC 0.13um standard cell library provided
with [39]. Short-path timing was not an issue in these designs.
The power-driven results are presented in Table 5.2. The dynamic clock power consumption of the original design (before delay optimization) is listed in column Orig. The
sequential elements in the circuit were then optimized to meet the minimum feasible delay
using three different methods: only skewing, only retiming, and a combined application of
retiming and skewing computed with our algorithm. These results are listed under columns
133
Name
mux32 16bit
mux64 16bit
mux8 128bit
mux8 64bit
nut 000
nut 001
nut 002
nut 003
oc ata ocidec2
oc ata v 3
oc cordir p2r
oc hdlc
oc minirisc
oc mips
oc oc8051
oc pavr
oc pci
oc vga lcd
oc wb dma
os blowfish
radar12
AVERAGE
Orig
8.64E+4
1.70E+5
1.87E+5
9.38E+4
5.28E+4
7.84E+4
2.76E+4
4.29E+4
4.91E+4
2.54E+4
1.17E+5
6.90E+4
4.68E+4
2.04E+5
1.22E+5
2.00E+5
2.19E+5
1.80E+5
2.88E+5
1.44E+5
6.28E+5
Sk-Only Ret-Only
Comb %Improv
8.88E+4 1.14E+5 8.68E+4
2.3%
1.73E+5 2.58E+5 1.64E+5
5.2%
2.02E+5 2.72E+5 1.68E+5
16.8%
1.01E+5 1.36E+5 8.55E+4
15.3%
6.43E+4 6.44E+4 6.24E+4
3.0%
1.29E+5 1.15E+5 1.03E+5
10.4%
3.14E+4 3.66E+4 2.90E+4
7.6%
4.36E+4 4.98E+4 3.88E+4
11.0%
9.64E+4 5.30E+4 4.89E+4
7.7%
9.64E+4 3.06E+4 2.82E+4
7.8%
3.14E+5 2.90E+5 2.53E+5
12.8%
6.89E+4 7.00E+4 6.11E+4
11.3%
4.72E+4 4.72E+4 4.72E+4
0.0%
3.02E+5 2.87E+5 2.30E+5
19.9%
1.22E+5 1.24E+5 1.21E+5
0.8%
2.68E+5 2.98E+5 2.55E+5
4.9%
2.29E+5 2.38E+5 2.24E+5
2.2%
1.81E+5 2.04E+5 1.76E+5
2.8%
2.96E+5 3.08E+5 2.92E+5
1.4%
2.09E+5 1.79E+5 1.43E+5
20.1%
7.05E+5 6.54E+5 6.53E+5
0.2%
7.8%
Table 5.3. Area-driven combined retiming/skew optimization.
Sk-only, Ret-only, and Comb, respectively. In the few cases where retiming alone was not
able to meet the target (due to the discrete delays of the gates), the difference was corrected with a small amount of skew and included in that cost. Finally, the column %
Improv indicates the improvement in the power using our technique over the best of either
the skew-only or retiming-only solution. On average, the combined solution results in 10.7%
less dynamic clock power consumption.
The area-driven results are presented in Table 5.3. The columns are identical in meaning
in the power-driven versions, but here we measure the total number of layout unit squares
required for the registers and clock skew buffers. The average reduction in area using the
combined optimization is 7.8% better than the best solution of either retiming or skewing
alone.
This is not a fair method of comparing the optimization alone. As the size of the original
134
Name
Ttarg /Torig Power %Improv Area %Improv Runtime (s)
mux32 16bit
0.37
0.0%
2.3%
2.8
mux64 16bit
0.36
4.4%
5.2%
9.3
mux8 128bit
0.38
24.4%
16.8%
10.4
mux8 64bit
0.38
22.3%
15.3%
2.7
nut 000
0.46
2.6%
3.0%
2.3
nut 001
0.49
21.7%
10.4%
16.7
nut 002
0.41
4.6%
7.6%
1.2
nut 003
0.65
7.9%
11.0%
2.6
oc ata ocidec2
0.93
4.0%
7.7%
5.1
oc ata v 3
0.93
4.6%
7.8%
5.8
oc cordir p2r
0.71
43.0%
12.8%
149.8
oc hdlc
0.93
10.1%
11.3%
16.3
oc minirisc
0.92
0.7%
0.0%
3.9
oc mips
0.95
18.4%
19.9%
380.9
oc oc8051
0.95
1.8%
0.8%
164.9
oc pavr
0.82
12.1%
4.9%
292
oc pci
0.55
4.2%
2.2%
90.1
oc vga lcd
0.71
3.0%
2.8%
293.4
oc wb dma
0.87
4.7%
1.4%
614.7
os blowfish
0.53
30.7%
20.1%
165.9
radar12
0.51
0.2%
0.2%
2337
AVERAGE
10.7%
7.8%
Table 5.4. Results summary.
design is fixed by functionality and timing, it is really only this increase that we are targeting
for improvement. Alternatively, we can compare the difference in additional power or area
required to improve the worst-case delays from the original values to the tighter targets. On
average, meeting the delay target with combined retiming/skewing required over 131% less
additional area and 79% less additional dynamic power than the best of either technique in
isolation. In many cases, the faster period was met using less area and/or power than the
slower original design (resulting in the >100% reduction in average additional area). This
is due to the improvement in area from the application of the technique in Chapter 2.
Table 5.4 summarizes both the power- and area-driven results. Here, we also specify
the target delay constraint that was used. Column Ttarg /Torig expresses this as a fraction
of the original period, indicating the relative aggressiveness of the delay optimization. The
135
column Runtime measures the total runtime of the heuristic in seconds: this was identical
for both the area- and power-driven optimizations.
Figure 5.3 illustrates the cost curve generated by end-to-end retiming in more detail for
two benchmarks. The two regions represent the relative contributions of skew buffers and
registers to the total dynamic power consumed on the leaves of the clock tree. The circuit
“nut 003” is a case where the minimum-cost solution lies in the portion of the retiming
curve revealed by minimum-register retiming. The aggressive target timing (Ttarg = 24.0)
can be met with the minimum dynamic power consumption by first retiming to a slower
period (T = 49.0) and then recovering the performance with clock skew. The increase in
additional skew buffers is outweighed by the decrease in the number of registers.
The balance between retiming and skewing in the minimum cost solution was highly
design dependent. In some of the benchmarks, the result consisted of either maximally
retiming or maximally skewing (as is the case in “oc pavr” and “nut 003”, respectively, in
Figure 5.3). However, in general, the minimum-cost solution utilized instances of both.
5.6
Summary
The technique efficiently explores a smooth set of retiming solutions between the
minimum-register and minimum-delay retiming solutions. Because the cost of retiming
is unpredictable a priori, combined retiming and skewing allows an informed decision about
the best balance between the two optimizations. This was used to minimize the dynamic
power in the clock tree. For the set of benchmarks examined, the total dynamic power
consumption of the clock tree endpoints was reduced by an average of 10.7%.
136
Figure 5.3. Dynamic power of two designs over course of optimization.
137
Chapter 6
Clock Gating
In this chapter we examine an algorithm for introducing clock gating, a power- and areaoptimization technique whereby the clock signal is selectively propagated to subgroups of
registers in the design.
Because clock gating represents a fundamentally different optimization technique than
retiming and clock skew scheduling, the content of this chapter is fairly self-contained. The
necessary background will be introduced herein.
The chapter begins by introducing and motivating the technique of clock gating. The
existing approaches to computing clock gating conditions are summarized in Section 6.2.
Our new simulation and SAT-based approach is described Section 6.3. Experimental results
are presented in Section 6.5.
6.1
Problem
Clock gating inserts conditions on the propagation of a clock transition to one or more
registers in the design. By limiting any unnecessary switching, the dynamic power required
to charge and discharge the capacitive load of the register inputs is reduced. The capacitance
of a large group of registers is thereby shielded behind the smaller capacitance of a single
clock gate.
138
The condition under which a clock transition is inhibited is known as the gating condition, clock disable, or activation function. In general, this function G may be sequential and
dependent on variables from previous time frames. Architectural implementations of clock
gating often implicitly make use of this property to implement more powerful conditions
(e.g. through the use of a dedicated low-power controller). In this work, we restrict the
problem to the combinational version.
We also focus on the application of clock gating to a netlist-level circuit. All of the
functionality is assumed to have been decomposed into a set of standard cells. Physical
information may be available.
The problem is how to find a function G that preserves functionality, maximizes the
power savings, minimizes the perturbation to the netlist, and can be identified and synthesized in a manner that is scalable to large designs.
6.1.1
Motivation
The dynamic switching of the clock network typically accounts for 30-50% of the total
power consumption of a modern design, and with the proliferation of low-power requirements
and thermal limitations, minimizing this total is imperative. One of the most effective and
widely adopted techniques is clock gating, whereby the clock signal is selectively blocked
for registers in the design that are inactive or do not otherwise need to be switched.
6.1.2
Implementation
Two example implementations of clock gates are depicted in Figure 6.1. If the gating
condition is true (here, G = g1 ∨ g2 ), the clock with be blocked from passing through the
clock gate. If G is monotonic or its transitions are strictly confined to one half of the
clock cycle, the gating can be implemented with only one logic gate (e.g. Figure 6.1(i)).
Otherwise, the gating condition must be latched as in Figure 6.1(ii) to prevent glitches from
being propagated onto the clock line. Glitches are undesirable because of both the extra
dynamic power required and the potential change in the sequential behavior of the circuit.
139
In the circuit of Figure 6.1(ii), any glitch is either filtered by the controlling logic value of
the clock or by the non-transparency of the latch; each cycle of the clock will be a fully
propagated or completely constant. Many standard cell libraries include a merged gate and
latch, often referred to as a clock gating integrated cell (CGIC).
Figure 6.1. Clock gating circuits.
6.2
Previous Work
The most common clock gating approach is to identify architectural components that
can be deactivated and to explicitly design the control logic of the gating signal. However,
the benefits of gating can also be extended to very local sections of the circuit and small
clusters of registers. Utilizing clock gating on this finer level of design abstraction requires
automatic synthesis to be practical.
6.2.1
Structural Analysis
The most straightforward automatic synthesis of useful clock gating conditions relies
solely on structural analysis. A typical implementation of structural gating involves using
either existing synchronous enables or detecting multiplexors at the input of a register that
140
implements synchronous-enable-like behavior. These two structures are illustrated in Figure
6.2(i) and (ii), respectively.
Figure 6.2. Opportunities for structural gating.
The advantage of structural methods is that runtime is quite fast, and grows only linearly
with the number of registers in the circuit. Only a small region local to each register is
examined for specific patterns that imply the ability to gate the register.
The disadvantage is that the limitation on the utilized gating functions is unnecessarily
strict and may miss significant potential for additional savings. A simple example is illustrated by Figure 6.3: each of the pair of registers may be gated by the function g. This is
max
demonstrated in the truth table (ii). The columns Gmax
R1 and GR2 describe the cases where
it is safe to gate each of the respective registers. Note that there is no self-feedback loop
from the register outputs to the register inputs. These missed gating opportunities can be
caught by using a stronger functional analysis.
6.2.2
Symbolic Analysis
Even if there is not a physical signal whose structure indicates that a register can be
gated, it is possible to compute and analyze the next state function of a register to generate
a functional description of when it is safe to gate it.
The methods of [61] [62] use symbolic representations to directly compute the conditions under which a register does not switch. This requires generating a BDD [63] for
the next state function of a register. Unfortunately, the limits of such symbolic functional
manipulation are often below the size of even moderate designs. This problem is further
141
Figure 6.3. Non-structural gating.
compounded by the need to find conditions that are able to gate multiple registers simultaneously, requiring that multiple next state functions are kept in memory. The grouping
of gated registers may not even be known a priori; it may be desirable to consider all of
them simultaneously. Constructing the BDDs for the entire circuit is expensive and often
impossible.
Figure 6.4. Unknown relationship between BDDs and post-synthesis logic.
Once (and if) a gating conditional can be derived symbolically, it must be implemented
in the netlist. This requires a general synthesis method to implement the BDDs as mapped
logic. A strong disadvantage to this technique is that this general synthesis may result in
an unknown amount of additional logic, as is suggested by Figure 6.4. Even if the physical
142
design is not disrupted by the additional area and wire requirements of this hardware,
its dynamic power consumption eats away from the power-saving benefits that it seeks to
provide.
It is therefore necessary to prune the coverage of the function to save on implementation
cost, but determining a good balance is a difficult synthesis problem. When timing must
also be considered, the complexity is increased further.
6.2.3
RTL Analysis
There are also clock-gating techniques that target higher-level descriptions of a system
than a netlist. One example is described in [64] and operates on an RTL description of a
system. A second example is the industrial tool PowerPro from Calypto [65].
Because the design representation is abstract, an RTL-level analysis may facilitate a
functional analysis that is more powerful and deeper than is possible with a finer abstraction. For example, it is often possible to identify idle cycles at the beginning of a pipelined
computation (sometimes known as pipeline “bubbles”) and use these to gate the subsequent registers. This analysis requires reaching across multiple clock cycles and complex
arithmetic components and is not suited to a gate-level approach.
The critical disadvantage to an RTL-based approach is the lack of information about
design timing, placement, or logic implementation. It is not possible to back-annotate or
predict this information with much accuracy and still reorganize the RTL netlist. This
makes the consequences of clock gating on important metrics such as timing and area (and
even power) difficult to predict. The same problem suffered by symbolic gating techniques
is faced to an even greater degree by RTL-based methods.
Ideally, both RTL and netlist level clock gating have an important place in a low-power
design flow. The new algorithm described in this chapter is not meant to replace other
levels of analysis so much as to complement them.
143
6.2.4
ODC-Based Gating
The work of [62] leverages another opportunity to gate unnecessary clock transitions.
Not only is it sufficient for gating to be applied when the state of a register does not change
but also when a change that does occur is not ever observable at a primary output.
This presents an opportunity to gate a register that is different from of whether it is
switching or not. The ODC-based techniques can be applied in parallel with ones that
predict switching. A combined algorithm for capturing both observability and switching
is of interest and in development, though it will not be discussed within the scope of this
work.
6.3
Algorithm
We examine the automatic synthesis of combinational gating logic for netlist-level circuits and propose an approach that addresses the dual problems of gating condition selection
and synthesis by constructing these functions out of signals in the existing logic network.
While this is less flexible than the synthesis of an arbitrary function, the result is still quite
good and, importantly, scalable to large designs with very predictable results. We also discuss how necessary constraints on the placement and timing of the design can be included
in the problem. These allow control over the resulting netlist perturbation.
In particular, the algorithm introduced in this chapter seeks to maximize the power
savings by finding a gating condition GR for each register R such that GR is the disjunction
of up to M literals, as described in Equation 6.1.
GR =
_
j=1..M
144
gj (x)
(6.1)
6.3.1
Definitions
We model a circuit to be clock gated as a hypergraph whose nodes are either singlebit registers or single-output combinational logic nodes. The combinational nodes may
implement any arbitrary functions.
If there are multiple clock domains, each group of similarly-clocked registers must be
gated separately. While the same gating conditions can be used in multiple clock domains,
the clock gates themselves can not be shared. The net costs of implementing these gating
opportunities are therefore independent, and it is desirable from a complexity perspective
to treat the problems independently.
Let x be the set of external inputs and current state variables. xR is the current state
of register R, and FR (x) is its next state function.
Let fn (x) be a function of the external inputs and current state variables that is implemented at some circuit node n’s output. We define a literal gn (x) to be either fn or fn . The
set of literals is the set of functions implemented at node outputs and their complements.
The support of function fn (x), support(fn ), is the subset of x (the primary inputs and
current state variables) on which the function has an algebraic dependence. Structurally,
this implies that node n lies in the transitive fan-out of x: support(fn) = {x : n ∈ T F O(x)}.
To maintain the functional correctness of the circuit, each register’s gating condition
GR (x) must only be active when the register does not change state. This functional correctness condition is described by Equation 6.2.
GR (x) ⇒ FR (x) ⊕ xR
(6.2)
If GR (x) = FR (x) ⊕ xR , then it is the unique maximal complete gating condition; otherwise it is incomplete. The complete condition captures every set of inputs for which the
registers doesn’t switch. Because the timing requirements of the clock gate typically necessitate that GR is available earlier than FR , it is desirable to find an incomplete gating
condition that can be generated early in the clock cycle with maximal coverage and minimal
145
implementation cost. Furthermore, the maximal gating condition is typically not useful for
gating multiple registers, and power considerations typically dictate that a condition that
is incomplete but correct for multiple registers is chosen.
We define two probabilistic quantities of interest. Given a set of simulation vectors vn1..i
for net n, the signal probability P signal (n) and switching probability P switch (n) are defined
by Equation 6.3 and Equation 6.4, respectively.
P
signal
1
=
i
1
i−1
i−1 

X
(n)
P switch (n)
=

i 

X

j=1 

1 if vnj
(6.3)
0 otherwise
1 if vnj ⊕ vnj+1
(6.4)

j=1  0 otherwise
The concepts of signal and switching probability can be extended to functions that are
present in the netlist but describe combinations of physical nets.
6.3.2
Power Model
The power that is saved by implementing a set of gating signals G for some set of
registers RG ⊆ R (where Gr is the signal used to gate each register r) is approximated by
Equation 6.5. This quantity is a function of (i) the probability that Gr disables a given
clock, P(Gr ), (ii) the number of registers gated, and (iii) the relative capacitances of the
clock gate and register clock inputs, Cr and Ccg respectively. It is assumed that the clocks
of each register are always switching, but if this is not the case (perhaps due to existing
clock gating), these cases can be probabilistically excluded from P(Gr ). The registers in
the design need not all have identical clock input capacitances.
Psavings =
X
Cr P(Gr ) −
X
Ccg
(6.5)
∀ unique G
∀r∈R
Typical values of Ccg and CR imply that the gated clock signals must be shared amongst
multiple registers, for each of which the corresponding gating condition must be valid. This
146
motivates a global approach to the clock gating synthesis problem, where multiple registers
are considered at the same time.
Additional dynamic power may be dissipated in the logic network by increasing the
fan-out loads of the literals used to generate a gating condition, but this is typically much
smaller than the power saved in the clock network. We restrict our power analysis to only
the clock network.
6.3.3
Overview
Our approach is based on the combination of simulation and SAT-checking. This algorithmic duo has proven to be incredibly powerful in several contexts, and it is useful in clock
gating synthesis as well. Random simulation quickly identifies most of the signals in the
design as being useless for gating a given register, and a SAT solver is used to conclusively
prove the functional correctness of those that remain.
The overall steps of our technique are summarized in Algorithm 11 and described in the
following sections in sequential order.
Algorithm 11: Simulation and SAT-based Clock Gating: SIMSAT CLG()
Input : a sequential circuit graph < V, E >
Output: a correct gating conditions GR for each register R, if one exists
let R be the set of registers
forall r ∈ R do
collect candidate literals set Vcand (R)
repeat
prune every literal v ∈ Vcand (R) where v ∧ (FR (x) ⊕ xR )
until k simulation steps
prove for each literal v that v ⇒ FR (x) ⊕ xR
create disjunctive candidate sets ⊂ Vcand (R)
select a subset of disjunctive sets G to cover registers R
return selected gating conditions
147
6.3.4
Literal Collection
The first step consists of extracting a set of candidate nodes for the inputs of the gating
signal GR , for each register R. Inclusion in the candidate set is not (yet) meant to imply
correctness. All node functions and their complements could initially be considered as
candidates for each register, but it is useful and necessary to immediately narrow the set
by removing nodes that violate either timing, physical, or structural constraints.
Timing Constraints The added delay of the clock gating logic and the clock gate itself
dictate that every candidate literal be available earlier than the latest register input. This
can be expressed by a timing constraint as in Equation 6.6: ag is the latest arrival time at
g, dgate is the delay through the clock gate, and rclk is the required time at which the clock
must be gated. If all of the times are relative to the same clock, this can be expressed in
terms of the period T .
≤ rclk
ag + dgate
ag + dgate ≤ T − Sclk
(6.6)
(6.7)
We also introduce a term Sclk , the setup time of the clock gate. The meaning of this
quantity is identical to the setup time at a register input. While delay is typically measured
between the mid-points of two transitions (which is what has been assumed here), it is not
sufficient to have the clock gating condition still in transition (either at its input or internally
within the clock gate) when the clock edge arrives. The partially-switched enable transistor
would introduce additional slew onto the clock line: this is typically unacceptable. Quantity
Sclk therefore pads the slack to ensure that the clock gating condition is fully “setup”.
The consequence of the timing constraints is that it is only necessary to select from
amongst nodes that will be available early enough to meet the timing requirements. The
literals can be further subdivided into groups that restrict exactly how they can be used
in a gating condition (e.g. only directly, complemented, in a disjunctive set, etc.). This is
illustrated in Figure 6.5.
148
Figure 6.5. Timing constraints based upon usage.
Physical Constraints It is undesirable to route gated clock signals over large distances,
as an undesirably long wire propagation delay may result in a late-arriving gating control
signal and a timing violation. These long wires may also unnecessarily complicate routing.
Constraints between the proximity of the candidate gating literal and the gated registers
are therefore necessary. To this end, we introduce distance constraints of the form of
Equation 6.8. (xgr , ygr ) is the placed location of the candidate literal and (xr , yr ) the
location of a register. dmax is the proximity constraint, here in terms of L1 distance.
distL1 (gr , r) = (|xgr − xr | + |ygr − yr |) ≤ dmax
(6.8)
The result is to limit the region from which the literals used in gating conditions are
selected to one that is local to the register(s) to be gated. This is illustrated in Figure 6.6.
An important effect of the distance constraints is to bound the number of literals that
are considered for any register in the design. In practice, the size of the die is sufficiently
larger than the maximum allowable separation, and this is an strong constraint on the literal
149
Figure 6.6. Distance constraints.
count. This permits a linear worst-case bound on the runtime of the literal collection: O(R),
where R is the set of registers to be gated.
With a model to estimate wire delays from pin locations, it is possible to implicitly constrain the proximity of the gating logic and the gated register using the timing constraints.
A routing estimator may provide a function dwire (pdriver , pload , P ′load ), whose inputs are
pdriver , the location of driving pin, pload , the location of the load pin of interest (e.g. the
clock gate), and P ′load , the locations of the other load pins. Including this information into
the timing constraint of Equation 6.6, we have something of the form of Equation 6.9. This
physically-aware timing constraint is now a function of location.
agr + dgate + dwire (pgr , pgate , P ′f anout(gr ) ) ≤ rclk
(6.9)
Structural Constraints The candidate literals can be restricted to those whose structural support are partially common to the next state function or include the register output.
This is implied by Equation 6.2. If this were not the case for a candidate literal, it could not
possibly have an predictive value in determining whether the register might switch. (Unless
the register always switches.)
150
A tighter structural constraint can be applied if it is known whether the register contains
its output in the support of its next state function (i.e. xR ∈ support(FR (x)). If runtime
considerations require bounds to be placed on structural traversal– such that it can not be
determined whether a register has any self-feedback path– it is conservative to assume that
no such loop exists.
(support(g) ∩ support(FR (x))) ∧ (xR ∈ (support(FR (x) ∪ support(g)))


1
if xR ∈ support(FR (x))
(support(g) ∩ support(FR (x))) ∧

 xr ∈ support(g) otherwise
(6.10)
(6.11)
Signal and Switching Probability Constraints As it is only possible to gate a register
when it does not switch state (assuming there is no knowledge of its future observability),
those that are frequently switching are not likely to offer many possibilities to gate their
clocks. To reduce the runtime of the gating optimization, it may desirable to exclude such
registers from consideration. Given the switching probability of a register output RQ , we
can ignore those that exceed some maximum frequency smax .
P switch (RQ ) > smax
(6.12)
Similarly, literals that are not often true are not likely to gate the clock with much
frequency. While these would probably never be selected by the optimization technique, it
may be beneficial to exclude them from the beginning to reduce runtime. Given the signal
probability of a literal g, we ignore those that fall below some threshold pmin .
P signal (g) < pmin
151
(6.13)
6.3.5
Candidate Pruning
Because the sum of a set of terms satisfies the correctness condition (Equation 6.2) only
if each term satisfies it, a literal is only useful in a gating condition if it itself is a correct
gating signal. Therefore, each literal is only kept as a candidate if it is not inconsistent
with this condition. Simulation is applied in several passes to prune the set of candidate
literal/register pairs. The pruning passes are quite fast and effective, and if any literal is
found to violate the correctness condition, it is immediately removed from consideration.
Besides generating counterexamples to the correctness condition for candidate literals,
simulation provides a probabilistic estimate of the number of unnecessary clock transitions
that each legal literal g will block, P(g). This provides more accurate information than
assuming that the size of the Boolean ON-set of a gating condition correlates to its actual
ON-probability.
6.3.6
Candidate Proof
Once the set of candidates has been reduced with pruning to literal/register pairs that
are reasonably likely to be legal, these are proved to satisfy the correctness condition using a
satisfiability solver. The test structure is depicted in Figure 6.7. If the output is satisfiable,
then there exists an input that violates the correctness condition; otherwise, g is now known
to be a valid gating condition for register R.
There are two ways that the features of a modern SAT solver can be leveraged to speed
up the repeated proving of candidate gating conditions. A single problem structure for the
circuit can be constructed and reused for repeated queries to the same solver running in
incremental mode. Because the portion of the problem describing the circuit functionality
does not change, learned clauses are kept to speed up future runs.
Alternatively, the structure that is generated can be restricted to only the transitive
fan-in cones of the registers and gating conditions under test. As these regions may only
comprise a tiny fraction of the total circuit, the overall size of the SAT problem is dramatically reduced.
152
Figure 6.7. Proving candidate function.
Our experience indicates that the latter is the faster of the two methods for CNF-based
SAT solvers that generate assignments to all circuit variables. (This was the case with our
use of MiniSat [66].) This may not hold true for circuit-based SAT.
The counterexamples return by the SAT solver are also of immense utility. Because
these exercise corners of the Boolean space that were not reached by simulation alone, it
is likely that they might serve as counterexamples for other not-yet-disproven candidate
literals. We use these results to further prune the candidates.
6.3.7
Candidate Grouping
From simulation, an estimate of each node’s probability in inhibiting the switching of
each register is known. However, assuming that it is not possible to keep the full set of
simulation traces for every node, we lack any information about the correlations between
these probabilities. The minimum-cost set of activation functions can not be constructed
without information about their overlap.
Instead, an intermediate set of candidates groups are generated, each a set of 1 to M
unique nodes. A candidate set describes a potential gating condition G of the form of
Equation 6.1 for one or more registers. If M is sufficiently small, it is feasible to enumerate
all such subsets, but we also propose the following heuristic to extract a number of useful
153
candidate sets. This method is motivated by its guarantee that there will exist at least one
cover for each register that contains each legal term.
A set is constructed from a ”seed”, consisting of a node/register pair. The seed node
is inserted into the set, and the set is incrementally expanded by adding candidates to
maximize the sum of the gating probabilities under the constraint that the set will continue
to gate the seed register. The initial set can legally gate exactly the registers of the initial
candidate. Each addition may simultaneously (i) shrink the set of registers for which the
set is valid, because if one term is not a legal gating condition for a register, the set is no
longer a legal gating function for that register, and (ii) increase the probability that the
clock transitions of others will be gated. Again, because information about the correlation
between the multiple elements in the set is not known, the probabilities are heuristically
summed (and may therefore be greater than one). The process terminates when there are
no more candidates to be greedily added or the maximum set size M has been reached.
Non-unique results can be discarded.
Figure 6.8. Heuristic candidate grouping.
The candidate grouping process is illustrated in Figure 6.8 for the seed pair (g3 , R2 ).
154
The columns are of the table are the gateable registers, and the rows of the table are the
candidate gating functions. The value at each entry is the probability that the corresponding
function will gate the corresponding register; empty entries indicate that the resulting gated
register would not be functionally correct. Consisting of only the seed function, the initial
set {g3 } can gate the registers R1 , R2 , and R3 with a summed gating probability of 0.6. This
combination is contained within the red box labelled 1. We then proceed to greedily add
other signals to the set. While adding g5 would increase the sum of the set by the largest
net amount (1.6), it would remove the seed register and is therefore ignored. Function
g1 represents the next best improvement: the estimated gating probability of registers R1
and R2 is increased by 1.6 but register R3 must be dropped from the set, resulting in a
net increase of 1.4. The new combination is depicted with the blue yellow box labelled
2. Lastly, we add the function g4 for the final candidate set of {g1 , g3 , g4 } with a summed
gating probability of 2.4. There are no other candidates for greedy addition that would
increase this total.
6.3.8
Covering
The circuit is simulated again– with actual simulation traces, if available– and probabilistic information is collected about the candidate sets, thereby capturing the correlated
probabilities. The correlation between sets is not pertinent: each register can only be
switched by a single gated clock, generated by the one gating condition that is chosen for
it. (This restriction can be relaxed by amending our technique to employ hierarchical clock
gating.)
The problem now reduces to the weighted maximum set covering problem, where the
weight of each element set is exactly its net contribution to the power objective (as in
Equation 6.5), the total dynamic power. If an insufficient number of registers or clock
transitions are gated, the net weight of an element may be negative; these will never be
selected. The maximum set covering problem is NP-hard, but there exist good heuristics.
The problem is also less difficult for practical circuits because of the relatively small number
of partially overlapping sets. We utilize the greedy addition heuristic. [67]
155
Once a subset of candidate sets has been selected, the disjunction of each is used as
the disable of a clock gate to produce a single gated clock signal. This gated clock is then
connected to the covered registers.
6.4
Circuit Minimization
The insertion of a clock gating condition creates a set of observability don’t cares (ODCs)
for the next state function FR (x) at the input of register R. When the gated clock signal
is inactive, the value of the next state function is irrelevant; the output of the register will
remain constant. This fact can be used to minimize the logic implementation of the next
state function.
In general, the task of reducing a large logic network with ODCs is difficult, but in this
specific case, a structural simplification can be immediately applied. Let h be an immediate
fan-out of the node of literal g. If the combinational transitive fan-out of any h does not
include any (i) primary outputs, (ii) clock gate inputs, or (iii) register inputs gated by a
signal G such that g ∈
/ G, this connection can be replaced with a constant. The inserted
constants are then propagated forward in the network and any dangling portions dropped.
In many instances, the function GR is constructed of terms entirely from within R’s fan-in
cone. Multiple ODC-based simplifications can generally not be simultaneously applied, but
in this case, their mutual compatibility is guaranteed because the structure of all GR signals
is perfectly preserved.
An example of this simplification is illustrated in Figure 6.9. The signal g has been
selected to gate the two registers. Because the propagation of the clock is disabled when
g is 1, this introduces don’t cares. Using the structural simplification described above, we
can replace the connection from g → h with a constant 0. If this constant is propagated
forward, we reduce the next state function of both registers to i. This also leaves a dangling
input and allows the fan-out-free fan-in of node h (the OR-gate) to be dropped.
156
Figure 6.9. ODC-Based Circuit Simplification after Gating.
6.5
6.5.1
Experimental Results
Setup
We used OpenAccess Gear [68] as the platform upon which the algorithm described
in this chapter was implemented. OpenAccess Gear is “an open source software initiative
intended to provide pieces of the critical integration and analysis infrastructure that are
taken for granted in proprietary tools.” [69] It is built upon the OpenAccess database, an
industry-standard database and API for manipulating and utilizing design. The combination provides a powerful set of tools to explore new VLSI design techniques and software.
In this work, the Func [70], Aig, and Dd components were utilized.
157
The logic representation in OpenAccess Gear is a sequential AND/INVERTER graph
(AIG). For a modern treatment of the features and properties of AIGs, we refer the reader
to [71] or [34]. The sequential extension is described in [72].
The experimental setup consisted of a 2.66Ghz AMD x64 machine running our tools
under Linux. They were all written in C/C++ and compiled using GNU g++ version 4.1.
All benchmarks were pre-optimized using ABC. Greedy AIG rewriting was applied until
a fix-point reached (i.e. no more reduction in the number of nodes seen).
For the purposes of power optimization, the relative capacitances of the clock gate
and register clock inputs were assumed identical. An additional constraint was added to
further constrain the insertion of clock gates: every gated clock signal was required to drive
a minimum of three registers. This requirement was applied to both the structural and
sim-sat-based algorithms.
6.5.2
Structural Analysis
Our algorithm is compared against a purely structural analysis. The goal of this technique is to identify particular circuit structures (e.g. Figure 6.2) that can be used as clock
gating conditions. Because there was not a readily-available algorithm or tool to perform
this optimization, we implemented our own version.
As our circuit representation is based entirely on AND/INVERTER graphs with simple
sequential elements, it was not possible to explicitly identify either multiplexors or synchronous enable signals. However, we describe a technique that can detect many of their
resulting representations in the corresponding AIG. As we have knowledge of the synthesis flow, it is known that the initial implementations (before the rewriting passes) of both
of these situations conform to the structure detected by this method. In general, however, AIGs are non-canonical representations of functionality, and there are a multitude of
possible structures to represent a global or local function.
There is also the potential to catch additional circuit structures that are functionally
identical to the two previously mentioned. For example, alternative mappings of the multi158
Figure 6.10. Four-cut for structural check.
plexor in Figure 6.2(ii) will be detected. The generality is further increased when structural
hashing is applied during the construction of the AIG.
Our structural analysis proceeds as follows. For every register, we seek to identify two
signals: one that implies that the state of the register does not change, and one that indicates
the next state if it does. These are functionally equivalent to the synchronous enable and D
input, respectively, of the enable-DFF depicted in Figure 6.2. The minimum AIG structure
required to implement this synchronous-enable-like behavior contains at least three AND
nodes: this is exactly the size of the simplest representation of a 2-input multiplexor. With
fewer AND nodes it is not possible to either (i) drive the next state to both 0 and 1 and
(ii) accomplish it with the same input signal. However, because of the input permutations
and non-canonical edge complementation, there are still many graph structures meet this
description.
To identify the three-AND structures that describe synchronous-enable-like behavior,
we collect the fan-in cone of depth 2 at the input of each register. This cone is depicted
in Figure 6.10; the large vertices are AND nodes and the small hashed circles represent
complementation that may or may not be present on the edge. If it does not contain 3
AND nodes, the register is skipped. Otherwise, the resulting cut at its base is tested to
see if it contains any two signals that satisfy the above criteria. This is done by testing
the properties described in Equation 6.14. Here, ≡ denotes structural equivalence and =
functional equivalence. In our tool, functional equivalence was checked with BDDs.
159
∃(s0 , s1 , loop) ∈ {0..3} such that
(6.14)
s0 6= s1 6= loop
is0 ≡ is1
iloop ≡ q
((is0 ⇒ d = q) = 1) ∨ ((¬is0 ⇒ d = q) = 1)
Table 6.1 describes the results of applying the structural gating to the OpenCores and
QUIP benchmark suites. The column Tot is the total number of registers in the design.
Match is the number of these for which a gating condition was detected structurally, but
because these did not necessarily result in a power savings, only the number in column
Shared were sufficiently common to be implemented. % Shared is the percentage of the
total registers with shared enable signals. This is the fraction that is gateable.
The number of enables that were used to gate the registers is listed in column # En.
On average, each one of these enables gated the number of registers in column Regs/En. As
expected, the average ratio (as well as the individual ratio of each selected enable) satisfies
the aforementioned constraint on the minimum number of gated registers per clock gate.
The last two columns provide estimates on both the number of clock transitions avoided at
the register inputs and also the total estimated power savings.
6.5.3
Power Savings
The results of applying the simulation-and-SAT-based gating approach to the same set
of benchmarks as Table 6.1 is presented in Table 6.2. The column Gated is the number of
registers out of Total that were clock gated. Again, # En is the number of enable signals
used. The percentage of the total clock transitions and power saved are in columns Clks
and Pow, respectively. The power savings is also compared against the purely structural
approach in the next column. Here, the † symbols indicates the cases where gating was
160
Regs
% Saved
Name
Tot Match Shared %Shared #En Regs/En
Clks
Pow
fip cordic cla
55
0
0
0.0%
0
0.0% 0.0%
fip cordic rca
55
0
0
0.0%
0
0.0% 0.0%
oc vid com enc
59
28
24
40.7%
1
24.0 20.5% 18.8%
oc vid com dec
61
28
26
42.6%
2
13.0 11.3% 8.0%
oc miniuart
90
34
26
28.9%
3
8.7 11.1% 7.8%
oc ssram
95
0
0
0.0%
0
0.0% 0.0%
oc gpio
100
83
0
0.0%
0
0.0% 0.0%
oc sdram
112
74
54
48.2%
4
13.5 26.8% 23.3%
oc rtc
114
59
27
23.7%
1
27.0 11.8% 11.0%
oc i2c
129
89
86
66.7%
9
9.6 48.1% 41.1%
os sdram16
147
105
82
55.8%
4
20.5 13.2% 10.5%
oc ata v
157
79
79
50.3%
3
26.3 22.1% 20.2%
oc dct slow
178
38
38
21.3%
4
9.5 15.4% 13.1%
nut 004
185
15
0
0.0%
0
0.0% 0.0%
nut 002
212
21
0
0.0%
0
0.0% 0.0%
oc correlator
219
0
0
0.0%
0
0.0% 0.0%
oc simple fm rcvr
226
0
0
0.0%
0
0.0% 0.0%
nut 003
265
45
0
0.0%
0
0.0% 0.0%
oc ata ocidec1
269
191
191
71.0%
7
27.3 34.0% 31.4%
oc minirisc
289
106
74
25.6%
6
12.3 7.9% 5.8%
oc ata ocidec2
303
192
191
63.0%
7
27.3 30.1% 27.8%
nut 000
326
15
0
0.0%
0
0.0% 0.0%
oc aes core
402
128
128
31.8%
1
128.0 15.9% 15.7%
oc hdlc
426
138
62
14.6%
7
8.9 4.9% 3.2%
nut 001
484
50
0
0.0%
0
0.0% 0.0%
oc ata vhd 3
594
373
298
50.2%
13
22.9 22.4% 20.2%
oc ata ocidec3
594
365
290
48.8%
12
24.2 21.8% 19.7%
oc fpu
659
8
8
1.2%
1
8.0 0.9% 0.8%
oc aes core inv
669
132
132
19.7%
2
66.0 10.0% 9.7%
oc oc8051
754
481
236
31.3%
24
9.8 14.2% 11.0%
os blowfish
891
373
372
41.8%
10
37.2 11.5% 10.3%
oc cordic r2p
1015
0
0
0.0%
0
0.0% 0.0%
oc cfft 1024x12
1051
163
143
13.6%
7
20.4 4.4% 3.7%
oc vga lcd
1108
672
659
59.5%
30
22.0 29.2% 26.4%
fip risc8
1140
1097
62
5.4%
8
7.8 3.0% 2.3%
oc pavr
1231
1102
822
66.8%
40
20.6 49.7% 46.4%
oc mips
1256
1204
211
16.8%
9
23.4 7.6% 6.9%
oc ethernet
1272
653
254
20.0%
19
13.4 6.2% 4.7%
oc pci
1354
768
470
34.7%
21
22.4 9.1% 7.5%
oc aquarius
1477
1378
801
54.2%
27
29.7 20.9% 19.0%
oc wb dma
1775
1489
936
52.7%
42
22.3 12.4% 10.1%
oc mem ctrl
1825
1382
901
49.4%
38
23.7 25.3% 23.2%
oc vid com d
3549
12
12
0.3%
1
12.0 0.1% 0.1%
radar12
3875
2222
2019
52.1%
46
43.9 17.5% 16.4%
oc vid com j
3972
88
88
2.2%
5
17.6 0.8% 0.7%
radar20
6001
2922
2679
44.6%
59
45.4 16.0% 15.1%
uoft raytracer
13079
3926
2216
16.9%
70
31.7 5.8% 5.3%
AVERAGE
26.9%
25.1 9.1% 8.0%
Table 6.1. Structural clock gating results.
161
found with our method but none were found with the structural analysis. Although they
represent a clear improvement, these instances are not reflected in the average additional
power savings. The remaining average power reduction was 27% greater than that of the
purely structural method, and as much as 2.2x times higher.
6.5.4
Circuit Minimization
The post-gating optimization described in Section 6.4 was then applied, and the resulting netlist optimized with the ABC package using the same procedure of repeated rewriting.
This was done primarily to perform the constant propagation and to leverage any resulting simplifications. The results for a selected subset of the benchmarks are in Table 6.3.
The number and improvement in the number of AND nodes are reported in the final two
columns. On average, the size of the combinational logic was reduced by 7.0%. In five of the
benchmarks the depth of the combinational logic was also reduced, resulting in a potential
performance improvement.
The correctness of both the gating conditions (modeled as synchronous enables) and
the logic optimization was successfully verified using combinational equivalence checking.
6.6
Summary
We have introduced a method for clock gate synthesis that constructs the gating condition out of the disjunction of functions that are already present in the existing logic and
their complements. Applied to a set of industry-supplied benchmarks, the dynamic clock
power consumption is reduced over synchronous methods alone. The gating condition can
also be utilized in a straightforward structural optimization to reduce the area of the circuit.
162
% Saved
Name
Total Gated #En
Clks
Pow vs Struct Runtime(s)
fip cordic cla
55
3
1 2.3% 0.5%
†
2.50
fip cordic rca
55
3
1 2.3% 0.4%
†
2.48
oc vid com enc
59
28
2 23.9% 20.5%
9.0%
2.84
oc vid com dec
61
26
2 11.3% 8.0%
0.0%
2.28
oc miniuart
90
44
6 21.2% 14.5%
85.8%
1.56
oc ssram
95
32
2 15.8% 13.7%
†
0.82
oc gpio
100
0
0 0.0% 0.0%
0.0%
1.76
oc sdram
112
91
10 43.3% 34.4%
47.8%
2.56
oc rtc
114
27
1 11.8% 11.0%
0.0%
4.47
oc i2c
129
86
9 48.1% 41.1%
0.0%
3.13
os sdram16
147
122
11 26.0% 18.5%
76.5%
5.92
oc ata v
157
79
3 22.1% 20.2%
0.0%
3.12
oc dct slow
178
47
5 17.9% 15.1%
14.9%
2.65
nut 004
185
8
2 3.0% 1.9%
†
1.05
nut 002
212
19
5 4.0% 1.7%
†
1.31
oc correlator
219
0
0 0.0% 0.0%
0.0%
4.38
oc simple fm rcvr
226
3
1 0.7% 0.2%
†
4.19
nut 003
265
17
5 3.6% 1.7%
†
4.23
oc ata ocidec1
269
191
7 34.0% 31.4%
0.0%
8.23
oc minirisc
289
86
8 9.1% 6.4%
8.7%
5.72
oc ata ocidec2
303
191
7 30.1% 27.8%
0.0%
9.22
nut 000
326
13
4 1.6% 0.4%
†
3.05
oc aes core
402
128
1 15.9% 15.7%
0.0%
15.90
oc hdlc
426
133
16 12.9% 9.2%
183.6%
5.08
nut 001
484
23
6 3.0% 1.8%
†
9.39
oc ata vhd 3
594
298
13 22.4% 20.2%
0.0%
19.08
oc ata ocidec3
594
293
13 22.0% 19.8%
0.5%
19.07
oc fpu
659
56
2 2.8% 2.5%
223.7%
40.79
oc aes core inv
669
132
2 10.0% 9.7%
0.0%
17.52
oc oc8051
754
339
34 22.4% 17.9%
62.5%
58.91
os blowfish
891
378
11 11.7% 10.5%
1.4%
30.34
oc cordic r2p
1015
0
0 0.0% 0.0%
0.0%
40.94
oc cfft 1024x12
1051
279
16 8.5% 7.0%
89.2%
26.33
oc vga lcd
1108
707
35 31.6% 28.4%
7.6%
44.09
fip risc8
1140
68
10 3.3% 2.4%
4.9%
44.94
oc pavr
1231
841
43 50.2% 46.7%
0.6%
97.34
oc mips
1256
243
10 8.9% 8.1%
16.7%
78.21
oc ethernet
1272
367
35 10.8% 8.0%
72.5%
40.59
oc pci
1354
615
33 12.4% 10.0%
33.1%
48.33
oc aquarius
1477
816
29 21.2% 19.3%
1.2%
72.36
oc wb dma
1775
971
44 13.3% 10.8%
7.4%
82.16
oc mem ctrl
1825
1064
45 27.8% 25.4%
9.3%
93.92
oc vid com d
3549
16
2 0.1% 0.1%
33.3%
141.35
radar12
3875
2209
64 20.4% 18.7%
14.6%
188.64
oc vid com j
3972
93
6 0.8% 0.7%
3.0%
227.75
radar20
6001
2968
81 18.8% 17.4%
15.8%
476.36
uoft raytracer
13079
2355 102 6.5% 5.7%
8.3%
1695.09
AVERAGE
14.7% 12.4%
27.2%
Table 6.2. New clock gating results.
163
Name
Init ANDs % Gated Regs Final ANDs % Change
oc ssram
274
33.68%
179
34.70%
oc sdram
894
81.25%
720
19.50%
oc hdlc
1873
31.22%
1734
7.40%
oc vga lcd
6923
63.81%
6555
5.30%
oc ethernet
8926
28.85%
8890
0.40%
oc cfft
9177
26.55%
9124
0.60%
oc 8051
9746
44.96%
9622
1.30%
oc fpu
16260
8.50%
16179
0.50%
radar20
60835
49.46%
60576
0.40%
uoft raytracer
138895
18.01%
138542
0.30%
AVERAGE
38.63%
7.04%
Table 6.3. ODC-based simplification results.
164
Chapter 7
Conclusion
We believe to have described a set of sequential optimization techniques that are novel
and useful when applied to low power digital design. In this chapter, we review and highlight
the main contributions of this work.
We have focused on reducing the dynamic power that is dissipated due to capacitive
switching. This quantity is proportional to the total capacitance switched, the square of
the voltage swing, and the switching frequency. Because the switching of one particular
class of signal–the clock– drives both more capacitance and switches more frequently (by
at least a factor of two) than any other synchronous signal in the design, it accounts for
approximately 30%-50% of the total power consumed in a modern integrated circuit.
The clock and the synchronizing circuit elements that it drives are exactly the targets
of sequential optimization. We believe this strongly motivates the application of this class
of synthesis transformations for reducing clock dynamic power consumption. Experimental
results have been presented throughout this work to quantify the successful minimization of
the clocks fraction of the total power. This is accomplished through two different avenues:
reducing the total clock capacitance and reducing the effective switching frequency. We
now summarize the algorithms described to improve each of these objectives and review
their main features.
165
7.1
Minimizing Total Clock Capacitance
The sinks of the clock distribution network are sequential components of the design.
For the purposes of this work, this is the set of registers in the design.
Chapter 2 introduces a new formulation of the unconstrained minimum-register retiming
problem. The number of registers in the synthesis examples in our benchmark can be
reduced on average by 11% using retiming. The contribution of our algorithm is a speedup in the runtime of approximately 5 times over the fastest available method, using a
formulation of the problem as an instance of minimum-cost network circulation (MCC).
The algorithm has the desirable property of minimizing the relocation of registers within
the circuit. The solution returned is the optimal one that moves the registers over the fewest
number of gates, or equivalently, minimizes the sum of the absolute values of the retiming
lag function.
Another important feature– which is not available when using MCC-based solution
techniques– is that the result after each iteration is both legal and monotonically decreasing
in the number of registers. It is possible to terminate early with a result that is still
better than the original. This is important for industrial scalability and runtime-limited
applications.
We have shown in Chapter 3 how to incorporate constraints on the maximum and minimum combinational path delays into the min-register retiming problem. Timing constraints
are critical for performance-constrained synthesis applications. The resulting algorithm runs
an average of 102 times faster than the best available academic tool. Our timing-constrained
min-register retiming algorithm also has the property that termination in either its inner
or outer loop results in an improved and timing feasible result. This is again a feature not
present in competing methodologies.
Chapter 4 extends the constraint set to include the requirement that the resulting
retiming has an equivalent initial state. While this is an NP-hard problem in the worstcase, we show that the runtime of our method is quite tractable for all of our example
166
circuits. The median benchmark required the addition of only 2 extra registers to restore
initializability.
Finally, Chapter 5 explores the simultaneous combination of retiming and clock skew
scheduling to minimize the overall dynamic power of the clock network under a delay constraint. The topology of the circuit makes skewing a more power-effective means of improving worst-case delay in some cases and retiming a more power-effective means in others.
Our experiments show that a heuristically determined combination of both can improve the
total dynamic power consumption of the clock endpoints by 11%.
7.2
Minimizing Effective Clock Switching Frequency
Clock gating allows the effective frequency at which some register clock inputs must be
switched to be reduced. The clock is selectively disabled in cases where it is not necessary
for a register to latch a new value.
Chapter 6 describes a new method for synthesizing clock gating logic. We employ
a combination of simulation and SAT to identify existing signals within the logic network
that are functionally correct gating conditions. This technique is more general and powerful
than strictly structural methods but doesn’t have the scalability problems associated with
symbolic methods. The result is an average reduction in the dynamic power of the clock of
12% for the sets of synthesis benchmarks that were examined.
The synthesis of the gating logic is done so as to control the perturbation to the netlist.
Unlike the RTL or symbolic methods, the implementation of a gating condition does not
require the general synthesis of a function. This can have unpredictable effects and result
in an unknown amount of additional logic. Our method creates requires only the OR and
possible complementation of a set of signals.
Furthermore, it is quite easy to incorporate important constraints on the resulting clock
gates. Physical constraints can be applied to limit the maximum distance between the
167
driver of a gating signal and the register which it gates. Timing constraints can be used to
ensure that there is no violation in the resulting netlist.
Finally, the use of disjunctions of circuit signals allows a straightforward logic simplification to be applied. In certain cases, the signal used in the gating condition can be
replaced with a constant, and the surrounding logic simplified. We have shown an average
7% decrease in the number of AIG AND nodes for a subset of the benchmarks. This is
likely to result in additional power savings.
168
Bibliography
[1] Energy Information Administration, U.S. Department of Energy, “Annual energy review,” tech. rep., Washington, DC, United States, 2006.
[2] A.
Stepin
et
al.,
http://www.xbitlabs.com.
“Various
benchmarking
reports,”
2004–2008,
[3] A. V. Goldberg, “Recent developments in maximum flow algorithms,” Tech. Rep. 98045, NEC Research Institute, Inc., 1998.
[4] K. S. Chatha, “Talk on thermal system design,” Berkeley, CA, USA, 2008.
[5] I. Buchmann, “The future battery,” tech. rep., Cadex Electronics, Inc, 2005.
[6] J. P. Research, “Jon peddie’s market watch,” tech. rep., 2007-2008.
[7] U.S. Department of Energy and U.S. Environmental Protection Agency, “Carbon dioxide emissions from the generation of electric power in the united states,” tech. rep.,
Washington, DC, United States, 2000.
[8] C. E. Leiserson and J. B. Saxe, “Retiming synchronous circuitry,” Algorithmica, vol. 6,
pp. 5–35, 1991.
[9] J. P. Fishburn, “Clock skew optimization,” IEEE Trans. on Computing, vol. 39,
pp. 945–951, July 1990.
[10] A. P. Hurst, P. Chong, and A. Kuehlmann, “Physical placement driven by sequential
timing analysis,” in Proceedings of the International Conference on Computer Aided
Design, 2004.
[11] L. G. Khachiyan, “A polynomial algorithm in linear programming,” Soviet Mathematic
Doklady, 1979.
[12] N. Karmarkar, “A new polynomial time algorithm for linear programming,” Combinatorica, vol. 4, pp. 373–395, 1984.
[13] A. Dasdan, S. Irani, , and R. K. Gupta, “Efficient algorithms for optimum cycle mean
and optimum cost to time ratio problems,” in Proceedings of the Design Automation
Conference, pp. 37–42, 6 1999.
[14] I. S. Kourtev and E. G. Friedman, “Synthesis of clock tree topologies to implement
nonzero clock skew schedule,” in IEEE Proceedings on Circuits, Devices, Systems,
vol. 146, pp. 321–326, December 1999.
169
[15] K. Ravindran, A. Kuehlmann, and E. Sentovich, “Multi-domain clock skew scheduling,” in Proceedings of the International Conference on Computer Aided Design, 2003.
[16] Q. K. Zhu, High-Speed Clock Network Design. Springer, 2002.
[17] E. G. Friedman, “Clock distribution networks in synchronous digital integrated circuits,” in Proceedings of the IEEE, vol. 89, pp. 665–692, May 2001.
[18] I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp. 720–738, 1989.
[19] S. Hauck, “Asynchronous design methodologies: an overview,” in Proceedings of the
IEEE, vol. 83, (Seattle, WA), pp. 69–83, 6 1995.
[20] C. J. Myers, Asynchronous Circuit Design. Wiley-Interscience, 1 ed., 2001.
[21] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev, Logic
Synthesis of Asynchronous Controllers and Interfaces. Springer, 2002.
[22] J. Woods, P. Day, S. Furber, J. Garside, N. Paver, and S. Temple, “Amulet1: An
asynchronous arm microprocessor,” IEEE Transactions on Computers, vol. 46, pp. 385–
398, Apr 1997.
[23] U. Cummings, “Focalpoint: A low-latency, high-bandwidth ethernet switch chip,” in
Proceedings of Hot Chips 18 Symposium, (Palo Alto, CA, USA), 2006.
[24] M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital Systems Testing and
Testable Design. Wiley-IEEE Press, 1994.
[25] C. Kern and M. R. Greenstreet, “Formal verification in hardware design: a survey,”
Transactions on Design Automation of Electronic Systems, vol. 4, no. 2, pp. 123–193,
1999.
[26] J.R. Burch, E.M. Clarke, K.L. McMillan, D.L. Dill, and L.J. Hwang, “Symbolic Model
Checking: 1020 States and Beyond,” in Proceedings of the Fifth Annual IEEE Symposium on Logic in Computer Science, (Washington, D.C.), pp. 1–33, IEEE Computer
Society Press, 1990.
[27] E. Clarke, E. Emerson, , and A. Sistla, “Automatic verification of finite-state concurrent systems using temporal logic specifications,” Transactions on Programming
Languages and Systems, vol. 8, pp. 244–263, 1 1986.
[28] O. Coudert, C. Berthet, , and J. C. Madre, “Verification of synchronous sequential
machines based on symbolic execution,” in Proceedings of International Workshop on
Automatic Verification Methods for Finite State Systems, June 1989.
[29] F. S. G. Cabodi, S. Quer, “Optimizing sequential verification by retiming transformations,” in Proceedings of the Design Automation Conference, 2000.
[30] J. Baumgartner, H. Mony, V. Paruthi, R. Kanzelman, and G. Janssen, “Scalable sequential equivalence checking across arbitrary design transformations,” in Proceedings
of the International Conference on Computer Design, 2006.
[31] A. Goldberg, “An efficient implementation of a scaling minimum-cost flow algorithm,”
Algorithms, vol. 22, no. 1, pp. 1–29, 1992.
170
[32] va Tardos, “A strongly polynomial minimum cost circulation algorithm,” Combinatorica, vol. 5, no. 3, pp. 247–255, 1985.
[33] R. K. Ahuja, T. L. .Magnanti, and J. B. Orlin, Network Flows: Theory, Algorithms,
and Applications. Englewood Cliffs, NJ, USA: Prentice-Hall, Inc., 1993.
[34] J. Baumgartner and A. Kuehlmann, “Min-area retiming on flexible circuit structures,”
in Proceedings of the ICCAD, 2001.
[35] B. V. Cherkassy and A. Goldberg, “On implementing push-relabel method for the
maximum flow problem,” Algorithmica, vol. 19, pp. 390–410, 1997.
[36] A. V. Goldberg, “HIPR,” version 3.5, http://www.avglab.com/andrew/soft.html.
[37] “UC
berkeley
clustered
http://www.millennium.berkeley.edu/.
computing.”
Berkeley,
CA,
[38] “Opencores project,” http://www.opencores.org.
[39] C. Albrecht, “IWLS benchmarks,” 2005.
[40] Altera Corporation, “Quartus II,” 101 Innovation Drive, San Jose, CA.
[41] M.
Hutton
and
J.
Pistorius,
“Altera
QUIP
http://www.altera.com/education/univ/research/unv-quip.html.
benchmarks,”
[42] IG Systems, “CS2,” version 3.9, http://www.igsystems.com/cs2/index.html.
[43] B. V. Cherkassky and A. Goldberg, “On implementing push-relabel method for the
maximum flow problem,” Algorithmica, vol. 19, pp. 390–410, 1997.
[44] A. L¨obel, “MCF - a network simplex
http://www.zib.de/Optimization/Software/Mcf/.
implementation,”
version
1.3,
[45] R. V. Helgason and J. L. Kennington, Primal Simplex Algorithms for Minimum Cost
Network Flows, ch. 2, pp. 85–133. 1993.
[46] S. Sapatnekar, “Minaret source code,” 2007. Personal Communication.
[47] D. Singh, V. Manohararajah, , and S. D. Brown, “Incremental retiming for FPGA
physical synthesis,” in Proceedings of the Design Automation Conference, pp. 433–438,
2005.
[48] H. Zhou, “Deriving a new efficient algorithm for minperiod retiming,” in Proceedings
of ASPDAC, pp. 990–993, 2005.
[49] S. S. Sapatnekar and R. B. Deokar, “Utilizing the retiming skew equivalence in a
practical algorithm for retiming large circuits,” IEEE Transactions on Computer-Aided
Design, vol. 15, pp. 1237–1248, 10 1996.
[50] J. Cong and C. Wu, “Optimal fpga mapping and retiming with efficient initial state
computation,” in IEEE Transactions on the Computer-Aided Design of Integrated Circuits and Systems, vol. 18, pp. 330–335, 11 1999.
171
[51] J. Jiang and R. Brayton, “Retiming and resynthesis: A complexity perspective,” IEEE
Transactions on CAD, 12 2006.
[52] H. J. Touati and R. K. Brayton, “Computing the initial states of retimed circuits,”
IEEE Transactions on the Computer-Aided Design of Integrated Circuits and Systems,
vol. 12, pp. 157–162, 1 1993.
[53] L. Stok, I. Spillinger, and G. Even, “Improving initialization through reversed retiming,” in Proceedings of the European conference on Design and Test, (Washington, DC,
USA), p. 150, IEEE Computer Society, 1995.
[54] P. Pan and G. Chen, “Optimal retiming for initial state computation,” in Proc. Conf.
on VLSI Design, 1999.
[55] N. Maheshwari and S. S. Sapatnekar, “Minimum area retiming with equivalent initial
states,” in Proceedings of the International conference on Computer-aided design, (San
Jose, CA, USA), pp. 216–219, IEEE Computer Society, 1997.
[56] P. Pan, “Continuous retiming: Algorithms and applications,” in Proceedings of the
ICCD, pp. 116–121, 1997.
[57] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms. The MIT Press,
2 ed., 2001.
[58] N. Maheshwari and S. Sapatnekar, “Efficient minarea retiming of large level-clocked
circuits,” in Proceedings of DATE, pp. 840–845, 1998.
[59] A. P. Hurst, A. Mishchenko, and R. K. Brayton, “Minimizing implementation costs
with end-to-end retiming,” in Proceedings of the International Workshop on Logic Synthesis, 2007.
[60] S. Burns, Performance Analysis and Optimization of Asynchronous Circuits. PhD
thesis, California Institute of Technology, Pasadena, CA, USA, December 1991.
[61] W. Qing, M. Pedram, and W. Xunwei, “Clock-gating and its application to low power
design of sequential circuits,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 47, Mar 2000.
[62] L. Benini, G. D. Micheli, E. Macii, M. Poncino, and R. Scarsi, “Symbolic synthesis of
clock-gating logic for power optimization of synchronous controllers,” ACM Transactions on Design Automation of Electronic Systems, vol. 4, no. 4, pp. 351–375, 1999.
[63] R. E. Bryant, “Graph-based algorithms for boolean function manipulation,” IEEE
Trans. Comput., vol. 35, no. 8, pp. 677–691, 1986.
[64] M. Donno, A. Ivaldi, L. Benini, and E. Macii, “Clock-tree power optimization based on
rtl clock-gating,” in DAC ’03: Proceedings of the 40th conference on Design automation, (Anaheim, CA, USA), pp. 622–627, ACM, 2003.
[65] Calypto Design Systems, 2933 Bunker Hill Lane, Suite 202, Santa Clara, CA, PowerPro
CG Datasheet.
[66] N. Een and N. S¨orensson, “An extensible SAT-solver,” in Proceedings of International
Conference on Theory and Applications of Satisfiability Testing, 2003.
172
[67] S. Khuller, A. Moss, , and J. Naor, “The budgeted maximum coverage problem,”
Information Processing Letters, 1999.
[68] C. D. S. Inc., “OpenAccess Gear,” http://openedatools.si2.org/oagear/.
[69] Z. Xiu, D. A. Papa, P. Chong, C. Albrecht, A. Kuehlmann, R. A. Rutenbar, and I. L.
Markov, “Early research experience with openaccess gear: an open source development
environment for physical design,” in ISPD ’05: Proceedings of the 2005 international
symposium on Physical design, (San Francisco, CA, USA), pp. 94–100, ACM, 2005.
[70] A. P. Hurst, “Openaccess gear functionality - a platform for functional representation, synthesis, and verification,” tech. rep., Cadence Design Systems, 2006. Invited
Presentation.
[71] A. Kuehlmann and F. Krohm, “Equivalence checking using cuts and heaps,” in Proc.
of the 34th Design Automation Conference, pp. 263–268, June 1997.
[72] A. P. Hurst, “Representing sequential functionality,” tech. rep., Cadence Design Systems, Berkeley, CA, 8 2005.
173
Appendix A
Benchmark Characteristics
174
Name
# Gates # Registers Max Depth Pri Inps Pri Outps
daio
23
4
6
2
2
s208.1
140
8
16
11
1
mm4a
146
12
19
8
4
traffic
176
13
11
6
8
s344
238
15
28
10
11
s349
242
15
28
4
6
s382
242
21
17
10
11
s400
256
21
17
4
6
mult16a
258
16
43
18
1
s526n
261
21
14
4
6
s420.1
282
16
18
19
1
s444
284
21
20
18
1
mult16b
284
30
11
4
6
s526
287
21
14
4
6
s641
399
19
78
36
23
s713
437
19
86
36
23
mult32a
498
32
75
34
1
s838.1
566
32
22
35
1
mm9a
582
27
55
13
9
s838
595
32
94
36
2
s953
661
29
27
17
23
s1196
716
18
34
15
14
s1238
719
18
32
15
14
mm9b
729
26
80
13
9
s1423
819
74
67
18
5
gcd
1011
59
33
19
25
sbc
1024
27
22
41
56
ecc
1477
115
19
12
14
phase decoder
1602
55
33
4
10
daio receiver
1796
83
45
17
46
mm30a
1905
90
139
34
30
parker1986
2558
178
61
50
9
s5378
2828
163
33
36
49
s9234
2872
135
55
37
39
bigkey
2977
224
9
263
197
dsip
3429
224
21
229
197
s13207
8027
669
59
31
121
s38584.1
18734
1426
70
39
304
s38417
22821
1465
65
29
106
clma
25124
33
78
383
82
Table A.1. Benchmark Characteristics: LGsynth
175
Name
# Gates # Registers Max Depth Pri Inps Pri Outps
ts mike fsm
48
3
6
5
10
xbar 16x16
178
32
4
81
16
barrel16
263
37
8
22
16
barrel16a
293
37
10
22
16
barrel32
712
70
10
39
32
nut 004
714
185
12
31
202
nut 002
760
212
19
34
78
mux32 16bit
887
533
6
38
16
mux8 64bit
965
579
4
12
64
fip cordic rca
983
55
45
19
34
fip cordic cla
1044
55
49
19
34
nut 000
1160
326
55
27
237
nut 003
1507
265
36
106
102
mux64 16bit
1704
1046
6
71
16
mux8 128bit
1925
1155
4
12
128
barrel64
1933
135
11
72
64
nut 001
2620
484
55
76
59
fip risc8
2951
1140
39
30
83
radar12
11005
3875
44
2769
1870
radar20
29552
6001
44
3292
1988
uoft raytracer
71598
13079
93
4364
4033
Table A.2. Benchmark Characteristics: QUIP
176
Name
oc ssram
oc miniuart
oc gpio
oc dct slow
oc ata v
oc i2c
oc correlator
oc ata ocidec1
oc sdram
oc ata ocidec2
oc rtc
os sdram16
oc minirisc
oc vid comp sys
oc des area
oc vid comp sys
oc hdlc
oc smpl fm rcvr
oc des des3area
oc ata ocidec3
oc ata vhd 3
oc aes core
oc cordic p2r
oc aes core inv
oc cfft 1024x12
oc vga lcd
os blowfish
oc cordic r2p
oc pci
oc ethernet
oc des perf
oc oc8051
oc mem ctrl
oc mips
oc wb dma
oc pavr
oc aquarius
oc fpu
oc vid comp sys
oc vid comp sys
oc des des3perf
# Gates # Registers Max Depth Pri Inps Pri Outps
238
95
2
110
88
381
90
10
16
11
385
100
10
74
67
513
178
17
6
18
516
157
13
65
60
598
129
19
19
14
613
219
21
83
2
697
269
13
65
60
731
112
13
95
90
842
303
13
65
60
887
114
37
58
35
947
147
22
19
80
1062
289
23
48
83
h
1162
59
13
13
10
1192
64
14
125
64
h
1328
61
20
19
19
1341
426
12
61
84
1377
226
34
18
23
1784
64
22
240
64
1858
594
15
99
103
1919
594
15
99
103
2806
402
13
387
258
3175
719
21
50
32
3523
669
13
516
395
3977
1051
21
52
86
4039
1108
34
223
284
4118
891
37
840
282
4495
1015
22
34
40
5061
1354
46
304
385
5633
1272
33
192
239
5818
1976
5
121
64
5945
754
52
166
176
6792
1825
32
115
152
7036
1256
72
54
178
7680
1775
18
226
218
8051
1231
58
35
55
11825
1477
99
464
283
15431
659
1030
262
232
d
20480
3549
31
1903
1077
j
22798
3972
30
1720
985
24582
5850
7
346
187
Table A.3. Benchmark Characteristics: OpenCores
177
Name
# Gates # Registers Max Depth Pri Inps Pri Outps
intel 001
240
36
35
31
1
intel 004
618
87
86
82
1
intel 002
720
75
72
72
1
intel 003
848
87
105
82
1
intel 005
1538
170
170
165
1
intel 006
2738
350
350
345
1
intel 024
5212
357
614
352
1
intel 023
5226
358
613
353
1
intel 020
5248
354
624
349
1
intel 017
5337
618
440
613
1
intel 021
5373
365
636
360
1
intel 026
5462
492
662
486
1
intel 018
5871
491
740
486
1
intel 019
6100
510
765
505
1
intel 015
7739
553
935
548
1
intel 022
8042
530
954
525
1
intel 029
8045
564
1009
559
1
intel 031
8095
531
956
523
1
intel 011
8233
533
1003
528
1
intel 010
8267
539
994
534
1
intel 007
11387
1307
1329
1302
1
intel 025
11550
1120
1100
1112
1
intel 032
14268
961
1787
890
1
intel 016
24869
2297
2471
2232
1
intel 034
25637
3297
1310
3292
1
intel 014
51042
4309
3197
4293
1
intel 035
64406
4404
5948
4407
1
intel 033
65090
4416
6107
4419
1
intel 027
65271
5143
4337
5127
1
intel 012
71108
5884
4776
5874
1
intel 037
72345
5927
4806
5911
1
intel 030
77923
5397
7138
5400
1
intel 009
77932
5399
7136
5400
1
intel 036
83547
5805
7281
5807
1
intel 043
86086
7223
6006
7213
1
intel 028
88692
7436
6186
7426
1
intel 042
99495
9005
6876
8994
1
intel 038
99712
9010
6876
8992
1
intel 040 101201
9510
7084
9499
1
intel 041 101826
9271
7009
9261
1
intel 039 103147
9501
7086
9493
1
intel 013 159957
13354
10725
13284
1
Table A.4. Benchmark Characteristics: Intel
178