Advanced Digital Design [VU] Homework II - Sample Solution Contents

Advanced Digital Design [VU]
Homework II - Sample Solution
Vienna University of Technology
January 21, 2014
Contents
1
2
3
Specialized Synchronizer
1
1.1
Branches and Delay Lines . . . . . . . . . . . . . . . . .
1
1.2
Sync derive Circuit . . . . . . . . . . . . . . . . . . . . .
4
Micropipeline
4
2.1
Advantages . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
Behavior of the Muller Pipeline . . . . . . . . . . . . . .
4
2.3
Questions . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Pausable Clocking
6
4
5
XOR-Gate for QDI
8
4.1
Implementation with RS-Latches . . . . . . . . . . . . . .
8
4.2
Implementation with C-Elements . . . . . . . . . . . . . .
12
4.3
Implementation with D-Latches . . . . . . . . . . . . . .
12
Protocols
15
5.1
4-Phase Bundled Data . . . . . . . . . . . . . . . . . . . .
15
5.2
2-Phase Bundled Data . . . . . . . . . . . . . . . . . . . .
15
5.3
4-Phase Dual Rail (NCL) . . . . . . . . . . . . . . . . . .
16
5.4
2-Phase Dual Rail (FSL/LEDR) . . . . . . . . . . . . . .
17
References
17
2
1
1.1
Specialized Synchronizer
Branches and Delay Lines
data
R1
G1
data_g1
ME0
data_del
R2
G2
R1
G1
ME1
conflict1
clk
R2
G2
delayed_data_g2
conflict
conflict2
R1
G1
data_g1
ME2
clk_del
R2
G2
R1
G1
ME3
R2
G2
delayed_data_n_g2
Figure 1: Conflict detector circuit
Figure 1 illustrates the conflict detector circuit. In order to get a better understanding
of this circuit a simulation was performed and the results can be seen in Figure 2.
Figure 2: Simulation of the conflict detector circuit
The output of ME0 (data_g1) becomes 1 if the rising edge on data appears before the
rising edge of clk_del, or if data is high while clk becomes 0. The output of ME1 (delayed_data_g2) becomes 1, if a rising edge of clk appears before a rising edge of data_del
1
or if clk is 1 and data_del becomes 0. This means, that these two MEs form a time window
which is able to detect if the data signal changes within a defined delay before or after the
rising edge of clk. The delay of clk determines the window in which a change on data
after the change of clk is detected and the delay of data determines a window in which a
change on data before a change on clk is detected. Both situations are combined to the
signal conflict1. ME2 and ME3 behave equally, but use the inverted data and therefore
detect the falling edges of data.
Thus, the delay of clock and data may be used to determine a window around the rising
edge of the clock. The delay of the data signal is related to the setup-time and the delay of
the clock signal is related to the hold time of the flip-flop. The following section discusses
the calculation of appropriate values.
Assumptions
The assumed setup time tsetup = 1.3ns and hold time thold = 1.4ns determine the coarse
values of the delays of the clock and data signals at the conflict detector. It also has to
be taken into consideration that the conflict signal has to be synchronized by the Sync
derive circuit. This synchronization and the realization of the Data Delay FSM causes that
a conflict is resolved by the FSM 3 cycles after it was detected by the conflict detector.
During these 3 cycles the clock and data still drift together and thus the window of the
conflict detector has to be expanded. The two signals drift by
tdrf t =
1
1
1
1
−
= 39.41ps
−
=
fclk fdata
25.175MHz 25.2MHz
at each clock cycle, thus the window has to be expanded by 3 · 39.41ps = 117.22ps.
Moreover, the ME elements of the conflict detector may become metastable and it has
to be investigated if this behavior has an effect on the subsequent circuit. Since the conflict
signal becomes 1 at the rising clock edge if a conflict was detected and 0 again after half
a clock cycle, the conflict signal has to be sampled at the falling clock edge. This is done
by the Sync derive circuit. If a ME element becomes metastable this metastability has to
be resolved before the first stage of the Sync derive circuit samples the value. Therefore,
it has to resolve within
tdly =
Tclk
39.72ns
− tsetup − tCO,nom =
− 1.3ns − 0.9ns = 17.66ns
2
2
Note that wire and gate delays are assumed to be zero! The following equation from
the lecture slides describes the output delay tdly in dependence of the data to clock distance
∆Tin .
2
tdly (∆Tin ) = τC · ln
TW 0
∆Tin
Figure 3 depicts this correlation.
900ps
800ps
700ps
600ps
tdly 500ps
400ps
300ps
200ps
100ps
0
-200ps-100ps
0 100ps 200ps
∆Tin
Figure 3: Output delay tdly over input phase ∆Tin
The equation can be transformed to calculate the data to clock distance that causes a
certain output delay.
tdly (∆Tin ) = τC · ln
→ ∆Tin = TW 0 · e
−
TW 0
∆Tin
tdly
τC
If the previously calculated delay of ddly = 17.66ns is inserted into this equation, the
resulting data to clock distance is in the order of 10−69 and has almost no effect on the
calculation of the decision window. This can also be seen in Figure 3. Since tdly is plotted
on the y-axis, ∆Tin on the x-axis and tdly is very huge (in fact it is even outside the plot),
∆Tin has to be extremely small.
Since the delay elements are perfect, they may not be considered when calculating the
delays.
A good choice for the delay elements, including safety margins would be
datadel = tsetup + tdrf t = 1.3ns + 117.2ps ≈ 1.5ns
clkdel = thold + tdrf t = 1.4ns + 117.2ps ≈ 1.6ns
3
1.2
Sync derive Circuit
As it can be seen in Figure 2, the conflict signal is only high for the first half of the period
of clk and therefore sampled at the falling edge of the clock, which is performed by the
two lower flip-flops of the circuit. Since the Sync derive circuit is also used to synchronize
the conflict180 signal, it is also necessary to synchronize on the rising edge of the clock.
The conflict180 signal is derived from a Conflict Detector which uses the inverted clock
and nevertheless has to be synchronous to the positive clock at the Data Delay FSM.
2
Micropipeline
2.1
Advantages
1. Mircopipelines are elastic. The speed in which items are stored in and read from the
pipeline may differ. Also the number of elements stored in the pipeline can vary.
2. They do not have a common clock and every stage works at its own speed. Therefore
no (global) clock routing is necessary, what makes them very energy efficient.
3. The concept of transition signaling is easy to design and understand.
4. Transition signaling uses both edges of the req/ack signals, what increases the speed
of the circuit.
5. Components, which operate at different speeds, can easily be connected.
6. The control signals are equally in both directions of the pipeline.
7. Empty pipelines behave like combinatorial logic.
2.2
Behavior of the Muller Pipeline
Figure 4 shows the Muller Pipeline and Figure 5 the related signal traces.
4
req_in
req_out
C
C
Q1
Q2
Q3
C
Q4
C
ack_in
ack_out
Figure 4: Behaviour of the Muller Pipeline
REQ1
REQ2
req_in
ack_in/
Q1
Q
Q
Q
Q
2
2
3
3
req_out/
Q4
req_out/
Q4
ack_out
ack_out
t
Figure 5: Behaviour of the Muller Pipeline
5
2.3
Questions
Is the shown pipeline protocol 2-phase or 4-phase? Explain?
As described in [1, Ch. 2.3] the Muller Pipeline can be used within 2-phase or 4-phase
schemes depending on the implementation of the latches of the datapath. If simple latches
with enable inputs are used, the pipeline implements a 4-phase approach, if capture-path
latches are used, it uses transition signaling (2-phase). In Figure 5 the two rising edges of
the req_in signal are marked as valid requests, thus a 4-phase scheme is used.
How can you exactly recognize a full pipeline and an empty pipeline?
Because of the fact that a pipeline only reacts to signal events (rising or falling edges), the
actual logical state of a Muller-C gate only matters with respect to the states of the other
elements. In an empty pipeline every Muller-C element has the same state. Therefore
the empty pipeline stores either 0000 or 1111. If every stage has the opposite state of its
predecessor and successor the pipeline is full. This case this corresponds to the states
0101 and 1010.
How in particular can you distinguish an empty pipeline from one that has only one
entry?
Figure 6 shows an empty pipeline and a pipeline with one entry. The two colors mark the
different logic states of the signals. As mentioned before it is not relevant if the pipeline is
initialized with all elements set to 1 or 0. The difference between the two scenarios shown
in the figure is that the req_out and ack_out signals of the empty pipeline have the same
logical value. The req_out line of the pipeline with one entry in it has a different value
than the ack_out line. This means that there is data in the pipeline which has not been
acknowledged by the following logic (by inverting the ack_out signal).
req_in
req_out req_in
C
C
C
req_out
C
C
ack_in
C
C
C
ack_out ack_in
ack_out
Figure 6: empty pipeline(left), pipeline with one element(right)
3
Pausable Clocking
A pausable clock circuit, based on masking a free running oscillator, must have two important properties:
6
• There must not be a glitch on clkout
• It must not be possible that the low or high period of clkout are shorter than the one
of clkin
Such a circuit can basically be divided into two main parts as shown in Figure 7. The
clock gate performs the actual masking operation on clkin and is controlled by the enable
signal of the synchronizer.
clkin
clock gate
clkout
enable
req
sychronizer
ack
Figure 7: Pausable clock
Figure 8 shows two possibilities for the clock gate part. Note however, that these gates
have different requirements for the enable signal. The AND gate must only be switched
on (enable = 1) and off (enable = 0) during the low period of clkin , because otherwise
our two constraints from above would be violated. The MUTEX can be switched off
(enable = 1) during both the low and the high period of the clock, but not exactly at the
clock edge (that’s why we need the synchronizer). Care must be taken when the MUTEX
is switched on again (enable = 0). This must only happen during the low period of clkin .
Additionally the MUTEX provides an ack signal indicating if the clkin has been masked.
clkin
clkin
enable
clkout
clkout
M U T EX
enable
ack
Figure 8: Possible clock gates
One very simple solution is shown in Figure 9. It uses an ordinary n-stage D flip
flop synchronizer, to synchronize the incoming req signal to the falling edge of clkin .
This ensures that generated enable signal for the AND-gate only changes its value at the
(beginning of the) low period of clkin . The MTBU of this circuit can be calculated very
easily. One drawback of this solution is the high delay (n clock cycles) of the circuit.
7
clkin
clkout
enable
ack
n-stage synchronizer
D
D
∧
∧
ack
Figure 9: Pausable clock circuit with AND gate and D-flipflops
4
XOR-Gate for QDI
Recap: FSL (Four State Logic) uses two rails (data, parity) to encode one bit of information. The data rail carries the binary representation of the transmitted information, while
the parity is used to indicate the current phase (even parity → ϕ0 , odd parity → ϕ1 ). Figure 10 shows a state chart of the encoding scheme. Note that only one rail (data or parity)
toggles its logical value per phase.
L(0, 1)
ϕ1
l(0, 0)
ϕ0
h(1, 1)
ϕ0
H(1, 0)
ϕ1
Figure 10: FSL encoding
4.1
Implementation with RS-Latches
Figure 11 shows a general structure for an FSL logic gate with two inputs and RS-latches
as state holding elements. In order to construct an XOR gate we now have to implement
the combinatorial logic blocks a-d, which are responsible for setting and resetting the
RS-latches.
8
A.d A.p B.d B.p
a
L0.set
b
L0.reset
c
L1.set
d
L1.reset
S Q
L0
R
Q.d
S Q
L1
R
Q.p
Figure 11: 2-input FSL gate with RS-latches
A.d A.p B.d B.p
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
1
0
0
0
1
0
1
0
1
1
0
0
1
1
1
1
0
0
0
1
0
0
1
1
0
1
0
1
0
1
1
1
1
0
0
1
1
0
1
1
1
1
0
1
1
1
1
phase(A/B)
ϕ0 /ϕ0
ϕ0 /ϕ1
ϕ0 /ϕ1
ϕ0 /ϕ0
ϕ1 /ϕ0
ϕ1 /ϕ1
ϕ1 /ϕ1
ϕ1 /ϕ0
ϕ1 /ϕ0
ϕ1 /ϕ1
ϕ1 /ϕ1
ϕ1 /ϕ0
ϕ0 /ϕ0
ϕ0 /ϕ1
ϕ0 /ϕ1
ϕ0 /ϕ0
L0.s L0.r L1.s L1.r
0
1
0
1
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
1
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
1
Q.d Q.p
0
0
H
H
H
H
1
1
H
H
0
1
1
0
H
H
H
H
1
0
0
1
H
H
1
1
H
H
H
H
0
0
Table 1: Truthtable describing the FSL XOR gate with RS-latches
Table 1 shows the truthtable for the set and reset inputs of the RS-latches L0 and L1 as
well as the data and parity outputs Q.d and Q.p of the overall circuit. If the inputs A and
B are not phase aligned the output Q must hold its previous value.
The KV maps in Figure 12 are used to derive optimized boolean equations in CNF.
9
B.d
L0.set :
B.d
L0.reset :
B.p
'
0
0
B.p
$
0
1
5
1
%
'
'
0
1
4
&
$
0
0
0
$
0
5
1
&
$
4
%
'
0
0
0
0
0
0
1
1
3
3
7
6
7
6
2
2
%
&
%
'
$
$
A.p &
A.p '
0
0
0
0
0
0
1
1
10
10
15
15
11
14
11
14
&
%
%
&
$
'
'
$
A.d
A.d
0
0
0
0
0
0
1
1
13
12
&
9
8
%
9
8
&
13
12
%
B.d
L1.set :
L1.reset :
B.p
$
0
0
0
1
&
0
A.p
2
1
A.d
B.d
1
0
8
'
1
0
0
0
5
4
%
'
0
7
3
%
'
10
11
'
B.p
0
0
1
$
0
0
A.p
%
A.d
2
0
0
8
0
0
0
0
1
$
0
5
4
&
$
7
3
&
$
10
11
&
1
12
%
0
0
1
'
0
6
&
$
15
14
&
$
9
13
&
'
1
1
6
%
'
0
15
14
%
'
9
13
%
0
%
$
0
12
&
Figure 12: KV maps for the function blocks a-d
L0.set = (A.d ∨ B.d) ∧ (A.d ∨ B.d) ∧ (A.p ∨ B.p) ∧ (A.p ∨ B.p)
= (A.d xor B.d) ∧ (A.p xor B.p)
L0.reset = (A.d ∨ B.d) ∧ (A.d ∨ B.d) ∧ (A.p ∨ B.p) ∧ (A.p ∨ B.p)
= (A.d xnor B.d) ∧ (A.p xnor B.p)
L1.set = (A.p ∨ B.d) ∧ (A.p ∨ B.d) ∧ (A.d ∨ B.p) ∧ (A.d ∨ B.p)
= (A.p xor B.d) ∧ (A.d xor B.p)
L1.reset = (A.d ∨ B.p) ∧ (A.d ∨ B.p) ∧ (A.p ∨ B.d) ∧ (A.p ∨ B.d)
= (A.d xnor B.p) ∧ (A.p xnor B.d)
By rearranging the equations for L0.reset and L1.reset, we can reuse sub terms already calculated for the set inputs of the latches. If we put everything together we obtain
the circuit shown in Figure 13.
10
L0.reset = (A.d xnor B.d) ∧ (A.p xnor B.p) = (A.d xor B.d) ∧ (A.p xor B.p)
= (A.d xor B.d) ∨ (A.p xor B.p)
L1.reset = (A.d xnor B.p) ∧ (A.p xnor B.d) = (A.d xor B.p) ∧ (A.p xor B.d)
= (A.d xor B.p) ∨ (A.p xor B.d)
A.d A.p B.d B.p
S Q
L0
R
Q.d
S Q
L1
R
Q.p
Figure 13: FSL XOR gate with RS-latches
Note that in this circuit it is possible that both the set and the reset inputs of one of the
RS-latches are set to one at the same time. If the RS-latch reacts to this condition by setting
the output Q to one, the circuit might produce invalid outputs. For the following discussion
we refer to the circuit shown in Figure 11. Consider the following scenario: Assume that
the delay of logic block b is much higher than the one of logic block a (∆a << ∆b ).
The input vector of the circuit (A.d, A.p, B.d, B.p) is set to (0, 0, 0, 0). Hence L0.reset
is one, L0.set is zero and the output Q.d is zero as well. Now the input vector (1, 0, 0, 1)
is applied. After ∆a a one will emerge on L0.set, and since ∆a << ∆b L0.reset stays
high as well. The RS-latch sets the output Q.d, the handshaking mechanism proceeds and
the next input vector (0, 0, 1, 1) is applied to the circuit. If the set-logic is analyzed for
hazards, it turns out that the transition from (1, 0, 0, 1) to (0, 0, 1, 1) produces a S1 hazard
on L0.set. But since L0.reset is still set (because of the long delay ∆b ) the glitch is
propagated through the RS-latch and visible on the output Q.d.
Although this scenario might seem very unlikely it demonstrates an important concept
in QDI circuit design. Every transition that happens during the evaluation of some input
data must be "visible" on the output. This means that a circuit must not proceed to the next
data-wave before all parts of the circuit are finished processing the current data-wave. If
this can not be guaranteed (like in our case), timing/delay assumption are required.
11
4.2
Implementation with C-Elements
The FSL XOR gate can also be implemented using C-gates instead of RS-latches as state
holding elements. Figure 14 shows this transformation. If both inputs (set and reset) are
set to zero, the C-gate is excited. A one on the set input results in a one on the output Q,
i.e. the C-gate is set. Applying a one to the reset input (and a zero to the set input) resets
the C-gate.
Because of the different behavior of the C-gate, the resulting circuit, shown in Figure
15, does not suffer from the problem discussed in the previous section. Assume reset is
one and set is zero, i.e. the C-gate output is zero. Now if set switches to one as well, the
output of the C-gate will not change, until reset switches to low.
S Q
⇒ resetset
R
Q
C
Figure 14: RS-latch, C-gate
A.d A.p B.d B.p
C
Q.d
C
Q.p
Figure 15: FSL XOR gate with C-gates
4.3
Implementation with D-Latches
Table 2 shows the truthtable for the D-latch implementation of the FSL XOR gate. If the
input signals A and B are not phase aligned the en signal of the latches is zero. In this
case the values of L0.d and L1.d do not matter and we can therefore set them to x (don’t
care). Observe that in rows where the enable signal of the latches (en) is one the input
rails (A.d, A.p, B.d, B.p) have an even parity, i.e. the number of ones on the input rails is
even. Hence a parity function can be used to generate this signal.
The KV maps in Figure 16 are used to derive optimized boolean equations for the
12
A.d A.p B.d B.p
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
1
0
0
0
1
0
1
0
1
1
0
0
1
1
1
1
0
0
0
1
0
0
1
1
0
1
0
1
0
1
1
1
1
0
0
1
1
0
1
1
1
1
0
1
1
1
1
phase(A/B)
ϕ0 /ϕ0
ϕ0 /ϕ1
ϕ0 /ϕ1
ϕ0 /ϕ0
ϕ1 /ϕ0
ϕ1 /ϕ1
ϕ1 /ϕ1
ϕ1 /ϕ0
ϕ1 /ϕ0
ϕ1 /ϕ1
ϕ1 /ϕ1
ϕ1 /ϕ0
ϕ0 /ϕ0
ϕ0 /ϕ1
ϕ0 /ϕ1
ϕ0 /ϕ0
L0.d L1.d en
0
0
1
x
x
0
x
x
0
1
1
1
x
x
0
0
1
1
1
0
1
x
x
0
x
x
0
1
0
1
0
1
1
x
x
0
1
1
1
x
x
0
x
x
0
0
0
1
Q.d Q.p
0
0
H
H
H
H
1
1
H
H
0
1
1
0
H
H
H
H
1
0
0
1
H
H
1
1
H
H
H
H
0
0
Table 2: Truthtable describing the FSL XOR gate with D-latches
function blocks e and f . The "don’t care" elements are used to construct larger groups of
ones.
B.d
L0.d :
B.d
L1.d :
B.p
'
x
0
0
x
0
A.p '
x
1
A.d
2
3
10
11
x
x
1
1
7
6
&
$
14
x
%
x
A.p
13
12
%
10
x
0
8
x
1
x
0
0
6
%
'
x
0
15
11
x
4
3
7
&
$
1
A.d
$
1
5
1
2
x
0
x
0
0
4
15
9
8
&
'
$
x
1
5
1
B.p
9
13
%
14
x
1
12
&
Figure 16: KV maps for the function blocks f and g
Figure 17 shows the resulting circuit. The logic blocks for L0.d and L1.d are reduced
to two xor gates. The parity function can be constructed with three xor gates and an
13
inverter.
L0.d = (A.d ∧ B.d) ∨ (A.d ∧ B.d) = A.d xor B.d
L1.d = (A.d ∧ B.p) ∨ (A.d ∧ B.p) = A.d xor B.p
A.d A.p B.d B.p
D
Q
Q.d
Q
Q.p
L0
EN
D
L1
EN
Figure 17: FSL XOR gate with D-latches
It is obvious that the D-latch circuit is not much cheaper (in terms of logic gates)
than the previously presented ones. However, it has serious disadvantages, which make
it not applicable in practice. As mentioned before, the idea behind this circuit is that the
parity function disables the data inputs of the D-latches as soon as the inputs are not phase
aligned. Note that this assumption leads to a race condition between the enable and data
inputs (L0.d and L1.d) of the D-latches. We use ∆P and ∆f,g to refer to the delays of the
parity function and the logic blocks e and f , respectively. The enable signal must switch to
zero before the data at the D-inputs of the latches becomes invalid (∆P < ∆f,g ). However,
it must not switch to one again, before the signal on the D-inputs of the latches are valid
(∆P > ∆f,g ).
14
5
5.1
Protocols
4-Phase Bundled Data
Figure 18 shows the waveform of the 4-phase bundled data protocol. The transmitting
pipeline stage signals the receiving stage that new data is available by a rising edge on req,
which is eventually acknowledged by an rising edge on ack. To complete the handshake
the transmitting stage deasserts req which is acknowledged by a falling edge on req. Since
it must be guaranteed that the transmitted data is valid and stable at the receiving pipeline
latch before the request signal arrives, the request line must be equipped with a delay line.
The delay must be at least as long as the time it takes the data to pass through the data path
(combinatorial logic) between the two pipeline latches.
req
ack
data
01
00
00
Figure 18: 4-phase bundled data protocol
5.2
2-Phase Bundled Data
Figure 19 shows the waveform of the 2-phase bundled data protocol. A rising edge on req
is acknowledged by a rising edge on ack and vice versa. Corresponding edges are marked
with the same color. As with 2-phase bundled data, a matched delay is required for the
request signal.
15
req
ack
data
01
00
00
Figure 19: 2-phase bundled data protocol
5.3
4-Phase Dual Rail (NCL)
Figure 20 shows the waveform of the 4-phase dual rail protocol. The transmission starts
in the null phase. To transmit data, either the true or the false rail of a signal (D0 or D1)
has to switch to one. A completion detector asserts the ack signal, if all input signals (D0
and D1) carry valid data (data phase). Hence there is no explicit request, the request is
implicitly encoded in the data. This also means that no matched delay is required. To
complete the handshake all rails have to switch to zero again (null phase), which is in turn
acknowledged by the completion detector by deasserting the ack signal. The time periods,
where all signals are in their data phase are marked gray whereas the null phase is marked
blue. Note that the length of each phase can be arbitrary long and can even vary during
runtime.
ack
D0.t
D0.f
data
D1.t
D1.f
data value
NULL
01
NULL
00
NULL
00
Figure 20: 4-phase dual rail (NCL) protocol
16
NULL
5.4
2-Phase Dual Rail (FSL/LEDR)
Figure 21 shows the waveform of the 2-phase dual rail protocol. The first data word (01) is
transmitted in phase 0. Just like in the NCL protocol a completion detector is required to
check if all data signals (D0 and D1) are phase aligned. A matched delay is not necessary
since the request is again implicitly encoded in the data. If all data signals have their
0 phase the completion detector asserts the ack signal (which completes one handshake).
The ack signal stays high until all data signals carry valid phase 1 values. The time periods,
where all signals are phase aligned are marked gray(phase 0) and blue(phase 1).
ack
D0.d
D0.p
data
D1.d
D1.p
data value
phase
(1)
01
00
00
0
1
0
Figure 21: 2-Phase dual rail (FSL/LEDR)
References
[1] J. Sparsø. Asynchronous circuit design - a tutorial, dec 2001.
17