Advanced Digital Design [VU] Homework II - Sample Solution Vienna University of Technology January 21, 2014 Contents 1 2 3 Specialized Synchronizer 1 1.1 Branches and Delay Lines . . . . . . . . . . . . . . . . . 1 1.2 Sync derive Circuit . . . . . . . . . . . . . . . . . . . . . 4 Micropipeline 4 2.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Behavior of the Muller Pipeline . . . . . . . . . . . . . . 4 2.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Pausable Clocking 6 4 5 XOR-Gate for QDI 8 4.1 Implementation with RS-Latches . . . . . . . . . . . . . . 8 4.2 Implementation with C-Elements . . . . . . . . . . . . . . 12 4.3 Implementation with D-Latches . . . . . . . . . . . . . . 12 Protocols 15 5.1 4-Phase Bundled Data . . . . . . . . . . . . . . . . . . . . 15 5.2 2-Phase Bundled Data . . . . . . . . . . . . . . . . . . . . 15 5.3 4-Phase Dual Rail (NCL) . . . . . . . . . . . . . . . . . . 16 5.4 2-Phase Dual Rail (FSL/LEDR) . . . . . . . . . . . . . . 17 References 17 2 1 1.1 Specialized Synchronizer Branches and Delay Lines data R1 G1 data_g1 ME0 data_del R2 G2 R1 G1 ME1 conflict1 clk R2 G2 delayed_data_g2 conflict conflict2 R1 G1 data_g1 ME2 clk_del R2 G2 R1 G1 ME3 R2 G2 delayed_data_n_g2 Figure 1: Conflict detector circuit Figure 1 illustrates the conflict detector circuit. In order to get a better understanding of this circuit a simulation was performed and the results can be seen in Figure 2. Figure 2: Simulation of the conflict detector circuit The output of ME0 (data_g1) becomes 1 if the rising edge on data appears before the rising edge of clk_del, or if data is high while clk becomes 0. The output of ME1 (delayed_data_g2) becomes 1, if a rising edge of clk appears before a rising edge of data_del 1 or if clk is 1 and data_del becomes 0. This means, that these two MEs form a time window which is able to detect if the data signal changes within a defined delay before or after the rising edge of clk. The delay of clk determines the window in which a change on data after the change of clk is detected and the delay of data determines a window in which a change on data before a change on clk is detected. Both situations are combined to the signal conflict1. ME2 and ME3 behave equally, but use the inverted data and therefore detect the falling edges of data. Thus, the delay of clock and data may be used to determine a window around the rising edge of the clock. The delay of the data signal is related to the setup-time and the delay of the clock signal is related to the hold time of the flip-flop. The following section discusses the calculation of appropriate values. Assumptions The assumed setup time tsetup = 1.3ns and hold time thold = 1.4ns determine the coarse values of the delays of the clock and data signals at the conflict detector. It also has to be taken into consideration that the conflict signal has to be synchronized by the Sync derive circuit. This synchronization and the realization of the Data Delay FSM causes that a conflict is resolved by the FSM 3 cycles after it was detected by the conflict detector. During these 3 cycles the clock and data still drift together and thus the window of the conflict detector has to be expanded. The two signals drift by tdrf t = 1 1 1 1 − = 39.41ps − = fclk fdata 25.175MHz 25.2MHz at each clock cycle, thus the window has to be expanded by 3 · 39.41ps = 117.22ps. Moreover, the ME elements of the conflict detector may become metastable and it has to be investigated if this behavior has an effect on the subsequent circuit. Since the conflict signal becomes 1 at the rising clock edge if a conflict was detected and 0 again after half a clock cycle, the conflict signal has to be sampled at the falling clock edge. This is done by the Sync derive circuit. If a ME element becomes metastable this metastability has to be resolved before the first stage of the Sync derive circuit samples the value. Therefore, it has to resolve within tdly = Tclk 39.72ns − tsetup − tCO,nom = − 1.3ns − 0.9ns = 17.66ns 2 2 Note that wire and gate delays are assumed to be zero! The following equation from the lecture slides describes the output delay tdly in dependence of the data to clock distance ∆Tin . 2 tdly (∆Tin ) = τC · ln TW 0 ∆Tin Figure 3 depicts this correlation. 900ps 800ps 700ps 600ps tdly 500ps 400ps 300ps 200ps 100ps 0 -200ps-100ps 0 100ps 200ps ∆Tin Figure 3: Output delay tdly over input phase ∆Tin The equation can be transformed to calculate the data to clock distance that causes a certain output delay. tdly (∆Tin ) = τC · ln → ∆Tin = TW 0 · e − TW 0 ∆Tin tdly τC If the previously calculated delay of ddly = 17.66ns is inserted into this equation, the resulting data to clock distance is in the order of 10−69 and has almost no effect on the calculation of the decision window. This can also be seen in Figure 3. Since tdly is plotted on the y-axis, ∆Tin on the x-axis and tdly is very huge (in fact it is even outside the plot), ∆Tin has to be extremely small. Since the delay elements are perfect, they may not be considered when calculating the delays. A good choice for the delay elements, including safety margins would be datadel = tsetup + tdrf t = 1.3ns + 117.2ps ≈ 1.5ns clkdel = thold + tdrf t = 1.4ns + 117.2ps ≈ 1.6ns 3 1.2 Sync derive Circuit As it can be seen in Figure 2, the conflict signal is only high for the first half of the period of clk and therefore sampled at the falling edge of the clock, which is performed by the two lower flip-flops of the circuit. Since the Sync derive circuit is also used to synchronize the conflict180 signal, it is also necessary to synchronize on the rising edge of the clock. The conflict180 signal is derived from a Conflict Detector which uses the inverted clock and nevertheless has to be synchronous to the positive clock at the Data Delay FSM. 2 Micropipeline 2.1 Advantages 1. Mircopipelines are elastic. The speed in which items are stored in and read from the pipeline may differ. Also the number of elements stored in the pipeline can vary. 2. They do not have a common clock and every stage works at its own speed. Therefore no (global) clock routing is necessary, what makes them very energy efficient. 3. The concept of transition signaling is easy to design and understand. 4. Transition signaling uses both edges of the req/ack signals, what increases the speed of the circuit. 5. Components, which operate at different speeds, can easily be connected. 6. The control signals are equally in both directions of the pipeline. 7. Empty pipelines behave like combinatorial logic. 2.2 Behavior of the Muller Pipeline Figure 4 shows the Muller Pipeline and Figure 5 the related signal traces. 4 req_in req_out C C Q1 Q2 Q3 C Q4 C ack_in ack_out Figure 4: Behaviour of the Muller Pipeline REQ1 REQ2 req_in ack_in/ Q1 Q Q Q Q 2 2 3 3 req_out/ Q4 req_out/ Q4 ack_out ack_out t Figure 5: Behaviour of the Muller Pipeline 5 2.3 Questions Is the shown pipeline protocol 2-phase or 4-phase? Explain? As described in [1, Ch. 2.3] the Muller Pipeline can be used within 2-phase or 4-phase schemes depending on the implementation of the latches of the datapath. If simple latches with enable inputs are used, the pipeline implements a 4-phase approach, if capture-path latches are used, it uses transition signaling (2-phase). In Figure 5 the two rising edges of the req_in signal are marked as valid requests, thus a 4-phase scheme is used. How can you exactly recognize a full pipeline and an empty pipeline? Because of the fact that a pipeline only reacts to signal events (rising or falling edges), the actual logical state of a Muller-C gate only matters with respect to the states of the other elements. In an empty pipeline every Muller-C element has the same state. Therefore the empty pipeline stores either 0000 or 1111. If every stage has the opposite state of its predecessor and successor the pipeline is full. This case this corresponds to the states 0101 and 1010. How in particular can you distinguish an empty pipeline from one that has only one entry? Figure 6 shows an empty pipeline and a pipeline with one entry. The two colors mark the different logic states of the signals. As mentioned before it is not relevant if the pipeline is initialized with all elements set to 1 or 0. The difference between the two scenarios shown in the figure is that the req_out and ack_out signals of the empty pipeline have the same logical value. The req_out line of the pipeline with one entry in it has a different value than the ack_out line. This means that there is data in the pipeline which has not been acknowledged by the following logic (by inverting the ack_out signal). req_in req_out req_in C C C req_out C C ack_in C C C ack_out ack_in ack_out Figure 6: empty pipeline(left), pipeline with one element(right) 3 Pausable Clocking A pausable clock circuit, based on masking a free running oscillator, must have two important properties: 6 • There must not be a glitch on clkout • It must not be possible that the low or high period of clkout are shorter than the one of clkin Such a circuit can basically be divided into two main parts as shown in Figure 7. The clock gate performs the actual masking operation on clkin and is controlled by the enable signal of the synchronizer. clkin clock gate clkout enable req sychronizer ack Figure 7: Pausable clock Figure 8 shows two possibilities for the clock gate part. Note however, that these gates have different requirements for the enable signal. The AND gate must only be switched on (enable = 1) and off (enable = 0) during the low period of clkin , because otherwise our two constraints from above would be violated. The MUTEX can be switched off (enable = 1) during both the low and the high period of the clock, but not exactly at the clock edge (that’s why we need the synchronizer). Care must be taken when the MUTEX is switched on again (enable = 0). This must only happen during the low period of clkin . Additionally the MUTEX provides an ack signal indicating if the clkin has been masked. clkin clkin enable clkout clkout M U T EX enable ack Figure 8: Possible clock gates One very simple solution is shown in Figure 9. It uses an ordinary n-stage D flip flop synchronizer, to synchronize the incoming req signal to the falling edge of clkin . This ensures that generated enable signal for the AND-gate only changes its value at the (beginning of the) low period of clkin . The MTBU of this circuit can be calculated very easily. One drawback of this solution is the high delay (n clock cycles) of the circuit. 7 clkin clkout enable ack n-stage synchronizer D D ∧ ∧ ack Figure 9: Pausable clock circuit with AND gate and D-flipflops 4 XOR-Gate for QDI Recap: FSL (Four State Logic) uses two rails (data, parity) to encode one bit of information. The data rail carries the binary representation of the transmitted information, while the parity is used to indicate the current phase (even parity → ϕ0 , odd parity → ϕ1 ). Figure 10 shows a state chart of the encoding scheme. Note that only one rail (data or parity) toggles its logical value per phase. L(0, 1) ϕ1 l(0, 0) ϕ0 h(1, 1) ϕ0 H(1, 0) ϕ1 Figure 10: FSL encoding 4.1 Implementation with RS-Latches Figure 11 shows a general structure for an FSL logic gate with two inputs and RS-latches as state holding elements. In order to construct an XOR gate we now have to implement the combinatorial logic blocks a-d, which are responsible for setting and resetting the RS-latches. 8 A.d A.p B.d B.p a L0.set b L0.reset c L1.set d L1.reset S Q L0 R Q.d S Q L1 R Q.p Figure 11: 2-input FSL gate with RS-latches A.d A.p B.d B.p 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 phase(A/B) ϕ0 /ϕ0 ϕ0 /ϕ1 ϕ0 /ϕ1 ϕ0 /ϕ0 ϕ1 /ϕ0 ϕ1 /ϕ1 ϕ1 /ϕ1 ϕ1 /ϕ0 ϕ1 /ϕ0 ϕ1 /ϕ1 ϕ1 /ϕ1 ϕ1 /ϕ0 ϕ0 /ϕ0 ϕ0 /ϕ1 ϕ0 /ϕ1 ϕ0 /ϕ0 L0.s L0.r L1.s L1.r 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 Q.d Q.p 0 0 H H H H 1 1 H H 0 1 1 0 H H H H 1 0 0 1 H H 1 1 H H H H 0 0 Table 1: Truthtable describing the FSL XOR gate with RS-latches Table 1 shows the truthtable for the set and reset inputs of the RS-latches L0 and L1 as well as the data and parity outputs Q.d and Q.p of the overall circuit. If the inputs A and B are not phase aligned the output Q must hold its previous value. The KV maps in Figure 12 are used to derive optimized boolean equations in CNF. 9 B.d L0.set : B.d L0.reset : B.p ' 0 0 B.p $ 0 1 5 1 % ' ' 0 1 4 & $ 0 0 0 $ 0 5 1 & $ 4 % ' 0 0 0 0 0 0 1 1 3 3 7 6 7 6 2 2 % & % ' $ $ A.p & A.p ' 0 0 0 0 0 0 1 1 10 10 15 15 11 14 11 14 & % % & $ ' ' $ A.d A.d 0 0 0 0 0 0 1 1 13 12 & 9 8 % 9 8 & 13 12 % B.d L1.set : L1.reset : B.p $ 0 0 0 1 & 0 A.p 2 1 A.d B.d 1 0 8 ' 1 0 0 0 5 4 % ' 0 7 3 % ' 10 11 ' B.p 0 0 1 $ 0 0 A.p % A.d 2 0 0 8 0 0 0 0 1 $ 0 5 4 & $ 7 3 & $ 10 11 & 1 12 % 0 0 1 ' 0 6 & $ 15 14 & $ 9 13 & ' 1 1 6 % ' 0 15 14 % ' 9 13 % 0 % $ 0 12 & Figure 12: KV maps for the function blocks a-d L0.set = (A.d ∨ B.d) ∧ (A.d ∨ B.d) ∧ (A.p ∨ B.p) ∧ (A.p ∨ B.p) = (A.d xor B.d) ∧ (A.p xor B.p) L0.reset = (A.d ∨ B.d) ∧ (A.d ∨ B.d) ∧ (A.p ∨ B.p) ∧ (A.p ∨ B.p) = (A.d xnor B.d) ∧ (A.p xnor B.p) L1.set = (A.p ∨ B.d) ∧ (A.p ∨ B.d) ∧ (A.d ∨ B.p) ∧ (A.d ∨ B.p) = (A.p xor B.d) ∧ (A.d xor B.p) L1.reset = (A.d ∨ B.p) ∧ (A.d ∨ B.p) ∧ (A.p ∨ B.d) ∧ (A.p ∨ B.d) = (A.d xnor B.p) ∧ (A.p xnor B.d) By rearranging the equations for L0.reset and L1.reset, we can reuse sub terms already calculated for the set inputs of the latches. If we put everything together we obtain the circuit shown in Figure 13. 10 L0.reset = (A.d xnor B.d) ∧ (A.p xnor B.p) = (A.d xor B.d) ∧ (A.p xor B.p) = (A.d xor B.d) ∨ (A.p xor B.p) L1.reset = (A.d xnor B.p) ∧ (A.p xnor B.d) = (A.d xor B.p) ∧ (A.p xor B.d) = (A.d xor B.p) ∨ (A.p xor B.d) A.d A.p B.d B.p S Q L0 R Q.d S Q L1 R Q.p Figure 13: FSL XOR gate with RS-latches Note that in this circuit it is possible that both the set and the reset inputs of one of the RS-latches are set to one at the same time. If the RS-latch reacts to this condition by setting the output Q to one, the circuit might produce invalid outputs. For the following discussion we refer to the circuit shown in Figure 11. Consider the following scenario: Assume that the delay of logic block b is much higher than the one of logic block a (∆a << ∆b ). The input vector of the circuit (A.d, A.p, B.d, B.p) is set to (0, 0, 0, 0). Hence L0.reset is one, L0.set is zero and the output Q.d is zero as well. Now the input vector (1, 0, 0, 1) is applied. After ∆a a one will emerge on L0.set, and since ∆a << ∆b L0.reset stays high as well. The RS-latch sets the output Q.d, the handshaking mechanism proceeds and the next input vector (0, 0, 1, 1) is applied to the circuit. If the set-logic is analyzed for hazards, it turns out that the transition from (1, 0, 0, 1) to (0, 0, 1, 1) produces a S1 hazard on L0.set. But since L0.reset is still set (because of the long delay ∆b ) the glitch is propagated through the RS-latch and visible on the output Q.d. Although this scenario might seem very unlikely it demonstrates an important concept in QDI circuit design. Every transition that happens during the evaluation of some input data must be "visible" on the output. This means that a circuit must not proceed to the next data-wave before all parts of the circuit are finished processing the current data-wave. If this can not be guaranteed (like in our case), timing/delay assumption are required. 11 4.2 Implementation with C-Elements The FSL XOR gate can also be implemented using C-gates instead of RS-latches as state holding elements. Figure 14 shows this transformation. If both inputs (set and reset) are set to zero, the C-gate is excited. A one on the set input results in a one on the output Q, i.e. the C-gate is set. Applying a one to the reset input (and a zero to the set input) resets the C-gate. Because of the different behavior of the C-gate, the resulting circuit, shown in Figure 15, does not suffer from the problem discussed in the previous section. Assume reset is one and set is zero, i.e. the C-gate output is zero. Now if set switches to one as well, the output of the C-gate will not change, until reset switches to low. S Q ⇒ resetset R Q C Figure 14: RS-latch, C-gate A.d A.p B.d B.p C Q.d C Q.p Figure 15: FSL XOR gate with C-gates 4.3 Implementation with D-Latches Table 2 shows the truthtable for the D-latch implementation of the FSL XOR gate. If the input signals A and B are not phase aligned the en signal of the latches is zero. In this case the values of L0.d and L1.d do not matter and we can therefore set them to x (don’t care). Observe that in rows where the enable signal of the latches (en) is one the input rails (A.d, A.p, B.d, B.p) have an even parity, i.e. the number of ones on the input rails is even. Hence a parity function can be used to generate this signal. The KV maps in Figure 16 are used to derive optimized boolean equations for the 12 A.d A.p B.d B.p 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 phase(A/B) ϕ0 /ϕ0 ϕ0 /ϕ1 ϕ0 /ϕ1 ϕ0 /ϕ0 ϕ1 /ϕ0 ϕ1 /ϕ1 ϕ1 /ϕ1 ϕ1 /ϕ0 ϕ1 /ϕ0 ϕ1 /ϕ1 ϕ1 /ϕ1 ϕ1 /ϕ0 ϕ0 /ϕ0 ϕ0 /ϕ1 ϕ0 /ϕ1 ϕ0 /ϕ0 L0.d L1.d en 0 0 1 x x 0 x x 0 1 1 1 x x 0 0 1 1 1 0 1 x x 0 x x 0 1 0 1 0 1 1 x x 0 1 1 1 x x 0 x x 0 0 0 1 Q.d Q.p 0 0 H H H H 1 1 H H 0 1 1 0 H H H H 1 0 0 1 H H 1 1 H H H H 0 0 Table 2: Truthtable describing the FSL XOR gate with D-latches function blocks e and f . The "don’t care" elements are used to construct larger groups of ones. B.d L0.d : B.d L1.d : B.p ' x 0 0 x 0 A.p ' x 1 A.d 2 3 10 11 x x 1 1 7 6 & $ 14 x % x A.p 13 12 % 10 x 0 8 x 1 x 0 0 6 % ' x 0 15 11 x 4 3 7 & $ 1 A.d $ 1 5 1 2 x 0 x 0 0 4 15 9 8 & ' $ x 1 5 1 B.p 9 13 % 14 x 1 12 & Figure 16: KV maps for the function blocks f and g Figure 17 shows the resulting circuit. The logic blocks for L0.d and L1.d are reduced to two xor gates. The parity function can be constructed with three xor gates and an 13 inverter. L0.d = (A.d ∧ B.d) ∨ (A.d ∧ B.d) = A.d xor B.d L1.d = (A.d ∧ B.p) ∨ (A.d ∧ B.p) = A.d xor B.p A.d A.p B.d B.p D Q Q.d Q Q.p L0 EN D L1 EN Figure 17: FSL XOR gate with D-latches It is obvious that the D-latch circuit is not much cheaper (in terms of logic gates) than the previously presented ones. However, it has serious disadvantages, which make it not applicable in practice. As mentioned before, the idea behind this circuit is that the parity function disables the data inputs of the D-latches as soon as the inputs are not phase aligned. Note that this assumption leads to a race condition between the enable and data inputs (L0.d and L1.d) of the D-latches. We use ∆P and ∆f,g to refer to the delays of the parity function and the logic blocks e and f , respectively. The enable signal must switch to zero before the data at the D-inputs of the latches becomes invalid (∆P < ∆f,g ). However, it must not switch to one again, before the signal on the D-inputs of the latches are valid (∆P > ∆f,g ). 14 5 5.1 Protocols 4-Phase Bundled Data Figure 18 shows the waveform of the 4-phase bundled data protocol. The transmitting pipeline stage signals the receiving stage that new data is available by a rising edge on req, which is eventually acknowledged by an rising edge on ack. To complete the handshake the transmitting stage deasserts req which is acknowledged by a falling edge on req. Since it must be guaranteed that the transmitted data is valid and stable at the receiving pipeline latch before the request signal arrives, the request line must be equipped with a delay line. The delay must be at least as long as the time it takes the data to pass through the data path (combinatorial logic) between the two pipeline latches. req ack data 01 00 00 Figure 18: 4-phase bundled data protocol 5.2 2-Phase Bundled Data Figure 19 shows the waveform of the 2-phase bundled data protocol. A rising edge on req is acknowledged by a rising edge on ack and vice versa. Corresponding edges are marked with the same color. As with 2-phase bundled data, a matched delay is required for the request signal. 15 req ack data 01 00 00 Figure 19: 2-phase bundled data protocol 5.3 4-Phase Dual Rail (NCL) Figure 20 shows the waveform of the 4-phase dual rail protocol. The transmission starts in the null phase. To transmit data, either the true or the false rail of a signal (D0 or D1) has to switch to one. A completion detector asserts the ack signal, if all input signals (D0 and D1) carry valid data (data phase). Hence there is no explicit request, the request is implicitly encoded in the data. This also means that no matched delay is required. To complete the handshake all rails have to switch to zero again (null phase), which is in turn acknowledged by the completion detector by deasserting the ack signal. The time periods, where all signals are in their data phase are marked gray whereas the null phase is marked blue. Note that the length of each phase can be arbitrary long and can even vary during runtime. ack D0.t D0.f data D1.t D1.f data value NULL 01 NULL 00 NULL 00 Figure 20: 4-phase dual rail (NCL) protocol 16 NULL 5.4 2-Phase Dual Rail (FSL/LEDR) Figure 21 shows the waveform of the 2-phase dual rail protocol. The first data word (01) is transmitted in phase 0. Just like in the NCL protocol a completion detector is required to check if all data signals (D0 and D1) are phase aligned. A matched delay is not necessary since the request is again implicitly encoded in the data. If all data signals have their 0 phase the completion detector asserts the ack signal (which completes one handshake). The ack signal stays high until all data signals carry valid phase 1 values. The time periods, where all signals are phase aligned are marked gray(phase 0) and blue(phase 1). ack D0.d D0.p data D1.d D1.p data value phase (1) 01 00 00 0 1 0 Figure 21: 2-Phase dual rail (FSL/LEDR) References [1] J. Sparsø. Asynchronous circuit design - a tutorial, dec 2001. 17
© Copyright 2024