Week 3 out-of-class notes, discussions and sample problems

Week 3 out-of-class notes, discussions and sample problems
We wrap up our discussion of scoreboard and Tomasulo-style dynamic issue processors with a look at
their implementation details, cost and complicating factors. We start with the nature of WAW and WAR
hazards. First, let’s consider from a high level language point of view what a WAW hazard is. Consider
the following two instructions:
x = y + 1;
x = z * 2;
Why would anyone write such code? The only rational explanation is that the programmer made a
mistake and the first instruction was to be replaced by the second but the programmer forgot (or was too
lazy) to remove the first assignment. The first value stored in x is never used. Now, a WAW hazard has
two consecutive writes to a register without an intervening read, but the two writes wind up occurring in
opposite order. The following MIPS FP code would result in a WAW hazard:
MUL.D
F0, F2, F4
// F0 = F2 * F4
ADD.D
F0, F6, F8
// F0 = F6 + F8
The reason for the hazard is that the multiply takes 3 more cycles than the add, and so F6 + F8 is
computed and stored first, and then F2 * F4 overwrites the sum and the product is put into F0. F0 winds
up with the wrong result. Yet why would a programmer put those two instructions together like that?
The product is never used, there is no need for it. The reason for this combination of instructions is quite
subtle. The compiler will schedule code for us to optimize it for the given pipeline. Among the
optimizations are branch delay slot scheduling. It is possible that the compiler could make such a move
on us. The following is such an example:
Foo:
BNEZ R1, Foo
DIV.D F0, F2, F4
…
L.D
F0, …
Here, the branch delay slot has a lengthy division operation. If the branch is taken, then we reach L.D
before the lengthy division will complete, so the loaded value into F0 is eventually replaced by the
division result. If the branch is not taken, the WAW hazard may still exist, it depends on the number of
instructions that appear in place of the … above.
While two successive writes to one location would not normally be a problem, the different lengths in
execution cause the WAW hazard. The MIPS solution – as soon as the WAW hazard is detected, ensure
that the MUL.D does not write to F0.
WAR hazards are a different type of problem that do not arise in MIPS until we move to the scoreboard.
In any typical sequence of MIPS operations, whether integer or FP, operand reads always take place in
the 2nd stage and execution takes place afterward. So no earlier instruction’s read will happen after a later
instruction’s write.
Consider the following variation of a MIPS pipeline which could result in a WAR hazard. In this
pipeline, ALU operations skip the MEM stage and memory operations have two cycles of MEM stages.
The first cycle of the MEM stage tests the cache to ensure it is a cache hit and the second cycle of the
MEM stage obtains the datum from register and sends it to data cache. Recall that any register write will
occur earlier in a stage than a register read during the same clock cycle. So what we see is the DADD
stores its result in R2 before the SW reads the datum.
SW R2, 0(R1)
DADD R2, R3, R4
IF
ID
IF
EX
ID
MEM1 MEM2 WB
EX
WB
This example is not plausible because the datum is actually retrieved from the register file in the ID stage.
However, as we move to dynamic issue, we will see changes to when register values are read. First, in
the scoreboard approach, registers are only read once both register values become available. Consider the
following instruction sequence:
L.D
F0, 0(R1)
ADD.D
F2, F0, F4
L.D
F6, 0(R2)
MUL.D
F8, F2, F6
The add is issued to the adder, but its operands are not read until both F0 and F4 are available. F0 is
being loaded from cache, so it may postpone the register access of F4 for a cycle (or more). This is not
significant in this problem. However, the MUL.D does not read either F2 or F6 until they are both
available. The load will conclude prior to the add. This creates no hazard. But now consider this
sequence instead:
L.D
F0, 0(R1)
MUL.D
F2, F0, F4
L.D
F6, 0(R2)
MUL.D
F8, F2, F6
ADD.D
F6, F10, F12
By replacing the second instruction with a longer MUL.D, it results in a longer amount of time before F2
becomes available for the fourth instruction. While the fourth instruction waits for F2 to become
available, it waits before reading F6 as well. Unfortunately, the next instruction, ADD.D, is able to issue,
read its operands, and execute all in the time the second MUL.D is waiting. The result is that the ADD.D
can write its sum to F6 before the second MUL.D is able to read the old value of F6. This is a WAR
hazard.
As with the WAR hazard in the altered MIPS pipeline, this example is also an artifact of a peculiar design
decision. Why shouldn’t the second MUL.D read F6 immediately upon being issued? The answer has to
do with the amount of buses available to move data between registers and functional units. To keep costs
down, we only need enough bus lines to accommodate 2 register reads per cycle. Those reads will be for
the instruction issued earliest whose operands have become available. If the scoreboard could operate by
having a functional unit read an operand as soon as it is available, it would avoid WAR hazards.
In Tomasulo’s approach, both WAW and WAR hazards are avoided through register renaming. The cost
is 1. added logic in the issue stage to detect the hazards (this logic would be required in the scoreboard as
well unless we prevent the WAR hazards as explained in the previous paragraph) and 2. the added
temporary registers needed to implement renaming. Since we could avoid the WAR hazard as explained
above, and WAW hazards should not normally arise, should we instead forego the register renaming
approach? The answer is no because with dynamic scheduling, you can never be certain of when data
will become available and when the functional units will read the data. Therefore, although there is added
expense, the approach permits dynamic issue of instructions which itself improves overall performance.
The benefits of dynamic issue may not be apparent yet. As we continue to expand the capability of the
processor, we will see their advantages. However, with dynamic issue, loop unrolling can take place
naturally without compiler optimization. The primary cost of dynamic issue (at least at this point) is with
the functional units (which we would have added anyway), reservation stations and temporary registers (a
minor cost today), and the added logic (fairly minimal). The main disadvantage of dynamic issue is the
reliance on the single CDB (common data bus). We can alleviate this bottleneck slightly by having two,
one for integer values and one for FP values.
The implementation of the Tomasulo architecture is given here. It is shown in figure 3.9 on page 180.
The description below will hopefully be easier to read and understand.
Instruction fetch unit: fetches instructions one at a time, incrementing the PC, and queuing each
instruction in the instruction buffer. If a branch is issued, the behavior of the pipeline is different, as we
will discuss below.
Issue Stage: the instruction is decoded by type. Assume the instruction involves source register: rs, rt
(some instructions do not have a second source operand) and destination register rd. If a reservation
station for this type of instruction is available, send instruction to that functional unit and store it in the
available reservation station.
FP operation:
Qi  this reservation station
Qj, Qk  if register value for the source operand is not available, the reservation
station that will be forwarding the value to it, otherwise 0
Vj, Vk  register value from register file if operand is available
Busy  this reservation station’s busy flag set to yes
Integer operation: same as FP
Load/store operation: Qj, Qk, Vj, Vk same as FP operation
A  the immediate datum field from the instruction
Qi  reservation station number (for loads only), no Qi used for stores
The Qj/Qk value is where register renaming takes place. If a datum is coming from a reservation station,
we record that location rather than the register file location. Thus, each reservation station’s registers is
used to promote renaming.
Execute: once both source operands are available, the instruction can execute. If two instructions obtain
their source operands in the same cycle, the instruction to move to the functional unit is randomly
selected. Since functional units are pipelined, any waiting instruction can begin executing in the next
cycle.
FP operation:
Qj, Qk,  0, execute on Vj, Vk
Integer:
same as FP
Load/store:
Qj  0, A  Vj + A (compute effective address)
Read from memory location [A] (loads)
Write Vk to memory location [A] (stores)
NOTE: loads and stores take 2 cycles to execute, the first cycle computes the effective address and the
second performs the memory access.
Write result: two things need to take place here, first the result has to be written to the register file and
second the result has to be sent out on the CDB.
All operations except store:
Register[Qi]  result
If Qi is listed in any reservation station under its Qj or Qk, forward result and set
that Qj or Qk  0
Qi  0 (indicate that this destination register is now available)
Busy  no
Store:
Send Vk to memory location [A]
Busy  no
The CDB broadcasts 1 result per clock cycle (maximum) but the result is broadcast to all waiting
reservation stations. Therefore, the write result writes the result to every waiting reservation station, the
register file, and the store buffer at the same time.
The hardware for Tomasulo’s approach permits loop unrolling, but in fact it does not execute as we
thought. The instruction fetch unit continues to fetch instructions sequentially until a branch is
completed. What happens to instructions fetched sequentially after a branch was fetched and issued but
not yet completed? If the branch was taken, we would have the wrong instructions in the queue and/or
issued to reservation stations. How do we know which ones? Consider the following code:
L.D
ADD.D
C.LT.D
BC1F
MUL.D
…
F0, 0(R1)
F2, F0, F4
F2, F6
foo
F8, F10, F12
foo:
Here, we load a datum, and use it in an add. Next, we compare the result in order to determine whether
we should branch around a multiply or not. Assuming that the ADD.D takes 4 cycles to execute, and
because the L.D will take 2 cycles to execute, the MUL.D operation will have been issued to a FP
multiply unit before the ADD.D completes. Assuming F10 and F12 are available, the multiply will even
begin executing before we have determined the branch condition. If the branch is taken, we want to shut
down the MUL.D. But how does the FP multiply functional unit know that it was dependent on a branch?
Unless we set some mechanisms up to handle this, we would have to delay issuing the MUL.D because it
is after a branch.
The text cryptically mentions that instructions after a branch are postponed in the issue stage (see the
second to last sentence in the caption for figure 3.9 on page 180). This would mean that our loop
unrolling example wouldn’t actually execute as state:
L.D
MUL.D
S.D
L.D
MUL.D
S.D
The second iteration would be stalled in the issue stage until the first iteration’s branch completed. In
which case, dynamic scheduling does nothing useful for us and we would want to rely on compiler-based
loop unrolling instead! For now, we will assume that branches do not stall the issue stage and that there is
a mechanism available to flush the reservation stations/functional units of instructions issued after a
branch if the branch is taken.
Next week, we will continue to expand our processor by focusing on the superscalar – a pipeline that
permits multiple instruction issues per cycle. You can think of this either as parallel pipelines, or a
Tomasulo-style processor where 2 (or more) instructions are issued at the issue stage, each to independent
functional units. We will see that the Tomasulo-based superscalar approach is common in today’s
processors. We will also examine how to support branch speculation so that we can bypass the problem
discussed in the previous two paragraphs.
The remainder of these notes cover some sample problems.
Sample Problems:
1. For each of the following situations, provide an example of MIPS code that will result in the
given hazard for the MIPS floating point pipeline, or explain why the hazard cannot arise.
a. Structural hazard in the EX stage
b. Structural hazard in the MEM stage
c. WAR hazard
d. WAW hazard
Solution:
a. Structural hazard in the EX stage – this arises when we have two division instructions
within 25 cycles of each other since the division unit is not pipelined.
b. Structural hazard in the MEM stage – this arises whenever two instructions leave the EX
stage during the same cycle, for instance a FP add followed 3 cycles later by a load:
ADD.D IF ID A1 A2 A3 A4 M WB
Instr2
IF ID EX M W
Instr3
IF ID EX M WB
LW
IF ID EX M WB
c. WAR hazard – this cannot arise in the MIPS pipeline whether integer or floating point
because all register reads happen in the 2nd stage and all writes happen later on, so no
later instruction would write to the register file earlier than an earlier instruction reads
from the register file
d. WAW hazard – this can arise with two instructions that have out-of-order completion
such as:
ADD.D
F1, … IF ID A1 A2 A3 A4 M WB
L.D
F1, …
IF ID EX M WB
2. Repeat #1 for the scoreboard and Tomasulo architectures.
Solution:
a. The structural hazard in the EX stage exists if we do not pipeline our functional units.
Additionally, if we have pipelined functional units, the structural hazard in the EX stage
arises in Tomasulo if we run out of reservation stations.
b. This does not exist because our MEM stage now has its own buffer.
c. In the scoreboard, this exists when an instruction waiting at a functional unit to read its
registers waits so long that a later instruction executes and writes its result to the same
register as one that the waiting instruction needs to read. These hazards are avoided by
stalling any such situation in the issue stage. With register renaming, WAR hazards
cannot arise in the Tomasulo approach.
d. Same as c.
3. Using the 7 cycle execution time for the MUL.D as presented in appendix C (as opposed to those
of section 3.2), unroll and schedule the following loop to remove all stalls for the MIPS FP
pipeline. Assume that the MUL.D and S.D can both enter the MEM and WB stages together.
Loop: L.D
F0, 0(R1)
MUL.D
F4, F0, F2
S.D
F4, 0(R1)
DADDI
R1, R1, #8
BNE
R1, R3, Loop
Solution: the greatest source of stalls exists between the MUL.D and S.D (5 cycles worth, this
would normally be 6 cycles worth if we could not accommodate both MUL.D and S.D in the
MEM stage at the same time). We can improve on this by scheduling the DADDI between
MUL.D and S.D and moving the S.D to the branch delay slot. This would reduce the number of
stalls needed to 3 cycles. However, if we unroll the loop, we can only place one S.D in the
branch delay slot, so at best, we still need to find 4 more instructions to exist between the MUL.D
and S.D other than the DADDI. We will unroll the loop for 5 total iterations.
Loop: L.D
F0, 0(R1)
L.D
F6, 8(R1)
L.D
F10, 16(R1)
L.D
F14, 24(R1)
L.D
F18, 32(R1)
MUL.D
F4, F0, F2
MUL.D
F8, F6, F2
MUL.D
F12, F6, F2
MUL.D
F16, F14, F2
MUL.D
F20, F18, F2
DADDI
R1, R1, #40
S.D
F4, -40(R1)
S.D
F8, -32(R1)
S.D
F12, -24(R1)
S.D
F16, -16(R1)
BNE
R1, R3, Loop
S.D
F20, -8(R1)
4. For the following loop, first determine the stalls, next schedule the code to reduce the stalls, and
finally, determining based on the number of stalls that remain how many times you would have to
unroll the loop in order to have enough code to schedule such that you remove all remaining
stalls. Assume the MIPS FP pipeline and the FP latencies as presented in Appendix C, not
chapter 3.2. Assume an FP and an S.D can reach the MEM stage at the same time but not two FP
operations.
Loop: L.D
F0, 0(R1)
MUL.D
F2, F0, F10
L.D
F4, 4(R1)
ADD.D
F6, F2, F4
S.D
F0, 8(R1)
DADDI
R1, R1, #12
DSUBI
R2, R2, #1
BNE
R2, Loop
This code is roughly equivalent to the following for loop:
for(i=0;i<3*n;i+=4)
a[i+2]=a[i]*s+a[i+1];
Solution: the stalls occur as follows, 1 after LW, 1 after MUL.D, 5 after L.D (which subsumes
both the RAW hazard from L.D to ADD.D and from MUL.D to ADD.D), 2 after ADD.D, 1 after
DSUBI, 1 after BNE. The biggest source of stalls is after the MUL.D. Notice that unlike the
previous example which had a RAW hazard between MUL.D and S.D, we have a hazard between
MUL.D and ADD.D. The result is that we have 1 additional cycle of delay because ADD.D
needs the datum in the A1 stage, S.D needed it in the MEM stage. This extra cycle of delay
though is subsumed by the second L.D operation. Unfortunately, unlike the MUL.D/S.D example
where the MUL.D and S.D could both reach the MEM and WB stages at the same time, this is not
true of MUL.D and ADD.D. So in fact, we have an additional cycle of stall to avoid that
structural hazard! We can schedule the code to remove some stalls as follows:
Loop: L.D
F0, 0(R1)
L.D
F4, 4(R1)
MUL.D
F2, F0, F10
DSUBI
R2, R2, #1
DADDI
R1, R1, #12
ADD.D
F6, F2, F4
BNE
R2, Loop
S.D
F0, -4(R1)
This code has 4 cycles of stalls between MUL.D and ADD.D because of the RAW hazard but 1
additional cycle of stall from the structural hazard of the MUL.D and ADD.D reaching the MEM
stage at the same time! We also have 1 stall after the ADD.D before the S.D because of that
RAW hazard. NOTE: we do not want to insert the stall after the BNE because that would fill the
branch delay, so the stall has to go after ADD.D and before BNE. Because we have 5 cycles
worth of stalls, we need to fill the void between MUL.D and ADD.D with 5 instructions. We
unroll the loop a total of 6 iterations. The code would have 12 L.Ds followed by 6 MUL.Ds
followed by the DSUBI and DADDI followed by 6 ADD.Ds, followed by 5 S.Ds, 1 BNE and the
last S.D. We would have to alter the memory offsets appropriately.
5. For the following code, show a table of when each instruction is issued, reads operands, executes
and writes results using a Scoreboard-based architecture of Appendix C (the table will look
something like that of figure 3.20 on page 202, but it will fit the Scoreboard architecture, not
Tomasulo’s). Assume that you have the following unpipelined functional units (along with their
execution times):
1 FP adder which takes 4 cycles to execute
2 FP multipliers which take 10 cycles to execute
1 FP divider which takes 20 cycles to execute
2 Load/store unit which take 2 cycles to execute (the load/store unit has its own adder so
the integer EX does not need to be used, but only 1 memory operation can be
done in the same 2 cycle period)
1 integer EX unit which takes 1 cycle to execute
In addition, assume that a functional unit is busy from the time an instruction is issued to it
through the cycle when it writes its results. If an instruction is waiting for a functional unit to
become available before it is issued, it can only be issued the cycle after the functional unit is
freed. Also assume that a functional unit waiting to read registers can read them the cycle they
are written (writes occur first in the cycle, then reads), and can begin executing the cycle after the
read occurs. Note that only one functional unit can write to the register file in one cycle and only
one functional unit can read operands in one cycle.
L.S
F1, 0(R3)
MUL.S F5, F1, F2
L.S
F2, 4(R3)
DIV.S F6, F2, F3
L.S
F3, 8(R3)
MUL.S F7, F6, F5
SUB.S F4, F1, F2
MUL.S F7, F1, F2
ADD.S F8, F7, F4
L.S
F1, 12(R3)
ADD.S F2, F1, F8
S.S
F2, 16(R3)
DADDI R3, R3, #20
// assume F2 already has a value
// assume F3 already has a value
Solution:
L.S F1, 0(R3)
MUL.S F5, F1, F2
L.S F2, 4(R3)
DIV.S F6, F2, F3
L.S F3, 8(R3)
MUL.S F7, F6, F3
SUB.S F4, F1, F2
MUL.S F7, F1, F2
ADD.S F8, F7, F4
L.S F1, 12(R3)
ADD.S F2, F1, F8
S.S F2, 16(R3)
DADDI R3, R3, #20
Issue
1
2
3
4
6
7
8
40
41
42
58
59
60
Read Operands
2
5
4
7
8
28
9
41
52
43
59
64
61
Execute
3-4
6-15
5-6
8-27
9-10
29-38
10-13
42-51
53-56
44-45
60-63
65-66
62
Writes Result
5
16
7
28
11
39
14
52
57
46
64
65
Comments
RAW hazard with previous L.S
RAW hazard with previous L.S
No WAR hazard with previous DIV.S
RAW hazard with DIV.S
Stalls because of WAW with MUL.S
RAW hazard with previous MUL.S
Functional hazard – only 1 FP adder
RAW hazard with ADD.S
WAR hazard with S.S
6. For each of the following situations, explain how the MIPS floating point pipeline, the
Scoreboard approach and Tomasulo’s approach each handle the situation, if the situation might
result in stalling the entire instruction stream or just the affected instruction, or if the given
situation cannot arise in that architecture (and if this is the case, why not).
a. RAW data hazards
b. WAW data hazards
c. WAR data hazards
d. Structural hazards from trying to enter the same FP functional unit
e. Structural hazards from trying to read registers/source operands
f. Structural hazards from trying to write results
g. Structural hazard from performing load/store to memory
h. Control hazards from a branch
For instance, the MIPS pipeline handles RAW hazards by forwarding when possible, and stalling
the pipeline when necessary, whereas WAR hazards cannot arise, and control hazards are handled
by filling the branch delay slot or by assuming not taken and flushing the pipeline of the wrong
instruction when branch is taken.
Solution:
MIPS FP pipeline:
a. Prevented by forwarding when possible, stalling when needed, notably more common with FP
operations.
b. WAW only possible in FP operations, earlier operation does not write its result, essentially
becoming a no-op.
c. WAR hazards are not possible since all read occur in the 2nd stage and writes occur in stage 5 or
later.
d. The only structural hazard from the functional units is for divides, all others are pipelined or in
the int unit’s case, it only takes 1 cycle.
e. This structural hazard does not arise since all reads are in the ID stage.
f. Since instructions must stall before entering the MEM stage if they are going to collide there,
only one instruction enters the WB stage per cycle so this hazard does not arise (see f).
g. Up to 4 instructions might try to enter the MEM stage at any one time (one from each of EX, A4,
M7, DIV), so all later instructions trying this will stall. This may or may not stall the earlier
(IF/ID) parts of the pipe.
h. Handled by either freezing/flushing the instruction in IF, or using the branch delay slot.
Scoreboard:
a. Results sent to registers when produced, the Scoreboard alerts functional units when the values
are available, so there is no stalling of instruction issue, but an instruction might wait at a
functional unit for a while.
b. The latter instruction is stalled from being issued when the hazard is detected, until the earlier
instruction completes and writes its result.
c. Writing of results by the later instruction is postponed until the earlier instruction can read both
operands, causing the writing instruction to wait in the functional unit until the writing can be
performed.
d. If a needed functional unit is busy, the instruction stalls at the issue stage, stalling the instruction
stream.
e. Only one instruction can read registers in a cycle, the Scoreboard selects which instruction
performs read operands based on the one waiting the longest, all others wait in their functional
unit.
f. Only one instruction can write results to a register in any cycle, all others must wait in their
functional unit.
g. A separate memory unit handles memory accesses, so the memory unit stores instructions in a
queue and services them one at a time, so there is only one memory access per cycle enforced by
this unit, all others wait.
h. Unless a branch target buffer is used, incorrect instructions might be issued and later must be
turned into no-ops. If a target buffer is used, the next instruction will already be in the instruction
stream when it is needed.
Tomasulo:
a. RAW hazards are handled by forwarding results over the CDB and having all reservation stations
keep an eye on the CDB for results from functional units that they are waiting for.
b. WAW hazards are handled by disallowing the earlier instruction from writing, turning it into a
no-op.
c. WAR hazards are handled by register renaming, renaming the later register to be written to to a
new value, and all subsequent instructions that use this register for a source will have the register
renamed as well.
d. Usually there are more reservation stations than functional units, so instructions will be issued
and will wait in reservation stations for a functional unit to become available and for the source
operands to become available, if no reservation stations are available, the instruction issue stalls,
stalling the instruction stream.
e. While only one set of register reads can occur in one cycle, causing other reservation stations to
wait, any number of reservation stations can read from the CDB.
f. Only one reservation station can write a result to the CDB at any one time, later instructions stall
in their functional unit.
g. Same answer as with the Scoreboard.
h. Same answer as with the Scoreboard – we will see a better solution later in chapter 3 by using a
“Reorder Buffer” which collects results and allows those results to be stored to memory or
register only once we have determined that we predicted correctly.