Download Report

Some Sample Problems For Exam 2
1. (30%) Consider the following code segment in MIPS
LOOP:
LD
F2, 100 (R2)
: Load to F2 value from memory
ADDD
F4, F2, F8
: F4 = F2+F8
MULD
F6, F4, F10
: F6 = F4*F10
SD
100 (R2), F6
: store F6 to memory
ADDDI
R2, R2, #8
: R2 = R2+8
BNE
R2, R6, Loop
Assume that you have the normal 5 stage pipeline (Instruction Fetch, Instruction Decode,
Execute, Memory and write-back). Also we will use the following latencies:
Integer operations
1
LD and STORE
2
Floating Add
3
Floating Mult
5
a) Show how many cycles are needed to complete one-iteration of the loop without reordering
the code.
b). Use delayed branch and show the number of cycles needed. You can also reorder the code if
you see opportunities.
Key.
a). Note that the latencies indicate when the results are available for use. So the results of LD can
be used 2 cycles after LD starts and if we assume the data can be forwarded to the instruction
that needs the results, we only need on stall after LD before the ADDD can use the LD results.
LOOP:
LD
F2, 100 (R2)
Stall
ADDD
F4, F2, F8
Stall
Stall
MULD
F6, F4, F10
Stall
Stall
Stall
Stall
SD
100 (R2), F6
ADDDI
R2, R2, #8
BNE
R2, R6, Loop
Stall
I included a stall after BNE since we do not know if the branch will be taken or not. With this we
have 14 Cycles to complete one iteration.
b). We can move SD into the delayed branch slot and also reorder some instructions to reduce
the number of cycles needed. We gave to modify the offset of SD instruction since R2 is
incremented by 8.
LOOP:
LD
Stall
ADDD
Stall
Stall
MULD
Stall
Stall
ADDDI
BNE
SD
F2, 100 (R2)
F4, F2, F8
F6, F4, F10
R2, R2, #8
R2, R6, Loop
92 (R2), F6
Now we need 11 cycles to complete one iteration.
Comments. Some of you assumed that a latency of 2 meant you needed 2 stalls.
Some of you had difficulty in reordering instructions or using delay slots.
2. (30%) Consider that you want to build a pipelined processor to implement ACCUMULATOR
type instructions. For the purpose of this problem, we will assume that there is only one general
purpose register, the accumulator (ACC for short). So we will have instructions like:
LOAD
address
:Load to accumulator from address
ADD
address
:Add to accumulator the value at address
: result is stored back into ACC
Likewise you can have other instructions for Subtract, Multiply, Store.
We can also have ADDI immediate with “address” replaced by a constant value
ADDI
#literal
Branch instructions can compare Accumulator to Zero. So we can have
BZ, BNZ, BNEG and BPOS
For these instructions, “address” will be the displacement that will be added to PC.
Sketch pipelined Data paths for this type of an architecture.
Key.
For this problem I will use 4 stages: Instruction Fetch, Decode/Memory, Execute and Writeback
Instruction Fetch. Fetch the instruction and increment PC by 4.
Decode/Memory. Decode instruction, access data memory (either get an operand or store
contents of accumulator into memory), also read the data in Accumulator to be used in an
arithmetic operation. Note that if the instruction has a literal value, then the literal value will be
used in the arithmetic operation instead of the data read from memory. If the instruction is Store,
write the contents of Accumulator to memory.
Execute. If the instruction is an arithmetic instruction, perform the arithmetic operation the
values from Accumulator, and either the data from memory or the literal value from the
instruction. At the same time we can also test the value in the accumulator to see if it is zero,
non-zero, positive or negative.
Writeback. Write the results of the instruction back to Accumulator. The results may come form
the arithmetic unit, or from memory. Change PC as needed depending on the condition (if the
instruction is a branch).
Comments. Some of you seem to be completely lost. Some of you used the same data paths as
those for MIPS architecture.
EX/WB
MEM/E
X
IF/MEM
Accumulat
or
Test for
condition
PC
Data
Memory
+4
Arithmeti
c
Unit
Integer
Adder
3. (15%) In a standard 5-stage pipeline, the decision about a branch is know in 3rd stage
(Execute). Consider two choices for branch instructions
a). Use delayed branch instruction with one delay slot (i.e., one instruction after the
branch will be executed) and stop fetching any additional instructions upon
discovering a branch instruction.
b). Use a delayed branch with 2 delay slots (i.e., two instructions after the branch will be
executed).
In the first case, we have a stall on a branch and lose one cycle (even if we save a cycle because
of the delay slot). In the second case there will be no stalls. However, it is more difficult to find
two useful instructions that can be placed after a branch (and fill the 2 delay slots). If we cannot
find a useful instruction for a delay slot, we use a NOOP and for the purpose of this example, we
will assume that it is a wasted slot.
Compare these two alternates if 20% of all instructions are branches, and an optimizing compiler
can find useful instructions for one delay slot 80% of the time and find useful instructions to fill
2 delay slots only 25% of the time.
Key
Consider the case with 1 delay slot. We still lose one cycle even if we can use the delay slot. Or
we will lose 2 cycles if we cannot use the delay slot. Since we can use only 80% of the time one
delay slot, we have
20%*1+80% + 20%*2*20% = 0.16+0.08 = 0.24 cycles lost
In the second case, if we can use both delay slots, we have no loss of cycles, if only one delay
slot can be used, we have a loss of one cycle and if both delay slots cannot be used we have a
loss of two cycles. Thus we have
20%*1*(100-25)%*80% + 20%*2*(100-25)%*20%
= 0.12+0.06 = 0.18 cycles lost.
In this example, creating two delay slots is better
Comments. Some of you just compared the stalls just for Branch instructions (that is OK with
me). Some of did not account for all the cases in the second option. You may be able fill both
delay slots with useful instructions (25% of the time), fill only one of the two delay slots with
useful instructions (80% of the remaining75% of the time) and cannot fill either of the two delay
slots.
4. (25%) For the code in problem #1 above, use Scoreboard technique and show the contents of
Instruction Status and Functional Unit Status tables for two snapshots
a). At the initial state when no instruction has completed execution
b). When LD and ADDD have completed.
Remember for Scoreboard we will assume one Integer unit (for LD, SD and Integer arithmetic
instructions), 2 Floating point Multiply units, one Floating point Add unit and one Floating point
Divide unit. For your convenience, I am giving you the template for Instruction Status and
Functional Unit status tables.
Key
a) Before any instruction completed execution
Instruction Status
LD
F2, 100 (R2)
ADDD F4, F2, F8
MULD F6, F4, F10
SD
100 (R2), F6
ADDDI R2, R2, #8
BNE R2, R6, Loop
Issue
Read
Operands
X
X
X
X
Execute
Functional Unit
Status
Name
Busy
Op
Fi
Fj
Fk
Qj
Integer
MULT-1
MULT-2
ADD
DIV
Yes
Yes
No
Yes
No
Load
Mult
F2
F6
R2
F4
F10
Add
F4
F2
F8
Write results
Qk
Rj
Rk
ADD
No
No
Yes
Integer
No
Yes
Note that we show No under Rj for Load instruction since in this snapshot we already read R2
and we want to indicate that WAR is no longer an issue (if there is an instruction waiting to write
to R2, it can proceed).
b) After LD and ADDD completed
Instruction Status
LD
F2, 100 (R2)
ADDD F4, F2, F8
MULD F6, F4, F10
SD
100 (R2), F6
ADDDI R2, R2, #8
BNE R2, R6, Loop
Issue
Read
Operands
Execute
Write results
X
X
X
X
X
X
X
X
X
X
X
Functional Unit
Status
Name
Busy
Op
Fi
Fj
Fk
Qj
Integer
MULT-1
MULT-2
ADD
DIV
Yes
Yes
No
No
No
SD
Mult
F6
R2
F4
F6
F10
Add
Qk
Rj
Rk
Mult-1
No
No
No
No
Note that for Multiply, I am assuming that the instruction read its operands, thus both F4 and F10
are read and any other instruction waiting to write to these registers can proceed (no longer a
WAR issue) as indicated by No in the Rj and Rk fields.
Note that ADDI and BNE cannot proceed since they are waiting for the integer unit which is
now waiting to execute SD. If we re-ordered the code (and possibly use delayed branch), we
could have moved ADDI and BNE before SD, and we could have completed those instructions.
Comments. Some of you did not issue MULTD in the first snapshot. Remember you can issue
an instruction if the functional unit is available and there is no WAW dependency on the
destination register
5. (35%) In some older architectures an indirect memory address mode is permitted. Consider the
following instruction.
LWI Rd, disp
This instruction uses disp as an indirect address. That is, use disp as a memory address, read the
contents of memory at that address, and use the value just read as the address of the real operand
(that is read memory again and store the value in Rd).
Consider the following example.
LWI R2, 100
Let us assume that at memory address “100” we have a value of 1500. The instruction will use
100 as an indirect address, obtains the value 1500 stored at memory address 100. The actual
address of the operand is 1500. If we have a value of –10 in memory location 1500, then the
instruction will load –15 into R2.
Note that such indirect address is applicable to both Load and Store.
Describe how we can modify our pipeline design for DLX to implement the indirect address
mode. Show the pipeline stages and data-paths to indicate which hardware units are accessed in
each of the pipeline stage. Also describe in English the functionality of each stage. Indicate the
number of read and write ports needed to the data memory to avoid structural hazards.
Key. As most of you discovered the typo, in the above example instruction using the indirect
mode, we load –10 in R2 (not –15 as indicated in the problem description). I am sorry for this
typo that may have confused some of you.
Since indirect address mode requires that we access the memory twice, we need to design a
pipeline with two Memory Access stages. Consider the following diagram.
Indirect Address
Data
Memory
P
C
Operand
(if Indirect)
Operand
(if direct)
+4
store
(if indirect)
disp
Instr.
Mem
Indirect?
ALU
store
(if direct)
Reg's
Instr
Fetch
Instr Decode/
Reg. Fetch
Execute
Memory-1
Memory-2
Write-Back
To simplify, I have eliminated some of the data paths that are needed to handle branch
instructions, to use immediate operands in instructions (i.e., sign extension hardware, testing for
zero for branch instructions, etc).
The main change is the introduction of two separate memory stages. In Memory-1, we use the
displacement from the instruction to fetch a data value. We check to see if the instruction is an
indirect instruction, if so we use the data fetched from memory as an address, and fetch memory
again in Memory-2. The Mux in Memory-2 forwards either the data fetched in Memory-1 or in
Memory-2 to Write-back. Note that the decision is based on the opcode which is obtained from
the pipeline latches.
For store, we may also have direct or indirect addresses. If the opcode is SW (direct), then we
store in Memory-1 stage; if the opcode is SWI, we do the store in Memory-2 stage. I hope the
diagram is clear with these data paths.
As can be seen we need two read ports and two write ports to the data memory to avoid stalls,
since it is possible to have either two LWI in a sequence or a SWI followed by SW instructions
in sequence. Consider the following examples
……..
…….
LWI R1, disp1
SWI disp1, R1
LW R2, disp2
SW disp2, R2
The example on the left hand side shows why we need two read ports to data memory (assuming
instructions are in separate memory that will be accessed by IF—otherwise we need 3 read
ports). The example on the right hand side shows why we need two write ports since the second
Store stores in the first MEM –1 stage while SWI stores in the second MEM-2 stage.
Although I did not ask for dependencies, here a bit of discussion on data dependencies.
If we have the following sequence of instructions,
LWI R1, disp
ADD R3, R1, R2
The ADD will incur two stalls since the operand (R1) for ADD will not be available until LWI
passes Memory-2 stage (we can forward the data from MEM2/WB to ID/EX).
Likewise if we have
LWI R1, disp
Some Instruction with no dependency on R1
ADD R3, R1, R2
The ADD will incur one stall (and we need to forward the data from MEM2/WB to ID/EX)
Note that we could have re-arranged the pipestages (say, IF, ID, Mem-1, Mem-2, EX, WB) to
reduce or eliminate stalls due to LW or LWI to an ALU instruction. However, this can cause
additional stalls in Branch since value of a register tested by branch may will not be available
until EX.
6. (20%). Consider the following code segment
1: ADD
2: LW
3: ADDI
4: LW
5: SUB
6: BNEG
R3, R0, R7
R8, 0(R3)
R3, R3, #4
R9, 0(R3)
R1, R8, R9
R1, Exit
Note R0 is hardwired to Zero.
List all dependencies (i.e., RAW, WAR, WAW) among these instructions. Use register
renaming to eliminate as many of these dependencies as possible (and indicate which
dependencies were eliminated).
Key.
RAW on R3 from 1 to 2.
RAW on R3 from 1 to 3.
RAW on R8 from 2 to 5.
RAW on R3 from 3 to 4.
RAW on R9 from 4 to 5.
RAW on R1 from 5 to 6.
WAW on R3 from 1 to 3.
WAR on R3 from 2 to 3
If we assume that this code segment is not in a loop, we can eliminate the use of R3 for LW
completely – and R3 is the only register that causes too many and unnecessary WAW and WAR
dependencies. So our new code looks like
1: LW
2: LW
3: SUB
4: ADDI
5: BNEG
R8, 0(R7)
R9, 4(R7)
R1, R8, R9
R3, R7, #4
R1, Exit
Note that statement 4 is included because we do not know if the value of R3 is needed or not
Now we have only RAW (or true) dependencies on R8 and R9 between 1, 2 and 3; and a RAW
on R1 between 3 and 5. If we assume compare and branch (Branch on Less Than to compare two
registers), we can eliminate the RAW on R1.
1: LW
2: LW
3: ADDI
4: BLT
R8, 0(R0)
R9, 4(R0)
R3, R7, #4
R8, R9, Exit
Note that when you rename R3 (as most of you did), you need to change the references to R3 to
use the new register – like the second LW. Some of you did not consider that the value of R3
after the ADDI may be needed elsewhere.
7. (30%) You are given the following code. Note that floating-point instructions use floating
point registers labeled F. Integer instructions use integer registers labeled R. We are given the
following latencies for instructions (that is the dependent data must wait this many cycles to the
data from the predecessor instruction).
Floating Point Add/Sub
Floating point Multiply
Load
Integer arithmetic (using data forwarding)
Loop:
LD
MULD
LD
ADDD
SD
SUBUI
SUBUI
BNEZ
F0, 0(R1)
F0, F0, F2
F4, 0(R2)
F0, F0, F4
0(R2), F0
R1, R1, #8
R2, R2, #8
R1, Loop
2
3
1
0
Assuming single-issue pipeline, unroll the loop 3 times and schedule the instructions to minimize
the number of cycles needed to execute the code.
Key.
Let us look at the original code with appropriate number of stalls
Loop:
LD
stall
MULD
stall
LD
stall
ADDD
stall
stall
SD
SUBUI
SUBUI
BNEZ
F0, 0(R1)
F0, F0, F2
F4, 0(R2)
F0, F0, F4
0(R2), F0
R1, R1, #8
R2, R2, #8
R1, Loop
We need 13 cycles to complete one iteration of the loop. Now look at the loop unrolled 3 times
(and I am using additional registers as well as correct the displacement to load and store values
from different array locations).
Loop:
LD
LD
LD
LD
LD
LD
MULD
MULD
MULD
stall
ADDD
ADDD
ADDD
SD
SD
SD
SUBUI
SUBUI
F0, 0(R1)
F4, 0(R2)
F6, -8(R1)
F8, -8(R2)
F10, -16(R1)
F12, 16(R2)
F0, F0, F2
F6, F6, F2
F10, F10, F2
F0, F0, F4
F6, F6, F8
F10, F10, F12
0(R2), F0
-8(R2), F6
-16(R2), F10
R1, R1, #24
R2, R2, #24
BNEZ
R1, Loop
We needed the one stall before ADDD F0, F0, F4 since the MULD F0, F0, F2 needs 3 cycles
before the data in F0 an be used. We can reorder the instructions (by moving one of the
SUBUIR1, R1, #24) to eliminate this stall.
Loop:
LD
LD
LD
LD
LD
LD
MULD
MULD
MULD
SUBUI
ADDD
ADDD
ADDD
SD
SD
SD
SUBUI
BNEZ
F0, 0(R1)
F4, 0(R2)
F6, -8(R1)
F8, -8(R2)
F10, -16(R1)
F12, 16(R2)
F0, F0, F2
F6, F6, F2
F10, F10, F2
R1, R1, #24
F0, F0, F4
F6, F6, F8
F10, F10, F12
0(R2), F0
-8(R2), F6
-16(R2), F10
R2, R2, #24
R1, Loop
Now we need 18 cycles to complete 3 iterations or 6 cycle per iteration.
8. (20%) This problem deals with branch target buffers, BTB (to store the address to which a
branch is taken). Assume that the branch miss-prediction penalty is 4 cycles. You are given that
the branch miss prediction rate is 10%, and the probability of finding the branch target (hit rate in
the BTB) is 80%. On a miss in the BTB, the penalty (you have to complete execution of the
branch instruction) is 3 cycles. 20% of all instructions are branch instructions. The base CPI is 1
cycle
a). What is the CPI for branch instructions when using the BTB as described above?
b). What is the CPI for branch instructions if you are not using BTB?
Key.
a). Notice we have two cases here: find an entry in BTB for the branch instruction or not; the
branch prediction is correct or not.
Hit in the BTB (80% of the time): 80%[90%*1 + 10%*4]
Miss in the BTB (20% of the time): 20%*3
Total = 1.04+0.6 =1.64 cycles
But we only have 20% of the instructions that are branches. So the CPI for branches =
20%*1.64= 0.328 cycles
b). If there is no BTB, all branch instructions will take 3 cycles. Since 20% of all instructions are
branches, the CPI = 20%*3 = 0.6 cycles.
9. (25%) Examine the following code segment. Assume a 5-stage pipeline and normal forwarding data on
Read-After-Write Dependencies.
Loop: LD
ADDD
LD
MULD
ADDD
SD
BEQ
R3, 0(R5)
R7, R7, R3
R4, 4(R5)
R8, R8, R4
R10, R7, R8
R10, 0(R5)
R10, R11, Loop
a). Show how many cycles are needed to execute this sequence of code.
b). Can you re-order the instructions to improve the number of cycles needed. Show the reordered code.
Key. Assuming normal forwarding, only LD following by an Arithmetic instruction causes a stall. We
will also assume that Branch causes a stall. Thus
Loop: LD
Stall
ADDD
LD
Stall
MULD
ADDD
SD
BEQ
Stall
R3, 0(R5)
R7, R7, R3
R4, 4(R5)
R8, R8, R4
R10, R7, R8
R10, 0(R5)
R10, R11, Loop
We need 10 cycles.
If we use flow diagrams through the pipeline we have
Cycle
LD R3, 0(R5)
ADDD R7, R7, R3
LD R4, 4(R5)
MULTD R8, R8, R4
ADDD R10, R7, R8
SD R10, 0(R5)
BEQ R10, R11, LOOP
1
2
3
4
5
F
6
7
D
E
M
W
F
S
D
F
8
9
10 11 12 13
E
M
W
D
E
M
F
S
D
F
E
M
W
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
S
S
S
W
b). Reordering the code is easy. I will also assume one delay slot so that we will have no stalls.
Loop: LD
R3, 0(R5)
LD
R4, 4(R5)
ADDD
R7, R7, R3
MULD
R8, R8, R4
ADDD
R10, R7, R8
BEQ
R10, R11, Loop
SD
R10, 0(R5)
Without any stalls, each iteration of the loop will take 7 cycles.
The flow through the pipeline is shown below.
Cycle
LD R3, 0(R5)
1
2
F
LD R4, 4(R5)
ADDD R7, R7, R3
3
4
5
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
MULTD R8, R8, R4
ADDD R10, R7, R8
BEQ R10, R11, LOOP
6
SD R10, 0(R5)
7
8
9
10
11
W
10. (30%) For the code in problem 2, let us assume that the latency for multiplication is 5 cycles and the
latency for ADD is 3 cycles. The latency for all other instructions is 1 cycle.
Using single-issue speculative processor, show a table similar to that on page 237 (note we are using
single issue unlike the figure on page 237). Show the table for 3 iterations of the loop.
Key.
Iteration
Issued
Executes
Memory
Write to
CDB
Commits
3
4
5
8
9
1
LD R3, 0(R5)
1
2
1
ADDD R7, R7, R3
2
5
1
LD R4, 4(R5)
3
4
1
MULTD R8, R8, R4
4
7
1
ADDD R10, R7, R8
5
13
1
SD R10, 0(R5)
6
17
1
BEQ R10, R11, LOOP
7
17
2
LD R3, 0(R5)
8
9
2
ADDD R7, R7, R3
9
12
2
LD R4, 4(R5)
10
11
2
MULTD R8, R8, R4
11
14
2
ADDD R10, R7, R8
12
20
2
SD R10, 0(R5)
13
24
2
BEQ R10, R11, LOOP
14
24
5
12
25
Wait for LD
6
10
12
13
Wait for LD
16
17
Wait for MULTD
18
Wait for ADDD
18
Wait for ADDD
18
10
Comments
11
19
15
20
13
21
19
22
Wait for LD
23
24
Wait for MULTD
25
Wait for ADDD
25
Wait for ADDD
Wait for LD
3
LD R3, 0(R5)
15
16
3
ADDD R7, R7, R3
16
19
3
LD R4, 4(R5)
17
18
3
MULTD R8, R8, R4
18
21
3
ADDD R10, R7, R8
19
27
3
SD R10, 0(R5)
20
31
3
BEQ R10, R11, LOOP
26
31
17
19
32
18
26
22
27
20
28
Wait for SD
26
29
Wait for LD
30
31
Wait for MULTD
32
Wait for ADDD
Wait for LD and previous
addd
32
Note we are using a single issue and not multiple issue. We will assume one FP adder, one FP Multiplier,
one LD/SD unit (I am not using multiple reservations stations with adders and multiplier). We need to
account for possible structural hazards in starting instructions. We may have delay an issue if the required
functional unit is not available (you can show this also as delaying execution). Likewise we may have to
delay posting results on CDB if the CDB is being used by earlier instructions. In this example this we
need no delays due to structural hazards.
Also in this example we are committing one instruction at a time (except for SD since there is no commit
needed). However, it may be possible to commit all instructions that have completed.
So we have a total of 32 cycles to complete 3 iterations or 10.67 cycles per iteration or 21 instructions in
32 cycles for an IPC of 0.657 instructions per cycles.