Download Report

page 1 of 9
ENCM 501 Winter 2015 Assignment 8
for the Week of March 23
Steve Norman
Department of Electrical & Computer Engineering
University of Calgary
March 2015
Assignment instructions and other documents for ENCM 501 can be found at
http://people.ucalgary.ca/~norman/encm501winter2015/
1
Administrative details
1.1
Group work is permitted
Here are the options:
• You may do your work entirely individually.
• A group of two or three students may hand in a single assignment for the
whole group.
• Collaboration at the level of individual exercises is acceptable. In that case,
submissions of complete, individual assignments are required, with explicit
acknowledgments given as needed on an exercise-by-exercise basis.
Informal discussion of assignment exercises between students is encouraged, and
does not need to be acknowledged.
Please be aware that all students are expected to understand all assignment exercises! Collaboration is of course not allowed on quizzes, the midterm test, and the
final exam.
1.2
Due Dates
The Due Date for this assignment is 3:30pm, Thursday, March 26.
The Late Due Date is 3:30pm, Friday, March 27.
The penalty for handing in an assignment after the Due Date but before the Late
Due Date is 3 marks. In other words, X/10 becomes (X–3)/10 if the assignment
is late. There will be no credit for assignments turned in after the Late Due Date;
they will be returned unmarked.
1.3
A
B
C
total
Marking scheme
10
5
4
19
marks
marks
marks
marks
ENCM 501 Winter 2015 Assignment 8
1.4
page 2 of 9
How to package and hand in your assignments
Please see the instructions in Assignment 1.
2
Exercise A: A scalar pipeline with branch prediction
2.1
Read This First
Figure 1 is an overview of the 7-stage pipeline introduced in the tutorial of March
18, 2015. The enhancement of dynamic branch prediction is proposed on slides 11
and 12 for that tutorial, but there wasn’t time to look at those slides in the tutorial
period. This exercise will look at dynamic branch prediction for that 7-stage pipeline
in more detail.
Figure 1: 7-stage pipeline from a tutorial period.
CLK
CLK
CLK
CLK
CLK
CLK
CLK
IT
IC
DEC
EX
DC/WB
DT/DC
EX/DT
ALU
GPRs
DEC/EX
IT/IC
next PC
logic
with
branch
predict
IC/DEC
CLK
DT
DC
Figure 2 shows much more detail for the first three pipeline stages. Let’s assume
that the instructions supported are these from the MIPS32 ISA: the five R-type
instructions ADD, SUB, AND, OR, SLT, plus LW, SW and BEQ.
To understand how the circuit handles branches, it is perhaps best to trace a
detailed example. Here is code sequence, with virtual instruction addresses given:
0x00400064
0x00400068
0x0040006c
0x00400070
0x00400074
0x00400078
ADD
ADD
ADD
BEQ
OR
LW
R4, R4, R8
R5, R5, R8
R6, R6, R8
R9, R0, L1
R10, R17, R18
R11, 8(R19)
more instructions
0x00400094
0x00400098
L1:
AND
SUB
R20, R20, R24
R21, R21, R25
We’ll assume, unrealistically, that there are never any TLB or cache misses, so
we can concentrate on the processing of the BEQ instruction and its neighbours.
We’ll also assume that this code fragment has recently been processed several times
as part of a program run, so the branch predictor contains information about the
branch target address (0x00400094 in this example) and has a prediction about
whether the branch should be taken. Let’s use Cycle 1 as a name of the clock cycle
in which the IT stage of the BEQ instruction takes place, Cycle 2 as a name for the
very next cycle, on so on. Figure 3 describes what happens over Cycles 1, 2, and 3,
in the case where the prediction about the BEQ is that it will be taken.
WB
ENCM 501 Winter 2015 Assignment 8
page 3 of 9
Figure 2: Sketch of the IT, IC and DEC stages. Most of the details needed to make
branches work are shown, but other important details—logic to handle TLB and cache
misses, and logic in the DEC stage to support LW, SW and R-type instructions—are
left out. The signal Branch is 1 if and only if the Control Unit sees that bits 31:26 of
the instruction are 0000102 , the opcode for BEQ. The pipeline registers have CLR (clear)
inputs, to help with cancelling instructions, and EN (enable) inputs, to help implement
stalls.
PTakenIT
PTakenIC
CLK
4
PCPlus4IT
IC/DEC
PTargetIT
+
Icache
PTargetIC
PCPlus4IC
Instr
20:16 GPRs
forwarding
muxes
EN
CLR
IT/IC
ITLB
EN
CLR
EN
CLK
25:21
next PC
PC
logic
with
branch
predict
=
=
15:0
Make
BTA
PCPlus4D
BranchTarget
IT
2.2
IC
DEC
What to Do, Part I
Study Figures 2 and 3 carefully. Then draw a diagram like the one below, with two
more rows, to confirm that the number of lost clock cycles for misprediction in the
example of Figure 3 is exactly two.
BEQ
AND
2.3
cycle 1
cycle 2
cycle 3
IT
IC
DEC
IT
IC
TakeWasWrong
CLK
GPRMatch
control
5:0
CLK
Branch
31:26
cycle 4
cycle 5
cycle 6
cycle 7
cancelled
cancelled
cancelled
cancelled
What to Do, Part II
Why is the comparison that generates the GoodTarget signal necessary? Why would
the predictor ever predict that a branch would be taken, but produce an incorrect
prediction about the branch target address? (Hint: Think about context switches.)
Answer briefly, but precisely.
cancelled
GoodTarget
PTakenD
ENCM 501 Winter 2015 Assignment 8
page 4 of 9
Figure 3: Trace of handling BEQ in a case where the branch predictor predicts a taken
branch.
• Cycle 1.
– PC = 0x00400070; this is the virtual address of BEQ.
– PTakenIT = 1 to indicate a prediction of taken.
– PTargetIT = 0x00400094; this is the predicted branch target address.
– PCPlus4IT = 0x00400074
– The I-TLB writes the physical address of the BEQ instruction into the IT/IC
register.
• Cycle 2.
– PC = 0x00400094; this is the virtual address of AND—the predictor got
that ready because it predicted a taken branch in the previous cycle.
– PTakenIT = 0; there was miss in the branch history table within the branch
predictor.
– PTargetIT = don’t care—this value will never be used.
– PCPlus4IT = 0x00400098
– The I-TLB writes the physical address of the AND instruction into the IT/IC
register.
– PCPlus4IC = 0x00400074
– The I-cache writes the machine code for BEQ into the IC/DEC register.
• Cycle 3. These things happen regardless of whether the prediction was correct . . .
– PC = 0x00400098; this is the virtual address of SUB.
– PTakenIT = 0, due to another miss in the branch history table.
– PTargetIT = don’t care
– PCPlus4IT = 0x0040009c
– PCPlus4D = 0x00400074
– The output of the Make BTA unit is also 0x0040009c.
– The four bits Branch, PTakenD, GPRMatch, and GoodTarget are copied
back to the branch predictor to help it choose the PC for Cycle 4, and to
help it update its branch history table.
Misprediction is detected if Branch and PTakenD are both 1, but GoodTarget is 0,
which doesn’t happen in this example, or GPRMatch is 0, which would happen
in this example if R9 6= R0. This kind of misprediction results in TakeWasWrong
= 1. In the case of misprediction . . .
– The CLR inputs of IC/DEC and IT/IC will be asserted, so the AND and
SUB instructions get cancelled.
– The PCPlus4D value will be used as the PC value in Cycle 4 to start the
fetch of OR.
In the case of successful prediction . . .
– AND and SUB instructions are allowed to continue.
– The PC value in Cycle 4 will be 0x0040009c.
ENCM 501 Winter 2015 Assignment 8
2.4
page 5 of 9
What to Do, Part III
Now let’s consider the case that the predictor predicts that the BEQ will be untaken.
In the style of Figure 3, outline what will happen for both misprediction and correct
prediction. Here is a start:
• Cycle 1.
– PC = 0x00400070; this is the virtual address of BEQ.
– PTakenIT = 0 to indicate a prediction of untaken.
– PTargetIT = don’t care
– PCPlus4IT = 0x00400074
– The I-TLB writes the physical address of the BEQ instruction into the IT/IC
register.
In Cycle 3, be very precise about what the condition is to distinguish misprediction
from correct prediction.
After you’ve finished writing your outline, draw a diagram like the one you drew
for Part I to confirm that in this case the penalty for misprediction is also two
cycles.
2.5
What to Do, Part IV
The vaguely-described “forwarding muxes” unit in Figure 2 hints that in addition
to cycles lost due to misprediction, branch instructions may cause loss of cycles due
to data hazards. Here is an example of a very common pattern in MIPS code:
SLT
BEQ
R8, R16, R17
R8, R0, L99
Regardless of what prediction was made about the BEQ, the BEQ instruction needs
the SLT result to check whether the prediction was correct. This leads to a onecycle stall to allow forwarding of the EX result from SLT into the DEC stage of
BEQ:
SLT
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
IT
IC
DEC
EX
DT
DC
IT
IC
cancelled
BEQ
BEQ (IC repeated after stall)
IC
cycle 7
WB
DEC
Make a similar diagram to find out how many cycles are lost in the data hazard in
this other common pattern in MIPS code:
LW
BEQ
R8, 0(R16)
R8, R0, L98
(Assume no D-TLB or D-cache misses—those would make everything much more
complicated!)
2.6
What to Do, Part V
Suppose that a program run has an IC (instruction count) of 1 billion, with the
following facts:
• Of those 1 billion instructions, 150 million are branches.
ENCM 501 Winter 2015 Assignment 8
page 6 of 9
• 90% of branches are predicted correctly.
• The misprediction penalty is two cycles.
• 15% of branches require 3-cycle stalls due to data hazards.
• 20% of branches require 2-cycle stalls due to data hazards.
• 40% of branches require 1-cycle stalls due to data hazards.
What is the total number of cycles lost due to the branch instructions, and, assuming
unrealistically that no cycles are lost for any other reason, what is the CPI of the
program run?
2.7
What to Do, Part VI
The data hazard penalties can be reduced by delaying the branch test to the EX
stage, at the cost of increasing the misprediction penalties.
For the program run of Part V, recalculate the number total number of cycles
lost due to branch instructions, and recalculate the CPI, to account for the design
change.
If you have to make any assumptions, state clearly what they are.
2.8
What to Hand In
Part I: A diagram showing why the misprediction penalty is two cycles.
Part II: An answer to the question about GoodTarget.
Part III: Detailed outline of events in Cycles 2 and 3; another diagram showing why
the misprediction penalty in this case is also is two cycles.
Part IV: Diagram, and a statement of the number of stall cycles required.
Parts V and VI: Clearly explained calculations for numbers of stall cycles and CPIs.
3
Exercise B: Cost of floating-point division
3.1
Read This First
As mentioned in a lecture, division operations tend to have much more latency than
addition, subtraction, or multiplication. This exercise will briefly look at that idea
by running some code in the ICT 320 lab.
3.2
What to Do
1. Make sure you are using a machine with an E4600 Core 2 Duo processor.
2. Copy the file scaleArray.c, and get copies of ts_funcs.c and ts_funcs.h,
which were used in previous assignments.
3. Build an executable called scaleArray:
gcc -O2 scaleArray.c ts_funcs.c -o scaleArray -lrt
Run the program several times with the command
./scaleArray 1.1
Use the results to get an estimate for the running time of the scale_array
function for the given input of 1.1. Let’s use the name base experiment
result for the name of this running time.
ENCM 501 Winter 2015 Assignment 8
page 7 of 9
4. Division hardware is usually sophisticated enough to work faster than usual
for easy special cases such as division by one or by infinity.
(a) Run the program several times with the command
./scaleArray 1.0
What is the speedup relative to the base experiment result?
(b) The largest finite 64-bit IEEE 754 FP value is about 1.797 × 10308 . Run
the program several times with the command
./scaleArray 2e308
That will have the effect of passing +∞ as the scale argument to
scale_array. What is the speedup relative to the base experiment result?
5. With exact real arithmetic, division by x is the same as multiplication by
1.0/x. But with FP arithmetic the value of (1.0 / x) * y may differ from
that of y / x, due to rounding. So an optimizing compiler is not free to
replace FP division with FP multiplication, unless the compiler is told in
some way that it is allowed to let FP results differ slightly from what the
source code is asking for.
Build an executable called scaleArray-fm:
gcc -O2 -ffast-math scaleArray.c ts_funcs.c -o scaleArray-fm -lrt
Run the program several times with the command
./scaleArray-fm 1.1
What is the speedup relative to the base experiment result?
6. Of course, it is possible to edit the source code to improve speed. Make a
copy of scaleArray.c called scaleArrayMult.c. In the new file, replace
a[i] = a[i] / scale;
with
a[i] = a[i] * (1.0 / scale);
Build an executable called scaleArrayMult:
gcc -O2 scaleArrayMult.c ts_funcs.c -o scaleArrayMult -lrt
Run the program several times with 1.1 as input, and calculate the speedup
relative to the base experiment result.
7. The change in the previous part might seem to make things worse instead
of better, because now the loop contains both an FP division and an FP
multiplication. Explain why that is not the case.
8. Suppose the speedup in Part 6 is 5.0. (It won’t be—that is just an example). Explain briefly why it would be wrong to conclude from that speedup
number that the latency for FP division must be 5 times the latency for FP
multiplication.
9. Denormalized numbers are nonzero FP numbers with magnitudes too small
to represent in the usual FP format. In the IEEE 754 64-bit system, the
best approximation to 1.0 × 10−308 is a denormalized number. Run your
scaleArrayMult executable this way:
ENCM 501 Winter 2015 Assignment 8
page 8 of 9
./scaleArray-fm 1e308
so that the value of 1.0 / scale will be approximately 1.0×10−308 . Calculate
the “speedup” relative to the base experiment result.
Suggest a reason why multiplication with a denormalized input is so much
slower than “normal” FP multiplication.
3.3
What to Hand In
Answers for Parts 3–9.
4
Exercise C: Multiple cycles in the EX stage
4.1
Read This First
Figure 4: 5-cycle latency for FP addition/subtraction, 3-stage latency for FP addition/subtraction, and 2-stage latency for data memory access steps D1 and D2. (D1
might be needed, say, for an ISA in which loads and stores have fancy addressing modes.)
CLK
CLK
M1
CLK
IF
CLK
M2
CLK
ID
CLK
M3
CLK
A1
CLK
M4
CLK
CLK
A2
M5
A3
CLK
D1
CLK
D2
W
B
CLK
EX
Figure 4 is a variation on Figure C.35 in the course textbook. The latencies in
Figure 4 are shorter than the textbook latencies for FP arithmetic, but data memory
access takes two cycles.
4.2
What to Do
1. Determine the total number of stall cycles lost in this sequence below if forwarding is done at the earliest possible time.
L.D
ADD.D
F2, (R4)
F4, F4, F2
Assume that the only hazard is the obvious one related to the F2 register.
Hint: Draw a diagram, and realize that the A1 step of the ADD.D can’t start
until the D2 step of the L.D is over.
2. Repeat the above for this sequence, which computes a sum of squares:
MUL.D
MUL.D
ADD.D
F6, F2, F2
F8, F4, F4
F10, F6, F8
ENCM 501 Winter 2015 Assignment 8
page 9 of 9
Assume that there are no hazards related to F2 or F4.
4.3
What to Hand In
Diagrams and clear explanations regarding the number of stall cycles needed in each
case.