page 1 of 9 ENCM 501 Winter 2015 Assignment 8 for the Week of March 23 Steve Norman Department of Electrical & Computer Engineering University of Calgary March 2015 Assignment instructions and other documents for ENCM 501 can be found at http://people.ucalgary.ca/~norman/encm501winter2015/ 1 Administrative details 1.1 Group work is permitted Here are the options: • You may do your work entirely individually. • A group of two or three students may hand in a single assignment for the whole group. • Collaboration at the level of individual exercises is acceptable. In that case, submissions of complete, individual assignments are required, with explicit acknowledgments given as needed on an exercise-by-exercise basis. Informal discussion of assignment exercises between students is encouraged, and does not need to be acknowledged. Please be aware that all students are expected to understand all assignment exercises! Collaboration is of course not allowed on quizzes, the midterm test, and the final exam. 1.2 Due Dates The Due Date for this assignment is 3:30pm, Thursday, March 26. The Late Due Date is 3:30pm, Friday, March 27. The penalty for handing in an assignment after the Due Date but before the Late Due Date is 3 marks. In other words, X/10 becomes (X–3)/10 if the assignment is late. There will be no credit for assignments turned in after the Late Due Date; they will be returned unmarked. 1.3 A B C total Marking scheme 10 5 4 19 marks marks marks marks ENCM 501 Winter 2015 Assignment 8 1.4 page 2 of 9 How to package and hand in your assignments Please see the instructions in Assignment 1. 2 Exercise A: A scalar pipeline with branch prediction 2.1 Read This First Figure 1 is an overview of the 7-stage pipeline introduced in the tutorial of March 18, 2015. The enhancement of dynamic branch prediction is proposed on slides 11 and 12 for that tutorial, but there wasn’t time to look at those slides in the tutorial period. This exercise will look at dynamic branch prediction for that 7-stage pipeline in more detail. Figure 1: 7-stage pipeline from a tutorial period. CLK CLK CLK CLK CLK CLK CLK IT IC DEC EX DC/WB DT/DC EX/DT ALU GPRs DEC/EX IT/IC next PC logic with branch predict IC/DEC CLK DT DC Figure 2 shows much more detail for the first three pipeline stages. Let’s assume that the instructions supported are these from the MIPS32 ISA: the five R-type instructions ADD, SUB, AND, OR, SLT, plus LW, SW and BEQ. To understand how the circuit handles branches, it is perhaps best to trace a detailed example. Here is code sequence, with virtual instruction addresses given: 0x00400064 0x00400068 0x0040006c 0x00400070 0x00400074 0x00400078 ADD ADD ADD BEQ OR LW R4, R4, R8 R5, R5, R8 R6, R6, R8 R9, R0, L1 R10, R17, R18 R11, 8(R19) more instructions 0x00400094 0x00400098 L1: AND SUB R20, R20, R24 R21, R21, R25 We’ll assume, unrealistically, that there are never any TLB or cache misses, so we can concentrate on the processing of the BEQ instruction and its neighbours. We’ll also assume that this code fragment has recently been processed several times as part of a program run, so the branch predictor contains information about the branch target address (0x00400094 in this example) and has a prediction about whether the branch should be taken. Let’s use Cycle 1 as a name of the clock cycle in which the IT stage of the BEQ instruction takes place, Cycle 2 as a name for the very next cycle, on so on. Figure 3 describes what happens over Cycles 1, 2, and 3, in the case where the prediction about the BEQ is that it will be taken. WB ENCM 501 Winter 2015 Assignment 8 page 3 of 9 Figure 2: Sketch of the IT, IC and DEC stages. Most of the details needed to make branches work are shown, but other important details—logic to handle TLB and cache misses, and logic in the DEC stage to support LW, SW and R-type instructions—are left out. The signal Branch is 1 if and only if the Control Unit sees that bits 31:26 of the instruction are 0000102 , the opcode for BEQ. The pipeline registers have CLR (clear) inputs, to help with cancelling instructions, and EN (enable) inputs, to help implement stalls. PTakenIT PTakenIC CLK 4 PCPlus4IT IC/DEC PTargetIT + Icache PTargetIC PCPlus4IC Instr 20:16 GPRs forwarding muxes EN CLR IT/IC ITLB EN CLR EN CLK 25:21 next PC PC logic with branch predict = = 15:0 Make BTA PCPlus4D BranchTarget IT 2.2 IC DEC What to Do, Part I Study Figures 2 and 3 carefully. Then draw a diagram like the one below, with two more rows, to confirm that the number of lost clock cycles for misprediction in the example of Figure 3 is exactly two. BEQ AND 2.3 cycle 1 cycle 2 cycle 3 IT IC DEC IT IC TakeWasWrong CLK GPRMatch control 5:0 CLK Branch 31:26 cycle 4 cycle 5 cycle 6 cycle 7 cancelled cancelled cancelled cancelled What to Do, Part II Why is the comparison that generates the GoodTarget signal necessary? Why would the predictor ever predict that a branch would be taken, but produce an incorrect prediction about the branch target address? (Hint: Think about context switches.) Answer briefly, but precisely. cancelled GoodTarget PTakenD ENCM 501 Winter 2015 Assignment 8 page 4 of 9 Figure 3: Trace of handling BEQ in a case where the branch predictor predicts a taken branch. • Cycle 1. – PC = 0x00400070; this is the virtual address of BEQ. – PTakenIT = 1 to indicate a prediction of taken. – PTargetIT = 0x00400094; this is the predicted branch target address. – PCPlus4IT = 0x00400074 – The I-TLB writes the physical address of the BEQ instruction into the IT/IC register. • Cycle 2. – PC = 0x00400094; this is the virtual address of AND—the predictor got that ready because it predicted a taken branch in the previous cycle. – PTakenIT = 0; there was miss in the branch history table within the branch predictor. – PTargetIT = don’t care—this value will never be used. – PCPlus4IT = 0x00400098 – The I-TLB writes the physical address of the AND instruction into the IT/IC register. – PCPlus4IC = 0x00400074 – The I-cache writes the machine code for BEQ into the IC/DEC register. • Cycle 3. These things happen regardless of whether the prediction was correct . . . – PC = 0x00400098; this is the virtual address of SUB. – PTakenIT = 0, due to another miss in the branch history table. – PTargetIT = don’t care – PCPlus4IT = 0x0040009c – PCPlus4D = 0x00400074 – The output of the Make BTA unit is also 0x0040009c. – The four bits Branch, PTakenD, GPRMatch, and GoodTarget are copied back to the branch predictor to help it choose the PC for Cycle 4, and to help it update its branch history table. Misprediction is detected if Branch and PTakenD are both 1, but GoodTarget is 0, which doesn’t happen in this example, or GPRMatch is 0, which would happen in this example if R9 6= R0. This kind of misprediction results in TakeWasWrong = 1. In the case of misprediction . . . – The CLR inputs of IC/DEC and IT/IC will be asserted, so the AND and SUB instructions get cancelled. – The PCPlus4D value will be used as the PC value in Cycle 4 to start the fetch of OR. In the case of successful prediction . . . – AND and SUB instructions are allowed to continue. – The PC value in Cycle 4 will be 0x0040009c. ENCM 501 Winter 2015 Assignment 8 2.4 page 5 of 9 What to Do, Part III Now let’s consider the case that the predictor predicts that the BEQ will be untaken. In the style of Figure 3, outline what will happen for both misprediction and correct prediction. Here is a start: • Cycle 1. – PC = 0x00400070; this is the virtual address of BEQ. – PTakenIT = 0 to indicate a prediction of untaken. – PTargetIT = don’t care – PCPlus4IT = 0x00400074 – The I-TLB writes the physical address of the BEQ instruction into the IT/IC register. In Cycle 3, be very precise about what the condition is to distinguish misprediction from correct prediction. After you’ve finished writing your outline, draw a diagram like the one you drew for Part I to confirm that in this case the penalty for misprediction is also two cycles. 2.5 What to Do, Part IV The vaguely-described “forwarding muxes” unit in Figure 2 hints that in addition to cycles lost due to misprediction, branch instructions may cause loss of cycles due to data hazards. Here is an example of a very common pattern in MIPS code: SLT BEQ R8, R16, R17 R8, R0, L99 Regardless of what prediction was made about the BEQ, the BEQ instruction needs the SLT result to check whether the prediction was correct. This leads to a onecycle stall to allow forwarding of the EX result from SLT into the DEC stage of BEQ: SLT cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 IT IC DEC EX DT DC IT IC cancelled BEQ BEQ (IC repeated after stall) IC cycle 7 WB DEC Make a similar diagram to find out how many cycles are lost in the data hazard in this other common pattern in MIPS code: LW BEQ R8, 0(R16) R8, R0, L98 (Assume no D-TLB or D-cache misses—those would make everything much more complicated!) 2.6 What to Do, Part V Suppose that a program run has an IC (instruction count) of 1 billion, with the following facts: • Of those 1 billion instructions, 150 million are branches. ENCM 501 Winter 2015 Assignment 8 page 6 of 9 • 90% of branches are predicted correctly. • The misprediction penalty is two cycles. • 15% of branches require 3-cycle stalls due to data hazards. • 20% of branches require 2-cycle stalls due to data hazards. • 40% of branches require 1-cycle stalls due to data hazards. What is the total number of cycles lost due to the branch instructions, and, assuming unrealistically that no cycles are lost for any other reason, what is the CPI of the program run? 2.7 What to Do, Part VI The data hazard penalties can be reduced by delaying the branch test to the EX stage, at the cost of increasing the misprediction penalties. For the program run of Part V, recalculate the number total number of cycles lost due to branch instructions, and recalculate the CPI, to account for the design change. If you have to make any assumptions, state clearly what they are. 2.8 What to Hand In Part I: A diagram showing why the misprediction penalty is two cycles. Part II: An answer to the question about GoodTarget. Part III: Detailed outline of events in Cycles 2 and 3; another diagram showing why the misprediction penalty in this case is also is two cycles. Part IV: Diagram, and a statement of the number of stall cycles required. Parts V and VI: Clearly explained calculations for numbers of stall cycles and CPIs. 3 Exercise B: Cost of floating-point division 3.1 Read This First As mentioned in a lecture, division operations tend to have much more latency than addition, subtraction, or multiplication. This exercise will briefly look at that idea by running some code in the ICT 320 lab. 3.2 What to Do 1. Make sure you are using a machine with an E4600 Core 2 Duo processor. 2. Copy the file scaleArray.c, and get copies of ts_funcs.c and ts_funcs.h, which were used in previous assignments. 3. Build an executable called scaleArray: gcc -O2 scaleArray.c ts_funcs.c -o scaleArray -lrt Run the program several times with the command ./scaleArray 1.1 Use the results to get an estimate for the running time of the scale_array function for the given input of 1.1. Let’s use the name base experiment result for the name of this running time. ENCM 501 Winter 2015 Assignment 8 page 7 of 9 4. Division hardware is usually sophisticated enough to work faster than usual for easy special cases such as division by one or by infinity. (a) Run the program several times with the command ./scaleArray 1.0 What is the speedup relative to the base experiment result? (b) The largest finite 64-bit IEEE 754 FP value is about 1.797 × 10308 . Run the program several times with the command ./scaleArray 2e308 That will have the effect of passing +∞ as the scale argument to scale_array. What is the speedup relative to the base experiment result? 5. With exact real arithmetic, division by x is the same as multiplication by 1.0/x. But with FP arithmetic the value of (1.0 / x) * y may differ from that of y / x, due to rounding. So an optimizing compiler is not free to replace FP division with FP multiplication, unless the compiler is told in some way that it is allowed to let FP results differ slightly from what the source code is asking for. Build an executable called scaleArray-fm: gcc -O2 -ffast-math scaleArray.c ts_funcs.c -o scaleArray-fm -lrt Run the program several times with the command ./scaleArray-fm 1.1 What is the speedup relative to the base experiment result? 6. Of course, it is possible to edit the source code to improve speed. Make a copy of scaleArray.c called scaleArrayMult.c. In the new file, replace a[i] = a[i] / scale; with a[i] = a[i] * (1.0 / scale); Build an executable called scaleArrayMult: gcc -O2 scaleArrayMult.c ts_funcs.c -o scaleArrayMult -lrt Run the program several times with 1.1 as input, and calculate the speedup relative to the base experiment result. 7. The change in the previous part might seem to make things worse instead of better, because now the loop contains both an FP division and an FP multiplication. Explain why that is not the case. 8. Suppose the speedup in Part 6 is 5.0. (It won’t be—that is just an example). Explain briefly why it would be wrong to conclude from that speedup number that the latency for FP division must be 5 times the latency for FP multiplication. 9. Denormalized numbers are nonzero FP numbers with magnitudes too small to represent in the usual FP format. In the IEEE 754 64-bit system, the best approximation to 1.0 × 10−308 is a denormalized number. Run your scaleArrayMult executable this way: ENCM 501 Winter 2015 Assignment 8 page 8 of 9 ./scaleArray-fm 1e308 so that the value of 1.0 / scale will be approximately 1.0×10−308 . Calculate the “speedup” relative to the base experiment result. Suggest a reason why multiplication with a denormalized input is so much slower than “normal” FP multiplication. 3.3 What to Hand In Answers for Parts 3–9. 4 Exercise C: Multiple cycles in the EX stage 4.1 Read This First Figure 4: 5-cycle latency for FP addition/subtraction, 3-stage latency for FP addition/subtraction, and 2-stage latency for data memory access steps D1 and D2. (D1 might be needed, say, for an ISA in which loads and stores have fancy addressing modes.) CLK CLK M1 CLK IF CLK M2 CLK ID CLK M3 CLK A1 CLK M4 CLK CLK A2 M5 A3 CLK D1 CLK D2 W B CLK EX Figure 4 is a variation on Figure C.35 in the course textbook. The latencies in Figure 4 are shorter than the textbook latencies for FP arithmetic, but data memory access takes two cycles. 4.2 What to Do 1. Determine the total number of stall cycles lost in this sequence below if forwarding is done at the earliest possible time. L.D ADD.D F2, (R4) F4, F4, F2 Assume that the only hazard is the obvious one related to the F2 register. Hint: Draw a diagram, and realize that the A1 step of the ADD.D can’t start until the D2 step of the L.D is over. 2. Repeat the above for this sequence, which computes a sum of squares: MUL.D MUL.D ADD.D F6, F2, F2 F8, F4, F4 F10, F6, F8 ENCM 501 Winter 2015 Assignment 8 page 9 of 9 Assume that there are no hazards related to F2 or F4. 4.3 What to Hand In Diagrams and clear explanations regarding the number of stall cycles needed in each case.
© Copyright 2024