Download Report

Computer System Architecture
Final Examination Sample Problems
May 12, 1999
Professor Arvind
Name: _______________________
Remember to write your name on every page!!!
This is an open book, open notes exam.
180 Minutes
21 Pages
New Questions: 3-7, Part 4
Question 1 (2 parts):
________ 15 Points
Question 2 (1 parts):
________ 24 Points
Question 3 (4 parts):
________ 26 Points
Question 4 (3 parts):
________ 15 Points
Question 5 (4 parts):
________ 29 Points
Question 6 (3 parts):
________ 26 Points
Question 7 (3 parts):
________ 25 Points
Total:
________ 160 Points
Question 1. Virtual Memory
Consider a byte-addressed system with 32-bit virtual and 24-bit physical addresses and 4096byte pages.
31
0
Virtual Address
Question 1.1 (5 points)
How large can a direct-mapped, physically addressed cache be in a design where the cache
and TLB are accessed in parallel?
Question 1.2 (10 points)
The initial contents of the TLB and page table are shown below:
TLB:
VPN
00020
00046
Page Table:
PPN
017
089
V
1
1
D
0
0
VPN
00020
00021
00022
….
00046
00047
00048
00049
00050
….
PPN
017
083
022
V
1
0
1
D
0
0
0
089
054
035
073
054
1
0
0
1
1
0
0
0
0
0
Note: All addresses are shown in hexadecimal. The virtual page number is specified by the high
order bits of the virtual address. For example, given the virtual address 0x64f3c, the
corresponding VPN would be 00064.
2
Suppose you are given the following code:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
Address
0x20fec
0x20ff0
0x20ff4
0x20ff8
0x20ffc
0x21000
0x21040
….
Instruction
.
addi R1, R0, 0x46000
lw R2, 0x0(R1)
lw R3, 0x1000(R1)
lw R4, 0x1100(R1)
lw R5, 0x3000(R1)
lw R6, 0x3100(R1)
sw R5, 0x3000(R1)
Identify the lines that cause TLB misses.
Identify the lines that cause page faults.
3
Question 2. Analysis of Two-level Caches
buf is an R-byte character array. The inner loop in the following program fetches, in order, the
characters whose position is a multiple of S in buf. The outer loop repeats the process
indefinitely.
char buf[R];
while( true ) {
i = 0;
while (i < R) {
dummy = buf[i];
i = i + S;
}
}
;; Time this Fetch ;;
The program is executed on a byte-addressable machine where a character is a byte. For a range
of different R’s and S’s, we measured the average latency (in clock ticks) of the character fetch
in the inner loop. The results are tabulated below:
S
0
R
0
2
21
22
23
24
25
26
27
28
21
210
211
212
213
214
215
216
217
218
219
220
221
222
223
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1.5
1.5
1.5
1.5
1.5
2
2
2
2
1
2
3
4
5
6
7
8
9
10
1
1
1
1
1
10
10
10
10
10
50
50
50
50
2
2
2
2
2
2
2
2
2 2
2
212 213 214 215 216 217 218 219 220 221 222 223
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2.1
2.1
2.1
2.1
2.1
3.3
3.3
3.3
3.3
1
1
1
1
1
1
1
1
1
1
1
1
1
3.2
3.2
3.2
3.2
3.2
5.7
5.7
5.7
5.7
1
1
1
1
1
1
1
1
1
1
1
1
5.5
5.5
5.5
5.5
5.5
10
10
10
10
1
1
1
1
1
1
1
1
1
1
1
10
10
10
10
10
20
20
20
20
1
1
1
1
1
1
1
1
1
1
10
10
10
10
10
30
30
30
30
1
1
1
1
1
1
1
1
1
10
10
10
10
10
50
50
50
50
1
1
1
1
1
1
1
1
10
10
10
10
10
50
50
50
50
1
1
1
1
1
1
1
10
10
10
10
10
50
50
50
50
1
1
1
1
1
1
10
10
10
10
10
50
50
50
50
1
1
1
1
10
10
10
10
10
50
50
50
50
1
1
1
1
10
10
10
10
50
50
50
50
4
11
1
1
1
1
10
10
10
50
50
50
50
1
1
1
1
10
10
50
50
50
50
1
1
1
1
10
50
50
50
50
1
1 1
1 1 1
1 1 1 1
50 1 1 1
50 50 1 1
50 50 50 1
50 50 50 50
1
1
1
1
1
1
1
1
1
1
Question 2.1 (24 points)
You are informed that the computer has two levels of caches and does not support virtual address
translation (i.e., programs use physical address directly). L2 is inclusive of L1. The caches use
LRU when appropriate. Based on the table, deduce the cache size, the block size (a.k.a.
cacheline size), and the associativity of L1 and L2. If you do not have enough information,
you should give the tightest bound possible. Support your answer with a brief explanation.
(Simply identifying a collection of rows or columns is not acceptable.)
Cache
L1
Parameter
Cache Size
Value
Explanation
Block Size
Associativity
L2
Cache Size
Block Size
Associativity
5
Question 3. Branch Prediction
In this problem, we will examine two branch prediction schemes and compare their relative
performances.
Specification of the Benchmark
To analyze the performance of this branch prediction scheme, let’s consider the following code
sequence. (Assume i_max and j_max are both greater than or equal to one.)
for (i=1; i<=i_max; i++) {
for(j=1; j<=j_max; j++) {
<body statements here>
}
}
Suppose the above code is compiled to the following assembly code sequence.
_outer:
_inner:
Bj:
Bi:
ADDI Ri, R0, _i_max
ADDI Rj, R0, _j_max
<body statements here>
.
.
.
ADDI Rj, Rj, #-1
BNEZ Rj, _inner
ADDI Ri, Ri, #-1
BNEZ Ri, _outer
; Ri <- i_max
; Rj <- j_max
;
;
;
;
Rj <- Rj - 1
branch if (Rj == 0)
Ri <- Ri - 1
branch if (Ri == 0)
The actual branch outcomes for this code exhibit the following pattern:
Bj is taken j_max-1 times, then Bj is not taken once, then Bi is taken once. The same pattern
is then repeated i_max-1 more times, except Bi is not taken for the last repetition, when the
program terminates.
For example, for i_max=4, j_max=3, the pattern is 1101 1101 1101 1100 (1 = taken, 0 = not
taken).
6
Two-bit Saturation Counter Prediction Scheme
Table of
2-bit Counters
Branch PC
Prediction
index
We have studied this scheme in lecture (Slide L15-16) and in the practice final exam (’97,
Question 3.1). As shown in the figure above, the lower order bits of the branch address is used
as an index into a table of two-bit counters. The content of these counters is the same as the BP
bits in a branch target buffer (BTB).
A two-bit counter encodes four states, and is updated as shown below. Note that the branch
prediction is equal to the high-order bit of the counter (1 = taken, 0 = not taken).
current
state
00
01
10
11
Prediction
Not taken
Not taken
taken
taken
next state
Actually
actually
Taken
not taken
01
00
11
00
11
00
11
10
taken
taken
pred
taken
11
pred
taken
01
taken
taken
taken
pred
taken
10
taken
pred
taken
00
taken
taken
Question 3.1 (6 points)
Assuming all the counters are initialized to weakly-taken (state 10). Give the number of
mispredictions (in terms of i_max and j_max, if applicable) and circle the final states of the
counters for the branches Bi and Bj when the above code sequence is executed using the
two-bit saturation counter prediction scheme.
Bi: _________ mispredictions, final state of counter is:
00
01
10
11
(circle one).
Bj: _________ mispredictions, final state of counter is:
00
01
10
11
(circle one).
7
N-bit Global History Correlating Prediction Scheme
Table of
2-bit Counters
Global History Register
Prediction
index
In this scheme, an N-bit global history register is used to store the outcomes of the last N branch
resolutions. This history register is used as an index into the counter table (with 2N 2-bit
counters), from with the branch prediction is taken. The two-bit counters are updated in the
same manner as in the two-bit saturation counter prediction scheme.
Each time a branch outcome is resolved, the global history register is also updated as follows: all
bits are left-shifted by one, and the right-most bit is updated with the most recent branch
outcome (1 = taken, 0 = not taken).
Question 3.2 (8 points)
Assuming the global history register is initialized to all zeros and all the counters are initialized
to weakly-taken (state 10). For i_max=100, j_max=2, and N=3, fill in the states of the table
of two-bit counters when Ri=17 , Rj=1, and the PC is in the loop body part of the
benchmark assembly code.
Global History
Register Bits
2-bit Counter
Bits
(oldest … newest)
000
001
010
011
100
101
110
111
8
Question 3.3 (6 points)
Assuming all the counters are initialized to weakly-taken (state 10). Give the number of
mispredictions and circle the final states of the counters for the branches Bi and Bj when
the above code sequence is executed using the N-bit global history correlating counter
prediction scheme.
Bi: _________ mispredictions, final state of counter is:
00
01
10
11
(circle one).
Bj: _________ mispredictions, final state of counter is:
00
01
10
11
(circle one).
Question 3.4 (6 points)
How well does this scheme do for larger values of j_max (i.e., when the number of
iterations of the inner loop increases)? Explain your answer by giving a rough estimate of
the number of mispredictions (in terms of i_max and j_max, if appropriate).
9
Question 4. Cache Coherence
After examining the rules for cache coherence given in (L22-12-14), Ben Bitdiddle comes up
with the following rule, which he thinks will improve the performance of his protocol.
id
M
k
k
i
M
Pushout Rule (Child to Parent)
→
<id, Cell(a, u, (Ex, W(Idk))) | m > , <idk, Cell(a, v, (Ex, R(dir))) | mk>
<id, Cell(a, v, (Ex, R(ε))) | m >, <idk, mk>
Question 4.1 (5 points)
Show that the Pushout Rule is correct because its behavior can be simulated by the other
caching rules.
Question 4.2 (5 points)
Give a scenario which shows that the Pushout rule can do better than the rules given in the
class.
10
Question 4.3 (5 points)
Suppose we replace the Writeback Rule (L22-14) with the Pushout Rule. Can the new set
of rules still show all the behaviors that the original set of rules could show? Argue by
showing that the Writeback rule can be simulated by the Pushout Rule.
_____________________________________________________________________________
(Do not write below this line)
11
Question 5. Superscalar AX
This problem explores the issues in building a superscalar AX from the pipelined AX described
in Lecture 13.
Question 5.1 (2 points)
Suppose the following instructions are in the pipeline of the pipelined AX, all of which have
already passed the decode (ID) stage but none of which has completed the write-back (WB)
stage:
I1:
I2:
I3:
ADD R1, R2, R3;
LW R4, 0(R3);
ADD R5, R6, R7;
Regs[R1] <- Regs[R2] + Regs[R3]
Regs[R4] <- Mem[0+Regs[R3]]
Regs[R5] <- Regs[R6] + Regs[R7]
Suppose the following instruction is in the fetch (IF) stage:
I4:
ADD R8, R4, R5;
Regs[R8] <- Regs[R4] + Regs[R5]
Can we dispatch instruction I4 into the decode stage? Explain.
_____________________________________________________________________________
(Do not write below this line)
12
Recall from Lecture Slide L13-11, the Op decode rule for the pipelined AX is given by:
Op decode rule
Proc((ia, rf, IB(sia, r:=Op(r1, r2));bsD, bsE, bsM, bsW), im, dm)
if
r1 ∉ Dest(bsE) and r2 ∉ Dest(bsE)
and
r1 ∉ Dest(bsM) and r2 ∉ Dest(bsM)
and
r1 ∉ Dest(bsW) and r2 ∉ Dest(bsW)
Ð
Proc((ia, rf, bsD, bsE;ITB(sia, r:=Op(rf[r1], rf[r2])), bsM, bsW), im, dm)
To simplify the boolean expression in the predicate, let’s define the following notations:
Sources(inst) ≡ source register(s) of instruction inst
≡ destination register of instruction inst (or instruction templates, if any)
Dest(inst)
Dest(bs)
≡ the union of all destination registers of all instructions in buffer bs
≡ Dest(a1) ∪ Dest(a2) ∪ ... ∪ Dest(an)
Dests(a1,a2,...,an)
With these new definitions, the decode stage rules can be combined into one rule:
Decode rule
Proc((ia, rf, IB(sia, inst1);bsD, bsE, bsM, bsW), im, dm)
if
for all s ∈ Sources(inst1), s ∉ Dests(bsE, bsM, bsW)
Ð
Proc((ia, rf, bsD, bsE;ITB(sia, it1), bsM, bsW), im, dm)
where it1 is the instruction template for instruction inst1
(i.e., the source registers of inst 1 have been replaced by appropriate
values from the register file)
Now, suppose we extend the pipelined AX to support the dispatching of two instructions at a
time. For this new machine, which we will call AX2, at every clock cycle:
(1) Two instructions are fetched from the instruction memory;
(2) Up to two instructions can propagate to the next stage in the pipeline.
Here, we assume the additional hardware (e.g., extra read/write ports for register file and
memory, another ALU, additional data paths and muxs) necessary for implementing AX2 is
available.
13
Question 5.2 (2 points)
Ben Bitdiddle is writing a new set of TRS rules for the AX2. He begins with the fetch stage rule:
Fetch stage rule
Proc((ia, rf, bsD, bsE, bsM, bsW), im, dm)
Ð
Proc((ia+1, rf, bsD;IB(ia, inst1);IB(ia, inst2), bsE, bsM, bsW), im, dm)
where inst1 = im[ia], inst2 = im[ia+1]
Circle and correct the one mistake Ben made in writing the fetch stage rule for AX2.
Question 5.3 (10 points)
There are two decode stage rules for AX2: one rule for dispatching two instructions to the
execute stage; another rule for dispatching only one instruction to the execute stage. Complete
the two decode stage rules by providing the predicates and filling in the blanks in the terms
to the right of the arrows.
Decode stage rules
Proc((ia, rf, IB(sia,inst1);IB(sia+1,inst2);bsD, bsE, bsM, bsW), im, dm)
if
Ð
Proc((____________________________________________________), im, dm)
where it1 and it2 are the instruction templates for instructions inst1 and
inst2, respectively.
Proc((ia, rf, IB(sia,inst1);IB(sia+1,inst2);bsD, bsE, bsM, bsW), im, dm)
if
Ð
Proc((____________________________________________________), im, dm)
where it1 and it2 are the instruction templates for instructions inst1 and
inst2, respectively.
14
Question 5.4 (15 points)
There are six execute stage rules, depending on the types of instructions to be executed. Ben
wrote one of the execute stage rules:
Execute stage rules
Proc((ia, rf, bsD, ITB(sia, it1);ITB(sia+1, it2);bsE, bsM, bsW), im, dm)
if
it1 ≠ r:=Jz(-,-) and it2 ≠ r:=Jz(-,-)
Ð
Proc((ia, rf, bsD, bsE, bsM;ITB(sia, it1*);ITB(sia+1, it2*), bsW), im, dm)
where it1* and it2* are executed versions of it1 and it2, respectively
Complete the remaining five execute stage rules by filling in the blanks in the terms to the
right of the arrows. Note that the terms (but not the predicates) to the left of the arrows are
identical for all six execute stage rule. For clarity, only the predicates are provided below.
if
Ð
Proc((____________________________________________________), im, dm)
if
Ð
it1 = r:=Jz(v,-), v ≠ 0 and it2 = r:=Jz(0,nia)
Proc((____________________________________________________), im, dm)
if
Ð
it1 ≠ r:=Jz(-,-) and it2 = r:=Jz(v,-), v ≠ 0
Proc((____________________________________________________), im, dm)
if
Ð
it1 ≠ r:=Jz(-,-) and it2 = r:=Jz(0,nia)
Proc((____________________________________________________), im, dm)
if
Ð
it1 = r:=Jz(0,nia)
it1 = r:=Jz(v,-), v ≠ 0 and it2 = r:=Jz(v,-), v ≠ 0
Proc((____________________________________________________), im, dm)
15
Question 6. Sequential Consistency and Out-of-order Execution
Consider the following parallel program for two processors.
Processor 1
Store α, 10
R1 ← Load β
R2 ← Load γ
R3 ← R1 + R2
Processor 2
Store γ, 100
R1 ← Load α
Store β, R1
α, β, and γ are three distinct addresses. Initially mem[α]=mem[β]=mem[γ]=0
Question 6.1 (8 points)
Suppose Processor 1 and Processor 2 are the speculative, out-of-order processors (Ps) described
in (L15-7~12). Assume processors have no caches and their data memories (dm) are
shared. Ps rules speculate and reorder the execution of instructions other than Loads and Stores,
which are dispatched from the ROB in order.
Ps-Load Rule :
Proc((ia, rf, ROB(t, ia, r:=Load(a)));rob, btb), im, dm)
Ð
Proc((ia, rf, ROB(t, ia, r:=dm[a]);rob, btb), im, dm)
Ps-Store Rule:
Proc((ia, rf, ROB(t, ia, Store(a, v));rob, btb), im, dm )
Ð
Proc((ia, rf, rob, btb), im, dm[a:=v] )
These rules can insure sequential consistency in a system without caches.
What are the possible values in R3 of Processor 1 at the end of an execution?
16
Question 6.2 (8 points)
PSR is identical to Ps except in its Load and Store dispatch rules that are given below.
PSR-Load Rule :
Proc((ia, rf, rob1;ROB(t, ia, r:=Load(a));rob2, btb), im, dm )
if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1
Ð
Proc((ia, rf, rob1;ROB(t, ia, r:=dm[a]);rob2, btb), im, dm )
PSR-Store Rule:
Proc((ia, rf, rob1;ROB(t, ia, Store(a, v));rob2, btb), im, dm )
if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1
and -:=Load(a) ∉rob1 and -:=Load(t’’) ∉ rob1
Ð
Proc((ia, rf, rob1;rob2, btb), im, dm[a:=v] )
Give a value for R3 of Processor 1 at the end of an execution that is allowed by PSR but not
by PS, and number the instructions below from 1 to 7 to indicate an execution order that
would lead to your answer R3=__________________.
Order
Processor 1
Store α, 10
R1 ← Load β
R2 ← Load γ
R3 ← R1 + R2
Order
17
Processor 2
Store γ, 100
R1 ← Load α
Store β, R1
Question 6.3 (10 points)
A memory-barrier instruction is introduced in PSRB to restore sequential consistency. PSRB is
identical to PSR except in its Load and Store dispatch rules and the mem-barrier dispatch rule.
PSRB-Load Rule :
Proc((ia, rf, rob1;ROB(t, ia, r:=Load(a));rob2, btb), im, dm )
if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1 and mem-barrier ∉ rob1
Ð
Proc((ia, rf, rob1;ROB(t, ia, r:=dm[a]);rob2, btb), im, dm )
PSRB-Store Rule:
Proc((ia, rf, rob1;ROB(t, ia, Store(a, v));rob2, btb), im, dm )
if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1
and -:=Load(a) ∉rob1 and -=:Load(t’’) ∉ rob1 and mem-barrier ∉ rob1
Ð
Proc((ia, rf, rob1;rob2, btb), im, dm[a:=v] )
PSRB-Mem-Barrier Rule:
Proc((ia, rf, ROB(t, ia, mem-barrier);rob, btb), im, dm )
Ð
Proc((ia, rf, rob, btb), im, dm)
Memory barriers can be inserted in a program to make its behavior sequentially consistent, i.e.
the same as PS as shown below:
Processor 1
Store α, 10
mem-barrier
R1 ← Load β
mem-barrier
R2 ← Load γ
mem-barrier
R3 ← R1 + R2
Processor 2
Store γ, 100
mem-barrier
R1 ← Load α
mem-barrier
Store β, R1
Cross out the extra mem-barrier instructions that are not necessary to guarantee
sequential consistency for this particular program.
18
Question 7. Atomic Operations and Cache Coherence Too . . .
Consider the function below to increment a counter.
Increment(int *counter) {
R=M[counter];
R=R+1;
M[counter]=R;
}
If multiple processes could increment the same counter simultaneously, the reading and updating
the counter needs to be performed in an atomic manner. In this problem, you are asked to
implement the function using different atomic operations. You may use a mixture of pseudocode and DLX assembly. You can assume the existence of temporary registers.
Assuming your implementation is intended for a cache-coherent multiprocessor system, make
your implementation as efficient as possible in terms of memory and cache subsystem
operations.
Question 7.1 (10 Points)
Give an implementation of Increment() using the Swap instruction (L20-19).
Swap(m,R):
Rt ← M[m];
M[m] ← (R);
R ← (Rt);
19
Question 7.2 (10 points)
Give an implementation of Increment() using the Compare&Swap Instruction (L20-21).
Compare&Swap(m,Rt,Rs):
if (Rt==M[m])
then
M[m]=Rs;
Rs=Rt ;
status ← success;
else
status ← fail;
Question 7.3 (5 Points)
In terms of memory and cache subsystem operations, give an advantage or a disadvantage of
the Compare&Swap atomic instruction relative to the Load-reserve/Store-conditional
combination (L20-22).
Load-reserve(m,R):
< reserve, address> ← < 1, m >;
R ← M[m];
Store-conditional(m,R):
if < reserve, address> = < 1, m >
then cancel other processors’ reservation on m;
M[m] ← (R); status ← succeed;
else status ← fail;
20
Part 4: Scheduling an Irregular Instruction Pipeline
TIPS Inc. hires you to add a multiply instruction to their integer DLX2000. The original
DLX2000 only supports a subset of memory (LW, SW only) and ALU/ALUi (ADD, ADDI,
SUB, SUBI, .etc) instructions. You can ignore branch/jump for this part of the quiz. Their 5stage pipeline is similar to what was presented in Lecture 9 and 10. Operands to all instructions
are required at the beginning of the E stage. The result of an ALU instruction is available by the
end of the E stage. The result of LW is available by the end of the M stage. Unless stalled, all
instructions must follow the same execution template, even though some stages may be idle for
some instructions.
F
D
E
M
W
t
Inst
t+1
t+2
t+3
t+4
Inst
Inst
Inst
Inst
The pipeline is fully-bypassed. The F and D stages only stall in the following condition (The
right-hand-side signals are defined in L10-10.)
Stalloriginal =
{
{
opcodeE == LW
[ ( wsE == rf1D ) and re1D ] or
[ ( wsE == rf2D ) and re2D ]
} and
}
IMUL performs half-length integer multiplication (only computes a 32-bit product) on 2 source
and 1 destination registers in GPR.
Regs[Rf3] ← Regs[Rf1] × Regs[Rf2]
IMUL:
The integer multiply unit (IMU) is separate from the main ALU. IMU has 3 stages and is
pipelined to accept a new multiplication on each cycle. The multiplicands are required at the
beginning of IMU1 and the product is available at the end of IMU3. The template for IMUL is:
F
D
E
M
W
IMU1
IMU2
IMU3
t
IMUL
t+1
t+2
t+3
t+4
t+5
IMUL
IMUL
IMUL
IMUL
IMUL
21
Question 13 (6 points):
TIPS requires you to implement an in-order issue and in-order completion design. Briefly
describe any new pipeline hazards introduced by the addition of IMUL and IMU to
DLX2000.
Data Hazard:
Structural Hazard:
Question 14 (6 points):
Assuming all feasible data bypasses to the D stage have been provided, specify the
necessary conditions for stalling the F and D stages to resolve the remaining hazards. (For
question 14 and 15, you must assume none of the other stages can be stalled.) Your
equation can make use of weX, wsX, re1X, re2X, and rf1x and rf2X and opcodeX from stages: D,
E, M, W, IMU1, IMU2, and IMU3. (The signals from L10-10 are extended for IMUL below.).
The opcode of an unoccupied stage is NOP. Your answer can also make use of Stalloriginal from
the original DLX2000 pipeline without IMU.
wsX = Case opcodeX
ALU, IMUL ⇒ rf3
ALUi, LW
⇒ rf2
JAL, JALR ⇒ 31
weX = Case opcodeX
ALU, ALUi, LW, IMUL
JAL, JALR ⇒ (wsX ≠ r0)
...
⇒ off
re1X = Case opcodeX
ALU, ALUi, LW, SW, IMUL,
BZ, JR, JALR ⇒ on
J, JAL
⇒ off
re2X = Case opcodeX
ALU, SW, IMUL⇒ on
...
⇒ off
Stallnew =
22
Question 15 (5 points): This is the last question of the midterm
Complete the pipeline resource diagram for the following instruction sequence. You need to
minimize the number of stalls. The first instruction has been filled in for you.
I1:
I2:
I3:
I4:
I5:
I6:
F
D
E
M
W
IMU1
IMU2
IMU3
0
I1
1
2
LW
IMULT
IMULT
ADD
IMULT
ADD
3
4
5
0(r1), r2
r2, r3, r4
r1, r5, r1
r4, r5, r6
r7, r6, r8
r2, r3, r9
6
7
8
I1
I1
I1
I1
23
;;
;;
;;
;;
;;
;;
9
r2 ← M[(r1)]
r4 ← (r2) × (r3)
r1 ← (r1) × (r5)
r6 ← (r4) × (r5)
r8 ← (r7) × (r6)
r9 ← (r2) + (r3)
10 11 12 13 14 15 16 17