Load/Store Execution Steps Load/store Buffer in Tomasulo Memory

Load/Store Execution Steps
Load: LW R2, 0(R1)
1. Generate virtual
address; may wait
on base register
2. Translate virtual
address into
physical address
3. Write data cache
Lecture 11: Memory Data Flow
Techniques
Load/store buffer design, memorylevel parallelism, consistency model,
memory disambiguation
1
Store: SW R2, 0(R1)
1. Generate virtual
address; may wait
on base register
and data register
2. Translate virtual
address into
physical address
3. Write data cache
Unlike in register accesses, memory addresses
are not known prior to execution
2
Load/store Buffer in Tomasulo
Load/store Unit with Centralized RS
Support memory-level
parallelism
Centralized RS
includes part of
load/store buffer in
Tomasulo
Loads wait in load buffer
until their address is
ready; memory reads
are then processed
Stores wait in store
buffer until their
address and data are
ready; memory writes
wait further until
stores are committed
IM
Fetch Unit
Decode
Rename
S-buf
L-buf
DM
Regfile
Reorder
Buffer
RS
RS
FU1
FU2
IM
Fetch Unit
Decode
Loads and stores wait
in RS until there
are ready
Rename
Regfile
Reorder
Buffer
RS
S-unit
data
Store buffer
L-unit
FU1
FU2
addr
addr
cache
3
4
Memory-level Parallelism
Memory Consistency
for (i=0;i<100;i++)
A[i] = A[i]*2;
Memory contents must be the same as by
sequential execution
Must respect RAW, WRW, and WAR
dependences
Loop:L.S F2, 0(R1)
MULT F2, F2, F4
SW F2, 0(R1)
ADD R1, R1, 4
BNE R1, R3,Loop
F4 store 2.0
LW1
LW2
LW3
SW1
SW2
Practical implementations:
1. Reads may proceed out-of-order
2. Writes proceed to memory in program
order
3. Reads may bypass earlier writes only if
their addresses are different
SW3
Significant
improvement from
sequential
reads/writes
5
6
1
Store Stages in Dynamic Execution
Load Bypassing and Memory Disambiguation
1. Wait in RS until base
RS
address and store data
are available (ready)
Store
Load
2. Move to store unit for
unit
unit
address calculation and
address translation
finished
3. Move to store buffer
completed
(finished)
4. Wait for ROB commit
(completed)
D-cache
5. Write to data cache
(retired)
Stores always retire in for Source: Shen and Lipasti, page 197
WAW and WRA Dep.
To exploit memory parallelism, loads
have to bypass writes; but this may
violate RAW dependences
Dynamic Memory Disambiguation:
Dynamic detection of memory
dependences
Compare load address with every older
store addresses
7
Load Bypassing Implementation
8
Load Forwarding
Store
unit
1
2
match
1
2
3
1. address calc.
2. address trans.
3. if no match, update
dest reg
Load
unit
data
addr
Load Forwarding: if a load
address matches a
older write address,
can forward data
RS
RS
in-order
D-cache
data
addr
in-order
Store
unit
1
2
match
1
2
3
Load
unit
If a match is found,
forward the related
data addr data to dest register
(in ROB)
D-cache
Multiple matches may
exists; last one wins
data
To dest.
reg
Associative search for
matching
Assume in-order execution
of load/stores
addr
9
In-order Issue Limitation
for (i=0;i<100;i++)
A[i] = A[i]/2;
Loop:L.S F2, 0(R1)
DIV F2, F2, F4
SW F2, 0(R1)
ADD R1, R1, 4
BNE R1, R3,Loop
10
Speculative Load Execution
Any store in RS station
may blocks all following
loads
When is F2 of SW
available?
RS
out-order
Store 1
2
unit
1
2
3
match
Match
at completion
When is the next L.S
ready?
addrdata
Finished
load buffer
Assume reasonable FU
latency and pipeline
length
Load
unit
data
No match:
addr predict a load has
no RAW on older
stores
D-cache
data
If match: flush pipeline
11
Forwarding does
not always work if
some addresses
are unknown
Flush pipeline at
commit if
predicted wrong
12
2
Alpha 21264 Pipeline
Alpha 21264 Load/Store Queues
Int issue queue
fp issue queue
Addr Int Int Addr
ALU ALU ALU ALU
FP
ALU
Int RF(80) Int RF(80)
D-TLB
L-Q
FP
ALU
FP RF(72)
S-Q
AF
Dual D-Cache
32-entry load queue, 32-entry store queue
13
Load Bypassing, Forwarding, and RAW Detection
LQ
IQ
match
commit
ROB Load/store?
SQ
IQ
completed
If match:
forwarding
D-cache
D-cache
Speculative Memory Disambiguation
Fetch PC
Load forwarding
Load: WAIT if
LQ head not
completed, then
move LQ head
Store: mark SQ
head as
completed, then
move SQ head
If match: mark store-load trap
to flush pipeline (at commit)
15
Architectural Memory States
LQ
14
1024 1-bit
entry table
Renamed inst
1
int issue queue
• When a load is trapped at commit, set stWait bit in the
table, indexed by the load’s PC
• When the load is fetched, get its stWait from the
table
• The load waits in issue queue until old stores are issued
• stWait table is cleared periodically
16
Summary of Superscalar Execution
Instruction flow techniques
SQ
Completed
entries
L1-Cache
Branch prediction, branch target prediction, and
instruction prefetch
Committed
states
Register data flow techniques
L2-Cache
Register renaming, instruction scheduling, in-order
commit, mis-prediction recovery
L3-Cache (optional)
Memory
Disk, Tape, etc.
Memory data flow techniques
Load/store units, memory consistency
Memory request: search the hierarchy from top to
bottom
Source: Shen & Lipasti
17
18
3