the whole group presentation

Introduction
Snooping
Directory
Conclusion
Introduction
Snooping
Directory
Conclusion
Memory
Cache 1
Cache 2
Cache 3
Cache 4
1
A
1
A
3
C
1
A
1
A
2
B
2
B
4
D
3
C
2
B
3
C
3
C
5
E
5
E
4
D
4
D
5
E
Memory
Cache 1
Cache 2
Cache 3
Cache 4
1
A
1
A
3
C
1
A
1
A
2
B
2
B
4
D
3
F
2
B
3
C
3
C
5
E
5
E
4
D
4
D
5
E
Introduction
Snooping
Directory
Conclusion
Memory
Cache 1
Cache 2
Cache 3
Cache 4
1
A
1
A
3
C
1
A
1
A
2
B
2
B
4
D
3
C
2
B
3
C
3
C
5
E
5
E
4
D
4
D
5
E
Memory
Cache 1
Cache 2
Cache 3
Cache 4
1
A
1
A
3
C
1
A
1
A
2
B
2
B
4
D
3
F
2
B
3
F
3
C
5
E
5
E
4
D
4
D
5
E
Introduction
Snooping
Directory
Conclusion
Core
Cache
Controller
issued coherence
requests &
responses
received coherence
requests &
responses
Memory
Controller
issued coherence
requests &
responses
Interconnection
Network
Memory
loaded
values
Cache
loads &
stores
received coherence
requests &
responses
Interconnection
Network
The goal of a coherence protocol is to maintain coherence by enforcing the SWMR
invariant:
Single-Writer, Multiple-Read (SWMR) invariant: For any memory location “A”, at any
given time, there exist only one core that may write to A or some number of cores that
may read it.
Introduction
Snooping
Directory
Conclusion
Core
They are finite states machines to implement the SWMR invariant.

Each coherence controller implements a set of finite state machines per
block.
In the core side:

Interfaces to the processor core.

Receives load and store from the core and returns values to the core.
Cache

Cache
Controller
Interconnection
Network
In the network side:
Interfaces to rest of the system using the interconnection network.

If a cache miss occurs, issues a coherence request to get for that block.

Could either receive data in response of its request or receives coherence
requests from the network.
Memory
Controller

Similar to cache controller but only has the network side.
Introduction
Snooping
Directory
Conclusion
Interconnection
Network
Memory

The state of a cache block contains of 4 main elements:
A valid block has the most up-to-date value for this block. A valid
block could be read, but written if it is also exclusive.
a cache block is dirty if its value is the most up-to-date value, and
this value differs from the value in the memory.
a cache block is exclusive if it is the only copy of this block
among all caches.
a cache controller (or memory controller) is the owner of a block
if it is responsible to for responding to coherence requests for that block. An
owned block cannot be evicted without giving the ownership to another
block. In most protocols, there is exactly one owner for each block.
Introduction
Snooping
Directory
Conclusion
Stable states: most protocols use a subset of the classic five state MOESI model
(pounced MO-Zee). Each state has different combination of elements, described
in the previous slide.
valid, exclusive, owned, and potentially dirty. May be read or
written. The only valid copy of this block. Should respond the requests for this
block. The memory copy of this block is potentially stale.
Valid, not exclusive, not dirty, and not owned. The cache has a readonly copy of this block. Other caches might have valid, read-only copies of
the block.
the block is invalid. The cache either does not contain the block or
have a stale version of it. It may not be read or written.
the block is valid, owned and potentially dirty but not exclusive. The
cache has a read-only copy of this block and should respond to the requests
for this block. The memory copy is potentially stale.
valid, exclusive, and not dirty. The cache has a read-only copy of
this block. The memory copy of this block is up-to-date.
Introduction
Snooping
Directory
Conclusion

Transient states occur during the transition from one stable state to another
one.

XYz: the block is transition from stable state X to stable state Y and the
transition will not be complete until an event of type Z occurs.

IMD: denotes that a block was in the I state and will become in the M state
when data (D) is received.
Introduction
Snooping
Directory
Conclusion
There are 2 general approaches to naming states of blocks in the memory. The
choice of the naming does not affect the functionality or performance.
the state of block in the memory is an aggregation of the block in
the caches.

For example, if a block in all caches is in state I, the memory state for this
block is I. If one or more copies are in S, then the block in S in memory. If
block in one cache is in state M, it is in M in memory.
the state of the block corresponds to the memory controller's
permission to this block.

For example, if all if a block in all caches is in I, the memory state for it will
be O because the memory will behave like its owner. If they are all in S the
memory state will be O. If the block is in M or O in one cache, then its
memory state will be I since the memory has the invalid copy.
Introduction
Snooping
Directory
Conclusion

To maintain the state of blocks in caches, the most common way is to add
some extra bit at the end of each block. For example, in MOSEI we need 3
bits to show the state.

To maintain the state of blocks in memory, we can use the same approach.
Alternatively, we can use logical gates. For example we can use an NOR gate
and if one of its inputs are OWNED = 1, the state of the block in memory
would be I = 0.
Block Data
State
10011…….
000 -> I
11111…….
001 -> O
00000…….
101 -> M
Block state in cache 1
State of block in memory
Block state in cache 2
Block state in cache 3
Introduction
Snooping
Directory
Conclusion

Most protocols have a similar set of transactions, because the basic goals of
the coherence controllers are similar.

Transactions are all initiated by cache controllers that are responding to
requests from their associated cores
Transaction
Goal
GetShared (GetS)
Obtain block in Shared (read-only) state.
GetModified (GetM)
Obtain block in Modified (read-write) state.
Upgrade (Upg)
Upgrade block state from read-only (Shared or Owned) to
read-write (Modified);
Upg (unlike GetM) does not require data to be sent to
requestor.
PutShared (PutS)
Evict block in Shared state.
PutExclusive (PutE)
Evict block in Exclusive state.
PutOwned (PutO)
Evict block in Owned state.
PutModified (PutM)
Evict block in Modified state.
Introduction
Snooping
Directory
Conclusion

Events are core requests to their cache controllers.
Event
Response of Cache Controller
Load
if cache hit, respond with data from cache; else
initiate GetS transaction
Store
if cache hit in state E or M, write data into cache; else
initiate GetM or Upg transaction
Atomic read-modify-write
if cache hit in state E or M, atomically execute readmodify-write semantics; else initiate GetM
or Upg transaction
Instruction fetch
if cache hit (in I-cache), respond with instruction from
cache; else initiate GetS transaction
Read-only prefetch
if cache hit, ignore; else may optionally initiate GetS
transaction
Read-write prefetch
If cache hit in state M, ignore; else may optionally initiate
GetM or Upg transaction
Replacement
depending on state of block, initiate PutS, PutE, PutO,
or PutM transaction
Introduction
Snooping
Directory
Conclusion
The other major design decision in a coherence protocol is to decide what to do
when a core writes to a block. There are two options:
when a core wishes to write to a block, it initiates a
coherence transaction to invalidate the copies in all other caches. Thus; if
other cores want to read this block, they need to issue a new request to
obtain a new copy of this block.
when a core wishes to write a block, it initiates a
coherence transaction to update the copies in all other caches to reflect the
new value it wrote to the block.
Update protocols reduce the reading latency.
They use more bandwidth since their messages are bigger (carry data as well).
Introduction
Snooping
Directory
Conclusion
Introduction
Snooping
Directory
Conclusion
all coherence controllers observe (snoop) coherence requests in the
same order. By requiring that all requests to a given block arrive in order, a
snooping system enables the distributed coherence controllers to correctly
update the finite state machines that collectively represent a cache block’s
state.
Snooping protocols broadcast requests to all coherence controllers, including
the controller that initiated the request. The coherence requests typically
travel on an ordered broadcast network, such as a bus.

Time
C1
C2
Memory
0
A:I
A:I
A:I, Owner
1
A: GetM from C1 /M, Owner
A: GetM from C1/I
GetM from C1/ M
2
A: GetM from C2 /I
A: GetM from C2/M, Owner
GetM from C2/ M
C1
C2
Memory
0
A:I
A:I
A:I, Owner
1
A: GetM from C1 /M, Owner
A: GetM from C2/M, Owner
GetM from C1/ M
2
A: GetM from C2 /I
A: GetM from C1/I
GetM from C2/ M
Time
Introduction
Snooping
Directory
Conclusion
core
Cache
controller
core
Private data
(LI) cache
Cache
controller
Private data
(LI) cache
Interconnection network
LLC/direct
ory
controller
Last-level
cache
(LLC)
MULTICORE PROCESSOR CHIP
MAIN MEMORY
Introduction
Snooping
Directory
Conclusion
Implements 2 atomicity properties.
states that a coherence request is ordered in the same
cycle that it is issued.
states that coherence transactions are atomic in that a
subsequent request for the same block may not appear on the bus until after
the first transaction completes (i.e., until after the response has appeared on
the bus).
Introduction
Snooping
Directory
Conclusion
S
t
a
t
e
Core Events
Bus Event
Other Cores Transactions
Own Transaction
Load
Store
I
GetS/ISD
GetS/ISD
ISD
stall load
stall store
IMD
stall load
S
Replacemen
t
data
GetS
GetM
PutM
stall evict
copy data
into cache,
load hit/S
(A)
(A)
(A)
stall store
stall evict
copy data
into cache,
store hit/M
(A)
(A)
(A)
load hit
GetM/SMD
-/I
SMD
load hit
stall store
stall evict
M
load hit
store hit
PutM,
Send data
to memory
/I
Introduction
Snooping
GetS GetM
PutM
-/I
copy data
into cache,
load hit/S
Directory
Conclusion
(A)
(A)
send data
to req and
memory/S
send data
to req/I
(A)
state
Bus Events
GetS
GetM
IorS
Send data block
to requestor/IorS
Send data block to
requestor/M
IorSD
(A)
(A)
M
-/IorSD
Introduction
PutM
Update data block in
memory/IorS
-/IorSD
Snooping
Directory
Data from Owner
Conclusion

Small table and few possible states.

Easy to understand and implement

Multiple copy of a same block could be available because of the shared state.

Many impossible states due to atomic transaction property and many stalls

Lower throughput

Higher latency

Unnecessary broadcast of invalidate messages: when a core wants to write on
block should get the block in the stat M and send an invalidate message to all
other cores, no matter if it is the only copy of that block or not.

Tradeoffs: downgrade from M to S or I? We need to predict if block is going to
be used again or not.
Introduction
Snooping
Directory
Conclusion
Implements atomic transactions and non-atomic request properties.
The Exclusive state is used in almost all commercial coherence protocols because
it optimizes a common case: a core first reads a block and then subsequently
writes it.

In MSI, a core needs to issue a GetS message to get the read permission (in
case a cache miss) and then have to issue a GetM message to get the write
permission.

In MESI, a core can get the block in the exclusive state and no other block can
access it anymore. Thus, the core does not need to issue a GetM message.
Introduction
Snooping
Directory
Conclusion
Load
Store
Repl.
I
GetS/
ISAD
GetM/
IMAD
ISAD
stall
stall
stall
ISD
stall
stall
stall
IMAD
stall
stall
stall
IMD
stall
stall
S
hit
SMAD
GetS
GetM
PutM
GetS
GetM
PutM
-
-
-
-
-
-
(A)
(A)
(A)
-
-
-
stall
(A)
(A)
(A)
GetM/
SMAD
-/I
-
-/I
-
hit
stall
stall
-
-/IMAD
-
SMD
hit
stall
stall
(A)
(A)
(A)
E
hit
hit/M
PutM/
EIA
data to
R & M/S
data to
R/I
-
M
hit
hit
PutM/
MIA
data to
R & M/S
data to
R/I
-
MIA
hit
hit
stall
data
to M/I
data to
M & R/IIA
data to
R/IIA
-
EIA
hit
stall
stall
-/I
data to
M & R/IIA
data to
R/IIA
-
IIA
stall
stall
stall
-/I
-
-
-
Introduction
Snooping
-/ISD
-/IMD
-/SMD
Directory
Conclusion
Data
-/S
-/M
-/M
Data
-/E
GetS
GetM
PutM
I
data to
R/EorM
data to
R/EorM
-/ID
S
data to
R/EorM
data to
R/EorM
-/SD
EorM
-/SD
-
-/EorMD
ID
(A)
(A)
SD
(A)
EorMD
(A)
Introduction
Data
NoData
NoData-E
(A)
write data
to
M/I
-/I
-/I
(A)
(A)
write data
to M/S
-/S
-/S
(A)
(A)
write data
to M/I
-/EorM
-/I
Snooping
Directory
Conclusion

Silent transition from the exclusive state to the modified/shared state. No
unnecessary invalidate messages are issued.

Read and write with issuing only request.

Fewer number of messages.

Less traffic on the bus, lower bandwidth usage.
Extra hardware is needed to implement the exclusive state.
Introduction
Snooping
Directory
Conclusion



When a cache has a block in state M or E and receives a GetS from another core, if
using the MSI protocol or the MESI protocol, the cache must

change the block state from M or E to S

send the data to both the requestor and the memory controller
Questions raise that how a snooping protocol can minimize accesses to memory or
eliminate
1.
the extra data message to update the LLC/memory when a cache receives a GetS
request in the M (and E) state?
2.
the potentially unnecessary write to the LLC?
Augmenting the baseline
state
Introduction
Snooping
odified
hared
Directory
nvalid protocol with the
Conclusion
wned

The key difference is what happens when a cache with a block in state M receives
a GetS from another core.

In a MOSI protocol, the cache

changes the block state to O (instead of S) and

retains ownership of the block (instead of transferring ownership to the LLC/memory)

The O state enables the cache to avoid updating the LLC/memory.

The protocol adds two transient cache states in addition to the stable O state

The transient OIA state helps handle replacements of blocks in the O

The transient OMA state handles upgrades back to state M after a store
Introduction
Snooping
Directory
Conclusion
States
Processor Core Events
load
store
issue
GetS/ISAD
issue
GetM/IMAD
ISAD
stall
stall
stall
ISD
stall
stall
stall
IMAD
stall
stall
stall
IMD
stall
stall
S
hit
SMAD
I
Introduction
Snooping
replacement
Bus Events
OwnGetS
OwnGetM
OwnPutM
OtherGetS
OtherGetM
OtherPutM
-
-
-
-
-
-
(A)
(A)
(A)
-
-
-
stall
(A)
(A)
(A)
issue
GetM/SMAD
-/I
-
-/I
-
hit
stall
stall
SMD
hit
stall
stall
(A)
(A)
(A)
O
hit
issue
GetM/OMA
issue
PutM/OIA
send data to
requestor
send data to
requestor/I
-
OMA
hit
stall
stall
M
hit
hit
issue
PutM/MIA
MIA
hit
hit
stall
OIA
hit
stall
stall
IIA
stall
stall
stall
Directory
Conclusion
-/ISD
-/IMD
-/SMD
-
send data to
requestor
-/M
send data to
requestor/O
send data to
requestor/IM
-
AD
-
send data to
requestor/IIA
-
A
send data to
memory/I
send data to
requestor
send data to
requestor/IIA
-
send NoData
to memory/I
-
-
-
send data to
requestor/OI
-/S
-/M
-
send data to
requestor/I
send data to
memory/I
Own Data
response
-/M
States
Bus Events
GetS
GetM
IorS
send data to requestor
send data to requestor/MorO
IorSD
(A)
(A)
MorO
-
-
MorOD
(A)
(A)
Introduction
Snooping
PutM
Data from Owner
NoData
-/IorSD
write data to memory/IorS
-/IorS
write data to memory/IorS
-/MorO
-/MorOD
Directory
Conclusion
Introduction
MSI
MOSI
# Messages
6
13
# Stalls
20
24
MSI
MOSI
# Messages
2
2
# Stalls
0
0
Snooping
Directory
Conclusion

Cycle 1:
Core 2
Core 1
Cache
Cache
Controller
Cache
issue GetS / ISAD
BUS
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller

Cycle 2:
Core 2
Core 1
Cache
Cache
Cache
Controller
BUS
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller
issue GetM / IMAD

Cycle 3:
Core 2
Core 1
Cache
Cache
Cache
Controller
request on BUS - GetS (C1)
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller

Cycle 4:
Core 2
Core 1
Cache
Cache
Controller
Cache
- / ISD
BUS
Introduction
Memory
Controller
Snooping
Directory
Memory
send data to C1 / IorS
Conclusion
Cache
Controller

Cycle 5:
Core 2
Core 1
Cache
Cache
Cache
Controller
Cache
Controller
data on BUS – data from LLC/mem
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion

Cycle 6:
Core 2
Core 1
Cache
Cache
Controller
Cache
copy data from LLC/mem / S
request on BUS – GetM (C2)
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller

Cycle 7:
Core 2
Core 1
Cache
Cache
Controller
Cache
-/I
BUS
Introduction
Memory
Controller
Snooping
Directory
Memory
send data to C2 / MorO
Conclusion
Cache
Controller
- / IMD

Cycle 8:
Core 2
Core 1
Cache
Cache
Cache
Controller
Cache
Controller
data on BUS – data from LLC/mem
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion

Cycle 9:
Core 2
Core 1
Cache
Cache
Cache
Controller
BUS
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller
copy data from LLC/mem / M

Cycle 10:
Core 2
Core 1
Cache
Cache
Controller
Cache
issue GetS / ISAD
BUS
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller

Cycle 11:
Core 2
Core 1
Cache
Cache
Cache
Controller
request on BUS - GetS (C1)
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller

Cycle 12:
Core 2
Core 1
Cache
Cache
Controller
Cache
- / ISD
BUS
Introduction
Memory
Controller
Snooping
Directory
Memory
- / MorO
Conclusion
Cache
Controller
send data to C1 / O

Cycle 13:
Core 2
Core 1
Cache
Cache
Cache
Controller
data on BUS – data from C2
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller

Cycle 14:
Core 2
Core 1
Cache
Cache
Controller
Cache
copy data from C2 / S
BUS
Introduction
Snooping
Directory
Memory
Memory
Controller
Conclusion
Cache
Controller
The Owner state of a cache block

supplies the data to another processor instead of having that processor read the
data from memory.

reduces the number of write backs to main memory

runs with medium complexity

When going from a Shared state to a Modified state, the block must pass through
the Invalid state.
Introduction
Snooping
Directory
Conclusion
Atomic Bus
Address Request 1
Bus
Data Bus

Request 2
Response 1
Request 3
Response 2
Response 3
To implement atomic transactions, the simplest way is to use

a shared-wire bus with

an atomic bus protocol – all bus transactions of an indivisible request-respond pair

unpipelined processor core

no way to overlap activities that could proceed in parallel

Be simple but sacrifice performance

Limited by the sum of the latencies for a request and response (including any wait cycles
between them)
Introduction
Snooping
Directory
Conclusion

Pipelined (non-atomic) Bus provides responses in the same order as the requests.
Address Request 1 Request 2 Request 3
Bus
Data Bus

Response 1
Response 2
Response 3
Split Transaction (non-atomic) Bus provides responses in an order different from
the request order.
Address Request 1 Request 2 Request 3
Bus
Data Bus
Introduction
Response 2
Snooping
Directory
Response 3
Conclusion
Response 1

Key advantage of a non-atomic bus:
“NOT having to wait for a response before a subsequent request can be serialized on the bus”
 The bus can achieve much higher bandwidth using the same set of shared wire.

The advantage of a split-transaction bus, with respect to a pipelined bus, is that
“A low-latency response does not have to wait for a long-latency response to a prior request”

One issue raised by a split-transaction bus is matching responses with requests.

The response must carry the identity of the request or the requestor.
Introduction
Snooping
Directory
Conclusion
FIFO queues for
buffering incoming &
outgoing messages
Memory controller does
not have a connection
to make requests
Introduction
Snooping
Directory
Conclusion
States
Processor Core Events
load
store
issue
GetS/ISAD
issue
GetM/IMAD
ISAD
stall
stall
stall
ISD
stall
stall
stall
ISA
stall
stall
stall
IMAD
stall
stall
stall
IMD
stall
stall
stall
IMA
stall
stall
stall
S
hit
issue
GetM/SMAD
-/I
SMAD
hit
stall
istall
SMD
hit
stall
stall
SMA
hit
stall
stall
M
hit
hit
issue
PutM/MIA
MIA
hit
hit
stall
IIA
stall
stall
stall
I
Introduction
replacement
Snooping
Bus Events
OwnGetS or
OwnGetM
OwnGetM
OwnPutM
OtherGetM
OtherPutM
Own Data
response (for
own request)
-
-
-
-
-
-
-
stall
-
-
-
-
stall
stall
-
-
-
-/I
-
-/IMAD
-/SMA
stall
stall
store hit/M
-
-/IMA
send data to
requestor and to
memory/S
send data to
requestor/I
send data to
requestor/I
send data to
requestor and to
memory/IIA
send data to
requestor/IIA
-/I
-
-
-/ISD
load hit/S
-/IMD
store hit/M
-/SMD
store hit/M
Directory
OtherGetS
Conclusion
-/-/ISA
load hit/S
-
-/IMA
store hit/M
-
States
Processor Core Events
load
store
issue
GetS/ISAD
issue
GetM/IMAD
ISAD
stall
stall
stall
ISD
stall
stall
stall
ISA
stall
stall
stall
IMAD
stall
stall
stall
IMD
stall
stall
stall
IMA
stall
stall
stall
S
hit
issue
GetM/SMAD
-/I
SMAD
hit
stall
istall
SMD
hit
stall
stall
SMA
hit
stall
stall
M
hit
hit
issue
PutM/MIA
MIA
hit
hit
stall
IIA
stall
stall
stall
I
It now can receive
an Other-GetS
It now can receive
an Other-GetS
Bus Events
Introduction
Snooping
replacement
OwnGetS or
OwnGetM
OwnGetM
OwnPutM
OtherGetM
OtherPutM
Own Data
response (for
own request)
-
-
-
-
-
-
-
stall
-
-
-
-
stall
stall
-
-
-
-/I
-
-/IMAD
-/SMA
stall
stall
store hit/M
-
-/IMA
send data to
requestor and to
memory/S
send data to
requestor/I
send data to
requestor/I
send data to
requestor and to
memory/IIA
send data to
requestor/IIA
-/I
-
-
-/ISD
load hit/S
-/IMD
store hit/M
-/SMD
store hit/M
Directory
OtherGetS
Conclusion
-/-/ISA
load hit/S
-
-/IMA
store hit/M
-
States
Bus Events
GetS
IorS
send data to requestor
send data to requestor, set
Owner to requestor/M
clear Owner/IorSD
set Owner to requestor
clear Owner/IorSD
-
write data to memory/IorSA
IorSD
stall
stall
stall
-
write data to memory/IorS
IorSA
clear Owner/IorS
-
clear Owner/IorS
-
M
Introduction
GetM
Snooping
PutM from Owner
PutM from Non-Owner
Data
-
Directory
Conclusion
Introduction
MSI
MSI with Split-Transaction Bus
# Messages
6
5
# Stalls
20
33
MSI
MSI with Split-Transaction Bus
# Messages
2
2
# Stalls
0
3
Snooping
Directory
Conclusion
(until data arrives to satisfy
the in-flight request)
1.
It sacrifices performance.
2.
Stalling raises the potential of deadlock (because of circular chains of stalls).
3.
It enables a requestor to observe a response to its request before processing its
own request.
Introduction
Snooping
Directory
Conclusion


By stalling a request, the protocol

stalls all requests after the stalled request and

delays those transactions from completing.
How a coherence controller processes requests behind the stalled one?
Process all messages, in order, instead of stalling

Add transient states that reflect messages that the coherence controller has
received but must remember to complete at a later event.

For example,

A cache with a block in state ISD stalled instead of processing an Other-GetM for that
block.

In this case, if the cache controller observes an Other-GetM on the bus, then it changes
the block state to ISDI “in I, going to S, waiting for data, and when data arrives will go to
I”
Introduction
Snooping
Directory
Conclusion
States
Processor Core Events
load
store
replacement
issue
GetS/ISAD
issue
GetM/IMAD
ISAD
stall
stall
stall
ISD
stall
stall
stall
ISA
stall
stall
stall
ISDI
stall
stall
stall
IMAD
stall
stall
stall
IMD
stall
stall
stall
IMA
stall
stall
stall
IMDI
stall
stall
IMDS
stall
IMDSI
I
Bus Events
OwnGetS
or
OwnGetM
OwnGetM
OwnPutM
OtherGetS
OtherGetM
OtherPutM
Own Data response (for own request)
-
-
-
-
-
-
-
-/ISDI
-
-
-
-
-
-
-/IMDS
-/IMDI
-
-
stall
-
-
store hit, send data to GetM requestor/I
stall
stall
-
-/IMDSI
store hit, send data to GetM requestor and mem/S
stall
stall
stall
-
store hit, send data to GetM requestor and mem/I
S
hit
issue
GetM/SMAD
-/I
SMAD
hit
stall
istall
SMD
hit
stall
stall
SMA
hit
stall
stall
SMDI
hit
stall
SMDS
hit
SMDSI
-/ISD
load hit/S
-/IMD
store hit/M
-/ISA
load hit/S
load hit/I
-
-/IMA
store hit/M
-
-/I
-
-/IMAD
-/SMA
-/SMDS
-/SMDI
store hit/M
-
-/IMA
stall
-
-
store hit, send data to GetM requestor/I
stall
stall
-
-/SMDSI
store hit, send data to GetM requestor and mem/S
hit
stall
stall
-
-
store hit, send data to GetM requestor and mem/I
M
hit
hit
issue
PutM/MIA
send data to requestor and to memory/S
send data to
requestor/I
MIA
hit
hit
stall
send data to requestor/I
send data to requestor and to memory/IIA
send data to
requestor/IIA
IIA
stall
stall
stall
-/I
-
-
-/SMD
store hit/M
-
States
Bus Events
GetS
IorS
send data to requestor
send data to requestor, set
Owner to requestor/M
clear Owner/IorSD
set Owner to requestor
clear Owner/IorSD
-
write data to
memory/IorSA
IorSD
stall
stall
stall
-
write data to memory/IorS
IorSA
clear Owner/IorS
-
clear Owner/IorS
-
M
Introduction
GetM
Snooping
PutM from Owner
PutM from Non-Owner
Data
-
Directory
Conclusion

Uses MOESI

Non-atomic requests and transactions.

Supports up to 64bit processors.

Wired snooping busses consume lots of energy; thus, they do not scale up to
large number of cores. To solve this problem. E10000 uses point-to-point links
instead.

Uses a separate bus for sending out-of-order data response messages.
Introduction
Snooping
Directory
Conclusion
implements 8 applications:

LU: dense matrix manipulation.

OCEAN: large-scale movements.

Cholesky: sparse matrix manipulation.

Radix: sorting radix-based integers

…
benchmark for computing the performance of java servers,
applications …
benchmark for shared memory, multithreaded programs.

Processor utilization

Bus utilization

Number of accesses to physical memory
Introduction
Snooping
Directory
Conclusion

Benchmark suite: Splash-2

Benchmark application: Gem5, SE mode

Hardware: four CPUs. Each CPU has private L1 cache of 32KB with associativity
4. Default cache line size is 64 bytes which we configure for our experiment.
L1 Cache
Size (KB)
Write-Back
/Memory References
16
17300
32
12672
64
5251
128
0
15000
10000
5000
0
16
32
64
128
Write backs
Write backs
20000
Snooping
Write-Back/
Memory
References
16
11214
32
12350
64
12672
128
13001
14000
13000
12000
11000
10000
16
L1 cash size (KB)
Introduction
L1 Block
Size
(bytes)
32
64
L1 block size (bytes)
Directory
Conclusion
128
Benchmark suite: SPEC
Benchmark applications: blackscholes, bodytrack,
fluidanimate, freqmine, raytrace, and swaptions.
canneal,
facesim,
Protocols: MESI, MOSI, and MOESI (compared to MSI).

Across all the benchmarks and input sizes, MESI and MOESI reduce the number
of broadcasts 7% on average.

MOSI and MOESI, reduce the number of write-backs is reduced by 5% on
average.
Introduction
Snooping
Directory
Conclusion

Since MOSI and MOESI substantially reduce the number of write-backs for
workloads, they reduce the energy consumption of the LLC by %4 on average.
MOSI and MOESI are only showing very little increasing benefits with regard to
write-back traffic reduction compared to MSI and MESI.
Introduction
Snooping
Directory
Conclusion
Benchmark suite: Splash-2
Benchmark applications: Barnes-Hut, LU, OCEAN, Radiosity, Radix, Ray Trace
Protocols: MESI and MSI
Hardware: ?
Introduction
Snooping
Directory
Conclusion
Protocols: MSI and MESI, MOSI, MOESI
Hardware
Splash-2 inputs and applications
Introduction
Snooping
Directory
Conclusion


Directory protocols were originally developed to address the lack of
scalability of snooping protocols.
Directory protocols is to avoid the broad cast nature of snooping.

Snooping systems broadcast all requests on a totally ordered interconnection
network and all requests are snooped by all coherence controllers.

But the, directory protocols uses indirection to avoid both the ordered
broadcast network and having each cache controller process every request.

Directory based protocols should be competitive with snoopy protocols
core
Cache
controller
core
Private data
(LI) cache
Cache
controller
Private data
(LI) cache
Interconnection network
LLC/direct
ory
controller
MAIN MEMORY
Last-level
cache
(LLC)
directory
MULTICORE PROCESSOR CHIP
Protocol
Ordered network
Advantages
disadvantages
Snooping protocol
Yes
Simple
Difficult to scale
Directory
based protocol
No
Scalable
Indirection,
extra hardware

A directory in the directory system model maintains a global view of the
coherence state of each block.

Keeps track of copies of cached blocks and their states. Every block has
associated directory information.

Every request goes to directory and the directory then sends directives to
each cache.

One restriction on the interconnection network that is that it enforces
point-to-point ordering. That is, if controller A sends two messages to
controller B, then the messages arrive at controller B in the same order in
which they were sent.

In Figure, we show the transactions in which a cache controller issues coherence
requests to change permissions from I to S, I or S to M, M to I, and S to I.

Cache sends request to GetM to the directory, and the directory takes two actions.
First, it responds to the requestor with a message that includes the data and the
AckCount. It is the number of current sharers of the block.

Second, the directory sends an Invalidation message to all of the current sharers.
Each sharer, upon receiving the Invalidation, sends an Invalidation-Ack to the
requestor.

PutM message that includes the data to the directory. The directory responds with a
Put-Ack. If the PutM did not carry the data with it, then the protocol would require
a third message—a data message from the cache controller to the directory with the
evicted block that had been in state M—to be sent in a PutM transaction.
I to S (common case #1)
The cache controller sends a GetS request to the directory and changes the block state from I to
ISD. The directory receives this request and, if the directory is the owner (i.e., no cache currently
hast he block in M), the directory responds with a Data message, changes the block’s state to S (if
it is not S already), and adds the requestor to the sharer list. When the Data arrives at the requestor,
the cache controller changes the block’s state to S, completing the transaction.
I to S (common case #2)
The cache controller sends a GetS request to the directory and changes the block state from I to
ISD. If the directory is not the owner (i.e., there is a cache that currently has the block in M), the
directory forwards the request to the owner and changes the block’s state to the transient state SD.
The owner responds to this Fwd-GetS message by sending Data to the requestor and changing the
block’s state to S. The now-previous owner must also send Data to the directory since it is
relinquishing ownership to the directory, which must have an up-to-date copy of the block. When
the Data arrives at the requestor, the cache controller changes the block state to S and considers the
transaction complete. When the Data arrives at the directory, the directory copies it to memory,
changes the block state to S, and considers the transaction complete.

Consider a complete directory maintaining complete
state of each block, including the full set of caches
that may have shared copies

Point-to-point ordering for the Forwarded Request
network

Recall: if a cache has a block in the Owned state, then the
block is valid, read-only, dirty (i.e., it must eventually
update memory), and owned (i.e., the cache must respond to
coherence requests for the block)

Adding Owned State changes the protocols (compare with MSI) in
three important ways:
1.
More coherence requests are satisfied by caches (in O
state) than by the LLC/mem
2.
There are more 3-hop transactions

If directory is the owner

If directory is not the owner
(2) Fwd-GetS
(1) GetS
(1) GetS
Req
IS
SS
Req
IS
Dir
MO
OO
Req
IS
Owner
MO
OO
I
ISD
S
send GetS to Dir/ISD
Last-InvAck
Inv-Ack
Inv-Ack
AckCount
from Dir
Data from
Owner
(ack >0)
Data form
Dir (ack
=0)
Put-Ack
Inv
Fwd-getM
Fwd-GetS
replaceme
nt
store
load
MOSI Directory Protocol – Cache Controller

If directory is the owner

If directory is not the owner
(1) GetS
(1) GetS
Req
IS
Req
IS
SS
(2) Fwd-GetS
Dir
MO
OO
Req
IS
Owner
MO
OO
send GetS to Dir/ISD
ISD
Stall
S
Stall
-/S
-/S
Last-InvAck
Inv-Ack
Inv-Ack
AckCount
from Dir
Data from
Owner
(ack >0)
Data form
Dir (ack =0)
Put-Ack
Inv
Stall
Fwd-getM
Stall
Fwd-GetS
replacement
I
store
load
MOSI Directory Protocol – Cache Control
(1) GetS

If directory is the owner
Req
IS
SS
Req
IS
(2) Data
(2) Fwd-GetS
(1) GetS
If directory is not the owner
Stall
S
Hit
Stall
-/S
-/S
Last-Inv-Ack
Inv-Ack
Inv-Ack
AckCount
from Dir
Data form Dir
(ack =0)
Put-Ack
Inv
Stall
Fwd-getM
Stall
Fwd-GetS
replacement
ISD
Store
Send GetS to Dir/ISD
Data from
Owner
(ack >0)
(3) Data
ISD: I -> S,
waits for D
I
Owner
M
O
OO
Dir
M
O
OO
Req
IS
load

(1) GetM
(1) GetM
Req
IS
SM
Req
IM
Req
IM
Sharer
SI
Dir
SM
Sharer
SI
I
IMAD
IMA
S
SMAD
SMA
M
Send GetM to
Dir/IMAD
Last-InvAck
Inv-Ack
Inv-Ack
AckCount
from Dir
Data form
Dir (ack
=0)
Data from
Owner
(ack >0)
Put-Ack
Inv
Fwd-getM
Fwd-GetS
replacement
Store
load
• IMAD: the cache wants I -> M, waits for D + possibly Ack
• The cache know how many ack it expects to receive
(1) GetM
(1) GetM
Req
IS
SM
Req
IM
Req
IM
Sharer
SI
(2) Inv
Dir
SM
Sharer
SI
(2) Inv
(2) Data[ack>0]
Stall
Stall
Send GetM to
Dir/IMAD
S
-/IMA
-/M
Ack--
-/M
-/SMA
-/M
Ack--
Send InvAck to
Req/I
Send GetM to
Dir/SMAD
SMAD
Hit
Stall
Stall
Stall
Stall
SMA
Hit
Stall
Stall
Stall
Stall
M
-/M
Send InvAck to
Req/IMAD
Last-InvAck
Stall
Inv-Ack
Stall
Inv-Ack
Stall
AckCount
from Dir
IMA
Data from
Owner
(ack >0)
Stall
Data form
Dir (ack
=0)
Stall
Put-Ack
Stall
Inv
Fwd-getM
Stall
replaceme
nt
Stall
I
Store
IMAD
load
Fwd-GetS
(2) Data [ack =0]
(3) Inv-Ack
(1) GetM
Req
IS
SM
Req
IM
Req
IM
Sharer
SI
(2) Inv
(1) GetM
Dir
SM
Sharer
SI
(2) Inv
(2) Data[ack>0]
(2) Data [ack =0]
Stall
Stall
Stall
S
Hit
Send GetM
to Dir/SMAD
send
PutS to
Dir/SIA
SMAD
Hit
Stall
Stall
Stall
Stall
SMA
Hit
Stall
Stall
Stall
Stall
M
Hit
Hit
Send
PutM +
data to
Dir/MIA
Send data
to Req/Q
Send
data to
Req/I
-/M
Ack--
-/M
-/SMA
-/M
Ack--
Send GetM
to Dir/IMAD
Send InvAck to
Req/I
Send InvAck to
Req/IMAD
-/I
Last-Inv-Ack
Stall
-/IMA
Inv-Ack
Stall
-/M
Inv-Ack
IMA
AckCount
from Dir
Stall
Data from
Owner
(ack >0)
Stall
Data form
Dir (ack =0)
Stall
Put-Ack
Fwd-getM
Stall
Inv
Fwd-GetS
Stall
I
Store
IMAD
load
replacement
(3) Inv-Ack
(3) Inv-Ack
Sharer
SI
(2) Inv
(1) GetM
Dir
OM
Req
OM
Sharer
SI
(2) Inv
(2) AckCount
Hit
Send GetM
to
Dir/OMAM
Send
PutO+data to
Dir/OIA
Send data to
Req
Send data to
Req/I
OMAC
Hit
Stall
Stall
Send data to
Req
Send data to
Req/IMAD
OMA
Hit
Stall
Stall
Send data to
Req
Stall
-/OMA
Ack- -
Last-Inv-Ack
O
Inv-Ack
Send data to
Req/I
Inv-Ack
Send data to
Req/Q
AckCount
from Dir
Send PutM +
data to
Dir/MIA
Data from
Owner
(ack >0)
Fwd-getM
Hit
Data form Dir
(ack =0)
Fwd-GetS
Hit
Put-Ack
Store
M
Inv
load
replacement
(3) Inv-Ack
-/I
Ack -
-/M
(1) PutS
Dir
SI
SS
Send PutM +
data to
Dir/MIA
Send data to
Req/Q
Send data
to Req/I
MIA
Stall
Stall
Stall
Send data to
Req/OIA
Send data
to Req/IIA
O
Hit
Send GetM to
Dir/OMAM
Send
PutO+data to
Dir/OIA
Send data to
Req
Send data
to Req/I
OMAC
Hit
Stall
Sall
Send data to
Req
Send data
to
Req/IMAD
OMA
Hit
Stall
Stall
Send data to
Req
Stall
OIA
Stall
Stall
Stall
Send data to
Req
Send data
to Req/IIA
SIA
Stall
Stall
Stall
IIA
Stall
Stall
Stall
Data from
Owner
(ack >0)
Hit
Data form
Dir (ack =0)
Hit
(2) Put-Ack
Put-Ack
M
Inv
Fwd-getM
Store
Fwd-GetS
load
replacement
(2) Put-Ack
Last-InvAck
(2) Put_ack
Dir
MI
Req
MI
Inv-Ack
Req
SI
Inv-Ack
Dir
OM
Req
OI
(1) PutM + data
AckCount
from Dir
(1) PutO + data
-/I
Ack--
-/OMA
Ack=
Ack -/I
Send Inv-Ack
to Req/IIA
-/I
-/I
-/M
-/M
Send GetS to
Dir/ISD
Send GetM to
Dir/ISAD
ISD
Stall
Stall
Stall
IMAD
Stall
Stall
Stall
Stall
Stall
IMA
Stall
Stall
Stall
Stall
Stall
S
Hit
Send GetM to
Dir/SMAD
send PutS to
Dir/SIA
SMAD
Hit
Stall
Stall
Stall
Stall
SMA
Hit
Stall
Stall
Stall
Stall
M
Hit
Hit
Send PutM +
data to
Dir/MIA
Send data to
Req/Q
Send data
to Req/I
MIA
Stall
Stall
Stall
Send data to
Req/OIA
Send data
to Req/IIA
O
Hit
Send GetM to
Dir/OMAM
Send
PutO+data to
Dir/OIA
Send data to
Req
Send data
to Req/I
OMAC
Hit
Stall
Stall
Send data to
Req
Send data
to
Req/IMAD
OMA
Hit
Stall
Stall
Send data to
Req
Stall
OIA
Stall
Stall
Stall
Send data to
Req
Send data
to Req/IIA
SIA
Stall
Stall
Stall
IIA
Stall
Stall
Stall
Stall
-/S
Last-InvAck
Inv-Ack
Inv-Ack
AckCoun
t from Dir
Data from
Owner
(ack >0)
Data form
Dir (ack
=0)
Put-Ack
Inv
FwdgetM
FwdGetS
replacem
ent
Store
load
I
-/S
-/M
-/IMA
-/M
Ack--
-/M
-/SMA
-/M
Ack--
Send Inv-Ack
to Req/I
Send Inv-Ack
to Req/IMAD
-/I
-/OMA
Ack=
Ack -/I
Send Inv-Ack
to Req/IIA
-/I
-/I
-/M
I
GetS
GetM from GetM from
Owner
NonOwner:
PutS –
NonLeaf
data
PutS-Last
send Data
to Req, add
Req to
Sharers/S
GetM from
Owner
send Data to Req, set
Owner to Req/M
send Put-Ack
to Req
send Put-Ack
to Req
Send PutAck to Req
send Data to Req, send
Inv to Sharers, set
Owner to Req, clear
Sharers/M
remove Req
from Sharers,
send PutAck to Req
Remove Req
from Sharers,
send Put-Ack
to Req/I
remove Req
from Sharers,
send Put-Ack
to
Req
forward GetM to Owner,
send Inv to Sharers, set
Owner to Req, clear
Sharers, send AckCount
to Req/M
remove Req
from Sharers,
send PutAck to Req
remove Req
from Sharers,
send Put-Ack
to Req
S send Data
to Req, add
Req to
Sharers
O forward
GetS to
Owner, add
Req
to Sharers
M forward
GetS to
Owner, add
Req to
Sharers/O
send AckCount to
Req,
send Inv to
Sharers,
clear
Sharers/M
forward GetM to Owner, send Put-Ack
set Owner to Req
to Req
PutM+data
from
Owner
remove Req
from
Sharers,
copy
data to
mem, send
Put-Ack to
Req,
clear
Owner/S
send Put-Ack copy data to
to Req
mem, send
PutAck to Req,
clear
Owne/I
PutO+data
from
NonOwner
remove Req
from Sharers,
send Put-Ack
to
Req
PutO +
dat
Send
Put-Ack
to Req
copy
data to
memory
, send
Put-Ack
to Req,
clear
Owner/
S
remove
Req
from
Sharers,
send
Put-Ack
to Req
remove
Req
from
Sharers,
send
Put-Ack
to Req
send
Put-Ack
to Req


Comparison between cache controller on MSI and MOSI
MSI
MOSI
Total # of messages
15
20
Total # of stalls
31
38
Comparison between memory controller on MSI and
MOSI
MSI
MOSI
Total # of messages
19
28
Total # of stalls
2
2

We have assumed a complete directory maintaining the complete
state of each blocks, including the full set of caches that may
have shared copies

Coarse directories and limited pointers are two ways to reduce
how much state directory maintains
2-bit
state
2-bit
state
2-bit
state
C-bit
log2C-bit
owner
log2C-bit
owner
complete sharer list (bit error)
Complete directory: each bit in
sharer list represents one cache
C/K-bit
coarse sharer list
(bit error)
log2C-bit
i*log2C-bit
owner
pointers to I
sharers
Coarse directory: each bit in
sharer list represents K caches
Limited directory: sharer list is
divided into i entries, each of
which is a pointer to a cache

Idea: in a system with N directories, block B’s directory might be
at directory B modulo N because the allocation of memory
address to nodes is often static.
Memory
core
Cache
controller
Cache
Directory
controller
Memory
core
Cache
controller
Interconnection network
Cache
directory
Multiple directories provides greater bandwidth of coherence
transactions
directory

Directory
controller

Recall: one of the limitation of directory protocols is that
the stall situation happens frequently

When a cache controller has a block in state IMA and receives a
Fwd-GetS, it processes the request and changes the block’s state
to IMAS.

This state indicates that after the cache controller’s GetM
transaction completes (i.e., when the last Inv-Ack arrives), the
cache controller will change the block state to S.


 the cache controller must also send the block to the
requestor of the GetS and to the directory, which is now the
owner.
Conclude: By not stalling on the Fwd-GetS, the cache controller
can improve performance by continuing to process other
forwarded requests behind that Fwd-GetS in its incoming queue.

NOTE: So far, we now do not have point-to-point ordering in
interconnection network

Considering MOSI situation as an example
(a) Example with point-to-point ordering

(b) Example without point-to-point ordering. Note that C2’s
Fwd-GetS arrives at C1 in state I and thus C1 does not
respond.
One of the approaches is to have a customized message to take care of
the situation

Adaptive routing is the solution to enable a message to
dynamically choose its path as its traverses the network


Congested links and switches can be avoided
Moreover, point-to-point ordering problem could also be solved
(a) Adaptive Routing Example

Flat memory-based directory protocol

Uses a bit vector directory representation

Consists 512 nodes

Two processors per node, but there is no snooping
protocol within a node –combining multiple processors in
a node reduces cost

Distinguishing Features

As its scalability, each directory entry contains fewer bits than
necessary to present every possible cache that could be sharing a
block.


Since network provides no ordering, there are several new
messages have been used for reordering purposes


Directory dynamically choose coarse bit vector or limited pointer
presentation
Protocol considers all of these conditions by not enforcing
ordering in the network
Use only two networks request and response to avoid deadlock.
Note that directory has three types of message (request,
forwarded request and response)

Supports scalability

Able to take care of ordering messages

More complicated than Snooping

Has many transactions -> inefficient in time as they
require an extra message when the home is not owner

High storage overhead of directory data structure


Benchmarks:

SPLASH-2: fft, Barnes-Hut, LU, Ocean, Radiosity, Radix, Ray Trace

SPECibb: benchmark for computing the performance of java servers,
applications

PERSEC: benchmark for shared memory, multithreaded programs.
Metrics

System performance (time efficiency)

Processor Utilization (time spent waiting for memory)

Directory utilization

Number of access to physical mem

Power consumption (difficult)

Benchmark suite: Splash-2

Benchmark application: Gem5, SE mode

Hardware: Hydra (UCDenver)
Example results:
L1 Cache
Size (KB)
Write-Back
/Memory References
16
17300
32
12672
64
5251
128
0
Write backs
-0.75
black
-0.76
-0.77
-0.78
Red
Write backs
-0.74
L1 Block
Size (bytes)
Write-Back/
Memory References
16
11214
32
12350
64
12672
128
13001
14000
13000
12000
11000
10000
16
32
64
L1 block size (bytes)
L1 cash size (KB)
128
Benchmark suite: SPEC
Benchmark
applications:
blackscholes,
bodytrack,
fluidanimate, freqmine, raytrace, and swaptions.
canneal,
Protocols: MESI, MOSI, and MOESI (compared to MSI).

Calculate the number of message exchange between entities

Analysis the results obtained
facesim,
Example results:

[1] – Daniel J. S., Mark D. H., and David A. W., “A Primer on Memory
Consistency and Cache Coherence,” Morgan Claypool Publishers, 2011.

[2] – Linda Suleman, Bigelow Veynu, and Narasiman Aater, “An Evaluation of
Snoop-Based Cache Coherence Protocols,”

[3] – Anoop Tiwari, “Performance comparison of cache coherence protocol on
multi-core architecture,” Diss. 2014.

[4] – Chang, Mu-Tien, Shih-Lien Lu, and Bruce Jacob. “Impact of Cache
Coherence Protocols on the Power Consumption of STT-RAM-Based LLC,”

[5] – CMU 15-418: Parallel Architecture and Programming. Lecture Series.
Spring 2012.
Introduction
Snooping
Directory
Conclusion
Introduction
Snooping
Directory
Conclusion