Introduction Snooping Directory Conclusion Introduction Snooping Directory Conclusion Memory Cache 1 Cache 2 Cache 3 Cache 4 1 A 1 A 3 C 1 A 1 A 2 B 2 B 4 D 3 C 2 B 3 C 3 C 5 E 5 E 4 D 4 D 5 E Memory Cache 1 Cache 2 Cache 3 Cache 4 1 A 1 A 3 C 1 A 1 A 2 B 2 B 4 D 3 F 2 B 3 C 3 C 5 E 5 E 4 D 4 D 5 E Introduction Snooping Directory Conclusion Memory Cache 1 Cache 2 Cache 3 Cache 4 1 A 1 A 3 C 1 A 1 A 2 B 2 B 4 D 3 C 2 B 3 C 3 C 5 E 5 E 4 D 4 D 5 E Memory Cache 1 Cache 2 Cache 3 Cache 4 1 A 1 A 3 C 1 A 1 A 2 B 2 B 4 D 3 F 2 B 3 F 3 C 5 E 5 E 4 D 4 D 5 E Introduction Snooping Directory Conclusion Core Cache Controller issued coherence requests & responses received coherence requests & responses Memory Controller issued coherence requests & responses Interconnection Network Memory loaded values Cache loads & stores received coherence requests & responses Interconnection Network The goal of a coherence protocol is to maintain coherence by enforcing the SWMR invariant: Single-Writer, Multiple-Read (SWMR) invariant: For any memory location “A”, at any given time, there exist only one core that may write to A or some number of cores that may read it. Introduction Snooping Directory Conclusion Core They are finite states machines to implement the SWMR invariant. Each coherence controller implements a set of finite state machines per block. In the core side: Interfaces to the processor core. Receives load and store from the core and returns values to the core. Cache Cache Controller Interconnection Network In the network side: Interfaces to rest of the system using the interconnection network. If a cache miss occurs, issues a coherence request to get for that block. Could either receive data in response of its request or receives coherence requests from the network. Memory Controller Similar to cache controller but only has the network side. Introduction Snooping Directory Conclusion Interconnection Network Memory The state of a cache block contains of 4 main elements: A valid block has the most up-to-date value for this block. A valid block could be read, but written if it is also exclusive. a cache block is dirty if its value is the most up-to-date value, and this value differs from the value in the memory. a cache block is exclusive if it is the only copy of this block among all caches. a cache controller (or memory controller) is the owner of a block if it is responsible to for responding to coherence requests for that block. An owned block cannot be evicted without giving the ownership to another block. In most protocols, there is exactly one owner for each block. Introduction Snooping Directory Conclusion Stable states: most protocols use a subset of the classic five state MOESI model (pounced MO-Zee). Each state has different combination of elements, described in the previous slide. valid, exclusive, owned, and potentially dirty. May be read or written. The only valid copy of this block. Should respond the requests for this block. The memory copy of this block is potentially stale. Valid, not exclusive, not dirty, and not owned. The cache has a readonly copy of this block. Other caches might have valid, read-only copies of the block. the block is invalid. The cache either does not contain the block or have a stale version of it. It may not be read or written. the block is valid, owned and potentially dirty but not exclusive. The cache has a read-only copy of this block and should respond to the requests for this block. The memory copy is potentially stale. valid, exclusive, and not dirty. The cache has a read-only copy of this block. The memory copy of this block is up-to-date. Introduction Snooping Directory Conclusion Transient states occur during the transition from one stable state to another one. XYz: the block is transition from stable state X to stable state Y and the transition will not be complete until an event of type Z occurs. IMD: denotes that a block was in the I state and will become in the M state when data (D) is received. Introduction Snooping Directory Conclusion There are 2 general approaches to naming states of blocks in the memory. The choice of the naming does not affect the functionality or performance. the state of block in the memory is an aggregation of the block in the caches. For example, if a block in all caches is in state I, the memory state for this block is I. If one or more copies are in S, then the block in S in memory. If block in one cache is in state M, it is in M in memory. the state of the block corresponds to the memory controller's permission to this block. For example, if all if a block in all caches is in I, the memory state for it will be O because the memory will behave like its owner. If they are all in S the memory state will be O. If the block is in M or O in one cache, then its memory state will be I since the memory has the invalid copy. Introduction Snooping Directory Conclusion To maintain the state of blocks in caches, the most common way is to add some extra bit at the end of each block. For example, in MOSEI we need 3 bits to show the state. To maintain the state of blocks in memory, we can use the same approach. Alternatively, we can use logical gates. For example we can use an NOR gate and if one of its inputs are OWNED = 1, the state of the block in memory would be I = 0. Block Data State 10011……. 000 -> I 11111……. 001 -> O 00000……. 101 -> M Block state in cache 1 State of block in memory Block state in cache 2 Block state in cache 3 Introduction Snooping Directory Conclusion Most protocols have a similar set of transactions, because the basic goals of the coherence controllers are similar. Transactions are all initiated by cache controllers that are responding to requests from their associated cores Transaction Goal GetShared (GetS) Obtain block in Shared (read-only) state. GetModified (GetM) Obtain block in Modified (read-write) state. Upgrade (Upg) Upgrade block state from read-only (Shared or Owned) to read-write (Modified); Upg (unlike GetM) does not require data to be sent to requestor. PutShared (PutS) Evict block in Shared state. PutExclusive (PutE) Evict block in Exclusive state. PutOwned (PutO) Evict block in Owned state. PutModified (PutM) Evict block in Modified state. Introduction Snooping Directory Conclusion Events are core requests to their cache controllers. Event Response of Cache Controller Load if cache hit, respond with data from cache; else initiate GetS transaction Store if cache hit in state E or M, write data into cache; else initiate GetM or Upg transaction Atomic read-modify-write if cache hit in state E or M, atomically execute readmodify-write semantics; else initiate GetM or Upg transaction Instruction fetch if cache hit (in I-cache), respond with instruction from cache; else initiate GetS transaction Read-only prefetch if cache hit, ignore; else may optionally initiate GetS transaction Read-write prefetch If cache hit in state M, ignore; else may optionally initiate GetM or Upg transaction Replacement depending on state of block, initiate PutS, PutE, PutO, or PutM transaction Introduction Snooping Directory Conclusion The other major design decision in a coherence protocol is to decide what to do when a core writes to a block. There are two options: when a core wishes to write to a block, it initiates a coherence transaction to invalidate the copies in all other caches. Thus; if other cores want to read this block, they need to issue a new request to obtain a new copy of this block. when a core wishes to write a block, it initiates a coherence transaction to update the copies in all other caches to reflect the new value it wrote to the block. Update protocols reduce the reading latency. They use more bandwidth since their messages are bigger (carry data as well). Introduction Snooping Directory Conclusion Introduction Snooping Directory Conclusion all coherence controllers observe (snoop) coherence requests in the same order. By requiring that all requests to a given block arrive in order, a snooping system enables the distributed coherence controllers to correctly update the finite state machines that collectively represent a cache block’s state. Snooping protocols broadcast requests to all coherence controllers, including the controller that initiated the request. The coherence requests typically travel on an ordered broadcast network, such as a bus. Time C1 C2 Memory 0 A:I A:I A:I, Owner 1 A: GetM from C1 /M, Owner A: GetM from C1/I GetM from C1/ M 2 A: GetM from C2 /I A: GetM from C2/M, Owner GetM from C2/ M C1 C2 Memory 0 A:I A:I A:I, Owner 1 A: GetM from C1 /M, Owner A: GetM from C2/M, Owner GetM from C1/ M 2 A: GetM from C2 /I A: GetM from C1/I GetM from C2/ M Time Introduction Snooping Directory Conclusion core Cache controller core Private data (LI) cache Cache controller Private data (LI) cache Interconnection network LLC/direct ory controller Last-level cache (LLC) MULTICORE PROCESSOR CHIP MAIN MEMORY Introduction Snooping Directory Conclusion Implements 2 atomicity properties. states that a coherence request is ordered in the same cycle that it is issued. states that coherence transactions are atomic in that a subsequent request for the same block may not appear on the bus until after the first transaction completes (i.e., until after the response has appeared on the bus). Introduction Snooping Directory Conclusion S t a t e Core Events Bus Event Other Cores Transactions Own Transaction Load Store I GetS/ISD GetS/ISD ISD stall load stall store IMD stall load S Replacemen t data GetS GetM PutM stall evict copy data into cache, load hit/S (A) (A) (A) stall store stall evict copy data into cache, store hit/M (A) (A) (A) load hit GetM/SMD -/I SMD load hit stall store stall evict M load hit store hit PutM, Send data to memory /I Introduction Snooping GetS GetM PutM -/I copy data into cache, load hit/S Directory Conclusion (A) (A) send data to req and memory/S send data to req/I (A) state Bus Events GetS GetM IorS Send data block to requestor/IorS Send data block to requestor/M IorSD (A) (A) M -/IorSD Introduction PutM Update data block in memory/IorS -/IorSD Snooping Directory Data from Owner Conclusion Small table and few possible states. Easy to understand and implement Multiple copy of a same block could be available because of the shared state. Many impossible states due to atomic transaction property and many stalls Lower throughput Higher latency Unnecessary broadcast of invalidate messages: when a core wants to write on block should get the block in the stat M and send an invalidate message to all other cores, no matter if it is the only copy of that block or not. Tradeoffs: downgrade from M to S or I? We need to predict if block is going to be used again or not. Introduction Snooping Directory Conclusion Implements atomic transactions and non-atomic request properties. The Exclusive state is used in almost all commercial coherence protocols because it optimizes a common case: a core first reads a block and then subsequently writes it. In MSI, a core needs to issue a GetS message to get the read permission (in case a cache miss) and then have to issue a GetM message to get the write permission. In MESI, a core can get the block in the exclusive state and no other block can access it anymore. Thus, the core does not need to issue a GetM message. Introduction Snooping Directory Conclusion Load Store Repl. I GetS/ ISAD GetM/ IMAD ISAD stall stall stall ISD stall stall stall IMAD stall stall stall IMD stall stall S hit SMAD GetS GetM PutM GetS GetM PutM - - - - - - (A) (A) (A) - - - stall (A) (A) (A) GetM/ SMAD -/I - -/I - hit stall stall - -/IMAD - SMD hit stall stall (A) (A) (A) E hit hit/M PutM/ EIA data to R & M/S data to R/I - M hit hit PutM/ MIA data to R & M/S data to R/I - MIA hit hit stall data to M/I data to M & R/IIA data to R/IIA - EIA hit stall stall -/I data to M & R/IIA data to R/IIA - IIA stall stall stall -/I - - - Introduction Snooping -/ISD -/IMD -/SMD Directory Conclusion Data -/S -/M -/M Data -/E GetS GetM PutM I data to R/EorM data to R/EorM -/ID S data to R/EorM data to R/EorM -/SD EorM -/SD - -/EorMD ID (A) (A) SD (A) EorMD (A) Introduction Data NoData NoData-E (A) write data to M/I -/I -/I (A) (A) write data to M/S -/S -/S (A) (A) write data to M/I -/EorM -/I Snooping Directory Conclusion Silent transition from the exclusive state to the modified/shared state. No unnecessary invalidate messages are issued. Read and write with issuing only request. Fewer number of messages. Less traffic on the bus, lower bandwidth usage. Extra hardware is needed to implement the exclusive state. Introduction Snooping Directory Conclusion When a cache has a block in state M or E and receives a GetS from another core, if using the MSI protocol or the MESI protocol, the cache must change the block state from M or E to S send the data to both the requestor and the memory controller Questions raise that how a snooping protocol can minimize accesses to memory or eliminate 1. the extra data message to update the LLC/memory when a cache receives a GetS request in the M (and E) state? 2. the potentially unnecessary write to the LLC? Augmenting the baseline state Introduction Snooping odified hared Directory nvalid protocol with the Conclusion wned The key difference is what happens when a cache with a block in state M receives a GetS from another core. In a MOSI protocol, the cache changes the block state to O (instead of S) and retains ownership of the block (instead of transferring ownership to the LLC/memory) The O state enables the cache to avoid updating the LLC/memory. The protocol adds two transient cache states in addition to the stable O state The transient OIA state helps handle replacements of blocks in the O The transient OMA state handles upgrades back to state M after a store Introduction Snooping Directory Conclusion States Processor Core Events load store issue GetS/ISAD issue GetM/IMAD ISAD stall stall stall ISD stall stall stall IMAD stall stall stall IMD stall stall S hit SMAD I Introduction Snooping replacement Bus Events OwnGetS OwnGetM OwnPutM OtherGetS OtherGetM OtherPutM - - - - - - (A) (A) (A) - - - stall (A) (A) (A) issue GetM/SMAD -/I - -/I - hit stall stall SMD hit stall stall (A) (A) (A) O hit issue GetM/OMA issue PutM/OIA send data to requestor send data to requestor/I - OMA hit stall stall M hit hit issue PutM/MIA MIA hit hit stall OIA hit stall stall IIA stall stall stall Directory Conclusion -/ISD -/IMD -/SMD - send data to requestor -/M send data to requestor/O send data to requestor/IM - AD - send data to requestor/IIA - A send data to memory/I send data to requestor send data to requestor/IIA - send NoData to memory/I - - - send data to requestor/OI -/S -/M - send data to requestor/I send data to memory/I Own Data response -/M States Bus Events GetS GetM IorS send data to requestor send data to requestor/MorO IorSD (A) (A) MorO - - MorOD (A) (A) Introduction Snooping PutM Data from Owner NoData -/IorSD write data to memory/IorS -/IorS write data to memory/IorS -/MorO -/MorOD Directory Conclusion Introduction MSI MOSI # Messages 6 13 # Stalls 20 24 MSI MOSI # Messages 2 2 # Stalls 0 0 Snooping Directory Conclusion Cycle 1: Core 2 Core 1 Cache Cache Controller Cache issue GetS / ISAD BUS Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller Cycle 2: Core 2 Core 1 Cache Cache Cache Controller BUS Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller issue GetM / IMAD Cycle 3: Core 2 Core 1 Cache Cache Cache Controller request on BUS - GetS (C1) Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller Cycle 4: Core 2 Core 1 Cache Cache Controller Cache - / ISD BUS Introduction Memory Controller Snooping Directory Memory send data to C1 / IorS Conclusion Cache Controller Cycle 5: Core 2 Core 1 Cache Cache Cache Controller Cache Controller data on BUS – data from LLC/mem Introduction Snooping Directory Memory Memory Controller Conclusion Cycle 6: Core 2 Core 1 Cache Cache Controller Cache copy data from LLC/mem / S request on BUS – GetM (C2) Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller Cycle 7: Core 2 Core 1 Cache Cache Controller Cache -/I BUS Introduction Memory Controller Snooping Directory Memory send data to C2 / MorO Conclusion Cache Controller - / IMD Cycle 8: Core 2 Core 1 Cache Cache Cache Controller Cache Controller data on BUS – data from LLC/mem Introduction Snooping Directory Memory Memory Controller Conclusion Cycle 9: Core 2 Core 1 Cache Cache Cache Controller BUS Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller copy data from LLC/mem / M Cycle 10: Core 2 Core 1 Cache Cache Controller Cache issue GetS / ISAD BUS Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller Cycle 11: Core 2 Core 1 Cache Cache Cache Controller request on BUS - GetS (C1) Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller Cycle 12: Core 2 Core 1 Cache Cache Controller Cache - / ISD BUS Introduction Memory Controller Snooping Directory Memory - / MorO Conclusion Cache Controller send data to C1 / O Cycle 13: Core 2 Core 1 Cache Cache Cache Controller data on BUS – data from C2 Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller Cycle 14: Core 2 Core 1 Cache Cache Controller Cache copy data from C2 / S BUS Introduction Snooping Directory Memory Memory Controller Conclusion Cache Controller The Owner state of a cache block supplies the data to another processor instead of having that processor read the data from memory. reduces the number of write backs to main memory runs with medium complexity When going from a Shared state to a Modified state, the block must pass through the Invalid state. Introduction Snooping Directory Conclusion Atomic Bus Address Request 1 Bus Data Bus Request 2 Response 1 Request 3 Response 2 Response 3 To implement atomic transactions, the simplest way is to use a shared-wire bus with an atomic bus protocol – all bus transactions of an indivisible request-respond pair unpipelined processor core no way to overlap activities that could proceed in parallel Be simple but sacrifice performance Limited by the sum of the latencies for a request and response (including any wait cycles between them) Introduction Snooping Directory Conclusion Pipelined (non-atomic) Bus provides responses in the same order as the requests. Address Request 1 Request 2 Request 3 Bus Data Bus Response 1 Response 2 Response 3 Split Transaction (non-atomic) Bus provides responses in an order different from the request order. Address Request 1 Request 2 Request 3 Bus Data Bus Introduction Response 2 Snooping Directory Response 3 Conclusion Response 1 Key advantage of a non-atomic bus: “NOT having to wait for a response before a subsequent request can be serialized on the bus” The bus can achieve much higher bandwidth using the same set of shared wire. The advantage of a split-transaction bus, with respect to a pipelined bus, is that “A low-latency response does not have to wait for a long-latency response to a prior request” One issue raised by a split-transaction bus is matching responses with requests. The response must carry the identity of the request or the requestor. Introduction Snooping Directory Conclusion FIFO queues for buffering incoming & outgoing messages Memory controller does not have a connection to make requests Introduction Snooping Directory Conclusion States Processor Core Events load store issue GetS/ISAD issue GetM/IMAD ISAD stall stall stall ISD stall stall stall ISA stall stall stall IMAD stall stall stall IMD stall stall stall IMA stall stall stall S hit issue GetM/SMAD -/I SMAD hit stall istall SMD hit stall stall SMA hit stall stall M hit hit issue PutM/MIA MIA hit hit stall IIA stall stall stall I Introduction replacement Snooping Bus Events OwnGetS or OwnGetM OwnGetM OwnPutM OtherGetM OtherPutM Own Data response (for own request) - - - - - - - stall - - - - stall stall - - - -/I - -/IMAD -/SMA stall stall store hit/M - -/IMA send data to requestor and to memory/S send data to requestor/I send data to requestor/I send data to requestor and to memory/IIA send data to requestor/IIA -/I - - -/ISD load hit/S -/IMD store hit/M -/SMD store hit/M Directory OtherGetS Conclusion -/-/ISA load hit/S - -/IMA store hit/M - States Processor Core Events load store issue GetS/ISAD issue GetM/IMAD ISAD stall stall stall ISD stall stall stall ISA stall stall stall IMAD stall stall stall IMD stall stall stall IMA stall stall stall S hit issue GetM/SMAD -/I SMAD hit stall istall SMD hit stall stall SMA hit stall stall M hit hit issue PutM/MIA MIA hit hit stall IIA stall stall stall I It now can receive an Other-GetS It now can receive an Other-GetS Bus Events Introduction Snooping replacement OwnGetS or OwnGetM OwnGetM OwnPutM OtherGetM OtherPutM Own Data response (for own request) - - - - - - - stall - - - - stall stall - - - -/I - -/IMAD -/SMA stall stall store hit/M - -/IMA send data to requestor and to memory/S send data to requestor/I send data to requestor/I send data to requestor and to memory/IIA send data to requestor/IIA -/I - - -/ISD load hit/S -/IMD store hit/M -/SMD store hit/M Directory OtherGetS Conclusion -/-/ISA load hit/S - -/IMA store hit/M - States Bus Events GetS IorS send data to requestor send data to requestor, set Owner to requestor/M clear Owner/IorSD set Owner to requestor clear Owner/IorSD - write data to memory/IorSA IorSD stall stall stall - write data to memory/IorS IorSA clear Owner/IorS - clear Owner/IorS - M Introduction GetM Snooping PutM from Owner PutM from Non-Owner Data - Directory Conclusion Introduction MSI MSI with Split-Transaction Bus # Messages 6 5 # Stalls 20 33 MSI MSI with Split-Transaction Bus # Messages 2 2 # Stalls 0 3 Snooping Directory Conclusion (until data arrives to satisfy the in-flight request) 1. It sacrifices performance. 2. Stalling raises the potential of deadlock (because of circular chains of stalls). 3. It enables a requestor to observe a response to its request before processing its own request. Introduction Snooping Directory Conclusion By stalling a request, the protocol stalls all requests after the stalled request and delays those transactions from completing. How a coherence controller processes requests behind the stalled one? Process all messages, in order, instead of stalling Add transient states that reflect messages that the coherence controller has received but must remember to complete at a later event. For example, A cache with a block in state ISD stalled instead of processing an Other-GetM for that block. In this case, if the cache controller observes an Other-GetM on the bus, then it changes the block state to ISDI “in I, going to S, waiting for data, and when data arrives will go to I” Introduction Snooping Directory Conclusion States Processor Core Events load store replacement issue GetS/ISAD issue GetM/IMAD ISAD stall stall stall ISD stall stall stall ISA stall stall stall ISDI stall stall stall IMAD stall stall stall IMD stall stall stall IMA stall stall stall IMDI stall stall IMDS stall IMDSI I Bus Events OwnGetS or OwnGetM OwnGetM OwnPutM OtherGetS OtherGetM OtherPutM Own Data response (for own request) - - - - - - - -/ISDI - - - - - - -/IMDS -/IMDI - - stall - - store hit, send data to GetM requestor/I stall stall - -/IMDSI store hit, send data to GetM requestor and mem/S stall stall stall - store hit, send data to GetM requestor and mem/I S hit issue GetM/SMAD -/I SMAD hit stall istall SMD hit stall stall SMA hit stall stall SMDI hit stall SMDS hit SMDSI -/ISD load hit/S -/IMD store hit/M -/ISA load hit/S load hit/I - -/IMA store hit/M - -/I - -/IMAD -/SMA -/SMDS -/SMDI store hit/M - -/IMA stall - - store hit, send data to GetM requestor/I stall stall - -/SMDSI store hit, send data to GetM requestor and mem/S hit stall stall - - store hit, send data to GetM requestor and mem/I M hit hit issue PutM/MIA send data to requestor and to memory/S send data to requestor/I MIA hit hit stall send data to requestor/I send data to requestor and to memory/IIA send data to requestor/IIA IIA stall stall stall -/I - - -/SMD store hit/M - States Bus Events GetS IorS send data to requestor send data to requestor, set Owner to requestor/M clear Owner/IorSD set Owner to requestor clear Owner/IorSD - write data to memory/IorSA IorSD stall stall stall - write data to memory/IorS IorSA clear Owner/IorS - clear Owner/IorS - M Introduction GetM Snooping PutM from Owner PutM from Non-Owner Data - Directory Conclusion Uses MOESI Non-atomic requests and transactions. Supports up to 64bit processors. Wired snooping busses consume lots of energy; thus, they do not scale up to large number of cores. To solve this problem. E10000 uses point-to-point links instead. Uses a separate bus for sending out-of-order data response messages. Introduction Snooping Directory Conclusion implements 8 applications: LU: dense matrix manipulation. OCEAN: large-scale movements. Cholesky: sparse matrix manipulation. Radix: sorting radix-based integers … benchmark for computing the performance of java servers, applications … benchmark for shared memory, multithreaded programs. Processor utilization Bus utilization Number of accesses to physical memory Introduction Snooping Directory Conclusion Benchmark suite: Splash-2 Benchmark application: Gem5, SE mode Hardware: four CPUs. Each CPU has private L1 cache of 32KB with associativity 4. Default cache line size is 64 bytes which we configure for our experiment. L1 Cache Size (KB) Write-Back /Memory References 16 17300 32 12672 64 5251 128 0 15000 10000 5000 0 16 32 64 128 Write backs Write backs 20000 Snooping Write-Back/ Memory References 16 11214 32 12350 64 12672 128 13001 14000 13000 12000 11000 10000 16 L1 cash size (KB) Introduction L1 Block Size (bytes) 32 64 L1 block size (bytes) Directory Conclusion 128 Benchmark suite: SPEC Benchmark applications: blackscholes, bodytrack, fluidanimate, freqmine, raytrace, and swaptions. canneal, facesim, Protocols: MESI, MOSI, and MOESI (compared to MSI). Across all the benchmarks and input sizes, MESI and MOESI reduce the number of broadcasts 7% on average. MOSI and MOESI, reduce the number of write-backs is reduced by 5% on average. Introduction Snooping Directory Conclusion Since MOSI and MOESI substantially reduce the number of write-backs for workloads, they reduce the energy consumption of the LLC by %4 on average. MOSI and MOESI are only showing very little increasing benefits with regard to write-back traffic reduction compared to MSI and MESI. Introduction Snooping Directory Conclusion Benchmark suite: Splash-2 Benchmark applications: Barnes-Hut, LU, OCEAN, Radiosity, Radix, Ray Trace Protocols: MESI and MSI Hardware: ? Introduction Snooping Directory Conclusion Protocols: MSI and MESI, MOSI, MOESI Hardware Splash-2 inputs and applications Introduction Snooping Directory Conclusion Directory protocols were originally developed to address the lack of scalability of snooping protocols. Directory protocols is to avoid the broad cast nature of snooping. Snooping systems broadcast all requests on a totally ordered interconnection network and all requests are snooped by all coherence controllers. But the, directory protocols uses indirection to avoid both the ordered broadcast network and having each cache controller process every request. Directory based protocols should be competitive with snoopy protocols core Cache controller core Private data (LI) cache Cache controller Private data (LI) cache Interconnection network LLC/direct ory controller MAIN MEMORY Last-level cache (LLC) directory MULTICORE PROCESSOR CHIP Protocol Ordered network Advantages disadvantages Snooping protocol Yes Simple Difficult to scale Directory based protocol No Scalable Indirection, extra hardware A directory in the directory system model maintains a global view of the coherence state of each block. Keeps track of copies of cached blocks and their states. Every block has associated directory information. Every request goes to directory and the directory then sends directives to each cache. One restriction on the interconnection network that is that it enforces point-to-point ordering. That is, if controller A sends two messages to controller B, then the messages arrive at controller B in the same order in which they were sent. In Figure, we show the transactions in which a cache controller issues coherence requests to change permissions from I to S, I or S to M, M to I, and S to I. Cache sends request to GetM to the directory, and the directory takes two actions. First, it responds to the requestor with a message that includes the data and the AckCount. It is the number of current sharers of the block. Second, the directory sends an Invalidation message to all of the current sharers. Each sharer, upon receiving the Invalidation, sends an Invalidation-Ack to the requestor. PutM message that includes the data to the directory. The directory responds with a Put-Ack. If the PutM did not carry the data with it, then the protocol would require a third message—a data message from the cache controller to the directory with the evicted block that had been in state M—to be sent in a PutM transaction. I to S (common case #1) The cache controller sends a GetS request to the directory and changes the block state from I to ISD. The directory receives this request and, if the directory is the owner (i.e., no cache currently hast he block in M), the directory responds with a Data message, changes the block’s state to S (if it is not S already), and adds the requestor to the sharer list. When the Data arrives at the requestor, the cache controller changes the block’s state to S, completing the transaction. I to S (common case #2) The cache controller sends a GetS request to the directory and changes the block state from I to ISD. If the directory is not the owner (i.e., there is a cache that currently has the block in M), the directory forwards the request to the owner and changes the block’s state to the transient state SD. The owner responds to this Fwd-GetS message by sending Data to the requestor and changing the block’s state to S. The now-previous owner must also send Data to the directory since it is relinquishing ownership to the directory, which must have an up-to-date copy of the block. When the Data arrives at the requestor, the cache controller changes the block state to S and considers the transaction complete. When the Data arrives at the directory, the directory copies it to memory, changes the block state to S, and considers the transaction complete. Consider a complete directory maintaining complete state of each block, including the full set of caches that may have shared copies Point-to-point ordering for the Forwarded Request network Recall: if a cache has a block in the Owned state, then the block is valid, read-only, dirty (i.e., it must eventually update memory), and owned (i.e., the cache must respond to coherence requests for the block) Adding Owned State changes the protocols (compare with MSI) in three important ways: 1. More coherence requests are satisfied by caches (in O state) than by the LLC/mem 2. There are more 3-hop transactions If directory is the owner If directory is not the owner (2) Fwd-GetS (1) GetS (1) GetS Req IS SS Req IS Dir MO OO Req IS Owner MO OO I ISD S send GetS to Dir/ISD Last-InvAck Inv-Ack Inv-Ack AckCount from Dir Data from Owner (ack >0) Data form Dir (ack =0) Put-Ack Inv Fwd-getM Fwd-GetS replaceme nt store load MOSI Directory Protocol – Cache Controller If directory is the owner If directory is not the owner (1) GetS (1) GetS Req IS Req IS SS (2) Fwd-GetS Dir MO OO Req IS Owner MO OO send GetS to Dir/ISD ISD Stall S Stall -/S -/S Last-InvAck Inv-Ack Inv-Ack AckCount from Dir Data from Owner (ack >0) Data form Dir (ack =0) Put-Ack Inv Stall Fwd-getM Stall Fwd-GetS replacement I store load MOSI Directory Protocol – Cache Control (1) GetS If directory is the owner Req IS SS Req IS (2) Data (2) Fwd-GetS (1) GetS If directory is not the owner Stall S Hit Stall -/S -/S Last-Inv-Ack Inv-Ack Inv-Ack AckCount from Dir Data form Dir (ack =0) Put-Ack Inv Stall Fwd-getM Stall Fwd-GetS replacement ISD Store Send GetS to Dir/ISD Data from Owner (ack >0) (3) Data ISD: I -> S, waits for D I Owner M O OO Dir M O OO Req IS load (1) GetM (1) GetM Req IS SM Req IM Req IM Sharer SI Dir SM Sharer SI I IMAD IMA S SMAD SMA M Send GetM to Dir/IMAD Last-InvAck Inv-Ack Inv-Ack AckCount from Dir Data form Dir (ack =0) Data from Owner (ack >0) Put-Ack Inv Fwd-getM Fwd-GetS replacement Store load • IMAD: the cache wants I -> M, waits for D + possibly Ack • The cache know how many ack it expects to receive (1) GetM (1) GetM Req IS SM Req IM Req IM Sharer SI (2) Inv Dir SM Sharer SI (2) Inv (2) Data[ack>0] Stall Stall Send GetM to Dir/IMAD S -/IMA -/M Ack-- -/M -/SMA -/M Ack-- Send InvAck to Req/I Send GetM to Dir/SMAD SMAD Hit Stall Stall Stall Stall SMA Hit Stall Stall Stall Stall M -/M Send InvAck to Req/IMAD Last-InvAck Stall Inv-Ack Stall Inv-Ack Stall AckCount from Dir IMA Data from Owner (ack >0) Stall Data form Dir (ack =0) Stall Put-Ack Stall Inv Fwd-getM Stall replaceme nt Stall I Store IMAD load Fwd-GetS (2) Data [ack =0] (3) Inv-Ack (1) GetM Req IS SM Req IM Req IM Sharer SI (2) Inv (1) GetM Dir SM Sharer SI (2) Inv (2) Data[ack>0] (2) Data [ack =0] Stall Stall Stall S Hit Send GetM to Dir/SMAD send PutS to Dir/SIA SMAD Hit Stall Stall Stall Stall SMA Hit Stall Stall Stall Stall M Hit Hit Send PutM + data to Dir/MIA Send data to Req/Q Send data to Req/I -/M Ack-- -/M -/SMA -/M Ack-- Send GetM to Dir/IMAD Send InvAck to Req/I Send InvAck to Req/IMAD -/I Last-Inv-Ack Stall -/IMA Inv-Ack Stall -/M Inv-Ack IMA AckCount from Dir Stall Data from Owner (ack >0) Stall Data form Dir (ack =0) Stall Put-Ack Fwd-getM Stall Inv Fwd-GetS Stall I Store IMAD load replacement (3) Inv-Ack (3) Inv-Ack Sharer SI (2) Inv (1) GetM Dir OM Req OM Sharer SI (2) Inv (2) AckCount Hit Send GetM to Dir/OMAM Send PutO+data to Dir/OIA Send data to Req Send data to Req/I OMAC Hit Stall Stall Send data to Req Send data to Req/IMAD OMA Hit Stall Stall Send data to Req Stall -/OMA Ack- - Last-Inv-Ack O Inv-Ack Send data to Req/I Inv-Ack Send data to Req/Q AckCount from Dir Send PutM + data to Dir/MIA Data from Owner (ack >0) Fwd-getM Hit Data form Dir (ack =0) Fwd-GetS Hit Put-Ack Store M Inv load replacement (3) Inv-Ack -/I Ack - -/M (1) PutS Dir SI SS Send PutM + data to Dir/MIA Send data to Req/Q Send data to Req/I MIA Stall Stall Stall Send data to Req/OIA Send data to Req/IIA O Hit Send GetM to Dir/OMAM Send PutO+data to Dir/OIA Send data to Req Send data to Req/I OMAC Hit Stall Sall Send data to Req Send data to Req/IMAD OMA Hit Stall Stall Send data to Req Stall OIA Stall Stall Stall Send data to Req Send data to Req/IIA SIA Stall Stall Stall IIA Stall Stall Stall Data from Owner (ack >0) Hit Data form Dir (ack =0) Hit (2) Put-Ack Put-Ack M Inv Fwd-getM Store Fwd-GetS load replacement (2) Put-Ack Last-InvAck (2) Put_ack Dir MI Req MI Inv-Ack Req SI Inv-Ack Dir OM Req OI (1) PutM + data AckCount from Dir (1) PutO + data -/I Ack-- -/OMA Ack= Ack -/I Send Inv-Ack to Req/IIA -/I -/I -/M -/M Send GetS to Dir/ISD Send GetM to Dir/ISAD ISD Stall Stall Stall IMAD Stall Stall Stall Stall Stall IMA Stall Stall Stall Stall Stall S Hit Send GetM to Dir/SMAD send PutS to Dir/SIA SMAD Hit Stall Stall Stall Stall SMA Hit Stall Stall Stall Stall M Hit Hit Send PutM + data to Dir/MIA Send data to Req/Q Send data to Req/I MIA Stall Stall Stall Send data to Req/OIA Send data to Req/IIA O Hit Send GetM to Dir/OMAM Send PutO+data to Dir/OIA Send data to Req Send data to Req/I OMAC Hit Stall Stall Send data to Req Send data to Req/IMAD OMA Hit Stall Stall Send data to Req Stall OIA Stall Stall Stall Send data to Req Send data to Req/IIA SIA Stall Stall Stall IIA Stall Stall Stall Stall -/S Last-InvAck Inv-Ack Inv-Ack AckCoun t from Dir Data from Owner (ack >0) Data form Dir (ack =0) Put-Ack Inv FwdgetM FwdGetS replacem ent Store load I -/S -/M -/IMA -/M Ack-- -/M -/SMA -/M Ack-- Send Inv-Ack to Req/I Send Inv-Ack to Req/IMAD -/I -/OMA Ack= Ack -/I Send Inv-Ack to Req/IIA -/I -/I -/M I GetS GetM from GetM from Owner NonOwner: PutS – NonLeaf data PutS-Last send Data to Req, add Req to Sharers/S GetM from Owner send Data to Req, set Owner to Req/M send Put-Ack to Req send Put-Ack to Req Send PutAck to Req send Data to Req, send Inv to Sharers, set Owner to Req, clear Sharers/M remove Req from Sharers, send PutAck to Req Remove Req from Sharers, send Put-Ack to Req/I remove Req from Sharers, send Put-Ack to Req forward GetM to Owner, send Inv to Sharers, set Owner to Req, clear Sharers, send AckCount to Req/M remove Req from Sharers, send PutAck to Req remove Req from Sharers, send Put-Ack to Req S send Data to Req, add Req to Sharers O forward GetS to Owner, add Req to Sharers M forward GetS to Owner, add Req to Sharers/O send AckCount to Req, send Inv to Sharers, clear Sharers/M forward GetM to Owner, send Put-Ack set Owner to Req to Req PutM+data from Owner remove Req from Sharers, copy data to mem, send Put-Ack to Req, clear Owner/S send Put-Ack copy data to to Req mem, send PutAck to Req, clear Owne/I PutO+data from NonOwner remove Req from Sharers, send Put-Ack to Req PutO + dat Send Put-Ack to Req copy data to memory , send Put-Ack to Req, clear Owner/ S remove Req from Sharers, send Put-Ack to Req remove Req from Sharers, send Put-Ack to Req send Put-Ack to Req Comparison between cache controller on MSI and MOSI MSI MOSI Total # of messages 15 20 Total # of stalls 31 38 Comparison between memory controller on MSI and MOSI MSI MOSI Total # of messages 19 28 Total # of stalls 2 2 We have assumed a complete directory maintaining the complete state of each blocks, including the full set of caches that may have shared copies Coarse directories and limited pointers are two ways to reduce how much state directory maintains 2-bit state 2-bit state 2-bit state C-bit log2C-bit owner log2C-bit owner complete sharer list (bit error) Complete directory: each bit in sharer list represents one cache C/K-bit coarse sharer list (bit error) log2C-bit i*log2C-bit owner pointers to I sharers Coarse directory: each bit in sharer list represents K caches Limited directory: sharer list is divided into i entries, each of which is a pointer to a cache Idea: in a system with N directories, block B’s directory might be at directory B modulo N because the allocation of memory address to nodes is often static. Memory core Cache controller Cache Directory controller Memory core Cache controller Interconnection network Cache directory Multiple directories provides greater bandwidth of coherence transactions directory Directory controller Recall: one of the limitation of directory protocols is that the stall situation happens frequently When a cache controller has a block in state IMA and receives a Fwd-GetS, it processes the request and changes the block’s state to IMAS. This state indicates that after the cache controller’s GetM transaction completes (i.e., when the last Inv-Ack arrives), the cache controller will change the block state to S. the cache controller must also send the block to the requestor of the GetS and to the directory, which is now the owner. Conclude: By not stalling on the Fwd-GetS, the cache controller can improve performance by continuing to process other forwarded requests behind that Fwd-GetS in its incoming queue. NOTE: So far, we now do not have point-to-point ordering in interconnection network Considering MOSI situation as an example (a) Example with point-to-point ordering (b) Example without point-to-point ordering. Note that C2’s Fwd-GetS arrives at C1 in state I and thus C1 does not respond. One of the approaches is to have a customized message to take care of the situation Adaptive routing is the solution to enable a message to dynamically choose its path as its traverses the network Congested links and switches can be avoided Moreover, point-to-point ordering problem could also be solved (a) Adaptive Routing Example Flat memory-based directory protocol Uses a bit vector directory representation Consists 512 nodes Two processors per node, but there is no snooping protocol within a node –combining multiple processors in a node reduces cost Distinguishing Features As its scalability, each directory entry contains fewer bits than necessary to present every possible cache that could be sharing a block. Since network provides no ordering, there are several new messages have been used for reordering purposes Directory dynamically choose coarse bit vector or limited pointer presentation Protocol considers all of these conditions by not enforcing ordering in the network Use only two networks request and response to avoid deadlock. Note that directory has three types of message (request, forwarded request and response) Supports scalability Able to take care of ordering messages More complicated than Snooping Has many transactions -> inefficient in time as they require an extra message when the home is not owner High storage overhead of directory data structure Benchmarks: SPLASH-2: fft, Barnes-Hut, LU, Ocean, Radiosity, Radix, Ray Trace SPECibb: benchmark for computing the performance of java servers, applications PERSEC: benchmark for shared memory, multithreaded programs. Metrics System performance (time efficiency) Processor Utilization (time spent waiting for memory) Directory utilization Number of access to physical mem Power consumption (difficult) Benchmark suite: Splash-2 Benchmark application: Gem5, SE mode Hardware: Hydra (UCDenver) Example results: L1 Cache Size (KB) Write-Back /Memory References 16 17300 32 12672 64 5251 128 0 Write backs -0.75 black -0.76 -0.77 -0.78 Red Write backs -0.74 L1 Block Size (bytes) Write-Back/ Memory References 16 11214 32 12350 64 12672 128 13001 14000 13000 12000 11000 10000 16 32 64 L1 block size (bytes) L1 cash size (KB) 128 Benchmark suite: SPEC Benchmark applications: blackscholes, bodytrack, fluidanimate, freqmine, raytrace, and swaptions. canneal, Protocols: MESI, MOSI, and MOESI (compared to MSI). Calculate the number of message exchange between entities Analysis the results obtained facesim, Example results: [1] – Daniel J. S., Mark D. H., and David A. W., “A Primer on Memory Consistency and Cache Coherence,” Morgan Claypool Publishers, 2011. [2] – Linda Suleman, Bigelow Veynu, and Narasiman Aater, “An Evaluation of Snoop-Based Cache Coherence Protocols,” [3] – Anoop Tiwari, “Performance comparison of cache coherence protocol on multi-core architecture,” Diss. 2014. [4] – Chang, Mu-Tien, Shih-Lien Lu, and Bruce Jacob. “Impact of Cache Coherence Protocols on the Power Consumption of STT-RAM-Based LLC,” [5] – CMU 15-418: Parallel Architecture and Programming. Lecture Series. Spring 2012. Introduction Snooping Directory Conclusion Introduction Snooping Directory Conclusion
© Copyright 2024