Lecture 13: —Ways to Reduce Memory Hierarchy Misses

Lecture 13:
Memory Hierarchy—Ways to Reduce
Misses
DAP Spr.‘98 ©UCB 1
Review: Who Cares About the
Memory Hierarchy?
• Processor Only Thus Far in Course:
– CPU cost/performance, ISA, Pipelined Execution
CPU
CPU-DRAM Gap
100
10
1
“Moore’s Law”
µProc
60%/yr.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)
DAP Spr.‘98 ©UCB 2
The Goal: Illusion of large, fast, cheap
memory
• Fact: Large memories are slow,
fast memories are small
• How do we create a memory that is large, cheap and
fast (most of the time)?
• Hierarchy of Levels
– Uses smaller and faster memory technologies close to the
processor
– Fast access time in highest level of hierarchy
– Cheap, slow memory furthest from processor
• The aim of memory hierarchy design is to have
access time close to the highest level and size equal
to the lowest level
DAP Spr.‘98 ©UCB 3
Recap: Memory Hierarchy Pyramid
Processor (CPU)
transfer datapath: bus
Decreasing
distance
from CPU,
Decreasing
Access Time
(Memory
Latency)
Level 1
Level 2
Increasing Distance
from CPU,
Decreasing cost /
MB
Level 3
...
Level n
Size of memory at each level
DAP Spr.‘98 ©UCB 4
Memory Hierarchy: Terminology
Hit: data appears in level X: Hit Rate: the fraction of memory accesses
found in the upper level
Miss: data needs to be retrieved from a block in the lower level (Block
Y) Miss Rate = 1 - (Hit Rate)
Hit Time: Time to access the upper level which consists of Time to
determine hit/miss + memory access time
Miss Penalty: Time to replace a block in the upper level + Time to
deliver the block to the processor
Note: Hit Time << Miss Penalty
DAP Spr.‘98 ©UCB 5
Current Memory Hierarchy
Processor
Control
L1
cache
regs
Datapath
L2
Cache
Main
Memory
Secondary
Memory
Speed(ns): 0.5ns 2ns
6ns
100ns 10,000,000ns
Size (MB):
0.0005 0.05
1-4 100-1000
100,000
Cost ($/MB):
-$100 $30
$1
$0.05
Technology: Regs SRAM SRAM DRAM
Disk
DAP Spr.‘98 ©UCB 6
Memory Hierarchy: Why Does it Work? Locality!
Probability
of reference
0
2^n - 1
Address Space
• Temporal Locality (Locality in Time):
=> Keep most recently accessed data items closer to the
processor
• Spatial Locality (Locality in Space):
=> Move blocks consists of contiguous words to the upper
levels
To Processor
From Processor
Upper Level
Memory
Lower Level
Memory
Blk X
Blk Y
DAP Spr.‘98 ©UCB 7
Memory Hierarchy Technology
• Random Access:
– “Random” is good: access time is the same for all locations
– DRAM: Dynamic Random Access Memory
» High density, low power, cheap, slow
» Dynamic: need to be “refreshed” regularly
– SRAM: Static Random Access Memory
» Low density, high power, expensive, fast
» Static: content will last “forever”(until lose power)
• “Not-so-random” Access Technology:
– Access time varies from location to location and from time
to time
– Examples: Disk, CDROM
• Sequential Access Technology: access time linear in
location (e.g.,Tape)
• We will concentrate on random access technology
– The Main Memory: DRAMs + Caches: SRAMs
DAP Spr.‘98 ©UCB 8
Introduction to Caches
• Cache
– is a small very fast memory (SRAM, expensive)
– contains copies of the most recently accessed
memory locations (data and instructions): temporal
locality
– is fully managed by hardware (unlike virtual memory)
– storage is organized in blocks of contiguous memory
locations: spatial locality
– unit of transfer to/from main memory (or L2) is the
cache block
• General structure
– n blocks per cache organized in s sets
– b bytes per block
– total cache size n*b bytes
DAP Spr.‘98 ©UCB 9
Cache Organization
(1) How do you know if something is in the cache?
(2) If it is in the cache, how to find it?
• Answer to (1) and (2) depends on type or
organization of the cache
• In a direct mapped cache, each memory address is
associated with one possible block within the
cache
– Therefore, we only need to look in a single location in the
cache for the data if it exists in the cache
DAP Spr.‘98 ©UCB 10
Simplest Cache: Direct Mapped
Block
Address
0
1
0010 2
3
4
5
0110 6
7
8
9
1010 10
11
12
13
1110 14
15
Main
Memory
Cache
Index
4-Block Direct
Mapped Cache
0
1
2
3
Memory block address
tag
index
• index determines block in cache
• index = (address) mod (# blocks)
• If number of cache blocks is power of 2,
then cache index is just the lower n bits
of memory address [ n = log2(# blocks) ]
DAP Spr.‘98 ©UCB 11
Issues with Direct-Mapped
• If block size > 1, rightmost bits of index are
really the offset within the indexed block
ttttttttttttttttt iiiiiiiiii oooo
tag
to check
to
if have
correct block
index byte
offset
select within
block block
DAP Spr.‘98 ©UCB 12
64KB Cache with 4-word (16-byte) blocks
31 . . . 16 15 . . 4 3 2 1 0
Address (showing bit positions)
16
Hit
12
2 Byte
offset
Tag
Data
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
DAP Spr.‘98 ©UCB 13
Direct-mapped Cache Contd.
• The direct mapped cache is simple to design and its
access time is fast (Why?)
• Good for L1 (on-chip cache)
• Problem: Conflict Miss, so low hit ratio
Conflict Misses are misses caused by accessing different
memory locations that are mapped to the same cache
index
In direct mapped cache, no flexibility in where memory
block can be placed in cache, contributing to conflict
misses
DAP Spr.‘98 ©UCB 14
Another Extreme: Fully Associative
• Fully Associative Cache (8 word block)
– Omit cache index; place item in any block!
– Compare all Cache Tags in parallel
4
0
Byte Offset
Cache Tag (27 bits long)
=
=
:
=
=
Cache Tag
Valid
Cache Data
B 31
:
31
B1
B0
=
:
:
• By definition: Conflict Misses = 0 for a fully
associative cache
:
DAP Spr.‘98 ©UCB 15
Fully Associative Cache
• Must search all tags in cache, as item can be in
any cache block
• Search for tag must be done by hardware in
parallel (other searches too slow)
• But, the necessary parallel comparator hardware
is very expensive
• Therefore, fully associative placement practical
only for a very small cache
DAP Spr.‘98 ©UCB 16
Compromise: N-way Set Associative
Cache
• N-way set associative:
N cache blocks for each Cache Index
– Like having N direct mapped caches operating in parallel
– Select the one that gets a hit
• Example: 2-way set associative cache
– Cache Index selects a “set” of 2 blocks from the cache
– The 2 tags in set are compared in parallel
– Data is selected based on the tag result (which matched the
address)
DAP Spr.‘98 ©UCB 17
Example: 2-way Set Associative Cache
tag
Valid Cache Tag Cache Data
Block 0
:
:
offset
index
Cache Data Cache Tag Valid
Block 0
:
:
mux
=
Hit
address
:
:
=
Cache
Block
DAP Spr.‘98 ©UCB 18
Set Associative Cache Contd.
• Direct Mapped, Fully Associative can be seen as
just variations of Set Associative block placement
strategy
• Direct Mapped =
1-way Set Associative Cache
• Fully Associative =
n-way Set associativity for a cache
with
exactly n blocks
DAP Spr.‘98 ©UCB 19
Alpha 21264 Cache Organization
DAP Spr.‘98 ©UCB 20
Block Replacement Policy
• N-way Set Associative or Fully Associative have
choice where to place a block, (which block to
replace)
– Of course, if there is an invalid block, use it
• Whenever get a cache hit, record the cache block
that was touched
• When need to evict a cache block, choose one which
hasn't been touched recently: “Least Recently Used”
(LRU)
– Past is prologue: history suggests it is least likely of the
choices to be used soon
– Flip side of temporal locality
DAP Spr.‘98 ©UCB 21
Review: Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
DAP Spr.‘98 ©UCB 22
Write Policy:
Write-Through vs Write-Back
• Write-through: all writes update cache and underlying memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Flagged write-back
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» Needs much lower bus bandwidth due to infrequent access
» Better tolerance to long-latency memory?
DAP Spr.‘98 ©UCB 23
Write Through: Write Allocate vs
Non-Allocate
• Write allocate: allocate new cache line in cache
– Usually means that you have to do a “read
miss” to fill in rest of the cache-line!
– Alternative: per/word valid bits
• Write non-allocate (or “write-around”):
– Simply send write data through to underlying
memory/cache - don’t allocate new cache line!
DAP Spr.‘98 ©UCB 24
Write Buffers
• Write Buffers (for wrtthrough)
– buffers words to be written
in L2 cache/memory along
with their addresses.
– 2 to 4 entries deep
– all read misses are
checked against pending
writes for dependencies
(associatively)
– allows reads to proceed
ahead of writes
– can coalesce writes to
same address
• Write-back Buffers
– between a write-back
cache and L2 or MM
– algorithm
» move dirty block to
write-back buffer
» read new block
» write dirty block in L2 or
MM
– can be associated with
victim cache (later)
L1
to CPU
Write buffer
L2
DAP Spr.‘98 ©UCB 25
Write Merge
DAP Spr.‘98 ©UCB 26
Review: Cache performance
• Miss-oriented Approach to Memory Access:
MemAccess


CPUtime  IC   CPI

 MissRate  MissPenalty   CycleTime
Execution
Inst


MemMisses


CPUtime  IC   CPI

 MissPenalty   CycleTime
Execution
Inst


– CPIExecution includes ALU and Memory instructions
• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
 AluOps
CPUtime  IC  
 CPI
Inst

AluOps

MemAccess

 AMAT   CycleTime
Inst

AMAT  HitTime  MissRate  MissPenalty
  HitTime Inst  MissRate Inst  MissPenalty Inst  
 HitTime Data  MissRate Data  MissPenaltyData 
DAP Spr.‘98 ©UCB 27
Impact on Performance
• Suppose a processor executes at
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations (Data) get 50 cycle miss
penalty
• Suppose that 1% of instructions get same miss penalty
• CPI = ideal CPI + average stalls per instruction
= 1.1(cycles/ins)
+ [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop)
x 50 (cycle/miss)]
+ [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50
(cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
• 58% (1.5/2.6) of the time the proc is stalled waiting for data memory!
• Total no. of memory accesses = one per instrn + 0.3 for data = 1.3
Thus, AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 cycles =>
instead of one cycle.
DAP Spr.‘98 ©UCB 28
Impact of Change in cc
• Suppose a processor has the following parameters:
– CPI = 2 (w/o memory stalls)
– mem access per instruction = 1.5
• Compare AMAT and CPU time for a direct mapped cache and a 2-way set
associative cache assuming:
–
–
–
–
cc
Hit cycle
Miss penalty
Miss rate
Direct map
1ns
1
75 ns
1.4%
2-way
associative
1.25ns(why?)
1
75 ns
1.0%
AMATd = hit time + miss rate * miss penalty = 1*1 + 0.014*75 = 2.05 ns
AMAT2 = 1*1.25 + 0.01*75 = 2 ns < 2.05 ns
CPId = (CPI*cc + mem. stall time)*IC = (2*1 + 1.5*0.014*75)IC = 3.575*IC
CPI2 = (2*1.25 + 1.5*0.01*75)IC = 3.625*IC > CPId !
• Change in cc affects all instructions while reduction in
miss rate benefit only memory instructions.
DAP Spr.‘98 ©UCB 29
Miss Penalty for Out-of-Order (OOO) Exe.
Processor.
• In OOO processors, memory stall cycles are
overlapped with execution of other
instructions. Miss penalty should not include
this overlapped part.
mem stall cycle per instruction = mem miss per
instruction x (total miss penalty – overlapped miss
penalty)
• For the previous example. Suppose 30% of the
75ns miss penalty can be overlapped, what is
the AMAT and CPU time?
– Assume using direct map cache, cc=1.25ns to
handle out of order execution.
AMATd = 1*1.25 + 0.014*(75*0.7) = 1.985 ns
With 1.5 memory accesses per instruction,
CPU time =( 2*1.25 + 1.5 * 0.014 * (75*0.7))*IC = 3.6025
IC < CPU2
DAP Spr.‘98 ©UCB 30
Lock-Up Free Cache Using MSHR
(Miss Status Holding Register)
1 bit
32 bits
16 bits
Valid bit
Block request address
Source node bits
mshr1
Source node bits
mshr 2
Source node bits
mshr n
Comparator
Valid bit
Block request address
Comparator
Valid bit
Block request address
Comparator
DAP Spr.‘98 ©UCB 31
Avg. Memory Access Time vs.
Miss Rate
• Associativity reduces miss rate, but increases hit time
due to increase in hardware complexity!
• Example: For on-chip cache, assume CCT = 1.10 for 2way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct
mapped
Cache Size
(KB)
1-way
1
2.33
2
1.98
4
1.72
8
1.46
16
1.29
32
1.20
64
1.14
128
1.10
Associativity
2-way 4-way
2.15
2.07
1.86
1.76
1.67
1.61
1.48
1.47
1.32
1.32
1.24
1.25
1.20
1.21
1.17
1.18
8-way
2.01
1.68
1.53
1.43
1.32
1.27
1.23
1.20
(Red means A.M.A.T. not improved by more associativity)
DAP Spr.‘98 ©UCB 32
Unified vs Split Caches
• Unified vs Separate I&D
Proc
Unified
Cache-1
Unified
Cache-2
I-Cache-1
Proc
D-Cache-1
Unified
Cache-2
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops  75% accesses from instructions
(1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 DAP Spr.‘98 ©UCB 33