Sample Cache Access Diagram 64-bit address, 2-way set associative, cache size 128KB (only for the data part), block size 16B 15 63 Tag = 48 bits 0 3 Index = 12 bits Block offset = 4 bits Improving Cache Performance Data address • • • • Choose a byte where to start the access 4 12 Choose a set tag = 48 bits data = 16B There are several possible directions for improvement: Reduce miss penalty Reduce miss rate Increase parallelism Reduce hit time valid bit One set 48 To CPU Cache, 128KB/16B = 213 records, 212 sets (two records per set) =? =? Block size = 24 bytes Offset is 4 bits 212 sets Index is 12 bits Tag is 64-offset-index = 48 bits 1. Reducing Miss Penalty: Multilevel Caches There is a trade-off between the size of the cache and the access time • • • • Small caches are faster but cannot hold much data, therefore having larger miss rate Large caches can hold enough data but are slower, thus increasing hit time We can add another, smaller cache (L1) between the current cache (L2) and the CPU Since everything that is in L1 is also in L2, it only makes sense to add L2 if it is much bigger than L1 Multilevel Caches Average _ Memory _ Access _ Time = Hit _ TimeL1 + Miss _ RateL1 * Miss _ Penalty L1 Miss _ Penalty L1 = Hit _ TimeL 2 + Miss _ RateL 2 * Miss _ Penalty L 2 We now have local and global miss rate • • Local miss rate is number of misses in the given cache level divided by number of accesses to this level and it is equal to Miss_RateL1 for L1 and Miss_RateL2 for L2 Global miss rate is the number of misses in the given cache level divided by total number of accesses generated by the CPU. It is equal to Miss_RateL1 for L1, but it is Miss_RateL1*Miss_RateL2 for L2 Average _ Memory _ Stalls _ per _ Instruction = Misses _ per _ InstructionL1 * Hit _ TimeL 2 + Misses _ per _ InstructionL 2 * Miss _ Penalty L 2 Example Example Assume that in 1000 memory references there are 40 misses in L1 cache and 20 misses in L2 cache. Assume the miss penalty for L2 is 100 clock cycles, hit time of L2 is 10 clock cycles, hit time of L1 is 1 clock cycle and there are 1.5 memory references per instruction. What are various miss rates, the average memory access time and the average memory stalls per instruction? Given the data below, what is the impact of second-level cache associativity on its miss penalty: • • • • • Hit time for direct mapped L2 is 10 clock cycles Two-way associativity increases hit time by 0.1 clock cycles Local miss rate in L2 for direct mapped is 25% Local miss rate in L2 for two-way set associative is 20% Miss penalty for L2 is 100 clock cycles 1 2. Reducing Miss Penalty: Critical Word First and Early Restart CPU is usually stalled waiting for one word, not the whole block of data • • Critical word first: Request from the memory and send this one word first to CPU thus reducing miss penalty, then continue loading the block Early restart: Request block from the beginning from the memory, but when the desired word arrives send it immediately to CPU Example Assume that a computer has 64B block and L2 cache takes 11 clock cycles to get critical 8 bytes and then 2 clock cycles per 8 bytes to fetch the rest of the block. Calculate the average miss penalty with and without critical word first technique. What if the following instructions wait for the block to load then read the words following the first one, 8 bytes each. These techniques do not help considerably as there is high chance that subsequent CPU accesses will be to other words in the block 3. Reducing Miss Penalty: Giving Priority to Read Miss over Write If we have a write-through cache we can avoid stalls on write by writing data into a write buffer and then releasing CPU • 4. Reducing Miss Penalty: Merging Write Buffer What happens if we do this but the subsequent read miss needs the data from the write buffer than has not been written yet? Writes usually modify one word in a block If a write buffer already contains some words from the given data block we will merge current modified word with the block parts already in the buffer SW R3, 512(R0) /* cache index 0 */ LW R1, 1024(R0) /* cache index 0 */ LW R2, 512(R0) /* cache index 0 */ • • The solution is to check the write buffer before going to memory If we have write-back cache, we can store the dirty block into write buffer, read data from memory into cache and release CPU, then perform the write from write buffer to memory 5. Reducing Miss Penalty: Victim Caches Victim cache holds data that has been deleted from the cache, in case it is needed again This is fully associative cache and can help to reduce misses with direct-mapped or set associative caches Reducing Miss Rate Miss categories: • • • Compulsory – first time when we want to access a block Capacity – if cache cannot hold all blocks needed in a program Conflict – if we use direct-mapped or set-associative strategy two blocks may map to the same record in cache 2 1. Reducing Miss Rate: Larger Block Size Reduce number of compulsory misses • Larger blocks take advantage of spatial locality Example Memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Miss rates for various block sizes are as follows. But increase miss penalty • Cache size Larger block size means that fewer blocks will be in cache – this increases capacity misses and conflict misses Block size 16 32 4K 8.57% 7.24% 16K 3.94% 2.87% 64K 256K 2.04% 1.09% 1.35% 0.70% 64 128 256 7.00% 7.78% 9.51% 2.64% 2.77% 3.29% 1.06% 1.02% 1.15% 0.51% 0.49% 0.49% Which block size gives us the smallest average memory access time? for solution see page 429 2. Reducing Miss Rate: Larger Caches Reduce number of capacity misses But increase hit time and have higher cost 3. Reducing Miss Rate: Higher Associativity Experiments show that: • • Greater associativity can come at the cost of increased hit time 4. Reducing Miss Rate: Way Prediction and Pseudoassociative Caches Example Assume higher associativity would increase clock cycle time over direct mapped cache as follows: 2-way 1.36 times, 4-way 1.44 times, 8-way 1.52 times. Hit time is 1 clock cycle. Miss penalty for direct mapped cache is 25 clock cycles. Using the miss rates from figure 5.14, find the average memory access times for different cache sizes. for solution see page 430 (mistake in solution, 25 should also be multiplied by increased clock cycle) 8-way set associative cache has almost the same miss rate as fully associative cache Direct mapped cache of size N has about the same miss rate as 2-way set associative cache of size N/2 (2:1 cache rule of thumb) Way prediction: High associativity increases hit time • • • Keep predictors with each set to predict which block in the set will be needed on next access, then compare the tag of this block with the tag we are looking for On correct prediction hit time is greatly reduced Simplest prediction is to remember which block was requested last time and change the prediction if we are wrong Pseudoassociativity • • • For two-way set associative cache, always check the first block If we miss, check the second block before going to memory Upon miss, swap blocks 3 5. Reducing Miss Rate: Compiler Optimizations Make accesses to same block rather than across blocks for(i=0;i<5000;i++) Loop interchange: for(j=0;j<100;j++) for(i=0;i<5000;i++) x[i][j]=2*x[i][j] ⇒ for(j=0;j<100;j++) x[i][j]=2*x[i][j] for(jj=0;jj<N;jj=jj+B) for(kk=0;kk<N;kk=kk+B) for(i=0;i<N;i++) for(i=0;i<N;i++) for(j=0;j<N;j++) for(j=jj;j<min(jj+B,N);j++) { r=0; { r=0; for(k=0;k<N;k++) for(k=kk;k<min(kk,N);k++) r=r+y[i][k]*z[k][j]; r=r+y[i][k]*z[k][j]; x[i][j]=r; x[i][j]=x[i][j]+r; } } 1. Improving Cache via Parallelism: Nonblocking Caches Blocking ⇒ Example For the following cache which optimization gives better performance: 2-way set associativity or hit under one miss? Calculate this both for FP and for integer programs. Assume that FP miss rate is 11.4% with direct-mapped cache and 10.4% with 2-way set associative cache. Miss rate for integer programs is 7.4% with direct mapped cache and 6% with 2-way set associative cache. Miss penalty is 16 cycles. Assume that the average memory stall time is simply a product of miss rate and miss penalty. Hit under one miss reduces average memory stall time to 76% for FP programs and to 81% for integer programs. • Compiler can insert special instructions to request prefetching • • • Register prefetch request will ask for the data to be loaded into register Cache prefetch request will ask for the data to be loaded into cache Prefetch can be faulting or non-faulting We could have multiple misses for the same block 2. Improving Cache via Parallelism: Hardware Prefetching Prefetch items from the memory before they have been requested by the processor Data can be placed into the cache or in some buffer • • 3. Improving Cache via Parallelism: Compiler-Controlled Prefetching Usually, during cache miss, cache waits for data to be read from the memory and stalls all further requests Nonblocking caches continue to serve future requests that result in a hit while the data for the miss is being fetched – hit under one miss Further optimization may allow for multiple outstanding misses (i.e. nonblocking cache would also serve future requests that result in a miss) For instructions, processor typically fetches two blocks on a miss – the requested one and the next consecutive one If prefetched data is used, this generates next prefetch request Improving Hit Time Hit time is critical because it determines (limits) the cycle time of the processor It is mostly taken for accessing the tag and comparing it to the block address Prefetch only makes sense if processor does not stall while waiting for prefetched data Write hint informs processor of the write miss that writes a whole block, to avoid unnecessary read 4 1. Improving Hit Time: Small and Simple Caches For first level cache • • 2. Improving Hit Time: Avoiding Address Translation • For second level cache • CPU generates requests for virtual addresses • Small so that it can fit on the same chip as the processor Simple, direct-mapped, so that the block read can be overlapped with tag check • Keeping tag memory on chip and data off chip provides fast checks Some of those are in main memory but the address must be translated from the virtual address to the physical address To make common case fast, we can store virtual addresses in cache – virtual cache Virtual cache removes address translation overhead but it • • • • 3. Improving Hit Time: Pipelined Cache Access Pipeline cache access so that it can last multiple clock cycles • • Now clock cycle can be small But we have greater penalty for mispredicted branches and more stalls for RAW dependencies 4. Improving Hit Time: Trace Cache Trace cache collect temporal information about data accessed, then loads this data into cache block (instead of the physical block) Branch prediction is folded into cache This solves the problem of low cache utilization – because of branches only a small portion of traditional cache block is used Main Memory Organizations for Improving Performance Main memory communicates with cache and I/O devices Performance measures • • Latency – how long it takes for the request to be answered Bandwidth – how much data can be read/written at once Rather than just making memory faster (which is hard) or increasing bandwidth, some organizations better address these issues May violate page-level protection Has to be flushed on context switch Sometimes programs use different virtual addresses for the same physical address I/O uses physical addresses Base Case 4 clock cycles to send the address 56 clock cycles for access time per word 4 clock cycles to send a word of data Assume cache block is 4-word long, word is 8 B • Miss penalty = 256 cycles 5 2. Higher Bandwidth: Simple Interleaved Memory 1. Higher Bandwidth: Wider Main Memory Wider the memory, less accesses we need on a miss In the previous example, if the memory were 2 words long we would need 128 cycles Cache width is the same as memory width • • As CPU still accesses cache a word at a time, words have to go through a multiplexer so that the correct word gets selected Second level cache can help 3. Higher Bandwidth: Independent Memory Banks Example Block size is 1 word, as well as memory bus width. Miss rate is 3%, memory accesses per instruction 1.2, cache miss penalty 64 cycles and CPI of ideal cache is 2. If we change block size to 2 words, the miss rate falls to 2%, and a 4-word block has a miss rate of 1.2. Calculate performance with 2 and 4 words for a block size, with and without interleaving See solution on page 452 Memory is organized into banks Interleaved memory organization makes use of that by allowing banks to be read at the same time, interleaving multiple reads Miss penalty would be 76 cycles Writes can also be interleaved If interleaving is done so that adjacent words are stored in different banks, this optimizes sequential memory accesses To see benefit we need more banks than cycles to access memory Memory is organized into banks, but not sequentially Each device, such as I/O and caches, will access one bank Exercise 5.1 We are given two machines, A and B with same processor (2GHz) and main memory, CPI of 1, and penalty of 100ns. Writing a word to a memory takes 100ns, and writing a block takes 200ns. Cache A is 2way set associative and has 32B blocks, it is write through and does not allocate a block on write miss. Cache B is direct-mapped and has 32B blocks, it is write back and allocates a block on a write miss. How to write a benchmark so that A is better than B? How to write a benchmark so that B is better than A? 6
© Copyright 2024