Week 7 out-of-class notes, discussions and sample problems In these notes, we concentrate on the lower levels of the memory hierarchy: main memory (DRAM) and virtual memory (swap space). We will also briefly look at flash memory, a form of removable storage. We start with a look at memory technology and improvements. SRAM – static RAM, built out of flip-flops. 6 transistors can be used to build 1 storage cell. SRAM is used to make both registers and cache. Although we refer to cache as “on-chip” and “off-chip”, the latest generation of processors have enough room for 2 to 3 caches on the chip (L1, L2 and optionally L3). The largest L3 on-chip cache is 12 MB. Often we see smaller L2 caches in our personal computer and laptop markets. As we already studied cache in class, we go on to the other forms of memory. DRAM – dynamic RAM, using single transistors (capacitors) to store each bit. The “dynamic” name is applied because the capacitor quickly loses any charge placed there requiring that each DRAM cell be recharged fairly often. Additionally, DRAM offers only destructive reads – reading the contents of a cell causes that cell to be discharged. Therefore, DRAM is set up so that an outgoing charge circles around and goes back into the cell to recharge it. This will be used to recharge all memory locations. Early DRAM may have been as much as 1MB of space, which would require 20 address bits. Because microprocessors (CPUs on 1 chip) had few pins, the address had to be multiplexed – that is, delivered in two or more bus transfers. Commonly, an address would be divided into two parts, a row and a column. The two addresses (row number, column number) would be sent in two bus cycles. Decoders would be used to take each address and select the byte (or word) within memory corresponding to that row and column. More recently, memory has been divided into banks. This permits parallel accesses to each bank. Consecutive memory locations can either appear on the same bank so that different banks stored different areas of memory (high order interleave) or consecutive memory locations would appear in consecutive banks (low order interleave). In order to recharge cells, DRAM uses a memory controller which would in essence read one entry from each bank in each cycle. Even though the contents of the read would not be sent anywhere, it would refresh each of those memory locations. The controller would then step through each address in the banks one at a time until it reaches the last address and then cycle around to start again. Recharging the full bank might take 8 ms time, plenty of time before any memory was lost due to discharge. The refreshing strategy would not be in use continuously or else DRAM would not be able to respond to requests from other sources (cache or a DMA controller from the I/O subsystem). In fact, according to the authors, the memory controller would be refreshing memory about 5% of the time. DRAM has lagged behind SRAM/CPU speed (see the chart on the next page). Let’s take a look here at how DRAM has improved over the years. First, DRAM capacity continues to grow at an exponential rate. DRAM capacity has improved just as transistor count has. In CPU transistor count, we estimate a doubling of transistors on a chip in roughly 18-24 months (doubling every 2 years for instance). For DRAM, we see a quadrupling roughly every 3 years, which is a 55% increase per year. Thus, it is common today for our personal computers to store as much as 8 (and in some cases, 16GB) whereas 10 years ago, the size was not even 1GB. Unfortunately, access time in DRAM has lagged behind. While we expect processors to be 20-30 times faster today than those of 10 years ago, DRAM access speed has only increased by a factor of 2 or less in the same time. The consequence is that the processor’s speedup increases much greater than DRAM’s speedup. So there is an ever widening gulf between CPU time and DRAM access time. In fact, since 1980, DRAM has only increased by a factor of about 6 whereas processors are thousands of times faster over the same time interval. See figure 2.13, p. 99. Since DRAM itself seems to offer little speedup, what can we do about this ever increasing gap between processor and DRAM? Architects and computer engineers have hit on a number of ideas. First, architects added a buffer to retain the row address. This allowed successive accesses to memory to require only sending one address (column). This is because the successive accesses would occur within the same row (usually). The next enhancement was to move DRAM from an asynchronous interface to one that was regulated by the system bus. Thus, synchronous DRAM or SDRAM was created, which resulted in a faster access response time because memory would be attuned to the system clock to start its response. SDRAM also added a byte count register so that, in one CPU request, multiple consecutive bytes could be accessed. While this does not impact access time, it permits overlapping accesses or a pipelining of memory reads in which successive reads can be sent back over the bus to the CPU in successive clock cycles. For instance, a DRAM access might take 80 cycles and a bus transfer 2. Thus, the first byte (or word) is transferred back in 82 cycles from the start of the request, the next at cycle 84 (2 cycles later), the next at cycle 86, etc. This is known as burst mode. Another innovation is to increase the bus transfer rate. Today, we see DDR (double data rate) DRAMs. The idea is that data can be transferred on both the leading edge and trailing edge of the clock cycle, thus doubling data transfer rate. The introduction of banks helps support this by permitting parallel accesses. Consider using low order interleave, we would see address patterns as follows (we will assume 8 banks): n n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8 n+9 n+10 n+11 n+12 n+13 n+14 n+15 n+16 n+17 n+18 n+19 n+20 n+21 n+22 n+23 So, if we want to retrieve word n, we can also initiate the transfers for n+1 through n+7 simultaneously. After waiting 80 ns for the accesses to take place, pairs of words are sent back every clock cycle (using DDR). So, we can retrieve 8 words in just 84 ns. If our bus is wider (say 64 bits instead of 32), we could even retrieve the 8 words in just 82 ns. High order interleave would not permit this, but we could obtain a different form of parallelism by letting multiple processes call upon memory simultaneously. For statistics of various forms of speedup obtained by DDR SDRAMs, see figure 2.14 on page 101. One last innovation of note is the GDRAM, or the Graphics Data RAM. This is an SDRAM tailored for higher bandwidth demands that come about from heavy graphics usage. Among other things, these DRAMs have both higher clock rates for their pins and wider buses to provide a speedup of at least 2 over SDRAMs (at a higher cost of course). Finally, let’s consider Flash memory. Although Flash memory is a form of removable storage and so should be grouped with optical disc storage, we tend to use Flash more like main memory in many cases especially with handheld devices which use flash memory in lieu of DRAM or hard disk. The flash memory is a form of EEPROM (electronically erasable programmable ROM). The EPROM (erasable programmable ROM) is a memory that can only be wholly erased, that is, to erase memory requires erasing the entire contents of memory (much like a CD RW). The EEPROM allows piecemeal deletions but unlike memory, you cannot delete at the level of individual words. Instead, you can only delete at the level of blocks, much like hard disk deletion. Unlike SRAM and DRAM, Flash memory is non-volatile so that the contents can be stored permanently. Drawbacks of Flash memory are that it has a limited lifetime (early Flash memory would only stand up to about 1000 erasures for any given block before that block failed, today the number is at least 100,000 erasures) and much slower access times. It is estimated that Flash memory is at least 4 times slower than SDRAM in reads and at least 10 times slower for writes, possibly much slower. But with the demise of floppy disks, flash drives have become a preferred method for porting data as flash memories can store as much as 8GB. We now turn to virtual memory. The need for virtual memory arose because of limited main memory sizes back in the 1960s. At that time, programmers would have to ensure that their programs were small enough to fit into the available memory space (perhaps a 1 or a few megabytes). If not, they wrote their programs and used a technique called overlays. The idea behind an overlay is that the program would reside on disk and when the first half of the program was done, the program would load the second half and overlay it over the first half, and then branch back to the “top” of the program (which would be the start of the second half). This had to be done carefully to ensure that memory addresses were correctly reflected throughout the entire program’s execution. Also, if the program were to branch back to the first half, the first overlay would need to be loaded back into memory. Notice that if a program were moved from one computer to another, part of it would probably have to be rewritten to reflect the new computer’s memory size. The idea behind virtual memory is similar to overlays – program code is loaded into memory when needed. However, virtual memory turns the various chores of loading program units and memory mapping over to the operating system rather than the programmer. We see a lot of benefits for VM: allows programs to be larger in size than main memory size without having to resort to overlays allows multiprogramming/multitasking as we can now place parts of several processes in memory avoids memory fragmentation (described below) provides an easy mechanism for relocation of code allows quicker initial load time from disk for the program (because only part of the program is initially loaded) Prior to VM, memory was allocated to processes in contiguous blocks. Consider a multiprogramming system (an OS which would switch off between processes when one process had to perform I/O or wait on another process) in which there are current five processes plus the OS loaded in memory. Assume process P2 terminates and can be removed, this creates a gap. If P6 is selected to run next, it fills only part of that gap leaving a fragment. Next, if P4 terminates and P7 starts, we have another gap. P8 cannot fit into memory as is, but if we were to compress the processes (move the fragments together), we would have enough space for P8. To avoid fragmentation and improve performance as described in the list above, VM was created. Now, all programs are decomposed into fixed sized blocks and main memory is decomposed into equally fixed sized frames. One page fits exactly into one frame. The operating system moves some initial program pages into available memory frames and keeps track of where things are through page tables. The page table is merely a record of where each page has been inserted into memory if it is resident in memory, or invalid because it is not in memory. There is one page table per running process. Below is an example mapping a 4-page process into 3 frames in memory and 1 on disk. The area of disk is called the swap space. VM has a lot of advantages as noted above. What are the costs? 1. We use some of the hard disk space for swap space (this is pretty negligible today now that our hard disks are so large). 2. We need to implement some form of protection to ensure that one process does not use the space allocated to another process. In fact, this was also needed with contiguous allocation but the mechanism for that form of protection was quite simple. 3. We need to perform address translation (mapping) so that, for instance, we know to map a memory location in page A into the frame starting at 16K. 4. Swapping, the process of moving a new page from swap space to memory, will greatly impact processor performance. With continuous allocation, all disk access was “up front” and once the process started, we no longer needed disk access (except to deal with data files). Now, we will have to access the disk often. We have to perform swapping whenever an address is requested that is not currently in memory, this is known as a page fault. The 4 questions that we asked regarding cache access (slide 4 of the powerpoint notes) are also applicable for VM: 1. Where can a block be placed? Our answer here is much simpler than in cache where it was based on the block address which, based on the type of cache, dictated where blocks were placed. We can insert a page in any free frame. However, within that answer, we have to balance two additional things. First, we will use a replacement strategy to select a frame if none are free. Second, we might have an OS policy that proscribes the number of pages allowable per process. For instance, if a process is only given 1024 pages and all are used, then any new page will have to replace one of this process’ frames and not a frame from another process. 2. How is a block found? As with cache, we need to map from the virtual address to the physical location. This can be done easily by just exchanging a page number for a frame number. We will use a page table for this. 3. Which block should be replaced? Again, we need a replacement strategy. While in cache, we had to make this choice in hardware and quickly, because disk access is so slow, we want to make a wise choice. So we use the OS to make the decision and it will use either an LRU (least recently used) algorithm or an approximation LRU. Another factor is that if we choose a page that is dirty (modified), we have to write it back to disk, slowing down the swapping process. So our LRU policy may favor clean pages over dirty pages as a clean page can be discarded without being written back to disk. 4. What happens on a write? There is no efficient way to implement a write through policy as the disk access is just too time consuming, so we will always use write back and include a dirty bit in our page table to indicate that if a page is to be removed, whether it should be written back to disk or not. Notice in the above figure how every memory access requires first accessing the page table (in memory). Even if the item we seek is in cache, we must first translate the address from virtual to physical requiring a page table access. So we will copy the most recently accessed parts of the page table into a cache called the translation lookaside buffer (TLB). TLBs tend to be very small and direct-mapped. For instance, the Opteron used a 40 entry TLB. We may want 2 TLBs, one for instruction access and one for data access. Also, recall one of our cache optimizations was to avoid address translation. This can be done by storing entries in L1 caches using virtual addresses and not physical addresses, which would then allow us to only have to perform virtual memory mapping if we missed the L1 cache. It is also possible to use virtual addresses for other caches although this more problematic. There are a number of drawbacks in storing cache items by virtual addresses as discussed in pages B-36 – B-40 of the textbook. About the only issues that need addressing in VM are the page size and how to implement protection. For page sizes, notice that the size of a page has a direct impact on the size of a page table. Imagine a process of 1MB and a page size of 1KB. This gives us 1K pages for this process, so our page table will have 1K entries. If we use a page size of 4KB, we would only have 256 pages. So the larger the page size, the smaller the page table. This is not much of an issue today because main memory sizes are definitely large enough to store even large page tables of dozens of processes. However, the larger page size has a few other significant impacts: Page transfer time is longer because there is more to transfer so miss penalty to swap space is greater, however this also has a benefit in that a larger transfer is worthwhile because you are already invested in accessing disk. The slowest part of the disk access is relocating the read/write head to the proper location on disk, the actual disk read and data transfer take less time. So once you begin the lengthy access, spending more time to bring more from disk into memory might be worthwhile as it is more efficient than two disk accesses. The larger the page size, the longer a process may go before a page fault because there is already more of the process in memory. This may lead to a lower memory miss rate. On the other hand, the larger the page size, the fewer pages can be kept in memory. Imagine that we have a policy that says that a process only gets 1/32 of memory. If memory is 4GB and a page size is 4KB, then there are 1M frames, so a process gets to move 32K pages into memory. But if the page size were 1KB, each process would get 128K pages instead. The larger the page size, the smaller the page table which means a larger proportion of the page table can be kept in TLB at any time. For instance, if a process has 1KB pages and the TLB stores only 40 entries, then only 4% of the process’ page information is in the TLB. If the page size were 4 times greater, the process would only have 256 pages and 16% of the process’ page information could be kept in the TLB. Larger page sizes mean larger frame sizes which in turn can permit caches to be both larger and faster. For protection, we need to ensure that any process which generates an address generates a legal address. This address can only be of the process’ memory space or shared space. Additionally, if space is shared, we need proper synchronization techniques. This becomes complicated in the Pentium architecture which used segmentation instead of paging. In paging, we just have to ensure that the page table is up-to-date. Since only the OS can modify the page table, we are guaranteed that a process cannot change page table information so that it now has pages that point to frames owned by other processes. In addition to this restriction, processors may include additional bits in the page table to indicate a variety of information such as: o Valid – page is in memory o Read/write – page is read only or readable/writable o User/supervisor – page is owned by the user and accessible or page is owned by the OS and has limited or no access to the user program o o o o o Dirty – page has been modified Accessed – page has been accessed (whether read or write) in the recent past (used to help decide if the page should be replaced or not in the near future) No execute – prevents code from executing on some pages, for instance if the page stores data only Page level cache disable – can this page be cached? Page level write through – can this page, in cache, be written through or does it have to use write back? For more information on protection and also a discussion of virtual machines, read section 2.4. We will skip the examination of two memory hierarchies, the ARM Cortex A-8 and the Intel Core i7 (2.6). Sample problems: Let’s examine pseudo-associativity as a solution to improve hit rate of a direct-mapped cache. Which provides faster avg memory access time for 4KB and 256 KB caches: direct-mapped, 2-way associative or pseudo-associative (PAC)? Assume hit time of 1 cycle for direct mapped, 1.36 for 2-way set associative, and miss penalty of 50 cycles. The PAC may have two attempts at accesses, the first to the address as generated by the CPU and a second on a miss by altering the address. Assume the second address takes 3 cycles. We alter our formula for PAC because a miss does not necessarily accrue a 50 cycle penalty but instead a 3 cycle penalty if the item is in the other position in cache. We need two hit rates: the normal hit rate and the hit rate of finding the item in the second position (we will call this the alternative hit rate). alternative hit rate = hit rate2 way - hit rate1 way = 1 - miss rate2 way - (1 - miss rate1 way) = miss rate1 way - miss rate2 way Avg mem access time PAC = 1 + (miss rate1 way – miss rate2 way) * 3 + miss rate2 way * miss penalty PAC: 4 KB = 1 + (.098 - .076) * 3 + (.076 * 50) = 4.866 PAC: 256 KB = 1 + (.013 - .012) * 3 + (.012 * 50) = 1.603 Direct-mapped: 4 KB = 1 + .098 * 50 = 5.9 Direct-mapped: 256 KB = 1 + .013 * 50 = 1.65 2-way: 4 KB = 1.36 + .076 * 50 = 5.16 2-way: 256 KB = 1.36 + .012 * 50 = 1.96 So, pseudo-associative cache outperforms both! Let’s look at the impact of pipelining cache access. The advantage is that it allows us to reduce clock cycle time, the disadvantage is that with a shorter clock, cache misses have a larger impact. Compare the MIPS 5-stage pipeline vs. the MIPS R4000 8-stage pipeline, assuming clock rates of 1 GHz for MIPS and 1.8 GHz for MIPS R4000. A main memory access time of 50 ns (we will assume no second level cache) and a cache miss rate of 5%. Assuming no other source of stalls, which machine is faster? First, we have to convert main memory access time into clock cycles as the two machines have two different clock cycle rates. MIPS miss penalty = 50 ns / (1 / 1) ns = 50 cycles MIPS R4000 miss penalty = 50 ns / (1 / 1.8) ns = 90 cycles CPU time MIPS = (1 + .05 * 50) * 1 = 3.5 CPU time MIPS R4000 = (1 + .05 * 90) * 1 / 1.8 = 3.06 So the gain of increased clock speed by pipelining cache accesses more than offsets the increased miss penalty. To truly see if this is advantageous, we would also have to factor in the impact of structural hazards and branch penalties. The longer the pipeline, the greater the impact is. Another form of compiler optimization to support cache access is to merge loops. Consider the following loop: for(i=0;i<n;i++) a[i]=b[i]*c[i]; If we assume n is a fairly large value, then it is likely that we would be loading a[i], b[i] and c[i] into cache in blocks, but as i increases, we would be discarded previous blocks of the three arrays. For instance, let’s assume a data cache of 256 blocks where each block stores 16 words. We will also assume that a, b, c are doubles and that n=1024. Thus, the memory space required for all of a, b, and c is 1024 * 8 * 3 bytes = 24KB. Our cache is 256 * 16 * 4 = 16 KB. Half of our cache would be replaced before the loop terminates. Now this problem is further exacerbated if we have a later loop like this: for(i=0;i<n;i++) d[i]=b[i]+c[i]; because we will have already discarded half of b and c (at least). So a compiler might rearrange the code: for(i=0;i<n;i++) a[i]=b[i]*c[i]; for(i=0;i<n;i++) d[i]=b[i]+c[i]; into for(i=0;i<n;i++) { a[i]=b[i]*c[i]; d[i]=b[i]+c[i]; } Can you figure out the improvement in terms of the number of cache misses that we would remove assuming that none of a, b, c or d are in cache initially and assuming no other data is stored in the data cache? Assume memory is organized as follows: – two L1 caches (one data, one instruction) – one L2 cache – main memory – disk cache – disk (swap space) Assume miss rates and access times of – data cache: 5%, 1 clock cycle – instruction cache: 1%, 1 clock cycle – L2 cache: 10%, 10 clock cycles – main memory: 0.2%, 100 clock cycles – disk cache: 20%, 1000 clock cycles – swap space: 0%, 250000 clock cycles If 40% of all instructions are loads or stores, what is the effective memory access time for this machine? Average memory access time = %instruction * (hit time instr cache + miss rate instr cache * (hit time second level cache + miss rate second level cache * (hit time main memory + miss rate main memory * (hit time disk cache + miss rate disk cache * hit time disk)))) + %data * (hit time data cache + miss rate data cache * (hit time second level cache + miss rate second level cache * (hit time main memory + miss rate main memory * (hit time disk cache + miss rate disk cache * hit time disk)))) With 1.4 memory accesses per instruction, the % of instruction accesses = 1.0 / 1.4 = 71.4%, and % of data accesses is 0.4 / 1.4 = 28.6% Average memory access time = 71.4% * (1 + .01 * (10 + .10 * (100 + .002 * (1000 + .20 * 250000)))) + 28.6% * (1 + .05 * (10 + .10 * (100 + .002 * (1000 + .20 * 250000)))) = 1.647.
© Copyright 2025