Week 4 out-of-class notes, discussions and sample problems Thread Level Parallelism We seem to have reached the limit of ILP. We have studied means of obtaining more parallelism with LLP (loop unrolling) and we will also explore other forms of speculating across branches. But we have another source of potential parallelism: thread level parallelism (TLP). We have multithreaded operating systems and a lot of times, we are running threads, not processes. So the question is, can we find parallelism to exploit among the threads of a single process? Threads are multiple running instances of the same process where each thread shares some of the same information with the other threads. Specifically, threads will use the same code, the same memory space (including page table) and some of the same reference values. Where they differ are that each thread has its own register values, stack space and local data. Context switches operate faster between threads than they do between processes. However, with architectural support, a processor can exploit thread-level parallelism. Thus, multithreading can be directly supported within a single processor rather than issuing threads to multiple processors (or multiple cores) as is the case in exploiting process-level parallelism. There are three forms of thread-level support: fine-grained multithreading, coarse-grained multithreading, and simultaneous multithreading. NOTE: to gain advantage from TLP, there must be multiple threads in the given process. Threads are implemented by the programmer although in some cases, a compiler can also attempt to generate threads if the program is written in such a way that threads are indicated, for instance, if the language directly supports parallelism with language constructs. In fine-grained multithreading, context switches between threads of one process take place every clock cycle. Thus, the multiple threads are executed in an interleaved fashion. For instance, if there are three threads of an active process, T0, T1 and T2, they will execute as follows: Cycle: 1 2 3 4 5 6 7 8 9 10 … Thread: T0 T1 T2 T0 T1 T2 T0 T1 T2 T0 … Threads that are currently suspended (for instance because they are waiting for an event) are ignored. A queue of active threads for the given process is used in a round-robin fashion to issue one instruction per cycle. The primary advantage of this approach is that stalls introduced by miss-speculation or cache misses can partially or completely be hidden because the stall is “swallowed up” by the delay between executing one cycle of the thread and the next cycle. Above, if T0 in cycle 4 has a 2 cycle stall, by the time we revisit it in cycle 7, that stalling situation will have elapsed and the result is that T0 would not need a stall. How can a processor support fine-grained multithreading? This is simple IF the processor contains multiple sets of registers. Recall that the only essential difference between thread T0 and T1 is the register values, stack space and perhaps some local variables. Assume all local variables are currently stored in registers and/or stack space. We can then view, from an execution perspective, that the only difference between T0 and T1 are the current values of its registers: PC, IR, Stack Pointer, Status Flags, data/address registers. Assume we equip a CPU with two complete sets of registers. Then, we can switch between T0 and T1 easily by just alternating which set of registers the processor is using in each stage. Recall the MIPS 5-stage pipeline, relevant register values are shunted down the pipeline stage-by-stage. Now, we have two sets of registers to move down the pipeline and each stage merely alternates between set 0 and set 1 in each successive cycle. Notice that if we limit our pipeline to just executing two threads, this does not introduce an enormous amount of overhead. We could not, however, use this approach to accommodate an unlimited number of threads. The Sun T1 processor, introduced in 2005, performs fine-grained multithreading. The T1 is an 8-core processor, each core supporting up to 4 threads. Each core consists of a 6-stage, single-issue pipeline. Threads waiting due to a cache miss or pipeline stall are bypassed if they are reached again before the stalling situation is resolved. A core idles only if all four threads are idle or stalled. All 8 cores share a single FP functional unit, so a structural hazard arises if two or more threads of different cores attempt to issue an FP instruction in the same cycle. We will examine the T1’s performance later in these notes. Coarse-grained multithreading switches between threads only when a costly (time consuming) stall arises. Such a stall would be, for instance, caused by an L2 or L3 cache miss, but not an L1 cache miss or missspeculation. The idea is that switching to a new thread and filling the pipeline with the new thread’s instructions will be less time consuming than the cost of waiting for the stall. Unlike fine-grained multithreading, there is little or no hardware support required to implement this form of multithreading. There is still overhead associated with a context switch: the cost of switching the register values and the cost of filling the pipeline with the new thread’s instructions. However, this is very cheap to implement and so this form of TLP avoids lengthy stalls with little extra cost but does not promote true TLP parallelism like the fine-grained approach. The most common form of TLP implementation is simultaneous multithreading. Here, multi-issue, dynamically scheduled processing is used. That is, at any cycle, multiple instructions are issued dynamically and scheduled using Tomasulo-style hardware. Unlike traditional dynamic scheduling, here the multiple instructions issued in any one cycle come from different threads. Recall the challenge of finding enough ILP in a single process to perform multiple issue in a cycle. But because there are several threads of the current process, each thread could have an instruction issued from it in each cycle without concern of dependence or conflict. Switching between threads is still available, but the switching occurs as a true context switch, that is, based on the timer and/or when a thread voluntarily leaves the processor for I/O or other matters. Consider, for instance, a process of four threads (t0, t1, t2, t3) running on a simultaneous multithreaded processor that can issue up to two instructions per cycle. Assume at the moment that both t0 and t1 are active. The processor attempts to issue one instruction per cycle from both t0 and t1. At some point, a context switch causes the processor to switch from t0 to t2 and t1 to t3. Now the processor attempts to issue one instruction per cycle from both t2 and t3. At the next context switch, the processor switches back to t0 and t1. So, we would see instruction fetches taking one instruction each from t0 and t1, cycle after cycle. The issue stage would attempt to issue the next instruction in the queue from t0 and from t1 each cycle (until either reservation stations were full or the ROB was full). Instructions would execute at reservation stations when data became available and forward results on the CDB. Instructions would be committed each cycle as possible. At the next context switch, after the switch took place, the instruction fetch would begin to fill the queue with instructions from t2 and t3. In the above example, t0 and t1 were “synched” together as are t2 and t3. This is probably not likely, so for instance we may see a behavior more like this: t0-------------------t2-------------------t0-------------------t2-------------------t0---------------------------t3-------------------t1-------------------t3-------------------t1-------------------t3 Or some other behavior. For instance, imagine that midway through the above timeline, t0 is forced into a waiting situation. This may leave the processor with just 3 threads, or alternatively there may be a fifth thread that can be switched to. The simultaneous multithreaded solution is more popular than fine-grained because many processors already have the Tomasulo-style, dynamic scheduling, multi-issue hardware in place. All that is needed on top of this is the ability to handle per-thread register renaming tables and the ability to maintain separate PCs for the threads being scheduled (in the above example, we would need 2 PCs, one for t0/t2 and one for t1/t3). If we wish to issue 4 instructions per cycle, we would need 4 PCs. The Pentium IV Extreme is one such processor. In summary, there are four different ways that a processor can exploit TLP. First, we can use a superscalar with no multithreading support. In this case, threads can be switched between, but the switch is not supported, and when a thread stalls, another thread does not fill in the stalls. Next is fine-grained multithreading. Here, each clock cycle consists of a separate thread executing. As stated above, the T1 used a single, simple pipeline. However, with superscalar support, each cycle could potentially issue multiple instructions of the single thread. By switching between threads in each cycle, stalls are largely unnoticed because of the other threads. In coarse-grained multithreading, a single thread executes until it stalls (if it is a lengthy stall), and the next thread is executed. Long stalls still have an impact because of the time required for the context switch and the time to refill the pipeline but there is a reduction in the impact of the stalls. With superscalar support, the processor attempts to dynamically issue multiple instructions per cycle of the single thread. Finally, in a simultaneous multithreaded situation, multiple threads and multiple instructions are issued in each cycle with stalls largely being “swallowed up” because of other threads and other instructions being available. The figure below illustrates these concepts. In each case in this figure, we have a 4-issue superscalar, so up to 4 instructions can be issued per cycle. In the first three of the four forms of TLP, the instructions issued per cycle are from the same thread. Only in the simultaneous multithreaded approach can we issue instructions of different threads in any one cycle. The white boxes in the figure are empty issue slots. These are caused by not enough instructions to issue in one cycle, or in the cases of the superscalar with no TLP and the course-grained approach, actual stalling situations. Shaded boxes are of the different threads. Notice that the fewest empty slots occur in the simultaneous multithreading processor and the next fewest stalls arise in both it and the fine-grained multithreaded processor. Let’s examine the performance of the Sun T1 (fine-grained) and Pentium 4 Extreme (simultaneous). A T1 running one core (the typical T1 has 8 cores) supports fine-grained multithreading, so that it switches off between threads in every clock cycle. The T1 is set up to switch between four threads. This requires four sets of registers to trade between as the instruction moves down the pipeline. The source of stalls in the T1 arise from: 1. Structural hazards in the single FP functional unit (this is not an issue if we assume a single core because at most, 1 FP operation can be issued in any cycle, we will assume the FP unit is pipelined). 2. 3 cycle penalties from RAW hazards after loads and for branches – these can be “hidden” by the other 3 threads if 4 threads are executing. If there are less than 4 threads available, these penalties could impact performance. 3. 23 cycle penalty from an L1 miss, assuming no contention for access to L2 (4 separate L2 caches help reduce any such contention). 4. All four of the threads that are currently “active” for the processor are idle. How much can the multithreaded approach “hide” the impact of the above stalls? Obviously, if the T1 is running a single thread, the stalls from 1 do not arise and the stalls from 2 completely impact performance because there is no other recourse than to wait (4 does not come into play because a stall causes the only thread to idle). So we really want to see what the impact from 3 does to our performance. Based on statistics gathered between running benchmarks on the T1 using single threading versus multithreading, it can be seen that an L1 miss of a single thread has between a 10% and 25% higher impact than an L1 miss when running four threads. An L2 miss when running a single thread has less than a 10% increase in latency over running four threads. See figure 3.30 on page 228. This indicates that the cache penalty is so large that running one thread versus four still sees this large penalty impact performance greatly. However, as stated earlier in this paragraph, when running four threads, stalls from 1 and 2 are largely subsumed. Unfortunately, 30-50% of threads being “idle” (see 4 above) are caused by cache misses, so in spite of the performance increase obtained by fine-grained multithreading, the processor is still susceptible to this problem. The CPI for the single core T1 on three popular benchmarks averages about 1.6, therefore the CPI is still higher than what we would see in a typical superscalar processor. Does this mean that a fine-grained approach is not useful? No. Recall that the T1 is not a superscalar, does not use dynamic scheduling, nor does it use branch speculation. Perhaps a more aggressive processor could take more advantage of the fine-grained approach. In order to achieve a worthwhile speedup through simultaneous multitasking, you need to have additional hardware support. First, recall that in simultaneous multitasking, you are issuing multiple instructions per cycle of multiple threads. Thus, you must be able to fetch several instructions at one time of several threads (for instance, 2-4 instructions per thread). This in itself calls for a wide bus and larger caches. Second, you need to dynamically schedule and issue instructions. This calls for multiple register renaming tables, numerous functional units including parallel load/store units (one per thread) and additional renaming registers and reservation stations. It can also call for more common data buses. It will also call upon a greater amount of logic that must, in parallel, track dependencies within any one thread. It also would benefit from accurate branch prediction. According to the authors, when the simultaneous multithreading approach was first researched in the 2000-2001 time period, it was thought that this approach would be followed over the next 5 years. No processor however has gotten close to the level discussed. Instead, restrictions are made such as no more than one thread being fetched and issued at a time with no more than four instructions issued at a time. The Pentium 4 Extreme achieves a speedup of only 1% (integer) to 7% (FP) speedup over an ordinary Pentium 4. Implementing simultaneous multitasking on an Intel i7 provides a speedup of about 28% on two Java benchmark programs. In this case, the i7 supports two threads and can issue up to 6 microoperations per cycle between the two threads. Therefore, we see the ability to issue more instructions (micro-ops) per cycle because the second thread provides greater parallelism. The conclusion we can draw however is that ultimately, thread level parallelism does not offer us enough parallelism to avoid the common stalling situations in any architecture while requiring additional support. Also keep in mind that multiple threads are only available in a multi-threaded process. We will return to thread-level parallelism at the end of the semester when we briefly consider multi-processor systems. In week 8, we will consider multiprocessor and vector solutions to handling multiple processes. Sample Problems: We wrap up these notes with some sample problems of material covered in the lecture, earlier in the week. Let’s return to the idea of a statically scheduled superscalar. 1. Consider the following simple loop. Assume the MUL.D takes 7 cycles to compute. Perform loop unrolling and scheduling to remove all stalls and issue up to 2 instructions per cycle on a superscalar pipeline which permits any single integer operation and any single FP operation per cycle. Assume loads, stores and branches are part of the integer pipeline, not the FP. Use the MIPS FP pipeline and assume that an FP and Load/store can enter the MEM stage at the same time. Loop: L.D F0, 0(R1) MUL.D F2, F0, F4 S.D F2, 0(R1) DSUBI R1, R1, #8 BNE R1, R2, Loop Solution: We need to find 6 instructions to put between the cycle that the first MUL.D is issued and the first S.D. This requires having 7 total loop iterations. Loop: L.D F0, 0(R1) L.D F6, 8(R1) L.D F10, 16(R1) MUL.D F2, F0, F4 L.D F14, 24(R1) MUL.D F8, F6, F4 L.D F18, 32(R1) MUL.D F12, F10, F4 L.D F22, 40(R1) MUL.D F16, F14, F4 L.D F26, 48(R1) MUL.D F20, F18, F4 DSUBI R1, R1, #56 MUL.D F24, F22, F4 S.D F2, -56(R1) MUL.D F28, F26, F4 S.D F8, -48(R1) S.D F12, -40(R1) S.D F16, -32(R1) S.D F20, -24(R1) S.D F24, -16(R1) BNE R1, R2, Loop S.D F28, -8(R1) 2. Repeat #1 assuming that you can issue up to 1 FP instruction per cycle, but you can issue up to 2 integer instructions per cycle. Solution: This solution is more complicated than from #1 because we can now place L.D and S.D operations in the second pipeline. This requires a greater number of unrollings. What we will quickly find out is that we run out of registers! So I will assume that we have 64 registers instead of 32. Loop: L.D F0, 0(R1) L.D F6, 8(R1) L.D F10, 16(R1) L.D F14, 24(R1) L.D F18, 32(R1) MUL.D F2, F0, F4 L.D F22, 40(R1) MUL.D F8, F6, F4 L.D F26, 48(R1) MUL.D F12, F10, F4 L.D F30, 56(R1) L.D F34, 64(R1) L.D F38, 72(R1) DSUBI R1, R1, #80 S.D F2, -80(R1) S.D F8, -72(R1) S.D F12, -64(R1) S.D F16, -56(R1) S.D F20, -48(R1) S.D F24, -40(R1) S.D F28, -32(R1) S.D F32, -24(R1) S.D F36, -16(R1) MUL.D MUL.D MUL.D MUL.D MUL.D MUL.D MUL.D F16, F14, F4 F20, F18, F4 F24, F22, F4 F28, F26, F4 F32, F30, F4 F36, F34, F4 F40, F38, F4 BNE S.D R1, Loop F40, -8(R1) 3. Repeat #1 on a VLIW which can support up to 2 load/store, 2 FP and 1 integer operation per cycle. Solution: For brevity, all registers and addresses are omitted. We would need registers F0-F62! Loop: L.D L.D L.D L.D L.D L.D MUL.D MUL.D L.D L.D MUL.D MUL.D L.D L.D MUL.D MUL.D L.D L.D MUL.D MUL.D L.D L.D MUL.D MUL.D L.D L.D MUL.D MUL.D DSUBI S.D S.D MUL.D MUL.D S.D S.D MUL.D MUL.D S.D S.D S.D S.D S.D S.D S.D S.D S.D S.D BNE S.D S.D 4. Show how the following code would execute on a 2-issue Tomasulo-based superscalar with speculation. There are 2 FP adders, 1 load/store unit, 1 integer unit. Use a table like that on slide 31 of the lecture4.pptx notes. Assume a FP ADD.D takes 4 cycles to execute and that up to 2 instructions can commit per cycle. Assume perfect branch prediction. Show the first 2 iterations of the loop. Loop: L.D F0, 0(R1) L.D F2, 8(R1) ADD.D F4, F0, F2 S.D F6, 0(R2) DADDI R1, R1, #16 DADDI R2, R2, #8 BNE R2, R3, Loop Solution: Cycle Instruction Issue Exec CDB Commit 2 Mem Acc 3 1 L.D F0 1 4 5 1 L.D F2 1 4 5 6 7 1 ADD.D 2 7-10 11 12 1 S.D 2 3 1 DADDI 3 4 1 DADDI 3 5 1 BNE 4 7 2 L.D F0 5 6 7 8 16 2 L.D F2 5 7 8 9 17 2 ADD.D 6 14 17 2 S.D 6 1013 7 2 DADDI 7 8 10 20 2 DADDI 7 9 12 20 2 BNE 8 13 13 Comments Only 1 load at a time 14 Has to wait for ADD.D to commit 5 14 Up to 2 commits per cycle 6 15 15 18 Has to wait for DADDI before it executes 19 Collision on CDB with L.D F0 of cycle 2 Collision on CDB with first ADD.D 21 5. Provide an example in C or Java of a sequence of if-else statements where some of the branch conditions could benefit from correlating predictors. Solution: if(x!=y) // condition 1 if(y==z) // condition 2 { y=0; … } if(y==0) {…} // condition 3 if(x==0) {…} // condition 4 Assume the … has no impact on either x, or y. If y==z, then the next condition will always be true. If condition 1 is false and condition 3 is true then condition 4 will be true. Finally, if condition 1 is true and condition 3 is true then condition 4 will be false. 6. Given a 2-issue Tomasulo-style superscalar, let’s attempt to determine the impact of a branch missprediction on performance. Assume 1 integer unit and an issue/execute/CDB/commit performance like that of example 4 above. How many cycles of penalty would we see in a miss-predicted? Can we figure this out in an absolute way? Solution: As we see in #4, the BNE does not commit until after all instructions preceding it commit, so the penalty will be based on the code that comes before it. Since the first BNE commits in cycle 15, we would have already fetched at least 1 additional complete iteration, a penalty of 7 instructions over 4 cycles. But, since we do not know the exact instruction mix, we cannot provide an absolute penalty. It would almost certainly be more than 4 cycles in any FP code.
© Copyright 2024