Week 4 out-of-class notes, discussions and sample problems Thread Level Parallelism

Week 4 out-of-class notes, discussions and sample problems
Thread Level Parallelism
We seem to have reached the limit of ILP. We have studied means of obtaining more parallelism with
LLP (loop unrolling) and we will also explore other forms of speculating across branches. But we have
another source of potential parallelism: thread level parallelism (TLP). We have multithreaded operating
systems and a lot of times, we are running threads, not processes. So the question is, can we find
parallelism to exploit among the threads of a single process?
Threads are multiple running instances of the same process where each thread shares some of the same
information with the other threads. Specifically, threads will use the same code, the same memory space
(including page table) and some of the same reference values. Where they differ are that each thread has
its own register values, stack space and local data. Context switches operate faster between threads than
they do between processes. However, with architectural support, a processor can exploit thread-level
parallelism. Thus, multithreading can be directly supported within a single processor rather than issuing
threads to multiple processors (or multiple cores) as is the case in exploiting process-level parallelism.
There are three forms of thread-level support:
fine-grained multithreading, coarse-grained
multithreading, and simultaneous multithreading.
NOTE: to gain advantage from TLP, there must be multiple threads in the given process. Threads are
implemented by the programmer although in some cases, a compiler can also attempt to generate threads
if the program is written in such a way that threads are indicated, for instance, if the language directly
supports parallelism with language constructs.
In fine-grained multithreading, context switches between threads of one process take place every clock
cycle. Thus, the multiple threads are executed in an interleaved fashion. For instance, if there are three
threads of an active process, T0, T1 and T2, they will execute as follows:
Cycle: 1
2
3
4
5
6
7
8
9
10
…
Thread: T0
T1
T2
T0
T1
T2
T0
T1
T2
T0
…
Threads that are currently suspended (for instance because they are waiting for an event) are ignored. A
queue of active threads for the given process is used in a round-robin fashion to issue one instruction per
cycle. The primary advantage of this approach is that stalls introduced by miss-speculation or cache
misses can partially or completely be hidden because the stall is “swallowed up” by the delay between
executing one cycle of the thread and the next cycle. Above, if T0 in cycle 4 has a 2 cycle stall, by the
time we revisit it in cycle 7, that stalling situation will have elapsed and the result is that T0 would not
need a stall.
How can a processor support fine-grained multithreading? This is simple IF the processor contains
multiple sets of registers. Recall that the only essential difference between thread T0 and T1 is the
register values, stack space and perhaps some local variables. Assume all local variables are currently
stored in registers and/or stack space. We can then view, from an execution perspective, that the only
difference between T0 and T1 are the current values of its registers: PC, IR, Stack Pointer, Status Flags,
data/address registers. Assume we equip a CPU with two complete sets of registers. Then, we can switch
between T0 and T1 easily by just alternating which set of registers the processor is using in each stage.
Recall the MIPS 5-stage pipeline, relevant register values are shunted down the pipeline stage-by-stage.
Now, we have two sets of registers to move down the pipeline and each stage merely alternates between
set 0 and set 1 in each successive cycle. Notice that if we limit our pipeline to just executing two threads,
this does not introduce an enormous amount of overhead. We could not, however, use this approach to
accommodate an unlimited number of threads.
The Sun T1 processor, introduced in 2005, performs fine-grained multithreading. The T1 is an 8-core
processor, each core supporting up to 4 threads. Each core consists of a 6-stage, single-issue pipeline.
Threads waiting due to a cache miss or pipeline stall are bypassed if they are reached again before the
stalling situation is resolved. A core idles only if all four threads are idle or stalled. All 8 cores share a
single FP functional unit, so a structural hazard arises if two or more threads of different cores attempt to
issue an FP instruction in the same cycle. We will examine the T1’s performance later in these notes.
Coarse-grained multithreading switches between threads only when a costly (time consuming) stall arises.
Such a stall would be, for instance, caused by an L2 or L3 cache miss, but not an L1 cache miss or missspeculation. The idea is that switching to a new thread and filling the pipeline with the new thread’s
instructions will be less time consuming than the cost of waiting for the stall. Unlike fine-grained
multithreading, there is little or no hardware support required to implement this form of multithreading.
There is still overhead associated with a context switch: the cost of switching the register values and the
cost of filling the pipeline with the new thread’s instructions. However, this is very cheap to implement
and so this form of TLP avoids lengthy stalls with little extra cost but does not promote true TLP
parallelism like the fine-grained approach.
The most common form of TLP implementation is simultaneous multithreading. Here, multi-issue,
dynamically scheduled processing is used. That is, at any cycle, multiple instructions are issued
dynamically and scheduled using Tomasulo-style hardware. Unlike traditional dynamic scheduling, here
the multiple instructions issued in any one cycle come from different threads. Recall the challenge of
finding enough ILP in a single process to perform multiple issue in a cycle. But because there are several
threads of the current process, each thread could have an instruction issued from it in each cycle without
concern of dependence or conflict. Switching between threads is still available, but the switching occurs
as a true context switch, that is, based on the timer and/or when a thread voluntarily leaves the processor
for I/O or other matters.
Consider, for instance, a process of four threads (t0, t1, t2, t3) running on a simultaneous multithreaded
processor that can issue up to two instructions per cycle. Assume at the moment that both t0 and t1 are
active. The processor attempts to issue one instruction per cycle from both t0 and t1. At some point, a
context switch causes the processor to switch from t0 to t2 and t1 to t3. Now the processor attempts to
issue one instruction per cycle from both t2 and t3. At the next context switch, the processor switches
back to t0 and t1.
So, we would see instruction fetches taking one instruction each from t0 and t1, cycle after cycle. The
issue stage would attempt to issue the next instruction in the queue from t0 and from t1 each cycle (until
either reservation stations were full or the ROB was full). Instructions would execute at reservation
stations when data became available and forward results on the CDB. Instructions would be committed
each cycle as possible. At the next context switch, after the switch took place, the instruction fetch would
begin to fill the queue with instructions from t2 and t3.
In the above example, t0 and t1 were “synched” together as are t2 and t3. This is probably not likely, so
for instance we may see a behavior more like this:
t0-------------------t2-------------------t0-------------------t2-------------------t0---------------------------t3-------------------t1-------------------t3-------------------t1-------------------t3
Or some other behavior. For instance, imagine that midway through the above timeline, t0 is forced into
a waiting situation. This may leave the processor with just 3 threads, or alternatively there may be a fifth
thread that can be switched to.
The simultaneous multithreaded solution is more popular than fine-grained because many processors
already have the Tomasulo-style, dynamic scheduling, multi-issue hardware in place. All that is needed
on top of this is the ability to handle per-thread register renaming tables and the ability to maintain
separate PCs for the threads being scheduled (in the above example, we would need 2 PCs, one for t0/t2
and one for t1/t3). If we wish to issue 4 instructions per cycle, we would need 4 PCs. The Pentium IV
Extreme is one such processor.
In summary, there are four different ways that a processor can exploit TLP. First, we can use a
superscalar with no multithreading support. In this case, threads can be switched between, but the switch
is not supported, and when a thread stalls, another thread does not fill in the stalls. Next is fine-grained
multithreading. Here, each clock cycle consists of a separate thread executing. As stated above, the T1
used a single, simple pipeline. However, with superscalar support, each cycle could potentially issue
multiple instructions of the single thread. By switching between threads in each cycle, stalls are largely
unnoticed because of the other threads. In coarse-grained multithreading, a single thread executes until it
stalls (if it is a lengthy stall), and the next thread is executed. Long stalls still have an impact because of
the time required for the context switch and the time to refill the pipeline but there is a reduction in the
impact of the stalls. With superscalar support, the processor attempts to dynamically issue multiple
instructions per cycle of the single thread. Finally, in a simultaneous multithreaded situation, multiple
threads and multiple instructions are issued in each cycle with stalls largely being “swallowed up”
because of other threads and other instructions being available. The figure below illustrates these
concepts. In each case in this figure, we have a 4-issue superscalar, so up to 4 instructions can be issued
per cycle. In the first three of the four forms of TLP, the instructions issued per cycle are from the same
thread. Only in the simultaneous multithreaded approach can we issue instructions of different threads in
any one cycle. The white boxes in the figure are empty issue slots. These are caused by not enough
instructions to issue in one cycle, or in the cases of the superscalar with no TLP and the course-grained
approach, actual stalling situations. Shaded boxes are of the different threads. Notice that the fewest
empty slots occur in the simultaneous multithreading processor and the next fewest stalls arise in both it
and the fine-grained multithreaded processor.
Let’s examine the performance of the Sun T1 (fine-grained) and Pentium 4 Extreme (simultaneous). A
T1 running one core (the typical T1 has 8 cores) supports fine-grained multithreading, so that it switches
off between threads in every clock cycle. The T1 is set up to switch between four threads. This requires
four sets of registers to trade between as the instruction moves down the pipeline.
The source of stalls in the T1 arise from:
1. Structural hazards in the single FP functional unit (this is not an issue if we assume a single core
because at most, 1 FP operation can be issued in any cycle, we will assume the FP unit is
pipelined).
2. 3 cycle penalties from RAW hazards after loads and for branches – these can be “hidden” by the
other 3 threads if 4 threads are executing. If there are less than 4 threads available, these penalties
could impact performance.
3. 23 cycle penalty from an L1 miss, assuming no contention for access to L2 (4 separate L2 caches
help reduce any such contention).
4. All four of the threads that are currently “active” for the processor are idle.
How much can the multithreaded approach “hide” the impact of the above stalls? Obviously, if the T1 is
running a single thread, the stalls from 1 do not arise and the stalls from 2 completely impact performance
because there is no other recourse than to wait (4 does not come into play because a stall causes the only
thread to idle). So we really want to see what the impact from 3 does to our performance. Based on
statistics gathered between running benchmarks on the T1 using single threading versus multithreading, it
can be seen that an L1 miss of a single thread has between a 10% and 25% higher impact than an L1 miss
when running four threads. An L2 miss when running a single thread has less than a 10% increase in
latency over running four threads. See figure 3.30 on page 228. This indicates that the cache penalty is
so large that running one thread versus four still sees this large penalty impact performance greatly.
However, as stated earlier in this paragraph, when running four threads, stalls from 1 and 2 are largely
subsumed. Unfortunately, 30-50% of threads being “idle” (see 4 above) are caused by cache misses, so in
spite of the performance increase obtained by fine-grained multithreading, the processor is still
susceptible to this problem.
The CPI for the single core T1 on three popular benchmarks averages about 1.6, therefore the CPI is still
higher than what we would see in a typical superscalar processor. Does this mean that a fine-grained
approach is not useful? No. Recall that the T1 is not a superscalar, does not use dynamic scheduling, nor
does it use branch speculation. Perhaps a more aggressive processor could take more advantage of the
fine-grained approach.
In order to achieve a worthwhile speedup through simultaneous multitasking, you need to have additional
hardware support. First, recall that in simultaneous multitasking, you are issuing multiple instructions per
cycle of multiple threads. Thus, you must be able to fetch several instructions at one time of several
threads (for instance, 2-4 instructions per thread). This in itself calls for a wide bus and larger caches.
Second, you need to dynamically schedule and issue instructions. This calls for multiple register
renaming tables, numerous functional units including parallel load/store units (one per thread) and
additional renaming registers and reservation stations. It can also call for more common data buses. It
will also call upon a greater amount of logic that must, in parallel, track dependencies within any one
thread. It also would benefit from accurate branch prediction. According to the authors, when the
simultaneous multithreading approach was first researched in the 2000-2001 time period, it was thought
that this approach would be followed over the next 5 years. No processor however has gotten close to the
level discussed. Instead, restrictions are made such as no more than one thread being fetched and issued
at a time with no more than four instructions issued at a time.
The Pentium 4 Extreme achieves a speedup of only 1% (integer) to 7% (FP) speedup over an ordinary
Pentium 4. Implementing simultaneous multitasking on an Intel i7 provides a speedup of about 28% on
two Java benchmark programs. In this case, the i7 supports two threads and can issue up to 6 microoperations per cycle between the two threads. Therefore, we see the ability to issue more instructions
(micro-ops) per cycle because the second thread provides greater parallelism. The conclusion we can
draw however is that ultimately, thread level parallelism does not offer us enough parallelism to avoid the
common stalling situations in any architecture while requiring additional support. Also keep in mind that
multiple threads are only available in a multi-threaded process. We will return to thread-level parallelism
at the end of the semester when we briefly consider multi-processor systems.
In week 8, we will consider multiprocessor and vector solutions to handling multiple processes.
Sample Problems: We wrap up these notes with some sample problems of material covered in the lecture,
earlier in the week. Let’s return to the idea of a statically scheduled superscalar.
1. Consider the following simple loop. Assume the MUL.D takes 7 cycles to compute. Perform loop
unrolling and scheduling to remove all stalls and issue up to 2 instructions per cycle on a superscalar
pipeline which permits any single integer operation and any single FP operation per cycle. Assume
loads, stores and branches are part of the integer pipeline, not the FP. Use the MIPS FP pipeline and
assume that an FP and Load/store can enter the MEM stage at the same time.
Loop: L.D
F0, 0(R1)
MUL.D
F2, F0, F4
S.D
F2, 0(R1)
DSUBI
R1, R1, #8
BNE
R1, R2, Loop
Solution: We need to find 6 instructions to put between the cycle that the first MUL.D is issued
and the first S.D. This requires having 7 total loop iterations.
Loop: L.D
F0, 0(R1)
L.D
F6, 8(R1)
L.D
F10, 16(R1)
MUL.D
F2, F0, F4
L.D
F14, 24(R1)
MUL.D
F8, F6, F4
L.D
F18, 32(R1)
MUL.D
F12, F10, F4
L.D
F22, 40(R1)
MUL.D
F16, F14, F4
L.D
F26, 48(R1)
MUL.D
F20, F18, F4
DSUBI R1, R1, #56
MUL.D
F24, F22, F4
S.D
F2, -56(R1)
MUL.D
F28, F26, F4
S.D
F8, -48(R1)
S.D
F12, -40(R1)
S.D
F16, -32(R1)
S.D
F20, -24(R1)
S.D
F24, -16(R1)
BNE R1, R2, Loop
S.D
F28, -8(R1)
2. Repeat #1 assuming that you can issue up to 1 FP instruction per cycle, but you can issue up to 2
integer instructions per cycle.
Solution: This solution is more complicated than from #1 because we can now place L.D and S.D
operations in the second pipeline. This requires a greater number of unrollings. What we will quickly
find out is that we run out of registers! So I will assume that we have 64 registers instead of 32.
Loop: L.D
F0, 0(R1)
L.D
F6, 8(R1)
L.D
F10, 16(R1)
L.D
F14, 24(R1)
L.D
F18, 32(R1)
MUL.D
F2, F0, F4
L.D
F22, 40(R1)
MUL.D
F8, F6, F4
L.D
F26, 48(R1)
MUL.D
F12, F10, F4
L.D
F30, 56(R1)
L.D
F34, 64(R1)
L.D
F38, 72(R1)
DSUBI R1, R1, #80
S.D
F2, -80(R1)
S.D
F8, -72(R1)
S.D
F12, -64(R1)
S.D
F16, -56(R1)
S.D
F20, -48(R1)
S.D
F24, -40(R1)
S.D
F28, -32(R1)
S.D
F32, -24(R1)
S.D
F36, -16(R1)
MUL.D
MUL.D
MUL.D
MUL.D
MUL.D
MUL.D
MUL.D
F16, F14, F4
F20, F18, F4
F24, F22, F4
F28, F26, F4
F32, F30, F4
F36, F34, F4
F40, F38, F4
BNE
S.D
R1, Loop
F40, -8(R1)
3. Repeat #1 on a VLIW which can support up to 2 load/store, 2 FP and 1 integer operation per cycle.
Solution: For brevity, all registers and addresses are omitted. We would need registers F0-F62!
Loop: L.D
L.D
L.D
L.D
L.D
L.D
MUL.D
MUL.D
L.D
L.D
MUL.D
MUL.D
L.D
L.D
MUL.D
MUL.D
L.D
L.D
MUL.D
MUL.D
L.D
L.D
MUL.D
MUL.D
L.D
L.D
MUL.D
MUL.D
DSUBI
S.D
S.D
MUL.D
MUL.D
S.D
S.D
MUL.D
MUL.D
S.D
S.D
S.D
S.D
S.D
S.D
S.D
S.D
S.D
S.D
BNE
S.D
S.D
4. Show how the following code would execute on a 2-issue Tomasulo-based superscalar with
speculation. There are 2 FP adders, 1 load/store unit, 1 integer unit. Use a table like that on slide 31
of the lecture4.pptx notes. Assume a FP ADD.D takes 4 cycles to execute and that up to 2
instructions can commit per cycle. Assume perfect branch prediction. Show the first 2 iterations of
the loop.
Loop: L.D
F0, 0(R1)
L.D
F2, 8(R1)
ADD.D
F4, F0, F2
S.D
F6, 0(R2)
DADDI
R1, R1, #16
DADDI
R2, R2, #8
BNE
R2, R3, Loop
Solution:
Cycle
Instruction
Issue
Exec
CDB
Commit
2
Mem
Acc
3
1
L.D F0
1
4
5
1
L.D F2
1
4
5
6
7
1
ADD.D
2
7-10
11
12
1
S.D
2
3
1
DADDI
3
4
1
DADDI
3
5
1
BNE
4
7
2
L.D F0
5
6
7
8
16
2
L.D F2
5
7
8
9
17
2
ADD.D
6
14
17
2
S.D
6
1013
7
2
DADDI
7
8
10
20
2
DADDI
7
9
12
20
2
BNE
8
13
13
Comments
Only 1 load at a time
14
Has to wait for ADD.D to commit
5
14
Up to 2 commits per cycle
6
15
15
18
Has to wait for DADDI before it
executes
19
Collision on CDB with L.D F0 of cycle
2
Collision on CDB with first ADD.D
21
5. Provide an example in C or Java of a sequence of if-else statements where some of the branch
conditions could benefit from correlating predictors.
Solution:
if(x!=y)
// condition 1
if(y==z)
// condition 2
{
y=0; … }
if(y==0) {…}
// condition 3
if(x==0) {…}
// condition 4
Assume the … has no impact on either x, or y. If y==z, then the next condition will always be true.
If condition 1 is false and condition 3 is true then condition 4 will be true. Finally, if condition 1 is
true and condition 3 is true then condition 4 will be false.
6. Given a 2-issue Tomasulo-style superscalar, let’s attempt to determine the impact of a branch missprediction on performance. Assume 1 integer unit and an issue/execute/CDB/commit performance
like that of example 4 above. How many cycles of penalty would we see in a miss-predicted? Can
we figure this out in an absolute way?
Solution: As we see in #4, the BNE does not commit until after all instructions preceding it commit,
so the penalty will be based on the code that comes before it. Since the first BNE commits in cycle
15, we would have already fetched at least 1 additional complete iteration, a penalty of 7 instructions
over 4 cycles. But, since we do not know the exact instruction mix, we cannot provide an absolute
penalty. It would almost certainly be more than 4 cycles in any FP code.