Objectives: Multicore Processors and Threading

Objectives: Multicore Processors and Threading
Ref: [Tanembaum, sect 8.1], [O’H&Bryant, sect 1.7.2, 1.9.1, 12.1, 12.3-4]
l to understand the basic concepts of threading
l to grasp the main ideas of shared memory programming model
l to understand how a multiple cores on a chip operate
l to appreciate how they arose
l to understand the idea of hardware threading
(some figures in these slides are from Magee & Kramer, Concurrency, Wiley; Lin & Snyder, Principles of
Parallel Programming, Pearson; Chapman et al, Using OpenMP)
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
1
Processes and Threads
l processes may have one or more threads of execution within them
n these all share the same address space and
manipulate the same memory areas
l (OS-style) processes: the Operating System view
(courtesy Magee&Kramer)
n a (heavyweight) process in an operating system is represented by its
descriptor, code, data and the state of the machine registers
n to support multiple (lightweight) threads of control, it has multiple stacks, one for
each thread
n the (specific) state at any time of a running thread includes its stack plus the
values in the registers of the CPU it runs on
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
2
The Thread Life-cycle
l an overview of the life-cycle of a thread as state transitions:
l normally, the operating system manages these
l a specific system call creates a new thread, e.g. Solaris lwp create()
n this will call some function and the thread terminates when it exits the function
n once terminated, it cannot be restarted
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
3
Shared Memory Multiprocessors
l a number of processors (CPUs) connected to a globally addressable memory
n through a bus ( [Lyn&Snyder, fig 2.3])
or, better,
n memory is organized into modules, all connected by an interconnect
n at any time, different threads (or processes) can run in parallel on different
CPUs
l need caches! memory consistency problem is now exacerbated!
l consider programs (threads) on processors 0 and 1 attempting to acquire a lock
n this is needed when one wishes to update a shared data structure
n hardware must support this via some atomic instructions, e.g.
! % o0 has address of the lock
mov 0 xff , % o1
! 0 xff is value for acquiring lock
loop :
! loop to acquire lock
brnz %o1 , loop
! exit if % o1 = 0 ( lock value was 0)
ldstub [% o0 ], % o1 !
atomic swap % o1 and value at lock
...
! safely update shared data structure
stb %g0 , [% o0 ]
! release lock
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
4
The Shared Memory Programming Model
l uses a fork-join variant of the threaded programming model
[Chap&Jost&derPas, fig 2.1]
n the team of threads executes a parallel region
n for parallel speedup, each thread needs to be allocated to a different processor
n in the low-level threads library (pthreads), only fork or join 1 thread at once
e.g. race.c
l threads communicate via global variables in common memory sections (e.g. static
data, heap); have private stacks for thread-local variables
l main synchronization mechanisms are locks and barriers
p t h r e a d m u t e x i n i t (& mutex1 , NULL );
...
p t h r e a d m u t e x l o c k (& mutex1 ); // involves a busy − wait loop
/ ∗ mutually exclusive access to shared resource ∗ /
p t h r e a d m u t e x u n l o c k (& mutex1 );
...
pbarrier (); // wait until other threads reach the same point
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
5
Overview of the OpenMP Programming Model
l idea is to add directives to ordinary C, C++ or Fortran code
l example: matrix-vector multiply y ← y + Ax ([Chap&Jost&derPas, fig 3.9])
double A[N ][ N], x[N], y[n ]; int i , j;
...
# pragma omp parallel for private (j)
for (i =0; i < N; i ++) // each thread updates segment of y []
for (j =0; j < n; j ++)
y[i] += A[i ][ j] ∗ x[j ];
// alternately :
for (i =0; i < N; i ++) {
double s = y[i ];
# pragma omp parallel for reduction (+: s)
for (j =0; j < n; j ++) // each thread computes a partial sum
s += A[i ][ j] ∗ x[j ]; // in its own version of s , which are
y[i] = s;
// later summed into the global s
}
l
l
l
l
more generally, directives apply over regions #pragma omp parallel { ... }
within a parallel region, a barrier may be inserted (#pragma omp barrier)
mutual exclusion via #pragma omp critical { ... }
for more details, see OpenMP web site or LNL Tutorial
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
6
What is Multicore?
l also known as chip multiprocessing, CMP):
multiple shared memory processors (‘cores’ =
CPUs) on a single chip
l each has its own register sets and operates on a
common main memory
n why? because we can! also because we must!
n can run multiple applications (processes) in
parallel
n can run a single (threaded) application in
parallel; herein lies the challenge!
l (large-scale) parallelism is now cheap, mainstream
l memory (data access) is an increasing
consideration!
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
7
Advent of Multicore
l caused by end of Dennard scaling and further improvement by pipelining /
superscalar techniques
l for a long time, Moore’s Law permitted an exponential increase in clock speed with
constant power density (Dennard scaling), as well as the number of transistors/chip
l extrapolation of exponential power density increase 1985–2000 indicates we are at
the limit!
l 2000 Intel chip equivalent to a hotplate, would have ⇒ a rocket nozzle by 2010!
l dissipated power is given by: P ∝ V 2 f ∝ f 3, V is the voltage, f is the clock
frequency (speed)
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
8
Hardware Threading
l for each core, have a number of ‘virtual CPUs’, each with own register set
l the core’s control unit can flexibly select instructions ready to execute in each of
these
l the operating system sees each as a separate CPU
l we can thus hide effects of cache misses
l this tends to be effective on memory-intensive workloads with high degrees of
concurrency (e.g. database and web servers)
l for other applications, there still is some limited benefit (e.g. 25% speedup for 4 way
hardware threading)
One the other hand, it is a ‘cheap’ technique in terms of extra hardware.
COMP2300 Lecture: Multicore Processors and Threading 2015 JJ J • I II ×
9