Tutorial sessions 1 & 2 on OpenMP

OpenMP Programming
OpenMP – Open Multi-processing
A standard API for writing multithreaded applications for a variety of
shared memory architectures
 Comprises three major components
 Compiler Directives
 Runtime Library Routines
 Environment Variables
C, C++ and Fortran support, JOMP for Java
 An explicit programming model, offering complete control to the user
over parallelization
 OpenMP uses the Fork Join model for execution
 Master thread forks/creates a team of parallel threads which
execute the accompanying block of statements in parallel
 Upon completion of task, the threads join leaving only the master
 Each thread is assigned a unique Id within the team, with the
master being assigned the id = 0
 A user can specify any number of logical threads to be created
#include <omp.h> // file BasicOpenMP.c
int main(void) {
omp_set_num_threads(4);
int tid = 0;
#pragma omp parallel private(tid) {
tid = omp_get_thread_num();
printf("Hello World from thread %d\n", tid);
}
return 0;
}
Compilation – for gcc, g++ compilers only need to
enable the OpenMP flag
gcc – fopenmp <filename>
O/P
Hello World from thread 0
Hello World from thread 1
Hello World from thread 2
Hello World from thread 3
Compiler Directives
#pragma omp directive-name [clause, …] newline
 Applies to the immediately following statement /code block
 Can have other directives nested within it
 Parallel directive – causes the thread to fork into a team with itself being the
master
#pragma omp parallel [clause, …] newline {}
The block following the directive is executed in parallel by all the teammates
An implicit barrier at the end of the parallel section causes the threads to join
leaving only the master
Any part of the program to be executed in parallel is enclosed within this
directive
Clauses
Clauses specify conditions imposed while executing the code block
 These include
num_threads() : # threads to be created
#pragma omp parallel num_threads(4)
shared/private/ firstprivate/lastprivate : behavior of data variables for each
thread
#pragma omp parallel shared(list)
schedule(type, size) : scheduling mechanism
#pragma omp parallel for schedule(type, size)
Runtime Library Routines
 Definitions of all the routines are in omp.h
 omp_set_num_threads() – Sets the #threads to be created for parallel execution
 The actual #threads created are implementation dependent
 omp_get_num_threads() – Gives the actual #threads created for a parallel
section
 Most implementations provide a default value for #threads which is predefined
or determined dynamically depending on the availability of the resources
 omp_get_dynamic() – If set to true(1), will dynamically determine the number of
threads to be created at run-time.
 Default value is mostly true, can be changed however using omp_set_dynamic()
#include <omp.h>
int main(void) {
if (omp_get_dynamic() == 1)
{
omp_set_dynamic(0);
}
omp_set_num_threads(10);
#pragma omp parallel {
O/P
NumThreads = omp_get_num_threads();
#threads created 4
printf(“#threads created %d”, NumThreads); #threads created 10
}
}
Work Sharing Constructs
Divides the execution of the enclosed code region among the thread team
Do not create new threads, hence must be enclosed within a parallel region
 An implied barrier at the end of the construct
 For Directive – shares iterations of the loop among threads (Data Parallel)
#pragma omp for [clauses, …] newline <for_loop>
 The loop variable must be an integer and is implicitly private for each thread
 The loop cannot contain any break, goto statements
 The chunk allocated to each thread depends on the schedule clause specified
 Beware of dependencies across iterations
#define N 1000
#define chunk_size 100
int main(void) {
omp_set_dynamic(0);
omp_set_num_threads(10);
int i = 0, res = 0;
#pragma omp parallel {
#pragma omp for \
schedule (static, chunk_size) \
reduction(+: res)
for (i = 0; i < N; i++) {
res = res + i;
}
}
}
#define size 100
Matrix Multiplication
int A[size][size], B[size][size], C[size][size];
shared(j) what happens now ??
#pragma omp parallel for
for (i = 0; i < size; i++) {
#pragma omp parallel for
for (j = 0; j < size; j++) {
C[i][j] = 0;
To enable nested parallelism
sum = 0;
use omp_set_nested(1)
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < size; k++)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
}
 Schedule(type, chunk) : determines how the loop iterations will be divided
among threads. Default is implementation dependent
 static – loop iterations are divided into contiguous groups of size chunk and
statically assigned to threads
 dynamic – loop iterations are divided as above but allocated dynamically to the
threads
 guided – similar to dynamic, except that the chunk size decreases each time an
iteration group is assigned to a thread
 runtime – scheduling decision is left to the compiler and is determined at
runtime
 Reduction(op : list) – The op is applied to the individual copies of the list variable
generated by each thread an the result is stored in the original list variable
 The list variable can only be scalar and the op can only be a non-overloaded
binary associative operator.
int A[n][n][n];
for (k = 0; k < n; k++) {
#pragma omp parallel for
for (i = 0; i < n; i++) {
#pragma omp parallel for
for (j = 0; j < n; j++) {
A[k][i][j] = min(A[k-1][i][j], (A[k-1][i][k] + A[k-1][k][j]));
}
}
}
Why the outermost loop cannot be parallelized ??
 Sections Directive – Independent Section directives are nested within it. Each
Section is executed by one of the threads in the team (Task Parallel)
#pragma omp sections [clauses, …] {
#pragma omp section
<structured block>
#pragma omp section
<structured block>
}
Scheduling of sections among threads is dynamic. In case #threads < #sections,
scheduling of extra sections is implementation dependent
 goto and jump statements cannot be used inside sections
 OpenMP pragmas must be encountered by all threads in a team or none at all,
hence a condition based execution is not allowed
Example on Sections
QuickSort(int list[], int lower, int upper) {
if (lower < upper) {
int pos = partition(list, lower, upper);
#pragma omp parallel sections {
#pragma omp section {
QuickSort(list, lower, pos-1);
}
#pragma omp section {
QuickSort(list, pos, upper);
}
}
}
}
Data Scope
 private(list) – declares all variables in its list as private for each thread
 shared(list) – declares variables in its list as shared among all threads. By default
all variables are shared in most OpenMP implementations
 firstprivate(list) – same as private with automatic initialization of the variables
with their original values before entering the shared construct
 lastprivate(list) – same as private with automatic assignment from the last loop
iteration or last section in the program (not execution) to the original object
 default(scope|none) – allows the user to specify a default scope for each
variable used in the construct. If specified as ‘none’, then the scope for each
variable needs to be specified
Synchronization
 Used to impose order constraints and protect access to shared data
 Atomic – allows a specific memory location to be updated atomically
#pragma omp atomic newline <statement>
#pragma omp parallel for
for (i = 0 ; i < N; i++) {
# pragma omp atomic
res = res + i;
}
 Barrier – synchronizes all threads in a team. A thread has to wait until all the
threads within a team reach this point
#pragma omp barrier newline
 Ordered – specifies that the iterations of the loop must be executed in a serial
manner
#pragma omp parallel
{
#pragma omp for ordered
{
for(i = 0; i < N; i ++) {
#pragma ordered
A[i] = A[i] + A[i-1];
}
}
}
 An ordered region must be closely nested inside a loop region with an ordered
clause
 Critical – specifies that the enclosed block can only be executed by one thread at
a time
#pragma omp critical [name] newline <structured block>
 If a thread is currently executing a critical section and another thread reaches
that critical region, then it will be blocked until the first thread completes its
execution of the critical section
 Multiple critical regions can be defined using names. Critical regions with same
names form a group and are mutually exclusive of each other. Anonymous critical
regions together form one group
 The critical directive enforces exclusive access with respect to critical directives
in all threads, not just the current team.
buffer[N], size;
Producer Consumer problem
#pragma omp sections {
#pragma omp section {
if(size < N) {
#pragma omp critical (prod_cons) {
if(size < N)
buffer[size++] = 1;
}
}
}
#pragma omp section {
if (size > 0) {
#pragma omp critical (prod_cons) {
if(size > 0)
buffer[size--] = 0;
}
}
}
}
Nesting of Critical regions
 OpenMP does not allow nesting of critical regions of the same name to avoid potential
deadlock scenarios
Critical regions inside Subroutines
 Critical sections inside subroutines also have the same global scope
 Care must be taken to ensure that the code does not nest two critical sections
RecProd() {
#pragma omp critical
{
RecProd();
}
}
Not allowed as it results in nesting of critical
sections even though the same thread is making
the recursive call
Comparing Synchronization Directives
 OpenMP provides fairly easy to use constructs for achieving synchronization.
However, they can severely impact performance if not used wisely
 Atomic and Critical achieve the same objectives but atomic is much cheaper to
implement internally
 Atomic must be preferred over critical for single memory updates. Use of critical
makes sense when a bunch of statements needs to be executed one thread at a
time
Locks : low level synchronization
OpenMP provides a set of runtime routines to support the use of locks
Simple Locks - Used to protect resources. Can only be used if the lock is unset
 omp_init_lock() – used to initialize a lock associated with a lock variable
 omp_set_lock() - used to set the lock. If already set, then the thread is asked to
wait
 omp_test_lock() – tests whether a lock is set or not, if unset then sets it
 omp_unset_lock() – used to release the lock for others to use
 omp_destroy_lock() – destroys the lock freeing the lock variable
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel private (tmp, id)
{
id = omp_get_thread_num();
tmp = do_lots_of_work(id);
omp_set_lock(&lck);
printf(“%d %d”, id, tmp);
omp_unset_lock(&lck);
}
omp_destroy_lock(&lck);
 Nested Locks – same as simple locks except that they allow a thread currently
holding a lock to re-acquire it. No other thread can acquire the set lock
 Same functions as above such as omp_init_nest_lock()
 Synchronization pragmas are easier to use than locks and significantly reduce the
need to check for deadlocks and memory leaks.
 Most synchronization pragmas incur little to no overhead in terms of
implementation
 However locks offer greater flexibility and control to the user than constructs
such as critical in terms of nesting and subroutine calls
Other Clauses & Directives
 nowait clause – removes the implicit barrier at the end of the work sharing
construct
 threadprivate(list) – declares static and file scope variables as private for each
thread
 if clause – used with the parallel directive, if evaluates to ‘true’ then new threads
are created for parallel execution of the enclosed code block
 #pragma omp master – if used, the enclosed piece of code is only executed by
the master
 #pragma omp single – if used, the enclosed piece of code is executed by only
one of the threads in the team. An implicit barrier is placed at the end
References
 https://computing.llnl.gov/tutorials/openMP/
 https://msdn.microsoft.com/en-us/magazine/cc163717.aspx
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf
 http://www.viva64.com/en/a/0054/