OpenMP Programming
OpenMP – Open Multi-processing
A standard API for writing multithreaded applications for a variety of
shared memory architectures
Comprises three major components
Compiler Directives
Runtime Library Routines
Environment Variables
C, C++ and Fortran support, JOMP for Java
An explicit programming model, offering complete control to the user
over parallelization
OpenMP uses the Fork Join model for execution
Master thread forks/creates a team of parallel threads which
execute the accompanying block of statements in parallel
Upon completion of task, the threads join leaving only the master
Each thread is assigned a unique Id within the team, with the
master being assigned the id = 0
A user can specify any number of logical threads to be created
#include <omp.h> // file BasicOpenMP.c
int main(void) {
omp_set_num_threads(4);
int tid = 0;
#pragma omp parallel private(tid) {
tid = omp_get_thread_num();
printf("Hello World from thread %d\n", tid);
}
return 0;
}
Compilation – for gcc, g++ compilers only need to
enable the OpenMP flag
gcc – fopenmp <filename>
O/P
Hello World from thread 0
Hello World from thread 1
Hello World from thread 2
Hello World from thread 3
Compiler Directives
#pragma omp directive-name [clause, …] newline
Applies to the immediately following statement /code block
Can have other directives nested within it
Parallel directive – causes the thread to fork into a team with itself being the
master
#pragma omp parallel [clause, …] newline {}
The block following the directive is executed in parallel by all the teammates
An implicit barrier at the end of the parallel section causes the threads to join
leaving only the master
Any part of the program to be executed in parallel is enclosed within this
directive
Clauses
Clauses specify conditions imposed while executing the code block
These include
num_threads() : # threads to be created
#pragma omp parallel num_threads(4)
shared/private/ firstprivate/lastprivate : behavior of data variables for each
thread
#pragma omp parallel shared(list)
schedule(type, size) : scheduling mechanism
#pragma omp parallel for schedule(type, size)
Runtime Library Routines
Definitions of all the routines are in omp.h
omp_set_num_threads() – Sets the #threads to be created for parallel execution
The actual #threads created are implementation dependent
omp_get_num_threads() – Gives the actual #threads created for a parallel
section
Most implementations provide a default value for #threads which is predefined
or determined dynamically depending on the availability of the resources
omp_get_dynamic() – If set to true(1), will dynamically determine the number of
threads to be created at run-time.
Default value is mostly true, can be changed however using omp_set_dynamic()
#include <omp.h>
int main(void) {
if (omp_get_dynamic() == 1)
{
omp_set_dynamic(0);
}
omp_set_num_threads(10);
#pragma omp parallel {
O/P
NumThreads = omp_get_num_threads();
#threads created 4
printf(“#threads created %d”, NumThreads); #threads created 10
}
}
Work Sharing Constructs
Divides the execution of the enclosed code region among the thread team
Do not create new threads, hence must be enclosed within a parallel region
An implied barrier at the end of the construct
For Directive – shares iterations of the loop among threads (Data Parallel)
#pragma omp for [clauses, …] newline <for_loop>
The loop variable must be an integer and is implicitly private for each thread
The loop cannot contain any break, goto statements
The chunk allocated to each thread depends on the schedule clause specified
Beware of dependencies across iterations
#define N 1000
#define chunk_size 100
int main(void) {
omp_set_dynamic(0);
omp_set_num_threads(10);
int i = 0, res = 0;
#pragma omp parallel {
#pragma omp for \
schedule (static, chunk_size) \
reduction(+: res)
for (i = 0; i < N; i++) {
res = res + i;
}
}
}
#define size 100
Matrix Multiplication
int A[size][size], B[size][size], C[size][size];
shared(j) what happens now ??
#pragma omp parallel for
for (i = 0; i < size; i++) {
#pragma omp parallel for
for (j = 0; j < size; j++) {
C[i][j] = 0;
To enable nested parallelism
sum = 0;
use omp_set_nested(1)
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < size; k++)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
}
Schedule(type, chunk) : determines how the loop iterations will be divided
among threads. Default is implementation dependent
static – loop iterations are divided into contiguous groups of size chunk and
statically assigned to threads
dynamic – loop iterations are divided as above but allocated dynamically to the
threads
guided – similar to dynamic, except that the chunk size decreases each time an
iteration group is assigned to a thread
runtime – scheduling decision is left to the compiler and is determined at
runtime
Reduction(op : list) – The op is applied to the individual copies of the list variable
generated by each thread an the result is stored in the original list variable
The list variable can only be scalar and the op can only be a non-overloaded
binary associative operator.
int A[n][n][n];
for (k = 0; k < n; k++) {
#pragma omp parallel for
for (i = 0; i < n; i++) {
#pragma omp parallel for
for (j = 0; j < n; j++) {
A[k][i][j] = min(A[k-1][i][j], (A[k-1][i][k] + A[k-1][k][j]));
}
}
}
Why the outermost loop cannot be parallelized ??
Sections Directive – Independent Section directives are nested within it. Each
Section is executed by one of the threads in the team (Task Parallel)
#pragma omp sections [clauses, …] {
#pragma omp section
<structured block>
#pragma omp section
<structured block>
}
Scheduling of sections among threads is dynamic. In case #threads < #sections,
scheduling of extra sections is implementation dependent
goto and jump statements cannot be used inside sections
OpenMP pragmas must be encountered by all threads in a team or none at all,
hence a condition based execution is not allowed
Example on Sections
QuickSort(int list[], int lower, int upper) {
if (lower < upper) {
int pos = partition(list, lower, upper);
#pragma omp parallel sections {
#pragma omp section {
QuickSort(list, lower, pos-1);
}
#pragma omp section {
QuickSort(list, pos, upper);
}
}
}
}
Data Scope
private(list) – declares all variables in its list as private for each thread
shared(list) – declares variables in its list as shared among all threads. By default
all variables are shared in most OpenMP implementations
firstprivate(list) – same as private with automatic initialization of the variables
with their original values before entering the shared construct
lastprivate(list) – same as private with automatic assignment from the last loop
iteration or last section in the program (not execution) to the original object
default(scope|none) – allows the user to specify a default scope for each
variable used in the construct. If specified as ‘none’, then the scope for each
variable needs to be specified
Synchronization
Used to impose order constraints and protect access to shared data
Atomic – allows a specific memory location to be updated atomically
#pragma omp atomic newline <statement>
#pragma omp parallel for
for (i = 0 ; i < N; i++) {
# pragma omp atomic
res = res + i;
}
Barrier – synchronizes all threads in a team. A thread has to wait until all the
threads within a team reach this point
#pragma omp barrier newline
Ordered – specifies that the iterations of the loop must be executed in a serial
manner
#pragma omp parallel
{
#pragma omp for ordered
{
for(i = 0; i < N; i ++) {
#pragma ordered
A[i] = A[i] + A[i-1];
}
}
}
An ordered region must be closely nested inside a loop region with an ordered
clause
Critical – specifies that the enclosed block can only be executed by one thread at
a time
#pragma omp critical [name] newline <structured block>
If a thread is currently executing a critical section and another thread reaches
that critical region, then it will be blocked until the first thread completes its
execution of the critical section
Multiple critical regions can be defined using names. Critical regions with same
names form a group and are mutually exclusive of each other. Anonymous critical
regions together form one group
The critical directive enforces exclusive access with respect to critical directives
in all threads, not just the current team.
buffer[N], size;
Producer Consumer problem
#pragma omp sections {
#pragma omp section {
if(size < N) {
#pragma omp critical (prod_cons) {
if(size < N)
buffer[size++] = 1;
}
}
}
#pragma omp section {
if (size > 0) {
#pragma omp critical (prod_cons) {
if(size > 0)
buffer[size--] = 0;
}
}
}
}
Nesting of Critical regions
OpenMP does not allow nesting of critical regions of the same name to avoid potential
deadlock scenarios
Critical regions inside Subroutines
Critical sections inside subroutines also have the same global scope
Care must be taken to ensure that the code does not nest two critical sections
RecProd() {
#pragma omp critical
{
RecProd();
}
}
Not allowed as it results in nesting of critical
sections even though the same thread is making
the recursive call
Comparing Synchronization Directives
OpenMP provides fairly easy to use constructs for achieving synchronization.
However, they can severely impact performance if not used wisely
Atomic and Critical achieve the same objectives but atomic is much cheaper to
implement internally
Atomic must be preferred over critical for single memory updates. Use of critical
makes sense when a bunch of statements needs to be executed one thread at a
time
Locks : low level synchronization
OpenMP provides a set of runtime routines to support the use of locks
Simple Locks - Used to protect resources. Can only be used if the lock is unset
omp_init_lock() – used to initialize a lock associated with a lock variable
omp_set_lock() - used to set the lock. If already set, then the thread is asked to
wait
omp_test_lock() – tests whether a lock is set or not, if unset then sets it
omp_unset_lock() – used to release the lock for others to use
omp_destroy_lock() – destroys the lock freeing the lock variable
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel private (tmp, id)
{
id = omp_get_thread_num();
tmp = do_lots_of_work(id);
omp_set_lock(&lck);
printf(“%d %d”, id, tmp);
omp_unset_lock(&lck);
}
omp_destroy_lock(&lck);
Nested Locks – same as simple locks except that they allow a thread currently
holding a lock to re-acquire it. No other thread can acquire the set lock
Same functions as above such as omp_init_nest_lock()
Synchronization pragmas are easier to use than locks and significantly reduce the
need to check for deadlocks and memory leaks.
Most synchronization pragmas incur little to no overhead in terms of
implementation
However locks offer greater flexibility and control to the user than constructs
such as critical in terms of nesting and subroutine calls
Other Clauses & Directives
nowait clause – removes the implicit barrier at the end of the work sharing
construct
threadprivate(list) – declares static and file scope variables as private for each
thread
if clause – used with the parallel directive, if evaluates to ‘true’ then new threads
are created for parallel execution of the enclosed code block
#pragma omp master – if used, the enclosed piece of code is only executed by
the master
#pragma omp single – if used, the enclosed piece of code is executed by only
one of the threads in the team. An implicit barrier is placed at the end
References
https://computing.llnl.gov/tutorials/openMP/
https://msdn.microsoft.com/en-us/magazine/cc163717.aspx
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf
http://www.viva64.com/en/a/0054/
© Copyright 2025