Introduction to SHARCNET and High Performance Computing What is SHARCNET?

Fall Workshop 2006
Introduction to
SHARCNET and High
Performance Computing
Intro to SHARCNET
SHARCNET
• Vision:
What is SHARCNET?
– To establish a world-leading, multi-university and
college, interdisciplinary institute with an active
academic-industry partnership, enabling forefront
computational research in critical areas of science,
engineering and business.
• Mission:
– SHARCNET exists to enable world-class
computational research so as to accelerate the
production of research results.
Intro to SHARCNET
1
SHARCNET…in brief
• General Objectives:
Academic and Affiliated
Partners
•
– provide otherwise unattainable computing resources
– build common, seamless computing environments
– promote remote collaboration and research
•
• SHARCNET Web Portal
http://www.sharcnet.ca
Intro to SHARCNET
Founding members (June 2001)
The University of Western Ontario
University of Guelph
McMaster University
Wilfrid Laurier University
University of Windsor
Fanshawe College
Sheridan College
New Partners (June 2003)
University of Waterloo
Brock University
University of Ontario Institute of
Technology
York University
•
New Partners (Dec 2005)
Trent University
Laurentian University
Lakehead University
•
New Partners (March 2006)
Ontario College of Art and Design
Perimeter Institute for Theoretical
Physics
•
Affiliated Partners
Robarts Research Institute
Fields Intitute for Mathematical
Sciences
Intro to SHARCNET
Industrial and Government
Partners
• Private Sector
– Hewlett Packard
– SGI
– Quadrics Supercomputing
World
– Platform Computing
– Nortel Networks
– Bell Canada
Intro to SHARCNET
• Government
– Canada Foundation for
Innovation
– Ontario Innovation Trust
– Ontario R&D Challenge
Fund
– Optical Regional Advanced
Network of Ontario
(ORANO)
Intro to SHARCNET
2
SHARCNET Overview
Getting an Account
•
FREE to academic researchers
•
Compute-intensive problems
•
– http://www.sharcnet.ca/Help/account.php - “Getting an Account” link
– http://www.sharcnet.ca/Portal/index.php - “New User” link
– not intended to replace desktop computing; SHARCNET users can
productively conduct HPC research on a variety of systems, each
optimized for specific HPC tasks
•
•
Fair access
You will have a webportal account that allows you to access
information/files, submit requires and manage your own profile
All systems are accessed via SSH (using the same account information)
– e.g. ssh [email protected]
Intro to SHARCNET
Intro to SHARCNET
SHARCNET Environment
SHARCNET Facilities
• Systems: clusters, shared memory (SMP)
Cluster of Clusters (COC)
login node
login node
LAN
•
•
– users have access to all systems
– jobs are run in batch mode (scheduling system) with fairshare
Account approval:
– Faculty accounts are approved by the site leader
– Students/postdoc/fellows require a faculty sponsor
• sponsor approves such accounts directly
– Non-SHARCNET institution accounts are approved by the Scientific Director
Academic HPC research
– research can be industry-related, but must be done in collaboration with
an academic researcher
•
Apply for an account online (webportal)
…
…
LAN
compute
nodes
10Gbps
compute
nodes
– Architectures: Opteron, Itanium, Xeon, Alpha, PowerPC
• recommended compilers: Pathscale/PGI, Intel, Compaq
– Intended Use: HPC task-specific
• Parallel (capability, utility, SMP, etc.), Serial (throughput, etc.)
• Visualization clusters
10Gbps
• Access Grid
10Gbps
LAN
Intro to SHARCNET
…
– multi-media video conferencing
– collaboration, remote instruction, etc.
Intro to SHARCNET
3
http://www.sharcnet.ca/Facilities/index.php
Facilities: Intended Use
Cluster
Intro to SHARCNET
Software Resources
•
OS - Linux
– HP XC 3.0 on all major Opteron-based clusters
– SGI Linux on Silky
•
Compilers
– Opteron: Pathscale (pathcc, pathCC, pathf90), PGI (pgcc, pgCC, pgf77, pgf90)
– Alpha: Compaq (ccc, cxx, fort)
– Itanium/Xeon: Intel (icc, ifort)
•
Scheduler
– LSF, SQ
•
Parallel development support
– MPI (HPMPI, OpenMPI, MPICH)
– Threads (pthreads, OpenMP)
Intro to SHARCNET
Storage
Interconnect
Intend Use
requin
(Capability)
CPUs
1536
RAM /node
8 GB
70 TB
Quadrics
Resource intensive MPI
(fine grained, large mem.)
narwhal
(Utility)
1068
8 GB
70 TB
Myrinet
MPI, small-scale SMP
whale
(Throughput)
3072
4 GB
70 TB
GigE
Serial
bull
(SMP-friendly)
384
32 GB
70 TB
Quadrics
High RAM/BW MPI &
small-scale SMP
silky
(SMP)
128
256 GB
4 TB
NUMAlink
large memory/mediumscale SMP
bala, bruce, dolphin,
megaladon, tiger, zebra
128
8 GB
4 TB
Myrinet
Genaral purpose
Intro to SHARCNET
Software Resources (cont.)
• Libraries
– ACML(AMD), CXML(Alpha), SCSL(SGI), Atlas, GSL,
ScaLAPACK, FFTW, PETSc, …
• Debugging/Profiling Tools
– Debugging (DDT, gdb, …)
– Profiling/Optimization (OPT, gprof, …)
• Application packages
– R, Gromacs, BLAST, NWChem, Octave
Intro to SHARCNET
4
Software Resources (cont.)
• Commercial packages
– bring your license to SHARCNET (such as lumerical, ..)
– software with large user groups (such as Gaussian, Fluent, etc.),
SHARCNET may purchase the license; user groups are asked to
share cost
• Others
– you ask/provide, we can install or advise
• More information:
http://www.sharcnet.ca/Facilities/software/softwarePage.php
Intro to SHARCNET
Support: Info on the Web
Support: People
• HPC Software Analysts
–
–
–
–
–
central resource for user support
software/development support
analysis of requirements
development/delivery of educational resources
research computing consultations
• System Administrators
– user accounts
– system software
– hardware and software maintenance
Intro to SHARCNET
Web Portal Features: Job
Statistics
• http://www.sharcnet.ca
– for general information: SHARCNET FAQ
• http://www.sharcnet.ca/Help/faq.php
– up-to-date facility information: facilities page:
• http://www.sharcnet.ca/Facilities/index.php
– online problem tracking system (login required)
• https://www.sharcnet.ca/Portal/
problems/problem_search.php
– support personnel:
• http://www.sharcnet.ca/Help/index.php
Intro to SHARCNET
Intro to SHARCNET
5
Web Portal Features:
Problem Reporting
Access Grid
•
•
•
•
• AG rooms on all
SHARCNET sites
Online problem
tracking system to
submit questions/
problems
Problems will be
handled by
appropriate staff
Record of the
problem/resolution
tracked for future
reference
User can add/
modify comments
while the problem
remains open
Intro to SHARCNET
• Currently live:
– UWO, Waterloo,
Guelph,
McMaster
– Usage: meetings,
seminar/
workshops,
remote
collaboration, etc.
Intro to SHARCNET
SHARCNET and You
• User involvement:
– efficient use of
resources
– contribution of
hardware/software
– acknowledge
SHARCNET in
publications
– report on publications
(supports our own
reporting obligations)
SHARCNET Essentials
Intro to SHARCNET
6
Login and File Transfers
•
File Systems
Login: only by SSH
– domain name of SHARCNET systems:
• external: machine.sharcnet.ca
• internal: machine.sharcnet
– Linux: ssh [email protected]
– Windows: ssh available as part of Cygwin tools (http://www.cygwin.com),
and many graphical clients available (PuTTY, etc.)
•
• Policies
– same username/password across all systems
(including webportal)
– user self-management on webportal (site leader,
sponsor, group member)
– common home directory across SHARCNET
(exceptions: wobbe, cat)
– SHARCNET-maintained software is in /opt/sharcnet
– /home backup
File Transfer:
– Linux: scp filename machine.sharcnet.ca:/scratch/username
– Windows: ssh file transfer, e.g.:
Intro to SHARCNET
Intro to SHARCNET
File Systems (cont.)
Compilers
• File system
•
Pool
Quota
Expiry
Purpose
/home
200MB
none
source, small config files
/work
none
none
active data files
/scratch
none
none
active data files
/tmp
160GB
10 days
node-local scratch
Opteron clusters:
– Pathscale (pathcc, pathCC, pathf90) [default]
– Portland Group (PGI) (pgcc, pgCC, pgf77/pgf90)
•
Intel Itanium/Xeon clusters (silky)
– Intel (icc, ifort)
•
Alpha cluster (greatwhite)
– Compaq (ccc, cxx, fort)
• /scratch and /work are local to each cluster
– not backed up
– important: run jobs from /scratch or /work (performance!)
Intro to SHARCNET
•
Development/misc clusters (mako, spinner, coral)
– GNU (gcc, g++, g77/gfortran)
Intro to SHARCNET
7
Compilers: “compile”
script
•
Compilers: Commonly
Used Options
SHARCNET provides a “compile” script (use as “generic” compiler)
– tries to do what it “should” - optimizations, compiler, etc.
– recommended unless you know better
Command
Language
Extension
Example
cc
c
c
cc –o code.exe code.c
CC
C++
.C, .cc, .cpp, .cxx, c++
cc –o code.exe code.cpp
c++
C++
cxx
C++
f77
Fortran 77
.f, .F
f77 –o Fcode.exe Fcode.f
f90/f95
F90/F95
.f90, .f95, .F90, F95
f90 –o Fcode.exe Fcode.f90
mpicc
c
c
mpicc –o mpicode.exe mpicode.c
mpiCC
C++
C++
mpiCC –o mpicode.exe mpicode.cc
mpif77
Fortran 77
f77
mpif77 –o mpicode.exe mpicode.f
F90/f95
mpif90 –o mpicode.exe mpicode.f90
Intro to SHARCNET
mpif90/mpif95
Fortran90/95
Compilers: Examples
•
Basic:
•
– e.g. Pathscale:
-c
Do not link, generate object file only.
-o file
Write output to specified file instead of default.
-Ipath
Add path to search path for include files.
-llibrary
Search the library named liblibrary.a or liblibrary.so such as –lmpi, -lacml
-Lpath
Add path to search path for libraries
-g[N]
Specify the level of debugging support produced by the compiler
-g0
No debugging information for symbolic debugging is produced. This is the default.
-g2
Produce additional debugging information for symbolic debugging.
-O[n]
Optimization level n=0 to 3. Default is -O2.
-O0
Turns off all optimizations.
-O1
Turns on local optimizations that can be done quickly.
-O2
Turns on extensive optimization. This is the default
-O3
Turns on aggressive optimization (e.g. loop nest optimizer).
-Ofast
Equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno
-pg
Generate extra code to profile information suitable for the analysis program pathprof
-Wall
Enable most warning messages.
Intro to SHARCNET
Software/Library Usage
•
pathcc -o executable source.c
•
Add optimizations:
•
Specify additional include paths:
pathcc -Ofast -o executable source.c
•
Link to ACML library (requires both -l and -L)
SHARCNET defaults (i.e. compile script):
– Pathscale: -Ofast
•
Intro to SHARCNET
ACML (AMD)
CXML/CPML (Alpha)
MKL (Intel)
SCSL (SGI)
Atlas, ScaLAPACK, GSL
•
Managed software typically includes itself in your environment
automatically, so simply add library link:
•
If header files are needed, may require -I compile option:
Compaq: -fast Intel: -O2 -xW -tpp
Make is recommended for all compilation
Optimized Math Libraries (architecture specific):
–
–
–
–
–
pathf90 -L/opt/sharcnet/acml/current/pathscale64/lib
-o executable source.c -lacml
•
List and detailed usage online:
– http://www.sharcnet.ca/Facilities/software/softwarePage.php
pathcc -I/work/project/include -c source.c
•
There are minor differences between compilers (see man page for details)
f90 -o executable source.f90 -lacml
cc -I/opt/sharcnet/acml/current/pathscale64/include -o
executable source.c -lacml
Intro to SHARCNET
8
Submitting Jobs
•
Job Scheduling
Log on to the desired cluster/machine
• All clusters have the same basic queues (mpi, threaded,
serial) but relative priority is cluster specific
– ensure files are in /scratch/username (or /work)
– do not run jobs out of /home --- it can be very slow
– jobs are submitted using the SQ system (wrapper, e.g. around LSF)
•
– note: test queue is for debugging - limited CPU time
– keep in mind, fairshare will influence your “effective” priority
Available Queues:
– NOTE: all jobs are submitted to queues and run in batch mode
– no interactive jobs are permitted, except on mako (devel. cluster)
Job Type
Queue
CPUs (default=1)
Parallel
mpi
>2
Parallel
threaded
(Machine dependent)
2 (requin, wobbe, cat)
2-4 (most clusters)
>2 (silky)
Serial
serial
1
• Consider narwhal (utility cluster):
nar317:~% bqueues
QUEUE_NAME
PRIO
staff
150
test
100
threaded
80
mpi
80
serial
40
Intro to SHARCNET
PEND
0
0
40
1080
0
RUN
0
0
28
658
257
SUSP
0
0
0
0
0
Commonly Used SQ/LSF
Commands
Contrast with whale (throughput) and bull (big mem/smp-friendly)
• bqueues
wha783:~% bqueues
QUEUE_NAME
PRIO
staff
150
test
100
serial
80
threaded
80
mpi
40
STATUS
Open:Active
Open:Active
Open:Active
Open:Active
Open:Active
MAX JL/U JL/P JL/H NJOBS
0
0
- 1385
112
768
PEND
0
0
0
0
0
RUN
0
0
1385
112
760
SUSP
0
0
0
0
8
bl124:~% bqueues
QUEUE_NAME
PRIO
staff
150
test
100
gaussian
100
threaded
80
mpi
80
serial
40
STATUS
Open:Active
Open:Active
Open:Active
Open:Active
Open:Active
Open:Active
MAX JL/U JL/P JL/H NJOBS
0
0
16
36
328
81
PEND
0
0
4
16
44
13
RUN
0
0
12
20
284
68
SUSP
0
0
0
0
0
0
Intro to SHARCNET
MAX JL/U JL/P JL/H NJOBS
0
0
68
- 1738
257
Intro to SHARCNET
Job Scheduling (cont.)
•
STATUS
Open:Active
Open:Active
Open:Active
Open:Active
Open:Active
– status of available queues
• sqsub
– submit a job for execution on the system
• sqjobs
– list status of all submitted jobs
• sqkill
– kill a program by job ID
• bhist
– display history of jobs
Intro to SHARCNET
9
sqsub (submit job)
sqjobs (view job status)
sqsub [-o ofile][-i ifile][-t][-q queue][-n ncpus] command …
sqjobs [-r][-q][-z][-v][-u user][-n][--summary][jobid]
Commonly used options:
Options:
-q queue
-n ncpus
-i ifile
-o ofile
-e efile
-t or --test
queue name (serial, threaded, mpi - default: serial)
execute on n cpus (default: 1)
job reads stdin from ifile
job sends stdout to ofile (default: lsf - e-mail)
job sends stderr to efile (default: ofile)
test mode (short but immediate)
-a or --all
-r
-q
-z
-u user
--summary
-h or --help
--man
jobid
Examples
sqsub -q mpi -n 4 -o mpi_hello.log ./mpi_hello
sqsub -q threaded -n 4 -o pthread_hello.log ./pthread_hello
sqsub -q serial -o hello.log ./hello
sqsub -q mpi -t -n 4 -o mpi_hello.log ./mpi_hello
Intro to SHARCNET
sqkill (kill a job)
sqkill jobid [jobid][…]
show all jobs: all users, all states
show running jobs
show queued jobs
show suspended/pre-empted jobs
show jobs for the specified user
show a line-per-user summary
show usage
show man page
one or more jobids to examine
Intro to SHARCNET
bhist (display job history)
bhist [-a | -d | -p | -r | -s][-b | -w][-l][-t] …
e.g.
nar317:~/pub/exercises% sqjobs
nar317:~/pub/exercises% sqsub -q mpi -t -n 4 -o mpi_hello.log ./mpi_hello
Job <134227> is submitted to queue <test>.
nar317:~/pub/exercises% sqjobs
jobid queue state ncpus nodes time command
------ ----- ----- ----- ----- ---- ------------134227 test
Q
4
3s ./mpi_hello
2972 CPUs total, 0 idle, 2972 busy; 2020 jobs running; 0 suspended, 269
queued.
nar317:~/pub/exercises% sqkill 134227
Job <134227> is being terminated
nar317:~/pub/exercises% sqjobs
nar317:~/pub/exercises%
May need to wait for a few seconds for the system to kill a job
Intro to SHARCNET
Options:
-a
displays finished and unfinished jobs (over-rides -d, -p, -s and -r)
-b
brief format; if used with -s option, shows reason why jobs were
suspended
-d
only display finshed jobs
-l
long format; displays additional information
-u user
display jobs submitted by specified user
Intro to SHARCNET
10
bhist example
nar317:~/pub/exercises% bhist -a
Summary of time in seconds spent in various states:
JOBID
USER
JOB_NAME PEND
PSUSP
RUN
USUSP
SSUSP
UNKWN
TOTAL
134177 dbm
*o_mpi_c 8
0
37
0
0
0
45
134227 dbm
*o_mpi_c 10
0
10
0
0
0
20
nar317:~/pub/exercises% bhist -l 134177
Job <134177>, User <dbm>, Project <dbm>, Job Group </dbm/dbm>,
Command </opt/hpmpi/bin/mpirun -srun -o mpi_hello.log ./
mpi_hello>
Fri Sep 15 13:06:08: Submitted from host <wha780>, to Queue <test>, CWD <$HOME/
scratch/examples>, Notify when job ends, 4 Processors Requ
ested, Requested Resources <type=any>;
Fri Sep 15 13:06:16: Dispatched to 4 Hosts/Processors <4*lsfhost.localdomain>;
Fri Sep 15 13:06:16: slurm_id=318135;ncpus=4;slurm_alloc=wha2;
Fri Sep 15 13:06:16: Starting (Pid 29769);
Fri Sep 15 13:06:17: Running with execution home </home/dbm>, Execution CWD
</scratch/dbm/examples>, Execution Pid <29769>;
Fri Sep 15 13:06:53: Done successfully. The CPU time used is 0.3 seconds;
Fri Sep 15 13:06:57: Post job process done successfully;
Summary of time in seconds spent in various states by Fri Sep 15 13:06:57
PEND
PSUSP
RUN
USUSP
SSUSP
UNKWN
TOTAL
8
0
37
0
0
0
45
Intro to SHARCNET
LSF E-mail
• Job results are returned via your profile specified
e-mail address
– LSF summary
– output from all processes
– change this behaviour using -o, -e to sqsub
• Default (on or off?)
• option to turn off/on LSF e-mail in sqsub
Intro to SHARCNET
DDT: Distributed
Debugging Tool
•
One-click access to
all processes
A powerful debugger for parallel programs with an advanced GUI
– works best with MPI programs, but can also be used for threaded and
serial jobs
– supports with C, C++ and many flavours of Fortran (77, 90, 95)
•
Installed on requin, narwhal, bull and six PoP clusters
•
To use DDT:
Advanced source
browsing
ddt program [arguments]
– then choose number of processes to run and press “Submit”
– DDT itself involes the scheduler using the test queue
• the debugging session starts almost immediately, but has a 1 hour
time limit
Intro to SHARCNET
Syntax highlighting
Intro to SHARCNET
11
High Performance
Computing
Intro to SHARCNET
High Performance
Computing
•
Definition is nebulous
Problems Faced
•
•
In reality, HPC is concerned with varied issues involving:
– hardware
• pipelining, instruction sets, multi-processors, inter-connects
– algorithms
• efficiency, techniques for concurrency
– software
• compilers (optimization/parallelization), libraries
Intro to SHARCNET
Hardware
– in order to facilitate processors working together they must be able to
communicate
– interconnect hardware is complex
• sharing memory is easy to say, harder to realize as system scales
• communication over any kind of network is still painfully slow compared to
bus speed --- overhead can be significant
– resource (processing) intensive computation
– computing where the need for speed is compelling
• computing nearer the limit of what is feasible
– parallel computing (this is too strict)
•
Software
– parallel algorithms are actually fairly well understood
– the realization of algorithms in software is non-trivial
– compilers
• automated parallelism is difficult
– design
• portability and power are typically at odds with each other
Intro to SHARCNET
12
Building Parallel Machines
(cont.)
Building Parallel Machines
•
Symmetric Multiprocessors (SMP)
•
•
Clusters
–
–
–
– shared memory with uniform memory access (UMA)
– issues of scalability
• accessing memory is complex with multiple processors
• crossbar switches can only get so big in practice
simple (relatively) nodes connected by a network
inter-process communication is explicit (e.g. message passing)
much better scalability at the cost of communication overhead (performance)
...
Non-Uniform Memory Access (NUMA)
...
– processors can see all of memory, but cannot access it all at the same speed
– some memory is local (fast), other memory is global (slower)
...
memory
p1
Intro to SHARCNET
p2 . . . pn
Adapted from “Using MPI: Portable Parallel Programming with the Message
Passing Interface (2e), Gropp, Lusk and Skjellum, 1999.
Sequential Computing
•
•
...
network
...
...
Intro to SHARCNET
...
Adapted from “Using MPI: Portable Parallel Programming with the Message
Passing Interface (2e), Gropp, Lusk and Skjellum, 1999.
Parallel Computing
Traditional model of computing
•
–
–
–
The general idea is if one processor is good, many processors will
be better
•
Parallel programming is not generally trivial
von Neumann architecture
one processor + one memory + bus
processor executes one instruction at a time
• where does the notion of pipelining fit in to this?
Characteristics of sequential algorithms
–
statement of a step-wise solution to a problem using simple
instructions and the following three types of operation:
1. sequence
2. iteration
3. conditional branching
Intro to SHARCNET
– tools for automated parallelism are either highly specialized or absent
•
Many issues need to be considered, many of which don’t have an
analog in serial computing
– data vs. task parallelism
– problem structure
– parallel granularity
Intro to SHARCNET
13
Data Parallelism
Task Parallelism
• Data is distributed (blocked) across processors
• Work to be done is decomposed across processors
– easily automated with compiler hints (e.g. OpenMP, HPF)
– code otherwise looks fairly sequential
– benefits from minimal communication overhead
• e.g. Vector Machines
– a single CPU with a number of subordinate ALUs each with its
own memory
– operate on an array of similar data items during a single
operation
• each cycle: load in parallel from all memories into ALU and
perform same instruction on their local data item
Intro to SHARCNET
– shared memory
• multi-threading, SMP
– distributed memory
• network of workstations, message passing, remote memory
operations
Parallel Granularity
Structure of the problem dictates the ease with which we can
implement parallel solutions
easy
• Must be possible for different processors to be
performing different tasks
Intro to SHARCNET
Problem Structure
•
– e.g. divide and conquer, recursive doubling, etc.
– each processor responsible for some part of the algorithm
– communication mechanism is significant
perfectly (painful) parallelism
- independent calculations
•
A measure of the amount of processing performed before
communication between processes is required
•
Fine grained parallelism
– constant communication necessary
– best suited to shared memory environments
pipeline parallelism
- overlap otherwise sequential work
•
synchronous parallelism
- parallel work is well synchronized
hard
Intro to SHARCNET
asynchronous parallelism
-dependent calculations
-parallel work is loosely synchronized
Course grained parallelism
– significant computation performed before communication is necessary
– ideally suited to message-passing environments
•
Perfect parallelism
– no communication necessary
Intro to SHARCNET
14
Flynn’s Taxonomy (1966)
•
SISD
•
Flynn characterized systems by
Instructions
single
Data
single
multiple
Traditional model of sequential computation
– one processor
– one memory space
– number of instruction streams
– number of data streams
multiple
•
Sequential algorithms state a step-wise solution to a problem using simple
instructions and only the following three classes of operation
SISD
MISD
SIMD
MIMD
•
Processor executes one instruction at a time
•1 CPU / many ALUs
•each with memory
•synchronous
•multiple CPUs
•each with memory
•asynchronous
•
Single Instruction Single Data (SISD)
•1 processor
•1 program
•von Neumann
– sequence
– iteration
– branching
•odd concept
•vector processors (?)
Intro to SHARCNET
SIMD
Intro to SHARCNET
SIMD (cont.)
• Single Instruction Multiple Data (SIMD)
– one processor (conceptually)
– multiple memories
• or concurrent access to a single shared memory
b: 5
c: 2
a:
b: 8
c:-1
a:
b: 1
c: 4
a:
ALU1
ALU2
ALU3
– all ALUs execute the exact same instruction
– e.g.
• Vector machines
int foo()
{
a = b+c;
}
return(a);
a := b + c
• AltiVec (Motorola, IBM, Apple)
• MMX/SSE/SSE2/SSE3 (Intel)
Intro to SHARCNET
CPU
SIMD Parallelism
Intro to SHARCNET
15
SIMD (cont.)
• Ideally suited to data parallel problems
– solution requires we apply the same operation/algorithm to data
elements that can be operated upon independently
• expectation of high independence of operations (ideally
perfectly parallel data)
SIMD (cont.)
• Still amenable where communication patterns are highly
regular and involve the same operations
– e.g.
• Euclidean inner product
• summation
– e.g. vector arithmetic
• compute <c> := <a> + <b>
• on each processor, i
• AND/OR
• max/min
• etc.
– load elements a i and bi
– execute a i + bi
– store result ci
Intro to SHARCNET
MISD
Intro to SHARCNET
MISD (cont.)
• Multiple Instruction Single Data (MISD)
– unusual model to conceptualize
• some argue there is no such thing
– e.g. pipelining
• operations naturally divided into steps that make use of
different resources
• issues: branch prediction, instruction stalls, flushing a
pipeline
Intro to SHARCNET
instruction
stream
– multiple operations applied to single stream of data
• contrast with SIMD
processor resources used
i1
i2
i3
fetch
decode
fetch
execute
decode
fetch
writeback
execute
writeback
decode
execute
writeback
..
.
time
Pipelining as MISD model
Intro to SHARCNET
16
MIMD
•
MIMD (Distributed Memory)
Multiple Instruction Multiple Data (MIMD)
– many processors executing different programs
•
b: 5
c: 2
Memory organization is significant
– Distributed memory
• unique memory associated with each processor
• issues: communication overhead
– Shared memory
• processors share a single physical memory
• programs can share blocks of memory between them
• issues: exclusive access, race conditions, synchronization,
scalability
Intro to SHARCNET
int foo()
{
a = b+c;
}
return(a);
a:=b+c
b: 5
CPU1
a:=b+c
x: 5
void bar()
{
if (x > b)
y = x;
else
y = 0;
}
}
a = f();
CPU3
call f
CPU2
if x > b
int main()
{
int a;
}
a = f();
•
CPU3
call f
Most powerful and general model for parallel computing
– it should be clear that SIMD models are a subset of MIMD
– required to achieve task parallelism
f()
{...}
int f();
Shared Memory MIMD
Intro to SHARCNET
if x > 1
int main()
{
int a;
MIMD (cont.)
c: 2
return(x);
int f();
CPU2
Distributed Memory MIMD
•
}
void bar()
{
if (x > 1)
y = x;
else
y = 0;
}
f()
{...}
Intro to SHARCNET
MIMD (Shared Memory)
int foo()
{
a = b+c;
CPU1
x: 5
Unrestricted, unsynchronized parallelism can be difficult to design,
test and deploy in practice
– it is more common for the solution to a problem to be somewhat regular
• natural points of synchronization even where the behaviour across
processes diverges
– typical MIMD solutions involve partitioning the solution space and have
only processes on partition boundaries communicate locally with one
another (sub-problems are all nearly identical)
Intro to SHARCNET
17
SPMD
SPMD (cont.)
• Single Program Multiple Data (SPMD)
– special case MIMD
b: 5
c: 2
– many processors executing the same program
• conditional branches used where specific
behaviour required on the processors
– shared/distributed memory organization
int main()
{
if cpu_1
b = 5;
else
b = 3
}
CPU1
b := 5
b: 3
c: 2
int main()
{
if cpu_1
b = 5;
else
b = 3
}
CPU2
b := 3
b: 3
c: 2
int main()
{
if cpu_1
b = 5;
else
b = 3
}
CPU3
b := 3
SPMD illustrating conditional branching to control divergent behaviour
Intro to SHARCNET
Intro to SHARCNET
Communication
•
Communication plays a key role in evaluating parallel models
•
Sockets
– standard model for network communication
– issues of network traffic (if shared), slow interface
– will not leverage dedicated interconnect without effort
•
Programming for
Clusters using MPI
Message Passing
– explicit communication between processes
– e.g. Message Passing Interface (MPI)
• standardized libraries for message passing in MIMD environments
•
Shared Memory
– implicit communication via common memory blocks
– e.g. OpenMP, pthreads
Intro to SHARCNET
18
Message Passing Interface
(MPI)
•
Library providing message passing support for parallel/distributed
applications
MPI Programming Basics
•
– C
– not a language: collection of subroutines (Fortran), functions/macros (C)
– explicit communication between processes
•
•
• #include “mpi.h”
– Fortran
• include “mpif.h”
Advantages
– standardized
– scalability generally good
– memory is local to a process (debugging/performance)
Include MPI header file
•
Compile with MPI library
– mpicc
– mpif90
– -lmpi [-lgm] …
Disadvantages
– more complex than implicit techniques
– communication overhead
Intro to SHARCNET
Core Functions
• MPI_Init()
• MPI_Finalize()
• MPI_Comm_rank()
Intro to SHARCNET
Example:
MPI “Hello, world!”
#include <stdio.h>
#include “mpi.h”
int main(int argc, char *argv[])
{
int rank, size;
• MPI_Comm_size()
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
• MPI_Send()
printf(“Hello, world! from process %d of %d\n”, rank, size);
• MPI_Recv()
MPI_Finalize();
return(0);
Intro to SHARCNET
}
Intro to SHARCNET
19
Exercise
1)
The mpi_hello.c file in ~dbm/public/exercises/intro is a copy
of the one used in the previous example
2)
Compile this program and submit it for execution on one of the
SHARCNET clusters (try 2, 4, 8, 16 processors)
•
•
3)
Programming Shared
Memory Machines using
OpenMP/pthreads
what compiler will you use? how do you submit the job?
when the job completes, where are the results available?
Food for thought…
•
Modify this program so that each process i sends it's text string to process (i +
1) % n, where n is the number of processes
•
the receiving process should be making use of the message it received in
the output to verify you are sending the data correctly
•
draw a diagram illustrating the order of events implied by your code
Intro to SHARCNET
A process imposes a specific set of
management responsibilities on the
OS
– a unique virtual address space
(stack, heap, code)
– file descriptor table
– program counter
– register states, etc.
Intro to SHARCNET
heap
file descriptor
table
•
DEF’N: a thread is a sequence of
executable code within a process
•
A serial process can be seen, at its
simplest, as a single thread (a single
“thread of control”)
– represented by the program counter
– sequence (increment PC),
iteration/conditional branch (set value of
PC)
register
states
stack
program
counter
Conceptual View of a Process
(simplified)
•
In terms of record-keeping, only a small
subset of a process is relevant when
considering a thread
– register states; program counter
heap
file descriptor
table
thread
•
DEF’N: a process is a program in
execution, as seen by the operating
system
virtual address space
•
Threads
virtual address space
Processes
stack
register
states
program
counter
Conceptual View of a Thread
(simplified)
Intro to SHARCNET
20
Multi-threading
OpenMP Basics
•
Distribute work by defining multiple
threads to do the work
– e.g. OpenMP, pthreads
thread_1
•
thread_3
•
Intro to SHARCNET
Advantages
– all process resources are implicitly
shared (memory, file descriptors, etc.)
– overhead incurred to manage multiple
threads is relatively low
– looks much like serial code
thread_2
• Uses compiler directives, library routines and
environment variables to automatically generate
threaded code
• Available for FORTRAN, C/C++
• Compile with OpenMP support
– f90 -openmp …
Disadvantages
– all data being implicitly shared
creates a world of hammers, and your
code is the thumb
– exclusive access, contention, etc.
Example:
OpenMP “Hello, world!”
PROGRAM HELLO
INTEGER ID, NTHRDS
INTEGER OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS
!$OMP PARALLEL PRIVATE(ID)
ID = OMP_GET_THREAD_NUM()
PRINT *, ‘HELLO, WORLD! FROM THREAD’, ID
!$OMP BARRIER
IF (ID .EQ. 0) THEN
NTHRDS = OMP_GET_NUM_THREADS()
PRINT *, ‘THERE ARE’, NTHRDS, ‘THREADS’
END IF
!$OMP END PARALLEL
END
– note: C typically requires a header file
• #include “omp.h”
Intro to SHARCNET
Exercise
1)
The openmp_hello.f90 file in ~dbm/public/exercises/intro is a
copy of the one used in the previous example
2)
Compile this program and submit it for execution on one of the
SHARCNET clusters (try 1, 2, 3, 4 processors --- on silky you could
submit this on up to 128 processors in principle)
•
•
3)
Food for thought
•
•
Intro to SHARCNET
how will you compile this program? how do you submit the job?
how are the results of the job returned to you?
compare what is happening in this code, as compared to an equivalent
program written using pthreads (next section) --- which do you think will be
more efficient? which do you think is easier to code?
is pthreads able to offer you anything OpenMP is not?
Intro to SHARCNET
21
Pthreads Basics
• Include Pthread header file
– #include “pthread.h”
Pthreads Programming
Basics (cont.)
•
Note that all processes have an implicit “main thread of control”
•
We need a means of creating a new thread
– pthread_create()
• Compile with Pthreads support/library
– cc -pthread …
• compiler vendors may differ in their usage to support
pthreads (link a library, etc.)
• Pathscale, Intel and gcc compilers use the -pthread
argument so this will suffice for the moment
• when in doubt, consult the man page for the compiler in
question
Intro to SHARCNET
Example:
pthreads “Hello, world!”
#include <stdio.h>
#include “pthread.h”
void output (int *);
•
– threads are terminated implicitly when the function that was the entry
point of the thread returns, or can be explicitly destroyed using
pthread_exit()
•
return(0);
}
void output(int *thread_num)
{
printf(“Hello, world! from thread %d\n”, *thread_num);
Intro
} to SHARCNET
We may need to distinguish one thread from another at run-time
– pthread_self(), pthread_equal()
Intro to SHARCNET
Exercise
1)
The pthread_hello.c file in ~dbm/public/exercises/intro is a
copy of the one used in the previous example
2)
Compile this program and submit it for execution on one of the
SHARCNET clusters (try 1, 2, 3, 4 processors --- on silky you could
submit this on up to 128 processors in principle)
int main(int argc, char *argv[])
{
int id;
pthread_t thread[atoi(argv[1])];
for (id = 0; id < atoi(argv[1]); id++)
pthread_create(&thread[id], NULL, (void *)output, (void *)&id);
We need a way to terminating a thread
•
•
3)
how will you compile this program? how do you submit the job?
how are the results of the job returned to you?
Food for thought…
•
Modify this program so that each thread returns its thread id number, with the
main thread summing these values and printing the result
o
if you add nothing to the main routine other than the print statement, what
happens when you run it? what additional information do you need to
make this work as intended?
Intro to SHARCNET
22
Parallel Design Patterns
Parallel Design Basics
•
In principle, a parallel solution could be arbitrarily asynchronous,
with little structure to the code
•
The reality is that most parallel algorithms draw from a relatively
small set of basic designs
– well understood
– easier to implement and debug
– reasonably general
• most are applicable to both threaded and message-passing
environments
•
Always consider these common patterns when thinking about a
parallel implementation of your problem
Intro to SHARCNET
Fork/Join Parallelism
•
Master/Slave Parallelism
Multiple processes/threads are created, executing the same or
different routines concurrently
– also referred to as a Peer model
• classic model of data parallelism
– synchronization upon exit is a common requirement
•
Processes/threads are created on demand as the main process requires
work to be done
– appropriate threads are created to service tasks, and run concurrently once
created
– this is difficult to realize on clusters running batch schedulers as there is no
support for dynamic processes creation
• you would be more likely to use the process/thread pool: worker processes
are created when the job starts, and are assigned work from a queue
thread_1
thread_taskA
thread_2
thread_3
thread_taskB
main thread
…
main thread
…
main thread - creates threads as needed
thread_N
Intro to SHARCNET
Intro to SHARCNET
23
Process/Thread Pool
•
Pipeline Parallelism
A number of processes/threads are created capable of doing work
– the main process/thread adds jobs to a queue of work to be done and signals
slaves that there is work to be done
– slave processes/threads consume tasks from the shared queue and perform the
work
thread_1
•
Multiple processes/threads are created, each performing a discrete stage in
some execution sequence
– new input is fed to the first worker, and the output from each subsequent stage is
passes to the next one
– after N time steps, the pipleline is filled and a result is produced every time step,
with all stages processing concurrently
– buffering and synchronization are significant here
thread_2
…
thread_stage1
thread_taskB
main thread
queue
thread_stage2
main thread
…
thread_stageN
main thread - adds jobs to queue as needed
Intro to SHARCNET
Intro to SHARCNET
24