Fall Workshop 2006 Introduction to SHARCNET and High Performance Computing Intro to SHARCNET SHARCNET • Vision: What is SHARCNET? – To establish a world-leading, multi-university and college, interdisciplinary institute with an active academic-industry partnership, enabling forefront computational research in critical areas of science, engineering and business. • Mission: – SHARCNET exists to enable world-class computational research so as to accelerate the production of research results. Intro to SHARCNET 1 SHARCNET…in brief • General Objectives: Academic and Affiliated Partners • – provide otherwise unattainable computing resources – build common, seamless computing environments – promote remote collaboration and research • • SHARCNET Web Portal http://www.sharcnet.ca Intro to SHARCNET Founding members (June 2001) The University of Western Ontario University of Guelph McMaster University Wilfrid Laurier University University of Windsor Fanshawe College Sheridan College New Partners (June 2003) University of Waterloo Brock University University of Ontario Institute of Technology York University • New Partners (Dec 2005) Trent University Laurentian University Lakehead University • New Partners (March 2006) Ontario College of Art and Design Perimeter Institute for Theoretical Physics • Affiliated Partners Robarts Research Institute Fields Intitute for Mathematical Sciences Intro to SHARCNET Industrial and Government Partners • Private Sector – Hewlett Packard – SGI – Quadrics Supercomputing World – Platform Computing – Nortel Networks – Bell Canada Intro to SHARCNET • Government – Canada Foundation for Innovation – Ontario Innovation Trust – Ontario R&D Challenge Fund – Optical Regional Advanced Network of Ontario (ORANO) Intro to SHARCNET 2 SHARCNET Overview Getting an Account • FREE to academic researchers • Compute-intensive problems • – http://www.sharcnet.ca/Help/account.php - “Getting an Account” link – http://www.sharcnet.ca/Portal/index.php - “New User” link – not intended to replace desktop computing; SHARCNET users can productively conduct HPC research on a variety of systems, each optimized for specific HPC tasks • • Fair access You will have a webportal account that allows you to access information/files, submit requires and manage your own profile All systems are accessed via SSH (using the same account information) – e.g. ssh [email protected] Intro to SHARCNET Intro to SHARCNET SHARCNET Environment SHARCNET Facilities • Systems: clusters, shared memory (SMP) Cluster of Clusters (COC) login node login node LAN • • – users have access to all systems – jobs are run in batch mode (scheduling system) with fairshare Account approval: – Faculty accounts are approved by the site leader – Students/postdoc/fellows require a faculty sponsor • sponsor approves such accounts directly – Non-SHARCNET institution accounts are approved by the Scientific Director Academic HPC research – research can be industry-related, but must be done in collaboration with an academic researcher • Apply for an account online (webportal) … … LAN compute nodes 10Gbps compute nodes – Architectures: Opteron, Itanium, Xeon, Alpha, PowerPC • recommended compilers: Pathscale/PGI, Intel, Compaq – Intended Use: HPC task-specific • Parallel (capability, utility, SMP, etc.), Serial (throughput, etc.) • Visualization clusters 10Gbps • Access Grid 10Gbps LAN Intro to SHARCNET … – multi-media video conferencing – collaboration, remote instruction, etc. Intro to SHARCNET 3 http://www.sharcnet.ca/Facilities/index.php Facilities: Intended Use Cluster Intro to SHARCNET Software Resources • OS - Linux – HP XC 3.0 on all major Opteron-based clusters – SGI Linux on Silky • Compilers – Opteron: Pathscale (pathcc, pathCC, pathf90), PGI (pgcc, pgCC, pgf77, pgf90) – Alpha: Compaq (ccc, cxx, fort) – Itanium/Xeon: Intel (icc, ifort) • Scheduler – LSF, SQ • Parallel development support – MPI (HPMPI, OpenMPI, MPICH) – Threads (pthreads, OpenMP) Intro to SHARCNET Storage Interconnect Intend Use requin (Capability) CPUs 1536 RAM /node 8 GB 70 TB Quadrics Resource intensive MPI (fine grained, large mem.) narwhal (Utility) 1068 8 GB 70 TB Myrinet MPI, small-scale SMP whale (Throughput) 3072 4 GB 70 TB GigE Serial bull (SMP-friendly) 384 32 GB 70 TB Quadrics High RAM/BW MPI & small-scale SMP silky (SMP) 128 256 GB 4 TB NUMAlink large memory/mediumscale SMP bala, bruce, dolphin, megaladon, tiger, zebra 128 8 GB 4 TB Myrinet Genaral purpose Intro to SHARCNET Software Resources (cont.) • Libraries – ACML(AMD), CXML(Alpha), SCSL(SGI), Atlas, GSL, ScaLAPACK, FFTW, PETSc, … • Debugging/Profiling Tools – Debugging (DDT, gdb, …) – Profiling/Optimization (OPT, gprof, …) • Application packages – R, Gromacs, BLAST, NWChem, Octave Intro to SHARCNET 4 Software Resources (cont.) • Commercial packages – bring your license to SHARCNET (such as lumerical, ..) – software with large user groups (such as Gaussian, Fluent, etc.), SHARCNET may purchase the license; user groups are asked to share cost • Others – you ask/provide, we can install or advise • More information: http://www.sharcnet.ca/Facilities/software/softwarePage.php Intro to SHARCNET Support: Info on the Web Support: People • HPC Software Analysts – – – – – central resource for user support software/development support analysis of requirements development/delivery of educational resources research computing consultations • System Administrators – user accounts – system software – hardware and software maintenance Intro to SHARCNET Web Portal Features: Job Statistics • http://www.sharcnet.ca – for general information: SHARCNET FAQ • http://www.sharcnet.ca/Help/faq.php – up-to-date facility information: facilities page: • http://www.sharcnet.ca/Facilities/index.php – online problem tracking system (login required) • https://www.sharcnet.ca/Portal/ problems/problem_search.php – support personnel: • http://www.sharcnet.ca/Help/index.php Intro to SHARCNET Intro to SHARCNET 5 Web Portal Features: Problem Reporting Access Grid • • • • • AG rooms on all SHARCNET sites Online problem tracking system to submit questions/ problems Problems will be handled by appropriate staff Record of the problem/resolution tracked for future reference User can add/ modify comments while the problem remains open Intro to SHARCNET • Currently live: – UWO, Waterloo, Guelph, McMaster – Usage: meetings, seminar/ workshops, remote collaboration, etc. Intro to SHARCNET SHARCNET and You • User involvement: – efficient use of resources – contribution of hardware/software – acknowledge SHARCNET in publications – report on publications (supports our own reporting obligations) SHARCNET Essentials Intro to SHARCNET 6 Login and File Transfers • File Systems Login: only by SSH – domain name of SHARCNET systems: • external: machine.sharcnet.ca • internal: machine.sharcnet – Linux: ssh [email protected] – Windows: ssh available as part of Cygwin tools (http://www.cygwin.com), and many graphical clients available (PuTTY, etc.) • • Policies – same username/password across all systems (including webportal) – user self-management on webportal (site leader, sponsor, group member) – common home directory across SHARCNET (exceptions: wobbe, cat) – SHARCNET-maintained software is in /opt/sharcnet – /home backup File Transfer: – Linux: scp filename machine.sharcnet.ca:/scratch/username – Windows: ssh file transfer, e.g.: Intro to SHARCNET Intro to SHARCNET File Systems (cont.) Compilers • File system • Pool Quota Expiry Purpose /home 200MB none source, small config files /work none none active data files /scratch none none active data files /tmp 160GB 10 days node-local scratch Opteron clusters: – Pathscale (pathcc, pathCC, pathf90) [default] – Portland Group (PGI) (pgcc, pgCC, pgf77/pgf90) • Intel Itanium/Xeon clusters (silky) – Intel (icc, ifort) • Alpha cluster (greatwhite) – Compaq (ccc, cxx, fort) • /scratch and /work are local to each cluster – not backed up – important: run jobs from /scratch or /work (performance!) Intro to SHARCNET • Development/misc clusters (mako, spinner, coral) – GNU (gcc, g++, g77/gfortran) Intro to SHARCNET 7 Compilers: “compile” script • Compilers: Commonly Used Options SHARCNET provides a “compile” script (use as “generic” compiler) – tries to do what it “should” - optimizations, compiler, etc. – recommended unless you know better Command Language Extension Example cc c c cc –o code.exe code.c CC C++ .C, .cc, .cpp, .cxx, c++ cc –o code.exe code.cpp c++ C++ cxx C++ f77 Fortran 77 .f, .F f77 –o Fcode.exe Fcode.f f90/f95 F90/F95 .f90, .f95, .F90, F95 f90 –o Fcode.exe Fcode.f90 mpicc c c mpicc –o mpicode.exe mpicode.c mpiCC C++ C++ mpiCC –o mpicode.exe mpicode.cc mpif77 Fortran 77 f77 mpif77 –o mpicode.exe mpicode.f F90/f95 mpif90 –o mpicode.exe mpicode.f90 Intro to SHARCNET mpif90/mpif95 Fortran90/95 Compilers: Examples • Basic: • – e.g. Pathscale: -c Do not link, generate object file only. -o file Write output to specified file instead of default. -Ipath Add path to search path for include files. -llibrary Search the library named liblibrary.a or liblibrary.so such as –lmpi, -lacml -Lpath Add path to search path for libraries -g[N] Specify the level of debugging support produced by the compiler -g0 No debugging information for symbolic debugging is produced. This is the default. -g2 Produce additional debugging information for symbolic debugging. -O[n] Optimization level n=0 to 3. Default is -O2. -O0 Turns off all optimizations. -O1 Turns on local optimizations that can be done quickly. -O2 Turns on extensive optimization. This is the default -O3 Turns on aggressive optimization (e.g. loop nest optimizer). -Ofast Equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno -pg Generate extra code to profile information suitable for the analysis program pathprof -Wall Enable most warning messages. Intro to SHARCNET Software/Library Usage • pathcc -o executable source.c • Add optimizations: • Specify additional include paths: pathcc -Ofast -o executable source.c • Link to ACML library (requires both -l and -L) SHARCNET defaults (i.e. compile script): – Pathscale: -Ofast • Intro to SHARCNET ACML (AMD) CXML/CPML (Alpha) MKL (Intel) SCSL (SGI) Atlas, ScaLAPACK, GSL • Managed software typically includes itself in your environment automatically, so simply add library link: • If header files are needed, may require -I compile option: Compaq: -fast Intel: -O2 -xW -tpp Make is recommended for all compilation Optimized Math Libraries (architecture specific): – – – – – pathf90 -L/opt/sharcnet/acml/current/pathscale64/lib -o executable source.c -lacml • List and detailed usage online: – http://www.sharcnet.ca/Facilities/software/softwarePage.php pathcc -I/work/project/include -c source.c • There are minor differences between compilers (see man page for details) f90 -o executable source.f90 -lacml cc -I/opt/sharcnet/acml/current/pathscale64/include -o executable source.c -lacml Intro to SHARCNET 8 Submitting Jobs • Job Scheduling Log on to the desired cluster/machine • All clusters have the same basic queues (mpi, threaded, serial) but relative priority is cluster specific – ensure files are in /scratch/username (or /work) – do not run jobs out of /home --- it can be very slow – jobs are submitted using the SQ system (wrapper, e.g. around LSF) • – note: test queue is for debugging - limited CPU time – keep in mind, fairshare will influence your “effective” priority Available Queues: – NOTE: all jobs are submitted to queues and run in batch mode – no interactive jobs are permitted, except on mako (devel. cluster) Job Type Queue CPUs (default=1) Parallel mpi >2 Parallel threaded (Machine dependent) 2 (requin, wobbe, cat) 2-4 (most clusters) >2 (silky) Serial serial 1 • Consider narwhal (utility cluster): nar317:~% bqueues QUEUE_NAME PRIO staff 150 test 100 threaded 80 mpi 80 serial 40 Intro to SHARCNET PEND 0 0 40 1080 0 RUN 0 0 28 658 257 SUSP 0 0 0 0 0 Commonly Used SQ/LSF Commands Contrast with whale (throughput) and bull (big mem/smp-friendly) • bqueues wha783:~% bqueues QUEUE_NAME PRIO staff 150 test 100 serial 80 threaded 80 mpi 40 STATUS Open:Active Open:Active Open:Active Open:Active Open:Active MAX JL/U JL/P JL/H NJOBS 0 0 - 1385 112 768 PEND 0 0 0 0 0 RUN 0 0 1385 112 760 SUSP 0 0 0 0 8 bl124:~% bqueues QUEUE_NAME PRIO staff 150 test 100 gaussian 100 threaded 80 mpi 80 serial 40 STATUS Open:Active Open:Active Open:Active Open:Active Open:Active Open:Active MAX JL/U JL/P JL/H NJOBS 0 0 16 36 328 81 PEND 0 0 4 16 44 13 RUN 0 0 12 20 284 68 SUSP 0 0 0 0 0 0 Intro to SHARCNET MAX JL/U JL/P JL/H NJOBS 0 0 68 - 1738 257 Intro to SHARCNET Job Scheduling (cont.) • STATUS Open:Active Open:Active Open:Active Open:Active Open:Active – status of available queues • sqsub – submit a job for execution on the system • sqjobs – list status of all submitted jobs • sqkill – kill a program by job ID • bhist – display history of jobs Intro to SHARCNET 9 sqsub (submit job) sqjobs (view job status) sqsub [-o ofile][-i ifile][-t][-q queue][-n ncpus] command … sqjobs [-r][-q][-z][-v][-u user][-n][--summary][jobid] Commonly used options: Options: -q queue -n ncpus -i ifile -o ofile -e efile -t or --test queue name (serial, threaded, mpi - default: serial) execute on n cpus (default: 1) job reads stdin from ifile job sends stdout to ofile (default: lsf - e-mail) job sends stderr to efile (default: ofile) test mode (short but immediate) -a or --all -r -q -z -u user --summary -h or --help --man jobid Examples sqsub -q mpi -n 4 -o mpi_hello.log ./mpi_hello sqsub -q threaded -n 4 -o pthread_hello.log ./pthread_hello sqsub -q serial -o hello.log ./hello sqsub -q mpi -t -n 4 -o mpi_hello.log ./mpi_hello Intro to SHARCNET sqkill (kill a job) sqkill jobid [jobid][…] show all jobs: all users, all states show running jobs show queued jobs show suspended/pre-empted jobs show jobs for the specified user show a line-per-user summary show usage show man page one or more jobids to examine Intro to SHARCNET bhist (display job history) bhist [-a | -d | -p | -r | -s][-b | -w][-l][-t] … e.g. nar317:~/pub/exercises% sqjobs nar317:~/pub/exercises% sqsub -q mpi -t -n 4 -o mpi_hello.log ./mpi_hello Job <134227> is submitted to queue <test>. nar317:~/pub/exercises% sqjobs jobid queue state ncpus nodes time command ------ ----- ----- ----- ----- ---- ------------134227 test Q 4 3s ./mpi_hello 2972 CPUs total, 0 idle, 2972 busy; 2020 jobs running; 0 suspended, 269 queued. nar317:~/pub/exercises% sqkill 134227 Job <134227> is being terminated nar317:~/pub/exercises% sqjobs nar317:~/pub/exercises% May need to wait for a few seconds for the system to kill a job Intro to SHARCNET Options: -a displays finished and unfinished jobs (over-rides -d, -p, -s and -r) -b brief format; if used with -s option, shows reason why jobs were suspended -d only display finshed jobs -l long format; displays additional information -u user display jobs submitted by specified user Intro to SHARCNET 10 bhist example nar317:~/pub/exercises% bhist -a Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 134177 dbm *o_mpi_c 8 0 37 0 0 0 45 134227 dbm *o_mpi_c 10 0 10 0 0 0 20 nar317:~/pub/exercises% bhist -l 134177 Job <134177>, User <dbm>, Project <dbm>, Job Group </dbm/dbm>, Command </opt/hpmpi/bin/mpirun -srun -o mpi_hello.log ./ mpi_hello> Fri Sep 15 13:06:08: Submitted from host <wha780>, to Queue <test>, CWD <$HOME/ scratch/examples>, Notify when job ends, 4 Processors Requ ested, Requested Resources <type=any>; Fri Sep 15 13:06:16: Dispatched to 4 Hosts/Processors <4*lsfhost.localdomain>; Fri Sep 15 13:06:16: slurm_id=318135;ncpus=4;slurm_alloc=wha2; Fri Sep 15 13:06:16: Starting (Pid 29769); Fri Sep 15 13:06:17: Running with execution home </home/dbm>, Execution CWD </scratch/dbm/examples>, Execution Pid <29769>; Fri Sep 15 13:06:53: Done successfully. The CPU time used is 0.3 seconds; Fri Sep 15 13:06:57: Post job process done successfully; Summary of time in seconds spent in various states by Fri Sep 15 13:06:57 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 8 0 37 0 0 0 45 Intro to SHARCNET LSF E-mail • Job results are returned via your profile specified e-mail address – LSF summary – output from all processes – change this behaviour using -o, -e to sqsub • Default (on or off?) • option to turn off/on LSF e-mail in sqsub Intro to SHARCNET DDT: Distributed Debugging Tool • One-click access to all processes A powerful debugger for parallel programs with an advanced GUI – works best with MPI programs, but can also be used for threaded and serial jobs – supports with C, C++ and many flavours of Fortran (77, 90, 95) • Installed on requin, narwhal, bull and six PoP clusters • To use DDT: Advanced source browsing ddt program [arguments] – then choose number of processes to run and press “Submit” – DDT itself involes the scheduler using the test queue • the debugging session starts almost immediately, but has a 1 hour time limit Intro to SHARCNET Syntax highlighting Intro to SHARCNET 11 High Performance Computing Intro to SHARCNET High Performance Computing • Definition is nebulous Problems Faced • • In reality, HPC is concerned with varied issues involving: – hardware • pipelining, instruction sets, multi-processors, inter-connects – algorithms • efficiency, techniques for concurrency – software • compilers (optimization/parallelization), libraries Intro to SHARCNET Hardware – in order to facilitate processors working together they must be able to communicate – interconnect hardware is complex • sharing memory is easy to say, harder to realize as system scales • communication over any kind of network is still painfully slow compared to bus speed --- overhead can be significant – resource (processing) intensive computation – computing where the need for speed is compelling • computing nearer the limit of what is feasible – parallel computing (this is too strict) • Software – parallel algorithms are actually fairly well understood – the realization of algorithms in software is non-trivial – compilers • automated parallelism is difficult – design • portability and power are typically at odds with each other Intro to SHARCNET 12 Building Parallel Machines (cont.) Building Parallel Machines • Symmetric Multiprocessors (SMP) • • Clusters – – – – shared memory with uniform memory access (UMA) – issues of scalability • accessing memory is complex with multiple processors • crossbar switches can only get so big in practice simple (relatively) nodes connected by a network inter-process communication is explicit (e.g. message passing) much better scalability at the cost of communication overhead (performance) ... Non-Uniform Memory Access (NUMA) ... – processors can see all of memory, but cannot access it all at the same speed – some memory is local (fast), other memory is global (slower) ... memory p1 Intro to SHARCNET p2 . . . pn Adapted from “Using MPI: Portable Parallel Programming with the Message Passing Interface (2e), Gropp, Lusk and Skjellum, 1999. Sequential Computing • • ... network ... ... Intro to SHARCNET ... Adapted from “Using MPI: Portable Parallel Programming with the Message Passing Interface (2e), Gropp, Lusk and Skjellum, 1999. Parallel Computing Traditional model of computing • – – – The general idea is if one processor is good, many processors will be better • Parallel programming is not generally trivial von Neumann architecture one processor + one memory + bus processor executes one instruction at a time • where does the notion of pipelining fit in to this? Characteristics of sequential algorithms – statement of a step-wise solution to a problem using simple instructions and the following three types of operation: 1. sequence 2. iteration 3. conditional branching Intro to SHARCNET – tools for automated parallelism are either highly specialized or absent • Many issues need to be considered, many of which don’t have an analog in serial computing – data vs. task parallelism – problem structure – parallel granularity Intro to SHARCNET 13 Data Parallelism Task Parallelism • Data is distributed (blocked) across processors • Work to be done is decomposed across processors – easily automated with compiler hints (e.g. OpenMP, HPF) – code otherwise looks fairly sequential – benefits from minimal communication overhead • e.g. Vector Machines – a single CPU with a number of subordinate ALUs each with its own memory – operate on an array of similar data items during a single operation • each cycle: load in parallel from all memories into ALU and perform same instruction on their local data item Intro to SHARCNET – shared memory • multi-threading, SMP – distributed memory • network of workstations, message passing, remote memory operations Parallel Granularity Structure of the problem dictates the ease with which we can implement parallel solutions easy • Must be possible for different processors to be performing different tasks Intro to SHARCNET Problem Structure • – e.g. divide and conquer, recursive doubling, etc. – each processor responsible for some part of the algorithm – communication mechanism is significant perfectly (painful) parallelism - independent calculations • A measure of the amount of processing performed before communication between processes is required • Fine grained parallelism – constant communication necessary – best suited to shared memory environments pipeline parallelism - overlap otherwise sequential work • synchronous parallelism - parallel work is well synchronized hard Intro to SHARCNET asynchronous parallelism -dependent calculations -parallel work is loosely synchronized Course grained parallelism – significant computation performed before communication is necessary – ideally suited to message-passing environments • Perfect parallelism – no communication necessary Intro to SHARCNET 14 Flynn’s Taxonomy (1966) • SISD • Flynn characterized systems by Instructions single Data single multiple Traditional model of sequential computation – one processor – one memory space – number of instruction streams – number of data streams multiple • Sequential algorithms state a step-wise solution to a problem using simple instructions and only the following three classes of operation SISD MISD SIMD MIMD • Processor executes one instruction at a time •1 CPU / many ALUs •each with memory •synchronous •multiple CPUs •each with memory •asynchronous • Single Instruction Single Data (SISD) •1 processor •1 program •von Neumann – sequence – iteration – branching •odd concept •vector processors (?) Intro to SHARCNET SIMD Intro to SHARCNET SIMD (cont.) • Single Instruction Multiple Data (SIMD) – one processor (conceptually) – multiple memories • or concurrent access to a single shared memory b: 5 c: 2 a: b: 8 c:-1 a: b: 1 c: 4 a: ALU1 ALU2 ALU3 – all ALUs execute the exact same instruction – e.g. • Vector machines int foo() { a = b+c; } return(a); a := b + c • AltiVec (Motorola, IBM, Apple) • MMX/SSE/SSE2/SSE3 (Intel) Intro to SHARCNET CPU SIMD Parallelism Intro to SHARCNET 15 SIMD (cont.) • Ideally suited to data parallel problems – solution requires we apply the same operation/algorithm to data elements that can be operated upon independently • expectation of high independence of operations (ideally perfectly parallel data) SIMD (cont.) • Still amenable where communication patterns are highly regular and involve the same operations – e.g. • Euclidean inner product • summation – e.g. vector arithmetic • compute <c> := <a> + <b> • on each processor, i • AND/OR • max/min • etc. – load elements a i and bi – execute a i + bi – store result ci Intro to SHARCNET MISD Intro to SHARCNET MISD (cont.) • Multiple Instruction Single Data (MISD) – unusual model to conceptualize • some argue there is no such thing – e.g. pipelining • operations naturally divided into steps that make use of different resources • issues: branch prediction, instruction stalls, flushing a pipeline Intro to SHARCNET instruction stream – multiple operations applied to single stream of data • contrast with SIMD processor resources used i1 i2 i3 fetch decode fetch execute decode fetch writeback execute writeback decode execute writeback .. . time Pipelining as MISD model Intro to SHARCNET 16 MIMD • MIMD (Distributed Memory) Multiple Instruction Multiple Data (MIMD) – many processors executing different programs • b: 5 c: 2 Memory organization is significant – Distributed memory • unique memory associated with each processor • issues: communication overhead – Shared memory • processors share a single physical memory • programs can share blocks of memory between them • issues: exclusive access, race conditions, synchronization, scalability Intro to SHARCNET int foo() { a = b+c; } return(a); a:=b+c b: 5 CPU1 a:=b+c x: 5 void bar() { if (x > b) y = x; else y = 0; } } a = f(); CPU3 call f CPU2 if x > b int main() { int a; } a = f(); • CPU3 call f Most powerful and general model for parallel computing – it should be clear that SIMD models are a subset of MIMD – required to achieve task parallelism f() {...} int f(); Shared Memory MIMD Intro to SHARCNET if x > 1 int main() { int a; MIMD (cont.) c: 2 return(x); int f(); CPU2 Distributed Memory MIMD • } void bar() { if (x > 1) y = x; else y = 0; } f() {...} Intro to SHARCNET MIMD (Shared Memory) int foo() { a = b+c; CPU1 x: 5 Unrestricted, unsynchronized parallelism can be difficult to design, test and deploy in practice – it is more common for the solution to a problem to be somewhat regular • natural points of synchronization even where the behaviour across processes diverges – typical MIMD solutions involve partitioning the solution space and have only processes on partition boundaries communicate locally with one another (sub-problems are all nearly identical) Intro to SHARCNET 17 SPMD SPMD (cont.) • Single Program Multiple Data (SPMD) – special case MIMD b: 5 c: 2 – many processors executing the same program • conditional branches used where specific behaviour required on the processors – shared/distributed memory organization int main() { if cpu_1 b = 5; else b = 3 } CPU1 b := 5 b: 3 c: 2 int main() { if cpu_1 b = 5; else b = 3 } CPU2 b := 3 b: 3 c: 2 int main() { if cpu_1 b = 5; else b = 3 } CPU3 b := 3 SPMD illustrating conditional branching to control divergent behaviour Intro to SHARCNET Intro to SHARCNET Communication • Communication plays a key role in evaluating parallel models • Sockets – standard model for network communication – issues of network traffic (if shared), slow interface – will not leverage dedicated interconnect without effort • Programming for Clusters using MPI Message Passing – explicit communication between processes – e.g. Message Passing Interface (MPI) • standardized libraries for message passing in MIMD environments • Shared Memory – implicit communication via common memory blocks – e.g. OpenMP, pthreads Intro to SHARCNET 18 Message Passing Interface (MPI) • Library providing message passing support for parallel/distributed applications MPI Programming Basics • – C – not a language: collection of subroutines (Fortran), functions/macros (C) – explicit communication between processes • • • #include “mpi.h” – Fortran • include “mpif.h” Advantages – standardized – scalability generally good – memory is local to a process (debugging/performance) Include MPI header file • Compile with MPI library – mpicc – mpif90 – -lmpi [-lgm] … Disadvantages – more complex than implicit techniques – communication overhead Intro to SHARCNET Core Functions • MPI_Init() • MPI_Finalize() • MPI_Comm_rank() Intro to SHARCNET Example: MPI “Hello, world!” #include <stdio.h> #include “mpi.h” int main(int argc, char *argv[]) { int rank, size; • MPI_Comm_size() MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); • MPI_Send() printf(“Hello, world! from process %d of %d\n”, rank, size); • MPI_Recv() MPI_Finalize(); return(0); Intro to SHARCNET } Intro to SHARCNET 19 Exercise 1) The mpi_hello.c file in ~dbm/public/exercises/intro is a copy of the one used in the previous example 2) Compile this program and submit it for execution on one of the SHARCNET clusters (try 2, 4, 8, 16 processors) • • 3) Programming Shared Memory Machines using OpenMP/pthreads what compiler will you use? how do you submit the job? when the job completes, where are the results available? Food for thought… • Modify this program so that each process i sends it's text string to process (i + 1) % n, where n is the number of processes • the receiving process should be making use of the message it received in the output to verify you are sending the data correctly • draw a diagram illustrating the order of events implied by your code Intro to SHARCNET A process imposes a specific set of management responsibilities on the OS – a unique virtual address space (stack, heap, code) – file descriptor table – program counter – register states, etc. Intro to SHARCNET heap file descriptor table • DEF’N: a thread is a sequence of executable code within a process • A serial process can be seen, at its simplest, as a single thread (a single “thread of control”) – represented by the program counter – sequence (increment PC), iteration/conditional branch (set value of PC) register states stack program counter Conceptual View of a Process (simplified) • In terms of record-keeping, only a small subset of a process is relevant when considering a thread – register states; program counter heap file descriptor table thread • DEF’N: a process is a program in execution, as seen by the operating system virtual address space • Threads virtual address space Processes stack register states program counter Conceptual View of a Thread (simplified) Intro to SHARCNET 20 Multi-threading OpenMP Basics • Distribute work by defining multiple threads to do the work – e.g. OpenMP, pthreads thread_1 • thread_3 • Intro to SHARCNET Advantages – all process resources are implicitly shared (memory, file descriptors, etc.) – overhead incurred to manage multiple threads is relatively low – looks much like serial code thread_2 • Uses compiler directives, library routines and environment variables to automatically generate threaded code • Available for FORTRAN, C/C++ • Compile with OpenMP support – f90 -openmp … Disadvantages – all data being implicitly shared creates a world of hammers, and your code is the thumb – exclusive access, contention, etc. Example: OpenMP “Hello, world!” PROGRAM HELLO INTEGER ID, NTHRDS INTEGER OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS !$OMP PARALLEL PRIVATE(ID) ID = OMP_GET_THREAD_NUM() PRINT *, ‘HELLO, WORLD! FROM THREAD’, ID !$OMP BARRIER IF (ID .EQ. 0) THEN NTHRDS = OMP_GET_NUM_THREADS() PRINT *, ‘THERE ARE’, NTHRDS, ‘THREADS’ END IF !$OMP END PARALLEL END – note: C typically requires a header file • #include “omp.h” Intro to SHARCNET Exercise 1) The openmp_hello.f90 file in ~dbm/public/exercises/intro is a copy of the one used in the previous example 2) Compile this program and submit it for execution on one of the SHARCNET clusters (try 1, 2, 3, 4 processors --- on silky you could submit this on up to 128 processors in principle) • • 3) Food for thought • • Intro to SHARCNET how will you compile this program? how do you submit the job? how are the results of the job returned to you? compare what is happening in this code, as compared to an equivalent program written using pthreads (next section) --- which do you think will be more efficient? which do you think is easier to code? is pthreads able to offer you anything OpenMP is not? Intro to SHARCNET 21 Pthreads Basics • Include Pthread header file – #include “pthread.h” Pthreads Programming Basics (cont.) • Note that all processes have an implicit “main thread of control” • We need a means of creating a new thread – pthread_create() • Compile with Pthreads support/library – cc -pthread … • compiler vendors may differ in their usage to support pthreads (link a library, etc.) • Pathscale, Intel and gcc compilers use the -pthread argument so this will suffice for the moment • when in doubt, consult the man page for the compiler in question Intro to SHARCNET Example: pthreads “Hello, world!” #include <stdio.h> #include “pthread.h” void output (int *); • – threads are terminated implicitly when the function that was the entry point of the thread returns, or can be explicitly destroyed using pthread_exit() • return(0); } void output(int *thread_num) { printf(“Hello, world! from thread %d\n”, *thread_num); Intro } to SHARCNET We may need to distinguish one thread from another at run-time – pthread_self(), pthread_equal() Intro to SHARCNET Exercise 1) The pthread_hello.c file in ~dbm/public/exercises/intro is a copy of the one used in the previous example 2) Compile this program and submit it for execution on one of the SHARCNET clusters (try 1, 2, 3, 4 processors --- on silky you could submit this on up to 128 processors in principle) int main(int argc, char *argv[]) { int id; pthread_t thread[atoi(argv[1])]; for (id = 0; id < atoi(argv[1]); id++) pthread_create(&thread[id], NULL, (void *)output, (void *)&id); We need a way to terminating a thread • • 3) how will you compile this program? how do you submit the job? how are the results of the job returned to you? Food for thought… • Modify this program so that each thread returns its thread id number, with the main thread summing these values and printing the result o if you add nothing to the main routine other than the print statement, what happens when you run it? what additional information do you need to make this work as intended? Intro to SHARCNET 22 Parallel Design Patterns Parallel Design Basics • In principle, a parallel solution could be arbitrarily asynchronous, with little structure to the code • The reality is that most parallel algorithms draw from a relatively small set of basic designs – well understood – easier to implement and debug – reasonably general • most are applicable to both threaded and message-passing environments • Always consider these common patterns when thinking about a parallel implementation of your problem Intro to SHARCNET Fork/Join Parallelism • Master/Slave Parallelism Multiple processes/threads are created, executing the same or different routines concurrently – also referred to as a Peer model • classic model of data parallelism – synchronization upon exit is a common requirement • Processes/threads are created on demand as the main process requires work to be done – appropriate threads are created to service tasks, and run concurrently once created – this is difficult to realize on clusters running batch schedulers as there is no support for dynamic processes creation • you would be more likely to use the process/thread pool: worker processes are created when the job starts, and are assigned work from a queue thread_1 thread_taskA thread_2 thread_3 thread_taskB main thread … main thread … main thread - creates threads as needed thread_N Intro to SHARCNET Intro to SHARCNET 23 Process/Thread Pool • Pipeline Parallelism A number of processes/threads are created capable of doing work – the main process/thread adds jobs to a queue of work to be done and signals slaves that there is work to be done – slave processes/threads consume tasks from the shared queue and perform the work thread_1 • Multiple processes/threads are created, each performing a discrete stage in some execution sequence – new input is fed to the first worker, and the output from each subsequent stage is passes to the next one – after N time steps, the pipleline is filled and a result is produced every time step, with all stages processing concurrently – buffering and synchronization are significant here thread_2 … thread_stage1 thread_taskB main thread queue thread_stage2 main thread … thread_stageN main thread - adds jobs to queue as needed Intro to SHARCNET Intro to SHARCNET 24
© Copyright 2025