How to Teach a Thousand Cores to Perform Scientific Simulations

Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
How to Teach a Thousand Cores to Perform
Scientific Simulations
Dr. Hatem Ltaief
Research Scientist
Center of Extreme Computing
KACSTIT 2013, Riyadh
November 18th , 2013
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Outline
1
Motivation
2
Dense Computations
3
KAUST-BLAS and SpMV
4
Stencil Computations
5
Summary
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Credits - KAUST People
• Ahmad Abdelfattah (PhD Student)
• Rabab Alomairy (MS Student)
• Saber Feki (KSL Computational Scientist)
• Huda Ibeid (PhD Student)
• David Keyes (AMCS Professor, SIEC Director)
• Rooh Khurram (KSL Computational Scientist)
• Tareq Malas (PhD Student)
• Dalal Sukkari (MS Student)
• Rio Yokota (SIEC Research Scientist)
• Michael Young (IT Research Computing)
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Credits - External Collaborators/Sponsors
• Research Institutions
• Industrial Partners
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Outline
1
Motivation
2
Dense Computations
3
KAUST-BLAS and SpMV
4
Stencil Computations
5
Summary
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
KAUST Homegrown Applications
Slide courtesy: Pr. Keyes
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
A Common Denominator
1
2
3
Dense solvers: systems of linear equations,
eigenvalue/singular value problems.
Sparse iterative solvers: sparse matrix-vector multiplication
(SpMV).
Sparse explicit solvers: stencil computations.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
The Top500 List, June’13: http://www.top500.org
Rank
1
2
3
4
5
6
7
8
9
10
Name
Tianhe-2
Titan
Sequoia
K Computer
Mira
Stampede
JuQUEEN
Vulcan
SuperMUC
Tianhe-1A
Vendor
Intel Xeon E5-2692 + Phi
Cray
IBM BG/Q
Fujitsu SPARC64
IBM BG/Q
Intel Xeon E5-2680 + Phi
IBM BG/Q
IBM BG/Q
Intel Xeon E5-2680
Intel Xeon X5670 + M2050
Cores
3,120,000
560,640
1,572,864
705,024
786,432
462,462
458,752
393,216
147,456
186,368
Rmax (Pflop/s)
33,9
17,6
17,2
10,51
8,6
5,17
5
4,3
2,9
2,57
Ever increasing hardware complexity
H. Ltaief
Power (MW)
17,8
8,21
7,89
12,66
3,95
4,51
2,3
1,97
3,42
4,04
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
The Top500 List, June’13: http://www.top500.org
Rank
1
2
3
4
5
6
7
8
9
10
...
52
...
164
Name
Tianhe-2
Titan
Sequoia
K Computer
Mira
Stampede
JuQUEEN
Vulcan
SuperMUC
Tianhe-1A
...
KACST
...
KAUST
Vendor
Intel Xeon E5-2692 + Phi
Cray
IBM BG/Q
Fujitsu SPARC64
IBM BG/Q
Intel Xeon E5-2680 + Phi
IBM BG/Q
IBM BG/Q
Intel Xeon E5-2680
Intel Xeon X5670 + M2050
...
Intel Xeon E5-2650 + FirePro
...
IBM BG/P
Cores
3,120,000
560,640
1,572,864
705,024
786,432
462,462
458,752
393,216
147,456
186,368
...
38,400
...
65,536
Rmax (Pflop/s)
33,9
17,6
17,2
10,51
8,6
5,17
5
4,3
2,9
2,57
...
0,53
...
0,19
Hardware Landscape in Constant Evolution
H. Ltaief
Power (MW)
17,8
8,21
7,89
12,66
3,95
4,51
2,3
1,97
3,42
4,04
...
0,18
...
0,5
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
The Top500 List, June’13: http://www.top500.org
Rank
1
2
3
4
5
6
7
8
9
10
...
52
...
164
Name
Tianhe-2
Titan
Sequoia
K Computer
Mira
Stampede
JuQUEEN
Vulcan
SuperMUC
Tianhe-1A
...
KACST
...
KAUST
Vendor
Intel Xeon E5-2692 + Phi
Cray
IBM BG/Q
Fujitsu SPARC64
IBM BG/Q
Intel Xeon E5-2680 + Phi
IBM BG/Q
IBM BG/Q
Intel Xeon E5-2680
Intel Xeon X5670 + M2050
...
Intel Xeon E5-2650 + FirePro
...
IBM BG/P
Cores
3,120,000
560,640
1,572,864
705,024
786,432
462,462
458,752
393,216
147,456
186,368
...
38,400
...
65,536
Rmax (Pflop/s)
33,9
17,6
17,2
10,51
8,6
5,17
5
4,3
2,9
2,57
...
0,53
...
0,19
Human brain: 20 Pflop/s! (cf Kurzweil)
H. Ltaief
Power (MW)
17,8
8,21
7,89
12,66
3,95
4,51
2,3
1,97
3,42
4,04
...
0,18
...
0,5
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Power is THE Big Deal
• Approximate power costs (in picoJoules):
• Goal: crunching more per loaded data.
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
The Two Distinct Optimization Paths
• By using the existing hardware features (sometimes comes for
free).
• By redesigning the numerical methods and optimizing the
actual implementation (requires effort and time).
=⇒ Separation of concerns:
• Abstracting the hardware complexity: dynamic runtime
systems.
• Novel Algorithmic challenges: numerical accuracy/stability.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Outline
1
Motivation
2
Dense Computations
3
KAUST-BLAS and SpMV
4
Stencil Computations
5
Summary
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
A Look Back...
Software infrastructure and algorithmic design follow hardware
evolution in time:
• 70’s - LINPACK, vector operations:
Level-1 BLAS operation
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
A Look Back...
Software infrastructure and algorithmic design follow hardware
evolution in time:
• 70’s - LINPACK, vector operations:
Level-1 BLAS operation
• 80’s - LAPACK, block,
cache-friendly:
Level-3 BLAS operation
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
A Look Back...
Software infrastructure and algorithmic design follow hardware
evolution in time:
• 70’s - LINPACK, vector operations:
Level-1 BLAS operation
• 80’s - LAPACK, block,
cache-friendly:
Level-3 BLAS operation
• 90’s - ScaLAPACK, distributed
memory:
PBLAS Message passing
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
A Look Back...
Software infrastructure and algorithmic design follow hardware
evolution in time:
• 70’s - LINPACK, vector operations:
Level-1 BLAS operation
• 80’s - LAPACK, block,
cache-friendly:
Level-3 BLAS operation
• 90’s - ScaLAPACK, distributed
memory:
PBLAS Message passing
• 00’s:
• PLASMA, MAGMA, many
x86/cuda cores friendly:
DAG scheduler, tile data layout, some
extra kernels
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Facts on LAPACK/ScaLAPACK
• Open-source packages (downloaded more than 60 million
times)
• Large community contributors
• Integrated in open-source libraries (PETSc, SLEPc, MUMPS,
CPMD, CP2K ...)
• Integrated in vendor numerical softwares (Mathworks, Intel,
Cray, IBM, HP, Fujitsu ...)
• Critical library for many scientific applications
(Cardio-Magnetism, Wave guide propagation, Image
Processing, Quantum chemistry/physics, Atomic structure
calculations, Electromechanics, Geophysics/seismology,
Nonlinear Mechanics)
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Block Algorithms: Fork-Join Paradigm
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
PLASMA: Tile Algorithms
PLASMA: Parallel Linear Algebra for Scalable Multi-core
Architectures
• Parallelism is brought to the fore
• May require the redesign of linear algebra algorithms
• Tile data layout translation
• Remove unnecessary synchronization points between
Panel-Update sequences
• DAG execution where nodes represent tasks and edges define
dependencies between them
• Dynamic runtime system environment QUARK targeting only
x86
• Similar to LAPACK in functionality
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Tile Data Layout Format
LAPACK: column-major format
H. Ltaief
PLASMA: tile format
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Dynamic Scheduling Framework: QUARK
Conceptually similar to out-of-order processor scheduling:
• Dynamic runtime DAG scheduler
• Out-of-order execution flow of tasks
• Task scheduling as soon as dependencies are satisfied
• Potential Overlapping of operations between computational
stages
• DataFlow Programming (Five decades OLD concept)
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Tridiagonal Reduction - Block Algorithms
H. Ltaief
Summary
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Block Algorithms
7
x 10
10
TLB Misses
5
Gflop/s
Credits
5
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0
5000
Matrix Sizes
Figure : Performance evaluation and TLB miss analysis of the one-stage
LAPACK TRD algorithm with optimized Intel MKL BLAS, on a
dual-socket quad-core Intel Xeon (8 cores total).
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
PLASMA Tridiagonal Reduction - Two Stage
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Need for Autotuning for Tile Size (NB)
100
80
seconds
60
40
20
9240
7560
5040
4000
2520
0
0
20
40
60
80
100
NB
120
140
160
180
200
H. Ltaief, P. Luszczek and J. Dongarra, ACM TOMS 2011.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
TRD Performance Results on four sockets 12 AMD cores
140
130
120
110
100
PLASMA DSYTRD
PLASMA DSYTRD w/o grouping
MKLïSBRïDSYRDB
SBRïtoolkitïDSYRDD
MKLïDSYTRD
MKLïScalapackïPDSYTRD
LPKïreferenceïDSYTRD
50X
Gflop/s
90
12X
80
70
60
50
40
30
20
10
0
2k 3k 4k
6k
8k
10k
12k
14k
Matrix size
16k
18k
20k
22k
24k
A. Haidar, H. Ltaief and J. Dongarra, SC’11.
A. Haidar, H. Ltaief and J. Dongarra, SIAM SISC 2012.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
BRD Performance Results on four sockets 12 AMD cores
140
130
120
110
100
PLASMA DGEBRD
PLASMA DGEBRD w/o grouping
DGEBRD from [4]
MKLïDGEBRD
MKLïScalapackïPDGEBRD
LPKïreferenceïDGEBRD
50X
Gflop/s
90
12X
80
70
60
50
40
30
20
10
0
2k 3k 4k
6k
8k
10k
12k
14k
Matrix size
16k
18k
20k
22k
24k
A. Haidar, H. Ltaief, P. Luszczek and J. Dongarra, IPDPS’12.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
GSEVP: What we solve?
Ax = λBx
A, B ∈ Rn×n ,
A = AT or
xBx H > 0
x ∈ Rn , λ ∈ R or
A, B ∈ Cn×n , x ∈ Cn , λ ∈ R
A = AH
A is symmetric or Hermitian
B is symmetric positive definite
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
GSEVP: Why we solve it?
To obtain energy eigenstates in:
• Chemical cluster theory
• Electronic structure of semiconductors
• Ab-initio energy calculations of solids
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
GSEVP: How to solve it?
Ax = λBx
Operation
1
2
Explanation
B = L × LT
L−1
C=
×A×
or HEGST
LAPACK routine name
Cholesky factorization
L−T
POTRF
application of triangular factors SYGST
3
T = Q T × C × Q tridiagonal reduction SYEVD or HEEVD
4
Tx = λx
QR iteration
H. Ltaief
STERF
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
All computational stages: separately
1:1
2:2
1:1
1:1
3:3
2:2
2:3
4:2
3:3
3:6
5:2
4:2
4:1
5:2
6:1
5:2
6:1
7:2
6:3
7:2
8:1
7:1
8:1
9:1
8:1
9:1
10:1
9:1
10:1
10:1
11:1
H. Ltaief
11:1
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
All computational stages: combined
• Dependencies are tracked inside PLASMA by QUARK.
1:1
2:4
3:9
4:4
5:11
6:8
7:6
8:5
9:7
10:4
11:4
12:2
13:2
14:3
15:3
16:1
17:2
18:1
19:1
20:1
21:1
22:1
23:1
24:1
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Combining stages: matrix view
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Results on Four-Socket AMD Magny Cours (48 cores)
2500
Time in seconds
2000
PLASMA xSYGST + two−stage TRD
MKL xSYGST + MKL SBR
MKL xSYGST + MKL TRD
MKL xSYGST + Netlib SBR
LAPACK xSYGST + LAPACK TRD
1500
1000
500
2000
4000
6000
8000
Matrix size
10000
12000
14000
16000
H. Ltaief, P. Luszczek, A. Haidar and J. Dongarra, ParCo’11.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
A−1 , Seriously???
• YES!
• Critical component of the variance-covariance matrix
computation in statistics (cf. Higham, Accuracy and Stability
of Numerical Algorithms, Second Edition, SIAM, 2002)
• A is a dense symmetric matrix
• Three steps:
1 Cholesky factorization (DPOTRF)
2 Inverting the Cholesky factor (DTRTRI)
3 Calculating the product of the inverted Cholesky factor with its
transpose (DLAUUM)
• StarPU dynamic runtime system used here
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
A−1 , Hybrid Architecture Targeted
• =⇒ PCIe 3.0 x16 Interconnect 64Gb/s, very thin pipe!
• =⇒ Fermi C2050 448 cuda cores 515 Gflop/s
• =⇒ Kepler K20x 2688 cuda cores 1.31 Tflop/s
H. Ltaief
Summary
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Performance Results
500
450
Tile Hybrid CPU−GPU
MAGMA
PLASMA
LAPACK
400
350
300
Gflop/s
Credits
250
200
150
100
50
0.5
1
1.5
Matrix size
2
2.5
4
x 10
H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC’11, India.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
System with More Sockets
• Run into NUMA-related issues
• Reduce data movement between NUMA nodes on a single
chip.
• Need new scheduling policy: socket-aware as opposed to
socket-oblivious
• Keep productivity in mind! But no free lunch
• Redesign the algorithm by introducing nested parallelism to
enhance data locality
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Nested Feature with OMPSs
• Extension to OpenMP
• Task parameters and directionality provided by the user
through pragmas
• Focus on tasks and Data dependences
• Algorithmic changes to suit the underlying NUMA-based
System
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
(a) Tile data layout.
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
1
3
1
3
(b) within a block.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Performance on Eight Sockets Six AMD Istanbul Cores
R. Al-Omairy, MS Thesis, 2013. Paper in preparation for SIAM
SISC.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Performance on Eight Sockets Six AMD Istanbul Cores
R. Al-Omairy, MS Thesis, 2013. Paper in preparation for SIAM
SISC.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
QUARK and DVFS
G. Bosilca, J. Dongarra and H. Ltaief, EnaHPC’12.
H. Ltaief, P. Luszczek and J. Dongarra, EnaHPC’13.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
QUARK and Resiliency
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Outline
1
Motivation
2
Dense Computations
3
KAUST-BLAS and SpMV
4
Stencil Computations
5
Summary
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
K-BLAS
• KBLAS is a subset of BLAS kernels for GPUs (Integration to
NVIDIA CUBLAS library).
• Current release is KBLAS-0.2
(http://ksl.kaust.edu.sa/Pages/HatemLtaief.aspx).
• GEMV/SYMV, ALL precisions, support for Fermi and Kepler
GPUs (AMD FirePro?).
• Hiding latency through double buffering, stressing register and
optimizing shared memory usage.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
K-BLAS Performance
A. Abdelfattah, J. Dongarra, D. Keyes and H. Ltaief, VECPAR’11.
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
K-BLAS Performance
A. Abdelfattah, D. Keyes and H. Ltaief, HeteroPar’12.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
K-BLAS Performance
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
SpMV Performance
Research project (reservoir) funded by Saudi Aramco.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Outline
1
Motivation
2
Dense Computations
3
KAUST-BLAS and SpMV
4
Stencil Computations
5
Summary
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Seismic Acquisition
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Reverse Time Migration
• Acoustic Wave Equation with constant density
• Numerical solution: Finite Difference
• Isotropic Case: Axis directions (X,Y,Z)
• Tilted Transverse Isotropic (TTI) Case: Axis directions
(X,Y,Z) & Cross Derivatives (XY, YZ, XZ)
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Isotropic stencil
• TTI stencil
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition
• Order one in space and time
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition
• Order one in space and time
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition
• Order one in space and time
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition
• Order one in space and time
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition
• Order one in space and time
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Standard Stencil Time Integration
• Explicit time integration scheme with domain decomposition
• Order one in space and time
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
New Stencil Time Integration
• Communication-avoiding by reducing halo exchanges
frequency
• Synchronization-reducing by instruction reordering
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Paulius’s Strategy
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Our Pyramid Strategy
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Performance Results on 8 NVIDIA Kepler GPUs
A. Abdelfattah, D. Keyes and H. Ltaief, to be submitted to ACM
TOMS.
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Stencil kernels
• Seismic kernels are typically order 2 in time and order 8 in
space
• New opportunities: how about considering lower order
methods?
• Two phases: calculate solutions inside a cone and then
outside the cone
• The cone kernel becomes the fine-grain task to dynamically
schedule (shipped to cuda SMs)
• No numerical instabilities added to the original scheme
• Load imbalance issues due to PML high compute intensity
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Outline
1
Motivation
2
Dense Computations
3
KAUST-BLAS and SpMV
4
Stencil Computations
5
Summary
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Summary
• Teaching a thousand cores through dynamic runtime systems.
• No free lunch.
• Need new flexible/asynchronous algorithms.
• Need to expose the parallelism.
• Dataflow programming model can play a major role for
exascale challenges.
• Need efficient runtime with high productivity in mind.
• Work for dense and potentially for sparse computations.
H. Ltaief
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
HPC Recipe For Exascale Computing
F
pa ine
ra -g
lle ra
lis in
m
ef Po
fic w
ie er
nc
y
Dynamic runtime
systems
Auto-Tuning
D
a
R ta
ed M
Sy
uc oti
nc
i n on
g
R hro
ed n
uc i z
a
in ti
g on
Credits
Hardware and Software
Co-design
1018
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Just in case you are wondering what’s beyond ExaFlops...
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Thank you!
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
What Can FMM Do For You?
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Data Driven Fast Multipole Method
H. Ltaief
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Tree Traversal
H. Ltaief
Stencil Computations
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Summary
Strong Scaling w/ QUARK on a 16 Core ShMem System
H. Ltaief and R. Yokota, Concurrency and Computations: Practice
H. Ltaief
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Power Monitoring with PowerPack
H. Ltaief
Summary
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Power Rate of LAPACK TRD
300
250
200
Power (Watts)
Credits
150
System
CPU
Memory
Motherboad
Fan
100
50
0
0
50
100
150
200
Time (seconds)
250
300
H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, Germany
H. Ltaief
350
Summary
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
Power Rate of PLASMA TRD
300
250
200
Power (Watts)
Credits
150
System
CPU
Memory
Motherboad
Fan
100
50
0
0
50
100
150
200
Time (seconds)
250
300
H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, Germany
H. Ltaief
350
Summary
Credits
Motivation
Dense Computations
KAUST-BLAS and SpMV
Stencil Computations
OmpSs Snapshot
#pragma css task input(A[NB][NB]) inout(T[NB][NB])
void dsyrk(double *A, double *T);
#pragma css task inout(T[NB][NB])
void dpotrf(double *T);
#pragma css task input(A[NB][NB], B[NB][NB]) inout(C[NB][NB])
void dgemm(double *A, double *B, double *C);
#pragma css task input(T[NB][NB]) inout(B[NB][NB])
void dtrsm(double *T, double *C);
#pragma css start
for (k = 0; k < TILES; k++) {
for (n = 0; n < k; n++)
dsyrk(A[k][n], A[k][k]);
dpotrf(A[k][k]);
for (m = k+1; m < TILES; m++) {
for (n = 0; n < k; n++)
dgemm(A[k][n], A[m][n], A[m][k]);
dtrsm(A[k][k], A[m][k]);
}
}
#pragma css finish
H. Ltaief
Summary