Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations How to Teach a Thousand Cores to Perform Scientific Simulations Dr. Hatem Ltaief Research Scientist Center of Extreme Computing KACSTIT 2013, Riyadh November 18th , 2013 H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Outline 1 Motivation 2 Dense Computations 3 KAUST-BLAS and SpMV 4 Stencil Computations 5 Summary H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Credits - KAUST People • Ahmad Abdelfattah (PhD Student) • Rabab Alomairy (MS Student) • Saber Feki (KSL Computational Scientist) • Huda Ibeid (PhD Student) • David Keyes (AMCS Professor, SIEC Director) • Rooh Khurram (KSL Computational Scientist) • Tareq Malas (PhD Student) • Dalal Sukkari (MS Student) • Rio Yokota (SIEC Research Scientist) • Michael Young (IT Research Computing) H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Credits - External Collaborators/Sponsors • Research Institutions • Industrial Partners H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Outline 1 Motivation 2 Dense Computations 3 KAUST-BLAS and SpMV 4 Stencil Computations 5 Summary H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations KAUST Homegrown Applications Slide courtesy: Pr. Keyes H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary A Common Denominator 1 2 3 Dense solvers: systems of linear equations, eigenvalue/singular value problems. Sparse iterative solvers: sparse matrix-vector multiplication (SpMV). Sparse explicit solvers: stencil computations. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary The Top500 List, June’13: http://www.top500.org Rank 1 2 3 4 5 6 7 8 9 10 Name Tianhe-2 Titan Sequoia K Computer Mira Stampede JuQUEEN Vulcan SuperMUC Tianhe-1A Vendor Intel Xeon E5-2692 + Phi Cray IBM BG/Q Fujitsu SPARC64 IBM BG/Q Intel Xeon E5-2680 + Phi IBM BG/Q IBM BG/Q Intel Xeon E5-2680 Intel Xeon X5670 + M2050 Cores 3,120,000 560,640 1,572,864 705,024 786,432 462,462 458,752 393,216 147,456 186,368 Rmax (Pflop/s) 33,9 17,6 17,2 10,51 8,6 5,17 5 4,3 2,9 2,57 Ever increasing hardware complexity H. Ltaief Power (MW) 17,8 8,21 7,89 12,66 3,95 4,51 2,3 1,97 3,42 4,04 Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary The Top500 List, June’13: http://www.top500.org Rank 1 2 3 4 5 6 7 8 9 10 ... 52 ... 164 Name Tianhe-2 Titan Sequoia K Computer Mira Stampede JuQUEEN Vulcan SuperMUC Tianhe-1A ... KACST ... KAUST Vendor Intel Xeon E5-2692 + Phi Cray IBM BG/Q Fujitsu SPARC64 IBM BG/Q Intel Xeon E5-2680 + Phi IBM BG/Q IBM BG/Q Intel Xeon E5-2680 Intel Xeon X5670 + M2050 ... Intel Xeon E5-2650 + FirePro ... IBM BG/P Cores 3,120,000 560,640 1,572,864 705,024 786,432 462,462 458,752 393,216 147,456 186,368 ... 38,400 ... 65,536 Rmax (Pflop/s) 33,9 17,6 17,2 10,51 8,6 5,17 5 4,3 2,9 2,57 ... 0,53 ... 0,19 Hardware Landscape in Constant Evolution H. Ltaief Power (MW) 17,8 8,21 7,89 12,66 3,95 4,51 2,3 1,97 3,42 4,04 ... 0,18 ... 0,5 Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary The Top500 List, June’13: http://www.top500.org Rank 1 2 3 4 5 6 7 8 9 10 ... 52 ... 164 Name Tianhe-2 Titan Sequoia K Computer Mira Stampede JuQUEEN Vulcan SuperMUC Tianhe-1A ... KACST ... KAUST Vendor Intel Xeon E5-2692 + Phi Cray IBM BG/Q Fujitsu SPARC64 IBM BG/Q Intel Xeon E5-2680 + Phi IBM BG/Q IBM BG/Q Intel Xeon E5-2680 Intel Xeon X5670 + M2050 ... Intel Xeon E5-2650 + FirePro ... IBM BG/P Cores 3,120,000 560,640 1,572,864 705,024 786,432 462,462 458,752 393,216 147,456 186,368 ... 38,400 ... 65,536 Rmax (Pflop/s) 33,9 17,6 17,2 10,51 8,6 5,17 5 4,3 2,9 2,57 ... 0,53 ... 0,19 Human brain: 20 Pflop/s! (cf Kurzweil) H. Ltaief Power (MW) 17,8 8,21 7,89 12,66 3,95 4,51 2,3 1,97 3,42 4,04 ... 0,18 ... 0,5 Credits Motivation Dense Computations KAUST-BLAS and SpMV Power is THE Big Deal • Approximate power costs (in picoJoules): • Goal: crunching more per loaded data. H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary The Two Distinct Optimization Paths • By using the existing hardware features (sometimes comes for free). • By redesigning the numerical methods and optimizing the actual implementation (requires effort and time). =⇒ Separation of concerns: • Abstracting the hardware complexity: dynamic runtime systems. • Novel Algorithmic challenges: numerical accuracy/stability. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Outline 1 Motivation 2 Dense Computations 3 KAUST-BLAS and SpMV 4 Stencil Computations 5 Summary H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations A Look Back... Software infrastructure and algorithmic design follow hardware evolution in time: • 70’s - LINPACK, vector operations: Level-1 BLAS operation H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations A Look Back... Software infrastructure and algorithmic design follow hardware evolution in time: • 70’s - LINPACK, vector operations: Level-1 BLAS operation • 80’s - LAPACK, block, cache-friendly: Level-3 BLAS operation H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations A Look Back... Software infrastructure and algorithmic design follow hardware evolution in time: • 70’s - LINPACK, vector operations: Level-1 BLAS operation • 80’s - LAPACK, block, cache-friendly: Level-3 BLAS operation • 90’s - ScaLAPACK, distributed memory: PBLAS Message passing H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations A Look Back... Software infrastructure and algorithmic design follow hardware evolution in time: • 70’s - LINPACK, vector operations: Level-1 BLAS operation • 80’s - LAPACK, block, cache-friendly: Level-3 BLAS operation • 90’s - ScaLAPACK, distributed memory: PBLAS Message passing • 00’s: • PLASMA, MAGMA, many x86/cuda cores friendly: DAG scheduler, tile data layout, some extra kernels H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Facts on LAPACK/ScaLAPACK • Open-source packages (downloaded more than 60 million times) • Large community contributors • Integrated in open-source libraries (PETSc, SLEPc, MUMPS, CPMD, CP2K ...) • Integrated in vendor numerical softwares (Mathworks, Intel, Cray, IBM, HP, Fujitsu ...) • Critical library for many scientific applications (Cardio-Magnetism, Wave guide propagation, Image Processing, Quantum chemistry/physics, Atomic structure calculations, Electromechanics, Geophysics/seismology, Nonlinear Mechanics) H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Block Algorithms: Fork-Join Paradigm H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary PLASMA: Tile Algorithms PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures • Parallelism is brought to the fore • May require the redesign of linear algebra algorithms • Tile data layout translation • Remove unnecessary synchronization points between Panel-Update sequences • DAG execution where nodes represent tasks and edges define dependencies between them • Dynamic runtime system environment QUARK targeting only x86 • Similar to LAPACK in functionality H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Tile Data Layout Format LAPACK: column-major format H. Ltaief PLASMA: tile format Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Dynamic Scheduling Framework: QUARK Conceptually similar to out-of-order processor scheduling: • Dynamic runtime DAG scheduler • Out-of-order execution flow of tasks • Task scheduling as soon as dependencies are satisfied • Potential Overlapping of operations between computational stages • DataFlow Programming (Five decades OLD concept) H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Tridiagonal Reduction - Block Algorithms H. Ltaief Summary Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Block Algorithms 7 x 10 10 TLB Misses 5 Gflop/s Credits 5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 5000 Matrix Sizes Figure : Performance evaluation and TLB miss analysis of the one-stage LAPACK TRD algorithm with optimized Intel MKL BLAS, on a dual-socket quad-core Intel Xeon (8 cores total). H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations PLASMA Tridiagonal Reduction - Two Stage H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Need for Autotuning for Tile Size (NB) 100 80 seconds 60 40 20 9240 7560 5040 4000 2520 0 0 20 40 60 80 100 NB 120 140 160 180 200 H. Ltaief, P. Luszczek and J. Dongarra, ACM TOMS 2011. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary TRD Performance Results on four sockets 12 AMD cores 140 130 120 110 100 PLASMA DSYTRD PLASMA DSYTRD w/o grouping MKLïSBRïDSYRDB SBRïtoolkitïDSYRDD MKLïDSYTRD MKLïScalapackïPDSYTRD LPKïreferenceïDSYTRD 50X Gflop/s 90 12X 80 70 60 50 40 30 20 10 0 2k 3k 4k 6k 8k 10k 12k 14k Matrix size 16k 18k 20k 22k 24k A. Haidar, H. Ltaief and J. Dongarra, SC’11. A. Haidar, H. Ltaief and J. Dongarra, SIAM SISC 2012. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary BRD Performance Results on four sockets 12 AMD cores 140 130 120 110 100 PLASMA DGEBRD PLASMA DGEBRD w/o grouping DGEBRD from [4] MKLïDGEBRD MKLïScalapackïPDGEBRD LPKïreferenceïDGEBRD 50X Gflop/s 90 12X 80 70 60 50 40 30 20 10 0 2k 3k 4k 6k 8k 10k 12k 14k Matrix size 16k 18k 20k 22k 24k A. Haidar, H. Ltaief, P. Luszczek and J. Dongarra, IPDPS’12. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary GSEVP: What we solve? Ax = λBx A, B ∈ Rn×n , A = AT or xBx H > 0 x ∈ Rn , λ ∈ R or A, B ∈ Cn×n , x ∈ Cn , λ ∈ R A = AH A is symmetric or Hermitian B is symmetric positive definite H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations GSEVP: Why we solve it? To obtain energy eigenstates in: • Chemical cluster theory • Electronic structure of semiconductors • Ab-initio energy calculations of solids H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary GSEVP: How to solve it? Ax = λBx Operation 1 2 Explanation B = L × LT L−1 C= ×A× or HEGST LAPACK routine name Cholesky factorization L−T POTRF application of triangular factors SYGST 3 T = Q T × C × Q tridiagonal reduction SYEVD or HEEVD 4 Tx = λx QR iteration H. Ltaief STERF Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations All computational stages: separately 1:1 2:2 1:1 1:1 3:3 2:2 2:3 4:2 3:3 3:6 5:2 4:2 4:1 5:2 6:1 5:2 6:1 7:2 6:3 7:2 8:1 7:1 8:1 9:1 8:1 9:1 10:1 9:1 10:1 10:1 11:1 H. Ltaief 11:1 Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations All computational stages: combined • Dependencies are tracked inside PLASMA by QUARK. 1:1 2:4 3:9 4:4 5:11 6:8 7:6 8:5 9:7 10:4 11:4 12:2 13:2 14:3 15:3 16:1 17:2 18:1 19:1 20:1 21:1 22:1 23:1 24:1 H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Combining stages: matrix view H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Results on Four-Socket AMD Magny Cours (48 cores) 2500 Time in seconds 2000 PLASMA xSYGST + two−stage TRD MKL xSYGST + MKL SBR MKL xSYGST + MKL TRD MKL xSYGST + Netlib SBR LAPACK xSYGST + LAPACK TRD 1500 1000 500 2000 4000 6000 8000 Matrix size 10000 12000 14000 16000 H. Ltaief, P. Luszczek, A. Haidar and J. Dongarra, ParCo’11. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary A−1 , Seriously??? • YES! • Critical component of the variance-covariance matrix computation in statistics (cf. Higham, Accuracy and Stability of Numerical Algorithms, Second Edition, SIAM, 2002) • A is a dense symmetric matrix • Three steps: 1 Cholesky factorization (DPOTRF) 2 Inverting the Cholesky factor (DTRTRI) 3 Calculating the product of the inverted Cholesky factor with its transpose (DLAUUM) • StarPU dynamic runtime system used here H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations A−1 , Hybrid Architecture Targeted • =⇒ PCIe 3.0 x16 Interconnect 64Gb/s, very thin pipe! • =⇒ Fermi C2050 448 cuda cores 515 Gflop/s • =⇒ Kepler K20x 2688 cuda cores 1.31 Tflop/s H. Ltaief Summary Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Performance Results 500 450 Tile Hybrid CPU−GPU MAGMA PLASMA LAPACK 400 350 300 Gflop/s Credits 250 200 150 100 50 0.5 1 1.5 Matrix size 2 2.5 4 x 10 H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC’11, India. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations System with More Sockets • Run into NUMA-related issues • Reduce data movement between NUMA nodes on a single chip. • Need new scheduling policy: socket-aware as opposed to socket-oblivious • Keep productivity in mind! But no free lunch • Redesign the algorithm by introducing nested parallelism to enhance data locality H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Nested Feature with OMPSs • Extension to OpenMP • Task parameters and directionality provided by the user through pragmas • Focus on tasks and Data dependences • Algorithmic changes to suit the underlying NUMA-based System 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 (a) Tile data layout. 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 0 2 0 2 0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3 1 3 1 3 (b) within a block. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Performance on Eight Sockets Six AMD Istanbul Cores R. Al-Omairy, MS Thesis, 2013. Paper in preparation for SIAM SISC. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Performance on Eight Sockets Six AMD Istanbul Cores R. Al-Omairy, MS Thesis, 2013. Paper in preparation for SIAM SISC. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations QUARK and DVFS G. Bosilca, J. Dongarra and H. Ltaief, EnaHPC’12. H. Ltaief, P. Luszczek and J. Dongarra, EnaHPC’13. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV QUARK and Resiliency H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Outline 1 Motivation 2 Dense Computations 3 KAUST-BLAS and SpMV 4 Stencil Computations 5 Summary H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary K-BLAS • KBLAS is a subset of BLAS kernels for GPUs (Integration to NVIDIA CUBLAS library). • Current release is KBLAS-0.2 (http://ksl.kaust.edu.sa/Pages/HatemLtaief.aspx). • GEMV/SYMV, ALL precisions, support for Fermi and Kepler GPUs (AMD FirePro?). • Hiding latency through double buffering, stressing register and optimizing shared memory usage. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary K-BLAS Performance A. Abdelfattah, J. Dongarra, D. Keyes and H. Ltaief, VECPAR’11. H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations K-BLAS Performance A. Abdelfattah, D. Keyes and H. Ltaief, HeteroPar’12. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV K-BLAS Performance H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations SpMV Performance Research project (reservoir) funded by Saudi Aramco. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Outline 1 Motivation 2 Dense Computations 3 KAUST-BLAS and SpMV 4 Stencil Computations 5 Summary H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Seismic Acquisition H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Reverse Time Migration • Acoustic Wave Equation with constant density • Numerical solution: Finite Difference • Isotropic Case: Axis directions (X,Y,Z) • Tilted Transverse Isotropic (TTI) Case: Axis directions (X,Y,Z) & Cross Derivatives (XY, YZ, XZ) H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Isotropic stencil • TTI stencil H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Explicit time integration scheme with domain decomposition • Order one in space and time H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Explicit time integration scheme with domain decomposition • Order one in space and time H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Explicit time integration scheme with domain decomposition • Order one in space and time H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Explicit time integration scheme with domain decomposition • Order one in space and time H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Explicit time integration scheme with domain decomposition • Order one in space and time H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Standard Stencil Time Integration • Explicit time integration scheme with domain decomposition • Order one in space and time H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations New Stencil Time Integration • Communication-avoiding by reducing halo exchanges frequency • Synchronization-reducing by instruction reordering H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Paulius’s Strategy H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Our Pyramid Strategy H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Performance Results on 8 NVIDIA Kepler GPUs A. Abdelfattah, D. Keyes and H. Ltaief, to be submitted to ACM TOMS. H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Stencil kernels • Seismic kernels are typically order 2 in time and order 8 in space • New opportunities: how about considering lower order methods? • Two phases: calculate solutions inside a cone and then outside the cone • The cone kernel becomes the fine-grain task to dynamically schedule (shipped to cuda SMs) • No numerical instabilities added to the original scheme • Load imbalance issues due to PML high compute intensity H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Outline 1 Motivation 2 Dense Computations 3 KAUST-BLAS and SpMV 4 Stencil Computations 5 Summary H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Summary • Teaching a thousand cores through dynamic runtime systems. • No free lunch. • Need new flexible/asynchronous algorithms. • Need to expose the parallelism. • Dataflow programming model can play a major role for exascale challenges. • Need efficient runtime with high productivity in mind. • Work for dense and potentially for sparse computations. H. Ltaief Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations HPC Recipe For Exascale Computing F pa ine ra -g lle ra lis in m ef Po fic w ie er nc y Dynamic runtime systems Auto-Tuning D a R ta ed M Sy uc oti nc i n on g R hro ed n uc i z a in ti g on Credits Hardware and Software Co-design 1018 H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Just in case you are wondering what’s beyond ExaFlops... H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Thank you! H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations What Can FMM Do For You? H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Data Driven Fast Multipole Method H. Ltaief Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Tree Traversal H. Ltaief Stencil Computations Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Summary Strong Scaling w/ QUARK on a 16 Core ShMem System H. Ltaief and R. Yokota, Concurrency and Computations: Practice H. Ltaief Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Power Monitoring with PowerPack H. Ltaief Summary Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Power Rate of LAPACK TRD 300 250 200 Power (Watts) Credits 150 System CPU Memory Motherboad Fan 100 50 0 0 50 100 150 200 Time (seconds) 250 300 H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, Germany H. Ltaief 350 Summary Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations Power Rate of PLASMA TRD 300 250 200 Power (Watts) Credits 150 System CPU Memory Motherboad Fan 100 50 0 0 50 100 150 200 Time (seconds) 250 300 H. Ltaief, P. Luszczek, J. Dongarra – EnaHPC’11, Germany H. Ltaief 350 Summary Credits Motivation Dense Computations KAUST-BLAS and SpMV Stencil Computations OmpSs Snapshot #pragma css task input(A[NB][NB]) inout(T[NB][NB]) void dsyrk(double *A, double *T); #pragma css task inout(T[NB][NB]) void dpotrf(double *T); #pragma css task input(A[NB][NB], B[NB][NB]) inout(C[NB][NB]) void dgemm(double *A, double *B, double *C); #pragma css task input(T[NB][NB]) inout(B[NB][NB]) void dtrsm(double *T, double *C); #pragma css start for (k = 0; k < TILES; k++) { for (n = 0; n < k; n++) dsyrk(A[k][n], A[k][k]); dpotrf(A[k][k]); for (m = k+1; m < TILES; m++) { for (n = 0; n < k; n++) dgemm(A[k][n], A[m][n], A[m][k]); dtrsm(A[k][k], A[m][k]); } } #pragma css finish H. Ltaief Summary
© Copyright 2024