07/09/2013 How to Use a Quantum Chemistry Code over 100,000 CPUs Edoardo Aprà Pacific Northwest National Laboratory Outline NWChem description Parallelization strategies 2 1 07/09/2013 Developed at EMSL/PNNL Provides major modeling and simulation capability for molecular science ‣ Broad range of molecules, including catalysts, biomolecules, and heavy elements ‣ Solid state capabilities Performance characteristics – designed for MPP ‣ Runs on a wide range of computers Open Source → large user community Uses Global Arrays/ARMCI for parallelization 3 NWChem Structure Energy, structure, … SCF energy, gradient, … MD, NMR, Solvation, … Molecular Calculation Modules Optimize, Dynamics, … Parallel IO Memory Allocator Global Arrays • Object-oriented design • abstraction, data hiding, APIs • Parallel programming model • non-uniform memory access, Global Arrays, MPI Molecular Modeling Toolkit ... PeIGS ... Basis Set Object Integral API Geometry Object Run-time database DFT energy, gradient, … Generic Tasks • Infrastructure • GA, Parallel I/O, RTDB, MA, ... • Program modules Molecular Software Development Toolkit • communication only through the database • persistence for easy restart 2 07/09/2013 Replicated vs Distributed: Matrix distribution Example: parallel computer made of 8 processors 1 2 5 3 4 7 6 8 How to distribute the elements of a 2-D Matrix? 1 3 5 7 5 2 4 6 8 Distributed 1 Replicated Approaches to parallelization • Distributed data - Pros • Smaller memory usage • Better scaling at large number of processors(more later) • Replicated data - Pros • Better scaling at small to moderate number of processors •Distributed data - Cons • More network traffic required at small/moderate number processors. • Better scaling at large number of processors(more later) • Replicated data - Cons • Larger memory footprint • Worse scaling at large number of processors, because of collective operations needed to merge resulting matrices 3 07/09/2013 Global Arrays • Distributed dense arrays that can be accessed through a shared memory-like style • High level abstraction layer for the application developer (that’s me!) • One-sided model = no need to worry and send/receive Physically distributed data single, shared data structure/ global indexing e.g., access A(4,3) rather than buf(7) on task 2 Global Address Space Gaussian DFT computational kernel Evaluation of XC potential matrix element my_next_task = SharedCounter() do i=1,max_i if(i.eq.my_next_task) then call ga_get() (do work) (xq) = ∑ D (xq) (xq) call ga_acc() my_next_task = F+= SharedCounter() xc ∑q wq (xq) V [(xq)] (xq) endif enddo barrier() D F Both GA operations are greatly dependent on the communication latency 8 4 07/09/2013 Parallel scaling of the DFT code Si28O148H66 3554 Basis functions LDA wavefunction XC build Benchmark run on Cray XK6 9 Parallel scaling of the DFT code Si159O264H110 7108 Basis functions LDA wavefunction XC build Benchmark run on 10 Cray XK6 5 07/09/2013 Hybrid computing Goes beyond the Send-Receive construct of Message-Passing (a.k.a MPI) OpenMP (directives) Shared memory Cilk Intel TBB Threads Mirrored Arrays: Matrix distribution Global Arrays that are replicated between SMP nodes but distributed within SMP nodes 1 3 1 3 2 4 1 2 5 7 6 8 3 4 Distributed Mirrored 2 4 5 7 6 8 1 Replicated 12 6 07/09/2013 Trends in Chemistry Codes Multi-level parallelism ‣ Effective path for most applications to scale to O(105) processors ‣ Examples: coarse grain over vibrational degrees of freedom in numerical hessian, or geometries in a surface scan or parameter study ‣ Conventional distributed memory within each subtask ‣ Fine grain parallelism within a few processor SMP (multi-threads, OpenMP, parallel BLAS, …) ‣ Example of application later in the talk .. Efficient exploitation of fine grain parallelism is a major concern on future architectures Emergence of Accelerators ‣ GPUs ‣ Intel Xeon Phi What is CCSD(T) ? Coupled-cluster (CC) theory is a numerical manybody technique that incorporates the effect of electron correlation on the electronic structure of molecular systems CCSD(T) estimates the effect of electron correlations by considering single, double and triple excitations Valence only CCSD(T) calculations = “gold standard of quantum chemistry” for their chemical accuracy in determining molecular energetics numerical cost scales as N7 (N = number of electrons) 7 07/09/2013 CCSD(T) algorithm “aijkbc algorithm” of Rendell and coworkers no use of I/O intermediate quantities (two-electron integrals and coupled-cluster wave function amplitudes) stored in global memory (GA) floating-point intensive kernel of this algorithm: BLAS DGEMM calls Main Steps of a CCSD(T) run MP2 energy & transformation of Molecular orbitals Generation of 2-electron integrals needed and storage in Global Array (ga_acc) Main CCSD(T) loop ‣ fetching the 2-electron integrals with ga_get ‣ Computational intensive kernel via dgemm 8 07/09/2013 CCSD(T) kernel: code features - I To scale at 1K procs: increased locality data to reduce communication To scale at 40K procs: implemented more careful tiling of intermediates to ‣ reduce memory consumption and ‣ increase parallelism and load balance These two modifications can be seen as PGAS style programming (distinction between local and global memory CCSD(T) kernel: code features - II Three levels of the memory hierarchy in dynamically load balanced algorithm ‣ intermediate results fit in available global memory ‣ nested loops tiled so that data for each each task fits into local memory ‣ each process access a global shared counter to determine the next task data moved from global into local memory via ga_get 9 07/09/2013 CCSD(T) run on Cray XT5 : 18 water molecules February 2009 Floating-Point performance at 90K cores: 358 TFlops (H2O)18 54 atoms 918 basis functions Cc-pvtz(-f) basis CCSD(T) run on Cray XT5 : 20 water February 2009 Floating-Point performance at 96K cores: 480 TFlops Efficiency > 50% (H2O)20 60 atoms 1020 basis functions Cc-pvtz(-f) basis 10 07/09/2013 NWChem code changes required to scale beyond 100K cores Many to one communication patterns causes job ‣ to progress very slowly (best case) ‣ to hang (or worse) Diagnosis: cumbersome … lucky coredumps! Causes: several processors simultaneously accessing the same patch of a matrix, where the patch is owned by a single processor. Solutions: ‣ Staggered access ‣ Use of a subset of processing elements ‣ Atomic shared counter: first token is static CCSD(T) run on Cray XT5 : 24 water November 2009 Floating-Point performance at 223K cores: 1.39 PetaFLOP/s (H2O)24 72 atoms 1224 basis functions Cc-pvtz(-f) basis 11 07/09/2013 K.Kowalski Tensor Contraction Engine (TCE) Symbolic algebra systems for coding complicated tensor expressions: Tensor Contraction Engine (TCE) ‣ Hirata, J. Phys. Chem. A 107, 9887 (2003) ‣ Sadayappan, Krishnamoorthy, et al. Proceedings of the IEEEE, 93, 276 (2005). ‣ Lai, Zhang, Rajbhandari, Valeev, Kowalski, Sadayappan, Procedia Computer Science (2012) New implementation of CC methods (since 2003) ‣ more effective for implementing new methods ‣ Easier tuning and porting 23 K.Kowalski Tensor Contraction Engine (TCE) Tile structure: S1 S2 … S1 S2 … S1 Occupied spinorbitals S2 ………. S1 S2 ………. unccupied spinorbitals Tensor structure: Tai T[[phnm]] 24 12 07/09/2013 New elements of parallel design for the iterative EOMCCSD method K.Kowalski Use of Global Arrays (GA) to implement a distributed memory model Iterative CCSD/EOMCCSD methods (basic challenges) ‣ Global memory requirements ‣ Complex load balancing ‣ Complicated communication pattern: use of one-sided ga_get,ga_put,ga_acc Implementation improvements ‣ New way of representing antysymmetric 2-electron integrals for the restricted (RHF) and restricted open-shell (ROHF) references ‣ Replication of low-rank tensors ‣ New task scheduling for the CCSD/EOMCCSD methods 25 New elements of parallel design for the non-iterative CREOMCCSD(T) method K.Kowalski Use of Global Arrays (GA) to implement a distributed memory model Basic challenges for Non-Iterative CREOMCCSD(T) method ‣ Local memory requirements: (tilesize)4 (EOMCCSD) vs. M*(tilesize)6 (CR-EOMCCSD(T)) Implementation improvements ‣ Two-fold reduction of local memory use : 2*(tilesize)6 ‣ New algorithms which enable the decomposition of six-dimensional tensors 26 13 07/09/2013 K.Kowalski Scalability of iterative EOMCC methods Alternative task schedulers use “global task pool” improve load balancing reduce the number of synchronization steps to absolute minimum larger tiles can be effectively used 27 K.Kowalski Scalability of the non-iterative EOMCC code 94 %parallel efficiency using 210,000 cores Scalability of the triples part of the CREOMCCSD(T) approach for the FBP-f-coronene system in the AVTZ basis set. Timings were determined from calculations on the Jaguar Cray XT5 computer system at NCCS/ORNL in 2011 28 14 07/09/2013 K.Kowalski MRCC theory in a nutshell M ls 0 1 Reference function M Model space Schematic representation of the complete model space corresponding to two active electrons distributed over two active orbitals (red lines). Only determinants with MS=0 are included in the model space. 29 K.Kowalski MRCC approaches: main challenges Intruder-state/intruder-solution problems Complete model space ‣ Huge dimensionality ‣ A large number of superfluous configurations not contributing to a given state Overall cost of the MRCC methods ‣ M×N6 (iterative MRCCSD) ‣ M×N7 (non-iterative MRCCSD(T)) Algebraic complexity of the MRCC methods 30 15 07/09/2013 K.Kowalski Processor groups (PGs) and reference level parallelism ( ) ( )( ) ( )( 1 ) ( ) ( M ) R F ( T ) G ( T ,..., T ,..., T ) 0 1 ,..., M The reference level parallelism can be applied in: ‣ Solving coupled referencedependent MRCC iterative equations ‣ Build efficient parallel schemes for non-iterative MRCC methods 31 Processor groups (PGs) and reference level parallelism K.Kowalski Scalability of the BWMRCCSD methods for -carotene in the 6-31G basis set (470 basis set functions); (4,4) a model space model of 20 references 32 16 07/09/2013 When triple excitations are needed: MRCCSD(T) K.Kowalski Improve the quality of the MRCCSD approaches Counteract the intruder-state problem eff eff ( ) H ( T ) H ( SD ) ( T ) Numerical complexity M N7 Scalability M (scalability of the CCSD(T) approach) 33 GPU implementation of non-iterative part of the MRCCSD(T) approach 34 K.Kowalski ~4x speed-up. Observed 5x in CCSD(T). Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach 17 07/09/2013 Thanks Karol Kowalski, Kiran Bhaskaran-Nair (PNNL) Wenjing Ma, Sriram Krishnamoorthy, Jeff Daily, Abhinav Vishnu, Bruce Palmer(PNNL) Vinod Tipparaju (AMD) Ryan Olson (Cray) Jiri Brabec (Czech Ac. Sc.) Oreste Villa & Norbert Juffa (NVIDIA) 35 Acknowledgements PNNL eXtreme Scale Computing Initiative Dept. of Energy Office of Biological and Environmental Research Resources of the National Center for Computational Sciences at Oak Ridge National Laboratory allocated through the INCITE program EMSL computing resources (Chinook HP system) 36 18 07/09/2013 thank you 37 Backup 38 19 07/09/2013 GPU implementation of non-iterative part of the MRCCSD(T) approach K.Kowalski Ongoing effort towards GPU-ing iterative part of the MRCCSD(T) approach(Oreste Villa & Norbert Juffa, NVIDIA) 39 20
© Copyright 2024