How to Use a Quantum Chemistry Code over 100,000 CPUs 07/09/2013

07/09/2013
How to Use a Quantum Chemistry Code over
100,000 CPUs
Edoardo Aprà
Pacific Northwest National Laboratory
Outline
 NWChem description
 Parallelization strategies
2
1
07/09/2013
 Developed at EMSL/PNNL
 Provides major modeling and simulation capability
for molecular science
‣ Broad range of molecules, including catalysts,
biomolecules, and heavy elements
‣ Solid state capabilities
 Performance characteristics – designed for MPP
‣ Runs on a wide range of computers
 Open Source → large user community
 Uses Global Arrays/ARMCI for parallelization
3
NWChem Structure
Energy, structure, …
SCF energy, gradient, …
MD, NMR, Solvation, …
Molecular
Calculation
Modules
Optimize, Dynamics, …
Parallel IO
Memory Allocator
Global Arrays
• Object-oriented design
• abstraction, data hiding,
APIs
• Parallel programming
model
• non-uniform memory
access, Global Arrays, MPI
Molecular
Modeling
Toolkit
...
PeIGS
...
Basis Set Object
Integral API
Geometry Object
Run-time database
DFT energy, gradient, …
Generic
Tasks
• Infrastructure
• GA, Parallel I/O, RTDB, MA,
...
• Program modules
Molecular
Software
Development
Toolkit
• communication only
through the database
• persistence for easy restart
2
07/09/2013
Replicated vs Distributed: Matrix distribution
Example: parallel
computer
made of 8 processors
1
2
5
3
4
7
6
8
How to distribute the
elements of a
2-D Matrix?
1
3
5
7
5
2
4
6
8
Distributed
1
Replicated
Approaches to parallelization
• Distributed data - Pros
• Smaller memory usage
• Better scaling at large number of processors(more later)
• Replicated data - Pros
• Better scaling at small to moderate number of processors
•Distributed data - Cons
• More network traffic required at small/moderate number
processors.
• Better scaling at large number of processors(more later)
• Replicated data - Cons
• Larger memory footprint
• Worse scaling at large number of processors, because of
collective operations needed to merge resulting matrices
3
07/09/2013
Global Arrays
• Distributed dense arrays that can be accessed through a
shared memory-like style
• High level abstraction layer for the application developer
(that’s me!)
• One-sided model = no need to worry and send/receive
Physically distributed data
single, shared data structure/
global indexing
e.g., access A(4,3) rather than
buf(7) on task 2
Global Address Space
Gaussian DFT computational kernel
Evaluation of XC potential matrix element
my_next_task = SharedCounter()
do i=1,max_i
if(i.eq.my_next_task) then
call ga_get()
(do work)
(xq) = ∑ D (xq) (xq)
call ga_acc()
my_next_task =
F+=
SharedCounter()
xc
∑q wq (xq) V [(xq)] (xq)
endif
enddo
barrier()
D

F 
Both GA operations are greatly dependent on the
communication latency
8
4
07/09/2013
Parallel scaling of the DFT code
Si28O148H66
3554 Basis
functions
LDA wavefunction
XC build
Benchmark run on
Cray XK6
9
Parallel scaling of the DFT code
Si159O264H110
7108 Basis
functions
LDA wavefunction
XC build
Benchmark run on
10
Cray XK6
5
07/09/2013
Hybrid computing
Goes beyond the Send-Receive
construct of Message-Passing (a.k.a MPI)
OpenMP (directives)
Shared memory
Cilk
Intel TBB
Threads
Mirrored Arrays: Matrix distribution
Global Arrays that are
replicated between SMP
nodes but distributed within
SMP nodes
1
3
1
3
2
4
1
2
5
7
6
8
3
4
Distributed
Mirrored
2
4
5
7
6
8
1
Replicated
12
6
07/09/2013
Trends in Chemistry Codes
 Multi-level parallelism
‣ Effective path for most applications to scale to
O(105) processors
‣ Examples: coarse grain over vibrational degrees
of freedom in numerical hessian, or geometries in
a surface scan or parameter study
‣ Conventional distributed memory within each
subtask
‣ Fine grain parallelism within a few processor SMP
(multi-threads, OpenMP, parallel BLAS, …)
‣ Example of application later in the talk ..
 Efficient exploitation of fine grain parallelism is a
major concern on future architectures
 Emergence of Accelerators
‣ GPUs
‣ Intel Xeon Phi
What is CCSD(T) ?
 Coupled-cluster (CC) theory is a numerical manybody technique that incorporates the effect of
electron correlation on the electronic structure of
molecular systems
 CCSD(T) estimates the effect of electron
correlations by considering single, double and
triple excitations
 Valence only CCSD(T) calculations = “gold
standard of quantum chemistry” for their
chemical accuracy in determining molecular
energetics
 numerical cost scales as N7 (N = number of
electrons)
7
07/09/2013
CCSD(T) algorithm
 “aijkbc algorithm” of Rendell and coworkers
 no use of I/O
 intermediate quantities (two-electron integrals
and coupled-cluster wave function
amplitudes) stored in global memory (GA)
 floating-point intensive kernel of this algorithm:
BLAS DGEMM calls
Main Steps of a CCSD(T) run
 MP2 energy & transformation of
Molecular orbitals
 Generation of 2-electron integrals
needed and storage in Global Array
(ga_acc)
 Main CCSD(T) loop
‣ fetching the 2-electron integrals with ga_get
‣ Computational intensive kernel via dgemm
8
07/09/2013
CCSD(T) kernel: code features - I
 To scale at 1K procs: increased locality
data to reduce communication
 To scale at 40K procs: implemented more
careful tiling of intermediates to
‣ reduce memory consumption and
‣ increase parallelism and load balance
 These two modifications can be seen as
PGAS style programming (distinction
between local and global memory
CCSD(T) kernel: code features - II
 Three levels of the memory hierarchy in
dynamically load balanced algorithm
‣ intermediate results fit in available global memory
‣ nested loops tiled so that data for each each task fits
into local memory
‣ each process access a global shared counter to
determine the next task
 data moved from global into local memory via
ga_get
9
07/09/2013
CCSD(T) run on Cray XT5 :
18 water molecules
February 2009
Floating-Point performance
at 90K cores:
358 TFlops
(H2O)18
54 atoms
918 basis functions
Cc-pvtz(-f) basis
CCSD(T) run on Cray XT5 : 20 water
February 2009
Floating-Point
performance
at 96K cores:
480 TFlops
Efficiency > 50%
(H2O)20
60 atoms
1020 basis functions
Cc-pvtz(-f) basis
10
07/09/2013
NWChem code changes required to scale
beyond 100K cores
 Many to one communication patterns causes job
‣ to progress very slowly (best case)
‣ to hang (or worse)
 Diagnosis: cumbersome … lucky coredumps!
 Causes: several processors simultaneously
accessing the same patch of a matrix, where the
patch is owned by a single processor.
 Solutions:
‣ Staggered access
‣ Use of a subset of processing elements
‣ Atomic shared counter: first token is static
CCSD(T) run on Cray XT5 : 24 water
November 2009
Floating-Point performance
at 223K cores:
1.39 PetaFLOP/s
(H2O)24
72 atoms
1224 basis functions
Cc-pvtz(-f) basis
11
07/09/2013
K.Kowalski
Tensor Contraction Engine (TCE)
 Symbolic algebra systems for coding
complicated tensor expressions: Tensor
Contraction Engine (TCE)
‣ Hirata, J. Phys. Chem. A 107, 9887 (2003)
‣ Sadayappan, Krishnamoorthy, et al.
Proceedings of the IEEEE, 93, 276 (2005).
‣ Lai, Zhang, Rajbhandari, Valeev,
Kowalski, Sadayappan, Procedia
Computer Science (2012)
 New implementation of CC methods
(since 2003)
‣ more effective for implementing new
methods
‣ Easier tuning and porting
23
K.Kowalski
Tensor Contraction Engine (TCE)
Tile structure:


S1 S2 …
S1 S2 …


S1
Occupied spinorbitals
S2
……….
S1
S2
……….
unccupied spinorbitals
Tensor structure:
Tai T[[phnm]]
24
12
07/09/2013
New elements of parallel design for the iterative
EOMCCSD method
K.Kowalski
 Use of Global Arrays (GA) to implement a distributed
memory model
 Iterative CCSD/EOMCCSD methods (basic challenges)
‣ Global memory requirements
‣ Complex load balancing
‣ Complicated communication pattern: use of one-sided
ga_get,ga_put,ga_acc
 Implementation improvements
‣ New way of representing antysymmetric 2-electron
integrals for the restricted (RHF) and restricted open-shell
(ROHF) references
‣ Replication of low-rank tensors
‣ New task scheduling for the CCSD/EOMCCSD methods
25
New elements of parallel design for the non-iterative CREOMCCSD(T) method
K.Kowalski
 Use of Global Arrays (GA) to implement a
distributed memory model
 Basic challenges for Non-Iterative CREOMCCSD(T) method
‣ Local memory requirements: (tilesize)4
(EOMCCSD) vs. M*(tilesize)6 (CR-EOMCCSD(T))
 Implementation improvements
‣ Two-fold reduction of local memory use :
2*(tilesize)6
‣ New algorithms which enable the
decomposition of six-dimensional tensors
26
13
07/09/2013
K.Kowalski
Scalability of iterative EOMCC methods
Alternative task schedulers
use “global task pool”
improve load
balancing
reduce the number of
synchronization steps to
absolute minimum
larger tiles can be
effectively used
27
K.Kowalski
Scalability of the non-iterative EOMCC
code
94 %parallel efficiency using 210,000 cores

Scalability of the
triples part of the CREOMCCSD(T)
approach for the
FBP-f-coronene
system in the AVTZ
basis set. Timings
were determined
from calculations on
the Jaguar Cray XT5
computer system at
NCCS/ORNL in 2011
28
14
07/09/2013
K.Kowalski
MRCC theory in a nutshell


M

ls
0



1
Reference function
M
Model space
Schematic representation of the complete model space corresponding to two
active electrons distributed over two active orbitals (red lines). Only determinants
with MS=0 are included in the model space.
29
K.Kowalski
MRCC approaches: main challenges
 Intruder-state/intruder-solution problems
 Complete model space
‣ Huge dimensionality
‣ A large number of superfluous configurations
not contributing to a given state
 Overall cost of the MRCC methods
‣ M×N6 (iterative MRCCSD)
‣ M×N7 (non-iterative MRCCSD(T))
 Algebraic complexity of the MRCC
methods
30
15
07/09/2013
K.Kowalski
Processor groups (PGs) and reference level parallelism
(

) (

)(

)
(

)(
1
)
(

)
(
M
)
R

F
(
T
)

G
(
T
,...,
T
,...,
T
)

0



1
,...,
M

The reference level
parallelism can be
applied in:
‣ Solving coupled
referencedependent MRCC
iterative equations
‣ Build efficient
parallel schemes for
non-iterative MRCC
methods
31
Processor groups (PGs) and reference level parallelism

K.Kowalski
Scalability of the BWMRCCSD methods for
-carotene in the 6-31G
basis set (470 basis set
functions);
(4,4)
a
model space model of
20 references
32
16
07/09/2013
When triple excitations are needed: MRCCSD(T)
K.Kowalski
 Improve the quality of the
MRCCSD approaches
 Counteract the intruder-state
problem

eff
eff
(

)
H
(
T
)

H
(
SD
)

(
T
)



Numerical complexity  M  N7
Scalability  M  (scalability of the
CCSD(T) approach)
33
GPU implementation of non-iterative part of the
MRCCSD(T) approach
34
K.Kowalski
 ~4x speed-up. Observed
5x in CCSD(T).
 Ongoing effort towards
GPU-ing iterative part of
the MRCCSD(T)
approach
17
07/09/2013
Thanks
 Karol Kowalski, Kiran Bhaskaran-Nair
(PNNL)
 Wenjing Ma, Sriram Krishnamoorthy,
Jeff Daily, Abhinav Vishnu, Bruce
Palmer(PNNL)
 Vinod Tipparaju (AMD)
 Ryan Olson (Cray)
 Jiri Brabec (Czech Ac. Sc.)
 Oreste Villa & Norbert Juffa (NVIDIA)
35
Acknowledgements
 PNNL eXtreme Scale Computing Initiative
 Dept. of Energy Office of Biological and
Environmental Research
 Resources of the National Center for
Computational Sciences at Oak Ridge
National Laboratory allocated through
the INCITE program
 EMSL computing resources (Chinook HP
system)
36
18
07/09/2013
thank you
37
Backup
38
19
07/09/2013
GPU implementation of non-iterative part of the
MRCCSD(T) approach

K.Kowalski
Ongoing effort towards GPU-ing iterative
part of the MRCCSD(T) approach(Oreste
Villa & Norbert Juffa, NVIDIA)
39
20