Torino, September 5th, 2013 CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) [email protected] 1 Why parallel? •Faster time to solution; •More available memory; •High Performance Computing (HPC) resources available and not many softwares can run efficiently on thousands of processors. The programmer’s concerns: •Load imbalance: the time taken will be that of the longest job; •Handling communications: the processors will need to talk to each other and communication is slow; •Handling Input/Output (I/O): in most cases I/O is slow and should be avoided. The User’s concerns: •Choose an appropriate number of processors to run a job depending on the problem size (mostly determined by the number of basis functions in the unit cell). 2 1 Towards large unit cell systems serially A CRYSTAL job can be run crystal Pcrystal in parallel MPPcrystal Message Passing Interface (MPI) for communications Pcrystal uses replicated data storage in memory MPPcrystal for large unit cell systems on high-performance computers use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors 3 Running CRYSTAL in parallel Pcrystal • full parallelism in the calculation of the interactions (one- and two-electron integrals) • distribution of tasks in the reciprocal space: one k point per processor • no call to external libraries • few inter-process communications MPPcrystal • full parallelism in the calculation of the interactions (oneand two-electron integrals) • double-level distribution of tasks in the reciprocal space: one k point to a subset of processors • use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition • enhanced distribution of data in memory among processors • many inter-process communications 4 2 Pcrystal and MPPcrystal in action 10,000 basis functions Example of a CRYSTAL calculation: 16 processors available 4 k points sampled in reciprocal space Processors 0-3 4-7 12 13 14 15 k4 k3 k1 Active Tasks in real space (integrals) Idle Tasks in reciprocal space Pcrystal k2 Active Tasks in real space (integrals) Tasks in reciprocal space 8 - 11 k4 k3 k2 MPPcrystal k1 5 Pcrystal - Implementation Standard compliant: •Fortran 90 •MPI for message passing Replicated data: •Each k point is independent: each processor performs the linear algebra (FC=EC) for a subset of the k points that the job requires; •Very few communications (potentially good scaling), but potential load imbalance; •Each processor has a complete copy of all the matrices used in the linear algebra; •The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point; •Number of k points limits the number of processors that can be exploited: in general scales very well provided the number of processors ≤ number of k points. 6 3 MPPcrystal – Implementation I Standard compliant: •Fortran 90 •MPI for message passing •ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices •www.netlib.org/scalapack/scalapack_home.html Distributed data: •Each processor holds only a part of each of the matrices used in the linear algebra (FC=EC); •Number of processors that can be exploited is NOT limited by the number of k points (great for large Γ point only calculations); •Use ScaLAPACK for e.g. Choleski decomposition Matrix matrix multiplies Linear equation solves •As distributed data communications are required to perform the linear algebra; •However, N3 operations but only N2 data to communicate. F o r 7 MPPcrystal – Implementation II Scaling: •Scaling gets better for larger systems; •Very rough rule of thumb: if N basis functions can exploit up to around N/20 processors (optimal ratio: N/50); •One further method that MPPcrystal uses is multilevel parallelism: if have 4 real k points and 32 procs each diagonalization will be done by 8 processors, so each diagonalization has to scale to fewer processors •Complicated by complex k points •Very useful for medium-large sized systems (for a big enough problem can scale very well) Non implemented features in MPPcrystal: •Will fail quickly and cleanly if requested feature not implemented, such as: symmetry adaption of the Crystalline Orbitals (for large high symmetry systems Pcrystal may be more effective) CPHF Raman Intensities 8 4 MCM-41 mesoporous material model P. Ugliengo, M. Sodupe, F. Musso, I. J. Bush, R. Orlando, R. Dovesi, Advanced Materials 20, (2008). B3LYP approximation Hexagonal lattice with P1 symmetry 580 atoms per cell (7800 basis functions) IR spectrum recorded on a micelle-templated silica calcinated at 823 K, water outgassed at 423 K MTS/423 K B3LYP 3800 3600 3400 3200 3000 Simulated powder spectrum: no relevant reflexions at higher 2 because of short-range disorder 9 MCM-41:increasing the unit cell R. Orlando, M. Delle Piane, I. J. Bush, P. Ugliengo, M. Ferrabone, R. Dovesi, J. Comput. Chem. 33, 2276 (2012). SPEEDUP  T32 32 TNC NC Supercells of the MCM-41 have been grown along the c crystallographic axis: Xn (side along c is n times that in X1). X10 contains 77,560 AOs in the unit cell. Calculations run on IBM SP6 at Cineca: •Power6 processors (4.7 GHz) with peak performance of 101 Tflops/s Speedup vs number of cores (NC) for SCF+total energy gradient calculations •Infiniband X4 DDR internal network 10 5 MCM-41:scaling of the main steps in MPPcrystal two-electron integrals + X4 total energy gradient one-electron integrals Fock matrix diagonalization exchange-correlation functional integration X preliminary steps Percentage data measure parallelization efficiency. Data in parenthesis: the amount of time for that task. 11 Running MCM-41 on different HPC architectures X1 IBM Blue Gene P at Cineca (Bologna) Cray XE6 - HECToR (Edimburgh) IBM Sp6 at CINECA (Bologna) 12 6 Memory storage optimization TOO MANY K POINTS IN THE PACK-MONKHORST NET: INCREASE LIM001 Most of the static allocations have been made dynamic: •array size now fits the exact memory requirement; •no need to recompile the code for large calculations; •a few remaining fixed limits can be extended from input: CLUSTSIZE (maximum number of a atoms in a generated cluster; default setting: number of atoms i the unit cell) LATVEC (maximum number of lattice vectors to be classified; default value: 3500). n2atom-size arrays “are distributed” among the cores. Data are removed from memory as soon as they are not in use. 13 LOWMEM option The LOWMEM keyword avoids allocation of large arrays generally with a slight increase in the CPU time (by default in MPPcrystal): •atomic orbital pair elements in matrices are located in real time without storing a large table into memory •Fock and Density matrices are only stored in their “irreducible” forms; symmetry related elements are computed in real time •Expansion of AO pair electron density for the “bipolar” approximation of 2-electron integrals into multipole moments is performed in real time instead of storing large buffers to memory •Information about the grid of points used in DFT exchange-correlation functional integration (point cartesian coordinates, multiplicity, Becke’s weights) is distributed among processors Dynamically allocated memory monitoring by means of: MEMOPRT, MEMOPRT2 14 7 Speeding up two-electron integrals 2g    F12g   P34l  10 l 0 h 0  4h+l 1 3h electron 1 electron 2 4h l   2g | 3h 1 0 1 2 3h | 2g  4h l   Integrals are screened on the basis of the overlap between atomic orbitals. In large unit cells a lot of (3, 4) pairs do not overlap to (1, 2g). 120.00 The following integrals are equivalent by atomic orbital permutation: 100.00 1 2 | 3   3 4 |1 g 0 h l 4 h   1 2 2   3 hl 0 g h 0 g |4 h l h 3  T [sec] 80.00 0 4l | 2g h 1h  Linearization 60.00 Permutation symmetry 40.00 20.00 Implemented for P1 symmetry. 0.00 0 2 4 6 8 10 Xn 15 Improved memory storage in Pcrystal Fg Fk Vk† Fk Vk Transformation of the Fock and the Density matrix into the basis set of the SymmetryAdapted Crystalline Orbitals (SACO) is operated from the “irreducible” F g to each block of VkFkVk (irreducible representation) straightforwardly, without forming the full blocks of Fk: •the maximum size of matrices to be diagonalized is that of the largest block •parallelization goes from k points down to the irreducible representations (many more than the number of k-points in highly symmetric cases) 16 8 Memory storage for fullerenes of increasing size (n,n)-Fullerenes n Sirr NAO Sred 1 840 1759 2 3360 6707 169980 716130 3 7560 14570 1609020 4 13440 25377 2847690 5 21000 39047 4432140 6 30240 55661 6362370 7 41160 75138 8638380 8 53760 97559 11260170 9 68040 122843 14227740 10 84000 151071 17541090 n=7 NAO: number of basis functions Sirr: size of the “irreducible” part of the overlap matrix represented in real space (number of matrix elements) Sred: size of the full overlap matrix represented in real space (number of matrix elements) 17 Fullerenes: matrix block size in the SACO basis (n,n) Ag Au F1g F1u F2g F2u Gg Gu Hg Hu NAO (1,1) 10 4 18 24 18 24 28 28 38 32 840 (2,2) 34 22 78 90 78 90 112 112 146 134 3360 (3,3) 72 54 180 198 180 198 252 252 324 306 7560 (4,4) 124 100 324 348 324 348 448 448 572 548 13440 (5,5) 190 160 510 540 510 540 700 700 890 860 21000 (6,6) 270 234 738 774 738 774 008 1008 1278 1242 30240 (7,7) 364 322 1008 1050 1008 1050 1372 1372 1736 1694 41160 (8,8) 472 424 1320 1368 1320 1368 1792 1792 2264 2216 53760 (9,9) 594 540 1674 1728 1674 1728 2268 2268 2862 2808 68040 (10,10) 730 670 2070 2130 2070 2130 2800 2800 3530 3470 84000 tSCF: wallclock time (in seconds) for running 20 SCF cycles on a single core tSCF 316 6565 20054 47412 87445 18 9 Conclusions CRYSTAL •can be run in parallel on a large number of processors efficiently, with very good scalability; •is portable to different HPC platforms; •allowed the calculation of the total energy and wavefunction of MCM41-X14, containing more than 100,000 basis functions (8000 atoms), on 2048 processors; •has been improved as concerning data storage to memory; •has been made more efficient as for the calculation of the Coulomb and exchange series; •Memory storage for highly symmetric cases has been drastically reduced by extending the use of SACOs to all steps in reciprocal space; •Task farming in Pcrystal will soon be moved from the k-point level to that of the irreducible representations. 19 10