December 10, 2001. Large-Scale Parallel FEM Computation by GeoFEM on 128 Nodes of Hitachi SR8000/MPP: 3D Linear Elastic Problem with >800M DOF (335GFLOPS Performance) Kengo Nakajima(1) and Hiroshi Okuda(2) (1) Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan (e-mail: [email protected], phone: +813-3712-5321, fax: +81-3-3712-5552) (2) Department of Quantum Engineering and Systems Science, The University of Tokyo, Tokyo, Japan ( e-mail: [email protected], phone: +81-3-5841-7426, fax: +81-3-3818-3455) An efficient parallel iterative method for unstructured grids developed by the authors for shared memory symmetric multiprocessor (SMP) cluster architectures on the GeoFEM platform is presented. The method is based on a 3-level hybrid parallel programming model, including message passing for inter-SMP node communication, OpenMP directives for intra-SMP node parallelization and vectorization for each processing element (PE). Simple 3D elastic linear problems with more than 8.0×108 DOF have been solved by localized 3×3 block ICCG(0) with additive Schwarz domain decomposition and PDJDS/CM-RCM (Parallel Descending-order Jagged Diagonal Storage/Cyclic Multicolor-Reverse Cuthil McKee) reordering on 128 SMP nodes of the Hitachi SR8000/MPP parallel computer at the University of Tokyo, achieving performance of 335 GFLOPS (18.6% of the peak). The PDJDS/CM-RCM reordering method provides excellent vector and parallel performance in SMP nodes, and is essential for parallelization of forward/backward substitution in IC/ILU factorization with global data dependency. [Reference] K.Nakajima and H.Okuda, ' Parallel Iterative Solvers for Unstructured Grids using a Directive/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures', GeoFEM 2001-003, RIST, Japan, October 2001. (http://geofem.tokyo.rist.or.jp/report_en/2001_003.html). Hardware System • Name: • Location: • Processors: • • • Hitachi SR8000/MPP Information Technology Center, the University of Tokyo 128 SMP nodes. Each SMP node consists of 8 processing elements (PEs). Total PE number is 1024. Peak performance of each z SMP node is 14.4 GFLOPS. Uniform Distributed Force in Peak performance of the entire z-direction @ z=Zmin system is 1.8 TFLOPS. Uy =0 @ y=Ymin Memory: 16 GB for each SMP node. 2 TB for the entire Ux =0 @ x=X min system. Fig. 1 Problem definition and boundary conditions for 3D solid mechanics example cases. Linear elastic problem with homogeneous material property and boundary conditions. Each element is cube with unit edge length. Problem has 3×Nx×Ny×Nz DOF in total. Nz -1 Ny -1 Uz =0 @ z=Zmin x 1 y Nx -1 Node-0 Node-0 Node-2 Node-1 Node-3 Partitioning of analysis domain PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Node-1 M e m o r y PE PE PE PE PE PE PE PE M e m o r y M e m o r y PE PE PE PE PE PE PE PE M e m o r y Node-2 Node-3 Fig. 2 Parallel FEM computation on SMP cluster architecture. Each partition corresponds to an SMP node 1000.0 GFLOPS 128 SMP nodes 805,306,368 DOF 335.2 GFLOPS 100.0 16 SMP nodes 100,663,296 DOF 42.4 GFLOPS 10.0 1.0 1 10 100 1000 SMP-Node # Fig. 3 Node # and GFLOPS rate on the Hitachi SR8000/MPP. Problem Size/node is fixed as 1283 Finite-Element Nodes = 3×1283 DOF= 6,291,456 DOF. The largest case is 805,306,368 DOF on 128 nodes (1024 PEs). Maximum performance was 335 GFLOPS (Peak Performance = 1.8 TFLOPS) 2 GFLOPS/Iters/Work Ratio(%) 1.0E+04 Iterations for convergence 1.0E+03 GFLOPS 1.0E+02 Work Ratio (%) 1.0E+01 1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 DOF Fig. 4 Problem size and GFLOPS rate/iterations for convergence (ε=10-8) in various problem sizes on 128 SMP nodes of the Hitachi SR8000/MPP. The largest case is 1,610,612,736 DOF on 128 SMP nodes (1024 PEs). Maximum performance was 335 GFLOPS (Peak Performance = 1.8 TFLOPS). (BLACK Circles: GFLOPS rate, WHITE Circles: Iterations for convergence, WHITE Triangles: Parallel work ratio (inter SMP nodes)). Work ratio is more than 95 % if problem size is sufficiently large. 3