Large-Scale Parallel FEM Computation by GeoFEM on

December 10, 2001.
Large-Scale Parallel FEM Computation by GeoFEM on
128 Nodes of Hitachi SR8000/MPP: 3D Linear Elastic
Problem with >800M DOF (335GFLOPS Performance)
Kengo Nakajima(1) and Hiroshi Okuda(2)
(1) Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan (e-mail: [email protected], phone: +813-3712-5321, fax: +81-3-3712-5552) (2) Department of Quantum Engineering and Systems
Science, The University of Tokyo, Tokyo, Japan ( e-mail: [email protected], phone:
+81-3-5841-7426, fax: +81-3-3818-3455)
An efficient parallel iterative method for unstructured grids developed by the authors for shared
memory symmetric multiprocessor (SMP) cluster architectures on the GeoFEM platform is presented. The method is based on a 3-level hybrid parallel programming model, including message
passing for inter-SMP node communication, OpenMP directives for intra-SMP node parallelization and vectorization for each processing element (PE). Simple 3D elastic linear problems with
more than 8.0×108 DOF have been solved by localized 3×3 block ICCG(0) with additive Schwarz
domain decomposition and PDJDS/CM-RCM (Parallel Descending-order Jagged Diagonal Storage/Cyclic Multicolor-Reverse Cuthil McKee) reordering on 128 SMP nodes of the Hitachi
SR8000/MPP parallel computer at the University of Tokyo, achieving performance of 335
GFLOPS (18.6% of the peak). The PDJDS/CM-RCM reordering method provides excellent vector and parallel performance in SMP nodes, and is essential for parallelization of forward/backward substitution in IC/ILU factorization with global data dependency.
[Reference] K.Nakajima and H.Okuda, ' Parallel Iterative Solvers for Unstructured Grids using a Directive/MPI
Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures', GeoFEM 2001-003,
RIST, Japan, October 2001. (http://geofem.tokyo.rist.or.jp/report_en/2001_003.html).
Hardware System
• Name:
• Location:
• Processors:
•
•
•
Hitachi SR8000/MPP
Information Technology Center, the University of Tokyo
128 SMP nodes. Each SMP node consists of 8 processing elements
(PEs). Total PE number is 1024.
Peak performance of each
z
SMP node is 14.4 GFLOPS.
Uniform Distributed Force in
Peak performance of the entire
z-direction @ z=Zmin
system is 1.8 TFLOPS.
Uy =0 @ y=Ymin
Memory:
16 GB for each
SMP node. 2 TB for the entire
Ux =0 @ x=X min
system.
Fig. 1 Problem definition and boundary
conditions for 3D solid mechanics example
cases. Linear elastic problem with homogeneous material property and boundary conditions. Each element is cube with unit edge
length. Problem has 3×Nx×Ny×Nz DOF in
total.
Nz -1
Ny -1
Uz =0 @ z=Zmin
x
1
y
Nx -1
Node-0
Node-0
Node-2
Node-1
Node-3
Partitioning of analysis
domain
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Node-1
M
e
m
o
r
y
PE
PE
PE
PE
PE
PE
PE
PE
M
e
m
o
r
y
M
e
m
o
r
y
PE
PE
PE
PE
PE
PE
PE
PE
M
e
m
o
r
y
Node-2
Node-3
Fig. 2 Parallel FEM computation on SMP cluster architecture. Each partition corresponds to an
SMP node
1000.0
GFLOPS
128 SMP nodes
805,306,368 DOF
335.2 GFLOPS
100.0
16 SMP nodes
100,663,296 DOF
42.4 GFLOPS
10.0
1.0
1
10
100
1000
SMP-Node #
Fig. 3 Node # and GFLOPS rate on the Hitachi SR8000/MPP. Problem Size/node is fixed as
1283 Finite-Element Nodes = 3×1283 DOF= 6,291,456 DOF. The largest case is 805,306,368
DOF on 128 nodes (1024 PEs). Maximum performance was 335 GFLOPS (Peak Performance =
1.8 TFLOPS)
2
GFLOPS/Iters/Work Ratio(%)
1.0E+04
Iterations for convergence
1.0E+03
GFLOPS
1.0E+02
Work Ratio (%)
1.0E+01
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
DOF
Fig. 4 Problem size and GFLOPS rate/iterations for convergence (ε=10-8) in various problem
sizes on 128 SMP nodes of the Hitachi SR8000/MPP. The largest case is 1,610,612,736 DOF on
128 SMP nodes (1024 PEs). Maximum performance was 335 GFLOPS (Peak Performance = 1.8
TFLOPS). (BLACK Circles: GFLOPS rate, WHITE Circles: Iterations for convergence, WHITE
Triangles: Parallel work ratio (inter SMP nodes)). Work ratio is more than 95 % if problem size is
sufficiently large.
3