Library for Math Applications Accelerated

Library for Accelerated Math Applications
for Heterogeneous HPC Applications
The Library
Overview
LAMA is a framework for building efficient, extensible and flexible solvers for
sparse linear systems and in application domains that involve – sparse and dense
– ­numerical linear algebra. It supports heterogeneous shared and distributed
memory ­compute architectures including various accelerators.
LAMA addresses users working on linear algebra with huge sparse matrices on
HPC clusters within heterogeneous environments. Possible use cases for LAMA
can be found in solving partial elliptic differential equation, image processing and
generic BLAS applications. LAMA’s code portability will extend your productivity
not influencing numerical stability of the application.
Features
Architecture
BLAS operations
Vec • Vec, Mat • Vec, Mat • Mat
Target Backends
Multicore CPU, GPU, Intel® Xeon Phi
Programming APIs
OpenMP, CUDA, OpenCL
Internode Communication
MPI, GPI2
Distributions
Block, Cyclic, Blockcyclic, Metis
Matrix Formats
CSR, ELL, JDS, COO, DIA, dense
Iterative Linear Solvers
Jacobi, SOR, (Bi)CG, GMRES, AMG
Configuration Sample
Benchmarks
ELL {
localContext = CUDA;
porting to GPU
HaloContext = CUDA;
RowDistribution = Block( MPI );
ColDistribution = Cyclic( MPI, 10 ); }
Single-Node Performance* (in Comparisson)
14
12
4
Flan_1565
cage15
ecology1
audikw
0
Multi-Node Performance* (Weak Scaling of a CG-Solver)
40
nesting solvers
Criterion myStoppingCrit1 = ( IterationCount( 100 ) OR
( ResidualThreshold( L2Norm, 1e-14, Relative ) OR
ResidualThreshold( L2Norm, 1e-12, Absolute ) ) );
10^9 nnz/node/sec
35
30
25
Communication:
20
synchron
synchron
asynchron
asynchron
15
10
CG r
oot {
logger = log;
preconditioner = amg;
stoppingCriterion = myStoppingCrit1;}
5
0
CPU
GPU
1
CPU
GPU
2
LAMA is available under the MIT License, a free software license originating at the
Massachusetts Institut of Technology.
Our first release LAMA Alpamayo is available on SourceForge.net. To get the lastet
update or for experimental features you can check out our GIT branches.
GPU
4
solver.dsl
Release Information
CPU
# nodes
CPU
GPU
8
3D7P Poisson Equation
* all values are measured with double precision; formats: CPU › CSR, GPU › ELL
CPU: Intel® Xeon E5-2670; GPU: Nvidia® K20m (ECC mode on,persistence mode on)
LAMA 1.0.0 (Alpamayo) build with gcc-4.4.7, boost-1.5.5, CUDA 5.5, OpenMPI 1.5.4
•
Paralution (0.6.1) www.paralution.com
LAMA_CPU
Paralution_CPU
LAMA_GPU
Paralution_GPU
2
sol.dsl
Logger log ( Solver, completeInformation, toConsoleOnly, Timer);
SimpleAMG amg {
logger = log;
MaxLevels = 25;
minVarsCoarseLevel = 200;
hostOnlyLevel = 6;
replicatedLevel = 8;}
6
Geo_1438
DenseVector { initialValue = 0.0;
distribution = Block( MPI); }
8
dielFilterV2real
DenseVector { initialValue = 1.0;
distributing with MPI
distribution = Cyclic( MPI, 10 ); }
rhs.dsl
10
RM07R
and various numerical requirements
MetaMatrix A( “symMatrix.mtx”, “matrix.dsl” );
DenseVectorb( “rhs.dsl” );
DenseVectorx( “sol.dsl” );
use text-book-syntax
x = A * b;
MetaSolver metaSolver( “MetaSolver”, “solver.dsl” );
metaSolver.initialize( A );
use solvers
metaSolver.solve( x, b );
nlpkkt80
supporting multiple matrix structures
Code Samples
Ga41As41H72
with implicit multinode
parallelization
Freescale1
Hamrle3
in heterogeneous environments
LAMA’s software stack is constructed of backends for the compute locations - capsulating the different programming APIs -, a C++ core implementation and a solver
framework. The core provides all the key features as the text-book-syntax, the memory management and implicit parallization, which are used to build the solvers on the
next level. User applications can directly interface the LAMA API or middleware using
LAMA, e.g. OpenFOAM.
10^9 nnz/sec
easy-to-use text-book-syntax
LAMA provides access to general BLAS functions and a wide set of iterative solvers
through a C++ interface. The user benefits by executing his application in a flexible
manner on various platforms.
LAMA takes care of memory usage and synchronisation, and pays attention to
­efficient parallization by asynchronous data transfers and communication. Calculation is highly optimized at kernel level for each backend accelerating different
sparse matrix f­ormats. Concepts of solvers, distributions, and matrix formats are exchangeable. Users can switch between compute locations e.g. GPU or Intel® Xeon
Phi through a DSL.
Future Work
Due to the easily expandable design LAMA will be broadened to more backends,
with more solvers and upcoming sparse matrix formats or distribution patterns. Recent work includes a FPGA backend, atomatic distribution strategies and interfaces
to other programming languages as C, Python, etc. LAMA will be integrated into
high level languages as MatLab or OpenFOAM.
Website: www.libama.org
Hosted on: sourgeforge.net/projects/libama