Library for Accelerated Math Applications for Heterogeneous HPC Applications The Library Overview LAMA is a framework for building efficient, extensible and flexible solvers for sparse linear systems and in application domains that involve – sparse and dense – numerical linear algebra. It supports heterogeneous shared and distributed memory compute architectures including various accelerators. LAMA addresses users working on linear algebra with huge sparse matrices on HPC clusters within heterogeneous environments. Possible use cases for LAMA can be found in solving partial elliptic differential equation, image processing and generic BLAS applications. LAMA’s code portability will extend your productivity not influencing numerical stability of the application. Features Architecture BLAS operations Vec • Vec, Mat • Vec, Mat • Mat Target Backends Multicore CPU, GPU, Intel® Xeon Phi Programming APIs OpenMP, CUDA, OpenCL Internode Communication MPI, GPI2 Distributions Block, Cyclic, Blockcyclic, Metis Matrix Formats CSR, ELL, JDS, COO, DIA, dense Iterative Linear Solvers Jacobi, SOR, (Bi)CG, GMRES, AMG Configuration Sample Benchmarks ELL { localContext = CUDA; porting to GPU HaloContext = CUDA; RowDistribution = Block( MPI ); ColDistribution = Cyclic( MPI, 10 ); } Single-Node Performance* (in Comparisson) 14 12 4 Flan_1565 cage15 ecology1 audikw 0 Multi-Node Performance* (Weak Scaling of a CG-Solver) 40 nesting solvers Criterion myStoppingCrit1 = ( IterationCount( 100 ) OR ( ResidualThreshold( L2Norm, 1e-14, Relative ) OR ResidualThreshold( L2Norm, 1e-12, Absolute ) ) ); 10^9 nnz/node/sec 35 30 25 Communication: 20 synchron synchron asynchron asynchron 15 10 CG r oot { logger = log; preconditioner = amg; stoppingCriterion = myStoppingCrit1;} 5 0 CPU GPU 1 CPU GPU 2 LAMA is available under the MIT License, a free software license originating at the Massachusetts Institut of Technology. Our first release LAMA Alpamayo is available on SourceForge.net. To get the lastet update or for experimental features you can check out our GIT branches. GPU 4 solver.dsl Release Information CPU # nodes CPU GPU 8 3D7P Poisson Equation * all values are measured with double precision; formats: CPU › CSR, GPU › ELL CPU: Intel® Xeon E5-2670; GPU: Nvidia® K20m (ECC mode on,persistence mode on) LAMA 1.0.0 (Alpamayo) build with gcc-4.4.7, boost-1.5.5, CUDA 5.5, OpenMPI 1.5.4 • Paralution (0.6.1) www.paralution.com LAMA_CPU Paralution_CPU LAMA_GPU Paralution_GPU 2 sol.dsl Logger log ( Solver, completeInformation, toConsoleOnly, Timer); SimpleAMG amg { logger = log; MaxLevels = 25; minVarsCoarseLevel = 200; hostOnlyLevel = 6; replicatedLevel = 8;} 6 Geo_1438 DenseVector { initialValue = 0.0; distribution = Block( MPI); } 8 dielFilterV2real DenseVector { initialValue = 1.0; distributing with MPI distribution = Cyclic( MPI, 10 ); } rhs.dsl 10 RM07R and various numerical requirements MetaMatrix A( “symMatrix.mtx”, “matrix.dsl” ); DenseVectorb( “rhs.dsl” ); DenseVectorx( “sol.dsl” ); use text-book-syntax x = A * b; MetaSolver metaSolver( “MetaSolver”, “solver.dsl” ); metaSolver.initialize( A ); use solvers metaSolver.solve( x, b ); nlpkkt80 supporting multiple matrix structures Code Samples Ga41As41H72 with implicit multinode parallelization Freescale1 Hamrle3 in heterogeneous environments LAMA’s software stack is constructed of backends for the compute locations - capsulating the different programming APIs -, a C++ core implementation and a solver framework. The core provides all the key features as the text-book-syntax, the memory management and implicit parallization, which are used to build the solvers on the next level. User applications can directly interface the LAMA API or middleware using LAMA, e.g. OpenFOAM. 10^9 nnz/sec easy-to-use text-book-syntax LAMA provides access to general BLAS functions and a wide set of iterative solvers through a C++ interface. The user benefits by executing his application in a flexible manner on various platforms. LAMA takes care of memory usage and synchronisation, and pays attention to efficient parallization by asynchronous data transfers and communication. Calculation is highly optimized at kernel level for each backend accelerating different sparse matrix formats. Concepts of solvers, distributions, and matrix formats are exchangeable. Users can switch between compute locations e.g. GPU or Intel® Xeon Phi through a DSL. Future Work Due to the easily expandable design LAMA will be broadened to more backends, with more solvers and upcoming sparse matrix formats or distribution patterns. Recent work includes a FPGA backend, atomatic distribution strategies and interfaces to other programming languages as C, Python, etc. LAMA will be integrated into high level languages as MatLab or OpenFOAM. Website: www.libama.org Hosted on: sourgeforge.net/projects/libama