How to Write Fast Numerical Code A Small Introduction Markus Püschel Computer Science ETH Zürich, Switzerland Picture: www.tapety-na-pulpit.org September 1st Electrical and Computer Engineering Carnegie Mellon University Computer Science ETH Zürich PhD and Postdoc openings (see also www.spiral.net):      High performance computing Compilers Theory Programming languages/Generative programming FPGAs Scientific Computing Computing  Unlimited need for performance  Large set of applications, but … Physics/biology simulations Consumer Computing  Audio/image/video processing Embedded Computing Signal processing, communication, control Relatively small set of critical components (100s to 1000s)  Matrix multiplication  Discrete Fourier transform (DFT)  Viterbi decoder  Shortest path computation  Solving linear systems  …. The Problem: Example 1 DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz) Performance [Gflop/s] 40 35 30 Fastest code (1 MB) 25 20 35x 15 10 12x 5 Straightforward “good” C code (1 KB) 0 16   64 256 1k 4k Vendor compiler, best flags Roughly same operations count 16k 64k 256k 1M The Problem: Example 2 Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz Gflop/s 50 45 40 Fastest code (100 KB) 35 30 25 160x 20 15 10 5 Triple loop (< 1KB) 0 0 1,000 2,000 3,000 4,000 5,000 matrix size  Vendor compiler, best flags  Exact same operations count (2n3)  What is going on? 6,000 7,000 8,000 9,000 Evolution of Processors (Intel) Evolution of Processors (Intel) Era of parallelism And There Will Be Variety … Arm Cortex A9 Nvidia G200 TI TNETV3020 Tilera Tile64 Core i7 Source: IEEE SP Magazine, Vol. 26, November 2009 DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz) Performance [Gflop/s] 40 35 30 25 Multiple threads: 3x 20 15 10 Vector instructions: 3x 5 Memory hierarchy: 5x 0 16   64 256 1k Compiler doesn’t do the job Doing by hand: nightmare 4k 16k 64k 256k 1M Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz Gflop/s 50 45 40 35 30 Multiple threads: 4x 25 20 15 10 Vector instructions: 4x Memory hierarchy: 20x 5 0 0 1,000 2,000 3,000 4,000 5,000 matrix size   Compiler doesn’t do the job Doing by hand: nightmare 6,000 7,000 8,000 9,000 Summary and Facts I  Implementations with same operations count can have vastly different performance (up to 100x and more)  A cache miss can be 100x more expensive than an operation  Vector instructions  Multiple cores = processors on one die  Minimizing operations ≠ maximizing performance  End of free speed-up for legacy code  Future performance gains through increasing parallelism Summary and Facts II  It is very difficult to write the fastest code      Tuning for memory hierarchy Vector instructions Efficient parallelization (multiple threads) Requires expert knowledge in algorithms, coding, and architecture Fast code can be large  Can violate “good” software engineering practices  Compilers often can’t do the job  Often intricate changes in the algorithm required  Parallelization/vectorization still unsolved  Highest performance is in general non-portable Performance/Productivity Challenge Current Solution  Legions of programmers implement and optimize the same functionality for every platform and whenever a new platform comes out. Better Solution: Automation  Automate (parts of) the implementation or optimization  Research efforts  Linear algebra: Phipac/ATLAS, LAPACK,        Sparsity/Bebop/OSKI, Flame Tensor computations PDE/finite elements: Fenics Adaptive sorting Fourier transform: FFTW Linear transforms: Spiral …others New compiler techniques Promising new area but much more work needed … Proceedings of the IEEE special issue, Feb. 2005 This Tutorial  Increasing awareness for the problem  Focus: optimization for the memory hierarchy  Examples considered: MMM and DFT  Techniques from state-of-the-art research and implementations  Extract general principles for application to other numerical problems  Get a glimpse of how automatic performance tuning works …  MMM generator ATLAS  Adaptive DFT library FFTW  … and how automatic code production works  Program generator Spiral Matrix-Matrix Multiplication MMM References used:    J. Demmel, Applied Numerical Linear Algebra, SIAM, 1997 R. C. Whaley, A. Petitet, and J. Dongarra, Automated empirical optimization of software and the ATLAS project, Parallel Comput., 27(1–2), pp. 3–35, 2001 K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill, Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE, 93(2), pp. 358–386, Feb. 2005 Linear Algebra Algorithms: Examples  Solving systems of linear equations  Computation of eigenvalues  Singular value decomposition  LU/Cholesky/QR/… decompositions  … many others   Make up large parts of the numerical computation across disciplines (sciences, computer science, engineering) Efficient software is extremely relevant The Path to LAPACK  1960s/70s: EISPACK and LINPACK  Libraries for linear algebra algorithms  Cleve Moler et al.  Implementation vector-based  Problem:  No locality: Low performance became apparent in the 80s (deep memory hierarchies) CPU  Cache Solution: LAPACK (late 80s, early 90s)  Reimplement the algorithms “block-based,” i.e., with locality  Jim Demmel, Jack Dongarra et al. Main memory Reuse and Blocking  Number of operations Reuse: Size of data Compute reuse:  Vector-vector operations (e.g., adding two vectors)  Matrix-vector operations (e.g., matrix-vector multiplication)  Matrix-matrix operations (e.g., MMM)  DFT 50 30 45 40 25 35 Performance drop outside L2 cache 20 30 25 15 .. 20 10 15 10 5 5 MMM: O(n) reuse DFT: O(log(n)) reuse 44 72 1, 0 2, 1 26 ,5 36 ,7 68 ,3 84 19 2 09 6 input size 13 65 32 16 8, 4, 04 8 2 9,000 2, matrix size 8,000 02 4 7,000 1, 6,000 6 5,000 51 4,000 8 3,000 25 2,000 12 1,000 64 0 32 0 0 16  LAPACK and BLAS  Basic Idea: LAPACK BLAS  static reimplemented for each platform BLAS = Basic Linear Algebra Subroutines link  BLAS1: vector-vector operations; O(1) reuse  BLAS2: matrix-vector operations; O(1) reuse  BLAS3: matrix-matrix operations; O(n) reuse (mostly MMM)  LAPACK implemented on top of BLAS link  Blocking: BLAS 3 is used as much as possible MMM and Algorithms  Most important BLAS3 function and in all of numerical computing  C = A· B or C = C + A· B with A, B, C matrices (assume square)  By definition: 2n3 = O(n3)  The basis for practically all existing implementation  Strassen’s algorithm: O(nlog2(7)) ≈ O(n2.808)  Crossover point n = 654  Structurally more complex, worse numerical stability  Rarely implemented  Best known: Coppersmith/Winograd: O(n2.376)  Large constant  Very complex  No implementation exists Optimizing MMM  Live  Why blocking and the effect on cache behaviour  Blocking for cache and registers  Unrolling and scalar replacement Final Code: Sketch // MMM loop-nest for i = 0:NB:N-1 for j = 0:NB:M-1 for k = 0:NB:K-1 // mini-MMM loop nest for i’ = i:MU:i+NB-1 Blocking for cache for j’ = j:NU:j+NB-1 for k’ = k:KU:k+NB-1 • Blocking for registers // micro-MMM loop nest • unrolling for k” = k’:1:k’+KU-1 • scalar replacement for i” = i’:1:i’+MU-1 • other … for j” = j’:1:j’+NU-1 c[i”][j”] += a[i”][k”]*b[k”][j”] Degrees of freedom: can be used for platform adaptation NB, MU, NU, KU, LS Automation: BLAS Generator ATLAS/PhiPAC Compile Execute Measure Mflop/s L1Size Detect Hardware Parameters  NR MulAdd L* ATLAS Search Engine NB MU,NU,KU xFetch MulAdd Latency MiniMMM Source Idea: Automatic porting of LAPACK LAPACK BLAS  ATLAS MMM Code Generator static regenerated for each platform But: parallelism killed it  Generative approach not suitable for vectorization  LAPACK/BLAS division suboptimal on multicore systems Discrete Fourier Transform DFT References used:    Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, 1992 M. Frigo and S. Johnson, FFTW: An Adaptive Software Architecture for the FFT, Proc. ICASSP, 1998, www.fftw.org M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, Spiral: Code Generation for DSP Transforms, Proc. IEEE 93(2), 2005, pp. 232-275 Linear Transforms  Linear transform = matrix-vector multiplication Output vector  Transform = matrix Example: Discrete Fourier transform (DFT) Input vector Cooley/Tukey Fast Fourier Transform (FFT) n=4=2x2 12 adds, 4 mults 4 adds 1 mult 4 adds FFT Dataflow n = 16 = 4 x 4 Optimizing Cooley-Tukey FFT and FFTW  Live  FFT as structured matrix-factorization  Recursive FFT for memory locality  Fusing steps for memory locality  Precomputing constants  Unrolling and scalar replacement  Adaptive library FFTW Optimization for the Memory Hierarchy  Try to understand the cache behavior of your algorithm  Use recursive algorithms to increase locality/reuse  MMM: blocking, DFT: recursive FFT  Minimize passes over the dataset to increase locality/reuse  Use recursive algorithms; DFT: fuse computation steps  Once the recursion reaches a small problem size: unroll the computation, use scalar replacement (x[i]→t)  MMM: for last level of blocking (for registers), DFT: for small DFT sizes  Use an initialization routine for things that have to be done once  DFT: precompute sines and cosines in the twiddle factors, best recursion strategy  Identify degrees of freedom and figure out the best choice  MMM: block sizes, DFT: in FFT recursion (factorization n = km) So Far  We have seen techniques to improve performance  We have seen tools that provide some automation  ATLAS  FFTW  But much is still manual:  Any form of parallelization  Choice of parameters (ATLAS)  Design of library (FFTW)  Is it possible to build a tool that automatically produces a highest performance library?  Including support for parallelism  Performance of generated code should match best hand-tuned code Spiral Computer Synthesis of High Performance Libraries Spiral team (only part shown) Joint work with … Franz Franchetti Yevgen Voronenko Srinivas Chellappa Frédéric de Mesmay Daniel McFarlin José Moura James Hoe David Padua Jeremy Johnson … References used:  At www.spiral.net Current Approach Algorithm knowledge Platform knowledge Brute Force Optimized implementation (redone for every new platform) _mm_set1_epi8(x) = … _mm_xor_si128(x,y) = … _mm_avg_epu8(x,y) = … _mm_cmpeq_epi8(x,y) = … _mm_unpacklo_epi8(x,y) = … … Algorithm knowledge Platform knowledge Automation Spiral Optimized implementation (regenerated for every new platform) _mm_set1_epi8(x) = … _mm_xor_si128(x,y) = … _mm_avg_epu8(x,y) = … _mm_cmpeq_epi8(x,y) = … _mm_unpacklo_epi8(x,y) = … … Organization  Spiral: Linear Transforms  Basic system  Parallelism  General input size  Beyond transforms  Conclusion Linear Transforms  Mathematically: Matrix-vector multiplication Output vector  Transform = matrix Example: Discrete Fourier transform (DFT) Input vector Transform Algorithms: Example 4-point FFT Cooley/Tukey fast Fourier transform (FFT):  SPL (signal processing language): mathematical, declarative, point-free  Divide-and-conquer algorithms: Breakdown rules in SPL Breakdown Rules (>200 for >40 Transforms) Breakdown rules in Spiral = algorithm knowledge (≈200 journal papers) Combining these rules yields many algorithms for every given transform SPL to Code Correct code: easy fast code: very difficult Program Generation in Spiral (Sketched) Transform user specified Breakdown rules Optimization at all abstraction levels Fast algorithm in SPL many choices parallelization vectorization ∑-SPL loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling …… But that’s not all … Organization  Spiral: Linear Transforms  Basic system  Parallelism  General input size  Beyond transforms  Conclusion SPL to Shared Memory Code: Issues  Good SPL construct A A A A Processor 0 Processor 1 Processor 2 Processor 3 x  y Bad SPL construct: Permutations (Data accesses) cacheline boundaries x y Parallelization: Basic Idea  Identify crucial hardware parameters  Number of processors: p  Cache line size: μ  Identify “good” SPL formulas  Use rewriting: SPL formulas good SPL formulas Parallelization by Rewriting Rewriting rules in Spiral = platform knowledge Example “Good” SPL formula (load-balanced, no false sharing) in the sense of our definition Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:   Rigorous, correct by construction Overcomes compiler limitations Back to the Big Picture Algorithm knowledge Platform knowledge Spiral Optimized implementation (regenerated for every new platform) Organization  Spiral: Linear Transforms  Basic system  Parallelism  General input size  Beyond transforms  Conclusion Challenge: General Size Libraries Spiral so far: Challenge: Fixed size code Recursive general size library (FFTW, MKL) DFT_384(x, y) { … for(i = …) { t[2i] = x[2i] + x[2i+1] t[2i+1] = x[2i] - x[2i+1] } … } DFT(n, x, y) { … for(i = …) { DFT_strided(m, x+mi, y+i, 1, k) } … } • Algorithm fixed • Nonrecursive code • Algorithm cannot be fixed • Recursive code • Creates many challenges Possible? Vectorized, parallelized, general-size, adaptive library Challenge: Recursion Steps  Cooley-Tukey FFT  Implementation that increases locality (e.g., FFTW 2.x) DFT(int n, complex *Y, complex *X) { k = choose_factor(n); for i=0 to k-1 DFT_strided(n/k, k, 1, Y + (n/k)*i, T + (n/k)*i) for i=0 to n/k-1 DFT_scaled(k, n/k, precomputed_f, Y + i, Y + i) } DFT_strided(int n, int is, int os, complex *Y, complex *X) {…} DFT_scaled(int n, int s, complex *F, complex *Y, complex *X) {…} 2 additional functions (recursion steps) needed How to discover automatically? S-SPL : Basic Idea  Four additional matrix constructs: S, G, S, Perm      S (sum) Gf (gather) Sf (scatter) Permf – – – – explicit loop load data with index mapping f store data with index mapping f permute data with the index mapping f S-SPL formulas = matrix factorizations Example: Find Recursion Step Closure  Input: transform T and a breakdown rule  Output: recursion step closure + implementation of each recursion step  Algorithm: 1. Apply the breakdown rule to T 2. Convert to S-SPL 3. Apply loop merging + index simplification rules. 4. Extract required recursion steps 5. Repeat until closure is reached Recursion Step Closure: Example DFT: scalar code DCT4: vector code Summary: Complete Automation for Transforms • Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations • Vectorization Rewriting • Parallelization Rewriting fixed input size code • Derivation of library structure Rewriting Other methods general input size library DFT on Intel Multicore Spiral 5MB vectorized, threaded, general-size, adaptive library Often it Looks Like That DCTs on 2.66 GHz Core2 (4-way SSSE3) Performance [Gflop/s] 12 Spiral-generated 10 8 6 4 Available libraries 2 0 32 64 96 128 160 192 224 input size n 256 288 320 352 384 Generated Functions for Intel IPP 6.0 • 1M lines of code • 3984 C functions • Functionality: Cross product of Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT) Precision: single, double Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT) Written by a computer Organization  Spiral: Linear Transforms  Beyond transforms  Conclusion Going Beyond Transforms   Linear transforms SPL linear  Basic matrices  Matrix operators   Generalization OL  Basic operators  Higher-order operators Basic operators Higher-order operators Example: Software-Defined Radio (SDR) plus operator definitions and additional rules with SpiralGen, Jan 2010 Organization  Spiral: Linear Transforms  Beyond transforms  Conclusion Related Work  Automatic performance tuning      Linear algebra: Phipac/ATLAS (UTK), Sparsity/Bebop (Berkeley), Tensor computations (Ohio State) Adaptive sorting (UIUC) Fourier transform: FFTW (MIT) Program synthesis  Flame (UT Austin)  FFTW basic block generator  Generative programming (WG 2.11)  Iterative compilation (U. Paris Sud, U. Edinburgh, …) Spiral: Key Points  Spiral:  Successful approach to automating the development of computing software  Commercial proof-of-concept  Key ideas:  Algorithm knowledge: Domain specific symbolic representation  Platform knowledge: Tagged rewrite rules, SIMD specification  Interdisciplinary approach void dft64(float *Y, float *X) { __m512 U912, U913, U914, U915,... __m512 *a2153, *a2155; a2153 = ((__m512 *) X); s1107 = *(a2153); s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108); t1324 = _mm512_sub_ps(s1107,s1108); <many more lines> U926 = _mm512_swizupconv_r32(…); s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi( _mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341), _mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…), _mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB)); U927 = _mm512_swizupconv_r32 <many more lines> } More Information: www.spiral.net Don’t Forget! Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz Gflop/s 50 45 40 35 30 Multiple threads: 4x 25 20 15 10 Vector instructions: 4x Memory hierarchy: 20x 5 0 0 1,000 2,000 3,000 4,000 5,000 matrix size 6,000 7,000 8,000 9,000 Programming languages Program generation Compilers Software engineering Performance/Productivity Challenge Machine learning Algorithms Mathematics Symbolic Computation Rewriting Great opportunities for interdisciplinary high-impact research