How to Write Fast Numerical Code Spring 2013 Lecture: Spiral (Computer generation of FFT code) Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations • 10 minutes each • random order • random speaker 2 © Markus Püschel Computer Science How to write fast numerical code Spring 2013 The Problem: Example DFT DFT on Intel Core i7 (4 Cores, 2.66 GHz) Performance [Gflop/s] 40 35 Fastest program 30 25 20 35x 15 10 12x 5 Direct implementation 0 16   64 256 1k 4k 16k 64k 256k 1M Same number of operations Best compiler DFT: Analysis DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz) Performance [Gflop/s] 40 35 30 25 3x 20 threading 15 10 3x vectorization 5 5x 0 16 © Markus Püschel Computer Science 64 256 1k  Compiler doesn’t do it  Doing by hand: Very tough 4k 16k 64k 256k 1M locality optimization How to write fast numerical code Spring 2013 Our Goal: Computer writes high performance library code “click” DFT IP Cores Viterbi Decoder @ www.spiral.net © Markus Püschel Computer Science How to write fast numerical code Spring 2013 Possible Approach: Capturing algorithm knowledge: Domain-specific languages (DSLs) Structural optimization: Rewriting systems High performance code style: Compiler Decision making for choices: Machine learning Organization © Markus Püschel Computer Science  Spiral: Basic system  Vectorization  General input size  Results  Final remarks How to write fast numerical code Spring 2013 Algorithms: Example FFT, n = 4 Fast Fourier transform (FFT) Representation using matrix algebra  SPL (Signal processing language): Mathematical, declarative, point-free  Divide-and-conquer algorithms = breakdown rules in SPL Decomposition Rules (>200 for >40 Transforms) ³ ³ ´´ ³ ´ > DFTn ! Pk=2;2m DFT2m © Ik=2¡1 -i C2m rDFT2m(i=k) RDFT0k -Im ; k even; ¯ ¯ ¯ 0 ¯ ¯11 0¯ ¯ 0¯ 1 ¯RDFTn¯ ¯RDFT ¯ ¯ rDFT ¯ ¯RDFT0 ¯ 2m¯ 2m (i=k) ¯ ¯ ¯ ¯ ¯ ¯ k¯ ¯ B¯ B ¯ ¯CC B¯ C 0¯ 0 ¯ 0¯ ¯RDFTn¯ B¯RDFT2m¯ B ¯ rDFT2m (i=k) ¯CC B¯RDFTk ¯ C > - I2 ) B¯ ¯ ¯ ! (Pk=2;m ¯ © BIk=2¡1 -i D2m ¯ ¯CC B¯ ¯ - ImC ; 0 ¯ DHTn ¯ ¯rDHT2m(i=k)¯AA @¯ DHTk ¯ @¯ DHT2m ¯ @ A ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ DHT0 ¯ ¯ DHT0 ¯ ¯rDHT ¯ DHT0 ¯ (i=k)¯ 2m n 2m k ¯ ¯ ¯ ¯! Ã¯ ¯ Ã ! ¯ rDFT (u) ¯ ¯ rDFT ¯ ¯ ¯ ¯ ¯ ¯ ¯ rDFT2k (u) ¯ 2n 2m ((i + u)=k) ¯¯ I I ; ¯ ¯ ! L2n ¯ ¯ ¯ m k i ¯rDHT m ¯rDHT2n (u)¯ ¯rDHT2k (u)¯ 2m((i + u)=k)¯ RDFT-3n ! (Q> k=2;m - I2) (Ik -i rDFT2m )(i + 1=2)=k)) (RDFT-3k -Im ) ; ³ ³ > DCT-2n ! Pk=2;2m DCT-22m K22m © Ik=2¡1 - N2m RDFT-3> 2m DCT-3n ! DCT-2> n; ³ ´ ´´ k even; k even; n=2 Bn (Lk=2 - I2)(Im - RDFT0k )Qm=2;k ; n=2 > 0 DCT-4n ! Q> k=2;2m Ik=2 - N2m RDFT-32m Bn (Lk=2 - I2)(Im - RDFT-3k )Qm=2;k : Decomposition rules = Algorithm knowledge in Spiral n DFTn ! (DFTk - Im) Tn n = km m(Ik - DFTm) Lk ; DFTn ! Pn(DFTk - DFTm)Qn ; n = km; gcd(k; m) = 1 DFTp ! RpT (I1 © DFTp¡1)Dp(I1 © DFTp¡1)Rp ; p prime DCT-3n ! (Im © Jm) Ln -3m(1=4) © DCT-3m(3=4)) m (DCT " # Im 0 © ¡ Jm¡1 ¢(F2 - Im) n = 2m p1 (I1 ©2 Im) ; (from ≈100 publications) 2 DCT-4n ! SnDCT-2n diag0·k<n(1=(2 cos((2k + 1)¼=4n))) ÃÃ" IMDCT2m ! (Jm © Im © Im © Jm) WHT2k ! t Y (I i=1 k1 +¢¢¢+ki¡1 2 # ! 1 - Im © ¡1 - WHT2ki - I Ã" # ki+1 +¢¢¢+kt ); 2 !! ¡1 - Im ¡1 J2m DCT-42m k = k1 + ¢ ¢ ¢ + kt DFT2 ! F2 p DCT-22 ! diag(1; 1= 2) F2 DCT-42 ! J2 R13¼=8 Combining these rules yields many algorithms for every given transform © Markus Püschel Computer Science How to write fast numerical code Spring 2013 SPL to Code Correct code: easy fast code: very difficult Program Generation in Spiral Transform Decomposition rules Algorithm (SPL) parallelization vectorization Algorithm (∑-SPL) locality optimization C Program © Markus Püschel Computer Science void sub(double *y, double *x) { double f0, f1, f2, f3, f4, f7, f8, f10, f11; f0 = x[0] - x[3]; f1 = x[0] + x[3]; f2 = x[1] - x[2]; f3 = x[1] + x[2]; f4 = f1 - f3; y[0] = f1 + f3; y[2] = 0.7071067811865476 * f4; f7 = 0.9238795325112867 * f0; < more lines> basic block optimizations + Search or Learning How to write fast numerical code Spring 2013 Organization  Spiral: Basic system  Vectorization  General input size  Results  Final remarks Example: Vectorization in Spiral  © Markus Püschel Computer Science Relationship SPL expressions ↔ vectorization? ³ ´ y = DFT2 x y = DFT2 - I4 x one addition one subtraction one (4-way) vector addition one (4-way) vector subtraction How to write fast numerical code Spring 2013 Step 1: Identify “Good” Vector Constructs  Vector length: º  Good (= easily vectorizable) SPL constructs: A - Iº 2 2º Lºº ; L2º 2 ; Lº base cases SPL expressions recursively built from those  Idea: Convert a given SPL expression into a “good” SPL expression through rewriting (structural manipulation) Step 2: Find Manipulation Rules Lnº ! n ³ In=º - Lºº Lmn ! m ³ Lm Lnº ! º Il - Lkmn - Ir ! n Il - Lkmn - Ir n Il - Lkmn km - Ir Il - Lkmn r km - I´ ³ Im -An£n ³ ´ Im -An£n Lmn m ³ ´ Lmn Im -An£n n µ ¶ ³ ´ Ik - Im -An£n Lmn Lkmn m k µ ³ ´¶ kmn mn n£n Lmn Ik - Ln Im -A ! ! ! ³ ³ Ln º - Iº mn=º 2 ´³ ´³ Ln n=º - Iº º2 In=º - Lº - Iº ´³ ´ ´ Imn=º 2 - Lºº ´³ 2 ´³ In=º - Lm m=º - Iº ´ ´ mn I - Lkn n - Imr ´³ Ikl - Ln - Ir ´ ³ l kmn kmn I - Lkn - Ir Il - Lmn - Ir ³ l ´³ ´ kn I - Lmn m - Ir ´³ Il - Lk - Imr ´ ³ kl Il - Lkmn - Ir Il - Lkmn m - Ir ³ k ´ Lmn (Im=º -An£n) - Iº Lmn º mn=º µ ³ ´¶³ ´ mn=º Im=º - Lnº An£n - Iº Lm=º - Iº º ¶ ³ ´µ ³ ´ mn=º Ln - Iº Im=º - An£n - Iº Lnº n ¶³ ³ ´µ ³ ´ ´ km n£n kn Lk - In Im - Ik -A Lk Lmn m - Ik ³ ´µ ³ ´¶³ ´ Lmn Im - Lkn Ik -An£n Lkm n - Ik n m - In Manipulation rules = Processor knowledge in Spiral ! AB Am£m - Iº Im -An£n D P © Markus Püschel Computer Science ! ! ! ! ! ! ! ! ! A ³ B ´³ ´³ ´ Im - L2º Am£m - Iº Im - L2º º 2 I³m -An£n ´ ³ ´ 2º ~ I In=º - L2º D º n=º - L2 P - I2 How to write fast numerical code Spring 2013 Example mn DFT DFTm - In)Tmn n {z(Im - DFTn) Lm } | {z mn} ! ( | vec(º) vec(º) ::: ::: ::: µ ¶µ ¶ 0mn ! I mn - L2º DFTm - I n - Iº T n º º µ ³ I m - L2n º - Iº º ´µ º I 2n - Lºº º 2 ¶µ I n - L2º 2 - Iº º ¶³ DFTn - Iº ´¶ µ mn L mº - L2º 2 º ¶ vectorized arithmetic vectorized data accesses Automatically Generate Base Case Library  Goal: Given instruction set, generate base cases n º=4:  4 4 8 8 L4 2; I2 - L2; L2 - I2; L2; L4 o Idea: Instructions as matrices + search y = _mm_unpacklo_ps(x0, x1); y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2)); y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4)); 2 6 6 6 6 6 4 © Markus Püschel Computer Science 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 3 7 7 7 7 7 5 2 3 # 1 0 0 0 0 0 0 0 " x0 6 0 0 0 0 1 0 0 0 7 ~ ~ x1 0 0 0 0 0 1 0 0 2 3 # 1 0 0 0 0 0 0 0 " x0 6 0 1 0 0 0 0 0 0 7 ~ y=4 0 0 0 0 1 0 0 0 5 ~ x1 0 0 0 0 0 1 0 0 2 3 # 0 0 1 0 0 0 0 0 " x0 6 0 0 0 1 0 0 0 0 7 ~ y=4 0 0 0 0 0 0 1 0 5 ~ x1 0 0 0 0 0 0 0 1 y=4 0 1 0 0 0 0 0 0 5 y0 = _mm_shuffle_ps(x0, x1, y0 = _mm_unpacklo_ps(x[0], x[1]); _MM_SHUFFLE(1,2,1,2)); y1 = _mm_shuffle_ps(x0, x1, y1 = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4)); _MM_SHUFFLE(3,4,3,4)); No base Base casecase How to write fast numerical code Spring 2013 Same Approach for Different Paradigms Threading: Vectorization: GPUs: Verilog for FPGAs:   Rigorous, correct by construction Overcomes compiler limitations Organization © Markus Püschel Computer Science  Spiral: Basic system  Vectorization  General input size  Results  Final remarks How to write fast numerical code Spring 2013 Challenge: General Size Libraries So far: Challenge: Code specialized to fixed input size Library for general input size DFT_384(x, y) { … for(i = …) { t[2i] = x[2i] + x[2i+1] t[2i+1] = x[2i] - x[2i+1] } … } DFT(n, x, y) { … for(i = …) { DFT_strided(m, x+mi, y+i, 1, k) } … } • Algorithm fixed • Nonrecursive code • Algorithm cannot be fixed • Recursive code • Creates many challenges Challenge: Recursion Steps  Cooley-Tukey FFT  Implementation that increases locality (e.g., FFTW 2.x) void DFT(int n, cpx *y, cpx *x) { int k = choose_dft_radix(n); for (int i=0; i < k; ++i) DFTrec(m, y + m*i, x + i, k, 1); for (int j=0; j < m; ++j) DFTscaled(k, y + j, t[j], m); } © Markus Püschel Computer Science How to write fast numerical code Spring 2013 S-SPL : Basic Idea   Four additional matrix constructs: S, G, S, Perm  S (sum) explicit loop  Gf (gather) load data with index mapping f  Sf (scatter) store data with index mapping f  Permf permute data with the index mapping f S-SPL formulas = matrix factorizations Example: Find Recursion Step Closure Voronenko, 2008 Repeat until closure © Markus Püschel Computer Science How to write fast numerical code Spring 2013 Recursion Step Closure: Examples DFT: scalar code DFT: full-fledged (vectorized and parallel code) Summary: Complete Automation for Transforms • Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations • Vectorization Rewriting • Parallelization Rewriting fixed input size code • Derivation of library structure Rewriting Other methods general input size library © Markus Püschel Computer Science How to write fast numerical code Spring 2013 Organization  Spiral: Basic system  Vectorization  General input size  Results  Final remarks DFT on Intel Multicore Spiral © Markus Püschel Computer Science 5MB vectorized, threaded, general-size, adaptive library How to write fast numerical code Spring 2013 Computer generated Functions for Intel IPP 6.0 3984 C functions 1M lines of code Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT) Precision: single, double Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT) Computer generated Results: SpiralGen Inc. Organization © Markus Püschel Computer Science  Spiral: Basic system  Vectorization  General input size  Results  Final remarks How to write fast numerical code Spring 2013 Spiral: Summary  Spiral: Successful approach to automating the development of computing software void dft64(float *Y, float *X) { __m512 U912, U913, U914, U915,... __m512 *a2153, *a2155; a2153 = ((__m512 *) X); s1107 = *(a2153); s1108 = *((a2153 + 4)); t1323 = _mm512_add_ps(s1107,s1108); t1324 = _mm512_sub_ps(s1107,s1108); <many more lines> U926 = _mm512_swizupconv_r32(…); s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi( _mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341), _mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…), _mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB)); U927 = _mm512_swizupconv_r32 <many more lines> } Commercial proof-of-concept  Key ideas: Algorithm knowledge: Domain specific symbolic representation Platform knowledge: Tagged rewrite rules, SIMD specification 35x © Markus Püschel Computer Science How to write fast numerical code Spring 2013 Algorithms Algorithms Software Software Compilers Compilers Microarchitecture Microarchitecture performance Performance is different than other software quality features Research Questions © Markus Püschel Computer Science  How to automate the production of fastest numerical code?  Domain-specific languages  Rewriting  Compilers  Machine Learning  What program language features help with program generation?  What environment should be used to build generators?  How to represent mathematical functionality?  How to formalize the mapping to fast code?  How to handle various forms of parallelism?  How to integrate into standard work flows? How to write fast numerical code Spring 2013