How to Write Fast Numerical Code Rest of Semester Spring 2013

How to Write Fast Numerical Code
Spring 2013
Lecture: Spiral (Computer generation of FFT code)
Instructor: Markus Püschel
TA: Georg Ofenbeck & Daniele Spampinato
Rest of Semester
Today
Lecture
Project meetings
Project presentations
• 10 minutes each
• random order
• random speaker
2
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
The Problem: Example DFT
DFT on Intel Core i7 (4 Cores, 2.66 GHz)
Performance [Gflop/s]
40
35
Fastest program
30
25
20
35x
15
10
12x
5
Direct implementation
0
16


64
256
1k
4k
16k
64k
256k
1M
Same number of operations
Best compiler
DFT: Analysis
DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz)
Performance [Gflop/s]
40
35
30
25
3x
20
threading
15
10
3x vectorization
5
5x
0
16
© Markus Püschel
Computer Science
64
256
1k

Compiler doesn’t do it

Doing by hand: Very tough
4k
16k
64k
256k
1M
locality optimization
How to write fast numerical code
Spring 2013
Our Goal:
Computer writes high performance library code
“click”
DFT IP Cores
Viterbi Decoder
@ www.spiral.net
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
Possible Approach:
Capturing algorithm knowledge:
Domain-specific languages (DSLs)
Structural optimization:
Rewriting systems
High performance code style:
Compiler
Decision making for choices:
Machine learning
Organization
© Markus Püschel
Computer Science

Spiral: Basic system

Vectorization

General input size

Results

Final remarks
How to write fast numerical code
Spring 2013
Algorithms: Example FFT, n = 4
Fast Fourier transform (FFT)
Representation using matrix algebra

SPL (Signal processing language): Mathematical, declarative, point-free

Divide-and-conquer algorithms = breakdown rules in SPL
Decomposition Rules (>200 for >40 Transforms)
³
³
´´ ³
´
>
DFTn ! Pk=2;2m
DFT2m © Ik=2¡1 -i C2m rDFT2m(i=k)
RDFT0k -Im ; k even;
¯
¯
¯ 0
¯
¯11 0¯
¯
0¯
1
¯RDFTn¯
¯RDFT
¯
¯ rDFT
¯
¯RDFT0 ¯
2m¯
2m (i=k) ¯
¯
¯
¯
¯
¯
k¯
¯
B¯
B
¯
¯CC B¯
C
0¯
0 ¯
0¯
¯RDFTn¯
B¯RDFT2m¯ B
¯ rDFT2m (i=k) ¯CC B¯RDFTk ¯
C
>
- I2 ) B¯
¯
¯ ! (Pk=2;m
¯ © BIk=2¡1 -i D2m ¯
¯CC B¯
¯ - ImC ;
0
¯ DHTn ¯
¯rDHT2m(i=k)¯AA @¯ DHTk ¯
@¯ DHT2m ¯ @
A
¯
¯
¯
¯
¯
¯
¯
¯
¯ DHT0 ¯
¯ DHT0
¯
¯rDHT
¯ DHT0 ¯
(i=k)¯
2m
n
2m
k
¯
¯
¯
¯! ï
¯
Ã
!
¯ rDFT (u) ¯
¯ rDFT
¯
¯
¯
¯
¯
¯
¯ rDFT2k (u) ¯
2n
2m ((i + u)=k) ¯¯
I
I
;
¯
¯ ! L2n
¯
¯
¯
m
k i ¯rDHT
m
¯rDHT2n (u)¯
¯rDHT2k (u)¯
2m((i + u)=k)¯
RDFT-3n ! (Q>
k=2;m - I2) (Ik -i rDFT2m )(i + 1=2)=k)) (RDFT-3k -Im ) ;
³
³
>
DCT-2n ! Pk=2;2m
DCT-22m K22m © Ik=2¡1 - N2m RDFT-3>
2m
DCT-3n ! DCT-2>
n;
³
´
´´
k even;
k even;
n=2
Bn (Lk=2 - I2)(Im - RDFT0k )Qm=2;k ;
n=2
>
0
DCT-4n ! Q>
k=2;2m Ik=2 - N2m RDFT-32m Bn (Lk=2 - I2)(Im - RDFT-3k )Qm=2;k :
Decomposition rules = Algorithm knowledge in Spiral
n
DFTn ! (DFTk - Im) Tn
n = km
m(Ik - DFTm) Lk ;
DFTn ! Pn(DFTk - DFTm)Qn ; n = km; gcd(k; m) = 1
DFTp ! RpT (I1 © DFTp¡1)Dp(I1 © DFTp¡1)Rp ; p prime
DCT-3n ! (Im © Jm) Ln
-3m(1=4) © DCT-3m(3=4))
m (DCT
"
#
Im 0 © ¡ Jm¡1
¢(F2 - Im)
n = 2m
p1 (I1 ©2 Im) ;
(from ≈100 publications)
2
DCT-4n ! SnDCT-2n diag0·k<n(1=(2 cos((2k + 1)¼=4n)))
ÃÃ"
IMDCT2m ! (Jm © Im © Im © Jm)
WHT2k !
t
Y
(I
i=1
k1 +¢¢¢+ki¡1
2
#
!
1
- Im ©
¡1
- WHT2ki - I
Ã"
#
ki+1 +¢¢¢+kt );
2
!!
¡1
- Im
¡1
J2m DCT-42m
k = k1 + ¢ ¢ ¢ + kt
DFT2 ! F2
p
DCT-22 ! diag(1; 1= 2) F2
DCT-42 ! J2 R13¼=8
Combining these rules yields many algorithms for every given transform
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
SPL to Code
Correct code: easy
fast code: very difficult
Program Generation in Spiral
Transform
Decomposition rules
Algorithm
(SPL)
parallelization
vectorization
Algorithm
(∑-SPL)
locality
optimization
C Program
© Markus Püschel
Computer Science
void sub(double *y, double *x) {
double f0, f1, f2, f3, f4, f7, f8, f10, f11;
f0 = x[0] - x[3];
f1 = x[0] + x[3];
f2 = x[1] - x[2];
f3 = x[1] + x[2];
f4 = f1 - f3;
y[0] = f1 + f3;
y[2] = 0.7071067811865476 * f4;
f7 = 0.9238795325112867 * f0;
< more lines>
basic block
optimizations
+ Search or
Learning
How to write fast numerical code
Spring 2013
Organization

Spiral: Basic system

Vectorization

General input size

Results

Final remarks
Example: Vectorization in Spiral

© Markus Püschel
Computer Science
Relationship SPL expressions ↔ vectorization?
³
´
y = DFT2 x
y = DFT2 - I4 x
one addition
one subtraction
one (4-way) vector addition
one (4-way) vector subtraction
How to write fast numerical code
Spring 2013
Step 1: Identify “Good” Vector Constructs

Vector length: º

Good (= easily vectorizable) SPL constructs:
A - Iº
2
2º
Lºº ; L2º
2 ; Lº
base cases
SPL expressions recursively built from those

Idea: Convert a given SPL expression into a “good” SPL expression
through rewriting (structural manipulation)
Step 2: Find Manipulation Rules
Lnº
!
n
³
In=º - Lºº
Lmn
!
m
³
Lm
Lnº
!
º
Il - Lkmn
- Ir !
n
Il - Lkmn
- Ir
n
Il - Lkmn
km - Ir
Il - Lkmn
r
km - I´
³
Im -An£n
³
´
Im -An£n Lmn
m
³
´
Lmn
Im -An£n
n
µ
¶
³
´
Ik - Im -An£n Lmn
Lkmn
m
k
µ
³
´¶
kmn
mn
n£n
Lmn Ik - Ln
Im -A
!
!
!
³
³
Ln
º - Iº
mn=º
2
´³
´³
Ln
n=º - Iº
º2
In=º - Lº
- Iº
´³
´
´
Imn=º 2 - Lºº
´³
2
´³
In=º - Lm
m=º - Iº
´
´
mn
I - Lkn
n - Imr ´³ Ikl - Ln - Ir ´
³ l
kmn
kmn
I - Lkn - Ir Il - Lmn - Ir
³ l
´³
´
kn
I - Lmn
m - Ir ´³ Il - Lk - Imr ´
³ kl
Il - Lkmn
- Ir Il - Lkmn
m - Ir
³ k
´
Lmn
(Im=º -An£n) - Iº Lmn
º
mn=º
µ
³
´¶³
´
mn=º
Im=º - Lnº
An£n - Iº
Lm=º - Iº
º
¶
³
´µ
³
´
mn=º
Ln
- Iº
Im=º - An£n - Iº Lnº
n
¶³
³
´µ
³
´
´
km
n£n
kn
Lk - In
Im - Ik -A
Lk
Lmn
m - Ik
³
´µ
³
´¶³
´
Lmn
Im - Lkn
Ik -An£n
Lkm
n - Ik
n
m - In
Manipulation rules = Processor
knowledge in Spiral
!
AB
Am£m - Iº
Im -An£n
D
P
© Markus Püschel
Computer Science
!
!
!
!
!
!
!
!
!
A
³ B
´³
´³
´
Im - L2º
Am£m - Iº Im - L2º
º
2
I³m -An£n ´ ³
´
2º
~ I
In=º - L2º
D
º
n=º - L2
P - I2
How to write fast numerical code
Spring 2013
Example
mn
DFT
DFTm - In)Tmn
n {z(Im - DFTn) Lm }
| {z mn} ! (
|
vec(º)
vec(º)
:::
:::
:::
µ
¶µ
¶
0mn
! I mn - L2º
DFTm - I n - Iº T n
º
º
µ
³
I m - L2n
º - Iº
º
´µ
º
I 2n - Lºº
º
2
¶µ
I n - L2º
2 - Iº
º
¶³
DFTn - Iº
´¶ µ mn
L mº - L2º
2
º
¶
vectorized arithmetic
vectorized data accesses
Automatically Generate Base Case Library

Goal: Given instruction set, generate base cases
n
º=4:

4
4
8
8
L4
2; I2 - L2; L2 - I2; L2; L4
o
Idea: Instructions as matrices + search
y = _mm_unpacklo_ps(x0, x1);
y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(1,2,1,2));
y = _mm_shuffle_ps(x0, x1, _MM_SHUFFLE(3,4,3,4));
2
6
6
6
6
6
4
© Markus Püschel
Computer Science
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
3
7
7
7
7
7
5
2
3
#
1 0 0 0 0 0 0 0 "
x0
6 0 0 0 0 1 0 0 0 7 ~
~
x1
0 0 0 0 0 1 0 0
2
3
#
1 0 0 0 0 0 0 0 "
x0
6 0 1 0 0 0 0 0 0 7 ~
y=4 0 0 0 0 1 0 0 0 5
~
x1
0 0 0 0 0 1 0 0
2
3
#
0 0 1 0 0 0 0 0 "
x0
6 0 0 0 1 0 0 0 0 7 ~
y=4 0 0 0 0 0 0 1 0 5
~
x1
0 0 0 0 0 0 0 1
y=4 0 1 0 0 0 0 0 0 5
y0 = _mm_shuffle_ps(x0, x1,
y0 = _mm_unpacklo_ps(x[0], x[1]);
_MM_SHUFFLE(1,2,1,2));
y1 = _mm_shuffle_ps(x0, x1,
y1 = _mm_shuffle_ps(x0, x1,
_MM_SHUFFLE(3,4,3,4));
_MM_SHUFFLE(3,4,3,4));
No base
Base
casecase
How to write fast numerical code
Spring 2013
Same Approach for Different Paradigms
Threading:
Vectorization:
GPUs:
Verilog for FPGAs:


Rigorous, correct by construction
Overcomes compiler limitations
Organization
© Markus Püschel
Computer Science

Spiral: Basic system

Vectorization

General input size

Results

Final remarks
How to write fast numerical code
Spring 2013
Challenge: General Size Libraries
So far:
Challenge:
Code specialized to fixed input size
Library for general input size
DFT_384(x, y) {
…
for(i = …) {
t[2i]
= x[2i] + x[2i+1]
t[2i+1] = x[2i] - x[2i+1]
}
…
}
DFT(n, x, y) {
…
for(i = …) {
DFT_strided(m, x+mi, y+i, 1, k)
}
…
}
• Algorithm fixed
• Nonrecursive code
• Algorithm cannot be fixed
• Recursive code
• Creates many challenges
Challenge: Recursion Steps

Cooley-Tukey FFT

Implementation that increases locality (e.g., FFTW 2.x)
void DFT(int n, cpx *y, cpx *x) {
int k = choose_dft_radix(n);
for (int i=0; i < k; ++i)
DFTrec(m, y + m*i, x + i, k, 1);
for (int j=0; j < m; ++j)
DFTscaled(k, y + j, t[j], m);
}
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
S-SPL : Basic Idea


Four additional matrix constructs: S, G, S, Perm
 S (sum)
explicit loop
 Gf (gather)
load data with index mapping f
 Sf (scatter)
store data with index mapping f
 Permf
permute data with the index mapping f
S-SPL formulas = matrix factorizations
Example:
Find Recursion Step Closure
Voronenko, 2008
Repeat until closure
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
Recursion Step Closure: Examples
DFT: scalar code
DFT: full-fledged (vectorized and parallel code)
Summary: Complete Automation for Transforms
• Memory hierarchy optimization
Rewriting and search for algorithm selection
Rewriting for loop optimizations
• Vectorization
Rewriting
• Parallelization
Rewriting
fixed input size code
• Derivation of library structure
Rewriting
Other methods
general input size library
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
Organization

Spiral: Basic system

Vectorization

General input size

Results

Final remarks
DFT on Intel Multicore
Spiral
© Markus Püschel
Computer Science
5MB vectorized, threaded,
general-size, adaptive library
How to write fast numerical code
Spring 2013
Computer generated Functions for Intel IPP 6.0
3984 C functions
1M lines of code
Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT
Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT)
Precision: single, double
Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT)
Computer generated
Results: SpiralGen Inc.
Organization
© Markus Püschel
Computer Science

Spiral: Basic system

Vectorization

General input size

Results

Final remarks
How to write fast numerical code
Spring 2013
Spiral: Summary

Spiral:
Successful approach to automating
the development of computing software
void dft64(float *Y, float *X) {
__m512 U912, U913, U914, U915,...
__m512 *a2153, *a2155;
a2153 = ((__m512 *) X); s1107 = *(a2153);
s1108 = *((a2153 + 4));
t1323 = _mm512_add_ps(s1107,s1108);
t1324 = _mm512_sub_ps(s1107,s1108);
<many more lines>
U926 = _mm512_swizupconv_r32(…);
s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(
_mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341),
_mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…),
_mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB));
U927 = _mm512_swizupconv_r32
<many more lines>
}
Commercial proof-of-concept

Key ideas:
Algorithm knowledge:
Domain specific symbolic representation
Platform knowledge:
Tagged rewrite rules, SIMD specification
35x
© Markus Püschel
Computer Science
How to write fast numerical code
Spring 2013
Algorithms
Algorithms
Software
Software
Compilers
Compilers
Microarchitecture
Microarchitecture
performance
Performance is different than other software quality features
Research Questions
© Markus Püschel
Computer Science

How to automate the production of fastest numerical code?
 Domain-specific languages
 Rewriting
 Compilers
 Machine Learning

What program language features help with program generation?

What environment should be used to build generators?

How to represent mathematical functionality?

How to formalize the mapping to fast code?

How to handle various forms of parallelism?

How to integrate into standard work flows?
How to write fast numerical code
Spring 2013