How to Write Fast Numerical Code A Small Introduction Markus Püschel Computer Science

How to Write Fast Numerical Code
A Small Introduction
Markus Püschel
Computer Science
ETH Zürich, Switzerland
Picture: www.tapety-na-pulpit.org
September 1st
Electrical and Computer Engineering
Carnegie Mellon University
Computer Science
ETH Zürich
PhD and Postdoc openings (see also www.spiral.net):





High performance computing
Compilers
Theory
Programming languages/Generative programming
FPGAs
Scientific Computing
Computing

Unlimited need for performance

Large set of applications, but …
Physics/biology simulations
Consumer Computing

Audio/image/video processing
Embedded Computing
Signal processing, communication, control
Relatively small set of critical
components (100s to 1000s)
 Matrix multiplication
 Discrete Fourier transform (DFT)
 Viterbi decoder
 Shortest path computation
 Solving linear systems
 ….
The Problem: Example 1
DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz)
Performance [Gflop/s]
40
35
30
Fastest code (1 MB)
25
20
35x
15
10
12x
5
Straightforward
“good” C code (1 KB)
0
16


64
256
1k
4k
Vendor compiler, best flags
Roughly same operations count
16k
64k
256k
1M
The Problem: Example 2
Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz
Gflop/s
50
45
40
Fastest code (100 KB)
35
30
25
160x
20
15
10
5
Triple loop (< 1KB)
0
0
1,000
2,000
3,000
4,000
5,000
matrix size

Vendor compiler, best flags

Exact same operations count (2n3)

What is going on?
6,000
7,000
8,000
9,000
Evolution of Processors (Intel)
Evolution of Processors (Intel)
Era of
parallelism
And There Will Be Variety …
Arm Cortex A9
Nvidia G200
TI TNETV3020
Tilera Tile64
Core i7
Source: IEEE SP Magazine, Vol. 26, November 2009
DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz)
Performance [Gflop/s]
40
35
30
25
Multiple threads: 3x
20
15
10
Vector instructions: 3x
5
Memory hierarchy: 5x
0
16


64
256
1k
Compiler doesn’t do the job
Doing by hand: nightmare
4k
16k
64k
256k
1M
Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz
Gflop/s
50
45
40
35
30
Multiple threads: 4x
25
20
15
10
Vector instructions: 4x
Memory hierarchy: 20x
5
0
0
1,000
2,000
3,000
4,000
5,000
matrix size


Compiler doesn’t do the job
Doing by hand: nightmare
6,000
7,000
8,000
9,000
Summary and Facts I

Implementations with same operations count can have vastly different
performance (up to 100x and more)
 A cache miss can be 100x more expensive than an operation
 Vector instructions
 Multiple cores = processors on one die

Minimizing operations ≠ maximizing performance

End of free speed-up for legacy code
 Future performance gains through increasing parallelism
Summary and Facts II

It is very difficult to write the fastest code





Tuning for memory hierarchy
Vector instructions
Efficient parallelization (multiple threads)
Requires expert knowledge in algorithms, coding, and architecture
Fast code can be large
 Can violate “good” software engineering practices

Compilers often can’t do the job
 Often intricate changes in the algorithm required
 Parallelization/vectorization still unsolved

Highest performance is in general non-portable
Performance/Productivity
Challenge
Current Solution

Legions of programmers implement and optimize the same
functionality for every platform and whenever a new platform comes
out.
Better Solution: Automation

Automate (parts of) the implementation or optimization

Research efforts
 Linear algebra: Phipac/ATLAS, LAPACK,







Sparsity/Bebop/OSKI, Flame
Tensor computations
PDE/finite elements: Fenics
Adaptive sorting
Fourier transform: FFTW
Linear transforms: Spiral
…others
New compiler techniques
Promising new area but
much more work needed …
Proceedings of the IEEE special issue, Feb. 2005
This Tutorial

Increasing awareness for the problem

Focus: optimization for the memory hierarchy

Examples considered: MMM and DFT
 Techniques from state-of-the-art research and implementations
 Extract general principles for application to other numerical problems

Get a glimpse of how automatic performance tuning works …
 MMM generator ATLAS
 Adaptive DFT library FFTW

… and how automatic code production works
 Program generator Spiral
Matrix-Matrix Multiplication
MMM
References used:



J. Demmel, Applied Numerical Linear Algebra, SIAM, 1997
R. C. Whaley, A. Petitet, and J. Dongarra, Automated empirical optimization of
software and the ATLAS project, Parallel Comput., 27(1–2), pp. 3–35, 2001
K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill,
Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE,
93(2), pp. 358–386, Feb. 2005
Linear Algebra Algorithms: Examples

Solving systems of linear equations

Computation of eigenvalues

Singular value decomposition

LU/Cholesky/QR/… decompositions

… many others


Make up large parts of the numerical computation across disciplines
(sciences, computer science, engineering)
Efficient software is extremely relevant
The Path to LAPACK

1960s/70s: EISPACK and LINPACK
 Libraries for linear algebra algorithms
 Cleve Moler et al.
 Implementation vector-based

Problem:
 No locality: Low performance became apparent in the 80s
(deep memory hierarchies)
CPU

Cache
Solution: LAPACK (late 80s, early 90s)
 Reimplement the algorithms “block-based,” i.e., with locality
 Jim Demmel, Jack Dongarra et al.
Main memory
Reuse and Blocking

Number of operations
Reuse:
Size of data
Compute reuse:
 Vector-vector operations (e.g., adding two vectors)
 Matrix-vector operations (e.g., matrix-vector multiplication)
 Matrix-matrix operations (e.g., MMM)
 DFT
50
30
45
40
25
35
Performance drop
outside L2 cache
20
30
25
15
..
20
10
15
10
5
5
MMM: O(n) reuse
DFT: O(log(n)) reuse
44
72
1,
0
2,
1
26
,5
36
,7
68
,3
84
19
2
09
6
input size
13
65
32
16
8,
4,
04
8
2
9,000
2,
matrix size
8,000
02
4
7,000
1,
6,000
6
5,000
51
4,000
8
3,000
25
2,000
12
1,000
64
0
32
0
0
16

LAPACK and BLAS

Basic Idea:
LAPACK
BLAS

static
reimplemented
for each platform
BLAS = Basic Linear Algebra Subroutines link
 BLAS1: vector-vector operations; O(1) reuse
 BLAS2: matrix-vector operations; O(1) reuse
 BLAS3: matrix-matrix operations; O(n) reuse (mostly MMM)

LAPACK implemented on top of BLAS link
 Blocking: BLAS 3 is used as much as possible
MMM and Algorithms

Most important BLAS3 function and in all of numerical computing

C = A· B or C = C + A· B with A, B, C matrices (assume square)

By definition: 2n3 = O(n3)
 The basis for practically all existing implementation

Strassen’s algorithm: O(nlog2(7)) ≈ O(n2.808)
 Crossover point n = 654
 Structurally more complex, worse numerical stability
 Rarely implemented

Best known: Coppersmith/Winograd: O(n2.376)
 Large constant
 Very complex
 No implementation exists
Optimizing MMM

Live

Why blocking and the effect on cache behaviour

Blocking for cache and registers

Unrolling and scalar replacement
Final Code: Sketch
// MMM loop-nest
for i = 0:NB:N-1
for j = 0:NB:M-1
for k = 0:NB:K-1
// mini-MMM loop nest
for i’ = i:MU:i+NB-1
Blocking for cache
for j’ = j:NU:j+NB-1
for k’ = k:KU:k+NB-1
• Blocking for registers
// micro-MMM loop nest
• unrolling
for k” = k’:1:k’+KU-1
• scalar replacement
for i” = i’:1:i’+MU-1
• other …
for j” = j’:1:j’+NU-1
c[i”][j”] += a[i”][k”]*b[k”][j”]
Degrees of freedom: can be used for platform adaptation
NB, MU, NU, KU, LS
Automation: BLAS Generator ATLAS/PhiPAC
Compile
Execute
Measure
Mflop/s
L1Size
Detect
Hardware
Parameters

NR
MulAdd
L*
ATLAS
Search Engine
NB
MU,NU,KU
xFetch
MulAdd
Latency
MiniMMM
Source
Idea: Automatic porting of LAPACK
LAPACK
BLAS

ATLAS MMM
Code Generator
static
regenerated
for each platform
But: parallelism killed it
 Generative approach not suitable for vectorization
 LAPACK/BLAS division suboptimal on multicore systems
Discrete Fourier Transform
DFT
References used:



Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM,
1992
M. Frigo and S. Johnson, FFTW: An Adaptive Software Architecture for the FFT,
Proc. ICASSP, 1998, www.fftw.org
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong,
F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo,
Spiral: Code Generation for DSP Transforms, Proc. IEEE 93(2), 2005, pp. 232-275
Linear Transforms

Linear transform = matrix-vector multiplication
Output vector

Transform
= matrix
Example: Discrete Fourier transform (DFT)
Input vector
Cooley/Tukey Fast Fourier Transform (FFT)
n=4=2x2
12 adds, 4 mults
4 adds
1 mult
4 adds
FFT Dataflow
n = 16 = 4 x 4
Optimizing Cooley-Tukey FFT and FFTW

Live

FFT as structured matrix-factorization

Recursive FFT for memory locality

Fusing steps for memory locality

Precomputing constants

Unrolling and scalar replacement

Adaptive library FFTW
Optimization for the Memory Hierarchy

Try to understand the cache behavior of your algorithm

Use recursive algorithms to increase locality/reuse
 MMM: blocking, DFT: recursive FFT

Minimize passes over the dataset to increase locality/reuse
 Use recursive algorithms; DFT: fuse computation steps

Once the recursion reaches a small problem size:
unroll the computation, use scalar replacement (x[i]→t)
 MMM: for last level of blocking (for registers), DFT: for small DFT sizes

Use an initialization routine for things that have to be done once
 DFT: precompute sines and cosines in the twiddle factors, best recursion strategy

Identify degrees of freedom and figure out the best choice
 MMM: block sizes, DFT: in FFT recursion (factorization n = km)
So Far

We have seen techniques to improve performance

We have seen tools that provide some automation
 ATLAS
 FFTW

But much is still manual:
 Any form of parallelization
 Choice of parameters (ATLAS)
 Design of library (FFTW)

Is it possible to build a tool that automatically produces a
highest performance library?
 Including support for parallelism
 Performance of generated code should match best hand-tuned code
Spiral
Computer Synthesis of
High Performance Libraries
Spiral team (only part shown)
Joint work with …
Franz Franchetti
Yevgen Voronenko
Srinivas Chellappa
Frédéric de Mesmay
Daniel McFarlin
José Moura
James Hoe
David Padua
Jeremy Johnson
…
References used:

At www.spiral.net
Current Approach
Algorithm knowledge
Platform knowledge
Brute Force
Optimized implementation
(redone for every new platform)
_mm_set1_epi8(x) = …
_mm_xor_si128(x,y) = …
_mm_avg_epu8(x,y) = …
_mm_cmpeq_epi8(x,y) = …
_mm_unpacklo_epi8(x,y) = …
…
Algorithm knowledge
Platform knowledge
Automation
Spiral
Optimized implementation
(regenerated for every new platform)
_mm_set1_epi8(x) = …
_mm_xor_si128(x,y) = …
_mm_avg_epu8(x,y) = …
_mm_cmpeq_epi8(x,y) = …
_mm_unpacklo_epi8(x,y) = …
…
Organization

Spiral: Linear Transforms
 Basic system
 Parallelism
 General input size

Beyond transforms

Conclusion
Linear Transforms

Mathematically: Matrix-vector multiplication
Output vector

Transform
= matrix
Example: Discrete Fourier transform (DFT)
Input vector
Transform Algorithms: Example 4-point FFT
Cooley/Tukey fast Fourier transform (FFT):
 SPL (signal processing language): mathematical, declarative, point-free
 Divide-and-conquer algorithms: Breakdown rules in SPL
Breakdown Rules (>200 for >40 Transforms)
Breakdown rules in Spiral = algorithm knowledge
(≈200 journal papers)
Combining these rules yields many algorithms for every given transform
SPL to Code
Correct code: easy
fast code: very difficult
Program Generation in Spiral (Sketched)
Transform
user specified
Breakdown rules
Optimization at all
abstraction levels
Fast algorithm
in SPL
many choices
parallelization
vectorization
∑-SPL
loop
optimizations
C Code:
Iteration of this process
constant folding
to search for the fastest
scheduling
……
But that’s not all …
Organization

Spiral: Linear Transforms
 Basic system
 Parallelism
 General input size

Beyond transforms

Conclusion
SPL to Shared Memory Code: Issues

Good SPL construct
A
A
A
A
Processor 0
Processor 1
Processor 2
Processor 3
x

y
Bad SPL construct: Permutations (Data accesses)
cacheline
boundaries
x
y
Parallelization: Basic Idea

Identify crucial hardware parameters
 Number of processors: p
 Cache line size: μ

Identify “good” SPL formulas

Use rewriting:
SPL formulas
good SPL formulas
Parallelization by Rewriting
Rewriting rules in Spiral = platform knowledge
Example
“Good” SPL formula (load-balanced, no false sharing)
in the sense of our definition
Same Approach for Different Paradigms
Threading:
Vectorization:
GPUs:
Verilog for FPGAs:


Rigorous, correct by construction
Overcomes compiler limitations
Back to the Big Picture
Algorithm knowledge
Platform knowledge
Spiral
Optimized implementation
(regenerated for every new platform)
Organization

Spiral: Linear Transforms
 Basic system
 Parallelism
 General input size

Beyond transforms

Conclusion
Challenge: General Size Libraries
Spiral so far:
Challenge:
Fixed size code
Recursive general size library (FFTW, MKL)
DFT_384(x, y) {
…
for(i = …) {
t[2i]
= x[2i] + x[2i+1]
t[2i+1] = x[2i] - x[2i+1]
}
…
}
DFT(n, x, y) {
…
for(i = …) {
DFT_strided(m, x+mi, y+i, 1, k)
}
…
}
• Algorithm fixed
• Nonrecursive code
• Algorithm cannot be fixed
• Recursive code
• Creates many challenges
Possible?
Vectorized, parallelized,
general-size, adaptive library
Challenge: Recursion Steps

Cooley-Tukey FFT

Implementation that increases locality (e.g., FFTW 2.x)
DFT(int n, complex *Y, complex *X) {
k = choose_factor(n);
for i=0 to k-1
DFT_strided(n/k, k, 1, Y + (n/k)*i, T + (n/k)*i)
for i=0 to n/k-1
DFT_scaled(k, n/k, precomputed_f, Y + i, Y + i)
}
DFT_strided(int n, int is, int os, complex *Y, complex *X) {…}
DFT_scaled(int n, int s, complex *F, complex *Y, complex *X) {…}
2 additional functions (recursion steps) needed
How to discover automatically?
S-SPL : Basic Idea

Four additional matrix constructs: S, G, S, Perm





S (sum)
Gf (gather)
Sf (scatter)
Permf
–
–
–
–
explicit loop
load data with index mapping f
store data with index mapping f
permute data with the index mapping f
S-SPL formulas = matrix factorizations
Example:
Find Recursion Step Closure

Input: transform T and a breakdown rule

Output: recursion step closure + implementation of each recursion step

Algorithm:
1. Apply the breakdown rule to T
2. Convert to S-SPL
3. Apply loop merging + index simplification rules.
4. Extract required recursion steps
5. Repeat until closure is reached
Recursion Step Closure: Example
DFT: scalar code
DCT4: vector code
Summary: Complete Automation for Transforms
• Memory hierarchy optimization
Rewriting and search for algorithm selection
Rewriting for loop optimizations
• Vectorization
Rewriting
• Parallelization
Rewriting
fixed input size code
• Derivation of library structure
Rewriting
Other methods
general input size library
DFT on Intel Multicore
Spiral
5MB vectorized, threaded,
general-size, adaptive library
Often it Looks Like That
DCTs on 2.66 GHz Core2 (4-way SSSE3)
Performance [Gflop/s]
12
Spiral-generated
10
8
6
4
Available libraries
2
0
32
64
96
128
160
192
224
input size n
256
288
320
352
384
Generated Functions for Intel IPP 6.0
• 1M lines of code
• 3984 C functions
• Functionality: Cross product of
Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT
Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT)
Precision: single, double
Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT)
Written by a computer
Organization

Spiral: Linear Transforms

Beyond transforms

Conclusion
Going Beyond Transforms


Linear transforms
SPL
linear
 Basic matrices
 Matrix operators


Generalization
OL
 Basic operators
 Higher-order operators
Basic operators
Higher-order operators
Example: Software-Defined Radio (SDR)
plus
operator definitions
and additional rules
with SpiralGen, Jan 2010
Organization

Spiral: Linear Transforms

Beyond transforms

Conclusion
Related Work

Automatic performance tuning





Linear algebra: Phipac/ATLAS (UTK), Sparsity/Bebop (Berkeley),
Tensor computations (Ohio State)
Adaptive sorting (UIUC)
Fourier transform: FFTW (MIT)
Program synthesis
 Flame (UT Austin)
 FFTW basic block generator
 Generative programming (WG 2.11)

Iterative compilation
(U. Paris Sud, U. Edinburgh, …)
Spiral: Key Points

Spiral:
 Successful approach to automating
the development of computing software
 Commercial proof-of-concept

Key ideas:
 Algorithm knowledge:
Domain specific symbolic representation
 Platform knowledge:
Tagged rewrite rules, SIMD specification
 Interdisciplinary approach
void dft64(float *Y, float *X) {
__m512 U912, U913, U914, U915,...
__m512 *a2153, *a2155;
a2153 = ((__m512 *) X); s1107 = *(a2153);
s1108 = *((a2153 + 4));
t1323 = _mm512_add_ps(s1107,s1108);
t1324 = _mm512_sub_ps(s1107,s1108);
<many more lines>
U926 = _mm512_swizupconv_r32(…);
s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(
_mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341),
_mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…),
_mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB));
U927 = _mm512_swizupconv_r32
<many more lines>
}
More Information:
www.spiral.net
Don’t Forget!
Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz
Gflop/s
50
45
40
35
30
Multiple threads: 4x
25
20
15
10
Vector instructions: 4x
Memory hierarchy: 20x
5
0
0
1,000
2,000
3,000
4,000
5,000
matrix size
6,000
7,000
8,000
9,000
Programming languages
Program generation
Compilers
Software engineering
Performance/Productivity
Challenge
Machine learning
Algorithms
Mathematics
Symbolic Computation
Rewriting
Great opportunities for
interdisciplinary high-impact research