How to Write Fast Numerical Code
A Small Introduction
Markus Püschel
Computer Science
ETH Zürich, Switzerland
Picture: www.tapety-na-pulpit.org
September 1st
Electrical and Computer Engineering
Carnegie Mellon University
Computer Science
ETH Zürich
PhD and Postdoc openings (see also www.spiral.net):
High performance computing
Compilers
Theory
Programming languages/Generative programming
FPGAs
Scientific Computing
Computing
Unlimited need for performance
Large set of applications, but …
Physics/biology simulations
Consumer Computing
Audio/image/video processing
Embedded Computing
Signal processing, communication, control
Relatively small set of critical
components (100s to 1000s)
Matrix multiplication
Discrete Fourier transform (DFT)
Viterbi decoder
Shortest path computation
Solving linear systems
….
The Problem: Example 1
DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz)
Performance [Gflop/s]
40
35
30
Fastest code (1 MB)
25
20
35x
15
10
12x
5
Straightforward
“good” C code (1 KB)
0
16
64
256
1k
4k
Vendor compiler, best flags
Roughly same operations count
16k
64k
256k
1M
The Problem: Example 2
Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz
Gflop/s
50
45
40
Fastest code (100 KB)
35
30
25
160x
20
15
10
5
Triple loop (< 1KB)
0
0
1,000
2,000
3,000
4,000
5,000
matrix size
Vendor compiler, best flags
Exact same operations count (2n3)
What is going on?
6,000
7,000
8,000
9,000
Evolution of Processors (Intel)
Evolution of Processors (Intel)
Era of
parallelism
And There Will Be Variety …
Arm Cortex A9
Nvidia G200
TI TNETV3020
Tilera Tile64
Core i7
Source: IEEE SP Magazine, Vol. 26, November 2009
DFT (single precision) on Intel Core i7 (4 cores, 2.66 GHz)
Performance [Gflop/s]
40
35
30
25
Multiple threads: 3x
20
15
10
Vector instructions: 3x
5
Memory hierarchy: 5x
0
16
64
256
1k
Compiler doesn’t do the job
Doing by hand: nightmare
4k
16k
64k
256k
1M
Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz
Gflop/s
50
45
40
35
30
Multiple threads: 4x
25
20
15
10
Vector instructions: 4x
Memory hierarchy: 20x
5
0
0
1,000
2,000
3,000
4,000
5,000
matrix size
Compiler doesn’t do the job
Doing by hand: nightmare
6,000
7,000
8,000
9,000
Summary and Facts I
Implementations with same operations count can have vastly different
performance (up to 100x and more)
A cache miss can be 100x more expensive than an operation
Vector instructions
Multiple cores = processors on one die
Minimizing operations ≠ maximizing performance
End of free speed-up for legacy code
Future performance gains through increasing parallelism
Summary and Facts II
It is very difficult to write the fastest code
Tuning for memory hierarchy
Vector instructions
Efficient parallelization (multiple threads)
Requires expert knowledge in algorithms, coding, and architecture
Fast code can be large
Can violate “good” software engineering practices
Compilers often can’t do the job
Often intricate changes in the algorithm required
Parallelization/vectorization still unsolved
Highest performance is in general non-portable
Performance/Productivity
Challenge
Current Solution
Legions of programmers implement and optimize the same
functionality for every platform and whenever a new platform comes
out.
Better Solution: Automation
Automate (parts of) the implementation or optimization
Research efforts
Linear algebra: Phipac/ATLAS, LAPACK,
Sparsity/Bebop/OSKI, Flame
Tensor computations
PDE/finite elements: Fenics
Adaptive sorting
Fourier transform: FFTW
Linear transforms: Spiral
…others
New compiler techniques
Promising new area but
much more work needed …
Proceedings of the IEEE special issue, Feb. 2005
This Tutorial
Increasing awareness for the problem
Focus: optimization for the memory hierarchy
Examples considered: MMM and DFT
Techniques from state-of-the-art research and implementations
Extract general principles for application to other numerical problems
Get a glimpse of how automatic performance tuning works …
MMM generator ATLAS
Adaptive DFT library FFTW
… and how automatic code production works
Program generator Spiral
Matrix-Matrix Multiplication
MMM
References used:
J. Demmel, Applied Numerical Linear Algebra, SIAM, 1997
R. C. Whaley, A. Petitet, and J. Dongarra, Automated empirical optimization of
software and the ATLAS project, Parallel Comput., 27(1–2), pp. 3–35, 2001
K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill,
Is Search Really Necessary to Generate High-Performance BLAS?, Proc. IEEE,
93(2), pp. 358–386, Feb. 2005
Linear Algebra Algorithms: Examples
Solving systems of linear equations
Computation of eigenvalues
Singular value decomposition
LU/Cholesky/QR/… decompositions
… many others
Make up large parts of the numerical computation across disciplines
(sciences, computer science, engineering)
Efficient software is extremely relevant
The Path to LAPACK
1960s/70s: EISPACK and LINPACK
Libraries for linear algebra algorithms
Cleve Moler et al.
Implementation vector-based
Problem:
No locality: Low performance became apparent in the 80s
(deep memory hierarchies)
CPU
Cache
Solution: LAPACK (late 80s, early 90s)
Reimplement the algorithms “block-based,” i.e., with locality
Jim Demmel, Jack Dongarra et al.
Main memory
Reuse and Blocking
Number of operations
Reuse:
Size of data
Compute reuse:
Vector-vector operations (e.g., adding two vectors)
Matrix-vector operations (e.g., matrix-vector multiplication)
Matrix-matrix operations (e.g., MMM)
DFT
50
30
45
40
25
35
Performance drop
outside L2 cache
20
30
25
15
..
20
10
15
10
5
5
MMM: O(n) reuse
DFT: O(log(n)) reuse
44
72
1,
0
2,
1
26
,5
36
,7
68
,3
84
19
2
09
6
input size
13
65
32
16
8,
4,
04
8
2
9,000
2,
matrix size
8,000
02
4
7,000
1,
6,000
6
5,000
51
4,000
8
3,000
25
2,000
12
1,000
64
0
32
0
0
16
LAPACK and BLAS
Basic Idea:
LAPACK
BLAS
static
reimplemented
for each platform
BLAS = Basic Linear Algebra Subroutines link
BLAS1: vector-vector operations; O(1) reuse
BLAS2: matrix-vector operations; O(1) reuse
BLAS3: matrix-matrix operations; O(n) reuse (mostly MMM)
LAPACK implemented on top of BLAS link
Blocking: BLAS 3 is used as much as possible
MMM and Algorithms
Most important BLAS3 function and in all of numerical computing
C = A· B or C = C + A· B with A, B, C matrices (assume square)
By definition: 2n3 = O(n3)
The basis for practically all existing implementation
Strassen’s algorithm: O(nlog2(7)) ≈ O(n2.808)
Crossover point n = 654
Structurally more complex, worse numerical stability
Rarely implemented
Best known: Coppersmith/Winograd: O(n2.376)
Large constant
Very complex
No implementation exists
Optimizing MMM
Live
Why blocking and the effect on cache behaviour
Blocking for cache and registers
Unrolling and scalar replacement
Final Code: Sketch
// MMM loop-nest
for i = 0:NB:N-1
for j = 0:NB:M-1
for k = 0:NB:K-1
// mini-MMM loop nest
for i’ = i:MU:i+NB-1
Blocking for cache
for j’ = j:NU:j+NB-1
for k’ = k:KU:k+NB-1
• Blocking for registers
// micro-MMM loop nest
• unrolling
for k” = k’:1:k’+KU-1
• scalar replacement
for i” = i’:1:i’+MU-1
• other …
for j” = j’:1:j’+NU-1
c[i”][j”] += a[i”][k”]*b[k”][j”]
Degrees of freedom: can be used for platform adaptation
NB, MU, NU, KU, LS
Automation: BLAS Generator ATLAS/PhiPAC
Compile
Execute
Measure
Mflop/s
L1Size
Detect
Hardware
Parameters
NR
MulAdd
L*
ATLAS
Search Engine
NB
MU,NU,KU
xFetch
MulAdd
Latency
MiniMMM
Source
Idea: Automatic porting of LAPACK
LAPACK
BLAS
ATLAS MMM
Code Generator
static
regenerated
for each platform
But: parallelism killed it
Generative approach not suitable for vectorization
LAPACK/BLAS division suboptimal on multicore systems
Discrete Fourier Transform
DFT
References used:
Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM,
1992
M. Frigo and S. Johnson, FFTW: An Adaptive Software Architecture for the FFT,
Proc. ICASSP, 1998, www.fftw.org
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong,
F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo,
Spiral: Code Generation for DSP Transforms, Proc. IEEE 93(2), 2005, pp. 232-275
Linear Transforms
Linear transform = matrix-vector multiplication
Output vector
Transform
= matrix
Example: Discrete Fourier transform (DFT)
Input vector
Cooley/Tukey Fast Fourier Transform (FFT)
n=4=2x2
12 adds, 4 mults
4 adds
1 mult
4 adds
FFT Dataflow
n = 16 = 4 x 4
Optimizing Cooley-Tukey FFT and FFTW
Live
FFT as structured matrix-factorization
Recursive FFT for memory locality
Fusing steps for memory locality
Precomputing constants
Unrolling and scalar replacement
Adaptive library FFTW
Optimization for the Memory Hierarchy
Try to understand the cache behavior of your algorithm
Use recursive algorithms to increase locality/reuse
MMM: blocking, DFT: recursive FFT
Minimize passes over the dataset to increase locality/reuse
Use recursive algorithms; DFT: fuse computation steps
Once the recursion reaches a small problem size:
unroll the computation, use scalar replacement (x[i]→t)
MMM: for last level of blocking (for registers), DFT: for small DFT sizes
Use an initialization routine for things that have to be done once
DFT: precompute sines and cosines in the twiddle factors, best recursion strategy
Identify degrees of freedom and figure out the best choice
MMM: block sizes, DFT: in FFT recursion (factorization n = km)
So Far
We have seen techniques to improve performance
We have seen tools that provide some automation
ATLAS
FFTW
But much is still manual:
Any form of parallelization
Choice of parameters (ATLAS)
Design of library (FFTW)
Is it possible to build a tool that automatically produces a
highest performance library?
Including support for parallelism
Performance of generated code should match best hand-tuned code
Spiral
Computer Synthesis of
High Performance Libraries
Spiral team (only part shown)
Joint work with …
Franz Franchetti
Yevgen Voronenko
Srinivas Chellappa
Frédéric de Mesmay
Daniel McFarlin
José Moura
James Hoe
David Padua
Jeremy Johnson
…
References used:
At www.spiral.net
Current Approach
Algorithm knowledge
Platform knowledge
Brute Force
Optimized implementation
(redone for every new platform)
_mm_set1_epi8(x) = …
_mm_xor_si128(x,y) = …
_mm_avg_epu8(x,y) = …
_mm_cmpeq_epi8(x,y) = …
_mm_unpacklo_epi8(x,y) = …
…
Algorithm knowledge
Platform knowledge
Automation
Spiral
Optimized implementation
(regenerated for every new platform)
_mm_set1_epi8(x) = …
_mm_xor_si128(x,y) = …
_mm_avg_epu8(x,y) = …
_mm_cmpeq_epi8(x,y) = …
_mm_unpacklo_epi8(x,y) = …
…
Organization
Spiral: Linear Transforms
Basic system
Parallelism
General input size
Beyond transforms
Conclusion
Linear Transforms
Mathematically: Matrix-vector multiplication
Output vector
Transform
= matrix
Example: Discrete Fourier transform (DFT)
Input vector
Transform Algorithms: Example 4-point FFT
Cooley/Tukey fast Fourier transform (FFT):
SPL (signal processing language): mathematical, declarative, point-free
Divide-and-conquer algorithms: Breakdown rules in SPL
Breakdown Rules (>200 for >40 Transforms)
Breakdown rules in Spiral = algorithm knowledge
(≈200 journal papers)
Combining these rules yields many algorithms for every given transform
SPL to Code
Correct code: easy
fast code: very difficult
Program Generation in Spiral (Sketched)
Transform
user specified
Breakdown rules
Optimization at all
abstraction levels
Fast algorithm
in SPL
many choices
parallelization
vectorization
∑-SPL
loop
optimizations
C Code:
Iteration of this process
constant folding
to search for the fastest
scheduling
……
But that’s not all …
Organization
Spiral: Linear Transforms
Basic system
Parallelism
General input size
Beyond transforms
Conclusion
SPL to Shared Memory Code: Issues
Good SPL construct
A
A
A
A
Processor 0
Processor 1
Processor 2
Processor 3
x
y
Bad SPL construct: Permutations (Data accesses)
cacheline
boundaries
x
y
Parallelization: Basic Idea
Identify crucial hardware parameters
Number of processors: p
Cache line size: μ
Identify “good” SPL formulas
Use rewriting:
SPL formulas
good SPL formulas
Parallelization by Rewriting
Rewriting rules in Spiral = platform knowledge
Example
“Good” SPL formula (load-balanced, no false sharing)
in the sense of our definition
Same Approach for Different Paradigms
Threading:
Vectorization:
GPUs:
Verilog for FPGAs:
Rigorous, correct by construction
Overcomes compiler limitations
Back to the Big Picture
Algorithm knowledge
Platform knowledge
Spiral
Optimized implementation
(regenerated for every new platform)
Organization
Spiral: Linear Transforms
Basic system
Parallelism
General input size
Beyond transforms
Conclusion
Challenge: General Size Libraries
Spiral so far:
Challenge:
Fixed size code
Recursive general size library (FFTW, MKL)
DFT_384(x, y) {
…
for(i = …) {
t[2i]
= x[2i] + x[2i+1]
t[2i+1] = x[2i] - x[2i+1]
}
…
}
DFT(n, x, y) {
…
for(i = …) {
DFT_strided(m, x+mi, y+i, 1, k)
}
…
}
• Algorithm fixed
• Nonrecursive code
• Algorithm cannot be fixed
• Recursive code
• Creates many challenges
Possible?
Vectorized, parallelized,
general-size, adaptive library
Challenge: Recursion Steps
Cooley-Tukey FFT
Implementation that increases locality (e.g., FFTW 2.x)
DFT(int n, complex *Y, complex *X) {
k = choose_factor(n);
for i=0 to k-1
DFT_strided(n/k, k, 1, Y + (n/k)*i, T + (n/k)*i)
for i=0 to n/k-1
DFT_scaled(k, n/k, precomputed_f, Y + i, Y + i)
}
DFT_strided(int n, int is, int os, complex *Y, complex *X) {…}
DFT_scaled(int n, int s, complex *F, complex *Y, complex *X) {…}
2 additional functions (recursion steps) needed
How to discover automatically?
S-SPL : Basic Idea
Four additional matrix constructs: S, G, S, Perm
S (sum)
Gf (gather)
Sf (scatter)
Permf
–
–
–
–
explicit loop
load data with index mapping f
store data with index mapping f
permute data with the index mapping f
S-SPL formulas = matrix factorizations
Example:
Find Recursion Step Closure
Input: transform T and a breakdown rule
Output: recursion step closure + implementation of each recursion step
Algorithm:
1. Apply the breakdown rule to T
2. Convert to S-SPL
3. Apply loop merging + index simplification rules.
4. Extract required recursion steps
5. Repeat until closure is reached
Recursion Step Closure: Example
DFT: scalar code
DCT4: vector code
Summary: Complete Automation for Transforms
• Memory hierarchy optimization
Rewriting and search for algorithm selection
Rewriting for loop optimizations
• Vectorization
Rewriting
• Parallelization
Rewriting
fixed input size code
• Derivation of library structure
Rewriting
Other methods
general input size library
DFT on Intel Multicore
Spiral
5MB vectorized, threaded,
general-size, adaptive library
Often it Looks Like That
DCTs on 2.66 GHz Core2 (4-way SSSE3)
Performance [Gflop/s]
12
Spiral-generated
10
8
6
4
Available libraries
2
0
32
64
96
128
160
192
224
input size n
256
288
320
352
384
Generated Functions for Intel IPP 6.0
• 1M lines of code
• 3984 C functions
• Functionality: Cross product of
Transforms: DFT (fwd+inv), RDFT (fwd+inv), DCT2, DCT3, DCT4, DHT, WHT
Sizes: 2–64 (DFT, RDFT, DHT); 2-powers (DCTs, WHT)
Precision: single, double
Data type: scalar, SSE, AVX (DFT, DCT), LRB (DFT)
Written by a computer
Organization
Spiral: Linear Transforms
Beyond transforms
Conclusion
Going Beyond Transforms
Linear transforms
SPL
linear
Basic matrices
Matrix operators
Generalization
OL
Basic operators
Higher-order operators
Basic operators
Higher-order operators
Example: Software-Defined Radio (SDR)
plus
operator definitions
and additional rules
with SpiralGen, Jan 2010
Organization
Spiral: Linear Transforms
Beyond transforms
Conclusion
Related Work
Automatic performance tuning
Linear algebra: Phipac/ATLAS (UTK), Sparsity/Bebop (Berkeley),
Tensor computations (Ohio State)
Adaptive sorting (UIUC)
Fourier transform: FFTW (MIT)
Program synthesis
Flame (UT Austin)
FFTW basic block generator
Generative programming (WG 2.11)
Iterative compilation
(U. Paris Sud, U. Edinburgh, …)
Spiral: Key Points
Spiral:
Successful approach to automating
the development of computing software
Commercial proof-of-concept
Key ideas:
Algorithm knowledge:
Domain specific symbolic representation
Platform knowledge:
Tagged rewrite rules, SIMD specification
Interdisciplinary approach
void dft64(float *Y, float *X) {
__m512 U912, U913, U914, U915,...
__m512 *a2153, *a2155;
a2153 = ((__m512 *) X); s1107 = *(a2153);
s1108 = *((a2153 + 4));
t1323 = _mm512_add_ps(s1107,s1108);
t1324 = _mm512_sub_ps(s1107,s1108);
<many more lines>
U926 = _mm512_swizupconv_r32(…);
s1121 = _mm512_madd231_ps(_mm512_mul_ps(_mm512_mask_or_pi(
_mm512_set_1to16_ps(0.70710678118654757),0xAAAA,a2154,U926),t1341),
_mm512_mask_sub_ps(_mm512_set_1to16_ps(0.70710678118654757),…),
_mm512_swizupconv_r32(t1341,_MM_SWIZ_REG_CDAB));
U927 = _mm512_swizupconv_r32
<many more lines>
}
More Information:
www.spiral.net
Don’t Forget!
Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz
Gflop/s
50
45
40
35
30
Multiple threads: 4x
25
20
15
10
Vector instructions: 4x
Memory hierarchy: 20x
5
0
0
1,000
2,000
3,000
4,000
5,000
matrix size
6,000
7,000
8,000
9,000
Programming languages
Program generation
Compilers
Software engineering
Performance/Productivity
Challenge
Machine learning
Algorithms
Mathematics
Symbolic Computation
Rewriting
Great opportunities for
interdisciplinary high-impact research
© Copyright 2025