Section 3 Optimising Code &

Section 3
Optimising Code &
Identifying Areas for Parallelism
Hubert Haberstock
Technical Consulting Engineer
Intel Software and Services Group
Agenda
Optimization – by Compiler
–
–
–
–
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation
Parallelisation
Identify Areas for Parallelization – by VTune
– Identifying Hotspots
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core
Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside,
Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel
NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside,
MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro
Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.
* Other names and brands may be claimed as the property of others.
Copyright (C) 2008, Intel Corporation. All rights reserved.
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
2
Nov 20, 2008
Optimisation
Why optimise serial code (why not just make it parallel)?
Amdahls law tells us:
Serial code limits speedup
Tparallel = {(1-P) + P/n} Tserial
P/2
Speed
up
your
n = number of processors
…
original
P/∞ code first!
(1-P)
(1-P)
Tserial
P
Rule Number One
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
Speedup = Tserial / Tparallel
4
Nov 20, 2008
Common Optimization Switches
WINDOWS
Linux & Mac OS*
Disable optimization
/Od
-O0
Optimize for speed (no code size increase)
/O1
-O1
Optimize for speed (default)
/O2
-O2
High-level optimizer, including prefetch, unroll
/O3
-O3
Create symbols for debugging
/Zi
-g
Inter-procedural optimization
/Qipo
-ipo
Profile guided optimization (muli-step build)
/Qprof-gen
/Qprof-use
-prof-gen
-prof-use
Optimize for speed, including IPO
/fast
-fast
OpenMP 2.5 support
/Qopenmp
-openmp
Automatic parallelization
/Qparallel
-parallel
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
5
Nov 20, 2008
Itanium and the Intel logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States or other countries
Interprocedural Optimizations
IPO: Two Step Process
Enables inlining, better register usage, dead
code elimination, etc.
usage:
icpc -ip: single file IPO
Compiling:
Pass 1
icpc -c -ipo a.cxx b.cxx
icpc -ipo: multi-file IPO
Link time code generation - increases build time
ipo objects
Pass 2
Usability Tips:
• Try IPO on performance critical files/libs
• Don’t run ipo on 10,000’s object files,
avoid unnecessary increased build time
• Remember to link with -ipo option
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
Linking:
icpc -ipo a.o b.o
executable
6
Nov 20, 2008
Interprocedural Optimization
Extends optimizations across file boundaries
Without IPO
With IPO
Compile & Optimize
file1.c
Compile & Optimize
file2.c
Compile & Optimize
file3.c
file1.c
file3.c
Compile & Optimize
file4.c
file4.c
file2.c
-ip
Only between modules of one source file
-ipo
Modules of multiple files/whole application
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
Compile & Optimize
7
Nov 20, 2008
ProfileProfile-Guided
Optimizations
Optimizing with runtime feedback
Enhances all optimizations, especially IPO,
register allocation, instruction cache
usage, switch statement optimization, etc
Code-Coverage and Test-Prioritization
Tools uses PGO technology
Usability Tips:
- Run on “typical” input dataset(s)
- Each run generates a data file.
- Compiler calculates averages of all runs
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
8
Nov 20, 2008
ProfileProfile-Guided Optimizations (PGO)
Use execution-time feedback to guide (final) optimization
Helps I-cache, paging, branch-prediction
Enabled optimizations:
–
–
–
–
–
Basic block ordering
Better register allocation
Better decision on which functions to inline
Function ordering
Switch-statement optimization
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
9
Nov 20, 2008
Automatic Compiler Vectorization
Processor Specific Optimizations
Automatically generate vector SSE/SSE2/SSE3/SSSE3/SSE4
Vector processing: Operate at once on:
• 4 floating point values
• 2 double precision floating point values
• 4 integer values
• Etc
Optimal code generation and instruction scheduling
– Large number of options for advanced control of vectorization
• Specify trip count, ignore dependencies (ivdep), specify alignment,
disable vectorization, etc.
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
10
Nov 20, 2008
AutoAuto-Vectorization (IA(IA-32 and Intel® 64):
Optimizing Loops with SSE/SSE2/SSE3/SSSE3/SSE4
Your Task: convert this…
this…
$ cat w.c
void work( float* a, float *b, float *c, int MAX) {
for (int I=0;I<=MAX;I++)
c[I]=a[I]+b[I]; }
A[0]
A[1]
not used
not used
not used
+
+
+
+
B[0]
B[1]
not used
not used
C[0]
C[1]
not used
not used
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
128-bit
Registers
128-bit
Registers
not
used
not used
11
Nov 20, 2008
AutoAuto-Vectorization (IA(IA-32 and Intel®
Intel® 64)
void work( float* a, float *b, float *c, int MAX) {
for (int I=0;I<=MAX;I++)
c[I]=a[I]+b[I]; }
$ icc w.c -c -xT
w.c(2) : (col. 3) remark: LOOP WAS VECTORIZED.
A[3]
A[2]
A[1]
A[0]
+
+
+
+
B[3]
B[2]
B[1]
C[3]
C[2]
C[1]
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
128-bit
128-bit Registers
Registers
B[0]
C[0]
12
Nov 20, 2008
Vectorization Report
“Loop was not vectorized” because:
– “Existence of vector
dependence”
– “Subscript too complex”
– “Non-unit stride used”
– “Mixed Data Types”
– “Contains unvectorizable
statement at line XX”
– “Condition too Complex”
– “Not Inner Loop”
– “Condition may protect
exception”
– "vectorization possible but
seems inefficient"
– “Low trip count”
– “Operator unsuited for
vectorization”
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
– ‘Unsupported Loop Structure”
13
Nov 20, 2008
Compiler Based Vectorization
Automatic Processor Dispatch – ax[?]
Single executable
– Optimized for Intel® Core Duo processors and generic code that runs on all IA32
processors. (-axT)
For each target processor it uses:
– Processor-specific instructions
– Vectorization
Low overhead
– Some increase in code size
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
14
Nov 20, 2008
Compiler Optimization Reports
Tells what optimizations were done and most importantly hints on what prevented a given
optimization
Turn on Optimization Reports -opt-report
Can be read by VTune™ Performance Analyzer
Default report verbose, recommend selecting optimization
• Enable Vectorizer reports: -vec-report3
• Enable Loop Optimizer (-O3): -opt-report-phase hlo
icc hpo.c
-c -O3 -xT –vec-report3
loop was not vectorized: existence of vector dependence.
vector dependence: proven FLOW dependence between a line 48, and
b line 48.
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
15
Nov 20, 2008
Identifying Hotspots
Pinpointing places where an application could be parallelised
Today’
Today’s Question
“Where do I split up
my code to take
advantage of multiple
CPU cores?”
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
17
Nov 20, 2008
Intel®
Intel® VTune – lots of things ‘under one hood’
hood’
Analyzer Projects
– Counter Monitor Wizard
– Sampler Wizard
– Call Graph Wizard
Threading Wizards
– Thread Checker
– Thread Profile
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
18
Nov 20, 2008
VTune™
VTune™ Performance Analyzer
VTune
– Collects performance data
– Organizes and displays results
– Identifies potential performance issues
– Suggests improvements.
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
19
Nov 20, 2008
Host/Target Environment
Target System
Host System
•IA-32 or Itanium®
processor family
•Windows* /Linux
•Controls target
LAN Connection
•View results of data
collection
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
•Windows or Linux*
20
Nov 20, 2008
VTune™ Analyzer Features and Usage Models
Sampling
-wide Performance
SamplingCollects
CollectsSystem
System-wide
PerformanceData
Data
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
21
Nov 20, 2008
VTune™ Analyzer Features and Usage Models
Sampling
SamplingOver
OverTime
TimeViews
ViewsShow
ShowHow
HowSampling
Sampling
Data
DataChanges
ChangesOver
OverTime
Time
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
22
Nov 20, 2008
VTune™ Analyzer Features and Usage Models
Sampling
SamplingSource
SourceView
View Displays
DisplaysSource
SourceCode
Code
Annotated
Annotatedwith
withPerformance
PerformanceData
Data
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
23
Nov 20, 2008
VTune™ Analyzer Features and Usage Models
Call
CallGraph
GraphCollects
Collectsand
andDisplays
DisplaysInformation
Information
About
Aboutthe
theProgram
ProgramFlow
Flowof
ofthe
theApplication
Application
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
24
Nov 20, 2008
Analysis - Sampling
bool TestForPrime(int val)
{
// let’s start checking from 3
Use VTune Sampling to find hotspots in application
int limit, factor = 3;
limit = (long)(sqrtf((float)val)+0.5f);
while( (factor <= limit) && (val % factor))
factor ++;
Let’s use the project PrimeSingle
for analysis
return
(factor > limit);
– PrimeSingle <start> <end>
}
void1 FindPrimes(int
start, int end)
Usage: ./PrimeSingle
1000000
{
// start is always odd
int range = end - start + 1;
for( int i = start; i <= end; i+= 2 ){
if( TestForPrime(i) )
globalPrimes[gPrimesFound++] = i;
ShowProgress(i, range);
}
}
Identifies the time consuming regions
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
25
Nov 20, 2008
Analysis - Call Graph
This is the level in
the call tree where
we need to thread
Used to find proper level in
the call-tree to thread
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
26
Nov 20, 2008
Analysis
Where to thread?
– FindPrimes()
Is it worth threading a selected region?
– Appears to have minimal dependencies
– Appears to be data-parallel
– Consumes over 95% of the run time
Baseline
measurement
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
27
Nov 20, 2008
Summary
Always analyze your existing application before implementing any
threading or parallelism.
Tools can significantly help you to understand your application and its
behavior on the micro-architecture
• Intel VTune™ analyzer can help you in many ways to reveal
architectural issues and find the bottlenecks of your program
Copyright © 2008, Intel Corporation. All rights reserved.
* Other brands and names are the property of their respective owners.
28
Nov 20, 2008