Section 3 Optimising Code & Identifying Areas for Parallelism Hubert Haberstock Technical Consulting Engineer Intel Software and Services Group Agenda Optimization – by Compiler – – – – Inter-procedural Optimisations Profile Guided Optimisations Vectorisation Parallelisation Identify Areas for Parallelization – by VTune – Identifying Hotspots BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright (C) 2008, Intel Corporation. All rights reserved. Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 2 Nov 20, 2008 Optimisation Why optimise serial code (why not just make it parallel)? Amdahls law tells us: Serial code limits speedup Tparallel = {(1-P) + P/n} Tserial P/2 Speed up your n = number of processors … original P/∞ code first! (1-P) (1-P) Tserial P Rule Number One Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. Speedup = Tserial / Tparallel 4 Nov 20, 2008 Common Optimization Switches WINDOWS Linux & Mac OS* Disable optimization /Od -O0 Optimize for speed (no code size increase) /O1 -O1 Optimize for speed (default) /O2 -O2 High-level optimizer, including prefetch, unroll /O3 -O3 Create symbols for debugging /Zi -g Inter-procedural optimization /Qipo -ipo Profile guided optimization (muli-step build) /Qprof-gen /Qprof-use -prof-gen -prof-use Optimize for speed, including IPO /fast -fast OpenMP 2.5 support /Qopenmp -openmp Automatic parallelization /Qparallel -parallel Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 5 Nov 20, 2008 Itanium and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries Interprocedural Optimizations IPO: Two Step Process Enables inlining, better register usage, dead code elimination, etc. usage: icpc -ip: single file IPO Compiling: Pass 1 icpc -c -ipo a.cxx b.cxx icpc -ipo: multi-file IPO Link time code generation - increases build time ipo objects Pass 2 Usability Tips: • Try IPO on performance critical files/libs • Don’t run ipo on 10,000’s object files, avoid unnecessary increased build time • Remember to link with -ipo option Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. Linking: icpc -ipo a.o b.o executable 6 Nov 20, 2008 Interprocedural Optimization Extends optimizations across file boundaries Without IPO With IPO Compile & Optimize file1.c Compile & Optimize file2.c Compile & Optimize file3.c file1.c file3.c Compile & Optimize file4.c file4.c file2.c -ip Only between modules of one source file -ipo Modules of multiple files/whole application Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. Compile & Optimize 7 Nov 20, 2008 ProfileProfile-Guided Optimizations Optimizing with runtime feedback Enhances all optimizations, especially IPO, register allocation, instruction cache usage, switch statement optimization, etc Code-Coverage and Test-Prioritization Tools uses PGO technology Usability Tips: - Run on “typical” input dataset(s) - Each run generates a data file. - Compiler calculates averages of all runs Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 8 Nov 20, 2008 ProfileProfile-Guided Optimizations (PGO) Use execution-time feedback to guide (final) optimization Helps I-cache, paging, branch-prediction Enabled optimizations: – – – – – Basic block ordering Better register allocation Better decision on which functions to inline Function ordering Switch-statement optimization Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 9 Nov 20, 2008 Automatic Compiler Vectorization Processor Specific Optimizations Automatically generate vector SSE/SSE2/SSE3/SSSE3/SSE4 Vector processing: Operate at once on: • 4 floating point values • 2 double precision floating point values • 4 integer values • Etc Optimal code generation and instruction scheduling – Large number of options for advanced control of vectorization • Specify trip count, ignore dependencies (ivdep), specify alignment, disable vectorization, etc. Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 10 Nov 20, 2008 AutoAuto-Vectorization (IA(IA-32 and Intel® 64): Optimizing Loops with SSE/SSE2/SSE3/SSSE3/SSE4 Your Task: convert this… this… $ cat w.c void work( float* a, float *b, float *c, int MAX) { for (int I=0;I<=MAX;I++) c[I]=a[I]+b[I]; } A[0] A[1] not used not used not used + + + + B[0] B[1] not used not used C[0] C[1] not used not used Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 128-bit Registers 128-bit Registers not used not used 11 Nov 20, 2008 AutoAuto-Vectorization (IA(IA-32 and Intel® Intel® 64) void work( float* a, float *b, float *c, int MAX) { for (int I=0;I<=MAX;I++) c[I]=a[I]+b[I]; } $ icc w.c -c -xT w.c(2) : (col. 3) remark: LOOP WAS VECTORIZED. A[3] A[2] A[1] A[0] + + + + B[3] B[2] B[1] C[3] C[2] C[1] Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 128-bit 128-bit Registers Registers B[0] C[0] 12 Nov 20, 2008 Vectorization Report “Loop was not vectorized” because: – “Existence of vector dependence” – “Subscript too complex” – “Non-unit stride used” – “Mixed Data Types” – “Contains unvectorizable statement at line XX” – “Condition too Complex” – “Not Inner Loop” – “Condition may protect exception” – "vectorization possible but seems inefficient" – “Low trip count” – “Operator unsuited for vectorization” Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. – ‘Unsupported Loop Structure” 13 Nov 20, 2008 Compiler Based Vectorization Automatic Processor Dispatch – ax[?] Single executable – Optimized for Intel® Core Duo processors and generic code that runs on all IA32 processors. (-axT) For each target processor it uses: – Processor-specific instructions – Vectorization Low overhead – Some increase in code size Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 14 Nov 20, 2008 Compiler Optimization Reports Tells what optimizations were done and most importantly hints on what prevented a given optimization Turn on Optimization Reports -opt-report Can be read by VTune™ Performance Analyzer Default report verbose, recommend selecting optimization • Enable Vectorizer reports: -vec-report3 • Enable Loop Optimizer (-O3): -opt-report-phase hlo icc hpo.c -c -O3 -xT –vec-report3 loop was not vectorized: existence of vector dependence. vector dependence: proven FLOW dependence between a line 48, and b line 48. Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 15 Nov 20, 2008 Identifying Hotspots Pinpointing places where an application could be parallelised Today’ Today’s Question “Where do I split up my code to take advantage of multiple CPU cores?” Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 17 Nov 20, 2008 Intel® Intel® VTune – lots of things ‘under one hood’ hood’ Analyzer Projects – Counter Monitor Wizard – Sampler Wizard – Call Graph Wizard Threading Wizards – Thread Checker – Thread Profile Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 18 Nov 20, 2008 VTune™ VTune™ Performance Analyzer VTune – Collects performance data – Organizes and displays results – Identifies potential performance issues – Suggests improvements. Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 19 Nov 20, 2008 Host/Target Environment Target System Host System •IA-32 or Itanium® processor family •Windows* /Linux •Controls target LAN Connection •View results of data collection Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. •Windows or Linux* 20 Nov 20, 2008 VTune™ Analyzer Features and Usage Models Sampling -wide Performance SamplingCollects CollectsSystem System-wide PerformanceData Data Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 21 Nov 20, 2008 VTune™ Analyzer Features and Usage Models Sampling SamplingOver OverTime TimeViews ViewsShow ShowHow HowSampling Sampling Data DataChanges ChangesOver OverTime Time Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 22 Nov 20, 2008 VTune™ Analyzer Features and Usage Models Sampling SamplingSource SourceView View Displays DisplaysSource SourceCode Code Annotated Annotatedwith withPerformance PerformanceData Data Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 23 Nov 20, 2008 VTune™ Analyzer Features and Usage Models Call CallGraph GraphCollects Collectsand andDisplays DisplaysInformation Information About Aboutthe theProgram ProgramFlow Flowof ofthe theApplication Application Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 24 Nov 20, 2008 Analysis - Sampling bool TestForPrime(int val) { // let’s start checking from 3 Use VTune Sampling to find hotspots in application int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; Let’s use the project PrimeSingle for analysis return (factor > limit); – PrimeSingle <start> <end> } void1 FindPrimes(int start, int end) Usage: ./PrimeSingle 1000000 { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } } Identifies the time consuming regions Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 25 Nov 20, 2008 Analysis - Call Graph This is the level in the call tree where we need to thread Used to find proper level in the call-tree to thread Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 26 Nov 20, 2008 Analysis Where to thread? – FindPrimes() Is it worth threading a selected region? – Appears to have minimal dependencies – Appears to be data-parallel – Consumes over 95% of the run time Baseline measurement Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 27 Nov 20, 2008 Summary Always analyze your existing application before implementing any threading or parallelism. Tools can significantly help you to understand your application and its behavior on the micro-architecture • Intel VTune™ analyzer can help you in many ways to reveal architectural issues and find the bottlenecks of your program Copyright © 2008, Intel Corporation. All rights reserved. * Other brands and names are the property of their respective owners. 28 Nov 20, 2008
© Copyright 2024