SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101

SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Martin Schulz: LLNL, Jim Galarowicz: Krell Institute
Jennifer Green: LANL, Don Maghrak: Krell Institute
LLNL-PRES-503451
11/18/2013
Why This Tutorial?

Performance Analysis is becoming more important




More than just measuring time





Complex architectures
Complex applications
Mapping applications onto architectures
What are the critical sections in a code?
Is a part of the code running efficiently or not?
Is the code using the underlying resources well?
Where is the greatest payoff for optimization?
Often hard to know where to start




Which experiments to run first?
How to plan follow-on experiments?
What kind of problems can be explored?
How to interpret the data?
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13
11/18/2013
2
Tutorial Goals

Basic introduction into performance analysis



Provide basic guidance on …




How to understand the performance of a code?
How to answer basic performance questions?
How to plan performance experiments?
Provide you with the ability to …



Typical pitfalls wrt. Performance
Wide range of types of performance tools and techniques
Run these experiments on your own code
Provide starting point for performance optimizations
Open|SpeedShop as the running example



Introduction into one possible tool solution
Basic usage instructions
Pointers to additional documentation
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
3
Open|SpeedShop Tool Set

Open Source Performance Analysis Tool Framework





Flexible and Easy to use


User access through:
GUI, Command Line, Python Scripting, convenience scripts
Scalable Data Collection



Most common performance analysis steps all in one tool
Combines tracing and sampling techniques
Extensible by plugins for data collection and representation
Gathers and displays several types of performance information
Instrumentation of unmodified application binaries
New option for hierarchical online data aggregation
Supports a wide range of systems


Extensively used and tested on a variety of Linux clusters
Cray XT/XE/XK and Blue Gene L/P/Q support
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
4
“Rules”

Let’s keep this interactive




Feel free to ask questions as we go along
Online demos as we go along
Ask if you would like to see anything specific
We are interested in feedback




Both about O|SS and tutorial in general
What was clear / what didn’t make sense?
What scenarios are missing?
How well does O|SS (or other tools) address the needs

For O|SS: Please report bugs and incompatibilities

Updated slides:


http://www.openspeedshop.org/wp/category/tutorials
Then choose SC2013 Monday Nov 18 tutorial
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
5
Presenters
Jim Galarowicz, Krell
 Donald Maghrak, Krell
 Martin Schulz, LLNL
 Jennifer Green, LANL


Larger team








William Hachfeld and Dave Whitney, Krell
Dane Gardner, Argo Navis/Krell
Matt Legendre and Chris Chambreau, LLNL
Mahesh Rajan, SNL
David Montoya, Mike Mason, Phil Romero, LANL
Dyninst group (Bart Miller, UW & Jeff Hollingsworth, UMD)
Phil Roth, ORNL
Ciera Jaspan, CMU
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
6
Outline

Welcome
1)
Concepts in performance analysis
2)
Introduction into Tools and Open|SpeedShop
3)
Basic timing experiments and their Pros/Cons
4)
Advanced analysis with Hardware Counter Experiments
5)
Analysis of I/O
6)
Analysis of parallel codes (MPI, threaded applications)
7)
Analysis on heterogeneous platforms (e.g., for GPUs)
8)
Conclusions: DIY and future work
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
7
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 1
Concepts in Performance Analysis
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
8
Typical Development Cycle

Performance tuning is an essential
part of the development cycle

Potential impact at every stage
• Message patterns
• Data structure layout
• Algorithms


Should be done from early on in the life
of a new HPC code
Typical use





Measure performance
Analyze data
Modify code and/or algorithm
Repeat measurements
Analyze differences
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
Algorithm
Coding
Code/Binary
Debugging
Correct Code
Tuning
Efficient Code
11/18/2013
9
What to Look For: Sequential Runs

Step 1: Identify computational intensive parts

Where am I spending my time?
• Modules/Libraries
• Statements
• Functions



L1 Cache
Impact of memory hierarchy




Is the time spent in the computational kernels?
Does this match my intuition?
CPU
Do I have excessive cache misses?
How is my data locality?
Impact of TLB misses?
External resources


Is my I/O efficient?
Time spent in system libraries?
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
L2 Cache
Shared L3 Cache
Main Memory
11/18/2013
10
What to Look For: Shared Memory

Shared memory model




L1
L1
Explicit threads (e.g., Pthreads)
OpenMP
L2 Cache
Typical performance issues





CPU
Common programming models


Single shared storage
Accessible from any CPU
CPU
False cache sharing
Excessive Synchronization
Limited work per thread
Threading overhead
Main Memory
Mem.
CPU
CPU
Mem.
Mem.
CPU
CPU
Mem.
Complications: NUMA


Memory locality critical
Thread:Memory assignments
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
11
What to Look For: Message Passing

Distributed Memory Model




Sequential/shared memory nodes coupled by a network
Only local memory access
Data exchange using message passing (e.g., MPI)
Typical performance issues




Long blocking times waiting for data
Low message rates
Limited network bandwidth
Global collective operations
Node
Memory
Application
MPI Library
NIC
Node
Memory
Memory
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
12
A Case for Performance Tools

First line of defense




Disadvantages



Measurements are coarse grain
Can’t pin performance bottlenecks
Alternative: code integration of performance probes



Full execution timings (UNIX: “time” command)
Comparisons between input parameters
Keep and track historical trends
Hard to maintain
Requirements significant a priori knowledge
Performance tools



Enable fine grain instrumentation
Show relation to source code
Work universally across applications
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
13
How to Select a Tool?

A tool with the right features



Must be easy to use
Provides capability to analyze the code at different levels
A tool must match the application’s workflow

Requirements from instrumentation technique
• Access to and knowledge about source code? Recompilation time?
• Machine environments? Supported platforms?


Remote execution/analysis?
Local support support and installation
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
14
Performance Tools Overview

Basic OS tools



Sampling tools








Typically unmodified binaries
Callstack analysis
HPCToolkit (Rice U.)


MPI or OpenMP profiles
mpiP (LLNL&ORNL)
ompP (LMU Munich)









Capture all MPI events
Present as timeline
Vampir (TU-Dresden)
Jumpshot (ANL)



Typically profiling and tracing
Combined workflow
Typically GUI/some vis. support
Binary: Open|SpeedShop (Krell/TriLab)
Source: TAU (U. of Oregon)
Specialized tools/techniques

Tracing tool kits
Profile and trace capture
Automatic (parallel) trace analysis
Kojak/Scalasca (JSC)
Paraver (BSC)
Integrated tool kits

Profiling/direct measurements


PAPI API & tool set
hwctime (AIX)
Trace Analysis

time, gprof, strace
Hardware counters



Libra (LLNL)
Load balance analysis
Boxfish (LLNL/Utah/Davis)
3D visualization of torus networks
Rubik (LLNL)
Node mapping on torus architectures
Vendor Tools
11/18/2013
How to Select a Tool?

A tool with the right features



Must be easy to use
Provides capability to analyze the code at different levels
A tool must match the application’s workflow

Requirements from instrumentation technique
• Access to and knowledge about source code? Recompilation time?
• Machine environments? Supported platforms?

Remote execution/analysis?

Local support support and installation

Why We Picked/Developed Open|SpeedShop?


Sampling and tracing in a single framework
Easy to use GUI & command line options for remote execution
• Low learning curve for end users

Transparent instrumentation (preloading & binary)
• No need to recompile application
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
16
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 2
Introduction into Tools and Open|SpeedShop
11/18/2013
Open|SpeedShop Tool Set

Open Source Performance Analysis Tool Framework





Flexible and Easy to use


User access through:
GUI, Command Line, Python Scripting, convenience scripts
Scalable Data Collection



Most common performance analysis steps all in one tool
Combines tracing and sampling techniques
Extensible by plugins for data collection and representation
Gathers and displays several types of performance information
Instrumentation of unmodified application binaries
New option for hierarchical online data aggregation
Supports a wide range of systems


Extensively used and tested on a variety of Linux clusters
Cray XT/XE/XK and Blue Gene L/P/Q support
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
18
Open|SpeedShop Workflow
srun –n4 –N1
osspcsamp
“srun
smg2000
–n4 –N1
–n smg2000
65 65 65 –n 65 65 65”
MPI Application
O|SS
Post-mortem
http://www.openspeedshop.org/
11/18/2013
Alternative Interfaces

Scripting language



Immediate command interface
O|SS interactive command line (CLI)
Experiment Commands
expAttach
expCreate
expDetach
expGo
expView
Python module
List Commands
list –v exp
list –v hosts
import openss
list –v src
my_filename=openss.FileList("myprog.a.out")
my_exptype=openss.ExpTypeList("pcsamp")Session Commands
setBreak
my_id=openss.expCreate(my_filename,my_exptype)
openGui
openss.expGo()
My_metric_list = openss.MetricList("exclusive")
my_viewtype = openss.ViewTypeList("pcsamp”)
result = openss.expView(my_id,my_viewtype,my_metric_list)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
20
Central Concept: Experiments

Users pick experiments:



What to measure and from which sources?
How to select, view, and analyze the resulting data?
Two main classes:

Statistical Sampling
• Periodically interrupt execution and record location
• Useful to get an overview
• Low and uniform overhead

Event Tracing
• Gather and store individual application events
• Provides detailed per event information
• Can lead to huge data volumes

O|SS can be extended with additional
experiments
11/18/2013
Sampling Experiments in O|SS

PC Sampling (pcsamp)




Call Path Profiling (usertime)




Record PC repeatedly at user defined time interval
Low overhead overview of time distribution
Good first step, lightweight overview
PC Sampling and Call stacks for each sample
Provides inclusive and exclusive timing data
Use to find hot call paths, whom is calling who
Hardware Counters (hwc, hwctime, hwcsamp)


Access to data like cache and TLB misses
hwc, hwctime:
• Sample a HWC event based on an event threshold
• Default event is PAPI_TOT_CYC overflows

hwcsamp:
• Periodically sample up to 6 counter events based (hwcsamp)
• Default events are PAPI_FP_OPS and PAPI_TOT_CYC
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
22
Tracing Experiments in O|SS

Input/Output Tracing (io, iop, iot)





MPI Tracing (mpi, mpit, mpiotf)





Record invocation of all POSIX I/O events
Provides aggregate and individual timings
Lightweight I/O profiling (iop)
Store function arguments and return code for each call (iot)
Record invocation of all MPI routines
Provides aggregate and individual timings
Store function arguments and return code for each call (mpit)
Create Open Trace Format (OTF) output (mpiotf)
Floating Point Exception Tracing (fpe)


Triggered by any FPE caused by the application
Helps pinpoint numerical problem areas
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
23
Performance Analysis in Parallel

How to deal with concurrency?

Any experiment can be applied to parallel application
• Important step: aggregation or selection of data


O|SS supports MPI and threaded codes





Special experiments targeting parallelism/synchronization
Automatically applied to all tasks/threads
Default views aggregate across all tasks/threads
Data from individual tasks/threads available
Thread support (incl. OpenMP) based on POSIX threads
Specific parallel experiments (e.g., MPI)

Wraps MPI calls and reports
• MPI routine time
• MPI routine parameter information

The mpit experiment also store function arguments and return
code for each call
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
24
How to Run a First Experiment in O|SS?
1.
Picking the experiment


2.
Launching the application



3.
How do I control my application under O|SS?
Enclose how you normally run your application in quotes
osspcsamp “mpirun –np 256 smg2000 –n 65 65 65”
Storing the results


4.
What do I want to measure?
We will start with pcsamp to get a first overview
O|SS will create a database
Name: smg2000-pcsamp.openss
Exploring the gathered data



How do I interpret the data?
O|SS will print a default report
Open the GUI to analyze data in detail (run: “openss”)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
25
Example Run with Output

osspcsamp “mpirun –np 2 smg2000 –n 65 65 65” (1/2)
Bash> osspcsamp "mpirun -np 2 ./smg2000 -n 65 65 65"
[openss]: pcsamp experiment using the pcsamp experiment default sampling rate: "100".
[openss]: Using OPENSS_PREFIX installed in /opt/OSS-mrnet
[openss]: Setting up offline raw data directory in /tmp/jeg/offline-oss
[openss]: Running offline pcsamp experiment using the command:
"mpirun -np 2 /opt/OSS-mrnet/bin/ossrun "./smg2000 -n 65 65 65" pcsamp"
Running with these driver parameters:
(nx, ny, nz) = (65, 65, 65)
…
<SMG native output>
…
Final Relative Residual Norm = 1.774415e-07
[openss]: Converting raw data from /tmp/jeg/offline-oss into temp file X.0.openss
Processing raw data for smg2000
Processing processes and threads ...
Processing performance data ...
Processing functions and statements ...
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
26
Example Run with Output

osspcsamp “mpirun –np 2 smg2000 –n 65 65 65” (2/2)
[openss]: Restoring and displaying default view for:
/home/jeg/DEMOS/demos/mpi/openmpi-1.4.2/smg2000/test/smg2000-pcsamp-1.openss
[openss]: The restored experiment identifier is: -x 1
Exclusive CPU time
in seconds.
3.630000000
2.860000000
0.280000000
0.210000000
0.150000000
0.100000000
0.090000000
0.080000000
0.070000000
0.070000000
struct_vector.c,537)
0.060000000
% of CPU Time Function (defining location)
43.060498221 hypre_SMGResidual (smg2000: smg_residual.c,152)
33.926453144 hypre_CyclicReduction (smg2000: cyclic_reduction.c,757)
3.321470937 hypre_SemiRestrict (smg2000: semi_restrict.c,125)
2.491103203 hypre_SemiInterp (smg2000: semi_interp.c,126)
1.779359431 opal_progress (libopen-pal.so.0.0.0)
1.186239620 mca_btl_sm_component_progress (libmpi.so.0.0.2)
1.067615658 hypre_SMGAxpy (smg2000: smg_axpy.c,27)
0.948991696 ompi_generic_simple_pack (libmpi.so.0.0.2)
0.830367734 __GI_memcpy (libc-2.10.2.so)
0.830367734 hypre_StructVectorSetConstantValues (smg2000:
0.711743772 hypre_SMG3BuildRAPSym (smg2000: smg3_setup_rap.c,233)
 View with GUI: openss –f smg2000-pcsamp-1.openss
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
27
Default Output Report View
Toolbar to switch
Views
Performance Data
Default view: by Function
(Data is sum from all
processes and threads)
Select “Functions”, click D-icon
Graphical
Representation
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
28
Statement Report Output View
Performance Data
View Choice: Statements
Select “statements, click D-icon
Statement in Program that
took the most time
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
29
Associate Source & Performance Data
Double click to open
source window
Use window controls to
split/arrange windows
Selected performance
data point
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
30
Summary

Place the way you run your application normally in quotes
and pass it as an argument to osspcsamp, or any of the other
experiment convenience scripts: ossio, ossmpi, etc.




Open|SpeedShop sends a summary profile to stdout
Open|SpeedShop creates a database file
Display alternative views of the data with the GUI via:



openss –cli –f <database file>
On clusters, need to set OPENSS_RAWDATA_DIR



openss –f <database file>
Display alternative views of the data with the CLI via:


osspcsamp “srun –N 8 –n 64 ./mpi_application app_args”
Should point to a directory in a shared file system
More on this later – usually done in a module or dotkit file.
Start with pcsamp for overview of performance
Then home into performance issues with other experiments
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
31
Digging Deeper

Multiple interfaces




Usertime experiments provide
inclusive/exclusive times



Time spent inside a routine vs. its children
Key view: butterfly
Comparisons



GUI for easy display of performance data
CLI makes remote access easy
Python module allows easy integration into scripts
Between experiments to study improvements/changes
Between ranks/threads to understand differences/outliers
Dedicated views for parallel executions


Load balance view
Use custom comparison to compare ranks or threads
11/18/2013
32
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 3
Basic timing experiments and their Pros/Cons
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
33
Step 1: Flat Profile

Answers a basic question:





Why use a profile?




Where does my code spend its time?
Time spent at each function
Aggregate (sum) measurements during collection
Over time and code sections
Reduced size of performance data
Typically collected with low overhead
Provides good overview of performance of application
Disadvantages of using a profile



Requires a-priori definition of aggregation
Omits performance details of individual events
Possible sampling frequency skew
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
34
Flat Profiles in O|SS: PCSamp
Basic syntax (same as in 1st experiment before):
osspcsamp “how you run your executable normally”
Examples:
osspcsamp “smg2000 –n 50 50 50”
osspcsamp “smg2000 –n 50 50 50” high
Optional Parameter:


Sampling frequency (samples per second)
Alternative parameter: high (200) | low (50) | default (100)
Recommendation: compile code with –g to get statements!
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
35
How to Read Flat Profiles
Performance Data
View Choice: Statements
Select “statements, click D-icon
Aggregated results
across all ranks,
processes, threads
Function in Program that
took the most time
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
36
Identifying Critical Regions

Profiles show computationally intensive code regions


Questions:




First views: Time spent per functions or per statements
Are those functions/statements expected?
Do they match the computational kernels?
Any runtime functions taking a lot of time?
Identify bottleneck components



View the profile aggregated by shared objects
Correct/expected modules?
Impact of support and runtime libraries
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
37
Adding Context through Stack Traces

Function
A
Missing information in flat
profiles


Function
B
Function
C


Critical technique: Stack traces

Function
D


Gather stack trace for each
performance sample
Aggregate only samples with
equal trace
User perspective:

Function
E
Distinguish routines called from
multiple callers
Understand the call invocation
history
Context for performance data

Butterfly views
(caller/callee relationships)
Hot call paths
• Paths through application that
take most time
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
38
Inclusive vs. Exclusive Timing

Function
A
Stack traces enable
calculation of
inclusive/exclusive times

Function
B
• See: Function B
Function
C

Function
D
Exclusive Time for B
Implementation similar to flat
profiles



Function
E
Time spent inside a function and
its children (inclusive)
• See Function C and children

Inclusive Time for C
Time spent inside a function
only (exclusive)
Sample PC information
Additionally collect call stack
information at every sample
Tradeoffs


Pro: Obtain additional context
information
Con: Higher overhead/lower
sampling rate
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
39
In/Exclusive Time in O|SS: Usertime
Basic syntax:
ossusertime “how you run your executable normally”
Examples:
ossusertime “smg2000 –n 50 50 50”
ossusertime “smg2000 –n 50 50 50” low

Parameters
Sampling frequency (samples per second)
Alternative parameter: high (70) | low (18) | default (35)
Recommendation: compile code with –g to get statements!
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
40
Reading Inclusive/Exclusive Timings

Default View


Similar to pcsamp view from first example
Calculates inclusive versus exclusive times
Exclusive
Time
Inclusive
Time
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
41
Interpreting Call Context Data

Inclusive versus exclusive times

If similar: child executions are insignificant
• May not be useful to profile below this layer

If inclusive time significantly greater than exclusive time:
• Focus attention to the execution times of the children

Hotpath analysis


Which paths takes the most time?
Butterfly analysis (similar to gprof)

Should be done on “suspicious” functions
•
•
•
•
•

Functions with large execution time
Functions with large difference between implicit and explicit time
Functions of interest
Functions that “take unexpectedly long”
…
Shows split of time in callees and callers
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
43
Stack Trace Views: Hot Call Path
Hot Call Path
Access to call paths:
• All call paths (C+)
• All call paths for
selected function (C)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
44
Stack Trace Views: Butterfly View

Similar to well known “gprof” tool
Callers of
“hypre_SMGSolve”
Pivot routine
“hypre_SMGSolve”
Callees of
“hypre_SMGSolve”
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
45
Comparing Performance Data

Key functionality for any performance analysis



Typical examples




Before/after optimization
Different configurations or inputs
Different ranks, processes or threads
Very limited support in most tools




Absolute numbers often don’t help
Need some kind of baseline / number to compare against
Manual operation after multiple runs
Requires lining up profile data
Even harder for traces
Open|SpeedShop has support to line up profiles


Perform multiple experiments and create multiple databases
Script to load all experiments and create multiple columns
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
46
Comparing Performance Data in O|SS

Convenience Script: osscompare

Compares Open|SpeedShop up to 8 databases to each other
• Syntax: osscompare “db1.openss,db2.openss,…” [options]
• osscompare man page has more details


Produces side-by-side comparison listing
Metric option parameter:
• Compare based on: time, percent, a hwc counter, etc.


Limit the number of lines by “rows=nn” option
Specify the: viewtype=[functions|statements|linkedobjects]
• Control the view granularity. Compare based on the function, statement,
or library level. Function level is the default.
• By default the compare will be done comparing the performance of
functions in each of the databases.
• If statements option is specified then all the comparisons will be made by
looking at the performance of each statement in all the databases that
are specified. Similar for libraries, if linkedobject is selected as the
viewtype parameter.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
47
Comparison Report in O|SS
osscompare "smg2000-pcsamp.openss,smg2000-pcsamp-1.openss”
openss]: Legend: -c 2 represents smg2000-pcsamp.openss
[openss]: Legend: -c 4 represents smg2000-pcsamp-1.openss
-c 2, Exclusive CPU -c 4, Exclusive CPU Function (defining location)
time in seconds.
time in seconds.
3.870000000
3.630000000 hypre_SMGResidual (smg2000: smg_residual.c,152)
2.610000000
2.860000000 hypre_CyclicReduction (smg2000: cyclic_reduction.c,757)
2.030000000
0.150000000 opal_progress (libopen-pal.so.0.0.0)
1.330000000
0.100000000 mca_btl_sm_component_progress (libmpi.so.0.0.2:
topo_unity_component.c,0)
0.280000000
0.210000000 hypre_SemiInterp (smg2000: semi_interp.c,126)
0.280000000
0.040000000 mca_pml_ob1_progress (libmpi.so.0.0.2:
topo_unity_component.c,0)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
48
Summary / Timing analysis

Typical starting point:




Adding context




From where was a routine called, which routine did it call
Enables the calculation of exclusive and inclusive timing
Technique: stack traces combined with sampling
Key analysis options



Flat profile
Aggregated information on where time is spent in a code
Low and uniform overhead when implemented as sampling
Hot call paths that contains most execution time
Butterfly view to show relations to parents/children
Comparative analysis


Absolute numbers often carry little meaning
Need the correct base line
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
49
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 4
Advanced analysis: Hardware Counter Experiments
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
50
What Timing Alone Doesn’t Tell you …

Timing information shows where you spend your time



BUT: It doesn’t show you why



Are the computationally intensive parts efficient?
Which resources constrain execution?
Answer can be very platform dependent




Hot functions / statements / libraries
Hot call paths
Bottlenecks may differ
Cause of missing performance portability
Need to tune to architectural parameters
Next: Investigate hardware/application interaction


Efficient use of hardware resources
Architectural units (on/off chip) that are stressed
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
51
The Memory System

Modern memory systems are complex





Key: locality



Deep hierarchies
Explicitly managed memory
NUMA behavior
Streaming/Prefetching
Accessing the same data repeatedly(Temporal)
Accessing neighboring data(Spatial)
Information to look for





Read/Write intensity
Prefetch efficiency
Cache miss rate at all levels
TLB miss at rates
NUMA overheads
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
52
Other Architectural Features

Computational intensity



Branches



Number of branches taken (pipeline flushes)
Miss speculations / Wrong branch prediction results
SIMD/Multimedia/Streaming Extensions


Cycles per instruction (CPI)
Number of floating point instructions
Are they used in the respective code pieces?
System-wide information




I/O busses
Network counters
Power/Temperature sensors
Difficulty: relating this information to source code
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
53
Hardware Performance Counters

Architectural Features



Newer platforms also provide system counters



Network cards and switches
Environmental sensors
Drawbacks




Typically/Mostly packaged inside the CPU
Count hardware events transparently without overhead
Availability differs between platform & processors
Slight semantic differences between platforms
In some cases : requires privileged access & kernel patches
Recommended: Access through PAPI



API for tools + simple runtime tools
Abstractions for system specific layers
http://icl.cs.utk.edu/papi/
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
54
The O|SS HWC Experiments

Provides access to hardware counters




Basic model 1: Timer Based Sampling: HWCsamp





Implemented on top of PAPI
Access to PAPI and native counters
Examples: cache misses, TLB misses, bus accesses
Samples at set sampling rate for the chosen event
Supports multiple counters
Lower statistical accuracy
Can be used to estimate good threshold for hwc/hwctime
Basic model 2: Thresholding: HWC and HWCtime



User selects one counter
Run until a fixed number of events have been reached
Take PC sample at that location
• HWCtime also records stacktrace


Reset number of events
Ideal number of events (threshold) depends on application
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
55
Examples of Typical Counters
PAPI Name
Description
Threshold
PAPI_L1_DCM
L1 data cache misses
high
PAPI_L2_DCM
L2 data cache misses
high/medium
PAPI_L1_DCA
L1 data cache accesses
high
PAPI_FPU_IDL
Cycles in which FPUs are idle
high/medium
PAPI_STL_ICY
Cycles with no instruction issue
high/medium
PAPI_BR_MSP
Miss-predicted branches
medium/low
PAPI_FP_INS
Number of floating point instructions
high
PAPI_LD_INS
Number of load instructions
high
PAPI_VEC_INS
Number of vector/SIMD instructions
high/medium
PAPI_HW_INT
Number of hardware interrupts
low
PAPI_TLB_TL
Number of TLB misses
low
Note: Threshold indications are just rough guidance and depend on the application.
Note: Not all counters exist on all platforms (check with papi_avail)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
56
Recommend start with HWCsamp

osshwcsamp “<command>< args>” * default
|<PAPI_event_list>|<sampling_rate>]

Sequential job example:
• osshwcsamp “smg2000”

Parallel job example:
• osshwcsamp “mpirun –np 128 smg2000 –n 50 50 50”
PAPI_L1_DCM,PAPI_L1_TCA 50
default events: PAPI_TOT_CYC and PAPI_FP_OPS
 default sampling_rate: 100
 <PAPI_event_list>: Comma separated PAPI event list
(Maximum of 6 events that can be combined)
 <sampling_rate>:Integer value sampling rate
 Use event count values to guide selection of thresholds
for HWC, HWCtime experiments for deeper analysis

How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
57
Selecting the Counters & Sampling Rate

For osshwcsamp, Open|SpeedShop supports …

Derived and Non derived PAPI presets
• All derived and non derived events reported by “papi_avail”
• Also reported by running “osshwcsamp” with no arguments
• Ability to sample up to six (6) counters at one time

All native events
• Architecture specific (incl. naming)
• Names listed in the PAPI documentation
• Native events reported by “papi_native_avail”

Sampling rate depends on application

Overhead vs. Accuracy
• Lower sampling rate cause less samples

NOTE: If a counter does not appear in the output, there
may be a conflict in the hardware counters. To find
conflicts use: papi_event_chooser PRESET <list of events>
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
58
Possible HWC Combinations To Use*
Xeons
 PAPI_FP_INS,PAPI_LD_INS,PAPI_SR_INS (load store info, memory
bandwidth needs)
 PAPI_L1_DCM,PAPI_L1_TCA (L1 cache hit/miss ratios)
 PAPI_L2_DCM,PAPI_L2_TCA (L2 cache hit/miss ratios)
 LAST_LEVEL_CACHE_MISSES,LAST_LEVEL_CACHE_REFERENCES (L3 cache
info)
 MEM_UNCORE_RETIRED:REMOTE_DRAM,MEM_UNCORE_RETIRED:LOC
AL_DRAM (local/nonlocal memory access)
For Opterons only
 PAPI_FAD_INS,PAPI_FML_INS (Floating point add mult)
 PAPI_FDV_INS,PAPI_FSQ_INS (sqrt and divisions)
 PAPI_FP_OPS,PAPI_VEC_INS (floating point and vector instructions)
 READ_REQUEST_TO_L3_CACHE:ALL_CORES,L3_CACHE_MISSES:ALL_COR
ES (L3 cache)
*Credit: Koushik Ghosh, LLNL
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
59
Viewing hwcsamp Data

hwcsamp default view: Counter = Total Cycles & FP ops
Also have pcsamp
information
Up to six event can be
displayed. Here we have the
default two.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
60
Viewing hwcsamp Data in CLI
openss -cli -f smg2000-hwcsamp-1.openss
openss>>[openss]: The restored experiment identifier is: -x 1
openss>>expview
Exclusive CPU time
% of CPU Time
PAPI_TOT_CYC
PAPI_FP_OPS Function (defining location)
in seconds.
3.920000000
smg_residual.c,152)
44.697833523
11772604888
1198486900 hypre_SMGResidual (smg2000:
2.510000000
28.620296465
cyclic_reduction.c,757)
7478131309
812850606 hypre_CyclicReduction (smg2000:
0.310000000
3.534777651
915610917
48863259 opal_progress (libopen-pal.so.0.0.0)
0.300000000
semi_restrict.c,125)
3.420752566
910260309
100529525 hypre_SemiRestrict (smg2000:
0.290000000
(libmpi.so.0.0.2)
3.306727480
874155835
48509938 mca_btl_sm_component_progress
openss>>expview
-m flops
Mflops Function (defining location)
478.639000000 hypre_ExchangeLocalData (smg2000: communication.c,708)
456.405900000 hypre_StructAxpy (smg2000: struct_axpy.c,25)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
61
Viewing hwcsamp Data in CLI
openss>>expview
-m flops -v statements
Mflops Statement Location (Line Number)
462.420300000 smg3_setup_rap.c(677)
456.405900000 struct_axpy.c(69)
453.119050000 smg3_setup_rap.c(672)
450.492600000 semi_restrict.c(246)
444.427800000 struct_matrix.c(29)
openss>>expview
Exclusive CPU time
-v linkedobjects
% of CPU Time
PAPI_TOT_CYC
PAPI_FP_OPS LinkedObject
2396367480 smg2000
in seconds.
7.710000000
87.315968290
22748513124
0.610000000
6.908267271
1789631493
0.310000000
3.510758777
915610917
48863259 libopen-pal.so.0.0.0
0.200000000
2.265005663
521249939
46127342 libc-2.10.2.so
8.830000000
100.000000000
25975005473
126423208 libmpi.so.0.0.2
2617781289 Report Summary
openss>>
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
62
Deeper Analysis with HWC and HWCtime

osshwc[time] “<command> < args>” * default |
<PAPI_event> | <PAPI threshold> | <PAPI_event><PAPI
threshold> ]

Sequential job example:
• osshwc[time] “smg2000 –n 50 50 50” PAPI_FP_OPS 50000

Parallel job example:
• osshwc*time+ “mpirun –np 128 smg2000 –n 50 50 50”

default: event (PAPI_TOT_CYC), threshold (10000)

<PAPI_event>: PAPI event name

<PAPI threshold>: PAPI integer threshold

NOTE: If the output is empty, try lowering the
<threshold> value. There may not have been enough PAPI
event occurrences to record and present
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
63
Viewing hwc Data

hwc default view: Counter = Total Cycles
Flat hardware counter profile
of a single hardware counter
event.
Exclusive counts only
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
64
Viewing hwctime Data
hwctime default view: Counter = L1 Cache Misses
Calling context hardware
counter profile of a single
hardware counter event.
Exclusive/Inclusive counts
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
65
Hardware Performance Counters Examples
Bring out importance of simple hardware counter
through easy to understand BLAS1, BLAS2, BLAS3
examples
1)


Motivated by difficulty in evaluating profiled code section
performance efficiency as gathered from counter data (usually a
large number)
Use well understood code kernel’s use of hardware components
( cache, vector units, etc.,); counter data & their ratios from
kernels serves as ‘markers’ to gauge data from real applications.
Bring out importance of a few useful hardware counter
performance ratios.
2)

Computational intensity, CPI, Memory Ops per float_ops, …
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
66
Hardware Performance Counters
The Memory Pyramid and Performance
Memory closer to the CPU faster, smaller
Memory further away from CPU slower, larger
Most expensive operation is moving data
Can only do useful work on data at the top
Example Nehalem: access latencies clock cycles
 L1 = 4
 L2 = 9
 L3 = 47
Main local NUMA Memory = 81
Main non-Local NUMA Memory = 128
CPU
L1 Cache
L2 Cache
Shared L3 Cache
Main Memory
For a given algorithm, serial
performance is all about maximizing
CPU Flop rate and minimizing
memory operations in scientific
codes
Remote Memory
Disk
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
67
BLAS Operations Illustrate impact of moving data
A, B, C = nxn Matrices; x,y = nx1 Vectors; k = Scaler
Level
Operation
# Memory
Refs or Ops
# Flops
Flops/Ops
Comments
on
Flops/Ops
1
y = kx + y
3n
2n
2/3
Achieved in
Benchmarks
2
y = Ax + y
n2
2n2
2
Achieved in
Benchmarks
3
C = AB + C
4n2
2n3
n/2
Exceeds HW
MAX
Use these Flops/Ops to understand how sections of your code relate to
simple memory access patterns as typified by these BLAS operations
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
68
Analyzing HWC dataExample: BLAS Level 1
OSS Experiments Conducted: hwc or hwcsamp
PAPI Counters: PAPI_FP_OPS, PAPI_TOT_CYC, PAPI_LD_INS, PAPI_ST_INS, PAPI_TOT_INS
Derived metrics ( or ratios) of Interest: GFLOPs, float_ops/cycle, instructions/cycle,
loads/cycle, stores/cycle, and float_ops/memory_ops
Blas 1 Kernel: DAXPY; y = alpha * x + y
Kernel Code:
do i=1,n
y(i) = alpha * x(i) + y(i)
enddo
Kernel Code parameters used in the experiments:
n = 10,000; ( data fits in cache )
kernel code called in an outside loop 100,000 times to get reasonable run times
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
69
Analyzing HWC data, Example 1; BLAS 1; PAPI data and analysis
n
10000
Mem Refs=3n FLOPS Calc loop blas code PAPI_LD_INS PAPI_SR_INS PAPI_FP_OPS PAPI_TOT_CYC PAPI_TOT_INS
30000
20000
100000
1.02E+09 5.09E+08
1.03E+09
2.04E+09
2.43E+09
code time,
Error PAPI Error Corrected
PAPI
Calc
secs code GFLOPS FPC
IPC
LPC
SPC
FLOPS
FLOPS Error Mem Refs PAPI_GLOPS Flops/ops FLOPS/OPS
6.4596E-06 3.096124 0.505386876 1.190989226 0.500489716 0.249412341 -93.80%
3.10%
-2.15% 3.195244288 0.673937178 0.6666667
Although daxpy performs two floating point operations per iteration, it is counted as one
floating point instruction in PAPI. Because of this, there are situations where PAPI_FP_INS may
produce fewer floating point counts than expected. In this example PAPI_FP_OPS was
multiplied by 2 to match the theoretical expected FLOP count.
Formula for calculating # Load Instructions: (Intel Counter: MEM_INST_RETIRED:LOADS)(16
byte aligned loads to XMM registers)
= ( 2 vectors)*(vec_length)*(loop)*(bytes_per_word) / (16 bytes_per_load)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
70
What Hardware Counter Metrics can tell us about code
performance

As set of useful metrics that can be calculated for code functions:





FLOPS/Memory Ops (FMO): Would like this to be large, implies good data locality
(Also called Computational Intensity or Ops/Refs)
Flops/Cycle (FPC): Large values for FP intensive codes suggests efficient CPU utilization
Instructions/Cycle (IPC): Large values suggest good balance with minimal stalls
Loads/Cycle (LPC): Useful for calculating FMO, may indicate good stride through arrays
Stores/Cycle (SPC): Useful for calculating FMO, may indicate good stride through array
Operation
Kernel PAPI_GFLOPS
BLAS 1; y= alpha *x + y do loop
0.67
BLAS 2; y = A * x + y
do loop
0.94
BLAS 2; y = A * x + y
DGEMV
1.89
BLAS 3; C = A * B + C do loop(kji)
6.29
BLAS 3; C = A * B + C DGEMM
12.96
FMO
0.67
2.00
FPC
0.51
0.14
0.29
0.87
1.84
IPC
1.19
0.26
0.42
1.74
1.26
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
LPC
0.50
0.07
0.15
0.21
0.59
11/18/2013
SPC
0.25
0.00
0.03
0.00
0.01
71
Hardware counter for simple math kernels
Other HWC metrics that are useful shown here
code
3dFFT;
256x256x256
matmul
500x500
HPCCG;
sparseMV;100x1
00x100
QR Fact.
N=2350
Comp.
Inten;ops/ref
1.33
1.71
1.68
0.64
MFLOPS/papi
952
4159
3738
352
MFLOPS
code
1370
4187
4000
276
percent peak
19.8
86.7
77.9
7.3
fpOps/TLB
miss
841.6515146
9040759.488
697703.964
14.05636016
fpOps/D1
cache miss
25.5290058
167.9364898
144.9081716
10.24364227
29.42427018
170.5178224
149.9578195
11.1702481
0.4
1.75
1.56
0.15
fpOps/DC_MI
SS
ops/cycle
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
72
AMG : LLNL Sparse Solver Benchmark Example (uses Hypre Library)
Major reasons on-node scaling limitations


Memory Bandwidth
Shared L3 Cache
L3 cache miss for 1,2,4 Pes matches
expectation for strong scaling



AMG Intra Node Scaling
16
Reduced data per PE
L3 misses decreasing up to 4 PEs linearly.
Speedup

14
AMG Weak
12
AMG Strong
10
AMG Ideal
8
6
4
2
0
L3_CACHE_MISSES:ALL
0
2
4
6
8
10
12
14
16
# of Cores
16
8
4

On the other hand L3 Evictions for 1,2,4 PEs
similarly decrease ‘near-perfect’ but
dramatically increases to 100x at 8PEs and
170x at 16 PEs

L3 evictions are a good measure of memory
bandwidth limited performance bottleneck at
a node

General Memory BW limitation Remedies
2
1
0
0.25
0.5
0.75
1
1.25
Normalized to 1 PE count; Counts are Avg. of PE values
L3_EVICTIONS:ALL
16
8
4

2

1
0
50
100
150
Blocking
Remove false sharing for threaded codes
200
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
73
A “False cache line sharing” example
! Cache line Aligned
real*4, dimension(112,100)::c,d
!$OMP DO SCHEDULE(STATIC, 16)
do i=1,100
do j=2, 100
c(i,j) = c(i, j-1) + d(i,j)
enddo
enddo
!$OMP END DO
! Cache line UnAligned
real*4, dimension(100,100)::c,d
!$OMP PARALLEL DO
do i=1,100
do j=2, 100
c(i,j) = c(i, j-1) + d(i,j)
enddo
enddo
!$OMP END PARALLEL DO
Same computation, but careful attention to alignment and independent OMP parallel
cache-line chunks can have big impact; L3_EVICTIONS a good measure;
Run Time
L3_EVICTIONS:ALL L3_EVICTIONS:MODIFIED
Aligned
6.5e-03
9
3
UnAligned
2.4e-02
1583
1422
Perf. Penalty
3.7
175
474
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
A PAPI_TLB_DM example: Sandia’s CTH
CTH Intra Node Scaling
MPI %
8
APP %
4
2
0
20
40
60
% Wall Time
80
CTH Weak
CTH Strong
CTH Ideal
100
Looking through performance counters the average per PE
PAPI counter seeing most increase ( among all the profiled
functions) with scale is PAPI_TLB_DM registered under MPI.
So relinked the executable with –lhugetlbfs, set
HUGETLB_MORECORE=yes, and executed with “aprun m500hs …..”
16 PE performance improvement: 7.35%
128 PE performance improvement: 8.14%
2048 PE performance improvement: 8.23%
0 2 4 6 8 10 12 14 16
# of Cores
MPI: PAPI_TLB_DM
Ratio
#of CORES/MPI Tasks
16
16
14
12
10
8
6
4
2
0
Speedup
# of CORES/MPI Tasks
CTH: Growth in MPI Time
16
8
4
2
0
1
2
3
Ratio Normalized to 2 PE TLB_DM
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
4
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 5
Analysis of I/O
11/18/2013
Need for Understanding I/O

I/O could be significant percentage of execution time
dependent upon:







Checkpoint, analysis output, visualization & I/O frequencies
I/O pattern in the application:
N-to-1, N-to-N; simultaneous writes or requests
Nature of application:
data intensive, traditional HPC, out-of-core
File system and Striping: NFS, Lustre, Panasas, and # of Object
Storage Targets (OSTs)
I/O libraries: MPI-IO, hdf5, PLFS,…
Other jobs stressing the I/O sub-systems
Obvious candidates to explore first while tuning:




Use parallel file system
Optimize for I/O pattern
Match checkpoint I/O frequency to MTBI of the system
Use appropriate libraries
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
77
I/O Performance Example

Application: OOCORE benchmark from DOD HPCMO





Out-of-core SCALAPACK benchmark from UTK
Can be configured to be disk I/O intensive
Characterizes a very important class of HPC application involving
the use of Method of Moments (MOM) formulation for
investigating electromagnetics (e.g. Radar Cross Section, Antenna
design)
Solves dense matrix equations by LU, QR or Cholesky
factorization
“Benchmarking OOCORE, an Out-of-Core Matrix Solver,” Cable,
S.B., D’Avezedo, E. SCALAPACK Team, University of Tennessee at
Knoxville/U.S. Army Engineering and Development Center
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
78
Why use this example?
 Used by HPCMO to evaluate I/O system
scalability
 Out-of-core dense solver benchmarks
demonstrate the importance of the
following in performance analysis:
 I/O
overhead minimization
 Matrix Multiply kernel – possible to achieve close to
peak performance of the machine if tuned well
 “Blocking” very important to tune for deep memory
hierarchies
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
79
Use O|SS to measure and tune for I/O
INPUT: testdriver.in
ScaLAPACK out-of-core LU,QR,LL
factorization input file
testdriver.out
Run on 16 cores on an SNL Quad-Core, Quad-Socket Opteron
IB Cluster
Investigate File system impact with OpenSpeedShop:
Compare Lustre I/O with striping to NFS I/O
6
device out
run cmd: ossio “srun -N 1-n 16 ./testzdriver-std”
1
number of factorizations
Sample Output from Lustre run:
LU
factorization methods -- QR, LU,
or LT
TIME M N MB NB NRHS P Q Fact/SolveTime Error Residual
1
number of problem sizes
31000
values of M
WALL 31000 31000 16 16 1 4 4 1842.20 1611.59 4.51E+15
1.45E+11
31000
values of N
DEPS = 1.110223024625157E-016
1
values of nrhs
sum(xsol_i) = (30999.9999999873,0.000000000000000E+000)
9200000
values of Asize
1
number of MB's and NB's
sum |xsol_i - x_i| = (3.332285336962339E006,0.000000000000000E+000)
16
values of MB
sum |xsol_i - x_i|/M = (1.074930753858819E010,0.000000000000000E+000)
16
values of NB
1
number of process grids
sum |xsol_i - x_i|/(M*eps) =
(968211.548505533,0.000000000000000E+000)
4
values of P
4
values of Q
---- ------ ------ --- --- ----- ----- --------------- ----------- --------
From output of two separate runs using Lustre and NFS:
LU Fact time with Lustre= 1842 secs;
LU Fact time with NFS = 2655 secs
813 sec penalty ( more than 30% ) if you do not use parallel file
system like Lustre!
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
80
NFS and Lustre O|SS Analysis (screen shot from NFS)
I/O to Lustre instead of NFS reduces runtime 25%: (1360 + 99) – (847 + 7) = 605 secs
NFS RUN
Min t (secs) Max t (secs) Avg t (secs)
LUSTRE RUN
call Function
__libc_read(/lib64/libpthread1102.380076 1360.727283 1261.310157
2.5.so)
31.19218
99.444468
49.01867
__libc_write(/lib64/libpthread2.5.so)
Min t (secs) Max t (secs) Avg t (secs)
call Function
368.898283 847.919127 508.658604
__libc_read(/lib64/libpthread2.5.so)
6.27036
7.896153
6.850897
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
__libc_write(/lib64/libpthread2.5.so)
11/18/2013
81
Lustre file system striping
Lustre File System (lfs) commands:
lfs setstripe –s (size bytes; k, M, G) –c (count; -1 all) –I (index; -1 round robin) <file | directory>
Typical defaults: -s 1M -c 4 –i -1 (usually good to try first)
File striping is set upon file creation
lfs getstripe <file | directory>
Example: lfs getstripe --verbose ./oss_lfs_stripe_16 | grep stripe_count
stripe_count: 16 stripe_size: 1048576 stripe_offset: -1
1 file per process; BW enhanced
Subset of PEs do I/O; Could be most optimal
OSTs
HS Network
Compute
IO nodes
1 PE writes; BW limited
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
82
OpenSpeedShop IO-experiment used to identify optimal lfs striping
(from load balance view (max, min & avg) for 16 way parallel run)
OOCORE I/O performance
libc_read time from OpenSpeedShop
1200
MAX
MIN
AVG
Wall Time, secs
1000
800
600
400
200
0
Stripe count=1
Stripe count=4
Stripe count=8 Stripe count=16
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
83
Additional I/O analysis with O|SS
 I/O Profiling (iop experiment)
 Lighter weight I/O tracking experiment
 Trace I/O functions but only record individual
callpaths
not each individual event with callpath (Like usertime)
 Extended I/O Tracing (iot experiment)
 Records each event in chronological
 Collects Additional Information
order
• Function Parameters
• Function Return Value
 When
to use extended I/O tracing?
• When you want to trace the exact order of events
• When you want to see the return values or bytes read or
written.
• when you want to see the parameters of the IO call
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
84
Beware of Serial I/O in applications: Encountered in VOSS, code LeP:
Simple code here illustrates ( acknowledgment: Mike Davis, Cray, Inc. )
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define VARS_PER_CELL 15
/ * Write a single restart file from many MPI processes */
int write_restart (
MPI_Comm comm
/// MPI communicator
, int num_cells
/// number of cells on this process
, double *cellv )
/// cell vector
{
int rank;
// rank of this process within comm
int size;
// size of comm
int tag;
// for MPI_Send, MPI_Recv
int baton;
// for serializing I/O
FILE *f;
// file handle for restart file
/ * Procedure: Get MPI parameters */
MPI_Comm_rank (comm, &rank);
MPI_Comm_size (comm, &size);
tag = 4747;
if (rank == 0) {
/* Rank 0 create a fresh restart file,
* and start the serial I/O;
* write cell data, then pass the baton to rank 1 */
f = fopen ("restart.dat", "wb");
fwrite (cellv, num_cells, VARS_PER_CELL * sizeof (double), f);
fclose (f);
MPI_Send (&baton, 1, MPI_INT, 1, tag, comm);
} else {
/* Ranks 1 and higher wait for previous rank to complete I/O,
* then append its cell data to the restart file,
* then pass the baton to the next rank */
MPI_Recv (&baton, 1, MPI_INT, rank - 1, tag, comm, MPI_STATUS_IGNORE);
f = fopen ("restart.dat", "ab");
fwrite (cellv, num_cells, VARS_PER_CELL * sizeof (double), f);
fclose (f);
if (rank < size - 1) {
MPI_Send (&baton, 1, MPI_INT, rank + 1, tag, comm);
}
}
/* All ranks have posted to the restart file; return to called */
return 0;
}
int main (int argc, char *argv[]) {
MPI_Comm comm;
int comm_rank;
int comm_size;
int num_cells;
double *cellv;
int i;
MPI_Init (&argc, &argv);
MPI_Comm_dup (MPI_COMM_WORLD, &comm);
MPI_Comm_rank (comm, &comm_rank);
MPI_Comm_size (comm, &comm_size);
/**
* Make the cells be distributed somewhat evenly across ranks
*/
num_cells = 5000000 + 2000 * (comm_size / 2 - comm_rank);
cellv = (double *) malloc (num_cells * VARS_PER_CELL * sizeof (double));
for (i = 0; i < num_cells * VARS_PER_CELL; i++) {
cellv[i] = comm_rank;
}
write_restart (comm, num_cells, cellv);
MPI_Finalize ();
return 0;
}
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
85
IOT O|SS Experiment of Serial I/O Example
SHOWS EVENT BY
EVENT LIST:
Clicking on this
gives each call to a
I/O function being
traced as shown.
Below is a
graphical trace
view of the same
data showing
serialization of
fwrite() (THE RED
BARS for each PE)
with another tool.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
86
Running I/O Experiments
Offline io/iop/iot experiment on sweep3d application
Convenience script basic syntax:
ossio[p][t] “executable” * default | <list of I/O func>+

Parameters
• I/O Function list to sample(default is all)
• creat, creat64, dup, dup2, lseek, lseek64, open, open64,
pipe, pread, pread64, pwrite, pwrite64, read, readv,
write, writev
Examples:
ossio “mpirun –np 256 sweep3d.mpi”
ossiop “mpirun –np 256 sweep3d.mpi” read,readv,write
ossiot “mpirun –np 256 sweep3d.mpi” read,readv,write
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
87
I/O output via CLI (equivalent of HC in GUI)
openss>>expview -vcalltrees,fullstack iot1
I/O Call Time(ms)
% of Total Time
Number of Calls Call Stack Function (defining location)
_start (sweep3d.mpi)
> @ 470 in __libc_start_main (libmonitor.so.0.0.0: main.c,450)
>>__libc_start_main (libc-2.10.2.so)
>>> @ 428 in monitor_main (libmonitor.so.0.0.0: main.c,412)
>>>>main (sweep3d.mpi)
>>>>> @ 58 in MAIN__ (sweep3d.mpi: driver.f,1)
>>>>>> @ 25 in task_init_ (sweep3d.mpi: mpi_stuff.f,1)
>>>>>>>_gfortran_ftell_i2_sub (libgfortran.so.3.0.0)
>>>>>>>>_gfortran_ftell_i2_sub (libgfortran.so.3.0.0)
….
>>>>>>>>>>>>>_gfortran_st_read (libgfortran.so.3.0.0)
17.902981000
96.220812461
1 >>>>>>>>>>>>>>__libc_read (libpthread-2.10.2.so)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
88
Section Summary - I/O Tradeoffs

Avoid writing to one file from all MPI tasks

If you need to, be sure to distinguish offsets for each PE at a stripe boundary,
and use Buffered I/O

If each process writes its own file, then the parallel file system
attempts to load balance the Object Storage Targets (OSTs), taking
advantage of the stripe characteristics

Metadata server overhead can often create severe I/O problems


Minimize number of files accessed per PE and minimize each PE doing
operations like seek, open, close, stat that involve inode information
I/O time is usually not measured, even in applications that keep
some function profile

Open|SpeedShop can shed light on time spent in I/O using io, iot
experiments
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
89
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 6
Analysis of parallel codes (MPI, threaded)
11/18/2013
Parallel vs Serial Performance Dependencies

Serial application performance dependencies





Execution time can change based on input deck
Compiler options and capabilities
Type of processor differences, multicore
Does the serial execution model really exist anymore?
Parallel application performance dependencies


Execution time can change based on input deck
Compiler options and capabilities
• Threading support
• Accelerator support (pragmas)



Type and Speed of the processor units and accelerators
Number of processor units and accelerators
Processor interconnects and network speed
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
91
Parallel Application Performance Challenges

Architectures are Complex and Evolving Rapidly






Changing multicore processor designs
Emergence of accelerators (GPGPU, MIC, etc.)
Multi-level memory hierarchy
I/O storage sub-systems
Increasing scale: number of processors, accelerators
Parallel processing adds more performance factors






MPI communication time versus computation time
Threading synchronization time versus computation time
CPU time versus accelerator transfer and startup time tradeoffs
I/O device multi-process contention issues
Efficient memory referencing across processes/threads
Changes in application performance due to adapting to new
architectures
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
92
Parallel Execution Goals

Ideal scenario

Efficient threading when using pthreads or OpenMP

Load balance for parallel jobs using MPI

Hybrid application have both MPI and threads running with limited
synchronization idle time.
• All threads are assigned work that can execute concurrently
• Synchronization times are low.
• All MPI ranks doing same amount of work, so no MPI rank waits
* Diagram from Performance Metrics for Parallel Programs: http://web.info.uvt.ro/~petcu/calcul/
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
93
Parallel Execution Goals

What causes the ideal goal to fail?*

Equal work was not given to each rank
• Number of array operations not equal for each rank
• Loop iterations were not evenly distributed

Sometimes have issues even if data is distributed equally
• Sparse arrays: some ranks have work others have 0’s
• Adaptive grid models: some ranks need to redefine their mesh while
others don’t
• N-body simulations: some data/work migrates to another ranks domain,
so that rank has more work

In threaded applications, some threads doing more work than
others and subsequently causing other threads to wait.
*From LLNL parallel processing tutorial
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
94
Parallel Application Analysis Techniques
 What steps can we take to analyze parallel jobs?
 Get an
overview of where the time is being spent.
• A low overhead way to get overview is sampling
– Program counter, call stack, hardware counter
 Collect
overview information for all ranks and/or
threads in the program, then:
• Examine load balance looking at the minimum, maximum,
and average values across the ranks and/or threads
• If possible look at this information per library as well as per
function
• Use above info to determine if the program is well balanced
or not
– Are the minimum, maximum values widely different?
– If so, probably have load imbalance and need to look for the cause of
performance lost because of the imbalance.
– Not all ranks or threads doing the same amount of work
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
95
Parallel Execution Analysis Techniques

What steps can we take to analyze parallel jobs?
 If imbalance detected, then what? How do you find
the cause?
• Look at library time distribution across all the ranks, threads
– Is the MPI library taking a disproportionate amount of time?
• If MPI application, use a tool that provides per MPI function
call timings
– Can look at MPI function time distributions
o In particular MPI_Waitall
o Then look at the call path to MPI_Waitall
– Also, can look source code relative to
o MPI rank or particular pthread that is involved.
o Is there any special processing for the particular rank or thread
o Examine the call paths and check code along path
• Use Cluster Analysis type feature, if tool has this capability
– Cluster analysis can categorize threads or ranks that have similar
performance into groups identifying the outlier rank or thread
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
96
Parallel Execution Analysis Techniques

Example steps: OpenSpeedShop case study on NPB: LU


http://www.openspeedshop.org/wp/wpcontent/uploads/2010/10/2010-11-sc-osstutorial-submitted.pdf
Use program counter (pc) sampling to get overview
• Noticed that smp_net_lookup showing up high in function load balance
• Because mpi related routine, take a look at the linked object (library)
sampling data
• Load balance on linked objects also showed some imbalance
• Used cluster analysis on data and found that rank 255 is an outlier

Take a closer look at rank 255
• Saw that rank 255’s pc output shows most time spent in smp_net_lookup
• Run a MPI tracing experiment to determine if we can get more clues
• Note that a load balance view on the MPI trace data shows rank 255’s
MPI_Allreduce time is highest of the 256 ranks used
• Show data for rank 255 and a representative rank from rest of ranks
• Note the differences in MPI_Wait, MPI_Send, and MPI_Allreduce
• Look at Call Paths to MPI_Wait

Conclusions?
• We’ve found the call paths to examine for why the wait is occurring.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
97
How can O|SS help for parallel jobs?

O|SS is designed to work on parallel jobs




All experiments can be used on parallel jobs


Support for threading and message passing
Automatically tracks all ranks and threads during execution
Records/stores performance info per process/rank/thread
O|SS applies the experiment collector to all ranks or threads on
all nodes
MPI specific tracing experiments


Tracing of MPI function calls (individual, all, or a specific group)
Three forms of MPI tracing experiments
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
98
Analysis of Parallel Codes

By default experiment collectors are run on all tasks



Viewing data from parallel codes




Automatically detect all processes and threads
Gathers and stores per thread data
By default all values aggregated (summed) across all ranks
Manually include/exclude individual ranks/processes/threads
Ability to compare ranks/threads
Additional analysis options

Load Balance (min, max, average) across parallel executions
• Across ranks for hybrid OpenMP/MPI codes
• Focus on a single rank to see load balance across OpenMP threads

Cluster analysis (finding outliers)
• Automatically creates groups of similar performing ranks or threads
• Available from the Stats Panel toolbar or context menu
• Note: can take a long time for large numbers of processors (current
version)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
99
Integration with MPI

O|SS has been tested with a variety of MPIs


Including: Open MPI, MVAPICH[2], and MPICH2 (Intel, Cray),
MPT (SGI)
Running O|SS experiments on MPI codes

Just use the convenience script corresponding to the data you
want to gather and put the command you use to run your
application in quotes:
• osspcsamp “mpirun –np 32 sweep3d.mpi”
• ossio “srun –N 4 –n 16 sweep3d.mpi”
• osshwctime “mpirun –np 128 sweep3d.mpi”
• ossusertime “srun –N 8 –n 128 sweep3d.mpi”
• osshwc “mpirun –np 128 sweep3d.mpi”
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
100
MPI Specific Tracing Experiments

MPI function tracing


Record all MPI call invocations
Record call times and call paths (mpi)
• Convenience script: ossmpi

Record call times, call paths and argument info (mpit)
• Convenience script: ossmpit

Equal events will be aggregated



Save space in O|SS database
Reduces overhead
Public format:


Full MPI traces in Open Trace Format (OTF)
Experiment name: (mpiotf)
• Convenience script: ossmpiotf
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
101
Running MPI Specific Experiments
Offline mpi/mpit/mpiotf experiment
Convenience script basic syntax:
ossmpi*t+ “mpi executable syntax” [default | <list MPI func> | mpi category]
 Parameters
• Default is all MPI Functions Open|SpeedShop traces
• MPI Function list to trace (comma separated)
– MPI_Send, MPI_Recv, ….
• mpi_category:
– "all”, "asynchronous_p2p”, "collective_com”, "datatypes”, "environment”,
"graphs_contexts_comms”, "persistent_com”, "process_topologies”,
"synchronous_p2p”
Examples:
ossmpi “srun –N 4 –n 32 smg2000 –n 50 50 50”
ossmpi “mpirun –np 4000 nbody” MPI_Send,MPI_Recv
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
102
Identifying Load Imbalance With O|SS

Get overview of application

Run one of these lightweight experiments
• pcsamp, usertime, hwc


Use this information to verify performance expectations
Use load balance view on pcsamp, usertime, hwc

Look for performance values outside of norm
• Somewhat large difference for the min, max, average values

Get overview of MPI functions used in application


If the MPI libraries are showing up in the load balance for
pcsamp, then do a MPI specific experiment
Use load balance view on MPI experiment

Look for performance values outside of norm
• Somewhat large difference for the min, max, average values

Focus on the MPI_Functions to find potential problems
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
103
pcsamp Default View

Default Aggregated pcsamp Experiment View
Information Displays
Experiment
Metadata
Aggregated Results
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
104
Load Balance View: NPB: LU

Load Balance View based on functions (pcsamp)
MPI library showing up
high in the list
Max time in rank 255
With load balance view we are
looking for performance number
out of norm of what is expected.
Large differences between min,
max and/or average values.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
105
Default Linked Object View: NPB: LU

Default Aggregated View based on Linked Objects (libraries)
Linked Object View
(library view)
Select “Linked Objects”
Click D-icon
NOTE: MPI library
consuming large portion of
application run time
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
106
Link. Obj. Load Balance: Using NPB: LU

Load Balance View based on Linked Objects (libraries)
Rank 255 has maximum
MPI library time value
& minimum LU time
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
107
MPI Tracing Results: Default View

Default Aggregated MPI Experiment View
Information Icon
Displays Experiment
Metadata
Aggregated Results
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
108
View Results: Show MPI Callstacks
Unique Call Paths View:
Click C+ Icon
Unique Call Paths to
MPI_Waitall and other
MPI functions
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
109
Using Cluster Analysis in O|SS

Can use with pcsamp, usertime, hwc





Will group like performing ranks/threads into groups
Groups may identify outlier groups of ranks/threads
Can examine the performance of a member of the outlier group
Can compare that member with member of acceptable
performing group
Can use with mpi, mpit




Same functionality as above
But, now focuses on the performance of individual
MPI_Functions.
Key functions are MPI_Wait, MPI_WaitAll
Can look at call paths to the key functions to analyze why they
are being called to find performance issues
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
110
Link. Obj. Cluster Analysis: NPB: LU

Cluster Analysis View based on Linked Objects (libraries)
In Cluster Analysis results
Rank 255 showing up as an
outlier.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
111
Single Rank View: NPB: LU

pcsamp View of Rank 255 Performance Data only
smpi_net_lookup MPI
library related routine
dominates rank 255
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
112
Hot Call Paths View (CLI): NPB: LU

Hot Call Paths for MPI_Wait for rank 255 only
Show all call paths
involving MPI_Wait
openss>>expview -r 255 -vcalltrees,fullstack -f MPI_Wait
for rank 255 only
Exclusive MPI Call
% of Total
Number of Calls Call Stack Function (defining location)
openss –cli –f lu-mpi-256.openss
Time(ms)
>>>>main (lu.C.256)
>>>>> @ 140 in MAIN__ (lu.C.256: lu.f,46)
>>>>>> @ 180 in ssor_ (lu.C.256: ssor.f,4)
>>>>>>> @ 213 in rhs_ (lu.C.256: rhs.f,5)
>>>>>>>> @ 224 in exchange_3_ (lu.C.256: exchange_3.f,5)
>>>>>>>>> @ 893 in mpi_wait_ (mpi-mvapich-rt-offline.so: wrappers-fortran.c,893)
6010.978000
3.878405
>>>>>>>>>> @ 889 in mpi_wait (mpi-mvapich-rt-offline.so: wrappers-fortran.c,885)
250 >>>>>>>>>>> @ 51 in MPI_Wait (libmpich.so.1.0: wait.c,51)
>>>>main (lu.C.256)
>>>>> @ 140 in MAIN__ (lu.C.256: lu.f,46)
Most expensive call
path to MPI_Wait
>>>>>> @ 180 in ssor_ (lu.C.256: ssor.f,4)
>>>>>>> @ 64 in rhs_ (lu.C.256: rhs.f,5)
>>>>>>>> @ 88 in exchange_3_ (lu.C.256: exchange_3.f,5)
>>>>>>>>> @ 893 in mpi_wait_ (mpi-mvapich-rt-offline.so: wrappers-fortran.c,893)
2798.770000
1.805823
>>>>>>>>>> @ 889 in mpi_wait (mpi-mvapich-rt-offline.so: wrappers-fortran.c,885)
250 >>>>>>>>>>> @ 51 in MPI_Wait (libmpich.so.1.0: wait.c,51)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
113
Summary / Parallel Bottlenecks

Open|SpeedShop supports MPI and threaded
application performance analysis


Parallel experiments



Works with multiple MPI implementations
Apply the sequential O|SS collectors to all nodes
Specialized MPI tracing experiments
Result Viewing



By default: results are aggregated across
ranks/processes/threads
Optionally: select individual ranks/threads or groups
Specialized views:
• Load balance view to see
• Cluster analysis

Use features to isolate sections of problem code
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
114
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 7
Analysis of heterogeneous codes (GPU)
11/18/2013
Emergence of HPC Heterogeneous Processing
What led to increased GPU usage in HPC?

Difficult to continue scale processor frequencies

Difficult to continue to increase power consumption

Instead, move toward:

Data level parallelism
• Vector units, SIMD execution

Thread level parallelism
• Multithreading, multicore, manycore
*based on NVIDIA presentation
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
116
Emergence of HPC Heterogeneous Processing
GPU versus CPU comparison

Different goals produce different designs



CPU: minimize latency experienced by 1 thread



GPU assumes work load is highly parallel
CPU must be good at everything, parallel or not
Big on-chip caches
Sophisticated control logic
GPU: maximize throughput of all threads



# threads in flight limited by resources => lots of resources
(registers, bandwidth, etc.)
Multi-threading can hide latency => skip the big caches
Share control logic across many threads
*from NVIDIA presentation
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
117
Heterogeneous Processors
Mixing GPU and CPU usage in applications
Multicore CPU
Manycore GPU
Data must be transferred to/from the CPU to the GPU in order
for the GPU to operate on it and return the new values.
*NVIDIA image
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
118
Programming for GPGPU
Prominent models for programming the GPGPU
Augment current languages to access GPU strengths
 NVIDIA CUDA






OpenCL (Open Computing Language)




Scalable parallel programming model
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel computing
Supports NVIDIA only
Open source, royalty-free
Portable, can run on different types of devices
Runs on AMD, Intel, and NVIDIA
OpenACC


Provides directives (hint commands inserted into source)
Directives tell the compiler where to create acceleration (GPU)
code without user modifying or adapting the code.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
119
Optimal Heterogeneous Execution
Considerations for best overall performance?

Is profitable to send a piece of work to the GPU?
 Key issue.

How much work is there to be done?

Is there a vectorization opportunity?

How is the parallel scaling for the application overall?

What is the cost of the transfer of data to and from the
GPU?

Are there opportunities to chain together operations so
the data can stay in the GPU for multiple operations?

Can you balance the GPU and CPU workload?
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
120
GPGPU Performance Monitoring
How can performance tools help optimize code?

Is profitable to send a piece of work to the GPU?

Can tell you this by measuring the costs:
•
•

Is there a vectorization opportunity?


Transferring data to and from the GPU
How much time is spent in the GPU versus the CPU
Could measure the mathematical operations versus the vector
operations occurring in the application. Experiment with
compiler optimization levels, re-measure operations and
compare.
How is the parallel scaling for the application overall?

Use performance tool to get idea of real performance versus
expected parallel speed-up.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
121
Open|SpeedShop GPU support
What performance info will Open|SpeedShop provide?





Reporting the time spent in the GPU device.
Reporting the cost and size of data transferred to and from the
GPU.
Reporting information to understand the balance of CPU
versus GPU utilization.
Reporting information to understand the balance transfer of
data between the host and device memory with the execution
of computational kernels.
Reporting information to understand the performance of the
internal computational kernel code running on the GPU device.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
122
Open|SpeedShop GPU support
What specific data will Open|SpeedShop gather?

Collect data for the following CUDA actions:




Kernel Invocations
Memory Copies
Memory Set
For the above we always gather this information:




Thread
Time Invocation/Copy/Set was Requested
CUDA Context & Stream
Call Site of Invocation/Copy
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
123
Open|SpeedShop GPU support
What specific data will Open|SpeedShop gather?

Collect data for action-specific: Kernel Invocations








Time Kernel Execution Actually Began/Ended
Kernel Name
Grid & Block Dimensions
Cache Preferences
Registers
Shared (Both Static & Dynamic) Memory
Local Memory Reserved
Kernel Invocations representation:

{kernel-name}<<<[{gx}, {gy}, {gz}], [{bx}, {by}, {bz}], {dynamicshared-memory}>>>
• Where each {} is a value that is filled in from the data.
• The format above is the same syntax that is used for CUDA kernel
invocations in the actual source.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
124
Open|SpeedShop GPU support
What specific data will Open|SpeedShop gather?

Collect data for action-specific: Memory copies







Time Copy Actually Began/Ended
Number of Bytes Copied
Kind of Copy (Host to Device, etc.)
Memory Source Type (Pageable/Pinned/Device/Array)
Memory Destination Type (Pageable/Pinned/Device/Array)
Was Copy Asynchronous?
Memory copy representation:

Copy {nbytes} from {source} to {destination}
• Where each {} is again a value filled in from the data. {source} and
{destination} are things like "host", "device, "array", etc.
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
125
Open|SpeedShop GPU support
What specific data will Open|SpeedShop gather?

Collect data for action-specific: Memory set



Time Set Actually Began/Ended
Number of Bytes Set
Memory copy representation:

Set {nbytes} on device
• Where each {} is again a value filled in from the data. {source} and
{destination} are things like "host", "device, "array", etc.

Can show overview information for Copy to/from Device:
Exclusive
% of
Number Function (defining location)
Call
Total of
Time(ms)
Calls
11.172864
1.061071
1
copy 64 MB from host to device (CUDA)
0.371616
0.035292
1
copy 2.1 MB from host to device (CUDA)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
126
CUDA Trace View (Chronological order)
openss>>expview -v trace
Start Time(d:h:m:s)
I/O Call
Exclusive
% of
Call Stack Function (defining location)
Total
Time(ms)
2013/08/21 18:31:21.611 11.172864 1.061071 copy 64 MB from host to device (CUDA)
2013/08/21 18:31:21.622 0.371616 0.035292 copy 2.1 MB from host to device (CUDA)
2013/08/21 18:31:21.623 0.004608 0.000438 copy 16 KB from host to device (CUDA)
2013/08/21 18:31:21.623 0.003424 0.000325 set 4 KB on device (CUDA)
2013/08/21 18:31:21.623 0.003392 0.000322
set 137 KB on device (CUDA)
2013/08/21 18:31:21.623 0.120896 0.011481
(CUDA)
compute_degrees(int*, int*, int, int)<<<[256,1,1], [64,1,1]>>>
2013/08/21 18:31:21.623 13.018784 1.236375 QTC_device(float*, char*, char*, int*, int*, int*, float*, int*,
int, int, int, float, int, int, int, int, bool)<<<[256,1,1], [64,1,1]>>> (CUDA)
2013/08/21 18:31:21.636 0.035232 0.003346
reduce_card_device(int*, int)<<<[1,1,1], [1,1,1]>>> (CUDA)
2013/08/21 18:31:21.636 0.002112 0.000201
copy 8 bytes from device to host (CUDA)
2013/08/21 18:31:21.636 1.375616 0.130640 trim_ungrouped_pnts_indr_array(int, int*, float*, int*, char*,
char*, int*, int*, float*, int*, int, int, int, float, int, bool)<<<[1,1,1], [64,1,1]>>> (CUDA)
2013/08/21 18:31:21.638 0.001344 0.000128
copy 260 bytes from device to host (CUDA)
2013/08/21 18:31:21.638 0.025600 0.002431
[64,1,1]>>> (CUDA)
update_clustered_pnts_mask(char*, char*, int)<<<[1,1,1],
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
127
Open|SpeedShop GPU support
What is the status of this Open|SpeedShop work?

Completed, but not hardened:



NVIDIA GPU Performance Data Collection
Packaging the data into the Open|SpeedShop database
Retrieving the data and forming a client and GUI view
• There is a Rudimentary View: Modeling this similar to the
Open|SpeedShop I/O and I/O extended views
– Function View, Trace View, Call Path views

Future work, if funded




Pursue obtaining more information about the code being
executing inside the GPU device
Research and support MIC, and other accelerators
Analysis of performance data and more meaningful views
Look into supporting OpenCL
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
128
SC2013 Tutorial
How to Analyze the Performance of Parallel Codes 101
A case study with Open|SpeedShop
Section 8
Conclusions: DYI and Future Work
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
129
Getting Open|SpeedShop

Sourceforge Project Home


CVS Access



http://sourceforge.net/scm/?type=cvs&group_id=176777
Packages


http://sourceforge.net/projects/openss
Accessible from Project Home Download Tab
Installation Instructions

http://www.openspeedshop.org/docs/users_guide/BuildAndInstallGuide.html

README file in source distribution
Additional Information

http://www.openspeedshop.org/
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
130
Open|SpeedShop Documentation

Open|SpeedShop User Guide Documentation



Python Scripting API Documentation



http://www.openspeedshop.org/docs/user_guide/
…/share/doc/packages/OpenSpeedShop/users_guide
http://www.openspeedshop.org/docs/pyscripting_doc/
…/share/doc/packages/OpenSpeedShop/pyscripting_doc
Command Line Interface Documentation


http://www.openspeedshop.org/docs/user_guide/
…/share/doc/packages/OpenSpeedShop/users_guide

Man pages: OpenSpeedShop, osspcsamp, ossmpi, …

Quick start guide downloadable from web site


http://www.openspeedshop.org
Click on “Download Quick Start Guide” button
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
131
Platforms

System architecture




AMD Opteron/Athlon
Intel x86, x86-64, and Itanium-2
IBM PowerPC and PowerPC64
Operating system

Tested on Many Popular Linux Distributions
•
•
•
•
•

SLES, SUSE
RHEL
Fedora Core, CentOS
Debian, Ubuntu
Varieties of the above
Large scale platforms





IBM Blue Gene/L, /P and /Q
Cray XT/XE/XK line
GPU support available as alpha version
Available on many DOE systems in shared locations
For special build & usage instructions, see web site
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
132
O|SS in Transition

Largest bottleneck in existing O|SS




Data handling from instrumentor to analysis
Need a hierarchical approach to conquer large scale
Generic requirement for a wide range of tools
Component Based Tool Framework (CBTF) project

Generic tool and data transport framework
• Create “black box” components
• Connect these components into a component network
• Distribute these components and/or component networks across a tree
base transport mechanism (MRNet).

Goal: reuse of components
• Aimed at giving the ability to rapidly prototype performance tools
• Create specialized tools to meet the needs of application developers

http://ft.ornl.gov/doku/cbtfw/start
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
133
CBTF Structure and Dependencies
 Minimal dependencies
• Easier builds
cbtf-gui
libcbtf-tcpip
libcbtf-mrnet
libcbtf-xml
Qt
libcbtf
MRNet
Xerces-C
Boost
MPI Application
Tool
 Tool-Type independent
• Performance Tools
• Debugging Tools
• etc…
 Completed components
• Base library (libcbtf)
• XML-based component
networks (libcbtf-xml)
• MRNet distributed components
(libcbtf-mrnet)
 Ability for online aggregation
• Tree-base communication
• Online data filters
• Configurable
11/18/2013
Open|SpeedShop and CBTF

CBTF will be/is the new platform for O|SS




Existing experiments will not change



More scalability
Aggregation functionality for new experiments
Build with new “instrumentor class”
Same workflow
Same convenience scripts
New experiments enabled through CBTF




Advanced I/O profiling
Threading experiments
Memory usage analysis
GPU/Accelerator support
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13
11/18/2013
135
Work in Progress: New GUI

Modular GUI Framework base on QT4


Base for multiple tools
Ability for remote connections
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13
11/18/2013
136
Summary

Performance analysis critical on modern systems




Complex architectures vs. complex applications
Need to break black box behavior at multiple levels
Lot’s of performance left on the table by default
Performance tools can help


Open|SpeedShop as one comprehensive option
Scalability of tools is important
• Performance problems often appear only at scale
• We will see more and more online aggregation approaches
• CBTF as one generic framework to implement such tools

Critical:



Asking the right questions
Comparing answers with good baselines or intuition
Starting at a high level and iteratively digging deeper
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13
11/18/2013
137
Questions vs. Experiments

Where do I spend my time?




How do I analyze cache performance?




Measure memory performance using hardware counters (hwc)
Compare to flat profiles (custom comparison)
Compare multiple hardware counters (N x hwc, hwcsamp)
How to identify I/O problems?



Flat profiles (pcsamp)
Getting inclusive/exclusive timings with callstacks (usertime)
Identifying hot callpaths (usertime + HP analysis)
Study time spent in I/O routines (io)
Compare runs under different scenarios (custom comparisons)
How do I find parallel inefficiencies?


Study time spent in MPI routines (mpi)
Look for load imbalance (LB view) and outliers (CA view)
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
138
Availability and Contact


Current version: 2.1 has been released
Open|SpeedShop Website


Open|SpeedShop Forum (now: google groups)


Package with install script
Source for tool and base libraries
Platform Specific Build Information


http://www.openspeedshop.org/forums/
Download options:



http://www.openspeedshop.org/
http://www.openspeedshop.org/wp/platform-specific-buildinformation/
Feedback




Bug tracking available from website
[email protected]
Feel free to contact presenters directly
Support contracts and onsite training available
How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13
11/18/2013
139