SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Martin Schulz: LLNL, Jim Galarowicz: Krell Institute Jennifer Green: LANL, Don Maghrak: Krell Institute LLNL-PRES-503451 11/18/2013 Why This Tutorial? Performance Analysis is becoming more important More than just measuring time Complex architectures Complex applications Mapping applications onto architectures What are the critical sections in a code? Is a part of the code running efficiently or not? Is the code using the underlying resources well? Where is the greatest payoff for optimization? Often hard to know where to start Which experiments to run first? How to plan follow-on experiments? What kind of problems can be explored? How to interpret the data? How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13 11/18/2013 2 Tutorial Goals Basic introduction into performance analysis Provide basic guidance on … How to understand the performance of a code? How to answer basic performance questions? How to plan performance experiments? Provide you with the ability to … Typical pitfalls wrt. Performance Wide range of types of performance tools and techniques Run these experiments on your own code Provide starting point for performance optimizations Open|SpeedShop as the running example Introduction into one possible tool solution Basic usage instructions Pointers to additional documentation How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 3 Open|SpeedShop Tool Set Open Source Performance Analysis Tool Framework Flexible and Easy to use User access through: GUI, Command Line, Python Scripting, convenience scripts Scalable Data Collection Most common performance analysis steps all in one tool Combines tracing and sampling techniques Extensible by plugins for data collection and representation Gathers and displays several types of performance information Instrumentation of unmodified application binaries New option for hierarchical online data aggregation Supports a wide range of systems Extensively used and tested on a variety of Linux clusters Cray XT/XE/XK and Blue Gene L/P/Q support How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 4 “Rules” Let’s keep this interactive Feel free to ask questions as we go along Online demos as we go along Ask if you would like to see anything specific We are interested in feedback Both about O|SS and tutorial in general What was clear / what didn’t make sense? What scenarios are missing? How well does O|SS (or other tools) address the needs For O|SS: Please report bugs and incompatibilities Updated slides: http://www.openspeedshop.org/wp/category/tutorials Then choose SC2013 Monday Nov 18 tutorial How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 5 Presenters Jim Galarowicz, Krell Donald Maghrak, Krell Martin Schulz, LLNL Jennifer Green, LANL Larger team William Hachfeld and Dave Whitney, Krell Dane Gardner, Argo Navis/Krell Matt Legendre and Chris Chambreau, LLNL Mahesh Rajan, SNL David Montoya, Mike Mason, Phil Romero, LANL Dyninst group (Bart Miller, UW & Jeff Hollingsworth, UMD) Phil Roth, ORNL Ciera Jaspan, CMU How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 6 Outline Welcome 1) Concepts in performance analysis 2) Introduction into Tools and Open|SpeedShop 3) Basic timing experiments and their Pros/Cons 4) Advanced analysis with Hardware Counter Experiments 5) Analysis of I/O 6) Analysis of parallel codes (MPI, threaded applications) 7) Analysis on heterogeneous platforms (e.g., for GPUs) 8) Conclusions: DIY and future work How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 7 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 1 Concepts in Performance Analysis How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 8 Typical Development Cycle Performance tuning is an essential part of the development cycle Potential impact at every stage • Message patterns • Data structure layout • Algorithms Should be done from early on in the life of a new HPC code Typical use Measure performance Analyze data Modify code and/or algorithm Repeat measurements Analyze differences How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 Algorithm Coding Code/Binary Debugging Correct Code Tuning Efficient Code 11/18/2013 9 What to Look For: Sequential Runs Step 1: Identify computational intensive parts Where am I spending my time? • Modules/Libraries • Statements • Functions L1 Cache Impact of memory hierarchy Is the time spent in the computational kernels? Does this match my intuition? CPU Do I have excessive cache misses? How is my data locality? Impact of TLB misses? External resources Is my I/O efficient? Time spent in system libraries? How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 L2 Cache Shared L3 Cache Main Memory 11/18/2013 10 What to Look For: Shared Memory Shared memory model L1 L1 Explicit threads (e.g., Pthreads) OpenMP L2 Cache Typical performance issues CPU Common programming models Single shared storage Accessible from any CPU CPU False cache sharing Excessive Synchronization Limited work per thread Threading overhead Main Memory Mem. CPU CPU Mem. Mem. CPU CPU Mem. Complications: NUMA Memory locality critical Thread:Memory assignments How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 11 What to Look For: Message Passing Distributed Memory Model Sequential/shared memory nodes coupled by a network Only local memory access Data exchange using message passing (e.g., MPI) Typical performance issues Long blocking times waiting for data Low message rates Limited network bandwidth Global collective operations Node Memory Application MPI Library NIC Node Memory Memory How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 12 A Case for Performance Tools First line of defense Disadvantages Measurements are coarse grain Can’t pin performance bottlenecks Alternative: code integration of performance probes Full execution timings (UNIX: “time” command) Comparisons between input parameters Keep and track historical trends Hard to maintain Requirements significant a priori knowledge Performance tools Enable fine grain instrumentation Show relation to source code Work universally across applications How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 13 How to Select a Tool? A tool with the right features Must be easy to use Provides capability to analyze the code at different levels A tool must match the application’s workflow Requirements from instrumentation technique • Access to and knowledge about source code? Recompilation time? • Machine environments? Supported platforms? Remote execution/analysis? Local support support and installation How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 14 Performance Tools Overview Basic OS tools Sampling tools Typically unmodified binaries Callstack analysis HPCToolkit (Rice U.) MPI or OpenMP profiles mpiP (LLNL&ORNL) ompP (LMU Munich) Capture all MPI events Present as timeline Vampir (TU-Dresden) Jumpshot (ANL) Typically profiling and tracing Combined workflow Typically GUI/some vis. support Binary: Open|SpeedShop (Krell/TriLab) Source: TAU (U. of Oregon) Specialized tools/techniques Tracing tool kits Profile and trace capture Automatic (parallel) trace analysis Kojak/Scalasca (JSC) Paraver (BSC) Integrated tool kits Profiling/direct measurements PAPI API & tool set hwctime (AIX) Trace Analysis time, gprof, strace Hardware counters Libra (LLNL) Load balance analysis Boxfish (LLNL/Utah/Davis) 3D visualization of torus networks Rubik (LLNL) Node mapping on torus architectures Vendor Tools 11/18/2013 How to Select a Tool? A tool with the right features Must be easy to use Provides capability to analyze the code at different levels A tool must match the application’s workflow Requirements from instrumentation technique • Access to and knowledge about source code? Recompilation time? • Machine environments? Supported platforms? Remote execution/analysis? Local support support and installation Why We Picked/Developed Open|SpeedShop? Sampling and tracing in a single framework Easy to use GUI & command line options for remote execution • Low learning curve for end users Transparent instrumentation (preloading & binary) • No need to recompile application How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 16 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 2 Introduction into Tools and Open|SpeedShop 11/18/2013 Open|SpeedShop Tool Set Open Source Performance Analysis Tool Framework Flexible and Easy to use User access through: GUI, Command Line, Python Scripting, convenience scripts Scalable Data Collection Most common performance analysis steps all in one tool Combines tracing and sampling techniques Extensible by plugins for data collection and representation Gathers and displays several types of performance information Instrumentation of unmodified application binaries New option for hierarchical online data aggregation Supports a wide range of systems Extensively used and tested on a variety of Linux clusters Cray XT/XE/XK and Blue Gene L/P/Q support How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 18 Open|SpeedShop Workflow srun –n4 –N1 osspcsamp “srun smg2000 –n4 –N1 –n smg2000 65 65 65 –n 65 65 65” MPI Application O|SS Post-mortem http://www.openspeedshop.org/ 11/18/2013 Alternative Interfaces Scripting language Immediate command interface O|SS interactive command line (CLI) Experiment Commands expAttach expCreate expDetach expGo expView Python module List Commands list –v exp list –v hosts import openss list –v src my_filename=openss.FileList("myprog.a.out") my_exptype=openss.ExpTypeList("pcsamp")Session Commands setBreak my_id=openss.expCreate(my_filename,my_exptype) openGui openss.expGo() My_metric_list = openss.MetricList("exclusive") my_viewtype = openss.ViewTypeList("pcsamp”) result = openss.expView(my_id,my_viewtype,my_metric_list) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 20 Central Concept: Experiments Users pick experiments: What to measure and from which sources? How to select, view, and analyze the resulting data? Two main classes: Statistical Sampling • Periodically interrupt execution and record location • Useful to get an overview • Low and uniform overhead Event Tracing • Gather and store individual application events • Provides detailed per event information • Can lead to huge data volumes O|SS can be extended with additional experiments 11/18/2013 Sampling Experiments in O|SS PC Sampling (pcsamp) Call Path Profiling (usertime) Record PC repeatedly at user defined time interval Low overhead overview of time distribution Good first step, lightweight overview PC Sampling and Call stacks for each sample Provides inclusive and exclusive timing data Use to find hot call paths, whom is calling who Hardware Counters (hwc, hwctime, hwcsamp) Access to data like cache and TLB misses hwc, hwctime: • Sample a HWC event based on an event threshold • Default event is PAPI_TOT_CYC overflows hwcsamp: • Periodically sample up to 6 counter events based (hwcsamp) • Default events are PAPI_FP_OPS and PAPI_TOT_CYC How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 22 Tracing Experiments in O|SS Input/Output Tracing (io, iop, iot) MPI Tracing (mpi, mpit, mpiotf) Record invocation of all POSIX I/O events Provides aggregate and individual timings Lightweight I/O profiling (iop) Store function arguments and return code for each call (iot) Record invocation of all MPI routines Provides aggregate and individual timings Store function arguments and return code for each call (mpit) Create Open Trace Format (OTF) output (mpiotf) Floating Point Exception Tracing (fpe) Triggered by any FPE caused by the application Helps pinpoint numerical problem areas How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 23 Performance Analysis in Parallel How to deal with concurrency? Any experiment can be applied to parallel application • Important step: aggregation or selection of data O|SS supports MPI and threaded codes Special experiments targeting parallelism/synchronization Automatically applied to all tasks/threads Default views aggregate across all tasks/threads Data from individual tasks/threads available Thread support (incl. OpenMP) based on POSIX threads Specific parallel experiments (e.g., MPI) Wraps MPI calls and reports • MPI routine time • MPI routine parameter information The mpit experiment also store function arguments and return code for each call How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 24 How to Run a First Experiment in O|SS? 1. Picking the experiment 2. Launching the application 3. How do I control my application under O|SS? Enclose how you normally run your application in quotes osspcsamp “mpirun –np 256 smg2000 –n 65 65 65” Storing the results 4. What do I want to measure? We will start with pcsamp to get a first overview O|SS will create a database Name: smg2000-pcsamp.openss Exploring the gathered data How do I interpret the data? O|SS will print a default report Open the GUI to analyze data in detail (run: “openss”) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 25 Example Run with Output osspcsamp “mpirun –np 2 smg2000 –n 65 65 65” (1/2) Bash> osspcsamp "mpirun -np 2 ./smg2000 -n 65 65 65" [openss]: pcsamp experiment using the pcsamp experiment default sampling rate: "100". [openss]: Using OPENSS_PREFIX installed in /opt/OSS-mrnet [openss]: Setting up offline raw data directory in /tmp/jeg/offline-oss [openss]: Running offline pcsamp experiment using the command: "mpirun -np 2 /opt/OSS-mrnet/bin/ossrun "./smg2000 -n 65 65 65" pcsamp" Running with these driver parameters: (nx, ny, nz) = (65, 65, 65) … <SMG native output> … Final Relative Residual Norm = 1.774415e-07 [openss]: Converting raw data from /tmp/jeg/offline-oss into temp file X.0.openss Processing raw data for smg2000 Processing processes and threads ... Processing performance data ... Processing functions and statements ... How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 26 Example Run with Output osspcsamp “mpirun –np 2 smg2000 –n 65 65 65” (2/2) [openss]: Restoring and displaying default view for: /home/jeg/DEMOS/demos/mpi/openmpi-1.4.2/smg2000/test/smg2000-pcsamp-1.openss [openss]: The restored experiment identifier is: -x 1 Exclusive CPU time in seconds. 3.630000000 2.860000000 0.280000000 0.210000000 0.150000000 0.100000000 0.090000000 0.080000000 0.070000000 0.070000000 struct_vector.c,537) 0.060000000 % of CPU Time Function (defining location) 43.060498221 hypre_SMGResidual (smg2000: smg_residual.c,152) 33.926453144 hypre_CyclicReduction (smg2000: cyclic_reduction.c,757) 3.321470937 hypre_SemiRestrict (smg2000: semi_restrict.c,125) 2.491103203 hypre_SemiInterp (smg2000: semi_interp.c,126) 1.779359431 opal_progress (libopen-pal.so.0.0.0) 1.186239620 mca_btl_sm_component_progress (libmpi.so.0.0.2) 1.067615658 hypre_SMGAxpy (smg2000: smg_axpy.c,27) 0.948991696 ompi_generic_simple_pack (libmpi.so.0.0.2) 0.830367734 __GI_memcpy (libc-2.10.2.so) 0.830367734 hypre_StructVectorSetConstantValues (smg2000: 0.711743772 hypre_SMG3BuildRAPSym (smg2000: smg3_setup_rap.c,233) View with GUI: openss –f smg2000-pcsamp-1.openss How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 27 Default Output Report View Toolbar to switch Views Performance Data Default view: by Function (Data is sum from all processes and threads) Select “Functions”, click D-icon Graphical Representation How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 28 Statement Report Output View Performance Data View Choice: Statements Select “statements, click D-icon Statement in Program that took the most time How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 29 Associate Source & Performance Data Double click to open source window Use window controls to split/arrange windows Selected performance data point How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 30 Summary Place the way you run your application normally in quotes and pass it as an argument to osspcsamp, or any of the other experiment convenience scripts: ossio, ossmpi, etc. Open|SpeedShop sends a summary profile to stdout Open|SpeedShop creates a database file Display alternative views of the data with the GUI via: openss –cli –f <database file> On clusters, need to set OPENSS_RAWDATA_DIR openss –f <database file> Display alternative views of the data with the CLI via: osspcsamp “srun –N 8 –n 64 ./mpi_application app_args” Should point to a directory in a shared file system More on this later – usually done in a module or dotkit file. Start with pcsamp for overview of performance Then home into performance issues with other experiments How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 31 Digging Deeper Multiple interfaces Usertime experiments provide inclusive/exclusive times Time spent inside a routine vs. its children Key view: butterfly Comparisons GUI for easy display of performance data CLI makes remote access easy Python module allows easy integration into scripts Between experiments to study improvements/changes Between ranks/threads to understand differences/outliers Dedicated views for parallel executions Load balance view Use custom comparison to compare ranks or threads 11/18/2013 32 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 3 Basic timing experiments and their Pros/Cons How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 33 Step 1: Flat Profile Answers a basic question: Why use a profile? Where does my code spend its time? Time spent at each function Aggregate (sum) measurements during collection Over time and code sections Reduced size of performance data Typically collected with low overhead Provides good overview of performance of application Disadvantages of using a profile Requires a-priori definition of aggregation Omits performance details of individual events Possible sampling frequency skew How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 34 Flat Profiles in O|SS: PCSamp Basic syntax (same as in 1st experiment before): osspcsamp “how you run your executable normally” Examples: osspcsamp “smg2000 –n 50 50 50” osspcsamp “smg2000 –n 50 50 50” high Optional Parameter: Sampling frequency (samples per second) Alternative parameter: high (200) | low (50) | default (100) Recommendation: compile code with –g to get statements! How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 35 How to Read Flat Profiles Performance Data View Choice: Statements Select “statements, click D-icon Aggregated results across all ranks, processes, threads Function in Program that took the most time How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 36 Identifying Critical Regions Profiles show computationally intensive code regions Questions: First views: Time spent per functions or per statements Are those functions/statements expected? Do they match the computational kernels? Any runtime functions taking a lot of time? Identify bottleneck components View the profile aggregated by shared objects Correct/expected modules? Impact of support and runtime libraries How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 37 Adding Context through Stack Traces Function A Missing information in flat profiles Function B Function C Critical technique: Stack traces Function D Gather stack trace for each performance sample Aggregate only samples with equal trace User perspective: Function E Distinguish routines called from multiple callers Understand the call invocation history Context for performance data Butterfly views (caller/callee relationships) Hot call paths • Paths through application that take most time How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 38 Inclusive vs. Exclusive Timing Function A Stack traces enable calculation of inclusive/exclusive times Function B • See: Function B Function C Function D Exclusive Time for B Implementation similar to flat profiles Function E Time spent inside a function and its children (inclusive) • See Function C and children Inclusive Time for C Time spent inside a function only (exclusive) Sample PC information Additionally collect call stack information at every sample Tradeoffs Pro: Obtain additional context information Con: Higher overhead/lower sampling rate How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 39 In/Exclusive Time in O|SS: Usertime Basic syntax: ossusertime “how you run your executable normally” Examples: ossusertime “smg2000 –n 50 50 50” ossusertime “smg2000 –n 50 50 50” low Parameters Sampling frequency (samples per second) Alternative parameter: high (70) | low (18) | default (35) Recommendation: compile code with –g to get statements! How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 40 Reading Inclusive/Exclusive Timings Default View Similar to pcsamp view from first example Calculates inclusive versus exclusive times Exclusive Time Inclusive Time How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 41 Interpreting Call Context Data Inclusive versus exclusive times If similar: child executions are insignificant • May not be useful to profile below this layer If inclusive time significantly greater than exclusive time: • Focus attention to the execution times of the children Hotpath analysis Which paths takes the most time? Butterfly analysis (similar to gprof) Should be done on “suspicious” functions • • • • • Functions with large execution time Functions with large difference between implicit and explicit time Functions of interest Functions that “take unexpectedly long” … Shows split of time in callees and callers How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 43 Stack Trace Views: Hot Call Path Hot Call Path Access to call paths: • All call paths (C+) • All call paths for selected function (C) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 44 Stack Trace Views: Butterfly View Similar to well known “gprof” tool Callers of “hypre_SMGSolve” Pivot routine “hypre_SMGSolve” Callees of “hypre_SMGSolve” How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 45 Comparing Performance Data Key functionality for any performance analysis Typical examples Before/after optimization Different configurations or inputs Different ranks, processes or threads Very limited support in most tools Absolute numbers often don’t help Need some kind of baseline / number to compare against Manual operation after multiple runs Requires lining up profile data Even harder for traces Open|SpeedShop has support to line up profiles Perform multiple experiments and create multiple databases Script to load all experiments and create multiple columns How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 46 Comparing Performance Data in O|SS Convenience Script: osscompare Compares Open|SpeedShop up to 8 databases to each other • Syntax: osscompare “db1.openss,db2.openss,…” [options] • osscompare man page has more details Produces side-by-side comparison listing Metric option parameter: • Compare based on: time, percent, a hwc counter, etc. Limit the number of lines by “rows=nn” option Specify the: viewtype=[functions|statements|linkedobjects] • Control the view granularity. Compare based on the function, statement, or library level. Function level is the default. • By default the compare will be done comparing the performance of functions in each of the databases. • If statements option is specified then all the comparisons will be made by looking at the performance of each statement in all the databases that are specified. Similar for libraries, if linkedobject is selected as the viewtype parameter. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 47 Comparison Report in O|SS osscompare "smg2000-pcsamp.openss,smg2000-pcsamp-1.openss” openss]: Legend: -c 2 represents smg2000-pcsamp.openss [openss]: Legend: -c 4 represents smg2000-pcsamp-1.openss -c 2, Exclusive CPU -c 4, Exclusive CPU Function (defining location) time in seconds. time in seconds. 3.870000000 3.630000000 hypre_SMGResidual (smg2000: smg_residual.c,152) 2.610000000 2.860000000 hypre_CyclicReduction (smg2000: cyclic_reduction.c,757) 2.030000000 0.150000000 opal_progress (libopen-pal.so.0.0.0) 1.330000000 0.100000000 mca_btl_sm_component_progress (libmpi.so.0.0.2: topo_unity_component.c,0) 0.280000000 0.210000000 hypre_SemiInterp (smg2000: semi_interp.c,126) 0.280000000 0.040000000 mca_pml_ob1_progress (libmpi.so.0.0.2: topo_unity_component.c,0) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 48 Summary / Timing analysis Typical starting point: Adding context From where was a routine called, which routine did it call Enables the calculation of exclusive and inclusive timing Technique: stack traces combined with sampling Key analysis options Flat profile Aggregated information on where time is spent in a code Low and uniform overhead when implemented as sampling Hot call paths that contains most execution time Butterfly view to show relations to parents/children Comparative analysis Absolute numbers often carry little meaning Need the correct base line How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 49 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 4 Advanced analysis: Hardware Counter Experiments How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 50 What Timing Alone Doesn’t Tell you … Timing information shows where you spend your time BUT: It doesn’t show you why Are the computationally intensive parts efficient? Which resources constrain execution? Answer can be very platform dependent Hot functions / statements / libraries Hot call paths Bottlenecks may differ Cause of missing performance portability Need to tune to architectural parameters Next: Investigate hardware/application interaction Efficient use of hardware resources Architectural units (on/off chip) that are stressed How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 51 The Memory System Modern memory systems are complex Key: locality Deep hierarchies Explicitly managed memory NUMA behavior Streaming/Prefetching Accessing the same data repeatedly(Temporal) Accessing neighboring data(Spatial) Information to look for Read/Write intensity Prefetch efficiency Cache miss rate at all levels TLB miss at rates NUMA overheads How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 52 Other Architectural Features Computational intensity Branches Number of branches taken (pipeline flushes) Miss speculations / Wrong branch prediction results SIMD/Multimedia/Streaming Extensions Cycles per instruction (CPI) Number of floating point instructions Are they used in the respective code pieces? System-wide information I/O busses Network counters Power/Temperature sensors Difficulty: relating this information to source code How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 53 Hardware Performance Counters Architectural Features Newer platforms also provide system counters Network cards and switches Environmental sensors Drawbacks Typically/Mostly packaged inside the CPU Count hardware events transparently without overhead Availability differs between platform & processors Slight semantic differences between platforms In some cases : requires privileged access & kernel patches Recommended: Access through PAPI API for tools + simple runtime tools Abstractions for system specific layers http://icl.cs.utk.edu/papi/ How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 54 The O|SS HWC Experiments Provides access to hardware counters Basic model 1: Timer Based Sampling: HWCsamp Implemented on top of PAPI Access to PAPI and native counters Examples: cache misses, TLB misses, bus accesses Samples at set sampling rate for the chosen event Supports multiple counters Lower statistical accuracy Can be used to estimate good threshold for hwc/hwctime Basic model 2: Thresholding: HWC and HWCtime User selects one counter Run until a fixed number of events have been reached Take PC sample at that location • HWCtime also records stacktrace Reset number of events Ideal number of events (threshold) depends on application How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 55 Examples of Typical Counters PAPI Name Description Threshold PAPI_L1_DCM L1 data cache misses high PAPI_L2_DCM L2 data cache misses high/medium PAPI_L1_DCA L1 data cache accesses high PAPI_FPU_IDL Cycles in which FPUs are idle high/medium PAPI_STL_ICY Cycles with no instruction issue high/medium PAPI_BR_MSP Miss-predicted branches medium/low PAPI_FP_INS Number of floating point instructions high PAPI_LD_INS Number of load instructions high PAPI_VEC_INS Number of vector/SIMD instructions high/medium PAPI_HW_INT Number of hardware interrupts low PAPI_TLB_TL Number of TLB misses low Note: Threshold indications are just rough guidance and depend on the application. Note: Not all counters exist on all platforms (check with papi_avail) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 56 Recommend start with HWCsamp osshwcsamp “<command>< args>” * default |<PAPI_event_list>|<sampling_rate>] Sequential job example: • osshwcsamp “smg2000” Parallel job example: • osshwcsamp “mpirun –np 128 smg2000 –n 50 50 50” PAPI_L1_DCM,PAPI_L1_TCA 50 default events: PAPI_TOT_CYC and PAPI_FP_OPS default sampling_rate: 100 <PAPI_event_list>: Comma separated PAPI event list (Maximum of 6 events that can be combined) <sampling_rate>:Integer value sampling rate Use event count values to guide selection of thresholds for HWC, HWCtime experiments for deeper analysis How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 57 Selecting the Counters & Sampling Rate For osshwcsamp, Open|SpeedShop supports … Derived and Non derived PAPI presets • All derived and non derived events reported by “papi_avail” • Also reported by running “osshwcsamp” with no arguments • Ability to sample up to six (6) counters at one time All native events • Architecture specific (incl. naming) • Names listed in the PAPI documentation • Native events reported by “papi_native_avail” Sampling rate depends on application Overhead vs. Accuracy • Lower sampling rate cause less samples NOTE: If a counter does not appear in the output, there may be a conflict in the hardware counters. To find conflicts use: papi_event_chooser PRESET <list of events> How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 58 Possible HWC Combinations To Use* Xeons PAPI_FP_INS,PAPI_LD_INS,PAPI_SR_INS (load store info, memory bandwidth needs) PAPI_L1_DCM,PAPI_L1_TCA (L1 cache hit/miss ratios) PAPI_L2_DCM,PAPI_L2_TCA (L2 cache hit/miss ratios) LAST_LEVEL_CACHE_MISSES,LAST_LEVEL_CACHE_REFERENCES (L3 cache info) MEM_UNCORE_RETIRED:REMOTE_DRAM,MEM_UNCORE_RETIRED:LOC AL_DRAM (local/nonlocal memory access) For Opterons only PAPI_FAD_INS,PAPI_FML_INS (Floating point add mult) PAPI_FDV_INS,PAPI_FSQ_INS (sqrt and divisions) PAPI_FP_OPS,PAPI_VEC_INS (floating point and vector instructions) READ_REQUEST_TO_L3_CACHE:ALL_CORES,L3_CACHE_MISSES:ALL_COR ES (L3 cache) *Credit: Koushik Ghosh, LLNL How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 59 Viewing hwcsamp Data hwcsamp default view: Counter = Total Cycles & FP ops Also have pcsamp information Up to six event can be displayed. Here we have the default two. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 60 Viewing hwcsamp Data in CLI openss -cli -f smg2000-hwcsamp-1.openss openss>>[openss]: The restored experiment identifier is: -x 1 openss>>expview Exclusive CPU time % of CPU Time PAPI_TOT_CYC PAPI_FP_OPS Function (defining location) in seconds. 3.920000000 smg_residual.c,152) 44.697833523 11772604888 1198486900 hypre_SMGResidual (smg2000: 2.510000000 28.620296465 cyclic_reduction.c,757) 7478131309 812850606 hypre_CyclicReduction (smg2000: 0.310000000 3.534777651 915610917 48863259 opal_progress (libopen-pal.so.0.0.0) 0.300000000 semi_restrict.c,125) 3.420752566 910260309 100529525 hypre_SemiRestrict (smg2000: 0.290000000 (libmpi.so.0.0.2) 3.306727480 874155835 48509938 mca_btl_sm_component_progress openss>>expview -m flops Mflops Function (defining location) 478.639000000 hypre_ExchangeLocalData (smg2000: communication.c,708) 456.405900000 hypre_StructAxpy (smg2000: struct_axpy.c,25) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 61 Viewing hwcsamp Data in CLI openss>>expview -m flops -v statements Mflops Statement Location (Line Number) 462.420300000 smg3_setup_rap.c(677) 456.405900000 struct_axpy.c(69) 453.119050000 smg3_setup_rap.c(672) 450.492600000 semi_restrict.c(246) 444.427800000 struct_matrix.c(29) openss>>expview Exclusive CPU time -v linkedobjects % of CPU Time PAPI_TOT_CYC PAPI_FP_OPS LinkedObject 2396367480 smg2000 in seconds. 7.710000000 87.315968290 22748513124 0.610000000 6.908267271 1789631493 0.310000000 3.510758777 915610917 48863259 libopen-pal.so.0.0.0 0.200000000 2.265005663 521249939 46127342 libc-2.10.2.so 8.830000000 100.000000000 25975005473 126423208 libmpi.so.0.0.2 2617781289 Report Summary openss>> How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 62 Deeper Analysis with HWC and HWCtime osshwc[time] “<command> < args>” * default | <PAPI_event> | <PAPI threshold> | <PAPI_event><PAPI threshold> ] Sequential job example: • osshwc[time] “smg2000 –n 50 50 50” PAPI_FP_OPS 50000 Parallel job example: • osshwc*time+ “mpirun –np 128 smg2000 –n 50 50 50” default: event (PAPI_TOT_CYC), threshold (10000) <PAPI_event>: PAPI event name <PAPI threshold>: PAPI integer threshold NOTE: If the output is empty, try lowering the <threshold> value. There may not have been enough PAPI event occurrences to record and present How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 63 Viewing hwc Data hwc default view: Counter = Total Cycles Flat hardware counter profile of a single hardware counter event. Exclusive counts only How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 64 Viewing hwctime Data hwctime default view: Counter = L1 Cache Misses Calling context hardware counter profile of a single hardware counter event. Exclusive/Inclusive counts How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 65 Hardware Performance Counters Examples Bring out importance of simple hardware counter through easy to understand BLAS1, BLAS2, BLAS3 examples 1) Motivated by difficulty in evaluating profiled code section performance efficiency as gathered from counter data (usually a large number) Use well understood code kernel’s use of hardware components ( cache, vector units, etc.,); counter data & their ratios from kernels serves as ‘markers’ to gauge data from real applications. Bring out importance of a few useful hardware counter performance ratios. 2) Computational intensity, CPI, Memory Ops per float_ops, … How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 66 Hardware Performance Counters The Memory Pyramid and Performance Memory closer to the CPU faster, smaller Memory further away from CPU slower, larger Most expensive operation is moving data Can only do useful work on data at the top Example Nehalem: access latencies clock cycles L1 = 4 L2 = 9 L3 = 47 Main local NUMA Memory = 81 Main non-Local NUMA Memory = 128 CPU L1 Cache L2 Cache Shared L3 Cache Main Memory For a given algorithm, serial performance is all about maximizing CPU Flop rate and minimizing memory operations in scientific codes Remote Memory Disk How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 67 BLAS Operations Illustrate impact of moving data A, B, C = nxn Matrices; x,y = nx1 Vectors; k = Scaler Level Operation # Memory Refs or Ops # Flops Flops/Ops Comments on Flops/Ops 1 y = kx + y 3n 2n 2/3 Achieved in Benchmarks 2 y = Ax + y n2 2n2 2 Achieved in Benchmarks 3 C = AB + C 4n2 2n3 n/2 Exceeds HW MAX Use these Flops/Ops to understand how sections of your code relate to simple memory access patterns as typified by these BLAS operations How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 68 Analyzing HWC dataExample: BLAS Level 1 OSS Experiments Conducted: hwc or hwcsamp PAPI Counters: PAPI_FP_OPS, PAPI_TOT_CYC, PAPI_LD_INS, PAPI_ST_INS, PAPI_TOT_INS Derived metrics ( or ratios) of Interest: GFLOPs, float_ops/cycle, instructions/cycle, loads/cycle, stores/cycle, and float_ops/memory_ops Blas 1 Kernel: DAXPY; y = alpha * x + y Kernel Code: do i=1,n y(i) = alpha * x(i) + y(i) enddo Kernel Code parameters used in the experiments: n = 10,000; ( data fits in cache ) kernel code called in an outside loop 100,000 times to get reasonable run times How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 69 Analyzing HWC data, Example 1; BLAS 1; PAPI data and analysis n 10000 Mem Refs=3n FLOPS Calc loop blas code PAPI_LD_INS PAPI_SR_INS PAPI_FP_OPS PAPI_TOT_CYC PAPI_TOT_INS 30000 20000 100000 1.02E+09 5.09E+08 1.03E+09 2.04E+09 2.43E+09 code time, Error PAPI Error Corrected PAPI Calc secs code GFLOPS FPC IPC LPC SPC FLOPS FLOPS Error Mem Refs PAPI_GLOPS Flops/ops FLOPS/OPS 6.4596E-06 3.096124 0.505386876 1.190989226 0.500489716 0.249412341 -93.80% 3.10% -2.15% 3.195244288 0.673937178 0.6666667 Although daxpy performs two floating point operations per iteration, it is counted as one floating point instruction in PAPI. Because of this, there are situations where PAPI_FP_INS may produce fewer floating point counts than expected. In this example PAPI_FP_OPS was multiplied by 2 to match the theoretical expected FLOP count. Formula for calculating # Load Instructions: (Intel Counter: MEM_INST_RETIRED:LOADS)(16 byte aligned loads to XMM registers) = ( 2 vectors)*(vec_length)*(loop)*(bytes_per_word) / (16 bytes_per_load) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 70 What Hardware Counter Metrics can tell us about code performance As set of useful metrics that can be calculated for code functions: FLOPS/Memory Ops (FMO): Would like this to be large, implies good data locality (Also called Computational Intensity or Ops/Refs) Flops/Cycle (FPC): Large values for FP intensive codes suggests efficient CPU utilization Instructions/Cycle (IPC): Large values suggest good balance with minimal stalls Loads/Cycle (LPC): Useful for calculating FMO, may indicate good stride through arrays Stores/Cycle (SPC): Useful for calculating FMO, may indicate good stride through array Operation Kernel PAPI_GFLOPS BLAS 1; y= alpha *x + y do loop 0.67 BLAS 2; y = A * x + y do loop 0.94 BLAS 2; y = A * x + y DGEMV 1.89 BLAS 3; C = A * B + C do loop(kji) 6.29 BLAS 3; C = A * B + C DGEMM 12.96 FMO 0.67 2.00 FPC 0.51 0.14 0.29 0.87 1.84 IPC 1.19 0.26 0.42 1.74 1.26 How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 LPC 0.50 0.07 0.15 0.21 0.59 11/18/2013 SPC 0.25 0.00 0.03 0.00 0.01 71 Hardware counter for simple math kernels Other HWC metrics that are useful shown here code 3dFFT; 256x256x256 matmul 500x500 HPCCG; sparseMV;100x1 00x100 QR Fact. N=2350 Comp. Inten;ops/ref 1.33 1.71 1.68 0.64 MFLOPS/papi 952 4159 3738 352 MFLOPS code 1370 4187 4000 276 percent peak 19.8 86.7 77.9 7.3 fpOps/TLB miss 841.6515146 9040759.488 697703.964 14.05636016 fpOps/D1 cache miss 25.5290058 167.9364898 144.9081716 10.24364227 29.42427018 170.5178224 149.9578195 11.1702481 0.4 1.75 1.56 0.15 fpOps/DC_MI SS ops/cycle How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 72 AMG : LLNL Sparse Solver Benchmark Example (uses Hypre Library) Major reasons on-node scaling limitations Memory Bandwidth Shared L3 Cache L3 cache miss for 1,2,4 Pes matches expectation for strong scaling AMG Intra Node Scaling 16 Reduced data per PE L3 misses decreasing up to 4 PEs linearly. Speedup 14 AMG Weak 12 AMG Strong 10 AMG Ideal 8 6 4 2 0 L3_CACHE_MISSES:ALL 0 2 4 6 8 10 12 14 16 # of Cores 16 8 4 On the other hand L3 Evictions for 1,2,4 PEs similarly decrease ‘near-perfect’ but dramatically increases to 100x at 8PEs and 170x at 16 PEs L3 evictions are a good measure of memory bandwidth limited performance bottleneck at a node General Memory BW limitation Remedies 2 1 0 0.25 0.5 0.75 1 1.25 Normalized to 1 PE count; Counts are Avg. of PE values L3_EVICTIONS:ALL 16 8 4 2 1 0 50 100 150 Blocking Remove false sharing for threaded codes 200 How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 73 A “False cache line sharing” example ! Cache line Aligned real*4, dimension(112,100)::c,d !$OMP DO SCHEDULE(STATIC, 16) do i=1,100 do j=2, 100 c(i,j) = c(i, j-1) + d(i,j) enddo enddo !$OMP END DO ! Cache line UnAligned real*4, dimension(100,100)::c,d !$OMP PARALLEL DO do i=1,100 do j=2, 100 c(i,j) = c(i, j-1) + d(i,j) enddo enddo !$OMP END PARALLEL DO Same computation, but careful attention to alignment and independent OMP parallel cache-line chunks can have big impact; L3_EVICTIONS a good measure; Run Time L3_EVICTIONS:ALL L3_EVICTIONS:MODIFIED Aligned 6.5e-03 9 3 UnAligned 2.4e-02 1583 1422 Perf. Penalty 3.7 175 474 How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 A PAPI_TLB_DM example: Sandia’s CTH CTH Intra Node Scaling MPI % 8 APP % 4 2 0 20 40 60 % Wall Time 80 CTH Weak CTH Strong CTH Ideal 100 Looking through performance counters the average per PE PAPI counter seeing most increase ( among all the profiled functions) with scale is PAPI_TLB_DM registered under MPI. So relinked the executable with –lhugetlbfs, set HUGETLB_MORECORE=yes, and executed with “aprun m500hs …..” 16 PE performance improvement: 7.35% 128 PE performance improvement: 8.14% 2048 PE performance improvement: 8.23% 0 2 4 6 8 10 12 14 16 # of Cores MPI: PAPI_TLB_DM Ratio #of CORES/MPI Tasks 16 16 14 12 10 8 6 4 2 0 Speedup # of CORES/MPI Tasks CTH: Growth in MPI Time 16 8 4 2 0 1 2 3 Ratio Normalized to 2 PE TLB_DM How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 4 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 5 Analysis of I/O 11/18/2013 Need for Understanding I/O I/O could be significant percentage of execution time dependent upon: Checkpoint, analysis output, visualization & I/O frequencies I/O pattern in the application: N-to-1, N-to-N; simultaneous writes or requests Nature of application: data intensive, traditional HPC, out-of-core File system and Striping: NFS, Lustre, Panasas, and # of Object Storage Targets (OSTs) I/O libraries: MPI-IO, hdf5, PLFS,… Other jobs stressing the I/O sub-systems Obvious candidates to explore first while tuning: Use parallel file system Optimize for I/O pattern Match checkpoint I/O frequency to MTBI of the system Use appropriate libraries How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 77 I/O Performance Example Application: OOCORE benchmark from DOD HPCMO Out-of-core SCALAPACK benchmark from UTK Can be configured to be disk I/O intensive Characterizes a very important class of HPC application involving the use of Method of Moments (MOM) formulation for investigating electromagnetics (e.g. Radar Cross Section, Antenna design) Solves dense matrix equations by LU, QR or Cholesky factorization “Benchmarking OOCORE, an Out-of-Core Matrix Solver,” Cable, S.B., D’Avezedo, E. SCALAPACK Team, University of Tennessee at Knoxville/U.S. Army Engineering and Development Center How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 78 Why use this example? Used by HPCMO to evaluate I/O system scalability Out-of-core dense solver benchmarks demonstrate the importance of the following in performance analysis: I/O overhead minimization Matrix Multiply kernel – possible to achieve close to peak performance of the machine if tuned well “Blocking” very important to tune for deep memory hierarchies How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 79 Use O|SS to measure and tune for I/O INPUT: testdriver.in ScaLAPACK out-of-core LU,QR,LL factorization input file testdriver.out Run on 16 cores on an SNL Quad-Core, Quad-Socket Opteron IB Cluster Investigate File system impact with OpenSpeedShop: Compare Lustre I/O with striping to NFS I/O 6 device out run cmd: ossio “srun -N 1-n 16 ./testzdriver-std” 1 number of factorizations Sample Output from Lustre run: LU factorization methods -- QR, LU, or LT TIME M N MB NB NRHS P Q Fact/SolveTime Error Residual 1 number of problem sizes 31000 values of M WALL 31000 31000 16 16 1 4 4 1842.20 1611.59 4.51E+15 1.45E+11 31000 values of N DEPS = 1.110223024625157E-016 1 values of nrhs sum(xsol_i) = (30999.9999999873,0.000000000000000E+000) 9200000 values of Asize 1 number of MB's and NB's sum |xsol_i - x_i| = (3.332285336962339E006,0.000000000000000E+000) 16 values of MB sum |xsol_i - x_i|/M = (1.074930753858819E010,0.000000000000000E+000) 16 values of NB 1 number of process grids sum |xsol_i - x_i|/(M*eps) = (968211.548505533,0.000000000000000E+000) 4 values of P 4 values of Q ---- ------ ------ --- --- ----- ----- --------------- ----------- -------- From output of two separate runs using Lustre and NFS: LU Fact time with Lustre= 1842 secs; LU Fact time with NFS = 2655 secs 813 sec penalty ( more than 30% ) if you do not use parallel file system like Lustre! How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 80 NFS and Lustre O|SS Analysis (screen shot from NFS) I/O to Lustre instead of NFS reduces runtime 25%: (1360 + 99) – (847 + 7) = 605 secs NFS RUN Min t (secs) Max t (secs) Avg t (secs) LUSTRE RUN call Function __libc_read(/lib64/libpthread1102.380076 1360.727283 1261.310157 2.5.so) 31.19218 99.444468 49.01867 __libc_write(/lib64/libpthread2.5.so) Min t (secs) Max t (secs) Avg t (secs) call Function 368.898283 847.919127 508.658604 __libc_read(/lib64/libpthread2.5.so) 6.27036 7.896153 6.850897 How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 __libc_write(/lib64/libpthread2.5.so) 11/18/2013 81 Lustre file system striping Lustre File System (lfs) commands: lfs setstripe –s (size bytes; k, M, G) –c (count; -1 all) –I (index; -1 round robin) <file | directory> Typical defaults: -s 1M -c 4 –i -1 (usually good to try first) File striping is set upon file creation lfs getstripe <file | directory> Example: lfs getstripe --verbose ./oss_lfs_stripe_16 | grep stripe_count stripe_count: 16 stripe_size: 1048576 stripe_offset: -1 1 file per process; BW enhanced Subset of PEs do I/O; Could be most optimal OSTs HS Network Compute IO nodes 1 PE writes; BW limited How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 82 OpenSpeedShop IO-experiment used to identify optimal lfs striping (from load balance view (max, min & avg) for 16 way parallel run) OOCORE I/O performance libc_read time from OpenSpeedShop 1200 MAX MIN AVG Wall Time, secs 1000 800 600 400 200 0 Stripe count=1 Stripe count=4 Stripe count=8 Stripe count=16 How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 83 Additional I/O analysis with O|SS I/O Profiling (iop experiment) Lighter weight I/O tracking experiment Trace I/O functions but only record individual callpaths not each individual event with callpath (Like usertime) Extended I/O Tracing (iot experiment) Records each event in chronological Collects Additional Information order • Function Parameters • Function Return Value When to use extended I/O tracing? • When you want to trace the exact order of events • When you want to see the return values or bytes read or written. • when you want to see the parameters of the IO call How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 84 Beware of Serial I/O in applications: Encountered in VOSS, code LeP: Simple code here illustrates ( acknowledgment: Mike Davis, Cray, Inc. ) #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define VARS_PER_CELL 15 / * Write a single restart file from many MPI processes */ int write_restart ( MPI_Comm comm /// MPI communicator , int num_cells /// number of cells on this process , double *cellv ) /// cell vector { int rank; // rank of this process within comm int size; // size of comm int tag; // for MPI_Send, MPI_Recv int baton; // for serializing I/O FILE *f; // file handle for restart file / * Procedure: Get MPI parameters */ MPI_Comm_rank (comm, &rank); MPI_Comm_size (comm, &size); tag = 4747; if (rank == 0) { /* Rank 0 create a fresh restart file, * and start the serial I/O; * write cell data, then pass the baton to rank 1 */ f = fopen ("restart.dat", "wb"); fwrite (cellv, num_cells, VARS_PER_CELL * sizeof (double), f); fclose (f); MPI_Send (&baton, 1, MPI_INT, 1, tag, comm); } else { /* Ranks 1 and higher wait for previous rank to complete I/O, * then append its cell data to the restart file, * then pass the baton to the next rank */ MPI_Recv (&baton, 1, MPI_INT, rank - 1, tag, comm, MPI_STATUS_IGNORE); f = fopen ("restart.dat", "ab"); fwrite (cellv, num_cells, VARS_PER_CELL * sizeof (double), f); fclose (f); if (rank < size - 1) { MPI_Send (&baton, 1, MPI_INT, rank + 1, tag, comm); } } /* All ranks have posted to the restart file; return to called */ return 0; } int main (int argc, char *argv[]) { MPI_Comm comm; int comm_rank; int comm_size; int num_cells; double *cellv; int i; MPI_Init (&argc, &argv); MPI_Comm_dup (MPI_COMM_WORLD, &comm); MPI_Comm_rank (comm, &comm_rank); MPI_Comm_size (comm, &comm_size); /** * Make the cells be distributed somewhat evenly across ranks */ num_cells = 5000000 + 2000 * (comm_size / 2 - comm_rank); cellv = (double *) malloc (num_cells * VARS_PER_CELL * sizeof (double)); for (i = 0; i < num_cells * VARS_PER_CELL; i++) { cellv[i] = comm_rank; } write_restart (comm, num_cells, cellv); MPI_Finalize (); return 0; } How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 85 IOT O|SS Experiment of Serial I/O Example SHOWS EVENT BY EVENT LIST: Clicking on this gives each call to a I/O function being traced as shown. Below is a graphical trace view of the same data showing serialization of fwrite() (THE RED BARS for each PE) with another tool. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 86 Running I/O Experiments Offline io/iop/iot experiment on sweep3d application Convenience script basic syntax: ossio[p][t] “executable” * default | <list of I/O func>+ Parameters • I/O Function list to sample(default is all) • creat, creat64, dup, dup2, lseek, lseek64, open, open64, pipe, pread, pread64, pwrite, pwrite64, read, readv, write, writev Examples: ossio “mpirun –np 256 sweep3d.mpi” ossiop “mpirun –np 256 sweep3d.mpi” read,readv,write ossiot “mpirun –np 256 sweep3d.mpi” read,readv,write How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 87 I/O output via CLI (equivalent of HC in GUI) openss>>expview -vcalltrees,fullstack iot1 I/O Call Time(ms) % of Total Time Number of Calls Call Stack Function (defining location) _start (sweep3d.mpi) > @ 470 in __libc_start_main (libmonitor.so.0.0.0: main.c,450) >>__libc_start_main (libc-2.10.2.so) >>> @ 428 in monitor_main (libmonitor.so.0.0.0: main.c,412) >>>>main (sweep3d.mpi) >>>>> @ 58 in MAIN__ (sweep3d.mpi: driver.f,1) >>>>>> @ 25 in task_init_ (sweep3d.mpi: mpi_stuff.f,1) >>>>>>>_gfortran_ftell_i2_sub (libgfortran.so.3.0.0) >>>>>>>>_gfortran_ftell_i2_sub (libgfortran.so.3.0.0) …. >>>>>>>>>>>>>_gfortran_st_read (libgfortran.so.3.0.0) 17.902981000 96.220812461 1 >>>>>>>>>>>>>>__libc_read (libpthread-2.10.2.so) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 88 Section Summary - I/O Tradeoffs Avoid writing to one file from all MPI tasks If you need to, be sure to distinguish offsets for each PE at a stripe boundary, and use Buffered I/O If each process writes its own file, then the parallel file system attempts to load balance the Object Storage Targets (OSTs), taking advantage of the stripe characteristics Metadata server overhead can often create severe I/O problems Minimize number of files accessed per PE and minimize each PE doing operations like seek, open, close, stat that involve inode information I/O time is usually not measured, even in applications that keep some function profile Open|SpeedShop can shed light on time spent in I/O using io, iot experiments How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 89 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 6 Analysis of parallel codes (MPI, threaded) 11/18/2013 Parallel vs Serial Performance Dependencies Serial application performance dependencies Execution time can change based on input deck Compiler options and capabilities Type of processor differences, multicore Does the serial execution model really exist anymore? Parallel application performance dependencies Execution time can change based on input deck Compiler options and capabilities • Threading support • Accelerator support (pragmas) Type and Speed of the processor units and accelerators Number of processor units and accelerators Processor interconnects and network speed How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 91 Parallel Application Performance Challenges Architectures are Complex and Evolving Rapidly Changing multicore processor designs Emergence of accelerators (GPGPU, MIC, etc.) Multi-level memory hierarchy I/O storage sub-systems Increasing scale: number of processors, accelerators Parallel processing adds more performance factors MPI communication time versus computation time Threading synchronization time versus computation time CPU time versus accelerator transfer and startup time tradeoffs I/O device multi-process contention issues Efficient memory referencing across processes/threads Changes in application performance due to adapting to new architectures How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 92 Parallel Execution Goals Ideal scenario Efficient threading when using pthreads or OpenMP Load balance for parallel jobs using MPI Hybrid application have both MPI and threads running with limited synchronization idle time. • All threads are assigned work that can execute concurrently • Synchronization times are low. • All MPI ranks doing same amount of work, so no MPI rank waits * Diagram from Performance Metrics for Parallel Programs: http://web.info.uvt.ro/~petcu/calcul/ How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 93 Parallel Execution Goals What causes the ideal goal to fail?* Equal work was not given to each rank • Number of array operations not equal for each rank • Loop iterations were not evenly distributed Sometimes have issues even if data is distributed equally • Sparse arrays: some ranks have work others have 0’s • Adaptive grid models: some ranks need to redefine their mesh while others don’t • N-body simulations: some data/work migrates to another ranks domain, so that rank has more work In threaded applications, some threads doing more work than others and subsequently causing other threads to wait. *From LLNL parallel processing tutorial How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 94 Parallel Application Analysis Techniques What steps can we take to analyze parallel jobs? Get an overview of where the time is being spent. • A low overhead way to get overview is sampling – Program counter, call stack, hardware counter Collect overview information for all ranks and/or threads in the program, then: • Examine load balance looking at the minimum, maximum, and average values across the ranks and/or threads • If possible look at this information per library as well as per function • Use above info to determine if the program is well balanced or not – Are the minimum, maximum values widely different? – If so, probably have load imbalance and need to look for the cause of performance lost because of the imbalance. – Not all ranks or threads doing the same amount of work How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 95 Parallel Execution Analysis Techniques What steps can we take to analyze parallel jobs? If imbalance detected, then what? How do you find the cause? • Look at library time distribution across all the ranks, threads – Is the MPI library taking a disproportionate amount of time? • If MPI application, use a tool that provides per MPI function call timings – Can look at MPI function time distributions o In particular MPI_Waitall o Then look at the call path to MPI_Waitall – Also, can look source code relative to o MPI rank or particular pthread that is involved. o Is there any special processing for the particular rank or thread o Examine the call paths and check code along path • Use Cluster Analysis type feature, if tool has this capability – Cluster analysis can categorize threads or ranks that have similar performance into groups identifying the outlier rank or thread How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 96 Parallel Execution Analysis Techniques Example steps: OpenSpeedShop case study on NPB: LU http://www.openspeedshop.org/wp/wpcontent/uploads/2010/10/2010-11-sc-osstutorial-submitted.pdf Use program counter (pc) sampling to get overview • Noticed that smp_net_lookup showing up high in function load balance • Because mpi related routine, take a look at the linked object (library) sampling data • Load balance on linked objects also showed some imbalance • Used cluster analysis on data and found that rank 255 is an outlier Take a closer look at rank 255 • Saw that rank 255’s pc output shows most time spent in smp_net_lookup • Run a MPI tracing experiment to determine if we can get more clues • Note that a load balance view on the MPI trace data shows rank 255’s MPI_Allreduce time is highest of the 256 ranks used • Show data for rank 255 and a representative rank from rest of ranks • Note the differences in MPI_Wait, MPI_Send, and MPI_Allreduce • Look at Call Paths to MPI_Wait Conclusions? • We’ve found the call paths to examine for why the wait is occurring. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 97 How can O|SS help for parallel jobs? O|SS is designed to work on parallel jobs All experiments can be used on parallel jobs Support for threading and message passing Automatically tracks all ranks and threads during execution Records/stores performance info per process/rank/thread O|SS applies the experiment collector to all ranks or threads on all nodes MPI specific tracing experiments Tracing of MPI function calls (individual, all, or a specific group) Three forms of MPI tracing experiments How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 98 Analysis of Parallel Codes By default experiment collectors are run on all tasks Viewing data from parallel codes Automatically detect all processes and threads Gathers and stores per thread data By default all values aggregated (summed) across all ranks Manually include/exclude individual ranks/processes/threads Ability to compare ranks/threads Additional analysis options Load Balance (min, max, average) across parallel executions • Across ranks for hybrid OpenMP/MPI codes • Focus on a single rank to see load balance across OpenMP threads Cluster analysis (finding outliers) • Automatically creates groups of similar performing ranks or threads • Available from the Stats Panel toolbar or context menu • Note: can take a long time for large numbers of processors (current version) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 99 Integration with MPI O|SS has been tested with a variety of MPIs Including: Open MPI, MVAPICH[2], and MPICH2 (Intel, Cray), MPT (SGI) Running O|SS experiments on MPI codes Just use the convenience script corresponding to the data you want to gather and put the command you use to run your application in quotes: • osspcsamp “mpirun –np 32 sweep3d.mpi” • ossio “srun –N 4 –n 16 sweep3d.mpi” • osshwctime “mpirun –np 128 sweep3d.mpi” • ossusertime “srun –N 8 –n 128 sweep3d.mpi” • osshwc “mpirun –np 128 sweep3d.mpi” How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 100 MPI Specific Tracing Experiments MPI function tracing Record all MPI call invocations Record call times and call paths (mpi) • Convenience script: ossmpi Record call times, call paths and argument info (mpit) • Convenience script: ossmpit Equal events will be aggregated Save space in O|SS database Reduces overhead Public format: Full MPI traces in Open Trace Format (OTF) Experiment name: (mpiotf) • Convenience script: ossmpiotf How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 101 Running MPI Specific Experiments Offline mpi/mpit/mpiotf experiment Convenience script basic syntax: ossmpi*t+ “mpi executable syntax” [default | <list MPI func> | mpi category] Parameters • Default is all MPI Functions Open|SpeedShop traces • MPI Function list to trace (comma separated) – MPI_Send, MPI_Recv, …. • mpi_category: – "all”, "asynchronous_p2p”, "collective_com”, "datatypes”, "environment”, "graphs_contexts_comms”, "persistent_com”, "process_topologies”, "synchronous_p2p” Examples: ossmpi “srun –N 4 –n 32 smg2000 –n 50 50 50” ossmpi “mpirun –np 4000 nbody” MPI_Send,MPI_Recv How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 102 Identifying Load Imbalance With O|SS Get overview of application Run one of these lightweight experiments • pcsamp, usertime, hwc Use this information to verify performance expectations Use load balance view on pcsamp, usertime, hwc Look for performance values outside of norm • Somewhat large difference for the min, max, average values Get overview of MPI functions used in application If the MPI libraries are showing up in the load balance for pcsamp, then do a MPI specific experiment Use load balance view on MPI experiment Look for performance values outside of norm • Somewhat large difference for the min, max, average values Focus on the MPI_Functions to find potential problems How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 103 pcsamp Default View Default Aggregated pcsamp Experiment View Information Displays Experiment Metadata Aggregated Results How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 104 Load Balance View: NPB: LU Load Balance View based on functions (pcsamp) MPI library showing up high in the list Max time in rank 255 With load balance view we are looking for performance number out of norm of what is expected. Large differences between min, max and/or average values. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 105 Default Linked Object View: NPB: LU Default Aggregated View based on Linked Objects (libraries) Linked Object View (library view) Select “Linked Objects” Click D-icon NOTE: MPI library consuming large portion of application run time How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 106 Link. Obj. Load Balance: Using NPB: LU Load Balance View based on Linked Objects (libraries) Rank 255 has maximum MPI library time value & minimum LU time How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 107 MPI Tracing Results: Default View Default Aggregated MPI Experiment View Information Icon Displays Experiment Metadata Aggregated Results How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 108 View Results: Show MPI Callstacks Unique Call Paths View: Click C+ Icon Unique Call Paths to MPI_Waitall and other MPI functions How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 109 Using Cluster Analysis in O|SS Can use with pcsamp, usertime, hwc Will group like performing ranks/threads into groups Groups may identify outlier groups of ranks/threads Can examine the performance of a member of the outlier group Can compare that member with member of acceptable performing group Can use with mpi, mpit Same functionality as above But, now focuses on the performance of individual MPI_Functions. Key functions are MPI_Wait, MPI_WaitAll Can look at call paths to the key functions to analyze why they are being called to find performance issues How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 110 Link. Obj. Cluster Analysis: NPB: LU Cluster Analysis View based on Linked Objects (libraries) In Cluster Analysis results Rank 255 showing up as an outlier. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 111 Single Rank View: NPB: LU pcsamp View of Rank 255 Performance Data only smpi_net_lookup MPI library related routine dominates rank 255 How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 112 Hot Call Paths View (CLI): NPB: LU Hot Call Paths for MPI_Wait for rank 255 only Show all call paths involving MPI_Wait openss>>expview -r 255 -vcalltrees,fullstack -f MPI_Wait for rank 255 only Exclusive MPI Call % of Total Number of Calls Call Stack Function (defining location) openss –cli –f lu-mpi-256.openss Time(ms) >>>>main (lu.C.256) >>>>> @ 140 in MAIN__ (lu.C.256: lu.f,46) >>>>>> @ 180 in ssor_ (lu.C.256: ssor.f,4) >>>>>>> @ 213 in rhs_ (lu.C.256: rhs.f,5) >>>>>>>> @ 224 in exchange_3_ (lu.C.256: exchange_3.f,5) >>>>>>>>> @ 893 in mpi_wait_ (mpi-mvapich-rt-offline.so: wrappers-fortran.c,893) 6010.978000 3.878405 >>>>>>>>>> @ 889 in mpi_wait (mpi-mvapich-rt-offline.so: wrappers-fortran.c,885) 250 >>>>>>>>>>> @ 51 in MPI_Wait (libmpich.so.1.0: wait.c,51) >>>>main (lu.C.256) >>>>> @ 140 in MAIN__ (lu.C.256: lu.f,46) Most expensive call path to MPI_Wait >>>>>> @ 180 in ssor_ (lu.C.256: ssor.f,4) >>>>>>> @ 64 in rhs_ (lu.C.256: rhs.f,5) >>>>>>>> @ 88 in exchange_3_ (lu.C.256: exchange_3.f,5) >>>>>>>>> @ 893 in mpi_wait_ (mpi-mvapich-rt-offline.so: wrappers-fortran.c,893) 2798.770000 1.805823 >>>>>>>>>> @ 889 in mpi_wait (mpi-mvapich-rt-offline.so: wrappers-fortran.c,885) 250 >>>>>>>>>>> @ 51 in MPI_Wait (libmpich.so.1.0: wait.c,51) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 113 Summary / Parallel Bottlenecks Open|SpeedShop supports MPI and threaded application performance analysis Parallel experiments Works with multiple MPI implementations Apply the sequential O|SS collectors to all nodes Specialized MPI tracing experiments Result Viewing By default: results are aggregated across ranks/processes/threads Optionally: select individual ranks/threads or groups Specialized views: • Load balance view to see • Cluster analysis Use features to isolate sections of problem code How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 114 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 7 Analysis of heterogeneous codes (GPU) 11/18/2013 Emergence of HPC Heterogeneous Processing What led to increased GPU usage in HPC? Difficult to continue scale processor frequencies Difficult to continue to increase power consumption Instead, move toward: Data level parallelism • Vector units, SIMD execution Thread level parallelism • Multithreading, multicore, manycore *based on NVIDIA presentation How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 116 Emergence of HPC Heterogeneous Processing GPU versus CPU comparison Different goals produce different designs CPU: minimize latency experienced by 1 thread GPU assumes work load is highly parallel CPU must be good at everything, parallel or not Big on-chip caches Sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.) Multi-threading can hide latency => skip the big caches Share control logic across many threads *from NVIDIA presentation How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 117 Heterogeneous Processors Mixing GPU and CPU usage in applications Multicore CPU Manycore GPU Data must be transferred to/from the CPU to the GPU in order for the GPU to operate on it and return the new values. *NVIDIA image How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 118 Programming for GPGPU Prominent models for programming the GPGPU Augment current languages to access GPU strengths NVIDIA CUDA OpenCL (Open Computing Language) Scalable parallel programming model Minimal extensions to familiar C/C++ environment Heterogeneous serial-parallel computing Supports NVIDIA only Open source, royalty-free Portable, can run on different types of devices Runs on AMD, Intel, and NVIDIA OpenACC Provides directives (hint commands inserted into source) Directives tell the compiler where to create acceleration (GPU) code without user modifying or adapting the code. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 119 Optimal Heterogeneous Execution Considerations for best overall performance? Is profitable to send a piece of work to the GPU? Key issue. How much work is there to be done? Is there a vectorization opportunity? How is the parallel scaling for the application overall? What is the cost of the transfer of data to and from the GPU? Are there opportunities to chain together operations so the data can stay in the GPU for multiple operations? Can you balance the GPU and CPU workload? How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 120 GPGPU Performance Monitoring How can performance tools help optimize code? Is profitable to send a piece of work to the GPU? Can tell you this by measuring the costs: • • Is there a vectorization opportunity? Transferring data to and from the GPU How much time is spent in the GPU versus the CPU Could measure the mathematical operations versus the vector operations occurring in the application. Experiment with compiler optimization levels, re-measure operations and compare. How is the parallel scaling for the application overall? Use performance tool to get idea of real performance versus expected parallel speed-up. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 121 Open|SpeedShop GPU support What performance info will Open|SpeedShop provide? Reporting the time spent in the GPU device. Reporting the cost and size of data transferred to and from the GPU. Reporting information to understand the balance of CPU versus GPU utilization. Reporting information to understand the balance transfer of data between the host and device memory with the execution of computational kernels. Reporting information to understand the performance of the internal computational kernel code running on the GPU device. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 122 Open|SpeedShop GPU support What specific data will Open|SpeedShop gather? Collect data for the following CUDA actions: Kernel Invocations Memory Copies Memory Set For the above we always gather this information: Thread Time Invocation/Copy/Set was Requested CUDA Context & Stream Call Site of Invocation/Copy How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 123 Open|SpeedShop GPU support What specific data will Open|SpeedShop gather? Collect data for action-specific: Kernel Invocations Time Kernel Execution Actually Began/Ended Kernel Name Grid & Block Dimensions Cache Preferences Registers Shared (Both Static & Dynamic) Memory Local Memory Reserved Kernel Invocations representation: {kernel-name}<<<[{gx}, {gy}, {gz}], [{bx}, {by}, {bz}], {dynamicshared-memory}>>> • Where each {} is a value that is filled in from the data. • The format above is the same syntax that is used for CUDA kernel invocations in the actual source. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 124 Open|SpeedShop GPU support What specific data will Open|SpeedShop gather? Collect data for action-specific: Memory copies Time Copy Actually Began/Ended Number of Bytes Copied Kind of Copy (Host to Device, etc.) Memory Source Type (Pageable/Pinned/Device/Array) Memory Destination Type (Pageable/Pinned/Device/Array) Was Copy Asynchronous? Memory copy representation: Copy {nbytes} from {source} to {destination} • Where each {} is again a value filled in from the data. {source} and {destination} are things like "host", "device, "array", etc. How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 125 Open|SpeedShop GPU support What specific data will Open|SpeedShop gather? Collect data for action-specific: Memory set Time Set Actually Began/Ended Number of Bytes Set Memory copy representation: Set {nbytes} on device • Where each {} is again a value filled in from the data. {source} and {destination} are things like "host", "device, "array", etc. Can show overview information for Copy to/from Device: Exclusive % of Number Function (defining location) Call Total of Time(ms) Calls 11.172864 1.061071 1 copy 64 MB from host to device (CUDA) 0.371616 0.035292 1 copy 2.1 MB from host to device (CUDA) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 126 CUDA Trace View (Chronological order) openss>>expview -v trace Start Time(d:h:m:s) I/O Call Exclusive % of Call Stack Function (defining location) Total Time(ms) 2013/08/21 18:31:21.611 11.172864 1.061071 copy 64 MB from host to device (CUDA) 2013/08/21 18:31:21.622 0.371616 0.035292 copy 2.1 MB from host to device (CUDA) 2013/08/21 18:31:21.623 0.004608 0.000438 copy 16 KB from host to device (CUDA) 2013/08/21 18:31:21.623 0.003424 0.000325 set 4 KB on device (CUDA) 2013/08/21 18:31:21.623 0.003392 0.000322 set 137 KB on device (CUDA) 2013/08/21 18:31:21.623 0.120896 0.011481 (CUDA) compute_degrees(int*, int*, int, int)<<<[256,1,1], [64,1,1]>>> 2013/08/21 18:31:21.623 13.018784 1.236375 QTC_device(float*, char*, char*, int*, int*, int*, float*, int*, int, int, int, float, int, int, int, int, bool)<<<[256,1,1], [64,1,1]>>> (CUDA) 2013/08/21 18:31:21.636 0.035232 0.003346 reduce_card_device(int*, int)<<<[1,1,1], [1,1,1]>>> (CUDA) 2013/08/21 18:31:21.636 0.002112 0.000201 copy 8 bytes from device to host (CUDA) 2013/08/21 18:31:21.636 1.375616 0.130640 trim_ungrouped_pnts_indr_array(int, int*, float*, int*, char*, char*, int*, int*, float*, int*, int, int, int, float, int, bool)<<<[1,1,1], [64,1,1]>>> (CUDA) 2013/08/21 18:31:21.638 0.001344 0.000128 copy 260 bytes from device to host (CUDA) 2013/08/21 18:31:21.638 0.025600 0.002431 [64,1,1]>>> (CUDA) update_clustered_pnts_mask(char*, char*, int)<<<[1,1,1], How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 127 Open|SpeedShop GPU support What is the status of this Open|SpeedShop work? Completed, but not hardened: NVIDIA GPU Performance Data Collection Packaging the data into the Open|SpeedShop database Retrieving the data and forming a client and GUI view • There is a Rudimentary View: Modeling this similar to the Open|SpeedShop I/O and I/O extended views – Function View, Trace View, Call Path views Future work, if funded Pursue obtaining more information about the code being executing inside the GPU device Research and support MIC, and other accelerators Analysis of performance data and more meaningful views Look into supporting OpenCL How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 128 SC2013 Tutorial How to Analyze the Performance of Parallel Codes 101 A case study with Open|SpeedShop Section 8 Conclusions: DYI and Future Work How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 129 Getting Open|SpeedShop Sourceforge Project Home CVS Access http://sourceforge.net/scm/?type=cvs&group_id=176777 Packages http://sourceforge.net/projects/openss Accessible from Project Home Download Tab Installation Instructions http://www.openspeedshop.org/docs/users_guide/BuildAndInstallGuide.html README file in source distribution Additional Information http://www.openspeedshop.org/ How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 130 Open|SpeedShop Documentation Open|SpeedShop User Guide Documentation Python Scripting API Documentation http://www.openspeedshop.org/docs/user_guide/ …/share/doc/packages/OpenSpeedShop/users_guide http://www.openspeedshop.org/docs/pyscripting_doc/ …/share/doc/packages/OpenSpeedShop/pyscripting_doc Command Line Interface Documentation http://www.openspeedshop.org/docs/user_guide/ …/share/doc/packages/OpenSpeedShop/users_guide Man pages: OpenSpeedShop, osspcsamp, ossmpi, … Quick start guide downloadable from web site http://www.openspeedshop.org Click on “Download Quick Start Guide” button How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 131 Platforms System architecture AMD Opteron/Athlon Intel x86, x86-64, and Itanium-2 IBM PowerPC and PowerPC64 Operating system Tested on Many Popular Linux Distributions • • • • • SLES, SUSE RHEL Fedora Core, CentOS Debian, Ubuntu Varieties of the above Large scale platforms IBM Blue Gene/L, /P and /Q Cray XT/XE/XK line GPU support available as alpha version Available on many DOE systems in shared locations For special build & usage instructions, see web site How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 132 O|SS in Transition Largest bottleneck in existing O|SS Data handling from instrumentor to analysis Need a hierarchical approach to conquer large scale Generic requirement for a wide range of tools Component Based Tool Framework (CBTF) project Generic tool and data transport framework • Create “black box” components • Connect these components into a component network • Distribute these components and/or component networks across a tree base transport mechanism (MRNet). Goal: reuse of components • Aimed at giving the ability to rapidly prototype performance tools • Create specialized tools to meet the needs of application developers http://ft.ornl.gov/doku/cbtfw/start How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 133 CBTF Structure and Dependencies Minimal dependencies • Easier builds cbtf-gui libcbtf-tcpip libcbtf-mrnet libcbtf-xml Qt libcbtf MRNet Xerces-C Boost MPI Application Tool Tool-Type independent • Performance Tools • Debugging Tools • etc… Completed components • Base library (libcbtf) • XML-based component networks (libcbtf-xml) • MRNet distributed components (libcbtf-mrnet) Ability for online aggregation • Tree-base communication • Online data filters • Configurable 11/18/2013 Open|SpeedShop and CBTF CBTF will be/is the new platform for O|SS Existing experiments will not change More scalability Aggregation functionality for new experiments Build with new “instrumentor class” Same workflow Same convenience scripts New experiments enabled through CBTF Advanced I/O profiling Threading experiments Memory usage analysis GPU/Accelerator support How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13 11/18/2013 135 Work in Progress: New GUI Modular GUI Framework base on QT4 Base for multiple tools Ability for remote connections How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13 11/18/2013 136 Summary Performance analysis critical on modern systems Complex architectures vs. complex applications Need to break black box behavior at multiple levels Lot’s of performance left on the table by default Performance tools can help Open|SpeedShop as one comprehensive option Scalability of tools is important • Performance problems often appear only at scale • We will see more and more online aggregation approaches • CBTF as one generic framework to implement such tools Critical: Asking the right questions Comparing answers with good baselines or intuition Starting at a high level and iteratively digging deeper How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC’13 11/18/2013 137 Questions vs. Experiments Where do I spend my time? How do I analyze cache performance? Measure memory performance using hardware counters (hwc) Compare to flat profiles (custom comparison) Compare multiple hardware counters (N x hwc, hwcsamp) How to identify I/O problems? Flat profiles (pcsamp) Getting inclusive/exclusive timings with callstacks (usertime) Identifying hot callpaths (usertime + HP analysis) Study time spent in I/O routines (io) Compare runs under different scenarios (custom comparisons) How do I find parallel inefficiencies? Study time spent in MPI routines (mpi) Look for load imbalance (LB view) and outliers (CA view) How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 138 Availability and Contact Current version: 2.1 has been released Open|SpeedShop Website Open|SpeedShop Forum (now: google groups) Package with install script Source for tool and base libraries Platform Specific Build Information http://www.openspeedshop.org/forums/ Download options: http://www.openspeedshop.org/ http://www.openspeedshop.org/wp/platform-specific-buildinformation/ Feedback Bug tracking available from website [email protected] Feel free to contact presenters directly Support contracts and onsite training available How to Analyze the Performance of Parallel Codes 101 - A Tutorial at SC'13 11/18/2013 139
© Copyright 2024