Code-Agnostic Performance Characterisation and Enhancement Ben Menadue Academic Consultant nci.org.au @NCInews Who Uses the NCI? • NCI has a large user base – 1000s of users across 100s of projects • These projects encompass almost every research area – – – – – – physical sciences Earth sciences engineering mathematics finance social science • Correspondingly, there is a huge variation in backgrounds and experience – some are programmers – can optimise their algorithm and code to suit the machine – most just run pre-packaged software – no control over the source nci.org.au Performance Characterisation for Beginners • If source code is available, can instrument and profile in the usual fashion. – For non-advanced users, often we walk them through this and help them analyse the results. • What about for pre-built packages? – Use an LD_PRELOAD to catch and log e.g. MPI calls • We provide several such tools: – IPM – mpiP – perf (not an LD_PRELOAD, but still doesn’t require recompilation) – IPM is our tool of choice: – easy to use: module load openmpi ipm – interfaces with PAPI for hardware counters – NCI patches for message binning, rounding off, suspend-resume… nci.org.au IPM Profile of CCAM • Performance of CCAM on Raijin was not what we expected – slower than on Vayu! • Profiled a run using IPM to see what was going on… nci.org.au IPM Profile of CCAM • Performance of CCAM on Raijin was not what we expected – slower than on Vayu! • Profiled a run using IPM to see what was going on… nci.org.au Performance Improvement in CCAM • What can we do to improve the performance? • Standard software in use by many researchers. – Can’t change the algorithm or code. • Need a different strategy to improve performance. • Work with the communication and system libraries instead. • IPM profile shows huge overhead coming from MPI calls. – Mellanox Accelerators • Messaging Accelerator (MXM) – improves message passing by using extra, Mellanox hardware features • Fabric Collective Accelerator (FCA) – offloads collectives from the processes to the interconnect hardware nci.org.au Performance Improvement in CCAM – Without Changing Code Execution Time (s) 1664.2 CCAM 827.8 750.7 700.07 Avg. Time Stdev July 2013 Original results November 2013 Mellanox Accelerators March 2014 Kernel updates and tweaks, MXM, FCA April 2014 Latest Result with HT 1664.2 148.36 827.8 24.6 750.7 4 700.07 2.14 Avg. Time is based on varied number of runs July 2013- Normal November 2013Mellanox Acceletrators March 2014-Kernel updates and tweaks, MXM, FCA April-2014-Latest Result with HT nci.org.au Performance Improvement in CCAM – Without Changing Code Execution Time (s) 1664.2 CCAM 827.8 750.7 700.07 Avg. Time Stdev July 2013 Original results November 2013 Mellanox Accelerators March 2014 Kernel updates and tweaks, MXM, FCA April 2014 Latest Result with HT 1664.2 148.36 827.8 24.6 750.7 4 700.07 2.14 Avg. Time is based on varied number of runs July 2013- Normal November 2013Mellanox Acceletrators March 2014-Kernel updates and tweaks, MXM, FCA April-2014-Latest Result with HT nci.org.au Performance Improvement in CCAM • What can we do to improve the performance? • Standard software in use by many researchers. – Can’t change the algorithm or code. • Need a different strategy to improve performance. • Work with the communication and system libraries instead. • Operating system can also be impacting performance – Moved to the latest CentOS 6 kernel and operating system • new task scheduling, memory management, … – Enabled hyperthreading • allow operating system tasks to run on separate hardware threads – reduce impact and jitter nci.org.au Performance Improvement in CCAM – Without Changing Code Execution Time (s) 1664.2 CCAM 827.8 750.7 700.07 Avg. Time Stdev July 2013 Original results November 2013 Mellanox Accelerators March 2014 Kernel updates and tweaks, MXM, FCA April 2014 Latest Result with HT 1664.2 148.36 827.8 24.6 750.7 4 700.07 2.14 Avg. Time is based on varied number of runs July 2013- Normal November 2013Mellanox Acceletrators March 2014-Kernel updates and tweaks, MXM, FCA April-2014-Latest Result with HT nci.org.au Application Software Stack • General package installations are made on request to a central location, /apps. – Lustre filesystem, mounted on all nodes. • We typically build these so they pass all their tests. – This normally means default optimisation and gcc. – Fortran 90/03/08 modules and libraries are built using both gfortran and ifort since the ABI is different. • While quite reasonable performance, sometimes users need/want more. • Working closely with a developer of Fludity to compile a custom software stack: – Lots of dependencies: MPI, PETSc, Metis, Scotch, Zoltan, Python, GMSH, … – All built using latest Intel compilers and OpenMPI with very high optimisation settings. – Found several compiler bugs – reported to Intel and several already fixed. • 20% improvement in runtime using custom software stack! – Still using a debugging build of PETSc – known to have significant performance impact. nci.org.au Summary • Even without changing a line of source code, there still lots of performance enhancements available! • Highly optimised software stack for best serial performance. • Using latest kernel and system libraries can reduce impact from operating system. • Hyperthreading reduces jitter and impact from O/S tasks. • Mellanox Accelerators can significantly improve MPI performance – especially for collectives. nci.org.au
© Copyright 2024