Download Report

An Analysis of Node Sharing on HPC
Clusters using XDMoD/TACC_Stats
Joseph P White, Ph.D
Scientific Programmer - Center for Computational Research
University at Buffalo, SUNY
XSEDE14 JULY 13– 18, 2014
Outline
•
•
•
•
•
•
Motivation
Overview of tools (XDMOD, tacc_stats)
Background
Results
Conclusions
Discussion
TECHNOLOGY AUDIT SERVICE
CoAuthors
•
•
•
•
•
•
•
•
•
•
Robert L. DeLeon (UB)
Thomas R. Furlani (UB)
Steven M. Gallo (UB)
Matthew D Jones (UB)
Amin Ghadersohi (UB)
Cynthia D. Cornelius (UB)
Abani K. Patra (UB)
James C. Browne (UTexas)
William L. Barth (TACC)
John Hammond (TACC)
TECHNOLOGY AUDIT SERVICE
Motivation
• Node sharing benefits:
– increases throughput by up to 26%
– increases energy efficiency by up to 22% (Breslow et al.)
• Node sharing disadvantages:
– resource contention
• Number of cores per node increasing
• Ulterior motive:
– Prove toolset
•
A. D. Breslow, L. Porter, A. Tiwari, M. Laurenzano, L. Carrington, D. M. Tullsen, and A. E. Snavely. The case for
colocation of hpc workloads. Concurrency and Computation: Practice and Experience, 2013
http://dx.doi.org/10.1002/cpe.3187
TECHNOLOGY AUDIT SERVICE
Tools
• XDMoD
– NSF funded open source tool that provides a wide range of
usage and performance metrics on XSEDE systems
– Web-based interface
– Powerful charting features
• tacc_stats
– low-overhead collection of system-wide performance data
– Runs on every node on a resource collects data at job start,
end and periodically during job
•
•
•
•
CPU usage
Hardware performance counters
Memory usage
I/O usage
TECHNOLOGY AUDIT SERVICE
Data flow
TECHNOLOGY AUDIT SERVICE
Data flow
TECHNOLOGY AUDIT SERVICE
XDMoD Data Sources
TECHNOLOGY AUDIT SERVICE
Background
• CCR's HPC resource "Rush"
–
–
–
–
–
8000+ cores
Heterogeneous cluster 8, 12, 16 or 32 cores per node
InfiniBand
Panasas parallel filesystem
SLURM resource manager
• node sharing enabled by default
• cgroup plugin to isolate jobs
• Academic computing center: higher % of smaller
jobs than large XSEDE resources
• All data from Jan - Feb 2014 (~370,000 jobs)
TECHNOLOGY AUDIT SERVICE
Number of jobs by job size
TECHNOLOGY AUDIT SERVICE
Results
• Exclusive jobs: where no other jobs ran concurrently on
the allocated node(s) (left hand side of plots)
• Shared jobs: where at least one other job was running
on the allocated node(s) (right hand side)
–
–
–
–
–
–
Process memory usage
Total OS memory usage
LLC read miss rates
Job exit status
Parallel filesystem bandwidth
InfiniBand interconnect bandwidth
TECHNOLOGY AUDIT SERVICE
Memory usage per core
• (MemUsed - FilePages - Slab) from
/sys/devices/system/node/node0/meminfo
Memory usage per core GB
Exclusive jobs
Memory usage per core GB
Shared jobs
TECHNOLOGY AUDIT SERVICE
Total memory usage per core
(4GB/core nodes)
Total memory usage per core GB
Exclusive jobs
Total memory usage per core GB
Shared jobs
TECHNOLOGY AUDIT SERVICE
Last level cache (LLC) read miss rate per socket
• UNC_LLC_MISS:READ on Intel Westmere uncore
• Gives upper bound estimate of DRAM bandwidth
LLC read miss rate 106/s
Exclusive jobs
LLC read miss rate 106/s
Shared jobs
TECHNOLOGY AUDIT SERVICE
Job exit status reported by SLURM
Exit status
1
0.9
Fraction of Jobs
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Successful
Killed
Exclusive jobs
Shared jobs
TECHNOLOGY AUDIT SERVICE
Failed
Panasas parallel filesystem
write rate per node
Write rate per node B/s
Exclusive jobs
Write rate per node B/s
Shared jobs
TECHNOLOGY AUDIT SERVICE
InfiniBand write rate per node
• Peaks truncated:
• ~45,000 for Exclusive jobs
Write rate Log10(B/s)
Exclusive jobs
• ~80,000 for shared jobs
Write rate Log10(B/s)
Shared jobs
TECHNOLOGY AUDIT SERVICE
Conclusions
• Little difference on average between the
shared and exclusive jobs on Rush
• Majority of jobs have resource usage much
less than max available
• Have created data collection/processing
software that facilitates easy evaluation of
system usage
TECHNOLOGY AUDIT SERVICE
Discussion
• Limitations of current work
– Unable to determine impact (if any) on job wall
time
– Comparing overall average values for jobs
– Shared node job statistics are convolved
– Exit code not reliable way to determine failure
TECHNOLOGY AUDIT SERVICE
Future work
• Use Application Kernels to get detailed
analysis of interference
• Many more metrics now available:
– FLOPS
– CPU clock cycles per instruction (CPI)
– CPU clock cycles per L1D cache load (CPLD)
• Add support for per job metrics on shared
nodes.
• Study classes of applications
TECHNOLOGY AUDIT SERVICE
Questions
• BOF: XDMoD: A Tool for Comprehensive
Resource Management of HPC Systems
– 6:00pm - 7:00pm tomorrow. Room A602
• XDMoD
– https://xdmod.ccr.buffalo.edu/
• tacc_stats
– http://github.com/TACCProjects/tacc_stats
• Contact info – [email protected]
TECHNOLOGY AUDIT SERVICE
Acknowledgments
• This work is supported by the National Science
Foundation under grant number OCI 1203560
and grant number OCI 1025159 for the
technology audit service (TAS) for XSEDE
TECHNOLOGY AUDIT SERVICE