Michael Kluge, ZIH Correlating Multiple TB of Performance Data to User Jobs Lustre User Group 2015, Denver, Colorado Zellescher Weg 12 Willers-Bau A 208 Tel. +49 351 - 463 – 34217 Michael Kluge ([email protected]) About us ZIH: about us – HPC and service provider – Research Institute – Big Data Competence Center Main research areas: – Performance Analysis Tools – Energy Efficiency – Computer Architecture – Standardization Efforts (OpenACC, OpenMP) Michael Kluge, ZIH 2 ZIH HPC and BigData Infrastructure Erlangen, Potsdam 100 Gbit/s Cluster Uplink TU Dresden ZIHBackbone Home Archiv HTW, Chemnitz, Leipzig, Freiberg Parallel Machine BULL HPC Haswell processors 2 x 612 Nodes ~ 30.000 Cores other clusters Megware Cluster IBM iDataPlex High Throughput BULL Islands Westmere, Sandy Bridge, und Haswell processors, GPU-Nodes 700 Nodes ~15.000 Cores Shared Memory SGI Ultraviolet 2000 512 cores SandyBridge 8 TB RAM NUMA Storage Michael Kluge, ZIH 3 HRSK-II, Phase 2, currently installed GPU Isle 64 nodes Haswell 2 x 12 cores 64 GB RAM + 2 x Nvidia K 80 Troughput Isle 232 nodes Haswell 2 x 12 cores 64/128/256 GB RAM 128 GB SSD HPC Isle 612 nodes Haswell 2 x 12 cores 64 GB RAM 128 GB SSD HPC Isle 612 nodes Haswell 2 x 12 cores 64 GB RAM 128 GB SSD 2x Export 2 x SMP nodes 2 TB RAM, 48 cores 64 GB RAM 24 Lustre Server 5 PB NLSATA 80 GB/s 400.000 IOPS 40 TB SSD 10 GB/s 1 Mio. IOPS 4x Login to Campus 4 x 10GE Uplink, Home, Backup ~ 1,2 PF peak in CPUs, ~35.000 cores total ~ 300 TF peak in GPUs, 128 cards Michael Kluge, ZIH 4 FDR InfiniBand Architecture of our storage concept (FASS) ZIH-Infrastructure Flexible Storage System (FASS) Login … User Z Transaction SCRATCH User A Storage SSD Server 2 Server N Server/File systems User A User Z Net Switch 2 HPC-Component Switch 1 Export Batch-System Checkpoint. Scratch Server 1 … SAS SATA Throughput-Component Optimization and control Michael Kluge, ZIH Access statistics 5 Analysis Performance Analysis for I/O Activities (1) What is of interest – Application requests, Interface types – Access sizes/patterns – Network/Server/Storage utilization Level of detail: – Record everything – Record data every 5 to 10 seconds Challenge: – How to analyze this? – How to deal with anonymous data? Michael Kluge, ZIH 6 I/O and Job Correlation: Architecture/Software SLURM Batch System HPC File System node MDS node I/O network node RAID Batch System Log Files Global I/O Data OSS RAID node OSS SAN Cluster Formation RAID application events node local events Correlation System OSS utilization errors Job Clusters backend, cache throughput, IOPS Job Clusters with high I/O demands Agent Interface Agent Interface Agent Interface Agent Interface Data Processing Database Storage Database Storage Job Clusters with low I/O demands Database Storage Data Request s Dataheap Data Collection System https://fusionforge.zih.tu-dresden.de/projects/dataheap/ Michael Kluge, ZIH 7 Visualization of Live Data for Phase 1 Michael Kluge, ZIH 8 Visualization of Live Data for Phase 2 Michael Kluge, ZIH 9 Performance Analysis for I/O Activities (2) Amount of collected data: – 240 OSTs, 20 OSS servers, … – Looking at stats and brw_stats (how to store a histogram in database?) ~75.000 tables – about 1 GB of data per hour Analysis is supposed to cover 6 month – 4 TB data – No way to analyze this with any serial approach Michael Kluge, ZIH 10 Analysis Process map operations to a set of tables – modify all tables (align data, scale, aggregate over time) – create a new table from an input set of tables (aggregation) – plot (time series, histograms) currently: single thread Haskell program Swift/T work in progress – 4 TB means 50 seconds on an 80 GB/s file system for the initial read – Swift/T has nice ways to describe data dependencies and to parallelize this kind of workflows (open for suggestions!) Michael Kluge, ZIH 11 Approach to look at the Raw Data analysis of the raw performance data – take original data and create derived metrics – look at plots and histograms correlate metrics with user jobs – take job log and split jobs into classes – check for correlations between job classes and performance metrics Michael Kluge, ZIH 12 Metric Example: NetApp vs. OST Bandwidth (20s) Michael Kluge, ZIH 13 Metric Example: NetApp vs. OST Bandwidth (60s) Michael Kluge, ZIH 14 Metric Example: NetApp vs. OST Bandwidth (10 minutes) Michael Kluge, ZIH 15 Metric Example: 3 Month bandwidth vs. IOPS Michael Kluge, ZIH 16 I/O and Job Correlation: Starting Point (Bandwidth) Michael Kluge, ZIH 17 Building Job Classes a class is defined by: – user id, core count – similar runtime – similar start time (time isolation) sort all jobs by runtime, split list at gaps split these groups again if they are not overlapping (long gap between start) challenges: runtime fluctuation, large run time variations Michael Kluge, ZIH 18 I/O and Job Correlation: Preliminary Result (1) Michael Kluge, ZIH 19 I/O and Job Correlation: Preliminary Result (1) Michael Kluge, ZIH 20 I/O and Job Correlation: Correlation Factors Michael Kluge, ZIH 21 Open Topics How to store a histogram in database? Other approaches than Swift/T Deal with the zeros Cache intermediate results Michael Kluge, ZIH 22 Conclusion / Questions it is possible to map anonymous performance data back to user jobs advance towards storage controller analysis (caches, backend bandwidth, …) Michael Kluge, ZIH 23
© Copyright 2025