Damaris: How To Efficiently Leverage Multicore Parallelism To Achieve

Damaris: How To Efficiently Leverage
Multicore Parallelism To Achieve
Scalable, Jitter-Free I/O And Non-Intrusive
In-Situ Visualization
Work involving Matthieu Dorier (IRISA, ENS Cachan), Gabriel Antoniu (INRIA),
Marc Snir (ANL), Franck Cappello (ANL), Leigh Orf (CMICH), Dave Semeraro
(NCSA, UIUC), Roberto Sisneros (NCSA, UIUC), Tom Peterka (ANL)
Gabriel Antoniu
[email protected]
KerData Project-Team
Inria Rennes Bretagne Atlantique
Maison de la simulation, Saclay, January 2014
Context: HPC simulations on Blue Waters
•  INRIA – UIUC - ANL
Joint Lab for Petascale Computing
•  Blue Waters: sustained petaflop
•  Large-scale simulation at unprecedented accuracy
•  Terabytes of data produced every minute
Dealing with Big Data on post-petascale machines: the Damaris approach
-2
Big Data challenge on post-petascale machines
•  How to efficiently store, move data?
•  How to index, process, compress these data?
•  How to analyze, visualize and finally understand them?
Dealing with Big Data on post-petascale machines: the Damaris approach
-3
Outline
1.  Big Data challenge at exascale
2.  The Damaris approach
3.  Achieving scalable I/O
4.  Non-impacting in-situ visualization
5.  Conclusion
Dealing with Big Data on post-petascale machines: the Damaris approach
-4
1
Big Data challenges at exascale
When parallel file systems don’t scale anymore
The stardard I/O flow: offline data analysis
100.000+
cores
PetaBytes of
data
~ 10.000 cores
Periodic data generation from the simulation
Storage in a parallel file system (Lustre, PVFS, GPFS,…)
Offline data analysis (on another cluster)
Dealing with Big Data on post-petascale machines: the Damaris approach
-6
Writing from HPC simulations:
File per process vs. Collective I/O
Two main approaches for I/O in HPC simulations
Implemented in MPI-I/O, available in HDF5, NetCDF,…
• 
Too many files
• 
Requires coordination
• 
High metadata overhead
• 
Data communications
• 
Hard to read back
Dealing with Big Data on post-petascale machines: the Damaris approach
-7
Periodic synchronous snapshots
lead to I/O bursts and high variability
Visualizing throughput variability:
- 
- 
Input
Between cores
Between iterations
Output
The “cardiogram” of a PVFS
data server during a run of the
CM1 simulation
Dealing with Big Data on post-petascale machines: the Damaris approach
-8
2
The Damaris approach
Dedicating cores to enable scalable asynchronous I/O
Writing without Damaris
CORE
CORE
CORE
CORE
CORE
CORE
Main Memory
CORE
CORE
Network
Huge access contention, degraded performance
Dealing with Big Data on post-petascale machines: the Damaris approach
- 10
With Damaris: using dedicated I/O cores
CORE
CORE
CORE
CORE
CORE
CORE
Main Memory
CORE
CORE
Network
• 
• 
• 
• 
• 
Dedicate cores to I/O
Write in shared memory
Process data asynchronously
Produce results
Output results
Dealing with Big Data on post-petascale machines: the Damaris approach
- 11
Leave a core, go faster!
Time-Partitioning
Space-Partitioning
Moving I/O to a dedicated core
Dealing with Big Data on post-petascale machines: the Damaris approach
- 12
Damaris at a glance
• 
Dedicated Adaptable Middleware for
• 
Application Resources Inline Steering
Main idea: dedicate one or a few
cores in each SMP node for data
• 
management
Features:
-  Shared-memory-based
communications
-  Plugin system (C,C++, Python)
-  Connection to visualization engines
(e.g. VisIt, ParaView)
-  XML external description of data
Dealing with Big Data on post-petascale machines: the Damaris approach
- 13
Damaris: current state of the software
• 
Version 0.7.2 available at http://damaris.gforge.inria.fr/ (1.0 available soon J)
- Along with documentation, tutorials, examples and a demo!
• 
Written in C++, uses
- Boost for IPC, Xerces-C and XSD for XML parsing
• 
API for Fortran, C, C++
• 
Tested on
- Grid’5000 (Linux Debian), Kraken (Cray XT5 - NICS), Titan (Cray XK6 – Oak
Ridge), JYC, Blue Waters (Cray XE6 - NCSA), Surveyor, Intrepid (BlueGene/
P – Argonne)
• 
Tested with
- CM1 (climate), OLAM (climate), GTC (fusion), Nek5000 (CFD)
Dealing with Big Data on post-petascale machines: the Damaris approach
- 14
3
Results: achieving scalable I/O
Running the CM1 simulation on Kraken,
G5K and BluePrint with Damaris
• 
The CM1 simulation
- Atmospheric simulation
- One of the Blue Waters target applications
- Uses HDF5 (file-per-process) and pHDF5 (for
collective I/O
• 
Kraken
• 
Grid 5000
• 
BluePrint
- Cray XT5 at NICS
- 12 cores/node
- 24 cores/node
- 48 GB/node
- Power5
- 16 GB/node
- PVFS file system
- 64 GB/node
- Luster file system
Dealing with Big Data on post-petascale machines: the Damaris approach
- 16 cores/node
- GPFS file system
- 16
1000
10000
800
8000
Scalability factor
Run time (sec)
Damaris achieves almost perfect scalability
600
400
200
Perfect scaling
6000
Damaris
4000
File-per-process
2000
0
0
576
2304 9216
Number of cores
Kraken Cray XT5
Application run time
(50 iterations + 1 write phase)
Collective-I/O
0
5000
10000
Number of cores
Weak scalability factor
S=N
Tbase
T
N: number of cores
Tbase: time of an iteration on one core w/o write
T: time of an iteration + a write
Dealing with Big Data on post-petascale machines: the Damaris approach
- 17
900
800
700
600
500
400
300
200
100
0
Collective-I/O
File-per-process
Damaris
BluePrint Power5, 1024 cores
Average, max and min write time
10
576
2304
9216
Number of cores
Kraken Cray XT5
Average and maximum write time
28MB per process
Time to write (sec)
Time to write (sec)
Damaris hides the I/O jitter
8
6
4
2
0
0
10
20
Total amount of data (GB)
Dealing with Big Data on post-petascale machines: the Damaris approach
30
- 18
Damaris increases effective throughput
Aggregate throughput
(GB/s)
Average aggregate throughput from the writer processes
16
8
4
2
1
0,5
0,25
0,125
0,0625
File-per-process
Damaris
Collective-I/O
0
5000
10000
Number of cores
Kraken Cray XT5
Average aggregate throughput
from writers
Dealing with Big Data on post-petascale machines: the Damaris approach
- 19
More results…
Damaris: How to Efficiently Leverage Multicore Parallelism
to Achieve Scalable, Jitter-free I/O
Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, Leigh Orf
Proceedings of IEEE CLUSTER 2012 (Beijing, China)
Dealing with Big Data on post-petascale machines: the Damaris approach
- 20
Damaris spares time for data management
300
250
250
200
Time (sec)
Time (sec)
Time spent by Damaris writing data and time spent waiting
200
150
150
100
Spare time
50
50
Used time
0
0
100
576
2304
9216
Number of cores
Kraken Cray XT5
0,05 5,8 15,1 24,7
Total amount of data (GB)
BluePrint Power5 (1024 cores)
Damaris spares time?
Let’s use it!
Dealing with Big Data on post-petascale machines: the Damaris approach
- 21
4
Recent work: non-impacting
in-situ visualization
Getting insights from running simulations
From offline to coupled visualization
• 
Offline approach
• I/O performance issues
in the simulation
• I/O performance issues
in visualization software
• Too much data!!!
• 
Coupled approach
• Direct insight in the simulation
• Bypass the file system
• Interactive
BUT
• Hardly accepted by users
Dealing with Big Data on post-petascale machines: the Damaris approach
- 23
Towards in-situ visualization
•  Loosely coupled strategy
- Visualization runs on a separate, remote set of resources
- Partially or fully asynchronous
- Solutions include staging areas, file format wrappers (HDF5 DSM, ADIOS,
…)
•  Tightly coupled strategy
- Visualization is collocated with the simulation
- Synchronous (time-partitioning): the simulation periodically stops
- Solution by code instrumentation
- Memory constrained
Dealing with Big Data on post-petascale machines: the Damaris approach
- 24
Four main challenges
User friendliness
Low impact on simulation code
Adaptability
(to different simulations and visualization scenarios)
Performance
Low impact on simulation run time
Good resource utilization
(low memory footprint, use of GPU,…)
Dealing with Big Data on post-petascale machines: the Damaris approach
- 25
In-situ visualization strategies
Coupling
Tight
Loose
?
Impact on code
High
Low
Minimal
Interactivity
Yes
None
Yes
Instrumentation
High
Low
Minimal
Impact on run time
High
Low
Minimal
Resource usage
Good
Non-optimal
Better
Adaptability
• 
Researchers seldom accept tightly-coupled in-situ visualization
-  Because of development overhead, performance impact…
-  “Users are stupid, greedy, lazy slobs” [1]
•  Is there a solution achieving all these goals?
•  Yes: Damaris
[1] D. Thompson, N. Fabian, K. Moreland, L. Ice, “Design issues for performing in-situ analysis of simulation data”, Tech. Report, Sandia National Lab
Dealing with Big Data on post-petascale machines: the Damaris approach
- 26
Node
Node
Node
Node
Node
Connect to a visualization backend
File System
Dealing with Big Data on post-petascale machines: the Damaris approach
- 27
Node
Node
Node
Node
Node
Interact with your simulation
File System
Dealing with Big Data on post-petascale machines: the Damaris approach
- 28
Let’s take a representative example
// rectilinear grid coordinates
float mesh_x[NX];
float mesh_y[NY];
float mesh_z[NZ];
// temperature field
double temperature[NX][NY][NZ];
Dealing with Big Data on post-petascale machines: the Damaris approach
- 29
“Instrumenting” with Damaris
!
Damaris_write(“mesh_x”,mesh_x);!
Damaris_write(“mesh_y”,mesh_y);!
Damaris_write(“mesh_z”,mesh_z);!
!
Damaris_write(“temperature”,temperature);!
!
(Yes, that’s all)
Dealing with Big Data on post-petascale machines: the Damaris approach
- 30
Now describe your data in an XML file
<parameter name="NX" type="int" value="4"/>!
<layout name="px" type="float” sizes="NX"/>!
<variable name="mesh_x" layout="px">!
<!-- idem for PTY and PTZ, py and pz, mesh_y and mesh_z -->!
!
<layout name="data_layout" type="double” sizes="NX,NY,NZ"/>!
<variable name="temperature" layout="data_layout” mesh=“my_mesh” />!
!
<mesh type=“rectilinear” name=“my_mesh” topology=“3”>!
<coord name=“mesh_x” unit=“cm” label=“width” />!
<coord name=“mesh_y” unit=“cm” label=“depth” />!
<coord name=“mesh_z” unit=“cm” label=“height” />!
</mesh>!
•  Unified data description for different visualization software
•  Damaris translates this description into the right function calls to any
such software (right now: Python and VisIt)
•  Damaris handles the interactivity through dedicated cores!
Dealing with Big Data on post-petascale machines: the Damaris approach
- 31
Plugin exposing vector fields in
the Nek5000 simulation with VTK
VisIt interacting with the CM1
simulation through Damaris
Dealing with Big Data on post-petascale machines: the Damaris approach
- 32
Damaris has a low impact on the code
VisIt
Damaris
Curve.c
144 lines
6 lines
Mesh.c
167 lines
10 lines
Var.c
271 lines
12 lines
Life.c
305 lines
8 lines
Number of lines of code required to instrument sample simulations
with VisIt and with Damaris
Nek5000
VTK : 600 lines of C
and Fortran
Damaris : 20 lines of
Fortran, 60 of XML
Dealing with Big Data on post-petascale machines: the Damaris approach
- 33
Results with the Nek5000 simulation
• 
The Nek5000 simulation
•  CFD solver
•  Based on spectral elements method
•  Developed at ANL
•  Written in Fortran 77 and MPI
•  Scales over 250,000 cores
• 
Data in Nek5000
•  Fixed set of elements constituting an
unstructured mesh
•  Each element is a curvilinear mesh
• 
Damaris already knows how to handle curvilinear
meshes and pass them to VisIt
Tested on up to 384 cores with Damaris so far
Traditional approach does not even scale to
this number!
• 
• 
Dealing with Big Data on post-petascale machines: the Damaris approach
- 34
Damaris removes the variability
inherent to in-situ visualization tasks
Experiments done with the
turbChannel test-case in Nek5000,
on 48 cores (2 nodes) of
Grid’5000’s Reims cluster.
Dealing with Big Data on post-petascale machines: the Damaris approach
- 35
Conclusion: using Damaris for in-situ analysis
Coupling
Tight
Loose
Damaris
Impact on code
High
Low
Minimal
Interactivity
Yes
None
Yes
Instrumentation
High
Low
Minimal
Impact on run time
High
Low
Minimal
Resource usage
Good
Non-optimal
Better
Adaptability
•  Impact on code: 1 line per object (variable or event)
•  Adaptability to multiple visualization software (Python, VisIt, ParaView, etc.)
•  Interactivity through VisIt
•  Impact on run time: simulation run time independent of visualization
•  Resources usage:
• 
preserves the “zero-copy” capability of VisIt thanks to shared-memory,
• 
can asynchronously use GPU attached to Cray XK6 nodes
Dealing with Big Data on post-petascale machines: the Damaris approach
- 36
More results…
Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In
Situ Visualization Framework
M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro,
In Proc IEEE LDAV 2013 - IEEE Symposium on Large-Scale Data
Analysis and Visualization, Oct 2013, Atlanta, United States.
Dealing with Big Data on post-petascale machines: the Damaris approach
- 37
Conclusion
The Damaris approach in a nutshell
•  Dedicated cores
•  Shared memory
•  Highly adaptable system thanks to plugins and XML
•  Connection to VisIt
Results on I/O
•  Fully hides the I/O jitter and I/O-related costs
•  15x sustained write throughput (compared to collective I/O)
•  Almost perfect scalability
•  Execution time divided by 3.5 compared to collective I/O
•  Enables 600% compression ratio without any overhead
Recent work for in-situ visualization
•  Efficient coupling of simulation and analysis tools
•  Perfectly hides the run-time impact of in-situ visualization
•  Minimal code instrumentation
•  High adaptability
Dealing with Big Data on post-petascale machines: the Damaris approach
- 38
Thank you!
Damaris: How To Efficiently Leverage Multicore
Parallelism To Achieve Scalable, Jitter-Free I/O
And Non-Intrusive In-Situ Visualization
Work involving Matthieu Dorier (IRISA, ENS Cachan), Gabriel Antoniu (INRIA), Marc Snir
(ANL), Franck Cappello (ANL), Leigh Orf (CMICH), Dave Semeraro (NCSA, UIUC), Roberto
Sisneros (NCSA, UIUC), Tom Peterka (ANL)
Gabriel Antoniu
[email protected]
KerData Project-Team
Inria Rennes Bretagne Atlantique
Maison de la simulation, Saclay, January 2014