Presentation

ESSI 2015-8273
The NCI High Performance Computing (HPC) and
High Performance Data (HPD) Platform to Support
the Analysis of Petascale Environmental Data
Collections
Ben Evans1, Lesley Wyborn1, Tim Pugh2, Chris Allen1, Joseph Antony1,
Kashif Gohar1, David Porter1, Jon Smillie1, Claire Trenham1, Jingbo
Wang1, Irina Bastrakova3, Alex Ip3, Gavin Bell4
1ANU, 2Bureau
3Geoscience
of Meteorology,
Australia, 4The 6th Column Project
(Second part of this talk is in next ESSI Session)
nci.org.au
@NCInews
nci.org.au
1/25
• High Performance Data (HPD) - data that is carefully prepared,
standardised and structured so that it can be used in Data-Intensive
Science on HPC (Evans, ISESS 2015, Springer)
– HPC – turning compute into IO-bound problems
– HPD – turning IO-bound into ontology + semantic problems
• What are the HPC and HPD drivers?
• How do you build environments for this infrastructure that is easy for users
to do science?
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
2/25
Top 500 Super Computer list since 1990
• Fast-and-flexible data
access to structured
data is required
Next NCI
Current NCI
• The needs to be a
balance between
processing power and
ability to access data
(data scaling)
• The focus is for ondemand direct access
to large data sources
http://www.top500.org/statistics/perfdevel/
• enabling High
performance analytics
and analysis tools
directly on that content
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
3/25
Elephant Flows Place Great Demands on Networks


essentially
fixed
determined by
speed of light
Physical pipe that
leaks water at rate
of .0046% by
volume.
Network ‘pipe’
that drops packets
at rate of .0046%.

Result
99.9954% of
water
transferred.

Result
100% of data
transferred,
slowly, at
<<5% optimal
speed.
With proper
engineering, we
can minimize
packet loss.
Assumptions: 10Gbps TCP flow, 80ms RTT.
See Eli Dart, Lauren Rotman, Brian Tierney, Mary Hester, and Jason Zurawski. The Science DMZ: A Network Design
Pattern for Data-Intensive Science. In Proceedings of the IEEE/ACM Annual SuperComputing Conference (SC13),
Denver CO, 2013.
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
4/25
Computational and Cloud Platforms
Raijin:
• 57,472 cores (Intel Xeon Sandy Bridge technology,
2.6 GHz) in 3592 compute nodes;
• 160 TBytes (approx.) of main memory;
• Infiniband FDR interconnect; and
• 7 PBytes (approx.) of usable fast filesystem (for
short-term scratch space).
• 1.5 MW power; 100 tonnes of water in cooling
Partner Cloud
• Same generation of technology as raijin (Intel
Xeon Sandy Bridge technology, 2.6 GHz) but only
1500 cores;
• Infiniband FDR interconnect;
• Collaborative platform for services and
• The platform for hosting non-batch services
NCI Nectar Cloud
• Same generation as partner cloud
• Non-managed environment
• Weak integration
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
5/25
NCI Cloud
Lustre
SSD
FDR IB
SSD
SSD
SSD
FDR IB
FDR IB
SSD
FDR IB
SSD
FDR IB
FDR IB
Per-Tenant public IP assignments (CIDR boundaries – typically /29) NFSNFS
OpenStack private IP (flat network*) - quota managed
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
NCI’s integrated high-performance environment
6/25
Internet
To second data centre
NCI data
movers
Raijin Login +
Data movers
Cloud
Raijin HPC
Compute
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin 56Gb FDR IB Fabric
Massdata (tape)
Cache 1.0PB,
Tape 20PB
Raijin high-speed
filesystem
Persistent global parallel
filesystem
/g/data1
/g/data2
/g/data3
/short
7.4 PB
6.75 PB
9 PB
7.6PB
/home,
/system,
/images,
/apps
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
7/25
10+ PB of Data for Interdisciplinary Science
Astronomy
(Optical)
200 TB
CMIP5
3PB
Earth
Observ.
2 PB
Weather
340 TB
Atmosphere
2.4 PB
Water
Ocean
1.5 PB
Marine
Videos
10 TB
BOM
GA
CSIRO
ANU
Other
National
Bathy,
DEM
100
TB
Geophysics
300 TB
International
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
8/25
National Environment Research Data Collections (NERDC)
1. Climate/ESS Model Assets and Data Products
2. Earth and Marine Observations and Data Products
3. Geoscience Collections
4. Terrestrial Ecosystems Collections
5. Water Management and Hydrology Collections
Data Collections
Approx. Capacity
CMIP5, CORDEX
~3 Pbytes
ACCESS products
2.4 Pbytes
LANDSAT, MODIS, VIIRS, AVHRR, INSAR, MERIS
1.5 Pbytes
Digital Elevation, Bathymetry, Onshore Geophysics
700 Tbytes
Seasonal Climate
700 Tbytes
Bureau of Meteorology Observations
350 Tbytes
Bureau of Meteorology Ocean-Marine
350 Tbytes
Terrestrial Ecosystem
290 Tbytes
Reanalysis products
100 Tbytes
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
9/25
Internationally sourced
• Satellite Data (USGS, NASA, JAXA, ESA, …)
• Reanalysis (ECMWF, NCEP, NCAR, …)
• Climate Data (CMIP5, AMIP, GeoMIP, CORDEX, …)
• Ocean Modelling (Earth Simulator, NOAA, GFDL, …)
These will only increase as we depend on more data, and some will be replicated.
How can we better keep this in sync, versioned, and back-referenced for the
supplier?
• Organise “long-tail” data that calibrates and integrates with the big data.
How should we manage this data, versioned, and easily attribute
supplier (researcher? Collab? Uni? Agency?)
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
10/25
Some Data Challenges
•
Data Formats
• Standardize data formats – time to convert legacy and proprietary ones
• Appropriately normalise the data data models and conventions
• Adopt HPC-enabled libraries that abstracts storage
•
Expose all attributes for search
• not just collection-level search, not just datasets, all data attributes
• What are the handles we need to access the data?
•
Provide more programmatic interfaces and link up data and compute resources
• More server side processing
•
Add the semantic meaning to the data
• Create useful datasets (in the programming context) from data collections
• Is it scientifically appropriate for a data service to aggregate/interpolate?
•
What unique/persistent identifiers do we need?
• DOI is only part of the story.
• Versioning is important.
• Born linked data and maintaining graph infrastructure
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
11/25
Regularising High Performance Data using HDF5
Compilers
& Tools
Fortran,
C, C++
Python, R,
MatLab, IDL
Metadata Library
Layer
Layer 1
netCDF-CF
NetCDF-4
Library
Ferret, CDO,NCL,
NCO, GDL,GDAL,
GrADS,GRASS,QGIS
HDF-EOS5
libgdal
Library
Layer 2
HDF5 MPI-enabled
Lustre
Globe
Caritas
Open Nav
Surface
ISO 19115, RIF-CS, DCAT etc
[FITS]
Airborne
Geophysics [SEG-Y] BAG
Line data
…
HDF5 Serial
Other Storage (options)
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
12/25
Regularising High Performance Data using HDF5 –
including Data Services
Compilers
& Tools
Python, R,
MatLab, IDL
Fortran,
C, C++
HDF-EOS5
libgdal
Library
Layer 2
HDF5 MPI-enabled
Lustre
OGC SOS
OGC WPS
OGC WFS
OGC WCS
NetCDF-4
Library
OGC WMS
OpenDAP
Services
(expose
data
model+sema
ntics)
Metadata Library
Layer
Layer 1
netCDF-CF
Ferret, CDO,NCL,
NCO, GDL,GDAL,
GrADS,GRASS,QGIS
Globe
Caritas
Open Nav
Surface
Fast “whole-oflibrary” catalogue
ISO 19115, RIF-CS, DCAT etc
[FITS]
Airborne
Geophysics [SEG-Y] BAG
Line data
…
HDF5 Serial
Other Storage (options)
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
13/25
Finding data and services
Supercomputer
access
Virtual lab
DAP, OGC, …
Services
GeoNetwork catalogue
Lucene database
/g/data1
/g/data2
Trialing Elastic Search
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
14/25
Prototype to Production - anti-”Mine” craft
Virtual Labs:
• Separating Researcher from Software builders
• Cloud is an enabler, but:
• don’t make researchers become full system admins.
• save developers from being operational
Project lifecycle – and preparing success
Perspiration
Productivity
Proj1:Start
Proj1:End
Proj2-4:Start
Proj2-4:End
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
15/25
Prototype to Production - anti-”Mine” craft
Development
Phase in a project
VL
Managers
Developer
Developers
Headspace hours
VL
Managers
Poorly
executed
VL Mgr.
Developer
?
Reasonably
executed
Well
executed
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
16/25
Prototype to Production - anti-”Mine” craft
VL
Managers
Changed Scope
– adopted broadly
Developer
Developers
Headspace hours
VL
Managers
VL Mgr
Developer
Development
Phase in a project
Poorly
executed
Reasonably
executed
Well
executed
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
17/25
Virtual Laboratory driven software patterns
Basic OS
functions
NCI Stack 1
Common
Modules
NCI Env
Stack
Bespoke
Services
Workflow
X
Gridftp
P2P
Analytics
Stack
Vis Stack
Special config
choices
Super
Software
Stack
2xStack1
Modify
Stack1
Modify
Stack 2
Take Stacks
from Upstream
And use as Bundles
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
Transition from developer, to prototype, to DevOps
18/25
Step 1: Development
• Get template for development
• What is special, separate out what is common
• Reuse other software stacks where possible
Step 2: Prototype
• Deploy in an isolated tenant of a cloud
• Determine dependencies.
• Test cases to demonstrate correctly functioning.
Step 3: Sustainability
• Pull repo into operational tenant
• Prepare bundle for integration with rest of framework
• Hand back cleaned bundle
• Establish DevOps process
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
19/25
DevOps approach to building and operating environments
Virtual Laboratory
Operational Bundle
- Git controlled
- pull model
- continuous integration
testing
NCI Core Bundles
Community2 repo
Community1 repo
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
Advantages
20/25
• Separates roles and responsibilities - from gatekeeper to DevOps management:
• Specialist on package
• VL managers
• system admin
• “Architecture” to “Platform”
• flexible with technology change
• makes handover/maintenance easier
•
•
•
•
•
Both Test/Dev/Ops and patches/rollback become BAU
Sharable bundles
Can tag release of software stacks
Precondition for trusted software stacks
Provenance - Scientific / gov policy scrutiny
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
A snapshot of layered bundles to build complex VLs
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
21/25
nci.org.au
Easy analysis environments
22/25
Increasing use of iPython Notebooks
VDI - Easy In-situ environment using virtual analysis desktops.
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
VDI – cont …
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
23/25
nci.org.au
24/25
NCI Petascale Data-Intensive Science Platform
Data Services
THREDDS
Server-side analysis
and visualization
VDI: Cloud scale user
desktops on data
10PB+ Research
Data
Web-time analytics software
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au
25/25
Summary: Progress toward Major Milestones
• Interdisciplinary Science
To publish, catalogue and access data and software for enhancing interdisciplinary, big
data-intensive (HPD) science and with interoperable data services and protocols.
• Integrity of Science
Managed services to capture a workflow’s process as a comparable, traceable output.
Ease-of-access to data and software for enhanced workflow development and
repeatable science which can be conducted with less effort or an acceleration of
outputs.
• Integrity of Data
The data repository services to ensure data integrity, provenance records, universal
identifiers, repeatable data discovery and access from workflows or interactive users.
© National Computational “NCI High Performance Computing (HPC) and High Performance Data
Infrastructure 2015
(HPD) Platform”, EGU2015-8273, 17 April, 2015 @BenJKEvans
nci.org.au