Implementation of a high performance scientific cloud backed

Implementation of a high performance
scientific cloud backed by Dell hardware
Dr. Muhammad Atif, Jakub Chrzeszczyk, Dongyang Li, Matthew Sanderson
National Computational Infrastructure,
The Australian National University
nci.org.au
@NCInews
nci.org.au
Agenda
•
•
•
•
•
•
3
About NCI
NCI Infrastructure
Why Cloud @ NCI
Benchmarks
Conclusions and Future
Questions
nci.org.au
NCI – an overview
Mission:
• To foster ambitious and aspirational research objectives and
to enable their realisation, in the Australian context, through
world-class, high-end computing services.
4
Communities and
Institutions/
Access and Services
Research Objectives
NCI is:
• Driven by research objectives,
• An integral component of the Commonwealth’s research
infrastructure program,
• A comprehensive, vertically-integrated research service,
• Engaging with, and is embedded in, research communities,
high-impact centres , and institutions,
• Fostering aspiration in computational and data-intensive
research, and increasing ambition in the use of HPC to enhance
research impact,
• Delivering innovative virtual laboratories,
• Providing national access on priority and merit, and
• Built on, and sustained by, a collaboration of national
organisations and research-intensive universities
Research Outcomes
Expertise Support
and
Development
HPC Services
Virtual Laboratories/
Data-intensive Services
Integration
Compute (HPC/Cloud)
Storage/Network
Infrastructure
nci.org.au
NCI: Partnership, Governance, and Co-investment
•
NCI Collaboration Agreement (2012–15)
– Major Collaborators: ANU, CSIRO, BoM, GA,
– Universities: Adelaide, Monash, UNSW, Queensland,
Sydney, Deakin (and ANU)
– University Consortia: Intersect, QCIF
• Implications of co-investment:
•
Growing Co-investment :
–
2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M;
2012: $8.5M; 2013: $11M; 2014 $11+M; to provide for
all recurrent operations
•
Strong Governance
– NCI governed by ANU on advice from NCI Board
– Executive level representation from
ANU, CSIRO, BoM, GA, res. intensive universities
5
– Engaged and outcomes driven
– Shape the enterprise in the
national interest
– Ownership and responsibility
– Provide drivers for
comprehensive and integrated
services
nci.org.au
NCI: comprehensive and integrated; quality and
innovation
• NCI today
– Computational and Data-intensive
services
• 3,000+ users, 600 active projects;
– Robust: 19 staff in operations; 9 in user
support
– Innovative: 20 staff in applications
support/development, virtual labs,
collections management, visualisation in
a dev-ops environment
– Engaged: partners organisations, other einfrastructure (NeCTAR, RDSI, ANDS)
6
nci.org.au
Current Infrastructure
o Raijin - Australia’s first petaflop supercomputer
- 57680 Intel Sandy Bridge (2.6 GHz)
- 157 TBytes memory,
- FDR Infiniband
- Centos 6.X Linux; PBS Pro scheduler
- 1195 Tflops, 1,400,000 SPECFPrate
- 10 Pbytes scratch file system
• 150Gbytes/sec.
- 1.5 MW power; 100 tonnes of water
o Data Storage
- HSM (massdata): 10 PB (Dual Site –
Redundant)
- Global Lustre Filesystem
• /g/data{1,2,3}. Over 15 PB Storage
- Up to 55 Gbytes/sec
7
nci.org.au
Cloud Infrastructure
o Cloud Computing since 2009.
o Several clusters- All powered by Dell.
o Stable, solid and best cost/performance.
o VM-Ware Cloud (Since 2009)
o For Web services and database clusters.
o Due for an upgrade.
o DCC Cluster (2010)- Retired in favour of OpenStack.
o Virtualized under Vmware.
o For workloads not typically suited for HPC.
clusters e.g. Oracle DB inside compute node.
o Single node compute jobs.
o OpenMP or Python.
o RedHat OpenStack Cloud. (2012)
o 384 core private cloud. Reuse of DCC hardware.
o Typically for Virtual Laboratories.
8
nci.org.au
NCI Cloud Infrastructure
o NeCTAR Research Cloud (Private Cloud).
o Intel Sandy Bridge (3200 cores with HT).
o 800 TB of SSDs per compute node
o 56 G Ethernet.
o No access to internal environment (no access to highspeed filesystem) due to security issues.
o Tenjin Partner Cloud (Pre-production this month)
o Flagship Cloud.
o Same hardware as NeCTAR but without HT.
o RDO and Centos 6.X.
o Architected for strong computational and I/O
performance needed for “big data” research.
o SRIOV (No live migration), FFT and 56G Ethernet.
o On-demand access to GPU nodes.
o Access to over 20PB Lustre storage.
9
nci.org.au
NCI Cloud Infrastructure
o InfiniCloud
o Experimental FDR Infiniband Cloud
o ICE House – Heavily Modified by NCI.
o Mellanox yet to release native IB for ICE House
o Native IB to the Virtual Machines via SRIOV
o Once stable, Tenjin will move the native IB.
o Another Cloud (not named yet- Not a Dell)
o Reuse NCI’s previous Supercomputer
o Native IB
o QDR
10
nci.org.au
NCI Data Centre
Tape robot
Raijin compute
10G
Raijin compute
10G
Raijin compute
Storage/Raijin mgmt
Cloud
Metro
X
Tape robot
Deja-Vayu
10G
Cloud
backup
Storage (r.Lustre, gdata{1..n})
Huxley Datacenter
Tape robot
Tape robot
FX10
40G
NCI Datacenter
11
nci.org.au
NCI Systems Connectivity
Internet
NCI data
movers
To Huxley DC
Raijin Login +
Data movers
Raijin Compute
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin 56Gb FDR IB Fabric
Massdata
Cache 1.0PB,
Tape 12.3PB
12
Cloud
VMware
/g/data
Raijin FS
/g/data1
/g/data2
/short
~7.4PB
~6.5PB
7.6PB
/home, /system,
/images, /apps
nci.org.au
Commercial Cloud vs Science Cloud
Commercial Cloud (& NeCTAR)
•
•
•
•
•
•
•
Equilibrium/Uptime
Run forever
Many applications per server
Light compute load
Light data load
No scheduling
Examples
–
–
–
•
13
HPD Science Cloud (e.g., NCI)
•
•
•
•
•
•
•
–
Email servers
Web servers
CRM
Regular production environment
Varying needs / loads
Run until outcomes or insights
Many servers per applications
Computationally intensive
Data Intensive
Scheduling
Examples
–
•
Data-intensive computation: analysis, mining,
visualisation
Parametric computation; Monte Carlo methods
Specialised and integrated (highperformance) environment
nci.org.au
NCI: OpenStack Cloud (NeCTAR and Tenjin)
Physical Nodes
200+
Cores
3200+
Core type
Total memory
Memory / node (GB)
25 TBytes
128
Memory speed
1333 MHz
Local fast SSD
capacity/node
400/800 GB
Per node Read
/Write
500 / 460 MB/s
Per node IOPS R / W
75000 / 36000
Other storage
GPU
Network
14
Intel Sandy bridge
2.6 GHz
~650 TB (Swift/CEPH)
10
56 Gbps IB FDR
nci.org.au
Why High Performance Cloud?
Job statistics on Raijin- Users are really into parallel jobs
NCI’s Awesome dashboard…. Courtesy Dr. Lei Shang
15
nci.org.au
Why High Performance Cloud? (cont…)
o
o
o
o
o
o
o
o
o
o
o
o
16
Complement the NCI supercomputer.
Single Node jobs are not fun for big supercomputer.
Virtual Laboratories.
Remote Job Submission.
Visualization.
Web Services.
o Mounts global file-system at NCI.
On-Demand GPU access.
Workloads not best suited for Lustre.
o Local scratch is SSD on NCI cloud compared to SATA HDD on Raijin.
Pipelines that are not suited for supercomputer
o Packages that cannot/will not be supported.
o Proof of concepts before making a big run.
Cloud burst
o Fancy name for offloading single node jobs to Cloud when system is in
heavy use.
Student Courses.
NCI Specific: Huge Global Lustre File System to to pre and post processing of jobs
o Cloud Raijin  Cloud  Web (All @ NCI)
nci.org.au
Architectural Design:
High-Performance RC Node at NCI
Architecture of Tenjin
Overview
ollowing diagram, Figure 3, depicts the proposed InfiniBand based OpenStack cluster.
Figure 3 Architecture of the Infiniband-based OpenStack cluster.
17
Band is used as
nci.org.au
a converged fabric which allows for the transport of Ethernet, storage and
Preliminary Results (Platform)
Cluster
Architecture
Interconnect
Loc
Raijin
Xeon(R) CPU E5-2670 @ 2.60GHz (Sandy
Bridge)
Mellanox FDR
Infiniband - FFT
NCI
Tenjin
Intel Xeon E312xx @ 2.60 GHz (Sandy
Bridge)
Mellanox FDR
Infiniband, flashed to
56G Ethernet- FFT
NCI
InfiniCloud
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Mellanox FDR
Infiniband
NCI
10G-Cloud
AMD Opteron 63xx
10G Ethernet

o OpenMPI 1.8
o All applications compiled with GCC used with O3. Note “No” Intel Compilers for fair
comparison.
o All clouds were based on OpenStack - Ice house.
o Not so scientific results- 10 runs, threw max and min results and took average.
o Comprehensive results to follow
18
nci.org.au
Latency in Microseconds (Log Scale)
OSU Latency Benchmark
Microseconds- Lower is better
100000
10000
1000
100
10
1
Message Size
10G-Cloud
19
Tenjin-TCP
Tenjin-Yalla
InfiniCloud
Raijin
nci.org.au
OSU Bandwidth Benchmark
MB/Sec - Higher is better
7000
Bandwidth (MB/s)
6000
5000
4000
3000
2000
1000
0
Message Size (Bytes)
10G-Cloud
20
Tenjin-TCP
Tenjin-Yalla
InfiniCloud
Raijin
nci.org.au
Bioinformatics Workload
Speedups compared to 10G-XXX-Cloud
16 CPU-One Compute Node
2.5
2
1.5
1
0.5
0
Inchworm
Chrysalis
Raijin
Tenjin
Butterfly
10G-XXX-Cloud
Trinity is a bioinformatics de nova sequence-assembling package consists of three programs: Inchworm (openmp,
gcc), Chrysalis (openmp, gcc) and Butterfly (java). The calculation was carried out based on procedure published
by BJ Haas et al, Nature Protocols 8, 1494–1512 (2013)
Courtesy: Dr. Ching-Yeh (Leaf) Lin at NCI
21
nci.org.au
NPB Results (32 processes only)
NASA Parallel Benchmarks
Speedup w.r.t 10G-Cloud
7
6
5
4
3
2
1
0
CG.C.32
22
EP.C.32
FT.C.32
IS.C.32
LU.C.32
Raijin-Partial
Raijin-Native-2.6GHz
Raijin-2GHz
InfiniCloud-Hypervisor
InfiniCloud-VM
Tenjin-VM-Partial
Tenjin-VM-Yalla
10G-XXX-Cloud-2.3GHz-Opteron
MG.C.32
nci.org.au
NAS Parallel Benchmarks
64 processes (16 processes per virtual machine)- Class C
3
Speed Up- Higher is better
2.5
2
1.5
1
0.5
0
10G-Node
Tenjin-56G-Ethernet
23
BT.C.64
CG.C.64
EP.C.64
FT.C.64
IS.C.64
LU.C.64
MG.C.64
SP.C.64
1
1
1
1
1
1
1
1
1.491083476
2.222986999
1.844686649
2.496761162
1.900924214
1.575110836
1.813863476
1.418657064
nci.org.au
Compute Time (s) (lower is better)
Scaling is a big problem for OpenStack Cloud
4096.00
1024.00
256.00
64.00
16.00
4.00
1.00
1
2
4
8
16
Number of CPUs
32
RDO TCP
RDO TCP MXM
RDO OIB
RDO OIB MXM
RJ TCP
RJ TCP MXM
RJ OIB
RJ OIB MXM
64
128
Computational Physics: Custom-written, hybrid Monte Carlo code for generate gauge fields for Lattice QCD. For each iteration,
calculating the Hamiltonian involves inverting a large, complex matrix using CGNE. Written in Fortran, using pure MPI (no
threading).
Courtesy: Dr. Benjamin Menadue
24
nci.org.au
Openstack specific HPC issues
o Parallel jobs run on the Cloud but is it HPC?
o Hardware performance counters?
o Took over 5 years to realize NUMA is essential.
o Juno; We have not tested that.
o SRIOV still is not the answer.
o M(ulti)RIOV (Dell?) 
o Locality aware scheduling.
o Not there; NCI working on it.
o Our benchmarks hindered by QPI performance of SandyBridge.
o Mellanox suggested it better in later versions of Intel Arch.
o Lustre.
o Security model. (Manila?)
o Single Node performance is at par with bare-metal.
o No surprise here. It was the case 5 years ago as well.
o NCI plans to move to bare-metal provisioning for Parallel
applications.
25
nci.org.au
Partnership with Dell
o Best Cost/Performance Ratio
o Fastest Research Cloud in Austraia
o Robust Hardware
o Four Cloud Clusters in Production all Dell
o Excellent Support
26
nci.org.au
27
nci.org.au