Implementation of a high performance scientific cloud backed by Dell hardware Dr. Muhammad Atif, Jakub Chrzeszczyk, Dongyang Li, Matthew Sanderson National Computational Infrastructure, The Australian National University nci.org.au @NCInews nci.org.au Agenda • • • • • • 3 About NCI NCI Infrastructure Why Cloud @ NCI Benchmarks Conclusions and Future Questions nci.org.au NCI – an overview Mission: • To foster ambitious and aspirational research objectives and to enable their realisation, in the Australian context, through world-class, high-end computing services. 4 Communities and Institutions/ Access and Services Research Objectives NCI is: • Driven by research objectives, • An integral component of the Commonwealth’s research infrastructure program, • A comprehensive, vertically-integrated research service, • Engaging with, and is embedded in, research communities, high-impact centres , and institutions, • Fostering aspiration in computational and data-intensive research, and increasing ambition in the use of HPC to enhance research impact, • Delivering innovative virtual laboratories, • Providing national access on priority and merit, and • Built on, and sustained by, a collaboration of national organisations and research-intensive universities Research Outcomes Expertise Support and Development HPC Services Virtual Laboratories/ Data-intensive Services Integration Compute (HPC/Cloud) Storage/Network Infrastructure nci.org.au NCI: Partnership, Governance, and Co-investment • NCI Collaboration Agreement (2012–15) – Major Collaborators: ANU, CSIRO, BoM, GA, – Universities: Adelaide, Monash, UNSW, Queensland, Sydney, Deakin (and ANU) – University Consortia: Intersect, QCIF • Implications of co-investment: • Growing Co-investment : – 2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M; 2013: $11M; 2014 $11+M; to provide for all recurrent operations • Strong Governance – NCI governed by ANU on advice from NCI Board – Executive level representation from ANU, CSIRO, BoM, GA, res. intensive universities 5 – Engaged and outcomes driven – Shape the enterprise in the national interest – Ownership and responsibility – Provide drivers for comprehensive and integrated services nci.org.au NCI: comprehensive and integrated; quality and innovation • NCI today – Computational and Data-intensive services • 3,000+ users, 600 active projects; – Robust: 19 staff in operations; 9 in user support – Innovative: 20 staff in applications support/development, virtual labs, collections management, visualisation in a dev-ops environment – Engaged: partners organisations, other einfrastructure (NeCTAR, RDSI, ANDS) 6 nci.org.au Current Infrastructure o Raijin - Australia’s first petaflop supercomputer - 57680 Intel Sandy Bridge (2.6 GHz) - 157 TBytes memory, - FDR Infiniband - Centos 6.X Linux; PBS Pro scheduler - 1195 Tflops, 1,400,000 SPECFPrate - 10 Pbytes scratch file system • 150Gbytes/sec. - 1.5 MW power; 100 tonnes of water o Data Storage - HSM (massdata): 10 PB (Dual Site – Redundant) - Global Lustre Filesystem • /g/data{1,2,3}. Over 15 PB Storage - Up to 55 Gbytes/sec 7 nci.org.au Cloud Infrastructure o Cloud Computing since 2009. o Several clusters- All powered by Dell. o Stable, solid and best cost/performance. o VM-Ware Cloud (Since 2009) o For Web services and database clusters. o Due for an upgrade. o DCC Cluster (2010)- Retired in favour of OpenStack. o Virtualized under Vmware. o For workloads not typically suited for HPC. clusters e.g. Oracle DB inside compute node. o Single node compute jobs. o OpenMP or Python. o RedHat OpenStack Cloud. (2012) o 384 core private cloud. Reuse of DCC hardware. o Typically for Virtual Laboratories. 8 nci.org.au NCI Cloud Infrastructure o NeCTAR Research Cloud (Private Cloud). o Intel Sandy Bridge (3200 cores with HT). o 800 TB of SSDs per compute node o 56 G Ethernet. o No access to internal environment (no access to highspeed filesystem) due to security issues. o Tenjin Partner Cloud (Pre-production this month) o Flagship Cloud. o Same hardware as NeCTAR but without HT. o RDO and Centos 6.X. o Architected for strong computational and I/O performance needed for “big data” research. o SRIOV (No live migration), FFT and 56G Ethernet. o On-demand access to GPU nodes. o Access to over 20PB Lustre storage. 9 nci.org.au NCI Cloud Infrastructure o InfiniCloud o Experimental FDR Infiniband Cloud o ICE House – Heavily Modified by NCI. o Mellanox yet to release native IB for ICE House o Native IB to the Virtual Machines via SRIOV o Once stable, Tenjin will move the native IB. o Another Cloud (not named yet- Not a Dell) o Reuse NCI’s previous Supercomputer o Native IB o QDR 10 nci.org.au NCI Data Centre Tape robot Raijin compute 10G Raijin compute 10G Raijin compute Storage/Raijin mgmt Cloud Metro X Tape robot Deja-Vayu 10G Cloud backup Storage (r.Lustre, gdata{1..n}) Huxley Datacenter Tape robot Tape robot FX10 40G NCI Datacenter 11 nci.org.au NCI Systems Connectivity Internet NCI data movers To Huxley DC Raijin Login + Data movers Raijin Compute 10 GigE /g/data 56Gb FDR IB Fabric Raijin 56Gb FDR IB Fabric Massdata Cache 1.0PB, Tape 12.3PB 12 Cloud VMware /g/data Raijin FS /g/data1 /g/data2 /short ~7.4PB ~6.5PB 7.6PB /home, /system, /images, /apps nci.org.au Commercial Cloud vs Science Cloud Commercial Cloud (& NeCTAR) • • • • • • • Equilibrium/Uptime Run forever Many applications per server Light compute load Light data load No scheduling Examples – – – • 13 HPD Science Cloud (e.g., NCI) • • • • • • • – Email servers Web servers CRM Regular production environment Varying needs / loads Run until outcomes or insights Many servers per applications Computationally intensive Data Intensive Scheduling Examples – • Data-intensive computation: analysis, mining, visualisation Parametric computation; Monte Carlo methods Specialised and integrated (highperformance) environment nci.org.au NCI: OpenStack Cloud (NeCTAR and Tenjin) Physical Nodes 200+ Cores 3200+ Core type Total memory Memory / node (GB) 25 TBytes 128 Memory speed 1333 MHz Local fast SSD capacity/node 400/800 GB Per node Read /Write 500 / 460 MB/s Per node IOPS R / W 75000 / 36000 Other storage GPU Network 14 Intel Sandy bridge 2.6 GHz ~650 TB (Swift/CEPH) 10 56 Gbps IB FDR nci.org.au Why High Performance Cloud? Job statistics on Raijin- Users are really into parallel jobs NCI’s Awesome dashboard…. Courtesy Dr. Lei Shang 15 nci.org.au Why High Performance Cloud? (cont…) o o o o o o o o o o o o 16 Complement the NCI supercomputer. Single Node jobs are not fun for big supercomputer. Virtual Laboratories. Remote Job Submission. Visualization. Web Services. o Mounts global file-system at NCI. On-Demand GPU access. Workloads not best suited for Lustre. o Local scratch is SSD on NCI cloud compared to SATA HDD on Raijin. Pipelines that are not suited for supercomputer o Packages that cannot/will not be supported. o Proof of concepts before making a big run. Cloud burst o Fancy name for offloading single node jobs to Cloud when system is in heavy use. Student Courses. NCI Specific: Huge Global Lustre File System to to pre and post processing of jobs o Cloud Raijin Cloud Web (All @ NCI) nci.org.au Architectural Design: High-Performance RC Node at NCI Architecture of Tenjin Overview ollowing diagram, Figure 3, depicts the proposed InfiniBand based OpenStack cluster. Figure 3 Architecture of the Infiniband-based OpenStack cluster. 17 Band is used as nci.org.au a converged fabric which allows for the transport of Ethernet, storage and Preliminary Results (Platform) Cluster Architecture Interconnect Loc Raijin Xeon(R) CPU E5-2670 @ 2.60GHz (Sandy Bridge) Mellanox FDR Infiniband - FFT NCI Tenjin Intel Xeon E312xx @ 2.60 GHz (Sandy Bridge) Mellanox FDR Infiniband, flashed to 56G Ethernet- FFT NCI InfiniCloud Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz Mellanox FDR Infiniband NCI 10G-Cloud AMD Opteron 63xx 10G Ethernet o OpenMPI 1.8 o All applications compiled with GCC used with O3. Note “No” Intel Compilers for fair comparison. o All clouds were based on OpenStack - Ice house. o Not so scientific results- 10 runs, threw max and min results and took average. o Comprehensive results to follow 18 nci.org.au Latency in Microseconds (Log Scale) OSU Latency Benchmark Microseconds- Lower is better 100000 10000 1000 100 10 1 Message Size 10G-Cloud 19 Tenjin-TCP Tenjin-Yalla InfiniCloud Raijin nci.org.au OSU Bandwidth Benchmark MB/Sec - Higher is better 7000 Bandwidth (MB/s) 6000 5000 4000 3000 2000 1000 0 Message Size (Bytes) 10G-Cloud 20 Tenjin-TCP Tenjin-Yalla InfiniCloud Raijin nci.org.au Bioinformatics Workload Speedups compared to 10G-XXX-Cloud 16 CPU-One Compute Node 2.5 2 1.5 1 0.5 0 Inchworm Chrysalis Raijin Tenjin Butterfly 10G-XXX-Cloud Trinity is a bioinformatics de nova sequence-assembling package consists of three programs: Inchworm (openmp, gcc), Chrysalis (openmp, gcc) and Butterfly (java). The calculation was carried out based on procedure published by BJ Haas et al, Nature Protocols 8, 1494–1512 (2013) Courtesy: Dr. Ching-Yeh (Leaf) Lin at NCI 21 nci.org.au NPB Results (32 processes only) NASA Parallel Benchmarks Speedup w.r.t 10G-Cloud 7 6 5 4 3 2 1 0 CG.C.32 22 EP.C.32 FT.C.32 IS.C.32 LU.C.32 Raijin-Partial Raijin-Native-2.6GHz Raijin-2GHz InfiniCloud-Hypervisor InfiniCloud-VM Tenjin-VM-Partial Tenjin-VM-Yalla 10G-XXX-Cloud-2.3GHz-Opteron MG.C.32 nci.org.au NAS Parallel Benchmarks 64 processes (16 processes per virtual machine)- Class C 3 Speed Up- Higher is better 2.5 2 1.5 1 0.5 0 10G-Node Tenjin-56G-Ethernet 23 BT.C.64 CG.C.64 EP.C.64 FT.C.64 IS.C.64 LU.C.64 MG.C.64 SP.C.64 1 1 1 1 1 1 1 1 1.491083476 2.222986999 1.844686649 2.496761162 1.900924214 1.575110836 1.813863476 1.418657064 nci.org.au Compute Time (s) (lower is better) Scaling is a big problem for OpenStack Cloud 4096.00 1024.00 256.00 64.00 16.00 4.00 1.00 1 2 4 8 16 Number of CPUs 32 RDO TCP RDO TCP MXM RDO OIB RDO OIB MXM RJ TCP RJ TCP MXM RJ OIB RJ OIB MXM 64 128 Computational Physics: Custom-written, hybrid Monte Carlo code for generate gauge fields for Lattice QCD. For each iteration, calculating the Hamiltonian involves inverting a large, complex matrix using CGNE. Written in Fortran, using pure MPI (no threading). Courtesy: Dr. Benjamin Menadue 24 nci.org.au Openstack specific HPC issues o Parallel jobs run on the Cloud but is it HPC? o Hardware performance counters? o Took over 5 years to realize NUMA is essential. o Juno; We have not tested that. o SRIOV still is not the answer. o M(ulti)RIOV (Dell?) o Locality aware scheduling. o Not there; NCI working on it. o Our benchmarks hindered by QPI performance of SandyBridge. o Mellanox suggested it better in later versions of Intel Arch. o Lustre. o Security model. (Manila?) o Single Node performance is at par with bare-metal. o No surprise here. It was the case 5 years ago as well. o NCI plans to move to bare-metal provisioning for Parallel applications. 25 nci.org.au Partnership with Dell o Best Cost/Performance Ratio o Fastest Research Cloud in Austraia o Robust Hardware o Four Cloud Clusters in Production all Dell o Excellent Support 26 nci.org.au 27 nci.org.au
© Copyright 2024