Disruptive Strategies for Petabyte

Disrup9ve Strategies for Petabyte-­‐scale Storage Fernando J. Pineda Dept. of Molecular Microbiology & Immunology Dept. of Biosta9s9cs Director, Joint High Performance Compu9ng Exchange Johns Hopkins Bloomberg School of Public Health [email protected] www.pinedalab.org “Flower Bomber” by Banksy CASC Spring mee9ng, Arlington VA, March 30, 2015 Scale of our opera9on •  The environment – 
– 
– 
– 
– 
– 
– 
400 ac9ve NIH funded users, from JHU Schools of Public Health & Medicine 2 FTEs (no formal applica9on support) $400K/yr opera9ng budget (mostly salaries) Mostly embarrassingly parallel applica9ons in Genomics & Biosta9s9cs Mostly batch jobs, ssh access only Business model is a Condo model with recovery of opera9ng costs Most of the hardware purchased by Stakeholders •  The Cluster – 
– 
– 
– 
– 
2500 cores and 20TB RAM 400 TB “enterprise” storage appliances (mostly ZFS) 688 TB NFS exported ZFS, green Dirt Cheap Storage (DCS) 1.4 PB green Dirt Cheap Lustre (DCL) Figure of merit: 1 TB storage/core •  The strategies we will discuss work in a genomics/biomedical research environment at Johns Hopkins. Your mileage may vary. Mo9va9on 1.  We at least double our storage capacity every year (since 2007) 2.  We can’t afford Enterprise storage solu9ons 3.  What the stakeholders tell us about BIG data storage – 
– 
– 
An extra week/year of up-­‐9me is in the noise. If an extra week of up-­‐9me doubles the cost of the storage, then they would rather spend that money on reagents. They want MORE storage, NOT faster storage This is exactly the opposite of what SysAdmins and enterprise storage vendors are telling the decision makers 4.  We observe that storage looks like linux clusters 15-­‐20 years ago: (Commodity hardware) + (open-­‐source system so=ware) = disrup?on 5.  We observe that storage sojware stack is no harder to administer than linux -­‐-­‐ Just different. 6.  We were highly mo9vated to develop storage systems that met OUR Research requirements NOT Enterprise requirements Risk & mi9ga9on •  Conflic9ng opinions on “DIY” –  100% of Vendors, Blogs/pundits, SysAdmins and bean counters all agree on one thing: You are crazy –  100% of PIs AND the Director agree: Makes sense, let’s try it. •  Talk to people who have done this before, e.g. –  JHU/Physics & Astronomy (Alainna White, Alex Szalay) –  JHU/HLTCOE (Scoo Roberts) –  Buck Ins9tute (Ari Berman, now at BioTeam). •  Hardware: Select experienced system integrators – 
– 
– 
– 
Don’t try to build it yourself Let the integrator do the “burn in” at their site For DCS we used the integrator of the JHU Data-­‐Scope storage (Seneca) For DCL we used Silicon Mechanics green Dirt Cheap Storage (DCS) A first effort completed in 2013 An NFS exported ZFS file system • 
• 
• 
• 
• 
• 
• 
ZFS on linux file system Low-­‐power “consumer” NAS drives Crowd funded 688TB (Raidz2 formaoed) 1/2 -­‐ 2/3 the cost of anything available from appliance vendors ½ the power of anything available from appliance vendors Stepping stone to bigger and beoer things • 
ZFS designed for unreliable drives...remember? – 
– 
– 
• 
Technology: ZFS ZFS repairs some errors that would be flagged as a drive failure in conven9onal raid arrays and so fewer drives are failed With lower failure rate, we can accept longer rebuild 9mes When a drive does actually fail, the rebuild 9me depends on the amount of data on the drive, not the size of the drive. Caching and Logging –  Makes random i/o look more like streaming –  more streaming => less head mo9on => less power & less wear-­‐and-­‐tear on the drives • 
Why ZFS on linux in 2013 instead of more mature BSD or Solaris? – 
– 
• 
Good 9ming: first stable release of ZFS on linux was March 2013 Lustre-­‐on-­‐ZFS was a glimmer in our eye in 2013 RAIDz2 zpools striped across 8 JBODs. Tolerates failure of 2 JBODs without data loss. Technology: Small-­‐NAS drives (new in 2013) •  Western digital 3TB Red drives – 
– 
– 
– 
– 
– 
– 
– 
– 
Designed for small NAS applica9ons (e.g. up to 8-­‐drive RAID). Designed for 24/7 opera9on Vibra9on Control 5400 RPM Good sequen9al-­‐access performance* sucky random-­‐access performance* Low power (5W/drive, 1.8W /raw-­‐TB) SATA interface Cheap! $165/3TB at the 9me ($100-­‐$130 as of Mar. 2015) •  We took the plunge and bought 360 of them. * hop://www.storagereview.com/western_digital_red_nas_hard_drive_review_wd30efrx Technology: 3TB WD Red Drives •  < 1% failure rate in first year of opera9on –  Ajer preproduc9on burn-­‐in –  Our experience consistent, if not beoer than, HD reliability analysis conducted by Backblaze of 27,134 consumer-­‐grade drives*. •  Reliability is as good as our “Enterprise” appliances. •  Typical RAIDz2 rebuild 9mes is ~ 12hours. •  Proved the concept of ZFS+small NAS drives. We need not pay a premium for “Enterprise” drives * hop://blog.backblaze.com/2014/01/21/what-­‐hard-­‐drive-­‐should-­‐i-­‐buy/ Crowd funding •  Build it WHEN they come –  PIs get “prospectus” and express interest in an alloca9on based on projected costs. –  Two components to cost: 1) Projected buy-­‐in and 2) quarterly fee to cover opera9ng costs & addi9onal capital expenses. –  Iterate requested alloca9ons and buy-­‐in as design and costs sharpen up –  No commitment from PIs un9l we actually pull the trigger •  Resources wind up being very well matched to the needs of PIs •  We escape from yearly centralized planning cycle and compe99on for a cut of the ins9tu9onal capital budget (F&A). •  Makes the Dean happy by minimizing the load on school’s limited capital budgets. •  But ... to stay out of Leavenworth we must sa9sfying OMB regula9ons, e.g. don’t mix money with different colors. So we sent 3 POs to the vendor • 
• 
• 
Capital equipment (5 year recovery from fees) $56,590 345 drives (PI Sponsored budgets, i.e. grants) $54,878 33 drives (PI non-­‐sponsored budgets) $5249 DCS01 vs deeply discounted ZFS enterprise storage appliance from a major vendor (2013) Cost Raw_TB Formaoed_TB $/Raw_TB $/formaoed_TB Power dissipa9on Waos/formaoed-­‐TB ZFS appliance $161,876 492 394 $329 $411 4kW (?) 10W/TB(?) DCS01 $162,171 1080 670 $150* $242 3.5kW 5.2W/TB Green! * $108/raw-­‐TB exclusive of development costs Mo9va9on for yet another storage system •  Our DCS (based on NFS/ZFS) is not scale-­‐out and slows to a crawl now-­‐and-­‐then when someone hammers it with hundreds of i/o-­‐intensive processes. •  But Enterprise High Performance File systems are even more expensive than enterprise NFS storage devices. Example scale-­‐out Petabyte systems: –  IBM GPFS –  Lustre –  NetApp •  We don’t need crazy-­‐fast performance. We don’t need 99.999% up9me. We just need significantly beoer performance than NFS at a reasonable price and a reasonable power footprint. Lustre •  Lustre used on the majority of the top 500 supercomputers, mostly as fast parallel scratch space for streaming large files. •  Growing acceptance as a scale-­‐out general purpose file system •  Lustre road map is very aorac9ve for Genomics –  Open source –  Hierarchical Storage Management (HSM) as of v2.5 –  Small (few KB) files to be incorporated into metadata server –  LNLL developed Lustre-­‐on-­‐ZFS-­‐on-­‐linux. Allows us to leverage ZFS’s ability to exploit low-­‐cost low-­‐power drives. Disrup9ve thoughts •  Lustre, on ZFS, on linux, on dirt cheap hardware •  Implica9ons of SATA instead of SAS disk drives –  have to accept manual fail-­‐over –  give up mul9path performance. –  Hmmm. Do I care as long as I don’t give up data integrity? •  Expect: A low-­‐cost, low-­‐power, scale-­‐out, high-­‐speed general purpose distributed file system that out-­‐
performs NFS. Lustre system overview BioTeam/PinedaLab 10Gbps cluster fabric 160Gbps 40Gbps 40Gbps switch 80Gbps OSS1 OST1 80Gbps OSS2 OST2 40Gbps 40Gbps MDS1 MDS2 MDT1 MDT2 24 x 300 GB, 10,000 RPM metadata servers Future expansion 225 x 4TB 5400RPM 225 x 4TB 5400RPM 2 x object store servers Linux -­‐ RHEL 6.5 ZFS Version -­‐ 0.6.3 Lustre Version -­‐ 2.5.32 IEEL Version -­‐ 2.2.0.0 Lustre went live this morning! •  Hoping to be in produc9on by now but stalled for 3 month in procurement (22 budgets from 3 JHU divisions/affiliates and 5 POs) We’ll do procurement differently next 9me! •  ZFS and network tested/op9mized last week –  The disk speed (5400 RPM) will not be the booleneck •  This morning (first-­‐9me-­‐ever) lustre performance test – 
– 
– 
– 
100 workers, each with 4 concurrent I/O processes >80% of flows in the hundreds of MB/s In aggregate pushing 24GB/s read, 11.4GB/s write All tests were within OSS’s, need to test with real clients. An9cipate performance 10-­‐30% lower in real life, limited by the network •  We may indeed approach the performance of expensive bou9que distributed parallel storage for a frac9on of the cost Cost? •  $165,527 for 1.4PB •  Value proposi9on to users: –  $90/TB up-­‐front buy-­‐in –  < $50/TB-­‐year fees (mostly salaries) •  Next 700TB increment ~$65K •  Guess that we save 3.5kW per OSS-­‐OST => @ $0.11/KW-­‐hr we expect to save $3.4K per OSS-­‐
OST /year. Or for both OSS-­‐OSTs and a PUE of 1.5 it’s more than $10K saved per year. Take aways •  Wisdom of the crowd: Hardware purchases completely driven by PI demand. The purchases say > 1TB mass storage/core. Storage business as usual is not sustainable. •  Vendors not selling the systems that we need. •  Disk manufactures not providing the right disks. We need something like a WD Red 5400RM with SAS instead of SATA interface. “They” could build it cheaply. But “they” are not providing it ...To the Ramparts!!!! •  Storage technology looks like linux clusters 15-­‐20 years ago: (Commodity hardware) + (open-­‐source system so=ware) = disrup?on •  If we can do it, you can do it. Let’s disrupt the storage industry! Acknowledgements Mark Miller (Pineda Lab) Marvin Newhouse (re=red, Pineda Lab) Jiong Yang (Biosta=s=cs) Ari Berman Aaron Gardner High Performance Data Division for assistance with Lustre