Big Data and The Network A Genomics Case Study Chris Konger Discussion Outline • • • • • • • • • • Introduc?on & Preliminary Ques?ons The Quick Version (for non-‐Techs) The Network The Researcher The Goal BoIlenecks on BoIlenecks on BoIlenecks Mul?tude of Possible Fixes Results from Last Summer-‐Fall Where we are headed Ques?on and Answers The Quick Version (for non-‐Techs) • • • • • • Big Data may not be so big to you (only 12TB) Original transfers were taking 8 days Analyses were being done in a destruc?ve way (i.e., if something went wrong with computa?on, the en?re data set was downloaded again!) Current transfers take from 5-‐10 hours We would like to get that down to 7-‐15 minutes Researchers have MUCH more flexibility with tes?ng different ideas and algorithms (i.e., making a mistake doesn’t waste a week plus). The Network (for Techies) CC-‐NIE grant to connect 20 buildings at 40Gbps Experimental network running OpenFlow Big Switch Controller Mix of whitebox switch vendors Connected to Clemson’s Brocade MLXe-‐32 (which is also where PalmeIo cluster connects) • The MLXe-‐32 connects to I2 AL2S at 100G • Iden?fy researchers who would connect their worksta?ons and test gear with 10G NICs • • • • • Let’s play “Spot the issues w/ the following design!” Perfsonar%nodes CCNIE/Science+DMZ Internet%via%clight I2+AL2S+ 10gig (Palme7o%Cluster) 10gig 10%gig 10gig 10gig 100gig S4810 10gig Perfsonar%nodes ASC 10.19.75.20 Campus%network 10gig%sr 10gig%sr Brocade%MLX3e (Palme7o%Core) O3MarWn 10.19.112.20 40gig%lr 10gig%sr 10gig%sr 40gig%lr Kinard 10.19.188.40 40gig%lr Poole 10gig%sr 40gig%lr 10gig%sr 10gig%sr Z9000 Sirrine 10.19.14.61 EIB 10.19.198.79 40gig%lr 40gig%lr 40gig%lr 40gig%lr 40gig%lr 10gig%sr 10gig%sr 10gig%sr 10gig%sr Hunter 10.19.222.50 10gig%sr 10gig%sr 10gig%sr Bracke7 10.19.104.50 40gig%lr 40gig%lr Rhodes 10.19.152.46 40gig%lr 10gig%sr 10gig%sr 10gig%sr 40gig%lr 40gig%lr 40gig%lr 40gig%lr 40gig%lr 40gig%lr 40gig%lr 40gig%lr 10gig%sr LSB 10.19.122.50 40gig%lr 10gig%sr Brown%Room 10.19.58.10 10gig%sr STI 10.19.21.20 Lee 10.19.64.55 Biotech 10.19.150.114 McAdams 10.19.48.28 AMRL 10.19.83.10 Daniel 10.19.140.50 Jordan 10.19.121.60 Riggs 10.19.248.25 Barre 10.19.55.10 10gig%sr Earl 10.19.32.25 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr 10gig%sr High8Level(Overview(of(CU(Brocade(Connec@vity UW² PNWGP Wa(St² Wisc FRGP CIC( OmniPOP PNWGP SEAT Ohio Colorado² SEAT DENV NCBI Brocade Harvard² Legacy(I2/Internet OARnet CHIC NIH NoX ASHB CLEV Default&pathway& out&of&the&Brocade& 10G is&through&CLight. BOST CLight T1600 ((((((((((((AL2S¹ 100G LOSA TL/PW via UW/CENIC Hawaii² CENIC USC SALT UEN Utah HOUH FOAM GENI³ Notes: (¹Paths(through(AL2S(are(mapped(with(OESS(GUI (²Upcoming/Planned(or(Prelim(Discussions(Have(Occurred (³GENI(being(migrated(off(ION((to(new(mappings(via(FOAM) (⁴perfSONAR01(is(being(migrated(from(Brocade(to(T1600, ((so(there(will(be(dedicated(pairs(facing(Internet(and(AL2S Original&by&CKonger&04;Mar;2014&20:00&ET Revised&by&CKonger&07;Apr;2015&17:00&ET StaCcs&route&traffic &[non;RFC1918]& across&AL2S,&based& on&desCnaCon& hosts/subnets ION8ATL perfSONAR( 01⁴|02 ITC VSS Vandy² GENI³ perfSONAR 03|04 10G (each) Traffic&from&private& (RFC1918)&sources& are&policy;routed& through&campus& firewalls,&which&do&& the&necessary& translaCon&to&"real"& addresses. 10G Research( Servers SciDMZ Clemson(Campus 10G (40G²) 1&U 2&U 2&U 2&U 2&U 2&U PalmeOo EIB GENI Rack Poole( VSS 10G OrangeFS (tes@ng) 40G Poole Border 10G((each) ATLA SoX ITC Border The Researcher • Alex Feltus, Associate Professor for Gene?cs and Biochemistry • Looking for snippets of code scaIered across the maize/corn genome (among others) • Ini?ally was not that interested … “what do I need to sign to make you go away?” • Original discussion gained trac?on when the emphasis changed from net-‐speak “Gbps” to “What if we could reduce your transfers which take over a week, to just a few hours – would that help with your workflows?” The Researcher (cont’d) • Alex has since become a HUGE proponent of the advantages of being able to move data quickly. • He is developing some really crea?ve ways of easily moving large data sets from point A to point B. I don’t want to “steal his thunder” (he will be presen?ng his results at the upcoming I2 Global Summit in DC. • Gemng decent results took a much greater ?me investment than anyone expected. There were numerous issues and surprises. The Goal • • • • • • • Read 12 Terabytes (TB) of Gene?c info from NCBI NCBI (NIH) database was ~ 3 Petabytes (PB) Sequence Read Archives (SRAs) Each file in it’s own subdirectory 7535 files ranging from 20 MB to 61 GB Average file size 1.63 GB A flat-‐file (text) lists path and file for each SRA with one line per each file to retrieve BoIlenecks on BoIlenecks on BoIlenecks • Alex server “Tikal2” was high powered, with 10G network card, huge storage array, but the laIer had a con?nuous write of only 120MB/s (< 1 Gbps) • Transfers were then shired to a PalmeIo data transfer node (DTN). Surprisingly, it didn’t do much beIer (even arer “TCP Tuning”) • There were MTU mismatch issues (Cisco-‐Juniper) • Scripts invoked the Transfers sequen?ally/serially • Transfer Applica?ons had very small buffers • The Storage Subsystem had to be Tweaked • Even parallel xfers didn’t give the expected results • The NCBI/NIH side had throughput issues as well • System calls (to the Kernel) couldn’t keep up Mul?tude of Possible Fixes • Applica?ons were recompiled with bigger buffers • System (kernel) buffering of streams was disabled • Storage Group modified SAMQFS semngs, op?mizing for the files in ques?on (difficult due to the wide range of sizes). • Transfers memory-‐to-‐memory were closer to expecta?ons, whenever storage was accessed the results were miserable. • Dtrace u?li?es were used to iden?fy the Kernel issues. It was found the breaking down of the passed chunk into linked chains (nodes) involved inefficient copying and was quite slow. Mul?tude of Possible Fixes (cont’d) • Parallel transfers involving many nodes to one (and subsequently many-‐to-‐many) were evaluated • NIH/NCBI preferred vendor Aspera was pulled in • Collaborator University of Utah was asked to test using their brand new servers with 40G network cards. The results weren’t much beIer than 10G! • Academic papers started appearing about the performance hit of kernel making mul?ple copies in memory on it’s way to storage. Aspera no?ced one where theore?cal throughput of 5 Gbps from a 10 Gbps was expected (perhaps Luigi Rizzo paper hIp://luca.ntop.org/10g.pdf ?) Mul?tude of Possible Fixes (cont’d) • Aspera worked with Intel Data Plane Development Kit (DPDK) to bypass the kernel, keeping everything in User space, a single memory copy from App to I/O subsystem. Cache implica?ons: “Chunks” of data had to be sized so as to fit within the L3 cache (~35MB), otherwise Processor would get busy with L3<-‐>RAM shuffling, with resul?ng performance hit. Aspera was able to achieve 80G, but with LOCAL STORAGE (not typical for most Data Centers). The user space flow was NIC -‐> L3 Cache -‐> PCIe -‐> Controller -‐> SSDs. Aspera presented these results at the last I2 Indy gathering. Results from Last Summer/Fall Stage Throughput Notes Tuned Worksta?on 980 Mbps single hard drive Data Center Transfer 1.40 Gbps RAID storage Custom File Transfer Applica?on 2.00 Gbps Larger applica?on buffer and disabled I/O buffering; modified in cURL recompile File system tuning 2.50 Gbps RAID parameter tuning One to One Parallel File Transfer 3.89 Gbps Mul?ple files downloaded simultaneously Many to One Parallel File Transfer 5.10 Gbps Mul?ple FTP servers in use Aspera Client 7.50 Gbps Memory to memory Where we are headed … • NIH and Clemson are rebuilding DTNs and Storage (both commodity low-‐cost and high-‐speed xfers). • Utah is reviewing ongoing upgrade/deployment • Clemson has sought out advice from commercial storage vendors for a product which could support much higher throughput from DTN to DTN. Utah has volunteered to test with Clemson. NIH is interested too, arer they are finished with their rebuild. • The boIleneck of local worksta?ons/servers with slow arrays MAY be resolved in the near future as well (i.e., Non-‐vola?le Memory Express aka NVMe with SSDs capable of 2-‐3GB/sec con?nuous writes) • Implica?ons of this traffic on I2 AL2S backbones Ques?ons and Answers
© Copyright 2024