Big data and the network - a genomics case study

Big Data and The Network A Genomics Case Study Chris Konger Discussion Outline • 
• 
• 
• 
• 
• 
• 
• 
• 
• 
Introduc?on & Preliminary Ques?ons The Quick Version (for non-­‐Techs) The Network The Researcher The Goal BoIlenecks on BoIlenecks on BoIlenecks Mul?tude of Possible Fixes Results from Last Summer-­‐Fall Where we are headed Ques?on and Answers The Quick Version (for non-­‐Techs) • 
• 
• 
• 
• 
• 
Big Data may not be so big to you (only 12TB) Original transfers were taking 8 days Analyses were being done in a destruc?ve way (i.e., if something went wrong with computa?on, the en?re data set was downloaded again!) Current transfers take from 5-­‐10 hours We would like to get that down to 7-­‐15 minutes Researchers have MUCH more flexibility with tes?ng different ideas and algorithms (i.e., making a mistake doesn’t waste a week plus). The Network (for Techies) CC-­‐NIE grant to connect 20 buildings at 40Gbps Experimental network running OpenFlow Big Switch Controller Mix of whitebox switch vendors Connected to Clemson’s Brocade MLXe-­‐32 (which is also where PalmeIo cluster connects) •  The MLXe-­‐32 connects to I2 AL2S at 100G •  Iden?fy researchers who would connect their worksta?ons and test gear with 10G NICs • 
• 
• 
• 
• 
Let’s play “Spot the issues w/ the following design!” Perfsonar%nodes
CCNIE/Science+DMZ
Internet%via%clight
I2+AL2S+
10gig
(Palme7o%Cluster)
10gig
10%gig
10gig
10gig
100gig
S4810
10gig
Perfsonar%nodes
ASC
10.19.75.20
Campus%network
10gig%sr
10gig%sr
Brocade%MLX3e
(Palme7o%Core)
O3MarWn
10.19.112.20
40gig%lr
10gig%sr
10gig%sr
40gig%lr
Kinard
10.19.188.40
40gig%lr
Poole
10gig%sr
40gig%lr
10gig%sr
10gig%sr
Z9000
Sirrine
10.19.14.61
EIB
10.19.198.79
40gig%lr
40gig%lr
40gig%lr
40gig%lr
40gig%lr
10gig%sr
10gig%sr
10gig%sr
10gig%sr
Hunter
10.19.222.50
10gig%sr
10gig%sr
10gig%sr
Bracke7
10.19.104.50
40gig%lr
40gig%lr
Rhodes
10.19.152.46
40gig%lr
10gig%sr
10gig%sr
10gig%sr
40gig%lr
40gig%lr
40gig%lr
40gig%lr
40gig%lr
40gig%lr
40gig%lr 40gig%lr
10gig%sr
LSB
10.19.122.50
40gig%lr
10gig%sr
Brown%Room
10.19.58.10
10gig%sr
STI
10.19.21.20
Lee
10.19.64.55
Biotech
10.19.150.114
McAdams
10.19.48.28
AMRL
10.19.83.10
Daniel
10.19.140.50
Jordan
10.19.121.60
Riggs
10.19.248.25
Barre
10.19.55.10
10gig%sr
Earl
10.19.32.25
10gig%sr
10gig%sr
10gig%sr
10gig%sr
10gig%sr
10gig%sr 10gig%sr
10gig%sr 10gig%sr
10gig%sr 10gig%sr
10gig%sr 10gig%sr
10gig%sr 10gig%sr
10gig%sr 10gig%sr
10gig%sr 10gig%sr
High8Level(Overview(of(CU(Brocade(Connec@vity
UW²
PNWGP
Wa(St²
Wisc
FRGP
CIC(
OmniPOP
PNWGP
SEAT
Ohio
Colorado²
SEAT
DENV
NCBI
Brocade
Harvard²
Legacy(I2/Internet
OARnet
CHIC
NIH
NoX
ASHB
CLEV
Default&pathway&
out&of&the&Brocade& 10G
is&through&CLight.
BOST
CLight
T1600
((((((((((((AL2S¹
100G
LOSA
TL/PW
via
UW/CENIC
Hawaii²
CENIC
USC
SALT
UEN
Utah
HOUH
FOAM
GENI³
Notes:
(¹Paths(through(AL2S(are(mapped(with(OESS(GUI
(²Upcoming/Planned(or(Prelim(Discussions(Have(Occurred
(³GENI(being(migrated(off(ION((to(new(mappings(via(FOAM)
(⁴perfSONAR01(is(being(migrated(from(Brocade(to(T1600,
((so(there(will(be(dedicated(pairs(facing(Internet(and(AL2S
Original&by&CKonger&04;Mar;2014&20:00&ET
Revised&by&CKonger&07;Apr;2015&17:00&ET
StaCcs&route&traffic
&[non;RFC1918]&
across&AL2S,&based&
on&desCnaCon&
hosts/subnets
ION8ATL
perfSONAR(
01⁴|02
ITC
VSS
Vandy²
GENI³
perfSONAR
03|04
10G
(each)
Traffic&from&private&
(RFC1918)&sources&
are&policy;routed&
through&campus&
firewalls,&which&do&&
the&necessary&
translaCon&to&"real"&
addresses.
10G
Research(
Servers
SciDMZ
Clemson(Campus
10G
(40G²)
1&U
2&U
2&U
2&U
2&U
2&U
PalmeOo
EIB
GENI
Rack
Poole(
VSS
10G
OrangeFS
(tes@ng)
40G
Poole
Border
10G((each)
ATLA
SoX
ITC
Border
The Researcher •  Alex Feltus, Associate Professor for Gene?cs and Biochemistry •  Looking for snippets of code scaIered across the maize/corn genome (among others) •  Ini?ally was not that interested … “what do I need to sign to make you go away?” •  Original discussion gained trac?on when the emphasis changed from net-­‐speak “Gbps” to “What if we could reduce your transfers which take over a week, to just a few hours – would that help with your workflows?” The Researcher (cont’d) •  Alex has since become a HUGE proponent of the advantages of being able to move data quickly. •  He is developing some really crea?ve ways of easily moving large data sets from point A to point B. I don’t want to “steal his thunder” (he will be presen?ng his results at the upcoming I2 Global Summit in DC. •  Gemng decent results took a much greater ?me investment than anyone expected. There were numerous issues and surprises. The Goal • 
• 
• 
• 
• 
• 
• 
Read 12 Terabytes (TB) of Gene?c info from NCBI NCBI (NIH) database was ~ 3 Petabytes (PB) Sequence Read Archives (SRAs) Each file in it’s own subdirectory 7535 files ranging from 20 MB to 61 GB Average file size 1.63 GB A flat-­‐file (text) lists path and file for each SRA with one line per each file to retrieve BoIlenecks on BoIlenecks on BoIlenecks •  Alex server “Tikal2” was high powered, with 10G network card, huge storage array, but the laIer had a con?nuous write of only 120MB/s (< 1 Gbps) •  Transfers were then shired to a PalmeIo data transfer node (DTN). Surprisingly, it didn’t do much beIer (even arer “TCP Tuning”) •  There were MTU mismatch issues (Cisco-­‐Juniper) •  Scripts invoked the Transfers sequen?ally/serially •  Transfer Applica?ons had very small buffers •  The Storage Subsystem had to be Tweaked •  Even parallel xfers didn’t give the expected results •  The NCBI/NIH side had throughput issues as well •  System calls (to the Kernel) couldn’t keep up Mul?tude of Possible Fixes •  Applica?ons were recompiled with bigger buffers •  System (kernel) buffering of streams was disabled •  Storage Group modified SAMQFS semngs, op?mizing for the files in ques?on (difficult due to the wide range of sizes). •  Transfers memory-­‐to-­‐memory were closer to expecta?ons, whenever storage was accessed the results were miserable. •  Dtrace u?li?es were used to iden?fy the Kernel issues. It was found the breaking down of the passed chunk into linked chains (nodes) involved inefficient copying and was quite slow. Mul?tude of Possible Fixes (cont’d) •  Parallel transfers involving many nodes to one (and subsequently many-­‐to-­‐many) were evaluated •  NIH/NCBI preferred vendor Aspera was pulled in •  Collaborator University of Utah was asked to test using their brand new servers with 40G network cards. The results weren’t much beIer than 10G! •  Academic papers started appearing about the performance hit of kernel making mul?ple copies in memory on it’s way to storage. Aspera no?ced one where theore?cal throughput of 5 Gbps from a 10 Gbps was expected (perhaps Luigi Rizzo paper hIp://luca.ntop.org/10g.pdf ?) Mul?tude of Possible Fixes (cont’d) •  Aspera worked with Intel Data Plane Development Kit (DPDK) to bypass the kernel, keeping everything in User space, a single memory copy from App to I/O subsystem. Cache implica?ons: “Chunks” of data had to be sized so as to fit within the L3 cache (~35MB), otherwise Processor would get busy with L3<-­‐>RAM shuffling, with resul?ng performance hit. Aspera was able to achieve 80G, but with LOCAL STORAGE (not typical for most Data Centers). The user space flow was NIC -­‐> L3 Cache -­‐> PCIe -­‐> Controller -­‐> SSDs. Aspera presented these results at the last I2 Indy gathering. Results from Last Summer/Fall Stage Throughput Notes Tuned Worksta?on 980 Mbps single hard drive Data Center Transfer 1.40 Gbps RAID storage Custom File Transfer Applica?on 2.00 Gbps Larger applica?on buffer and disabled I/O buffering; modified in cURL recompile File system tuning 2.50 Gbps RAID parameter tuning One to One Parallel File Transfer 3.89 Gbps Mul?ple files downloaded simultaneously Many to One Parallel File Transfer 5.10 Gbps Mul?ple FTP servers in use Aspera Client 7.50 Gbps Memory to memory Where we are headed … •  NIH and Clemson are rebuilding DTNs and Storage (both commodity low-­‐cost and high-­‐speed xfers). •  Utah is reviewing ongoing upgrade/deployment •  Clemson has sought out advice from commercial storage vendors for a product which could support much higher throughput from DTN to DTN. Utah has volunteered to test with Clemson. NIH is interested too, arer they are finished with their rebuild. •  The boIleneck of local worksta?ons/servers with slow arrays MAY be resolved in the near future as well (i.e., Non-­‐vola?le Memory Express aka NVMe with SSDs capable of 2-­‐3GB/sec con?nuous writes) •  Implica?ons of this traffic on I2 AL2S backbones Ques?ons and Answers