Delivering Popula/on-‐Scale Genomics: A Deep Dive into HiSeqX Performance and Capabili/es Jus%n Abreu, Kylee Bergin, Tim De Smet, Ryan Hegarty, Ma;hew Coole, Laurie Holmes, Ka%e Sullivan, Tim Fennell, DSDE, , Niall Lennon, Danielle Perrin, Sheila Dodge and Stacey Gabriel Genomics PlaIorm, Broad Ins%tute, 320 Charles Street Cambridge, MA Understanding Sequencing Yield and Data Quality . 30X$Equivalent$Genomes$Completed PF$Bases$(Tb)$GeneratedBGenomes Run$Time$(Days) (B) Sequence Generated, Cumulative (Terabases) ! 3,000" 2014 8,031 769.3 2.9 30.0% Average%of%%%PF%(BC)% 45.0" 1600" 1400" 35.0" Fail!Rate!(%)! 2,000" 1,500" 1200" 30.0" 1000" 25.0" 800" 20.0" 600" 15.0" 1,000" em be r" Oc to be r" No ve m be r" De ce m be r" Ja nu ar y" t" y" e" ay " 15.0% 58.0% 10.0% %"PF"On"Sequencer" 63.0% Figure 2: Data genera/on at scale. With the aforemen%oned process improvements in place, data generated via the HiSeqX plaIorm were observed to increase significantly (A) while lane failure rate decreased (B) over %me. #.Lanes.of. Sequencing 2 1 1 Library.Size Mean.Coverage 9,635,612,469 25 6,159,578,978 32 2,803,046,955 28 %.15X 89.3 96.7 92.1 %.20X 75.4 94.2 81.7 Table 2: Comparison of WGS protocol performance on the HiSeqX and HiSeq 2500. • Improved coverage across the genome 5.0% 0.0% 48.0% June% July% August% September% October% November% December% Figure 4: Duplica/on and data loss. Efforts to improve % PF Clusters resulted in an increase in overall % duplica%on and data loss. Figure 3: Understanding loading concentra/on. % PF clusters was found to increase as loading concentra%on decreased. • Overall duplica%on was observed to increase as % PF Clusters increased resul%ng in a decrease in usable data. This revela%on sparked an effort to to find a balance between % PF Clusters and % Duplica%on. • New quality filters were established with the goal of determining the true performance of a WGS library on the HiSeqX. PCT EXC DUPE: Percentage of bases excluded from coverage calcula%ons because reads are marked as duplicates. 35.0% • Reduc%on in base specific biases that are a;ributed with DNA polymerases Figure 6: GC bias comparison. PCR-‐free WGS libraries (blue) show significantly more even coverage across the GC spectrum tradi%onal WGS libraries (red and green). • Increased sensi%vity to detect and reduce in false-‐posi%ve observa%ons when calling indels and copy number variants. Condition Standard'WGS'HISeqX'v2.0 PCR<free'WGSHiSeqX'v2.0 CNV's+called+(+5kb) 467 499 Estimated+False+Detection+Rate 8.3% 3.4% Table 3: CNV and false detec/on call rate comparison. c. Figure 7: SNP/Indel analysis. Analysis has shown SNP and indel analysis d. to be equivalent between the two protocols. Average''Coverage'Loss'as'a'Result'of'Applied'Data'Filters' Average%Coverage%Loss% Average%Coverage% 40.0% Se pt 2014! Au gu s 2013! Ju l 2012! Ju n 2011! M 2010! 0" Ap ril " 2009! 0.0" ar ch " 0" 5.0" 200" M 500" 20.0% Sequencing. Technology HiSeq(2500 HiSeqX HiSeqX 53.0% 45.0% 400" 10.0" 68.0% 25.0% 50.0% 1800" Protocol PCR$free(WGS PCR$free(WGS Standard(WGS Average%of%Exc%Dupe%%%(WGS)% PCT EXC TOTAL: The sum of the above exclusions. HiSeqX!Lane!Fail!Rate!Over!Time!! 40.0" 2,500" Total 11052 1023.1 50.0" • Proven ability to generate ~ 30X coverage within a single lane of sequencing. PCT EXC OVERLAP: Percentage of bases excluded from coverage calcula%ons because two observa%ons of a single base from a single insert due to overlapping reads 1 and 2. !Total!Number!of!Lanes!Run! (A) 2013 3,021 253.8 10.8 Advantages %"Duplica5on"as"a"Func5on"of"%"PF"Clusters" Goal: Automate exclusion amplifica%on striptube prepara%on to meet HiSeqX throughput and capacity as well as reduce variability between lanes and samples. Observed Improvements: • Improved throughput by 384% • Increased overall data output • Decreased flowcell failures Combining the sequencing power of the HiSeqX along with the Broad’s PCR-‐free WGS protocol led to the genera%on of data of unprecedented quality and quan%ty. • Understanding the rela%onship between loading concentra%on and % PF has aided in maximizing the output of usable data. • Ini%al tes%ng revealed % PF Clusters increased as loading concentra%on decreased. Increased Scale Through Process Op/miza/on Automa%ng sample striptube crea%on: • Reduces poten%al for sample swaps • Reduces failures related to pipe[ng errors • Reduces process %me • Capable of preparing 96 individual libraries for cluster amplifica%on Figure 1: %PF per lane of automa/on vs manual valida/on flowcells: The automated striptube workflow was helped to increase the average % PF of samples and to decrease % PF variability between samples Longterm flowcell storage: • Prepared flowcells may be stored up to 3 days at 4°C • Inventory crea%on allows for sequencing runs to occur 7 days a week • Maximizes instrument u%liza%on by minimizing instrument down%me PCR-‐free WGS on HiSeqX Goal: Increase the amount of usable bases generated per lane of HiSeqX sequencing in order to maximize the PF Gb. %"Excluded"Duplica5on" The introduc%on of the Illumina HiSeqX sequencers has enabled the Genomics PlaIorm at the Broad Ins%tute to generate an unparalleled amount of data. To fully maximize the benefits of the HiSeq X plaIorm, a significant effort was undertaken to scale-‐up the output of high quality sequencing data. In order to accomplish this task, we focused on: • Increasing percent of clusters passing filter, while limi%ng data loss • Increasing throughput and incorpora%ng automa%on • Scaling up to a 7 day process These efforts have resulted in: • Increase in machine u%liza%on to full capacity • An unprecedented amount of data output • Improved sequencing data yield and quality 15.1% 10.1% 11.5% 10.7% 12.6% 13.1% 30.9% 31.3% 30.5% 31.3% 30.2% August% September% October% November% December% 8.3% 10.9% 30.0% 7.3% 8.1% 25.0% 20.0% 15.0% 29.5% 28.7% 29.4% 23.7% 26.3% 10.0% 5.0% 0.0% March% April% May% June% July% Figure 5: True genome coverage. Using these new data filters, we are able to truly assess how well we meet our coverage goals over %me. Acknowledgements
© Copyright 2024