Download Report

Delivering Popula/on-‐Scale Genomics: A Deep Dive into HiSeqX Performance and Capabili/es Jus%n Abreu, Kylee Bergin, Tim De Smet, Ryan Hegarty, Ma;hew Coole, Laurie Holmes, Ka%e Sullivan, Tim Fennell, DSDE, , Niall Lennon, Danielle Perrin, Sheila Dodge and Stacey Gabriel Genomics PlaIorm, Broad Ins%tute, 320 Charles Street Cambridge, MA Understanding Sequencing Yield and Data Quality .
30X$Equivalent$Genomes$Completed
PF$Bases$(Tb)$GeneratedBGenomes
Run$Time$(Days)
(B) Sequence Generated, Cumulative (Terabases) !
3,000"
2014
8,031
769.3
2.9
30.0%
Average%of%%%PF%(BC)%
45.0"
1600"
1400"
35.0"
Fail!Rate!(%)!
2,000"
1,500"
1200"
30.0"
1000"
25.0"
800"
20.0"
600"
15.0"
1,000"
em
be
r"
Oc
to
be
r"
No
ve
m
be
r"
De
ce
m
be
r"
Ja
nu
ar
y"
t"
y"
e"
ay
"
15.0%
58.0%
10.0%
%"PF"On"Sequencer"
63.0%
Figure 2: Data genera/on at scale. With the aforemen%oned process improvements in place, data generated via the HiSeqX plaIorm were observed to increase signiﬁcantly (A) while lane failure rate decreased (B) over %me. #.Lanes.of.
Sequencing
2
1
1
Library.Size Mean.Coverage
9,635,612,469
25
6,159,578,978
32
2,803,046,955
28
%.15X
89.3
96.7
92.1
%.20X
75.4
94.2
81.7
Table 2: Comparison of WGS protocol performance on the HiSeqX and HiSeq 2500. •  Improved coverage across the genome 5.0%
0.0%
48.0%
June%
July%
August%
September% October% November% December%
Figure 4: Duplica/on and data loss. Eﬀorts to improve % PF Clusters resulted in an increase in overall % duplica%on and data loss. Figure 3: Understanding loading concentra/on. % PF clusters was found to increase as loading concentra%on decreased. •  Overall duplica%on was observed to increase as % PF Clusters increased resul%ng in a decrease in usable data. This revela%on sparked an eﬀort to to ﬁnd a balance between % PF Clusters and % Duplica%on. •  New quality ﬁlters were established with the goal of determining the true performance of a WGS library on the HiSeqX. PCT EXC DUPE: Percentage of bases excluded from coverage calcula%ons because reads are marked as duplicates. 35.0%
•  Reduc%on in base speciﬁc biases that are a;ributed with DNA polymerases Figure 6: GC bias comparison. PCR-‐free WGS libraries (blue) show signiﬁcantly more even coverage across the GC spectrum tradi%onal WGS libraries (red and green). •  Increased sensi%vity to detect and reduce in false-‐posi%ve observa%ons when calling indels and copy number variants. Condition
Standard'WGS'HISeqX'v2.0
PCR<free'WGSHiSeqX'v2.0
CNV's+called+(+5kb)
467
499
Estimated+False+Detection+Rate
8.3%
3.4%
Table 3: CNV and false detec/on call rate comparison. c. Figure 7: SNP/Indel analysis. Analysis has shown SNP and indel analysis d. to be equivalent between the two protocols. Average''Coverage'Loss'as'a'Result'of'Applied'Data'Filters'
Average%Coverage%Loss%
Average%Coverage%
40.0%
Se
pt
2014!
Au
gu
s
2013!
Ju
l
2012!
Ju
n
2011!
M
2010!
0"
Ap
ril
"
2009!
0.0"
ar
ch
"
0"
5.0"
200"
M
500"
20.0%
Sequencing.
Technology
HiSeq(2500
HiSeqX
HiSeqX
53.0%
45.0%
400"
10.0"
68.0%
25.0%
50.0%
1800"
Protocol
PCR$free(WGS
PCR$free(WGS
Standard(WGS
Average%of%Exc%Dupe%%%(WGS)%
PCT EXC TOTAL: The sum of the above exclusions.
HiSeqX!Lane!Fail!Rate!Over!Time!!
40.0"
2,500"
Total
11052
1023.1
50.0"
•  Proven ability to generate ~ 30X coverage within a single lane of sequencing. PCT EXC OVERLAP: Percentage of bases excluded from coverage calcula%ons because two observa%ons of a single base from a single insert due to overlapping reads 1 and 2. !Total!Number!of!Lanes!Run!
(A) 2013
3,021
253.8
10.8
Advantages %"Duplica5on"as"a"Func5on"of"%"PF"Clusters"
Goal: Automate exclusion ampliﬁca%on striptube prepara%on to meet HiSeqX throughput and capacity as well as reduce variability between lanes and samples. Observed Improvements: •  Improved throughput by 384% •  Increased overall data output •  Decreased ﬂowcell failures Combining the sequencing power of the HiSeqX along with the Broad’s PCR-‐free WGS protocol led to the genera%on of data of unprecedented quality and quan%ty. •  Understanding the rela%onship between loading concentra%on and % PF has aided in maximizing the output of usable data. •  Ini%al tes%ng revealed % PF Clusters increased as loading concentra%on decreased. Increased Scale Through Process Op/miza/on Automa%ng sample striptube crea%on: •  Reduces poten%al for sample swaps •  Reduces failures related to pipe[ng errors •  Reduces process %me •  Capable of preparing 96 individual libraries for cluster ampliﬁca%on Figure 1: %PF per lane of automa/on vs manual valida/on ﬂowcells: The automated striptube workﬂow was helped to increase the average % PF of samples and to decrease % PF variability between samples Longterm ﬂowcell storage: •  Prepared ﬂowcells may be stored up to 3 days at 4°C •  Inventory crea%on allows for sequencing runs to occur 7 days a week •  Maximizes instrument u%liza%on by minimizing instrument down%me PCR-‐free WGS on HiSeqX Goal: Increase the amount of usable bases generated per lane of HiSeqX sequencing in order to maximize the PF Gb. %"Excluded"Duplica5on"
The introduc%on of the Illumina HiSeqX sequencers has enabled the Genomics PlaIorm at the Broad Ins%tute to generate an unparalleled amount of data. To fully maximize the beneﬁts of the HiSeq X plaIorm, a signiﬁcant eﬀort was undertaken to scale-‐up the output of high quality sequencing data. In order to accomplish this task, we focused on: •  Increasing percent of clusters passing ﬁlter, while limi%ng data loss •  Increasing throughput and incorpora%ng automa%on •  Scaling up to a 7 day process These eﬀorts have resulted in: •  Increase in machine u%liza%on to full capacity •  An unprecedented amount of data output •  Improved sequencing data yield and quality 15.1%
10.1%
11.5%
10.7%
12.6%
13.1%
30.9%
31.3%
30.5%
31.3%
30.2%
August%
September%
October%
November%
December%
8.3%
10.9%
30.0%
7.3%
8.1%
25.0%
20.0%
15.0%
29.5%
28.7%
29.4%
23.7%
26.3%
10.0%
5.0%
0.0%
March%
April%
May%
June%
July%
Figure 5: True genome coverage. Using these new data ﬁlters, we are able to truly assess how well we meet our coverage goals over %me. Acknowledgements