How To QC A KiloPipeline Per Month? Michael S. Noble

How To QC A
KiloPipeline Per Month?
Michael S. Noble
The Broad Institute of MIT & Harvard
TCGA Steering Committee Meeting
Houston, Texas
April 26, 2012
Born of the desire to systematize
analyses from The Cancer Genome
Atlas pilot and scale their execution
to the dozens of remaining diseases to
be studied, now sits atop 14 terabytes
of TCGA data and reliably executes
more than 1000 pipelines per month.
Data Snapshot
Data Snapshot
+821
CopyNumber
Data Snapshot
+821
CopyNumber
+917
Methylation
Data Snapshot
+821
CopyNumber
New datatype column
+2087 protein samples
+917
Methylation
GDAC Session Yesterday
Recommendation: formalize AWG co-chair role for each tumor type
Data Coordinator: ensure best possible data/analysis outcome.
YES!
GDAC Session Yesterday
Recommendation: formalize AWG co-chair role for each tumor type
Data Coordinator: ensure best possible data/analysis outcome.
YES!
Firehose approximates this for
23 tumorsets simultaneously
Volume, Change & Complexity
analyses:
stddata:
26 x 23 tumor sets / month
= 598
273 platforms over 23 tumorsets x 2/month = 546
Volume, Change & Complexity
analyses:
stddata:
26 x 23 tumor sets / month
= 598
273 platforms over 23 tumorsets x 2/month = 546
• Not even counting RPPA: ingested & almost ready
• CPTAC Proteomics Consortium interested, too
• Nothing on this scale ever attempted before?
• But worthless if we cannot establish scientific credibility
Enormous QC Challenge
Computing
Infrastructure
Scientific
Veracity
Enormous QC Challenge
Computing
Infrastructure
Firehose
Genepattern
HPC system (LSF)
Unix filesystems
DCC Mirroring
Normalization
Control Scripts
Website
Dashboards
Submission
etc
Scientific
Veracity
Data
How much?
Formatted ok?
New platforms?
Combine platforms?
Algorithms
What knob settings?
New versions?
Credible results?
Wired together properly?
Reports ok?
Scale preclude exhaustive inspection
Scale preclude exhaustive inspection
How We Cope
1. Automate
2. Aggregate
3. Clarify
4. Simplify
Scale preclude exhaustive inspection
How We Cope
1. Automate
2. Aggregate
3. Clarify
4. Simplify
Now some non-trivial examples ...
Automate: Continuous Unit Testing
•
•
•
•
Implemented in Bamboo framework
Regression tests run automatically
Immediately when changes checked in for covered tools
No need for CompBios / BInfs to explicitly run
Daniel DiCara
Automate: Continuous Unit Testing
•
•
•
•
Implemented in Bamboo framework
Regression tests run automatically
Immediately when changes checked in for covered tools
No need for CompBios / BInfs to explicitly run
“Seems to find bugs before Iʼve checked in the code!”
Daniel DiCara
Automate: Continuous Unit Testing
*
Gradual but steady cultural change
Where's the Real Bottleneck in Scientific Computing?
www.americanscientist.org Jan/Feb 2006
•
•
•
•
Implemented in Bamboo framework
Regression tests run automatically
Immediately when changes checked in for covered tools
No need for CompBios / BInfs to explicitly run
*
“Seems to find bugs before Iʼve checked in the code!”
Daniel DiCara
But Automation Not Enough
When N things break : manually diagnose / fix
Or automation N/A
: eyeball clusters/plots
But Automation Not Enough
When N things break : manually diagnose / fix
Or automation N/A
: eyeball clusters/plots
Aggregate: failure dashboard for all tumors & analyses
Clarify:
instant synoptic view of entire run
Saves 100s of clicks through Firehose GUI
Aggregate & Automate: Integration Testing
•
Unit tests must be predictable
•
Which means stable inputs and outputs
•
Essentially mandates old data
•
What about realtime? More current/live data?
Aggregate & Automate: Integration Testing
Establish that code changes play nice with rest of system
Aggregate & Automate: Integration Testing
brca
coad
gbm
kirc
kirp
laml
luad
lusc
ov
read
...
ucec
Establish that code changes play nice with rest of system
Across all datasets
Aggregate & Automate: Integration Testing
brca
coad
gbm
kirc
kirp
laml
luad
lusc
ov
read
...
ucec
Establish that code changes play nice with rest of system
Across all datasets
With O’s correctly wired to I’s
Aggregate & Automate: Integration Testing
brca
coad
gbm
kirc
kirp
laml
luad
lusc
ov
read
...
ucec
Establish that code changes play nice with rest of system
Across all datasets
With O’s correctly wired to I’s
Downstream dependents correctly read outputs
Aggregate & Automate: Integration Testing
brca
coad
gbm
kirc
kirp
laml
luad
lusc
ov
read
...
ucec
Establish that code changes play nice with rest of system
Across all datasets
With O’s correctly wired to I’s
Downstream dependents correctly read outputs
And remainder of workflow runs to completion
Aggregate & Automate: Integration Testing
brca
coad
gbm
kirc
kirp
laml
luad
lusc
ov
read
...
ucec
Establish that code changes play nice with rest of system
Across all datasets
With O’s correctly wired to I’s
Downstream dependents correctly read outputs
And remainder of workflow runs to completion
Using same automation infrastructure as production runs.
Clarify : Internal Process Flow
Clarify : Internal Process Flow
Previous
Slide
Clarify & Simplify:
TCGA MAF
WorkFlow
Nikki Schultz
Broad Institute
Prachi Kothiyal
Heidi Sofia &
Many Others
Simplify : firehose_get
Short 10k script
stddata
dashboard
analyses
dashboard
• Download all or parts
• Of data or analyses runs
• Open access : no password
• Select by run type & date
• Subselect by tumor type
• Or analyses type / name
• See what runs we did
• Or what tasks in each run
Simplify : firehose_get
Short 10k script
stddata
dashboard
analyses
dashboard
• Download all or parts
• Of data or analyses runs
• Open access : no password
• Select by run type & date
• Subselect by tumor type
• Or analyses type / name
• See what runs we did
• Or what tasks in each run
Easier
more eyeballs
higher quality
better science
Published studies have
that cells with
mutated
or methylated
of ourshown
studied HGS-OvCa
samples
had germline
or somatic mutations in
BRCA1 or mutated BRCA2
have
homologous
BRCA1/2,
that defective
11% lost BRCA1
expressionrecombination
through DNA hypermethy35–38
lationtoand
that epigenetic
of BRCA1
wasthat
mutually
Gene expression
. Fig.
3c shows
20%exclusive of
and are highly responsive
PARP
inhibitorssilencing
24
Low
High
, Fisher’smutations
exact test). Univariate
surBRCA1/2
mutations
5 4.4 3 10
of our studied HGS-OvCa
samples
had (P
germline
or somatic
in
vival analysis of BRCA1/2 status (Fig. 3c) showed better overall survival
BRCA1/2,
that
11%
lost
BRCA1
expression
through
DNA hypermethyTCGA training set
TCGA test set
for
BRCA1/2
mutated
cases
than
BRCA1/2
wild-type cases. Notably,
b
(Batches
17–24;
255
tumors)
(Batches
9–15;
215
tumors)
lation and that epigenetic
silencingsilenced
of BRCA1
was
mutually
exclusive
sion
epigenetically
BRCA1
cases
had survival
similarofto BRCA1/2
Death
event
(red)
24
High
exact
test). Univariate
sur-survivals of
BRCA1/2 mutations (P
5 4.4 3HGS-OvCa
10 , Fisher’s
wild-type
tumours
(respective
median overall
Poor prognosis
41.5 and
41.9 (Fig.
months,
5 0.69, log-rank
Supplementary
vival analysis of BRCA1/2
status
3c)Pshowed
better test;
overall
survival Methods,
genes (107)
section
and Supplementary
Fig. 8.13b).cases.
This suggests
that BRCA1 is
TCGA
training
set
TCGA test set
for BRCA1/2 mutated
cases8, than
BRCA1/2 wild-type
Notably,
OV
Analysis
Summary
Good prognosis
inactivated by mutually exclusive genomic and epigenomic mechanisms
genes (85) (Batches 17–24; 255 tumors)
(Batches 9–15; 215 tumors)
epigenetically
silenced
BRCA1 cases had survival similar to BRCA1/2
and that patient survival depends on the mechanism of inactivation.
red)
10
Prognostic
wild-type HGS-OvCaGenomic
tumours
(respective
median
overallrecombination
survivals of genes that
This is the analysis overview
for Firehose
run "21 March 2012".
0
alterations
in other
homologous
gene
score
sis
.......................................................................................................
–10
41.5
41.9riskmonths,might
P 5render
0.69, log-rank
test;
Methods,in this study
Lower risk
Higher risk
Lower
risk andHigher
cells sensitive
to Supplementary
PARP inhibitors39 discovered
The most robust consensus NMF
(Supplementary
Methods,This
section
8, andthat
Supplementary
section 8, and Supplementary
Fig. 8.13b).
suggests
BRCA1 is Fig. 8.12)
clustering of 565 samples using
sis
include
amplification
or and
mutation
of EMSY mechanisms
(also known as C11orf30)
c
genes was d inactivated
bycluster
mutually
exclusive
genomic
epigenomic
TCGA testthe
set1500 most variable
Ref. 25
Gene
1
(8%), focal deletion or mutation of PTEN (7%), hypermethylation of
identified 1for k = 3 clusters. We
Expr
D I M P
N = 237
N = 255
and
that
patient
survival
depends
on theof
mechanism
of inactivation.
Pulse
entire
tumor
analyses
0.8
0.8
10
computed the clustering for k = 2
C1 55 48 15 89
RAD51C
(3%),
mutation
of
ATM
or
ATR
(2%),
and mutation of
There were 547 tumor samples
miRNA
CN
to
k
=
8
and
0.6
0.6 used the cophenetic
0
Genomic
alterations
in
other
homologous
recombination
genes
that
29
51
21
40
C2
cluster
used in this analysis: 29 s ign
Fanconi
anaemia
genes
(5%).
Overall,
homologous
recombination
Clusters
In
key
representational
figures
correlation
0.4
0.4 coefficient to
39
C3 39 37 43 20
–10
.................................................
Lower
risk
Higher risk Cox P =Lower
risk
Higher
risk
discovered
might render cells sensitive
PARP
inhibitors
defects to
may
be present
in approximately
halfinofthis
all study
HGS-OvCa cases,
0.02 determine
Coxbest
P = solution.
8 × 10
0.2
0.2the
Peaks
Log-rank P = 0.0002
Log-rank P = 0.02
With
short
accompanying
providing
a rationale
for
clinical
trials of PARP
targeting
0
0
(Supplementary Methods,
section
8, and
Supplementary
Fig.inhibitors
8.12) text
20
40
60
20
40
60
0
0
ARTICLE RESEARCH tumours with these homologous-recombination-related aberrations.
Overall survival (months) Overall survival (months) e include TCGA
amplification
or mutation of EMSY (also known as C11orf30)
GA test set
Ref. 25
Gene cluster
1
d
Comparison between the complete set of BRCA inactivation events
Ref. 29
Ref. 30
N = 480
1
(8%),
focal
deletion
or
mutation of PTEN (7%), hypermethylation of
0.8
I M P
1 D
and
all recurrently
altered copy number
peaks revealed
an unexpectedly
N1= 237
N = 255
Executive
summary
from
N = 169 C1 55 48 15
N = 118
0.8
0.6 C1
89
RAD51C
(3%),
mutation
of
ATM
or
ATR
(2%),
and
mutation
ofinactivation
0.8
0.8
low frequency of CCNE1 amplification in cases with BRCA
C2
miRNA
0.4
0.6
0.6
C2 40 21 51 29
C3
50+
page
Nozzle
reports
cluster 0.6
Fanconi
anaemia genes
(5%).
Overall,
homologous
recombination
Survival
(8%
of
BRCA
altered
cases
had
CCNE1
amplification
whereas 26% of
0.2
0.4
0.4
0.4
Log-rank P = 0.007
C3 39 37 43 20
0
Cox P = 0.02
–5 Cox P = 0.0002
BRCA
wild-type
cases did;
0.0048,
adjusted
for
false-discovery
SMGs
defects
may40 be60present
in approximately
halfQof5 all
HGS-OvCa
cases,
= 0.02
Auto-excerpted
from
them
20
0
0.2 Cox P = 8 × 100.2 Log-rank P = 0.001 0.2 Log-rank P = 0.005
40
Log-rank P = 0.0002
ank P = 0.02
0
0
Overall survival
, overall
survival tended
to be lower for
previously
providing
a (months)
rationalerate).
for As
clinical
trialsreported
of PARP
inhibitors
targeting
0
20
40
60
20
40
60
0
0
20
40 Overall
60 survival (months) Overall survival (months)
0
40
60
0
patients with CCNE1 amplification than foraberrations.
patients in all other cases
tumours
with
these
homologous-recombination-related
survival (months) Overall survival (months) e
TCGA
(P 5 0.072, log-rank test; Supplementary Methods, section 8, and
Figure 2 | Gene and miRNA
expression patterns of molecular
subtype and
1
Comparison
between
the complete
of BRCA
inactivation
Supplementary
Fig. set
8.14a).
However,
no survivalevents
disadvantage for
Ref. 29
Ref.
30 prediction in HGS-OvCa.Path
NTumours
= 480
outcome
a,
from
TCGA
and
ref.
25
0.8
1
and all
recurrently
copy number
peaks
an unexpectedly
CCNE1-amplified
cases
(Prevealed
5 0.24, log-rank
test; Supplementary
separated
into four clusters
on the basis of gene expression.
b, Using
a training altered
N = 169
N = 118
0.6 C1
0.8
Methods,
section 8, and
Supplementary
Fig. 8.14b)
was apparent when
low frequency
ofset.
CCNE1
amplification
in cases
with BRCA
inactivation
data set, a prognostic gene signature
was defined and applied
to a test data
C2
Ways
0.4 C3
0.6
c, Kaplan–Meier analysis of four independent expression
data sets,
only CCNE1
at BRCA amplification
wild-type cases, whereas
suggesting26%
that the
(8%profile
of BRCA
alteredlooking
cases had
of previously
0.2
0.4
Log-rank
P
=
0.007
comparing survival for predicted higher-risk patients versus lower-risk
reported CCNE1 survival difference can be explained by the higher
0
= 0.0002
= 0.02
BRCA
wild-type
cases
did; Q 5 0.0048, adjusted for false-discovery
0.2 Cox Ppatients.
Univariate Cox P0value20for risk
40 index
60 included. d, Tumours separated
survival of40BRCA-mutated cases.
ank P = 0.001
Log-rank P = 0.005
0
(months)
, overall
survival tended
to model
be lower
for
As previously
into 40
three clusters
on theOverall
basis ofsurvival
miRNA
expression,rate).
overlapping
with gene- reported
41
) to
Finally, we
used a probabilistic
graphical
(PARADIGM
20
60
0
40
60
0
based
clusters
as
indicated.
D,
differentiated;
I,
immunoreactive;
M,
survival (months) Overall survival (months)
patients with CCNE1search
amplification
for patients
in all
other cases
for alteredthan
pathways
in the US
National
Cancer Institute
mesenchymal; P, proliferative (red bold indicates high degree of overlap).
42
, and found
that8,theand
FOXM1 tranInteraction Database
(P 5 0.072,
log-rankPathway
test; Supplementary
Methods,
section
e, Differences
in patient
survival among
the and
three miRNA-based
clusters.
ne and miRNA expression
patterns
of molecular
subtype
factor network
3d) is disadvantage
significantly altered
Supplementary Fig. scription
8.14a). However,
no (Fig.
survival
for in 87% of
iction in HGS-OvCa.section
a, Tumours
from
TCGA
and
ref.
25
6). Kaplan–Meier survival analysis of this
signature showed cases
cases (Supplementary
Methods, section
10, and Supplementary Figs
CCNE1-amplified
(P 5 0.24, log-rank
test; Supplementary
four clusters on the basis
of
gene
expression.
b,
Using
a
training
statistically significant association with survival in all validation data 10.1–10.3). FOXM1 and its proliferation-related target genes, AurB
Methods,
section 8, and
Supplementary
Fig. 8.14b)
wasPLK1,
apparent
when
gnostic gene signature sets
was (Fig.
defined
andSupplementary
applied to a test
data set.section
2d and
Methods,
6).
(AURKB),
CCNB1, BIRC5,
CDC25 and
were consistently
overier analysis of four independent
expression
profile
data
sets,
looking
only
at
BRCA
wild-type
cases,
suggesting
that
the
previously
Negative matrix factorization consensus clustering of miRNA expressed but not altered by DNA copy number changes, indicative of
1,000 g
Survival probability
Clarify & Simplify : Quicklook Summaries
•
•
•
c
RB and PI3K/RAS signalling
RB signalling
PI3K/RAS signalling
67% of cases altered
45% of cases altered
NF1
7%
12%
KRAS
18%
11%
Amplified
Amplified, mut. <1%
Amplified, mut. <1%
CCND2
4%
15%
Upregulated
RB1
AKT1
BRAF
3%
0.5%
Amplified
Mutated
Survival probability
3%
Log-rank test
P = 0.0008602
60
40
DNA damage
Sensors
ATM
1%
Mutated
FA core
complex
5%
Mutated
20
0
Proliferation/survival
50
Inactivated
d
0
50
100
Survival (months)
PLK1
HR pathway
51% of cases altered
ATR
<1%
Mutated
BRCA1
23%
Mutated,
hypermethyl.
EMSY
8%
Amplified,
mutated
FANCD2
<1%
Mutated
BRCA2
11%
Mutated
RAD51C
3%
Hypermethyl.
HR-mediated
repair
150
PTEN
7%
Deleted
•
•
•
84% of cases altered
FOXM1 signalling
50
Activated
ATM
TP53
ATR
RAD51
MAML1
2%
JAG1
JAG2
BRCA mutated [66]
BRCA1 epigenetically silenced [33]
BRCA wild type [212]
80
Percentage of cases (%)
NOTCH signalling
22% of cases altered
2%
Epigenetic silencing
via hypermethylation
0
6%
Amplified
Cell cycle progression
Somatic
mutation
100
AKT2
10%
Deleted 8%, mutated 2%
Amplified
Germline
mutation
Patient survival
PIK3CA
Amplified
BRCA2
Deleted 7%, mut. <1% Deleted 8%, mut. 4%
20%
CCND1
b
BRCA1
PTEN
32%
CCNE1
HR alterations
BRCA altered cases, N = 103 (33%)
CDKN2A
Downregulated 30%, deleted 2%
Survival probability
–5
a
Survival probability
1,000 genes
Proliferative
Inhibition
amplified / mutated
NOTCH3
11%
Amplified/mutated
amplified / mutated
Proliferation
2%
mutated
Amplified
CHEK2
Activation
MAML3
Cell cycle
progression
BRCA1
FOXM1
AURKB
MAML2
4%
CCNB1
BRCC
BRCA2
BIRC5
DNA
repair
CDC25B
Figure 3 | Altered pathways in HGS-OvCa. a, b, RB and PI3K/RAS pathways,
identified by curated analysis (a), and the NOTCH pathway, identified by
HOTNET analysis (b), are commonly altered. Alterations are defined by
somatic mutations, DNA copy number changes or, in some cases, by significant
up- or downregulation relative to expression in diploid tumours. Alteration
frequencies are expressed as a percentage of all cases; activated genes are red
and inactivated genes are blue. c, Genes in the homologous recombination
(HR) pathway are altered in up to 49% of cases. Survival analysis of BRCA1/2
status shows a divergent outcome for BRCA1/2 mutated cases (with higher
overall survival) than BRCA1/2 wild type, and that BRCA1 epigenetically
silenced cases have poorer outcomes. FA, Fanconi anaemia. d, FOXM1
transcription factor network is activated in 87% of cases. Each gene is depicted
as a multi-ring circle in which its copy number (outer ring) and gene expression
(inner ring) are plotted such that each ‘spoke’ in the ring represents a single
patient sample, with samples sorted in increasing order of FOXM1 expression.
Excitatory interactions (red arrows) and inhibitory interactions (blue lines)
were taken from the US National Cancer Institute Pathway Interaction
Database. Dashed lines indicate transcriptional regulation.
stands in striking contrast to previous TCGA findings in glioblastoma47, where there were more recurrently mutated genes with far
fewer chromosome arm-level or focal SCNAs (Fig. 1a). A high prevalence of mutations and promoter methylation in putative DNA repair
genes, including homologous recombination components, may
explain the high prevalence of SCNAs. The mutation spectrum marks
HGS-OvCa as completely distinct from other ovarian cancer histological subtypes. For example, clear-cell ovarian cancer tumours have
few TP53 mutations but have recurrent ARID1A and PIK3CA mutations48–50; endometrioid ovarian cancer tumours have frequent
CTNNB1, ARID1A and PIK3CA mutations and a lower rate of TP53
(refs 49, 50); and mucinous ovarian cancer tumours have prevalent
KRAS mutations51. These differences between ovarian cancer subtypes
METHODS SUMMARY
Human interpretation needed, but not automate-able
This distills into most concise aggregate form
All specimens were obtained from patients with appropriate consent from the
relevant institutional review board. DNA and RNA were collected from samples
using the Allprep kit (Qiagen). We used commercial technology for capture and
sequencing of exomes from whole-genome-amplified tumour DNA and normal
DNA. DNA sequences were aligned to NCBI Build 36 of the human genome;
duplicate reads were excluded from mutation calling. Validation of mutations
occurred on a separate whole-genome amplification of DNA from the same
tumour. Significantly mutated genes were identified by comparing them with
expectation models based on the exact measured rates of specific sequence lesions.
CHASM20 and MutationAssessor (Supplementary Methods, section 4) were used to
identify functional mutations. GISTIC analysis of the circular-binary-segmented
Agilent 1M feature copy number data was used to identify recurrent peaks by
comparison with the results from the other platforms, to determine likely plat-
Related Posters
Poster :
Poster :
Poster :
Poster :
Engineering Firehose
RNA-Seq in Firehose
GDAC Interoperability
Broad SNP6 Pipeline
(DiCara et al)
(Zhang et al)
(Cerami et al)
(Saksena et al)
Acknowledgements
Broad
Michael Noble
Douglas Voet
Gordon Saksena
Dan DiCara
Kristian Cibulskis
Juok Cho
Rui Jing
Michael Lawrence
Lee Lichtenstein
Pei Lin
Spring Liu
Aaron McKenna
Sachet Shukla
Raktim Sinha
Andrey Sivachenko
Carrie Sougnez
Petar Stojanov
Lihua Zhou
Hailei Zhang
Robert Zupko
PI: Lynda Chin, Gaddy Getz
Belfler-DFCI/MDACC
Yonghong Xiao
Juinhua Zhang
Terrence Wu
Harvard
Peter Park
Nils Gehlenborg
Semin Lee
Richard Park
IGV & GenePattern teams @ Broad
Jill Mesirov
Michael Reich
Peter Carr
Marc-Danie Nazaire
Jim Robinson
Helga Thorvaldsdottir
Matthew Meyerson
Todd Golub
Eric Lander