Open source analytics for Big Data in Big Pharma

Open source analytics for Big Data in Big
Pharma
Applications in next generation sequencing
data
Big Data SIG
23 Apr 2015
Miika Ahdesmaki
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Crash course to molecular biology
Central dogma
• DNA is the ~static part
• RNA is the dynamic middle man
- Only 1% of DNA is protein-coding
(or “exonic”)
• Proteins are involved in virtually all
cell functions
• We can sequence DNA and RNA
using ultra high throughput
sequencing (3rd gen Next
Generation Sequencing)
"Centraldogma nodetails" by Narayanese at English Wikipedia - Own work.
Licensed under Public Domain via Wikimedia Commons –
http://commons.wikimedia.org/wiki/File:Centraldogma_nodetails.png#/media/File:Centraldogma_nodetails.png
2
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Why NGS?
• Personalised medicine:
- One drug for all patients no longer realistic (especially in oncology)
- Different demographics have different variations of risks
- Understanding patient specific needs will help guide their individual medication
• Cancer is a genetic disease, most often the result of spurious mutations in DNA
- Understanding changes in cancer DNA can help defeat the disease
• Next generation high throughput sequencing offers genome DNA analyses in days
and under $10k
3
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
What is next generation sequencing?
Sequencing
• NGS: massively parallel DNA
sequencing
• Oncology biggest consumer of NGS
at AZ
• We sequence RNA and DNA e.g.
from
- Clinical samples
- Cell lines
- Xenografts / explants
4
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
What is next generation sequencing?
Sequencing
• The DNA/RNA is pre-processed,
fragmented and the short
fragments are sequenced (in
random order)
5
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
What is next generation sequencing?
Alignment
• The short fragments are aligned to
a reference sequence, such as the
human reference
HG19
6
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
What is next generation sequencing?
Downstream Processing (variants, expression)
• The alignments are further processed to
answer the following questions
- How are the alignments different from
the reference (SNPs, indels)?
HG19
- Which genes are expressed?
7
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Uses of NGS
Patient stratification
Biomarkers for
prognosis, drug
response, safety
Expression
Variants
NGS Data
RNA-Seq
Fusions
Explants
Tumors-FFPE
Tumors –fresh
frozen
Targeted
Cell lines
Clinical samples
DNA-Seq
Whole
exome
Whole
genome
Coding and noncoding variants
Coding
variants
New Target ID
Mechanism of drug
action
Mechanism of disease
Mechanisms of
resistance
8
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Data generation and volumes
• AZ: Mix of outsourced sequencing and
internal data generation
Whole
genome:
60-180GB
• Typical size of files per sample:
Exome
Dna-seq:
10-20GB
• In oncology, individuals are often studied in
pairs (tumour/normal, parental/daughter),
doubling the data volumes
• Typical study sizes: 100GB - 1TB raw
compressed data
• One of our most frequent Big Data problems
9
Miika Ahdesmaki | 23 April 2015
RNA-seq
10-15GB
Single gene
targeted:
100-200MB
Cambridge Wireless Big Data SIG | AstraZeneca
Data generation and volumes
• Over the past 3-4 years we accumulated ~400TB of sequencing data via
- Acquiring public data sets (TCGA, ICGC)
- Vendor sequencing (major)
- Internal sequencing (minor)
• Over 2015-2016 we expect
- Internal sequencing to become the major data generation source (5 new
sequencers in 2015 to accompany 2 sequencers in 2013-2014)
- 1PB of sequencing data by mid 2016
• Long term prediction of volumes difficult
• 3 tiered storage for processing, short term storage and long term storage
- Amazon Glacier strongly considered for long term storage
10
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Partnering with the leaders
• “Illumina Announces Strategic Partnerships with AstraZeneca, Janssen and Sanofi
to Redefine Companion Diagnostics for Oncology”
- http://investor.illumina.com/phoenix.zhtml?c=121127&p=irolnewsArticle&ID=1960007
- Illumina, Inc. … announced it has formed collaborative partnerships with leading
pharmaceutical companies to develop a universal … NGS-based oncology test
system
- The system will be used for clinical trials of targeted cancer therapies with a goal of
developing and commercializing a multi-gene panel for therapeutic selection,
resulting in a more comprehensive tool for precision medicine
11
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Pipelines and analytics
12
Production – Dealing with the complexity
Number of NGS tools increases daily..
annotateBed
append_sff
bam12auxmerge
bam12split
bam12strip
bam2fastx
bamadapterclip
bamadapterfind
bamauxsort
bamcat
bamchecksort
bamclipreinsert
bamcollate
bamcollate2
bamdownsamplerandom
bamfilteraux
bamfilterflags
bamfilterheader
bamfilterrg
bamfixmateinformation
bamindex
bamleftalign
bammapdist
bammarkduplicates
bammarkduplicates2
bammaskflags
bammdnm
bammerge
bam_merge
bamrank
bamrecompress
bamreset
bamseqchksum
bamsort
bamsplit
bamsplitdiv
bamToBed
bamtofastq
bamToFastq
bamtools
bamtools-2.3.0
bamzztoname
bcbio_nextgen.py
bcftools
bed12ToBed6
bedGraphToBigWig
bedpeToBam
bedpeToBed12
bedpeToVcf
bedToBam
bedToBigBed
bedToIgv
bed_to_juncs
bedtools
bgzip
bigBedInfo
bigBedSummary
bigBedToBed
bigWigInfo
bigWigSummary
bigWigToBedGraph
bigWigToWig
blast2sam.pl
bowtie2
bowtie2-align
bowtie2-build
bowtie2-inspect
bowtie2sam.pl
brew
bwa
ccmake
closestBed
clusterBed
cmake
complementBed
contig_to_chr_coords
convert_trace
coverageBed
cpack
cpanm
cram_dump
cram_index
cramtools
crc32
ctest
cuffcompare
cuffdiff
cufflinks
cuffmerge
dbilogstrip
dbiprof
dbiproxy
expandCols
export2sam.pl
extract_fastq
extract_qual
extract_seq
faCount
faSize
fastaFromBed
fastqc
fastqtobam
faToTwoBit
featureCounts
fetchChromSizes
filter_vep.pl
fix_map_ordering
flankBed
freebayes
gatk-framework
GenomeAnalysisTK.jar
genomeCoverageBed
get_comment
getOverlap
gffread
glia
grabix
groupBy
gtf_juncs
gtf_to_fasta
gtfToGenePred
gtf_to_sam
hash_exp
hash_extract
hash_list
hash_sff
hash_tar
index_tar
interpolate_sam.pl
intersectBed
io_lib-config
isnovoindex
juncs_db
kmerprob
liftOver
linksBed
long_spanning_reads
lumpy
makeSCF
map2gtf
mapBed
maq2sam-long
maq2sam-short
maskFastaFromBed
md5fa
md5sum-lite
mergeBed
multiBamCov
multiIntersectBed
muTect-1.1.6.jar
normalisefasta
novo2paf
novo2sam.pl
novoalign
novoalignCS
novoalignCSMPI
novoalignMPI
novobarcode
novoindex
novomethyl
novope2bed.pl
novorun.pl
novosort
novoutil
nucBed
pairToBed
pairToPair
platypus
plot_roc.r
plot-vcfstats
prep_reads
psl2sam.pl
qualimap
randomBed
rtg
s3cmd
sam2vcf.pl
sambamba
samblaster
sam_juncs
samtools
samtools.pl
scalpel
scf_dump
scf_info
scf_update
scramble
scram_flagstat
scram_merge
scram_pileup
segment_juncs
seqtk
shuffleBed
slopBed
snpEff
soap2sam.pl
SomaticAnalysisTK.jar
sortBed
speedseq
speedseq.config
splitReadSamToBedpe
splitterToBreakpoint
sra_to_solid
srf2fasta
srf2fastq
srf_dump_all
srf_extract_hash
srf_extract_linear
srf_filter
srf_index_hash
srf_info
srf_list
STAR
subtractBed
tabix
tabtk
tagBam
tophat
tophat2
tophat-fusion-post
tophat_reports
trace_dump
twoBitInfo
twoBitToFa
unionBedGraphs
variant_effect_predictor.pl
vcf2fasta
vcf2sqlite.py
vcf2tsv
vcfaddinfo
vcfafpath
vcfallelicprimitives
vcfaltcount
vcfannotate
vcfannotategenotypes
vcfbiallelic
vcfbreakmulti
vcfcat
vcfcheck
vcfclassify
vcfcleancomplex
vcfclearid
vcfclearinfo
vcfcombine
vcfcommonsamples
vcfcomplex
vcfcountalleles
vcfcreatemulti
vcfdistance
vcfecho
vcfentropy
vcfevenregions
vcffilter
vcffixup
vcfflatten
vcfgeno2alleles
vcfgeno2haplo
vcfgenosamplenames
vcfgenosummarize
vcfgenotypecompare
vcfgenotypes
vcfglbound
vcfglxgt
vcfgtcompare.sh
vcfhetcount
vcfhethomratio
vcfindelproximity
vcfindels
vcfindex
vcfintersect
vcfkeepgeno
vcfkeepinfo
vcfkeepsamples
vcfleftalign
vcflength
vcfmultiallelic
vcfmultiway
vcfmultiwayscripts
vcfnobiallelicsnps
vcfnoindels
vcfnosnps
vcfnulldotslashdot
vcfnumalt
vcfoverlay
vcfparsealts
vcfplotaltdiscrepancy.r
vcfplotaltdiscrepancy.sh
vcfplotsitediscrepancy.r
vcfplottstv.sh
vcfprimers
vcfprintaltdiscrepancy.r
vcfprintaltdiscrepancy.sh
vcfqual2info
vcfqualfilter
vcfrandom
vcfrandomsample
vcfregionreduce
vcfregionreduce_and_cut
vcfregionreduce_pipe
vcfregionreduce_uncompressed
vcfremap
vcfremoveaberrantgenotypes
vcfremovenonATGC
vcfremovesamples
vcfroc
vcfsample2info
vcfsamplediff
vcfsamplenames
vcfsitesummarize
vcfsnps
vcfsom
vcfsort
vcfstats
vcfstreamsort
vcf_strip_extra_headers
vcfToBedpe
vcfuniq
vcfuniqalleles
vcfutils.pl
vcfvarstats
vep_convert_cache.pl
vep_install.pl
vt
wgsim
wgsim_eval.pl
wigToBigWig
windowBed
windowMaker
xmlwf
zoom2sam.pl
ztr_dump
300+ (OSS) tools within our production framework
Infinite number of combinations to “get it wrong”
13
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Production – Overcoming the Complexity
Scalability, Reproducibility, Flexibility, Accessibility
• “Forced” to use open source tools and OS (Linux), no closed source alternatives
exist
- Integration challenging
- Variant calling and expression analysis very much an open research questions,
rapidly changing code
- No licensing costs, but costs in internal and external consulting
• Bcbio-nextgen
- An open source Python toolkit providing best practice pipelines for fully automated
NGS analysis
- Main developer Brad Chapman (HSPH)
- Unit tested, version controlled, development in Github
https://github.com/chapmanb/bcbio-nextgen
- Scalable across different clusters, schedulers, Amazon cloud
• AZ is active recognised contributor and collaborator to HSPH and bcbio-nextgen
14
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Production – Overcoming the Complexity
Bcbio-nextgen overview
• The user writes/modifies a high level configuration file specifying inputs and
analysis parameters
- Very few “tuning parameters” -> Given the same data, two analysts will produce
the same results
15
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Getting it right
• Given the rapid changes in the individual analysis tools, how do we know the
pipeline “gets it right”?
• Solution: reference standards
• For germline sequencing, the Genome in A Bottle Consortium established a gold
standard for an individual (NA12878)
- Samples from NA12878 can be bought off the shelf
- Compare sequencing and analytics results to the gold standard, establish
sensitivity, PPV of variant calls, compare to other people’s results
• For tumour sequencing, several standards exist
- Horizon Diagnostics’ tumour standard
- ICGC-TCGA DREAM Mutation Calling challenge
16
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Processing and managing the data
• NGS HPC clusters on 4 main R&D sites
- UK (SGE, ~200 cores, gpfs)
- Sweden (SLURM, >500 cores, Lustre)
- China (SGE, >100 cores, gpfs)
- US (UGE, >200 cores, gpfs)
• Data generated or received in one place processed locally by the NGS Production
Team (each member has access to all HPC clusters)
- Processed data handed over to disease area bioinformaticians in a controlled
manner
• Quick pipes between the sites allows data sharing when required
• Cloud computing …
17
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
NGS + Cloud
NGS Suited to using “Cloud”
• Large scale storage needs
• High computational power that can continue to scale
• Inherently (embarrassingly) parallel, easily ported
• Peaks and valleys in compute needs, so burst into cloud as needed instead of large
investment upfront
• Launch-able computing centre
utilising Amazon EC2
18
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
StarCluster from MIT with our pipeline
32
Core
32
Core
32
Core
32
Core
32
Core
32
Core
320
SSD
320
SSD
320
SSD
320
SSD
320
SSD
320
SSD
40 TB GlusterFS
/ngs
19
Miika Ahdesmaki | 23 April 2015
32
Core
32
Core
32
Core
32
Core
32
Core
32
Core
Cambridge Wireless Big Data SIG | AstraZeneca
Why not Hadoop?
• The use of a large number of mostly academic open source tools that are 99.9%
not written for Hadoop
• No pipeline implements wrapping up of the above tools in a Hadoop framework
• Disk I/O admittedly the bottle neck in current parallel file system architectures for
NGS analytics
- Gpfs locally at AZ
- Lustre in AWS, local scratch SSD
20
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
Visualising the data
JBrowse genome browser
• Most popular genome analysis viewer is the Integrated Genome Viewer (IGV,
Broad Institute), a Java based standalone program
- Requires a Java app
- Requires configuration
• JBrowse, a web browser based genome viewer is inherently easier for non-tech
savvy people: point your browser to it and it just works
- Physical location of data less important, only the part that is shown transferred
• Data of interest, such as genomic variants, can be annotated by a URL to JBrowse
21
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca
JBrowse
BRCA2 gene screenshot
Reference DNA sequence
and amino acids
BRCA2 alternative exons
Detected gene variant
(G to A mutation)
Evidence in the data
for the variant
22
Miika Ahdesmaki | 23 April 2015
Noise in the data
Cambridge Wireless Big Data SIG | AstraZeneca
Summary
23
Summary
• NGS data is accumulating faster and faster
• Analysing and interpreting the data is I/O intensive (+CPU and RAM)
• Easily parallelised using SMP and simple schedulers (SGE, Slurm)
• Current challenges in integrating all the processed data (in e.g. no-SQL databases)
• Long term storage (due to e.g. regulatory requirements) in e.g. Amazon Glacier
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and
remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or
disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK,
T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com
25
Miika Ahdesmaki | 23 April 2015
Cambridge Wireless Big Data SIG | AstraZeneca