Next Generation Sequencing The past, present, and future of DNA sequencing

Next Generation Sequencing
The past, present, and future of
DNA sequencing
*DNA sequencing:
Determining the number and order of nucleotides that
make up a given molecule of DNA.
Alex V. Postma, PhD
Department of Anatomy, Embryology & Physiology
Academic Medical Center
1
(Relevant) Trivia
How many base pairs (bp) are there in a human genome?
How much did it cost to sequence the first human genome?
How long did it take to sequence the first human genome?
When was the first human genome sequence complete?
Whose genome was it?
(Relevant) Trivia
How many base pairs (bp) are there in a human genome?
~3 billion (haploid)
How much did it cost to sequence the first human genome?
~$2.7 billion
How long did it take to sequence the first human genome?
~13 years
When was the first human genome sequence complete?
2000-2003
Genome Sequencing
• Goal
 figuring the order of nucleotides across a genome
• Problem
 Current DNA sequencing methods can handle only
short stretches of DNA at once (<1-2Kbp)
• Solution
 Sequence and then use computers to assemble the
small pieces
Genome Sequencing
TG..GT
TC..CC
AC..GC
CG..CA
TT..TC
TG..AC
AC..GC GA..GC
CT..TG
AC..GC
GT..GC
AC..GC
AA..GC
AT..AT
TT..CC
Genome
Short fragments of DNA
ACGTGGTAA
CGTATACAC
TAGGCCATA
GTAATGGCG
CACCCTTAG
TGGCGTATA
CATA…
ACGTGGTAATGGCGTATACACCCTTAGGCCATA
Short DNA sequences
ACGTGACCGGTACTGGTAACGTACA
CCTACGTGACCGGTACTGGTAACGT
ACGCCTACGTGACCGGTACTGGTAA
CGTATACACGTGACCGGTACTGGTA
ACGTACACCTACGTGACCGGTACTG
GTAACGTACGCCTACGTGACCGGTA
CTGGTAACGTATACCTCT...
Sequenced genome
Sanger Sequencing
• Mix DNA with dNTPs and
ddNTPs
• Amplify
• Run in Gel
– Fragments migrate distance that is
proportional to their size
Sanger Sequencing
Sanger Sequencing
• Advantages
 Long reads (~900bps)
 Suitable for small projects
• Disadvantages
 Low throughput
 Expensive
Sanger Sequencing
2007: Global Ocean
Sampling Expedition
~3,000 organisms,
7Gbp (Venter et al.)
1994: H. Influenzae
1.8 Mbp
(Fleischmann et al.)
1980
1982: lambda virus
DNA stretches up to
30-40Kbp
(Sanger et al.)
1990
2000
2001: H. Sapiens,
D. Melanogaster
3 Gbp
(Venter et al.)
Next Generation Sequencing:
Why Now?
• Motivation: HGP and its derivatives,
personalized medicine
• Short reads applications: (re)sequencing, other methods (e.g. gene
expression)
• Advancements in technology
High Parallelism is Achieved in
Polony Sequencing
Sanger
Polony
Generation of Polony array: DNA
Beads (454, SOLiD)
DNA Beads are generated using Emulsion PCR
Generation of Polony array: DNA
Beads (454, SOLiD)
DNA Beads are placed in wells
Generation of Polony array: BridgePCR (Solexa)
DNA fragments are attached to array and
used as PCR templates
Single Molecule Sequencing:
HeliScope
• Direct sequencing of DNA molecules: no
amplification stage
• DNA fragments are attached to array
• Potential benefits: higher throughput, less
errors
Genome Sequencer 20 (454)
Ion torrent
Genome Analyzer (Solexa)
MinION
Technology Summary
Read length Sequencing Throughput Cost
Technology (per run)
(1mbp)*
Sanger
~800bp
Sanger
400kbp
500$
454
~400bp
Polony
500Mbp
60$
Solexa
75bp
Polony
20Gbp
2$
SOLiD
75bp
Polony
60Gbp
2$
Helicos
30-35bp
Single
molecule
25Gbp
1$
*Source: Shendure & Ji, Nat Biotech, 2008
17
Comparing Different Technologies
Sanger Sequencing
Advantages
Disadvantages
Lowest error rate
High cost per base
Long read length (~750
bp)
Long time to generate
data
Can target a primer
Need for cloning
Amount of data per run
Comparing Different Technologies
454 Sequencing
Advantages
Low error rate
Medium read length
(~400-600 bp)
Disadvantages
Relatively high cost per
base
Must run at large scale
Medium/high startup costs
Comparing Different Technologies
Ion Torrent Sequencing
Advantages
Low startup costs
Scalable (10 – 1000 Mb of
data per run)
Disadvantages
New, developing
technology
Cost not as low as
Illumina
Medium/low cost per base
Low error rate
Fast runs (<3 hours)
Read lengths only ~100200 bp so far
Comparing Different Technologies
Illumina Sequencing
Advantages
Low error rate
Disadvantages
Must run at very large
scale
Lowest cost per base
Tons of data
Short read length
(50-75 bp)
Runs take multiple days
High startup costs
De Novo assembly
difficult
Comparing Different Technologies
PacBio Sequencing
Advantages
Can use single molecule
as template
Potential for very long
reads
(several kb+)
Disadvantages
High error rate (~10-15%)
Medium/high cost per
base
High startup costs
NGS Platforms Overview
• Differ in design and chemistries
• Fundamentally relatedsequencing of thousands to
millions of clonally amplified
molecules in a massively parallel
manner
• Orders of magnitude more
information-will continue to
evolve
• Attractive for clinical applications
– individual sequencing assays
costly and laborious- serial
“gene by gene” analysis
Pacific Biosciences
Helicos Biosciences
NABsys
VisiGen Biotechnologies
Complete Genomics
Oxford Nanophore
Technologies
What, When and Why
• Sanger:
Small projects (less than 1Mbp)
• 454:
De-novo sequencing, metagenomics
• Solexa, SOLiD, Heliscope:
– Gene expression, protein-DNA interactions
– Resequencing
24
Sequencing the Human Genome
2001: Human Genome Project
2.7G$, 11 years
10
Log10(price)
8
6
2007: 454
1M$, 3 months
2008: ABI SOLiD
60K$, 2 weeks
2001: Celera
100M$, 3 years
4
2009: Illumina,
Helicos
40-50K$
2
2000
2010: 5K$,
a few days?
2012: 100$, <24
hrs?
2005
Year
2010
25
Sequencing costs have fallen
Next Generation Sequencing
Applications
•Mutation dectection
•Foreign DNA detection
•Non invasive diagnosis aneuplody
•Population characterization
•Cancer genetics
•Ancient DNA (Neanderthaler)
•Expression analysis
•Transcription binding
•Chromosomal interaction
•Etc etc
28
Exome Sequencing Identifies a Tibetan
Adaptation
Yi et al. Science 2010
The widespread mutation in Tibetans is near a gene called EPAS1, a so-called “super athlete gene”
identified several years ago and named because some variants of the gene are associated with
improved athletic performance.
The gene codes for a protein involved in sensing oxygen levels and perhaps balancing aerobic and
anaerobic metabolism.
• Degraded state of the sample  mitDNA sequencing
• Nuclear genomes of ancient remains: cave bear, mommoth,
Neanderthal (106 bp )
Problems: contamination modern humans and coisolation bacterial
DNA
NGS Application ExamplesInherited Conditions
Discovery tool: Single gene
disorders
i.e. AD – Kabuki syndrome (MLL)
Causative mutations for multigenic
diseases –superior to “one by one”
approach of traditional sequencing
Diagnostic advancements for
diseases with overlapping
symptoms, multiple possible
syndromes/genes
Variant detection through next generation
sequencing
Meyerson et al. NRG 2010
Inherited ConditionsChallenges and Opportunities
Challenges
Example:
Monogenic disorders
Novel missense mutations Germ line mosaicism
Structural aberrations
Imprinting effects
Epigenetic factors
Opportunities
Example:
Multifactorial disease
Risk loci more often in
non-coding
or inter-gene regions
Pathogenicity of variants
often unclear- less testing
vs. monogenic disease
Reference human genome
cataloguing of variants =
more test offerings
Sequencing of a Single
Individual with Family Data
Lupski et al. NEJM 2010
The First 8 Human Genomes
SNP Distribution in Proband
Nonsynonymous SNPs in
Known Disease Genes
NGS Application ExamplesNeoplastic Conditions
Cancer susceptibility genes
Patient stratification
Risk assessment
Risk management
Predictions of therapeutic
response
personalized treatment
Somatic/driver mutations
Therapeutic monitoring
Micro-RNAs
Methylation
Epigenetic changes
Prognosis
Alterations in gene expression
Molecular profiling
Tumor sub-typing
Exome Sequencing in Prostate
Cancer
Barbieri et al. Nature Genetics 2012
Exome Sequencing in Prostate
Cancer
Barbieri et al. Nature Genetics 2012
Nonsynonymous Somatic
Mutations in Neuroblastoma
Molenaar et al. Nature 2012
Mutation count associated
with age, stage, and survival
Molenaar et al. Nature 2012
Next Generation Sequencing
NGS diagnostics - shifted towards
data analysis rather than the
technical component
NGS infrastructures must consist of
appropriate expertise and
computational hardware
Unprecedented amounts of medical
data and various processing
algorithms necessitate adequate
tools for
Data management
(alignment and assembly)
QC of image
processing,
base calling, filtering,
alignment, SNP
finding/application steps
archiving
Considerations
• Evaluation of the variant positions
“called” involves queries of all
known relevant databases
• Lack of databases curated to
accept clinical standards likely the
most significant challenge in
managing and reporting genome
sequencing data
• EHR considerations – test
ordering, archiving of NGS reports,
patient consent, data
(reinterpretation?)
NGS-Post-Analytical
Considerations
• Expert interpretation and guidancecorrelation of age, gender, clinical
presentation, family hx
• Team approach ideal -pathologists,
geneticists, other providers
• Proficiency testing and alternative
assessment are challenging
• Proficiency testing schemes based on
NGS methods vs. specific genes are likely
Professional ConsiderationsReimbursement and Gene Patents
• Challenging reimbursement issues
• Genome sequencing may potentially
involve numerous patented gene
sequences
• Development of an affordable system of
common access to genes?
• What about mutations in known disease
genes, not evident to patient phenotype?