PAG 2014

Structural Variation Detection and De Novo Assembly in Complex
Genomes Using Extremely Long Single-Molecule Imaging
W Wang, H VanSteenhouse, A Hastie, E Lam, C Luo1, J Ecker1, S. Rombauts2, Y Gu3 , H Cao
BioNano Genomics, San Diego, California
1
Salk Institute, San Diego, CA; 2Ghent University, VIB, Gent, Belgium; 3US Department of Agriculture–ARS, Albany, CA
Abstract
De novo genome assemblies using only short read data are generally incomplete and
highly fragmented due to the intractable complexity found in most genomes. This
complexity, consisting mainly of large duplications and repetitive regions, hinders
sequence assembly and subsequent comparative analyses. We present a single
molecule genome analysis system (Irys) based on NanoChannel Array technology that
linearizes extremely long DNA molecules for observation. This high-throughput platform
automates the imaging of single molecules of genomic DNA hundreds of kilobases in
size to measure sufficient sequence uniqueness for unambiguous assembly of complex
genomes. High-resolution genome maps assembled de novo from the extremely
long single molecules retain the original context and architecture of the genome. As a
result, genome maps improve contiguity and accuracy of whole genome assemblies,
permitting a more comprehensive analysis of functional genome biology and structural
variation.
Additionally, genome maps serve as a much-needed orthogonal validation method
to NGS assemblies. Free from reference bias, Genome Maps identify novel sequence
insertions and locate transgene or viral integration sites.
In addition to providing an introduction to this newly available technology, we will
demonstrate a number of examples of its utility in a variety of organisms, including crop
plants and pest insects.
Background
Methods
Generating high quality finished genomes replete with accurate identification of structural variation and high
completion (minimal gaps) remains challenging using short read sequencing technologies alone. Instead, Irys
technology provides direct visualization of long DNA molecules in their native state, avoiding the statistical
assumptions that are normally used to force sequence alignments of low uniqueness elements. The resulting order
and orientation of sequence elements are demonstrated in anchoring NGS contigs and structural variation detection.
(1) Long molecules of DNA is labeled with IrysPrep™ reagents by (2) incorporation of fluorophore labeled nucleotides
at a specific sequence motif throughout the genome. (3) The labeled genomic DNA is then linearized in the IrysChip™
nanochannels and single molecules are imaged by Irys. (4) Single molecule data are collected and detected
automatically. (5) Molecules are labeled with a unique signature pattern that is uniquely identifiable and useful in
assembly into genome maps. (6) Maps may be used in a variety of downstream analysis using IrysView™ software.
1) IrysPrep Kit extraction of
long DNA molecules
2) IrysPrep reagents label DNA at
specific sequence motifs
3) IrysChip linearizes DNA in
NanoChannels
Displaced Strand
Blood
4) Irys automates imaging of single
molecules in NanoChannels
Free DNA solution!
DNA in a microchannel!
DNA in a NanoChannel!
Gaussian coil!
Partially elongated!
Linearized!
5) Molecules and labels detected
in images by instrument software
6) IrysView software assembles
genome maps
Cells
Nickase
Recognition
Motif
Tissue
Nick
Site
Polymerase
Microbes
Position (kb)
EVOLUTION
LUNAR Confidential 2011
Assembly of Wheat Genome Region
Rice De Novo Cultivar Comparison
Nipponbare Reference v7
M-202 Map Assembly
Total Map Length
Map N50
% Overlap Reference
M-202 Cultivar Genome Map
Chr 1 Repeat
Chr 4 Absent Region
Coverage Distribution for Chromosome 4
Dual-Labeled Single
Molecule Images
389.5 Mb
2.4 Mb
78.7%
Genome-Wide Repeat Content
Genome Map for 2.1 Mb of Ae. tauschii
Clone 23 Clone 19 Clone 18 Clone 21 Clone 22 Clone 23 Clone 22 Clone 12 Clone 16 Clone 17 Clone 21 Clone 11 Clone 15 Clone 17 Clone 18 Clone 19 Reference Assembly
Clone 16 Clone 12 Clone 11 Clone 15 80
Frequency
40
20
Coverage (X)
60
6
5
0
25000000
35000000
Clone 13 Clone 9 Clone 27 Clone 6 Clone 26 Clone 8 Clone 4 Clone 7 Clone 7 Clone 6 Clone 24 Clone 4 Clone 27 Clone 5 Clone 3 Clone 2 Clone 25 Clone 2 Clone 24 Clone 1 Clone 3 Clone 25 Clone 1 Clone 20 Clone map: Genome mapping Clone map: SNaPshot DNA fingerprinHng 4
3
2
15000000
Scaffolding Sequences on the Genome Map
1
Nucleotide position (nt)
Clone 26 Clone 9 Clone 8 Clone 10 Clone 14 Nt.BbvCI Nt.BspQI 7
5000000
Clone 13 Clone 14 8
0
Clone 10 0
5
10
Repeat Size (kb)
15
20
25
M-202 Single Molecules
9000
8000
7000
Frequency
6000
5000
Alignment and comparison of the M-202 cultivar genome map to
the reference assembly identifies several differences in genomic
M-202 Genome Map Assembly
architecture and content. M-202 contains dense repeats that are
not present in the reference, as well as large stretches that are not
present in M-202. Genome maps derived from single-molecule
measurements of extremely long DNA enables comprehensive
surveys of genome-wide repetitive elements. The M-202 map has a
distinct profile of simple repeat content compared to the reference sequence; in particular a significantly
higher large population of ~7kb repeats are present. In addition to biological differences, it is possible that
the reference under-represents repeats due to inherent limitations in sequencing and assembly methods,
which is overcome by Irys’ use of single-molecule detection and long “reads”.
4000
3000
2000
1000
0
5
10
Repeat Size (kb)
15
20
25
120
100
Frequency
80
60
40
20
0
5
10
Repeat Size (kb)
15
20
25
Spider Mite Broken Gene Scaffolding
Arabidopsis Chr4 Telomeric Repeats
Whole Chr 4 Assembly Validation
TAIR10 Reference
Genome Map
Chr 4 rDNA Repeats
Col-0 Map Assembly
Total Map Length
Map N50
% Overlap Reference
A BAC minimum tiling path for a 2.1 Mb region of Ae. Tauschii, the wheat D genome donor, was used
to create a dual-motif genome map. Using this map, a physical map and the sequence assembly were
corrected. The genome map-independent sequence assembly (454 single read and paired-end reads) was
75% concordant with the genome map and was corrected to 95% accuracy by using the genome map.
113.0 Mb
1.1 Mb
91.2%
The Arabidopsis Col-0 de novo genome map aligns
well to the reference assembly, excluding reference
gaps. In addition to this valuable validation, some
maps provide additional information in repeat rich
regions not represented in the reference.
T. urticae DNA was used to create a de
novo genome map and assemble sequence
scaffolds and contigs. The complete de
novo sequence assembly using genome
maps for super-scaffolding is 90.8 Mb. The
genome map was used to bridge important
genes as well as validate and correct
sequence assemblies. The scaffold N50
improved from 3Mb to 6.8 Mb, reducing the
number of large scaffolds from 44 to 15.
Original Assembly
Genome Map Assembly
Size (Mb) N50 (Mb)
90.8
3.0
90.8
6.8
Putative fusion of a Fibroin gene
Genome
Map
Sequence
Scaffold 21
Scaffold 8
Tetur21g03310
(1 to 15701)
Tetur08g00010
(1486829 to 1523188)
~25 kb gap
Conclusions
References
Irys enables direct visualization of single-molecules of extremely long DNA for the direct observation and
measurement of genome complexities. This system permits accurate genome-wide assembly and detection of
structural variants that typically confound short read genome assembly and comparative genomic analysis. Here
we demonstrate genome assembly capabilities of the IrysChip nanochannel array and Irys imaging system to
overcome repetitive regions to characterize complex crop and plant genomes, scaffold important functional genes
in a pest insect, and assemble a difficult region of the wheat genome with multi-color mapping.
•
•
•
•
•
Lam, E.T., et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence
assembly. Nature Biotechnology (2012); 10: 2303
Das, S. K., et al. Single molecule linear analysis of DNA in nano-channel labeled with sequence specific
fluorescent probes. Nucleic Acids Research (2010); 38: 8
Xiao, M et al. Rapid DNA mapping by fluorescent single molecule detection. Nucleic Acids Research (2007);
35:e16.
Hastie, A.R., et al. Rapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De
Novo Sequence Assembly of the Complex Aegilops tauschii Genome. PLoS ONE (2013); 8(2): e55864.
Rice sample generously provided by Yulin Jia Ph.D., USDA-ARS Dale Bumpers National Rice Research Center.