Structural Variation Detection and De Novo Assembly in Complex Genomes Using Extremely Long Single-Molecule Imaging W Wang, H VanSteenhouse, A Hastie, E Lam, C Luo1, J Ecker1, S. Rombauts2, Y Gu3 , H Cao BioNano Genomics, San Diego, California 1 Salk Institute, San Diego, CA; 2Ghent University, VIB, Gent, Belgium; 3US Department of Agriculture–ARS, Albany, CA Abstract De novo genome assemblies using only short read data are generally incomplete and highly fragmented due to the intractable complexity found in most genomes. This complexity, consisting mainly of large duplications and repetitive regions, hinders sequence assembly and subsequent comparative analyses. We present a single molecule genome analysis system (Irys) based on NanoChannel Array technology that linearizes extremely long DNA molecules for observation. This high-throughput platform automates the imaging of single molecules of genomic DNA hundreds of kilobases in size to measure sufficient sequence uniqueness for unambiguous assembly of complex genomes. High-resolution genome maps assembled de novo from the extremely long single molecules retain the original context and architecture of the genome. As a result, genome maps improve contiguity and accuracy of whole genome assemblies, permitting a more comprehensive analysis of functional genome biology and structural variation. Additionally, genome maps serve as a much-needed orthogonal validation method to NGS assemblies. Free from reference bias, Genome Maps identify novel sequence insertions and locate transgene or viral integration sites. In addition to providing an introduction to this newly available technology, we will demonstrate a number of examples of its utility in a variety of organisms, including crop plants and pest insects. Background Methods Generating high quality finished genomes replete with accurate identification of structural variation and high completion (minimal gaps) remains challenging using short read sequencing technologies alone. Instead, Irys technology provides direct visualization of long DNA molecules in their native state, avoiding the statistical assumptions that are normally used to force sequence alignments of low uniqueness elements. The resulting order and orientation of sequence elements are demonstrated in anchoring NGS contigs and structural variation detection. (1) Long molecules of DNA is labeled with IrysPrep™ reagents by (2) incorporation of fluorophore labeled nucleotides at a specific sequence motif throughout the genome. (3) The labeled genomic DNA is then linearized in the IrysChip™ nanochannels and single molecules are imaged by Irys. (4) Single molecule data are collected and detected automatically. (5) Molecules are labeled with a unique signature pattern that is uniquely identifiable and useful in assembly into genome maps. (6) Maps may be used in a variety of downstream analysis using IrysView™ software. 1) IrysPrep Kit extraction of long DNA molecules 2) IrysPrep reagents label DNA at specific sequence motifs 3) IrysChip linearizes DNA in NanoChannels Displaced Strand Blood 4) Irys automates imaging of single molecules in NanoChannels Free DNA solution! DNA in a microchannel! DNA in a NanoChannel! Gaussian coil! Partially elongated! Linearized! 5) Molecules and labels detected in images by instrument software 6) IrysView software assembles genome maps Cells Nickase Recognition Motif Tissue Nick Site Polymerase Microbes Position (kb) EVOLUTION LUNAR Confidential 2011 Assembly of Wheat Genome Region Rice De Novo Cultivar Comparison Nipponbare Reference v7 M-202 Map Assembly Total Map Length Map N50 % Overlap Reference M-202 Cultivar Genome Map Chr 1 Repeat Chr 4 Absent Region Coverage Distribution for Chromosome 4 Dual-Labeled Single Molecule Images 389.5 Mb 2.4 Mb 78.7% Genome-Wide Repeat Content Genome Map for 2.1 Mb of Ae. tauschii Clone 23 Clone 19 Clone 18 Clone 21 Clone 22 Clone 23 Clone 22 Clone 12 Clone 16 Clone 17 Clone 21 Clone 11 Clone 15 Clone 17 Clone 18 Clone 19 Reference Assembly Clone 16 Clone 12 Clone 11 Clone 15 80 Frequency 40 20 Coverage (X) 60 6 5 0 25000000 35000000 Clone 13 Clone 9 Clone 27 Clone 6 Clone 26 Clone 8 Clone 4 Clone 7 Clone 7 Clone 6 Clone 24 Clone 4 Clone 27 Clone 5 Clone 3 Clone 2 Clone 25 Clone 2 Clone 24 Clone 1 Clone 3 Clone 25 Clone 1 Clone 20 Clone map: Genome mapping Clone map: SNaPshot DNA fingerprinHng 4 3 2 15000000 Scaffolding Sequences on the Genome Map 1 Nucleotide position (nt) Clone 26 Clone 9 Clone 8 Clone 10 Clone 14 Nt.BbvCI Nt.BspQI 7 5000000 Clone 13 Clone 14 8 0 Clone 10 0 5 10 Repeat Size (kb) 15 20 25 M-202 Single Molecules 9000 8000 7000 Frequency 6000 5000 Alignment and comparison of the M-202 cultivar genome map to the reference assembly identifies several differences in genomic M-202 Genome Map Assembly architecture and content. M-202 contains dense repeats that are not present in the reference, as well as large stretches that are not present in M-202. Genome maps derived from single-molecule measurements of extremely long DNA enables comprehensive surveys of genome-wide repetitive elements. The M-202 map has a distinct profile of simple repeat content compared to the reference sequence; in particular a significantly higher large population of ~7kb repeats are present. In addition to biological differences, it is possible that the reference under-represents repeats due to inherent limitations in sequencing and assembly methods, which is overcome by Irys’ use of single-molecule detection and long “reads”. 4000 3000 2000 1000 0 5 10 Repeat Size (kb) 15 20 25 120 100 Frequency 80 60 40 20 0 5 10 Repeat Size (kb) 15 20 25 Spider Mite Broken Gene Scaffolding Arabidopsis Chr4 Telomeric Repeats Whole Chr 4 Assembly Validation TAIR10 Reference Genome Map Chr 4 rDNA Repeats Col-0 Map Assembly Total Map Length Map N50 % Overlap Reference A BAC minimum tiling path for a 2.1 Mb region of Ae. Tauschii, the wheat D genome donor, was used to create a dual-motif genome map. Using this map, a physical map and the sequence assembly were corrected. The genome map-independent sequence assembly (454 single read and paired-end reads) was 75% concordant with the genome map and was corrected to 95% accuracy by using the genome map. 113.0 Mb 1.1 Mb 91.2% The Arabidopsis Col-0 de novo genome map aligns well to the reference assembly, excluding reference gaps. In addition to this valuable validation, some maps provide additional information in repeat rich regions not represented in the reference. T. urticae DNA was used to create a de novo genome map and assemble sequence scaffolds and contigs. The complete de novo sequence assembly using genome maps for super-scaffolding is 90.8 Mb. The genome map was used to bridge important genes as well as validate and correct sequence assemblies. The scaffold N50 improved from 3Mb to 6.8 Mb, reducing the number of large scaffolds from 44 to 15. Original Assembly Genome Map Assembly Size (Mb) N50 (Mb) 90.8 3.0 90.8 6.8 Putative fusion of a Fibroin gene Genome Map Sequence Scaffold 21 Scaffold 8 Tetur21g03310 (1 to 15701) Tetur08g00010 (1486829 to 1523188) ~25 kb gap Conclusions References Irys enables direct visualization of single-molecules of extremely long DNA for the direct observation and measurement of genome complexities. This system permits accurate genome-wide assembly and detection of structural variants that typically confound short read genome assembly and comparative genomic analysis. Here we demonstrate genome assembly capabilities of the IrysChip nanochannel array and Irys imaging system to overcome repetitive regions to characterize complex crop and plant genomes, scaffold important functional genes in a pest insect, and assemble a difficult region of the wheat genome with multi-color mapping. • • • • • Lam, E.T., et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nature Biotechnology (2012); 10: 2303 Das, S. K., et al. Single molecule linear analysis of DNA in nano-channel labeled with sequence specific fluorescent probes. Nucleic Acids Research (2010); 38: 8 Xiao, M et al. Rapid DNA mapping by fluorescent single molecule detection. Nucleic Acids Research (2007); 35:e16. Hastie, A.R., et al. Rapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome. PLoS ONE (2013); 8(2): e55864. Rice sample generously provided by Yulin Jia Ph.D., USDA-ARS Dale Bumpers National Rice Research Center.
© Copyright 2025