Long Span Mate Pairs Approaching 100 kb for Closing Genomes Closing and Finishing the Mitochondrial and Chloroplast Genomes of Sorghum Male-sterile A1 Cytoplasm Scott Monsma1 ,Robert Klein2, Brendan Keough1, Svetlana Jasinovica1, Erin Ferguson1, Michael Lodes1 and David Mead1 1Lucigen ABSTRACT Corporation, 2905 Parmenter Street, Middleton, WI 53562; 2USDA-ARS, College Station, TX NxSeq® Long Mate Pair Library Technology Long repetitive DNA sequences are abundant in most species, which creates technical challenges for the de novo assembly of even small genomes using short read next generation sequencing (NGS) methods. The incorporation of long span mate pair reads dramatically improves the success of de novo assembly and the discovery of structural variants by resolving repeat elements and ordering contigs. A new NGS library method that generates user defined mate pairs (MP) up to 100 kb has been developed. A fosmid based technology for 30-50 kb MP libraries has been used to close multiple repeat rich, complex genomes. A clone free version has been developed for 2-8 kb, 10-20 kb and beyond using bead-based and gel-based methods. A unique barcoding strategy is used to distinguish true mate pairs from false chimeric junctions, dramatically reducing the fraction of misassembled contigs. We report the closing and finishing of four bacterial genomes, and closing several plant genomes, including Miscanthus sinensis and the CMS sorghum mitochondrial and chloroplast genomes. Closing and Finishing the Sorghum CMS Mitochondrial And Chloroplast Genomes De novo sequence assembly of these repeat rich genomes is complicated by significant lateral gene transfer between the organelles. Using Lucigen’s long span library construction strategy we were able to assemble both genomes via 2 kb MP, 8 kb MP and a 35-50 kb MP library using 2x250 read chemistry. 1 SCAFFOLD, 2 GAPS Plastid Genome 147 kb METHODS AND RESULTS 40 kb NGS Mate Pair Fosmid Technology Figure 3. Schematic for user defined mate pair library construction up to 100 kb. Genomic DNA is sheared, end repaired, A-tailed and ligated to barcode adaptors prior to size selection. The insert is ligated to a unique coupler with encrypted Chimera Codes, samples are then treated with exonuclease to remove unwanted DNA, and finally digested with a selection of endonucleases to produce the correct sized Di-tags. Biotin capture allows for the removal of unwanted DNA fragments prior to the addition of a Junction Code adaptor and re-circularization. NxSeq® libraries can be made compatible with either IonTorrent or Illumina sequencing platforms. User Defined Mate Pairs up to 100 kb 8 kb NxSeq E. coli Mate Pair Library 20-40 kb NxSeq Taq Mate Pair Library MITOCHONDRIAL GENOME: 6 SCAFFOLDS, 7 GAPS B1 98,084 bp 1 A v1 D v1 C’’ Closing Plant & Animal Genomes Utilizing 30-50 kb Fosmid Mate Pairs A C’ v1 4 C’ v2 B2-a0 20 30 40 50 D v2 B2-c0 124,411 bp B2-b 5 6 199,072 bp A v2 70,887 bp Unique length: 433,377 bp Figure 6. Chloroplast and mitochondrial genome assemblies. The plastid chromosome assembled into a single circlular scaffold with only two gaps, however the mitochondrial genome appears to be substantially more complicated, with multiple segmental connections and circular subgenomic segments. Verification by PCR is in progress. Sequence Coverage of Mitochondrial Genome Indicates Variable Copy Number Subgenomes 20-100 kb Homo sapiens NxSeq Mate Pair Library 2.5 Gb Miscanthus sinensis Genome pNGS™-FOS Mate Pair Libraries pNGS Library Contigs Spanned Spans BfaI 252,236 96,205 CviQI 294,410 99,285 EcoRI 30,792 33,073 BamHI 28,510 25,234 HindIII 80,924 47,337 668,886 confirmed spans from all 5 libraries 301,134 gaps spanned N50 = 40 kb before pNGS FOS mate pair library N50 = 102 kb after pNGS FOS mate pair library Span sizes for pNGS BfaI mate pair library frequency B 2 4 6 8 10 12 14 10 Mate Pair Distance (kb) 145,782 bp 3 Total length: 726,759 bp US Patent 8329400 Figure 1. Schematic of pNGS FOS library construction workflow. Sheared and end-repaired gDNA is ligated to the pNGS™ FOS vector, packaged into lambda phage, and transfected into E. coli. Plasmid DNA is miniprepped, and after restriction of the insert DNA, the construct is rejoined and sequenced using Illumina platforms. 88,523 bp 2 base pairs C 1 Gb Fish Genome Span sizes for pNGS BfaI mate pair library frequency N50 = 80 kb before pNGS FOS mate pair library N50 = 477 kb after pNGS FOS mate pair library 0 10 20 30 40 50 60 70 80 90 100 Figure 4. Long mate pair libraries from three genomes An 8 kb NxSeq® mate pair library was constructed using gel-free methods, and the larger mate pair libraries using gel isolation. Resulting true mate pairs were mapped against the respective reference genome to determine the resulting mate pair distances. Figure 7. Variable coverage indicates sub-stoichiometric subgenomes. Reads from the un-amplified fragment library were mapped against the reference genome, revealing a complicated pattern with multiple levels of coverage, and certain areas lacking any read coverage (likely deletions). Several distinct levels of coverage are apparent, indicating altered stoichiometry for various segments. Comparison Of Long Read Technologies De novo Assembly and Closing Of Four Microbial Genomes Fragment + 10-20 kb Mate Pair Library Microbial Genome Size, GC content Scaffolds, Gaps Thermus aquaticus Staphylococcus aureus Streptomyces spp. Nonomurea spp. 2.3 Mb, 68% GC 2.8 Mb, 32% GC 8.6 Mb, 71% GC 10.3 Mb, 70% GC 1, 0 1, 0 1, 0 1, 3 Figure 5. De novo assembly and closing of four microbial genomes using a fragment and a 10-20 kb mate pair library. A fragment library (paired end, mean insert size ~500 bases) and a 1020kb mate pair library were constructed for four bacterial strains. Assembly of the first two (T. aquaticus and S. aureus) was performed with CLC Genomics Workbench 7.5, and the second two (Streptomyces and Nonomurea) were assembled with SPAdes 3.5.0 (spades.bioinf.spbau.ru). In all four cases, the assembler was able to create near-complete or single scaffolds, and the remaining gaps were readily filled and closed. Completely finished genomes were generated within 1-4 days. Figure 8. Comparison of size distributions from commercially available NGS sequencing technologies. Mate pair size distributions are plotted for various NxSeq and pNGS FOS libraries (orange), as compared to Pac Bio long reads (blue) and Illumina TruSeq synthetic long reads (green). The NxSeq Mate Pair Technologies enable mate pair libraries spanning 50 kb. Improvements approaching 100 kb have been demonstrated. CONCLUSIONS base pairs Figure 2. pNGS™ FOS libraries from large plant and animal genomes. pNGS™ FOS library mate pair data improves assemblies of large genomes by spanning gaps left by the assembly of fragment libraries only (A), thereby markedly increasing scaffold sizes (B & C). Lambda packaging ensures a predictable span distribution that lends greater confidence to the arrangement and orientation of contigs. Closing and finishing genomes has been simplified by a new paradigm for constructing 95% efficient mate pair libraries. Chimera Code™ helps prevent false mates, while incorporation of a Junction Code™ identifies mate pair junctions. NxSeq® and pNGS™-FOS Mate Pair Technology enables accurate, economical assembly of BACs and genomes and is compatible with both Ion Torrent and Illumina platforms. ACKNOWLEDGEMENTS This work was partly supported by NHGRI grants to ML and DM. We thank Mark Liles, Richard Davis, and Peter Panizzi from Auburn University for sharing microbial genome closing data, and Kankshita Swaminathan from University of Illinois for sharing Miscanthus genome closing data.
© Copyright 2024