Long Span Mate Pairs Approaching 100 kb for Closing

Long Span Mate Pairs Approaching 100 kb for Closing Genomes
Closing and Finishing the Mitochondrial and Chloroplast Genomes of Sorghum Male-sterile A1 Cytoplasm
Scott Monsma1 ,Robert Klein2, Brendan Keough1, Svetlana Jasinovica1, Erin Ferguson1, Michael Lodes1 and David Mead1
1Lucigen
ABSTRACT
Corporation, 2905 Parmenter Street, Middleton, WI 53562; 2USDA-ARS, College Station, TX
NxSeq® Long Mate Pair Library Technology
Long repetitive DNA sequences are abundant in most species, which
creates technical challenges for the de novo assembly of even small
genomes using short read next generation sequencing (NGS) methods.
The incorporation of long span mate pair reads dramatically improves
the success of de novo assembly and the discovery of structural
variants by resolving repeat elements and ordering contigs. A new NGS
library method that generates user defined mate pairs (MP) up to 100
kb has been developed. A fosmid based technology for 30-50 kb MP
libraries has been used to close multiple repeat rich, complex genomes.
A clone free version has been developed for 2-8 kb, 10-20 kb and
beyond using bead-based and gel-based methods. A unique barcoding
strategy is used to distinguish true mate pairs from false chimeric
junctions, dramatically reducing the fraction of misassembled contigs.
We report the closing and finishing of four bacterial genomes, and
closing several plant genomes, including Miscanthus sinensis and the
CMS sorghum mitochondrial and chloroplast genomes.
Closing and Finishing the Sorghum CMS
Mitochondrial And Chloroplast Genomes
De novo sequence assembly of these repeat rich genomes is
complicated by significant lateral gene transfer between the organelles.
Using Lucigen’s long span library construction strategy we were able to
assemble both genomes via 2 kb MP, 8 kb MP and a 35-50 kb MP
library using 2x250 read chemistry.
1 SCAFFOLD, 2 GAPS
Plastid Genome
147 kb
METHODS AND RESULTS
40 kb NGS Mate Pair Fosmid Technology
Figure 3.
Schematic for user defined mate pair library
construction up to 100 kb. Genomic DNA is sheared, end repaired,
A-tailed and ligated to barcode adaptors prior to size selection. The
insert is ligated to a unique coupler with encrypted Chimera Codes,
samples are then treated with exonuclease to remove unwanted DNA,
and finally digested with a selection of endonucleases to produce the
correct sized Di-tags. Biotin capture allows for the removal of
unwanted DNA fragments prior to the addition of a Junction Code
adaptor and re-circularization.
NxSeq® libraries can be made
compatible with either IonTorrent or Illumina sequencing platforms.
User Defined Mate Pairs up to 100 kb
8 kb NxSeq E. coli Mate Pair Library
20-40 kb NxSeq Taq Mate Pair Library
MITOCHONDRIAL GENOME: 6 SCAFFOLDS, 7 GAPS
B1
98,084 bp
1
A v1
D v1
C’’
Closing Plant & Animal Genomes
Utilizing 30-50 kb Fosmid Mate Pairs
A
C’ v1
4
C’ v2
B2-a0
20
30
40
50
D v2
B2-c0
124,411 bp
B2-b
5
6
199,072 bp
A v2
70,887 bp
Unique length: 433,377 bp
Figure 6. Chloroplast and mitochondrial genome assemblies. The
plastid chromosome assembled into a single circlular scaffold with only
two gaps, however the mitochondrial genome appears to be
substantially more complicated, with multiple segmental connections
and circular subgenomic segments. Verification by PCR is in progress.
Sequence Coverage of Mitochondrial Genome
Indicates Variable Copy Number Subgenomes
20-100 kb Homo sapiens NxSeq Mate Pair Library
2.5 Gb Miscanthus sinensis Genome
pNGS™-FOS Mate Pair Libraries
pNGS
Library
Contigs
Spanned
Spans
BfaI
252,236
96,205
CviQI
294,410
99,285
EcoRI
30,792
33,073
BamHI
28,510
25,234
HindIII
80,924
47,337
668,886 confirmed spans from all 5 libraries
301,134 gaps spanned
N50 = 40 kb before pNGS FOS mate pair library
N50 = 102 kb after pNGS FOS mate pair library
Span
sizes for
pNGS
BfaI
mate pair
library
frequency
B
2 4 6 8 10 12 14
10
Mate Pair Distance (kb)
145,782 bp
3
Total length: 726,759 bp
US Patent
8329400
Figure 1. Schematic of pNGS FOS library construction workflow.
Sheared and end-repaired gDNA is ligated to the pNGS™ FOS vector,
packaged into lambda phage, and transfected into E. coli. Plasmid
DNA is miniprepped, and after restriction of the insert DNA, the
construct is rejoined and sequenced using Illumina platforms.
88,523 bp
2
base pairs
C
1 Gb Fish Genome
Span
sizes
for pNGS
BfaI
mate pair
library
frequency
N50 = 80 kb before pNGS FOS mate pair library
N50 = 477 kb after pNGS FOS mate pair library
0 10 20 30 40 50 60 70 80 90 100
Figure 4. Long mate pair libraries from three genomes
An 8 kb NxSeq® mate pair library was constructed using gel-free
methods, and the larger mate pair libraries using gel isolation. Resulting
true mate pairs were mapped against the respective reference genome
to determine the resulting mate pair distances.
Figure 7. Variable coverage indicates sub-stoichiometric
subgenomes. Reads from the un-amplified fragment library were
mapped against the reference genome, revealing a complicated pattern
with multiple levels of coverage, and certain areas lacking any read
coverage (likely deletions). Several distinct levels of coverage are
apparent, indicating altered stoichiometry for various segments.
Comparison Of Long Read Technologies
De novo Assembly and Closing
Of Four Microbial Genomes
Fragment + 10-20 kb Mate Pair Library
Microbial
Genome
Size,
GC content
Scaffolds,
Gaps
Thermus aquaticus
Staphylococcus aureus
Streptomyces spp.
Nonomurea spp.
2.3 Mb, 68% GC
2.8 Mb, 32% GC
8.6 Mb, 71% GC
10.3 Mb, 70% GC
1, 0
1, 0
1, 0
1, 3
Figure 5. De novo assembly and closing of four microbial genomes
using a fragment and a 10-20 kb mate pair library.
A fragment library (paired end, mean insert size ~500 bases) and a 1020kb mate pair library were constructed for four bacterial strains.
Assembly of the first two (T. aquaticus and S. aureus) was performed with
CLC Genomics Workbench 7.5, and the second two (Streptomyces and
Nonomurea) were assembled with SPAdes 3.5.0 (spades.bioinf.spbau.ru).
In all four cases, the assembler was able to create near-complete or
single scaffolds, and the remaining gaps were readily filled and closed.
Completely finished genomes were generated within 1-4 days.
Figure 8. Comparison of size distributions from commercially
available NGS sequencing technologies.
Mate pair size distributions are plotted for various NxSeq and pNGS
FOS libraries (orange), as compared to Pac Bio long reads (blue) and
Illumina TruSeq synthetic long reads (green). The NxSeq Mate Pair
Technologies enable mate pair libraries spanning 50 kb. Improvements
approaching 100 kb have been demonstrated.
CONCLUSIONS
base pairs
Figure 2. pNGS™ FOS libraries from large plant and animal
genomes.
pNGS™ FOS library mate pair data improves assemblies of large
genomes by spanning gaps left by the assembly of fragment libraries
only (A), thereby markedly increasing scaffold sizes (B & C). Lambda
packaging ensures a predictable span distribution that lends greater
confidence to the arrangement and orientation of contigs.
Closing and finishing genomes has been simplified by a new paradigm for constructing 95% efficient mate pair
libraries. Chimera Code™ helps prevent false mates, while incorporation of a Junction Code™ identifies mate
pair junctions. NxSeq® and pNGS™-FOS Mate Pair Technology enables accurate, economical assembly of
BACs and genomes and is compatible with both Ion Torrent and Illumina platforms.
ACKNOWLEDGEMENTS
This work was partly supported by NHGRI grants to ML and DM. We thank Mark Liles, Richard Davis, and Peter Panizzi from
Auburn University for sharing microbial genome closing data, and Kankshita Swaminathan from University of Illinois for
sharing Miscanthus genome closing data.