Roy N. Platt II & David A. Ray Deptartment of Biological

Understanding genome evolution in non-model taxa is negatively affected by
homology based, transposable element identification
Roy N. Platt II & David A. Ray
Deptartment of Biological Sciences, Texas Tech University
Transposable elements (TEs) are mobile genetic elements with the ability to replicate
throughout a host genome. In some taxa TEs reach copy numbers in the hundreds of
thousands or millions and can occupy more half of the genome. The increasing number of
reference genomes from non-model species has outpaced efforts to identify and annotate TE
content and, when applied, annotation methods vary significantly between projects. Here we
demonstrate the pitfalls in a homology-dependent method of TE identification using examples
from Mammalia and Insecta. De novo repeat identification and manual curation identified
more than a hundred new TEs in the both the naked mole rat (Heterocephalus glaber) and
prairie vole (Microtus ochrogaster) genomes. When these genomes were re-annotated using
these novel repeats as well as the available rodent repeat libraries the portion of the genome
recognized increased 3-5%. More importantly, the average genetic distance within TE families
decreased, implying younger, more recent TE accumulation than was previously thought. Reanalyses of the postman butterfly (Heliconius melpomene) recovered similar results—increased
recognition of younger TEs. These observations imply that homology-based searches are
unable to identify novel lineage specific repeats and that the accuracy of homology-based TE
annotations decrease as phylogenetic distance between taxa increases. This would mean
families or, in the case of horizontal transfer events, entire classes of TEs may go unrecognized.
In order to understand the role that TEs may play in genome evolution, they must first be
identified using de novo repeat identification and manual curation.
Class I Retrotransposons
LTRs
ERV
ERV1
ERV2
ERV3
Gypsy
LTR
LINEs
CR1
L1
L2
Penelope
R4
RTE
RTEX
Tx1
SINEs
SINE1/7SL
SINE2
SINE3/5S
Vingi
Unk
Unclassified non-LTRs
Unclassified
Percent transposable elements identified
Abstract
The Problem
Millions of years
Figure 2. Homology-based TE annotations using human transposable elements (TEs). TEs in several
mammalian genomes were identified and quantified using human TEs. The percentage of TEs identified
using human TEs is given as a percentage of the known repeat content. Time since divergence from the
human lineage for each taxa was taken TimeTree.org. Taxonomically similar species are grouped together
by color. The dotted line represents 100% recognition.
4
4
1
0
0
0
1
Naked mole rat
2
1
2
1
2
3
Mismatches
Mismatches
Figure 1. Repeat identification. The accuracy of homolog-based repeat identification is
driven by the query element used. In the example above, a “human” element is used to identify
elements in the mouse genome. As a result, all of the repeats have been identified, but the
number of mismatches is artificially skewed. On the other hand, repeat identification with the
more appropriate “Rodent” repeat recovers the same repeats, but with only one mismatch
between consensus element and each individual locus.
The number of mismatches, or mutations, is used as a proxy for age. The repeats in the
mouse genome are almost identical, but the estimated age of the elements varies drastically
based on the query element used.
Methods
Figure 1. Species examined. The naked mole rat (Heterocephalus glaber), prairie vole and
postman butterfly (Heliconius melpomene).
de novo (Mb)
601.6
251.4
0.9
11.3
147.7
84.8
0.1
6.6
188.6
1.6
186.9
0
0
0
0
0.1
0
161.6
58.4
103.2
0
0
0
0
0
33.16
0
14.45
15.33
1.43
0.13
0.02
1.8
40.15
1.73
18.85
16.42
1.24
0.13
0.02
1.76
14.75
0
3.87
8.03
0.25
0.02
0.01
2.57
14.7
0
3.9
8.0
0.3
0
0
2.6
Unclassified Tes
Unclassified
5.61
5.61
20.1
20.1
7.00
7.00
7.6
7.6
633.57
721.84
531.88
623.9
Table 1.---Transposable element load in the naked mole rat (Heterocephalus glaber) and the
prairie vole (Microtus ochrogaster) using rodent specific and de novo repeat transposable
element libraries. Rodent specific libraries were taken from Repbase (August 2014). De novo
libraries were combined with the rodent specific libraries in an effort to generate the complete
annotations.
Major Findings
•Using de novo repeat identification more than 100 novel TE subfamilies were recovered in
each of the prairie vole and naked mole rate genomes.
•Novel TE subfamilies occupied more than 100 Mb in both rodent genomes an increase of 1520% of what was previously estimated.
Postman butterfly
Homology-based curation
1. Genomes were masked with clade specific repeats (Rodentia & Arthropoda)
2. Repeat content was quantified with RepeatMasker
De novo curation
1. Genomes were masked with clade specific repeats (Rodentia & Arthropoda)
2. Novel repeats were identified with RepeatModelor (Heliconius from Lavoie et al. 2014)
3. Novel repeats were manually verified through a Blast, extension, alignment protocol
4. Repeats were classified based on sequence hallmarks and the 80-80-80 rule
5. Repeat content was quantified with the de novo and ancestral repeats using RepeatMasker
Fully curation
Rodent (Mb)
510.13
169.23
1.17
8.63
71.46
85.65
0.07
2.25
172.28
1.61
170.57
0.02
0
0
0.01
0.07
0
168.60
66.38
102.15
0.04
0
0.03
0.02
0.02
Prarie vole
Class II DNA Transposons
PiggyBac
TcMariner
hAT
MuDR
Helitron
Kolobok
Unk
Total
Prairie vole
2
TE derived nucleotides
Count
Count
Homology only
3
3
Naked Mole Rat
Rodent (Mb)
de novo (Mb)
594.8
661.65
157.39
175.2
7.55
7.45
17.05
15.47
21.35
14.61
110.65
84.39
0.54
0.51
0.25
52.77
368.84
400.35
16.18
15.94
352.16
383.94
0.12
0.11
0.01
0.01
0.01
0.01
0.02
0.02
0.33
0.31
0.01
0.01
68.51
86.04
68.42
74.29
0
11.66
0.04
0.04
0
0
0.05
0.05
0.06
0.06
0.06
0.06
Divergence from consensus
Figure 3. Differences in TE accumulation histories of the (A & B) naked mole rat (Heterocephalus glaber),
(C & D) prairie vole (Michrotus ochrogaster), and (E & F) postman butterfly (Heliconius melpomene) before
and after de novo TE identification and curation. RepeatMasker searches against the (A) mole rat and (C)
prairie vole used all known mammal TEs and all known arthropod TEs were used against the (E) postman
butterfly genome to identify all known TEs based on homology only. De novo identification and curation
altered the content, quantity and distribution of elements identified for the (B) mole rat, (D) prairie vole
and (F) postman butterfly genomes. Divergence from a consensus sequence from each element was
calculated and binned to demonstrate the accumulation profile for each taxa. For the mole rate and
prairie vole, highly mutable CpG sites were excluded from analyses.
•In all three species, the difference between homology-dependent and de novo repeat
identification resulted reduced nucleotide diversity within repeat subfamilies.
•In mammals, re-analysis using de novo repeats shifted the estimated age of lineage specific
repeats by 40-45 million years. This number reflects the age difference between the subject
genome and the closest relative with known repeats (usually Mus and Rattus).
Recommendations
The examples presented herein indicate that the homology-based analyses commonly
employed by genome sequencing, assembly and analysis projects do not provide an accurate
picture of TE content or accumulation patterns. As more genome sequences become available
it is imperative to provide full, detailed repeat annotations. Relying on homology to elements
from a closest relative will create a negative feed back loop where poor repeat annotations in
“taxa A” will lead to poor annotations in “taxa B”…
The most accurate repeat annotations are possible through:
1. Repeat identification through de novo computational resources
2. Verification of element capture, often time requiring manual curation
3. Classification of novel elements to the family level or beyond
4. Re-annotation using known ancestral repeats plus newly identified elements
By abiding to the principles outlined herein, our ability to understand the biology of TEs
and genome evolution in general will be significantly impacted.
Acknowledgments
We would like to thank Robert Hubley, Laura Berdugo-Blanco, Sarah Mangum, and Wesli
Stubbs for their support. Citations are available upon request.