5 Phylogeny 5.1 References

66
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5 Phylogeny
Evolutionary tree of organisms, Ernst Haeckel, 1866
5.1
References
• J. Felsenstein, Inferring Phylogenies, Sinauer, 2004.
• C. Semple and M. Steel, Phylogenetics, Oxford University Press, 2003.
• R. Durbin, S. Eddy, A. Krogh & G. Mitchison, Biological sequence analysis, Cambridge, 1998
• D.L. Swofford, G.J. Olsen, P.J.Waddell & D.M. Hillis, Phylogenetic Inference, in: D.M. Hillis
(ed.), Molecular Systematics, 2 ed., Sunderland Mass., 1996.
• M. Salemi and A-M. Vandamme, The Phylogenetic Handbook. Cambridge University Press
2003.
5.2
Why phylogenetic analysis?
Phylogenetic analysis has many applications, including understanding the emergence and development
of diseases:
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.3
67
Outline of the chapter
1. Enumerating phylogenetic trees
2. Computer representation of trees
3. Distance-based reconstruction methods
4. Parsimony-based reconstruction methods
5. Jukes-Cantor model of DNA sequences evolution
6. Maximum likelihood reconstruction methods
7. Bayesian reconstruction methods
8. Phylogenetic Networks
1, 3, 4 will cover some material already presented in “Grundlagen der Bioinformatik”
5.4
Phylogenetic analysis
The goal of phylogenetic analysis is to determine and describe the evolutionary relationship between a
givens set of species. In particular, this involves determining the order of speciation events and their
approximate timing.
It is generally assumed that speciation is a branching process: a population of organisms becomes
separated into two sub-populations. Over time, these evolve into separate species that do not crossbreed.
Because of this assumption, a tree is often used to represent a proposed phylogeny for a set of species,
showing how the species evolved from a common ancestor.
68
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.4.1
Early Evolutionary Studies
Before the early 1960s: Anatomical/Morphological features were the dominant criteria used to derive
evolutionary relationships between species
1958: Francis Crick suggested to use molecular sequences for phylogenetic tree reconstruction.
1962: Emile Zuckerkandl and Linus Pauling built the first phylogenetic tree from aligned amino acid
sequences. They also proposed the theory of the molecular clock.
1967: Walter Fitch and Emanuel Margoliash published the first algorithm for phylogenetic tree reconstruction (least-squares algorithm).
1977: Carl Woese uses phylogenies based on SSU rRNA sequences to show that are are three Domains
of Life: Eukaryotes, Bacteria and Archaea.
5.4.2
Phylogenetic trees
In the following, we will use X = {x1 , x2 , . . . , xn } to denote a set of taxa, where a taxon xi is simply
a representative of a group of individuals defined in some way.
Definition 5.4.1 (Phylogenetic tree) A phylogenetic tree (on X) is a system T = (V, E, λ) consisting of a connected graph (V, E) without cycles, together with a labeling λ of the leaves by elements
of X, such that:
1. every leaf is labeled by exactly one taxon, and
2. every taxon appears exactly once as a label.
5.4.3
Unrooted trees
An unrooted phylogenetic tree is obtained by placing a set of taxa on the leaves of a tree:
Pan_panisc
Gorilla
Homo_sap
Mus_mouse
Rattus_norv
harbor_sel
Bos_ta(cow)
fin_whale
blue whale
Taxa X
harbor_sel
Gorilla
Pan_panisc
Bos_ta(cow)
fin_whale
Homo_sap
Mus mouse
+
tree
⇒
blue_whale
Rattus norv
phylogenetic tree T on X
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.4.4
Unrooted and rooted trees
5.4.5
Enumerating phylogenetic trees
69
Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves. We assume that T is bifurcating,
i.e. that each internal node has degree 3. Any non-bifurcating tree on n taxa will have less nodes and
edges.
Lemma 5.4.2 (Number of nodes and edges) A bifurcating phylogenetic tree T on n taxa has
2n − 2 nodes, 2n − 3 edges and n − 3 internal edges for all n ≥ 3.
Proof: by induction.
For n = 3 there is exactly one tree: it has 3 outer edges and zero internal edges. It has 3 leaves and
one internal node.
Assume that the statement holds for n. Any tree T 0 with n + 1 leaves can be obtained from some tree
T with n leaves by inserting a new node v into some edge e of T and connecting v to a new leaf w.
This increases both the number of nodes and the number of edges by 2.
Thus T 0 has 2n − 2 + 2 = 2(n + 1) − 2 nodes and 2n − 3 + 2 = 2n − 1 = 2(n + 1) − 3 edges.
The number of internal edges is equal to the total number of edges minus the number of edges leading
to leaves: (2n − 3) − n = n − 3.
Lemma 5.4.3 (Number of trees) There are
U (n) = (2n − 5)!! := 3 · 5 · 7 · . . . · (2n − 5)
unrooted trees on n leaves, and
R(n) = (2n − 3)!! = U (n) · (2n − 3) = 3 · 5 · . . . · (2n − 3)
rooted trees.
The number of possible trees grows extremely rapidly for increasing values of n. For example, for
n = 10 we have U (n) = 2.027 × 106 and R(n) = 34.459 × 108 .
5.4.6
Distances in phylogenetic trees
Let T = (V, E) be a phylogenetic tree with node set V and edge set E. An edge weight function
ω : E → R≥0 assigns a non-negative number ω(e) to every edge e.
These weights (or lengths) often represent
70
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
• the number of mutations on evolutionary path from one species to another, or
• a time estimate for evolution of one species into another.
If T is a phylogenetic tree with edge weights ω, then the tree distance (or patristic distance) between
i and j is given by
X
ω(e),
dT (i, j) =
e∈P (i,j)
where the sum is taken over all edges e that are contained in the path P (i, j) from i to j.
Example:
B
A
9
8
2
D
12
7
5
C
E
The tree distance between B and C is dT (B, C) = 9 + 12 + 7 = 28.
5.5
Computer representation of a phylogenetic tree
Let X be a set of taxa and T a phylogenetic tree on X.
To represent a phylogenetic tree in a computer we maintain a set of nodes V and a set of edges E.
Each edge e ∈ E maintains a reference to its source node s(e) and target node t(e). Each node v ∈ V
maintains a list of references to all incident edges.
Each leaf node v ∈ V maintains a reference λ(v) to the taxon that it is labeled by, and, vice versa,
ν(x) maps a taxon x to the node that it labels.
Each edge e ∈ E maintains its length ω(e).
5.5.1
Nested structure
A rooted phylogenetic tree T = (V, E, λ) is a nested structure: Consider any node v ∈ V . Let Tv
denote the subtree rooted at v. Let v1 , v2 , . . . , vk denote the children of v. Then Tv is obtainable from
its subtrees Tv1 , Tv2 , . . . , Tvk by connecting v to each of their roots.
Such a nested structure can be written using nested brackets:
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Phylogenetic tree
nested description
v1
v10
taxon2
v7
v8
taxon6
v9
taxon1
v5 v6
taxon3
taxon4
v4
taxon5
v3
v2
71
v1
↓
(v2 , v3 )
↓
((v4 , v5 ), (v6 , v7 , v8 ))
↓
(((v9 , v10 ), v5 ), (v6 , v7 , v8 ))
Description: (((taxon1 , taxon2 ), taxon3 ), (taxon4 , taxon5 , taxon6 ))
5.5.2
Newick format
Phylogenetic trees are usually read and printed using the so-called Newick format. The Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees
and nested parentheses, noticed in 1857 by the famous English mathematician Arthur Cayley. The
tree ends with a semicolon. Interior nodes are represented by a pair of matched parentheses. Between
them are representations of the nodes that are immediately descended from that node, separated by
commas. Leaves are represented by their names. Trees can be multifurcating at any level. Branch
lengths can be incorporated into a tree by putting a real number after a node and preceded by a colon.
This represents the length of the branch immediately below that node.
Examples:
Unrooted tree
Rooted tree with edge lengths
e
a
d
b
c
Newick string:
((a,b),c,(d,e));
Newick string:
(((A:0.5,B:0.5):1.5,C:2.0):1.0,(D:1.0,E:1.0):2.0);
In the Newick format, each pair of matching brackets corresponds to an internal node and each taxon
label corresponds to an external node. To add edge lengths to the format, specify the length len of
the edge along which given node we visited by adding :len behind the closing bracket, in the case of
an internal node, and behind the label, in the case of a leaf.
For example, consider the tree (a,b,(c,d)). If all edges have the same length 1, then this tree is
written as (a:1,b:1,(c:1,d:1):1).
5.5.3
Algorithm for Tree to Newick
This algorithm recursively prints a tree in Newick format. Initialisation with e = null and v = ρ, if
the tree is rooted, or v set to an arbitrary internal node, else.
Algorithm 5.5.1 (toNewick(e, v))
Input: Phylogenetic tree T = (V, E) on X, with labeling λ : V → X
72
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Output: Newick description of tree
begin
if v is a leaf then
print λ(v)
else // v is an internal node
print ’(’
for each edge f 6= e incident to v do
If this is not the first pass of the loop, print ’,’
Let w 6= v be the other node incident to f
call toNewick(f, w)
print ’)’
print ’;’
end
5.5.4
Algorithm for parsing Newick
Algorithm parses a tree in Newick format from a (space-free) string s = s1 s2 . . . sm .
Initialisation: parseNewick(1, m, null).
Algorithm 5.5.2 (parseNewick(i,j,v))
Input: a substring si . . . sj and a root node v
Output: a phylogenetic tree T = (V, E, λ)
begin
while i ≤ j do
Add a new node w to V
if v 6= null then add a new edge {v, w} to E
if si = ’(’ then // hit a subtree
Set k to the position of the balancing close-bracket for i
call parseNewick(i + 1, k − 1, w) // strip brackets and recurse
else // hit a leaf
Set k = min{k ≥ i | sk+1 = ’,’ or k + 1 = j}
Set λ(w) = si . . . sk // grab label for leaf
Set i = k + 1 // advance to next ’,’ or j
if i < j then
Check that i + 1 ≤ j and si+1 = ’,’
Increment i // advance to next token
end
5.6
Constructing phylogenetic trees
There are four main approaches to constructing phylogenetic trees:
1. Using a distance method, one first computes a distance matrix for a given set of biological data
and then one computes a tree that represents these distances as closely as possible.
2. Maximum parsimony takes as input a set of aligned sequences and attempts to find a tree and
a labeling of its internal nodes by auxiliary sequences such that the number of mutations along
the tree is minimum.
3. Given a probabilistic model of evolution, maximum likelihood approaches aim at finding a phylogenetic tree that maximizes the likelihood of obtaining the given sequences.
4. Bayesian reconstruction methods estimate a posterior probability distribution of trees.
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.7
73
Distance-based tree reconstruction
Assume we are given a set X = {x1 , x2 , . . . , xn } of taxa. The input to a distance method is a distance
matrix D : X × X → R≥0 that associates a distance d(xi , xj ) with every pair of taxa xi , xj ∈ X.
We require that D satisfies at least the first two properties of a metric:
1. symmetry, d(x, y) = d(y, x),
2. the triangle inequality, d(x, y) + d(y, z) ≥ (dx, z) and
3. identity, x 6= y ⇒ d(x, y) > 0.
Distance-Based Phylogeny Problem: Reconstruct an evolutionary tree on n leaves from an n × n
distance matrix
Input: A distance matrix D = (dij ) on X.
Output: A phylogenetic tree T (rooted or unrooted) with edge lengths on X such that the distances
in T approximate D in some prescribed way.
The most commonly distance-based methods used are
1. UPGMA
2. Neighbor joining
3. Minimum evolution and related methods
5.8
The neighbor-joining algorithm
The most widely used distance method is neighbor joining (NJ), originally introduced by Saitou and
Nei (1987)1 and modified by Studier and Keppler (1988).
Given a distance matrix D, neighbor joining produces an unrooted phylogenetic tree T with edge
lengths.
It is more widely applicable than UPGMA, as it does not assume a “molecular clock”.
Starting from a star tree topology the algorithm proceeds to build a tree based on D by repeatedly
pairing “neighboring” taxa:
c
a
c
d
d
g
b
e
1
a
→
f
Saitou, N., Nei, Y. (1987) SIAM J. on Comp. 10:405-421
b
g
e
f
74
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Algorithm 5.8.1 (Neighbor joining)
Input: Distance matrix D on X
Output: Unrooted phylogenetic tree T
Initialization:
Define L to be the set of leaf nodes, one for each taxon x ∈ X
Set T = L
while |L| > 2 do
Compute the “neighbor-joining matrix” N :
1 P
For all i, j set Nij = dij − (ri + rj ), where ri = |L|−2
k∈L dik
Pick a pair i, j ∈ L for which Nij is minimum
Create a new node k in T
Update D:
For all m ∈ L set dkm = 12 (dim + djm − dij )
Connect k to i and to j using two new edges of lengths
dik = 12 (dij + ri − rj ) and djk = dij − dik , respectively.
Remove i and j from L and add k to L.
if L = {i, j}, then create a new edge between i and j of length dij .
Let i and j be two neighboring leaves that have the same parent node, k. Remove i, j from the list of
nodes L and add k to the current list of nodes. How do we update its distance to any given leaf m?
Consider:
i
i
−→
m
j
j
k
m
We must have:
dim = dik + dkm , djm = djk + dkm and dij = dik + djk ,
thus
dim + djm = dik + dkm + djk + dkm = dij + 2dkm ,
which implies
1
dkm = (dim + djm − dij ).
2
Why do we set the length of the new edges to dik = 12 (dij + ri − rj ) and djk = dij − dik ?
1 P
1 P
By definition, ri = |L|−2
k∈L dik equals the average distance qi = |L|−2
k∈L,k6=i,j dik from i to all
other nodes m 6= i, j, plus
dij
|L|−2 :
m1
i
dij
j
k
m2
qi
average distance
qj
m5 m 4
m3
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
As we see from this figure:
2dik = dij + qi − qj = dij + qi − qj +
5.8.1
Taxa
A
B
C
D
dij
|L|−2
−
dij
|L|−2
75
= dij + ri − rj .
Example
A
-
B
8
-
C
7
9
-
D
12
14
11
-
Initialisation: L = X = {A, B, C, D}
Iteration:
1. Compute N: first compute column sums ri∗ :
∗
∗
∗
∗
rA
= 8 + 7 + 12 = 27, rB
= 8 + 9 + 14 = 31, rC
= 7 + 9 + 11 = 27, rD
= 12 + 14 + 11 = 37
N
=
T axa
A
B
C
D
A
−
∗
∗
8 − (rA
+ rB
)/2
∗
∗
7 − (rA + rC
)/2
∗
∗
)/2
+ rD
12 − (rA
B
8
−
∗
∗
9 − (rB
+ rC
)/2
∗
∗
)/2
+ rD
14 − (rB
=
T axa
A
B
C
D
A
−
−21
−20
−20
D
12
14
11
−
B
8
−
−20
−20
C
7
9
−
−21
C
7
9
−
∗
∗
)/2
+ rD
11 − (rC
D
12
14
11
−
One minimum is N (A, B) = −21. Define new internal node k to join A and B and compute edge
lengths of k to A and of k to B:
1
1
(d(A, B) + rA − rB ) = (8 + 27/2 − 31/2) = 3
2
2
d(B, k) = d(A, B) − d(A, k) = 8 − 3 = 5
d(A, k) =
Compute new D0 :
T axa k
C
D
k
− d(k, C) d(k, D)
C
−
11
D
−
where
d(k, C) =
d(k, D) =
1
(d(A, C) + d(B, C) − d(A, B)) = 4
2
1
(d(A, D) + d(B, D) − d(A, B)) = 9
2
Second iteration: Compute N: we have L = {AB, C, D}, |L| = 3, |L| − 2 = 1
∗
∗
rk∗ = 4 + 9 = 13, rC
= 4 + 11 = 15, rD
= 9 + 11 = 20
N
=
T axa
AB
C
D
AB
−
4 − (13 + 15)/1
9 − (13 + 20)/1
=
T axa
AB
C
D
AB
−
−24
−24
C
4
−
−24
D
9
11
−
C
4
−
11 − (15 + 20)/1
D
9
11
−
76
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
A minimum is given by N (AB, C). Define a new node k to join AB and C:
d(C, k)
=
d(AB, k)
=
1
1
(d(AB, C) + rC − rAB ) = (4 + 15 − 13) = 3
2
2
d(AB, C) − d(C, k) = 4 − 3 = 1
Termination:
Only two nodes left, D and ABC. Compute length of edge d(ABC, D):
d(ABC, D) = d(C, D) − d(ABC, C) = 11 − 3 = 8
This is the resulting tree:
B
5
8
1
D
3
A
3
C
5.8.2
Additivity and the four-point condition
A distance matrix D is called additive, or a tree metric, if it can be represented by a phylogenetic tree.
How to tell whether D is additive?
Definition 5.8.2 (Four-point condition) Assume we are given a set of taxa X. A distance matrix
D on X is said to fulfill the four-point condition if for all i, j, k, l ∈ X
d(i, j) + d(k, l) ≤ max{d(i, k) + d(j, l), d(i, l) + d(j, k)}
holds.
(That is, two of the three expressions d(xi , xj ) + d(xk , xl ), d(xi , xk ) + d(xj , xl ), and d(xi , xl ) + d(xj , xk )
are equal and are not smaller than the third.)
We have the following classical result (Buneman, 1971):
Theorem 5.8.3 (Additivity metrics and the 4PC) A distance matrix D on X is additive, if and
only if it fulfills the four-point condition.
We will prove one direction of the theorem.
Proof “⇒”:
Assume that D is additive. Then there exists a phylogenetic tree T that represents D.
Let i, j, k, l be elements of X. If they are all different, then we can assume w.l.o.g. that (i, j) and (k, l)
are neighbors in the induced tree T |{i,j,k,l} , respectively.
It follows that
d(i, j) + d(k, l) ≤ d(i, k) + d(j, l) = d(i, l) + d(j, k),
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
77
as can be seen from the following figure:
and thus D fulfills the four-point condition.
Example additive metric:


B C D
 A 7 6 5 

D=
 B
3 6 
C
5
A non-additive

B
 A 7
D=
 B
C
Check the four-point condition:
Check the four-point condition:
metric:
C
7
4

D
6 

7 
5
d(A, B) + d(C, D) = 7 + 5 = 12
d(A, C) + d(B, D) = 7 + 7 = 14
d(A, D) + d(B, C) = 6 + 4 = 10
⇒ d(A, C) + d(B, D) 6≤
max (d(A, B) + d(C, D), d(A, D) + d(B, C))
⇒ 4-point condition doesn’t hold.
d(A, B) + d(C, D) = 7 + 5 = 12
d(A, C) + d(B, D) = 6 + 6 = 12
d(A, D) + d(B, C) = 5 + 3 = 8
⇒ the four-point condition holds.
D
D
C
2.0
C
2.0
3.0
2.0
1.0
1.0
2.0
1.0
2.0
3.0
2.0
B
A
B
A
5.9
The Jukes-Cantor model of evolution
T. Jukes and C. Cantor (1969) introduced a very simple model of DNA sequence evolution:
Definition 5.9.1 (Jukes-Cantor model) The Jukes-Cantor model of evolution makes the following
assumptions:
1. We are given a rooted phylogenetic tree T along which sequences evolve.
2. The sequence length is an input parameter and possible states for each site are A, C, G and T.
3. The possible states for each site are A, C, G and T.
4. For each site of the sequence at the root of T , the state is drawn from a given distribution
(typically uniform).
5. The sites evolve identically and independently (i.i.d.) down the edges of the tree from the root
at a fixed rate u.
6. With each edge e ∈ E we associate a duration t = t(e) and the expected number of mutations per
site along e is uτ (e). The probabilities of change to each of the 3 remaining states are equal.
78
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
How do we “evolve” a sequence down an edge e under the Jukes-Cantor model?
Let v = (v1 , . . . , vn ) and w = (w1 , . . . , wn ) denote the source and target sequences associated with e.
We assume that v has already been determined and we want to determine w.
Under the Jukes-Cantor model, the evolutionary event C3 = nucleotide changes to one of the other
three bases occurs at a fixed rate of u.
Now consider the event C4 = nucleotide changes to any base (including back to the same base). The
rate of occurrences of this event is taken to be 43 u.
4
The probability that event C4 does not occur within time t is: e− 3 ut (Poisson distribution with k = 0).
Given that the event C4 occurs, the probability of ending in any particular base is 14 .
Let P and Q denote the base pair present at a given site before and after a given time period t. The
probability that an event of type C4 occurs and changes P ∈ {A, C, G, T} to base Q within time t is:
4
1
Prob(Q | P, t) =
1 − e− 3 ut .
4
Now, adding the probability that the event C4 does not occur, we obtain the probability that the state
does not change, i.e. P = Q:
1
4
4
4
1
1 − e− 3 ut =
1 + 3e− 3 ut .
Prob(Q = P | P, t) = e− 3 ut +
4
4
Thus, we obtain a probability-of-change formula for the probability of an observable change occurring
at any given site in time t:
4
Prob(change | t) = 1 − Prob(Q = P | P, t) = 1 − 41 1 + 3e− 3 ut
4
= 34 1 − e− 3 ut .
We can use this model to generate artificial sequences along a model tree. E.g., set the sequence
length to 10 and the mutation rate to u = 0.1. Initially, the root node is assigned a random sequence.
Then the sequences are evolved down the tree, at each edge using the probability-of-change formula
to decide whether a given base is to change:
1
a
1
b
1
c
1
d
1
e
1
f
2
ACGTTTAGAG
ACGTTTCGAG
2
1
*
3
ACCTTGCGGG
E.g., the probability of change along the edge marked ∗ is
4
0.75(1 − e− 3 ×0.1×3 ) = 0.75(1 − e−0.4 ) ≈ .247.
5.10
The Jukes-Cantor distance transformation
The Hamming distance between two sequences underestimates their true evolutionary distance (average number of mutations that took place per site over the elapsed time), because it doesn’t count
“back mutations” and multiple mutations at the same site.
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
79
Thus, a correction formula is required. For the Jukes-Cantor model, inversion of the probability-ofchange formula leads to the following:
Definition 5.10.1 (Jukes-Cantor transformation) The Jukes-Cantor distance transformation
for a pair of sequences a and b is defined as: formula:
3
4
JC(a, b) = − ln 1 − Ham(a, b) .
(5.1)
4
3
5.11
More general transformations
The Jukes-Cantor model assumes that all nucleotides occur with the same frequency. This assumption
is relaxed under the Felsenstein 81 model introduced by Joe Felsenstein (1981), which has the following
distance transformation:
F 81(ai , aj ) = −B ln(1 − Ham(ai , aj )/B),
(5.2)
where B = 1 − (πA2 + πC2 + πG2 + πT2 ) and πQ is the frequency of base Q. The base frequency is obtained
from the pair of sequences to be compared, or better, from the complete set of given sequences.
Note that this contains the Jukes-Cantor transformation as a special case with equal base frequencies
πA = πC = πG = πT = 0.25, thus B = 43 .
Unlike the Jukes-Cantor transformation, this transformation can also be applied to protein
P20 sequences,
2
if
all
amino
acids
are
generated
with
the
same
frequency,
and
B
=
1
−
setting B = 19
i=1 πi , in the
20
proportional case.
Given equal base frequencies, but different proportions P and Q of transitions and transversions
between ai and aj , the distance for the Kimura 2 parameter model is computed as:
1
1
1
1
K2P (ai , aj ) = ln
+ ln
.
(5.3)
2
1 − 2P − Q
4
1 − 2Q
If we drop the assumption of equal base frequencies, then we obtain the Felsenstein 84 transformation:
P
(A − B)Q
Q
F 84(ai , aj ) = −2A ln 1 −
−
+ 2(A − B − C) ln 1 −
,
(5.4)
2A
2AC
2C
where A = πC πT /πY +πA πG /πR , B = πC πT +πA πG , and C = πR πY with πY = πC +πT , πR = πA +πG .
Even more general models exist, but we will skip the details. The following figure summarizes the
complete overview of the time reversible models:
80
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
GTR
Equal base frequency
3 substitution parameters
(2 f. Transitions, 1 f. Transv.)
SYM
TrN
2 substitution parametera
(1 f. Transitions, 1 f. Transv.)
HKY
=F84
1 substitution parameter
3 substitution parameters
(1 f. Transitions, 2 f. Transv)
K3SP
Equal base
frequency
2 substitution parameters
(1 f. Transitions, 1 f. Transv)
K2P
F81
Equal base frequency
1 substitution parameter
JC
GTR
TrN
SYM
HKY
K3ST
=
=
=
=
=
“general time reversible”
“Tamura-Nei”
“symmetric”
“Hasegawa-Kishino-Yano”
“Kimura 3-Parameter”
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.12
81
Maximum parsimony
In phylogenetic analysis, the Maximum Parsimony problem is to find a phylogenetic tree that explains
a given set of aligned sequences using a minimum number of “evolutionary events”.
The maximum parsimony method is one of the most widely used used tree reconstruction methods.
Suppose we are given a multiple alignment of sequences A of length L and a phylogenetic tree T . The
maximum parsimony method makes two fundamental assumptions:
• Changes at different positions of the sequences are independent.
• Changes in different parts of the tree are independent.
5.12.1
The parsimony score of a tree
The difference between two sequences X = (x1 , . . . , xL ) and Y = (y1 , . . . , yL ) is simply their nonnormalized Hamming distance
d(X, Y ) = |{k | xk 6= yk }|,
that is, the number of positions at which they differ.
Assume we are given a multiple alignment of sequences A = {A1 , A2 , . . . , An } and a corresponding
phylogenetic tree T , leaf-labeled by A.
If we assign a hypothetical ancestor sequence to every internal node in T , then we can obtain a score
for T together with this assignment by summing over all differences d(X, Y ), where X and Y are two
sequences labeling two nodes that are joined by an edge in T .
Example:
AGT
AGA
1
0
CGA
0
CCA
1
AGA
CGA
⇒ score = 3
1
0
CGA
Definition 5.12.1 (Parsimony score of a tree) Suppose we are given a multiple alignment of sequences A of length L and a corresponding phylogenetic tree T leaf-labeled by A. Then the parsimony
score for T and A is defined as:
X
P S(T, A) = min
d(λ(u), λ(v)),
λ
{u,v}∈E
where the minimum is taken over all possible labelings λ of the nodes of T by sequences of length L
that extend the labeling of the leaves by A. .
Small Parsimony problem: The Small Parsimony problem is to compute the parsimony score for
a given tree T .
82
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.13
The Fitch algorithm
In 1971 Walter Fitch2 published a dynamic programming algorithm that solves the Small Parsimony
problem efficiently.
Input: A phylogenetic tree T with n leaves. The leaves are labeled by the sequences of an MSA with
n sequences over an alphabet Σ.
Since maximum parsimony assumes indepedence of each column, one can compute the parsimony
score column by column and the final result is given by the sum of the column results.
The Fitch algorithm runs in two passes: The forward pass first computes the parsimony score and the
backward pass then computes a labeling of the internal nodes.
5.13.1
The Fitch algorithm: Forward pass
As the algorithms proceeds column-by-column, we can assume that each leaf v is labeled by a single
character-state c(v).
We will assign a set S(v) of states and a number l(v) to each node v as follows:
S(v) = {c(v)}, l(v) = 0
For each leaf v:
For each inner node v
with children u, w:
S(v) =
S(u) ∩ S(w) if S(u) ∩ S(w) 6= ∅
S(u) ∪ S(w) if S(u) ∩ S(w) = ∅
l(v) = l(u) + l(w)
if S(u) ∩ S(w) 6= ∅
l(v) = l(u) + l(w) + 1 if S(u) ∩ S(w) = ∅
To compute S(v) and l(v) we traverse the tree in postorder or “bottom-up”. The number of changes
required is given by the value of l at the root.
The forward pass algorithm requires O(n) steps for each column and thus O(nL) in total.
5.13.2
The Fitch algorithm: Example
Example: we use the Fitch algorithm to compute the parsimony score for the following labeled tree:
G
5.13.3
A
G
C
C
A
Fitch algorithm: Backward pass
The forward pass algorithm computes the parsimony score. An optimal labeling of the internal nodes
is obtained via traceback: starting at the root node r, we label r using any character in S(r). Then,
2
Fitch, W. (1971) Toward defining the course of evolution: minimum change of specified tree topology. Syst. Zoology
20:406-416
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
83
for each child w, we use the same letter, if it is contained in S(w), otherwise we use any letter in S(w)
as label for w. We then visit the children von w etc.
{A,G}
{G}
{C,G}
{G}
{C}
{A,C}
{G} {C}
{C}
{A}
{A}
G
→
A
G
C
C
A
Again, the algorithm requires O(nL) steps, in total.
5.13.4
The Fitch algorithm: Backward pass
Algorithm 5.13.1 (Backward pass)
Input: A phylogenetic tree T and the set S(v) of each node v in T
Output: The fully labeled tree T
do for all nodes v, starting at the root node, going to the leaves
if v is the root node then
Set c(v) = t with t ∈ S(v)
else
Let w be the parent node of v
if c(w) ∈ S(v) then
Set c(v) = c(w)
else
Set c(v) = t with t ∈ S(v)
end
5.13.5
Sankoff ’s algorithm
In Fitch’s algorithm each mutation is equally weighted independent of the nature of the mutation. For
nucleotides and especially for amino acids the assumption of equal substitution costs for all possible
character changes is inappropriate.
In 1975, David Sankoff3 proposed an alternative to Fitch’s algorithm in which substitutions can be
weighted.
Input:
• a phylogenetic tree T = (V, E) with n leaves
• a character-state c(v) for each leaf v
• a cost matrix C that specifies the cost C(α, β) of changing state α into state β
Forward pass: In a post-order traversal, for each node v and each state α ∈ Σ we will compute
Sα (v) as the minimum cost of the subtree rooted at v, given c(v) = α.
Initialization:
For each leaf node v and state α set
Sα (v) =
3
0, c(v) = α
∞, else.
Sankoff D. (1975) Minimal mutation trees of sequences. SIAM J. Appl. Math. 28:35-42
84
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Main recursion:
For each internal node v with children u and w recursively compute Sβ (u) and Sβ (w) for all states β.
Then set
Sα (v) = min(Sβ (u) + C(α, β)) + min(Sβ (w) + C(α, β)).
β
β
Result:
The minimal cost of the whole tree is given by
min Sα (ρ),
α
where ρ is the root of the tree.
Example:
Backward pass:
Based on the values of Sv (α) from the forward pass we will now choose an optimal character state α
for each inner node. As in Fitch’s algorithm we start at the root and visit all nodes in a pre-order
traversal.
For the root ρ set:
c(ρ) = arg min Sα (ρ)
α
For every internal node w 6= ρ with parent node v we set:
c(w) = arg min(Sα (w) + C(α, c(v))).
α
Example:
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.13.6
85
The Large Parsimony problem
Definition 5.13.2 (Parsimony score of an alignment) Given a multiple alignment A
{a1 , . . . , an }, its parsimony score is defined as
=
P S(A) = min{P S(T, A) | T is a phylogenetic tree on A}.
Definition 5.13.3 (Large Parsimony problem) The Large Parsimony problem is to compute
P S(A).
Theorem 5.13.4 (Parsimony is hard) The Large Parsimony problem is NP-hard.
5.14
Maximum likelihood estimation
Let A = (a1 , . . . , an ) be a multiple sequence alignment of length m on X .
The basic idea of the maximum likelihood estimation method is to compute a phylogenetic tree T ,
together with edge lengths ω, that maximizes the likelihood
P(A | T, ω, M )
that the multiple sequence alignment A was generated on the phylogenetic tree T with edge lengths
ω under the given model of sequence evolution M .
Such a model M specifies how to choose the initial sequence at the root of the tree T and how sequences
evolve along edges of the tree.
For example, the Jukes-Cantor model specifies that at any position i of the root sequence, all four
nucleotides appear with the same probability of 41 . The mutation rate is u.
Moreover, the length ω(e) of any edge e is interpreted as a duration and the probability that an
observable change occurs along an edge of length t is given by the probability-of-change formula:
4
3
P(change | t) =
1 − e− 3 ut .
4
In practice, the models employed are more general than the simple Jukes-Cantor model, usually allowing unequal nucleotide frequencies and different transition probabilities between different nucleotides.
The models are generally time reversible, which implies that the optimal score attainable is independent
of the location of the root.
A main advantage of maximum likelihood estimation is that it provides a systematic framework for
explicitly incorporating assumptions and knowledge about the process that generated the given data.
Although evolutionary models are only a rough estimate of real-world biological evolution, in practice,
maximum likelihood methods have proved to be quite robust to violations of the assumptions made
in the models.
5.14.1
Maximum likelihood estimation is hard
As with maximum parsimony, the main draw-back of this approach is that finding an optimal phylogenetic tree is NP-hard. In fact, the situation is in some sense worse:
Theorem 5.14.1 (Maximum likelihood estimation is hard) For a given phylogenetic tree T ,
the problem of defining edge lengths ω so as to maximize the likelihood P (A | T, ω, M ) is NP-hard.
86
5.14.2
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Solving maximum likelihood
Assume we are given the following DNA alignment A as
1 2
i
a1 A C . . . G C
a2 A C . . . G C
a3 A C . . . T A
a4 A C . . . T G
input:
G
C
C
G
m
A
G
G
A
...
...
...
...
There are three possible unrooted phylogenetic trees on a set of four taxa:
a1
a3
a1
a2
a1
a2
a2
a4
a3
a4
a4
a3
To simplify the calculation of the maximum likelihood score for such a phylogenetic tree T we root it
by choosing one of its internal nodes to be the root ρ:
(1)
(1)
→
(2)
(2) (3)
C
(4)
C
A
G
(3)
(4)
(5)
(5)
→
(6)
(6)
We will discuss how to compute the maximum likelihood for the first tree of the three unrooted trees.
The other two trees are processed similarly.
Conceptually, we obtain the likelihood L(i) that a specific rooted phylogenetic tree T generated Ai ,
the i-th character of A, by summing over all possible labellings α of the internal nodes by states.
For a bifurcating unrooted phylogenetic tree on n taxa there are n − 2 internal nodes, and so there
are
k (n−2)
possible labellings to be considered, where k is the number of possible character states.
For our example, we need to consider 16 different possibilities:
 C C A G 
 A



L(i) = P Ai + P Ai A
 +
...
+ P Ai  +
...
+ P Ai C
C A G

C

A
C
C A G

G

C
C
C A G

T

T
In general, for a given rooted phylogenetic tree T and model M , the likelihood of T given a single
character Ai is defined as
L(i) = P(A
P i |T, ω, M )
=
P(Ai , α|T, ω, M )
Pα Q
=
P(α(ρ))
P
(α(v),
α(w))
,
e
α
e={v,w}
where the summation is over all possible labellings α of the internal nodes, P(α(ρ)) is the probability
of generating the state α(ρ) at the root and Pe (α(v), α(w)) is the probability of observing the two
states α(v) and α(w) at the two ends of an edge e = {v, w}.
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
87
The total likelihood that T generated the multiple sequence alignment A is obtained as the product
of the likelihoods for all positions of A:
L(T ) = L(1) · L(2) · · · · · L(m) =
m
Y
L(i).
i=1
The computation of L(i) is actually more complicated than we have described so far, because not only
do we have to consider all possible rooted phylogenetic trees on n taxa, but we must also consider all
possible ways of defining edge lengths ω on each such tree.
Assume that we have a rooted phylogenetic tree T with edge lengths ω, we will now discuss in more
detail how to calculate
Y
Pe (α(v), α(w))
P (Ai , α | T, ω, M ) = P(α(ρ))
e={v,w}
for the case of the Jukes-Cantor model with mutation rate u.
Under this model, the probability of observing any particular nucleotide at the root node is 41 . The
probability of observing two different states α(v) and α(w) at opposite ends of an edge e of length
t = ω(e) is given by
4
3
P (α(v) 6= α(w) | T ) = (1 − e− 3 ut ),
4
whereas the probability of observing the same state at opposite nodes of e equals
4
1
1 − P (α(v) = α(w) | T ) = (1 + 3e− 3 ut ).
4
For example, the value of
C C A G 

 A

P
A
i


A

is given by
P (A)Pe1 (A, C)Pe2 (A, C)Pe3 (A, A)Pe4 (A, A)Pe5 (A, G)
=
33
(1
46
4
4
4
4
4
− e− 3 ut1 )(1 − e− 3 ut2 )(1 + 3e− 3 ut3 )(1 + 3e− 3 ut4 )(1 − e− 3 ut5 ),
where e1 , . . . , e5 are the five edges in T and ti = ω(ei ) for i = 1, . . . , 5.
5.14.3
Felsenstein’s algorithm
One can efficiently compute the likelihood for a given bifurcating, rooted phylogenetic tree T with
edge lengths ω using Felsenstein’s algorithm (Felsenstein, 1981).
In this approach, for each node v in T and every possible character state a, in a post-order traversal
of T , one recursively computes the conditional likelihood Lv (a) for the subtree Tv rooted at v, under
the condition that the state of v is a, that is, we have α(v) = a.
More precisely, for any internal node v with children w1 and w2 , the main recursion is
!
!
X
X
Lv (a) =
P{v,w1 } (a, b)Lw1 (b)
P{v,w2 } (a, b)Lw2 (b) ,
b
b
88
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
where summation is over all possible states b and, as above, P{v,w} (a, b) is the probability of observing
states a and b at the two ends of the edge {v, w} under the given model M .
To initialize the computation, for each leaf v and every symbol a, we set Lv (a) = 1, if the node v is
labeled by the character state a, and Lv (a) = 0, else.
5.14.4
Summary
As mentioned above, we must consider all possible choices of edge lengths ω. Determining an optimal
choice of ω is in general a computationally hard problem. One possible heuristic is to optimize the
length of each edge in turn, while keeping all other edge lengths fixed.
In summary, for a given multiple sequence alignment A on X , in maximum likelihood estimation, a
result is obtained by searching through all possible phylogenetic trees T and for each such tree T ,
computing the maximum likelihood score. The method returns the tree that attains the best score.
5.15
Quartet puzzling
Quartet puzzling
4
is a heuristic for approximating the maximum likelihood (MLE) tree.
It is based on the idea of computing the MLE trees for all quartets of taxa and then using many
rounds of combination steps in which one attempts to fit a random set of quartet trees together so as
to obtain a tree on the whole dataset.
Assume we are given a multiple alignment of sequences A = {a1 , . . . , an }. The quartet puzzling
heuristic proceeds in three steps:
1. For each quartet q of four distinct taxa ai , aj , ak , al ∈ A compute the maximum likelihood tree
Tq . This step is called maximum likelihood step.
2. Choose a random order ai1 , ai2 , . . . , ain of the taxa and then use it to guide a combining or puzzle
step that produces a tree on the whole data set from quartet trees.
3. Repeat step (2) many times to obtain trees T1 , T2 , . . . , TR . Report the majority consensus tree
T.
5.15.1
Computing all quartet trees
Given four taxa {a1 , a2 , a3 , a4 }, there are three possible bifurcating topologies:
a1
a3
a1
a2
a1
a2
a2
a4
a3
a4
a4
a3
{{a1 , a2 }, {a3 , a4 }}
{{a1 , a3 }, {a2 , a4 }}
{{a1 , a4 }, {a2 , a3 }}
Now, let T be any phylogenetic tree that contains leaves labeled a1 , a2 , a3 , a4 , We say that T induces
the split or tree topology {{a1 , a2 }, {a3 , a4 }}, if there exists an edge in T that separates the nodes
labeled a1 and a2 from the nodes a3 and a4 .
We will sometimes abbreviate {{a1 , a2 }, {a3 , a4 }} by a1 , a2 | a3 , a4 .
Assume we are given this tree T :
4
Korbinian Strimmer and Arndt von Haeseler (1996) Quartet-puzzling: a quartet maximum-likelihood method for
reconstructing tree topologies. Mol Biol Evol 13, 964-969.
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
a1
89
a5
a2
a4
a3
This tree induces the following five quartet tree topologies:
a1
a3
a1
a3
a1
a4
a1
a4
a2
a4
a2
a5
a2
a2
a5
a3
a5
a4
a3
a5
It is easy to reconstruct the original tree from its set of quartets.
However, in the presence of false quartet topologies (as inevitably produced by any phylogenetic
reconstruction method), the reconstruction of the original tree can become difficult.
In quartet puzzling, quartets are greedily combined to produce a tree for the whole data set. In an
attempt to avoid the usual problems of a greedy approach, the combining step is run many times using
different random orderings of the taxa.
5.15.2
The puzzle step
The puzzle step is initialized with the maximum likelihood tree for {a1 , a2 , a3 , a4 }, e.g.:
a1
0 0
a2 0
0
a3
0
a4
Each edge e is labeled with a number of conflicts c(e) that is originally set to 0.
In the puzzle step, the task is to decide how to insert the next taxon ai+1 into the tree Ti already
constructed for taxa {a1 , . . . , ai }. We proceed as follows:
For each edge e in Ti we consider what would happen if we were to attach taxon ai+1 into the middle
of e by a new edge:
We consider all quartets consisting of the taxon ai+1 and three taxa from {a1 , a2 , . . . , ai }. For each
such quartet q we compare the precomputed maximum likelihood tree Tq with the restriction of Ti to
q, denoted by Ti |q . If these two trees differ, then we increment c(e) by one.
Finally, we attach e to an edge f in Ti that has a minimum number of conflicts c(f ), thereby obtaining
Ti+1 .
Example: We illustrate the puzzle step using the five taxon tree shown above. For each of the five
quartets we compute the respective ML tree. Assume that these are indeed
a1
a3
a1
a3
a1
a4
a1
a4
a2
a4
a2
a5
a2
a2
a5
a3
a5
a3
a4
a5
We start with the maximum likelihood tree on {a1 , a2 , a3 , a4 }. The puzzle step for a5 proceeds as
90
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
follows:
a1
0 0
Initial tree:
a2 0
a1
−→
a1 , a3 | a4 , a5
2 1
a2 2
0
a3
0
a4
1
a3
0
a4
a1
1 0
a2 1
a1
3 2
−→
a2 , a3 | a4 , a5
a2 3
0
a3
0
a4
2
a3
0
a4
a3
4
−→
a1 , a2 | a3 , a5
a1
−→
a1 , a2 | a4 , a5
2
2
0
4
a2
a4
Hence, the puzzle step suggests that taxon a5 should be attached in the edge adjacent to the leaf
labeled a4 to obtain the tree
a1
a5
a2
a3
a4
In geneal, this procedure is repeated for different orders of taxa many times (say, 1000 for a large
dataset).
Finally a consensus of all computed trees is derived as the final result.
5.15.3
The quartet puzzling algorithm
Algorithm 5.15.1 (Quartet puzzling)
Input: Aligned sequences A = {a1 , . . . , an }, number R of trees
Output: Phylogenetic trees T 1 , T 2 , . . . , T R on A
Initialization: Compute an MLE tree tq for each quartet q ⊆ A
for r = 1 to R do
Randomly permute the order of the sequences in A
Set T4 equal to the precomputed MLE tree on {a1 , a2 , a3 , a4 }
for i = 5 to n do
Set c(e) = 0 for all edges in Ti−1
for each quartet q ⊆ {a1 , . . . , ai } with ai ∈ q do
Let x, y | z, ai be the MLE tree topology on q
Let p denote the path from x to y in Ti−1
Increment c(e) by 1 for each edge e that is contained
in p or that is separated from z by p
Choose any edge f for which c(f ) is minimal
Obtain Ti from Ti−1 by attaching a new leaf with label ai to f
Output Tn
end
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.16
91
Bayesian methods
Bayesian inference of phylogenetic trees is a method whose goal is to obtain a quantity called the
posterior probability of a phylogenetic tree.
Generally speaking, the posterior probability of a result is the conditional probability of the result
being observed, computed after seeing a given input dataset. The prior probability of a result is the
marginal probability of the result being observed, set before the input dataset is examined.
In other words, the prior probability distribution of phylogenetic trees on X reflects our belief of
the distribution of all possible phylogenetic trees on X . This distribution is determined in advance,
without knowledge of the actual sequence data to be examined.
Let A be a multiple sequence alignment and T a phylogenetic tree with edge lengths ω, on X . Also,
we will assume that some fixed model of evolution M is given. The posterior probability of T can be
obtained from the prior probability of T with the help of the likelihood P (A | T ) = P (A | T, ω, M )
using Bayes’ Theorem:
Theorem 5.16.1 (Bayes’ Theorem)
P (T | A) =
P (A | T ) × P (T )
.
P (A)
The posterior probability P (T | A) can be interpreted as the probability that the tree T is correct,
given the data. In the simplest case, all trees on X are considered to be equally probable and then the
prior distribution is called a flat prior. In the case of a flat prior, the posterior probability P (T | A) is
proportional to the likelihood P (A | T ) (computed under the model M ) and the maximum likelihood
tree also has maximum posterior probability.
The maximum posterior probability over all trees is usually very low, if we think of tree space as a
discrete sample space.
In Bayesian tree inference the goal is not to determine a single phylogentic tree of maximum posterior
probability, but rather to compute a sample of phylogenetic trees according to the posterior probability
distribution of trees.
Such a sample of trees is then processed further, for example, to generate a single “consensus tree”
(as described later) and then to label the clusters in such a tree by their posterior probabilities.
The expression for the posterior probability given by Bayes’ Theorem looks quite simple and we can
efficiently compute P (A | T ) using Felsensteins algorithm, as described in in the context of maximum
likelihood.
However, it is usually impossible to solve the complete expression analytically, as computation of the
normalizing factor P (A) usually involves summing over all possible phylogenetic tree topologies T 0
and integrating over all possible edge lengths ω and model parameters µ (of the given model M ):
P (T | A) =
P (A | T )P (T )
P (A | T )P (T )
=P
,
0
0
P (A)
T 0 P (A | T )P (T )
where
0
Z Z
P (A | T ) =
ω
P (A | T 0 , ω, µ)P (ω, µ)dµdω.
µ
To avoid this problem, Bayesian inference employs the Markov chain Monte Carlo (MCMC) approach,
which is based on the idea of sampling from the posterior probability distribution using a suitably
constructed chain of results.
92
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
This is a general approach that is used in a wide range of applications. In the context of phylogenetic
tree reconstruction, the MCMC approach constructs a chain of phylogenetic trees.
At each step, a modification of the current phylogenetic tree is proposed and then a probabilistic
decision is made whether to “accept” the newly proposed phylogenetic tree or whether to keep the
current one.
5.16.1
Metropolis-Hastings algorithm
Algorithm 5.16.2 (Metropolis-Hastings) Let Ti be the phylogenetic tree at position i of the
MCMC chain and let T 0 be a proposed modification of Ti . Using what is called the Metropolis-Hastings
algorithm, the decision whether to accept T 0 or to keep Ti is based on the ratio of the posterior probabilities of the two trees, which is given by
R=
P (T 0 | A)
P (A | T 0 )P (T 0 )P (A)
P (A | T 0 )P (T 0 )
=
=
.
P (Ti | A)
P (A | Ti )P (Ti )P (A)
P (A | Ti )P (Ti )
Note that the normalizing factor P (A), which is so difficult to compute, cancels out and this expression
is much easier to calculate than the actual posterior probability of either Ti or T 0 .
Based on R, we accept the proposed phylogenetic tree T 0 with probability min(1, R).
In other words, we always accept T 0 , if it has a better posterior probability than Ti , and otherwise we
draw a random number p uniformly from the interval [0, 1] and accept the tree T 0 if p < R, and reject
T 0 , otherwise.
Acceptance means that we set Ti+1 = T 0 , whereas rejection implies Ti+1 = Ti .
The aim is to produce an MCMC chain M of bifurcating phylogenetic trees that has the property
that the stationary distribution of the trees (including the parameters ω and µ) in the chain equals
the posterior probability distribution of the trees (again, including ω and µ).
One can show that, by construction, the decision rule described above implies that the number of
times that a phylogenetic tree T occurs in M is a valid approximation of the posterior probability of
that tree.
5.16.2
Practical issues
For this approach to work well in practice, a number of issues must be taken into account.
• First, the mechanism for modifying the current phylogenetic tree must be designed with care.
On the one hand, the suggested changes should be quite small so as to avoid a drastic drop
in posterior probability. On the other hand, the modifications should be large enough so as to
ensure that M is able to explore a sufficient portion of parameter space in a reasonable amount
of time.
• Second, as the MCMC chain M starts at a random location in parameter space, usually of very
low posterior probability, the chain must be allowed to burn in, that is, it must be given time to
reach stationarity, a state in which the chain is actually sampling from the posterior probability.
In practice, the first 10,000 elements of M, say, are excluded from further analysis.
• Third, trees that are close to each other in the MCMC chain M are highly correlated by construction, as they differ only by small modifications. To avoid this problem of autocorrelation,
the final set of output trees is obtained by sparsely sampling M, retaining only every 1,000-th
tree, say.
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.16.3
93
The local algorithm
What types of modifications are used in practice? Branch-swapping methods such as NNI (nearest
neighbor interchange), SPR (subtree prune and regraft) and TBR (tree bisection and reattachment)
are used in the context of maximum parsimony and maximum likelihood methods to search through
the space of phylogenetic tree topologies. For the purposes of constructing an MCMC chain we need
to couple such an approach with a method for proposing modifications of edge lengths and parameters
of the evolutionary model.
We will now describe one approach that is used in practice called the local algorithm, which is a
modification of the NNI method. The algorithm starts by choosing an internal edge (v, w) of the
current phylogenetic tree Ti at random. As we assume that Ti is bifurcating, both of the nodes v and
w are each connected to two further nodes, which we will randomly label a and b for v, and c and d
for w, respectively. Let x, y and z denote the distance from a to v, w and d, respectively, as obtained
by adding the edge lengths ω of the edges from a to v, v to w, and w to d, appropriately.
Here we show a phylogenetic tree before and after a modification by the local algorithm:
0
x
y
z
a
v
w
d
b
c
0
−→
y'
x'
z'
a w
v
d
c
b
The distance z from node a to node d is modified by a randomly chosen factor to a new value of z 0 .
In this example, the node w is moved by a random amount along the path from a to d, whereas the
relative position of node v on the path from a to d is not changed.
5.16.4
Details of the local algorithm
The local algorithm changes the length of the path from a to d and modifies the positions of v and w
in that path. Here are some details:
• First define the new distance from a to d as z 0 = z × ep−0.5 , where p is a random number
uniformly chosen from the interval (0, 1). The lengths of edges on the path from a to d will be
modified so that the distance from a to d will be z 0 .
• Second, randomly decide with equal odds whether node v or node w is to be moved by a random
amount. If we decide to move node v, then define x0 = q × z 0 and y 0 = y × z 0 /z. Otherwise,
define x0 = x × z 0 /z and y 0 = q × z 0 . In both cases, q is a random number uniformly chosen from
the interval (0, 1).
• Third, to reposition the nodes v and w according to the values just chosen, remove the two nodes
v and w from the path connecting a to d and then reinsert them into the path at distances x0
and y 0 from a, respectively.
If x0 < y 0 , then the local algorithm does not result in a change of the current tree topology. In
94
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
this case, only the lengths of the edges from a to v, v to w, and w to d change. In more detail,
we set ω(a, v) = x0 , ω(v, w) = y 0 − x0 and ω(w, d) = z 0 − y 0 .
If x0 > y 0 , then the modifications result is a new topology in which a and w, w and v, and v and
d, respectively, are both connected by an edge. In this case, the edge lengths are ω(a, w) = y 0 ,
ω(w, v) = x0 − y 0 and ω(v, d) = z 0 − x0 .
Simple parameters of the evolutionary model, such as the mutation rate µ in the Jukes-Cantor model,
are modified by adding a random number uniformly chosen from an appropriate interval centered at
0. For a set of parameters that is constrained to sum to some specific value, such as 1 in the case of
probabilities, the values are randomly modified according to a so-called “Dirichlet distribution”.
A proposal for modifying the tree and a proposal for new values for all parameters of the specified
evolutionary model are generated independently and simultaneously, and then accepted or rejected
together in in a single application of the ratio-based decision step of the Metropolis-Hasting algorithm.
5.16.5
Metropolis-coupled Markov chain Monte Carlo
It is important that the MCMC chain M is given the opportunity to visit all relevant parts of parameter
space. In consequence, the larger the number of taxa is, and the more parameters the evolutionary
model M has, the longer the chain must be run.
A popular approach for exploring more of parameter space in less time is to use a variant of MCMC
called Metropolis-coupled Markov chain Monte Carlo, or (MC 3 ), for short. In this approach k different
MCMC chains M1 , . . . , Mk are run in parallel. The first chain M1 is called the cold chain and this
is the one from which inferences will be made. The other k − 1 chains are called heated chains. Each
of the heated chains aims at sampling from a posterior probability distribution raised to a power of
β > 1. Heating a chain has the effect of “flattening out” the probability landscape and the hope is
that a heated chain will be able to move more freely through parameter space.
For example, in a strategy called incremental heating, the i-th chain Mi aims at sampling from the
distribution P (T | A)β(i) , with
1
β(i) =
,
1 + (i − 1)t
where t is a user-specified heating parameter.
After all k chains have moved one step, a swap of states is proposed between two randomly chosen
chains Mi and Mj , using a Metropolis-like algorithm. Let Ti denote the current state of chain Mi ,
for all i = 1, . . . , k. (Here, by current state we mean the total state consisting of the current tree T ,
the current length parameters ω and the current model parameters µ.) A swap between Mi and Mj
is accepted with probability
P (Tj | A)β(i) P (Ti | A)β(j)
α=
.
P (Ti | A)β(i) P (Tj | A)β(j)
The main idea is that the heated chains act as “scouts” exploring the parameter space and that
swapping will occasionally help the cold chain to move to a new part of the space which it would
otherwise have difficulty reaching because of a deep valley in the original distribution. At the end of
the run, the heated chains are discarded and the cold chain is then processed as described above.
5.16.6
Outlook
A major challenge is how do we determine whether an MCMC chain M has been run long enough so
that that chain has converged, that is, it provides a good approximation of the posterior distribution
of trees. A simple approach is to monitor the likelihood of the phylogenetic trees added to M and to
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
95
assume that convergence has occurred once there is no further increase in likelihood. A more reliable
approach may be to use multiple MCMC chains in parallel and to keep them running until they appear
to be sampling from the same distribution.
One of the main attractions of Bayesian inference is that it produces a distribution of phylogenetic
trees rather than a point estimate, in many cases using less time than a maximum-likelihood analysis
followed by a bootstrap analysis. Areas of on-going research are how to ensure convergence and also
how to choose appropriate prior distributions.
96
5.17
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Consensus trees
In most practical phylogenetic studies, a whole collection of different phylogenetic trees is computed
for the set of taxa under investigation.
These might be different phylogenetic “gene trees” that were obtained from a number of different gene
sequences in a multi-gene study.
Or, they might be different phylogenetic trees obtained from the same multiple sequence alignment
using different inference methods, such as maximum-parsimony, maximum-likelihood and so forth.
Yet another possibility is that the collection of phylogenetic trees was obtained using a method that
produces multiple trees, such as Bayesian inference.
In all such cases, one may want to treat the different phylogenetic trees as different estimations of the
same underlying “true” tree and then a “consensus method” can be used to construct a “consensus
tree” that represents those parts of the evolutionary history on which the different phylogenetic trees
agree on, in some sense.
Although consensus methods can be formulated for rooted phylogenetic trees, such methods are less
commonly used, probably because most tree inference methods produce unrooted trees, and also
perhaps because the location of the roots in the input trees may affect the resulting consensus tree
too much.
We will discuss the two most important consensus methods for unrooted phylogenetic trees, namely
the strict consensus and the majority consensus. But first we must introduce the important concept
of a split.
5.17.1
Trees and splits
Any edge e of a phylogenetic tree T defines a split S = A|A¯ on X, that is, a bipartition of X into two
non-empty sets A and A, consisting of all taxa on the one side and the other side of e, respectively.
t1
t8
t2
For example:
t3
e
t4
t7
t6
t5
Here, A = {t3 , t4 , t5 } and A = {t1 , t2 , t6 , t7 , t8 }.
We will use Σ(T ) to denote the split encoding of T , i.e. the set of all splits obtained from T .
Here is an unrooted phylogenetic tree and its split encoding:
b
a
c
e
d
(a) Unrooted tree T
{a} | {b, c, d, e}
{b} | {a, c, d, e}
{c} | {a, b, d, e}
{d} | {a, b, c, e}
{e} | {a, b, c, d}
{a, b} | {c, d, e}
{a, b, e} | {c, d}
(b) Split encoding Σ(T ).
For a given set tree it is easy to write down all splits. But, given a set of splits, how to decide whether
it comes from some tree?
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
97
Definition 5.17.1 (Compatible splits) Two split S1 = A1 |A¯1 and S2 = A2 |A¯2 on X are called
compatible, if one of the folowing four possible intersections of split parts is empty:
A1 ∩ A2 , A1 ∩ A¯2 , A¯1 ∩ A2 or A¯1 ∩ A¯2 .
A set of splits Σ on X is called compatible, if all pairs of splits in Σ are compatible.
Theorem 5.17.2 (Buneman, 1971) A set of splits Σ on X (that contains all trivial splits) is compatible, if and only if there exists a phylogenetic tree T on X with Σ = Σ(T ).
5.17.2
Strict consensus
Let X be a set of taxa and assume that we are given a collection of phylogenetic trees T = (T1 , . . . , Tk )
on X . Let Σstrict denote the set of all splits that occur in every input tree, that is, we define
\
Σstrict =
Σ(T ).
T ∈T
We call this the set of strict consensus splits of the collection of phylogenetic trees T . The strict
consensus tree is the (unique) phylogenetic tree Tstrict for which Σ(Tstrict ) = Σstrict holds. It can be
obtained from any tree in T by contracting every edge whose split is not contained in Σstrict .
In practice, the strict consensus tree may be quite unresolved, often being very close to the “star tree”,
which consists of one central node incident to all leaf edges. The “majority consensus” is usually more
informative.
5.17.3
Majority consensus
Let Σmajority denote the set of majority consensus splits, that is, the set of all splits that occur in
more than half of all trees in T , formally defined as
k
Σmajority = S : |{T ∈ T : S ∈ Σ(T )}| >
.
2
By the pigeon-hole principle, any two splits S1 and S2 that are contained in Σmajority will also occur
in some input tree together, and thus they are compatible. Thus, there exists a unique phylogenetic
tree Tmajority corresponding to Σmajority , which is called the majority consensus tree.
Here are four unrooted phylogenetic trees on X = {a, b, c, d, e, f } and the corresponding strict and
majority consensus trees:
c
d
d
d
c
c
d
c
e
b
e b
e b
b
e
f
a
f
f
a
a
a
f
(a) Tree T1
(b) Tree T2
c
(c) Tree T3
d
(d) Tree T4
c
d
e
e
b
b
f
a
(e) Strict consensus Tstrict
a
f
(f) Majority consensus Tmajority
98
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.18
Rooted phylogenetic networks
Rooted phylogenetic trees can be generalized to “rooted phylogenetic networks”:
Definition 5.18.1 (Rooted phylogenetic network) Let X be a set of taxa. A rooted phylogenetic
network N = (V, E, λ) on X consists of a directed graph G = (V, E) and a node labeling λ : X → V
such that:
1. The graph G is a directed, acyclic graph (DAG).
2. The graph G is rooted, that is, it has precisely one node ρ of indegree 0.
3. The node labeling λ assigns a label to each leaf of G.
We will usually assume that the network does not contain any node that is suppressible, in other
words, is an unlabeled node with indegree 1 and outdegree 1. In addition we will usually assume that
the taxon labeling λ is a bijection between the taxon set X and the set of leaves of N .
Note: a rooted DAG is always connected.
In the the rooted phylogenetic network N depicted below, the leaves are labeled by taxa a,. . . ,i, the
root node is ρ, there are two reticulate nodes labeled r and s, and all other nodes are tree nodes.
Reticulate edges are shown in gray. The node labeled v is suppressible. All tree nodes except the node
labeled w are bifurcations. The reticulation node r is a bicombination, whereas s is not.
½
v
w
s
r
a
b
c
d
e
f
g
h
i
Here is a summary of all the types of nodes and edges that we would like to distinguish among:
Definition 5.18.2 (Types of nodes and edges) Let N be a rooted phylogenetic network on a
taxon set X .
1. The one node with indegree 0 is called the root.
2. A node with outdegree 0 is called a leaf node.
3. A node with at most one in-edge is called a tree node.
4. Any edge leading to a tree node is called a tree edge.
5. A node with at least two in-edges is called a reticulate node or a reticulation.
6. Any edge leading to a reticulate node is called a reticulate edge.
7. A node with exactly two out-edges is called a bifurcation.
8. A node with exactly two in-edges is called a bicombination.
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.19
99
Representing trees in rooted networks
Let T = {T1 , T2 , . . . , Tm } be a collection of rooted phylogenetic trees on some taxon set X . For
example, these might be the result of a multiple gene analysis or they might be different estimations
of a single gene-based phylogeny.
In practice, such a collection of trees is often combined into a consensus tree T¯ that is interpreted as
representing the parts of the evolutionary history upon which the different trees agree.
An alternative idea is to use a rooted phylogenetic network N to represent T in the following sense:
Definition 5.19.1 (Rooted trees represented by a rooted network) Let N be a rooted phylogenetic network on a taxon set X . We say that a rooted phylogenetic tree T on X is represented by
N , or that T is displayed by N , if T is embedded in N . We will use T (N ) to denote the set of all
rooted phylogenetic trees on X that are represented by N .
Below we show two rooted phylogenetic trees T1 and T2 , and then two different phylogenetic networks
N1 and N2 that both represent the two trees:
(a) Tree T1 :
a
b
c
d
(b) Tree T2 :
e
f
a
(c) Network N1 :
b
c
d
e
f
(d) Network N2 :
r
a
5.20
b
c
d
e
f
a
b
c
d
e
f
Recombination networks
A recombination network is a rooted phylogenetic network that is used to describe the evolution of a set
of closely related sequences (usually from different individuals of a population) in terms of mutations
(along edges of the network), speciation or duplication events (at tree nodes) and recombination events
(at reticulate nodes).
In a simple recombination event, two sequences a and b of the same length L give rise to a new
sequence c of length L that consists of a prefix of the sequence a and a suffix of the sequence b (or
vice-versa). This is known as a single crossover event:
a ACGTTGGCC AGTTGA
b ccattgaat gtttca
c ACGTTGGCC gtttca
In a multiple crossover event, the resulting sequence c consists of a mosaic of segments from a and b.
In either type of event, the character state ci is either ai or bi for every position i = 1, . . . , L of the
sequence c.
100
5.20.1
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
Definition of a recombination network
Definition 5.20.1 (Recombination network) Let M be a multiple alignment of sequences of
length L, on X . A recombination network N representing M is given by a bicombining rooted phylogenetic network on X , together with two additional labellings:
1. Each node v of N is labeled by a sequence σ(v) of length L.
2. Each tree edge e is labeled by a set of positions δ(e) ⊆ {1, . . . , L}.
These two labellings must fulfill the following compatibility conditions:
1. The sequence σ(v) assigned to any leaf v must equal the sequence in M that is given for the
taxon associated with v.
2. If r is a reticulate node (often called a recombination node in this context) with parents v and w,
then the sequence σ(r) must be obtainable from the two sequences σ(v) and σ(w) by a crossover.
3. If e = (v, w) is a tree edge, then the set of positions at which the two sequences σ(v) and σ(w)
differ must equal δ(e).
For computational reasons, the following condition is usually also required:
(iv) Any given position may mutate at most once in the network. In other words, for any given
position i there exists at most one edge e with i ∈ δ(e).
The latter condition is usually referred to as the infinite sites assumption because for sequences of
infinite length it holds that the probability of the same site being hit by a mutation more than once
is zero, under a uniform distribution.
In population studies one sometimes extends the definition of a recombination network to allow more
than one root. This is necessary, for example, when different lineages are traced back to a founding
population. However, we will not explore this further.
Below we show (a) a multiple alignment of sequences of length five on X = {a, . . . , e} and (b) a
recombination network N for M . For each node v we show the sequence σ(v) and for each edge e we
show the set of positions δ(e). In this example, that set is either empty (unlabeled edges) or contains
only one position (from 1 to 5):
00000
1
a 10000
b 11000
c 11100
d 00110
e 00111
3
10000
2
11000
00100
r
4
00110
11100
5
(a) Alignment M
a
b
c
d
e
10000
11000
11100
00110
00111
(b) Recombination network N
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
5.20.2
101
The local-tree parsimony approach
Let M be a multiple alignment of sequences of length L and let N be a corresponding recombination
network, on X .
Consider a position i in the alignment. For any particular reticulate node r in N the character state
at position i of the sequence σ(r) is copied from exactly one of the two parents of r. In consequence,
each individual character in M evolves along some rooted phylogenetic tree embedded in N .
In the recombination network shown above, the first two characters evolve along the tree in which the
leaf labeled c is a child of the node labeled 11000, whereas the last three characters evolve along the
tree in which the leaf labeled c is a child of the node labeled 00100.
So, in consequence, the evolutionary history of any sufficiently short segment of sequence is a rooted
phylogenetic tree. This insight leads to the following approach:
1. Determine a suitable rooted phylogenetic tree Ti for each position i in the alignment M . We
will refer to any such tree Ti as a local tree.
2. Combine all the trees into a suitable rooted phylogenetic network N .
When doing this, there is trade-off to be made between the number of incompatibilities between a
character and its associated local tree on the one hand, and the “recombination cost ” of switching
from one tree topology to a different when going from position i to i + 1, on the other.
The first part can be formulated as a dynamic programming problem (Hein, 1993). For this, define
a local-tree graph G = (V, E) as follows. For each position i in M and each possible rooted tree T
on X let v(i, T ) be a node in V . Any two nodes v(i, T ) and v(i + 1, T 0 ) at adjacent positions in the
alignment are connected by a directed edge e = (v(i, T ), v(i + 1, T 0 )):
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
(1,T1 )
(11,T2)
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Each node in G is assigned a weight ω(v(i, T )) equal to the minimum number of substitutions required
for the i-th character on the rooted tree T . Each edge e = (v(i, T ), v(i + 1, T 0 )) is assigned a weight
ω(v(i, T ), v(i + 1, T 0 )) = ω(e) that reflects the “recombination distance” between T and T 0 .
Now, in the local-tree graph G any path
P = (v(1, T1 ), v(2, T2 ), . . . , v(L, TL ))
that starts at some node v(1, T1 ) and ends at some node v(L, TL ) selects a rooted phylogenetic tree
Ti for each position i in M .
The total weight of such a path P is obtained by adding all node weights and all edge weights along
the path:
L
L−1
X
X
W (P ) =
ω(v(i, Ti )) +
ω(v(i, Ti ), v(i + 1, Ti+1 )).
i=1
i=1
102
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
A most parsimonious set of trees for M is given by a path P of minimum weight W (P ), which can be
computed by standard dynamic programming using the following recursion:

 minT 0 {W (i − 1, T 0 ) + ω((v(i − 1, T 0 ), v(i, T ))) + ω(v(i, T ))},
if i > 1,
W (i, T ) =

ω(v(1, T )), else.
To complete the recursion we need to specify how to determine the weight of a node ω(v(i, T )) and
the weight of an edge ω(v(i − 1, T 0 ), v(i + 1, T )):
• The weight of a node is simply the parsimony score for the i-th character on the tree T , which
can be efficiently computed using the Sankoff algorithm.
• The weight of an edge is the minimum number of reticulation nodes in a rooted phylogenetic
network that contains both trees T and T 0 . This problem is NP-hard problem.
The described approach is not practical for three reasons.
First, the number of possible rooted phylogenetic trees for n taxa is (2n − 3)!! and thus is very large
even for relatively small values of n.
Second, the computation of each edge weight in the graph is NP-hard.
Finally, the second part of the approach, computation of a suitable rooted phylogenetic network that
represents all the trees, is an NP-hard problem.
Although the approach is not practical, it is conceptually appealing because it explicitly addresses
the mosaic nature of aligned sequences: A long multiple sequence alignment consists of stretches of
sequence that have evolved along a common rooted phylogenetic tree and these stretches are separated
by crossover positions at which recombinations have occurred.
Here is an illustration of the approach:
1 2 3
a 1 0 0
b 1 1 0
c 1 1 0
d 0 0 1
e 0 0 1
...
...
...
...
...
...
4 5 6 7 8
0 1 1 0 0
0 0 1 0 0
0 0 0 1 0
1 1 0 1 1
1 1 0 1 1
(a) Alignment M
...
...
...
...
...
...
...
...
9
0
0
1
1
1
...
10
1
1
0
0
0
...
...
11
1
1
1
0
0
...
...
...
...
(1,T1 )
(11,T2)
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
(b) Local-trees graph G
The path highlighted in bold chooses the tree T1 for the first five positions of M and then the tree T2
for the remaining six positions:
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
(c) Tree T1 :
a
b
c
103
(d) Tree T2 :
d
a
e
b
c
d
e
(e) The resulting recombination network N :
a
b
c
d
e
Explaination: For the given multiple alignment M of sequences of length 11 on X = {a, b, c, d, e}
we indicate the corresponding local-tree graph G. The graph has 11 columns of nodes, each column
containing one node for each of the possible rooted phylogenetic trees on X . All nodes of a column are
connected to all nodes of the next column by edges. The nodes and edges are weighted as described
above. In this example the first five characters of M are all compatible with the tree T1 , whereas
the last six characters are all compatible with the tree T2 , and so in this case a path through G that
chooses T1 for the first five positions and then T2 for the remaining six positions is optimal. The
depicted rooted phylogenetic network N contains both T1 and T2 and is a recombination network for
M . More precisely, N is a recombination network if one adds the appropriate labeling of the nodes
by sequences and edges by mutations.
5.20.3
A heuristic for the local-tree approach
Let M be a multiple alignment of sequences of length L on X . We now look at a heuristic for computing
a recombination network N for M that is based on the ideas just discussed (Hein, 1993).
A first simplification is to use unrooted phylogenetic trees rather than rooted ones.
The main simplification is to consider only a small part of the local-tree graph.
The heuristic starts by computing a phylogenetic tree T from the whole dataset M using the maximum
parsimony method. This tree will be used as a start tree and is assigned to position 1 of the alignment.
Next, we compute the SPR-neighborhood S of T consisting of T and all phylogenetic trees that are
exactly one SPR modification away from T .
The main recursion of the previous section is repeatedly applied to the local-tree graph restricted to
nodes that contain a tree from S until a position is reached at which a jump to a different tree T 0 in
S is indicated by the recursion. We then determine the SPR-neighborhood S 0 of T 0 and continue the
calculation with T 0 and S 0 in place of T and S.
The heuristic algorithm considers only those trees in the local-tree graph that differ at most by one
“SPR” from the current tree:
104
Bioinformatics I, WS’09-10, D. Huson, December 1, 2009
...
...
...
...
...
...
...
...
...
...
...
(1,T1 )
(11,T2)
...
...
...
...
...
...
...
...
...
...
...
The resulting set of trees might depend quite strongly on the initial tree used. Hence, after completing
the first pass of the algorithm, a second pass is performed in the opposite direction, using the local
tree that was previously chosen for the last position of the alignment as the new start tree. This
should be repeated a couple of times to completely erase any dependency on the initial choice of start
tree. The final pass produces a set of local trees and breakpoint positions along the sequences that
reflect the recombination history of the given set of sequences.
This heuristic is an example of a general approach to studying recombination that operates by scanning
along a multiple sequence alignment, building trees for local segments of the sequence and then
comparing the trees in an attempt to determine recombination crossover positions.
Once a set of phylogenetic trees has been constructed for the different stretches of a multiple sequence
alignment M , then in a second step these trees may be combined into a recombination network to
represent the evolutionary history of the data.
By design, the local trees of neighboring positions will differ by at most one SPR and so in this case the
calculation of a corresponding network is easier than in the general case, although it is still challenging
in practice.
5.21
Summary
• Phylogenetic trees are used to represent the evolutionary history of species.
• They can be computed from distances using a method such as neighbor-joining.
• They can be computed from a multiple alignment of sequences using a method based on maximum parsimony, maximum likelihood or the Bayesian approach.
• Rooted phylogenetic networks can be used to represent multiple trees.
• The local-tree approach attempts to explain the evolution of a set of sequences in the presence
of recombination by finding a most parsimonious path through the local-tree graph.