Phylogenetic Reconstruction

Phylogenetic
Reconstruction
Theory and Software
Many slides are from:
Alexandros Stamatakis, Dan Graur, Paul
Lewis
Definitions we need to know
●
●
●
●
●
●
Leaf
Node
Branch
root
MRCA (most recent common ancestor)
rooted and unrooted trees
Rooted vs unrooted phylogenies
Relatedness in phylogenetics
Two sequences A and B are more closely
related than A and B’ when A and B share a
common ancestor more recently than A and B’
Is B more closely related to A or to C ?
Think about relatedness
As defined above it implies the concept of
“TIME”.
There is no time concept in unrooted trees.
Thus, speaking about relatedness is rather
meaningless in unrooted trees
Distance and relatedness
Is B closer to C or to A?
Is C closer to D or to B?
C
B
A
D
Distance and relatedness
Is B closer to C or to A?
Is C closer to D or to B?
C
B
The concept “closeness” is not well defined. So you cannot
really answer the above question precisely
A
D
Distance and relatedness
The concept “closeness” is not well defined. So you cannot
really answer the above question precisely
C
B
B is more related to A than to C
Evolutionary distance between B and C is smaller than
between A and B
A
D
Distance and relatedness
B is more related to A than to C
Evolutionary distance between B and C is smaller than
between A and B
C
B
How is this possible?
A
D
Evolutionary rate
The number of substitutions per site
This can be different from species to species
Tree Inference
A common task in evolutionary biology is to
infer the phylogenetic tree for a set of
sequences.
This means to decide from a set of possible
phylogenies which is the best to explain the
data we observe
Data -- this is what we observe
Finding the relations between
sequences
Assuming that the whole body of evolutionary
theory is true :))
● Sequences should have common ancestors
o
o
o
i.e. they should be related
relatedness should affect sequence similarity via the
identity by descent.
if two sequences inherit a nucleotide from a common
ancestor then both should carry the same nucleotide
in a given site.
Similarity and Orthology
● Orthology is a biological concept: two
objects (sequences, sites, proteins etc) are
homologous if they share a common
ancestor
● similarity is not a biological concept
● similarity does not imply orthology
necessarily
● Find examples that similarity != orthology
… from wikipedia
Homology among proteins or DNA is often
incorrectly concluded on the basis of sequence
similarity. The terms "percent homology" and
"sequence similarity" are often used
interchangeably. As with anatomical structures,
high sequence similarity might occur because
of convergent evolution, or, as with shorter
sequences, because of chance.
… two random sequences
are __ % similar
are __ % homologous (the term % homologous
is a bit strange anyway)
Similarity and orthology
Practically
● Secuences might share a high similarity
because they are orthologous
● This is the basis of the alignment
o
The process of arranging a pair of sequences in a
way that homologous sites are on the same column
Let’s align these sequences
ATTTACCACGGA and ATTTACCACGGA
ATTTACCACGGA and ATTCACCACGGA
ATTTACCACGGA and ATTACCACGGA
ATTTACCACGGA and ATTTCGA
Why do we align sequences
To find Phylogenetic Trees!!!!!
Note that:
During the alignment process we already “have
in mind” a phylogenetic tree
This is a small paradox of the
alignment/phylogeny:
● we align to find the tree
● we align “assuming” some tree
Alignment - Phylogeny problem
In practice:
First we find an alignment
Then we reconstruct the tree
There are methods however, that do both
simultaneously (e.g. PRANK).
Phylogenetics
the art of tree reconstruction
Inferring UNROOTED phylogenies
all programs I know work with unrooted
phylogenies
Assume an alignment
A:
B:
C:
D:
ACGTCAACACCCA
ACGTCGACACCCA
ACGTCAACATTTT
ACGTTAACATTTT
Examining all possible phylogenies
A:
B:
C:
D:
ACGTCAACACCCA
ACGTCGACACCCA
ACGTCAACATTTT
ACGTTAACATTTT
Number of possible phylogenetic
trees
Scoring the trees
Impossible Calculations
● Since the tree space is so large it is not
possible to evaluate all trees
● Trees are doing some smart heuristics to
examine the tree space and to report the
best “known” tree
● Keep in mind: since the space is huge,
almost certainly the tree is a wrong one :( :)
UPGMA
Unweighted pair-group method with arithmetic
means
48
UPGMA employs a sequential clustering
algorithm, in which local topological
relationships are identified in order of
decreased similarity, and the tree is built in49
simple OTUs
50
composite OTU
51
52
53
UPGMA only works if the distances
are strictly ultrametric i.e. the
distance to all leaves from the root
is the same. … or
for any 3 species in matrix S, we
label them as i,j,k Mik = Mjk >= Mij
54
●
Parsimony
56
Are there problems in parsimony?
• long branch attraction
Huge data... better trees?
What is your opinion?
Bonobo, Chimp, Human
The bonobo genome compared with the chimpanzee and human genomes
http://www.nature.com/nature/journal/v486/n7404/full/nature11128.html
What does it mean on the phylogenetic tree reconstruction?
What software do we use?
MAFFT: to obtain alignments
but also MUSCLE, CLUSTAL etc
RAxML to obtain phylogenetic trees
(but also FastTree... well we do trust only raxml
:)