Phylogenetic Reconstruction Theory and Software Many slides are from: Alexandros Stamatakis, Dan Graur, Paul Lewis Definitions we need to know ● ● ● ● ● ● Leaf Node Branch root MRCA (most recent common ancestor) rooted and unrooted trees Rooted vs unrooted phylogenies Relatedness in phylogenetics Two sequences A and B are more closely related than A and B’ when A and B share a common ancestor more recently than A and B’ Is B more closely related to A or to C ? Think about relatedness As defined above it implies the concept of “TIME”. There is no time concept in unrooted trees. Thus, speaking about relatedness is rather meaningless in unrooted trees Distance and relatedness Is B closer to C or to A? Is C closer to D or to B? C B A D Distance and relatedness Is B closer to C or to A? Is C closer to D or to B? C B The concept “closeness” is not well defined. So you cannot really answer the above question precisely A D Distance and relatedness The concept “closeness” is not well defined. So you cannot really answer the above question precisely C B B is more related to A than to C Evolutionary distance between B and C is smaller than between A and B A D Distance and relatedness B is more related to A than to C Evolutionary distance between B and C is smaller than between A and B C B How is this possible? A D Evolutionary rate The number of substitutions per site This can be different from species to species Tree Inference A common task in evolutionary biology is to infer the phylogenetic tree for a set of sequences. This means to decide from a set of possible phylogenies which is the best to explain the data we observe Data -- this is what we observe Finding the relations between sequences Assuming that the whole body of evolutionary theory is true :)) ● Sequences should have common ancestors o o o i.e. they should be related relatedness should affect sequence similarity via the identity by descent. if two sequences inherit a nucleotide from a common ancestor then both should carry the same nucleotide in a given site. Similarity and Orthology ● Orthology is a biological concept: two objects (sequences, sites, proteins etc) are homologous if they share a common ancestor ● similarity is not a biological concept ● similarity does not imply orthology necessarily ● Find examples that similarity != orthology … from wikipedia Homology among proteins or DNA is often incorrectly concluded on the basis of sequence similarity. The terms "percent homology" and "sequence similarity" are often used interchangeably. As with anatomical structures, high sequence similarity might occur because of convergent evolution, or, as with shorter sequences, because of chance. … two random sequences are __ % similar are __ % homologous (the term % homologous is a bit strange anyway) Similarity and orthology Practically ● Secuences might share a high similarity because they are orthologous ● This is the basis of the alignment o The process of arranging a pair of sequences in a way that homologous sites are on the same column Let’s align these sequences ATTTACCACGGA and ATTTACCACGGA ATTTACCACGGA and ATTCACCACGGA ATTTACCACGGA and ATTACCACGGA ATTTACCACGGA and ATTTCGA Why do we align sequences To find Phylogenetic Trees!!!!! Note that: During the alignment process we already “have in mind” a phylogenetic tree This is a small paradox of the alignment/phylogeny: ● we align to find the tree ● we align “assuming” some tree Alignment - Phylogeny problem In practice: First we find an alignment Then we reconstruct the tree There are methods however, that do both simultaneously (e.g. PRANK). Phylogenetics the art of tree reconstruction Inferring UNROOTED phylogenies all programs I know work with unrooted phylogenies Assume an alignment A: B: C: D: ACGTCAACACCCA ACGTCGACACCCA ACGTCAACATTTT ACGTTAACATTTT Examining all possible phylogenies A: B: C: D: ACGTCAACACCCA ACGTCGACACCCA ACGTCAACATTTT ACGTTAACATTTT Number of possible phylogenetic trees Scoring the trees Impossible Calculations ● Since the tree space is so large it is not possible to evaluate all trees ● Trees are doing some smart heuristics to examine the tree space and to report the best “known” tree ● Keep in mind: since the space is huge, almost certainly the tree is a wrong one :( :) UPGMA Unweighted pair-group method with arithmetic means 48 UPGMA employs a sequential clustering algorithm, in which local topological relationships are identified in order of decreased similarity, and the tree is built in49 simple OTUs 50 composite OTU 51 52 53 UPGMA only works if the distances are strictly ultrametric i.e. the distance to all leaves from the root is the same. … or for any 3 species in matrix S, we label them as i,j,k Mik = Mjk >= Mij 54 ● Parsimony 56 Are there problems in parsimony? • long branch attraction Huge data... better trees? What is your opinion? Bonobo, Chimp, Human The bonobo genome compared with the chimpanzee and human genomes http://www.nature.com/nature/journal/v486/n7404/full/nature11128.html What does it mean on the phylogenetic tree reconstruction? What software do we use? MAFFT: to obtain alignments but also MUSCLE, CLUSTAL etc RAxML to obtain phylogenetic trees (but also FastTree... well we do trust only raxml :)
© Copyright 2024