Outline 1. Motivation 2. Effects of population structure on gene genealogies 3. Testing for population structure 4. Effects of population structure on species tree inference (3 species) 5. Conclusions Population structure April 8, 2015 1 / 38 One population, two samples We’ll start with the Wright-Fisher setting (discrete generations, constant population size, random mating). At first, we’ll just look at what happens with samples. Population structure April 8, 2015 2 / 38 One population, two samples, no population structure One popula*on with n=2, N=10 Population structure April 8, 2015 3 / 38 One population, two samples, population structure There is a reduced probability that lineages from different subpopulations will have ancestors from the same subpopulation. One popula*on with n=2, N=10, with popula*on structure Population structure April 8, 2015 4 / 38 Speciation Population structure April 8, 2015 5 / 38 Speciation Population structure April 8, 2015 6 / 38 Speciation Population structure April 8, 2015 7 / 38 Two questions 1. What affect does population structure have on gene trees? 2. What affect does population structure on inferring species trees? Population structure April 8, 2015 8 / 38 Affect of population structure on gene trees From the slide with two lineages, we expect population structure to increase the time to coalescence for two lineages. What if there are more than two lineages? Population structure April 8, 2015 9 / 38 Population structure with more than two lineages. With two or more lineages in each subpopulation, distributions of different topologies are affected as well, such as whether a tree is more or less balanced. If you sample two lineages in each subpopulation, the probability that a tree topology is balanced is increased. If you sample one lineage in one subpopulation, and three in the other, the tree topology is less likely to be balanced than for a panmictic population. Population structure April 8, 2015 10 / 38 Population structure with two lineages per subpopulation This example results in a gene tree with a balanced topology, ((AB)(CD)). Population structure April 8, 2015 11 / 38 No population structure as a null hypothesis If we want to test whether or not there is population structure, we can think of a model with no population structure as a null hypothesis. If there is no population structure, we can use statistics on the gene trees to check whether the data (gene trees) are consistent with the null hypothesis. Population structure April 8, 2015 12 / 38 Using monophyly to check for population structure Population structure makes monophyly more likely in the two subspecies. For example, if lineages in subpopulation A are called A1 , A2 , . . . , Am , and lineages in subpopulation B are called B1 , B2 , . . . , Bn , then a tree has reciprocal monophyly if the most recent common ancestor (MRCA) of the Ai s is only ancestral to Ai lineages, and the MRCA of the Bi s is only ancestral to Bi lineages. If the probability of monophyly is low, yet monophyly is observed, then this is evidence of population structure, and we can interpret the probability of monophyly as a p-value. Population structure April 8, 2015 13 / 38 The probability of monophyly How to get the probability of monophyly? Fortunately there is a formula for the probability of monophyly that applies under the coalescent. This happens to be the same probability under the Yule and coalescent models for trees. For m species in clade A and n species in clade B, the probability of reciprocal monophyly is p(m, n) = 2 (m + n − 1) Population structure m+n m April 8, 2015 14 / 38 The probability of monophyly Probability of monophyly 10-‐1 10-‐2 10-‐3 10-‐4 10-‐5 10-‐6 3 5 7 9 Number of species per subpopula=on Population structure April 8, 2015 15 / 38 The probability of monophyly multiple loci The previous formula gave the probability of monophyly for one locus. Suppose instead that you have multiple loci. Suppose you have 10 loci, with 4 individuals per species, and 2 loci are reciprocally monophyletic? Is there evidence against random mating? Population structure April 8, 2015 16 / 38 Multiple loci Here you could use the idea that the number of loci which are reciprocally monophyletic is binomial (assuming loci are independent), with parameters 10 (for the sample size) and p = p(4, 4) = 0.004081633. You can do a binomial test to get a p-value. In this case the p-value is the probability of having observed at least 2 monophyletic loci. Let X be the number of loci with reciprocal monophyly. Then p − value = P(X ≥ 2) = 1 − P(X = 0) − P(X = 1). This is easiest to implement in R using the binom.test function. Population structure April 8, 2015 17 / 38 Multiple loci Based on the binomial p-value, this is still a highly significant result. In fact, the probability of reciprocal monophyly is so low that even one locus with reciprocal monophyly gives a p-value of 0.04. Population structure April 8, 2015 18 / 38 Monophyly versus paraphyly and polyphyly If one group is monophylytic, but the other isn’t, then we have paraphyly. For example, if the MRCA of the Ai s only has Ai s as descendants, but the MRCA of the Bi s also includes at least one Ai , then B is paraphyletic with respect to A, but A is monophyletic. If both groups have violation of monophyly, then the MRCA of the Ai s includes at least one Bi and the MRCA of the Bi s includes at least one Ai , and we have polyphyly. Probabilities of paraphyly and polyphyly have also been worked out under the null hypothesis (see Rosenberg, 2003, Evolution) and could be used for p-values instead. Population structure April 8, 2015 19 / 38 Taxonomic Distinctiveness In cases where one group is of particular interest, you might ask whether that group is taxonomically distinct. Here, we might just use the probability that a clade is observed for that particular group. The probability that the A group forms a clade is 2(m + n) m(m + 1) m+n m where m is the number of lineages in A and n is the number of lineages not in A. Population structure April 8, 2015 20 / 38 Taxonomic Distinctiveness Using this probability, we can ask similar questions about whether the probability of observing a clade is sufficiently low to reject the null hypothesis of unstructured populations. Rosenberg 2007, Evolution. Population structure April 8, 2015 21 / 38 Taxonomic Distinctiveness Population structure April 8, 2015 22 / 38 Complications Two possible complications for this approach are that (1) estimated gene trees might have polytomies, (2) for larger trees, monophyly might not be observed but there could still be significant structure in the trees. An alternative approach is to try to quantify the degree of violation of monophyly, at the same time allowing trees to be multifurcating. This approach is taken by the genealogical sorting index (Cummings, Neal, and Shaw, Evolution, 2008). http://www.genealogicalsorting.org/ Population structure April 8, 2015 23 / 38 Genealogical sorting index Here gs for A is 1, gs for B is 3/4. gs value for a group: n=number of lineages -‐1 Denominator (for bifurca9ng tree): number of nodes in subtree rooted at MRCA of the group. Population structure April 8, 2015 24 / 38 Genealogical sorting index The maximum value of the genealogical sorting index is 1, which occurs for a monophyletic group. The minimum value is not 0, so the index is normalized to be between 0 and 1. The normalized statistic is called gsi rather than gs. Population structure April 8, 2015 25 / 38 Genalogical sorting index The genealogical sorting index has some distribution, conditional on the tree topology and sampling scheme. This distribution is difficult to determine, so a permutation test is used by randomly permuting the labels of the tree, recalculating the gsi, and getting a distribution of values. A p-value is obtained by considering P(gsi ≥ gsiobserved ) Population structure April 8, 2015 26 / 38 Ancient population structure What are the consequences of ancient population structure? Here we might especially be concerned with the effects of ancient population structure on species tree inference. I think this problem is sort of overlooked. The main examples of reasons for gene trees to disagree with the species tree are the following list: horizontal gene transfer, gene duplication, hybridization, and recombination. However, population structure should be on this list as well. Population structure April 8, 2015 27 / 38 Discordance between gene trees and species trees As a bit of terminology, I’ll follow Rosenberg in using discordance to refer to disagreements between gene tree topologies and species tree topologies. Incongruence will be used to refer to disagreements between gene tree topologies at different loci. Population structure April 8, 2015 28 / 38 Gene tree estimation error What about gene tree estimation error? Is this a source of discordance/incongruence as well? Yes, but conceptually, it is quite different from the others. The other examples are biological processes that cause the ancestry of the gene lineages to disagree with each other and the species tree, regardless of our knowledge of the genealogies. Error due to misestimation can cause two gene trees that should agree to apparently disagree. Error due to misestimation can also cause two gene trees that have the different topologies to incorrectly be inferred to have the same topology, although this might be relatively rare for larger trees. Therefore, misestimation can cause congruence and concordance! Population structure April 8, 2015 29 / 38 Ancient population structure Ancient population structure can affect the probabilities of observing different gene tree topologies, and therefore the probability that a gene tree is discordant with the species tree. We’ll focus on the case of three taxa and one lineage sampled per species (similar to Edwards’ Finch data). Population structure April 8, 2015 30 / 38 Ancient population structure Population structure April 8, 2015 31 / 38 Ancient population structure T’ T Population structure April 8, 2015 32 / 38 Ancient population structure In many cases, having population structure might be equivalent to modifying the divergence time in the species tree, and therefore the internal branch lengths. In this case, the population structure causes a shortening of the internal branch. This might cause more discordance in the gene trees and make it more difficult to estimate the species tree, but is not a problem for estimating species trees from gene tree topologies. It might make it harder for methods like MP-EST to estimate branch lengths correctly, but is not an important model violation for estimating the species tree topology. Note that it is also possible for population structure to effectively lengthen internal branches and make it easier to estimate the species tree as well. Population structure April 8, 2015 33 / 38 Ancient population structure:Slatkin and Pollack example Suppose we have population structure like this (Slatkin and Pollack, 2008, MBE) A B Population structure C April 8, 2015 34 / 38 Ancient population structure In this case, the strong population structure making it unlikely for A lineages to coalesce with the other two lineages could make most coalescences between A and other species to predate the divergence of species C . In this case, it is possible that the most likely gene tree is ((B, C ), A). Another consequence of this model is that there can be three distinct probabilities for the gene trees with P[((B, C ), A)] > P[((A, B), C )] > P[((A, C ), B)] Population structure April 8, 2015 35 / 38 Ancient population structure Under the usual multi species coalescent, if the species tree has topology ((A, B), C ), then the following is true P[((A, B), C )] > 1/3 > P[((B, C ), A)] = P[((A, C ), B)] Both the inequalities and the equality (tie in discordant gene tree probabilities) are useful for testing the multispecies coalescent. The model does hold if the probabilities are (0.4, 0.4, 0.2) or (0.6, 0.3, 0.1), for example. However, these predictions don’t hold for the previous example of ancient population structure. Population structure April 8, 2015 36 / 38 Distinguishing sources of discordance Unfortunately, it is often difficult to tell which biological processes are contributing to discordance. Asymmetry in gene tree probabilities can be caused by population structure, gene flow after speciation, or by hybridization. Just using topologies, it might not be possible to distinguish these processes. However, these processes might imply different variances or other distributional properties in the coalescence times, so these might have information for distinguishing which processes generated the data. For example, coalescence times might be bimodal under hybridization, but not under population structure. Population structure April 8, 2015 37 / 38 Conclusions I Population structure is often overlooked as a source of incongruence I Robustness of species tree methods in the presence of structure needs to be investigated more Population structure April 8, 2015 38 / 38
© Copyright 2025