Population structure

Outline
1. Motivation
2. Effects of population structure on gene genealogies
3. Testing for population structure
4. Effects of population structure on species tree inference (3 species)
5. Conclusions
Population structure
April 8, 2015
1 / 38
One population, two samples
We’ll start with the Wright-Fisher setting (discrete generations, constant
population size, random mating).
At first, we’ll just look at what happens with samples.
Population structure
April 8, 2015
2 / 38
One population, two samples, no population structure
One popula*on with n=2, N=10 Population structure
April 8, 2015
3 / 38
One population, two samples, population structure
There is a reduced probability that lineages from different subpopulations
will have ancestors from the same subpopulation.
One popula*on with n=2, N=10, with popula*on structure Population structure
April 8, 2015
4 / 38
Speciation
Population structure
April 8, 2015
5 / 38
Speciation
Population structure
April 8, 2015
6 / 38
Speciation
Population structure
April 8, 2015
7 / 38
Two questions
1. What affect does population structure have on gene trees?
2. What affect does population structure on inferring species trees?
Population structure
April 8, 2015
8 / 38
Affect of population structure on gene trees
From the slide with two lineages, we expect population structure to
increase the time to coalescence for two lineages.
What if there are more than two lineages?
Population structure
April 8, 2015
9 / 38
Population structure with more than two lineages.
With two or more lineages in each subpopulation, distributions of different
topologies are affected as well, such as whether a tree is more or less
balanced.
If you sample two lineages in each subpopulation, the probability that a
tree topology is balanced is increased.
If you sample one lineage in one subpopulation, and three in the other, the
tree topology is less likely to be balanced than for a panmictic population.
Population structure
April 8, 2015
10 / 38
Population structure with two lineages per subpopulation
This example results in a gene tree with a balanced topology, ((AB)(CD)).
Population structure
April 8, 2015
11 / 38
No population structure as a null hypothesis
If we want to test whether or not there is population structure, we can
think of a model with no population structure as a null hypothesis. If there
is no population structure, we can use statistics on the gene trees to check
whether the data (gene trees) are consistent with the null hypothesis.
Population structure
April 8, 2015
12 / 38
Using monophyly to check for population structure
Population structure makes monophyly more likely in the two subspecies.
For example, if lineages in subpopulation A are called A1 , A2 , . . . , Am , and
lineages in subpopulation B are called B1 , B2 , . . . , Bn , then a tree has
reciprocal monophyly if the most recent common ancestor (MRCA) of the
Ai s is only ancestral to Ai lineages, and the MRCA of the Bi s is only
ancestral to Bi lineages.
If the probability of monophyly is low, yet monophyly is observed, then this
is evidence of population structure, and we can interpret the probability of
monophyly as a p-value.
Population structure
April 8, 2015
13 / 38
The probability of monophyly
How to get the probability of monophyly? Fortunately there is a formula
for the probability of monophyly that applies under the coalescent. This
happens to be the same probability under the Yule and coalescent models
for trees.
For m species in clade A and n species in clade B, the probability of
reciprocal monophyly is
p(m, n) =
2
(m + n − 1)
Population structure
m+n
m
April 8, 2015
14 / 38
The probability of monophyly
Probability of monophyly 10-­‐1 10-­‐2 10-­‐3 10-­‐4 10-­‐5 10-­‐6 3 5 7 9 Number of species per subpopula=on Population structure
April 8, 2015
15 / 38
The probability of monophyly multiple loci
The previous formula gave the probability of monophyly for one locus.
Suppose instead that you have multiple loci. Suppose you have 10 loci,
with 4 individuals per species, and 2 loci are reciprocally monophyletic?
Is there evidence against random mating?
Population structure
April 8, 2015
16 / 38
Multiple loci
Here you could use the idea that the number of loci which are reciprocally
monophyletic is binomial (assuming loci are independent), with parameters
10 (for the sample size) and p = p(4, 4) = 0.004081633. You can do a
binomial test to get a p-value. In this case the p-value is the probability of
having observed at least 2 monophyletic loci.
Let X be the number of loci with reciprocal monophyly. Then
p − value = P(X ≥ 2) = 1 − P(X = 0) − P(X = 1).
This is easiest to implement in R using the binom.test function.
Population structure
April 8, 2015
17 / 38
Multiple loci
Based on the binomial p-value, this is still a highly significant result. In
fact, the probability of reciprocal monophyly is so low that even one locus
with reciprocal monophyly gives a p-value of 0.04.
Population structure
April 8, 2015
18 / 38
Monophyly versus paraphyly and polyphyly
If one group is monophylytic, but the other isn’t, then we have paraphyly.
For example, if the MRCA of the Ai s only has Ai s as descendants, but the
MRCA of the Bi s also includes at least one Ai , then B is paraphyletic with
respect to A, but A is monophyletic.
If both groups have violation of monophyly, then the MRCA of the Ai s
includes at least one Bi and the MRCA of the Bi s includes at least one Ai ,
and we have polyphyly.
Probabilities of paraphyly and polyphyly have also been worked out under
the null hypothesis (see Rosenberg, 2003, Evolution) and could be used for
p-values instead.
Population structure
April 8, 2015
19 / 38
Taxonomic Distinctiveness
In cases where one group is of particular interest, you might ask whether
that group is taxonomically distinct. Here, we might just use the
probability that a clade is observed for that particular group.
The probability that the A group forms a clade is
2(m + n)
m(m + 1) m+n
m
where m is the number of lineages in A and n is the number of lineages
not in A.
Population structure
April 8, 2015
20 / 38
Taxonomic Distinctiveness
Using this probability, we can ask similar questions about whether the
probability of observing a clade is sufficiently low to reject the null
hypothesis of unstructured populations.
Rosenberg 2007, Evolution.
Population structure
April 8, 2015
21 / 38
Taxonomic Distinctiveness
Population structure
April 8, 2015
22 / 38
Complications
Two possible complications for this approach are that (1) estimated gene
trees might have polytomies, (2) for larger trees, monophyly might not be
observed but there could still be significant structure in the trees.
An alternative approach is to try to quantify the degree of violation of
monophyly, at the same time allowing trees to be multifurcating. This
approach is taken by the genealogical sorting index (Cummings, Neal, and
Shaw, Evolution, 2008).
http://www.genealogicalsorting.org/
Population structure
April 8, 2015
23 / 38
Genealogical sorting index
Here gs for A is 1, gs for B is 3/4.
gs value for a group: n=number of lineages -­‐1 Denominator (for bifurca9ng tree): number of nodes in subtree rooted at MRCA of the group. Population structure
April 8, 2015
24 / 38
Genealogical sorting index
The maximum value of the genealogical sorting index is 1, which occurs
for a monophyletic group. The minimum value is not 0, so the index is
normalized to be between 0 and 1.
The normalized statistic is called gsi rather than gs.
Population structure
April 8, 2015
25 / 38
Genalogical sorting index
The genealogical sorting index has some distribution, conditional on the
tree topology and sampling scheme. This distribution is difficult to
determine, so a permutation test is used by randomly permuting the labels
of the tree, recalculating the gsi, and getting a distribution of values. A
p-value is obtained by considering
P(gsi ≥ gsiobserved )
Population structure
April 8, 2015
26 / 38
Ancient population structure
What are the consequences of ancient population structure? Here we
might especially be concerned with the effects of ancient population
structure on species tree inference. I think this problem is sort of
overlooked.
The main examples of reasons for gene trees to disagree with the species
tree are the following list: horizontal gene transfer, gene duplication,
hybridization, and recombination. However, population structure should be
on this list as well.
Population structure
April 8, 2015
27 / 38
Discordance between gene trees and species trees
As a bit of terminology, I’ll follow Rosenberg in using discordance to refer
to disagreements between gene tree topologies and species tree topologies.
Incongruence will be used to refer to disagreements between gene tree
topologies at different loci.
Population structure
April 8, 2015
28 / 38
Gene tree estimation error
What about gene tree estimation error? Is this a source of
discordance/incongruence as well?
Yes, but conceptually, it is quite different from the others. The other
examples are biological processes that cause the ancestry of the gene
lineages to disagree with each other and the species tree, regardless of our
knowledge of the genealogies. Error due to misestimation can cause two
gene trees that should agree to apparently disagree. Error due to
misestimation can also cause two gene trees that have the different
topologies to incorrectly be inferred to have the same topology, although
this might be relatively rare for larger trees. Therefore, misestimation can
cause congruence and concordance!
Population structure
April 8, 2015
29 / 38
Ancient population structure
Ancient population structure can affect the probabilities of observing
different gene tree topologies, and therefore the probability that a gene
tree is discordant with the species tree. We’ll focus on the case of three
taxa and one lineage sampled per species (similar to Edwards’ Finch data).
Population structure
April 8, 2015
30 / 38
Ancient population structure
Population structure
April 8, 2015
31 / 38
Ancient population structure
T’ T Population structure
April 8, 2015
32 / 38
Ancient population structure
In many cases, having population structure might be equivalent to
modifying the divergence time in the species tree, and therefore the
internal branch lengths. In this case, the population structure causes a
shortening of the internal branch. This might cause more discordance in
the gene trees and make it more difficult to estimate the species tree, but
is not a problem for estimating species trees from gene tree topologies. It
might make it harder for methods like MP-EST to estimate branch lengths
correctly, but is not an important model violation for estimating the
species tree topology.
Note that it is also possible for population structure to effectively lengthen
internal branches and make it easier to estimate the species tree as well.
Population structure
April 8, 2015
33 / 38
Ancient population structure:Slatkin and Pollack example
Suppose we have population structure like this (Slatkin and Pollack, 2008,
MBE)
A B Population structure
C April 8, 2015
34 / 38
Ancient population structure
In this case, the strong population structure making it unlikely for A
lineages to coalesce with the other two lineages could make most
coalescences between A and other species to predate the divergence of
species C . In this case, it is possible that the most likely gene tree is
((B, C ), A).
Another consequence of this model is that there can be three distinct
probabilities for the gene trees with
P[((B, C ), A)] > P[((A, B), C )] > P[((A, C ), B)]
Population structure
April 8, 2015
35 / 38
Ancient population structure
Under the usual multi species coalescent, if the species tree has topology
((A, B), C ), then the following is true
P[((A, B), C )] > 1/3 > P[((B, C ), A)] = P[((A, C ), B)]
Both the inequalities and the equality (tie in discordant gene tree
probabilities) are useful for testing the multispecies coalescent. The model
does hold if the probabilities are (0.4, 0.4, 0.2) or (0.6, 0.3, 0.1), for
example. However, these predictions don’t hold for the previous example
of ancient population structure.
Population structure
April 8, 2015
36 / 38
Distinguishing sources of discordance
Unfortunately, it is often difficult to tell which biological processes are
contributing to discordance. Asymmetry in gene tree probabilities can be
caused by population structure, gene flow after speciation, or by
hybridization.
Just using topologies, it might not be possible to distinguish these
processes. However, these processes might imply different variances or
other distributional properties in the coalescence times, so these might
have information for distinguishing which processes generated the data.
For example, coalescence times might be bimodal under hybridization, but
not under population structure.
Population structure
April 8, 2015
37 / 38
Conclusions
I
Population structure is often overlooked as a source of incongruence
I
Robustness of species tree methods in the presence of structure needs
to be investigated more
Population structure
April 8, 2015
38 / 38