Molecular Biology & Evolution Letter Revised March 29, 2011 Fast and slow implementations of relaxed clock methods show similar patterns of accuracy in estimating divergence times Fabia U. Battistuzzi1, Paul Billing-Ross1, Aditya Paliwal1,2, and Sudhir Kumar*,1,2 1 Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287-5301 2 School of Life Sciences, Arizona State University, Tempe, AZ 85287-4501 *Corresponding author: E-mail: [email protected] Address for Correspondence: Sudhir Kumar Biodesign A-240A Arizona State University Tempe, AZ 85287-5301 Phone: 480-727-6949 E-mail: [email protected] 1 Abstract Phylogenetic analyses are using increasingly larger datasets for estimating divergence times. With this increase in data sizes, the computation time required is becoming a bottleneck in evolutionary investigations. Our recent study of two relaxed-clock programs (BEAST and MultiDivTime) showed their usefulness in time estimation; however, they place a significant computational time burden on biologists even for moderately small datasets. Here, we report speed and accuracy of another relaxed-clock program (MCMCTree, MC2T). We find it to be much faster than both MDT and BEAST, while producing comparable time estimates. These results will encourage the analysis of larger datasets as well as the evaluation of the robustness of estimated times to changes in the model of evolutionary rates and clock calibrations. In Battistuzzi et al. (2010), we recently reported that the MultiDivTime (MDT) and BEAST software packages are accurate in estimating divergence times particularly when the model describing the change in evolutionary rate among lineages matched the underlying assumptions (autocorrelated [AR] or random rates [RR]) (Thorne and Kishino 2002; Drummond and Rambaut 2007). However, the computational time needed to complete calculations was rather large for even a small phylogeny (e.g., in BEAST) and exorbitant for large phylogenies (Fig. 1). Such requirements not only slow-down the pace of discovery, but ultimately discourage efforts to assess the robustness of inferred times to assumptions made regarding evolutionary rate models and calibrations used. Consequently, the quality of scientific discoveries also suffers. Therefore, a tool that performs as well as MDT and BEAST, but takes much less time is desired. Here we explored the use of the MCMCTree program, MC2T, which is based on a Bayesian framework for time estimation in the PAML package (Yang 2007). MC2T takes as input a tree topology, a sequence alignment, and bounds on clock calibrations. It first estimates branch lengths, which is followed by the inference of divergence times for each node in the tree based on the selection of prior distributions on clock calibrations (e.g., Gamma and skewed Normal) and the model of evolutionary rate change in the tree (AR or RR) (Battistuzzi et al. 2010). Using simulated sequences and perfect calibration 2 points (± 1 million years around the true time) we directly compared the computational speed and time accuracy of MC2T with those of MDT and BEAST. All three tools use independent algorithms and programming implementations of relaxed clock methods, and their direct comparison is interesting because of their applicability to estimating species divergence times. We first used alignments resulting from the concatenation of 10 individually simulated genes, which were the primary focus of our previous study (alignment length = ~8,000 − 23,000 base pairs) (Battistuzzi et al. 2010). In both the best and worst-case scenarios (model match and model violation), MC2T completes analyses at a fraction of the time taken by MDT or BEAST (Fig. 1E). At an average, MC2T took less than 2 minutes to generate divergence times with default parameters, but MDT consumed at least three-fold the time. BEAST required at least two orders of magnitude larger computational time. Furthermore, the rate of increase in computational time requirements increases much more slowly for MC2T than for MDT (Fig. 1D); BEAST computations did not complete for large datasets (see Supplementary Material). Does MC2T speed come at a price of decreased accuracy of inferred time estimates? To answer this question, we evaluated its accuracy in estimating divergence times under the cases where the rate change models assumed by the program match or violate the assumptions for generating sequences. In the model-match analyses, the normalized difference from true time of time estimates reported by MC2T shows a central tendency with an absolute mean that differs from the true time by 7.5% (Fig. 2A and B). MC2T shows greater difficulty in correctly estimating time when the lineage rates are autocorrelated (AR simulated datasets) as compared to RR simulated datasets (Fig. 2A and B histograms). This difficulty is also reflected in the success rate of the credibility intervals (CrIs) in including the true time: the success rate for AR datasets is only 86% as compared to 98% for RR datasets. In both cases, the width of the 95% CrIs was approximately one-third of the true time. As expected, divergence time estimates are less accurate when an incorrect model is chosen to describe the among lineage heterogeneity (model violation). The normalized distribution of estimated times is still symmetric, but has an absolute mean 20% higher than the model match case (Fig. 2C and D). Again, MC2T had greater difficulty with 3 accuracy for AR cases (Fig. 2C and D histograms) with CrIs success rates at 82% and 94% for AR and RR, respectively. The width of the 95% CrI was similar to the model match case. Overall, the above results are encouraging, as the estimated times were within 10% of the true time at an average, and the CrIs were not overly liberal. However, time estimates based on multiple concatenated gene sets are likely to be better because different sets of lineages evolve slower or faster than average in different genes, which will cancel (or at least reduce) rate differences across lineages (Battistuzzi et al. 2010). However, some empirical datasets will have genome-wide biases, which will not benefit from homogenization of the rate variations across lineages. Therefore, we evaluated the accuracy of MC2T for such a scenario and found divergence times to be, on average, 15% different from the true time (model match scenarios). Success rates were very high (96%), but the average width of CrIs were much larger (73% of the estimated time), because statistical methods cannot overcome uncertainty in the time estimate introduced by large degrees of rate variation present across lineages. Next, we compared MC2T results to those produced by MDT and BEAST. The implementation of AR and RR models in MC2T allows for a direct a comparison of this fast method with these two slower methods. In the model match cases, accuracies of estimated times for both the fast and slow methods show similar trends, with the slow method being 2.5% more accurate than the fast method (7.5% and 5.1% for fast and slow methods, respectively; these values are averages of AR and RR cases, see Fig. 2A and B). This is primarily because divergence times for AR cases are estimated less accurately in the fast method (Fig. 2A and 2B, compare line and bar graphs). Fast and slow methods show overall similar success rates (92% and 95%, respectively), but CrIs are ~20% larger in the fast method (26.5% and 34.5% for slow and fast methods, respectively). In contrast to the model match case, success rates for the slow methods (MDT and BEAST) are worse than those for the fast method (81% versus 88%) when the lineagerate model is violated. This is most likely due to larger CrI widths for the fast method (35% relative to the true time) as compared to the slow methods (27%). Instead, the average differences of time estimates from true times for fast and slow methods are similar (9.4% vs. 8.9%; Fig. 2C, D). Thus, one could conclude that slow and fast methods 4 have their individual strengths, but they are overall similar in accuracies despite differences in calculation time needed. Finally, we examined the performance of composite credibility intervals (cCrIs) yielded by fast and slow relaxed clock methods, because the actual model of rate variation among lineages is never known in real data analysis. For calculating cCrI for a given node in the tree, we use relaxed clock method with AR and with RR assumption to generate two CrIs. Then, the upper and lower bound for the cCrI is equal to the minimum of the two individual CrIs and the maximum of the two CrIs, respectively (Battistuzzi et al. 2010). Because of the fast computational speed of MC2T, this approach is more practical. The true time is included in cCrIs produced using MC2T in 89% − 98% of the simulated datasets (average = 96%), an improvement that is accompanied by a small widening of the confidence interval (0 – 12%; average = 4.5%). These results are similar to those reported previously for slower methods (97% accuracy with an average 10% CrI size increase). In summary, slow and fast relaxed-clock methods produce comparable results at least for datasets with small number of sequences. This enables researchers to evaluate the robustness of the estimated divergence times not only to the assumptions made regarding the model describing the inequality of lineage rates, but also to the taxon sampling and calibration boundaries. A more thorough application of molecular clocks will ultimately increase the confidence that scientists have in the estimated divergence times and result in more accurate timelines of species divergences and gene duplications. Methods We used previously simulated sequences, phylogeny, and calibration points (Battistuzzi et al. 2010) for evaluating the accuracy and speed of performance of the MC2T program (Yang 2007). MC2T analyses were carried out on a series of Intel Q8400 (2.66 GHz; 6−8 GB of memory; Windows 7) using the MC2T version with soft boundaries on calibrations; very similar results were obtained for the version that implements hard boundaries. The calculation times for MC2T included the time taken in all steps from sequence and tree inputs into PAML to the final timetree output. These MC2T computational times were compared to those of MultiDivTime (MDT) and 5 BEAST (v1.4.8), which were run on comparable computers (Intel Xeons 2.66−3.20 GHz; 4−24 GB of memory; UNIX). For these methods, computational times included the time taken for estimating branch length and divergence times. For making direct comparisons, the true topology was fully constrained in BEAST, MC2T, and MDT analyses, such that only the branch length and divergence times were optimized, using an exact and an approximate likelihood estimation in BEAST and in MC2T/MDT, respectively. In these analyses, chain lengths for the three methods were kept at default values (MC2T at 50,000; MDT at 1,000,000; BEAST at 10,000,000), because it is not possible to have prior knowledge of more appropriate values in practice. All programs reported reaching convergence with these defaults. We also estimated the accuracy of time estimates by increasing the numbers of independent samples (ESS values), which did not improve estimated times for the simulated data in our case (see Supplementary Information). Following Battistuzzi et al. (2010), we used three pairs of calibrations selected at different heights in the tree and described by perfect boundaries (uniform distribution of ±1 million year around the true time) (Fig. 1A). MC2T was used with approximate likelihood, and parameters coincided with those used to simulate the sequences. In MC2T, multiple values for the birth-death priors were tested following Yang (2006), and the combination of values that produced intermediate results and simulated more closely the birth-death process in MDT was chosen (λ = 2, μ = 2, ρ= 0.1) (Yang 2006; Yang and Rannala 2006; Rannala and Yang 2007). The fine-tuning step was repeated until values were within the range of 0.2−0.4. Other parameters (rgene_gamma, sigma2_gamma) correspond to the values used in MDT and BEAST, when applicable, and were optimized for each replicate independently. We also conducted some initial exploration of the effects of prior distribution on the estimated times, because of their importance in the Bayesian framework. We compared the prior distributions of MC2T to those of MDT for a subset of our simulations (10 concatenated gene datasets and 10 rate-synchronized datasets). As reported by others, MC2T showed older priors for the deepest nodes and younger priors for the shallower nodes as compared to MDT (Inoue, Donoghue, and Yang 2010), with 95% CrIs of the priors being narrower in MDT by an average of 20%. However, the posterior distributions were similar for these two methods. 6 Acknowledgments We thank Alan Filipski for helpful comments and discussions and Carol Williams for text editing support. This study was supported by the National Science Foundation and National Institute of Health to S. K., and by funds from the Howard Hughes Medical Institute through the Undergraduate Science Education Program and from the ASU School of Life Sciences. References Battistuzzi FU, Filipski A, Hedges SB, Kumar S. 2010. Performance of relaxed-clock methods in estimating evolutionary divergence times and their credibility intervals. Mol. Biol. Evol. 27:1289-1300. Drummond AJ, Rambaut A. 2007. Beast: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol 7:214. Inoue J, Donoghue PCJ, Yang Z. 2010. The impact of representation of fossil calibrations on bayesian estimation of species divergence times. Syst. Biol. 59:74-89. Rannala B, Yang Z. 2007. Bayesian estimation of species divergence times from multiple loci using multiple calibrations. Syst. Biol. 56:453-466. Thorne JL, Kishino H. 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. 51:689-702. Yang ZH. 2007. Paml 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586-1591. Yang ZH. 2006. Computational molecular evolution. Oxford: Oxford University Press. Yang ZH, Rannala B. 2006. Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Mol. Biol. Evol. 23:212-226. 7 Figure Legends Figure 1. Computational times of MDT, BEAST, and MC2T for a 14 sequence phylogeny (model-match case). (A) Phylogeny used to simulate sequences; calibration node pairs are h and k (46 and 10 million years ago; mya), h and d (46 and 81 mya), and j and k (23 and 10 mya). Distribution of computational times for (B) MDT and (C) BEAST. (D) Relationship between the number of sequences and the computational time for MDT and MC2T. (E) Overall comparison of computational times on a log-scale for MC2T, MDT, and BEAST. Figure 2. Distributions of the normalized difference (%) of estimated times from true times for slow (MDT and BEAST; lines) and fast methods (MC2T; histograms) when (A, B) the correct lineage rate variation [model match] and (C, D) the incorrect model [model-violation] were used. Graphs in the inset show histograms for the absolute values of the normalized differences. Numbers in parentheses are the mean and standard deviations of the distribution (in order). In the main panels, a negative value on the x-axis shows the degree of underestimation and positive values indicate overestimations. 8 A A B C D E F G H I J K L M f d h g b i e a c j k 200 150 100 l 50 0 Million years B 35 MDT frequency (%) 30 25 20 15 10 5 0 4 5 6 7 8 9 10 11 12 computational time (minutes) frequency (%) C 35 BEAST 30 25 20 15 10 5 0 5 10 15 20 25 30 35 40 computational time (hours) computational time (minutes) D 600 500 MDT 400 300 MC2T 200 100 0 0 E 90 10 20 30 40 # taxa MC2T MDT 50 60 70 BEAST 80 60 50 40 30 20 10 computational tim e (m inutes) 10 00 0 10 00 10 0 10 1 0 0. 1 frequency (%) 70 Figure 1 A Model match AR simulations 70 Frequency (%) 70 Slow (5.2%, 5.7) 60 60 50 Slow (-0.4%, 7.7) 50 40 30 Fast (9.4%, 9.7) 20 10 40 0 5 15 25 35 45 55 65 75 85 30 20 Fast (-2.4%,13.3) 10 75 65 55 35 45 25 5 15 -5 -15 -25 -35 -45 0 B Model match 70 RR simulations 70 60 Frequency (%) 60 50 40 Slow (5.0%, 4.3) 50 40 30 Fast (5.5%, 5.0) 20 Slow (-0.7%, 6.6) 10 0 5 15 25 35 45 55 65 75 85 30 Fast (-0.1%, 7.4) 20 75 65 55 45 25 15 5 -5 -15 -25 -35 -45 0 35 10 Percent time difference from the true time C Model violation 70 AR simulations 70 60 60 Frequency (%) 50 40 Slow (8.4%, 8.8) 50 40 30 Slow (-0.4%, 12.7) Fast (11.1%, 12.1) 20 10 0 5 15 25 35 45 55 65 75 85 30 Fast (-1.3%, 16.1) 20 10 75 65 55 45 35 25 15 5 -5 -15 -25 -35 -45 0 D Model violation 70 RR simulations 70 60 60 40 30 Fast (7.6%, 7.5) 20 Slow (-0.6%, 13.4) 10 0 5 15 25 35 45 55 65 75 85 30 Fast (-1%, 10.6) 20 10 Percent time difference from the true time 75 65 55 45 35 25 15 5 -5 -15 -25 -35 0 -45 Frequency (%) 50 40 Slow (9.3%, 10.0) 50 Figure 2 Supplementary Material Computer Simulations. In addition to our analysis of the accuracy of divergence times for the 14 taxa phylogeny described in the Methods section of the main text, we investigated the increase in computational times with an increasing number of sequences. For this purpose, we generated DNA sequence alignments containing 20 – 65 sequences that were simulated from corresponding-sized model trees derived from a master phylogeny of 765 taxa (Fig. S1). This master phylogeny was obtained by pruning taxa and groups from the tree of 1671 families in the Timetree of Life (Hedges and Kumar 2009), such that the final timetree was strictly bifurcating. The resultant timetree of 765 taxa spanned 4.2 billion years of evolution. This master topology was subsampled to generate model trees for simulating alignments containing 20 – 65 sequences. Branch lengths of the trees were determined by using a rate of 0.05 substitutions/site/billion years and by applying an autocorrelation parameter (ν, Thorne and Kishino 2002) of 1 to introduce rate variation among lineages. Because of the depth of the trees, the initial rate was selected to restrict maximum pairwise sequence divergence to be no more than 2 substitutions per site. Sequences were simulated using SeqGen (Rambaut and Grassly 1997) under the HKY model of nucleotide substitution (Hasegawa, Kishino, and Yano 1985) with G+C content of 48% and transition/transversion rate bias of 1.05. These substitution patterns were estimated from an alignment of 800 taxa of animals, fungi, plants, and archaebacteria. Sequence length was set at the average of the length of the 10gene concatenations used in the 14 taxa simulations (13,760 sites). For each phylogeny, we used four randomly selected perfect calibrations (±1 million years from the true time). All other parameters were identical to those used in the 14 taxa cases. Computational timing results from only MC2T and MDT were obtained and reported, because none of the BEAST analyses reached good mixing even with a threefold increase in the number of generations compared to the 14 taxa simulations. Accuracy measures. The accuracy of MC2T in dating divergence events was based on comparisons of estimated and true times as outlined in Battistuzzi et al. (2010). We estimated percent difference between estimated and true time and the percent CrIs around estimated times that included the true time. Results from all calibration scenarios and 1 simulation datasets were pooled. Additionally, we estimated the size of the CrIs normalized by the true time estimate, normalized CrI (NCrI) = (upper bound – lower bound)/true time. We used medians of NCrIs in order to avoid bias caused by a small fraction of outliers, especially when using MC2T. We also calculated composite CrIs (cCrIs) to incorporate the uncertainty in the time estimation given by the use of a potentially incorrect model of rate evolution (Battistuzzi et al. 2010) . It is important to note that Bayesian relaxed clock methods produce estimates of the effective number of independent parameter sets evaluated during the parameter sampling process, which is reported as an ESS (Effective Sample Size) value. An ESS ≥ 200 is considered satisfactory (Drummond et al. 2002). In our analysis, which employed default number of MCMC generations, MDT was found to always reach good ESS values (i.e., ≥ 95% of the parameters had ESS values > 200), while MC2T and BEAST reached good ESS values for approximately half of the samples (model match cases). For MC2T, we found that the accuracy of time estimates for datasets with ESS < 200 and ESS ≥ 200 were very similar (5.1% and 4.6% for time estimates, respectively). The difference was also small for BEAST (5.2% and 4.5%). As mentioned above, we used default settings for the number of generations in our analyses, with the smallest being for MC2T (50,000) and the largest being for BEAST (10,000,000). Because chain lengths can affect computational time required to complete the analyses, we evaluated the effect of using minimum (50,000) generations for all three methods on a subset of the replicates. MDT analyses converged (i.e., produced similar results in two independent runs) for the 14 taxa datasets. However, this lower number of generations decreased the time taken to complete analyses by ~30%, which happens because the estimation of branch lengths turns out to be the most time consuming step in MDT, rather than the MCMC sampling. BEAST analyses, instead, did not reach convergence and required the use of a large number of generations. References Battistuzzi FU, Filipski A, Hedges SB, Kumar S. 2010. Performance of relaxed-clock methods in estimating evolutionary divergence times and their credibility intervals. Mol. Biol. Evol. 27:1289-1300. 2 Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:1307-1320. Hasegawa M, Kishino H, Yano TA. 1985. Dating of the human ape splitting by a molecular clock of mitochondrial-DNA. J. Mol. Evol. 22:160-174. Hedges SB, Kumar S. 2009. Discovering the timetree of life. In: Hedges SB, and Kumar S, editors. The timetree of life. New York: Oxford University Press. Rambaut A, Grassly NC. 1997. Seq-gen: An application for the monte carlo simulation of DNA sequence evolution along phylogenetic frees. Comput. Appl. Biosci. 13:235-238. Thorne JL, Kishino H. 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. 51:689-702. 3
© Copyright 2024