Accessing the relative performance of fast molecular dating methods for phylogenomic data

doi:10.21203/rs.3.rs-1805291/v1

Download PDF

Research Article

Accessing the relative performance of fast molecular dating methods for phylogenomic data

https://doi.org/10.21203/rs.3.rs-1805291/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Due to advances in genome sequencing techniques, there was a significant growth of phylogenomic datasets. This massive amount of data represents a computational challenge for molecular dating with the Bayesian approach that relax the assumption of rate constancy. To overcome these issues, over the last few decades, rapid molecular dating methods have been proposed. However, comparative evaluation of their relative performances on empirical data sets is lacking. We analyzed 23 empirical phylogenomic datasets to investigate the performance of two commonly employed fast dating methodologies, the penalized likelihood (PL), implemented in treePL, and the relative rate framework (RRF), implemented in RelTime. They were compared to Bayesian analysis under the same models and calibration settings. We found that the RRF was computationally faster and generally provided node age estimates statistically equivalent to Bayesian divergence times. Furthermore, contrasted to Bayesian dating, PL time estimates were excessively precise. To approximate Bayesian approaches, RelTime is an efficient method with significantly lower computational demand, being up to more than 100 times faster than treePL. Thus, to alleviate the computational burden of Bayesian divergence time inference in the era of massive genomic data, molecular dating can be facilitated using the RRF, so that evolutionary hypotheses can be tested more quickly and efficiently.

Bayesian analysis

confidence interval

divergence times

RelTime

treePL

BEAST

MCMCTree

PhyloBayes

Molecular dating is an essential component of contemporary evolutionary studies. The idea that substitutions accumulate in a time-correlated manner in molecular sequences has greatly impacted evolutionary biology since it was proposed in 1960s [1–4]. Over the last decades, major breakthroughs in sequencing technologies have allowed the assembly of large molecular datasets to estimate divergence times between species [5–8]. Such huge datasets pose a computational burden to parameter-rich molecular dating methods that rely on Bayesian Markov chain Monte Carlo (MCMC) sampling, slowing the testing and proposition of evolutionary hypotheses [9–12]. Because of this, phylogenomic studies have frequently devised alternative strategies to compute biological timescales, including the use of reduced datasets [13–19] and the summarization of time estimates based on data partitioning schemes [20, 21].

Such limitations prompted the development of rapid methods to date lineage divergences that represent feasible alternatives to the standard Bayesian molecular dating, hence accelerating evolutionary analysis in the big data era [22, 23]. Moreover, they compute divergence times without the premise of a strict molecular clock. As in the Bayesian approaches, such efficient methods hold their assumptions, which are mainly related to how substitution rates vary across the phylogenetic tree. Currently, the most frequently used molecular dating approaches to accelerate the calculation of divergence times are the penalized likelihood (PL) [24] and the relative rate framework (RRF) [12, 25]. They have been employed to several branches of the Tree of Life, from prokaryotes to plants and animals [26–33]. Importantly, such methodologies are more environmentally friendly than highly parametric Bayesian analyses, as their associated carbon footprint are orders of magnitude smaller (Kumar 2022). Because of this, they might play an important role in the growing environmental awareness of bioinformatics research, i.e. green computing [34, 35].

Although both PL and RRF alleviate the assumption of rate constancy, they are fundamentally distinct. PL uses a penalty function to minimize rate changes between closely related branches globally [24]. Therefore, it assumes autocorrelation of evolutionary rates, which has been suggested as pervasive across the tree of life [36, 37]. A key component of PL is the smooth parameter (λ), which controls the global level of rate variation and is optimized by a cross-validation method. The lower the value of λ, the greater will be the level of rate variation across the phylogeny. PL was first implemented in the r8s software [38], and was later refined to deal with large phylogenies [39, 40]. RRF in turn minimizes the difference in evolutionary rates of ancestral and descendant lineages individually [12]. This eliminates the need to employ a global penalty function and still accommodates rate differences between sister lineages [23]. As a result, RRF does not require any additional analytical step, such as the cross-validation procedure, to select the optimal level of rate variation. It is implemented in the RelTime routine of the MEGA software [41].

As they are currently implemented, PL and RRF also differ on the treatment of calibration information. While PL requires calibration information to be hard-bounded by minimum and/or maximum values [38], RRF via RelTime further allows the use of calibration densities [42]. Additionally, uncertainties associated with the estimates of node ages are dealt distinctly. PL uses the bootstrap approach to compute error measures [38], whereas RelTime adopts an explicit analytical equation to calculate confidence intervals [42]. Both frameworks represent feasible alternatives to reduce computational requirements when compared to Bayesian relaxed clock methods. Because the algorithms of PL and RRF are distinct, they potentially impact divergence time estimates differently, and their relative performances compared to Bayesian approaches have not been evaluated yet with empirical datasets.

As PL and RRF have been increasingly used to estimate timescales over the last years, it is important to carry out such large-scale evaluation against the popular Bayesian framework. While previous studies investigated both fast dating methods separately [22, 25, 40, 43–46], a joint assessment of their performance is lacking on empirical data [47]. Moreover, treePL, which is the most popular implementation of PL for large phylogenies, was not extensively compared to any Bayesian method whatsoever, and there is little information on how they behave comparatively on real data. In this sense, the accumulation of empirical phylogenomic datasets that have been made available in the last years provide the ideal opportunity to investigate the relative performances of rapid and Bayesian methods.

We have collected empirical datasets from 23 phylogenomic studies to measure the relative performance of fast dating methods as compared to Bayesian divergence time inference. Studies were selected based on the availability of Bayesian timetrees (or the input files used to carry out Bayesian inference) plus alignment data, deposited on public databases or as supplementary information. Data retrieved comprise DNA and amino acid sequences from diverse biological lineages with divergences as old as the Cambrian/Precambrian. The number of sequences ranged from tens to nearly a thousand and alignment lengths from ~ 5kb to > 4Mb. Alignment lengths, data types, number of terminals, calibration information, methodology originally employed, as well as the labels used to refer to each study hereafter, were summarized in Table 1.

Table 1

Detailed information about the phylogenomic datasets analyzed.
Data reference	Label	Biological group	Data type^a	Site count	Taxa count	Calibration count	Software used	Substitution model^b
Allio et al. (2020)	Allio20	Arthropoda	AA	288,446	61	5	PhyloBayes	JTT + F + G₅ + I
Anderson et al. (2017)	Anderson17	Annelida	AA	16,541	39	3	PhyloBayes	JTT + G₅ + I
Blaimer et al. (2018)	Blaimer18	Arthropoda	N	33,874	155	7	BEAST	GTR + G₅
Borowiec (2019)	Borowiec19	Arthropoda	N	44,079	162	3	BEAST	GTR + G₄
Chazot et al. (2019)	Chazot19	Arthropoda	N	6,260	994	22	BEAST	GTR + G₅
Delsuc et al. (2018)	Delsuc18	Chordata	AA	66,593	63	11	PhyloBayes	LG + G₅^c
Delsuc et al. (2019)	Delsuc19	Chordata	N	15,157	40	4	PhyloBayes	GTR + G₄
dos Reis et al. (2018)	dosReis18	Chordata	N	61,132	372	17	MCMCTree	GTR + G₄
Fang et al. (2018)	Fang18	Chordata	N	8,079	128	3	BEAST	GTR + G₅
Feng et al. (2017)	Feng17	Chordata	N	88,302	164	20	MCMCTree	GTR + G₅
Hedin et al. (2019)	Hedin19	Arthropoda	N	71,483	27	3	PhyloBayes	GTR + G₅
Hughes et al. (2018)	Hughes18	Chordata	N	10,203	305	31	MCMCTree	HKY + G₅
Irisarri et al. (2017)	Irisarri17	Chordata	AA	14,043	100	14	PhyloBayes	JTT + F + G₄ + I
Johnson et al. (2019)	Johnson19	Arthropoda	N	131,013	193	23	MCMCTree	GTR + G₅
Kuntner et al. (2019)	Kuntner19	Arthropoda	N	89,212	34	2	MCMCTree	HKY + G₅
Pereira et al. (2017)	Pereira17	Chordata	N	12,354	294	22	MCMCTree	GTR + G₅
Pessoa-Filho et al. (2017)	PessoaFilho17	Streptophyta	N	135,255	30	1	BEAST	GTR + G₄
Peters et al. (2017)	Peters17	Arthropoda	AA	75,904	174	14	MCMCTree	JTT
Peters et al. (2018)	Peters18	Arthropoda	AA	1,469,006	48	3	MCMCTree	JTT
Ran et al. (2018)	Ran18	Streptophyta	N	4,246,454	16	4	MCMCTree	GTR + G₅
Sann et al. (2018)	Sann18	Arthropoda	N	284,607	184	10	MCMCTree	GTR + G₄
Wolfe et al. (2019)	Wolfe19	Arthropoda	AA	5,994	95	19	PhyloBayes	JTT + G₄^c
Yonezawa et al. (2017)	Yonezawa17	Chordata	N	873,274	45	6	MCMCTree	GTR + G₈
^aN = nucleotide; AA = amino acid.
^bThe model that was used for most partitions, if applicable. The number of discrete categories to approximate the Gamma distributions is shown.

The original studies have employed a Bayesian relaxed clock methodology as implemented in BEAST, MCMCTree or PhyloBayes, except for Kuntner et al. (2019), hereafter Kuntner19, which estimated divergence times using the RRF. In this case, the Bayesian timescale was inferred for the first time. Whenever possible, timetrees were directly obtained from the original works. Otherwise, divergence times were estimated using the input files provided by the authors. We also tried to keep substitution models matching the original studies. However, studies that used CAT models of amino acid substitution implemented in PhyloBayes [48] were subjected to model selection in MEGA X [41]. If the original study applied data partitioning with distinct substitution models, we chose the model used in most partitions.

Fast divergence time inference

We used the same alignment, topology and calibration information as originally employed by the authors to estimate absolute times in RelTime [12, 25] and treePL [40]. To standardize computation, all analyses were carried out on a machine with 3.2 GHz 6-Core Intel® i7 processor and 64 GB 2667 MHz DDR4 RAM. All branch lengths (in substitutions per site) used by both methods were estimated in MEGA X. RelTime calculations were performed with the command line version of MEGA X, and the confidence intervals (CI) of divergence times were calculated analytically, as implemented by the method.

In treePL, the program was firstly run using the option ‘prime’ to select the best optimization parameters. A cross-validation procedure was performed to optimize the smooth parameter values for each dataset [24], totaling 10 optimization iterations and 10¹⁷ simulated annealing iterations. The ‘cvstart’ and ‘cvstop’ parameters were set to 10¹⁷ and 10^− 19, respectively, resulting in 37 smooth parameter values tested. All analyses were run with the ‘thorough’ option. treePL CIs of time estimates were calculated from 100 bootstrap replicates, which were subsequently summarized in TreeAnnotator [49].

Regarding calibration information, whenever the original studies employed uniform priors, the constraints of the uniform distributions were provided as minimum and maximum boundaries of nodes in treeP; while in RelTime they were set as lower and upper limits of a uniform distribution. When probability distributions other than the uniform were originally used, namely, the normal, lognormal, exponential and skew-t distributions, they were also used in RelTime, except for the skew-t distribution, which is currently unavailable in this software. It was thus approximated by a normal distribution using the sn [50] and fitdistrplus packages [51] in R [52]. As treePL implements only minimum and/or maximum values as calibrations, we derived minimum and maximum bounds based on the 95% cumulative probability of density distributions. For the skew-t distribution, we did the same procedure, but using the normal distribution approximated for RelTime. Because the assumption of equal rate of evolution between ingroup and outgroup sequences is not testable (Kumar et al., 2016), all calibrations located on the root node and on the outgroup clade were automatically removed for both treePL and RelTime. As previously mentioned, Kuntner19 originally used only RelTime. We thus inferred a Bayesian timescale in MCMCTree [54, 55] under the same calibration information, employing the independent rates prior with the HKY + G(5) substitution model [56]. Markov chain Monte Carlo analysis was run twice to check for convergence, each chain was sampled every 100th cycle until ESS values to approximate the posterior were greater than 200.

Evaluation of relative performance

To contrast RelTime and treePL estimates to those derived by Bayesian methods, we calculated a series of metrics. For Bayesian time estimates, either the mean or the median of the posterior distribution of divergence times were used, depending on which value was reported in the original study. For each dataset, we carried out linear regressions of RelTime and treePL-derived estimates against Bayesian estimates. The coefficient of determination (R²) and the slope (β) of the linear regression through the origin were used as summary statistics to measure correlation between fast and Bayesian dating methods.

For each data set, the average difference between fast dating methods and Bayesian time estimates were normalized to become comparable across studies, which focused on various depths of the Tree of Life. Given n divergence times in a data set, for each i^th node age (t), the average difference was calculated as follows.

$$\stackrel{-}{D}=\left(\frac{1}{n}\sum _{i=1}^{n}\frac{|{t}_{i, FAST}-{t}_{i, BAYES}|}{{t}_{i, BAYES}}\right)\times 100\%$$

Additionally, the CIs of fast methods were contrasted to the Bayesian measures of uncertainty reported in the original study, either the highest posterior densities (HPDs) or the credibility intervals CrIs. Although fundamentally different from a statistical standpoint, these metrics are generally regarded as the measures of uncertainty and precision associated with the time estimate. Thus RelTime CIs, treePL CIs and HPDs/CrIs from Bayesian analyses will be hereafter referred simply as CIs.

For each dataset, two metrics were computed using CIs: the median CI width and the CI coverage. The median CI width of a method for each dataset was calculated as follows. For each i^th node age estimate, the difference between maximum (t_max) and minimum (t_min) limits of the CIs was normalized by the estimated node age (t).

$${CI width}_{i}=\frac{{t}_{i,max}-{t}_{i,min}}{{t}_{i}}$$

Therefore, CI widths of a data set were transformed as fractions of the estimated node ages, and their median value was calculated. Importantly, this measure was computed excluding nodes that presented node ages smaller than 10^− 10. This was done to avoid division by values near zero. Finally, the CI coverage is a measure analogous to the success rate, as it indicates the frequency that node age estimates from fast methods were included within the CI of Bayesian analyses. This frequency was computed for each dataset.

We tested whether the number of terminals, the number of sites in the alignment and the calibration density (number of calibrations divided by the number of tree nodes) impacted the association between the Bayesian estimates and those from both fast-dating methods. Linear models were inferred using 1) the absolute deviations of the slope of the regression lines from 1 or 2) the mean squared errors (MSEs) as response variables. The importance of each feature was assessed by the varImp function [57] of the caret R package [58].

Fast methods produced time estimates that were highly correlated with Bayesian time estimates, regardless of the Bayesian method employed. All the recovered R² values of the linear regression between fast methods and Bayesian node ages were ≥ 0.94, with most values exhibiting higher than 0.98 when both treePL and RelTime were compared against Bayesian estimates. The slope of the regression lines indicated a great correspondence between rapid methodologies and Bayesian node ages (Fig. 1a). The median slope values were 0.98 and 0.95 for treePL and RelTime, respectively. Nevertheless, the slopes of the regression lines between treePL and Bayesian time estimates presented a larger variance than when we compared RelTime to Bayesian node ages. For Peters18 dataset, the comparison of treePL and Bayesian time estimates presented a β of 1.99, indicating that node ages were generally 99% older than MCMCTree inferred times. For this dataset, RelTime node ages led to a β of 1.46 when compared to Bayesian divergence times. For three other datasets, treePL estimates presented very high β values when compared against Bayesian estimates, these were the datasets of PessoaFilho17 (β_treePL = 1.57, β_RelTime = 1.15), Allio20 (β_treePL = 1.58, β_RelTime = 1.09) and Peters 17 (β_treePL = 1.6, β_RelTime = 1.16). On the other hand, treePL produced much younger times for the dataset of Fang18 (β_treePL = 0.54, β_RelTime = 0.75). The highest β recovered for RelTime was for the dataset of Ran18 (β_RelTime = 1.5), which was very similar to the β recovered for treePL (β_treePL = 1.48). The lower β values produced by the node ages estimated by RelTime were for the datasets of Hedin 19 (β_treePL = 0.54, β_RelTime = 0.75) and Fang18 (β_treePL = 0.78, β_RelTime = 0.75). Comparisons between Bayesian and fast methods’ time estimates per dataset could be accessed through Supporting information 1.

The distribution of treePL $\stackrel{-}{D}$ values were also wider than the distribution of RelTime (Fig. 1b). RelTime estimates were, on average, more similar to Bayesian time estimates, as the mean $\stackrel{-}{D}$ was 26.5% for RelTime and 38.3% for treePL. When treePL was used to estimate divergence times, several datasets led to estimates that were on average more than 50% different from the Bayesian node ages. Conversely, RelTime molecular dates were, on average, more than 50% different than the Bayesian estimates for a single dataset (Ren18). For this dataset, both treePL and RelTime generated node ages that were around 60% different from Bayesian times. For most of the analyzed datasets (70%), RelTime produced time estimates that were on average less than 30% different from the Bayesian ones, while treePL estimated node ages that were less than 30% distant from Bayesian times for only 26% of the datasets (Supporting information 2).

Regarding the precision of time estimates, TreePL provided narrower CIs than Bayesian analyses (Fig. 2a). On the other hand, RelTime produced CI widths that were larger than Bayesian CIs. The distribution of the median CI widths across all datasets analyzed was centered around 16.4% for treePL, while they were around 64.3% for Bayesian and 102.3% for RelTime. For some of the datasets (48%), treePL CIs sometimes did not include the node ages estimated by the method itself. In these cases, up to 9% of the node ages did not fall within the CI generated for treePL by the bootstrap approach. Regarding the frequency in which fast methods’ divergence times were included within the Bayesian credibility intervals, treePL and RelTime presented a similar performance. Mean coverage values for RelTime node ages were centered around 77.3%, while for treePL it was placed around 67.3% (Fig. 2b). The percentage of datasets that led to coverage values that included less than half of the estimated node ages of a phylogeny was 45% for treePL and 27% for RelTime. On the other hand, the analyses using the data from 32% and 45 of the studies was covered by the Bayesian CI with a frequency of more than 80% when using treePL and RelTime, respectively.

For both fast-dating methods, deviation from the slope $\beta$=1 was significantly explained by the three features investigated (p < 0.001 and R² = 0.59 for RelTime and p < 0.005 and R² = 0.40 for TreePL). The data feature with the highest importance on determining the deviation from a perfect fit to Bayesian estimates was the number of sites in the alignment (importance of 60% for RelTime and 37% for TrerePL). For explaining MSEs, the calibration density was the feature with the highest importance for RelTime (69%, p < 0.001 and R² = 0.50), while TreePL MSEs were not significantly predicted by any of the features analyzed (p > 0.05). Therefore, for RelTime, increasing the density of calibrations resulted in more distinct time estimates from Bayesian analysis.

Computational efficiency was very distinct between fast methods (Fig. 3). Average running times were around 51.8 hours for treePL, and 0.9 hours for RelTime. For most of the datasets, treePL took more than 24 hours to complete the calculations. In fact, RelTime usually took less than 2% of treePL running time, being often more the 60 times faster than treePL (Fig. 3). Because CIs are essential to retrieve uncertainty measures for divergence time estimates, treePL running times considered the estimation of branch lengths for the one hundred bootstrap replicates that were used to compute treePL CIs.

We provided the first comprehensive analysis of two of the most frequently used fast dating methodologies against Bayesian molecular dating employing several empirical phylogenomic datasets from distinct biological groups, including up to hundreds of taxa. We measured differences in node age estimates, coverage of the Bayesian credibility intervals and computational time efficiency. Our findings indicate that RFF, as implemented in RelTime, is a fast alternative to time-consuming molecular dating softwares. RelTime was much faster and generally provided time estimates closer to the Bayesian node ages than treePL. TreePL, which is considered a fast algorithm to perform molecular dating, required a significant computational time. This was due to the bootstrapping strategy used to compute confidence intervals of time estimates. As CIs are necessary to interpret biological scenarios derived from timetrees, their calculation entailed a running time that were comparable to Bayesian approaches, with some running times of more than one month.

Studies that have evaluated treePL performance against other approaches are scarce. The original work describing its implementation performed an evaluation using simulated and empirical data [40]. However, simulations did not include alignments, as the divergence times were directly inferred from the true tree, and the empirical datasets did not consist of several loci. Previous works employing both Bayesian approaches and treePL compared time estimates for specific taxa [59, 60], and their results are contrasting, with treePL leading either to older time estimates and narrower CIs than BEAST in angiosperm evolution [60], or younger node ages and wider CIs than BEAST in a flowering plant family [59]. In the present study, treePL CIs were narrower than Bayesian CIs for all datasets analyzed. This result is expected, because the bootstrap procedure leads to reduced parametric uncertainty as the number of sites increase, which is the case for phylogenomic data. Regarding time estimates, we found that treePL tended to produce older estimates than Bayesian analyses (Fig. 1a). This is in agreement with other works that have compared PL to Bayesian and non-Bayesian approaches [61–64]. It is already known that PL may provide overly ancient divergence time estimates when there is no calibration information to limit node ages near the root because of optimization issues [65]. The absence of efficient time constraints at deeper nodes was, in fact, common to all the analyses where older estimates were obtained (β > 1.1). For most of these datasets, treePL placed the age of the deep nodes exactly at or very close to the values provided as loosse maxima. Additionally, our findings corroborate Barba-Montoya et al. (2021), where TreePL was more impacted by small deviations from the molecular clock. This probably resulted in more asymmetrical distributions of $\stackrel{-}{D}$ values for treePL, while RelTime presented lower asymmetry (Supporting information 2).

Comparisons between time estimates retrieved by the RFF and Bayesian methods have been carried out in several empirical studies [12, 22, 25, 42, 43, 66–68]. Mello et al. (2017) and Tao et al. (2020) employed phylogenomic datasets and found that RelTime produced reliable time estimates when compared to BEAST and MCMCTree. Here, we extended these findings to PhyloBayes software, which implements more sophisticated substitution models. Although MEGA does not provide the option to use the site-heterogeneous models implemented in PhyloBayes, times inferred employing the simpler models available in MEGA exhibited good correspondence to PhyloBayes estimates. The equivalence between timescales from simple and complex homogeneous substitution models was reported elsewhere [69]. We confirmed this finding and showed that it can be extended to site-heterogeneous substitution models.

If researchers need a faster alternative to Bayesian dating, our work demonstrated the good performance of the RelTime’s RFF when compared to treePL. Besides providing node ages closer to Bayesian estimates, RelTime infereed ages lied within Bayesian CIs more frequently. Recently, using simulated data, Barba-Montoya et al. (2021) also recovered a greater accuracy for RelTime when compared to other fast dating methods, particularly when autocorrelated rates were used. We showed that for empirical phylogenomic datasets, in which the true rate model is unknown, RelTime also performed better than treePL to approximate the standard Bayesian procedure.

Although Bayesian credibility intervals are not strictly comparable with confidence intervals produced by bootstrapping, empirical biologists use both metrics as measures of uncertainty of the estimated node ages when testing evolutionary scenarios. Therefore, we were prompted to contrast CI widths of Bayesian, RelTime and treePL estimates to compare the degree of uncertainty of the infereed times. On average, treePL CIs were 5 times narrower than Bayesian CIs, while RelTime produced CIs that were ca. 2 times wider than the Bayesian CIs. The greater precision of treePL when compared to Bayesian methods was also recovered using simulated data [47]. Simulations also have shown that RelTime CIs exhibit equivalent or greater coverage probabilities than Bayesian approaches [42]. Therefore, our analysis of empirical datasets confirmed that the uncertainty associated with RelTime estimates are closer to the Bayesian CIs.

Besides having good statistical proprieties, we expect fast dating methods to reduce computational time significantly. We demonstrated that, on average, RelTime was 60 times faster than treePL. On the age of big data, such speed-up makes large-scale biological hypotheses testing feasible. Moreover, previous works based on simulations that accessed PL performance against Bayesian approaches and RelTime found that it performed worse than these methods under various scenarios of heterogeneous rates [25, 70]. These findings, together with our results that certified the speed of RelTime, demonstrate the usefulness of the RRF to obtain biological timescale for large datasets.

The discrepancy between divergence time estimates from fast-dating and Bayesian methods was primarily influenced by the alignment length. Longer alignments resulted in larger differences between methods. This result is expected if methods rely on different modeling assumptions regarding parameters and evolutionary rate variation. Consequently, as sample size approaches infinity, estimates become significantly different. Although it seems that fast-dating and Bayesian methods are asymptotically different, even in the dataset with the longest alignment length (Ran18), the inferred timescales led to very similar biological interpretation of the scenario in which lineage diversification took place. For RelTime, calibration density significantly impacted the MSE of time estimates, implying that, besides alignment length, increasing the number of time constraints also makes the differences between methods more pronounced [47].

While previous work has advocated that RFF may not be suitable to infer divergence times for deep time datasets, leading to overly older time estimates [68], our analyses did not support this claim. Also, in contrast with a previous study [67], our results indicate that the strategy used by RelTime to calibrate timetrees [42] is as appropriate as the Bayesian calibration priors, yielding great correspondence between the timescales from both methods for most of the datasets (for ~ 78% of the datasets, β values deviated less than 0.2 from 1).

It is worth mentioning that larger differences between Bayesian analysis and RelTime may be retrieved at nodes connecting branches that have their lengths close to zero. Such lack of substitutions along branches cause RelTime to estimate more recent node ages. The fact that fast methods use branch lengths to estimate divergence times without relying on priors for node ages imply that, when some branches have near zero substitutions, they underestimate times when compared to Bayesian analysis. This occurs because divergence time priors assign lengths > 0 even when no substitutions are observed, as is the case of the coalescent prior [71]. This may also affect treePL estimates, as observed for the dataset of Fang18 (Supporting information 2), although treePL may also assign non-zero time values to branches in which the number of accumulated substitutions is effectively zero [40], leading to older inferred times than RelTime.

Our comparative analysis using a comprehensive empirical dataset has shown that fast dating methods are a viable alternative to time-consuming Bayesian methods to infer node ages for large-scale datasets. Additionally, we demonstrated that the RFF approach implemented in RelTime performed better, with a lower demand in computational times. Thus, we emphasize the efficacy of the RFF in establishing molecular timescales with excellent correspondence to those inferred by Bayesian approaches. Timescales from different dating frameworks were impacted by alignment length, suggesting that their asymptotic properties are different, although the biological meaning of the estimates did not change significantly. Furthermore, the quick estimation of confidence intervals of node ages allows for robust testing between several alternate evolutionary hypotheses, eliminating the computational burden brought forth by big data in biology.

Ethical Approval and consent to participate: not applicable.

Consent to Publication: not applicable.

Data Availability statement

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Conflict of interest

The authors declare that they have no competing interests.

Funding

This research was supported by grants from the Brazilian Research Council (CNPq, 409152/2018-8 and 309165/2019-9) and Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ, E-26/211.248/2019, E-26/201.911/2019 and E-26/201.446/2022). FPC was supported by scholarships from CNPq (132838/2019-2) and FAPERJ (E-26/200.170/2020).

Acknowledgment

We thank the reviewers for helpful comments on previous versions of this manuscript.

Author contribution

BM conceived the ideas, and BM and FPC designed methodology; BM and FPC collected the data; BM, FPC and CGS analyzed the data; BM, FPC and CGS discussed results; BM and FPC led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

Doolittle RF, Blomback B. Amino-Acid Sequence Investigations of Fibrinopeptides from Various Mammals: Evolutionary Implications. Nature. 1964;202:147.
Margoliash E. Primary Structure and Evolution of Cytochrome C. Proc Natl Acad Sci U S A. 1963;50:672–9.
Zuckerkandl E, Pauling L. Molecular disease, evolution, and genic heterogeneity. In: Kasha M, Pullman B, editors. Horizons in Biochemistry. New York, USA: Academic Press; 1962. p. 189–225.
Zuckerkandl E, Pauling L. Evolutionary Divergence and Convergence in Proteins. In: Evolving Genes and Proteins. Elsevier; 1965. p. 97–166.
Blair C, Bryson RW, Linkem CW, Lazcano D, Klicka J, McCormack JE. Cryptic diversity in the Mexican highlands: Thousands of UCE loci help illuminate phylogenetic relationships, species limits and divergence times of montane rattlesnakes (Viperidae: Crotalus ). Mol Ecol Resour. 2019;19:349–65.
Givnish TJ, Zuluaga A, Spalink D, Soto Gomez M, Lam VKY, Saarela JM, et al. Monocot plastid phylogenomics, timeline, net rates of species diversification, the power of multi-gene analyses, and a functional model for the origin of monocots. Am J Bot. 2018;105:1888–910.
Tarver JE, dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, et al. The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference. Genome Biol Evol. 2016;8:330–44.
Yang L, Su D, Chang X, Foster CSP, Sun L, Huang C-H, et al. Phylogenomic Insights into Deep Phylogeny of Angiosperms Based on Broad Nuclear Gene Sampling. Plant Commun. 2020;1:100027.
Battistuzzi FU, Billing-Ross P, Paliwal A, Kumar S. Fast and Slow Implementations of Relaxed-Clock Methods Show Similar Patterns of Accuracy in Estimating Divergence Times. Mol Biol Evol. 2011;28:2439–42.
Bromham L, Duchêne S, Hua X, Ritchie AM, Duchêne DA, Ho SYW. Bayesian molecular dating: opening up the black box. Biol Rev Camb Philos Soc. 2018;93:1165–91.
Crosby RW, Williams TL. Fast algorithms for computing phylogenetic divergence time. BMC Bioinformatics. 2017;18:514.
Tamura K, Tao Q, Kumar S. Theoretical Foundation of the RelTime Method for Estimating Divergence Times from Variable Evolutionary Rates. Mol Biol Evol. 2018;35:1770–82.
Aardema ML, Stiassny MLJ, Alter SE. Genomic Analysis of the Only Blind Cichlid Reveals Extensive Inactivation in Eye and Pigment Formation Genes. Genome Biol Evol. 2020;12:1392–406.
Del Cortona A, Jackson CJ, Bucchini F, Van Bel M, D’hondt S, Škaloud P, et al. Neoproterozoic origin and multiple transitions to macroscopic growth in green seaweeds. Proc Natl Acad Sci U S A. 2020;117:2551–9.
Helmstetter AJ, Béthune K, Kamdem NG, Sonké B, Couvreur TLP. Individualistic evolutionary responses of Central African rain forest plants to Pleistocene climatic fluctuations. Proc Natl Acad Sci U S A. 2020;117:32509–18.
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–31.
Percequillo AR, Prado JR do, Abreu EF, Dalapicolla J, Pavan AC, de Almeida Chiquito E, et al. Tempo and mode of evolution of oryzomyine rodents (Rodentia, Cricetidae, Sigmodontinae): A phylogenomic approach. Mol Phylogenet Evol. 2021;159:107120.
Smith SA, Brown JW, Walker JF. So many genes, so little time: A practical approach to divergence-time estimation in the genomic era. PLOS ONE. 2018;13:e0197433.
Wolfe JM, Breinholt JW, Crandall KA, Lemmon AR, Lemmon EM, Timm LE, et al. A phylogenomic framework, evolutionary timeline and genomic resources for comparative studies of decapod crustaceans. Proc R Soc B Biol Sci. 2019;286:20190079.
Irisarri I, Baurain D, Brinkmann H, Delsuc F, Sire J-Y, Kupfer A, et al. Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nat Ecol Evol. 2017;1:1370–8.
Prum RO, Berv JS, Dornburg A, Field DJ, Townsend JP, Lemmon EM, et al. A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing. Nature. 2015;526:569–73.
Mello B, Tao Q, Tamura K, Kumar S. Fast and Accurate Estimates of Divergence Times from Big Data. Mol Biol Evol. 2017;34:45–50.
Tao Q, Tamura K, Kumar S. Efficient Methods for Dating Evolutionary Divergences. In: Ho SYW, editor. The Molecular Evolutionary Clock. Cham: Springer International Publishing; 2020. p. 197–219.
Sanderson MJ. Estimating Absolute Rates of Molecular Evolution and Divergence Times: A Penalized Likelihood Approach. Mol Biol Evol. 2002;19:101–9.
Tamura K, Battistuzzi FU, Billing-Ross P, Murillo O, Filipski A, Kumar S. Estimating divergence times in large molecular phylogenies. Proc Natl Acad Sci. 2012;109:19333–8.
Bond JE, Garrison NL, Hamilton CA, Godwin RL, Hedin M, Agnarsson I. Phylogenomics Resolves a Spider Backbone Phylogeny and Rejects a Prevailing Paradigm for Orb Web Evolution. Curr Biol. 2014;24:1765–71.
Daane JM, Auvinet J, Stoebenau A, Yergeau D, Harris MP, Detrich HW. Developmental constraint shaped genome evolution and erythrocyte loss in Antarctic fishes following paleoclimate change. PLOS Genet. 2020;16:e1009173.
Fernández-Mazuecos M, Vargas P, McCauley RA, Monjas D, Otero A, Chaves JA, et al. The Radiation of Darwin’s Giant Daisies in the Galápagos Islands. Curr Biol. 2020;30:4989-4998.e7.
Harvey MG, Bravo GA, Claramunt S, Cuervo AM, Derryberry GE, Battilana J, et al. The evolution of a tropical biodiversity hotspot. Science. 2020;370:1343–8.
Marin J, Battistuzzi FU, Brown AC, Hedges SB. The Timetree of Prokaryotes: New Insights into Their Evolution and Speciation. Mol Biol Evol. 2016;:msw245.
Qiao J, Zhang X, Chen B, Huang F, Xu K, Huang Q, et al. Comparison of the cytoplastic genomes by resequencing: insights into the genetic diversity and the phylogeny of the agriculturally important genus Brassica. BMC Genomics. 2020;21:480.
Roxas BAP, Roxas JL, Claus-Walker R, Harishankar A, Mansoor A, Anwar F, et al. Phylogenomic analysis of Clostridioides difficile ribotype 106 strains reveals novel genetic islands and emergent phenotypes. Sci Rep. 2020;10:22135.
Shingate P, Ravi V, Prasad A, Tay B-H, Venkatesh B. Chromosome-level genome assembly of the coastal horseshoe crab (Tachypleus gigas). Mol Ecol Resour. 2020;20:1748–60.
Grealey J, Lannelongue L, Saw W-Y, Marten J, Méric G, Ruiz-Carmona S, et al. The Carbon Footprint of Bioinformatics. Mol Biol Evol. 2022;39:msac034.
Kumar S. Embracing Green Computing in Molecular Phylogenetics. Mol Biol Evol. 2022;39:msac043.
Lepage T, Bryant D, Philippe H, Lartillot N. A General Comparison of Relaxed Molecular Clock Models. Mol Biol Evol. 2007;24:2669–80.
Tao Q, Tamura K, U. Battistuzzi F, Kumar S. A Machine Learning Method for Detecting Autocorrelation of Evolutionary Rates in Large Phylogenies. Mol Biol Evol. 2019;36:811–24.
Sanderson MJ. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics. 2003;19:301–2.
Paradis E. Molecular dating of phylogenies by likelihood methods: A comparison of models and a new information criterion. Mol Phylogenet Evol. 2013;67:436–44.
Smith SA, O’Meara BC. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics. 2012;28:2689–90.
Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol Biol Evol. 2018;35:1547–9.
Tao Q, Tamura K, Mello B, Kumar S. Reliable Confidence Intervals for RelTime Estimates of Evolutionary Divergence Times. Mol Biol Evol. 2020;37:280–90.
Battistuzzi FU, Tao Q, Jones L, Tamura K, Kumar S. RelTime Relaxes the Strict Molecular Clock throughout the Phylogeny. Genome Biol Evol. 2018;10:1631–6.
Chernikova D, Motamedi S, Csürös M, Koonin EV, Rogozin IB. A late origin of the extant eukaryotic diversity: divergence time estimates using rare genomic changes. Biol Direct. 2011;6:26.
Filipski A, Murillo O, Freydenzon A, Tamura K, Kumar S. Prospects for Building Large Timetrees Using Molecular Data with Incomplete Gene Coverage among Species. Mol Biol Evol. 2014;31:2542–50.
Gunter NL, Weir TA, Slipinksi A, Bocak L, Cameron SL. If Dung Beetles (Scarabaeidae: Scarabaeinae) Arose in Association with Dinosaurs, Did They Also Suffer a Mass Co-Extinction at the K-Pg Boundary? PloS One. 2016;11:e0153570.
Barba-Montoya J, Tao Q, Kumar S. Assessing Rapid Relaxed-Clock Methods for Phylogenomic Dating. Genome Biol Evol. 2021;13:evab251.
Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol. 2004;21:1095–109.
Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, et al. BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLOS Comput Biol. 2014;10:e1003537.
Azzalini A. The R package “sn”: The Skew-Normal and Related Distributions such as the Skew-t and the SUN. 2021.
Delignette-Muller ML, Dutang C. fitdistrplus : An R Package for Fitting Distributions. J Stat Softw. 2015;64.
R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.
Kumar S, Stecher G, Tamura K. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol. 2016;33:1870–4.
dos Reis M, Yang Z. Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times. Mol Biol Evol. 2011;28:2161–72.
Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol. 2007;24:1586–91.
Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985;22:160–74.
Gevrey M, Dimopoulos I, Lek S. Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecol Model. 2003;160:249–64.
Kuhn M. Building Predictive Models in R Using the caret Package. J Stat Softw. 2008;28.
Cai L, Xi Z, Peterson K, Rushworth C, Beaulieu J, Davis CC. Phylogeny of Elatinaceae and the Tropical Gondwanan Origin of the Centroplacaceae(Malpighiaceae, Elatinaceae) Clade. PLOS ONE. 2016;11:e0161881.
Magallón S, Gómez‐Acevedo S, Sánchez‐Reyes LL, Hernández‐Hernández T. A metacalibrated time‐tree documents the early rise of flowering plant phylogenetic diversity. New Phytol. 2015;207:437–53.
Britton T, Anderson CL, Jacquet D, Lundqvist S, Bremer K. Estimating Divergence Times in Large Phylogenetic Trees. Syst Biol. 2007;56:741–52.
Ericson PGP, Anderson CL, Britton T, Elzanowski A, Johansson US, Källersjö M, et al. Diversification of Neoaves: integration of molecular sequence data and fossils. Biol Lett. 2006;2:543–7.
Mulcahy DG, Noonan BP, Moss T, Townsend TM, Reeder TW, Sites JW, et al. Estimating divergence dates and evaluating dating methods using phylogenomic and mitochondrial data in squamate reptiles. Mol Phylogenet Evol. 2012;65:974–91.
Pérez-Losada M, Høeg JT, Crandall KA. Unraveling the Evolutionary Radiation of the Thoracican Barnacles Using Molecular and Morphological Evidence: A Comparison of Several Divergence Time Estimation Approaches. Syst Biol. 2004;53:244–64.
Sanderson MJ, Thorne JL, Wikström N, Bremer K. Molecular evidence on plant divergence times. Am J Bot. 2004;91:1656–65.
Battistuzzi FU, Billing-Ross P, Murillo O, Filipski A, Kumar S. A Protocol for Diagnosing the Effect of Calibration Priors on Posterior Time Estimates: A Case Study for the Cambrian Explosion of Animal Phyla. Mol Biol Evol. 2015;32:1907–12.
Beavan AJS, Donoghue PCJ, Beaumont MA, Pisani D. Performance of A Priori and A Posteriori Calibration Strategies in Divergence Time Estimation. Genome Biol Evol. 2020;12:1087–98.
Lozano-Fernandez J, dos Reis M, Donoghue PCJ, Pisani D. RelTime Rates Collapse to a Strict Clock When Estimating the Timeline of Animal Diversification. Genome Biol Evol. 2017;9:1320–8.
Tao Q, Barba-Montoya J, Huuki LA, Durnan MK, Kumar S. Relative Efficiencies of Simple and Complex Substitution Models in Estimating Divergence Times in Phylogenomics. Mol Biol Evol. 2020;37:1819–31.
Ho SYW. Accuracy of Rate Estimation Using Relaxed-Clock Models with a Critical Focus on the Early Metazoan Radiation. Mol Biol Evol. 2005;22:1355–63.
Mello B, Tao Q, Barba‐Montoya J, Kumar S. Molecular dating for phylogenies containing a mix of populations and species by using Bayesian and RelTime approaches. Mol Ecol Resour. 2021;21:122–36.

No competing interests reported.

SUPPORTINGINFORMATION.pdf

Download PDF

Editorial decision: Major revision
27 Jul, 2022
Reviews received at journal
10 Jul, 2022
Reviewers agreed at journal
30 Jun, 2022
Reviewers invited by journal
30 Jun, 2022
Editor assigned by journal
30 Jun, 2022
Editor invited by journal
30 Jun, 2022
Submission checks completed at journal
30 Jun, 2022
First submitted to journal
28 Jun, 2022

You are reading this latest preprint version

Accessing the relative performance of fast molecular dating methods for phylogenomic data

Status:

Version 1

Abstract

Figures

Introduction

Material And Methods

Fast divergence time inference

Evaluation of relative performance

Results

Discussion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1