The used Human Mitochondrial mutation rate. The efficiency of the molecular clock [18] is based on the reliability of several implicit assumptions as a) the correctness of the mutation rate point estimate (µ) of the gene under study; b) constancy with time of the rate of molecular substitution (Ɵ) and c) rate homogeneity among the different lineages involved in the phylogeny. To accomplish the first point, in this study I have used the full-length mtDNA germ-line mutation rate of 1.3 x 10-8 (interquartile range from 4.2 x 10-9 to 4.1 x 10-8) mutations per site per year (assuming a generation time of 20 years) and its derived rate scalar of one mutation every 4651 years estimated by others[19]. This mtDNA mutation rate is about ten times lower than estimates in most pedigree studies which the authors explain because, at analyzing two tissues, they could discard somatic heteroplasmies. In this respect, it has to be mentioned that second-generation massive sequencing has made possible the direct calculation of the germ-line human genomic mutation rate which resulted in half of the phylogenetic mutation rate thus, doubling the estimated divergence dates of Africans and proposing that crucial events in human evolution have occurred earlier than suggested previously[20].
Accounting for the time-dependence effect on the Rate of Molecular Evolution. It is well known that rates of molecular evolution are not constant neither at interspecific nor intraspecific levels[21]. In general, they decline with increasing divergence time, but the rate of decay differs among taxa. This time-dependent pattern has also been observed for human mtDNA in coding as well as non-coding regions[22, 23]. Purifying selection on deleterious mutations and mutation saturation has been suggested as the main forces responsible for this time rate decay[23]. However, the unrealistic large effective population sizes required to explain the long-term persistence of significantly deleterious mutations cast doubts on whether purifying selection alone can explain the observed rate acceleration[24]. It has also been found unlikely that the apparent decline in rates over time is due to mutational saturation[23]. Congruently, correcting for the effects of purifying selection and saturation has only slightly modified the mtDNA evolutionary mutation rate, providing molecular times still in apparent contradiction with archaeological and paleontological ages[17, 23]. Demographic processes such as serial bottlenecks and expansions have also been proposed to explain the differences in rate estimation over time[25]. It seems evident that some adjustment should be implemented to correct the time dependency of the molecular clock. In this paper, I propose a practical approach to counteract the time dependency effect on the molecular rate estimates taking into account tree topologies.
Approaching Lack of Mutation Rate homogeneity between Lineages. Since early molecular analyses, it was observed that rates of homologous nuclear DNA sequence evolution differ between taxonomic groups [26] which was also extensible to mtDNA[27]. Later, significant differences in rates of molecular evolution between mtDNA human lineages were also detected at haplogroup level[28–32]. Different relaxed molecular-clock methods have been implemented to incorporate rate variation among lineages[33, 34]. However, the application of these methods to the human mtDNA has yielded age estimates for the main milestones of human evolution in agreement with previous molecular estimates[35, 36]. In this paper, when distributing mutations of lineages with significant rate differences within coalescent periods, I have used a simple proportionality criterion. I allowed a window from 0 to 5 mutations between sequences within coalescent periods, as it has been demonstrated that, under a Poisson distribution, even in an extended period of 10,000 years, there could be lineages still carrying the same mutations that their common ancestor, and lineages that have accumulated five new mutations with probabilities higher than 0.05 percent[37]. Another technical problem is the mutation distribution into coalescent periods of those isolated sequences that directly radiate from ancestral nodes. To resolve this issue, I have used a weighted distribution, consisting in multiplying the number of mutations in the isolates by the number of mutations between internodes in each coalescent period, and then to divide the result by the total number of mutations in all the internodes.
Fluctuating Population Size effect on Mutation Substitution Rate.
In a population, the fixation time of a mutation (forward) or the coalescence time to the most recent common ancestor (backward), is usually calculated by the estimator Ɵ = 4Neµ (being Ne the effective population size). For the haploid mtDNA genome, Ɵ equals to 2Nefµ (being Nef the female effective population size). Kimura demonstrated that under strict neutral theory parameters, the rate of substitution is equated to the mutation rate[38]. However, the same author warned us that a clear distinction exists between mutation rate (µ) and mutation substitution (Ɵ). The former refers to the change of genetic material at the individual level, and the latter refers to that at the population level[38]. Thus, only when Ne is constant across generations, keeping small or large sizes, Ɵ equals to µ. This holds because with large constant sizes the number of new mutations incorporated into the population (Neµ) increases but, on the same path, the probability of fixation (1/Ne) decreases. Contrarily, with small constant sizes, the number of new mutations decreases but the probability of fixation of any of them increases at a similar level. It is widely admitted that N fluctuated largely during the human history and that global exponential population growth is happening since recent times[25]. In this paper, I have taken into account the changes in population size that occurred backward in time and its influence on the rate of gene substitution. When the population size fluctuates across generations, the probability of fixation of a mtDNA neutral variant (q) is no longer the 1/N constant. It will depend on the difference in population size of the next generation (N1) with respect to the initial size (N0):
[Please see the supplementary files section to view the equation.]
For example, if N1 is twice the size of N0, q equals 2/N0 and, on the contrary, if N1 is half the size of N0, q equals 1/2N0 . As a consequence, the rate of substitution for neutral mutations in a population with fluctuating size depends on the change in size between generations:
[Please see the supplementary files section to view the equation.]
Using a different approach, the dependence on population size of the substitution rate at neutral genes was already demonstrated for populations with fluctuating sizes and overlapping generations[39].
As human populations have been growing exponentially for several centuries, we should counterbalance this effect from the present-day generation (Nn) going backward in time by inverting the fraction between consecutive generations (Nn-1/Nn). Notice that this dependence might explain the differences in rate estimation over time observed empirically [22]. I will take into consideration this important relationship for the calculation of Ɵ.
A new Rho Statistic to estimate Coalescent Ages. Several statistics, based on DNA polymorphism, exist to estimate the parameter Ɵ. One, S, the number of segregating sites per nucleotide in a sample of sequences [40] is strongly dependent on the sample size. A second, π, is defined as the average number of nucleotide differences per site in a sample of sequences[41]. These two estimators were used to implement a statistical method for testing the neutral mutation hypothesis[42]. A third statistic, rho (ρ), is referred to as the mean number of nucleotide differences of a sample of sequences compared to their common ancestral type[15]. This last statistic is calculated from a rooted phylogenetic tree relating the sampled sequences. The accuracy of molecular dating with the rho statistic has been questioned by some, because it shows downward biased data estimations, large asymmetric variances and strong dependency of demographic factors[43], but defended by others[44]. Anyway, it is still a commonly used method to measure intraspecific mtDNA divergence events in humans. Although the distribution of pairwise nucleotide site differences between individuals has been used to detect episodes of population growth and decline[45], none of the above mentioned statistics considers the past demography of the sample in their age estimations. In this paper, I propose the use of a modified rho (ρm) that, taking into account the coalescent genealogical structure, significantly improves the molecular date estimation of key events in the human history based on mtDNA genome data. In order to make explicit our modifications to the classical rho, I have depicted, in figure 1, a real genealogy constructed from five lineages (a), and an idealized star-like phylogeny of the same five lineages (b) supposing a population exponential growth short after a severe bottleneck[46]. The number of lineages sampled is represented by ni; ti are the time periods defined by progressive coalescent events from the tips to the most recent common ancestor (TMRCA) root; i represents the number of independent lineages left after successive coalescences, ϒi is the number of mutations accumulated during each coalescent period, and mi the number of mutations accumulated along each lineage. Mutations in the star-like phylogeny are distributed into periods following the pattern found in the real phylogeny. As the accumulation of mutations along each lineage is an individual process driven by the mutation rate, µ, and distributed as independent Poisson processes, ρ, the average number of mutations per lineage, has the same value irrespective of the topology. However, as a consequence of the fact that the lineages in the real tree are not independent, because of their shared genealogy, the variance decay is much slower (1/logn) than in the independent star-like tree (1/n)[47]. As a consequence of this, for the rho calculation, mutations within lineages in the star-like phylogeny are counted only once. On the contrary, in the real phylogeny, only mutations occurring at the tips are counted only once while mutations in subsequent coalescent periods going backward to the MRCA node are counted as many times as the number of the period to which they belong. For this reason, mutations in the older periods are overrepresented in the rho calculation. Because of lineage independence, star-like phylogenies are statistically optimal to calculate ρ and π estimators. Under this topology, as mutations along lineages are counted from the root to the tips in ρ and from tip to tip in π pairwise comparisons, the value of π doubles that of ρ. However, as rho ignores the dependence among lineages existing in the majority of the trees, it is necessary to correct for this dependence. For this reason, I propose a modified rho statistic (ρm) that is the summation of classical rho statistics calculated for each coalescent period in the tree:
[Please see the supplementary files section to view the equation.]
This compound Poisson distribution is also Poisson distributed, therefore mean and variance are equal and, the standard deviation is the square root of this variance. Thus, uncertainty in the estimates were calculated using the Poisson confidence interval. Another important difference between the real and star-like genealogies is that in the first the number of lineages decreases as one stepwise function across coalescent periods while in the second, the number of lineages is constant until the root is reached. Equating the number of lineages in the sample as an approximation to the effective population size in the population, we should take into account this backward real decrease in Ne to improve the estimation of the MRCA age. In an ideal coalescent model, we should have i-1 population decreasing sizes, but in real phylogenies, in addition to bifurcations, there are also multifurcations and lineages with long internal segments without any branching event. Even so, I applied to each rho in consecutive periods going backward the reverse proportion used to counteract the time dependence effect on the evolutionary rate. That is, multiplying in each i-1 period the mutation rate µ by (i-1)/i and leaving the µ rate as calculated from germ-line estimations for the most recent period, comprising the tips of all the lineages sampled. With this method, I have obtained a time-dependent scaled mutation rate, Ɵ, which gave human mtDNA intraspecific ages congruent with the archaeological and paleontological calibrated nodes representing key events in the human history.
[Please see the supplementary files section to view the equation.]
Performing calculations on the empirical genealogy (Figure 1a), I have obtained an age of 22 277 ± 4 720 years using the standard ρ and age of 30 396 ± 5 513 years, 1.36 times greater, when using the time-dependent Ɵ estimator proposed here (Table 1).