Genomes tend to produce short repeats
We analyzed 55 randomly-selected reported segment sequences covering from animal, plant, fungus, protist, bacteria, archaea and viruses (Table S1). The SSRs were extracted from all these segment sequences by using a threshold with minimum length of 3 base pairs or nucleotides. Though 2 iteration of di-, tri-, tetra-, penta- and hexa- nucleotide repeat sequence are usually ignored in most previous studies [2, 5, 25, 26, 28], we found they occurred in a very large number. It is difficult to consider them just as random sequences but not repetitive sequences, and it is also inappropriate to consider the iteration of 3 to 5 of mononucleotide repeats just as random sequences. Therefore, the threshold was set at 3, 2, 2, 2, 2, 2 in this study for exploring more comprehensive occurrence of SSRs, which could grab shorter simple repeats that never analyzed before, and another two thresholds were used to analyze these sequences for comparison. To test whether the SSRs under this threshold are random, we generated 55 mimic sequences with same size and nucleotide composition to the corresponding 55 reported sequences.
The analyzed data showed that the reported segment sequences are averagely 44.4% constituted, with SSRs, ranging from 36.4–60.0% under the new threshold (Fig. 1a, Table S1). And comparing analysis also show the SSR content of these segments with average of 18.8% and 5.0%. These results indicate that all these segments remained high content of SSRs, because all these segments are randomly selected from their genomes, suggesting that the remaining high content of short SSRs is a general feature of all organism genomes after long time evolution, and also suggesting that few formerly well-studied repeats may only stand for the proverbial tip of the iceberg [2, 3, 5–7]. The null hypothesis test demonstrated that the percentages of SSRs in the generated segments are all lower than those in the reported segments, indicating that the high percentages of short SSRs are not randomly remained in all reported segments.
Though the evolutionary mechanism of nucleotide sequences is still hotly debated by evolutionist, it is widely accepted that the genomic sequences are continually mutating forever; and the neutral molecular evolution and molecular clock theory suggested that the nucleotide substitution is constant over the evolution time; the thermodynamics in biology states that an isolated system will always tend to disorder [41–46]. As the microsatellites are indeed ordered sequences, according to the former stated theories, the ordered repeats possibly tend to mutate into disordered sequences in the long evolutionary history without any selective pressure. Therefore, the repeat sequences should tend to disappear in genomes in the long evolution history. However, the remaining high percentage of SSRs in genomes is contradicted with the ideas of repetitive sequences tend to become no repetitive sequences. Thus, it can be inferred that there is most probably a mechanism for continually produce repeats to balance continuous repeat disappearance, and be responsible for the remaining of high percentage of short repeat sequences in genomes (Fig. 1b).
Furthermore, the SSRs of small iteration numbers were observed to occur largely more than those of large iteration numbers in all analyzed segments (Table 1, Table S2), and this observation indicated that the SSRs of small iteration numbers maybe the basis for forming the SSRs of large iteration numbers, otherwise, it should be that the SSRs of large iteration numbers possibly are remained in higher percent level than or at least almost same level to the SSRs of small iteration numbers. Some of the longer SSRs also possibly mutate into short SSRs by contraction and point mutation as debated by many evolutionists [5, 12, 47], and these debates are possible because of that most of short repeats were not considered in their statistics; our observations generally suggested that most of the longer SSRs possibly evolved from short SSRs by expansion. So, the genomes possibly tend to produce short repeats by a continual repeat producing mechanism with the possibility of expansion a little more than that of contraction.
Table 1
The lengths (bp) of SSRs with different repeat unit types and different iterations in the segment of the reported human reference X chromosomal sequence at the location of 144822–231384 bp.
Iteration | Monoa | Di | Tri | Tetra | Penta | Hexa | Total |
I2 | (18128)b | 10040 | 3540 | 2056 | 1250 | 480 | 17366 |
I3 | 9702 | 1782 | 288 | 156 | 45 | 18 | 11991 |
I4 | 3844 | 368 | 12 | 112 | - | - | 4336 |
I5 | 2095 | 120 | 15 | 20 | - | - | 2250 |
I6 | 600 | 24 | 18 | 0 | - | - | 642 |
I7 | 182 | 14 | -c | 28 | - | - | 224 |
I8 | 128 | 16 | - | 0 | - | - | 144 |
I9 | 54 | 18 | - | 36 | - | - | 108 |
I10 | 50 | 0 | - | - | - | - | 50 |
I11 | 55 | 22 | - | - | - | - | 77 |
I12 | 24 | - | - | - | - | - | 24 |
I13 | 65 | - | - | - | - | - | 65 |
I14 | 56 | - | - | - | - | - | 56 |
I15 | 45 | - | - | - | - | - | 45 |
I16 | 64 | - | - | - | - | - | 64 |
I17 | 0 | - | - | - | - | - | 0 |
I18 | 36 | - | - | - | - | - | 36 |
I19 | 19 | - | - | - | - | - | 19 |
I20 | 0 | - | - | - | - | - | 0 |
I21 | 42 | - | - | - | - | - | 42 |
I22 | 0 | - | - | - | - | - | 0 |
I23 | 23 | - | - | - | - | - | 23 |
I24 | 0 | - | - | - | - | - | 0 |
I25 | 25 | - | - | - | - | - | 25 |
I26 | - | - | - | - | - | - | - |
I27 | - | - | - | - | - | - | - |
I28 | - | - | - | - | - | - | - |
Sum | 17109 | 12404 | 3873 | 2408 | 1295 | 498 | 37587 |
a Mononucleotide repeat (Mono), Dinucleotide repeat (Di), Trinucleotide repeat (Tri), Tetranucleotide repeat (Tetra), Pentanucleotide repeat (Penta), Hexanucleotide repeat (Hexa). |
b The length of mononucleotide repeats with iterations of 2 was not included in this statistics and just used as the reference here. |
c Beyond the largest iteration of this repeat unit type in corresponding analyzed segments were expressed as “-“. |
Relatively semi-conservative replication
It is well known that each base pair of DNA is one-to-one correspondence without other extra residue during replication in the double-helix model [35, 36]. And Meselson and Stahl have verified the replication of DNA chains is semi-conservative by the sedimentation techniques based on the diversity differential of DNA with different isotopes, also implicating that the number of nucleotides in replicating strand is consistent with that in template strand while processing complete replication [48]. However, if the remained high percentage of short repeats is produced during replication process as described above, it certainly makes the base numbers of replication strand to be unequal to those of template strand, with one or several nucleotides/motifs being repeated and more than that in template strand. In vitro experiments also revealed the presence of repeats during DNA replication, and the nascent replication chain has a base increase [27, 37, 38, 49]. And in this case, the replication process is possibly relatively semi-conservative and could be described as the following formula:
N i = int[N0(1 + f1λ1)(1 + f2λ2)…(1 + fiλi)] (1)
ΔN i = Ni-Ni−1 = int[N0fiλi(1 + f1λ1)(1 + f2λ2)…(1 + fi−1λi−1)] ≥ 0 (2)
N0
The number of nucleotides in the initial template strand;
Ni
The number of nucleotides in the replicating strand during No. i round replication;
int[]: Round the value to the lower integer;
ΔNi
The difference for the number of nucleotides between Ni and Ni−1;
λ i (λi → 0): The coefficient of occurring repeats during No. i round replication, and is most probably an infinitesimal with relating to the possibility of repeat sequence occurrence;
f i (0 ≤ fi ≤1): The fixation coefficient of repeat sequences during No. i round replication.
In general, the number of nucleotides in replicating strand is usually detected to be exactly equal to that in template strand, which is possibly because of the observed template strand being too short, for example, the total number of nucleotides in the initial template strand for stable PCR is up to two to three thousand nucleotides, in this case, we suppose N0 = 3000, λ1 = 10− 5, f1 = 1, then the value of ΔN1 will be 0 according to the formula (2), and therefore, N1 = N0, causing the replicating strand to be no longer (or no shorter) than template strand, and the discovery of new-born repeat is unavailable; however, when the observed strand is long enough, then ΔNi is able to be larger than 1 at least, and it can be found that the number of nucleotides in replicating strand is different from that in template strand, for instance, we suppose N0 = 106, λ1 = 10− 5, f1 = 1, then the value of ΔN1 will be 10, in this case, the replicating strands probably have 10 nucleotides (or repeat-motifs) more than template strand do after this replication. Thus, the increased number of nucleotides may represent newly occurred repeat sequences.
The occurrence of SSRs will possibly encounter selective pressure, though it may be different in coding or non-coding regions, then, we use fi representing the fixation possibility of the newly born repeats facing with the selective pressure. The fi = 0 when the occurrences of new repeats are the lethal mutations and unable fixation in the organism, or may be excluded by DNA repair system [1, 50]. The fixation coefficient is 0 < fi<1 when the new SSRs are the deleterious but fixed in the genome within alive individuals, like Huntington’s disease [13]. While the occurrences of new SSRs are the neutral mutations, the fixation coefficient should be 0 ≤ fi≤1, and they are fixed or excluded depending on genetic drift. And the fi of beneficial mutations is 1, representing that the new SSRs may help the organism surviving. Therefore, the remaining high percentage of short repeats suggests that the replicating process possibly produce short repeat sequences frequently which may be fixed neutrally, beneficially, or deleteriously with diseases, and also suggests that the replication may be relatively semi-conservative.
Folded slippage model
The nucleotide chains of various species tend to produce simple repeats during replication, and thus cause the number of the nucleotides in replication strand possible to be different from template strand after replication as discussed above. Moreover, how did simple repeats actually originate from is still a key argument topic [5, 47, 51]. The widely accepted mechanism of occurring SSRs is the replication slippage model, which is possibly easy to explain the expansion and contraction of longer SSRs, but possibly difficult to explain the much amounts of short repeats expansion and contraction. And the current slippage model is indeed a straight template strand model, without considering that the space is required for nucleotide base and also phosphodiester bonds are much stronger than hydrogen bond (Fig. 2a) [52, 53], and also without considering what is the force to drive the replicate strand slippage. The straight replication slippage model has not given any clear suggestion, and it suggests that the SSRs possibly occurred by slippage occasionally [12, 54–56]. Actually, there are about 33 atoms in a nucleotide (A: 33, T: 33, G: 34, C: 31) [57], and of course the nucleotide base need a certain space in nature. According to previous reports, we simplified a nucleotide space into an intuitive plane model, whose length is about 0.489 nm (length = (distance between the double helix 1.08 - Hydrogen bond length 0.102) / 2), and with a width of 0.34 nm which is the distance between each pair of bases (Fig. 2a) [52, 53, 58]. We reconstructed the linear replication slippage model with a CAD geometric calculation by considering the space of bases (Fig. 2b, Figure S1); if the slippage bubble has enough geometric space to accommodate the repeat bases, the phosphodiester bond should be elongated far more than 0.34 nm, while the phosphodiester bonds in DNA is actually much stronger than hydrogen bond (Fig. 2a) [58]. So it is impossible to form a slippage bubble by a larger elongation of the phosphodiester bonds for accommodating the repeat bases. Therefore, the straight slippage model is very difficult to the occurrence of short repeats, and it is most possibly necessary to improve the slippage model.
Actually there is a fact which is widely ignored in replication slippage studies. The template strands are thought to be straight in all replication models, though it is the truth in general condition. It is also well known that the genomic DNA chains are very long and the space is too narrow in the nucleus (Fig. 3a); for example, the total length of human genome is about 2 m (2 × 109 nm), while the diameter of nucleus is beneath 105 nm in human cell [57]; therefore, the genomic DNA chains are generally highly curved and folded in the nucleus as widely accepted. Indeed, the replicating molecule is believed to be a straight molecule [37–40], and the replicating enzyme complexes usually straighten the template strand to be straight making the replicating strand well paired to finish the semi-conservative replication process [37, 59, 60]. However, there are a lot of environmental factors like temperature, viral proteins or diseases etc., which may disturb the normal works of the enzyme complexes. So, when the replicating enzyme complexes are disturbed by environmental factors, the replicating part DNA molecule may recover to some extent of curved or folded state, and then the template strand may also be some extent of curved or folded state.
Firstly, we proposed a curved template slippage model. When the curved DNA strand is used as the template strand on inner side, the replication strand is longer than the template strand and can form more nucleotides than the template strand on the outside for during replication process. The replication strand should be longer than template strand, then, is able to provide extra spaces for accommodating the extra repeat bases (Fig. 3b). However, it is well known that the links of base pairs mainly depend on 2 types of hydrogen bonds, N—H … :N and N—H … :O [52], and the strengths of these hydrogen bonds are negatively correlated to the distance between every base pair; the strength of the hydrogen bond is about 3% of the 3', 5'-phosphodiester bonds [53, 58, 61, 62] (Fig. 2a), so the distance between the bases is fixed; even if there is space to form a slippage bubble, the hydrogen bond should be elongated to exceed the threshold of 0.167 nm [52] and should be easy to be broken off in such condition. So, the curved slippage model is able to provide spaces for forming slippage bubble with forming unstable hydrogen bonds double-chain structures (Arm1 and Arm2) at both sides of the slippage bubble (Fig. 3b, Figure S2), indicating that the curved slippage model should be unreasonable.
Then we proposed a folded slippage model. In this case, the folded template strand forms a slippage bubble above the folding site to have sufficient space for accommodating the repeat nucleotides in replication process, the phosphodiester bonds are not elongated, but the bases are well paired with the stable hydrogen bonds at both sides of the slippage bubble (Fig. 4). If folding angle is proper, thereby it is most possibly to form a very stable double-stranded folded slippage structure to provide chances for producing repeats, with considering nucleotide geometric spaces and stability of phosphodiester and hydrogen bonds. Actually, there are two conditions of the folded slippage models: When template strand is on the inner side, the repeat unit duplicated to produce new repetitive unit or repeat expansion (Fig. 4); and when the template strand is on the outside, the replication strand may make the repetitive sequences to contract (Fig. 5); the features of this folded slippage model can easily explain the widely observed microsatellite mutations with expansion and contraction of repeat units [5, 12, 47, 55, 63]. In addition, replication slippage of template strands with different folding angles may result in the expansion or contraction of repeat units with different sizes. When template chains are folded on the inner side at a rotation angel of 18°, 36°, 54°, 72°, 90° and 108°, the replication strands will produce mononucleotide to hexanucleotide repeat expanding respectively (Fig. 4). So, it is necessary to break off the number of hydrogen bonds from 2 to 18 without elongating the phosphodiester bond to produce repeats; it suggested that the difficulty of formation repeats from mono- to hexanucleotide is gradually increasing, and also means the occurrence of mono-, di-, tri-, tetra-, penta- and hexanucleotide repeat is gradually decreasing; that is well consistent with our statistic data (Table 1, Table S2). Vice versa, when template chains are folded on the outside at a rotation angel of 18°, 36°, 54°, 72°, 90° and 108°, the replication strands will produce responding repeats contracting respectively (Fig. 5). These features are well corresponding to the microsatellites which usually refers to the tandem repeats with repeat units from mono- to hexanucleotides[5, 19, 24]. According to this rule, we also describe the possible folded template slippage models of hepta-, otca-, nona- and decanucleotide repeats (Figures S3 and S4). In fact, the replicating strand must break off at least from 14 to 30 hydrogen bonds to make a folded slippage bubble, the energy to break off so much hydrogen bonds are almost close to energy of phosphodiester bond, then, they are very difficult to occur, and therefore, this is consistent with the observations that such long tandem repetitive sequences are often not very abundant in the genomes [55, 64]. The (AmTn) repeats growing faster than (GmCn) repeats also suggested that the broken number of hydrogen bonds involves in the speed of repeat expansion [18, 34, 65, 66]. Although this folded slippage model is just simply described in a plane form, it can still clearly simulate and explain the repeat sequences producing process. We also use the same space size to make the double-helical three-dimensional forms show the folded slippage model more intuitively (Figs. 4 and 5), and the precise folding angle in the three-dimensionally double-helical forms and other issues desire further study.
There is enough geometric space in the slippage bubble of the folded template model to accommodate repeat nucleotides without stretching the phosphodiester bonds, compared with the straight template slippage model. In contrast to the curved template model, the difference in the folded model is that the two sides of the slippage bubble are stably paired, and the Arm1 and Arm2 similar to the straight template replication model are formed at both sides (Figs. 4 and 5). The folded model takes full account of the space required by nucleotides, the stability of phosphodiester bonds and the strength comparison between phosphodiester bonds and hydrogen bond, and is easy used to explain microsatellite mutations with repeat unit expansion and contraction. Therefore, we propose that the folded template chain slippage model may be considered as the most reasonable model for explaining repeats production in replicating process, and the folded template strand slippage model may be responsible for the continual producing of repeat sequences and the remaining of high percentage of repeat sequences in genomes.
Microsatellites tend to expand
As stated above, according to the folded slippage model, template chain folding on the inner side may make the replicating chain slippage for repeats expansion, vice versa, the template chain folding on the outside may make the replicating chain slippage for repeats contraction; and it seems that the possibility of repeats expansion and contraction is same. However, there are two manners for the repeat sequences contraction, one is above mentioned the template chain folds on outside, another is also above stated general mutations; the high content of the repeat sequence is still in a stable state in the genome of each species, suggesting that the possibility of repeat expansion should be higher than repeat contraction. And many reports also suggest that there is a higher possibility of repeat expansion than repeat contraction [27, 49, 67].
When the folded template chain slippage was deeply investigated, the replicating straight template DNA chain should return to folded under external forces from the narrow and crowded cell nucleus when the replicating enzyme complexes are disturbed, and usually the replicating enzyme complexes may provide power for balancing the external forces to drag the template DNA molecule straight. Then, we proposed an external force function for template strand returning to folded, and this function may be helpful to explore the probability of expansion and contraction. When the template strand is on the inner side, the nucleotide bases are outward, and the space of bases at the folded site become wide and loose at outward part; while it is on the outside, the base in the folding position is squeezed inward. Comprehensively considering the small difference of the space of nucleotides at the folded site, it can be easy accepted that the external forces to make template strand folded with bases loose should be smaller than that to be squeezed; therefore, the external force required for the template strand folded on the outside (Fo) is inevitable greater than that (Fi) on the inner side, it can be described as Fo >Fi, suggesting that the probability for the template strand folded on the inner side is higher than that on the outside; as our folded slippage model suggested that the repeats tend to expand when the template strand on inner side and contract when the template strand on outside, therefore, the possibility of repeat expansion (Pe) is most possibly higher than that for repeat contraction (Pc), it can be described as Pe>Pc (Fig. 6). The SSR studies, like in Huntington disease related locus and myotonic dystrophy type 1 locus, all showed SSR expansion biased [12, 13, 68–70], which proving that the expansion and of short SSRs are more frequent than that of contraction.
Thus, according to formula (2):
When the template strand on the outside, repeats tend to contract, so λc < 0,
thus, ΔNc = Nci-Nci−1 = int[N0fciλci(1 + fc1λc1)(1 + fc2λc2)…(1 + fci−1λci−1)] ≤ 0.
When the template strand on the inner side, repeats tend to expand, so λe > 0,
thus, ΔNe = Nej-Nej−1 = int[N0fejλej(1 + fe1λe1)(1 + fe2λe2)…(1 + fej−1λej−1)] ≥ 0.
The general repeat expansion and contraction can be described as:
|∑ΔNe| = |int[∑N0fejλej(1 + fe1λe1)(1 + fe2λe2)…(1 + fej−1λej−1)]|;
|∑ΔNc| = |int[∑N0fciλci(1 + fc1λc1)(1 + fc2λc2)…(1 + fci−1λci−1)]|;
∑ΔN = |∑ΔNe| - |∑ΔNc| = int[N0∑[|fejλej(1 + fe1λe1)(1 + f e2λe2)…(1 + f ej−1λej−1)| - |fciλci(1 + f c1λc1)(1 + f c2λc2)… (1 + fλci−1λci−1)|]].
Because λ was defined as coefficient of occurring repeats, the possibility of repeat expansion (Pe) is positively proportional to λe and the possibility of contraction (Pc) is positively proportional to the absolute value of λc (|λc|), if we suppose that fe = fc = f, i = j, and as generally Pe > Pc, then λe > |λc|,
and also ∑[|λej (1 + fλe1)(1 + fλe2)…(1 + fλej−1)|] ≥ ∑[|λci (1 + fλc1)(1 + fλc2)… (1 + fλci−1)|],
therefore, ∑ΔN = |∑ΔNe| - |∑ΔNc| ≥ 0.
So, when the external forces for returning the folded template strand were considered, the possibility of repeat expansion should be higher than that of repeat contraction, then the revised formula (2) is also able to explain the remaining of high percentage of short repeats in genomes under a mechanism of continually producing repeats; and this mechanism might result from the folded template chain slippage model, which is possibly responsible for the widely occurring short tandem repeats, also called microsatellites or SSRs in eukaryotic, prokaryotic and also viral genomes. We improved the straight slippage model to folded slippage model by fully considering the geometric spaces of nucleotides base, the relationship between phosphodiester and hydrogen bond and the stability of these bonds. The slippage model showed that the straight replicating template DNA may return to be some extent of folded resulting from disturbed replicating enzyme complexes, and may provide chances for continually producing much amount of short repeats; though the long unit repeats may be related with the former slippage model [30, 55]. The easily forming of folded slippage may be also responsible for the widely observed fact that repetitive part of genome is usually evolved hundred or more times than other part with only repeat units expansion and contraction [1, 17, 47, 71], though the repeats occurred more in non-coding regions than in coding regions possibly because of different selective pressures [5, 12, 55]. Most of new occurring repeats should be lethal mutation and may have been negatively selected to lost; some of new occurring repeats should be deleterious in genomes and responsible for a series of diseases [69, 70, 72, 73]; many neutral repeat expansions may be lost or fixed with no functions in genomes by genetic drift [74]; and some beneficial repeat expansions may promote the emergence of different new properties or functions, that is why the repeat sequences are reported with so many different roles [8–10, 63, 65, 75, 76]. And the longer repeats might originate from short repeat expansion by the folded template slippage, and the longer genomes possibly evolved from the short genome with related to the continuous repeats producing folded slippage model in the long evolutionary replicating process.