While the literature on the importance of STRs in biological, evolutionary, and pathological terms is growing, the most basic repeats of these elements are largely overlooked. Here we identified unprecedented genomic colonies, resulting from colonization of the dyads of GCC and GGC repeats, i.e., (GGC)2 and (GCC)2. Several of these colonies were human-specific or revealed directional incremented complexity in human versus other species. Based on the above observations and the exceedingly significant occurrence of the identified colonies in statistical terms, we propose that they link to the evolution of the human species. The identified colonies may serve as computer codes for various evolutionary and biological instructions.
The genomic rearrangements resulting in those colonies were also phenomenal with respect to the frequency of those events in the genomic lengths that they occurred. The identified colonies do not match the description of repetitive sequences 22. They do not match segmental duplications either, with respect to the density of the events. Of note, the shortest reported human segmental duplications, copy number variations, and other genomic rearrangements are estimated to involve lengths of ≥ 10 kb of genomic DNA in human 23–26.
The likely explanation for the occurrence of the identified colonies is recombination, involving the dyads and the flanking sequences to each dyad. In other words, the identified colonies are recombination hotspots. Following comparison of fine-scale recombination rates in human and chimpanzee, Winckler and co-workers reported that local patterns of recombination rate have evolved rapidly, in a manner disproportionate to the change in DNA sequence, and recombination hotspots are rarely (if at all) found at the same locus between human and chimpanzee 27. However, if we assume that the colonies, are at least partially, formed by recombination, then common recombination hotspots at the same locus between the two species are not as rare as previously reported. For example, C99, C51, and C38 appear to be recombination hotspots in great apes, although at higher order of complexity of events in human. These are prime instances in the literature, in which directional incremented density and complexity of repeats at a specific locus in the genome coincide with human evolution. Another interesting instance includes a CT-repeat complex in the PAXBP1 core promoter and 5' untranslated region, which is maximally complex in human versus other species (OMIM: 617621) 28.
Based on our observations, the dyads and flanking sequences to each dyad are involved in the genomic rearrangements. Considering the fact that the common elements across those colonies are the dyads, and not the flanking sequences to each dyad, it is likely that the main reason for the rearrangement hotspots in those colonies are the dyads and not their flanking sequences.
Several of the largest colonies were intergenic or deeply located inside very large introns. For example, C219 and C71 were the largest (GCC)2 and (GGC)2 colonies, respectively, and intergenic. The nearest genes to C219 and C71 were COPS7B and WDR5, respectively, which astonishingly directly interact at the protein level (https://string-db.org/). It has been shown that intergenic distance, and hence genome architecture, is highly non-random. Rather, it is shaped by regulatory information contained in noncoding DNA 29. It is, therefore, not unexpected that expansion of the non-coding genome and its regulatory potential underlies vertebrate neuronal diversity 30. This potential is particularly in line with our observations that the largest colonies are mainly human-specific or are most complex in human versus other species, and several of the genes containing those colonies reveal divergent brain expression in human (https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/) 31.
A subset of the (GCC)2 and (GGC)2 colonies were deeply inside very large introns. For a subset of genes, the most important sequences that are of regulatory importance are situated not in the promoters, but rather are located within introns 32–34.
Whereas human chromosome Y is virtually devoid of GGC and GCC colonies of ≥ 4-repeats, numerous colonies of (GGC)2 and (GCC)2 were detected along this chromosome (https://figshare.com/articles/dataset/_GGC_2_and_GCC_2/22178102), the largest of which at 36 (GCC)2 was located in IL3RA, at the pseudoautosomal regions of chromosomes Y and X. Multi-tissue expression of this gene was strictly human-specific, and non-existent in multiple primate species 32.
A number of the identified colonies were close to lncRNAs. While the targets of many of those lncRNAs are not completely known, lnRNAs are stimulating large interest due to their versatile roles in fine-tuning numerous signaling pathways 35.
Another category of colonies was located in pseudogenes. A number of the identified colonies inside pseudogenes were specific to great apes, and showed directional trend in human. Pseudogenes are abundant in the human genome and had long been thought of purely as nonfunctional gene fossils. Recent observations point to a role for pseudogenes in regulating genes transcriptionally and post-transcriptionally in human cells. Pseudogenes are pervasively transcribed on both strands and are common drivers of gene regulation, which have consequences in health and diseases 36–38.
The genomic colonies formed by (GGC)2 and (GCC)2 dyads only fractionally overlap with G quadruplex (G4) structures 4,15,39. G4 structures are DNA tetraplexes that typically form in guanine-rich regions of genomes. Four guanine bases associate with each other through Hoogsteen hydrogen bonds to form a guanine tetrad plane (G-quartet), and then two or more G-quartet planes stack on top of each other to form a G4 structure 40. Organisms may have evolutionarily developed G4 into a novel and elaborate transcriptional regulatory mechanism benefiting multiple physiological activities of higher organisms 41,42.
The supplementary data (https://figshare.com/articles/dataset/_GGC_2_and_GCC_2/22178102) will enable studying a vast number of additional colonies, and further the knowledge of the link between those colonies and the evolution of human and other great apes.