Here we report the first indication of purifying selection at a STR locus in human. The primary importance of (GCC)-repeats stems from a possible link between that type of STR and natural selection, mainly for two reasons: Firstly, (GCC)-repeats are specifically enriched in the exons. Secondly, GC-rich sequences are mutation hotspots [25], and frequently interrupted by single nucleotide substitutions. Specific expansion of the SBF1 (GCC)-repeat in primates, and not in any other order, supports selective advantage of the STR in this order.
In both NCD and control groups, the expected heterozygosity for the observed allele frequencies was dramatically compromised, most likely due to selection against heterozygous genotypes. As a consequence, the homozygote compartment expanded significantly beyond expectation and over 77% across the two groups. This anomaly could not be attributed to the excess of consanguineous marriages in Iran, as excess of homozygosity in consanguineous societies can contribute to between 2 and 11% homozygosity at a given locus [26, 27]. The homozygous genotypes could not be attributed to allele dropout either, as the frequency of such event is less than 0.004 in amplification-based approaches [28]. Sampling error is another explanation for the observed genotypes. All samples were collected from the same districts in Iran, and the results were replicated in both groups, such as the shrunk 8/9 genotype compartment, and the excess of the 6/8 vs. 6/9 genotypes. However, it should be noted that this is a pilot study, and warrants replication by independent studies.
Searching the Genome Aggregation database (gnomAD) for the human SBF1 (GCC)-repeat revealed inconclusive data for the annotated alleles and genotypes, which spanned across all the populations studied (https://gnomad.broadinstitute.org). The above findings are most likely due to the frequent failure of the general whole-exome sequencing methods to capture GC-rich sequences. Successful PCR amplification of the human SBF1 gene necessitates stringent conditions and special GC-rich buffer preparations as described in the Methods.
A likely hypothesis that may be put forward is that the heterozygous genotypes might have been selected against in human in the process of evolution. The studied (GCC)-repeat is located in the 5′ UTR, and it may be speculated that the heterodimer RNAs of, for example, 8 and 9 repeats, and 6 and 9, have a detrimental effect on the downstream events, such as transcript processing and translation. A possible mechanism might be connected to RNA structure and accessibility, which we could show does change with the number of (GCC)-repeats, and can affect at least exon 1.
Example of RNA heterodimer formation exists in the 5' regulatory regions of human HIV-1/HIV-2 RNAs [29, 30], which requires GC-rich palindromic sequences among a number of other motifs [31]. It may be speculated that similar sequences in the GC-rich human SBF1 RNAs fulfill the conditions for potential RNA dimerization. Experimental synthetic stem-loop RNAs have been reported to alter the expression of a number of genes in bacteria [32].
SBF1 is predominantly expressed in the brain and skeletal muscle, and the protein encoded by this gene is a member of the myotubularin family. Myotubularin-related proteins, namely MTMR2, MTMR13/SBF2 and MTMR5/SBF1 are mainly involved in regulating endolysosomal trafficking [33] and mitochondrial functioning [34]. Dysregulation of SBF1 is linked to late-onset NCDs such as AD [17], which is also indicated by the observed genotype anomalies in the NCD group vs. controls in our study. An isolate instance of an NCD patient harboring a genotype that consisted of extreme short alleles, may be of significance, while random co-occurrence should also be considered as a possibility. The secondary structure and accessibility effect of the 5/6 genotype were dramatically divergent, and the 5-repeat allele length was not detected in the control group. It is possible that low frequency alleles at the extreme ends of the allele distribution curve are subject to negative natural selection [8, 12, 35].
It remains to be clarified how certain heterozygous genotypes might have been selected against in human, and may increase the risk of late-onset AD. It is also warranted that this STR is sequenced in larger samples and in a spectrum of neurological disorders.