Here, we report a novel locus for late-onset NCD and indication of natural selection at a (GCC)-repeat in the 5′ UTR of the human SMAD9 gene. This locus may convey genotypes that specifically (unambiguously) predispose or protect against moderate to severe late-onset NCD in human. The NCD patients harboring the specific genotypes encompassed a spectrum of possible diagnoses, including CVD and AD.
The primary importance of (GCC)-repeats stems from a possible link between that type of STR and natural selection, mainly for two reasons: Firstly, (GCC)-repeats are specifically enriched in the exons. Secondly, CpG-rich sequences are mutation hotspots 27, and frequently interrupted by single nucleotide substitutions as a result of C to T transitions, which is also the likely possibility at the (GCT)-residue at the immediate downstream flanking sequence of human SMAD9 (GCC)-repeat. Expansion of the SMAD9 (GCC)-repeat in primates, and not in any other order, supports selective advantage of this repeat in this order.
We found significant excess of the (GCC)7 allele in the NCD group and genotypes that consisted of (GCC)7 and not (GCC)9 in this group only. On the contrary, we found genotypes in the control group only and not in the NCDs, that consisted of (GCC)9 and not (GCC)7. Based on the above findings, we propose that the (GCC)7 allele may function as risk factor for late-onset NCD, whereas (GCC)9 may be protective. Similar to our findings in SMAD9, we have previously reported another predominantly biallelic (GCC)-repeat of 8 and 9 repeats in the 5′ UTR of the human SBF1 gene, in which excess of the shorter allele was detected in the NCD group 28.
Searching the Genome Aggregation database (gnomAD) for the human SMAD9 (GCC)-repeat yielded inconclusive data for the annotated alleles and genotypes (https://gnomad.broadinstitute.org). The above finding is most likely due to the frequent failure of the general whole-exome sequencing methods to capture GC-rich sequences. Successful PCR amplification of the human SMAD9 gene is challenging, and warrants stringent conditions and special GC-rich buffer preparations as described in the Methods. Furthermore, this imperfect STR, which is disrupted by T nucleotides in its 3′ end, as revealed by the (GCT)-residue, indicates that conventional fragment analysis may not be an efficient method for scoring (GCC)-repeats. The above necessitates sequencing of every sample included in the study for obtaining accurate data.
SMAD9 is predominantly expressed in the brain and skeletal tissues 18,29,30, and the protein encoded by this gene can translocate into the nucleus and affect transcriptional regulation of target genes. Higher order brain functions and skeletal phenotypes (characteristics that have significantly diverged in primates vs. other orders of animals) may be selection forces for the expansion of this STR in primates. Skewed genetic architecture in the group of NCD individuals with moderate to severe dysfunction of brain functions in our study, reflected predominantly in the AMTS and MMSE tests, supports a role for this STR in the human higher brain functions. Various (GCC/GGC)-repeats of the similar length range to the human SMAD9 gene STR can alter gene expression activity 31,32. Our bioinformatics analysis revealed that the number of (GCC)-repeats may change the RNA secondary structure (stem-loops) and accessibility (unpaired RNA bases) of, at least, exons 1 and 2 of human SMAD9 (Fig. 5). RNA stem-loops in structurome data reveals widespread association with protein binding sites 33, which may, in turn, alter the processes linked to transcription and translation.
Another interesting feature at this locus and a number of other previously reported instances is the low frequency alleles, which might have been subject to negative natural selection 8,14,34. In human SMAD9, examples of those alleles are (GCC)8 and (GCC)10. Two genotypes consisting of those rare alleles i.e., 7/8 and 8/10, were detected in the NCD group only. Genotypes consisting of low frequency alleles at a (GCC) locus in the NCD patients were also detected in the RASGEF1C and SBF1 gene loci 14,28. While allele and genotype-wise, the (GCT)-residue did not skew in the NCD group vs. controls, conjunction of (GCT)1 and (GCT)3 with (GCC)7 were detected in two NCD patients, and not in any controls.
Reported instances of STR allelic natural selection at non-coding loci in human are rare 8,14, and the SMAD9 (GCC)-repeat provides a potentially valuable locus to further test this phenomenon.
It is warranted that this STR locus is sequenced in larger samples and in a spectrum of neurological and skeletal disorders. Mechanisms underlying allele and genotype selection should also be examined in the future functional studies.