Novel biological implication of a strictly monomorphic GCC repeat in the human PRKACB gene


 Across human protein-coding genes, PRKACB (Protein Kinase CAMP-Activated Catalytic Subunit Beta) contains one of the longest GCC-repeats, and is predominantly expressed in the brain. Here we studied this STR in 300 human subjects, consisting of late-onset neurocognitive disorder (NCD) (N = 150) and controls (N = 150). We also studied the impact of this STR on the three-dimensional structure of DNA. While the PRKACB GCC-STR was strictly monomorphic at 7-repeats, we detected two 7/8 genotypes only in the NCD group. In comparison to all other lengths, (GCC)7 had the least effect on the three-dimensional structure of DNA, evidenced by minimal divergence between 0 and 7-repeats (divergence score = 0.04) and significant divergence between 0 and 8 repeats (divergence score = 0.50). A similar inert effect to the GCC-repeat was not detected in other classes of STRs such as GA and CA repeats. In conclusion, we report monomorphism of an exceptionally long GCC repeat in the PRKACB gene in human, its inert effect on DNA structure, and divergence in two cases of late-onset NCD. This is the first indication of natural selection for an exceptionally long monomorphic GCC-repeat, which probably evolved to function as an “epigenetic knob”, without changing the regional DNA structure.


Introduction
Short tandem repeats (STRs) are the most polymorphic genetic elements in the vertebrate genomes. Because of their polymorphic nature and plasticity, these elements provide an efficient source of variation at the inter-and intraspecies levels [1][2][3][4][5][6] . Accumulating evidence indicates that certain STRs may be associated with selective advantage in human and other species [7][8][9][10] . Among the exceptionally long STRs spanning the core promoter and 5′ untranslated region (UTR) of human protein-coding genes, the protein kinase cAMP-activated catalytic subunit beta (PRKACB) gene contains one of the longest GCC-repeats, at 7-repeats 5 . Across human tissues, PRKACB has its highest level of expression in the brain 11 (https://www.proteinatlas.org/ENSG00000142875-PRKACB/tissue). Furthermore, in comparison with fourteen non-human primates, the brain expression of this gene reaches highest quantity in human (https://www.ncbi.nlm.nih.gov/ieb/research/acembly/av.cgi?db=human&term=PRKACB&submi t=Go). The kinase encoded by PRKACB is involved in tau phosphorylation at Alzheimer's disease (AD)-related sites, and regulation of this gene has been linked to neuroprotective effects against tau and Aβ-induced toxicity 12,13 .
Here we studied the PRKACB gene GCC-repeat in a sample of human subjects, consisting of late-onset neurocognitive disorder (NCD) (also incorrectly known as dementia) and controls, and analyzed its impact on DNA three-dimensional (3D) structure.

Results
Predominant monomorphism of the PRKACB GCC-repeat in human, at 7-repeats, and divergence from monomorphism in two patients afflicted with late-onset NCD.
We sequenced the PRKACB GCC-repeat in 300 human subjects, and found strict monomorphism of this STR at 7-repeats in this human sample (Fig. 1). Exceptions included two 7/8 genotypes in two patients with late-onset NCD (Fig. 2). The two patients harboring the 7/8 genotypes were females of 78 (Patient 1) and 83 (Patient 2) years of age and AMTS of 4 and 3, respectively.
In patient 1 the progression of neurocognitive symptoms was gradual, strengthening the possible diagnosis of AD. In patient 2, symptoms of neurocognitive impairment including aphonia and dementia occurred abruptly and subsequent to a cerebrovascular accident two years before the interview, strengthening the possible diagnosis of vascular dementia. Other causes of late-onset neurological disorders were ruled out in both patients by neurologists.

Status of the PRKACB GCC-repeat across vertebrates.
The PRKACB GCC-repeat was highly conserved in numerous orders of vertebrates (Table 1).
This repeat probably emerged in Birds and Reptiles, as we did not detect it in Amphibians, Fish, and other eukaryotes. In human, the interval encompassing the +1 to +100 to the TSS contains two GCC-repeats of (GCC)3 and (GCC)7 formula (Table 1). We detected a long ancestral trace of GCC-repeats, identifiable by triplet nucleotides of GCC or non-GCC in various species (Fig.   3). This long trace of GCC resulted in complex and unique GCC blocks in every species.
In silico reconstruction of the human PRKACB 5′ UTR sequence encompassing various GCC repeats.
DNA reconstruction of the PRKACB 5′ UTR encompassing (GCC)7 revealed that (GCC)7 is an inert element that results in the slightest change to the structure of DNA in comparison with other repeat lengths (Table 2) (Fig. 4). For example, 0 and 7-repeat constructs were at striking similarity (0.04 divergence), whereas the divergence scores between 7 and 8 repeat constructs, genotypes of which were observed in two cases of late-onset NCD, were among the highest scores (0.50), indicating that there was a dramatic change in the 3D structure as a result of 7 vs. 8-repeat (Table 2).
Although we detected only two alleles in the human population studied (7 and 8-repeats), we reconstructed DNAs for various lengths in order to analyze the effect of various repeat lengths on the 3D structure of the region. This analysis revealed that subtraction or addition of 7-repeats resulted in the lowest divergence in the 3D structure in comparison to other repeat lengths ( Table   2).
To explore the specificity of the PRKACB GCC repeat inert effect, we performed DNA reconstruction of additional GCC/GGC-repeats in the 5′ UTR of the SMAD9 and RASGEF1C genes, and also a non-GCC STR, such as a GA-repeat in the 5′ UTR of human GPM6B.
Strikingly, while the reconstructed 3D structures were significantly divergent at various lengths in the case of the GPM6B 5′ UTR GA-repeat, the SMAD9 and RASGEF1C GCC 3D structures were almost identical for all repeat lengths studied (Suppl. 1). We also studied the interactive effect of divergent non-repeat flanking sequences with the GCC repeats by studying a different species than human, such as capuchin. That analysis showed that the inert effect of the GCC block is dominant to the effect of the flanking non-repeat sequences (Suppl. 1).

Gene network reconstruction
Based on the available experimental evidence, the reconstructed interactive network consisted of 16 nodes and 45 edges (Fig. 5). Within this network, PRKACB interacts with genes of other subunits of the cAMP-dependent protein kinase complex, as well as other genes such as MAPT, FOXO1, and PDE5A, which are central to maintaining cell structure and function.

Discussion
Because of their propensity to epigenetic alterations such as hypermethylation, GCC-repeats are mutation hotspots 14,15 , and this may be the reason why this class of repeats do not expand beyond certain lengths unless when selected, such as in gene regulatory regions 16 . Expansions of GCC repeats in the 5′ UTR of a number of genes is associated with hypermethylation and intellectual disability 17 . Based on the Ensembl database, the PRKACB gene is among the top ten genes in respect of containing the longest GCC-repeat 5 . Here we show that this GCC-repeat is predominantly monomorphic in human at 7-repeats, and there may be divergence from this monomorphism in late-onset NCD. We propose that there was selective advantage in human at this particular STR to reach to, and stabilize at that length. The above is in line with the observations that mutations that have negative fitness consequences tend to be eliminated from the population 18 . This STR was found to be highly conserved across various mammalian species, indicating its important role in growth and development. In general, STRs near the TSSs of genes are often highly conserved, and distance from a STR to the nearest TSS is a good predictor of the STR conservation score 19 .
We also found that the (GCC)7 had the least effect on the 3D structure in contrast to the more significant effects that various other repeats inserted, which probably confers a modulatory effect to (GCC)7 without changing the structure of DNA or RNA. Interestingly, the divergence score between the 7 and 8 repeat (the latter was an allele of the two diverged genotypes) was among the highest. While we also detected an inert effect of GCC/GGC repeats in two additional genes, Significant divergence has also been observed with various CA-repeat lengths in the case of the human NHLH2 gene 9 . The above findings indicate that exceptionally long GCC/GGC repeats might have evolved to exert epigenetic effects (epigenetic knobs) without changing the structure of the region.
GCC motifs of STRs significantly overlap with G-quadraplex (G4) non-B structures 16 . Recent research indicates that organisms may have evolutionarily developed G4 into a novel reversible and elaborate transcriptional regulatory mechanism benefiting multiple physiological activities of higher organisms [20][21][22] .
Future studies are warranted to sequence the PRKACB GCC-STR in a large number of patients with neurological disorders. Considering the pivotal role of PRKACB in the brain, CRISPR/Cas9 methods may also be ideal to edit this STR at the genomic level and investigate differentiation of human stem cells into neural cells.
In conclusion, we report the first instance of predominant monomorphism of an exceptionally long GCC repeat in human and divergence from this monomorphism in human disease. We also propose that GCC-STRs might have evolved in the genome as regulatory elements (such as epigenetic regulation) without dramatically changing the 3D structure of DNA.

Subjects
Three hundred unrelated Iranian subjects, consisting of late-onset NCD patients (N=150) and controls (N=150) were recruited from the provinces of Tehran, Qazvin, and Rasht. In each NCD case, the Persian version of the Abbreviated Mental Test Score (AMTS) 23,24 was implemented (AMTS of <7 was an inclusion criterion for NCD), medical records were reviewed in all participants, and CT-scans were taken where possible. The AMTS is currently one of the most accurate primary screening instruments to increase the probability of NCD 25 , and the Persian version of the AMTS is a valid cognitive assessment tool for older Iranian adults, which can be used for NCD screening in Iran 23 . The control group was selected based on cognitive AMTS of >7, lack of major medical history, and CT-scan where possible. The cases and controls were matched based on age, gender, and residential district. The subjects' informed consent was obtained (from their guardians where necessary) and their identities remained confidential throughout the study. The research was approved by the Ethics Committee of the University of Social Welfare and Rehabilitation Sciences, Tehran, Iran, and was consistent with the principles outlined in an internationally recognized standard for the ethical conduct of human research. All methods were performed in accordance with the relevant guidelines and regulations.

Allele and genotype analysis of the PRKACB GCC-repeat.
Genomic DNA was obtained from peripheral blood using a standard salting out method. PCR reactions for the amplification of the PRKACB GCC-repeat were set up with the following Primers. were sequenced by the forward primer, using an ABI PRISM 377 DNA sequencer.

Analysis of the PRKACB (GCC)7 across vertebrates.
The interval between +1 and +100 of the transcription start site (TSS) of the PRKACB was searched across several orders of vertebrates based on the Ensembl Release 102 (https://asia.ensembl.org/index.html). Alignment was performed using CodonCode Aligner 9.0.1.

In silico DNA reconstruction of various GCC-repeat lengths in human.
The DNA structure across the PRKACB GCC-STR, SMAD9 GCC-STR, RASGEF1 GGC-STR, and GPM6B GA-STR were reconstructed according to the AA-Wedge model. For each reconstruction, the 100 nucleotide flanking sequences to the repeats were also included. Among the existing DNA predicting models, namely the Crothers, Dickerson, Jernigan, Tung-Harvey, Zhurkin, and the AA-Wedge model, the AA-Wedge model has been reported as the most consistent and accurate 26 . Based on the twist, roll and tilt, this model predicts experimental Atract curvature as measured by gel retardation and cyclization kinetics 27 . After obtaining the coordinates of the nucleotides in a 3D space, the DNA structure was visualized using plot3D package in R software 28 .

Divergence calculation across DNA constructs of various GCC repeat lengths.
Let ( , , ) for = 1, 2, … , be coordinates of points for each repeat, in the first step we scaled these coordinates as follows )i.j)-th elements revealed the divergence scores obtained from the above method between the i-th and j-th diagrams.
The accuracy of data was validated by two-by-two comparison of constructs of identical lengths.

Gene network reconstruction
The STRING database (https://string-db.org) was used to find the interactions of PRKACB with other genes. STRING is a biological database of known molecular interactions containing information from experimental, computationally predicted, and public text collections 29 . The minimum required interaction score set 0.7 and the maximum number of interactions to show set 20. To ensure the most important and reliable interactions, we selected only data from experimental studies as the active interaction sources. Subsequently, using Cytoscape version 3.8.2 and according to the interactions found, the interactive network was reconstructed 30 .

Author contributions
Safoura Khamse and Zahra Jafarian did the molecular experiments..
Ali Bozorgmehr conceived and performed the in silico modeling of the DNA constructs and wrote part of the manuscript.
Mostafa Tavakoli did the mathematics analyses.
Hossein Afshar collected the human samples and their clinical information.
Maryam Keshavarz and Razieh moayedi provided critical comments and contributed to data collection Mina Ohadi conceived and supervised the project, and wrote the manuscript.

Conflict of interests
The authors declare that there is no conflict of interest