2.1 PolyQ regions can be categorized based on their evolutionary stability
A protein sequence is thought to be something static, a fixed ordered set of amino acids reflecting the biological entity that is a protein. But in reality, the same sequence is constantly under selective pressure that may be favoring sequence alterations. Here we aim to study whether polyQ regions can be categorized based on their evolutionary stability. We define this stability comparing polyQ regions across orthologs from taxonomically close species.
We prepared aligned sets of orthologs in which at least one protein has a polyQ region (see Methods for details), for four distinct taxonomic groups: Class Insecta, Infraclass Teleostei, Clade Sauria and Class Mammalia. The aligned polyQ regions were then studied in detail to assess the glutamine conservation per aligned position (Figure 1). Each position has three possible values, depending on the proportion of sequences having a glutamine:
- conserved, if >=80% of the sequences have a glutamine;
- unstable, if <80% but >=20% of the sequences have a glutamine; and
- uncategorized, if <20% of the sequences have a glutamine.
For (a) and (c) it is clear that the glutamine in that position is clearly conserved or absent in the alignment, respectively. In the case of (b), we further defined three sub-values:
- inserted, if there are more gaps than non-glutamine amino acids;
- mutated, if there are more non-glutamine amino acids than gaps;
- undefined, if the number of gaps and non-glutamine amino acids is the same.
The (b) value reflects an aligned position not dominated by glutamines, being the sub-value an indicator of the possible underlying mechanism driving the position. More gaps would mean a length variation in the sequence, whereas any residue different to glutamine would imply a point mutation either to or from the glutamine.
The stability of the polyQ is categorized by taking into consideration all the values of glutamine conservation along the complete span of the polyQ region. Five categories are defined:
- Category 1, stable polyQ. More (a) conserved than (b) unstable positions.
- Category 2, unstable polyQ by length variation. More (b) unstable than (a) conserved positions, and more (b1) inserted than (b2) mutated and (b3) undefined.
- Category 3, unstable polyQ by mutations. More (b) unstable than (a) conserved positions, and more (b2) mutated than (b1) inserted and (b3) undefined.
- Category 4, unstable polyQ by undefined mechanism. More (b) unstable than (a) conserved positions, and not in category 2 or 3.
- Category 5, uncategorized. Not in any previous category.
Although five categories are defined, only in 1, 2 and 3 it seems clear how the stability or instability of the polyQ region is achieved in the orthologs. In category 4 it is theoretically not possible to determine whether the instability comes from an indel or a point mutation, and category 5 is by definition place for the uncategorized regions. For the purpose of the present work we discard hereafter the polyQ regions in categories 4 and 5, and refer to the selected categories as stable (1), inserted (2) and mutated (3).
Stating that a polyQ is stable means that it is conserved in at least 80% of the orthologs. It is implied that the categorization of the polyQ is intrinsically affected by the selection of the species to use in the study (Figure 2a), as more similar orthologs would mean more stable polyQ. However, this relation is not straightforward, and the sequence similarity in the orthologs (Figure 2b) is not directly correlated with having more stable polyQ regions (Figure 2c). It is interesting to note that Mammalia has a higher proportion of stable polyQ even compared with classes Teleostei and Sauria, for which we obtained orthologs similarly conserved. The difference to class Insects could be due to the higher divergence of the set of insect orthologs resulting in more polyQ identified as mutated or inserted. Nonetheless, with our choice of species the number of polyQ regions per taxa is high enough for our purposes of obtaining statistically significant results (2639, 3588, 1884, 2639 in Insecta, Teleostei, Sauria and Mammalia, respectively). For the record, the number of discarded polyQ (categories 4 and 5) was of 1496, 740, 359, and 388 in Insecta, Teleostei, Sauria and Mammalia, respectively.
We included results for the Nephrozoa taxa, taking one representative species from each of the taxa considered above (see Methods for details), to show the effect of lower sequence similarity when using distant orthologs (Figure 2b) and, as a result, the minimal number in stable polyQ compared to unstable ones (Figure 2c).
2.2 Glutamine codon usage depends on polyQ stability category
Glutamine codon usage is biased towards CAG over CAA codons, especially in polyQ regions of Vertebrata species (3:1 proportion)9. Here we intend to determine whether there is a difference in this behavior based on the stability category of the polyQ region. To this end, we studied the codon usage in all glutamines within the orthologous polyQ regions. This means that not all considered glutamines are necessarily part of a polyQ, but they are in at least one of the orthologs. Even if they are not forming a polyQ themselves, placing them in an evolutionary context by studying their orthologs could show whether they have sequence or structural features in consonance with forming a polyQ.
PolyQ regions in the chordate taxa (Teleostei, Sauria and Mammalia) are significantly enriched in CAG codons, close or slightly above the background codon ratio CAG to CAA of 3:1, as expected (Figure 3b-d). Differently, glutamine codon usage in polyQ from Insecta does not differ much from their background, which is close to 1:1 (Figure 3a); an unexpected decrease in the %CAG is even apparent in stable polyQ. According to the values of significance comparing the distributions, the most significant differences are higher %CAG in inserted polyQ in Insecta and Mammalia, and different %CAG in stable polyQ in Teleostei (higher) and in Sauria (lower). Separation above the background is higher in Sauria and Mammalia than in Insecta and Teleostei, with highest %CAG in inserted polyQ. This is in line with previous results showing that trinucleotide repeats are subject to a CAG-slippage mechanism17.
The categories of polyQ stability can be significantly distinguished by their %CAG, but the differences depend on the taxa. Results suggest that the CAG-slippage mechanism for which polyQ regions vary in length are predominant in inserted polyQ in Amniota species (comprising taxa Sauria and Mammalia).
2.3 The polyQ amino acid context is influenced by polyQ stability
Next, we study the sequences surrounding the polyQ regions. In this respect, previous work have described how polyQ regions are usually followed by a proline-rich or polyP region13. In addition, we recently described an unusual peak of leucine residues in position -1 of polyQ regions from several taxonomically diverse species10. Here, we took one sequence at random from each set of orthologous regions, but forcing the selected sequence to have a polyQ. We then calculated the proportion of both leucine and proline residues in the previous (from position -10 to -1) and following (from position +1 to +10) regions.
Although the results do not seem to be steady for leucine residues in any of the taxa, the highest proportion in position -1 is always achieved in stable polyQ (Figure 4a). This is especially clear for mammalian sequences. It must be noted that it is difficult to draw conclusions from the noisy results from Insecta sequences due to the low amount of sequences considered (123 stable polyQ, versus, for example, 1761 stable polyQ in Mammalia). We also note that the threshold used in this work to look for polyQ regions, 4/6, results in a lower signal for leucines in position -1 than other thresholds (4/4, 6/6, 6/8, 8/10) when examining human proteins10. While the 4/6 threshold maximizes the number of polyQ found, features specific to longer polyQ may be obscured by the larger number of short polyQ.
Almost the complete N-terminal and C-terminal surrounding sequences of inserted polyQ regions in the four taxa are enriched in prolines, specially C-terminally to polyQ in mammalian sequences (Figure 4b). The signal is also present to some extent in mutated polyQ. The C-terminal capping of polyQ regions has been proposed as a protective mechanism against the aggregation they induce14. As aggregation propensity is polyQ-length dependent16, it is not surprising that inserted polyQ (which is defined by having varying length) is the category clearly showing this position-specific enrichment. We observed a remarkable decay in proline frequency in position -2, shared by all categories in all taxa. Prolines in that position would difficult the function of polyQ in expanding a preceding alpha-helix upon protein interaction, since prolines act as helix breakers.
2.4 Stable polyQ have higher tendency to be preceded by helical structure
We previously reported that polyQ regions have a tendency to be preceded by a sequence in helical conformation and followed by a region in random coil conformation1. Following the strategy of previous sections, and assuming that the secondary structure of all orthologous sequences will be similar, we predicted the secondary structure of one protein per set of orthologs using the tool JPred18. More specifically, we used as input sequence the polyQ region plus its sequence context. JPred uses a neural network to classify each residue to be structured as alpha helix (helical), beta sheet (extended) or as other secondary structure.
In all polyQ categories and taxa there is higher helical conformation content N-terminally to the polyQ (with a steep increase from the N-terminal side for stable polyQ; Figure 5). The aggregation of structural predictions by taxa (last column) and by category (bottom row) highlights that the category in which a polyQ is classified (stable, inserted or mutated) is more relevant than the taxa with respect to its surrounding secondary structure.
Using the results given by JPred we also studied the solvent accessibility of the residues in the input sequences. However, we found no differences between taxa or categories (data not shown). Similarly, we predicted the coiled coil propensity of the sequences using DeepCoil and also found no significant differences (data not shown).
2.5 The protein-protein interaction capacity of a polyQ is affected by its stability
PolyQ regions are associated to protein-protein interactions (PPI)13. Propensity of a polyQ region to interact may be related to its stability category, as the structural differences described in the previous section may play a role in its interaction capacity. We considered all proteins from the sets of orthologs, independently of whether they or their orthologs have the polyQ region, and calculated the number of high confidence interactors per protein described in the STRING database. We discarded the proteins for which STRING had no entry as we could not distinguish whether they have indeed no interactors or the protein is missing in the database.
Results show that stable polyQ are present in proteins that have significantly and consistently in all taxa more interactors than proteins with unstable polyQ (Figure 6a-d). As longer proteins tend to have more interactions13, we calculated the protein length distribution of the proteins to rule out the possibility that our results would be due to stable polyQ being predominant in longer proteins: this is not the case (Figure 6e-h). Stable polyQ do then have greater number of interactors than unstable polyQ and this is not influenced by the protein length. However, proteins with many interactors are more conserved and this could explain the results.
2.6 Functional differences of polyQ categories
We have studied so far how the stability of a polyQ region affects its features on the nucleotide (codon usage), amino acid (sequence context), structural (secondary structure) and interaction (PPI) level. Lastly, we aim to check whether the polyQ functionality is dependent on its stability. For this purpose, we performed GO enrichment analyses of all proteins from the sets of orthologs for a representative species per taxa. Analyses were done individually per category and species.
To ease the interpretation of results, we extracted the top-10 enriched GO terms per dataset and analyzed the commonalities between categories within each species (Figure 7). Not surprisingly, the term ‘coiled coil’ is the top enriched term in almost all studied conditions since polyQ are known to mediate coiled-coil interactions13. More interesting is the enrichment of the term ‘triplet repeat expansion’ exclusively for inserted polyQ of human proteins, pointing towards the correct classification of proteins depending on the polyQ categories. Regarding subcellular location, while stable polyQ are enriched in cytoplasm and nucleus, inserted polyQ are enriched only in nucleus, and mutated polyQ in cilium. Enriched terms related to signaling specifically observed for the human stable polyQ dataset and not for those of the other species (e.g. ‘protein phosphorylation’, Suppl. File 1) could account for the higher proportion of stable polyQ observed in mammals (Figure 2c), and are consistent with the enrichment in cytoplasmic location.
Analyses were performed independently per species, with sets of proteins specific per taxa, and as such results should not necessarily be similar between species. However, it is interesting to see how some of them are comparable. As an example, GO terms related to ‘transcription’ are mostly limited to inserted polyQ. On the other hand, ‘cilium’ and ‘axon’ related terms are enriched only in mutated polyQ; both cellular structures are known to share proteins19.
We illustrate the presence of mutated polyQ in human proteins annotated with the term ‘cilium’ with the OFD1 protein (UniProtKB ID: O75665) encoded by the Ofd1 gene (Oral-facial-digital syndrome 1). This protein is required for the formation of primary cilia20. It is located in the distal centriole, at the cilium-centriole interface, and controls centriole length21. To examine the structural context of the mutated polyQ regions in this protein and the other proteins annotated with the GO term ‘cilium’ (20 proteins in total), we searched for solved structures of the proteins or their homologs (using the web tool Aquaria22). Unfortunately, there was no information for or near the polyQ regions. For OFD1 we created a multiple sequence alignment including homologous and orthologous proteins of the human OFD1 chosen from selected organisms, assisted by ProteinPathTracker23 (Figure 8a). The polyQ region (‘QQEQDQ’, amino acid positions 965-971) is aligned in a block with gaps but with variable Q content. The position is very close to the C-terminal of a coiled region (Figure 8b), which is consistent with the function expected for polyQ.