Genes under positive selection estimated using the branch model
The branch model was designed to detect branches or lineages with faster evolution [8]. Under a likelihood-ratio test framework, this methodology has been extensively used in screening positively selected genes in a specific lineage [9]. After filtering out genes without Ensembl-annotated orthologs, a total of 58 RVI genes were used for selection analyses. We compared the HPO-annotated responsive genes of recurrent viral, fungal, and bacterial infections (RVI, RFI and RBI respectively), which are phenotypes involving different classes of pathogens. The free-ratio model revealed that 77.59% of genes (45/58) showed signals of positive selection in at least one lineage (Supplementary table 1a and 1b). When we compared the free-ratio model with the one-ratio model, which assumes the absence of inter-lineage differences in the evolutionary rate, we found evidence that 70.69% of genes (41/58, χ2 test, p < 0.05) evolved at a rate that differed from the assumptions of neutral evolution. Among these, 55.17% of genes (32/58) were identified as positively selected genes. Notably, genes were identified as positively selected only if they conformed to two parameters simultaneously: positive selection identified with statistical significance (χ2 test, p < 0.05) and positive selection in at least one evolutionary lineage (ω > 1) (Supplementary Table 3 and Fig. 1a).
To make the fractions of positively selected genes (\({f}_{s}\)) of RVI comparable among the HPO-defined phenotypes, we firstly compared RVI with RBI and RFI, which are also phenotypes involved in pathogen-responses. Following the filtering of shared genes among the three phenotypic categories (RVI, RFI and RBI), we then focused on unique genes for each category. We found higher \({f}_{s}\) for unique genes of RVI (78.57%, 11/14) than for unique genes of RFI (57.14%, 12/21) and RBI (58.68%, 71/121) (Supplementary Table 1b, 1c and 1d). We further used the genes related to the "autoimmunity" category as a (proxy of) control, because autoimmunity is also within the scope of the immune system but not necessarily related to pathogens. When we focused on unique genes between the autoimmunity and RVI, the \({f}_{s}\) of the autoimmunity category (55.00%, 66/120) was also lower than that of the RVI (66.67%, 22/33) (Supplementary table 1e). Interestingly, on average, the background \({f}_{s}\) level of genome-wide protein-coding genes was 49.43% (8218/16625, Supplementary Table 2), which is also lower than that of RVI. These comparisons tentatively suggest that the probability of historical recurrent positive selection on genes responsive for antiviral responses is higher than the average genome-level positive selection, the non-pathogen immune phenotype (autoimmunity), and other pathogen-related immune phenotypes (bacteria and fungi).
To understand the long-term evolution of these genes, we mapped the genes with signals of significant positive selection (dN/dS > 1) onto the mammalian phylogenetic tree. All branches were decomposed into 12 phylostra or phylostratigraphic stages (PS), also the ancestral or outgroup branches. The genes with dN/dS ratios significantly larger than 1 in at least one branch were assigned into these stages. Interestingly, after counting the number of positively selected genes distributed across species, we found that the Old-World monkey species have the highest number of genes (19) with a positive selection signal (Fig. 1 and Supplementary Table 4, χ2 test, p < 0.05). Among all primates, 26 genes had significant positive selection signals (Supplementary table 5, χ2 test, p < 0.05). The Laurasiatherian species (cattle, cat, camel, dog, horse, pig, and yak) also demonstrated a higher number of genes (14) under positive selection than other remaining mammalian clades including bats and rodents (Fig. 1 and Supplementary Table 6).
The branch-site model revealed two sites of IRF9 are under a significant positive selection
To increase the resolution of positive selection analyses, we tried to understand whether the above 32 genes under positive selection have some specific sites under significant positive selection. To identify positively selected sites in the human lineage, we predefined the human branch as the foreground in the branch-site model analysis. We identified a significant signal of one gene, IRF9, with two sites under positive selection (χ2 test, p = 0.01036). The probabilities of positive selection on the two sites Val129 and Val333 were 0.992 and 0.652, respectively, based on the Naive Empirical Bayes (NEB) analysis (Table 1).
To assess whether the selection signals identified in IRF9 are robust, we attempted to corroborate these results using different tools in the HyPhy package. The aBSREL method found evidence of episodic diversifying selection on the human branch (p = 0.0114). For the likelihood test on sites, due to a coordinate change caused by an alignment gap, the site coordinate of Val333 in the branch-site model of PAML was changed into site 340. The MEME method confirmed that Val129 and Val333 (i.e., 340) are positively selected sites (LRT, p < 0.05). The BUSTED method with synonymous rate variation found evidence (LRT, p = 0.027) of episodic diversifying selection acting upon the IRF9 in the human lineage (selected as the foreground branch). Therefore, there is evidence that at least one site on the human branch has experienced diversifying selection (Fig. 2a). The FEL method also supported the positive selection on two sites Val129 and Val333 (i.e., 340) with statistical significance (Fig. 2b and Table 2, LRT p < 0.01). All these analyses based on different algorithms strongly suggest that IRF9 and its two sites Val129 and Val333 are under positive selection in the human lineage, though Val129 has stronger signal than Val 333.
We further assessed whether these two sites were conserved in non-human species (Fig. 2c and 2d). Among 26 species, site 129 had a conserved "S" amino acid in 23 species. This suggests that the ancestral state of this site is "S", and that it is highly conserved during the long-term evolution history in mammals (23/26 = 88.5%). In this light, only three species exhibited derived states (amino acid mutations) in this site: humans, megabat, and orangutan. The derived state in the human genome is "V", whereas in megabat is L, and orangutan P. As for site 333, 23/26 species had the "A" amino acid. Only two Old-world monkeys and humans had the amino acid "V". This suggests that "V" is the derived state, whereas "A" is the ancestral state. The ancestral states of these two sites were also confirmed with the Maximum likelihood (ML) inference based on packages of PAML [12] and MEGA [13] (Supplementary Fig. 1).
Table 1
The two sites of the human IRF9 detected to be under positive selection by the Naive Empirical Bayes (NEB) analysis.
Codon
|
Amino acid
|
Location in protein
|
Location in genome
|
GTA
|
V
|
129
|
chr14:24163398–24163400
|
GTC
|
V
|
333
|
chr14:24165852–24165854
|
Table 2
The sites of IRF9 detected to be under positive selection by the FEL algorithm. Note: Site 340 corresponds to the site 333 in the branch-site model of PAML due to coordinate shift caused by alignment gap. α and β denote synonymous and non-synonymous rates
Codon
|
α
|
β
|
α = β
|
LRT
|
p value
|
class
|
129
|
1.148
|
383.28
|
2.201
|
16.854
|
0
|
Diversifying
|
340
|
0
|
116.021
|
0.649
|
10.083
|
0.0015
|
Diversifying
|
119
|
1.422
|
86.249
|
2.345
|
5.483
|
0.0192
|
Diversifying
|
377
|
10000
|
0
|
53.61
|
4.645
|
0.0312
|
Purifying
|
The structure of IRF9 in humans and rats
To examine whether the amino acid substitutions may have caused a change in the structure, we compared both secondary and tertiary structures between human and rat orthologues. Based on the prediction of an online tool PSIPRED and the machine learning software Alphafold2, we revealed that, despite the above-mentioned changes in the conserved sites, the secondary and tertiary structures did not change between the human and rat orthologues (Fig. 3). The site Val129 in humans and its orthologous site Ser129 in rats were both located in the coil region. In addition, the site Val333 in humans and its orthologous site in rats also shared the structure of the helix.
The secondary structural similarities of the IRFs gene family
The gene family of IRFs has been investigated extensively. For example, IRF3 is one of the most well-characterized transcription factors involving innate immune responses [14]. The IAD domain was reported to be conserved in all IRFs, except IRF1 and IRF2 [15]. Here, we tried to understand the structural differences between the homologous proteins within the IRFs family with a focus on positively selected sites in IRF9. The homologous alignment revealed that common features of IRF structures comprised DBD (DNA Binding Region), IAD (IRF Associated Domain), and the linker region (Fig. 4a). Alignment of homologous genes revealed that Val129 can be aligned to the amino acid "S" within IRF3 (Fig. 4b). DBD can bind to its interferon-stimulated response element (ISRE), while IAD is responsible for binding with the signal transducer and activator of transcription 2 (STAT2) [16, 17]. The human Val129 lies within the linker region between DBD and IAD, while the human Val333 is located within the IAD region (Fig. 5a). The alignment similarities for these homologous proteins were higher in DBD and IAD regions than in the linker region (Fig. 5b), suggesting that the linker region, comprised of the random coil, might be the hotspot for positive selection due to its comparative abundance in amino acid differences among homologous genes.