Amplification of CRPs through whole genome and local tandem duplication
We collected a total of 240 plant genomes from Phytozome, Refseq, and other publicly available databases to identify CRPs. The plant species ranged from lower plants such as algae to higher plants i.e., seed plants, including diversified orders and families (Table S1). We obtained 80,953 CRP candidate genes using HMM and Cysmotif, based on a search for conserved protein domains and sequence homology, and we used these for our subsequent analysis (Table S2). An overview of the process used in our study is shown in Figure S1.
The distribution and counts of CRPs varied significantly among the 240 plant genomes (Fig. 1a). To determine if there is a correlation between the CRP copy number and plant genome size, we analyzed the linear regression of CRP copy number and genome size (Fig. 1b). We found that there is a positive correlation relationship – CRP copy number increases with genome size. The positive correlation relationship between CRPs and genome size suggested that CRPs were duplicated with whole genome duplication events. We hypothesize that whole genome duplication is the factor driving the expansion of the CRP gene family members.
Next, we analyzed the clustering of CRP genes in the plant genomes. We defined gene clusters as genes on the same chromosome with a distance between two CRP genes that is no greater than eight other non-CRP genes (Richly et al. 2002). Based on our definition, more than half of CRPs formed gene clusters, and gene clusters accounted for more than 50% of the total CRPs in more than 30% of the plant species (Table S3). Among them, all CRP genes in Monoraphidium neglectum, Triticum aestivum, and Humulus lupulus could be placed into gene clusters. Gene clusters form as a result of localized gene duplication – tandem duplication and translocation – typically followed by subsequent divergence. The tendency of CRPs to cluster on chromosomes is consistent with the evolution of CRPs by local tandem duplication.
We next assessed the homogeneity of these CRP clusters. We defined as “homogeneous” the genes belonging to the same gene cluster that were orthologous, while we defined as “heterogeneous” the genes belonging to the same gene cluster but not orthologous (classified in Methods, consistent with earlier definitions (Zhang et al. 2016). Here, we take A. thaliana as an example to show the position of homogeneous cluster on the chromosomes (Fig. S2). We calculated the proportion of homogeneous cluster in the CRP gene clusters; this demonstrated that more than a third of species have more homogeneous than heterogeneous clusters, while the CRPs of Trifolium pratense were completely homogeneous (Table S3). CRP genes of these species were composed of tandem duplications, while other species’ levels of heterogeneous genes were relatively high and represent translocations (Table S3). We conclude that the tendency of CRP genes to cluster on chromosomes shows that CRP duplication was influenced by local tandem duplication, and the pattern of CRP gene clusters on the chromosomes indicates that the duplication of CRPs reflects a complex evolutionary history.
CRP copy number variation among evolutionary branches and ecotypes
Statistical analysis of the distribution of CRPs in 240 plant genomes revealed that the copy number of CRPs varies highly among the species. These species were distributed across 78 families in 50 orders, and in 15 major branches (Fig. 1a, Table S1). A comparison of CRPs to a species tree indicated that CRPs exist in all 240 species, but their copy number per plant genome ranged from 13 to 2445. Notably, even in the same branch, the difference in the number of CRPs was significant; for example, in monocots, Thinopyrum intermedium had 2445 CRPs while Zostera marina had only 147 (Table S2).
The scattered distribution of species with unusually high or low copy number of CRPs among the 240 plant species caused us to speculate that rapid CRP expansion or contraction occurred during evolution. First, across different evolutionary branches (Fig. 1a), the copy number of CRPs was highly diverse. For example, more than 76% of the species in the algae had CRP gene copy numbers fewer than 100, while more than 92% of the species in the Malvids (a eudicot clade) had copy numbers greater than 200. The highest CRP copy number was found in the monocots, but six species of the top ten with the greatest copy number were in the Malvids. This shows that CRP copy number varies highly among different branches. Secondly, at the level of families, the number of CRPs also significantly differed. For better representation (the number of species > 3), we examined the distribution of CRPs among families, which revealed that the CRP copy number of all families significantly differed from each other (Fig. 2a). There was a significant difference in CRP gene copy numbers among different families after performing the Student’s t-test (Table S4). In addition, the copy number of CRPs varies substantially within the same family; for example, in the Poaceae, a 2- to 8-fold difference was recorded in gene copy number. The varied gene numbers suggested that CRP gene gain or loss occurs rapidly during speciation, which might serve to help plants respond to diversified environments.
We found that the diversity of CRPs was also reflected in the different ecotypes of plants and it was closely related to the environment in which they live. A previous study on NBS-LRR genes in angiosperms found that NBS-LRR gene reduction is associated with ecological specialization (Liu et al. 2021). In our current study, we speculated that there may be a relationship between the CRP copy number and different ecological phenotypes. We found that CRP distribution was correlated with three groups of different ecological phenotypes: herbaceous and woody species, land and aquatic species, and nodulating and non-nodulating species (Table S4). This shows that there is a substantial difference in the distribution of CRPs among each group of ecotypes. The average number of CRP genes in aquatic species was significantly lower than land species; herbaceous species had higher CRP copy numbers on average relative to woody species (Fig. 2b, Table S4). These results are consistent with the observation that the distribution of NBS-LRR genes in specialized lifestyles or environments was lower than common lifestyles or environments. Further, we hypothesized that the living environment of the species affects the CRP copy numbers.
Previous studies have shown that CRPs have an antimicrobial function, can defend against the invasion of pathogens, and can respond quickly to changing living conditions (Terras et al. 1995). To investigate our hypothesis that CRP copy numbers are affected by the living environment with different ecotypes, defensin-like genes were selected as representative relevant disease-resistance gene subfamily. We analyzed the number of these subfamilies of CRPs which displayed extremely significant differences in ecotypic groups (Fig. 2c) as verified by Student’s t-test (Table S4). The observed differences in CRP copy number between ecological conditions is consistent with the hypothesis that CRPs may have evolved to better cope with threats posed by different environments.
Conserved and specific CRP families involved in antimicrobial activities
To examine additional characteristics of CRPs, we analyzed the presence/absence of CRPs in each species by clustering CRPs into gene families (see Methods). We found that the highly conserved gene families accounted for only 0.74% of the total (Fig. 3a). Here, we took the highly conserved gene family of annotated chitin recognition protein as an example (Fig. 3b). These chitin recognition proteins were found in most species and only missing in 15 species of algae and 9 other species (Fig. 3b, Table S5). Chitin recognition proteins are the basis for recognition of fungi and resisting them via induction of a series of immune responses. This demonstrated that these families of chitin recognition protein are highly conserved.
On the other hand, the species-specific gene families accounted for 22.1% of total (Fig. 3a). For example, we observed a striking peak in the distribution of NCRs and found that most NCRs were present in M. truncatula, yet only a few presents in other species, further verifying that NCRs are species-specific (Fig. 3b, Table S5). NCRs are responsible for the control of bacteroid differentiation in IRLC legumes (Velde et al. 2010; Alunni and Gourion 2016), and maintaining the working balance during nitrogen-fixing symbiosis (Wang et al. 2010; Pan and Wang 2017). These results indicated that CRPs also contributed to species-specific plant-microbial interaction.
Despite genomic variation of CRPs, in particular defensins, their three-dimensional structures are relatively conserved, including a stable backbone structure formed by the connection of an α helix with three antiparallel β-strands through four conserved disulfide bonds (Kovaleva et al. 2020). Conserved cysteines form disulfide bonds and play an important role in maintaining stable, three-dimensional protein structures. We identified conserved cysteines in CRP subfamilies (schematic diagram in Fig. S3a), and explored the evolutionary pattern of conserved cysteines to infer the evolution of CRPs. We found that conserved cysteines were prevalent in CRP subfamilies. Further, we calculated the number of other amino acids between two conserved cysteines; we found that a distance of three amino acids was the highest frequency, followed by one and two amino acids (Fig. S3b). The frequency of the distance between cysteines implies that conserved cysteines are concentrated and highly localized.
The unique bi-domain CRPs generated through un-equal crossover event
Some CRPs showed unique characteristics of having two identical cysteine-rich domains; we investigated them. As shown in Figure 4A, there was one sequence (Medtr8g012775) that encoded two similar domains on chromosome 8, and another sequence (Medtr8g012795) that encoded the corresponding single domain elsewhere on chromosome 8. In the sequence with the bi-domain, exon-2 and its preceding sequence were duplicated to exon-3 and its preceding sequence and formed the bi-domain (Fig. 4a). Correspondingly, we predict the molecular mechanism by which genes encoding bi-domain proteins are formed is that homologous chromosomes were cross-exchanged which caused the single domain to be transformed into a bi-domain. In this process, the functional efficacy of the bi-domain may have become stronger than the single domain. A recent study demonstrated that the bi-domain MtDef5 has more potent antifungal activity against the fungal pathogen Botrytis cinerea than its two single domains, MtDef5A and MtDef5B (Islam et al. 2017). This suggests that, in at least this case, the bi-domain has stronger antifungal potential or efficacy than the single domain.
The linker peptide between two single domains may contribute to the potent antifungal activity of the bi-domain MtDef5 (Islam et al. 2017). We were curious about the origin of the linker that connects the two single domains. Thus, we aligned the linker sequence to the other sequences in this species, and found the linker matched the sequence on chromosome 8. This indicated that the linker sequences might be formed by homologous repair during the unequal cross-over of two single domains.
This provides a unique opportunity to explore the evolution of the bi-domain CRP proteins. Based on the sequence characteristics of MtDef5, we also identified additional distinct bi-domain proteins in other CRP sequences, via pfam annotation (Fig. 4b). The genes encoding these bi-domain proteins were derived from homologs encoding single domain proteins in each of the genomes in which they were found. The identification of some bi-domain proteins in Figure 4b suggests that they did not occur by accident in one particular species, but is widespread among plant species and might result from regular events over the course of evolution. It is interesting to hypothesize whether increasing the number of CRPs with bi-domains in plants could be used as a novel and more effective antimicrobial measure. Alternatively, another form of bi-domain proteins may result from the fusion of genes encoding different single domains. Perhaps these gene fusion events increase the efficiency or efficacy of interactions that would otherwise have to occur between separate molecules.