Evolution of antimicrobial cysteine-rich peptides in plants

We analyzed the evolutionary pattern of cysteine-rich peptides (CRPs) to infer the relationship between CRP copy number and plant ecotype, and the origin of bi-domains CRPs. Plants produce cysteine-rich peptides (CRPs) that have long-lasting broad-spectrum antimicrobial activity to protect themselves from various groups of pathogens. We analyzed 240 plant genomes, ranging from algae to eudicots, and discovered that CRPs are widely distributed in plants. Our comparative genomics results revealed that CRP genes have been amplified through both whole genome and local tandem duplication. The copy number of these genes varied significantly across lineages and was associated with the plant ecotype. This may be due to their resistance to changing pathogenic environments. The conserved and lineage-specific CRP families contribute to diverse antimicrobial activities. Furthermore, we investigated the unique bi-domain CRPs that result from unequal crossover events. Our findings provide a unique evolutionary perspective on CRPs and insights into their antimicrobial and symbiosis characteristics.


Introduction
As sessile organisms, plants are often exposed to diverse pathogenic microorganisms. It is crucial to understand the mechanisms of how plants defend against these biotic factors for both plant biotechnology and sustainable agriculture. Plants possess two layers of an innate immune system, known as pattern-triggered immunity (PTI) and effector-triggered immunity (ETI) (Jones and Dangl 2006). PTI is triggered by microbe-associated molecular patterns (MAMPs) that are detected by pattern-recognition receptors localized on the cell surface. In addition to PTI, ETI is activated by intracellular pathogen effector proteins that are specifically secreted into the host cell by pathogens and subsequently recognized by intracellular NOD-like receptor proteins, which include Toll/Interleukin-1 receptor (TIR) motifs (the TIR-NBS-LRRs) or coiled-coil (CC) domains at the N-terminus (the CC-NBS-LRRs) (Spoel and Dong 2012;Cui et al. 2015;Couto and Zipfel 2016;Yu et al. 2017). In recent years, studies have revealed the role of plant antimicrobial peptides (AMPs) as endogenous factors in plant immune response (Nawrot et al. 2014;Tam et al. 2015;Srivastava et al. 2021 AMPs are small proteins that have potent antibacterial, antiviral, and antifungal activities and are present in most multicellular eukaryotes. Both plant and animal species express dozens of distinct AMP genes. Plant AMPs are components of plant barrier defenses that exist in different tissues of diverse plants; exhibiting significant structural and functional diversity (Nawrot et al. 2014;Srivastava et al. 2021). Plant-derived AMPs exhibit broad-spectrum antimicrobial activity that modulates the innate immune system of different life forms such as plant and animal pathogens, protozoans, and insects; AMPs can even function against cancer cells (Campos et al. 2018;Srivastava et al. 2021). Based on their amino acid sequence identity, number of cysteine residues and their spacing, AMPs can be classified into different groups (Lay and Anderson 2005). AMPs rich in cysteine residues form multiple disulfide bridges and thus are protected from chemical and proteolytic degradation (Haag et al. 2012). These cysteine-rich peptides (CRPs), namely defensins, comprise one of largest families of AMPs (Kovaleva et al. 2020). The conserved N-terminal region includes a secretion peptide signal that guides CRPs to specific destinations through the secretory pathway, and the C-terminal region which includes a cysteine-rich domain (Zhu et al. 2005;Silverstein et al. 2007;Marshall et al. 2011;Tam et al. 2015;Srivastava et al. 2021). Based on amino acid sequence homology, small variable CRPs have mostly been classified as nodule-specific cysteine-rich (NCR) peptides (Scheres et al. 1990), defensins (Florack and Stiekema 1994;Broekaert et al. 1995), lipid transfer proteins Molina et al. 1993), hevein-like peptides (Van Parijs et al. 1991), knottin-type peptides (Cammue et al. 1992), snakins family members (Berrocal Lobo et al. 2002), and others (Silverstein et al. 2007;Hammami et al. 2009;Srivastava et al. 2021).
CRPs are not highly conserved due to variations in their copy number and domain arrangement among closely related species. For instance, NCRs have been found to drive the terminal differentiation of rhizobia into nitrogen-fixing bacteroids in the nodules of inverted repeat-lacking clade (IRLC) legumes Velde et al. 2010;Czernic et al. 2015;Alunni and Gourion 2016). Medicago truncatula contains approximately 700 NCR peptides, whereas only seven have been identified in Glycyrrhiza uralensis (Montiel et al. 2017;Roy et al. 2020). Another study has identified a bi-domain defensin, MtDef5, and several domains associated with their function (Islam et al. 2017). The bidomain MtDef5, containing two domains and exhibits more potent antifungal activity than single-domain MtDef5A or MtDef5B, indicating that the linker peptide APKKVEP contributes to the potent antifungal activity (Islam et al. 2017). However, the evolutionary history underlaying these complex CRPs remains unknown. Recent developments in comparative genomics have improved our understanding of CRPs and suggest that they are worthy of increased attention. In this study, we used HMMER and Cysmotif to search for CRPs in the genomes of 240 plant species, which we then clustered, annotated, and analyzed the evolution of the CRPs. Our study highlights the pattern of cysteines in the identified CRPs and explores the potential value of AMPs in future research.

Data used in this study
The data for 240 plant genomes were downloaded from Phytozome, Refseq, and other publicly available databases (Table S1). The branch, order, and family of the 240 plant genomes are referenced from the APG IV system (Chase et al. 2016), published plant genomes (https:// plabi pd. de/ plant_ genom es_ pa. ep) and PlantRep (Luo et al. 2022). The species tree is mainly referenced from the TimeTree of Life (Kumar et al. 2022), phyloT v2, which is based on NCBI or GTD taxonomy (Schoch et al. 2020), and the literature (Sanjur et al. 2002;Knapp et al. 2004;Lim et al. 2007;Mahelka et al. 2011;Arias and Pires 2012;Yang et al. 2013;Takahashi et al. 2016;Liu et al. 2020;Zhuang et al. 2020). The ecotypes of species were divided into herbaceous and woody species, land and aquatic species, and nodulating and nonnodulating species (Table S1).

Orthologous/paralogous groups identification, classification, and annotation
We used OrthoFinder Kelly 2015, 2019) to identify orthologous/paralogous groups, with default parameters based on protein similarity. We defined these groups as gene families. Then, we classified these families into three categories by R scripts: "highly conserved" families are those present in more than 80% of species in each evolutionary branch; "species-specific" families are those present only in a particular species not in other species; and others. The identified CRPs were functionally annotated using eggnogmapper (Cantalapiedra et al. 2021) (Table S5), then the annotation of highly conserved and species-specific families was done using shell script.

Definition and classification of CRP gene clusters
The CRP gene clusters in each family were calculated using customized Perl scripts. In one species, if two CRP genes were separated by less than or equal to eight other genes, we considered them to be located in the same CRP gene cluster (Richly et al. 2002). If a gene cluster contained CRP genes from multiple families, we defined it as a heterogeneous CRP cluster. If a cluster contained CRP genes only from one family, we defined it as a homogeneous CRP cluster. The location of homogeneous CRP clusters on the chromosome of Arabidopsis thaliana was shown with the assistance of TBtools (Chen et al. 2020).

Identification of patterns of cysteines
To accurately identify patterns of cysteines in the CRP proteins, the CRP families were clustered into smaller groups by BLASTP and mcl (van Dongen 2000). Next, we aligned them within the small group using mafft (Rozewicki et al. 2019). We masked the non-cysteine positions and generated the consensus sequence of each alignment with customized scripts (schematic diagram in Fig. S3a). In addition, we calculated the distance between two conserved cysteines and counted their frequencies using custom Perl script.

Identification of bi-domain CRPs
For each protein sequence, we used an HMM search (Johnson et al. 2010) to identify the CRP domain and select the high quality or 'credible' sequences containing CRP domains-those that had an alignment proportion greater than 80% with the reference CRP domain. Then, we defined the credible sequences with two similar domains as "bi-domain" and the credible sequences with different domains as "single domain". In addition, the identified bidomains were aligned to single domains in the same species to predict the formation of bi-domains.

Amplification of CRPs through whole genome and local tandem duplication
We collected a total of 240 plant genomes from various publicly available databases including Phytozome, Refseq, and others to identify CRPs. These plant species varied from lower plants such as algae to higher plants such as, seed plants, including diverse orders and families (Table S1). We obtained 80,953 CRP candidate genes using HMMER and Cysmotif, based on a search for conserved protein domains and sequence homology, and we used these for our subsequent analysis (Table S2). Meanwhile, we validated the accuracy of our CRP identification by examining their expression patterns in transcriptome data of maize, rice, soybean, and cotton . A flowchart depicting the process used in our study illustrated in Figure S1.
The distribution and count of CRPs significantly varied among the 240 plant genomes (Fig. 1a). To assess if correlation existed between the CRP copy number and plant genome size, we performed linear regression analysis of CRP copy number and genome size (Fig. 1b). Our results revealed that there is a positive correlation relationship existed, indicating that CRP copy number increased with genome size. This positive correlation relationship between CRPs and genome size suggested that CRPs were duplicated during whole-genome duplication events. Thus, we hypothesize that whole-genome duplication is the driving factor behind the expansion of the CRP gene family members.
Next, we investigated the clustering of CRP genes in the plant genomes. We defined gene clusters as genes on the same chromosome with a distance between two CRP genes that is no greater than eight other non-CRP genes (Richly et al. 2002). Based on our definition, more than half of the CRPs formed gene clusters, and gene clusters accounted for more than 50% of the total CRPs in over than 30% of the plant species (Table S3). Among them, all CRP genes in Monoraphidium neglectum, Triticum aestivum, and Humulus lupulus could be placed into gene clusters. Gene clusters are formed due to localized gene duplication-tandem duplication and translocation-typically followed by subsequent divergence. The tendency of CRPs to cluster on chromosomes is consistent with the evolution of CRPs by local tandem duplication.
We proceeded to evaluate the homogeneity of these CRP clusters. We designated genes in the same gene cluster as "homogeneous" if they were orthologous, while we classified genes in the same gene cluster but not orthologous as "heterogeneous" as defined in Methods section and consistent with previous study (Zhang et al. 2016). We used A. thaliana as an example to illustrate the location of homogeneous clusters on chromosomes (Fig. S2). We calculated the proportion of homogeneous clusters in the CRP gene clusters and found that over a third of species had more homogeneous than heterogeneous clusters, while the CRPs of Trifolium pratense were entirely homogeneous (Table S3). The CRP genes in these species consisted of tandem duplications, whereas other species had relatively high levels of heterogeneous genes that represented translocations (Table S3). We conclude that the tendency of CRP genes to cluster on chromosomes indicates that CRP duplication was influenced by local tandem duplication, and the pattern of CRP gene clusters on the chromosomes suggests that the duplication of CRPs reflects a complex evolutionary history.

CRP copy number variation among evolutionary branches and ecotypes
The statistical analysis conducted on the distribution of CRPs in 240 plant genomes revealed a high variation in the copy number of CRPs varied highly among the species. These species were distributed across 78 families in 50 orders, and in 15 major branches (Fig. 1a, Table S1).
Although CRPs were present in all 240 species, but their copy number per plant genome ranged from 13 to 2445. It is noteworthy that even in the same branch, there was a significant difference in the number of CRPs was significant. For instance, in monocots, Thinopyrum intermedium had 2445 CRPs while Zostera marina had only 147 (Table S2). The scattered distribution of species with unusually high or low copy numbers of CRPs among the 240 plant species led us to speculate that there was a rapid expansion or contraction of CRP occurred during evolution. Firstly, the copy number of CRPs was highly diverse across different evolutionary branches (Fig. 1). For example, more than 76% of the species in the algae had CRP gene copy numbers fewer than 100, while more than 92% of the species in the Malvids (a eudicot clade) had copy numbers greater than 200. The highest CRP copy number was found in the monocots, but six species of the top ten with the greatest copy number were in the Malvids. This implies that CRP copy numbers vary significantly among different branches. Secondly, at the family level of, the number of CRPs also significantly differed. For better representation (the number of species > 3), we examined the distribution of CRPs among families and found that the CRP copy number of all families significantly differed from each other (Fig. 2a). We also conducted Student's t-tests and found a significant difference in CRP gene copy numbers among different families (Table S4). Moreover, the copy number of CRPs varied substantially within the same family. For example, in the Poaceae, a 2-to 8-fold difference was recorded in gene copy number. These variations suggest that CRP gene gain or loss occurs rapidly during speciation and might help plants respond to diversified environments.
We observed that the variability in CRPs was associated with different plants ecotypes, which in turn correlated with their environment in which they live. Previous study on NBS-LRR genes in angiosperms suggested that NBS-LRR gene reduction is associated with ecological specialization (Liu et al. 2021). In our current study, we hypothesized that the copy number of CRPs might also be related to different ecological phenotypes. We found that the distribution of CRP was correlated with three different groups of ecological phenotypes: herbaceous and woody species, land and aquatic species, and nodulating and non-nodulating species (Table S4). These results showed that significant variations in the distribution of CRPs among each group of ecotypes. The average number of CRP genes in aquatic species was significantly lower than land species; herbaceous species had higher CRP copy numbers on average relative to woody species (Fig. 2b, Table S4). These results are consistent with the observation that the distribution of NBS-LRR genes in aquatic and land species, nodulating and non-nodulating species). c The violin plot of defensin-like subfamily in different ecotype groups specialized lifestyles or environments is lower than in common lifestyles or environments. After the colonization of land, plants faced more complex pathogenic environments, which might explain why the average CRP copy numbers are higher in land species (Qiu et al. 2012;Delaux and Schornack 2021;Liu et al. 2021). We further hypothesized that the living environment of the species affects the CRP copy numbers.
Previous studies have shown that CRPs have an antimicrobial function and can defend against the invasion of pathogens, and can respond quickly to changing living conditions . To investigate our hypothesis that CRP copy numbers are affected by the living environment with different ecotypes, defensin-like genes were selected as representative relevant disease-resistance gene subfamily. We analyzed the number of these subfamilies of CRPs which showed extremely significant differences in ecotypic groups (Fig. 2c) as confirmed by Student's t test (Table S4). These results strengthen the hypothesis that CRPs have evolved to cope better with threats posed by diverse environments.

Conserved and specific CRP families involved in antimicrobial activities
To investigate the diversity of CRPs, we analyzed their presence/absence of CRPs in each species by clustering them into gene families (see "Methods"). Our analysis revealed that the highly conserved gene families accounted for only 0.74% of all families (Fig. 3a). For example, the chitin recognition protein family, which is one of the highly conserved CRP families, was present in most species but absent in 15 species of algae and 9 other species (Fig. 3b, Table S5).
Chitin recognition proteins are involved in recognizing fungi and inducing immune responses to resist them. This finding suggests that a few CRP families maintain conserved defense mechanisms against pathogen across species.
On the other hand, the species-specific gene families accounted for 22.1% of total families (Fig. 3a). For example, we observed a striking peak in the distribution of NCRs and found that most NCRs were present in M. truncatula, yet only a few presents in other species, further verifying that NCRs are species-specific (Fig. 3b, Table S5). NCRs are responsible for the control of bacteroid differentiation in IRLC legumes (Velde et al. 2010;Wang et al. 2010). And recent reports have shown that NCRs play a negative role in inducing terminal differentiation of rhizobia and a positive role in maintaining the survive of rhizobia, which helps the host control the fate of bacteria and maintain a working balance during nitrogen-fixing symbiosis (Alunni and Gourion 2016;Pan and Wang 2017). These results suggested that CRPs also contributed to species-specific plant-microbial interaction.
Despite genomic variations in CRPs, especially defensins, their three-dimensional structures are relatively conserved, including a stable backbone structure formed by the connection of an α helix with three antiparallel β-strands through four conserved disulfide bonds (Kovaleva et al. 2020). The disulfide bonds, which are formed by conserved cysteines play critical roles in maintaining the proteins three-dimensional structure. We identified conserved cysteines in CRP subfamilies (schematic diagram in Fig. S3a) and explored the evolutionary pattern of conserved cysteines to infer the evolution of CRPs. Our analysis showed that conserved cysteines were prevalent in CRP subfamilies. Furthermore, Gene Count Gene Count Fig. 3 The classification of conservation and specificity of CRP. a The pie chart represents ratio of highly conserved gene families, species-specific gene families, and other. b The distribution of highly conserved chitin recognition proteins (above) and species-specific NCRs (below) among species from lower plants to higher plants we calculated the number of other amino acids between two conserved cysteines and found that a distance of three amino acids had the highest frequency, followed by one and two amino acids (Fig. S3b). The frequency of the distance between cysteines suggests that conserved cysteines are concentrated and highly localized.

The unique bi-domain CRPs generated through unequal crossover of single-domain CRP loci
We investigated some CRPs that displayed unique characteristics of having two identical cysteine-rich domains; we investigated them. As shown in Fig. 4a, there was one sequence (Medtr8g012775) that encoded two similar domains on chromosome 8, while another sequence (Medtr8g012795) that encoded the corresponding single domain elsewhere on the same chromosome 8. In the sequence with the bi-domain, exon-2 and its preceding sequence were duplicated to exon-3 and its preceding sequence to from the bi-domain (Fig. 4a). We predict that the molecular mechanism by which genes encoding bi-domain proteins are formed is that homologous chromosomes undergo cross-exchange, which transforms the single domain to be transformed into a bi-domain. In this process, the functional efficacy of the bi-domain may become stronger than the single domain. A recent study demonstrated that the bidomain MtDef5 has more potent antifungal activity against the fungal pathogen Botrytis cinerea than its two single domains, MtDef5A and MtDef5B (Islam et al. 2017). This suggests that, in at least this case, the bi-domain has stronger antifungal potential or efficacy than the single domain. The linker peptide between two single domains may contribute to the potent antifungal activity of the bi-domain MtDef5 (Islam et al. 2017). We were curious about the origin of the linker that connects the two single domains. Thus, we aligned the linker sequence with the other sequences in this species and found the linker matched the sequence on chromosome 8. This indicated that the linker sequences may be formed by homologous repair during the unequal crossover of two single domains.
This finding provides a unique opportunity to explore the evolution of the bi-domain CRP proteins. Based on the sequence characteristics of MtDef5, we also identified additional distinct bi-domain proteins in other CRP sequences, via pfam annotation (Fig. 4b). The genes encoding these bi-domain proteins were derived from homologs encoding single-domain proteins in each of the genomes in which they were found. The identification of some bi-domain proteins in Fig. 4b suggests that they did not occur by chance in one particular species, but are widespread among plant species and might result from regular events over the course of evolution. It is interesting to hypothesize whether increasing the number of CRPs with bi-domains in plants could be used as a novel and more effective antimicrobial measure. Alternatively, another form of bi-domain proteins may result from the fusion of genes encoding different single domains. Perhaps, these gene fusion events increase the efficiency or efficacy of interactions that would otherwise have to occur between separate molecules.

Discussion
With the increasing availability of sequenced plant genomes, our understanding of the evolution of CRPs in plants is improving. In this study, we constructed a dataset including 80,953 CRPs derived from 240 plant genomes (Table S1) to explore the evolutionary patterns of CRPs and the factors that influence their evolution.
As a polygenic family, CRPs have a variable distribution of copy numbers across plant genomes (Fig. 1a). Interestingly, there is a positive relationship between CRPs copy number and genome size (Fig. 1b), which suggests that CRP copy number increases as the size of plant genomes increase due to genome duplication size. It was speculated that the copy number of polygenic family will expand as the genome duplication as observed for CRPs.
CRPs play different roles in different species by their multiple subfamilies (Silverstein et al. 2007), in which NCRs are one of the prominent characteristics. NCRs play a key role in IRLC legumes and are indispensable peptides in the process of nodule formation and nitrogen fixation (Mergaert et al. 2003;Velde et al. 2010;Alunni and Gourion 2016). This is consistent with the fact that most of NCRs are distributed in M. truncatula in our study. Interestingly, it may be that the resulting balance between the effect of NCR peptides and the rhizobia's ability to resist NCR peptides is related to the content of NCR peptides (Wang et al. 2010;Pan and Wang 2017;Pan 2019). Symbiosis may fail if the concentration of NCR peptides becomes too low, but an overexpression of NCR peptides also caused bacterial death and led to the failure of symbiosis (Pan and Wang 2017). Therefore, only an optimal range and concentration of NCR peptides may give one partner some advantage while still maintaining a working symbiosis (Pan and Wang 2017).
Domains are considered the basic and evolutionary units of CRPs, and we identified multiple CRP family members that exist in the form of double or bi-domain proteins. It was reported that MtDef5, the cysteine-rich plant defensin in the genome of the model legume M. truncatula, is a unique bi-domain defensin (Islam et al. 2017;Velivelli et al. 2018). This 107-amino acid protein contains two domains, 50 amino acids each, linked by a short peptide APKKVEP (Islam et al. 2017  bi-domain of MtDef5 encoded by a single gene is a fusion of two recently-duplicated genes encoding single-domain defensins in the genomes of M. truncatula and M. sativa (Islam et al. 2017). In this study, we found the genomic imprint that MtDef5 might be formed by unequal crossover followed by homologous repair. According to the formation and evolution of bi-domain proteins, it is possible to explore the fusion of multiple functional domains onto one sequence to promote functional diversity.
The structure of CRP is composed of domain units, but its stability is maintained by the presence of disulfide bonds that maintains its stability. Disulfide bonds play an important role in protein folding and structural stability by creating covalent bonds between cysteine pairs in protein structures (Matsumura et al. 1989). We hypothesize that conserved cysteines are more likely to form disulfide bonds. In future studies, more sophisticated algorithms are needed to predict disulfide bonds and map the predicted disulfide bonds to the conserved cysteines in the peptide. This will provide further insight into the evolution of disulfide bonds and the conserved cysteines in CRP.
In conclusion, we have constructed a dataset containing 240 plant genomes providing a resource to explore the evolution of CRPs. Using this dataset, we have found that the complex evolution of the whole-genome duplication and local tandem duplication of CRP. The copy number of CRPs varies widely among evolutionary clades and ecotypes, which may be related to the pathogenic environment in which the plant is located. Different types of CRP subfamilies drive high conserved/specific or other antimicrobial activities. Furthermore, we have investigated the cysteine pattern of CRPs to explore the potential value of conserved cysteine for CRPs. It is conceivable that with the more comprehensive understanding of the functions of CRPs, we may be able to develop new environmentally friendly pesticides for agriculture or even antibiotics for medicine.