Developing a new residual score matrix for CLE motifs
To predict CLE genes in plants, we developed a new residual score matrix for CLE motifs by integrating the amino acid substitution matrix, amino acid usage frequency matrix and site weights of the CLE motif.
The amino acid composition of the CLE motif was analyzed in 69 species at different levels that included total proteins, small proteins (≤ 200 residues in length) and CLE precursor proteins (Additional file 1: Figure S1). The amino acid composition of small proteins was similar to total proteins (Pearson Correlation Coefficient: 0.23, p = 4.4e-17). However, a higher frequency of particular amino acid residues was observed in CLE precursors. For example, the frequency of histidine (H) was 74% higher in CLE precursors than in total proteins (Additional file 1: Figure S1). The amino acid composition in different regions of CLE proteins was also analyzed (Additional file 1: Figure S1). The frequency of P and H in CLE motifs were both more than fourfold higher than in total proteins, which provides evidence that they are functionally important amino acids for peptide processing or for the recognition of peptides by receptors (Figure 1A). In contrast, some residues were very scarce in CLE motifs, such as the three aromatic amino acids (phenylalanine (F), tyrosine (Y) and tryptophan (W)) and the two sulfur-containing amino acids (cysteine (C) and methionine (M)) (Figure 1A; Additional file 1: Figure S1). This strong bias in amino acid composition encouraged us to try and build a CLE-specific score matrix.
We tested three commonly used substitution matrices—BLOSUM62, BLOSUM80 and PAM250—and found that the performance of each was similar (Additional file 2: Figure S2). Nevertheless, the BLOSUM80 matrix provided a slightly better resolution of the motif scores and therefore, was used to develop the new score matrix. For the amino acid usage frequency matrix, 1628 reported CLE genes from 57 species  were chosen as references (Additional file 12: Table S1). The amino acid usage frequency at each site of the CLE motif was calculated as a percentage (Additional file 3: Figure S3) and was represented as a Weblogo sequence (Figure 1D). The conservativities of each site were largely different, as previously reported . Sites with higher conservativity were considered to hold higher weight in our new score matrix. Some sites contained two dominant residues, such as the 8th site (50.03% N and 47.06% D) and the 12th site (66.49% N and 31.38% H). Based on the modified method for evaluating site weight, the weight of the 12th site was set at 1.00, and the 1st, 6th-9th, and 11th sites had weights no less than 0.70 (Figure 1B).
In the new CLE score matrix, each residue of a candidate CLE motif made a contribution to its total score. Residues at the conserved sites contributed more than those at less conserved sites. Dominant residues contributed more than scarce residues. For example, the proline at the 9th site alone had a score of 6.62, which made the most striking contribution to the score matrix (Figure 1C). It is worth mentioning that the combination of 12 residues with the highest frequency at each site was “RLVPSGPNPLHN”, found in CLE9/10 in A. thaliana. The combination of residues with the highest score was “RRVPSGPNPLHN”. The total motif score was 38.00.
Machine learning aided the prediction of CLE genes in plants
In addition to the CLE motif score, we also included the protein length, motif position and signal peptide score to predict CLE genes in 69 plant species. Three machine learning algorithms, C4.5, ANN and SVM, were employed to categorize all candidate genes into CLE genes or non-CLE genes using the reported CLE genes in the training data set. Our analysis of the training data set, based on 53 species, yielded 1709 CLE candidate genes, including the 1529 genes that were predicted using the Hidden Markov Model (HMM) (Goad et al 2017) and 180 novel genes. All three machine learning algorithms supported 1475 (96.5%) of the reported CLEs and 106 (58.9%) of the novel CLEs. In total, 94 (5.5%) of the candidate genes were supported by only one algorithm (Figure 1E). Additionally, machine learning aided in the prediction of 447 novel CLE candidates from the 16 species in the testing data set. Therefore, our method identified a total of 2156 CLE candidates in 69 species (Additional file 12: Table S1).
A new CLE classification method based on the Euclidean distance of CLE motifs in a site-weight dependent manner
To group the 2156 CLE motifs, the Euclidean distances (d) between each CLE candidate and the 32 Arabidopsis CLE motifs (AtCLEs) were calculated (Figure 2 and Figure 3). Motifs from the top 5% maximum d to AtCLEs were classified into Group “Others”. The rest of the CLE motifs were classified with their closest AtCLE. Consequently, all of the CLE motifs were grouped into six groups, Group1-5 and Others. As a comparison, phylogenetic trees constructed using the A. thaliana CLE motifs (Figure 2A), CLE proteins without signal peptides (Figure 2B) and log-normalized rank of all-vs-all BLAST e-values of full-length CLE proteins (Figure 2C) were constructed using the NJ method, as previously described [7, 59]. The new AtCLE clustering using the HCL method was based on the Euclidean distance between each pair of AtCLE motifs (Figure 2D). The clustering results were similar to the third phylogenetic tree, except for AtCLE8, AtCLE40 and AtCLE43 (Figure 2 and Additional file 13: Table S2). In Group3, the AtCLE8 and AtCLE12 motifs, which are “RRVPTGPNPLHH” and “RRVPSGPNPLHH”, respectively, share high sequence similarity. However, clustering of AtCLE40 and AtCLE43 was not consistent among the four methods (Figure 2D). To determine the reasons for these discrepancies, Weblogos of the appropriate subgroups (Group5A and Group5B) were created. Both subgroups were less conserved relative to other subgroups (Additional file 4: Figure S4 and Additional file 14: Table S3).
A cluster tree of all CLE candidates in 69 species was drawn that includes a heatmap indicating the Euclidean distance between the CLE motifs (Figure 3). The heatmap demonstrated that the CLE candidates in Group5 and Group “Others” have a higher diversity in residual composition. Based on the cluster tree, the 26 AtCLE subgroups were then combined into 11 subgroups. Weblogos of the final 12 subgroups illustrated the importance of “heavy-weight” sites in the classification of CLE motifs (e.g., the 1st and 8th sites (Additional file 4: Figure S4)). Analysis of tandem CLE genes revealed that Group1 had the highest rate of tandem genes. Besides, candidates from monocots seemed to form clusters with other monocots, and candidates from dicots seemed to form clusters with other dicots. These data indicate a strong specificity among the monocot CLE motifs and the dicot CLE motifs (Figure 3). Statistical analysis of the different types of CLE motifs showed that monocots and dicots share very few CLE motifs (18 out of the 474 CLE motifs in dicots). Furthermore, there was no common TDIF/TDIF-like motif shared between monocot and dicot species, probably due to the evolution of distinct vascular patterns in monocots and dicots (Additional file 15: Table S4).
Evolution of CLE genes in plants
To understand the evolution of CLE genes in plants, the number of CLE genes in each species was counted (Figure 4 and Additional file 5: Figure S5). Although three CLE genes had been detected in algae, including one in Dunaliella salina and two in Coccomyxa subellipsoidea, the algal CLE genes were atypical because of their low motif scores, low signal peptide scores and poor motif positions (Additional file 16: Table S5). In contrast to algae, there were nine typical CLE genes in Physcomitrella patens (Figure 4; Additional file 5: Figure S5 and Additional file 16: Table S5), 11 genes in Sphagnum fallax and eight genes in Marchantia polymorpha (Figure 4, Additional file 5: Figure S5). These data provide evidence that CLE genes originated in a bryophyte.
The numbers of annotated transcripts in the 62 species of land plants were largely different, ranging from 19287 (M. polymorpha) to 99386 (Triticum aestivum) (Additional file 5: Figure S5). The proportion of CLE genes in different species was not fixed, ranging from 0.015% (Vitis vinifera) to 0.204% (Phaseolus vulgaris). The mean proportion of CLE candidates in dicots was slightly higher than in monocots, which were 0.105% and 0.091%, respectively. Their proportions in the three Bryophytes and the pteridophyte (Selaginella moellendorffii) were 0.027%, 0.041%, 0.041% and 0.036%, respectively, in general lower than in the monocots and dicots.
To further investigate the evolution of CLE genes in different species, the number of CLE genes in each subgroup was counted in each species (Figure 4). CLE candidates appeared in fewer subgroups in lower plants. For example, the nine CLE candidates in P. patens were all presented in Group3B. Mapoly1011s0001.1 from M. polymorpha was the first candidate identified in Group4. Its motif “HKNPAGPNPIGN” shared high similarity with the CLE motif from Arabidopsis CLE46, a homolog of TDIF. Although none of the Group1 candidates were identified in the bryophytes, two CLE candidates from Group1 were identified in S. moellendorffii.
In addition, to the finding that CLE motifs are most frequently found in monocots and dicots, the total number of each motif was counted in monocots and/or dicots (Additional file 6: Figure S6 and Additional file 16: Table S5). Our results indicate that the most frequent CLE motif in monocots is “RRVRRGSDPIH”—the same as CLE45 in A. thaliana—and that the most frequent CLE motif in dicots is “HEVPSGPNPISN”—the same as CLE41/44 (TDIF) in A. thaliana. Monocots and dicots have a strong bias for particular CLE motifs. For example, although the TDIF motif appeared 83 times in dicots, the TDIF motif was not found in monocots. Only a TDIF-like motif “HEVPSGPNPDSN” appeared in monocots (Additional file 17: Table S6).
Statistics analysis of CLE precursor proteins
CLE peptides are derived from nonfunctional precursor proteins by removal of the N-terminal signal peptide from the latter and by enzymatic processing to yield the mature peptide . In order to get a better understanding of CLE protein evolution, various characteristics of different groups were analyzed, including CLE motif score, protein length, relative position of the CLE motif, length of the C-terminal tail, signal peptide scores, and correlations among the major variables of the score matrix (Figure 5, Additional file 3: Figure S3, Additional file 7: Figure S7 and Additional file 8: Figure S8).
Group 3 had the highest median CLE motif score, followed by the rest of the groups in the following order: Group 1, Group 2, Group 4, Group 5 and Group “Others” (Figure 5A and Additional file 18: Table S7). Although about 90% of the CLE precursor proteins contain 50-150 amino acid residues, most groups have more candidates containing 50 to 100 residues, except for Group 4 (Figure 5B and Additional file 9: Figure S9). Group 1 to 4, particularly Group 3, has higher values for the relative position of the CLE motifs (i.e., means closer to the C-terminal end). In contrast, the motif positions in Group 5 and “others” are more widely distributed (Figure 5C). Approximately two thirds of the candidates have a 0 to 2 residues tail on the C-terminus of the CLE motif. More than 50% of the candidates from both Group 1 and Group 3 did not have a C-terminal tail. The basic amino acids Arginine (R) and lysine (K) dominated at the first amino acid residue position in the short C-terminal tails (1-2 residues), except for the candidates from Group 1 (Figure 5D and Additional file 7: Figure S7). The presence of a signal peptide in the CLE precursors was predicted online using the SignalP/TargetP server and illustrated with a violin plot. Most genes in Group 1 have high signal peptide scores (Figure 5E, F). However, about two-thirds of the genes in Grp. 2B and Grp. 5A have SignalP scores lower than the cut-off value (Additional file 19: Table S8). In general, the lengths of the CLE precursor proteins in the bryophytes and S. moellendorffii are slightly longer than the average. Other variables, including the signalP score, motif position and the CLE motif score, were not significantly different between vascular and non-vascular plants (Additional file 8: Figure S8).
To determine how much each variable contributed to each CLE candidate, correlations between the five variables and the decision to define a candidate as a CLE were calculated (Figure 5G-5I). Motif score and motif position were decisive factors when the length of the CLE proteins was between 50 and 150 residues. Protein length was positively correlated with the decision when the candidates were shorter than 100 residues. However, the correlation was negative for the candidates between 100 and 150 residues in length (Figure 5G and 5H). For longer candidates (> 150 residues), the correlation between motif position and the decision was less. For these candidates, the motif score was the only decisive factor (Figure 5I). It is worth mentioning that the correlation between the signal peptide scores and the decision was less than expected. In addition, we analyzed the gene structures of the CLE candidates in A. thaliana and Zea mays. The results provide evidence that alternative splicing may allow particular CLE genes to concurrently encode proteins with or without the CLE motif (e.g., CLE46 (AT5G59305) and GRMZM5G875999) (Additional file 10: Figure S10).
Identification of new types of CLE genes
By applying our new approach, 5% (n = 136) of the CLE candidates that are more distantly related to the Arabidopsis CLEs were clustered into Group “others”. A total of 31 of these candidates were reported previously . Based on the clustering, a novel subgroup of candidates (n = 26) was identified, with an unusual “serine (S)” at the 12th site of the CLE motif (Figure 6). This subgroup could be further divided into three types, mainly based on the last three residues of their CLE motifs. All members of this subgroup were from monocots and dicots, consistent with their recent evolution.
There were three subsets of CLE candidates containing a motif similar to the Arabidopsis secreted peptide IDA “PIPPSAPSKRHN” , that we named the SVPP-type (n = 6), PVPP-type (n = 7) and RIPP-type (n = 8), respectively, based on their first four residues (Figure 6). Most of the IDA/IDL-like candidates were from the monocots and dicots, except for MA_9094901g0010 and AmTr_v1.0_scaffold00135.62 from M. polymorpha and Amborella trichopoda, respectively. By clustering the IDA/IDL-like candidates together with the Arabidopsis IDA/IDL motifs, using PIP/PIPL motifs  as the outgroup, we found that the SVPP- and PVPP-type motifs were grouped with the IDA/IDL family, while the RIPP-type motif was more closely related to the CLV3 motif (Figure 7A). All of the PVPP-type genes were predicted to encode a potential signaling peptide “PVPPSGPSPCHN” (Figure 7B).
In addition to the novel CLE candidates from Group “others”, small sets of novel candidates were identified in the major groups. The most common residues at the 1st site of a typical CLE motif are arginine (R) and histidine (H). However, candidates with an initial lysine (K) or tryptophan (W) residue in the CLE motif were identified. These K-type and W-type CLE motifs are the most closely related to CLE16 (Group 3C) and CLE45 (Group 2A), respectively (Additional file 11: Figure S11). The 11 K-type candidates were all from monocots. The 13 W-type candidates were exclusively found in dicots.