Background: The CLV3/ESR-RELATED (CLE) gene family encodes small secreted peptides (SSPs) and plays vital roles in plant growth and development by promoting cell-to-cell communication. The prediction and classification of CLE genes is challenging because of their low sequence similarity.
Results: We developed a machine learning-aided method for predicting CLE genes by using a CLE motif-specific residual score matrix and a novel clustering method based on the Euclidean distance of 12 amino acid residues from the CLE motif in a site-weight dependent manner. In total, 2156 CLE candidates—including 627 novel candidates—were predicted from 69 plant species. The results from our CLE motif-based clustering are consistent with previous reports using the entire pre-propeptide. Characterization of CLE candidates provided systematic statistics on protein lengths, signal peptides, relative motif positions, amino acid compositions of different parts of the CLE precursor proteins, and decisive factors of CLE prediction. The approach taken here provides information on the evolution of the CLE gene family and provides evidence that the CLE and IDA/IDL genes share a common ancestor.
Conclusions: Our new approach is applicable to SSPs or other proteins with short conserved domains and hence, provides a useful tool for gene prediction, classification and evolutionary analysis.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7
This is a list of supplementary files associated with this preprint. Click to download.
Additional file 1: Figure S1. Amino acid composition of all proteins, all small proteins, CLE precursors, CLE signal peptides, CLE variable regions and CLE motifs in 69 species.
Additional file 2: Figure S2. Comparison of three amino acid substitution matrices for evaluating CLE motif scores.
Additional file 3: Figure S3. Amino acid usage frequency of CLE motifs.
Additional file 4: Figure S4. Weblogo representation of CLE motifs in each group or subgroup of the CLE gene family in plants.
Additional file 5: Figure S5. Number and proportion of CLE genes in the genomes of 69 plant species.
Additional file 6: Figure S6. Top ten most frequently used CLE motifs in 69 species.
Additional file 7: Figure S7. Statistical analysis of the lengths of the C-terminal tails of CLE candidates.
Additional file 8: Figure S8. Statistical analysis of protein lengths, SignalP scores, motif positions and CLE motif scores of CLE candidates from 69 species.
Additional file 9: Figure S9. Distribution of protein lengths of CLE precursors in the range of 50-150 amino acid residues.
Additional file 10: Figure S10. Gene structure of CLE candidates regulated by alternative splicing in A. thaliana and Z. mays.
Additional file 11: Figure S11. K-type and W-type CLE candidates in plants.
Additional file 12: Table S1. Predicted CLE candidates in 69 plant species
Additional file 13: Table S2. Comparison of group information on Arabidopsis CLE genes from the literature and from this study.
Additional file 14: Table S3. Comparison of E-values for evaluating the grouping of CLE motifs from Goad et al. [7] and from the site-weight based method described in this study.
Additional file 15: Table S4. Number of types of CLE motifs in 69 plant species.
Additional file 16: Table S5. CLE candidates in algae and bryophytes.
Additional file 17: Table S6. Top 10 most frequently used CLE motifs in monocots and dicots.
Additional file 18: Table S7. Number of CLE genes lying above the CLE motif scores of the corresponding quantiles.
Additional file 19: Table S8. Number of CLE genes lying above the threshold of the SignalP score.
Additional file 20: Table S9. CLE motifs used for finding the optimal amino acid substitution matrix.
Loading...
Posted 12 Oct, 2020
On 01 Oct, 2020
On 29 Sep, 2020
Received 16 Sep, 2020
On 15 Sep, 2020
On 14 Sep, 2020
Invitations sent on 14 Sep, 2020
On 13 Sep, 2020
On 13 Sep, 2020
On 21 Aug, 2020
Received 17 Aug, 2020
On 03 Aug, 2020
Received 08 Jun, 2020
On 09 Feb, 2020
Invitations sent on 06 Feb, 2020
On 14 Jan, 2020
On 13 Jan, 2020
On 13 Jan, 2020
On 11 Jan, 2020
Posted 12 Oct, 2020
On 01 Oct, 2020
On 29 Sep, 2020
Received 16 Sep, 2020
On 15 Sep, 2020
On 14 Sep, 2020
Invitations sent on 14 Sep, 2020
On 13 Sep, 2020
On 13 Sep, 2020
On 21 Aug, 2020
Received 17 Aug, 2020
On 03 Aug, 2020
Received 08 Jun, 2020
On 09 Feb, 2020
Invitations sent on 06 Feb, 2020
On 14 Jan, 2020
On 13 Jan, 2020
On 13 Jan, 2020
On 11 Jan, 2020
Background: The CLV3/ESR-RELATED (CLE) gene family encodes small secreted peptides (SSPs) and plays vital roles in plant growth and development by promoting cell-to-cell communication. The prediction and classification of CLE genes is challenging because of their low sequence similarity.
Results: We developed a machine learning-aided method for predicting CLE genes by using a CLE motif-specific residual score matrix and a novel clustering method based on the Euclidean distance of 12 amino acid residues from the CLE motif in a site-weight dependent manner. In total, 2156 CLE candidates—including 627 novel candidates—were predicted from 69 plant species. The results from our CLE motif-based clustering are consistent with previous reports using the entire pre-propeptide. Characterization of CLE candidates provided systematic statistics on protein lengths, signal peptides, relative motif positions, amino acid compositions of different parts of the CLE precursor proteins, and decisive factors of CLE prediction. The approach taken here provides information on the evolution of the CLE gene family and provides evidence that the CLE and IDA/IDL genes share a common ancestor.
Conclusions: Our new approach is applicable to SSPs or other proteins with short conserved domains and hence, provides a useful tool for gene prediction, classification and evolutionary analysis.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7
This is a list of supplementary files associated with this preprint. Click to download.
Additional file 1: Figure S1. Amino acid composition of all proteins, all small proteins, CLE precursors, CLE signal peptides, CLE variable regions and CLE motifs in 69 species.
Additional file 2: Figure S2. Comparison of three amino acid substitution matrices for evaluating CLE motif scores.
Additional file 3: Figure S3. Amino acid usage frequency of CLE motifs.
Additional file 4: Figure S4. Weblogo representation of CLE motifs in each group or subgroup of the CLE gene family in plants.
Additional file 5: Figure S5. Number and proportion of CLE genes in the genomes of 69 plant species.
Additional file 6: Figure S6. Top ten most frequently used CLE motifs in 69 species.
Additional file 7: Figure S7. Statistical analysis of the lengths of the C-terminal tails of CLE candidates.
Additional file 8: Figure S8. Statistical analysis of protein lengths, SignalP scores, motif positions and CLE motif scores of CLE candidates from 69 species.
Additional file 9: Figure S9. Distribution of protein lengths of CLE precursors in the range of 50-150 amino acid residues.
Additional file 10: Figure S10. Gene structure of CLE candidates regulated by alternative splicing in A. thaliana and Z. mays.
Additional file 11: Figure S11. K-type and W-type CLE candidates in plants.
Additional file 12: Table S1. Predicted CLE candidates in 69 plant species
Additional file 13: Table S2. Comparison of group information on Arabidopsis CLE genes from the literature and from this study.
Additional file 14: Table S3. Comparison of E-values for evaluating the grouping of CLE motifs from Goad et al. [7] and from the site-weight based method described in this study.
Additional file 15: Table S4. Number of types of CLE motifs in 69 plant species.
Additional file 16: Table S5. CLE candidates in algae and bryophytes.
Additional file 17: Table S6. Top 10 most frequently used CLE motifs in monocots and dicots.
Additional file 18: Table S7. Number of CLE genes lying above the CLE motif scores of the corresponding quantiles.
Additional file 19: Table S8. Number of CLE genes lying above the threshold of the SignalP score.
Additional file 20: Table S9. CLE motifs used for finding the optimal amino acid substitution matrix.
Loading...