In this study, a web-based method is proposed to classify the plant-derived peptide among antimicrobial, antibacterial, antifungal, or antiviral activity. To select an appropriate classifying feature, a comparative performance was conducted between eight types of feature encodings which include AAC (20D), DPC (400D), physicochemical (74D), composition (21D), transition (21D), distribution (105D), CTD (147D), and hybrid (641D). Further, another thing was to determine the best performing ML algorithm from the four commonly used algorithms for supervised classification problems. At last, to help the users in classifying their query sequence with ease, PTPAMP is executed smoothly as a free plant-derived antimicrobial peptide prediction server. The complete workflow of this study starting from the dataset collection to the development of prediction server is summarized in Fig. 1.
3.1 Performance comparison of used feature spaces
To evaluate the efficacy of each feature in accurately predicting AMPs, the five-fold cross-validation, and independent tests were carried out for each activity by using SVM as a classification algorithm. The obtained AUCs for the best performing feature model on cross-validation datasets are 0.98, 0.97, 0.93, and 0.97 for antimicrobial DPC model, antibacterial AAC model, antifungal DPC model, and antiviral hybrid model, respectively. While AUCs of the respective feature models on independent test datasets are recorded as 0.98, 0.99, 0.90, and 0.86 for antimicrobial, antibacterial, antifungal, and antiviral activities, respectively. The relative performance of eight feature encodings across cross-validation and independent test datasets is illustrated by Figs. 2 and 3, respectively. After comparing the results of each feature model, it can be seen that the performance of the DPC model developed after optimizing the parameters on cross-validation dataset is relatable with their performance on the respective independent test datasets for each functional activity. Therefore, to assess the comparative execution of four ML algorithms, DPC is used. To give a detailed picture, results for cross-validation and independent test sets are assembled in Supplementary Tables S3-S6.
Apart from AAC and DPC, the rest six feature encodings also achieved an equitable performance with AUCs ranging from 0.80–0.97, 0.69–0.95, 0.72–0.92, and 0.74–0.97 for antimicrobial, antibacterial, antifungal, and antiviral, respectively. The respective AUC range indicates the utility of the remaining six features in peptide prediction due to their complementary feature depiction from another viewpoint.
3.2 Choosing the finest classification algorithm
In general, there is no professed best algorithm for any classification problems, hence, a rule of thumb is followed in an attempt to find the most optimized algorithm for our data classification. As mentioned before, DPC is the most relevant feature to differentiate between positive and negative data of each activity. Thus, to corroborate the use of DPC as a classifying feature, four different ML models are trained with DPC feature and their execution is examined on five-fold cross-validation and independent test datasets. The detailed performance of all models for each activity is listed in Table 2. As stated in Table 2, the model trained with set 2 datasets of antimicrobial, antibacterial, and antiviral achieved AUC of (0.98, 0.90, 0.97, and 0.87), (0.96, 0.67, 0.95, and 0.92), and (0.95, 0.88, 0.94, and 0.94) for SVM, KNN, RF, and NB algorithms, respectively. While, for antifungal activity, the DPC model developed with set 1 dataset achieved better performance on a five-fold cross-validation dataset with AUC of 0.94, 0.73, 0.92, and 0.82 for SVM, KNN, RF, and NB algorithms, respectively. Therefore, independent test datasets of set 2 and set 1 were selected of antimicrobial, antibacterial, antiviral, and antifungal activity, respectively for benchmarking purposes as briefed in Table 1. To manifest the applicability and effectiveness of the proposed model, we compared its result with the other three aforementioned algorithms. As evident from Table 2, the overall accuracy and MCC of SVM models are higher than those resulting from KNN, RF, and NB on five-fold cross-validation and independent test datasets. It can be affirmed that SVM-derived models are more powerful and remarkably efficient for the classification problems proposed in this study.
Table 2
Performance comparison of SVM derived DPC models with three other ML algorithms (KNN, RF, and NB) derived DPC models across cross-validation (5-fold) and independent test datasets
Activity | Sets | Algorithms | Cross validation dataset | Independent test dataset |
Sn | Sp | Ac | MCC | AUC | Sn | Sp | Ac | MCC | AUC |
Antimicrobial | Set 1 | SVM | 91.34 | 89.32 | 90.38 | 0.81 | 0.93 | 84.77 | 91.36 | 87.88 | 0.76 | 0.93 |
KNN | 86.40 | 76.60 | 81.80 | 0.63 | 0.82 | 82.60 | 67.40 | 75.40 | 0.50 | 0.76 |
RF | 88.80 | 91.90 | 90.20 | 0.80 | 0.95 | 85.90 | 89.40 | 87.50 | 0.75 | 0.93 |
NB | 85.00 | 43.00 | 65.20 | 0.31 | 0.65 | 87.00 | 49.40 | 69.20 | 0.39 | 0.70 |
Set 2 | SVM | 88.33 | 96.99 | 92.66 | 0.86 | 0.98 | 87.42 | 98.01 | 92.72 | 0.86 | 0.98 |
KNN | 90.90 | 89.60 | 90.30 | 0.80 | 0.90 | 88.10 | 87.40 | 87.70 | 0.75 | 0.87 |
RF | 91.20 | 94.40 | 92.80 | 0.85 | 0.97 | 90.30 | 93.60 | 91.90 | 0.83 | 0.97 |
NB | 88.10 | 75.30 | 81.70 | 0.64 | 0.87 | 85.20 | 83.20 | 84.20 | 0.68 | 0.85 |
Set 3 | SVM | 85.68 | 93.02 | 89.35 | 0.79 | 0.95 | 84.11 | 94.70 | 89.40 | 0.79 | 0.96 |
KNN | 85.20 | 82.20 | 83.70 | 0.67 | 0.83 | 82.80 | 79.90 | 81.30 | 0.62 | 0.81 |
RF | 86.70 | 91.60 | 89.10 | 0.78 | 0.94 | 85.00 | 92.70 | 88.90 | 0.77 | 0.94 |
NB | 84.70 | 42.60 | 63.60 | 0.30 | 0.68 | 77.50 | 66.90 | 72.20 | 0.44 | 0.79 |
Antibacterial | Set 1 | SVM | 82.51 | 89.69 | 86.10 | 0.72 | 0.91 | 98.65 | 71.62 | 85.14 | 0.73 | 0.90 |
KNN | 84.80 | 74.00 | 79.40 | 0.59 | 0.80 | 73.00 | 66.20 | 69.60 | 0.39 | 0.69 |
RF | 79.40 | 87.90 | 83.60 | 0.67 | 0.91 | 85.10 | 70.30 | 77.70 | 0.56 | 0.82 |
NB | 70.90 | 79.40 | 75.10 | 0.50 | 0.80 | 71.60 | 82.40 | 77.00 | 0.54 | 0.80 |
Set 2 | SVM | 93.27 | 92.38 | 92.83 | 0.86 | 0.96 | 97.30 | 97.30 | 97.30 | 0.95 | 0.99 |
KNN | 93.70 | 39.00 | 66.40 | 0.39 | 0.67 | 77.00 | 82.40 | 79.70 | 0.59 | 0.81 |
RF | 88.80 | 93.70 | 91.30 | 0.82 | 0.95 | 87.80 | 93.20 | 90.50 | 0.81 | 0.95 |
NB | 89.20 | 88.80 | 89.00 | 0.78 | 0.92 | 90.50 | 86.50 | 88.50 | 0.77 | 0.94 |
Set 3 | SVM | 80.72 | 89.24 | 84.98 | 0.70 | 0.89 | 94.59 | 70.27 | 82.43 | 0.67 | 0.89 |
KNN | 92.40 | 48.40 | 70.40 | 0.45 | 0.69 | 73.00 | 63.50 | 68.20 | 0.36 | 0.66 |
RF | 85.20 | 83.90 | 84.50 | 0.69 | 0.89 | 81.10 | 77.00 | 79.10 | 0.58 | 0.83 |
NB | 73.50 | 77.10 | 75.30 | 0.50 | 0.80 | 56.80 | 82.40 | 69.60 | 0.40 | 0.78 |
Antifungal | Set 1 | SVM | 89.94 | 81.56 | 85.75 | 0.72 | 0.94 | 78.99 | 92.44 | 85.71 | 0.72 | 0.90 |
KNN | 73.70 | 74.90 | 74.30 | 0.48 | 0.73 | 75.60 | 60.50 | 68.10 | 0.36 | 0.69 |
RF | 84.10 | 85.50 | 84.80 | 0.69 | 0.92 | 77.30 | 80.70 | 79.00 | 0.58 | 0.88 |
NB | 77.70 | 78.20 | 77.90 | 0.55 | 0.82 | 68.10 | 83.20 | 75.60 | 0.51 | 0.80 |
Set 2 | SVM | 80.73 | 90.5 | 85.61 | 0.72 | 0.92 | 68.07 | 95.8 | 81.93 | 0.66 | 0.89 |
KNN | 83.80 | 70.10 | 77.00 | 0.54 | 0.77 | 81.50 | 66.40 | 73.90 | 0.48 | 0.75 |
RF | 81.80 | 86.00 | 83.90 | 0.67 | 0.91 | 84.90 | 83.20 | 84.00 | 0.68 | 0.89 |
NB | 77.40 | 74.30 | 75.80 | 0.51 | 0.82 | 73.90 | 75.60 | 74.80 | 0.49 | 0.78 |
Set 3 | SVM | 81.56 | 83.24 | 82.4 | 0.65 | 0.91 | 79.83 | 73.11 | 76.47 | 0.53 | 0.84 |
KNN | 79.30 | 71.80 | 75.60 | 0.51 | 0.75 | 82.40 | 50.40 | 66.40 | 0.34 | 0.66 |
RF | 79.60 | 81.80 | 80.70 | 0.61 | 0.89 | 75.60 | 70.60 | 73.10 | 0.46 | 0.80 |
NB | 77.90 | 71.20 | 74.60 | 0.49 | 0.77 | 62.20 | 69.70 | 66.00 | 0.32 | 0.69 |
Antiviral | Set 1 | SVM | 86.11 | 81.94 | 84.03 | 0.68 | 0.80 | 95.83 | 70.83 | 83.33 | 0.69 | 0.82 |
KNN | 86.10 | 75.00 | 80.60 | 0.61 | 0.79 | 87.50 | 79.20 | 83.30 | 0.66 | 0.83 |
RF | 80.60 | 81.90 | 81.30 | 0.62 | 0.88 | 83.30 | 75.00 | 79.20 | 0.58 | 0.82 |
NB | 81.90 | 79.20 | 80.60 | 0.61 | 0.84 | 75.00 | 70.80 | 72.90 | 0.45 | 0.78 |
Set 2 | SVM | 90.28 | 98.61 | 94.44 | 0.89 | 0.95 | 95.83 | 100.00 | 97.92 | 0.96 | 0.99 |
KNN | 94.40 | 79.20 | 86.80 | 0.74 | 0.88 | 95.80 | 83.30 | 89.60 | 0.79 | 0.87 |
RF | 90.30 | 97.20 | 93.80 | 0.87 | 0.94 | 87.50 | 100.00 | 93.80 | 0.88 | 0.94 |
NB | 80.60 | 97.20 | 88.90 | 0.78 | 0.94 | 95.80 | 91.70 | 93.80 | 0.87 | 0.95 |
Set 3 | SVM | 90.28 | 77.78 | 84.03 | 0.69 | 0.87 | 95.83 | 75.00 | 85.42 | 0.72 | 0.92 |
KNN | 86.10 | 41.70 | 63.90 | 0.31 | 0.62 | 75.00 | 62.50 | 68.80 | 0.37 | 0.72 |
RF | 83.30 | 80.60 | 81.90 | 0.63 | 0.85 | 87.50 | 75.00 | 81.30 | 0.63 | 0.78 |
NB | 70.80 | 79.20 | 75.00 | 0.50 | 0.83 | 70.80 | 75.00 | 72.90 | 0.45 | 0.81 |
3.3 Comparison with existing antimicrobial peptide predictors
Although, there is no such prediction method available so far, fully accountable to plant-derived antimicrobial peptides. Still, in terms of predicting plant-derived AMPs, we made a comparison of PTPAMP with existing approaches such as amPEPpy1.0 (Lawrence et al. 2020), Antifp (Agrawal et al. 2018), Meta-iAVP (N et al. 2019), and iAMPpred (Meher et al. 2017). The AUC obtained with datasets outlined in Table 1 was examined for all eight feature models across each activity. The antimicrobial model based on DPC is comparatively efficient in distinguishing between positive and negative data of benchmark datasets as reflected by Fig. 4. On the contrary, AAC based model achieved higher AUCs for another three activities named antibacterial, antifungal, and antiviral. Thereby, the best-performed model of all activities for six benchmark datasets was selected for their implementation into the web server. The selected model was also compared with the state-of-art predictors and the results are displayed in Fig. 5. After observing the comparative performance results listed in Supplementary Table S7, the relatively effective execution of our proposed model for each activity in classifying plant-derived antimicrobial peptides has been proven.
Even though negative data of this study was taken from the previously developed tools, the performance of respective tools is not satisfactory when plant-derived peptides were given as queries. The improved performance of our proposed models can be justified by the following aspects: (i) the existing predictors were developed on datasets of multiple organisms causing models to learn generalized features. In contrast, we developed models especially intended for plant-derived datasets; (ii) the parameters used for our proposed model were optimized on five-fold cross-validation datasets indicating these parameters to be more accurate and stable; (iii) amongst distinct features utilized for constructing models in this study, we found significant differences among positive and negative data during compositional analysis i.e. AAC and DPC. Many studies reported the successful implementation of these two features in predicting peptides and proteins (Bhasin and Raghava 2004; Raghava and Han 2005; Garg and Raghava 2008; Kumar et al. 2008, 2015; Lata et al. 2010; Thakur et al. 2012; Gautam et al. 2013; Gupta et al. 2016). Analysis of the literature revealed the successful applicability of binary features in classifying antifungal peptides (Agrawal et al. 2018), however, a flaw of this feature is the requirement of equal length peptides to be used for the development of prediction models. As described in Section 3.2, our dataset sequence count is not much higher so the binary feature was not included in this study to prevent the reduction of our dataset size. Additionally, the prediction models were developed by keeping sequence count and length balanced thus beneficial in classifying peptides of varied lengths precisely.
3.4 Structure analysis of the selected sequences
The secondary structure predicted by PSIPRED is shown in Fig. 6. With the increase of alpha-helical structures in peptides, its hydrophobicity also gets increased in aqueous environments (Chen et al. 2007). Peptide hydrophobicity is believed to be an important factor in their antimicrobial activity. In lieu of higher or lower hydrophobicity of a peptide, an optimum hydrophobicity window is accountable for its potent antimicrobial function. There will be a decreased antimicrobial activity with the increase in hydrophobicity. The clarification for this trend is the self-association of peptides which acts as a hindrance for peptide entry into the bacterial cell (Chen et al. 2007). The 3D structure generated by SWISS-MODEL for the selected sequences is represented in Fig. 7. To assess the quality of structure, GMQE, QMEANDisCo global scores (Studer et al. 2020), and the percentage of residues favored by Ramachandran were calculated for each sequence as listed in Table 3. The total percentage of amino acid residues of each sequence found in Ramachandran favored and allowed region is > 90%, making the constructed structures reasonable and convincing. The ideal case would be if > 98% of amino acid residues are present in the favored region (Chen et al. 2010). In this study, the respective ideology is fulfilled by the structures of viscotoxin A3, HsAFP1, beta hordothionin, and ginkbilobin-2 with 100%, 98.08%, 100%, and 99.06% residues, respectively. Moreover, the structures presented in Fig. 7 are determined by X-ray diffraction with high resolution which is a preferable method in homology modeling.
Table 3
Description of structure quality for selected sequences
PPepDB ID | Peptide name | Sequence | Length | GMQE | QMEANDisCo global score | Ramachandran favoured (%) |
PPepDB_1622 | Hellethionin-D | KSCCRNTLARNCYNACRFTGGSQPTCGILCDCIHVTTTTCPSSHPS | 46 | 0.97 | 0.94 ± 0.12 | 97.73 |
PPepDB_1543 | Snakin-1 | GSNFCDSKCKLRCSKAGLADRCLKYCGICCEECKCVPSGTYGNKHECPCYRDKKNSKGKSKCP | 63 | 0.96 | 0.94 ± 0.11 | 96.72 |
PPepDB_1491 | Griffithsin | SLTHRKFGGSGGSPFSGLSSIAVRSGSYLDAIIIDGVHHGGSGGNLSPTFTFGSGEYISNMTIRSGDYIDNISFETNMGRRFGPYGGSGGSANTLSNVKVIQINGSAGDYLDSLDIYYEQY | 121 | 0.95 | 0.89 ± 0.08 | 97.48 |
PPepDB_2071 | Ep-AMP1 | CVLIGQRCDNDRGPRCCSGQGNCVPLPFLGGVCAV | 35 | 0.94 | 0.95 ± 0.12 | 93.94 |
PPepDB_3948 | Viscotoxin-A3 | KSCCPNTTGRNIYNACRLTGAPRPTCAKLSGCKIISGSTCPSDYPK | 46 | 0.93 | 0.89 ± 0.12 | 100.00 |
PPepDB_2152 | PsDef1 | RMCKTPSGKFKGYCVNNTNCKNVCRTEGFPTGSCDFHVAGRKCYCYKPCP | 50 | 0.93 | 0.93 ± 0.11 | 95.83 |
PPepDB_621 | Ginkbilobin-2 | ANTAFVSSACNTQKIPSGSPFNRNLRAMLADLRQNTAFSGYDYKTSRAGSGGAPTAYGRATCKQSISQSDCTACLSNLVNRIFSICNNAIGARVQLVDCFIQYEQRSF | 108 | 0.93 | 0.89 ± 0.08 | 99.06 |
PPepDB_2077 | HsAFP1 | DGVKLCDVPSGTWSGHCGSSSKCSQQCKDREHFAYGGACHYQFPSVKCFCKRQC | 54 | 0.92 | 0.91 ± 0.11 | 98.08 |
PPepDB_3974 | Beta-hordothionin | KSCCRSTLGRNCYNLCRVRGAQKLCANACRCKLTSGLKCPSSFPK | 45 | 0.90 | 0.83 ± 0.09 | 100.00 |
PPepDB_2206 | NaD1 | RECKTESNTFPGICITKPPCRKACISEKFTDGHCSKILRRCLCTKPC | 47 | 0.90 | 0.85 ± 0.09 | 95.56 |
The structures recorded by SWISS-MODEL for beta-hordothionin and NaD1 have 4 and 2 salt-bridges in them, respectively. The formation of salt bridges in a peptide confers proteolytic stability to them along with assisting in peptide folding but is not vital for microbicidal action (Andersson et al. 2012).
3.5 Structural insights based on amino acid residue preferences
Most of the plant-derived AMPs belong to β or αβ family such as cyclotides and defensins. In response to said structural family, the dominant amino acid in each group differs. For instance, β and αβ family is dominated by cysteine (C) as the hydrophobic amino acid prerequisite in peptide folding. The abundance of cysteine residue in plant peptides indicates the frequently occurring disulfide bonds which confer metabolic stability to them (Cole and Cole 2008). The presence of disulfide bonds also directs the majority of plant-derived peptides to be disulfide-bonded defensin-like molecules (Li et al. 2021). Other than C, arginine (R) and lysine (K) is equally distributed among αβ family. While, glycine (G) and serine (S) are the two favored amino acids in entire families (Mishra and Wang 2012). Moreover, the presence of short-chain residues like glycine and serine in antimicrobial peptides is accountable for their disordered region. These regions provide structural flexibility to such peptides (Tavares et al. 2012). Plant cyclotides are mainly rich in C, G, T/S, and K which regulate a widespread β-sheet containing scaffold in them (Wang 2010). Concisely, plant-derived peptides are mostly rich in cysteine and glycine. The respective presence of C, R, K, G, and S in our peptide datasets can be visualized by Fig. 8. As per Wang et al. (Wang et al. 2009) amino acid composition of natural AMPs is directly related to their 3D structures.
3.6 PTPAMP web server
In an effort to maximize the utility of plant-derived AMPs, we established a web server, PTPAMP, which is targeted at reaching a wide plant research community and enables them to predict plant-derived AMPs. To validate our proposed work, the datasets used in this study can be downloaded from the server. Hereafter, we are providing summarized steps for using this server. Firstly, the peptide screening module, in which user can enter their query peptide sequences (in FASTA format) in the provided text-box or can upload a peptide FASTA file from their system. This module will help the users to categorize their queries in four functional activities (AMP, ABP, AFP, or AVP) based on the score computed by integrated feature models. Results are shown in tabular form. Along with prediction scores, there are four columns representing charge, molecular weight, isoelectric point, and hydrophobicity values for each sequence. Additionally, there is a link to see peptide properties where some relevant graphs were plotted like AAC (regarding physicochemical properties), hydrophobicity, charge, and hydrophobic moment plot. In next column BLAST (SF et al. 1990) option is given to search the similar sequences from PlantPepDB. Secondly, the peptide designing module in which user can enter their query peptide sequence and obtain the probable peptide mutants with single amino acid substitution. The said module will lead to generate peptide mutants. Based on the predicted scores and peptide property gained after using pepcalc (https://pepcalc.com/) placed on our server, users can ultimately select the best peptide mutant. Thirdly, there is a protein scan module that helps to generate peptides of desired length and overlapping residues provided, and each peptide will be displayed with its prediction scores and motif search results. Users can paste the query protein sequence in the given text box or may upload a FASTA file. This module is helpful in getting possible regions of protein sequence to be classified as AMPs. In all the three modules there is an option of selecting SVM threshold and users are suggested to keep its value high if they want results of high specificity or may keep it low if needed results with high sensitivity.