PTPAMP: prediction tool for plant-derived antimicrobial peptides

The emergence of antimicrobial peptides (AMPs) as a potential alternative to conventional antibiotics has led to the development of efficient computational methods for predicting AMPs. Among all organisms, the presence of multiple genes encoding AMPs in plants demands the development of a plant-based prediction tool. To this end, we developed models based on multiple peptide features like amino acid composition, dipeptide composition, and physicochemical attributes for predicting plant-derived AMPs. The selected compositional models are integrated into a web server termed PTPAMP. The designed web server is capable of classifying a query peptide sequence into four functional activities, i.e., antimicrobial (AMP), antibacterial (ABP), antifungal (AFP), and antiviral (AVP). Our models achieved an average area under the curve of 0.95, 0.91, 0.85, and 0.88 for AMP, ABP, AFP, and AVP, respectively, on benchmark datasets, which were ~ 6.75% higher than the state-of-the-art methods. Moreover, our analysis indicates the abundance of cysteine residues in plant-derived AMPs and the distribution of other residues like G, S, K, and R, which differ as per the peptide structural family. Finally, we have developed a user-friendly web server, available at the URL: http://www.nipgr.ac.in/PTPAMP/. We expect the substantial input of this predictor for high-throughput identification of plant-derived AMPs followed by additional insights into their functions.


Introduction
Plants produce a range of antimicrobial peptides, which are supposed to be their pivotal molecular stratagem to ward off any microbial intrusions. Such peptides have their role defined at the entry level, which designates them as an important weapon of host innate immunity. This emphasizes plant-derived peptides as a plausible and efficient agent to deal with the intensive increase in microbial resistance toward antibiotics and drugs (Tam et al. 2015;Santos-Silva et al. 2020). Over the last few years, peptides have been getting more attention as an alternative to antibiotics due to their small structure, cationic, and amphipathic nature (Kamysz et al. 2003;Chen et al. 2007; Barashkova and Rogozhin 2020). Multiple organisms, extending from prokaryotes to humans, are involved in developing antimicrobial peptides, and plants are one of them. These molecules are supposed to be evolutionarily conserved throughout the aforementioned range (Zhang and Gallo 2016). Plants use AMPs as part of their defense response, along with certain other toxic molecules that acts on pathogens by interacting with their phospholipids causing membrane permeabilization and cell death (Nawrot et al. 2014). In other words, mainly fundamental features were attacked, like constraining cell wall formation or protein synthesis, as AMPs can bind to DNA, RNA, or protein (Ganz 2003;Brogden 2005;Hancock 2006;Hale and Hancock 2007). Due to the evident physicochemical properties of AMPs in terms of their charge and hydrophobicity, they are designated as pivotal players in the plant defense system (Quintans et al. 2021).
It has been reported earlier that the AMPs are constitutively expressed in specific susceptible organs or are microbially induced at infection sites rather than circulating (Sels et al. 2008). Cysteine is the predominant amino acid content for the majority of plant-derived AMPs (Hammami et al. 2009), which is accountable for the presence of a myriad of disulfide bonds to provide the peptides a compact structure helpful in enduring the stress conditions (Nawrot et al. 2014). Also, this feature enables peptides to be more resistant to proteolytic and chemical degradation (Tam et al. 2015). Additionally, AMPs are equipped with multiple lowaffinity targets for microbes to thwart the advancement of microbial resistance (Maróti et al. 2011). The said resistance developed in a microbe for an antibiotic when it stopped reacting or slowed down in the direction of curing infections (Srivastava et al. 2021). Consequently, bacterial adaptive mechanisms allow us to select resistant microbes under the selection pressure caused by the application of antibiotics (Morehead and Scarbrough 2018).
Based on the beneficial characteristics of AMPs, numerous peptide repositories have been developed, such as APD3 (Wang et al. 2016), PhytAMP (Hammami et al. 2009), and PlantPepDB (Das et al. 2020), the latter is exclusively for plant-derived peptides with diverse activities. Still, several peptides are waiting to be functionally and structurally characterized. In the post-genomic era, the avalanche of recently found protein and peptide sequences is responsible for the advent of numerous prediction tools based on machine learning (ML), such as AntiBP2 (Lata et al. 2010), AVPpred (Thakur et al. 2012), IAMP-2L (Xiao et al. 2013), amppred (Meher et al. 2017), antifp (Agrawal et al. 2018), Deep-AmPEP30 (Yan et al. 2020), and amPEPpy 1.0 (Lawrence et al. 2020) to configure the properties of peptide sequences. AntiBP2 integrates terminus-wise compositional features to develop models, specifically to predict antibacterial peptides with an accuracy of 98.95%, but is not able to conclude about sequences having a length of less than 15 amino acids. AVPpred and Antifp use the support vector machine (SVM) algorithm to develop models with the respective accuracy of 86% and 84.88% based on the compositional and physicochemical features, respectively, for antiviral and antifungal peptides. Besides having high accuracy, these models are not able to classify plant-derived peptides appropriately as they have not trained on plant datasets. However, IAMP-2L, amppred, Deep-AmPEP30, and amPEP work for antimicrobial peptides, and to the best of our understanding, amPEP is the only predictor based on large and diverse AMP and Non-AMP datasets. Based on selected physicochemical features, amPEP is a specifically designed AMP classification method that uses a model developed by a random forest (RF) algorithm. It achieved an accuracy of 96%, the highest among all prediction tools developed so far. It classifies a query sequence as AMP or non-AMP and assigns a score that lies between 0 and 1. The prediction result seems to be quite insufficient to characterize an unknown sequence, which justifies the concept of developing a web server armed with models built upon plant-derived AMP datasets with modules to elaborate the properties of a query sequence.
In comparison to animals, plants are having more genes for antimicrobial peptides with an ability to generate hypervariable sequences that signifies plants being a formidable repertoire of antimicrobial peptides for pathogenic microbes. They are equally capable to evolve with new specifications (Maróti et al. 2011). Plant-derived AMPs are classified into certain families based on their cysteine content (Farrokhi et al. 2008). Following the earlier study, the probability of getting hundreds of distinct AMPs in certain plant species are quite high (Nawrot et al. 2014). Also, the identification of AMPs through wet-lab experiments is quite expensive in terms of resources and time. Therefore, procuring a probability for a peptide being categorized as AMPs with the help of in silico methods would be really helpful for getting positive results in the experimental research of antimicrobial peptide identification. Addressing these properties may lead to more precise predictors aimed at plant-derived peptides.
Henceforth, we put forward a machine learning-based predictor, named PTPAMP, for the classification of given peptide sequences into antimicrobial, antibacterial, antifungal, or antiviral activity. To develop a prediction model for each activity, we encoded sequences with amino acid composition (AAC), dipeptide composition (DPC), physicochemical features, and composition-transition-distribution (CTD). Subsequently, models were developed using SVM for each encoding and the optimal feature set was identified after observing their results on cross-validation and independent test datasets. The comparative performance of PTPAMP with state-of-the-art methods showed that our proposed models outperformed the existing predictors when checking their results on six benchmark datasets. To the best of our knowledge, PTPAMP is the first plant-based approach for prediction of AMPs. The presented web server is freely accessible at the http:// www. nipgr. ac. in/ PTPAMP/, which provides a platform for peptide categorization among four defined activities along with a module to generate their mutated sequences in the hope of getting better bioactive peptides. Consequently, we anticipate that our designed framework will assist experimentalists in the discovery of novel plant-derived AMPs.

Overview of PTPAMP
The notion of developing PTPAMP entails different steps: (i) collection of reliable datasets, and development of training and validating the models; (ii) representation of peptide sequences in a manner that can truly reflect their intrinsic properties; (iii) development of a classifier to supervise the prediction; (iv) assessment of the classifier with relevant cross-validation tests; and (v) execution of a web server after the integration of machine learning models, at which users can simply get their intended result without prior understanding of the classifier algorithm used. A detailed overview of the aforementioned steps has been discussed in further sections.

Data assembly
We picked up a total of 2383 antimicrobial, 324 antibacterial, 530 antifungal, and 128 antiviral peptides from Plant-PepDB as positive datasets, which also shows a range of other activities like anticancer, hemolytic, cytotoxic, etc. owing to their comprehensive range of actions. Out of those, few peptides have evidence at transcript or protein level, quite a few are inferred from homology, and the rest are predicted, in compliance with the information available at PlantPepDB. After excluding redundant sequences, and peptides with non-natural amino acids (B, J, O, U, Z, X), we finally obtained 1815 antimicrobial, 297 antibacterial, 477 antifungal, and 96 antiviral peptides. To build distinct models for comparative purposes, three sets were drawn up based on the discrepancy between negative datasets. While collating positive and negative data for each set, we maintained a specific sequence length range of 5-255 amino acids (aa) and 10-121aa for antimicrobial, antibacterial, antifungal, and antiviral activities, respectively. For the first set, PlantPepDB (Das et al. 2020) was noted as the source. While collecting negative data for each of the mentioned activities, certain peptides such as antimicrobial, antibacterial, antifungal, antiviral, anticancer, cytotoxic, insecticidal, hemolytic were not included to circumvent the analogy between positive and negative data, as a majority of reported AMPs exhibit more than one activity and can target a wider range of microbes involving fungi, viruses, and bacteria.
In furtherance of the second set, we used 1815 sequences generated from UniProt (as employed in the ampep.py (Lawrence et al. 2020) method) as negative data of antimicrobial activity after covering the same sequence length space as AMP. For antibacterial activity, 285 sequences with lengths ranging from 5-94aa from non-experimental negative records of AntiBP2 (Lata et al. 2010) and 12 sequences with lengths ranging from 95-255aa from MitPred (a source of non-secretory proteins) (Kumar et al. 2006) were used as negative data. We extracted 238 and 203 sequences from Antifp_DS1 and Antifp_DS2 (Agrawal et al. 2018), respectively, to use them as negative data for antifungal activity. To keep the sequence count and length balanced, we used 36 UniProt sequences (of a length range of 108-255aa) as well as the ampep.py (Lawrence et al. 2020) method. For antiviral activity, 89 non-experimental negative peptides (as used in the previous antiviral peptide prediction method (Thakur et al. 2012)) having a length range of 10-40aa were used in addition to the rest of the sequences taken from negative data of AntiBP2 (Lata et al. 2010) and MitPred (Kumar et al. 2006). Negative sequences for the previously discussed two sets were mutually different, with no reported recurring sequences.
Conducive to the third set, after shuffling negative data of the two sets stated earlier, we randomly picked an equal number of peptides from both sets and merged them by keeping the sequence count balanced as positive data for each functional activity. A statistical representation of datasets and an overlapping distribution of positive data are delineated in Supplementary Table S1 and Supplementary Fig. S1, respectively. All collected datasets were divided into 75% and 25% for training and testing the models, respectively. Separately, we constructed six different benchmark datasets to make a functional comparison between our plant-based prediction server (PTPAMP) and formerly developed prediction tools. All the six datasets used in benchmarking are summarized in Table 1. Due to the efficient performance of iAMP2L (Xiao et al. 2013), 599 non-AMP sequences from their benchmark dataset were employed as negative data for our first set.

Peptide features
Numerous features exist, albeit, to define antimicrobial peptides, and a lot of them have been used earlier for preexisting antimicrobial peptide prediction tools. However, we chose a few of them in reference to the elective studies (Lata et al. 2010;Thakur et al. 2012;Agrawal et al. 2018;Lawrence et al. 2020) for characterizing plant-derived AMPs. Those features are amino acid composition, dipeptide composition, physicochemical properties, and CTD descriptors. Additionally, we looked for motifs of antimicrobial, antibacterial, antifungal, and antiviral peptides separately. For identification of those motifs, we used the MERCI program (Vens et al. 2011). A complete list of motifs for each activity has been provided in Supplementary sheet 1.

Amino acid and dipeptide composition
To have the objective of developing models by employing machine learning algorithms, input features should be of fixed length. Here, we have peptides of varying lengths; therefore, we computed their amino acid and dipeptide composition profiles with the values confined to a vector of 20 and 400 dimensions, respectively. The amino acid composition depicts the portion of each amino acid in the respective peptide sequences, while dipeptide composition tells about the global and local arrangement of amino acid residues in a sequence (Manavalan et al. 2017). We used a Perl script from the GPSR package (https:// webs. iiitd. edu. in/ ragha va/ gpsr/) for calculating amino acid and dipeptide composition profiles.

Physicochemical properties
Besides compositional features, the physicochemical attributes are also important in describing a peptidic feature. We computed aaDescriptors in addition to charge, mass, isoelectric point, hydrophobicity, aliphatic index, PPI (potential protein interaction) index, and hydrophobic moment using an R package named 'Peptides' (Osorio et al. 2015). There are five aspects of AMPs that should always be in focus while studying antimicrobial peptides. These aspects include charge, hydrophobicity, amphiphilicity, chain length, and secondary structure (Huan et al. 2020). The size of AMPs is one of the common features they share among themselves (Ageitos et al. 2017). Other than these features, aliphatic index calculation is related to the thermal stability of proteins and the PPI index refers to the differentiation of the action mechanisms of AMPs (Boman 2003). The overall calculated physicochemical properties are encapsulated in a vector of 74 dimensions. The complete list of physicochemical attributes used in this study is provided in Supplementary  Table S2.

CTD descriptors
According to Global Protein Sequence Descriptors (Dubchak et al. 1995), the amino acids are categorized into 3 groups as per 7 physicochemical attributes and three kinds of descriptor sets can be calculated for a particular attribute. We computed all three kinds, i.e. composition, transition, and distribution descriptor set individually as well as collectively to look into their importance in classifying positive and negative data. The composition descriptor is encoded in 21 feature spaces to describe the global percentage of 3 groups for each physicochemical attribute. Dimension of transition and distribution descriptor sets are 21 and 105, respectively. In this study, to calculate these three descriptors, the ProtR package of R (Xiao et al. 2015) is used.
In addition to the aforesaid feature spaces, we encoded 641 features for each sequence after considering AAC, DPC, physicochemical, and CTD descriptor altogether. This was done to examine whether this hybrid feature is performing better in comparison to the respective feature models.

Algorithm implementation
The ability of models to differentiate between positive and negative data not only depends on the feature representation processes but also relies on the employed ML algorithms. Using an ML algorithm named SVM, we developed models for the categorization of positive and negative peptides for a list of activities like antimicrobial, antibacterial, antifungal, and antiviral. We used the freely downloadable SVM light Version 6.02 package (Joachims 1998) to develop prediction models. The basic idea behind using SVM is to map our peptide data into higher dimensional space to get them linearly separated. Many studies recorded the impressive use of SVM for small sample sizes due to their excellent learning and finest generalization abilities (Hongjaisee et al. 2019). We used RBF kernel parameters for procuring the best results. Furthermore, to make a comparative study, we employed other algorithms such as Naïve Bayes (NB), K-Nearest Neighbor (KNN), and RF, which are commonly used in supervised classification problems. The WEKA package (Hall et al. 2009) is used for the implementation of these three algorithms.

Performance assessment and validation
The five-fold cross-validation technique was adopted for evaluating the execution of models. Under this technique, datasets are randomly distributed into five equally sized sets, where each set is used once for testing and the other four sets for training. To appraise the performance of models, both threshold-dependent and threshold-independent parameters are required. Threshold-dependent parameters include sensitivity (Sn), specificity (Sp), accuracy (Ac), and Matthew's correlation coefficient (MCC). ROC (Receiver Operating Characteristic) comes under threshold-independent parameters. We calculated these parameters in all developed models to assess their performance. In literature, certain equations are often been used to contemplate the prediction quality: where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. MCC value ranges between -1 and + 1, signifying the correlation between true and predicted queries. Larger MCC and ROC values denote better prediction.

Peptide structural analysis
This study is focused on classifying plant-derived antimicrobial peptides; hence, their better understanding is also simultaneously important. According to the earlier studies (Li et al. 2021), their composition is quite complex and their structure play a significant role of in the functional classification (Hammami et al. 2009). Therefore, to understand the structural aspects of plant-derived antimicrobial peptides, ten sequences that were validated at the protein level were handpicked from the complete positive datasets used in this study. For secondary structure prediction, we used PSIPRED, available at http:// bioinf. cs. ucl. ac. uk/ psipr ed/ which embodies two feed-forward neural networks for analyzing the results acquired from PSI-BLAST (Jones 1999; Buchan et al. 2013). Additionally, homology modeling to get a three-dimensional (3D) structure for the selected sequences was executed with the help of SWISS-MODEL, available on the ExPASy website at https:// swiss model. expasy. org/ (Waterhouse et al. 2018). In homology modeling, evolutionarily related protein structures were obtained for query sequences after searching them against the SWISS-MODEL template library (SMTL) (Biasini et al. 2014). Two database search methods, BLAST (Altschul et al. 1997;Camacho et al. 2009) and HHblits (Remmert et al. 2011) are used by SWISS-MODEL to perform this task. BLAST is useful and accurate against closely related templates, while HHblits increases sensitivity in the case of remote homology. The searched templates are graded according to the , entire quality of the resulting model as assessed by their GMQE (Global Model Quality Estimate) (Biasini et al. 2014). The sequences were selected by keeping certain points in mind, which include (i) peptide length should be less than 180aa; (ii) sequence identity to the template structure should be > 95%; and (iii) GMQE should be ≥ 0.9 for the structure model.

Development of PTPAMP web server
The web server was developed by using Apache HTTP server (version 2.4.6) integrated with PHP (version 7.3.3) on a server machine with Centos 7 Linux as the operating system. CSS and HTML were used to make the template responsive. The best predictive model, as described in the result section, was deployed on the web server. First, the web server accepts peptide FASTA sequence as an input. Second, after submission by invoking the submit button, the query sequences are subjected to the compositional feature calculation and subsequently fed to the predictive models described previously. The prediction for each activity, along with the SVM score for query sequences, is displayed in the prediction output box. The displayed results are also available as a CSV file upon invoking the download button found directly above the result table.

Results and discussion
In this study, a web-based method is proposed to classify the plant-derived peptides among antimicrobial, antibacterial, antifungal, or antiviral activity. To select an appropriate classifying feature, a comparative performance was conducted between eight types of feature encodings, which included AAC (20D), DPC (400D), physicochemical (74D), composition (21D), transition (21D), distribution (105D), CTD (147D), and hybrid (641D). Furthermore, another thing was to determine the best performing ML algorithm from the four commonly used algorithms for supervised classification problems. Finally, to help the users in classifying their query sequence with ease, PTPAMP is executed smoothly as a free plant-derived antimicrobial peptide prediction server.
The schematic framework of this study starting from the dataset collection to the prediction of plant-derived AMPs is summarized in Fig. 1.

Performance comparison of used feature spaces
To evaluate the efficacy of each feature in accurately predicting AMPs, five-fold cross-validation and independent tests were carried out for each activity using SVM as a classification algorithm. The obtained AUCs for the best performing feature model on cross-validation datasets are 0.98, 0.97, 0.93, and 0.97 for the antimicrobial DPC model, the antibacterial AAC model, antifungal DPC model, and the antiviral hybrid model, respectively. While the AUCs of the respective feature models on independent test datasets are recorded as 0.98, 0.99, 0.90, and 0.86 for antimicrobial, antibacterial, antifungal, and antiviral activities, respectively. The relative performance of eight feature encodings across cross-validation and independent test datasets is illustrated by Figs. 2 and 3, respectively. After comparing the results of each feature model, it can be seen that the performance of the DPC model developed after optimizing the parameters on the cross-validation dataset is relatable with their performance on the respective independent test datasets for each functional activity. Therefore, to assess the comparative execution of four ML algorithms, DPC is used. To give a detailed picture, results for cross-validation and independent test sets are assembled in Supplementary Tables S3-S6. Apart from AAC and DPC, the other six feature encodings also achieved an equitable performance with AUCs ranging from 0.80-0.97, 0.69-0.95, 0.72-0.92, and 0.74-0.97 for antimicrobial, antibacterial, antifungal, and antiviral, respectively. The respective AUC range indicates the utility of the remaining six features in peptide prediction due to their complementary feature depiction from another viewpoint. The performance of physicochemical models for each activity shows relatively low contribution in classifying peptides correctly. But when physicochemical features are used along with AAC, DPC, and CTD descriptors, its performance got increased up to a certain level. This signifies that physicochemical features alone are not sufficient enough to represent AMPs in a better classifying manner. In general, all four activities share common physicochemical properties, i.e. they are mostly cationic, rich in hydrophobic and amphiphilic residues, and have both hydrophobic and hydrophilic regions (Yan et al. 2015;Lei et al. 2019).

Choosing the finest classification algorithm
The prediction accuracy of AMPs depends on multiple factors ranging from data sets, peptide feature encodings, and machine learning algorithms (Wang et al. 2022). In general, there is no professed best algorithm for any classification problem; hence, a rule of thumb is followed in an attempt to find the most optimized algorithm for our data classification. As mentioned before, DPC is the most relevant feature to differentiate between positive and negative data for each activity. Thus, to corroborate the use of DPC as a classifying feature, four different ML models are trained with the DPC feature and their execution is examined on five-fold cross-validation and independent test Fig. 1 Schematic framework of PTPAMP. The major steps includes: (i) dataset construction, (ii) represent peptide sequences by eight feature encodings, (iii) construction of prediction models using ML algorithms, and (iv) performance assessment by cross-validation and benchmarking datasets. The detailed performance of all models for each activity is listed in Table 2. As stated in Table 2, the model trained with set 2 datasets of antimicrobial, antibacterial, and antiviral achieved an AUC of (0.98, 0.90, 0.97, and 0.87), (0.96, 0.67, 0.95, and 0.92), and (0.95, 0.88, 0.94, and 0.94) for SVM, KNN, RF, and NB algorithms, respectively. While, for antifungal activity, the DPC model developed with set 1 dataset achieved better performance on a five-fold cross-validation dataset with AUC of 0.94, 0.73, 0.92, and 0.82 for SVM, KNN, RF, and NB algorithms, respectively. Therefore, independent test datasets of set 2 and set 1 were selected for antimicrobial, antibacterial, antiviral, and antifungal activity, respectively, for benchmarking purposes as briefed in Table 1. To manifest the applicability and effectiveness of the proposed model, we compared its results with the other three aforementioned algorithms. As evident from Table 2, the overall accuracy and MCC of SVM models are higher than those resulting from KNN, RF, and NB on five-fold cross-validation and independent test datasets. It can be affirmed that SVMderived models are more powerful and remarkably efficient for the classification problems proposed in this study.

Comparison with existing antimicrobial peptide predictors
Although, there is no such prediction method available so far, fully accountable to plant-derived antimicrobial peptides. Still, in terms of predicting plant-derived AMPs, we made a comparison of PTPAMP with existing approaches such as amPEPpy1.0 (Lawrence et al. 2020), Antifp (Agrawal et al. 2018), Meta-iAVP (N et al. 2019, and iAMPpred (Meher et al. 2017). The AUC obtained with the datasets outlined in Table 1 was examined for all eight feature models across each activity. The antimicrobial model based on DPC is comparatively efficient in distinguishing between positive and negative data of benchmark datasets, as reflected by Fig. 4. On the contrary, the AAC-based model achieved higher AUCs for another three activities named antibacterial, antifungal, and antiviral. Therefore, the best performed model of all activities for six benchmark datasets was selected for their implementation into the web server. The selected model was also compared with the state-ofart predictors, and the results are displayed in Fig. 5. After observing the comparative performance results listed in  Table S7, the relatively effective execution of our proposed model for each activity in classifying plant-derived antimicrobial peptides has been proven. Even though the negative data for this study was taken from the previously developed tools, the performance of the respective tools was not satisfactory when plant-derived peptides were given as queries. The improved performance of our proposed models can be justified by the following aspects: (i) the existing predictors were developed on datasets of multiple organisms, causing models to learn generalized features. In contrast, we developed models especially intended for plant-derived datasets; (ii) the parameters used for our proposed model were optimized on five-fold cross-validation datasets, indicating these parameters to be more accurate and stable; (iii) among distinct features utilized for constructing models in this study, we found significant differences among positive and negative data during compositional analysis, i.e., AAC and DPC. Many studies have reported the successful implementation of these two features in predicting peptides and proteins (Bhasin and Raghava 2004;Raghava and Han 2005;Garg and Raghava 2008;Kumar et al. 2008Kumar et al. , 2015Lata et al. 2010;Thakur et al. 2012;Gautam et al. 2013;Gupta et al. 2016).
Analysis of the literature revealed the successful applicability of binary features in classifying antifungal peptides (Agrawal et al. 2018). However, a flaw in this feature is the requirement of equal length peptides to be used for the development of prediction models. As described in Sect. 3.2, our dataset sequence count is not much higher, so the binary feature was not included in this study to prevent the reduction of our dataset size. Additionally, the prediction models were developed by keeping sequence count and length balanced, which is beneficial in classifying peptides of varied lengths precisely.

Structure analysis of the selected sequences
The secondary structure predicted by PSIPRED is shown in Fig. 6. With the increase of alpha-helical structures in peptides, their hydrophobicity also increases in aqueous environments (Chen et al. 2007). Peptide hydrophobicity is believed to be an important factor in their antimicrobial activity (Huan et al. 2020). In lieu of the higher or lower hydrophobicity of a peptide, an optimum hydrophobicity window is accountable for its potent antimicrobial function. There will be a decreased antimicrobial activity with the increase in hydrophobicity. The explanation for this trend is the self-association of peptides, which acts as a hindrance for peptide entry into the bacterial cell (Chen et al. 2007). The 3D structure generated by SWISS-MODEL for the Fig. 3 Performance comparison of SVM models on respective independent test data of three sets and eight feature encodings for a antimicrobial, b antibacterial, c antifungal, and d antiviral functional activity selected sequences is represented in Fig. 7. To assess the quality of structure, GMQE, QMEANDisCo global scores (Studer et al. 2020), and the percentage of residues favored by Ramachandran were calculated for each sequence as listed in Table 3. The total percentage of amino acid residues of each sequence found in Ramachandran favored and allowed region is > 90%, making the constructed structures reasonable and convincing. The ideal case would be if  . 4 Comparison between the performance of SVM models prepared with respect to eight feature encodings when checked their results on six benchmark datasets. a, b, c, and d, respectively, describe the results obtained for antimicrobial, antibacterial, antifungal, and antiviral models more than 98% of amino acid residues were present in the favored region (Chen et al. 2010). In this study, the respective ideologies are fulfilled by the structures of viscotoxin A3, HsAFP1, beta-hordothionin, and ginkbilobin-2 with 100%, 98.08%, 100%, and 99.06% residues, respectively. Moreover, the structures presented in Fig. 7 are determined by X-ray diffraction with high resolution, which is a preferable method in homology modeling. The structures recorded by SWISS-MODEL for betahordothionin and NaD1 have four and two salt bridges in them, respectively. The formation of salt bridges in a peptide confers proteolytic stability to them along with aiding in peptide folding, but is not vital for microbicidal action (Andersson et al. 2012).

Structural insights based on amino acid residue preferences
Most of the plant-derived AMPs belong to the β or the αβ family, such as cyclotides and defensins. In response to said structural family, the dominant amino acid in each group differs. For instance, the β and the αβ family are dominated by cysteine (C) as the hydrophobic amino acid prerequisite in peptide folding. The abundance of cysteine residues in plant peptides indicates the frequently occurring disulfide bonds which confer metabolic stability to them (Cole and Cole 2008). The presence of disulfide bonds also directs the majority of plant-derived peptides to be disulfide-bonded defensin-like molecules (Li et al. 2021). Other than C, arginine (R) and lysine (K) are equally distributed among the αβ family. While glycine (G) and serine (S) are the two favored amino acids in entire families (Mishra and Wang 2012). Moreover, the presence of short-chain residues like glycine and serine in antimicrobial peptides is accountable for their disordered regions. These regions provide structural flexibility to such peptides (Tavares et al. 2012). Plant cyclotides are mainly rich in C, G, T/S, and K which regulate a widespread β-sheet containing scaffold in them (Wang 2010). Concisely, plant-derived peptides are mostly rich in cysteine and glycine. The respective presence of C, R, K, G, and S in our peptide datasets can be visualized by Fig. 8. As per Wang et al. (2009), the amino acid composition of natural AMPs is directly related to their 3D structures.

PTPAMP web server
In an effort to maximize the utility of plant-derived AMPs, we established a web server, PTPAMP, which is targeted at reaching a wide plant research community and enabling them to predict plant-derived AMPs. To validate our proposed work, the datasets used in this study can be downloaded from the server. Hereafter, we are providing summarized steps for using this server. First, the peptide screening module, in which users can enter their query peptide sequences (in FASTA format) in the provided text box or can upload a peptide FASTA file from their system. This module will help users to categorize their queries into four functional activities (AMP, ABP, AFP, or AVP) Predicted secondary structures of selected sequences that have been validated at protein level. Secondary structures were predicted using PSIPRED with moderate confidence of prediction. The structure depicts the majority of plant-AMPs belongs to β or αβ family  based on the score computed by integrated feature models. Results are shown in tabular form. Along with prediction scores, there are four columns representing charge, molecular weight, isoelectric point, and hydrophobicity values for each sequence. Additionally, there is a link to see peptide properties where some relevant graphs were plotted like AAC (regarding physicochemical properties), hydrophobicity, charge, and hydrophobic moment plot. In the next column, BLAST (SF et al. 1990) option is given to search for similar sequences from PlantPepDB. Second, the peptide designing module, in which a user can enter their query peptide sequence and obtain the probable peptide mutants with a single amino acid substitution. The said module will lead to generate peptide mutants. Based on the predicted scores and peptide property gained after using pepcalc (https:// pepca lc. com/) placed on our server, users can ultimately select the best peptide mutant. Thirdly, there is a protein scan module that helps to generate peptides of the desired length and overlapping residues provided, and each peptide will be displayed with its prediction scores and motif search results. Users can paste the query protein sequence in the given text box or may upload a FASTA file. This module is helpful in getting possible regions of protein sequence to be classified as AMPs. In all the three modules, there is an option of selecting SVM threshold and users are suggested to keep its value high if they want results of high specificity and may keep it low if they need results with high sensitivity.

Conclusions
To delve into the properties of plant-derived peptides, we developed a predictor named PTPAMP to trace peptide activity regarding antimicrobial, antibacterial, antifungal, and antiviral ones. In this predictor, an optimal feature set was selected after individually observing the performance of eight feature encodings. The compositional feature set, i.e., AAC and DPC is supposed to integrate multiple aspects of sequence information, thereby showing consistent performance across cross-validation and independent test datasets. To increase the robustness of these models, we trained them with multiple ML algorithms and, after comparing them found SVM to be the best classifier. Furthermore, to make sure that our proposed models are sufficiently effective in classifying plant-derived AMPs, we carried out a comparative analysis with existing AMP predictors. Most existing AMP predictors could only classify a peptide into a single activity at a time, whereas PTPAMP users can classify query sequences into four different activities on a single platform. As there is a shred of evidence regarding the broad-spectrum activity of antimicrobial peptides (Ageitos et al. 2017;Campos et al. 2018), PTPAMP can support the research going on toward understanding and sub-classifying the functional activity of plant-derived peptides. To maximize the convenience of users, the models were deployed as a web server, which is made freely available and is user-friendly. Besides functional activity prediction, our proposed framework could be used to generate peptide variants either after producing mutant analogs of the given peptide sequence or after scanning given protein sequences and creating overlapping peptides of a specific window size provided by the user. Additionally, with the increase of experimentally verified plant peptides and novel features, the prediction of plant-derived AMPs will also be simultaneously enhanced. The goal of PTPAMP is to be proved as a valuable tool for the identification of plant-derived AMPs in a high-throughput and cost-effective manner, followed by characterization as per their physicochemical properties.