Characterization and Prediction of Dengue Virus Targeting Peptides Based on Combined Amino Acid Composition Descriptors Using Random Forest Algorithm.

Dengue virus peptides are emerging as potential therapeutics for dengue infection. Due to the important role of dengue peptides in curbing dengue infection, their identication has proven crucial in terms of infection biology. To calculate differences between amino acids and physiochemical attributes, statistical tests and F-scores were used in this work. The random forest algorithm was used to predict dengue peptides using grouped amino acid composition, transition and distribution. Here, we have used three descriptors; Amino acid content, Grouped Amino acid composition and Composition, transition and distribution features (CTDC). We have created models and compared with combined model. Using the grouped amino acid composition as input parameters for the random forest algorithm, Our classier's overall accuracy increased to 88.80%, which was the greatest overall accuracy found in this investigation. Our classier produced superior predicting outcomes when compared to previously developed algorithms. In conclusion, we looked at the differences in amino acids and physiochemical properties between dengue viral peptides, using the grouped amino acid composition to build a classier that predicts these dengue virus inhibitory peptides. The accuracy of the AAC RF and GAAC RF models has improved to 88% and 87%, respectively. protease of aviviruses.


Introduction
Dengue virus (DENV) is the mosquito-borne avivirus that frequently infects people in subtropical and tropic areas. As per the reports of the World Health Organization, over 40% of the world's population are at risk of dengue infection [1]. Dengue virus infections cause severe illness, known as dengue haemorrhagic fever (DHF). It is majorly characterized by vascular leakage, which further develops into life-threatening dengue shock syndrome (DSS) [2] . It leads to high mortality of DHF/DSS. DENV NS1 is a 48-kDa glycoprotein that is highly conserved among all aviviruses [3]. NS1 is essential for viral replication and immune evasion [4] [5]. The triggering hyperpermeability of human endothelial cells in-vitro and systemic vascular leakage in-vivo is caused by the pathogenic effect of secreted DENV non-structural protein 1 (NS1) [6]. The NS1 disrupts endothelial glycocalyx layer (EGL), inducing the shedding of heparan sulfate glycoprotein and degradation of sialic acid. It has been shown that NS1 activates cathepsin L which activates heparanase via enzymatic cleavage. This enzyme act on the breakdown of heparan sulfate proteoglycans. Therefore, DENV patients have high heparan sulfate and sialic acid in their serum [7].
The use of peptides as therapeutic agents for DENV infection has previously been investigated. As competitive inhibitors of virus entrance and replication, these peptides were engineered to disrupt active regions of viral proteins or to imitate speci c sections of viral proteins. Peptide inhibitors have been shown to target viral structural proteins C, prM, and E, as well as viral NS1, NS2B/NS3 protease, and NS5 methyltransferase during DENV infection. [8,9,10,11,12,13,14,15,16,17,18,19].
Here, we have proposed a classi cation algorithm to predict dengue virus inhibiting peptides using three main descriptors namely; Amino Acid content, grouped amino acid content and CTDC. The binary dataset for developing machine learning model were taken from literatures and dengue peptides-oriented databases. The Random Forest (RF) machine learning algorithm was applied to predict top 5 models for each three descriptors. We compared each model with the combined descriptor model. The descriptors contributing for high model accuracy were, Amino acid content and grouped amino acid composition.
These models were used to predict the dengue virus inhibiting and non-inhibiting peptides. Comparing all developed models, best results were obtained using AAC_RF and GAAC_RF model, this suggests that our classi er is better at predicting dengue viral peptides..

Dataset
In this study, Dengue virus inhibiting peptides were downloaded from the AVPdb, a database of antiviral peptides that have been experimentally con rmed against medically signi cant viruses [20], which consisted of 89 dengue virus inhibiting peptides. The 11 peptides were taken from a paper entitled "Peptides targeting dengue viral nonstructural protein 1 inhibit dengue virus production". The negative dataset was taken from AVPdb Database [19].
All the peptide sequences were checked in Cluster Database at High Identity with Tolerance (CD-HIT) [21] in order to generate a high-quality dataset for this research. Finally, we have categorized our both dataset into training and testing with 7:3 ratio.

Descriptor selection
We selected three descriptors. 1-Amino acid content (AAC) which calculates amino acid frequency in peptide sequence.2-Grouped Amino Acid Composition (GAAC), twenty amino acids are categorized into ve classes (aliphatic, aromatic, positive, negative, uncharge). It calculates the frequency of each class. 3-The composition, transition and distribution (CTDC) features represent amino acid distribution patterns of a speci c structural or physiochemical property in a peptide sequence. We used iLearnplus Web [22] for descriptor selection and machine learning model development.

Clustering and dimensionality reduction
The three descriptor's data were used as input for clustering. K-means clustering was used with cluster size of 2. The basic idea is to initialize cluster centers, move each point to its new nearest center and calculating the mean of the member points to update the clustering centers and repeat the process until the convergence [23].
The Principal component analysis (PCA) is used to describe useful variants [24]. The data was used for principal component analysis for dimensionality reduction. The main three principal components were retrieved. The dimensionality reduction data was used as input for feature selection and normalization.

Feature selection and normalization
Page 4/13 F score is used for class discrimination. F-score can measure the discrimination between sets of real numbers [25]. For feature selection, F score value was used and 10 best features was found. The values of features were transformed into three principal components. The features were normalized using Z Score. Nowadays, microarrays data also being normalized using Z score [26].

Machine learning
A big part of machine learning is classi cation -we want to know what class a new peptide is (Dengue inhibiting peptide or non-inhibiting). We have considered random forest (RF) as it is more robust algorithm for classi cation. Here, uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions [27]. The normalized dataset (Training set: 102,3; Testing set: 14,3) was taken as input and loaded for machine learning. The random forest algorithm was selected with the following parameters. Tree number: 1000, Number of threads:2, Tree ranges from: 50, Tree ranges to: 500 and Tree steps: 50. The cross validation was set to 5.

Model validation in testing data
The AAC_RF, GAAC_RF, CTDC_RF and combined_model_RF were validated with testing data. The ROC and PRC curve was ploted. The evaluation metrics was reported.
Results And Discussion

Dataset
As per the protocol of iLearnWeb Plus server, we annotated all sequences for classi cation. The protocol for writing sequence is given below. >name|class|category sequence Here, we can give any name (alphanumeric with underscore). we had two class; 1 for dengue virus inhibiting peptides, 0 for dengue virus non-inhibiting peptides. Totally, we collected 100 experimentally validated dengue virus inhibiting peptides. Here, category means training and testing dataset. We split the sequences into training and testing set in 7:3 ratio. Similarly, we had 16 negative datasets. This set also we split into 7:3 ratio. We saved these datasets in the Supplementary le (S1).

Descriptor generation and data distribution
We generated descriptors for all 116 peptides. The generated 20 descriptors under AAC are given in supplementary table 1. This numeric value indicates frequency of Amino acid in peptides. In AAC, Tryptophan frequency differentiates dengue virus inhibiting peptides from non-inhibiting peptides. In non-inhibiting peptides the occurrence of tryptophan is almost 0. In various literatures, it has been shown that tryptophan is very important for delivering antimicrobial activity [28,29]. Similarly, Glycine, tryptophan and phenylalanine frequency in non-inhibiting peptide is comparatively less than inhibiting peptides. (Table 1) and it is well supported by published article [30].
The generated 5 descriptors under GAAC are given in supplementary table 2. In GAAC, Aromatic amino acids in noninhibiting peptides were found to be less than 5%. It has been reported that aromatic amino acids plays a vital role in viral defense [31]. The generated 39 descriptors under CTDC are given in supplementary table 3. In CTDC, the Page 5/13 distribution of solvent accessible residues in non-inhibiting peptides was found to be less. The alpha helices and beta sheets in dengue virus inhibiting peptides are equally distributed but in non-inhibiting peptides, the proportion of beta sheets is more as compared to alpha helices. The alpha helical content in peptides determine its antiviral activity [32]. The data distribution for AAC, GAAC and CTDC is given in Figure 1.
All descriptors under AAC, GAAC and CTDC was used for clustering ( Figure 2) and dimensionality reduction ( Figure   3). The top 10 features were selected and transformed into 3 three principal components. Further, principal component values for each sequence were normalized. The normalized data for AAC, GAAC and CTDC is shown in Figure 4. The normalized data was used as input for model development using Random Forest (RF) algorithm. The RF algorithm is widely used for better understanding and prediction of antiviral peptides [44]. All model (AAC_RF, GAAC_RF, CTDC_RF and combined_RF) metrics was given in Table 2. The ROC and PRC curve for all models are shown in Figure 5. In this table, only the best predictive results of our classi er are illustrated boldly. The boxplot for all models with 8 different evaluation parameters are shown in Figure 6. On looking into the eight parameters, AAC and GAAC models were showing good prediction output. The correlation values between models are given in Figure   7. The highest correlation of 0.9978 was found between AAC_RF and CDTC_RF models. The successful predictive performance obtained in our study clearly demonstrated that the combined descriptors (AAC (20 descriptors), GAAC (5 descriptors) and CTDC (39 descriptors)) with Random Forest was quite suitable for predicting these peptides inhibiting dengue virus but overall AAC and GAAC with random forest is the best choice for model development and prediction. The model was evaluated on testing data. The ROC and PRC curve was plotted in Figure 8. The evaluation metrics of all model is given in Table 3. Compared to a regular amino acid, the grouped amino acid composition, transition and distribution decreases information redundancy, over tting and simpli es the protein complexity. To determine which amino acids and biological features were most discriminative between dengue virus inhibiting and non-inhibiting peptides, we analysed differences in amino acids and biological properties. We aimed to create a classi er that could predict dengue virus inhibitory peptides based on the composition, transition, and distribution of grouped amino acids. As a result, these descriptors served as RF's input parameters.

Conclusion
There is currently no effective dengue virus (DENV) therapeutic. In this study, we presented the rst evidence, to our knowledge, for the relationship between dengue virus inhibiting and non-inhibiting peptides with amino acid use and biological properties. We found that the frequency of Glycine (G), Phenylalanine (F), and Tryptophan (W) was signi cantly higher in dengue virus inhibitory peptides. Similarly, aromatic amino acids in non-inhibiting peptides were found to be less than 5%. The distribution of solvent accessible residues in non-inhibiting peptides was found to be less as compared to inhibiting peptides. The alpha helices and beta sheets in dengue virus inhibiting peptides are equally distributed but in non-inhibiting peptides, the proportion of beta sheets is more as compared to alpha helices. An RF algorithm was applied on the three descriptors; AAC, GAAC and CTDC. It was used to predict dengue virus inhibiting peptides. The successful predictive performance obtained in our study clearly demonstrated that these descriptors combined with RF was quite suitable for predicting these two peptide categories. We also developed combined model but the accuracy of this model was comparatively less. The AAC_RF and GAAC_RF model has improved accuracy of 88% and 87% respectively. Based on these data, we believed that our classi er, which uses the scheme of grouped amino acid composition, transition and distribution, may facilitate dengue virus inhibition peptide prediction.
Declarations Figure 1 The alpha helical content in peptides determine its antiviral activity [32]. The data distribution for AAC, GAAC and CTDC Figure 2 The amino acid composition of a protein has been widely utilized for the prediction of peptide categories [33][34][35][36][37][38][39][40][41][42][43]. All descriptors under AAC, GAAC and CTDC was used for clustering The amino acid composition of a protein has been widely utilized for the prediction of peptide categories [33][34][35][36][37][38][39][40][41][42][43]. All descriptors under AAC, GAAC and CTDC was used for dimensionality reduction Figure 4 Page 12/13 The top 10 features were selected and transformed into 3 three principal components. Further, principal component values for each sequence were normalized. The normalized data for AAC, GAAC and CTDC Figure 5 The normalized data was used as input for model development using Random Forest (RF) algorithm. The RF algorithm is widely used for better understanding and prediction of antiviral peptides [44]. All model (AAC_RF, GAAC_RF, CTDC_RF and combined_RF) metrics was given in Table 2. The ROC and PRC curve for all models Figure 6 In this table, only the best predictive results of our classi er are illustrated boldly. The boxplot for all models with 8 different evaluation parameters The successful predictive performance obtained in our study clearly demonstrated that the combined descriptors (AAC (20 descriptors), GAAC (5 descriptors) and CTDC (39 descriptors)) with Random Forest was quite suitable for predicting these peptides inhibiting dengue virus but overall AAC and GAAC with random forest is the best choice for model development and prediction. The model was evaluated on testing data. The ROC and PRC curve was plotted