DOI: https://doi.org/10.21203/rs.3.rs-1092942/v1
Dengue virus peptides are emerging as potential therapeutics for dengue infection. Due to the important role of dengue peptides in curbing dengue infection, their identification has proven crucial in terms of infection biology. To calculate differences between amino acids and physiochemical attributes, statistical tests and F-scores were used in this work. The random forest algorithm was used to predict dengue peptides using grouped amino acid composition, transition and distribution. Here, we have used three descriptors; Amino acid content, Grouped Amino acid composition and Composition, transition and distribution features (CTDC). We have created models and compared with combined model. Using the grouped amino acid composition as input parameters for the random forest algorithm, Our classifier's overall accuracy increased to 88.80%, which was the greatest overall accuracy found in this investigation. Our classifier produced superior predicting outcomes when compared to previously developed algorithms. In conclusion, we looked at the differences in amino acids and physiochemical properties between dengue viral peptides, using the grouped amino acid composition to build a classifier that predicts these dengue virus inhibitory peptides.
Dengue virus (DENV) is the mosquito-borne flavivirus that frequently infects people in subtropical and tropic areas. As per the reports of the World Health Organization, over 40% of the world’s population are at risk of dengue infection [1]. Dengue virus infections cause severe illness, known as dengue haemorrhagic fever (DHF). It is majorly characterized by vascular leakage, which further develops into life-threatening dengue shock syndrome (DSS) [2]. It leads to high mortality of DHF/DSS. DENV NS1 is a 48-kDa glycoprotein that is highly conserved among all flaviviruses [3]. NS1 is essential for viral replication and immune evasion [4][5]. The triggering hyperpermeability of human endothelial cells in-vitro and systemic vascular leakage in-vivo is caused by the pathogenic effect of secreted DENV non-structural protein 1 (NS1) [6]. The NS1 disrupts endothelial glycocalyx layer (EGL), inducing the shedding of heparan sulfate glycoprotein and degradation of sialic acid. It has been shown that NS1 activates cathepsin L which activates heparanase via enzymatic cleavage. This enzyme act on the breakdown of heparan sulfate proteoglycans. Therefore, DENV patients have high heparan sulfate and sialic acid in their serum [7].
The use of peptides as therapeutic agents for DENV infection has previously been investigated. As competitive inhibitors of virus entrance and replication, these peptides were engineered to disrupt active regions of viral proteins or to imitate specific sections of viral proteins. Peptide inhibitors have been shown to target viral structural proteins C, prM, and E, as well as viral NS1, NS2B/NS3 protease, and NS5 methyltransferase during DENV infection. [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].
Here, we have proposed a classification algorithm to predict dengue virus inhibiting peptides using three main descriptors namely; Amino Acid content, grouped amino acid content and CTDC. The binary dataset for developing machine learning model were taken from literatures and dengue peptides-oriented databases. The Random Forest (RF) machine learning algorithm was applied to predict top 5 models for each three descriptors. We compared each model with the combined descriptor model. The descriptors contributing for high model accuracy were, Amino acid content and grouped amino acid composition.
These models were used to predict the dengue virus inhibiting and non-inhibiting peptides. Comparing all developed models, best results were obtained using AAC_RF and GAAC_RF model, this suggests that our classifier is better at predicting dengue viral peptides..
In this study, Dengue virus inhibiting peptides were downloaded from the AVPdb, a database of antiviral peptides that have been experimentally confirmed against medically significant viruses [20], which consisted of 89 dengue virus inhibiting peptides. The 11 peptides were taken from a paper entitled "Peptides targeting dengue viral nonstructural protein 1 inhibit dengue virus production". The negative dataset was taken from AVPdb Database [19]. All the peptide sequences were checked in Cluster Database at High Identity with Tolerance (CD-HIT) [21] in order to generate a high-quality dataset for this research. Finally, we have categorized our both dataset into training and testing with 7:3 ratio.
We selected three descriptors. 1- Amino acid content (AAC) which calculates amino acid frequency in peptide sequence.2- Grouped Amino Acid Composition (GAAC), twenty amino acids are categorized into five classes (aliphatic, aromatic, positive, negative, uncharge). It calculates the frequency of each class. 3- The composition, transition and distribution (CTDC) features represent amino acid distribution patterns of a specific structural or physiochemical property in a peptide sequence. We used iLearnplus Web [22] for descriptor selection and machine learning model development.
The three descriptor’s data were used as input for clustering. K-means clustering was used with cluster size of 2. The basic idea is to initialize cluster centers, move each point to its new nearest center and calculating the mean of the member points to update the clustering centers and repeat the process until the convergence [23].
The Principal component analysis (PCA) is used to describe useful variants [24]. The data was used for principal component analysis for dimensionality reduction. The main three principal components were retrieved. The dimensionality reduction data was used as input for feature selection and normalization.
F score is used for class discrimination. F-score can measure the discrimination between sets of real numbers [25]. For feature selection, F score value was used and 10 best features was found. The values of features were transformed into three principal components. The features were normalized using Z Score. Nowadays, microarrays data also being normalized using Z score [26].
A big part of machine learning is classification — we want to know what class a new peptide is (Dengue inhibiting peptide or non-inhibiting). We have considered random forest (RF) as it is more robust algorithm for classification. Here, uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions [27]. The normalized dataset (Training set: 102,3; Testing set: 14,3) was taken as input and loaded for machine learning. The random forest algorithm was selected with the following parameters. Tree number: 1000, Number of threads:2, Tree ranges from: 50, Tree ranges to: 500 and Tree steps: 50. The cross validation was set to 5.
The AAC_RF, GAAC_RF, CTDC_RF and combined_model_RF were validated with testing data. The ROC and PRC curve was ploted. The evaluation metrics was reported.
As per the protocol of iLearnWeb Plus server, we annotated all sequences for classification. The protocol for writing sequence is given below.
>name|class|category
sequence
Here, we can give any name (alphanumeric with underscore). we had two class; 1 for dengue virus inhibiting peptides, 0 for dengue virus non-inhibiting peptides. Totally, we collected 100 experimentally validated dengue virus inhibiting peptides. Here, category means training and testing dataset. We split the sequences into training and testing set in 7:3 ratio. Similarly, we had 16 negative datasets. This set also we split into 7:3 ratio. We saved these datasets in the Supplementary file (S1).
We generated descriptors for all 116 peptides. The generated 20 descriptors under AAC are given in supplementary table 1. This numeric value indicates frequency of Amino acid in peptides. In AAC, Tryptophan frequency differentiates dengue virus inhibiting peptides from non-inhibiting peptides. In non-inhibiting peptides the occurrence of tryptophan is almost 0. In various literatures, it has been shown that tryptophan is very important for delivering antimicrobial activity [28, 29]. Similarly, Glycine, tryptophan and phenylalanine frequency in non-inhibiting peptide is comparatively less than inhibiting peptides. (Table 1) and it is well supported by published article [30]. The generated 5 descriptors under GAAC are given in supplementary table 2. In GAAC, Aromatic amino acids in non-inhibiting peptides were found to be less than 5%. It has been reported that aromatic amino acids plays a vital role in viral defense [31]. The generated 39 descriptors under CTDC are given in supplementary table 3. In CTDC, the distribution of solvent accessible residues in non-inhibiting peptides was found to be less. The alpha helices and beta sheets in dengue virus inhibiting peptides are equally distributed but in non-inhibiting peptides, the proportion of beta sheets is more as compared to alpha helices.
Feature | Values |
---|---|
G | 0.458 |
F | 0.202 |
W | 0.181 |
N | 0.154 |
A | 0.135 |
I | 0.134 |
D | 0.083 |
L | 0.046 |
E | 0.044 |
V | 0.038 |
K | 0.025 |
P | 0.020 |
T | 0.018 |
C | 0.015 |
R | 0.015 |
Q | 0.013 |
H | 0.011 |
Y | 0.007 |
M | 0.001 |
S | 0.000 |
The alpha helical content in peptides determine its antiviral activity [32]. The data distribution for AAC, GAAC and CTDC is given in Figure 1.
The amino acid composition of a protein has been widely utilized for the prediction of peptide categories [33–43]. All descriptors under AAC, GAAC and CTDC was used for clustering (Figure 2) and dimensionality reduction (Figure 3). The top 10 features were selected and transformed into 3 three principal components. Further, principal component values for each sequence were normalized. The normalized data for AAC, GAAC and CTDC is shown in Figure 4. The normalized data was used as input for model development using Random Forest (RF) algorithm. The RF algorithm is widely used for better understanding and prediction of antiviral peptides [44]. All model (AAC_RF, GAAC_RF, CTDC_RF and combined_RF) metrics was given in Table 2. The ROC and PRC curve for all models are shown in Figure 5. In this table, only the best predictive results of our classifier are illustrated boldly. The boxplot for all models with 8 different evaluation parameters are shown in Figure 6. On looking into the eight parameters, AAC and GAAC models were showing good prediction output. The correlation values between models are given in Figure 7. The highest correlation of 0.9978 was found between AAC_RF and CDTC_RF models.
Id | Sensitivity | Specificity | Precision | Accuracy | MCC | F1 | AUROC | AUPRC |
---|---|---|---|---|---|---|---|---|
CTDC_RF_model | 91.578 | 41.0 | 88.422 | 82.824 | 0.3357 | 0.8951 | 0.8431 | 0.961 |
GAAC_RF_model | 92.63 | 61.0 | 92.35 | 87.064 | 0.5235 | 0.9226 | 0.8663 | 0.9683 |
AAC_RF_model | 96.842 | 51.0 | 90.55 | 88.802 | 0.5388 | 0.9342 | 0.8487 | 0.9598 |
Combined_model | 87.368 | 55.0 | 91.332 | 81.956 | 0.3915 | 0.886 | 0.8503 | 0.9647 |
The successful predictive performance obtained in our study clearly demonstrated that the combined descriptors (AAC (20 descriptors), GAAC (5 descriptors) and CTDC (39 descriptors)) with Random Forest was quite suitable for predicting these peptides inhibiting dengue virus but overall AAC and GAAC with random forest is the best choice for model development and prediction. The model was evaluated on testing data. The ROC and PRC curve was plotted in Figure 8. The evaluation metrics of all model is given in Table 3.
Id | Sensitivity | Specificity | Precision | Accuracy | MCC | F1 | AUROC | AUPRC |
---|---|---|---|---|---|---|---|---|
Metrics value_AAC_RL | 97.89 | 95.24 | 98.94 | 97.41 | 0.9147 | 0.9841 | 0.9937 | 0.9986 |
Metrics value GAAC_RL | 97.89 | 95.24 | 98.94 | 97.41 | 0.9147 | 0.9841 | 0.9937 | 0.9986 |
Metrics value CTDC_RL | 94.74 | 95.24 | 98.9 | 94.83 | 0.842 | 0.967 | 0.990 | 0.997 |
Compared to a regular amino acid, the grouped amino acid composition, transition and distribution decreases information redundancy, overfitting and simplifies the protein complexity. To determine which amino acids and biological features were most discriminative between dengue virus inhibiting and non-inhibiting peptides, we analysed differences in amino acids and biological properties. We aimed to create a classifier that could predict dengue virus inhibitory peptides based on the composition, transition, and distribution of grouped amino acids. As a result, these descriptors served as RF's input parameters.
There is currently no effective dengue virus (DENV) therapeutic. In this study, we presented the first evidence, to our knowledge, for the relationship between dengue virus inhibiting and non-inhibiting peptides with amino acid use and biological properties. We found that the frequency of Glycine (G), Phenylalanine (F), and Tryptophan (W) was significantly higher in dengue virus inhibitory peptides. Similarly, aromatic amino acids in non-inhibiting peptides were found to be less than 5%. The distribution of solvent accessible residues in non-inhibiting peptides was found to be less as compared to inhibiting peptides. The alpha helices and beta sheets in dengue virus inhibiting peptides are equally distributed but in non-inhibiting peptides, the proportion of beta sheets is more as compared to alpha helices. An RF algorithm was applied on the three descriptors; AAC, GAAC and CTDC. It was used to predict dengue virus inhibiting peptides. The successful predictive performance obtained in our study clearly demonstrated that these descriptors combined with RF was quite suitable for predicting these two peptide categories. We also developed combined model but the accuracy of this model was comparatively less. The AAC_RF and GAAC_RF model has improved accuracy of 88% and 87% respectively. Based on these data, we believed that our classifier, which uses the scheme of grouped amino acid composition, transition and distribution, may facilitate dengue virus inhibition peptide prediction.
Acknowledgements
We sincerely acknowledge the Centre for Bioinformatics, for providing computational facility to carry out this research work.
Funding
This project was supported by the Indian Council of Medical Research (ICMR, New Delhi).
The grant number is No: 45/36/2019-PHA/BMS
Ethics declarations: Not applicable as we didn’t do any experiments using model organism.
Competing interests
The authors declare that they have no competing interests.
Availability of data and material: Yes, we uploaded all data and material while submission
Code availability: Not Applicable
Author information
Suresh Kumar Muthuvel designed this work and Elakkiya Elumalai performed analysis and wrote the manuscript.
Affiliations
Center for Bioinformatics, Pondicherry University, Pondicherry, India
Elakkiya Elumalai & Suresh Kumar Muthuvel
Contributions
SKM designed the experiments. EE performed the experiments. EE analyzed the data. EE wrote the manuscript. SKM proofed the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Correspondence to Suresh Kumar Muthuvel.
Consent to participate: This is a computational biology work so consent is not required.
Consent for publication: I, Prof. Suresh Kumar Muthuvel, undersigned, give my consent for the publication of identifiable details, which can include photograph(s) and/or videos and/or case history and/or details within the text (“Material”) to be published in the Journal and Article. Therefore, anyone can read material published in the Journal.
Supplementary material is not available with this version