ACP-Dnnel: anti-coronavirus peptides’ prediction based on deep neural network ensemble learning

The ongoing COVID-19 pandemic has caused dramatic loss of human life. There is an urgent need for safe and efficient anti-coronavirus infection drugs. Anti-coronavirus peptides (ACovPs) can inhibit coronavirus infection. With high-efficiency, low-toxicity, and broad-spectrum inhibitory effects on coronaviruses, they are promising candidates to be developed into a new type of anti-coronavirus drug. Experiment is the traditional way of ACovPs’ identification, which is less efficient and more expensive. With the accumulation of experimental data on ACovPs, computational prediction provides a cheaper and faster way to find anti-coronavirus peptides’ candidates. In this study, we ensemble several state-of-the-art machine learning methodologies to build nine classification models for the prediction of ACovPs. These models were pre-trained using deep neural networks, and the performance of our ensemble model, ACP-Dnnel, was evaluated across three datasets and independent dataset. We followed Chou's 5-step rules. (1) we constructed the benchmark datasets data1, data2, and data3 for training and testing, and introduced the independent validation dataset ACVP-M; (2) we analyzed the peptides sequence composition feature of the benchmark dataset; (3) we constructed the ACP-Dnnel model with deep convolutional neural network (DCNN) merged the bi-directional long short-term memory (BiLSTM) as the base model for pre-training to extract the features embedded in the benchmark dataset, and then, nine classification algorithms were introduced to ensemble together for classification prediction and voting together; (4) tenfold cross-validation was introduced during the training process, and the final model performance was evaluated; (5) finally, we constructed a user-friendly web server accessible to the public at http://150.158.148.228:5000/. The highest accuracy (ACC) of ACP-Dnnel reaches 97%, and the Matthew’s correlation coefficient (MCC) value exceeds 0.9. On three different datasets, its average accuracy is 96.0%. After the latest independent dataset validation, ACP-Dnnel improved at MCC, SP, and ACC values 6.2%, 7.5% and 6.3% greater, respectively. It is suggested that ACP-Dnnel can be helpful for the laboratory identification of ACovPs, speeding up the anti-coronavirus peptide drug discovery and development. We constructed the web server of anti-coronavirus peptides’ prediction and it is available at http://150.158.148.228:5000/.


Introduction
The damage caused by the COVID-19 pandemic is getting worse; there have been more than 765 million confirmed COVID-19 cases since the start of the pandemic and nearly 7 million people have died, according to WHO data (https:// covid 19. who. int/), and there are many problems with traditional medicines in the treatment of viral infectious diseases, including drug resistance, toxic side effects, etc. (Lin et al. 2021). There is an urgent need to develop a safe and efficient drug that can inhibit coronavirus (Timmons and Hewage 2021a). Peptides are biologically active molecules composed of amino acid residues, they can treat diseases with lower toxic and side effects (Gomes et al. 2018;Pfalzgraff et al. 2018;O'Brien-Simpson et al. 2018), anti-coronavirus peptides are one of them, and it has the characteristic of highefficiency, low-toxicity, broad-spectrum antiviral activity (Zhang et al. 2022). With the emergence and spread of the COVID-19, the public health crisis still threatens the world (Mishal et al. 2020;Singh 2021). It is urgent to find more Handling editor: F. Albericio.
Extended author information available on the last page of the article 1 3 drug candidates for COVID-19. The traditional identifying anti-coronavirus peptides' method is laboratory biological experiments (Wang et al. 2021), which is less efficient.
With the increase of peptides data resources, the advance of high-throughput sequencing technology, and the continuous reduction of sequencing costs, a large number of sequences data have been generated. Currently, it is necessary to introduce the latest machine learning algorithms to aid in the identification of ACovPs (Manavalan et al. 2022). The computational methods have the characteristics of high efficiency, faster, and lower cost (Kieslich et al. 2021), and show better repeatability and batch processing ability (Lee et al. 2015).
In 2012, the AVPpred (Nishant et al. 2012) model used support vector machines (SVM) (Boopathi et al. 2019) to predict antiviral peptide, and constructed a benchmark dataset. The highest accuracy of antiviral peptides prediction accuracy was 85%, and the Matthew's correlation coefficient (MCC) value (Chicco and Jurman 2020) was 0.70. In 2013, Chang and Yang (2013) used random forests (RF) (Genuer and Poggi 2020) to predict antiviral peptides with an accuracy of 90% and the MCC value of 0.79. In 2017, iAMPpred (Meher et al. 2017) was based on SVM to prediction for antibacterial and antiviral peptides, the prediction accuracy of antiviral peptides was 90.08%, and the MCC value was 0.8. In 2019, PEPred-Suite (Wei et al. 2019) was based on random forests and its antiviral peptides prediction accuracy was 86.4%, and its MCC value was 0.725. In 2019, AMPfun (Chung et al. 2020), a random forest-based antiviral peptides classification model, was tested on an independent dataset, and its antiviral peptides' prediction accuracy was 86.13%, and the MCC value was 0.71. The FIRM-AVP (Chowdhury et al. 2020) model in 2020 showed an accuracy of 92.4% in predicting antiviral peptides and the MCC value was 0.84. AVPIden (Pang et al. 2021a) could perform two-stage antiviral prediction: the first stage was to predict antimicrobial peptides and the second stage was to predict antiviral peptides. With the advancement of deep learning and neural networks, several neural network-based algorithms have been introduced for predicting antiviral peptides. The iAMP-CA2L (Xiao et al. 2021) was based on cellular automata image for identifying antimicrobial peptides and their functional types. Its highest accuracy for identifying antibacterial peptides achieved 94.13%, and its accuracy for predicting antiviral peptides was 80.57%. In the study of anti-coronavirus peptides (ACovPs), PreAnti-CoV (Pang et al. 2021b) was built to predict ACovPs in 2021. Considering the imbalance of data, the model introduced unbalanced random forest technology and its MCC value for anti-coronavirus peptide prediction was 0.57. In the same year of 2021, ENNAVIA (Timmons and Hewage 2021b) was specially modeled for the prediction of antiviral peptides and anti-coronavirus peptides. The model was constructed based on a deep neural network. The external test accuracy was 93.9%, and the MCC value was 0.87. iACVP (Kurata et al. 2022) Combining RF with word-embedding word2vec, and extracting contextual feature information in peptide sequences by word-embedding word2vec technology for prediction of ACovPs, PACVP (Chen et al. 2023) used the stacking learning framework, the first layer is responsible for feature extraction, and the second layer through the logistic regression algorithm (LR) to train the final model and accomplish the prediction of ACovPs.
After conducting the literature review, it was found that current peptides' sequence prediction analyses mostly rely on mathematical calculations of feature descriptors to extract sequence features. However, incomplete feature extraction may occur when the mathematical formula fails to capture all data features. Although the latest feature extraction technique using convolutional neural networks can comprehensively extract the composition of amino acids in anticoronavirus peptides, but a few anti-coronavirus peptides' prediction algorithms extract the context feature information of amino acids in peptide sequences. Moreover, many studies use a single machine learning algorithm model, which may encounter technical bottlenecks, and different machine learning models may display bias when solving problems. Therefore, there is a need to enhance the prediction accuracy of ACovPs. To address this issue, we propose an ensemble prediction classification algorithm based on deep neural networks, named ACP-Dnnel, It extracts the feature information contained in the ACovPs' sequence through the deep convolutional neural network, and then further extracts the contextual cross-feature information between these features through the bi-directional long short-term memory (BiL-STM) algorithm, combines them, and then finally inputs this feature information into the ensemble algorithm model for classification learning and training to constructed the final ACovPs' prediction model.

Datasets
We followed Chou's 5-step rules first to construct the benchmark dataset (Chou 2011). The ACovPepDB (Zhang et al. 2022) is a special anti-coronavirus peptides database. Its data are manually collected from public databases and published peer-reviewed articles. The database contains 518 entries, in which 214 entries are unique. All these anti-coronavirus peptides are used in this project.
To construct a larger training dataset, we also collected 137 anti-coronavirus peptides data from PreAntiCoV. We merged the two datasets, removed the repeated entries and finally obtained 252 non-redundant anti-coronavirus peptides, which is used as the positive dataset. We constructed three different negative training datasets. The first negative training datasets have 252 peptides extracted from AVPpred, which are experimentally verified non-antiviral peptides. The second negative datasets extracted from UniProt that protein fragments and protein sequences with length 5-100 residues, and then removed those sequences with any of 'Antimicrobial', 'Antibiotic', 'Fungicide', or 'Defensin'. The third negative training datasets have 252 peptides; antiviral peptides were extracted from AVPpred. For all datasets, 80% of the entries are used for model training, and the remaining 20% are used for independent evaluation. During the training process, tenfold cross-validation is used to validation training models. Therefore, we have constructed the three different datasets, namely dataset 1: anti-coronavirus peptides fusion with non-antiviral peptides; dataset 2: anti-coronavirus peptides fusion non-antibacterial peptides; dataset 3: anti-coronavirus peptides fused to antiviral peptides. It is used to simulate different scenarios in which anti-coronavirus peptides appear, and further verify the superiority of the model. The three specific datasets are listed in Table 1.

Sequence encoding
Given an anti-coronavirus peptide sequence P, its format is shown in formula (1) where R 1 represents the first residue in sequence, R 2 represents the second residue,..., R L is the Lth residue. Then, each amino acid residue is converted to a number according to Table 2.

Composition analysis
After the dataset was constructed, we counted the number of all amino acids in the benchmark as M, and then counted the number of each amino acid as N. Then, the N/M × 100% is the distribution of all amino acids in the sample data.
Then, we proceeded to intercept the N-terminal and C-terminal 10 amino acids of each peptide sequence in the benchmark dataset for composition analysis, which is similar to the whole sample composition analysis, and the whole analysis process was done by Microsoft Office Excel 2021.

Feature extraction
In the process of model building, the construction of deep neural network models is inspired by the literature [Hu et al. (2019) and Xiao et al. (2021)]; the construction of the entire model involves several stages. The first stage is feature encoding, which encodes the peptides sequence based on Table 2 to serve as the model's data input. The second stage is feature learning, which uses the embedding layer to transform the input sequence data into a multi-dimensional matrix and reduce the dimensionality of the sparse matrix. The convolutional neural network (CNN) network (LeCun et al. 2015;Shin et al. 2016) is then used to perform convolution learning on the multi-dimensional matrix to extract data features in the third stage. The features learned by the CNN   Aslan et al. 2021) to extract context feature information between the features; it enables extraction of the potential relationship features between amino acids in the ACovPs' sequence. The feature information is further synthesized through three fully connected layers, with the last fully connected layer outputting a one-dimensional feature matrix. The fourth stage is model classification, which extracts the features output by the last fully connected layer as the feature input for the following nine binary classification models. To improve the model construction, this paper introduces Keras deep neural network technology (Moolayil et al. 2019). First, the pad sequence's function is used to convert sequences of different lengths into sequences of equal length by padding. This function transforms variable-length sequences into new sequences with uniform length. Next, the processed coding sequence is fed into an embedding layer for data preprocessing. The embedding layer converts large sparse vectors into a low-dimensional space that preserves semantic relationships and reduces the dimensionality of input data. To mitigate the over-fitting phenomenon during the model training process, a dropout layer is introduced with a value set to 0.2, that is, 20% of the neurons in the neural network are randomly discarded to make the network simpler and more direct. Additionally, a regularization method is introduced during training. Specifically, regularization is applied in the fully connected layer (Laarhoven 1706), the L2 shown as in formula (2), and the L1 shown as in formula (3), where is the weight of the model, and is the regularization coefficient, which is used to control the degree of regularization The output layer comprises a singular node that is activated by the sigmoid function, consistent with typical binary classification neural networks, the sigmoid function shown as in formula (4). To compute the loss, the binary crossentropy loss function is utilized (Ruby and Yendapalli 2020), which is defined in formula (5) where the y i is true value of ith sample and ŷ i is the predicted value of ith sample. When the loss function value is close to zero, the error rate of the model decreases and its performance improves.

Model construction
The model building on the foundation of a feature extraction model that integrates deep convolutional neural network and BiLSTM then constructed nine classifier models based on it. These models were developed using various classifier models . Figure 1 shows the diagram of the entire model. To construct an ensemble classification system, this paper imported nine binary classifier models using python packages (Kramer and Kramer 2016). Subsequently, we performed classification and evaluation, and selected the optimal binary classification model. The specific parameters of the classification model were adjusted in the following manner: 1. SVM, the support vector machines (SVMs) (Boopathi et al. 2019) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Set the kernel to "rbf", and tenfold crossvalidation. 2. RF, the random forest (Biau 2012), which is an ensemble learning method for classification, regression, and other tasks, which operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. The parameter "n_estimators" is set to 10 and the parameter "random_state" is set to 123. 3. XGBoost, it is designed to provide a high-performance implementation of gradient boosting (Chen et al. 2016), with a focus on computational efficiency, scalability, and accuracy. The parameters are optimized as follows: "max_depth" set to 50, "n_estimators" set to 100, "learning_rate" set to 0.1, "colsample_bytre" set to 0.7, "gamma" set to 0, "reg_alpha" set to 4, "objective" set to 'binary: logistic', "eta" set to 0.3, "silent" set to 1, and "subsample" set to 0.8. 4. KNN, K-nearest neighbors (KNN) (Xing and Bei 2019) is a supervised machine learning algorithm used for classification and regression. It is a non-parametric method which classifies data based on the similarity of data points with its neighbors. Set the parameter "n_neighbors" to 2. 5. GNB, Gaussian Naive Bayes (GNB) (Kamel et al. 2019) is a supervised machine learning algorithm used for classification and regression. It is based on Bayes' theorem and assumes that the features in the data are independent of each other. GNB is a probabilistic classifier, which means that it calculates the probability of each class for a given set of features. It then selects the class with the highest probability as the output. Set all parameters to default. 6. LG, logistic regression is a machine learning algorithm used for classification problems (Shipe et al. 2019). It is a supervised learning algorithm that uses a logistic function to model a binary dependent variable. The goal of logistic regression is to find the best-fitting model to describe the relationship between the depend-ent variable and one or more independent variables. Set all parameters to default. 7. DTREE, decision tree classifier is a supervised learning algorithm used for classification problems (Yoo et al. 2020). It works by creating a tree-like structure of decisions and their possible consequences, including the results of the decisions. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. The parameters "criterion" set to 'entropy', "random_state" set to 1, and "max_depth" set to None. 8. Bagging, it is an ensemble machine learning algorithm that is used to improve the accuracy and stability of machine learning models (Sandag 2020). Bagging Classifier is a scikit-learn classifier that implements the bagging algorithm. It works by creating multiple instances of the same base estimator and training them on different subsets of the training. The implementation of this classifier requires the support of 'Decision Tree Classifier', and its parameter optimization is that set the "criterion" to 'entropy', set "random_state" to 1, set "max_depth" to None, set the "base_estimator" to the previous 'Decision Tree Classifier', set the "n_estimators" to 50, set the "max_samples" to 1.0, set the "max_features" to 1.0, set the "bootstrap" to True, set the "boot-strap_features" to False, set the "n_jobs" to 1, and set the "random_state" to 1. 9. The ensemble consisting of multiple models stacked together (Dong et al. 2020), using the Stacking Classifier technology, divided into two steps: first constructing estimators, which are composed of Random Forest Classifier, K-nearest neighbors, and SVM. In Random Forest Classifier, set the parameter "n_estimators" to 100 and the parameter "max_depth" to None; in K-nearest neighbors, set the parameter "n_neighbors" to 20 and the parameter "metric" to 'minkowski'. The parameter kernel of SVM is set to 'rbf', and then, the "final_estimator" is constructed, which is optimized by Gradient Boosting Classifier, with the parameter "n_ estimators" set to 25, the parameter "subsample" set to 1.0, the parameter "min_samples_leaf" set to 25, the parameter "max_features" set to None, and the parameter "random_state" set to 42. Then, the estimators and "final_estimator" are combined into a whole through the Stacking Classifier technology for ACovPs' classification. 10. Figure 1 shows the exact technical procedure for this project. In general, the research process of the project was divided into three stages. The first module was to extract features from the ACovPs benchmark dataset after it was constructed. The amino acid properties of the ACovPs peptides' sequence were extracted using a deep convolutional neural network in the second 1 3 module. BiLSTM technology was then used to extract additional contextual feature data about the amino acid properties of the peptide sequence. The feature information was condensed and subjected to dimensionality reduction processing before being used as learning features of the nine-ensemble classification technique. Finally, the performance of the model is evaluated using test datasets.

Evaluation
In The receiver-operating characteristic (ROC) and area under the curve (AUC) are also used in this paper, and ROC is a curve that shows the trade-off between false-positive and true-positive rates for various cut-off points. The falsepositive rate (FPR) is calculated as FP/(TN+FP) and the true-positive rate (TPR) is calculated as TP/(TP+FN). The shape of the ROC curve provides an indication of model performance, with a more convex curve indicating better performance. The AUC is a measure of the area under the ROC curve, with an AUC close to 1.0 indicating near-perfect prediction, while an AUC of 0.5 indicating random guessing (Dzisoo et al. 2019).

Peptides' sequence composition analysis
Through the analysis of the amino acid composition in ACovPs, non-antiviral peptides, non-antibacterial peptides, and antiviral peptides, we can gain insights into the distribution patterns of amino acids across the four different datasets. Figure 2 shows that amino acids, such as glutamic acid, lysine, and leucine, are enriched in ACovPs. On the other hand, antiviral peptides contain more glutamic acid, isoleucine, while non-antiviral peptides enrich

All amino acid sequence components
Anti-coronavirus peptides Non-antiviral peptides N on-antibacterial peptides Antiviral peptides To further investigate the compositional characteristics of amino acids in the peptide sequence datasets, we conducted a statistical analysis of ten amino acids at the C-terminal and N-terminal. Interestingly, a similar amino acid enrichment pattern was observed in comparison to the full sequence analysis. Specifically, the amino acids enriched at the C-terminus of ACovPs were glutamic acid, serine, and leucine, while those enriched in non-antibacterial peptides were methionine and leucine, as shown in Fig. 3.
At the N-terminal of ACovPs, glutamic acid, serine, and leucine are also enriched. In contrast, lysine and leucine are enriched at the N-terminal of non-antibacterial peptides, as shown in Fig. 4.

Dataset 1 model performance
Upon completion of the training for the Dataset 1 model, the performance of the model is evaluated using the test dataset. The accuracy rate is exceeding 94%, and the AUC value exceeds 0.98. The model training process evaluation and receiver-operating characteristic (ROC) curve is shown in Fig. 5.
To enhance the performance of the model, the data features extracted by the deep neural network are inputted into nine distinct classification models for training, and then, the remaining 20% of the independent validation dataset is used for classification validation, Furthermore, a comprehensive performance evaluation and comparison of the different models is shown in Table 3.
As can be seen from

N-terminal 10 residues composition
Anti-coronavirus peptides Non-antiviral peptides Non-antibacterial peptides Antiviral peptides

Fig. 4 N-terminal peptide sequence composition analysis
of the GNB model is very outstanding, the accuracy has increased exceeds 1%, the MCC value has increased to 0.904, and the SP performance is raised to 100%. The surface model exhibits a high degree of accuracy in predicting ACovPs derived from non-antiviral peptides.

Dataset 2 model performance
Dataset 2 is generated by merging anti-coronavirus peptides with non-antibacterial peptides. Similar to the previous dataset 1, the performance of the model is evaluated using the test dataset. The accuracy rate is 95%, and the AUC value exceeds 0.98 too. The model training process evaluation and ROC curve are shown in Figure 6.
Subsequently, the deep neural network features of dataset 2 are fed into nine classification models for further training, and then import the remaining 20% of the independent validation dataset for validation, respectively. The performance of the nine classification models is shown in Table 4.
It can be seen from Table 4 that the performance of five classifiers has been improved, and the mean accuracy rate has increased by 2%. Among them, the classification performance of XGBoost AUC performance has been improved to 96.9%, and the MCC value has improved to 0.940, the best SN value increased to 95.7%, and XGBoost was significantly better than other classification models.

Dataset 3 model performance
Dataset 3 combines the data of anti-coronavirus peptides and antiviral peptides. As in the previous training process, 80% of the dataset 3 is used for model training, tenfold crossvalidation is introduced during training, and the remaining 20% is also used as independent validation dataset for model performance evaluation. The model training process evaluation results are shown in Fig. 7. Its accuracy reaches 93.06%, and the AUC value is more than 0.95%.
Then, the features extracted by the deep neural network from dataset 3 are fed into nine classification models for further learning and optimization. The remaining independent validation dataset is then utilized for model classification validation. The comprehensive comparison results are presented in Table 5. Table 5 shows that the performance of the independent validation dataset has improved after further training and classification of the classification models, and the ACC

Compare with state-of-the-art methods
Due to the scarcity of machine learning predictive analysis algorithms for ACovPs, and the differences in the datasets, the performance of current state-of-the-art algorithm models on similar datasets is compared, as shown in Table 6. Table 6 reveals that the ACP-Dnnel is based on the deep neural network model, the data 2 highest SN value is 95.7%, and the data 3 highest MCC value is 0. 946, generally outperforms the ENNAVIA algorithm in terms of SP, which is also based on the deep neural network. It also outperforms the PreAntiCoV, which is based on RF. Compared with them, the ACP-Dnnel model SP has increased by 1.3%, the MCC value has increased by 3.6%, and this fully demonstrates that the model proposed in this paper can accurately identify and classify ACovPs.
In the latest study, Manavalan et al. (2022) recently conducted an assessment of AVP and ACVP predictors, they use an independent dataset without overlapping data, and concluded that ENNAVIA-D was the most effective methodology. Their evaluation incorporated all of the currently available predictors. iACVP (Kurata et al. 2022) is the latest anti-coronavirus peptide prediction model, which concentrates the advantages of multiple models in the early stage and constructs an independent validation dataset, to compare model performance more objectively, this paper is also using the ACVP-M dataset as the independence validation datasets, and we evaluated the nine state-of-the-art predictors and compared their performance against our proposed predictor.
Initially, the deep convolutional neural network known as ACP-Dnnel was trained using three previously constructed datasets and the negative samples of the iACVP. The pretrained features were subsequently extracted and utilized as the foundation for learning in the ensemble classification model, Due to the small amount of anti-coronavirus peptides data samples and the problem of data imbalance, the sample weight was adjusted during the model training, and the weight of the anti-coronavirus peptides dataset was adjusted according to the proportion of the overall dataset.
The independent validation dataset ACVP-M that constructed by iACVP was evaluated using the ensemble classification model.
To compare model prediction performance, two methods were used: one without processing unbalanced data and the  other by adjusting sample weight through weight redistribution technology during pre-training. The positive sample weight with fewer samples was adjusted to match the negative sample weight. Results of the independent dataset validation ACVP-M without imbalance processing are shown in Table 7. And then, Table 8 shows the results of re-validation of the independent dataset ACVP-M by adjusting the sample weights of the training set. Comparison of Tables 7 and 8 reveals that preprocessing the imbalanced dataset leads to improved training and validation results on the independent dataset. Given the limited number of anti-coronavirus peptides samples relative to other samples, it is particularly critical to preprocessing the issue of data imbalance during model pre-training.
The ACP-Dnnel comparison with state-of-the-art algorithms on the ACVP-M independent dataset results are shown in Table 9. Upon comparison with the state-of-the-art nine predictive models, it was discovered that ACP-Dnnel demonstrated the highest MCC, SP, and ACC values (0.369, 0.981, and 0.958, respectively), whereas iACVP demonstrated the highest SN (0.825) and iACVP exhibited the highest AUC (0.920). ACP-Dnnel demonstrated considerably higher MCC, SP, and ACC values (6.2%, 7.5%, and 6.3% greater, respectively) than iACVP, ENNAVIA-D, and PreAntiCoV. Furthermore, the MCC value can objectively

Web server implement
By introducing the web server technologies such as flask and html (Yang et al. 2021;Zhou et al. 2022), we have established a web server system for predictive analysis of anti-coronavirus peptides data. Users can directly input the peptides sequence they wish to analyze on the web page and submit it to the analysis system by clicking the "commit" button. The new peptide sequence is input into the ensemble model, and when the threshold value of the model's prediction is more than 0.5, that indicate the sequence is an anti-coronavirus peptide; otherwise, the sequence is output as a non-anti-coronavirus peptide. The results of peptide sequence prediction result are presented at the bottom of the web page. The analysis demonstration process is shown in Fig. 8. From Fig. 8, it is evident that the analysis system can analyze each peptide's sequence data, predicted whether the sequence is an anti-coronavirus peptide, and provide the results at the bottom of the web page. We constructed the web server of anti-coronavirus peptides prediction and it is available at http:// 150. 158. 148. 228: 5000/. The server offers not only anti-coronavirus peptides predictions, but also offer the download link for the datasets used in the study to access and further research.

Discussion
With the COVID-19 pandemic, there is an urgent need for antiviral drugs that can fight against COVID-19. In recent years, there has been an increasing number of peptide-based therapies. Anti-coronavirus peptides are small molecules that can resist coronaviruses; therefore, it can be used as an alternative to anti-coronavirus. Compared with the large number of antimicrobial peptides data that exist in the market, the amount of anti-coronavirus peptides data is relatively small, so the related studies are fewer.
In the field of anti-coronavirus peptides research, traditional molecular dynamics simulation techniques can provide insight into the activity of these peptides, but they are time-consuming and impractical for large-scale peptides sequence screening. Therefore, the use of computer-assisted prediction methods that are more accurate and efficient can accelerate the exploration and development of anti-coronavirus peptides, while reducing costs. This study imported deep neural network technology, in conjunction with the BiLSTM to extracting contextual semantics. Additionally, the study integrated nine mainstream classification models to comprehensively investigate the efficacy of the developed approach. To ensure the stability of the model, three distinct datasets were constructed. During training, tenfold cross-validation method was used to construct the model. After validation  Fig. 8 Anti-coronavirus peptides prediction analysis system. a Enter the peptides' sequence data that need to be analyze; b Submit the peptides sequence that need to be analyze and obtain the prediction results; c FASTA format data input; d Submit the FASTA format data peptides' sequence and obtain the prediction results; e Introduction of the study; f This project builds a model pre-training benchmark dataset. And the independent validation dataset data. The model underwent independent dataset testing and demonstrated accurate prediction of anti-coronavirus peptides. In our future work, we intend to conduct biological wet experiments to further validate the predicted anti-coronavirus peptides.
In the process of anti-coronavirus peptides prediction model training, the DNN model may be more vulnerable to over-fitting problem due to the small size of the dataset. For this reason, we introduced the dropout technique to make the network a little simpler. In future work, we would like to try more methods to solve this problem.
In the encoding process, we used 5-bit binary encoding instead of traditional 20-bit one-hot encoding. The later will form a large number of sparse matrices. We need to extract the contextual relationships between amino acids through BiLSTM. If we set the sequence maxlen to 1280 in the padding data preprocessing, the length of each vector in traditional 20-bit one-hot encoding will be very large, which will occupy a large amount of memory space and affect the training speed.
To ensure the reliability of the obtained results regarding the identified peptides and the proteins of interest, we screened three anti-coronavirus peptide sequences, which were not included in our benchmark dataset from the antiviral peptides database at http:// dravp. cpu-bioin for. org/ downl oads/ (Liu et al. 2023), for prediction purposes. Consequently, we successfully predicted these three sequences as anti-coronavirus peptides. These sequences were derived from the research conducted by Outlaw V K, Bovier F T, Mears M C, et al. Their study titled "Inhibition of Coronavirus Entry In Vitro and Ex Vivo by a Lipid-Conjugated Peptide Derived from the SARS-CoV-2 Spike Glycoprotein the C-terminal heptad repeat (HRC) Domain" was published in mBio (Outlaw et al. 2020). The HRC lipopeptide corresponding to SARS-CoV-2 exhibited inhibitory effects on S-mediated membrane fusion, demonstrating effectiveness against both SARS-CoV-2 and MERS-CoV live viruses in vitro. Furthermore, it hindered the spread of SARS-CoV-2 in human airway tissue and prevented the systemic surface exit of SARS-CoV-2. The HRC fasta format sequence data can be found in Table 10, and the verification results from the web server are presented in Fig. 9.
In the future, we plan to implement the model interpretability algorithm to analyze the weights of features that impact the final prediction results. Additionally, research will be conducted on the algorithm interpretability of biological significance, which will serve as a basis for  Fig. 9 Prediction of new anti-coronavirus peptides subsequent laboratory tests aimed at verifying the functional activity of anti-coronavirus peptides. Ultimately, this research will provide a theoretical foundation for the development of anti-coronavirus peptide drugs in the later stages.

Conclusions
In this project's research, we developed a model that combines DCNN-BiLSTM with nine classification algorithms to identify and analyze anti-coronavirus peptides data. The DCNN network extracts peptides sequence features, while the BiLSTM network extracts the context between amino acids. The nine classification algorithms are utilized for further learning and classification of features. To solve the problem of data imbalance, in the process of model feature extraction, the weight of anti-coronavirus peptides data is adjusted more bigger for balanced the weight of the sample. To improve the algorithm's stability, this paper imported tenfold cross-validation during training. To verify the algorithm's superiority, this paper constructed three different datasets for research. After independent dataset testing, the proposed algorithm model is comparable to the current anti-coronavirus peptide prediction algorithm in terms of Accuracy, Sensitivity, MCC, and other indicators. By compared with the state-of-the-art algorithm, it has been improved and achieved better results. In future research, we aim to improve the algorithm's interpretability in the biological sense and verify the functional activities of anticoronavirus peptides in conjunction with wet experiments.