Multi-objective optimization with majority voting ensemble of classifiers for prediction of HIV-1 protease cleavage site

HIV-1 protease cleavage site prediction of an amino acid sequence of Human Immune Deficiency Virus (HIV-1) type 1 has been the subject of intense research for decades to increase the AUC value of the prediction without placing much attention to the accuracy metric by many researchers. Knowledge of the substrate specificity of HIV-1 protease has significant application in HIV-1 protease inhibitors development and in studying novel drug targets. Motivated by this, a multi-objective optimization (MOO)-based majority voting ensemble framework combining the outputs from multiple classifiers has been proposed in the current paper to increase both the prediction accuracy and AUC values simultaneously. The optimal set of classifiers that are considered for voting purposes at the time of combining the outputs is determined automatically using the search capability of MOO. Comparatively better results have been attained using various benchmark data sets with average accuracy and AUC (area under the ROC curve) values of 0.92 and 0.96, respectively.


Introduction
Human immune deficiency virus (HIV), which is the causative agent of Acquire Immune Deficiency Syndrome (AIDS), affects the human immune system by spreading from one cell to other. AIDS is one of the most threatening health challenges of the world. As per the statistics given by World Health Organization (WHO) 1 approximately 6.9 lakh people died and 1.5 million people have suffered from HIV-related causes in the year 2020. For decades, various laboratorybased experiments are performed to understand the HIV-1 replicative cycle mechanism. Their results reveal that the HIV-1 protease is an essential enzyme of HIV-1 replication to produce mature and infectious virions (Sadiq et al. 1 https://www.who.int/news-room/fact-sheets/detail/hiv-aids ). HIV-1 protease is an essential enzyme that takes an important role in cleaving Gag and Gag-Pol polyproteins (Darke et al. 1988) and generating component proteins. These component proteins can further mutate with new virions to infect new cells. Hence, for the treatment of HIV-1, infected cell inhibition of HIV-1 protease activity could be an efficient technique to prevent HIV-1 replication. The protease inhibitor prevents the activity of protease by binding with active enzymes. Based on this concept, many HIV-1 protease inhibitors have been developed. Due to the presence of these inhibitors, the peptide substrate can avoid being cleaved by HIV-1 protease. The substrate specificity of HIV-1 protease plays an important role in designing HIV-1 protease inhibitors. In this regard, effective identification of the HIV-1 protease cleavage site has remained an important study topic for decades. Nanni (2006) experimented to show that drastic error reduction in cleavage prediction could be achieved using the combinations of different classifiers trained using different feature spaces. Further, Nanni and Lumini (2009) applied the ensemble of learning algorithms which showed better performance than the stand-alone method. It reveals that a reliable system could be achieved by using the combination of different classifiers with different feature extraction methods. Orthonormal encoding with Taylor's Venn diagram was combined to introduce a new encoding technique namely OETMAP by Gök and Özcerit (2013). It was found that linear SVM performed better with this feature encoding technique. Rögnvaldsson et al. (2015) also used a linear support vector machine classifier with orthonormal encoding as a feature extractor. Later, deep learning approaches based on bidirectional Gated Recurrent Units and feed forward networks were able to generate promising results as proved by Shayanfar et al. (2017). Further, Fathi and Sadeghi (2018) introduced a genetic algorithm-based feature selection technique and combined this with SVM classifier which yields better results in this regard. To reveal diversified properties, Singh et al. (2020) used a variety of feature extraction techniques, namely structure-based, physicochemical-based, and sequence-based features. They used various combinations of data sets and adapted multitask learning methodology to deal with data scarcity and enhanced prediction performance. A recent algorithm, PU-HIV, is proposed by Li et al. (2021) who considered the unknown octapeptides as unlabeled sets instead of negative labels. They formed a comprehensive combination of three different feature sets which are extracted from substrate sequences. Further, a biased LSVM was applied for classification. Another recent study done by Hu et al. (2019) proposed a co-evolutionary pattern-based prediction model called EvoCleave for HIV-1 PR cleavage sites, which integrates substrate sequences with a linear SVM classifier to yield promising results in terms of ROC analysis and F-measure. Moreover, the study done by Onah et al. (2022) aimed to predict HIV-1 protease cleavage site using a hybrid of octapeptide sequence information, amino acid binary profile, and physicochemical properties as input variables for selected machine learning algorithms. The study used a combined dataset, applied a 3-way data split, and evaluated the models using a stratified 10-fold cross-validation technique alongside a testing set. Further the work done by Hu et al. (2022) proposed an ensemble learning algorithm called EM-HIV for predicting HIV-1 PR cleavage sites by training biased support vector machine classifiers with asymmetric bagging strategy to address data imbalance and noisy data. The algorithm uses features from three different coding schemes to construct relevant feature vectors of octamers and outperforms state-of-the-art prediction algorithms on three independent benchmark datasets. In another recent study Palmal et al. (2022), the concept of a stacked auto-encoder has been used for latent feature extraction from the concatenated feature set, which is constructed with three different types of extracted features, namely structural, physicochemical, and sequential features. Further, a majority voting based classifier ensemble has been applied for classification.

Motivation for the proposed model
This protease cleavage site prediction method has two steps, one is the encoding of peptide octamers (feature encoding) and the other is the classification. Peptide encoding has a huge impact on determining the predictive performance of classifiers. In some of the previous works, the combination of different classifiers trained using different feature spaces (Nanni 2006) is used to determine which one performed better. Different feature encoding methods were used with different classifiers for solving the prediction task (Gök and Özcerit 2013;Rögnvaldsson et al. 2015). Deep learning models were also applied for enhancing the prediction performance in various works (Shayanfar et al. 2017;Palmal et al. 2022;Palmal et al. 2023). Though the problem of misclassification based on the pattern of data (linear or nonlinear) was still there, these models may not perform equally well when their performances are averaged across all possible data sets. According to the No Free Lunch Theorem, it is difficult to determine a single classifier that performs the best across all data sets. In some of the previous works, either a single classifier was used with single feature encoding (Nanni 2006) or classifiers were combined blindly in the case of ensemble (Nanni and Lumini 2009). But all the classifiers may not perform well for all the data sets. Motivated by this, a subset of classifiers is targeted to be selected. The task of finding an appropriate classifier ensemble for HIV-1 cleavage site prediction is posed as a multi-objective optimization problem in the current work. To determine the suitable subset of classifiers for designing the ensemble, we have used majority voting where the selection or the rejection of each classifier is auto-generated based on the performance of the validation set.
In the work carried out by Singh et al. (2019Singh et al. ( , 2020, a classifier-ensemble was applied but a single objective value is optimized during the process. However different metrics (accuracy, AUC, precision, recall, etc.) are equally important for measuring the quality of prediction in the biological domain and all these metrics cannot be optimized simultaneously using a single objective optimization setting. Due to the contradictory nature of these evaluation metrics, optimization of a single objective value may decrease the value of other metrics. To measure the performance of prediction, multiple objective values confirm the quality of the model. Thus we have used multi-objective optimization (MOO) with a majority voting ensemble of different classifiers where the accuracy and AUC values are prioritized equally to increase their values simultaneously.
In our proposed work we have used a multi-objective optimization-based majority voting ensemble of classifiers. Here the selection or rejection of votes of different classifiers is determined automatically using the search capability of a MOO technique. We have tested this model on four peptide data sets (see Table 1) and inspired by the work of Fathi and Sadeghi (2018), we applied structural and physicochemical-based feature extraction techniques where 15 physicochemical and 4 structural properties are considered corresponding to each amino acid in the individual octamer of the data set.
Some of the popular classifiers have been considered for the experiment, namely Logistic Regression (Tolles and Meurer 2016), RandomForest (Breiman 2001), and different kernels of SVM (Linear, Polynomial, Gaussian, Sigmoid) (Pradhan et al. 2004;Gönen and Alpaydın 2011). Then NSGA-II Deb et al. (2002), a popularly used MOO technique, with a voting ensemble of classifiers is applied to the data set to perform multi-objective optimization. We have measured the performance based on two metrics, one is accuracy and the other is AUC. The comparisons have been made with the works of Rögnvaldsson et al. (2015), Shayanfar et al. (2017) and Li et al. (2021). A better AUC value is achieved than Rögnvaldsson et al. (2015) and Li et al. (2021) (Rögnvaldsson et al. and Li et al. have not reported accuracy values so we could not compare with respect to this metric) whereas a better accuracy value and AUC value are attained by our work as compared to Shayanfar et al. (2017) in a majority of the cases. All the comparisons of results have been performed based on out-of-sample testing, where the model is trained using one dataset and tested using a separate dataset. The main contributions of the paper are enumerated below:

A novel technique for ensembling classifiers based on
MOO is proposed for solving the HIV-1 cleavage site prediction task. The selection of classifiers is automatically determined based on the performance of the classifiers for different data sets. 2. Structural and physicochemical-based feature extraction techniques (Fathi and Sadeghi 2018) are applied to four data sets related to HIV-1 cleavage site prediction. 3. Initially, six classifiers are considered. The selection or rejection of different classifiers is determined automatically using the search capability of a MOO technique and majority voting ensemble are applied to the outputs of selected classifiers. 4. NSGA-II is applied for MOO where AUC and accuracy values are optimized simultaneously.
The organization of the rest of the paper is as follows; Sects. 4 and 5 describe the background of the present work and the information about the data set, respectively. Section 6 describes the proposed methodology by dividing the section into the following subsections: chromosome representation, fitness computation, objective function, discussion about other operators, and selection of a solution from the final Pareto optimal front. The experimental result discussion and the comparison with the other state-of-the-art techniques are reported in Sect. 7. The conclusion is delineated in Sect. 8, respectively.

Background
In this section, we have discussed the existing concepts which are used in the current study.

Majority voting ensemble of classifiers
A voting ensemble or majority voting ensemble is a machine learning model which combines the prediction results of multiple other models. The model is used for solving the classification or regression problem. Here our problem is of classification type where predictions for each label (positive or negative) by different classifiers are summed up and the label with the majority vote is considered as the resultant label. The Ensemble technique is used to improve model performance as in general it achieves better performance than any single model used in an ensemble.

Multi-objective optimization(MOO)
Most real-world problems incorporate the optimization of multiple conflicting objective functions simultaneously. Generally, it could be found that these objective functions are competing and conflicting also. Multi-objective optimization deals with such cases. It is capable of finding the set of optimal solutions instead of finding a single solution. This set of optimal solutions is non-dominating in nature because no solution can be considered better than any other solution in the solution set considering all objectives.
The goal of multi-objective optimization (Coello Coello 1999) is to find the vectors of decision variables Main steps of non-dominated sorting genetic algorithm-II (NSGA-II) It is basically a genetic algorithm. The genetic algorithm is extended to solve the MOO problem. The total population size is N . P t , Q t are parent and offspring populations, respectively. Step 1: Combine P t and Q t to get R t = P t ∪ Q t . Nondominated sorting is performed on R t to identify different fronts F i , i = 1, 2, . . . , etc.
Step 3: Execute the Crowding Distance-Sorting procedure and include the most widely spread (N − P t+1 ) solutions by using the crowding distance values in the sorted F i to P t+1 .
Step 5: Create offspring population Q t+1 from P t+1 by using crowded tournament selection, crossover, and mutation operations.

Data set
In the proposed method we have considered four benchmark data sets available in the UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+ cleavage). The standard data sets include octamers containing cleaved and non-cleaved sites. Each of these octamers includes a collection of 8 amino acids. The description of each benchmark data set is given in Table 1. After collecting all the data sets, we applied structure and physicochemical-based feature extraction techniques where the sequences of amino acids are mapped to feature set (Fathi and Sadeghi 2018). The data sets consist of octamers where each amino acid in an individual octamer has its feature set. A total of 19 features are extracted for an individual amino acid. Among them, 15 are physicochemical namely Amino Acid Code, Experimental Melting Point( • C), Experimental Optical Rotation, Aliphatic, Aromatic, Polarity, Acidity, Basic, Unique, Hydropathy Index, Molecular Mass, pKa, pKb, PI and Solubility. Four structural features are the number of Carbon, Hydrogen, Nitrogen, and Oxygen atoms. Here total feature size we get is 19 × 8 = 152. Descriptions of the features are given below.
• Amino Acid Code:From Alanine to Valine all the 20 amino acids are assigned unique numbers from 1 to 20. 'Aliphatic' amino acids are non-polar and hydrophobic (Gök and Özcerit 2013). An 'aromatic' amino acid is an amino acid that includes an aromatic ring (Gök and Özcerit 2013). The amino acid molecules are considered to be 'polar' in the sense that they have polar functional groups (Gök and Özcerit 2013). Charge (Biro 2006) and uniqueness properties of amino acids are represented by 'basic' and 'unique' features (Fathi and Sadeghi 2018). All these mentioned features are binary-type features. • Acidity: The feature acidity represents the acidic, basic, or neutral nature of the amino acid. • Hydropathy Index: Feature Hydropathy Index (Biro 2006) expresses the hydrophobicity properties of amino acids. It can take continuous values. • Molecular mass: molecular mass is represented in grams per mole (Fathi and Sadeghi 2018). • Number of Carbon, Hydrogen, Nitrogen, and Oxygen atoms: These are properties reflecting the structural features. • pK a, pK b, p I Features: K a is the acid dissociation constant and pK a is the −log of this constant. In the same way, K b is the base dissociation constant and pK b is the −log of the constant. pI is the pH at which a protein is electrically neutral which is called the isoelectric point (has no net electrical charge). • Solubility: It refers to the solubility of the amino acid in grams per 100 ml of water at 25 • C temperature.
After considering all the above feature values, individual features are normalized using the following formula.
Here v i is the old feature instance and n i is the new feature instance.

The proposed technique
In the current work, we have developed a multi-objective optimization-based classifier ensemble technique where the selection of different classifiers is automatically determined using the search capability of MOO for performing the majority voting ensembling. NSGA-II (non-dominated sorting genetic algorithm-II) (Deb et al. 2002), a popular MOO technique, is utilized to perform multi-objective optimization where objectives are accuracy and AUC. These two objective values are simultaneously optimized. The classifiers applied for the work are as follows: Logistic Regression (Tolles Here the cross-domain experiment or out-of-sample testing is followed where the model is trained with one among four datasets and tested with the remaining datasets. NSGA-II (Deb et al. 2002) is applied to determine the optimal selection of classifiers among six classifiers, to develop a majority voting ensemble. For this ensemble technique, each classifier is associated with the value 0 or 1. 1 means that the particular classifier is included to create the ensemble model and 0 indicates it is rejected. These values are not fixed and may change based on the performance of the classifier for a dataset. The optimal set of classifiers is determined by exploring the search capability of NSGA-II by optimizing two objective functions simultaneously, accuracy, and AUC values (Fig. 1). The dataset and the code used in the proposed work is available at https://github.com/SusmitaPalmal/Moowith-Voting-Ensmble

Chromosome representation
If K number of classifiers are considered for the experiment then the chromosome size is K , where each of the genes of that chromosome will be initialized by randomly selecting a value 0 or 1. The value 1 indicates that the classifier will take part in constructing the ensemble and the value 0 indicates that the classifier will not be considered in building the ensemble. The static classification method is followed here. The classification results of different samples for different classifiers are stored in separate files. Finally, the outputs of a subset of K classifiers are combined where the subset is consisting of all classifiers which get the value of 1 in the chromosome. Here the presence or absence of the classifiers is encoded in the chromosome (see Fig. 2). If the size of the population is P, then all the P numbers of chromosomes of this population are initialized in the way given below.
Suppose there are K classifiers and the individual genes for a particular chromosome are g 1 , g 2 , g 3 , . . . , g K . Then the sample output class will be selected based on the following calculation. We are assuming that, for a certain chromosome, C 1 , C 2 , C 3 , C 4 , C 5 , C 6 are the classifiers and the corresponding genes are g i ∈ {0, 1}, i = 1, 2...6 where K = 6. Detailed description is given in Table 2. Here as per the example we can see g 2 , g 3 , g 4 and g 6 contain values of 1 so the outputs of C 2 , C 3 , C 4 , C 6 will be considered. We can further see that the combination C 3 , C 4 , C 6 is providing output as 'cleavage' whereas only C 2 is providing 'non-cleavage' as output. So the 'cleavage' class gets the majority voting, thus the final prediction class is the 'cleavage' site for the particular sample.

Fitness computation
Initially, we considered separate datasets for training and testing for four datasets. From one dataset, 95% data is considered for training, 5% data is considered for validation and testing is done on the other three datasets. The fitness values of each chromosome are evaluated in the following way.
Step 1: Suppose there are total K number of classifiers. We have K possible output classes (each from a different classifier) for each instance in the validation data. Now for the ensemble classifier, the label of the output class for each instance in the validation data is determined using the majority voting of the M classifiers' outputs where M is the subset of K . Here, I (k, i) is the entry of the chromosome corresponding to the kth classifier and ith class. The combined score of a particular class for a particular instance w is: Here op(w, k) denotes the output label provided by the kth classifier for the instance, w. The class receiving the maximum combined score is selected as the joint decision.
Chromosome g 1 = 0 g 2 = 1 g 3 = 1 g 4 = 1 g 5 = 0 g 6 = 1 Prediction Cleavage Non-cleavage Cleavage Cleavage Non-cleavage Cleavage Step 2: The overall accuracy and AUC values of this ensemble are calculated for the validation data. The validation set is used to select the set of chromosomes with good fitness values.
Step 3: The accuracy and AUC values are used as the fitness values of the particular chromosome. The objective is to maximize these fitness values using the search capability of NSGA-II.
After identifying the near-optimal set of chromosomes, individual test sets are considered to estimate the accuracy value and AUC value with respect to the trained model.

Objective functions
As the evaluation measures, accuracy and AUC values are considered. Accuracy is calculated by the summation of the number of true positive and the number of true negative samples (correctly classified data) divided by the total number of samples. It measures the correct classification rate. The area under the curve (AUC) shows the ratio of the true positive rate (i.e., sensitivity) and the false positive rate (i.e., 1-specificity).
In biological data, there is a high negative impact if the true positive rate is less. Thus we consider the AUC metric with the goal of maximizing of true positive rate. We have analyzed that maximization of accuracy could be achieved by increasing the true negative rate where the true positive rate may remain low but this scenario drastically reduces the AUC value. This contradictory nature of these two metrics has motivated us to choose MOO based approach to maximize accuracy and AUC, simultaneously. Here TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively.

Other operators
After computing the objective functions, the steps of NSGA-II are executed to optimize the above-mentioned two objective functions. We have used crowded distance-based nondominated sorting in NSGA-II, after performing conven- tional crossover and mutation for the MOO-based classifier voting ensemble. We have used a population size of 100 and the number of generations as 20. Here the mutation probability is 0.5. The elitism operation of NSGA-II signifies that non-dominated solutions (Deb 2011) among the parent and child populations are propagated to the next generation. The near Pareto optimal strings of the last generation provide different solutions to the ensemble problem.

Selection of a solution from the final pareto optimal front
In MOO, the algorithms produce a large number of nondominated solutions (Deb 2011) on the final Pareto optimal front. Each of these solutions is further used for generating an ensemble of classifiers. In our work, we have selected the solution with the maximum AUC value. Note that AUC is preferred over accuracy for the selection of a single solution because existing techniques mostly report the AUC values. But depending on the need, a solution can also be selected based on accuracy value.

Result and discussion
The effectiveness of the proposed method is evaluated using four benchmark data sets. Detailed information about the data sets is provided in Table 1. After performing the physicochemical and structure-based feature extraction, a total of 152 features have been considered.

Experimental settings
The proposed method is implemented in Python (version 3.8). We have used the following classifiers for the experimentation purpose: Logistic Regression (Tolles and Meurer 2016), Random Forest (Breiman 2001), and different kernels of SVM (Linear, Polynomial(degree = 2), Gaussian, Sigmoid) (Gönen and Alpaydın 2011;Pradhan et al. 2004). The parameters used for NSGA-II are population size = 100, the number of generations = 20, mutation probability = 0.5, and crossover probability = 0.9. Multi-point mutation is adapted here by selecting two genes (points) randomly.

In-sample testing
Tenfold cross-validation is performed on individual data sets and corresponding accuracy and AUC values are reported in Table 3. This is called in-sample testing because the same data set is used for training and testing. Four experiments are performed with respect to four data sets. Here fitness evaluation, mutation, and crossover operators are executed in the same way as mentioned in the Sect. 6.2 but as the same dataset is being used for training and testing, 10% of data is considered as a test set, 5% of data is taken as the validation set and remaining data is used as a training set in each fold of cross-validation. Further, Table 4 has been utilized to illustrate the performance of individual classifiers in terms of Accuracy and AUC value, using the 10-fold experimental setup. It has been observed that, in the majority of cases, individual classifiers have under performed compared to the proposed model in terms of either accuracy or AUC.

Out-of-sample testing
Here out-of-sample testing is performed where we have used a data set for training and a different data set for testing. This process confirms the independence between the test and training sets. In this process, the proposed model's performance in a new environment could be analyzed. Total 3×4 = 12 experiments are conducted and corresponding AUC and accuracy values are reported. Each training set is tested with three other data sets. All the results are depicted in different tables where the bold numbers indicate the best results in comparison to others. However, if more than one method is attaining the best value then these best values are also represented with bold font. The values are rounded off and represented up to two decimal points.
Here the comparison between the multi-objective optimization (MOO) based voting ensemble model and simple voting ensemble model is performed (see Table 5) to illustrate that MOO based method performs well in the majority of the cases. The simple voting ensemble model has been constructed with the same set of six classifiers mentioned earlier where the predictions for each class or label are summed up and the label having the majority vote of contributing models is finally selected as the final label of the given test sam-  Shayanfar et al. (2017) ple. In a simple voting ensemble model, soft voting criteria have been used where class prediction is carried out with the largest summed probability from multiple classifiers. From Table 5, we can observe that in 7 cases these two methods attain similar results, whereas in 10 cases, MOO based model attains better results than the simple voting ensemble model. There are another 7 cases where the simple voting ensemble model attains 1 or 2% better values compared to MOO based model. By observing all these results, the MOO-based voting ensemble has been chosen over the simple voting ensemble.

Comparison with state-of-the-art methods
Further, the comparison of the 'out-of-sample testing' result of the proposed technique with respect to different state-ofthe-art techniques has been depicted in Table 6 (accuracy comparison), and in Table 7 (AUC comparison). In Table 6, it is visible that the proposed method attains the best results in 5 cases and Shayanfar et al. (2017) work attains the best results in 4 cases. We found 3 more cases where both methods perform similarly. We have compared the accuracy values attained by the proposed approach with those obtained by Shayanfar et al. (2017) because other methods mentioned in Table 7 have not reported the accuracy values in their work. In terms of AUC value comparison, the proposed method achieves better performance in most of the cases (8 cases). And, Li et al. (2021), Rögnvaldsson et al. (2015) and Shayanfar et al. (2017) have given better results in 4 cases, 3 cases, and 6 cases, respectively.
To analyze the performance of the individual dataset as a training set considering the other three datasets as test sets, the average values of three computed results corresponding to the individual test sets are considered. In this way, we calculated average accuracy and AUC values for Data746, Data1625, Data Schilling, and Data Impens. In Fig. 3 Li et al. model (2021). From Fig. 5, it can be seen that for Data746, and Data1625 data sets, the average AUC values attained by the proposed approach are better than those by Rognvaldsson et al. model. (2015) and for Data Schilling, the average performances by both the models are equal. Figure 6 illustrates that for Data746,  Rögnvaldsson et al. (2015) Data1625 and Data Schilling data sets, average AUC values attained by the proposed approach are better than those by Shayanfar et al. model (2017). In all these four figures, a total of 16 results are reported. Among these, the proposed model's performance appeared to be better in 10 cases. So with respect  Shayanfar et al. (2017) to average accuracy and AUC values, MOO based method has accomplished efficiency in the prediction performance.

Comparison with respect to OETMAP (Gök and Özcerit 2013) feature set
To compare the performance of the feature set used in the proposed model with other feature encoding techniques, we applied OETMAP feature encoding (Gök and Özcerit 2013) method. There exist several feature encoding processes for amino acid encoding. Among them, OETMAP is a very popular encoding introduced by Gok Ozcerit in 2013 (2013). OETMAP consists of a one-hot coded vector where the amino acid is converted into a size of 20 features and also 10 Physico-chemical properties. Thus the total feature length of each of the data sets is 8 × (20 + 10) = 240. After applying OETMAP feature encoding, the feature set is used in  Table 8 using an out-of-sample testing setup. Here also proposed model performed better in the majority of cases.

Conclusion
In this paper, a multi-objective optimization-based classifier ensemble technique using the search capability of NSGA-II is developed for the prediction of the protease cleavage site of an amino acid sequence of HIV-1. Here physicochemical and structure-based feature extraction methods are used to extract the initial feature vectors for the corresponding octamer sequence. The initial feature vectors are consisting of 152 element vectors. The proposed method is evaluated on four benchmark data sets and its average prediction accuracy and AUC values are calculated as 92% and 96%, respectively. In an out-of-sample testing environment, the proposed method reveals its efficacy in comparison to different existing models by producing better accuracy and AUC values in the majority of cases. Using multi-objective optimization with a majority voting ensemble of classifiers, it has been possible to utilize the power of individual classifiers as per the requirement because the selection of the classifiers is autogenerated by the search capability of the proposed method. Also, the complexity of the model is not very high due to the static classification ensemble technique. Here initially the individual testing sample class is predicted using an individual classifier and stored in a file. Further using MOO, only different combinations of the predicted results are generated to construct the final classifier ensemble without executing individual classifiers repeatedly.
Here we have extracted the features based on physicochemical and structural properties. In future work, we will apply multi-category feature extraction including the sequence-based property with this present multi-objective model.
Author Contributions SP conceived the idea, and conducted the experiment(s), SP, SS, and ST analyzed the results. SP wrote the manuscript with valuable input from SS and ST. All authors reviewed the manuscript.
Funding The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Data availability
The dataset used in the proposed work is available at: https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+cleavage