Application of nature inspired soft computing techniques for gene selection: a novel frame work for classification of cancer

A modified Artificial Bee Colony (ABC) metaheuristics optimization technique is applied for cancer classification, that reduces the classifier’s prediction errors and allows for faster convergence by selecting informative genes. Cuckoo search (CS) algorithm was used in the onlooker bee phase (exploitation phase) of ABC to boost performance by maintaining the balance between exploration and exploitation of ABC. Tuned the modified ABC algorithm by using Naïve Bayes (NB) classifiers to improve the further accuracy of the model. Independent Component Analysis (ICA) is used for dimensionality reduction. In the first step, the reduced dataset is optimized by using Modified ABC and after that, in the second step, the optimized dataset is used to train the NB classifier. Extensive experiments were performed for comprehensive comparative analysis of the proposed algorithm with well-known metaheuristic algorithms, namely Genetic Algorithm (GA) when used with the same framework for the classification of six high-dimensional cancer datasets. The comparison results showed that the proposed model with the CS algorithm achieves the highest performance as maximum classification accuracy with less count of selected genes. This shows the effectiveness of the proposed algorithm which is validated using ANOVA for cancer classification.

data sets are difficult to interpret and data interpretation is very essential for the treatment of cancer 48 patients. The initial studies of dimensionalities reduction problem, found that the best test error can be 49 attained through a limited number of features (genes) that directly affect the accuracy rates [8,9]. In a 50 large feature space, it is common to have irrelevant and redundant genes concerning the class labels.

51
Integrality constraints such as irrelevant and redundant features have the capacity to affect the 52 classification performances. Therefore, this research study developed an approach for genes selection 53 to counter all the mentioned drawbacks. 54 Defining an optimal decision framework for gene selection is an essential but difficult task in the field 55 of machine learning and medical science from microarray data because the characteristic of each data 56 is different. Recently hybrid machine learning techniques gain popularity and by using suitable 57 ccombinations effectively obtain a few relevant and informative genes (features) [8]. Various 58 researches applied varieties of different data mining techniques with different combinations for the 59 problem of identification of significant genes. Motivated by previous researchersnature-inspireded 60 algorithms are more suitable to find optimal set of features from large and complex data of different 61 domains. Techniques comprised of metaheuristic optimization have a broad range from the process of 62 a local search to learning processes [10]. Nature inspired algorithm by conducting them over the search 63 space there by bringing out its best capabilities able to obtain the best of best solutions. in high dimensional data classification problem. Proposed technique select optimal minimum number 72 of top ranked genes that provided good classification. The experental results on four well known 73 microarray datasets showed that performance of proposed algorithms was better than other published 74 algorithm for the same problem [12].

75
Hybrid approach is used to reduce the computational time and to take the benefits of the different 76 dimension reduction method [13,14]. Hybrid approaches combine different feature selection and 77 extraction method to reduce the dimension of the data. Different researchers applied different 78 combination of algorithm according to the requirement of different data sets. 79 Hameed et al., compared the performance of three well-known nature-inspired metaheuristic 80 algorithms, namely binary particle swarm optimization (BPSO), genetic algorithm (GA) and cuckoo 81 search algorithm (CS) with tewelve cancer data sets for gene selection and classification. In terms of 82 accuracy, BPSO outscored GA and CS, according to the study. In comparison to GA and BPSO, CS 83 was able to pick fewer attributable genes and was less computationally complex [15].

84
Some researchers created a classification framework and utilised it to categorise cancer gene 85 expression patterns using various hybrid gene selection algorithms based on various nature-inspired 86 metaheuristic methodologies, with better outcomes than single approaches. The work not only selected 87 very few features but also reduced computational cost by using the collection of new techniques that 88 produced good performance in classification. A comparison result expresses that the proposed hybrid 89 approach have been successfully applied and excels with other existing methods in terms of accuracy. 90 [16][17][18]. Therefore in this paper hybrid approach based on Nature Inspired Metaheuristics technique is 91 proposed that can produce an optimal feature space with significant genes to improve the classification 92 performance. shown that the CS is an effective approach to solve numerous optimization problems of different 117 domain. For different research problems CS algorithm has been widely used, but at the same time for  follows, Section two, described the details of feature selection algorithm, Experimental setup is 146 provided in Section three. While Section fourth discussed the experimental results and Section fifth 147 presented the conclusion. Figure 1 shows the proposed framework. Recently nature inspired optimization algorithm ABC is popular for genes selection problem of the 166 microarray. The step of ABC approach is based on bees behavior for finding best food source (gene

II.
The best nest with the high quality of (solutions) eggs will carry over to the next generations.  space is not good compare to ABC algorithm [43,44]. That is why the CS obtained the local optima 207 too quickly and suffers from the preconvergence problem which is the main issue with CS algorithm 208 [45]. Therefore, to maintain the balanced between exploitation and exploration process, CSABC 209 algorithm uses combination of ABC and CS algorithms. In proposed algorithm, CS algorithm is 210 adopted in the onlooker bee phase as an exploitation process to find the best optimal solution with less 211 computational time by improving the formation sharing between onlooker and employee bees.

212
Furthermore, the idea of using ABC for gene selection with a CS algorithm based on researches [46].

213
Therefore, the proposed approach uses a combination of nature-inspired metaheuristic algorithms, to 214 reduced disadvantages of the ABC approach such as preconvergence and computational time by 215 maintaining balanced between exploration and exploitation. Figure 2 shows  Because features of microarray are continuous so for the calculation of class-conditional probability,

268
To evaluate the performance of the proposed approach in this research used six benchmark microarray 269 cancer datasets. In this paper, used six cancer benchmark data sets of gene expression, namely; Colon 270 cancer (Alon et al., 1999), Acute leukemia (Golub et al., 1999), Prostate cancer (Singh et al., 2002), Lung cancer-II (Gordon et al., 2002), High-grade Glioma data (Nutt et al., 2003)     Levy walk generates some new solutions around the obtained best solutions, will accelerate the local 320 search functionality. Here, has been set to 0.85 from experience. Next evaluate its quality with fitness  data , which shows that the proposed approach has a discrimination capability between two classes.

392
In the case of multiclass classification of Lukemia 2 dataset, the proposed approach obtained slightly 393 lower classification accuracy compare to ICA+GBC but better classification accuracy compare to other 394 three method (table 8). But it gives an advantage for obtaining the smallest number of informative and 395 predictive genes, for the NB classifier its obtained 15 genes for the highest classification accuracy 396 which is low compared with 18 obtained genes by the ICA+GBC method.     Table 9 and figure 9 summarizes the classification accuracy and error rate of NB and SVM with 464 proposed approach by using LOOCV iterations for the same parameter settings (CSABC) algorithms. 465 We can easily see from the were 7.67% with 12 genes and 6.33% with 15 genes, respectively, so in the term of classification 473 accuracy NB classifier is best but the SVM classifier is the best in the term of selected genes for 474 Lekumia 2 data.  Classification. This method was successes fully reduced the misclassification errors during the 504 classification process on six cancer microarray data. Experimental results show the superiority of the 505 proposed approach in the term of classification accuracy with two factors, best obtained less number 506 of genes set and best AUC score for unbiased accuracy. Therefore, metaheuristic nature-inspired 507 algorithms act as a strong tool in solving microarray cancer data classification problems.

508
In the future work, incorporating more than one classifier with proposed feature selection techniques 509 to enhance the classification accuracy of the proposed work and to examine the selective classifier 510 mode.