Metaheuristic Model of Gene Selection for Deep Learning Early Prediction of Cancer Disease Using Gene Expression Data

doi:10.21203/rs.3.rs-2896430/v1

Download PDF

Research Article

Metaheuristic Model of Gene Selection for Deep Learning Early Prediction of Cancer Disease Using Gene Expression Data

https://doi.org/10.21203/rs.3.rs-2896430/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Cancer prediction in the early stage is a topic of major interest in medicine since it allows accurate and efficient actions for successful medical treatments of cancer. Mostly cancer datasets contain various gene expression levels as features with less samples, so firstly there is a need to eliminate similar features to permit faster convergence rate of prediction algorithms. These features (genes) enable us to identify cancer disease, choose the best prescription to prevent cancer and discover deviations amid different techniques. To resolve this problem, proposed a hybrid novel technique CSSMO, based feature selection for cancer prediction. First, combine the use of Spider Monkey Optimization (SMO) along with Cuckoo search (CS) algorithm viz. CSSMO for feature selection, which helps to combine the benefit of both metaheuristic algorithms to discover a subset of genes which helps to predict a cancer disease in early stage. Further, to enhance the accuracy of the CSSMO algorithm, choose a cleaning process, minimum redundancy maximum relevance (mRMR) to lessen the gene expression of cancer datasets. Next, these subsets of genes are classified using Deep Learning (DL) to identify different groups or classes related to a particular cancer disease. Six different datasets have utilized to analyze the performance of the proposed approach in terms of cancer sample classification and prediction with Recall, Precision, F1-Score, and confusion matrix. Proposed gene selection method with DL achieves much better prediction accuracy than other existing Deep Learning (DL) and Machine learning models with a large gene expression datasets.

Deep Learning (DL)

Cuckoo Search (CS)

Spider Monkey Optimization (SM)

minimum Redundancy Maximum Relevance (mRMR).

Successful cancer therapy has remained a significant issue despite enormous improvements in healthcare over the past century, and it is the second leading cause of mortality globally, after cardiovascular disease [1]. According to data from the World Health Organization (WHO), cancer is the leading cause of death worldwide. Of the estimated 18.1 million cancer cases worldwide, 9.3 million cases involved males and 8.8 million involved women. The most common types of cancer are lung, liver, prostate, colon, breast, and rectum. Figure 1 depicts the anticipated global number of new instances in 2022, broken down by age groups and gender. Clinical research and the treatment of many diseases are significantly influenced by the gene expression levels in an organism. Gene expression data is also known as gene-chip is a scientific advanced tool used by many researchers to study the magnitudes of several genes expressed in the abnormal sample. It serves as a tool that reflects the possible spectrum of the genome to analyze and investigate the root cause of the diseases. Problems related to gene expression profile could be solved using DNA microarray and Rna-Sequence based platform [2]. The use of gene expression profile in genetic research is a potent strategy that presents the data scientist with several analytical difficulties [3]. In order to locate the relevant gene that is conveyed, advanced biomarker machine learning approaches help by using gene expression data. The development of trustworthy cancer biomarkers is crucial for the field of clinical diagnostics [4]. Microarray gene expression technology, the most popular instrument used in illness prediction, makes it possible to categories and analyze a variety of genetically linked disorders. More than two hundred forms of cancers and their subtypes caused due to abnormal mutation of the genes in DNA that leads to uncontrolled growth of cells. To take preventive measures, to improve the early prognosis methods and to initiate clinical treatment, it is important to identify the changes in the complete set of DNAs or its genome. Gene expression profiles like microarray technology and Rna-Sequence based platforms with machine learning and deep learning are useful in managing and isolating the genes responsible for inherited diseases [5, 6]. It helps to design suitable treatment in suppressing the magnitude of expressed genes linked with inherited diseases during the early development of the organism. Gene expression profiles helps to strengthen the genes responsible for developing healthy crops in bulk, in the management of effective use of pesticides, fertilizers, and in designing nationwide agriculture policies. The gene expression profiles generate high dimensional data, which is a major issue to deal with before creating the actual classifier. The accuracy and cost of computation affect the performance of the classifier. The specific methods to decrease the dimensionality of the gene expression and to conquer the related problems are the Feature selection technique & method of Feature extraction [7]. The latter provides new fewer size features, condensing the properties of high dimensional features as far as possible and the previous, filters irrelevant and reductant features and includes critical informative features. The datasets of standardized microarray consist of thousands of gene expression and samples of a few hundred. Every individual expression of a gene quantifies the magnitude of interest shown in the given tissue of the sample. It enables comparing the abnormally expressed gene in the tissue with those in common tissues, which provides a good insight and intuition into the diagnosis as well as treatment and further clinical management, predictions of the future samples. Feature selection or Feature extraction methods are the few scientific approaches to the curse of dimensionality problem [7]. The optimization techniques of linear algebra and the core part of statistics are the fundamental tools of most of the algorithms developed for gene expression data analysis [8].

Gene expression profiles have been analyzed in bulk by systematic implementation of machine learning and deep learning techniques [9, 10]. Results connected to the information recorded by gene expression determined the prognosis of the majority of illnesses and different forms of cancer. It is possible to examine the data of expressed genes using a variety of machine learning approaches, but current findings connected to deep learning algorithms are more precise and accommodating since they are good at identifying and categorizing useful genes. The isolated genes may be useful for further clinical trials and diagnosis of the disease.

Malfunctioning of the apoptosis process, uncontrolled growth, and division of the body's cells is the primary cause of almost all types of cancers [11]. Spread of abnormal cells through basic organ systems of the human body, scientifically termed as Metastasizing, is the major cause of death from cancers. The scientific community is looking for the specific reasons for a person to develop cancer. The known unavoidable cancer developing factors are family history, age, climate change, and avoidable factors that could be exposed to radioactive substances or certain harmful chemicals. Consumption of tobacco and alcohol, unhealthy diet, physical inactivity is the prime risk factor worldwide, and are also leading contributors to no communicable diseases [12]. Worldwide sixteen percent deaths are because of some types of cancer, which will reach twenty-five million cases by the end of 2030. Hence, early screening for cancer is important before they damage vital organs, as it is very difficult to treat once it invades. Most cancers have a moderately high chance of being cured if diagnosed and treated at early stages [13].

The important findings of this work defined as:

Hybrid metaheuristic learning-based approach has been designed with DL classifier for gene selection that explains about infection accurately using SMO and CS as CSSMO for classifies cancer more accurately even if the patients are in an early stage.
Enhance the CSSMO results by adopting filtering method mRMR, to reduce the dimensionality of gene expression data.
Identify different classes with an associated disease using deep learning with various evaluation criteria.
The result of deep learning model with proposed hybrid approach achieves much better accuracy than other existing DL models. Figure 2, shown the complete framework of proposed model.

1.1 Paper organization

This paper focuses on identifying compact gene groups using CSSMO for efficient deep-learning prediction of cancer classes. The remainder sections of the research document are arranged as follows: Segment 2 discuss about information of various previous published work with literature review. Segment 3 presents initial learning terminology related to CS, SM algorithms and Deep Learning (DL) and the proposed CS and SM based important feature identification algorithm. In Segment 4 explained complete experimental setup and parameter setting of proposed algorithm. Segment 5 outlines the empirical evaluation and gives outcome. Finally, Segment 6 summarizes our paper.

Deep learning methods can be used for the prediction of cancer data, though it has several challenges, Joseph et al. explored various problems related to the implementation of the deep learning model in solving the classification of cancer in gene expression data. Transformation of the 1D gene expression values into a 2D image takes into account the overall features of the genes and evaluates features that are required for the classification purpose using a convolution neural network (CNN). The proposed model achieved a steady result with an accuracy of 95.65% when operated on the data related to 33 different cancer types [14].

Mostavi et al. successfully implanted the different CNN models such as 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN which are based on unconstructed gene expression input, which is not only able to classify the sample as tumor or non-tumor but also into their respective types of cancer or normal. All the created classifier operated on 10,340 samples of 33 distinct cancer types, the Cancer Genome Atlas was used as the source which resulted in an average accuracy of 94.45% [15].

The backpropagation method is the universal technique for training artificial neural networks. In the estimation of the feature selection in gene expression data is carried out with the hybrid models of backpropagation neural networks along with fast Genetic Algorithms (GA), which examine those algorithms which are sensitive towards restricted minima in search space. Most of the gene selection algorithms are based on selecting the genes with higher ranking, however many low-rank genes if selected properly, are capable of increasing the performance of the classification. The Hybrid Fast GA-BPN algorithm encircles an informative small gene subset capable of improving the overall accuracy of the classification. Using microarray technology to do genetic research is very effective but, the curse of dimension creates many analytical challenges for the data scientist [16].

Zeebaree et al. tackled the main challenges of the classification of cancer microarray data with the help of deep learning algorithms based on convolutional neural network (CNN), which show improvement in accuracy and extraction of informative genes as compared to mSVM-REF-iRF and varSeIRF [17]. Feature selection is a crucial task in any classification problem, as the overall accuracy of the classifier depends upon the relevance between the properties of the selected gene and the properties of each class.

Mao et al. used the Randomization test (RT) and the partial least square discriminant analysis (PLSDA) to measure the relevance of the significance of the selected genes with the objective class. The suggested model provided satisfactory result upon classifying the data into four classes, using principal component analysis (PCA) & multiple linear regression (MLR) [18].

Cancer microarray data is represented in a non-squared matrix as the number of features is very large compared to the size of the sample (Curse of dimensionality), which needs to resolve by employing a suitable feature extraction method. Generally, distance-based feature selection techniques are based on Euclidean distance (PLSDA). Zhong et al. incorporated Bhattacharya distance to determine the class of the gene in a binary classification model. Genes were selected based upon the lowest misclassification rates obtained through a support vector machine. The highlighted method improved the accuracy and processing time as compared to SVM-REF and SWKC/SVM [19].

International communities stand united to fight against cancer. One out of six death worldwide is because of cancer. Thus, to facilitate researchers across the globe, different data sets related to types of cancers are made available publicly to expedite the process of finding suitable, affordable treatment. Comparative studies of accuracies of available algorithms with known data sets may give some insights to the researcher community. Tabares et al. have shown the comparative studies on the 11 tumor database and recorded accuracies of 90.6% & 94.43% respectively on logistic regression and convolutional neural networks. The proposed algorithm based on deep learning methods shows more promising results on microarray data analysis [20].

Medical treatment of patent diagnosed with cancer depends not only on the gene that caused it but also on the type of the cancer. Hence, classification of types of cancers plays vital role in clinical management of the disease. Despite many available microarray data analysis algorithms that have already shown remarkable results, they possess their limitations too. Hence, improved techniques based on machine learning algorithms are required for efficient implementation. Salem et al. implemented Information Gain & Standard Genetic Algorithm to classify human cancer disease depending on gene expression profiles. The Information Gain algorithm serves the purpose for feature selection followed by feature reduction and cancer type classification is achieved through Genetic algorithm and Genetic programming algorithm respectively. The hybrid technique based on IG and SGA improved the accuracy of the classifier [21].

Deep learning methods are getting popularized day by day as researchers started to use them in a wide range of areas however its use in the classification of cancer microarray data is limited and many connected domains remained untouched due to the Scarcity of the training datasets. The hybrid approach with deep learning techniques is the newly emerging area that will try to overcome limitations with available algorithms. Malignant and benign are the two types of tumors. Early detection and classification of tumors are important in preventing its harmful effects. Liu et al. suggested Sample Expansion Based SAE and 1DCNN are two deep learning approaches used for categorization of the microarray data. The authors claimed improvement in the accuracy of the classifier after testing the data with proposed algorithms [22].

Elastic net is very effective in variable selection and its regularization. The improved performance of Elastic net, under weak regularization, is the unique property of it which enabled the implementation in multidisciplinary fields. Wang et al. classified microarray data of leukaemia and colon cancer, using the hybrid technique with Adaptive Elastic Net with Conditional Mutual Information (AEN-CMI) & conditional mutual information. The proposed hybrid algorithm dominates traditional methods not only by improving the accuracy but also by using the minimum number of genes [23].

Microarray data analysis is one of the advanced and reliable tools in medical science, which strives to link the cause of the disease like cancer with an informative gene. Medjahed et al. developed a unique two steps algorithm. It is based on Support Vector Machine Recursive Feature Elimination to extract the genes and the latest Binary Dragonfly Algorithm (BDA) to improve performance of the previous. Author/s, for the first time, incorporated the application of the metaheuristics BDA with microarray data analysis that enhanced the accuracy of the classifier with a minimum number of genes [24].

Microarray data analysis of tumour requires a sufficient number of training models to build a classifier with better accuracy. Liao et al. tackled this problem by incorporating the different datasets of the multiple types of cancer and applied the Multi-Task Deep Learning (MTDL) algorithm to analyse the microarray data. MTDL resolved the issues related to the scarcity of the data and significantly enhanced the accuracy of a classifier, towards identifying the type of cancer when tested on multiple cancer datasets [25].

Prostate cancer in males causes seven percentage of deaths worldwide out of total cancer deaths. Research community is still debating on the reliability of the Prostate-specific antigen (PSA). Hence, there is a need of the biomarker that receives acceptance form the community. Hou Q et al. created a diagnostic and prognostic forecast model for prostate cancer and discussed the effectiveness of the biomarker named C1QTNF3. Author also expects the application of the proposed biomarker in other parallel domains to map oncogenes [26].

Mapping the type of disease with the best informative gene is the central idea behind using microarray data for the classification. Jansi et al. implemented two-stage algorithms based on Mutual Information Genetic Algorithm (MI-GA). Screening of potential genes with high mutual values is followed by creating an optimal set of genes through Genetic Algorithm and SVM. The proposed method shows improvement in accuracy when applied on datasets of different type of cancers [27].

Filtering only useful features capable of affecting the overall classifier is the prime thought behind selection of features through appropriate algorithms from available search space. Chen et al. implemented binary particle swarm optimization and CCFS to design cost effective and confidence-based features election method. Initially, author has filtered the most promising features by utilizing the available statically parameters within the features as well as within different categories. Secondly, discriminant statistical parameters designed in to the model to develop best possible classifier [28].

Reductant and irrelevant features reduce the classifier's accuracy and raise the computational cost, which leads to the mismatches of the training set with the desired class. Rouhi et al. proposed a hybrid approach which initially reduces the dimension of the features followed by implementation of Advanced Binary Ant Colony (ABACOh) meta-heuristic algorithm. The constructed hybrid approach enhanced the accuracy of the classifier when compared with available methods [29].

The study of the Genetic mutations inside the cell, is one of the primary applications of the microarray data, which provides the spectrum of expressed genes for further analysis. Jain et al. proposed a two-phase algorithm that combines Correlation-based Feature Selection (CFS) with improved-Binary Particle Swarm Optimization (iBPSO). At later stages, it implements Naive–Bayes classifier with stratified 10-fold cross-validation. It accelerates the convergence and provides highly accurate classifier on seven data sets out of ten, based on different types of cancers [30].

Advancements in computing facilities and vectorization of the iterative steps involved in algorithms enable data scientist to explore more useful insights from the data. It also helps in experimentation with a large number of possible subsets of features related to microarray data. Several parallel computing packages like Hadoop, Spark and Mahout provides a wide number of tools like extracting/selecting features to tune the classifier up to the desired level of accuracy. Venkataramana et al. implemented Parallelized hybrid feature selection (HFS) method. It not only incorporates the statistics related to subsets of features but also ranks them to set the selection of most effective, informative genes. The proposed method established the accuracy of 97% on the data sets related to gastric cancer and improved the accuracy to some extent when compared with available methods [31].

Alzaqebah et al. presented a study demonstrating use of cuckoo search methods for feature selection. This study involved use of cuckoo search alongside a memory-based mechanism to save optimal solutions (feature vectors) to find features that enhanced the classification accuracy. Additionally, the proposed algorithm was contrasted with the original algorithm using a microarray dataset. This study concluded that the proposed algorithm produced outcomes that are superior than those of the original and contemporary algorithms [32].

Swathypriyadharsini et al. have put out a methodology for identifying co-expressed genes that combines triclustering methods with a hybridized CS (cuckoo search) algorithm and clonal selection. After that, to ascertain the biological importance of the genes in the generated clusters, this technique makes use of gene ontology, functional annotation, and transcription factor binding site analysis. In comparison to both conventional cuckoo search techniques and other current triclustering algorithms, the experimental results of this approach were shown to be superior. [33].

Zhao et al.proposed a new search algorithm namely, the Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm which employed feature weighting and elite strategy to improve over Cuckoo Search. The proposed algorithm showed results outperforming binary genetic algorithm and binary particle swarm optimization algorithm in terms of standard deviation, sensitivity, specificity, precision, and F-measure [34].

Othman et al. use of innovative operators for genomic selection is included in a hybrid multi-objective CS (cuckoo search) algorithm that has been developed. To do this, this study employed single crossover and double mutation operators. Using seven high dimensional cancer microarray datasets that are freely available, the suggested method was assessed. According to the experimental findings, the suggested technique selected fewer relevant genes while outperforming multi-objective cuckoo search and classic cuckoo search algorithms in terms of performance [35].

Pandey et al. presented a novel feature selection technique based on the binomial cuckoo search metaheuristic was described in the study. To increase stability, a hybrid data transformation procedure was developed that combines Fast Independent Component Analysis (FICA) and Principal Component Analysis (PCA). On 14 benchmark datasets from the UCI repository, the suggested method was evaluated and contrasted with other methods like the binary cuckoo search, the binary bat algorithm, the binary gravitational search algorithm, the binary Whale Optimization with Simulated Annealing, and the binary Grey Wolf Optimization [36].

Swathypriyadharsini et al. conducted a comparative study exploring two commonly used bio-inspired optimization algorithms namely, Cuckoo Search (CS) and Particle Swarm Optimization (PSO) for triclustering the microarray gene expression data. Both the algorithms were applied to three real-life three-dimensional datasets with mean square residue as fitness function. The experimental results in this study showed CS having better computational efficiency compared to PSO [37].

Scaria et al. proposed a user-friendly rule-based classification model for processing microarray gene data. Here, cuckoo search optimization algorithm was used to form classification rules and pruned by associative rule mining. This study concluded that the performance of the proposed approach was adequate enough in terms of accuracy, sensitivity, specificity and time consumption [38].

Balamurugan et al. presented a study showcasing use of shuffled cuckoo search with Nelder-Mead (SCS-NM) for finding significant biclusters in large expression data. This study used shuffling and simplex NM to diversify and intensify the search space to gain an edge over compared algorithms. The proposed work was benchmarked on four benchmark datasets where it showed significant improvement in fitness value when compared with the swarm intelligence technique and various bi-clustering algorithms [39].

Boushaki et al. performed a study involving hybridization of parallel cuckoo search optimization (PCSO) algorithm and Naïve Bayes (NB). In order to locate feature subsets that optimize accuracy, this led to the development of the parallel cuckoo search with naive bayes (PCSNB) wrapper technique, which coupled the exploration capacity of PCSO with the speed of naive bayes. On seven distinct datasets, the suggested method was evaluated and compared against existing metaheuristic methods. The experimental findings demonstrated more accurate prediction than previous algorithms and were relatively efficient [40].

Pandey et al. presented a study that demonstrated a cutting-edge metaheuristic technique that combined k-means clustering with enhanced cuckoo search to expand the potential of conventional clustering techniques. Three microarray datasets were used to assess the effectiveness of the suggested approach. This study found that the suggested algorithm beat the benchmarked state-of-the-art approaches. [41].

Kulhari et al. explored a novel metaheuristic gauss-based cuckoo search clustering method to broaden traditional algorithm capabilities for unsupervised data classification. The proposed algorithm was tested on three microarray datasets. This study concluded via. result-drive experimentation that the proposed algorithm was more performant than existing methods [42].

3.1 Deep Learning

Deep learning has lately received a lot of interest in the world of machine learning. Convolutional networks, deep au-to encoders, and deep belief networks are just a few examples of the hierarchical learning structures used in this method. For tasks like pattern categorization and representation learning, these designs have several input processing layers. One popular approach for training neural network weights is through the use of a backpropagation algorithm [7–10]. This algorithm involves propagating the error from the output layer back through the network, adjusting the weights in each layer based on the error gradient with respect to the activation of the previous layer, using these hierarchical architectures aims to lower the total inaccuracy of the network. The backpropagation algorithm is widely used for supervised learning in deep neural networks, but it has limitations in terms of convergence speed and the possibility of getting trapped in local optima [15]. To address these issues, applied the proposed algorithm before training deep neural network architecture for prediction tasks.

3.2 Cuckoo search (CS) Algorithm

Cuckoo Search (CS) is a population-based metaheuristic optimization approach that Yang and Deb initially introduced in 2009. Cuckoo birds, which lay their eggs in the nests of other bird species and rely on the host birds to hatch and nurture their young, inspired CS. The goal of CS is to iteratively explore the solution space for the best answer to an optimization problem using reproduction, selection, and replacement [43, 44]. Each solution in computer science is symbolized by a cuckoo egg, and each cuckoo egg refers to a valid solution to the optimization problem. The process begins with a population of cuckoo eggs created at random. During the search process, some cuckoo eggs are replaced with new eggs generated by random walk. This process mimics the reproduction behavior of cuckoo birds. Additionally, a levy flight strategy is used to generate new solutions that can help the algorithm to escape from local optima [45].

3.4 Spider Monkey Optimization (SMO) Algorithm

The SMO algorithm is an metaheuristic optimization technique that belongs to the category of swarm intelligent techniques. It is inspired by the food finding strategy of spider monkeys and involves a population of solutions, or "spider monkeys," that search for an optimal solution by sharing information and continuously updating their positions [46]. The algorithm consists of six phases that are designed to improve the positions of the solutions and avoid stagnation or premature convergence. It starts with randomly generated initial positions for the solutions and updates these positions through iterations. The best solution in the population is called the global leader, and the algorithm divides the population into groups if the global leader is not improving for a certain number of iterations. The best solution in each group is called the local leader. The algorithm has phases for generating trial positions for the solutions, selecting global and local leaders, and handling stagnation and premature convergence in the population and groups [47, 48].

3.5 Proposed Methodology CSSMO

We have developed an algorithm called CSSMO, which combines the features of the well-known meta-heuristic algorithm CS with an optimization technique SMO for finding the best solutions. Figure 3, shown the flow chart of proposed method. Our proposed algorithm involves three phases: First preprocessing phase, next Cuckoo search and Spider monkey-based feature selection approach, then prediction of cancer by using selected genes with Deep Learning classifiers.

4.1 Preprocessing phase

Gene expression datasets are high dimensional in nature and consist thousands of genes. Because of this direct use (i.e., without preprocessing) of proposed algorithm is not efficient. This CSSMO also makes difficult to classifier and reduce accuracy. Thus, first mRMR has applied to eliminate noise and minimize the number of redundant genes. In mRMR method genes having minimum redundancy and maximum relevancy would be selected for cancer prediction. The mRMR is based on two facts: First mutual information among cancer classes and genes while the other metric between every two genes, which are applied to measure the relevancy and compute redundancy respectively.

4.2 CSSMO Algorithm

The literature on nature-inspired metaheuristic approaches for optimization shows that these methods have been successful in solving a wide range of problems. However, it's important to note that each algorithm has its own strengths and weaknesses, and may not be the best choice for every optimization problem. The optimization problem of feature selection in microarray is a multivariate one with a lot of variables and a lot of combinatorial possibilities. Therefore, various soft computing techniques have been used in literature to find a solution for that. It is important to evaluate the performance of different algorithms and select the one that provides the best results for a specific problem. Out of these, In this paper, a hybrid metaheuristic approach that pools the strengths of the CS and SMO algorithms to discover the optimum solution for optimization issues is suggested. The CS algorithm has a powerful local search capacity and fewer control parameters with smaller population size. SMO algorithm, on the other hand, is specialized in global search and is resilient, although its suffer from early convergence its convergence rate may be slower than other approaches. The suggested hybrid technique tried to make use of the capabilities of both, we replace the global phase of the CS algorithm with SMO algorithm to address their limitations by merging, and naming as CSSMO algorithm. This is intended to increase the optimization process's efficiency and efficacy [49].

Pseudo Code: Hybrid (CSSMO) Algorithm:

STEP-1: Global Leader Phase (GLP) with SMO algorithm

STEP-2 Local Leader Decision Phase (LLDP) optimized with CS Algorithm

Step-3 Position update in Global Leader Decision Phase

In this research, we employed CSSMO as a feature extraction technique to extract best subset of gene, which served as inputs for the DL prediction. The experimentation was performed using the MATLAB R2012a software on a computer system that featured an Intel Core i5 processor clocked at 4.50 GHz and 16 GB of RAM.

4.1 Dataset Used

Experiments were carried out to determine the efficiency of our technique. To assess the proposed algorithm's accuracy, we used six benchmark data sets: Leukemia, Colon, Prostate, Lung Cancer 2, Leukemia 2, and High-Grade Glioma. The characteristics of these datasets are described in Table 1.

Table 1

– Detail of six cancer microarray data.
Data set	Number of classes	Number of genes	Class balance +/-	Number of samples	Brief description
Colon cancer [50],	2	2000	(22\40)	62	Colon cancer data gathered from patients who had tumor biopsies reveal that both routine positive biopsies and negative tumors come from healthy portions of the same patients' colons.
Acute leukemia [51],	2	7129	(47\25)	72	Acute Leukemia consists of two categories: category 1 is the Acute Myeloid Leukemia (AML) with 47 samples and category 2 is Lymphoblastic Leukemia (ALL) with 25.
Prostate tumor [52]	2	12600	(50\52)	102	Prostate tumor data was acquired from two types of samples: non-tumor (normal) and tumor samples (cancer).
High-grade Glioma [53]	2	12625	(28\22)	50	High-grade Glioma contains glioblastomas and anaplastic oligodendrogliomas from brain tumor tissues.
Lung cancer II [54]	2	12533	(31\150)	181	Lung cancer II comprises of Malignant Pleural Mesothelioma (MPM) and Adenocarcinoma (ADCA) tissue samples of the lung.
Leukemia 2 [55]	3	7129	(28\24\20)	72	The Leukemia 2 data set includes three types of samples: 28 AML samples, 24 ALL samples, and 20 MLL samples.

4.2 Deep Learning model configuration

Figure 4 depicts a deep learning model configuration that consists of six convolutional layers. The first layer, "Convolution 8 2 × 2 × 1", applies 8 filters of size 2 × 2 to the input data, with a stride of 1. The second layer, "Convolution 16 2 × 2 × 8", applies 16 filters of size 2 × 2 to the output of the first layer, with a stride of 1, and uses 8 as the number of input channels. Similarly, the third layer, "Convolution 32 2 × 2 × 16", applies 32 filters of size 2 × 2 to the output of the second layer, with a stride of 1, and uses 16 as the number of input channels. The fourth layer, "Convolution 64 2 × 2 × 32", applies 64 filters of size 2 × 2 to the output of the third layer, with a stride of 1, and uses 32 as the number of input channels. The fifth layer, "Convolution 128 2 × 2 × 64", applies 128 filters of size 2 × 2 to the output of the fourth layer, with a stride of 1, and uses 64 as the number of input channels. The last layer, "Convolution 256 2 × 2 × 128", applies 256 filters of size 2 × 2 to the output of the fifth layer, with a stride of 1, and uses 128 as the number of input channels. ReLU (Rectified Linear Unit) is a commonly used activation function in neural networks. It applies an operation on each element of the input, where any element less than zero is set to zero and any element greater than zero is passed through unchanged. This operation is defined mathematically as $y={max}(0, x)$, where $x$ is the input and $y$ is the output. This function allows the network to converge faster and reduces the chances of encountering the vanishing gradient problem, as it increases the network's non-linearity. Max pooling is a technique used to down-sample the spatial dimensions of the input data, typically used after the convolutional layer in CNN. The max pooling operation is applied to small rectangular regions of the input data, called pooling windows, and for each window the maximum value within that window is selected and propagated to the next layer. This operation helps to reduce the number of parameters in the network, reduce overfitting and also preserves the dominant features in the images. Batch normalization is a technique used to normalize the input layer by adjusting and scaling the activations. The idea behind this technique is to ensure that the inputs of each layer are in the same distribution and thus accelerate the convergence of the network. It normalizes the input data by re-centring and re-scaling them so that the mean of the data is zero and the standard deviation is one. During training, it maintains a moving average of the mean and variance of the data and during testing it uses these values to normalize the test data. This allows the network to be less sensitive to the initial conditions of the parameters, reducing the need for careful parameter initialization and makes it possible to use much larger learning rates, which speeds up the training process. Based on above discussion, in our model between each of the convolutional layers, there is a batch normalization operation and a ReLU activation function which serves as a non-linearity to the output of the convolution operation. The output of each batch normalization and ReLU operation is then passed through a max pooling operation, with the exception of the last layer, which does not have max pooling applied.

4.3 Parameter setting of proposed method

The fitness function given here is used to assess the accuracy of the proposed model. It is used to assess how well the model's output matches the predicted or actual outcomes.

$${A}{c}{c}{u}{r}{a}{c}{y} = \frac{{C}{C}}{{N}} \times 100 \left(2\right)$$

Equation 2 refers to the fitness function of the proposed approach, which is used to evaluate the classifier's performance. The fitness function is dependent on the classifier's prediction accuracy, which is a measure of how successfully the classifier categorizes data. In the equation, N is the total number of samples in the relevant class, and CC is the number of properly classified observations. The number of correctly classified observations is the numerator of the equation, while the total number of samples in the class is the denominator. The accuracy is the resultant number, which ranges from 0 to 1, with 1 indicating perfect accuracy and 0 indicating no accuracy. Finally,

${F}{i}{t}{n}{e}{s}{s}$ $\left({f}\right)={A}{c}{c}{u}{r}{a}{c}{y} \left({{f}}_{{a}}\right)$ (3)

The LOOCV accuracy has been utilized as a fitness function to evaluate the classifier's performance. It is critical to grasp the parameters and their values in order to properly comprehend the performance of the suggested approach. It's also worth mentioning that alternative parameter setups may be required for different issue domains. Table 2 shows the parameters used for the proposed algorithm.

Table 2

Parameter setting of the proposed CSSMO algorithm
PARAMETER	VALUE
Number of Nests (Population)	50
Total No. of eggs	10
Total No. of generations	200
Minimum Probability${(P}_{\alpha })$ of discovering an egg${P}_{{a}_{min}}$	0.3
Maximum Probability ${(P}_{\alpha })$ of discovering an egg${ P}_{{a}_{max}}$	0.5
$\alpha$ Step size	1
The Swarm size N	50
MG	5
Global Leader Limit	50
Local Leader Limit	1500
The number of simulations/runs	100

5.1 Deep learning prediction accuracy

Tables 3, 4, 5, 6, 7 and 8 allow for the following inferences. Upon comparison of the results, it can be observed that the CS and SMO algorithms demonstrate lower accuracy in comparison to the CSSMO algorithm. Additionally, these three algorithms are compared on the basis of the minimum, mean and maximum value of accuracy results, with the deep learning model over best 5 iterations which are depicted in Tables 3, 4, 5, 6, 7 and 8. Both CS and SMO tend to classify cancer with high prediction accuracy, but CSSMO algorithm gives higher prediction accuracy for all the six datasets used with a maximum accuracy of 100%. In case of Colon cancer in all iterations, Leukemia in iteration 4, Prostate cancer in iteration 4, High grade Glioma in iteration 3 and 4, 1st ,2nd and 5th iteration of lung cancer and all iterations of leukemia 2 dataset obtained maximum prediction accuracy 100%.

Table 3

Prediction accuracy comparison for Colon cancer
S. No.	SMO Algorithm			CSA Algorithm			CSSMO Algorithm
S. No.	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)
Experiment 1	74.9 ± 5.27	82.1 ± 4.70	88.4 ± 5.17	81.8 ± 2.54	90.5 ± 2.11	96.7 ± 1.95	99.5 ± 0.55	99.2 ± 0.13	100
Experiment 2	75.3 ± 5.28	84.4 ± 4.71	89.4 ± 5.18	88.2 ± 2.55	91.2 ± 2.12	95.2 ± 1.86	99.2 ± 0.36	99.3 ± 0.05	100
Experiment 3	76.9 ± 5.29	83.9 ± 4.72	89.3 ± 5.19	87.6 ± 2.56	93.7 ± 2.13	94.3 ± 1.77	99.4 ± 0.17	99.5 ± 0.09	100
Experiment 4	78.2 ± 5.30	80.1 ± 4.73	84.6 ± 5.20	83.2 ± 2.57	90.6 ± 2.14	96.4 ± 1.28	99.1 ± 0.28	99.6 ± 0.12	100
Experiment 5	72.3 ± 5.31	82.17 ± 4.74	81.7 ± 5.21	85.1 ± 2.58	93.4 ± 2.15	93.2 ± 1.89	99.6 ± 0.49	99.4 ± 0.09	100

Table 4

Prediction accuracy comparison for Leukemia.
S. No.	SMO Algorithm			CSA Algorithm			CSSMO Algorithm
S. No.	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)
Experiment 1	70.2 ± 8.71	74.2 ± 3.15	87.8 ± 5.20	76.4 ± 8.21	82.12 ± 8.07	88.3 ± 6.43	87.3 ± 3.27	91.6 ± 4.72	95.3 ± 2.91
Experiment 2	73.5 ± 7.75	79.2 ± 4.16	88.9 ± 5.21	78.4 ± 8.12	82.51 ± 4.18	87.3 ± 7.41	88.5 ± 3.18	89.9 ± 2.13	98.3 ± 2.12
Experiment 3	76.2 ± 5.73	80.2 ± 5.07	84.4 ± 5.22	79.4 ± 6.34	84.16 ± 2.19	89.3 ± 2.25	86.3 ± 4.29	92.7 ± 3.74	99.9 ± 0.01
Experiment 4	72.2 ± 5.74	78.2 ± 9.18	86.8 ± 5.23	80.4 ± 5.24	84.17 ± 4.20	88.3 ± 7.76	85.5 ± 2.30	93.1 ± 1.75	100
Experiment 5	77.2 ± 6.75	82.2 ± 6.29	88.3 ± 5.24	76.4 ± 6.35	83.5 ± 5.21	91.0 ± 6.17	86.9 ± 3.31	91.1 ± 3.76	96.3 ± 3.15

Table 5

Prediction accuracy comparison for Prostate cancer.
S. No.	SMO Algorithm			CSA Algorithm			CSSMO Algorithm
S. No.	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)
Experiment 1	70.2 ± 8.71	74.2 ± 3.15	87.8 ± 5.20	76.4 ± 8.21	82.12 ± 8.07	88.3 ± 6.43	87.3 ± 3.27	91.6 ± 4.72	95.3 ± 2.91
Experiment 2	73.5 ± 7.75	79.2 ± 4.16	88.9 ± 5.21	78.4 ± 8.12	82.51 ± 4.18	87.3 ± 7.41	88.5 ± 3.18	89.9 ± 2.13	98.3 ± 2.12
Experiment 3	76.2 ± 5.73	80.2 ± 5.07	84.4 ± 5.22	79.4 ± 6.34	84.16 ± 2.19	89.3 ± 2.25	86.3 ± 4.29	92.7 ± 3.74	99.9 ± 0.01
Experiment 4	72.2 ± 5.74	78.2 ± 9.18	86.8 ± 5.23	80.4 ± 5.24	84.17 ± 4.20	88.3 ± 7.76	85.5 ± 2.30	93.1 ± 1.75	100
Experiment 5	77.2 ± 6.75	82.2 ± 6.29	88.3 ± 5.24	76.4 ± 6.35	83.5 ± 5.21	91.0 ± 6.17	86.9 ± 3.31	91.1 ± 3.76	96.3 ± 3.15

Table 6

Prediction accuracy comparison for High grade.
S. No.	SMO Algorithm						CSSMO Algorithm
S. No.	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)
Experiment 1	71 ± 6.21	75.2 ± 5.22	89 ± 6.23	77.52 ± 3.45	83.31 ± 3.12	89.44 ± 3.11	87.33 ± .57	91.73 ± .33	95.54 ± .02
Experiment 2	74.3 ± 6.19	80.2 ± 5.67	90.1 ± 5.34	79.63 ± 3.77	83.71 ± 2.34	88.62 ± 2.34	88.53 ± .06	90.03 ± .05	98.54 ± .03
Experiment 3	77 ± 5.91	81.2 ± 5.31	85.6 ± 5.22	80.34 ± 3.76	85.31 ± 3.24	90.37 ± 3.34	86.33 ± .67	92.83 ± .32	100 ± 0
Experiment 4	73 ± 5.77	79.2 ± 5.23	88 ± 4.97	81.56 ± 4.11	85.31 ± 3.14	89.65 ± 3.86	85.53 ± .55	93.23 ± .43	100 ± 0
Experiment 5	78 ± 5.37	83.2 ± 4.77	89.5 ± 4.78	77.51 ± 3.56	84.71 ± 3.99	92.32 ± 3.44	86.93 ± .04	91.23 ± .02	96.54 ± .06

Table 7

Prediction accuracy comparison for Lung cancer 2.
S. No.	SMO Algorithm			CSA Algorithm			CSSMO Algorithm
S. No.	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)
Experiment 1	76.62 ± 7.77	82.12 ± 4.50	90.72 ± 4.56	90.45 ± 4.23	92.33 ± 2.33	95.72 ± 1.22	96.7 ± .89	97.7 ± .25	100 ± 0
Experiment 2	73.92 ± 7.67	80.32 ± 6.09	92.22 ± 5.44	88.95 ± 3.46	90.03 ± 1.78	93.42 ± 2.33	94.27 ± .34	96.76 ± .23	100 ± 0
Experiment 3	75.32 ± 8.12	83.42 ± 6.34	89.52 ± 5.43	91.35 ± 3.78	91.13 ± 2.56	95.82 ± 3.21	98.97 ± .67	99.23 ± .23	98.93 ± .09
Experiment 4	77.02 ± 8.09	84.72 ± 5.98	88.62 ± 5.47	92.05 ± 4.23	92.53 ± 2.11	96.52 ± 1.77	99.57 ± .91	99.2 ± .45	99.93 ± 11
Experiment 5	75.52 ± 7.23	81.52 ± 4.33	92.42 ± 6.22	88.15 ± 6.54	93.03 ± 3.12	97.92 ± 2.34	98.27 ± 1.1	98.7 ± .76	100 ± 0

Table 8

Prediction accuracy comparison for Leukemia 2 data.
S. No.	SMO Algorithm			CSA Algorithm			CSSMO Algorithm
S. No.	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)	Min(± STD)	Mean(± STD)	Max(± STD)
Experiment 1	76.42 ± 4.34	81.85 ± 3.6	90.38 ± 5.32	89.56 ± 2.71	94.34 ± 2.11	96.62 ± 1.7	99.66 ± .87	99.71 ± .07	100 ± 0
Experiment 2	73.72 ± 5.11	80.05 ± 3.65	91.88 ± 6.11	88.06 ± 3.1	92.04 ± 2.56	94.32 ± 2.3	98.56 ± 86	98.97 ± .08	100 ± 0
Experiment 3	75.12 ± 3.98	83.15 ± 3.21	89.18 ± 5.47	90.46 ± 3.01	93.14 ± 2.61	96.72 ± 2.13	97.26 ± .96	99 .27 ± .09	100 ± 0
Experiment 4	76.82 ± 4.56	84.45 ± 3.99	88.28 ± 3.56	91.16 ± 2.79	94.54 ± 2.17	97.42 ± 1.99	99.86 ± 15	99.89 ± 0.8	100 ± 0
Experiment 5	75.32 ± 4.89	83.72 ± 4.11	92.08 ± 4.03	87.26 ± 2.78	95.04 ± 1.96	98.82 ± 1.6	97.06 ± .56	99.39 ± 0.76	100 ± 0

5.2 Error Estimation

Figure 5 (a-f), provide insight into the Prediction errors of a deep learning model utilizing three different algorithms. Figure 5a depicts essential information regarding the experimental findings achieved with the deep learning model for the Colon cancer dataset. CSSMO had the highest prediction accuracy, followed by CS and SMO algorithms, with average errors of 0.6%, 8.5%, and 19.9% over 5 iterations, respectively. Figure 5b depicts the experimental outcomes for the Acute leukemia dataset. When compared to the other two approaches, the CSSMO performs the best, with an average prediction error of 6.1% and a maximum prediction error of 8.32% attained in the fourth iteration. The Prostate dataset results are shown in Fig. 5c. The CSSMO algorithm has the best accuracy, averaging 91.86% and reaching a maximum of 93.1% in iteration four. The CSA method has a mean error of 11.1% over 5 iterations while the SMO algorithm has the lowest accuracy, with a mean error of 19.37%. The 3D plot of the High-Grade Glioma outcome is shown in Fig. 5d. The plot shows that the CSSMO algorithm fared the best, with the highest accuracy of 93.23% attained in iteration 4, as opposed to the CS and SMO algorithms' respective accuracy values of 85.31% in iteration 3 and 83.2% in iteration 5. The Lung cancer II dataset's results are shown in Fig. 5e. The CSSMO approach provided the best accuracy, as is shown from the 3D plot. The minimum prediction accuracy of the deep learning model using the CSSMO algorithm is 0.77% in iteration 3, while the mean prediction accuracy over the course of five iterations is 98.31%. The experimental outcomes for the leukemia II dataset are depicted in Fig. 5f. The CSSMO performs the best with an average prediction error of 0.56% as compared to average prediction error of 6.18% and 17.36% for CS and SMO algorithms respectively.

5.3 Convergence rate of the Algorithms

Figure 6 (a-f) shows the convergence rate of SMO, CS and CSSMO algorithms for each cancer data sets. The line graph showing accuracy vs iteration for the convergence comparison of SMO, CS and CSSMO algorithms for all datasets:

Colon Cancer

The line graph shows that the accuracy of the SMO algorithm remains relatively constant at around 80.1% across all iterations. The CSA algorithm has a higher accuracy of around 91.5% and remains relatively constant throughout the iterations. The CSSMO algorithm has the accuracy of around 99.2–99.6% across all iterations and is consistently higher than that of the other two algorithms and shows a slight increase over the iterations. Overall, the CSSMO algorithm performs the best in terms of accuracy, followed by the CSA algorithm and then the SMO algorithm.

Leukemia

As the iteration number increases, the accuracy of the SMO algorithm fluctuates within the range of 74–82%. In comparison, the CSA algorithm shows a slightly higher and more consistent accuracy, with a range of 82–84%. However, the CSSMO algorithm stands out in the graph, starting at around 91% in the first iteration and steadily increases to reach a high of 93.1% in the fourth iteration. This trend suggests that the CSSMO outperforms the SMO and CSA algorithms in terms of accuracy.

Prostate Cancer

From the graph, we can see that the CSSMO algorithm consistently performs the best with an accuracy range between 90–93% and reaches a maximum of 93.1% in the fourth iteration. The CSA algorithm also performs well, with accuracy values of 86–91% in iterations 1 through 5. The SMO algorithm has the lowest overall accuracy in the range of 76–84% and seems to fluctuate more in performance compared to the other two algorithms.

Overall, the CSSMO algorithm appears to have the highest and most stable accuracy, followed by the CSA algorithm, and then the SMO algorithm.

High Grade Glioma

The line graph shows that the CSSMO algorithm is the most effective of the three algorithms, as it consistently achieves the highest accuracy in every iteration. The highest accuracy achieved by the CSSMO algorithm is 93.23% in the fourth iteration. The CSA algorithm is the second most effective, with its highest accuracy being 85.31% in the third iteration and the SMO algorithm is the least effective, with its highest accuracy being 83.2% in the fifth iteration.

Lung Cancer

In the given line graph, it can be seen that the accuracy of the SMO algorithm fluctuates between 80–84% throughout the iterations. The accuracy of the CSA algorithm also fluctuates but it is consistently higher than the SMO algorithm, with a range of 90–93%. The CSSMO algorithm has the highest accuracy, with a range of 96–99%. The accuracy of all three algorithms improves as the iteration number increases, with the CSSMO algorithm showing the most significant increase in accuracy.

Leukemia II

In the given line graph, it is evident that the CSSMO algorithm has the highest accuracy among the three algorithms. This is demonstrated in the line graph by its consistently high accuracy range of 98–99%. The CSA algorithm follows, with an accuracy range of 92–95%. The SMO algorithm exhibits the lowest accuracy, with a range of 80–84%.

Additionally, the accuracy of all three algorithms tends to improve as the iteration number increases. The CSSMO algorithm exhibits the most significant improvement, while the SMO algorithm shows the least. The CSA algorithm falls in the middle, with moderate improvement.

5.4 Model Performance

In Fig. 7 (a-f), the training accuracy and loss scores are used to assess the model's performance on training data. The training accuracy is the proportion of properly categorized instances in the training set, whereas the training loss is the mistake of the model in predicting the right output for a given case. The testing accuracy and loss scores assess the model's ability to generalize to new, previously unknown data. The testing accuracy is the proportion of properly categorized instances in the test set, whereas the testing loss is the model's inaccuracy in predicting the right output for a particular example in the test set.

Figure. 7a-b plots accuracy & loss vs epochs for CS algorithm, it has a relatively large gap between training and testing accuracy and loss. On the other hand, in Fig. 7c-d plots accuracy & loss vs epochs for SMO algorithm which shows a narrower gap between training and testing accuracy and loss. Figure 7e-f plots accuracy & loss vs epochs for CSSMO algorithm, it clearly shows that the CSSMO algorithm has the least difference in accuracy and loss between training and testing compared to CS and SMO, indicating that hybrid algorithm CSSMO can learn from training data and generalize effectively to new, unknown data. Based on the facts supplied, the CSSMO is the most effective of the three algorithms for reducing gene dimensionality.

5.5 Confusion Matrix

In Fig. 8 (a-c), we have used confusion matrix to evaluate the performance of a prediction made by CSA, SMO and the proposed CSSMO algorithm. It is a summary of the actual and predicted class labels for a given set of test data. The rows of the matrix represent the actual class labels, while the columns represent the predicted class labels. In the case of the three algorithms CSA, SMO, and CSSMO, the confusion matrices show the number of correct and incorrect predictions made by each algorithm on a set of test data. The diagonal values of the confusion matrix represent the number of correct predictions made by the algorithm. Figure. 8a shows confusion matrix for CSA algorithm, Fig. 8b shows confusion matrix for SMO and lastly Fig. 8c shows confusion matrix for CSSMO algorithm. In Fig. 8d CSSMO algorithm had the highest number of correct predictions on the test data, as indicated by the highest diagonal values in the confusion matrix. This indicates that CSSMO algorithm is the best for classifying the six different types of cancer and hence the most effective one.

5.5 Comparison with others Machine Learning and Deep Learning model

For further comparisons, the proposed algorithm employed with most popular machine learning (SVM and NB classifiers) and deep learning (VGG and LeNet classifiers), being a widely used classifier for medical data classification and cancer prediction from gene expression profiles.

Table 9

The comparison result of SVM, NB, VGG and LeNet classifiers with proposed approach.
Datasets	NB		VGG	LeNet	Proposed Model
Datasets	Mean Prediction Accuracy	Mean Prediction Accuracy	Mean Prediction Accuracy	Mean Prediction Accuracy	Mean Prediction Accuracy
Colon cancer data	94.12	94.11	94.01	94.21	98.27
Acute leukemia data	91.35	90.45	88.67	86.37	93.15
Prostate tumor data	90.14	89.90	89.38	87.18	92.38
High-grade Glioma data	90.32	89.22	91.24	90.04	92.16
Lung cancer II data	93.71	92.34	91.34	94.22	96.23
Leukemia 2 data	94.67	93.33	94.84	95.44	96.75

Figure 9 showed the mean performance comparison of all comparative and proposed model with training accuracy, F1 score, Recall and Precision. In Fig. 9 it is clearly depicted from the all observation that proposed model with deep learning gives comparative good results as compared to others popular models of deep learning and machine learning for cancer prediction.

Figure 10 presents the radar graph that ranks the algorithms based on their error evaluation. Area near the centre of the radar graph represent lower error values. Therefore, algorithms that have a narrow area performed the best classification task, which is the proposed approach at first, followed by the VGG algorithm. The performance of the proposed approach is compared in Tables 9 and the radar plot in Fig. 10, from which it can be deduced that the proposed method is superior to the established deep learning and machine learning methods.

In this paper, a hybrid method for deep learning prediction, named CSSMO is proposed for the utilization of feature selection. The CSSMO method is utilized in the proposed model in order to perform feature selection, which identifies a best subset of genes. Following that, this subset of genes is categorized by means of deep learning in an effort to identify distinct groupings or classes that are associated with a specific disease. For the purpose of determining how accurate the suggested algorithm is, six different benchmark data sets are utilised. These data sets are Colon cancer [77], Acute leukaemia [78], Prostate tumour [79], High-grade Glioma [80], Lung cancer II [81], and Leukemia 2 [82]. We have carried out prediction tests in order to demonstrate that the proposed model is accurate. In addition, proposed CSSMO model’s performance was superior to that of the conventional ML and DL models that are currently being utilized. As a result, we are able to draw the conclusion that the proposed methodology contributes to an increase in the prediction model's efficiency.

Researchers will be able to overcome the constraints of cancer prediction using gene expression data with the assistance of this method. This model has the potential to be used in the future for the purpose of enhancing accuracy by employing it as a parallel framework in conjunction with other extraction strategies in order to obtain findings that are more precise. Future research will look into ways to improve accuracy by adjusting various performance metrics. Furthermore, in future work the proposed model may be evaluated on Next Generation Sequencing datasets, which can be used to sequence genomes and investigate human biomes at a much quicker and more cost-effective rate than earlier techniques.

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Author Contributions: Material preparation, data collection, and data analysis were performed by Amol Avinesh Joshi, Manuscript writing .and all other works were performed by Dr. Rabia Musheer Aziz.

Funding: The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Data Availability: All used data are benchmark and are freely available in repositories.

Code Availability: All used code are freely available on net.

Declaration of competing interest: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare no Conflict of Interest.

D. Haber, Health promotion and aging: Practical applications for health professionals. Springer Publishing Company, 2019.
N. Almugren and H. J. I. A. Alshamlan, "A survey on hybrid feature selection methods in microarray gene expression data for cancer classification," vol. 7, pp. 78533-78548, 2019.
Y. F. Leung and D. J. T. i. G. Cavalieri, "Fundamentals of cDNA microarray data analysis," vol. 19, no. 11, pp. 649-659, 2003.
H. F. Ong, N. Mustapha, H. Hamdan, R. Rosli, and A. J. E. S. w. A. Mustapha, "Informative top-k class associative rule for cancer biomarker discovery on microarray data," vol. 146, p. 113169, 2020.
M. Daoud and M. J. A. i. i. m. Mayo, "A survey of neural network-based cancer prediction models from microarray data," vol. 97, pp. 204-214, 2019.
R. M. J. S. C. Aziz, "Application of nature inspired soft computing techniques for gene selection: a novel frame work for classification of cancer," vol. 26, no. 22, pp. 12179-12196, 2022.
R. M. J. M. Aziz, B. Engineering, and Computing, "Nature-inspired metaheuristics model for gene selection and classification of biomedical microarray data," vol. 60, no. 6, pp. 1627-1646, 2022.
F. Morais-Rodrigues et al., "Analysis of the microarray gene expression for breast cancer progression after the application modified logistic regression," vol. 726, p. 144168, 2020.
R. A. Musheer, C. K. Verma, and N. Srivastava, "Novel machine learning approach for classification of high-dimensional microarray data," Soft Computing, vol. 23, no. 24, pp. 13409-13421, 2019.
R. Aziz, C. K. Verma, and N. Srivastava, "Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction," Annals of Data Science, vol. 5, no. 4, pp. 615-635, 2018.
P. Rusin and K. J. S. E. e. B. Jabłońska, "Disturbances in the Mechanism of Apoptosis as One of the Causes of the Development of Cancer Diseases," vol. 18, no. 4, pp. 63-73, 2020.
S. Jayasinghe, N. M. Byrne, K. A. Patterson, K. D. Ahuja, and A. P. J. P. i. C. D. Hills, "The current global state of movement and physical activity-the health and economic costs of the inactive phenotype," vol. 64, pp. 9-16, 2021.
G. Curigliano et al., "Management of cardiac disease in cancer patients throughout oncological treatment: ESMO consensus recommendations," vol. 31, no. 2, pp. 171-190, 2020.
M. Joseph, M. Devaraj, and C. K. Leung, "DeepGx: deep learning using gene expression for cancer classification," in 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2019, pp. 913-920: IEEE.
M. Mostavi, Y.-C. Chiu, Y. Huang, and Y. J. B. m. g. Chen, "Convolutional neural network models for cancer type prediction based on gene expression," vol. 13, no. 5, pp. 1-13, 2020.
M. Vimaladevi and B. J. G. Kalaavathi, "A microarray gene expression data classification using hybrid back propagation neural network," vol. 46, no. 3, pp. 1013-1026, 2014.
D. Q. Zeebaree, H. Haron, and A. M. Abdulazeez, "Gene selection and classification of microarray data using convolutional neural network," in 2018 International Conference on Advanced Science and Engineering (ICOASE), 2018, pp. 145-150: IEEE.
Z. Mao, W. Cai, and X. J. J. o. b. i. Shao, "Selecting significant genes by randomization test for cancer classification using gene expression data," vol. 46, no. 4, pp. 594-601, 2013.
W. Zhong, "Feature selection for cancer classification using microarray gene expression data," Graduate Studies, 2014.
R. Tabares-Soto, S. Orozco-Arias, V. Romero-Cano, V. S. Bucheli, J. L. Rodríguez-Sotelo, and C. F. J. P. C. S. Jiménez-Varón, "A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data," vol. 6, p. e270, 2020.
H. Salem, G. Attiya, and N. J. A. S. C. El-Fishawy, "Classification of human cancer diseases by gene expression profiles," vol. 50, pp. 124-134, 2017.
J. Liu, X. Wang, Y. Cheng, and L. J. O. Zhang, "Tumor gene expression data classification via sample expansion-based deep learning," vol. 8, no. 65, p. 109646, 2017.
Y. Wang, X.-G. Yang, and Y. J. A. M. M. Lu, "Informative gene selection for microarray classification via adaptive elastic net with conditional mutual information," vol. 71, pp. 286-297, 2019.
S. A. Medjahed, T. A. Saadi, A. Benyettou, and M. J. A. S. C. Ouali, "Kernel-based learning and feature selection analysis for cancer diagnosis," vol. 51, pp. 39-48, 2017.
Q. Liao, L. Jiang, X. Wang, C. Zhang, and Y. Ding, "Cancer classification with multi-task deep learning," in 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), 2017, pp. 76-81: IEEE.
Q. Hou et al., "RankProd combined with genetic algorithm optimized artificial neural network establishes a diagnostic and prognostic prediction model that revealed C1QTNF3 as a biomarker for prostate cancer," vol. 32, pp. 234-244, 2018.
M. J. Rani and D. J. J. o. m. s. Devaraj, "Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data classification," vol. 43, no. 8, pp. 1-11, 2019.
Y. Chen, Y. Wang, L. Cao, and Q. Jin, "An effective feature selection scheme for healthcare data classification using binary particle swarm optimization," in 2018 9th international conference on information technology in medicine and education (ITME), 2018, pp. 703-707: IEEE.
A. Rouhi and H. Nezamabadi-pour, "A hybrid method for dimensionality reduction in microarray data based on advanced binary ant colony algorithm," in 2016 1st Conference on Swarm Intelligence and Evolutionary Computation (CSIEC), 2016, pp. 70-75: IEEE.
I. Jain, V. K. Jain, and R. J. A. S. C. Jain, "Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification," vol. 62, pp. 203-215, 2018.
L. Venkataramana et al., "Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data," vol. 41, no. 11, pp. 1301-1313, 2019.
M. Alzaqebah et al., "Memory based cuckoo search algorithm for feature selection of gene expression dataset," vol. 24, p. 100572, 2021.
P. Swathypriyadharsini and K. J. I. J. o. R. Premalatha, "Hybrid Cuckoo Search with Clonal Selection for Triclustering Gene Expression Data of Breast Cancer," pp. 1-9, 2021.
M. Zhao, Y. J. C. Qin, and M. M. i. Medicine, "Feature Selection on Elite Hybrid Binary Cuckoo Search in Binary Label Classification," vol. 2021, 2021.
M. S. Othman, S. R. Kumaran, and L. M. J. I. A. Yusuf, "Gene Selection Using Hybrid Multi-Objective Cuckoo Search Algorithm with Evolutionary Operators for Cancer Microarray Data," vol. 8, pp. 186348-186361, 2020.
A. C. Pandey, D. S. Rajpoot, M. J. J. o. A. I. Saraswat, and H. Computing, "Feature selection method based on hybrid data transformation and binary binomial cuckoo search," vol. 11, no. 2, pp. 719-738, 2020.
P. Swathypriyadharsini and K. J. I. J. o. S. I. Premalatha, "Comparison of cuckoo search and particle swarm optimisation in triclustering temporal gene expression data," vol. 4, no. 1, pp. 55-72, 2019.
L. T. Scaria and T. J. J. o. m. s. Christopher, "A Bio-inspired Algorithm based Multi-class Classification Scheme for Microarray Gene Data," vol. 43, no. 7, pp. 1-8, 2019.
R. Balamurugan, A. Natarajan, and K. J. A. A. I. Premalatha, "A new hybrid cuckoo search algorithm for biclustering of microarray gene-expression data," vol. 32, no. 7-8, pp. 644-659, 2018.
S. I. Boushaki, N. Kamel, and O. J. E. S. w. A. Bendjeghaba, "A new quantum chaotic cuckoo search algorithm for data clustering," vol. 96, pp. 358-372, 2018.
A. C. Pandey, D. S. Rajpoot, and M. Saraswat, "Data clustering using hybrid improved cuckoo search method," in 2016 Ninth International Conference on Contemporary Computing (IC3), 2016, pp. 1-6: IEEE.
A. Kulhari, A. Pandey, R. Pal, and H. Mittal, "Unsupervised data classification using modified cuckoo search method," in 2016 Ninth International Conference on Contemporary Computing (IC3), 2016, pp. 1-5: IEEE.
A. C. Pandey and D. S. Rajpoot, "Spam review detection using spiral cuckoo search clustering method," Evolutionary Intelligence, vol. 12, no. 2, pp. 147-164, 2019.
A. C. Pandey, D. S. Rajpoot, and M. Saraswat, "Feature selection method based on hybrid data transformation and binary binomial cuckoo search," Journal of Ambient Intelligence and Humanized Computing, vol. 11, no. 2, pp. 719-738, 2020.
M. Abdel-Basset, A.-N. Hessin, and L. Abdel-Fatah, "A comprehensive study of cuckoo-inspired algorithms," Neural Computing and Applications, vol. 29, no. 2, pp. 345-361, 2018.
Q. Wei, C. Wang, Y. J. J. o. I. Wen, and F. Systems, "Minimum attribute reduction algorithm based on quick extraction and multi-strategy social spider optimization," vol. 40, no. 6, pp. 12023-12038, 2021.
N. Khare et al., "Smo-dnn: Spider monkey optimization and deep neural network hybrid classifier model for intrusion detection," vol. 9, no. 4, p. 692, 2020.
G. Nirmalapriya, V. Agalya, R. Regunathan, M. B. J. J. B. S. P. Ananth, and Control, "Fractional Aquila spider monkey optimization based deep learning network for classification of brain tumor," vol. 79, p. 104017, 2023.
B. A. Garro, K. Rodríguez, and R. A. Vázquez, "Classification of DNA microarrays using artificial neural networks and ABC algorithm," Applied Soft Computing, vol. 38, pp. 548-560, 2016.
U. Alon et al., "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays," Proceedings of the National Academy of Sciences, vol. 96, no. 12, pp. 6745-6750, 1999.
T. R. Golub et al., "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring," science, vol. 286, no. 5439, pp. 531-537, 1999.
D. Singh et al., "Gene expression correlates of clinical prostate cancer behavior," Cancer cell, vol. 1, no. 2, pp. 203-209, 2002.
C. L. Nutt et al., "Gene expression-based classification of malignant gliomas correlates better with survival than histological classification," Cancer research, vol. 63, no. 7, pp. 1602-1607, 2003.
G. J. Gordon et al., "Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma," Cancer research, vol. 62, no. 17, pp. 4963-4967, 2002.
S. A. Armstrong et al., "MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia," Nature genetics, vol. 30, no. 1, pp. 41-47, 2002.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Metaheuristic Model of Gene Selection for Deep Learning Early Prediction of Cancer Disease Using Gene Expression Data

Status:

Version 1

Abstract

Figures

1. INTRODUCTION

1.1 Paper organization

2. LITERATURE REVIEW

3. METHODS USED

3.1 Deep Learning

3.2 Cuckoo search (CS) Algorithm

3.4 Spider Monkey Optimization (SMO) Algorithm

3.5 Proposed Methodology CSSMO

4.1 Preprocessing phase

4.2 CSSMO Algorithm

4. EXPERIMENTAL SETUP

4.1 Dataset Used

4.2 Deep Learning model configuration

4.3 Parameter setting of proposed method

5. EXPERIMENTAL RESULTS AND DISCUSSION

5.1 Deep learning prediction accuracy

5.2 Error Estimation

5.3 Convergence rate of the Algorithms

5.4 Model Performance

5.5 Confusion Matrix

5.5 Comparison with others Machine Learning and Deep Learning model

6. CONCLUSION

Declarations

References

Additional Declarations

Status:

Version 1