Deep learning methods can be used for the prediction of cancer data, though it has several challenges, Joseph et al. explored various problems related to the implementation of the deep learning model in solving the classification of cancer in gene expression data. Transformation of the 1D gene expression values into a 2D image takes into account the overall features of the genes and evaluates features that are required for the classification purpose using a convolution neural network (CNN). The proposed model achieved a steady result with an accuracy of 95.65% when operated on the data related to 33 different cancer types [14].
Mostavi et al. successfully implanted the different CNN models such as 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN which are based on unconstructed gene expression input, which is not only able to classify the sample as tumor or non-tumor but also into their respective types of cancer or normal. All the created classifier operated on 10,340 samples of 33 distinct cancer types, the Cancer Genome Atlas was used as the source which resulted in an average accuracy of 94.45% [15].
The backpropagation method is the universal technique for training artificial neural networks. In the estimation of the feature selection in gene expression data is carried out with the hybrid models of backpropagation neural networks along with fast Genetic Algorithms (GA), which examine those algorithms which are sensitive towards restricted minima in search space. Most of the gene selection algorithms are based on selecting the genes with higher ranking, however many low-rank genes if selected properly, are capable of increasing the performance of the classification. The Hybrid Fast GA-BPN algorithm encircles an informative small gene subset capable of improving the overall accuracy of the classification. Using microarray technology to do genetic research is very effective but, the curse of dimension creates many analytical challenges for the data scientist [16].
Zeebaree et al. tackled the main challenges of the classification of cancer microarray data with the help of deep learning algorithms based on convolutional neural network (CNN), which show improvement in accuracy and extraction of informative genes as compared to mSVM-REF-iRF and varSeIRF [17]. Feature selection is a crucial task in any classification problem, as the overall accuracy of the classifier depends upon the relevance between the properties of the selected gene and the properties of each class.
Mao et al. used the Randomization test (RT) and the partial least square discriminant analysis (PLSDA) to measure the relevance of the significance of the selected genes with the objective class. The suggested model provided satisfactory result upon classifying the data into four classes, using principal component analysis (PCA) & multiple linear regression (MLR) [18].
Cancer microarray data is represented in a non-squared matrix as the number of features is very large compared to the size of the sample (Curse of dimensionality), which needs to resolve by employing a suitable feature extraction method. Generally, distance-based feature selection techniques are based on Euclidean distance (PLSDA). Zhong et al. incorporated Bhattacharya distance to determine the class of the gene in a binary classification model. Genes were selected based upon the lowest misclassification rates obtained through a support vector machine. The highlighted method improved the accuracy and processing time as compared to SVM-REF and SWKC/SVM [19].
International communities stand united to fight against cancer. One out of six death worldwide is because of cancer. Thus, to facilitate researchers across the globe, different data sets related to types of cancers are made available publicly to expedite the process of finding suitable, affordable treatment. Comparative studies of accuracies of available algorithms with known data sets may give some insights to the researcher community. Tabares et al. have shown the comparative studies on the 11 tumor database and recorded accuracies of 90.6% & 94.43% respectively on logistic regression and convolutional neural networks. The proposed algorithm based on deep learning methods shows more promising results on microarray data analysis [20].
Medical treatment of patent diagnosed with cancer depends not only on the gene that caused it but also on the type of the cancer. Hence, classification of types of cancers plays vital role in clinical management of the disease. Despite many available microarray data analysis algorithms that have already shown remarkable results, they possess their limitations too. Hence, improved techniques based on machine learning algorithms are required for efficient implementation. Salem et al. implemented Information Gain & Standard Genetic Algorithm to classify human cancer disease depending on gene expression profiles. The Information Gain algorithm serves the purpose for feature selection followed by feature reduction and cancer type classification is achieved through Genetic algorithm and Genetic programming algorithm respectively. The hybrid technique based on IG and SGA improved the accuracy of the classifier [21].
Deep learning methods are getting popularized day by day as researchers started to use them in a wide range of areas however its use in the classification of cancer microarray data is limited and many connected domains remained untouched due to the Scarcity of the training datasets. The hybrid approach with deep learning techniques is the newly emerging area that will try to overcome limitations with available algorithms. Malignant and benign are the two types of tumors. Early detection and classification of tumors are important in preventing its harmful effects. Liu et al. suggested Sample Expansion Based SAE and 1DCNN are two deep learning approaches used for categorization of the microarray data. The authors claimed improvement in the accuracy of the classifier after testing the data with proposed algorithms [22].
Elastic net is very effective in variable selection and its regularization. The improved performance of Elastic net, under weak regularization, is the unique property of it which enabled the implementation in multidisciplinary fields. Wang et al. classified microarray data of leukaemia and colon cancer, using the hybrid technique with Adaptive Elastic Net with Conditional Mutual Information (AEN-CMI) & conditional mutual information. The proposed hybrid algorithm dominates traditional methods not only by improving the accuracy but also by using the minimum number of genes [23].
Microarray data analysis is one of the advanced and reliable tools in medical science, which strives to link the cause of the disease like cancer with an informative gene. Medjahed et al. developed a unique two steps algorithm. It is based on Support Vector Machine Recursive Feature Elimination to extract the genes and the latest Binary Dragonfly Algorithm (BDA) to improve performance of the previous. Author/s, for the first time, incorporated the application of the metaheuristics BDA with microarray data analysis that enhanced the accuracy of the classifier with a minimum number of genes [24].
Microarray data analysis of tumour requires a sufficient number of training models to build a classifier with better accuracy. Liao et al. tackled this problem by incorporating the different datasets of the multiple types of cancer and applied the Multi-Task Deep Learning (MTDL) algorithm to analyse the microarray data. MTDL resolved the issues related to the scarcity of the data and significantly enhanced the accuracy of a classifier, towards identifying the type of cancer when tested on multiple cancer datasets [25].
Prostate cancer in males causes seven percentage of deaths worldwide out of total cancer deaths. Research community is still debating on the reliability of the Prostate-specific antigen (PSA). Hence, there is a need of the biomarker that receives acceptance form the community. Hou Q et al. created a diagnostic and prognostic forecast model for prostate cancer and discussed the effectiveness of the biomarker named C1QTNF3. Author also expects the application of the proposed biomarker in other parallel domains to map oncogenes [26].
Mapping the type of disease with the best informative gene is the central idea behind using microarray data for the classification. Jansi et al. implemented two-stage algorithms based on Mutual Information Genetic Algorithm (MI-GA). Screening of potential genes with high mutual values is followed by creating an optimal set of genes through Genetic Algorithm and SVM. The proposed method shows improvement in accuracy when applied on datasets of different type of cancers [27].
Filtering only useful features capable of affecting the overall classifier is the prime thought behind selection of features through appropriate algorithms from available search space. Chen et al. implemented binary particle swarm optimization and CCFS to design cost effective and confidence-based features election method. Initially, author has filtered the most promising features by utilizing the available statically parameters within the features as well as within different categories. Secondly, discriminant statistical parameters designed in to the model to develop best possible classifier [28].
Reductant and irrelevant features reduce the classifier's accuracy and raise the computational cost, which leads to the mismatches of the training set with the desired class. Rouhi et al. proposed a hybrid approach which initially reduces the dimension of the features followed by implementation of Advanced Binary Ant Colony (ABACOh) meta-heuristic algorithm. The constructed hybrid approach enhanced the accuracy of the classifier when compared with available methods [29].
The study of the Genetic mutations inside the cell, is one of the primary applications of the microarray data, which provides the spectrum of expressed genes for further analysis. Jain et al. proposed a two-phase algorithm that combines Correlation-based Feature Selection (CFS) with improved-Binary Particle Swarm Optimization (iBPSO). At later stages, it implements Naive–Bayes classifier with stratified 10-fold cross-validation. It accelerates the convergence and provides highly accurate classifier on seven data sets out of ten, based on different types of cancers [30].
Advancements in computing facilities and vectorization of the iterative steps involved in algorithms enable data scientist to explore more useful insights from the data. It also helps in experimentation with a large number of possible subsets of features related to microarray data. Several parallel computing packages like Hadoop, Spark and Mahout provides a wide number of tools like extracting/selecting features to tune the classifier up to the desired level of accuracy. Venkataramana et al. implemented Parallelized hybrid feature selection (HFS) method. It not only incorporates the statistics related to subsets of features but also ranks them to set the selection of most effective, informative genes. The proposed method established the accuracy of 97% on the data sets related to gastric cancer and improved the accuracy to some extent when compared with available methods [31].
Alzaqebah et al. presented a study demonstrating use of cuckoo search methods for feature selection. This study involved use of cuckoo search alongside a memory-based mechanism to save optimal solutions (feature vectors) to find features that enhanced the classification accuracy. Additionally, the proposed algorithm was contrasted with the original algorithm using a microarray dataset. This study concluded that the proposed algorithm produced outcomes that are superior than those of the original and contemporary algorithms [32].
Swathypriyadharsini et al. have put out a methodology for identifying co-expressed genes that combines triclustering methods with a hybridized CS (cuckoo search) algorithm and clonal selection. After that, to ascertain the biological importance of the genes in the generated clusters, this technique makes use of gene ontology, functional annotation, and transcription factor binding site analysis. In comparison to both conventional cuckoo search techniques and other current triclustering algorithms, the experimental results of this approach were shown to be superior. [33].
Zhao et al.proposed a new search algorithm namely, the Elite Hybrid Binary Cuckoo Search (EHBCS) algorithm which employed feature weighting and elite strategy to improve over Cuckoo Search. The proposed algorithm showed results outperforming binary genetic algorithm and binary particle swarm optimization algorithm in terms of standard deviation, sensitivity, specificity, precision, and F-measure [34].
Othman et al. use of innovative operators for genomic selection is included in a hybrid multi-objective CS (cuckoo search) algorithm that has been developed. To do this, this study employed single crossover and double mutation operators. Using seven high dimensional cancer microarray datasets that are freely available, the suggested method was assessed. According to the experimental findings, the suggested technique selected fewer relevant genes while outperforming multi-objective cuckoo search and classic cuckoo search algorithms in terms of performance [35].
Pandey et al. presented a novel feature selection technique based on the binomial cuckoo search metaheuristic was described in the study. To increase stability, a hybrid data transformation procedure was developed that combines Fast Independent Component Analysis (FICA) and Principal Component Analysis (PCA). On 14 benchmark datasets from the UCI repository, the suggested method was evaluated and contrasted with other methods like the binary cuckoo search, the binary bat algorithm, the binary gravitational search algorithm, the binary Whale Optimization with Simulated Annealing, and the binary Grey Wolf Optimization [36].
Swathypriyadharsini et al. conducted a comparative study exploring two commonly used bio-inspired optimization algorithms namely, Cuckoo Search (CS) and Particle Swarm Optimization (PSO) for triclustering the microarray gene expression data. Both the algorithms were applied to three real-life three-dimensional datasets with mean square residue as fitness function. The experimental results in this study showed CS having better computational efficiency compared to PSO [37].
Scaria et al. proposed a user-friendly rule-based classification model for processing microarray gene data. Here, cuckoo search optimization algorithm was used to form classification rules and pruned by associative rule mining. This study concluded that the performance of the proposed approach was adequate enough in terms of accuracy, sensitivity, specificity and time consumption [38].
Balamurugan et al. presented a study showcasing use of shuffled cuckoo search with Nelder-Mead (SCS-NM) for finding significant biclusters in large expression data. This study used shuffling and simplex NM to diversify and intensify the search space to gain an edge over compared algorithms. The proposed work was benchmarked on four benchmark datasets where it showed significant improvement in fitness value when compared with the swarm intelligence technique and various bi-clustering algorithms [39].
Boushaki et al. performed a study involving hybridization of parallel cuckoo search optimization (PCSO) algorithm and Naïve Bayes (NB). In order to locate feature subsets that optimize accuracy, this led to the development of the parallel cuckoo search with naive bayes (PCSNB) wrapper technique, which coupled the exploration capacity of PCSO with the speed of naive bayes. On seven distinct datasets, the suggested method was evaluated and compared against existing metaheuristic methods. The experimental findings demonstrated more accurate prediction than previous algorithms and were relatively efficient [40].
Pandey et al. presented a study that demonstrated a cutting-edge metaheuristic technique that combined k-means clustering with enhanced cuckoo search to expand the potential of conventional clustering techniques. Three microarray datasets were used to assess the effectiveness of the suggested approach. This study found that the suggested algorithm beat the benchmarked state-of-the-art approaches. [41].
Kulhari et al. explored a novel metaheuristic gauss-based cuckoo search clustering method to broaden traditional algorithm capabilities for unsupervised data classification. The proposed algorithm was tested on three microarray datasets. This study concluded via. result-drive experimentation that the proposed algorithm was more performant than existing methods [42].