STFT, LASSO and EHO based Feature Extraction with Integrated Machine Learning and Metaheuristic Classification Techniques for Colon Cancer detection from Microarray Gene Expressions

doi:10.21203/rs.3.rs-4357463/v1

Download PDF

Article

STFT, LASSO and EHO based Feature Extraction with Integrated Machine Learning and Metaheuristic Classification Techniques for Colon Cancer detection from Microarray Gene Expressions

https://doi.org/10.21203/rs.3.rs-4357463/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 17 Jul, 2024

Read the published version in Scientific Reports →

You are reading this latest preprint version

The microarray gene expression data poses a tremendous challenge due to their curse of dimensionality problem. The sheer volume of features far surpasses available samples, leading to overfitting and reduced classification accuracy. Thus the dimensionality of microarray gene expression data must be reduced with efficient feature extraction methods to reduce the volume of data and extract meaningful information to enhance the classification accuracy and interpretability. In this research, we discover the uniqueness of applying STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimisation) for extracting significant features from lung cancer and reducing the dimensionality of the microarray gene expression database. The classification of lung cancer is performed using the following classifiers: Gaussian Mixture Model (GMM), Particle Swarm Optimization (PSO) with GMM, Detrended Fluctuation Analysis (DFA), Naive Bayes classifier (NBC), Firefly with GMM, Support Vector Machine with Radial Basis Kernel (SVM-RBF) and Flower Pollination Optimization (FPO) with GMM. The EHO feature extraction with FPO-GMM classifier attained the highest accuracy in the range of 96.77, with an F1 score of 97.5, MCC of 0.92 and Kappa of 0.92.

STFT

LASSO

EHO

Lung Cancer

Microarray gene expression

GMM

PSO GMM

DFA

NBC

Firefly GMM

SVM

Flower Pollination Optimization with GMM

Colon cancer stands as a formidable public health adversary, claiming countless lives annually due to its late detection and devastating potential as mentioned in Jemal et al. [1]. Colon cancer, also known as colorectal cancer, thus poses a significant threat to public health due to its widespread prevalence and potential for lethality. Early identification maximizes treatment success and offers a beacon of hope, significantly improving treatment efficacy and long-term prognosis for patients as discussed in Veer et al. [2].

There are several medical methods to diagnose Colon cancer, which are mentioned in the research and literature. The Stool Occult Blood Test (FOBT) is a simple test that checks stool samples for hidden blood used to detect colon cancer. The hidden blood can be a sign of colon cancer or other conditions. But, the FOBT is a non-invasive and convenient test that recognises colon cancer in advanced stages as discussed in Merak et al. [3]. A biopsy is a method that removes a small tissue sample from a suspicious colon area to visually examine and diagnose cancer under a microscope by a pathologist. As indicated in Compton et al. [4], Biopsy turns out to be an invasive method involving bleeding risks, but can confirm cancer along with the type and grade of cancer stage. Miller et al. [5] discusses about Barium Enema (BaEn), a different method that captures X-ray images of the colon area. The colon area will be filled with barium, a contrast material for X-ray image visualization. BaEn involves radiation exposure and can reveal polyps, masses, or the narrowing of the colon part. It offers a broader view of the colon compared to FOBT, but may not detect small polyps as indicated in Ott et al.[6]. Ultrasonic scanning method is suggested in Ignatov et al.[7] that employs high-frequency sound waves to produce images of the colon, aiding in the detection of abnormalities such as tumours or masses suspected of colon cancer. It is a non-invasive method that does not require radiation exposure making it suitable for evaluating the colon area without any risk including sedation.

Colonoscopy is an advanced methodology that utilizes a flexible tube with a camera to examine the abnormal growths or polyps in the colon area. The colonoscopy offers direct visualization and aids in the early detection of colon cancer as mentioned in Hurlstone et al.[8]. However, as Robertson et al.[9] has pointed out, Colonoscopy is invasive and needs further biopsy procedures to make a conclusion and report. In this case as pointed out by Summers et al.[10], Virtual Colonoscopy (VC) or Computer Tomography (CT scans) are preferred to create detailed images of the colon, aiding in the detection of polyps and tumours. As indicated in Waye et.al. [11], VC stands as an alternative to traditional colonoscopy that doesn't require sedation procedures. The risk involved in VC is that it may miss small lesions that indicate the presence of colon cancer. All the above mentioned methods mostly provides insights to cancer and diagnosis based on anatomical factors. In this case as pronounced in Nannini et al.[12], microarray gene expression analysis can diagnose and reveal gene expression patterns to provide a deeper and molecular-level understanding of the colon cancer.

Microarray gene expression analysis examines the activity of thousands of genes simultaneously. It offers several advantages over traditional clinical methods for diagnosing colon cancer. Firstly, as pointed out in Poturnajova et.al.[13], it provides a comprehensive molecular profile of the tumor, allowing for the identification of specific gene expression patterns associated with cancer development and progression. As given in Zhang et al.[14], compared to ultrasonic scanning, molecular information provided by microarray gene expression analysis offers distinct advantages in diagnosing colon cancer. The identification of specific genetic alterations associated with the colon cancer development and progression is thus possible. The molecular information offers insights into personalized treatment strategies and prognosis, surpassing the capabilities of ultrasound in providing detailed molecular insights into the disease process as described in Vaidya et al.[15]. Also as performed in Badgwell et al.[16], microarray gene expression analysis detected subtle molecular changes for ovarian cancer that are not visible on ultrasound imaging, enhancing its sensitivity for early detection and diagnosis of cancer.

Moreover, microarray analysis can distinguish between different subtypes of colon cancer based on their gene expression profiles as indicated in Bertucci et al.[17], which may have prognostic implications and influence treatment decisions. Xavier et al.[18] details that Microarray gene expression also help to identify new biomarkers for colon cancer diagnosis, leading to the development of more specific and sensitive tests. Furthermore, microarray gene expression analysis can be performed using minimally invasive techniques, such as obtaining tumor tissue samples via biopsy or analysing circulating tumor cells or cell-free DNA in the blood, reducing patient discomfort and risk compared to invasive procedures like colonoscopy as discussed in Galamb et al.[19]. Overall, microarray gene expression analysis offers a more precise, sensitive, and minimally invasive approach to diagnosing colon cancer, potentially leading to earlier detection, more accurate prognosis, and tailored treatment strategies for patients. However, as marked in Wulfkuhle et al.[20], microarray technology is still under development for routine clinical use and requires complex analysis and further validation before widespread adoption. The vast amount of data generated by microarrays post a major challenge known as the "curse of dimensionality," as sermonized in Maniruzzaman et al.[21], necessitating dimensionality reduction techniques and feature extraction techniques for effective classification and analysis of microarray data as sermonized in Guyon et al.k [22].So, in this research we focus to leverage the dimensionality of microarray gene expression data, and select relevant features that aids in improving the classification accuracy. The various dimensionality reduction and feature extraction methods with be discussed in the upcoming section.

1.1 Review of Previous Work

There are various research works performed in literature based on dimensionality reduction and feature extraction methods on microarray gene data for various cancer databases. The selection of methodology to perform dimensionality reduction and feature extraction is significant in reducing the data volume and extract relevant features and patterns. This selection aids the upcoming classification phase to improve the classification accuracy and reducing the overfitting problems. The diverse feature extraction/dimensionality techniques with classification, pronounced in the literature is provided in Table 1.

Table 1

Review of Previous Work
Sl.No	Author and Year	Database	Feature Extraction/ Dimensionality Reduction Technique	Classifiers Used	Evaluation Metrics
1	Liu et al. (2013), [23]	Colorectal cancer Dataset (Prof.Lindy Durant’s research)	Continuous Wavelet Transform (CWT)	Genetic Algorithm based on Bayes classifier	Accuracy = 78.6%
2	Islam et al.(2023)[24]	Ischemic sensitivity dataset	genomap	genomap + genoNet	Accuracy = 93%
3	Xiao et al.(2018), [25]	TCGA Database	Stacked sparse auto-encoder (SSAE)	Support Vector Machine (SVM), Neural Network (NN) ,Random Forest (RF)	Accuracy = 99.89%
4	You et.al. (2013), [26]	(Bio) Dataset	F-test & Partial Least Squares (PLS), ReliefF and PLS, Recursive Feature Elimination and PLS	LDA,SVM, NBC,KNN (K-Nearest Neighbourhood), INN	96.43%
5	Bonev et al. (2008), [27]	Datasets from Broad Institute, Stanford Genomic Resources, and Princeton University	Mutual Information	KNN, SVM	Accuracy = 89%
6	Xu et al. (2018) [28]	Simulated Genome Dataset Available in Bioinformatics online	Extreme phenotype sampling (EPS)	LASSO	Accuracy = 90%
7	Torkey et al. (2021), [29]	METABRIC, Nature 2012, and Nat Commun 2016, COX-PASnet, and TCGA	Principal Component Analysis (PCA) with Sparse Autoencoders	Cox Regression, Random Survival Forest (RSF)	Accuracy = 98%
8	Abdulla et al. (2020), [30]	Leukemia and DLBCL dataset	Genetic Algorithm (GA) and Binary Gray Wolf Optimization (BGWO)	Random Forest (RF) and KNN	Accuracy = 95%
9	Li et al.(2018), [31]	Colon cancer and Leukemia dataset	Variants of LDA, LASSO Spectral regression discriminant analysis (SRDA), locality preserving projections (LPP), kernel discriminant analysis with spectral graph analysis (SRKDA)	LDA, RLDA, HLDA,NDA, SRDA, LPP, SRKDA, and Lasso SRDA	Accuracy = 84%
10	Zhang et al. (2015), [32]	Leukemia-ALLAML, SRBCT	Projection Matrix	kNN, Locally Linear Discriminant Embedding (LLDE), DNE (Discriminant Neighbourhood Embedding)	Accuracy = 90%
11	Subhajit et al. (2015), [33]	SRBCT, ALL_AML and MLL microarray datasets	Particle Swarm Optimization (PSO) –adaptive KNN	SVM (Linear, RBF, Polynomial, Quadratic)	Accuracy = 95%
12	Nursabillilah et al.(2022), [34]	Breast Cancer Gene Expression Database	Information Gain, Relief and Fisher Score (filter-based methods), LASSO (embedded-based method), GA, PSO, harmony search, ant colony, artificial bee colony, firefly algorithm, cuckoo search, gravitational search, grey wolf, whale optimization	RF, KNN, Naïve Bayes (NB), Logistic Regression (LR), Fuzzy Logic, Artificial NN (ANN)	Accuracy = 90–100%
13	Wang et al.(2003) [35]	Leukemia, Colon Cancer, Brain tumors and NCI60	Fuzzy c-means clustering, Weighted/Mean component plane	Fisher's linear discriminant	Accuracy = 95%
14	Aziz et al. (2016), [36]	colon cancer, acute leukemia, prostate cancer, lung cancer II, and high-grade glioma	independent component analysis (ICA) and fuzzy backward feature elimination (FBFE)	SVM and NB	Accuracy = 90%

Here are certain limitations and research directions observed in the diverse methodologies listed in Table 1. In Liu et al. [23], low accuracy is obtained to more number of redundant features. Islam et al. [24] and Xiao et al [25] reported high accuracies due to the applied deep learning methods. However, deep learning methods incurs high computational complexity. The proposed methods in You et.al. [26] is not suitable for datasets that are highly nonlinear in nature. Bonev et al. [27] and Xu et al. [28] showcased efficient feature extraction techniques but there is only limited exploration of diverse classifiers. Torkey et al. [29] performed survival tests instead of classification and obtained good accuracies in range of 98%. Abdulla et al. [30] has proposed a cost sensitive feature selection method with accuracy in the range of 95%. Li et al. [31] explored variety of regression techniques, and reported 84% of accuracy due to the minimum search space exploration due to LASSO and LDA regressions. Zhang et al. [32] achieved accuracy of 90% but the methodology is sensitive to outliers and missing of sensitive information while handling nonlinear data. Subhajit et al. [33] and Nursabillilah et al. [34] attained accuracy of 95%, but slight traces of overfitting is reported by the adopted metaheuristic methods. Wang et al. [35] and Aziz et al. [36] proposed methods that reported accuracies over 90%, but methodologies adopted in these research works are not suitable for highly nonlinear datasets. From these observations in literature in this research we adopt, STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimisation) for the feature extraction/dimensionality reduction method. The machine learning classifiers namely DFA, NBC, GMM, and SVM (RBF) are also compared with integrated metaheuristic classifiers namely PSO-GMM, Firefly-GMM, and Flower Pollination Optimization with GMM. The materials and methodology is discussed in the upcoming section.

2.1 Dataset

In this research, we have used a publicly accessible dataset to classify cases of colon cancer, provided in Alon et al. [37]. The dataset contains 2000 genes. There are 62 samples total; Class 1 represents the tumour class with 40 samples, and Class 2 represents the healthy class with 22 samples. The Table 2 summarises the dataset's details in tabular form.

Table 2

Dataset details
Dataset	Number of genes	Class 1 (Cancer)	Class 2 (Healthy)	Total samples
Colon Cancer [37]	2000	40	22	62

2.2 Work flow of the Research

The microarray gene expression data generation involves numerous steps, starting with obtaining an RNA sample from the suspected colon cancer area cells. As explained in Sakyi et al. [38], isolating RNA uses techniques like phenol-chloroform extraction. After obtaining the RNA sample, a process called as reverse transcription is performed to convert RNA into complementary DNA (cDNA). The cDNA will be labelled with fluorescent dyes and other markers. The labelled cDNA is hybridized onto microarray slides containing thousands of known DNA sequences that correspond to genes of suspected colon cancer area. After hybridization, the microarrays are scanned to detect the fluorescence signals, which indicate the abundance of each gene's mRNA in the original sample. At last, the raw fluorescence intensity data are processed and analyzed to generate a microarray gene expression matrix. The rows of the microarray gene expression matrix represents a gene and each column represents a sample.

The microarray gene expression matrix provides valuable insights into the expression levels of thousands of genes simultaneously. It helps to identify significant patterns and differences in gene expression across diverse conditions and experimental groups. The major steps involved in generation of microarray gene expression data along with overall work flow of the research work is illustrated in Fig. 1. In the next section, the details about various feature extraction methods adopted in the research is discussed.

2.3 Feature Extraction Methods

As mentioned, STFT, LASSO and EHO feature extraction techniques are adopted in this research. The STFT is nothing but a windowed Fourier transform that is used to capture time-varying frequency components in gene expression data. It enables the analysis of dynamic gene expression patterns over time intervals. LASSO employs regularization to simultaneously perform feature selection and data shrinkage. This technique effectively identifies the subset of genes with significant expression levels and mitigating the overfitting problems during microarray data analysis. On the other hand, EHO is a metaheuristic approach that is inspired by the herding behaviour of elephants. This nature-inspired extracts and also optimizes the features in the gene expression data, providing unique insights into gene interactions.

2.3.1 Feature Extraction using STFT

A frequency domain study of the data over a brief time span is called the Short-Time Fourier Transform (STFT). By extracting pertinent and helpful features, the STFT can reduce the data dimension and extract relevant features when applied to microarray gene data. Gupta et al. [39] have used STFT to do QRS Complex Detection. According to the authors, STFT is helpful in giving researchers a time-frequency representation of the data, enabling them to examine how variations in gene expression levels occur over various frequency components and over time. With STFT, a localised representation of frequency content over time can be obtained by identifying particular time intervals where particular frequency components are prominent. This STFT feature is helpful in finding genes that show temporal patterns active during particular time intervals for microarray gene expression research. Additionally, by identifying the crucial genes or gene clusters connected to particular frequency components, STFT lowers dimensionality. The Blackman window [40] that reduces spectral leakage serves as the windowing function for STFT calculations. Therefore, by identifying the frequency patterns and underlying correlations in the dataset, STFT can offer biological insights. The computation of STFT is expressed as

$$X\left(m,w\right)= \sum _{n=-\infty }^{\infty }x\left[n\right]w[n-m]{e}^{-jwn}$$

Here, x [n] represents the input data having length N with n = 0, 1, 2, … N − 1. The STFT window is represented by w[n] having length M with m = 0, 1, 2, … M − 1, j = √ (− 1) denotes the complex number. The Blackman window is given by the expression,

$$w\left[n\right]=0.42-0.5\text{cos}\left(\frac{2\pi n}{M}\right)+0.08\text{cos}\left(\frac{4\pi n}{M}\right), 0\le n\le N-1$$

In essence, STFT would be able to extract features from the microarray gene expression data and captures significant time-varying patterns that allows the detection of dynamic gene expression changes crucial for understanding inherent biological processes. However, STFT often exhibits its inability to effectively handle high-dimensional data and multicollinearity. This can lead to inadequate performance or overfitting issues. Here we discuss the significance of LASSO that can overcome these issues by imposing a penalty on the absolute size of regression coefficients.

2.3.2 Feature Extraction using LASSO Regression

LASSO is an extension of ordinary least squares (OLS) regression introduced by Tibshirani et al. [41]. It works by minimizing the sum of squared residuals while imposing a constraint on the sum of the total values of the regression constants. The estimated regression coefficients in LASSO can be characterized as follows:

$$\beta =argmin\left\{{\sum }_{i=1}^{n}{({x}_{i}-{\sum }_{j=1}^{p}{\beta }_{j}{y}_{ij})}^{2}\right\}+\lambda {\sum }_{j=1}^{p}\left|{\beta }_{j}\right|$$

Here ‘n’ represents the number of observations in the dataset and ‘p’ represents the number of predictor variables or features in the dataset. The variable ${x}_{i}$ is the observed value in the i^th observation, ${y}_{ij}$ is the predicted value in the i^th observation and j^th prediction, ${\beta }_{j}$ represents the regression coefficient associated with the j^th predictor. The regularisation values are represented by, a dimensional vector that is also designated as the penalty term. The $\lambda$ variable contains the LASSO estimation values for the slope coefficients. It is reasonable to assume that the response variable has a mean of 0 without sacrificing generalizability. Also the covariates acts as predictors having a variance of 1 and a standardised mean of 0. When the L1 regularization is applied to the constants, a feature emerges: the coefficients tend to approach zero as λ grows, and for sufficiently large λ, some are precisely reduced to zero. Because of this special quality, the LASSO can choose models and produce predictable results. Thus, LASSO has effectively performed the feature selection by reducing the risk of overfitting in microarray gene expression data analysis. However, LASSO tends to select only a subset of features, potentially discarding relevant predictors that are correlated with the selected ones. Here, an Elephant Herding Optimization (EHO) can address this limitation by offering a nature-inspired optimization approach that explores a wider range of feature combinations, potentially capturing synergistic interactions among genes and providing a more comprehensive understanding of the underlying biological mechanisms in microarray gene expression data.

2.3.3 Feature Extraction using Elephant Heard Optimization

EHO is grounded on the food searching behaviour of elephants, is introduced in Wang et al.[42]. In the EHO metaheuristic algorithm, the location of each elephant i is iteratively updated as follows.

$${p}_{i}^{new}= {p}_{i}^{old}+ \varUpsilon \left({p}_{best}- {p}_{i}^{old}\right)*{rand}_{1}$$

Here ${p}_{best}$ is the global best and $\varUpsilon$ indicates the control variable, $\varUpsilon$ ∈ [0, 1], rand₁ refers to the random number, ${rand}_{1}$ ∈ [0, 1]. Then global best, ${p}_{best}$is updated iteratively in the following manner.

${p}_{best}^{new}= {\delta }* {p}_{center}$	(5)
${p}_{center}= \frac{1}{n}* \sum _{i=1}^{n}{p}_{i}$	(6)

Where δ ∈ [0, 1] is another control parameter. In addition, the worst position is changed according to the following equation.

$${p}_{worst}= {p}_{min}+ \left({p}_{max}- {p}_{min}+1\right)*rand$$

Here ${p}_{min}$and ${p}_{max}$ mentions the minimum and maximum position values available in the solution space. In this way using the EHO's exploration of the solution space beyond sparse representations helps in mitigating the risk of overlooking to the important features, but at the same time optimizing the model performance to avoid overfitting problems during classification. In the next section, a statistical analysis of the extracted features obtained from STFT, LASSO, and EHO is analysed in detail.

2.4 Statistical Analysis of the extracted features using STFT, LASSO, and EHO

To ensure that the feature extracted data reduces the dimensionality and retain the significant characteristics of the original microarray data, we delve into a comprehensive statistical analysis. For both the normal and cancerous classes, a comparison of the extracted features are performed through various statistical parameters: Mean, Variance, Skewness, Kurtosis, Pearson Correlation Coefficient (PCC), and Canonical Correlation Analysis (CCA). The analysis presented in Table 3, gives a clear picture of whether the adopted feature extraction methods can preserve the crucial properties of the microarray genes data within each class. The analysis serves as the validation for the subsequent classification performance and interpretations.

Table 3

Statistical Analysis of Brain Tumor Micro Array Gene after Bio Inspired Features
Statistical Parameter	STFT		LASSO Regression		EHO
Statistical Parameter	Normal	Cancer	Normal	Cancer	Normal	Cancer
Mean	6.43E + 03	7.29E + 03	0.0060	0.0545	0.0967	0.8288
Variance	5.02E + 07	6.41E + 07	1.72E-07	1.23E-07	9.8956E-05	0.0085
Skewness	1.85E + 00	1.74E + 00	-2.40E-05	0.0103	-3.0884	-0.5309
Kurtosis	1.14E + 01	1.11E + 01	-1.1999	-1.1904	9.9041	11.8288
PCC	3.33E-02	1.55E-02	-0.14286	-0.07692	0.0052	-0.0073
CCA	0.5356		0.5843		0.4230

The statistical parameter ‘Mean’ is the average value of the set of data values in the dataset. It represents the average expression levels of gene across different samples. Due to the detailed frequency information and mathematical characteristics (without scaling and normalization), STFT provides very high mean values of 6.43E + 03 and 7.29E + 03 compared to LASSO and EHO. The mean values of LASSO and EHO are low because these are normalized methods with sparsity and compression. Variance measures the spread or dispersion of the values in the dataset. In gene expression, variance indicates how much individual expression levels deviate from the mean expression level. Similar to the mean values, the variance of the STFT data is very high when compared to LASSO and STFT methods. High variance suggests that the expression levels vary widely across samples, while low variance indicates that the expression levels are similar across samples.

Skewness measures the asymmetry of the distribution of values compared to normal distribution in the dataset. A positive skewness indicates that the distribution is skewed to the right (i.e., the tail of the distribution extends more to the right), while a negative skewness indicates that the distribution is skewed to the left. From Table 3, the STFT is a right-skewed distribution, meaning there are more samples with higher expression levels for this particular feature compared to lower expression levels. LASSO exhibits nearly a symmetrical distribution with skewness values close to zero. EHO exhibits strong negative skewness in case of Normal data and alight negative skewness for Cancer data, suggesting a left-skewed distribution of expression levels. Kurtosis measures the peakedness or flatness of the distribution of values in a dataset compared to a normal distribution. High kurtosis indicates a sharp peak (leptokurtic), while low kurtosis indicates a flat peak (platykurtic). The extracted features using STFT and EHO are leptokurtic, meaning the distribution has heavier tails and more extreme values than a normal distribution. LASSO is platykurtic, revealing that the distribution has lighter tails and fewer extreme values than a normal distribution.

PCC measures the linear relationship between two variables. In gene expression analysis, PCC is used to quantify the degree of linear association between the expression levels of two genes across different samples. A value close to 1 indicates a strong positive correlation, a value close to -1 indicates a strong negative correlation, and a value close to 0 indicates no linear correlation. For both the normal and cancer classes, STFT feature extraction shows a weak positive linear relationship, LASSO exhibits a weak negative linear relationship, and EHO shows no linear relationship. The CCA is a multivariate statistical technique used to analyze the relationship between two sets of variables. In gene expression analysis, CCA is used to identify linear relationships between two cancer classes. CCA finds linear combinations of the features that maximize the correlation between the two sets of cancer classes. A positive correlation among the classes is observed in STFT, and LASSO methods, whereas a weak positive correlation is reported in the EHO feature extraction method. The CCA is thus a crucial statistical parameter for understanding the associations between different cancer class feature that can help to identify patterns and underlying biological processes in the data. All these observations can be clearly visualized through the Fig. 2 that shows the data distribution of STFT, LASSO and EHO feature extraction methods.

	Figure 2: Data Distribution Plots for Various Feature Extraction Methods

Up next, we compare the two class groups in our dataset using Viloin plot shown in Fig. 3 to compare the distributions of numeric data across the two groups. A Violin plot that combines aspects of a box plot (median, quartiles, and outliers) with a kernel density plot (which shows the distribution of the data). The width of the violin at any given point represents the probability density of the data at that value. Therefore, wider sections indicate a higher probability density, while narrower sections indicate a lower probability density.

The violin plots are the same for STFT and this indicates that the distribution of features extracted using STFT is similar across both classes (normal and cancer). This suggests that STFT may sometimes fail to capture the differences in gene expression patterns between normal and cancer samples. The spread of the violin plot with more width for LASSO suggests that there is more variability in the distribution of features extracted using LASSO across both normal and cancer classes. This implies that LASSO is capable of capturing a wider range of gene expression patterns. Also, the normal class has more violin length compared to the cancer class reports that there may be more variability in the normal samples compared to the cancer samples for the features extracted using LASSO method. In EHO, there is spread in width of violin in certain regions suggesting variability in the distribution of features extracted using EHO across both normal and cancer classes. The non-overlapping of violin lengths portrays the significant differences in the distribution of features extracted using EHO between the two classes. This reveals that EHO is effective in capturing distinct gene expression patterns associated with normal and cancer samples. In short, LASSO and EHO appear to capture more variability and distinct patterns compared to STFT, with EHO showing particularly promising results in separating normal and cancer classes based on the distribution of extracted features. Thus, from all these statistical parameters it is clear that the extracted features are extremely relevant and significant for upcoming cancer classification step.

Once the features are extracted, the colon cancer classification is performed using machine learning and metaheuristic classifiers. In our research, we utilize a diverse set of seven classifiers to cast a wide net for the most suitable approach. The machine learning classifiers utilised are GMM, DFA, NBC, and SVM (RBF). These are probabilistic modelling used to identify clusters of similar gene expressions and patterns within the data. Further to uncover the potentially hidden dynamics inside the data, we employ the metaheuristic integrated machine learning classification namely PSO-GMM, Firefly-GMM, and FPO-GMM. By utilizing a diverse range of algorithms, we aim to capture the multifaceted nature of the data and identify the most accurate and robust approach for colon cancer classification.

3.1 Gaussian Mixture Model (GMM)

The Gaussian Mixture Model is a popular unsupervised learning technique that groups together similar data points to perform tasks like image classification and pattern recognition. A set of Gaussian distributions are joined linearly in the PDF (Probability Density Function) of GMM, which makes the classification of the data easier. So the mixture classifier creates a probability distribution from microarray gene expression levels for both the classes with the help of a combination of Gaussian distributions. After probability distribution, the class prediction is based on the highest probability value (Bayes' theorem). For every class ‘c’, and feature ‘f’, GMM assumes that the data is drawn from a mixture of ‘G’ Gaussian coefficients. This can be expressed in the following way:

$$P\left(f\right|y=c)= \sum _{g-1}^{G}{Ϻ}_{cg}N (f |{ {\rm Y}}_{cg} , { Ͼ}_{cg})$$

Here${Ϻ}_{cg}$, ${ {\rm Y}}_{cg}, { Ͼ}_{cg}$indicates coefficient of mixture, mean vector, and covariance matrix respectively, under mixture component ‘g’ in class ‘c’. The ${Ϻ}_{cg}$ signify the ratio of each constituent in the class. From training data, ${Ϻ}_{cg}$, ${ {\rm Y}}_{cg}, { Ͼ}_{cg}$ parameters are gained for each constituent in the class with ‘N’ as the total number of samples, total classes ‘C’, and i = 0,1,2,..N.

$${{\rm Y}}_{c} = \frac{1}{{N}_{c}}\sum _{i=1}^{N}{{(y}_{i}=c)}_{1} . {f}_{i}$$

$$Ͼ=\frac{1}{N-G}\sum _{c=1}^{C}\sum _{i=1}^{N}{{(y}_{i}=c)}_{1} ({f}_{i} - {{\rm Y}}_{c}){({f}_{i} - {{\rm Y}}_{c})}^{T}$$

Also, $P (y=c)$ indicating prior probability is also obtained for every class ‘c’. $P (y=c)$ is given as:

$$p \left(y=c\right)= \frac{Number of samples with class c}{Total EquationNumber of samples}$$

A new feature, ${f}_{new}$ classification is performed by likelihood estimation under mixture model using Bayes' theorem in the following way.

$$p\left(y=c|{f}_{new}\right)\propto p \left(y=c\right)\bullet p\left({f}_{new}|y=c\right)$$

$$p\left({f}_{new}|y=c\right) = \frac{1}{{\left(2\pi \right)}^{D/2}{\left|Ͼ\right|}^{1/2}}\text{e}\text{x}\text{p}\left(-\frac{1}{2}{\left({f}_{new}- {{\rm Y}}_{k}\right)}^{T}{Ͼ}^{-1}({f}_{new}- {{\rm Y}}_{k})\right)$$

Here ‘D’ represents the dimensionality of the data, that is nothing but the number of features.

3.2 Particle swarm optimization-GMM (PSO-GMM)

PSO utilizes the collective intelligence of a "swarm" to optimize the classification performance. Consider a flock of birds searching for food, constantly refining their flight paths based on individual discoveries and interactions. PSO can effectively perform classification by searching for the most clustered and hidden subset of genes from the large pool of gene expression data. Moreover, PSO can handle the complex and non-linear relationships among feature extracted gene expressions. A combination of PSO with GMM (PSO-GMM) can intricate relationships that might be missed by static GMM. The technique potentially provides a more accurate cluster identification and classification. Here's the crux of PSO-GMM classification, Nair et al. [43].

$${v}_{i}(t+1) = w{v}_{i}\left(t\right) + {c}_{1}{r}_{1}({pbest}_{i} - {x}_{i}(t\left)\right) + {c}_{2}{r}_{2}(gbest - {x}_{i}(t\left)\right)$$

where ${v}_{i}\left(t\right)$ is the velocity of particle i at iteration t, w is the inertia weight, c₁ and c₂ are constriction coefficients, r₁ and r₂ are random numbers between 0 and 1, ${pbest}_{i}$ is the best position found by particle i so far, $gbest$ is the best position found by any particle in the swarm, ${x}_{i}\left(t\right)$ is the current position of particle i.

After application of PSO, the search space will be optimized and data distribution will be changed. Now, PSO-GMM follows the expressions from Eq. (8–13) to perform the final classification. PSO-GMM is a methodology that can unlock hidden patterns in gene expression data and improves the classification accuracy by avoiding overfitting issues.

3.3 Firefly GMM

The Firefly Algorithm (FA) is a nature-inspired metaheuristic that mimics the flashing behavior of fireflies to solve optimization problems. Consider a firefly flitting through the night, their brightness representing their "fitness" in finding mates or food. So, brighter fireflies attract others and guides them towards better locations. Using this behaviour, FA can segregate the most of the relevant features and escape from the local optima of the microarray data. This exploration and search mechanism of the solution space results in reliable and robust classification outcomes. Here's the simplified representation of the attraction between fireflies:

$${I}_{i} ={{I}_{i}}^{0}{e}^{(-\gamma {r}_{ij}²)}$$

Where ${I}_{i}$ is the attractiveness, ${{I}_{i}}^{0}$ is the initial attractiveness, γ is a light absorption coefficient, ${r}_{ij}$ is the distance between fireflies i and firefly j. Fireflies move towards brighter neighbours, gradually refining their positions towards better solutions. The movement of a firefly i towards a brighter firefly j is determined by the attractiveness and randomness is given by

$${{x}_{i}}^{t+1}= {{x}_{i}}^{t}+ \beta \left({{I}_{j}}^{t}- {{I}_{i}}^{t}\right)+ \alpha (rand\left( \right)-0.5)$$

Here, ${{x}_{i}}^{t+1}$ is the new position of firefly i at time step t + 1, ${{x}_{i}}^{t}$ is the current position of firefly i at time step t, β is the attraction coefficient, α is the randomization parameter, and $rand\left( \right)$ ∈ [0, 1], is a generated random number.

The collaboration of Firefly with GMM can potentially uncover intricate relationships in gene expression data that might be missed by standard GMM. Firefly combined with GMM can also overcome the limitations of dimensionality problem. The Firefly GMM thus gives a clustered and segregated data distribution to GMM for classification by varying the mean, variance and other statistical parameters of the extracted features.

3.4 Detrended Fluctuation Analysis (DFA)

Detrended Fluctuation Analysis (DFA) analyses is efficient in solving complex and non-stationary data. DFA captures both short and long-range correlations based on how the data fluctuates around the data distribution with the help of a scaling exponent to classify the data. The scaling nature of DFA algorithm is described by root mean square fluctuation of the integrated time-series and detrended input data. For DFA, the inputs are analysed in the following way.

$$y\left(k\right)={\sum }_{i=1}^{k}[B\left(i\right)-\stackrel{-}{B}]$$

Where $B\left(i\right)$ and $\stackrel{-}{B}$ are the i^th sample of the input data and mean value of the input data respectively. Thus $y\left(k\right)$ denotes the estimated value of the integrated time-series. Now, the fluctuation of integrated time-series and detrended data for a window with scale of ‘n’ is determined by

$$F\left(n\right)=\sqrt{\frac{1}{N}{\sum }_{k=1}^{N}[y\left(k\right)-{y}_{n}(k){]}^{2}}$$

Here, ${y}_{n}\left(k\right)$ is the k^th point on the trend computed by means of the predetermined window scale, ‘N’ is the normalization factor.

Therefore using Detrended Fluctuation Analysis (DFA) as a classifier for microarray gene expression data can effectively capture the long-range correlations present in microarray gene expression data. Unlike traditional methods that focus on short-range correlations, DFA evaluates the scaling behaviour of fluctuations across different time scales. This capability is particularly valuable in gene expression data analysis, where genes may exhibit complex patterns of co-regulation and interactions across various biological processes and time scales.

3.5 Naive Bayes classifier (NBC)

NBC is a probabilistic classification algorithm based on the Bayes theorem and feature independence assumption. This straightforward assumption allows Naive Bayes classifiers to efficiently handle large feature spaces, making them computationally efficient and scalable for microarray data analysis. NBC starts by calculates the posterior probability of a class using the prior probability and likelihood. For a given class C and extracted features x₁, x₂… x_n, posterior probability $p\left(C|{x}_{1},{x}_{2},\dots ., {x}_{n}\right)$ is expressed in Fan et al. [44] as:

$$p\left(C|{x}_{1},{x}_{2},\dots ., {x}_{n}\right)= \frac{p\left(C\right).p({x}_{1},{x}_{2},\dots ., {x}_{n}|C)}{p({x}_{1},{x}_{2},\dots ., {x}_{n})}$$

For the class C, $p\left(C\right)$represents the prior probability, $p({x}_{1},{x}_{2},\dots ., {x}_{n}|C)$ is the likelihood, and $p({x}_{1},{x}_{2},\dots ., {x}_{n})$ is the evidence probability. As mentioned, in the Naive Bayes approach, the features are conditionally independent for the class. This assumption that simplifies the calculation of the likelihood as follows:

$p({x}_{1},{x}_{2},\dots ., {x}_{n}|C)$ = $p\left({x}_{1}\right|C)$. $p\left({x}_{2}\right|C)$ …. $p\left({x}_{n}\right|C)$(20)

Where $p\left({x}_{i}\right|C)$ is the probability of feature ${x}_{i}$ of the class C. The $\left({x}_{i}\right|C)$ is estimated from the fraction of class C training examples with the feature value ${x}_{i}$. Then the prior probability (𝐶) can be estimated as the fraction of training examples belonging to class C. Finally, to predict the class label for the features${x}_{i}$, the algorithm calculates the posterior probability for each class and assigns the instance to the class with the highest probability. Thus using a Naive Bayes classifier for microarray gene expression data can efficiently handle large amount of data because they assume independence between features given the class label, which greatly reduces computational complexity.

3.6 Support Vector Machine (Radial Basis Function)

Support Vector Machine with a Radial Basis Function (SVM RBF) can handle complex, non-linear relationships between gene expression levels. SVM can effectively handle non-linear separability by mapping the input data into a high-dimensional feature space, where non-linear relationships become linearly separable. This is possible by construct decision boundaries between normal and cancerous samples by handling non-linear relationships effectively. RBF is the kernel used in SVM to perform the nonlinear mapping of the input features into a higher-dimensional. The RBF kernel ${K}_{RBF}\left(x,z\right)$ that is used to compute the similarity between feature vectors in the input space is given by:

$${K}_{RBF}\left(x,z\right)=\text{exp}\left(-\frac{{‖x-z‖}^{2}}{2{\sigma }^{2}}\right)$$

Where σ is the kernel width parameter that regulates the influence of each training sample, and |x − z| is the Euclidean distance between feature extracted vectors ‘x’ and ‘z’. The SVM RBF classification decision function is also described as a linear combination of the kernel evaluations between the input feature vector and the support vectors with the bias term, as performed in Zhang et al. [45].The decision function ${f}_{RBF}\left(x\right)$ is given by:

$${f}_{RBF}\left(x\right)=\sum _{i=1}^{N}\left({\alpha }_{i}{y}_{i}\right){K}_{RBF}\left({x}_{i, }x\right)+b$$

Here ${y}_{i},$ and ${\alpha }_{i}$ are Lagrange multipliers and class labels respectively associated with each support vector. The ${K}_{RBF}\left(x,z\right)$ computes the computes the similarity or distance between two feature extracted vectors ${x}_{i}$ and $x$. $b$ is the bias term that shifts the decision boundary away from the origin allowing the SVM to classify data points that may not be separable by a hyperplane in the original feature space.

3.7 Flower Pollination Optimization (FPO) with GMM

FPO is a metaheuristic optimization algorithm inspired by the pollination behavior of flowering plants. In nature, pollination happens through two major forms: abiotic and biotic. The biotic cross-pollination is observed in 90% of the cases where the insects are supporting the pollination process. The insects thus take long distance steps, in which the motion can be drawn from a levy distribution. The FPO starts with the discovery of solution space and subsequent movements.The insect motion and the levy distribution is expressed in the following steps as provided in Yang et al. [46].

$${{x}_{i}}^{t+1}= {{x}_{i}}^{t}+ \delta L \left(\lambda \right) ({g}_{best}- {{x}_{i}}^{t})$$

Where ${{x}_{i}}^{t}$ is a pollen representing the solution vector ${x}_{i}$ at the $t$ ^th iteration, and ${g}_{best}$ is the global best solution discovered so far. The step size is denoted by factor $\delta$, and $L \left(\lambda \right)$ is the Levy flight step size denoting the success of pollination.

The Levy distribution is given by:

$$L \sim \frac{\lambda Г\left(\lambda \right)\text{sin}\left(\pi \lambda /2\right)}{\pi } \frac{1}{{S}^{1+\lambda }}, ({s}_{ } \gg {s}_{0} >0)$$

Where $Г\left(\lambda \right)$ the step size for is large steps where $({s}_{ }>0)$, drawn from a gamma distribution. However, for smaller pseudo-random steps, that correctly follows Levy distribution, the $s$ value is drawn from two Gaussian distributions U and V (Mantegna algorithm [47]),

$$s= \frac{U}{{\left|V\right|}^{\left(1/\lambda \right)}}, U \sim N \sim N \left(0,{\sigma }^{2}\right), V \sim N \left(\text{0,1}\right)$$

Here, $U$ is drawn from a normal distribution with mean = 0 and variance =${\sigma }^{2}$, and $V$ is drawn from a normal distribution with mean = 0 and variance = 1. The variance is given by

$${\sigma }^{2}= {\left\{ \begin{array}{ccc}\frac{Г (1+\lambda )}{\lambda Г \left[(1+\lambda )/2\right] }& .& \frac{\text{sin}\left(\pi \lambda /2\right)}{{2}^{(\lambda -1)/2}}\end{array}\right\}}^{1/\lambda }$$

The remaining 10% of the abiotic pollination observed in the plant and flower community is regarded as a Random Walk, that sometimes bring out solutions from the unexplored search space ${x}_{i}$.

${{x}_{i}}^{t+1}= {{x}_{i}}^{t}+ Є ({{x}_{j}}^{t}- {{x}_{k}}^{t})$

(27)

Here,${{ x}_{i}}^{t}$, and ${{x}_{k}}^{t}$ are considered to be pollen transformation within the same plant species, from the same population. $Є$ is the step for the random walk drawn from the uniform distribution [0,1].

Thus, FPO allows for a global search of the solution space, which is crucial for effectively exploring the high-dimensional feature space of gene expression data. FPA identifies the most relevant genes that can contribute to the classification task and reduces the dimensionality of the feature extracted data. FPA also changes the means, covariance, and mixing coefficients of data distribution, so that GMM can perform better with the optimized data that contains complex and non-linear relationships among genes. In the next section,

3.8 Selection of Target

Selecting the target is a crucial step in defining the objective of the classification methodology. The clear definition of target improves the classifier model's predictive power by focusing on the most informative aspects of the data. The noise, outliers, and nonlinearity aspects of the data decide the selection of classifier targets. Moreover, the dataset used in this research is imbalanced and selecting the target must be strategized to deliver the maximum classifier performance. The binary classification performed in this research where the classification of feature extracted microarray data samples are classified into normal and colon cancer classes. Therefore two targets are selected namely ${\varvec{Т}}_{\varvec{N}}$ and ${\varvec{Т}}_{\varvec{C}}$. For a normal class feature set of N elements, the class target for Normal class is defined as:

$$\frac{1}{N} \sum _{i=1}^{N}{М}_{i} \le {Т}_{N}$$

Where ${М}_{i}$ is the average of the feature extracted vectors of Normal class, and ${Т}_{\varvec{N}}$ target follows the constraint ${Т}_{\varvec{N}}$∈ [0, 1]. For a Colon Cancer class feature set of M elements, the class target for Colon Cancer class is defined as:

$$\frac{1}{M} \sum _{j=1}^{N}{М}_{j} \le {Т}_{C}$$

Where ${М}_{j}$ is the average of the feature extracted vectors of Colon cancer class, and ${Т}_{C}$ target follows the constraint ${Т}_{C}$ ∈ [0, 1]. The Eucledian distance between the Targets of the binary class must also follow the constraint:

$‖{Т}_{N} - {Т}_{C} ‖\ge$ 0.5(30)

Based on the Eq. [28–30], the class targets ${Т}_{N}$ and ${Т}_{C}$ are chosen as 0.85 and 0.1, respectively. The classifier performance will be monitored with the help of MSE criteria. The next section discusses the training and testing of classifiers.

3.9 Training and Testing of classifiers

Before moving to the final classification step, training of classifiers is performed utilizing the labelled microarray dataset to adjust the classifier model's parameters. This step enables the learning of complex patterns and relationships in the gene expression data and optimizes the classifier's performance by minimizing the discrepancy between predicted and actual outcomes. After training, testing evaluates the trained classifier's performance on an independent dataset not used during training. So the testing phase assesses the model's generalization ability and provides insights into its effectiveness on other datasets. Mean Square Error (MSE) serves as a critical evaluation metric during both training and testing phases. In training, MSE quantifies the disparity between predicted and actual gene expression values, guiding the optimization process to minimize prediction errors. During testing, MSE provides insights into the model's predictive accuracy and its ability to generalize to unseen data. Minimizing MSE ensures that the classifier effectively captures the underlying relationships within the gene expression data, enhancing its predictive performance. The MSE is given by calculating the average squared difference between the predicted values and the actual values in a dataset.

$$MSE= \frac{1}{N}{\sum }_{j=1}^{N}({A}_{j}-{P}_{j}{)}^{2}$$

Here ‘N’ is the total number of extracted features, ${A}_{j}$ represents the actual gene expression value of j^th instance, and ${P}_{j}$ represents the predicted gene expression value of the j^th instance. A lower MSE indicates better performance, as it signifies smaller discrepancies between predicted and actual values. Conversely, a higher MSE suggests poorer performance and potentially larger prediction errors. Thus, MSE acts as a parameter that continuously checks the performance of the classifier.

We also perform K-Fold Cross Validation as performed in Fushiki et al. [48] to validate the classifier model. In this approach, the dataset is divided into K subsets (folds), with each fold serving as a validation set while the remaining data is used for training. This process is repeated K times, ensuring that each data point is used for validation exactly once. By averaging the performance metrics across multiple validation iterations, K-Fold Cross Validation provides a more reliable estimate of the classifier's performance and helps mitigate overfitting, thus enhancing the generalization ability of the model. In this research we have varied K value in the range 5–20, and is finally the value is fixed with 10, as higher values are providing similar results.

In this research for the binary classification problem of colon cancer data, the confusion matrix is described as shown in Table 4.It is used to describe the performance of a classification model on a during the training and testing phase. The confusion matrix has four possible combinations of predicted and actual class labels: True Positives (TP), True Negatives (TN), False Positives (FP), and False negatives (FN). These values help to evaluate the overall classifier performance metrics like Accuracy, F1 Score, Error Rate, MCC, and Kappa.

Table 4

Confusion Matrix
	Predicted Normal	Predicted Cancer
Actual Normal	TN	FP
Actual Cancer	FN	TP

TP: Samples that are correctly classified as colon cancer.

TN: Samples that are correctly classified as normal.

FP: Samples that are incorrectly classified as colon cancer when they are actually normal.

FN: Samples that are incorrectly classified as normal when they are actually colon cancer

Table 5 shows the obtained training and testing MSE of classifier for various feature extraction methods. The training MSE is reported between 10^− 01 and 10^− 9. The testing MSE is reported between 10^− 5 to 10^− 8.The training process is evaluated with 2000 iterations. The FPO-GMM with STFT feature extraction attained the lowest training and testing MSE of 7.29 × 10^− 09 and 6.44× 10^− 07, respectively. For LASSO feature extraction, SVM (RBF) attained the lowest training and testing MSE of 1.6 × 10^− 07 and 9 × 10^− 08, respectively. For EHO feature extraction SVM (RBF) classifier attained the lowest training MSE of 1.67 × 10^− 07 and FPO-GMM classifier attained the lowest testing MSE of 9 × 10^− 08.

Table 5

Training and testing MSE of classifiers for STFT, LASSO and EHO feature extraction techniques
Classifiers	STFT Feature Extraction		LASSO Feature Extraction		EHO Feature Extraction
Classifiers	Training MSE	Testing MSE	Training MSE	Testing MSE	Training MSE	Testing MSE
GMM	1.02 × 10^− 05	1.44 × 10^− 05	1.02 × 10^− 05	7.84 × 10^− 06	2.25 × 10^− 06	2.25 × 10^− 06
PSO-GMM	1.02 × 10^− 05	9.01 × 10^− 06	4.84 × 10^− 06	6.25 × 10^− 06	1.44 × 10^− 06	1 × 10^− 06
DFA	4.36 × 10^− 05	2.35 × 10^− 05	1.3 × 10^− 05	1.44 × 10^− 05	2.89 × 10^− 01	2.24 × 10^− 06
NBC	4.36 × 10^− 05	6.08 × 10^− 05	6.25 × 10^− 06	3.24 × 10^− 06	2.25 × 10^− 06	2.25 × 10^− 06
Firefly-GMM	6.2 × 10^− 05	2.21 × 10^− 05	8.41 × 10^− 06	7.29 × 10^− 06	2.89 × 10^− 06	6.4 × 10^− 07
SVM (RBF)	7.84 × 10^− 06	1.44 × 10^− 06	1.6 × 10^− 07	9 × 10^− 08	1.67 × 10^− 07	2.5 × 10^− 07
FPO-GMM	7.29 × 10^− 09	6.44× 10^− 07	1.44 × 10^− 06	3.6 × 10^− 07	2.25 × 10^− 07	9 × 10^− 08

Based on the observations from Table 5, the classifier parameters are selected for the various employed classifiers in this research as provided in Table 6.

Table 6

Parameter selection of the employed classifiers
Classifiers	Parameters
GMM	$\text{C}\text{o}\text{e}\text{f}\text{f}\text{i}\text{c}\text{i}\text{e}\text{n}\text{t} \text{o}\text{f} \text{m}\text{i}\text{x}\text{t}\text{u}\text{r}\text{e}-{Ϻ}_{cg}$, Mean vector-${ {\rm Y}}_{cg}, { Ͼ}_{cg}$- Covariance matrix are initialized to zero. Test point likelihood probability = 0.1, Cluster probability = 0.5, Convergence rate = 0.6, Convergence Criteria = MSE of 10^− 7
PSO-GMM	Population Size, N = 200, Inertia Weight ($w$) = 0.7, Constriction coefficients: c1 = 1.5 and c2 = 1.5, Random numbers: r1 and r2 ∈ [0, 1], Maximum Number of Iterations = 1000 or Convergence Criteria = MSE of 10^− 7
DFA	Window Scale (n) = 1.6, Polynomial Order = 1, Normalization Factor (N) = 1.6, Degree of window overlap = 50%, Convergence Criteria = MSE of 10^− 7
NBC	Smoothing factor, α = 0.06, Prior probabilities = 0.15, Distribution Assumption = Gaussian Naive Bayes, Convergence Criteria = MSE of 10^− 7
Firefly-GMM	Population Size, N = 200, Initial attractiveness ${I}_{0}$ = 1, Randomization Parameter (α) = 0.1, Attraction coefficient β = 0.6, Light absorption coefficient γ = 0.1, Distance between two fireflies r = Eucledian ,Maximum Number of Iterations = 1000 or Convergence Criteria = MSE of 10⁻⁷
SVM (RBF)	Kernel width parameter (σ) = 0.1, Regularization Parameter (C) = 1, Class Weights(w) = 0.86, Bias, b = 0.01, Convergence Criteria = MSE of 10^− 7
FPO-GMM	Population Size, N = 200, Step Size ($\delta$) = 0.15 ,Pollination Rate (λ) = 1.5, Random walk step $Є$ ∈ [0, 1] (Uniform distribution), Switch Probability (ρ) = 0.65, Maximum Number of Iterations = 1000 or Convergence Criteria = MSE of 10^− 7

From the gene expression feature extracted data, 85% are used for training, and the remaining 15% are used for testing of models. The confusion matrix is a useful tool for assessing a machine learning model's performance in binary classification situations. The performance metrics for the binary classifier is based on the confusion matrix provided in Table 4. Through a comprehensive comparison across various feature extraction and metaheuristic and machine learning integrated classification algorithms, we strive to pinpoint the method that delivers the most accurate and consistent for colon cancer classification based on gene expression data.

4.1 Classifier Performance

As previously indicated, the confusion matrix lists the model's predictions in comparison to the actual labels of the data. By examining the values in the confusion matrix, performance metrics including accuracy, precision, F1 score, MCC, error rate, and kappa are obtained. These metrics offer valuable insights into the sensitivity, specificity, and overall effectiveness of each classification technique. The classifier performance metrics and it’s expression from confusion matrix is listed in Table 7.

Table 7

Classifier Performance Metrics
Classifier Performance	Expression from Confusion Matrix
Accuracy	$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{(\text{T}\text{N}+\text{T}\text{P})}{(\text{T}\text{N}+\text{F}\text{N}+\text{T}\text{P}+\text{F}\text{P})}$
F1 Score	$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\left(\text{T}\text{P}+\text{F}\text{P}\right)}$ $\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{(\text{T}\text{P}+\text{F}\text{N})}$ $\text{F}1 \text{S}\text{c}\text{o}\text{r}\text{e}=\frac{2\times \text{T}\text{P}}{(2\times \text{T}\text{P}+\text{F}\text{P}+\text{F}\text{N})}$
Error Rate	$\text{E}\text{r}\text{r}\text{o}\text{r} \text{r}\text{a}\text{t}\text{e}=\frac{(\text{F}\text{P}+\text{F}\text{N})}{(\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N})}$
MCC	$MCC= \frac{(\text{T}\text{P}* \text{T}\text{N}-\text{F}\text{P} * \text{F}\text{N})}{\sqrt{TP+FP\left) * \right(TP+FN\left) * \right(TN+FP\left) * \right(TN+FN)}}$
Kappa	$\text{P}\text{o}=\frac{(\text{T}\text{P}+\text{T}\text{N})}{(\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N})}$ $\text{P}\text{e}=\frac{(\text{T}\text{P}+\text{F}\text{P}) \text{} (\text{T}\text{P}+\text{F}\text{N})+(\text{F}\text{P}+\text{T}\text{N}) \text{} (\text{F}\text{N}+\text{T}\text{N})}{{(\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N})}^{2}}$ $\text{K}\text{a}\text{p}\text{p}\text{a}=\frac{(\text{P}\text{o}-\text{P}\text{e})}{(1-\text{P}\text{e})}$

*𝑃𝑜 is the observed proportion of agreement, 𝑃𝑒 is the proportion of agreement expected by chance.

A classifier's accuracy is determined by how well it recognises the class labels in a dataset. It is obtained by dividing the total number of examples in the dataset by the number of successfully identified instances. A classifier's accuracy can be further is gauged by its F1 score, which combines recall and precision into a single statistic. F1 score is computed as the harmonic mean of precision and recall. The value of F1 score ranges from 0 to 1, with 1 denoting perfect precision and recall. Thus, F1 score is more considered over accuracy, especially in the case of imbalanced datasets. The error rate of a classifier is the proportion of misclassified instances. MCC stands for “Matthews Correlation Coefficient”, which measures the quality of binary classification models. It considers true and false positives and false negatives which is particularly useful in situations where the classes are imbalanced.

The kappa statistic, also known as Cohen’s kappa, measures agreement between two raters or between a rater and a classifier. The agreement between the true and projected classes is measured using the kappa statistic, which takes into account the likelihood that the agreement is coincidental. The Kappa coefficient = 1 denotes perfect agreement, and any value below one can be interpreted as follows: There are five levels of agreement: poor (< 0.2), fair (0.2–0.4), moderate (0.4–0.6), good (0.6–0.8), and very good (0.8–1).The performance analysis of classifiers with STFT feature extraction is first discussed through Table 8.

Table.8 Performance Analysis of Classifiers in Colon Cancer Detection from STFT Feature Extraction

S.NO	Classifiers	Accuracy	F1Score	MCC	Error Rate	Kappa
1	GMM	80.64516129	84.21052632	0.599404003	19.35483871	0.593886463
2	PSO-GMM	82.25806452	85.71428571	0.627341804	17.74193548	0.624035281
3	DFA	70.96774194	76.31578947	0.394460723	29.03225806	0.390829694
4	NBC	62.90322581	67.6056338	0.269679945	37.09677419	0.258064516
5	Firefly-GMM	67.74193548	74.35897436	0.310317338	32.25806452	0.309576837
6	SVM (RBF)	87.09677419	90	0.718181818	12.90322581	0.718181818
7	FPO-GMM	90.32258065	92.5	0.788636364	9.677419355	0.788636364

From the reported results in Table 8, it is evident that FPO-GMM has performed the best among all the other classifiers taken to consideration. An accuracy of 90.32% indicates that the classifier is performing well in correctly classifying instances. The high F1 Score of 92.5% suggests a good balance between precision and recall, which is crucial for imbalanced datasets. MCC of 0.7886 is also quite high, indicating a strong correlation between predicted and true classifications. Kappa value of 0.7886 indicates that classifier's predictions are consistent with the true classifications, considering the potential for randomness in the classification process. PSO-GMM has reported better performed better when compared to Firefly-GMM due to their differences in their optimization capabilities and convergence behaviour. PSO is likely to explore the solution space more effectively, leading to higher performance metrics compared to Firefly Algorithm. PSO's ability to efficiently search for optimal solutions contributes to better classification accuracy and MCC of 0.6273 reflects a stronger agreement between predicted and true classifications. In contrast, Firefly Algorithm might struggle with convergence or effective exploration, resulting in lower performance metrics across the board.

SVM (RBF) achieved the second highest accuracy of 87.0967, F1 Score of 90, MCC of 0.7181, and Kappa of 0.7181 values. This is because SVM (RBF) is effective the handling the high-dimensional gene expression data and capture the complex nonlinear relationships. GMM yielded moderate performance metrics as the assumption of Gaussian distributions lead to a lower performance compared to SVMs. DFA showed lower performance compared to SVM and GMM, suggesting this classifier is not well-suited for capturing the intricate patterns in gene expression data. Naive Bayesian produced the lowest performance metrics due to the assumption of independence between features. This is because, Naive Bayes is simple and efficient and assumes independence between features, which is not true for gene expression data. The performance analysis of classifiers with LASSO feature extraction is next discussed through Table 9.

S.NO	Classifiers	Accuracy	F1Score	MCC	Error Rate	Kappa
1	GMM	83.87097	87.17949	0.656355	16.12903	0.654788
2	PSO-GMM	85.48387	88.31169	0.696061	14.51613	0.692393
3	DFA	80.64516	84.21053	0.599404	19.35484	0.593886
4	NBC	87.09677	89.74359	0.725562	12.90323	0.723831
5	Firefly-GMM	83.87097	87.17949	0.656355	16.12903	0.654788
6	SVM (RBF)	96.77419	97.5	0.929545	3.225806	0.929545
7	FPO-GMM	93.5483871	95	0.859090909	6.451612903	0.859090909

Table.9 Performance Analysis of Classifiers in Colon Cancer Detection from LASSO Regression Feature Extraction

From Table 9, it evident that SVM (RBF) showed the best performance foe LASSO based feature extraction method. A high accuracy of 96.77%, F1 Score of 97.5% indicate a strong performance of the classifier in correctly classifying instances even in the class imbalance conditions. The 0.9295 is considerably high, suggesting a strong correlation between predicted and true classifications. This indicates the robustness and reliability of the classifier's predictions. The high Kappa value of 0.9295 indicates substantial agreement beyond chance, reinforcing the reliability and consistency of the classifier's performance. The performance of LASSO with SVM (RBF) is due to the combined advantages of both methods; significant and distinct feature extraction method with for nonlinear pattern recognition and classification.

The second best results are reported from FPO-GMM Classifier with a high accuracy of 93.55% and F1 Score of 95%. This results also indicate a successful classification of microarray gene expression data with MCC and Kappa of 0.8591 suggesting a strong agreement between predicted and true classifications. If we compare with FPO-GMM and GMM, there is an improvement of 9.57% in accuracy, 7.8% in F1 Score, and 0.202 difference in MCC and Kappa. This is because FPO has truly contributed to optimizing the feature data of the Gaussian Mixture Model classifier. The FPO once again showed its ability to explore the solution space effectively. Overall, the LASSO with SVM (RBF) outperformed LASSO with FPO-GMM due to SVM's robustness in handling complex, nonlinear patterns inherent in microarray gene expression data. The performance analysis of classifiers with EHO feature extraction is next discussed through Table 10.

S.NO	Classifiers	Accuracy	F1Score	MCC	Error Rate	Kappa
1	GMM	90.32258065	92.30769231	0.794769584	9.677419355	0.792873051
2	PSO-GMM	91.93548387	93.67088608	0.82614923	8.064516129	0.825646794
3	DFA	88.70967742	91.13924051	0.756365507	11.29032258	0.755905512
4	NBC	90.32258065	92.30769231	0.794769584	9.677419355	0.792873051
5	Firefly-GMM	90.32258065	92.5	0.788636364	9.677419355	0.788636364
6	SVM (RBF)	95.16129032	96.20253165	0.895932952	4.838709677	0.895388076
7	FPO-GMM	96.77419355	97.5	0.929545455	3.225806452	0.929545455

Table.10 Performance Analysis of Classifiers in Colon Cancer Detection from EHO Feature Extraction

For the FPO-GMM classifier, a high accuracy of 96.77%, F1 Score of 97.5%, MCC of 0.9295, and Kappa of 0.9295 is reported for EHO feature extraction as provided in Table 10.values indicate successful classification of the data. The high F1 Score suggests effective handling of data imbalance, indicating balanced performance in capturing both positive and negative instances. It is evident that EHO and FPO has contributed for optimizing parameters in both feature extraction and classifier tuning, respectively to enhancing the model's performance. The second best performance is observed with SVM-RBF classifier. These is only a slight improvement of performance in in terms of accuracy of 1.61%, F1 Score of 1.3%, MCC of 0.03, and Kappa of 0.03. The similarity in MCC and Kappa values suggests that both classifiers achieved a similar balance between true positives, true negatives, false positives, and false negatives in the predictions. However, the difference in Accuracy and F1 Score is due to the variations in precision and recall vales. The slight improvement in performance of the FPO-GMM over the SVM-RBF is be attributed to the EHO feature extraction method that distinctly distributes the data in two clusters as provided in Fig. 2 and Fig. 3. In essence, STFT yields higher accuracy due to its ability to capture wide variability in expression levels across samples, potentially enabling better discrimination between classes. LASSO and EHO are normalized and sparse representations with more consistent expression levels across samples and hence they are able to distinguish even the subtle differences between classes. For a smooth comparison, Fig. 4 shows the performances of various employed classifiers with their classification performance metrics.

Now, the overall performance of the various feature extraction methodologies and classification methodologies are compared using MCC vs. Kappa as portrayed in Fig. 5. The graph demonstrates a strong positive linear relationship between MCC and Kappa values for the various classifiers used with different feature extraction techniques. The high R² value indicates that the relationship between MCC and Kappa values is highly linear and predictable. As one metric increases, the other tends to increase proportionally, suggesting that classifiers performing well in terms of Kappa also tend to perform well in terms of MCC. The linear relationship also suggests that the correlation between MCC and Kappa values are robust and not specific to any particular feature extraction method, especially for the imbalance dataset in this research. Overall Fig. 5 is a representation of the reliable performance of the classifiers across various feature extraction techniques taken to consideration.

So, overall in the classification task, the choice between classifiers depend on requirements such as computational complexity, interpretability, or preference for certain optimization techniques. Overall, FPO-GMM and SVM (RBF) approaches offer effective solutions for classification tasks with microarray gene expression data, with FPO-GMM approach providing a competitive alternative to traditional methods SVM-RBF classifiers. In this upcoming section the computational complexity comparison of the integrated feature selection and classification techniques are discussed.

4.1 Computational Complexity

The computational complexity of feature extraction and classification methods plays a crucial role in determining the feasibility and efficiency of the high volume of microarray gene expression data. In feature extraction, computational complexity refers to the time and memory resources required to transform raw data into a meaningful feature representation. The computational complexity in classification involves the resources needed to train and deploy classification models, as well as to make predictions on new data points. Based on these information, the computational complexity can be represented with Big O notation - O (n), with n indicating the size of the dataset. The computational complexity O (n) indicates that the computational effort grows linearly with the dataset size ‘n’. Table 11 provides the computational Complexity of various employed methodologies for Classification.

Under the various feature extraction techniques, the EHO feature extraction with FPO GMM classifier has the highest complexity of O (2n⁸ log2n). The lowest computational complexity of O (n⁴) is observed in DFA classification. From Table 8–10 shows that the methodology with comparatively with high computational complexity has provided better results during classification. When dealing with high-dimensional data, like microarray gene expressions, feature extraction helps to reduce the dimensionality and extract relevant information, making the subsequent classification task more manageable for complex classifiers. Also during classification, to bring out the underlying patterns and nonlinear relationship of the data, often the classifiers employed need to be more complex as the gene expression data is not easily separable in the original feature space.

Table 11

Computational Complexity of various employed methodologies for Classification
Classifiers	Without Feature Extraction	With STFT Feature Extraction	With LASSO Feature Extraction	With EHO Feature Extraction
GMM	O(2n log2n)	O(2n⁴ log2n)	O(2n³ log2n)	O(2n⁴ log2n)
PSO GMM	O(2n³ log2n)	O(2n⁷ log2n)	O(2n⁴ log2n)	O(2n⁵ log2n)
DFA	O(n)	O(2n³)	O(n³)	O(n⁴)
NBC	O(n log2n)	O(2n³ log2n)	O(n³ log2n)	O(n⁴ log2n)
Firefly GMM	O(2n² log2n)	O(2n⁴ log2n)	O(2n⁴ log2n)	O(2n⁴ log2n)
SVM (RBF)	O(n log n)	O(2n²log n)	O(n³ log n)	O(n⁴ log n)
FPO-GMM	O(2n⁵ log2n)	O(2n⁷ log2n)	O(2n⁷ log2n)	O(2n⁸ log2n)

Overall, feature extraction followed by classification methods with high computational complexity is preferred in situations where achieving optimal performance, handling complex data patterns, enhancing robustness to noise, and scalability are essential considerations. So when computational resources are available, these complex learning models can be used to classify microarray gene expression data. Likewise, feature extraction followed by classification methods with lower computational complexity are preferred, for real-time applications where rapid decision-making within the time and resource limit is critical. The conclusions, limitations of research, and future research directions are discussed in the next section.

This research contributes to the ongoing fight against colon cancer by exploring novel methods for early detection and improved diagnosis. The major focus of the research is to address the curse of dimensionality problem inherent in microarray data and to extract significant features and that retains relevant patterns and inherent nonlinearity in the data. An accuracy of 96.77% is reported for LASSO feature extraction with SVM-RBF Classification. Likewise, 96.77% is also reported for EHO feature extraction with FPO-GMM Classification. This indicates the balance between feature extraction and classification of both approaches employing different optimization techniques. Methodologies worked well to discover the unseen data, indicating that the selected features are relevant. So the choice on methodology can be made based on computational complexity, suitability, and efficiency.

However, the effectiveness of metaheuristic optimization techniques, such as PSO and FPO may vary across different datasets and problem domains. The metaheuristic optimization methods are computationally intensive and often require tuning of algorithm parameters using hyper parameter tuning to further optimize and reproduce the same research results. So the future directions of the research is to utilize techniques like grid and random search to systematically explore the hyperparameter space and identify optimal settings for metaheuristic and machine learning classifiers. Also t-SNE, ReliefF methods for dimensionality reduction can be experimented to discriminate the high-dimensional data and identify underlying patterns and clusters.

Author Contribution

Author Contributions: Conceptualization, A.R.N.,H.R.,K.M.S. and C.K.; Methodology, A.R.N.,H.R.,K.M.S. and C.K.; Software, A.R.N.,H.R.,K.M.S. and C.K.; Validation, A.R.N. and H.R.; Formal analysis, A.R.N.,H.R.,K.M.S. and C.K.; Investigation, A.R.N.,H.R.,K.M.S. and C.K.; Resources, A.R.N. and C.K; Data curation, A.R.N.; Writing—original draft, A.R.N.; Writing—review and editing, A.R.N.; Visualization, A.R.N. and H.R.; Supervision, A.R.N. and H.R.

Data Availability

The datasets used and analysed during the current study available from the corresponding author on reasonable request.

Jemal, A., Siegel, R., & Xu, J. (2010). Cancer statistics, 2010. CA: A Cancer Journal for Clinicians, 60(4), 276-300.
van't Veer, L. J., & Bernards, R. (2008). Gene expression profiling for systemic disease. New England Journal of Medicine, 359 (10), 1028-1039.
Kościelniak-Merak, Barbara, et al. "Faecal occult blood point-of-care tests." Journal of Gastrointestinal Cancer 49 (2018): 402-405.
Compton, Carolyn C. "Pathology report in colon cancer: what is prognostically important?." Digestive diseases 17.2 (1999): 67-79.
Miller, D. J., and Jovitas Skucas. The radiological examination of the colon: practical diagnosis. Vol. 3. Springer Science & Business Media, 2012.
Ott, David J. "Accuracy of double-contrast barium enema in diagnosing colorectal polyps and cancer." Seminars in Roentgenology. Vol. 35. No. 4. WB Saunders, 2000.
Ignatov, Valentin, et al. "Diagnostic modalities in colorectal cancer–endoscopy, Ct and pet scanning, magnetic resonance imaging (MRI), endoluminal ultrasound and intraoperative ultrasound." Colorectal Cancer-Surgery, Diagnostics and Treatment. InTech, 2014. 29-51.
Hurlstone, D. P., T. Fujii, and A. J. Lobo. "Early detection of colorectal cancer using high-magnification chromoscopic colonoscopy." Journal of British Surgery 89.3 (2002): 272-282.
Robertson, Douglas J., et al. "Colorectal cancers soon after colonoscopy: a pooled multicohort analysis." Gut 63.6 (2014): 949-956.
Summers, Ronald M., et al. "Computed tomographic virtual colonoscopy computer-aided polyp detection in a screening population." Gastroenterology 129.6 (2005): 1832-1844.
Waye, Jerome D., James Aisenberg, and Peter H. Rubin. Practical colonoscopy. John Wiley & Sons, 2013.
Nannini, Margherita, et al. "Gene expression profiling in colorectal cancer using microarray technologies: results and perspectives." Cancer treatment reviews 35.3 (2009): 201-209.
Poturnajova, Martina, et al. "Molecular features and gene expression signature of metastatic colorectal cancer." Oncology Reports 45.4 (2021): 1-1.
Zhang, Xue Wu, et al. "Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis." European Journal of Human Genetics 13.12 (2005): 1303-1311.
Vaidya, Tanvi, et al. "The continuing evolution of molecular functional imaging in clinical oncology: the road to precision medicine and radiogenomics (Part I)." Molecular diagnosis & therapy 23 (2019): 1-26.
Badgwell, Donna, and Robert C. Bast Jr. "Early detection of ovarian cancer." Disease markers 23.5-6 (2007): 397-410.
Bertucci, Francois, et al. "Gene expression profiling of colon cancer by DNA microarrays and correlation with histoclinical parameters." Oncogene 23.7 (2004): 1377-1391.
Solé, Xavier, et al. "Discovery and validation of new potential biomarkers for early detection of colon cancer." PLoS One 9.9 (2014): e106748.
Galamb, Orsolya, et al. "Diagnostic mRNA expression patterns of inflamed, benign, and malignant colorectal biopsy specimen and their correlation with peripheral blood results." Cancer Epidemiology Biomarkers & Prevention 17.10 (2008): 2835-2845.
Wulfkuhle, Julia D., et al. "Technology insight: pharmacoproteomics for cancer—promises of patient-tailored medicine using protein microarrays." Nature Clinical Practice Oncology 3.5 (2006): 256-268.
Maniruzzaman, Md, et al. "Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms." Computer methods and programs in biomedicine 176 (2019): 173-193.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157-1182.
Liu, Yihui, et al. "Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data." Knowledge-Based Systems 37 (2013): 502-514.
Islam, Md Tauhidul, and Lei Xing. "Cartography of genomic interactions enables deep analysis of single-cell expression data." Nature Communications 14.1 (2023): 679.
Xiao, Yawen, et al. "A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data." Computer methods and programs in biomedicine 166 (2018): 99-105.
You, Wenjie, et al. "Totalpls: local dimension reduction for multicategory microarray data." IEEE Transactions on Human-Machine Systems 44.1 (2013): 125-138.
Bonev, Boyan, Francisco Escolano, and Miguel Cazorla. "Feature selection, mutual information, and the classification of high-dimensional patterns: Applications to image classification and microarray data analysis." Pattern Analysis and Applications 11 (2008): 309-319.
Xu, Chao, et al. "EPS-LASSO: test for high-dimensional regression under extreme phenotype sampling of continuous traits." Bioinformatics 34.12 (2018): 1996-2003.
Torkey, Hanaa, et al. "A novel deep autoencoder based survival analysis approach for microarray dataset." PeerJ Computer Science 7 (2021): e492.
Abdulla, Mai, and Mohammad T. Khasawneh. "G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays." Artificial Intelligence in Medicine 108 (2020): 101941.
Li, Peiyang, et al. "Improved graph embedding for robust recognition with outliers." Scientific reports 8.1 (2018): 4231.
Zhang, Li, et al. "Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data." Computers in biology and medicine 64 (2015): 236-245.
Kar, Subhajit, Kaushik Das Sharma, and Madhubanti Maitra. "Gene selection from microarray gene expression data for classification of cancer subgroups employing PSO and adaptive K-nearest neighborhood technique." Expert Systems with Applications 42.1 (2015): 612-627.
Mohd Ali, Nursabillilah, Rosli Besar, and Nor Azlina Ab. Aziz. "Hybrid feature selection of breast cancer gene expression microarray data based on metaheuristic methods: A comprehensive review." Symmetry 14.10 (2022): 1955.
Wang, Junbai, et al. "Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data." BMC bioinformatics 4 (2003): 1-12.
Aziz, Rabia, CKa Verma, and Namita Srivastava. "A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data." Genomics data 8 (2016): 4-15.
U. Alon, N. Barkai, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Pro ceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.
Sakyi, Samuel Asamoah, et al. "Comparison of modified manual acid-phenol chloroform method and commercial RNA extraction kits for resource limited laboratories." International Journal of Clinical Practice 2023 (2023).
Gupta, Varun, and Monika Mittal. "QRS complex detection using STFT, chaos analysis, and PCA in standard and real-time ECG databases." Journal of the Institution of Engineers (India): Series B 100.5 (2019): 489-497.
Özhan, Orhan. "Short-Time-Fourier Transform." Basic Transforms for Electrical Engineering. Cham: Springer International Publishing, 2022. 441-464.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.
Wang, Gai-Ge, et al. "A new metaheuristic optimisation algorithm motivated by elephant herding behaviour." International Journal of Bio-Inspired Computation 8.6 (2016): 394-409.
R Nair, A., & S, K. (2023). Analysis of energy harvesting in SWIPT using bio-inspired algorithms. International Journal of Electronics, 110(2), 291–311. https://doi.org/10.1080/00207217.2021.2025447
Fan, Liwei, Kim-Leng Poh, and Peng Zhou. "A sequential feature extraction approach for naïve bayes classification of microarray data." Expert Systems with Applications 36.6 (2009): 9919-9923.
Zhang, Rui, and Wenjian Wang. "Facilitating the applications of support vector machine by using a new kernel." Expert systems with applications 38.11 (2011): 14225-14230.
Yang, Xin-She. "Flower pollination algorithm for global optimization." International conference on unconventional computing and natural computation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012.
Mantegna, Rosario Nunzio. "Fast, accurate algorithm for numerical simulation of Levy stable stochastic processes." Physical Review E 49.5 (1994): 4677.
Fushiki, Tadayoshi. "Estimation of prediction error by using K-fold cross-validation." Statistics and Computing 21 (2011): 137-146.

No competing interests reported.

Download PDF

Journal Publication

published 17 Jul, 2024

Read the published version in Scientific Reports →

Editorial decision: Revision requested
31 May, 2024
Reviews received at journal
28 May, 2024
Reviewers agreed at journal
28 May, 2024
Reviews received at journal
28 May, 2024
Reviewers agreed at journal
27 May, 2024
Reviewers agreed at journal
27 May, 2024
Reviewers invited by journal
27 May, 2024
Editor assigned by journal
27 May, 2024
Editor invited by journal
07 May, 2024
Submission checks completed at journal
06 May, 2024
First submitted to journal
02 May, 2024

You are reading this latest preprint version

STFT, LASSO and EHO based Feature Extraction with Integrated Machine Learning and Metaheuristic Classification Techniques for Colon Cancer detection from Microarray Gene Expressions

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

1.1 Review of Previous Work

2. Materials and Methodology

2.1 Dataset

2.2 Work flow of the Research

2.3 Feature Extraction Methods

2.3.1 Feature Extraction using STFT

2.3.2 Feature Extraction using LASSO Regression

2.3.3 Feature Extraction using Elephant Heard Optimization

2.4 Statistical Analysis of the extracted features using STFT, LASSO, and EHO

3. The Classification Approach

3.1 Gaussian Mixture Model (GMM)

3.2 Particle swarm optimization-GMM (PSO-GMM)

3.3 Firefly GMM

3.4 Detrended Fluctuation Analysis (DFA)

3.5 Naive Bayes classifier (NBC)

3.6 Support Vector Machine (Radial Basis Function)

3.7 Flower Pollination Optimization (FPO) with GMM

3.8 Selection of Target

3.9 Training and Testing of classifiers

4. Results and Discussion

4.1 Classifier Performance

4.1 Computational Complexity

5. Conclusion

Declarations

Author Contribution

Data Availability

References

Additional Declarations

Status:

Journal Publication

Version 1

Classifiers	Parameters
GMM	\(\text{C}\text{o}\text{e}\text{f}\text{f}\text{i}\text{c}\text{i}\text{e}\text{n}\text{t} \text{o}\text{f} \text{m}\text{i}\text{x}\text{t}\text{u}\text{r}\text{e}-{Ϻ}_{cg}\), Mean vector-\({ {\rm Y}}_{cg}, { Ͼ}_{cg}\)- Covariance matrix are initialized to zero. Test point likelihood probability = 0.1, Cluster probability = 0.5, Convergence rate = 0.6, Convergence Criteria = MSE of 10^− 7
PSO-GMM	Population Size, N = 200, Inertia Weight (\(w\)) = 0.7, Constriction coefficients: c1 = 1.5 and c2 = 1.5, Random numbers: r1 and r2 ∈ [0, 1], Maximum Number of Iterations = 1000 or Convergence Criteria = MSE of 10^− 7
DFA	Window Scale (n) = 1.6, Polynomial Order = 1, Normalization Factor (N) = 1.6, Degree of window overlap = 50%, Convergence Criteria = MSE of 10^− 7
NBC	Smoothing factor, α = 0.06, Prior probabilities = 0.15, Distribution Assumption = Gaussian Naive Bayes, Convergence Criteria = MSE of 10^− 7
Firefly-GMM	Population Size, N = 200, Initial attractiveness \({I}_{0}\) = 1, Randomization Parameter (α) = 0.1, Attraction coefficient β = 0.6, Light absorption coefficient γ = 0.1, Distance between two fireflies r = Eucledian ,Maximum Number of Iterations = 1000 or Convergence Criteria = MSE of 10⁻⁷
SVM (RBF)	Kernel width parameter (σ) = 0.1, Regularization Parameter (C) = 1, Class Weights(w) = 0.86, Bias, b = 0.01, Convergence Criteria = MSE of 10^− 7
FPO-GMM	Population Size, N = 200, Step Size (\(\delta\)) = 0.15 ,Pollination Rate (λ) = 1.5, Random walk step \(Є\) ∈ [0, 1] (Uniform distribution), Switch Probability (ρ) = 0.65, Maximum Number of Iterations = 1000 or Convergence Criteria = MSE of 10^− 7

Classifier Performance	Expression from Confusion Matrix
Accuracy	\(\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{(\text{T}\text{N}+\text{T}\text{P})}{(\text{T}\text{N}+\text{F}\text{N}+\text{T}\text{P}+\text{F}\text{P})}\)
F1 Score	\(\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\left(\text{T}\text{P}+\text{F}\text{P}\right)}\) \(\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{(\text{T}\text{P}+\text{F}\text{N})}\) \(\text{F}1 \text{S}\text{c}\text{o}\text{r}\text{e}=\frac{2\times \text{T}\text{P}}{(2\times \text{T}\text{P}+\text{F}\text{P}+\text{F}\text{N})}\)
Error Rate	\(\text{E}\text{r}\text{r}\text{o}\text{r} \text{r}\text{a}\text{t}\text{e}=\frac{(\text{F}\text{P}+\text{F}\text{N})}{(\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N})}\)
MCC	\(MCC= \frac{(\text{T}\text{P}* \text{T}\text{N}-\text{F}\text{P} * \text{F}\text{N})}{\sqrt{TP+FP\left) * \right(TP+FN\left) * \right(TN+FP\left) * \right(TN+FN)}}\)
Kappa	\(\text{P}\text{o}=\frac{(\text{T}\text{P}+\text{T}\text{N})}{(\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N})}\) \(\text{P}\text{e}=\frac{(\text{T}\text{P}+\text{F}\text{P}) \text{} (\text{T}\text{P}+\text{F}\text{N})+(\text{F}\text{P}+\text{T}\text{N}) \text{} (\text{F}\text{N}+\text{T}\text{N})}{{(\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N})}^{2}}\) \(\text{K}\text{a}\text{p}\text{p}\text{a}=\frac{(\text{P}\text{o}-\text{P}\text{e})}{(1-\text{P}\text{e})}\)