In this study, a deep neural network (DNN)-based multi-omics fusion method is employed to investigate the classification efficacy of three omics data types. Initially, the standardized dataset undergoes preliminary screening, with features exhibiting missing values greater than 20% and unchanged features being removed.[13] The sample intersection of the three omics data types is retained. Subsequently, all male samples are deleted, and the African American and White sample sets, the African American sample collection, and the White sample collection are preserved. The KNN Imputer algorithm[14] is applied to impute missing values in the samples, addressing the issue of reduced classification accuracy due to missing values. As this study focuses on survival prediction, features related to survival cycles in clinical data are removed, along with some features with missing values exceeding 20%.
Given that the number of features in transcriptomics[15] and methylomics substantially surpasses the sample size, overfitting of the model is a potential concern, resulting in the so-called curse of dimensionality. Consequently, feature selection for methylomics and transcriptomics data is necessary. First, variance selection[16] is employed to filter out features with minimal changes. Subsequently, the optimal feature subset is selected according to the weights assigned by LinearSVC.[17] The SMOTE oversampling technique[18] is then utilized to balance the training set. Finally, distinct DNN models are constructed for triple cross-validation training in the three groups, maintaining the same proportion of positive and negative samples in each fold cross-validation. The ensemble learning concept is implemented to fuse the results of the three models. The experimental workflow is illustrated in Fig. 1.
Data source and preprocessing
In this study, data were obtained from The Cancer Genome Atlas (TCGA; https://tcgadata.nci.nih.gov/tcga/), a comprehensive repository of cancer genomic maps. The dataset encompasses three types of omics data: transcriptomics, DNA methylomics, and clinical information. The intersection of patient samples was derived for each of the three omics data types. The final sample data comprised 567 White samples and 156 African American samples. The survival density maps were generated using the Hiplot online visualization tool[19], with Fig. 2 displaying the survival days for both groups: the blue section represents African American survival time density, while the red section signifies White survival density, with the abscissa and ordinate representing survival days and density, respectively.
Based on the five-year survival criterion, patients were further classified as long- or short-term survivors. [20] To address the issue of missing values within the sample, features with more than 20% missing values were removed. Subsequently, the KNN Imputer method was employed to fill in the remaining missing values, preserving the differences between features to the greatest extent possible. Due to the extensive scale of the data, which adversely impacts the algorithm's time complexity, data standardization was performed. Consequently, the resulting feature data encompassed 39,953 RNA-seq signatures, 395,007 methylation signatures, and 16 clinical information features.
Feature Selection
In this study, we employed two distinct feature selection algorithms, Variance estimation selection and SelectFromModel,[17] to conduct a two-step feature screening process for omics data with a larger number of features than the sample size. Variance refers to the dispersion of a single measure's distribution, indicating the squared average distance of the distribution. The mathematical representation of variance is shown below:</p>
<p>\({{\sigma }}^{2}=\frac{1}{\text{n}}{\sum }_{\text{i}=1}^{\text{n}}{\left({\text{x}}_{\text{i} }-\stackrel{-}{\text{x}}\right)}^{2} \left(1\right)\)
Here, σ represents the variance, n denotes the number of features, xi corresponds to the ith sample feature value, and x̅ signifies the average of the feature in each sample. By leveraging the advantages of variance estimation within the dataset, the high-dimensional features are initially reduced in dimensionality, thereby streamlining subsequent experiments.
SelectFromModel (SFM) is a feature selection method that operates on the basis of feature importance weights. Like other feature selection functions, SFM necessitates specifying the number of features to retain. In this study, we combined the Linear Support Vector Classification (LSVC) classifier, which supports multi-classification and exhibits robust performance, with SFM to select features by evaluating their respective weights. LSVC is a widely utilized method in medical big data applications for feature selection, offering a reliable and effective approach to the analysis of high-dimensional data.
Sample Balance
The Synthetic Minority Oversampling Technique (SMOTE) is employed to address imbalanced datasets. [21] In this study, we utilized the SMOTE algorithm from the Imbalanced-learn Python package.[22] The SMOTE algorithm mitigates the issue of overfitting commonly encountered in random sampling algorithms. It achieves this by analyzing minority samples and synthesizing new samples for the dataset based on these minority samples. The algorithm process proceeds as follows:
- For each sample x in the minority class, compute the Euclidean distance between it and all samples in the minority sample set, subsequently obtaining its k nearest neighbor samples.
- Establish a sampling ratio according to the sample imbalance ratio to determine the sampling magnification N. For each minority sample x, randomly select several samples from its k neighbors, with the chosen neighbors denoted as \({x}_{n}\).
- For each randomly selected neighbor \({x}_{n}\), construct a new sample with the original sample using the following equations:
$$xnew=x+{\lambda }*\left|x-{x}_{n}\right| \left(2\right)$$
Here, \(xnew\) represents the newly constructed sample, λ is a random number between 0 and 1, and \({x}_{n}\) denotes the randomly selected neighbor.
By implementing the SMOTE algorithm, this study effectively balances the dataset and addresses the challenges posed by imbalanced data in the context of classification and prediction tasks.
DNN Neural Network
Deep learning is founded upon the framework of neural network models, which involve the study of neural networks comprising multiple layers of hidden layers, as opposed to simpler neural network models.[23] The basic framework of a Deep Neural Network (DNN) is depicted in Fig. 3, where the first layer is designated as the input layer, receiving the original data feature input. The layers between the first and last layers are referred to as hidden layers, serving as the primary units for data processing and feature learning. Hidden layers initially fit the input information from the preceding layer using a linear model, followed by a nonlinear transformation of the fitting result through an activation function, and subsequently pass the transformed result to the next layer for processing. The final layer, known as the output layer, delivers the model's ultimate calculation result. Through this layer-by-layer abstraction process, each layer within the DNN can extract more complex feature information, facilitating more profound data characterization and pattern learning for large-scale training samples.
Common activation functions in DNNs include Sigmoid, Tanh, and others, which primarily serve to confer non-linear mapping capabilities to the network.[24] In the absence of activation functions, feedforward neural networks can only implement linear mappings, and multilayer neural networks become equivalent to single-layer neural networks. The addition of activation functions imparts hierarchical non-linear mapping learning abilities to deep neural networks. The Tanh activation function addresses the issue of non-centered Sigmoid outputs, which can result in slower convergence. With an exponential function shape, the Tanh activation function closely resembles biological neurons in a physical sense and can map input data between 0 and 1. Its mathematical expression is given as follows, where x represents the input value:
$$\text{tanh}\left(x\right)=\frac{{e}^{x}-{e}^{-x}}{{e}^{x}+{e}^{-x}} \left(3\right)$$
The DNN training process employs a loss function to characterize the discrepancy between a patient's true label value and the network's predicted output value. By minimizing the loss function, the trainable parameters within the network are continuously updated and optimized to enhance life-time prediction performance.[25] This model employs multiple types of cross-entropy to ultimately classify breast cancer patients as long-survivors or short-survivors. The mathematical expression for this is as follows, where the predicted label of the ith sample is \({\widehat{y}}^{i}=({\widehat{y}}_{0}^{i},{\widehat{y}}_{1}^{i})\), and the true label of the ith sample is \({y}^{i}=({y}_{0}^{i},{y}_{1}^{i})\).
$$categorical crossentropy\left(Y,\widehat{Y}\right)=-\frac{1}{m}{\sum }_{i=1}^{m}{\sum }_{j=1}^{n}{\widehat{y}}_{j}^{i}{log}\left({y}_{j}^{i}\right) \left(4\right)$$
Ensemble Learning
Ensemble learning is not an independent machine learning algorithm; rather, it involves constructing and combining multiple machine learning models to complete learning tasks.[26] The integration of multiple classifiers leads to improved results. The voting mechanism (voting) is a combination strategy for classification problems in ensemble learning, with the fundamental concept being the fusion of multiple data sources to reduce error. For classification models, the hard voting method predicts the result as the most frequently occurring category among multiple models' predictions.[27] Soft voting, on the other hand, aggregates the probabilities of each type of prediction result and ultimately selects the class label with the largest sum of probabilities. In multi-omics fusion, the concept of soft voting is employed to obtain the results of multi-omics model fusion, which reduces the computational complexity of the model and effectively enhances its accuracy.
Performance Measurements
In machine learning and deep learning, evaluating the performance indicators of a model is essential for fully reflecting its recognition capabilities. The model's prediction results typically rely on the confusion matrix,[28] which includes True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). More advanced categorical indicators can be derived from the confusion matrix,[29] as demonstrated in the following formulas:
$$\begin{array}{c}Acc=\frac{TP+TN}{TP+FP+TN+FN}\\ SN=\frac{TP}{TP+FN}\\ SP=\frac{TP}{TN+FP}\\ Pre=\frac{TP}{TP+FP}\\ F1=\frac{2\times Pre+SN}{SN+Pre}\end{array}$$