2.1 The Datasets Source and Data Pre-processing
RNA-sequencing data (IlluminaHiSeq_RNASeqV2;Level 3), miRNA-seq data (IlluminaHiSeq_miRNASeq; Level 3), DNA methylation data (HumanMethylation450; Level 3), copy number variation(CNV) data (Affymetrix SNP 6.0 array; Level 3) and corresponding clinical information from HNSC were obtained from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/), in which the method of acquisition and application complied with the guidelines and policies. The clinical information of HNSC samples were downloaded as well.Meanwhile, the tumor samples ofthe multi-omicswere selected by filtering out the samples according to the nomenclature of TCGA sample IDs.
For the downloaded dataset, the data pre-processing and dimensionality reduction were required. The gene expression data and miRNA expression data were identified by comparison with tumor tissues and normal tissues expression and using the edgeR and DeSeq2 packages in R. Meanwhile, the chi-square test and Kruskal-Wallis test was used to reduce the number of genes in order to obtain DEGs in CNV data of HNSC. And for methylation data of HNSC, the limma package in R was used to filter DEGs.
Unless otherwise specified in the analysis of this paper, the programming language used is R (version 4.0.1).
2.2 Machine learning
Least Absolute Shrinkage and Selection Operator
The Least Absolute Shrinkage and Selection Operator (LASSO) is a regression-based algorithm that permits a large number of covariates in the model and has the function of penalizing the absolute value of the regression coefficient26. It is a linear regression method that uses L1 regularization, which can achieve the purpose of sparseness and feature selection. The LASSO regression is applied to data dimensionality reduction and feature selection due to its outstanding feature extraction and robust cancer prognosis27.Formula 1 described the representation method of the minimum residual sum of squares of the LASSO algorithm.
Bayesian Network
Bayesian Network (BN) is a multi-layered network of connections between clinical factors in a multi-omics data set that provides a multivariate mapping of complex data28. BN is a directed acyclic graph. Its nodes represent some random variables (Figure 1). Some of these random variables are observable and some are unobservable. Meanwhile, BN is a probabilistic graph model with a clear and transparent representation of the causal relationship between variables. Importantly, because the BN uses the posterior information of the data sets itself, it protects against over interpretation of the data. Survival predictions based on BN models have been developed for a number of tumors in order to improve prognostic estimates and to guide clinical decision making for appropriate treatment29, 30.The BNs is one of the deep learning model methods, which also has the deep learning model’s advantages. BNs with proper external validation could be useful as clinical decision support tools and provide clinicians and patients with information germane to the treatment of HNSC.
Decision Tree
The Decision Tree (DT) is a basic classification and regression method. The DT is composed of nodes and directed edges. The DT reflected the mapping relationship between features and tags as well. DT learning is a process of recursively selecting the optimal feature, and segmenting the training data according to the feature so that each sub-data set has the best classification process.
Generalized linear model
The Generalized linear model (GLM) is based on the exponential distribution family, and the prototype of the exponential distribution family is as Formula 2.
Where η is a natural parameter, it may be a vector. T(y) is called a sufficient statistic.
Random forest
Random forest (RF) is an integrated learning method based on decision trees. At the same time, RF is also an improvement to the bagging algorithm. The process of RF was shown in the Figure2.
Neural Networks
Neural Networks (NN) is a two-stage regression or classification model. NN is a complex network system formed by a large number of simple processing units (called neurons) widely connected to each other. It is a highly complex nonlinear dynamic learning system. The network diagram was shown in Figure 3.
Support Vector Machine
Support Vector Machine (SVM) is a generalized linear classifier that classifies data binary in a supervised learning manner, and its decision boundary is the maximum-margin hyperplane that solves the learning sample. SVM can perform non-linear classification through the kernel method and is a classifier with sparsity and robustness.
Evaluation index of performance: AUC
Since the accuracy rate cannot fully evaluate the performance of the models, this study considered another evaluation indicator, namely AUC. AUC is the performance indicator to measure the pros and cons of machine learning. AUC is the abbreviation of Area Under roc Curve. As the name implies, the value of AUC is the size of the area under the ROC curve. The definition of AUC is given below:
The ROC curve is drawn by two variables. The abscissa is 1-specificity, and the ordinate is sensitivity.
The meaning of these characters was shown here: TP represented the actual number of positive samples predicted as positive samples, TN represented the actual number of negative samples predicted as negative samples, FP represent edactually negative samples were predicted to be the number of positive samples, and FN represented the actual positive samples were predicted to be the number of negative samples.
2.3 Survival prediction process
By preprocessing the downloaded TCGA clinical data and omics data, the 490 HNSC samples shared by multi-omics were obtained. Likewise, DEGs were also obtained separately from each single-omic through preprocessing.
After the data pre-processing, the Lasso algorithm was used to select important variables for the survival outcome of HNSC from mRNA data, miRNA data, DNA methylation data, and CNV data. Random forest was used to calculate the ratio of each screened important variable. Integrated the four single omics and then six machine learning models (namely BN, RF, NN, GLM, DT, SVM) were performed to predict the survival outcome for HNSC. Likewise, using single-omic data as model input was performed to predict survival outcomes as well. Among them, the 490 HNSC samples were randomly divided into 3 groups, of which 2/3 were used as the training set and 1/3 were used as the test set. All mentioned models were operated 10 times.
Measure and compare the test results with performance indicators to find out which machine learning algorithms were effective and which omics were the most accurate for predicting HNSC survival. The flowchart for the main process of the study is presented in Figure 4.
2.4. In vitro experimental
Cell lines and culture
A normal human immortalized keratinocytes (Hacat) cell line and three HNSC cell lines (Cal-27, SCC-9 and FaDu) were used in the present study. All cell lines were obtained from the Cell Bank of the Chinese Academy of Sciences. Hacat, Cal-27 and FaDu cell lines were cultured in Dulbecco's Modified Eagle Medium (DMEM) and SCC-9 was cultured in Dulbecco's Modified Eagle Medium/Nutrient Mixture F-12(DMEM/F12) in 5% CO2 at 37°C. All media was supplemented with 10% fetal bovine serum (FBS) and 1% penicillin streptomycin. All cell culture reagents were purchased from Gibco, Thermo Fisher Scientific company.
Quantitative Real-time PCR(qPCR) assay
Cells were seeded at a density of 105 cells per well in a 6-well plate and cultured overnight. Total RNA was extracted from cultured cells using TRIzol reagent (Invitrogen). Complementary DNA (cDNA) was synthesized using Transcriptor First Strand cDNA Synthesis Kit (Roche), in accordance with the manufacturer's instructions. Quantitative reverse-transcription PCR was performed with Fast Start Essential DNA Green Master (Roche) and special primer sequences (Table1). Relative mRNA expression was quantified by the comparative Ct (ΔCt) method and normalized to the internal control gene, ACTB.
Table 1
Primers sequences
Gene
|
Primer forward (5’→3’)
|
Primer reverse (5’→3’)
|
AQP5
|
GCCACCTTGTCGGAATCTACT
|
CCTTTGATGATGGCCACACG
|
ACTN3
|
GCCCGATCGAGATGATGATGG
|
GGCAGTGAAGGTTTTCCGCT
|
TAC1
|
GGGACTGTCCGTCGCAAAAT
|
ACAGGGCCACTTGTTTTTCA
|
ZFR2
|
ATGGCTACCTACCAGGACAGT
|
GTATCCCGAGGACAAGGTGC
|
MMP11
|
GATCGACTTCGCCAGGTACT
|
CAGTGGGTAGCGAAAGGTGT
|
ACTB
|
TCACCATGGATGATGATATCGC
|
ATAGGAATCCTTCTGACCCATGC
|