Squamous cell carcinoma of the head and neck (HNSCC) is one of the most common malignant tumours in the world. It accounts for more than 90% of all malignant tumours of the head and neck and thus poses a serious health risk to human beings[1]. Treatment options for HNSCC are mainly based on TNM staging and a combination of surgical-based therapies (radiotherapy, chemotherapy and biotherapy)[2]. Although, the majority of patients with HNSCC are presented with locally advanced disease with significant lymph node metastases their outcome has improved due to advances in multi-disciplinary treatment. However, the mortality rate is still above 55% and there is 40 to 60% recurrence and metastasis rate[2–4]. Therefore, accurate prediction of the prognosis of patients with HNSCC is an important clinical guide.
Clustering is based on the principle that genes with similar expression patterns have similar or related functions. It is one of the important methods for processing gene expression data[5]. Clustering is divided into one-way clustering and two-way clustering. One-way clustering is whereby only rows or columns are clustered and its results are more influenced by unrelated columns or rows. The commonly used one-way clustering algorithms include systematic clustering, self-organizing mapping clustering and principal component clustering. Two-way clustering is whereby the optimal set of sub-matrices are found in a matrix where the rows and columns are all significantly correlated. Furthermore, it allows overlap between classes, which is significant for gene chip data. Usually, a gene is not involved in only one biological process, but also each sample has multiple biological processes at the same time. Bidirectional clustering is thus more suitable for processing gene expression data. Nonnegative matrix factor-ization is a two-way clustering process[6]. Nonnegative matrix factorization has 3 main advantages over other standard decomposition methods, namely, no parameters, good interpretability and good numerical results [6]. Nonnegative matrix factorization has been widely used for cancer classification based on gene expression profile data[7].
The present study performed molecular clustering and prognostic modeling of HNSCC samples from TCGA database and validation group (collected by the Department of Oral and Maxillofacial Head and Neck Oncology, Shanghai Ninth People’s Hospital) based on NMF. This was to appropriately classify the patients with HNSCC for treatment selection and prognosis predictions.
1.1 Data acquisition
The present study obtained RNA sequencing (RNA-seq) data of 502 HNSCC patients, 44 normal human head and neck samples as well as their corresponding clinical features from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/ repository). Fresh HNSCC and normal tissues from 80 human patients were postoperatively collected from April 2010 to October 2016. In this procedure, frozen sections of tissue from the surgical margins were examined after completion of the extended resection. If the pathology is positive, continue to extend the excision until all margins are negative and remove normal tissue from the surrounding area.. A pathologist from the Department of Pathology of Shanghai Ninth People's Hospital diagnosed the patients in this study. Clinical data of the patients was as shown in the Tables 1 and 2. This study was approved by the Human Research Ethics Committee of the Ninth People’s Hospital, Shanghai JiaoTong University School of Medicine (Shanghai, China). The informed consent was not available because this study was retrospective in nature.
1.2 Consensus Cluster of HNSCC samples based on the NMF model
Nonnegative matrix factorization cluster was constructed using the Consensus Cluster Plus package[9]. Nonnegative matrix factorization hierarchical clustering is performed using the adjusted and unified dataset, the number of clusters k values are taken from 2 to 8. The value with better clustering stability is selected according to the clustering effect [10]. In the present study Kaplan-Meier survival analysis was performed based on the results of NMF classification. To determine whether there is a significant difference in the survival prognosis of different groups of patients the different immune infiltration levels of each immune cell between the two groups were analyzed using the vioplot package in R software.
1.3 Construction of the prognostic model
For differential gene (DEGs) expression analysis was conducted using edgeR, where the threshold values were set to the absolute value of logFC greater than 1 and FDR less than 0.05 to screen for DEGs. DEG ssignificantly associated with overall survival (OS) in patients with HNSCC were screened using univariate Cox regression analysis. Lasso regression analysis was then used to eliminate collinearity between the genes. They were then included in a multifactorial Cox regression analysis model analysis for further screening. The final genes obtained were identified as predictive model component genes.
The prognostic signature was used as the risk score = ![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAIIAAAAgCAYAAADe16AYAAAFZUlEQVR4Ae3XC1HsTBAFYCygAQt4QMLVgAUc4AAHKEDBNYABHOAhf31bdaDpm8fssjz+qnRVKtlJP0+f6cleTLvsCEzTdLGjsCMAgZ0IOw8OCOxE2ImwE2HnwDsCh4nw/Pw8XVxcHH29u9mf/u8IvB0NNzc3ByJcXl5OLy8vs3W9vr5Ot7e3b4SpSnd3d4f16+vrw/LDw8PEl+vp6amq7s+/EIE3Imh+psKfP38WU0UGzaUbeXx8nFwabv3+/n5CBHJ1dTUh2S7fg4A+2qzpkbt+bMl7N6fpYBAy/P37d9E2U6EraD77kMB7RKC/y9cj4IjXeFM5Uz092SLDByLY7Rqnme5Lkp3f35sk1Y6/Toxus/8+HwIIAH+4V9GDran8gQiM0+RTGiiJuvsdF/xg6i5fi0Cwdu9yEhE4qR+OnV09SH7nG6MmkrMqOvv96xAwjTW89ysbu27QuSz+mQiU6t/JLQdxGkbW3Y9QLsmtfYDGx34/HYG5XQ93U9p3Q74ZliLMEoFy/g4KsPbhGMf0BawSciBDJUjVOeezGJlC8pZPnVBiASQTz5ladxBdNvkLTN9HFjD5y0ewOPHh3blqgyG/x4r+yC8fhGpSi9xGsV8kAmdAEaACc2yS36Uf0pk8mp3dIP8qASYjM82NPX0NId4F3IBKD9n4RwD63p0q/Cdenvm2lvUt3+zkgRBsPeudGkflI0rNKuBw+pslTe3HT3bt0lgEWIjhOaSYq1WzbYjenMTQgFMFzvxqKH/wVsuoT7ryj756Mxn5HpFVIgBGgFFnIwG/QkeT5FkbDhTra7sV6N4Dfq1GvvinG7BTR4jQjwdNHTlS4weZ+RfnmJ3Mnp2rC+KqrecMJ/lVvBaJoLAwswdY+33MSFvzM/quTwPFaWpA6A2qfjNS1yYB/cSY04OR5lWRg7U5/arnWZPsaISibyf7LbfewG7rd2Kx65L6OiHhI7+KzccKiqcAOZJMMTuQZy6pqjPyrEB+arJzdilWYS47I6Bu5Z6RujYNxEyMnkuaAKsq1tm4bwm9HDeeg11dX/OxRlK+YNKJ4Df/VWaJQGnOQTVcehZkBIAle80TPzutF9HtTs0VSCFCwO++8zvjP79zT+xKJM/W61r0t+4IIdYxkhzmcEJQfaz9oO/q+v8QgQLjsHQ0KYUr4thCun/2fIXpPeGuvwQEQvHVp4LfGg8kz+5z52uNAw9XFeAia7c13uki2XeIGntu4qaPvR8hfz+2PlQHGIUFpKVCgCB4bZI1th2YJR9b6ymkxpiziZ7mpulIJI++K63JWwMz5hE+tfDVNwA9712xoQej6qfmRrcDXd+f8zm59frlNpdf8EotyeUDEcKWrhTl3MP6OnK8w74+ZjPuvFu65uIlYfctSeMDijq6HZJ4jww1XgjsXQUzMVOrOiqR5nTZ8M1Xjx9/57wnlnpdGi+2u/x6f8ROPT2PNyJkFBu1ayJ4AnY9SfSdQB8oa1d2cvVH/7sArXH7M0DlMSpLQI/aH6Nn4slN70YFYfpxwfZQIeakuRyPXjV4Gqfx55D4c/9JMQUcA6OyBPSo/TF6Odbmdv6SH32e2+wHIkh+tPlVrwbLR1td8/wdR0OPea7fOU7gMyqIo2by1STOd8pobgijfyaI2uqmHZ95G9HqTqiMO/VoOGXsbaR49Oscl2nsiAM7ToPg0T9WR+yP0dHUuTG/5EMv2KRXX0KEjCkgfAaATJA6eRT7GZ9LwGytZ8ppbv/2WbIFLrsK8pLuZ9ZzdNZNN+IPudXSj5OzTQRJSG7uw28kwegAkJ9+9cSjv9/Pg8BZiXCelHYvP4HAToSfQP0XxtyJ8Aub8hMp/QdhKr7TZGZiYAAAAABJRU5ErkJggg==)
Where, n, expi and βi, represent the number of prognostic genes, the expression value and the coefficient of gene i, respectively. The risk score was calculated for each patient according to the formula, and the median of the scores was the cut-off value, whereas all the patients were divided into high-risk and low-risk groups using the cut-off value. Kaplan-Meier method was used to plot the overall survival curves for the different groups of patients and log-rank test was also performed.
The ROC curves and calibration plots were plotted to assess the predictive ability of the proposed model. The earlier described risk calculation formula was also used to calculate the risk score for each patient in validation group. The ROC and calibration plots were also used to validate the predictive ability of the model. The clinical specimens of 80 patients with HNSCC were collected during surgeries. The specimens were immersed into the RNA later solution (Invitrogen, USA) immediately and then used for RNA extraction or stored at -80 ◦C.
Total RNA was extracted from the fresh tissues using TRIzol reagent (Invitrogen) and cDNA was synthesized from 10 µg of total mRNA by using High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems) according to the manufacturers’ instructions. FastStart Universal SYBR Green Master Mix (Roche) and QuantStudio™ 6 Flex (Applied Biosystems) were used to perform qRT-PCR. Primers used for the qRT-PCR were as showed in Table S2. RNA-seq analysis was performed using the NovelBrain Cloud Analysis Platform, China. In brief, after total RNA was extracted, the cDNA libraries were constructed for each pooled RNA sample using the VAHTSTM Total RNAseq (H/M/R). The gene expression level was determined through the FPKM method.
1.4 Correlation analysis of model-independent prognostic and clinical characteristics
Univariate and multifactor Cox regression analyses of risk scores were performed to determine whether the model had independent prognostic value. If the risk score was significantly different from OS in both univariate and multivariate Cox analyses, the risk score was considered as an independent risk factor. Finally, DCA was introduced to prove the clinical validity of the model in the present study.
1.5 Immune infiltration analysis
The infiltrating score of 16 immune cells and the activity of 13 immune-related pathways were calculated using the single-sample gene set enrichment analysis (ssGSEA) method in the Gene Set Variation Analysis (GSVA) package of R software. The Benjamini–Hochberg (BH) method was used to adjust the p values. Expression analysis was conducted to determine the relationship of risk score and immune-related genes, including m6a, ferroptosisiron death, cellular autophagy, tumor mutation burden (TMB) and major histocompatibility complex (MHC). Two hundred ninety eight patients treated with immunotherapy from the IMvigor210 were included to form an independent validation cohort and to verify the robustness of the classification and ability to predict the response to immunotherapy.