Nonnegative Matrix Factorization Model-Based Construction For Molecular Clustering and Prognostic Assessment of Squamous Carcinoma of Head and Neck


 Purpose: Exploring nonnegative matrix factorization (NMF) model-based clustering and prognostic modeling of head and neck squamous carcinoma (HNSCC). Methods: The transcriptome microarray data of HNSCC samples were downloaded from The Cancer Genome Atlas (TCGA) and Shanghai Ninth People’s Hospital, and NMF clustering was constructed using the R software package. Relevant prognostic models were developed based on clustering. Results: Based on NMF, all samples were divided into 2 subgroups. Predictive models were constructed by analysing the differential gene between the two subgroups. Results of survival analysis in the current study revealed that the high-risk group had a poor prognosis. Further, results of multi-factor Cox regression analysis revealed that the predictive model was an independent predictor of prognosis. Conclusion: It was evident that the NMF-based prognostic model is a useful guide to the prognostic assessment of HNSCC.


Introduction
Squamous cell carcinoma of the head and neck (HNSCC) is one of the most common malignant tumours in the world. It accounts for more than 90% of all malignant tumours of the head and neck and thus poses a serious health risk to human beings [1] . Treatment options for HNSCC are mainly based on TNM staging and a combination of surgical-based therapies (radiotherapy, chemotherapy and biotherapy) [2] .
Although, the majority of patients with HNSCC are presented with locally advanced disease with signi cant lymph node metastases their outcome has improved due to advances in multi-disciplinary treatment. However, the mortality rate is still above 55% and there is 40 to 60% recurrence and metastasis rate [2][3][4] . Therefore, accurate prediction of the prognosis of patients with HNSCC is an important clinical guide.
Clustering is based on the principle that genes with similar expression patterns have similar or related functions. It is one of the important methods for processing gene expression data [5] . Clustering is divided into one-way clustering and two-way clustering. One-way clustering is whereby only rows or columns are clustered and its results are more in uenced by unrelated columns or rows. The commonly used one-way clustering algorithms include systematic clustering, self-organizing mapping clustering and principal component clustering. Two-way clustering is whereby the optimal set of sub-matrices are found in a matrix where the rows and columns are all signi cantly correlated. Furthermore, it allows overlap between classes, which is signi cant for gene chip data. Usually, a gene is not involved in only one biological process, but also each sample has multiple biological processes at the same time. Bidirectional clustering is thus more suitable for processing gene expression data. Nonnegative matrix factor-ization is a two-way clustering process [6] . Nonnegative matrix factorization has 3 main advantages over other standard decomposition methods, namely, no parameters, good interpretability and good numerical results [6] . Nonnegative matrix factorization has been widely used for cancer classi cation based on gene expression pro le data [7] .

Consensus Cluster of HNSCC samples based on the NMF model
Nonnegative matrix factorization cluster was constructed using the Consensus Cluster Plus package [9] .
Nonnegative matrix factorization hierarchical clustering is performed using the adjusted and uni ed dataset, the number of clusters k values are taken from 2 to 8. The value with better clustering stability is selected according to the clustering effect [10] . In the present study Kaplan-Meier survival analysis was performed based on the results of NMF classi cation. To determine whether there is a signi cant difference in the survival prognosis of different groups of patients the different immune in ltration levels of each immune cell between the two groups were analyzed using the vioplot package in R software.

Construction of the prognostic model
For differential gene (DEGs) expression analysis was conducted using edgeR, where the threshold values were set to the absolute value of logFC greater than 1 and FDR less than 0.05 to screen for DEGs. DEG ssigni cantly associated with overall survival (OS) in patients with HNSCC were screened using univariate Cox regression analysis. Lasso regression analysis was then used to eliminate collinearity between the genes. They were then included in a multifactorial Cox regression analysis model analysis for further screening. The nal genes obtained were identi ed as predictive model component genes.
The prognostic signature was used as the risk score = Where, n, expi and βi, represent the number of prognostic genes, the expression value and the coe cient of gene i, respectively. The risk score was calculated for each patient according to the formula, and the median of the scores was the cut-off value, whereas all the patients were divided into high-risk and lowrisk groups using the cut-off value. Kaplan-Meier method was used to plot the overall survival curves for the different groups of patients and log-rank test was also performed.
The ROC curves and calibration plots were plotted to assess the predictive ability of the proposed model.
The earlier described risk calculation formula was also used to calculate the risk score for each patient in validation group. The ROC and calibration plots were also used to validate the predictive ability of the model. The clinical specimens of 80 patients with HNSCC were collected during surgeries. The specimens were immersed into the RNA later solution (Invitrogen, USA) immediately and then used for RNA extraction or stored at -80 • C.
Total RNA was extracted from the fresh tissues using TRIzol reagent (Invitrogen) and cDNA was synthesized from 10 µg of total mRNA by using High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems) according to the manufacturers' instructions. FastStart Universal SYBR Green Master Mix (Roche) and QuantStudio™ 6 Flex (Applied Biosystems) were used to perform qRT-PCR. Primers used for the qRT-PCR were as showed in Table S2. RNA-seq analysis was performed using the NovelBrain Cloud Analysis Platform, China. In brief, after total RNA was extracted, the cDNA libraries were constructed for each pooled RNA sample using the VAHTSTM Total RNAseq (H/M/R). The gene expression level was determined through the FPKM method.

Correlation analysis of model-independent prognostic and clinical characteristics
Univariate and multifactor Cox regression analyses of risk scores were performed to determine whether the model had independent prognostic value. If the risk score was signi cantly different from OS in both univariate and multivariate Cox analyses, the risk score was considered as an independent risk factor. Finally, DCA was introduced to prove the clinical validity of the model in the present study.

Immune in ltration analysis
The in ltrating score of 16 immune cells and the activity of 13 immune-related pathways were calculated using the single-sample gene set enrichment analysis (ssGSEA) method in the Gene Set Variation Analysis (GSVA) package of R software. The Benjamini&ndash;Hochberg (BH) method was used to adjust the p values. Expression analysis was conducted to determine the relationship of risk score and immune-related genes, including m6a, ferroptosisiron death, cellular autophagy, tumor mutation burden (TMB) and major histocompatibility complex (MHC). Two hundred ninety eight patients treated with immunotherapy from the IMvigor210 were included to form an independent validation cohort and to verify the robustness of the classi cation and ability to predict the response to immunotherapy.

Clustering based on the NMF model divides the samples into 2 subgroups
To reduce the effects of sample multicenter sources and batch processing, ComBat-data was corrected using the R package [11] . Comprehensive judgment of clustering stability [12,13] , stability was found to be better when k=2, thus k=2 was used for the judgement (Fig. 1a). Based on the survival curves and the logrank test results (P<0.05), it was found that the prognosis of the 2 subgroups was signi cantly different ( Fig. 1b-c). Further, it was found that the immune ne in ltration between the two subgroups was statistically signi cant different from each other (Fig. S1).

Construction of prognostic models
The above classi cation con rms a difference in prognosis between the two clusters and hence it warranted for further study. Therefore, the DEGs between the two clusters were subjected to univariate Cox analysis to obtain DEGs associated with prognosis, and the LASSO was also used to further screen the 13 associated genes (Fig. 1d). These genes were subjected to multifactorial Cox analysis and 9 DEGs were obtained as well as the correlation coe cients ( Table 3). The prognostic model risk score was as follows: risk score= 0.78* expression level of HAUS6 -0.38* expression level of SCNN1D+ 0.76* expression level of S100A1 -1.64* expression level of TNFRSF4 -1.51* expression level of FBXO17+ 0.75* expression level of IRF9+ 0.65* expression level of IFI6 -0.41* expression level of PTGS2+ 0.32* expression level of MSC. The risk score of each patient was calculated based on the regression coe cients according to the prognostic model and the patients were divided into high-risk and low-risk groups by the median of the risk scores.
It was found that the survival of high-risk patients was signi cantly shorter compared with that of the patients in the low-risk group (Fig. 2a). The time-dependent ROC curves showed a 1-, 3-and 5-year time AUC of 0.852, 0.890 and 0.953, respectively which was in agreement with the calibration curves (Fig. 2bc). Applying the same prognostic score to the validation set, the Kaplan-Meier survival curves showed that patients with high risk scores had lower OS than those with low risk scores, and the OS was signi cantly different between the two groups of patients (Fig. 3a). The 1-and -5-year AUC values in the validation set ranged from 0.767 to 0.862, indicating that the model still had good predictive performance in the external validation set (Fig. 3b).
Results of PCR in the present study showed that 9 genes still differed in the validation group (Fig. 3c).
This was in line with the results of the TCGA cohort. Further, the results of Kaplan-Meier survival curve analysis for the 9 genes of OS were as shown in Fig. S2. The above evaluation results suggest that the risk score model has good sensitivity and speci city for HNSCC prognosis prediction. A multifactorial Cox regression analysis was performed combining clinical indicators of the patients (risk score, age, gender, stage and grade among others). The results of this study showed that risk score were associated with survival (Table 4). Second, decision curves were made to determine the clinical net bene t derived from the use of the model. Decision curve analysis demonstrated that the model was clinically useful (Fig. 3d).
In conclusion, the risk score can be used as a prognostic indicator independent of other clinical factors for the prognosis of HNSCC.

Immunogenesis and enrichment analysis
Analysis of the relationship between risk score and m6a, ferroptosis, cellular autophagy and other related genes revealed that the risk score was closely related to immune-related genes (Fig. S3). The TMB refers to the number of base mutations per million bases and is a marker for the e cacy of immune checkpoint inhibitors. The higher the TMB, the more neoantigens can be recognized by T cells at the end and the better the immunotherapy effect. Our analysis showed a negative correlation between risk score and TMB, which could also be the reason for the poor prognosis of high-risk patients (Fig. 4a). It was found that there was a higher probability of higher bene t for high-risk patients in immunotherapy treatment (complete response (CR), partial response (PR), no clinical bene t (progressive disease (PD) or Stable Disease (SD). This provides new options for the treatment of patients with subsequent tumors (Fig. 4b).
According to the results of GSEA, the high-risk group was enriched in dilated cardiomyopathy whereas the low-risk group was mainly associated with tumorigenesis development. The GSEA analysis partially explained the biological differences between the low-and high-risk groups at the genetic and pathway level (Fig. 4c). Enrichment and signaling pathway analysis were performed for DEGs and it was hence possible to understand the biological functions of DEGs. The DEGs were mainly enriched in cell cycle, DNA replication, catalytic activity, acting on DNA, chromosomal region, human papillomavirus infection, organelle ssion and condensed chromosome (Fig. 4d). We further compared the enrichment scores of 16 types of immune cells and the activity of 13 immune-related pathway between the low and high-risk groups by employing the ssGSEA. The high-risk subgroup generally had higher levels of in ltration of immune cells especially of T helper cells, Macrophages, regulatory T (Treg) cells and tumour-in ltrating lymphocytes (TILs) (Fig. 5a). Except for the MHC_class_I and type-1 IFN response pathway, the other 11 immune pathways showed lower activity in the high-risk group than in the low-risk group in the TCGA cohort (Fig. 5b).

Drug sensitivity analysis
The highest negative correlation score is for chrysin (-0.776). Chrysin is a drug which has a variety of biological activities, such as anti-tumor, anti-in ammatory, anti-bacterial, anti-anxiety and anti-oxidant pharmacological activities [14] , hence suggesting a possible therapeutic effect in HNSCC. The next highest scores was MS-275. Previous studies have shown that it has a selective killing effect on gastric adenocarcinoma cells) [15] , 1, 4-chrysenequinone (an Ahr-activator) and piperlongumine (inhibits tumor autophagy leading to reduced cell proliferation viability) (Table S1).

Discussion
Head and neck squamous carcinoma is one of the more common malignancies worldwide, with 550,000 new cases and about 380,000 deaths per year [16,17] . It is not only aggressive and lethal, but also causes serious facial deformities, speech, chewing and swallowing dysfunctions as well as psychosocial problems to patients. Although surgical radical techniques, repair and reconstruction techniques for HNSCC have become increasingly sophisticated, their 5-year survival rate has not improved signi cantly in the last 20 years. The rst research results on NMF algorithms was publication in 1999. In 2003 Kim rst used NMF for clustering of genes and identifying subsystems of functional cells [18] .
To better predict the prognosis of HNSCC, this study staged patient with HNSCC s into two subgroups based on the NMF model. It was found that there was signi cant differences in OS between the two subgroups, with patients in the subgroup with more abundant immune cell in ltration having better prognostic indicators. This suggests that associated immune cells can in uence the prognosis of patients with HNSCC. The enrichment analysis of differential genes between the two groups also revealed the existence of enrichment in immune-related functions. A prognostic risk model was constructed consisting of nine genes. The scores of patients were calculated based on the risk model and divided into two groups of high and low risk. Further, it was found that there was had a signi cant difference in the prognosis of patients in the two groups of high and low risk whereas the prognosis of high-risk patients was signi cantly lower than that of the low-risk patients. The ROC curve and calibration curve of the model also achieved remarkable results which revealed that the model has better discriminatory ability. The DCA also demonstrated that the reliability and accuracy of the prediction model was better than the other clinical indicators.
Tumour necrosis factor receptor superfamily, member4 (TNFRSF4) is one of the component genes of the model and is predominantly inducibly expressed in activated CD4+ and CD8+ cells [19] . Previous studies have shown that binding of TNFRSF4 to ligands promotes clonal proliferation of T cells, enhances T cell memory, proliferation, immune surveillance and killer cells, as well as prevents the development of immune tolerance [20] . In addition, TNFRSF4 expressing positive T cells can reduce suppressive factors in the tumor immune microenvironment and effectively inhibit tumor invasion and metastasis [21] . The distribution of TNFRSF4 expression in breast cancer, melanoma, and lymphoma has been discussed in the previous studies [22,23] , and targeted TNFRSF4 treatment can signi cantly play a role in anti-breast cancer and melanoma [24] . In the present study, TNFRSF4 was a protective factor for HNSCC which was in agreement with previous studies. Enrichment analysis of the differential genes revealed that the genes were mainly enriched in cell cycle, DNA replication, catalytic activity and acting on DNA. The GSEA analysis also revealed that the low-risk patients were predominantly enriched in immunode ciency and tumor-associated pathways. Further, the analysis of immune cell in ltration in both groups of patients revealed a positive correlation between the macrophages and risk scores. This phenomenon was associated with the fact that macrophages are essential in the HNSCCC microenvironment for the regulation of the in ammatory reactions [17] . It was hence suggested that the model may serve as a predictor of increased immune cell in ltration.
Several studies have reported that increase of macrophages is associated with poor cancer prognosis [25] . Further, the macrophage in ltration in the tumor microenvironment can promote tumor growth, angiogenesis, invasion and metastasis [26] . The present study also analyzed the correlation between risk scores and immune-related genes. It was revealed that the risk scores were strongly associated with m6arelated, iron death-related and autophagy-related genes. Recently, the development of tumor immunotherapy has rapidly progressed and become of a great research interest [27] . Immunesuppressants such as PD-1/PD-L1 suppressants have successfully emerged in the recent past [28] . The present study analyzed the responsiveness to PD-1/PD-L1 in both high and low risk groups of patients and it was evident that high risk patients responded to immunotherapy better than the low risk patients.

Conclusions
In conclusion, based NMF algorithm, this study screened for DEGs and constructed an associated prognostic model which could independently predict prognosis in patients with HNSCC. The predictive performance of the model was found to be stable and can assist in providing reference for individualized treatment of patients with HNSCC. Further, the genes in prognostic risk models also provided new therapeutic targets for the exploration of immunotherapy for HNSCC.

Declarations
Competing interests: no con icts of interest.