A 8-gene-methylation-based Signature for Head and Neck Squamous Cell Carcinoma Prognosis

Background Head and neck squamous cell carcinoma (HNSCC), accounting for 6% of systemic malignant tumors, has an increasing incidence rate year by year worldwide. A large number of studies have investigated the tumor markers to determine clinical stage, prognosis, treatment evaluation, predict relapses, and overall survival of HNSCC patients, with controversial results. Methods In this paper, we comprehensively analyzed gene expression and DNA methylation data sets of HNSCC and adjacent non-tumor samples from The Cancer Genome Atlas (TCGA). Univariate cox regression analysis followed by sure independence screening (SIS) were used for identifying differential methylation signatures to stratify patients with signicantly different prognosis. Results We identied methylation levels of HS3ST1 (cid:0) TOMM34 (cid:0) RPL26L1 (cid:0) MTHFD2 (cid:0) ORC1 (cid:0) MYOSLID (cid:0) UHRF1 and AL357033.3 as potential HNSCC prognosis signatures, and veried the correlation between their gene expression and the corresponding methylation. Their reliability for predicting the prognosis of HNSCC was conrmed in an independent dataset. Conclusions In conclusion, we built a 8-gene-methylation-based signature which can well assess the prognosis of HNSCC patients. we performed a single factor survival analysis of the differential genes. The signicantly differential genes that correlated with prognosis were screened using Kaplan-Meier method in survival R package. Then, the stable and reliable characteristic genes were identied by using the sure independence screening (SIS) with LASSO Cox penalty regression in SIS package. The prognostic prediction model was constructed based on these characteristic genes to assess overall survival of HNSCC samples, and the nal predictive model was consisted of the expression levels of the characteristic genes and their multivariate analysis weights.


Introduction
The neck squamous cell carcinoma (HNSCC) is the most common pathological type of head and neck tumors, which develops in the epithelial layer of the mouth, pharynx and larynx, ranking sixth in systemic malignant tumors (1,2). Each year, 600,000 HNSCC patients are newly added in the world (3). Due to the hidden anatomical location, di culty in early detection and strong invasiveness, the overall prognosis of HNSCC is poor (4). Many factors, including smoking, alcohol consumption, viral infection (HPV), are closely associated with increased risk of HNSCC developing (3,5,6). However, only a small part of the above population will eventually develop head and neck tumors, suggesting that the critical position of genetic factors (7).
Over the past three decades, despite the tremendous advances in treatments such as surgery, minimally invasive surgery, precision radiotherapy, chemotherapy, and monoclonal antibody therapy, 25-50% of patients will relapse within 3-5 years after the rst treatment (8, 9). Thus, early diagnosis is important for improving the survival rate of patients (10). The in-depth studies of biomarkers have provided possibilities for early diagnosis, grading, recurrence and prognosis assessment of HNSCC. Recently, several tumor biomarkers with clinical application have been reported, including matrix metalloproteinases (MMPs) (11,12), vascular endothelial growth factor (VEGF) (13), interleukin (IL-6 and IL-8) (14).
DNA methylation is the most common epigenetic regulator that affects gene expression and chromatin stability without altering genomic sequences. Carvalho's group found that DNA methylation speci city was as high as 90% in the serum and saliva samples from amounts of patients with HNSCC (15).
Undoubtedly, DNA methylation plays an important role in the occurrence and development of HNSCC. Therefore, we performed comprehensively analyses of gene expression and DNA methylation data to identify key genes affecting the prognosis of HNSCC, and constructed a prognostic prediction model based on these characteristic genes (Fig. 1). Our study should be useful for screening potential therapeutic targets and the development of effective treatment methods of HNSCC.

Data source
Datasets on DNA methylation and mRNA expression pro les of HNSCC were obtained from the Cancer Genome Atlas (TCGA, https://tcgadata. nci. nih. gov/tcga/). The datasets of TCGA-HNSC Project, which including the information of 527 HNSCC cases, were selected for further analysis. Among them, data of 493 cases with complete information of mRNA expression data (HTSeq-Counts format), DNA methylation data (Platform: Illumina Human Methylation 450 (HM450) arrays) and clinical data was downloaded.

Preprocessing of DNA methylation and mRNA expression data
The mRNA expression data and DNA methylation level data were obtained from the TCGA database, in which mRNA was raw counts data and DNA methylation levels were expressed as Beta Value. From the 493 patients, 246 cases were randomly selected as training sets to construct a prognostic prediction model based on gene expression levels, and the other 247 cases were used as test sets to verify the reliability of this model.

Gene differentially expressed analysis.
Forty-three cases with expression data of both tumor and paracancerous normal tissues were selected from the gene expression level matrix of preprocessed samples for screening the differentially expressed genes in HNSCC samples. The differentially expressed genes were screened by using the DESeq2 package in R software using |log2 Fold Change|>4 and (Benjamini & Hochberg corrected P value) BH_p < 0.01 as a threshold.

Construction of prognostic prediction model.
Combining the clinical data of 246 patients in the training set, we performed a single factor survival analysis of the differential genes. The signi cantly differential genes that correlated with prognosis were screened using Kaplan-Meier method in survival R package. Then, the stable and reliable characteristic genes were identi ed by using the sure independence screening (SIS) with LASSO Cox penalty regression in SIS package. The prognostic prediction model was constructed based on these characteristic genes to assess overall survival of HNSCC samples, and the nal predictive model was consisted of the expression levels of the characteristic genes and their multivariate analysis weights.
2.5 Veri cation of prognostic prediction model.
The scores of each sample in the training set and test set were calculated separately based on the model obtained from the training set. The samples in the training set were divided into two groups based on the scores. The survival curves were performed by the Kaplan-Meier method and comparison between each group were examined by log rank-test. The samples in training set were divided into two groups by selecting an appropriate grouping score threshold, while the samples in test set were divided into two groups with the same score threshold. The difference of overall survival between the two groups were examined by log-rank test. In addition, we performed correlation analysis and single factor survival analysis on the expression data and their corresponding methylation data of the characteristic genes in the model.

Clinical data characteristics of the HNSCC sample.
The clinical data of 493 HNSCC patients were shown in Table 1, including age, gender, ethnicity, and TNM staging. Then a chi-square test was performed for each indicator of the training set and test set, indicating that these two groups were independent of age, gender, race and cancer stage. Thus, the clinical data characteristics had no effect on the construction and veri cation of the prognostic prediction model.

Differentially expressed genes
The gene expression levels of HNSCC tumor samples and normal samples were analyzed by using the Rbased DESeq2 package analyzes. As shown in Fig. 2A, 3682 differentially expressed genes were obtained with a standard of |log2 Fold Change|>4 and BH_p < 0.01. Among these differentially expressed genes, 2928 genes were up-regulated, while 754 genes were down-regulated in HNSCC tumor samples.

SIS model
A total of 983 differentially expressed genes related to overall patient survival were obtained by using the single factor survival analysis based on Kaplan-Meier method, and 8 genes were identi ed as stable and

Veri cation of SIS model
The highest chi-square value of training set obtained from log-rank test for Kaplan-Meier survival was used to divide the samples in training set into high-risk and low-risk groups (HR = 3.34, P value = 7.14×10 − 11 ). As shown in Fig. 3, the higher scores obtained by the model were signi cantly correlated with the shorter survival rates in the training group. The samples in test set were divided into two groups with the same scoring threshold of training set. Also, the risk of death in the high-risk group was 1.76 times higher than in the low-risk group (P value = 4.01×10 − 3 ). This result indicated that the SIS model we constructed can be used for predicting the prognosis of HNSCC and had high reliability.
Furthermore, A total of 246 samples in the training set had gene expression data and DNA methylation level data, and 4 of the 8 characteristic genes in the SIS model had methylation data of CpG sites in the promoter region, which were HS3ST1, ORC1, TOMM34 and UHRF1. Thus, we analyzed the correlation between the expression levels of these 4 genes and the methylation levels of their corresponding CpG sites. As shown in Fig. 4, there was no signi cant correlation between the methylation and gene expression levels of HS3ST1, ORC1 and TOMM34, however, the methylation level of UHRF1 was signi cantly positively correlated with gene expression level. In addition, we further analyzed the relationship between the expression of these 4 genes and the overall survival of the patients, and the survival curves showed signi cant differences between high-risk and low-risk groups (HR = 1.67, P value = 5.58 × 10 − 3 ) according to the differently expression of TOMM34.
In this experiment, the correlation between the expression and corresponding methylation of characteristic genes was not consistent with the expected, but in general, the prognostic model we constructed had certain value for the prognosis of HNSCC.

Discussion
The development of cancer often involved complicated regulatory networks, and many studies have shown that DNA methylation in malignant tumor cells is abnormal. Chang et al found that the degree of methylation of p15 and p16 in body uids and tumor tissues of HNSCC patients were signi cantly higher than these of healthy group (16). The methylation level of p15 in HNSCC tissues of patients with longterm smoking or drinking was higher than non-smoking or non-drinking patients, indicating that the abnormal DNA methylation is signi cantly correlated with clinical high-risk factors of HNSCC (17).
Besides, Sanchez-Cespedes's group have detected the methylation level of p16,6-oxomethylguanine-DNA methyltransferase, a death-related protein kinase in HNSCC tissues, suggesting that DNA methylation existed in all pathological stages of tumor tissues (Sanchez-Cespedes, 2000). In this study, we developed a prognostic model for HNSCC that analyzed the transcriptomes and DNA methylation data of 493 HNSCC samples. The results showed that prognostic signature was signi cantly associated with overall survival in HNSCC patients, and patients with higher model scores tended to have poorer survival. 8 genes were identi ed as stable and reliable characteristic genes in SIS model, including HS3ST1, TOMM34, RPL26L1, MTHFD2, ORC1, MYOSLID, UHRF1, and AL357033.3.
It was found that the methylation level of UHRF1 showed a signi cant related with the gene expression level, and the expression level of the gene TOMM34 was signi cant different between high-risk and lowrisk groups. In the present study, it had been reported that Ubiquitin-like PHD and Ring Finger domain 1 (UHRF1) genes were newly discovered nuclear protein genes closely related to cell growth, and it played an important role in regulating biological processes such as DNA damage repair, cell proliferation, cell cycle, and apoptosis (18,19). Furthermore, UHRF1 is also an important epigenetic regulator maintaining DNA methylation and histone code in the cell, involving in the regulation of tumorigenesis and progression (20,21). Several studies have suggested UHRF1 as a potential universal biomarker for cancers (22)(23)(24)(25). Translocase of the outer mitochondrial membrane 34 (TOMM34) gene transcript as one of the tops differentially expressed gene, have been proven to be associated with features of aggressive behavior including higher tumor grade, advanced nodal stage, larger tumor size and lymphovascular invasion (26). The prognostic value of TOMM34 has been reported in colorectal cancer (27,28).
Moreover, folate metabolism was central to cell proliferation and a target of commonly used cancer chemotherapeutics. In particular, the mitochondrial folate-coupled metabolism was thought to be important for proliferating cancer cells (29). The mitochondrial enzymes bifunctional methylenetetrahydrofolate dehydrogenase / cyclohydrolase (MTHFD2) in this pathway was highly expressed in human tumors, and broadly required in survival of cancer cells. Philip M et al revealed that overexpression of MTHFD2 was associated with both high proliferation rates and c-MYC overexpression (30). It has been reported that the overexpression of MTHFD2 was associated with poor prognosis of breast cancer patients and with an increased rate of invasion and metastasis (31). As MTHFD2 is over expressed in rapidly replicating tumor cells but not in adult tissue, it is suitable as a therapeutic target for selective cancer treatment (32).
In summary, the above reports on the functions of our characteristic genes were consistent with our nding, however, some genes were shown to be related with the cardiovascular disease, such as HS3ST1 and MYOSLID, and other genes had been less studied. Thus, we would verify the relationship between these genes and HNSCC as well as the role of genes in the signaling pathway in the future study. In addition, we should expand the clinical data in further investigate to better assess the survival of HNSCC and obtain the prognostic model.

Conclusion
In conclusion, we obtained an 8-genes-based prognostic model of HNSCC by comprehensive bioinformatics analysis of DNA methylation and gene expression data, and veri ed its prognostic value of patients with HNSCC. This study should be helpful for clinical treatment and experimental research of HNSCC.

Declarations
Funding: No funding was received for the creation of this article.