Molecular Subtypes Based on DNA Methylation Proles can Predict the Prognosis of Patients with Laryngeal Cancer

Background: Laryngeal cancer is a common malignant tumor, with the highest mortality rate and lymph node metastasis rate among cancers. Although laryngeal cancer subtypes have been dened for clinical relevance, there are few studies on the molecular characteristics of laryngeal cancer subtypes. Methods: DNA methylation proles of laryngeal cancer were downloaded from The Cancer Genome Atlas database. The univariate Cox regression analysis and multivariate Cox regression analysis was employed to screen the prognosis-related DNA methylation site. Meanwhile, we performed unsupervised clustering using prognostic DNA methylation data to identify the molecular characteristics of laryngeal cancer. We also explored the correlation between Differences in DNA methylation levels and differences in T, N, and M category, age, stage, and prognosis. Functional enrichment analysis was conducted in each molecular subtype. A risk model was constructed via prognostic DNA methylation levels. Results: DNA methylation data in 123 laryngeal cancer samples were obtained. Through univariate Cox regression analysis, the 17,751 DNA methylation sites were dened as prognostic CpG sites with the threshold of P value less than 0.05. Then, the multivariate Cox regression analysis was further screen the prognosis-related DNA methylation sites. Five subgroups were identied based on consensus clustering using 455 prognostic CpG sites that were signicantly associated with patient’s survival. We built a prediction model and used it to classify the samples. Conclusion: The unsupervised clustering was performed using prognostic DNA methylation data and described the seven laryngeal cancer molecular subtypes. Our results provide novel insights into mechanisms and more effective personalized treatments of laryngeal cancer.


Introduction
Laryngeal carcinoma (LC) is one of the most common types in the head and neck region [1]. Laryngeal cancer is expected to account for 17,950 new cases and 3,640 deaths in the United States in 2020 [2]. About 0.8% of all new cancer cases and 0.6% of all cancer deaths occur in patients with laryngeal cancer. With the decrease in tobacco use, the incidence of laryngeal cancer has decreased by 2.4% every year in the past 10 years. Although surgery, radiotherapy, and chemotherapy have improved, the prognosis of most patients with laryngeal cancer has not improved signi cantly in the past thirty years, and the 5-year survival rate is still about 60% [3][4][5]. It is reported that the recurrence rate of advanced laryngeal cancer is between 25%-50% [6]. Laryngeal cancer is highly heterogeneous, and differences in sensitivity to chemotherapy between clinical subtypes can lead to multiple prognoses. Therefore, it is important to nd molecular subtypes for patients with laryngeal cancer, which will help improve the prognosis and treatment effect.
Epigenetics is considered to be heritable changes in gene expression, not related to changes in the DNA sequence. It plays a vital role in carcinogenesis [7][8][9][10]. At present, with the development of high-throughput gene detection technologies (especially gene chips and RNA sequencing, DNA methylation), DNA methylation of several promoter sequences, including EZH2, SSTR2, CDKN2A, MGMT, MLH1, and DAPK, has been associated with the onset, development, and malignant transformation of laryngeal cancer [11][12][13]. The identi cation of speci c epigenetic markers from samples of patients with laryngeal cancer may help to develop personalized treatment plans. Such biomarkers can play a key role in prognostic evaluation, staging, recurrence prediction, and timely initiation of appropriate therapeutic drugs and interventions.
The traditional morphology-based classi cation system of laryngeal cancer include the occurrence site (primary and secondary laryngeal cancer), histopathological classi cation (squamous cell carcinomas (SCC), non-SCC cancers), HPV infection status (human papillomavirus-initiated (HPV+) and HPV-negative (HPV − ) disease) [14][15][16]. Meanwhile, the TCGA analysis identi ed laryngeal cancer subclasses as atypical, classic, basal, and mesenchymal types based on mRNA pro ling. Additionally, other studies based on mutation data has been conducted to identify NSD1-and NSD2subtypes of laryngeal cancer [15]. Currently, there is a lack of reliable molecular prognostic biomarkers for laryngeal cancer strati cation based on DNA methylation pro les. Therefore, the prognostic prediction model based on highthroughput omics data integrates multiple DNA methylation biomarkers to improve clinical prognosis evaluation and personalized treatment in research. This classi cation system also can help identify molecular subtypes or new laryngeal cancer markers to more accurately classify patients with laryngeal cancer.

Materials And Methods
Data source and data pre-processing RNA-sequencing data from 129 laryngeal cancer samples were downloaded from the TCGA data portal (https://cancergenome.nih.gov/) [17]. The clinical information of these samples is shown in Supplementary Fig. 1, including follow-up data of 117 patients. The methylation data of the HumanMethylation450 BeadChip array on samples from 129 patients were downloaded from the UCSC Cancer Browser.
The methylation level of each probe is represented by a β value from 0 to 1, corresponding to unmethylated and fully methylated, respectively. Probes with missing data in more than 70% of the samples were removed. The remaining probes that were not available (NAs) were imputed using the k-nearest neighbors (KNN) imputation procedure. The ComBat algorithm in the "sva" of R package was used to eliminate batch effects by combining patient ID information and batches and integrating all DNA methylation array data [18]. Finally, we selected samples for which gene expression pro les were available. In total, 112 samples and 20,6635 methylation sites were included in subsequent analysis.
Identi cation of molecular subtype characteristics using Univariate COX regression analysis The univariate Cox regression analysis was used to screen the prognostic CpG sites with a cutoff point of P value less than 0.05. Then, multivariate COX proportional risk regression model was further employed to identify the prognostic related CpG sites, including tumor stage and age as covariates. Finally, the CpG sites that were still signi cant were used as classi cation features. The Coxph function in the survival software package R was used to t the Cox proportional hazard model to the methylation level of CpG, and clinical and demographic features (T category, N category, M category, age, and stage) were used as covariates in the multivariate analysis with the threshold of P value less than 0.001.

Molecular subtype classi cation using consensus clustering
Consensus clustering was performed with the ConsensusClusterPlus package in R to determine subgroups of laryngeal cancer based on prognostic CpG sites [19]. After executing unsupervised consensus clustering, the item-consensus results and cluster consensus were obtained. Cluster counts of 2, 3, 4, 5, 6, and 7 are evaluated. The cumulative distribution function (CDF), the proportion of ambiguously clustered pairs (PAC), principal component analysis (PCA), and consensus heatmaps were used to assess the optimal K. The category number was selected as the area under the CDF curve and showed no signi cant change. The heat map corresponding to the consensus clustering was generated by pheatmap R package.

Survival and Clinical Characteristic Analysis
The Kaplan-Meier plots were used to illustrate the overall survival rate among molecular subgroups de ned by the DNA methylation pro les. The log-rank test was used to test the signi cance of differences among the clusters. The chi-square test was used to analyze the association between clinical and biological characteristics and DNA methylation clustering. All tests are two-sided. A P value < 0.05 was considered statistically signi cant.

Functional enrichment analysis
To identify differentially expressed genes (DEGs) in subtypes, differential expression analysis of subtypes was performed. Gene Ontology (GO) biological process terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway were performed with the R package "clusterpro le" [20]. Biological process (BP), cellular component (CC), and molecular function (MF) are the three parts of GO analysis. P-value < 0.05 and false discovery rate (FDR) of < 0.05 were the threshold for GO and KEGG enrichment analysis.

Construction and testing of the prognostic prediction model
With the candidate methylation sites identi ed by Cox Proportional Hazard Model, we built a prognostic prediction model. The regression coe cients from the multi Cox regression analysis were used to create a classi cation index for each sample, and used the following formula to weight the expression value of the selected methylation sites: The Prognostic Index = (methylation level of cg1 * Coef1) + (methylation level of cg12* Coef2) + (methylation level of cg3* Coef3) +…+ (methylation level of cgn * Coefn). The KM survival curve, ROC curve, and risk curve were plotted.
Meanwhile, the univariate and multivariate COX prognostic analysis was used to determine whether the established prognostic prediction model was an independent prognostic factor for HNSC.

Identi cation of prognostic related methylation sites
After patient data were preprocessed as described above, 206,635 methylation sites were identi ed. To identify the speci c CpG sites that were signi cantly correlated with overall survival in laryngeal cancer, the univariate COX regression analysis was conducted. Among these methylation sites, 17,751 methylation sites were identi ed as potential DNA methylation biomarkers for overall survival in patients with laryngeal cancer. Subsequently, a multivariate COX regression analysis was performed on these CpG sites (variables also include TMN stage, age, stage), and 455 independent CpG sites related to the prognosis of patients with laryngeal cancer were identi ed (Supplementary Table 1).

Molecular subtype classi cation based on Consensus Clustering
Using the ConsensusClusterplus R software package, the methylation pro les of 455 CpG sites from 112 samples were used for consensus clustering of samples to obtain molecular subtypes of laryngeal cancer. Consensus unsupervised clustering of 122 samples from laryngeal cancer patients revealed 2-7 clusters. Compared with 3, 4, and 6 clusters, cluster 5 had a lower value for the proportion of ambiguously clustered pairs (PAC), which re ected a near-perfect stable partitioning of the samples at the correct K value (Fig. 1A-1C). The consensus matrix shown in Fig. 2A represents the consensus for k = 5 and displays a well-de ned 5-block structure. A heatmap corresponding to the dendrogram in Kaplan-Meier survival analysis showed a signi cant difference in the prognosis between the ve subgroups (P < 0.05).
As shown in Fig. 3A, cluster 3 has the best prognosis, while cluster 4 and 2 have the worst prognosis. Then, we analyzed the intra-cluster ratios of 5 clusters according to the T, N, M, stage, and age, as shown in Fig. 3B-3F. The correlation trends between features and speci c clusters are as follows: Cluster 3,4 with young patients; Cluster 3 with the male population; Clusters 2 with higher grade and advance stage; Cluster 3,5 with early-stage; Cluster 4 with higher M stage; Cluster 1,3 with higher T stage; Cluster 3 with lower N stage (Fig. 4F). These results indicate that each clinical parameter is related to a different intra-cluster ratio.
Annotate prognostic-related CpG sites and functional enrichment analysis Genomic annotation of the above-mentioned prognostic CpG sites was used to identify the promoter regions of 567 genes. Then we conducted a functional enrichment analysis of 567 genes and identi ed signi cantly 405 GO terms and 17 enriched pathways (P < 0.05), as shown in Fig. 3A, B. The top 5 enriched GO terms were enriched in signal transduction by p53 class mediator, autophagy, DNA repair complex and p53 binding, which included basic cancerrelated biological processes. The three most signi cantly enriched pathways were the TGF − beta signaling pathway, autophagy, and cell cycle. As shown in Fig. 3C, close relationships were identi ed among the 17 pathways and between those pathways and the cell cycle and cellular senescence.
Then, the expression of methylated genes identi ed in the subgroups was explored. The expression values of 567 genes out of 122 samples are available. The gene expression heat map is shown in Fig. 3D. The gene expression patterns differ between these subgroups, which indicates that the DNA methylation level usually re ects the expression of these genes.
Identifying subtype-speci c CpG sites The expression of methylated genes identi ed in the subgroup was explored. 455 CpG sites were analyzed for differences among 5 molecular subtypes, and 70 molecular subtype-speci c CpG sites were identi ed. Among them, subtype 3 has the most differentially expressed CpG sites, which are hypermethylated compared with other subtypes ( Table 1). The gene expression heatmap is shown in Fig. 4C. The differences in gene expression patterns between subgroups suggest that DNA methylation levels usually re ect the expression of these genes. Then, we further screen for cluster-speci c methylation sites by including the methylation sites as features of clusters. Finally, the 32 clusterspeci c methylation sites were identi ed and the heat map was shown in Fig. 5A. Analysis using clusterPro ler showed that these genes were enriched in 3 pathways, as shown in Fig. 5B. Cluster 3 had the largest number of speci c sites, all of which were hypermethylated, and the methylation level was the highest among all the clusters (Fig. 6). Establishment and evaluation of a prognostic prediction model for laryngeal cancer patients Cluster 3 was chosen as the candidate cluster because it contains a large number of samples, has a good prognosis, and has the most speci c methylation sites. In multivariate Cox regression method, nine methylation sites as optimal features were ultimately recognized, including cg00413617, cg01330280, cg05893785, cg07098752, cg10699171, cg16508480, cg18470295, cg24464397 and cg24911113 (Table 2). Then, we obtained the risk score based on the formula described above and calculated the risk score of each sample (Fig. 7A). As the risk score increased, the methylation level of nine speci c sites increased signi cantly. Hypermethylation was associated with low-risk patients, while hypomethylation was associated with high-risk patients. The area under the curve of the ROC curve for 1,3,5 year reached 0.80, 0.86, and 0.89, respectively, indicating that the model functioned well (Fig. 7B). Based on the median of risk score, the samples were divided into two groups: high-risk group and low-risk group. Patients in the high-risk group suffered a lower survival probability (Fig. 7C). Univariate and multivariate Cox regression analysis suggested that the established model could become an independent predictor, including age, gender, histological grade, stage, TNM stage, and risk score (Fig. 8A, B, Table 3).

Discussion
Laryngeal cancer is mainly composed of laryngeal squamous cell carcinoma, which is the second most common malignant type of head and neck tumors. Despite its advantages in diagnosis and treatment, the mortality rate of laryngeal squamous cell carcinoma has remained high in the past 20 years. Epigenetic regulation plays an important role in laryngeal squamous cell carcinoma [21][22][23]. Gene methylation can inhibit gene expression, and certain histones and DNA methylation are related to laryngeal cancer. Histone deacetylase also plays a role in laryngeal squamous cell carcinoma [24]. According to differences in epigenetic regulation, certain characteristics can be used as diagnostic markers and can even promote the treatment of laryngeal squamous cell carcinoma. More recently, abnormal DNA methylation is one of the hallmarks of cancer tissue. The latest development of sequencing technology makes it possible to analyze the DNA methylation pro les of the whole genome with high resolution. Recently, several promoter sequence-speci c methylation related to the occurrence and development of laryngeal cancer has been discovered. The enhancer of zeste homolog 2 (EZH2) is a highly conserved histone methyltransferase, which is overexpressed in different types of cancers such as breast cancer [25] and prostate cancer [26], laryngeal cancer [11]. Lili reported that the occurrence of aberrant methylation events in the LINC00886 TSS may accelerate the malignant progression of laryngeal carcinoma [27]. Another study showed that the regulation of gene expression by TREX2 DNA methylation may affect the incidence and survival of laryngeal cancer [21]. Elevated CMCC3 methylation level is a risk factor for male LSCC patients, especially in patients over 55 years of age who have smoking behavior [28]. Although the in uence of some genes with abnormal DNA methylation on the occurrence and development of laryngeal cancer has been widely reported, the comprehensive overview of the classi cation of laryngeal cancer is based on DNA methylation data still needs further clari cation.
Molecular mechanism research based on bioinformatics analysis is an important method in cancer research. Besides, The TCGA database is a publicly available resource that covers a wide variety of data types in a variety of cancers.
Therefore, an informatics method was employed to classify the molecular subtype based on DNA methylation pro ling.
The classi cation will help to improve the patient's prognosis and seek for a valid target of treatment. Although methylation may serve as an important biomarker in laryngeal carcinoma, the molecular subtypes based on DNA methylation pro les have yet not been reported. We tried to solve the problems in this study by developing a classi cation method that integrates several DNA methylation biomarkers, which can be used to evaluate the prognosis of treatment effects and help guide treatment choices. Besides, the clinical and statistical signi cance of these gene methylations concerning tumor classi cation, survival time, and prognosis needs to be con rmed in the larger cohort.
Therefore, speci c prognosis subtypes based on DNA methylation status using 112 laryngeal cancer from the TCGA database were explored. Consensus unsupervised clustering was used to classify the molecular subgroups.
Unsupervised hierarchical clustering analysis utilizing the Consensus unsupervised clustering method uncovered the ve speci c DNA methylation subtypes of laryngeal carcinoma. Five subgroups were distinguished by consensus clustering using 11,637 CpGs that signi cantly in uenced survival. Cluster III had a better prognosis compared with Clusters I and II. Moreover, each cluster possessed its own distinct functional enrichment terms. The difference in molecular characteristics in ve subtypes would affect the patient's prognosis. Based on the results of the multivariate Cox regression analysis, a prognosis prediction model was constructed.
Nevertheless, there are several limitations to the present study. First, the prognosis prediction model was based on the publicly available dataset, and further clinical studies are needed to con rm these results. Second, it is very di cult to judge the best k in unsupervised consensus analysis. Third, although our research aims to investigate the possibility of constructing a prognostic prediction model, it is still in its infancy and needs improvement.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and material
Data availability could be obtained from the TCGA website.