Comprehensive analysis of DNA methylation and gene expression reveals specific predictive biomarkers for renal cell carcinoma

Background We aimed to identify methylation-driven genes (MDGs) and predict the prognosis of clear cell renal cell carcinoma (ccRCC) based on several cohorts with high-throughput data. Methods The transcriptome profiles, 450K methylated data and corresponding clinical information were extracted from the cancer genome atlas (TCGA) database. MethylMix package was used to screen aberrant methylation events. Next, the Cox regression models were applied via survival package. Functional analysis was mainly performed based on ConsensusPathDB database. Besides, “mimifi” R package were adopted to analyze methylated alterations from three Gene Expression Omnibus (GEO) datasets. The predictive efficiency of constructed MDGs signature was assessed and validated in TCGA-KIRC and ICGC-RCC cohort, respectively. Unsupervised clustering analysis was conducted via ConsensusClusterPlus package. Moreover, MDGs-nomogram for OS prediction was conducted via glm and survival packages. Results Totally, we collected We finally identified 761 samples from TCGA-KIRC, ICGC-RCC, GSE61441, GSE105260 and GSE105261. We combined the expression data and 450K methylated data to find 5 hub prognostic MDGs (TAGLN2, PDK2, HHLA2, HOXA2, XAF1). We used the TCGA-KIRC cohort as the training dataset to verify the superior predictive significance of the 5 MDGs signature (AUC = 0.713), and validated it in ICGC-RCC cohort with AUC = 0.769. Kaplan-Meier analysis suggested that patients with high MDGs levels suffered from worse suvival outcomes. Besides, we further conducted the unsupervised clustering analysis in the whole ccRCC patients and identified the sub-cluster with the worst prognosis, indicating the MDGs could be used as efficient molecular classifier for ccRCC. We also identified 32 prognostic risk loci associated with hub MDGs in KIRC. Superior predictive efficiency was found in the MDGs-nomogram [Area Under Curve (AUC) of 3-year: 0.842, AUC of 5-year: 0.862],


Background
Clear cell renal cell carcinoma (ccRCC) is the most common histologic subtype of renal cell carcinoma (RCC) defined by pathologic classification (1). Accounting for approximately 2-3% among all human malignant neoplasms, RCC is a serious challenge to urologists and oncologists globally (2). Unlike surgical intervention, the clinical efficacy of radiotherapy and chemotherapy are unsatisfactory (3).
Despite the progression in immunotherapy in recent years, the prognosis of advanced RCC is far from optimistic (4,5). Therefore, identification of biomarkers for the early diagnosis and treatment will be substantially beneficial for early surgical intervention and improvement of prognosis.
Studies have shown that the tumorigenesis and development of ccRCC is a complex process driven by multiple genes and follows multiple steps (6)(7)(8). However, the molecular regulation mechanism of ccRCC has not been elucidated. As one of the most important epigenetic factors, DNA methylation is extensively participated in the regulation progress of a variety of malignancies, including lung cancer, hepatic carcinoma, breast cancer and ccRCC (9)(10)(11)(12). Aberrant methylation genes in CpG islands, which located in or near promoter regions of the genome, are often manifest as hypermethylation and eventually lead to silencing of tumor suppressor genes (13). In ccRCC, Li et al. revealed that CLDN7 which appears as a major constituent of tight junctions in cell-to-cell adhesion, is frequently downregulated via hypermethylation of its promoter (14). Besides, Schelleckes et al. identified two separated methylation-sensitive CpG islands in independent regions in promoter of tumor suppressor KIBRA, which was inhibited caused by promoter methylation (15). However, due to the heterogeneity of the disease itself and the large number of genetic and molecular regulatory networks involved, the regulation of DNA methylation in ccRCC has not been clarified.
In recent years, the interaction between methylation and gene expression has attracted more and more attention. The wide DNA methylation arrays, advent of deep RNA-Seq approach and emergence of new statistical algorithms promoted the rapid development of bioinformatics in malignant tumor research. An original computational algorithm called Methylmix invented by Olivier Gevaert et al. was applied in correlation analysis between abnormality of DNA methylation and predict transcription (16).
In addition, public databases contain cancer genomic profiles are also facilitating the process of bioinformatics study. The Cancer Genome Atlas (TCGA) is a database with a view to catalogue and explore major cancer-causing genomic alterations (17). With a wide variety of online resources for biological information and data provided by the National Center for Biotechnology Information (NCBI), the utilization of Gene Expression Omnibus (GEO) database provides a reliable data foundation for bioinformatics analysis (18).
In this study, we analyzed and screened methylation-driven genes (MDGs) extracted from gene methylation profiling combined with gene expression data. The analyzed results from TCGA and GEO database were integrated and hub MDGs were filtrated by using bioinformatics tools. Finally, we explored the prognosis value of selected MDGs in ccRCC.

Data acquisition and processing
We obtained mRNA expression profiles and corresponding clinical information from TCGA (https://cancergenome.nih.gov/) database, including 541 tumors and 70 normal samples. Normalized fragments per kilobase of exon per million reads mapped (FPKM) values of the mRNA-seq data were selected for subsequent analysis. Methylation profiles were assessed based on the Illumina Infinium HumanMethylation450 platform, which could cover nearly 482,421 CpG sites throughout the human genome. We utilized the TCGA-Assembler to download the beta values, which represent the average methylation level for all CpG sites related with one gene. Besides, clinical characteristics including age, gender, radiation or pharmaceutical therapy, tumor grade, TNM stage, pathologic status and survival information of each ccRCC patient were available from the TCGA data portal.

Methylmix Package
Since we have prepared the files of methylation and normalized expression data for tumors and normal tissues, we conducted the integrative analysis with the bioconductor MethylMix package to investigate potential correlations between methylation events and gene expression. MethylMix has been shown to produce unique findings apart from results consistent with other methods of integrative analysis 16 . The concordance of these complementary "omics" may facilitate revealing relevant information (19). Firstly, we considered the methylation events that correlated with alterations of gene expression only when genes reached the correlation filter cutoff. Secondly, a methylation state based on the large number of patients was defined using a beta mixture model by which the need for arbitrary cutoff could be precluded. Thirdly, the DNA methylation states were compared in tumor versus normal samples using the Wilcoxon rank sum test.

Pathway Analysis Of The Methylation-driven Genes
ConsensusPathDB database consists of 17 public resources for interactions, including genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways (20). In our study, we selected the following sub-databases for analysis: Kegg, Inoh, Wikipathways, Signalink, Biocarta, Netpath, Humancyc, Smpdb, Pharmgkb, Ehmn, and Smpdb. We used the default settings: minimum overlap and p value cutoff of 0.01. A list of 175 methylationdriven genes found to be statistically significant by MethylMix were conducted the pathway analysis and we set the cutoff with P value of 0.01.

Identification Of Hub Prognostic Mdgs
Integrating the corresponding clinical information of 319 ccRCC patients, we ultilized uivariate Cox regression model to screen survival associated MDGs from a list of 175 genes. Then, a total of 52 prognostic MDGs were chosen with P 0.01. Besides, we continued stepwise genes selection procedure and got several related models. We calculated the Akaike information criterions (AICs) of all models and an optimized model including 38 genes with the smallest AIC was selected finally. To further obtain and validate the hub MDGs, we download methylation data and transcriptome profiles from the GEO database. GSE61441 and GSE105260 were used for analysis of methylation profiles based on the GPL13534 platform (Illumina HumanMethylation450 BeadChip), while gene expression data of GSE105261 was performed using GPL10558 platform (Illumina HumanHT-12 V4.0 expression beadchip). There were 46 paired ccRCC tumor and adjacent normal tissues in GSE61441 and GSE105260 consisted of 35 ccRCC and 9 normal tissues. The GSE105161 series included 35 ccRCC and 9 normal tissues. For gene methylation profiling of GSE61441 and GSE105260, we ultilized the "mimifi" R package to identify genes differentially methylated between tumor and normal tissues with cutoff of q < 0.05 and |Fold Change|≥1. The differentiall genes analysis of GSE105261 were conducted using "limma" package and P 0.05 with |Fold Change|≥1 was set as the thresholds.
Afterwards, we obtained hypomethylation and upregulated genes via overlapping hypomethylated and high-expressed genes from GSE61441, GSE105260, GSE105261 and the 38 prognostic MDGs from TCGA cohort. Similarly, the hypermethylation with down expression genes were searched through the same process across four gene set. At last, we successfully identified 5 hub MDGs from TCGA and GEO databases that related with survival outcomes.

Construction Of Mdgs Signature And Nomogram For Os Prediction
Since we have already identified five prognostic MDGs from the above procedures, we conducted the multivariate Cox analysis for the five MDGs and obtained the respective coefficients (β i ). Then, we constructed the methylation-related risk score with the following formula: risk score=(β i * Exp i ).
According to the risk results, all 319 ccRCC patients were divided into high-and low-risk group with the median risk value as the threshold. We evaluated the predictive value of risk score via receiver operating characteristic (ROC) curve using "timeROC" package (21). and Kaplan-Meier analysis were drawn to compare the difference between high-and low-groups. Moreover, we integrated MDGs signature with clinical data to establish a comprehensive nomogram, in which only the variables that reaching statistical significance from univariate and multivariate Cox analysis can be selected into the final model. Meanwhile, the calibration graphs of 3-, 5-year OS prediction was drawn to assess the performance of nomogram. We used "rms" package to exhibited the plots. What is more, ROC curves for 3-, 5-year OS prediction of ccRCC patients were illustrated to compare the predictive accuracy between MDGs signature-based nomogram and traditional clinical factors of TNM stage.

Analysis of genes and methylated loci associated with OS, joint survival analysis
To further detect the specific methylated sites in the 5 hub MGDs, we extracted the all methylation data in 440 files from TCGA using the Perl scripts. We merged the value of methylated sites into one matrix and conducted univariate Cox analysis to select potential risk methylated loci. Survival analysis was then performed to assess the difference between hypo-and hypermethylation state of specific sites. Additionally, a joint survival analysis was conducted to further evaluate the vital MDGs signature associated with prognosis in KIRC patients, where we combined the the methylation levels and expression data of one gene. The conjoint analysis was also performed by survival package.

Meta-analysis Based On Oncomine Database
The expression levels of five MDGs signature were further validated in Oncomine database (https://www.oncomine.org/resource/main.html). We compared the median rank of a gene across 6 analyses with the cutoff of P value 0.05 (FigureS2).

Gene Set Enrichment Analysis
The gene set enrichment analysis (GSEA) (22)was conducted to further uncover the potential pathways that enriched statistically significant between low-and high-risk phenotypes. We chose the "c2.cp.kegg.v6.2.symbols.gmt gene sets" for enrichment analysis downloaded from the MSigDB.
Besides, we considered the significant gene sets with false discovery rate (FDR) 0.05. All the procedures were based on the JAVA8 platform (http://software.broadinstitute.org/gsea/index.jsp).
Moreover, Co-expression analysis for the 5 MDGs signature was conducted and we selected top 6 pathways to show by dotplot via "clusterProfiler" package.

Statistical Analysis
Identification of hub MDGs signature was achieved mainly by "MethylMix", "minifi" and "limma" packages (23). The correlation of gene methylation state with expression level was evaluated by Spearman's correlation. Cox regression analysis or Kaplan-Merier curves were performed based on "survival" package. Establishment of nomogram or calibration graphs were conducted via "rms" package. The heatmaps were generated by "pheatmap" package (24). and ROC curves were drawn by "timeROC" package. The meta-analysis results from Oncomine database were exhibited with P values and median ranks. All statistical analysis was conducted based on R studio (version 3.5.3).

Identification of methylation-driven genes (MDGs) in RCC
We collected 485 samples with 450K methylation data from TCGA cohort, including 319 ccRCC samples and 166 precancerous samples. Transcriptome files were obtained from 611 samples, including 541 tumors and 70 normal samples. We then extracted and normalized the methylation and gene expression information of ccRCC using the "limma" package. MethyMix was utilized to conduct correlation analysis, where a mixed model was constructed and Wilcoxon rank test was calculated for identifying differential methylation with |logFC| > 0, P 0.05, |Cor| > 0.3. A total of 175 driven genes were selected via the screening procedure (Additional Table S1). We illustrated the top hypermethylated and top hypomethylated genes in Fig. 1.

Functional Pathway Analysis Of 175 Mdgs
Firstly, we exhibited the clustering heatmap of 175 MDGs in tumors and normal tissues using "pheatmap" package in Fig. 2A. To further investigate the molecular mechanisms and underlying biological processes that the drivers may participate in, we conducted the functional pathway analysis based on ConsensusPathDB database. The results revealed that these 175 driven genes were mainly involved in oxygen metabolism process, one carbon metabolism, inflammatory response pathway, cell communication, as well as glucose metabolism pathway (Fig. 2B, Table 2).  downregulated genes in GSE105261. Therefore, we finally identified 5 hub prognostic MDGs (TAGLN2, PDK2, HHLA2, HOXA2, XAF1) by integrating across three GSE series and TCGA cohort (Fig. 4). Besides, we constructed a representative volcano plot for GSE105261 in Fig. 4C. We conducted multivariate  Fig. 4C. To assess the predictive value of the risk signature, ROC curve was drawn and the AUC was 0.713 for 3-year OS prediction (Fig. 4D). Kaplan-Meier survival curve analysis of patients revealed that the prognosis was poor in high-risk group, and the difference across the two groups showed statistically significant with log-rank test of P 0.001 (Fig. 4E). Furthermore, we ultilized another ICGC-RCC cohort as a validation dataset and we found the AUC was 0.769 (Fig. 4F).
Patients with high risk suffered from worse clinical outcomes compared with that in low-risk group (Fig. 4J). We thus summarized that the risk model based on 5 MDGs signature has certain accuracy in assessing the prognosis of ccRCC patients.
Given that the 5 MDGs signature was proved to be well associated with prognosis, we considered whether the 5 genes could be ultilized as the molecular markers for guiding classifications of patients.
We totally collected 618 patients from the TCGA-KIRC and ICGC-RCC cohort and normalized the sequencing data for following analysis. Besides, we used the ConsensusClusterPlus package and iterated 1000 times (K = 1: 10) to stabilize the classification categories, obtaining the optimized classification of the samples (Fig. 5A and 5B). We observed that the classifications reached the best stable status with k = 5 in Fig. 5C. Kaplan-Meier analysis suggested that MDGcluster classifications correlated significantly with survival outcomes when k = 5, among which the MDGcluster4 (N = 35) has the best prognosis, and MDGcluster5 (N = 105) has the worst prognosis (Fig. 5D). Sankey diagram also illustrated the underlying associations among MDGcluster, risk score levels and vital outcomes (Fig. 5E). Given the differential survival outcomes between MDGcluster4 and MDGcluster5, we intended to figure out the underlying mechanisms between two groups. We performed the differential analysis between MDGcluster4 and MDGcluster5 to identify totally 162 DEGs with FDR < 0.001 (Table   S2). Meanwhile, GO functional analysis also revealed that the top 400 genes were mainly enriched in mitochondrion organization, notch developmental processes and cell cycle biological items (Fig. 5F).
The GSEA analysis also indicated that the notch signaling pathway, wnt signaling pathway and EMT pathway were significantly up-regulated in MDGcluster5 than MDGcluster4, partially contributing to the poor prognosis of MDGcluster5 (Fig. 5J).

Establishment Of A Comprehensive Nomogram
To figure out that whether the MDGs risk score could be ultilized as an independent predictor, we combined with other clinical risk variables associated with prognosis based on the univariate Cox regression results, including age, tumor grade, pharmaceutical therapy, as well as pathological stages. Besides, we conducted the multivariate Cox regression method to integrate clinical features with the MDGs signature. Based on the multi-variate analysis consisting of the above variables, the risk score maintained a significant and independent factor for OS prediction (P = 0.001122) in the TCGA cohort (Table 3). Hence, a MDGs-nomogram was accordingly constructed to predict the 1-, 3-, 5- year OS and the calibration graphs for the 3-, 5-year OS rate revealed the well curve fitting between observed outcomes and predicted survival (Fig. 6A-B). In addition, our integrated MDGs-nomogram

Mapping of Kaplan-Meier curves of driver genes and joint survival analysis
We performed the prognostic survival analysis of 5 drivers based on the "survival" R package, where log-rank test of P 0.05 was used as a meaningful statistical value. We found methylation levels of 3 MDGs were associated with the prognosis of OS (TAGLN2, HHLA2, XAF1), while no significant difference was observed in PDK2 and HOXA2 (Fig. 7A-E). What is more, the joint survival analysis showed that combination of methylation and expression levels had a highly significant correlation with survival outcomes in ccRCC. The hypomethylation with high expression level of TAGLN2, HOXA2 and XAF1 correlated with lower survival rate, while the hypomethylation with high expression level of PDK2 and HHLA2 showed the higher survival rate (Fig. 7F-J).
Given that we have already identified the 5 hub MDGs associated highly with survival prognosis of KIRC, we intended to further detect the significant methylation loci harboring these hazard aberrant methylated drivers. We extracted the β value of methylation data according the 5 gene symbols from the 874 files and merged into a matrix. A list of 187 methylation sites were screened out and the univariate Cox analysis revealed that 32 key methylation loci were associated with survival outcomes in Table 4 (P 0.05).

Enriched pathways and co-expression analysis of the 5 MDGs signature
The GSEA was conducted to investigate potential pathways that associated with the 5 MDGs signature. The result revealed that there were six MDGs related pathways enriched significantly in high-risk group with P 0.05 and the top four were mainly involved in glycolysis, sucrose metabolism, P53 signaling pathways and cytochrome p450 crosstalk (Fig. 8A-D). Besides, we further extracted the five-MDGs signature related genes with filter cutoff value of correlation coefficient = 0.5, P = 0.001. A total of 70 genes were then selected and put into "DAVID 6.8" (https://david-d.ncifcrf.gov/) 20 to perform the functional enriched pathway analysis, including gluconeogenesis, fructose metabolism, HIF-1 signaling pathway, carbon metabolism, as well as central carbon metabolism in cancer (Fig. 8E).
In accordance with GSEA results, the functional pathway analysis also indicated that the 5 MDGs signature were mainly associated with oxygen metabolism, glycometabolism, even the HIF-1 signaling pathway.

Discussion
Recent studies have revealed that the oncogenesis and progression of malignant tumors are inseparable from the extensive participation of epigenetic regulation mechanisms (26,27). As one of the most common epigenetic regulatory mechanisms, DNA methylation may lead to genomic rearrangements and chromosomal instability, which in turn leads to malignant progression (28). In ccRCC, the methylation status of specific genes was associated with worse prognosis (29). Since microarrays and high-throughput sequencing can simultaneously provide expression levels or modifications of thousands of genes and mRNAs in the human genome, they have been widely used for screening differentially expressed genes or abnormal RNA modifications in various malignancies.
However, the risk relationship and specific regulatory mechanisms of DNA methylation and tumor progression have not yet been elucidated in ccRCC.
In this study, we revealed critical methylation events in the landscape of ccRCC by characterizing the genome-wide methylated-differentially expressed genes. Firstly, mRNA expression profiles and corresponding clinical information were extracted from the TCGA database. We conducted the integrative analysis with the bioconductor MethylMix package to investigate potential correlations between methylation events and gene expression. Next, the ConsensusPathDB database and the univariate Cox regression model were applied to analyze and screen out the prognosis-related MDGs.
To further validate the above results, we used the 'mimifi' R package to analyze chip data from ccRCC tumors and adjacent normal tissues from the GEO database and to search for differential genes. Our results indicated that the 5 MDGs signature were mainly associated with oxygen metabolism, glycometabolism and HIF-1 signaling pathway.
TAGLN2 is a subtype of the TAGLN (Transgelin) family, which is class of actin actin-binding protein.
The current literature reports that the expression level of TAGLN2 in tumors remains controversial.
Several studies have found that TAGLN2 is up-regulated in malignant diseases by proteomic analysis (30)(31)(32)(33). However, some studies demonstrated the opposite results in breast cancer and colon cancer through proteomic analysis (34). In our study, the hypomethylation with high expression level of TAGLN2 correlated with lower survival rate, consistent with the description of oncogene.
PDK2 is a member of the pyruvate dehydrogenase kinase (PDK) family. As a phosphatase inhibitor of pyruvate dehydrogenase complex (PDC), PDK enhances cell glycolysis and promotes tumor cell proliferation. Cui et al. revealed that overexpression of PDK2 reflects poor prognosis in acute myeloid leukemia (35). Besides, PDK2 may induces cisplatin-resistance in lung adenocarcinoma by transcriptional regulating cation transport mediator 3 (CNNM3) (36). However, we observed no significant difference between methylation levels of PDK2 and the prognosis of OS. This may be caused by heterogeneity of ccRCC.
Study has shown that HHLA2 may inhibit T cell receptors mediated proliferation and cytokine production of CD4 + and CD8 + T cells (38). High expression level of HHLA2 were observed in nonsmall-cell lung cancer (NSCLC) and osteosarcoma, and has association with advanced and metastatic diseases (39,40). In PD-L1 negative NSCLC patients, HHLA2 were identified as a high-expression level (40). On the contrary, our results revealed that hypomethylation with high expression level HHLA2 showed the higher survival rate in ccRCC. The role of HHLA2 in ccRCC requires further investigation to explore its potential value as a novel immunotherapeutic target (39).
HOXA2 is a part of HOXA, which is a cluster of the human class I homeobox genes (HOX) (41).
Recently, HOXA2 was firstly defined as an androgen-responsive gene and an oncogenic transcription factor in prostate cancer (PCa). In this study, Gao et.al discovered that the binding site of HOXA2 may altered by rs11672691, which was identified at 19q13 associated with aggressive PCa by Genomewide association studies (GWAS) (42). In ccRCC, we found that the hypomethylation with high expression level of HOXA2 correlated with lower survival rate, while no significant difference was observed in methylation levels of HOXA2 associated with the prognosis of OS. The regulatory mechanism of HOXA2 in ccRCC progression needs further study.
X-linked inhibitor of apoptosis protein (XIAP)-associated factor 1 (XAF1) is identified as a pro-apoptotic protein and an antagonist of XIAP, counteracting anti-caspase activity of XIAP (43). It was demonstrated that XAF1 may promote cell apoptosis through in a multi-mechanism environment (44)(45)(46)(47). Down-regulated or deleted of XAF1 was observed in many malignancies, which suggesting that low expression of XAF1 may promote malignant tumor progression by relieving tumor cell apoptosis inhibition (48)(49)(50). However, XAF1 was up-regulated in testicular germ cell tumors exceptionally, which may lead to high sensitivity to chemotherapy (51). Also, we found that hypomethylation with high expression level of XAF1 correlated with lower survival rate. It is suggested that XAF1 may participate in the ccRCC regulation process through some unconventional approaches.
Our functional pathway analysis indicated that such 5 MDGs signature were mainly associated with oxygen metabolism, glycometabolism, even the HIF-1 signaling pathway, which is unique in our study.
In head and neck cancer (HNC), it was revealed that mitochondrial oxidation may be activated by PDK2 inhibition, and reverse cisplatin resistance (52). Another research focused on HNC demonstrated that, mitochondrial mutations contribute to HIF1-α accumulation via increased reactive oxygen species and up-regulated PDK2, which contribute to HNSCC development (53). Combined with our results, it is suggested that in ccRCC, PDK2 may regulate mitochondrial oxidation through methylation, and thus leading to tumor progression.
In addition, we summarized that the risk model based on 5 MDGs signature has certain accuracy in assessing the prognosis of ccRCC patients. Nomogram was integrated and showed that the superior predictive efficiency, compared with traditional independent feature such as TNM stage. The results suggested that our nomogram might serve as an accurate prognosis prediction for ccRCC.
Apart from the discovery of 5 MDGs and prognostic value of our nomogram, some limitations were also observed in ours research. Firstly, screening and verification were performed through bioinformatics approaches, while no other external clinical samples were used to validate the data.
Secondly, the underlying biological signaling pathways that the 5 hub MGDs may participate in warranted further functional experiments to validate. Moreover, selection bias might occur, which could reduce the credibility of our results. Finally, as mentioned above, these 5 selected MDGs should be further explored to illuminate the specific regulatory mechanism in ccRCC.

Conclusion
In the current study, we explored and identified several hub-MDGs that could be an effective classifier and predictive factor in ccRCC using the multi-omics data across several datasets, providing new insights on epigenetic pathogenesis and therapeutic targets for ccRCC.