Molecular Characterization, Clinical Signicance, Tumor Microenvironment Association and Drug Susceptibility of DNA Methylation-Driven Genes in Renal Cell Carcinoma

Background: Accumulating evidence suggests that DNA methylation has essential roles in the development of renal cell carcinoma (RCC). Aberrant DNA methylation acts as a vital role in RCC progression through regulating gene expression, yet little is known about the role of methylation and its association with prognosis in RCC. The purpose of this study is to explore the DNA methylation-driven genes for establishing prognostic-related molecular clusters and providing a basis for survival prediction. Methods: Differentially expressed genes (DEGs) were calculated by the “limma” R package. DNA methylation-driven genes (MDGs) were identied using the “methylMix” R package. The differentially expressed DNA methylation-driven genes (DEMDGs) were obtained by intersecting MDGs and DEGs. RCC data sets were divided into two groups using “ConsensusClusterPlus” R packet. The prognostic model was constructed based on multivariate Cox regression analysis. Based on the optimal cut-off value of the risk score, the patients were divided into high-risk group and low-risk group. The potential molecular mechanisms of prognostic-related DEMDGs were detected by GO, KEGG and GSEA analysis. Results: In this study, 146 DEMDGs were selected and two clusters were distinguished by consensus clustering. We further evaluated the immune status of two clusters and selected 106 DEGs in cluster 1. Functional enrichment analysis of 106 DEGs and cluster-based immune status analysis provided new insights for the development of RCC. The prognostic model based on 17 DEMDGs was constructed to predict prognosis of RCC patients. The predictive nomogram and the web-based survival rate calculator (http://127.0.0.1:7634/) were built to validate predictive accuracy of prognostic model. Furthermore, the risk scores were strongly associated with clinical features, immune status and drug susceptibility. Conclusion: Novel prognosis-related clusters based on the 146 DEMDGs were constructed and might contribute to precision medicine development. The 17 DEMDGs were utilized to construct a prognostic model for RCC prediction, which was signicantly correlated with prognosis, immune inltration, clinical features and drug

challenges due to the lack of biomarkers for prediction of progression [7]. For this reason, more molecular biomarkers are urgently needed to screen for RCC diagnosis.
Modi cations of the epigenome, such as DNA methylation, play a crucial role in development of many diseases [8]. The correct methylation pattern is very important for normal biological functions, and aberrant methylation is one of the drivers for progression of several diseases, especially cancer [9].
Numerous prior studies suggested that hypermethylated and hypomethylated DNA always show different activities [10]. Abnormal methylation always occurs in cancer cells, leading to some genes aberrantly activated and some genes aberrantly silenced [11]. Hypomethylation of proto-oncogenes or tumor suppressor gene methylation is considered one of the leading mechanisms of tumorigenesis in many cancer types [12,13]. Therefore, the detection of the methylation patterns alteration of speci c gene can aid the cancer diagnosis. Silencing of tumor suppressor genes caused by promoter hypermethylation provides new ideas for inquiring the molecular mechanisms of RCC [14]. The aberrant methylation is involved in the progression of RCC. Some studies found that the DNA methylation in RCC silenced the von Hippel-Lindau (VHL) tumor-suppressor gene [15]. In addition, RCC can be genotyped based on DNA methylation mutations [16].
There have been many studies focused on DNA methylation or gene expression. However, RCC prognostic models based on DNA methylation driven genes have barely been explored. In this study, we established a prognostic model to accurately predict patient survival. In addition, we divided RCC samples into two clusters according to 146 DEMDGs and further explored the relationship between the tumor immune status and clusters of RCC. The results we distilled will ultimately contribute to improving diagnostic accuracy and e cacy in immunotherapy.

Date collection
A total of 28 RNA-seq transcriptional pro ling of normal samples were downloaded from GTEx dataset (https://gtexportal.org/). A total of 1021 RNA-seq transcriptome pro ling (128 normal samples and 893 RCC samples), 872 DNA methylation data (205 normal samples and 667 tumor samples) and corresponding clinical information of RCC were downloaded from TCGA dataset (https://gdc.cancer.gov/).

Differentially expressed genes (DEGs) screening in RCC and heatmaps plotting
We standardized RNA-seq transcriptional pro ling by using "limma" R package, and Wilcoxon rank-sum test was utilized to identify DEGs [17,18]. The false discovery rate (FDR) < 0.05 and |log 2 fold change (FC)| > 2 were taken advantage of as cutoff criteria. The "pheatmap" R package was used to plot the heatmaps [19].
Integrated analysis of gene methylation data and gene expression data Gene expression data and DNA methylation data were standardized by using "limma" R package [17].
DNA methylation-driven genes (MDGs) were identi ed using the "methylMix" R package [18,20]. DNA methylation data and paired gene expression data were integrated and analyzed jointly to identify DNA methylation status negatively correlated with gene expression of a particular gene, indicating that the gene is a DNA methylation-driven gene [21]. Inclusion criteria were set to the correlation of methylation data and corresponding gene expression data of DEGs less than -0.3, |log 2 FC| > 0 and adjust p <0.05.
The differentially expressed DNA methylation-driven genes (DEMDGs) were obtained by intersecting MDGs and DEGs for further analysis [18].

Construction of PPI network
The PPI network was established by the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (https://string-db.org/) and visualized by Cytoscape software (v3.8.2) [22].

Evaluation of immune status and boxplots plotting
The immune cells in ltration levels in the samples were evaluated by "CIBERSORT" R package, and the inclusion criteria was p < 0.05 [23]. Stromal score and immune score also estimated by "estimate" R package [24]. Boxplots were plotted using the "ggpubr" R package [25].

Consensus clustering analysis
We capitalized on k-means clustering algorithm in the "ConsensusClusterPlus" R packet to perform clustering [26]. The similarity in samples distance was measured by computing Euclidean distance and kmeans algorithm to determine the credibility of our classi cation. The RCC datasets were grouped into distinct and nonoverlapping groups according to the consistent expression of 146 DEMDGs.

Construction of prognostic model
Utilizing the "glmnet" R package, "survival" R package and "survminer" R package, the prognostic model was constructed [27][28][29]. The risk scores were calculated according to a linear combination of the gene expression levels weighted by the regression coe cients from the multivariate Cox regression analysis [30]. The "survivalROC" R package was utilized to validate the stability of the prognostic model [31]. The Kaplan-Meier survival curves were carried out to assess the survival time between highand low-risk score RCC patients [32]. We validated the accuracy of optimal cut-off value by the principal component analysis [33].

Independence of the prognostic model from clinical features
We evaluated the independence of the prognostic model from clinical features via univariate and multivariate Cox regression analysis [34]. The signi cant levels were set to p < 0.05 and hazard ratios (HRs) with 95% CIs were also calculated.

Construction of the nomogram and the dynamic nomogram
The nomogram was constructed utilizing the "rms" R package [35]. The web-based survival rate calculator was established using the "shiny" and "DynNom" R packages to predict cancer-speci c survival rates dynamically [36,37].
Functional and pathway enrichment analysis Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were performed using "enrichplot" R package, "org.Hs.eg.db" R package and "clusterPro ler" R package [39,40]. The inclusion criteria were set to p < 0.05 and q < 1. The results were visualized by "ggplot2" R package [41].

Identi cation of 146 DEMDGs in RCC
The research process of the study was showed in Fig. 1. Based on Wilcoxon rank test, a total of 5198 DEGs were selected from 28 RNA-seq transcriptome pro ling of normal samples in GTEx dataset and 1021 RNA-seq transcriptome pro ling (893 RCC samples and 128 normal samples) in TCGA dataset (FDR < 0.05 and |log 2 FC| > 2) ( Table S1). The heatmap shows the expression of DEGs between RCC samples and normal samples ( Fig. 2A). We screened the 270 methylation-driven genes (MDGs), whose methylation status negatively correlated with expression levels (Cor < −0.3, |log 2 FC| > 0 and adjust p < 0.05) ( Table S2). The heatmap shows the expression of MDGs between RCC samples and normal samples (Fig. 2B). Then, 270 MDGs and 5198 DEGs were intersected to obtain 146 DEMDGs for further analysis (Fig. 2C). We further visualized the methylation levels ( Fig. 2D) and gene expression levels ( Fig.   2E) of 146 DEMDGs in RCC samples and normal samples. The comprehensive landscape of DEMDGs interactions and DEMDGs connection for RCC patients was depicted with the DEMDGs network (Fig. 2F).
The above results found 146 signi cantly DEMDGs in RCC and normal samples. These abnormal DEMDGs were interconnected and may be involved in the occurrence and development of RCC.

Consensus clustering based on 146 DEMDGs and immune status analysis of clusters
To select optimized cluster number, we calculated k-means clustering algorithm with the ConsensusClusterPlus R packet. K = 2 was identi ed with optimal clustering stability ( Fig. 3A-D). Then, we analyzed the methylation levels and gene expression levels of 146 DEMDGs, as well as the clinical features of paired patients. There were signi cant differences in the methylation levels and gene expression levels between cluster 1 and cluster 2, and the clinical features were evenly distributed in two clusters ( Fig. 3E-F). The RCC patients in cluster 2 (n = 419) had a better overall survival (OS) than the patients in cluster 1 (n = 435, p < 0.001) (Fig. 3G).
Immune checkpoint inhibitors (ICIs) are administered for the treatment of RCC. We investigated whether the two clusters were related to ICI-related biomarkers. The results showed that cluster 1 was positively correlated with high expression of LAG3 (p < 0.001), CD160 (p < 0.001), HAVCR2 (p < 0.001), CTLA4 (p < 0.001) and TIGIT (p < 0.001), and the stromal score and immune score were signi cantly higher in cluster 1 compared with cluster 2 (p < 0.001) (Fig. 4A). The abundance of B cells naive (p < 0.001), T cells CD8 (p < 0.001), T cells CD4 memory activated (p < 0.001), T cells follicular helper (p < 0.001), T cells gamma delta (p < 0.001) and Macrophages M1 (p < 0.001) was signi cantly higher in cluster 1 compared with cluster 2 (Fig. 4B). The higher immune in ltration level corresponded to cluster 1, and lower immune in ltration level corresponded to cluster 2 (Fig. 4C). The RCC patients in high-immune score group had a worse OS than the patients in low-immune score group (p < 0.001) (Fig. 4D). Besides, we accessed the correlation of immune cells in cluster 1 and cluster 2. In cluster 1, the positive correlation between T cells CD8 and T cells follicular helper was the strongest, in which correlation coe cient was 0.55. The correlation coe cient between T cells CD8 and T cells CD4 memory resting was -0.66, the lowest negative correlation (Fig. 4E). However, in cluster 2, the cells with the strongest negative correlation were activated T cells CD8 and Macrophages M2, in which correlation coe cient was -0.46 (Fig. 4F). These results showed that the two clusters based on 146 DEMDGs were closely associated with prognosis and immune status in RCC patients.
To explore the possible reasons causing worse OS in cluster 1, we selected 106 DEGs from two clusters (Cor < −0.3 and |log 2 FC| > 1). We used heatmap visualizing the gene expression levels of 106 DEGs in RCC and normal samples (Fig. S1). We performed GO and KEGG analysis to analyze underlying functions and pathways of 106 DEGs (p < 0.05). Results of GO analysis were signi cantly enriched in regulation of T cell activation, T cell proliferation, positive regulation of leukocyte proliferation, positive regulation of T cell proliferation, positive regulation of cell-cell adhesion, etc. (Fig. 4G, S2A and S2B). Results of the KEGG pathways were signi cantly enriched in pathogenic Escherichia coli infection, natural killer cell mediated cytotoxicity, viral myocarditis, one carbon pool by folate, cytokine-cytokine receptor interaction, etc. (Fig.  4H). The above results indicated that the 106 DEGs were in close contact with immune microenvironment, which may be the causing for OS difference between two clusters.
RCC patients were split into high-and low-risk groups according to optimal cut-off value of the risk score (Fig. 5E). The AUC of the ROC curves was 0.788, 0.743 and 0.747 within 1-, 2-, 3-year, which demonstrated that risk score had a good prognostic value (Fig. 5F). The distributions of the risk score in high-and low-risk groups were showed in Fig. 5G. Patients' mortality risk increased with increasing risk score (Fig. 5 H). The survival curve was carried out to assess the survival time between high-and low-risk score groups. The survival time of high-risk group was signi cantly worse than the low-risk group (p < 0.001) (Fig. 5I). RCC samples were clearly structured in two different groups by the principal component analysis, which suggested our study could signi cantly re ect the prognosis differences of RCC patients (Fig. 5J). To investigate the prognostic value of the signature for OS in RCC patients strati ed by clinical features, RCC patients were strati ed according to gender, grade, clinical stage, T stage, N stage and M stage. For all different strati cations, the OS time of the high-risk group was shorter than that of the lowrisk group (p < 0.001) (Fig. 6). These results showed that the 17 DEMDGs for OS could accurately predict the prognosis of RCC patients and that the prognostic model could accurately re ect the survival of patients under different clinical features.
Clinical veri cation of prognostic model and enrichment analysis of prognostic model We investigated the relationship between the risk score of RCC and several variables. We found that T stage (p < 0.001), N stage (p < 0.05), clinical stage (p < 0.0001), tumor grade levels (p < 0.05), cluster (p < 0.0001) and immune score (p < 0.01) were signi cantly associated to the risk score. The risk score level of the T1, T2 and N0 stage were clearly lower than that of the T3, T4 and N1, N2 stage ( Fig. 7A-B). The higher level of clinical stage had higher level of risk score (Fig. 7C). The risk score increased with the grade of the tumor (Fig. 7D). The risk score of the cluster 1 was signi cantly higher than that of the cluster 2 (Fig. 7E). The low immune score group had lower risk score than high immune score group (Fig.  7F).
To further annotate functions enriched in the high-and low-risk groups, GSEA was queried to con rm the signaling pathways in which the genes are enriched. The results were represented in Fig. 7G-J (p < 0.05).
The following biological processes were enriched in the high-risk group: regulation of DNA damage response signal transduction by p53 class mediator, chondrocyte development, negative regulation of gene expression epigenetic, positive regulation of leukocyte proliferation and cardiac epithelial to mesenchymal transition. The following signaling pathways were enriched in the high-risk group: cytokinecytokine receptor interaction, vascular smooth muscle contraction, cell cycle, nod-like receptor signaling pathway, jak-stat signaling pathway. To clarify the possible molecular mechanism of 17 prognosisrelated DEMDGs, we also performed GO and KEGG pathway analysis. Results of GO analysis were signi cantly enriched in negative regulation of cytokine production, response to interferon-gamma, cortical actin cytoskeleton organization, multivesicular body, SCAR complex, Rac GTPase binding, actin binding, etc. (Fig. 7K, S5A and S5B). Results of KEGG analysis were signi cantly enriched in synthesis and degradation of ketone bodies, pathogenic Escherichia coli infection, regulation of actin cytoskeleton, butanoate metabolism (Fig. 7L). These results suggested that the risk score was correlated with clinical features, cluster and immune score and that the above biologic functions and signaling pathways might impact RCC patients' prognosis.
The predictive accuracy of prognostic model In order to determine whether the risk score could be presented as an independent prognostic factor for RCC patients, we employed univariate and multivariate Cox proportional hazards regression analysis. In the univariate analysis and multivariate analysis, the risk score, age and M stage showed pronounced effects on the RCC prognosis (p < 0.05) (Fig. 8A-B). Beyond this, we constructed a nomogram with the signi cant variables in the multivariate analysis (Fig. 8C). Results suggested that risk score had the signi cant in uence on survival prediction. The 1-, 2-, and 3-year predicted calibration curves also respectively suggested that the model had a good prediction accuracy (Fig. 8D-F). We also established a dynamic web-based survival rate calculator (http://127.0.0.1:7634/), which could individually predict the survival of patients according to their clinical features and risk score. For example, the 3-year cancerspeci c survival rate was approximately 55% (95% CI 42-72%) for patients with low risk, M0 stage and aged < 65 years (Fig. 8G-H).

Prognostic model correlated with tumor-in ltrating immune cells and drug susceptibility
To further analyze the relationship between prognostic model and tumor-in ltrating immune cells, we performed a detailed spearman correlation analysis and the result was presented with the lollipop shape ( Fig. 9A). High-risk group were more positively correlated with tumor-in ltrating immune cells, including CD8 + T cells, Macrophage M1, B cell , monocytes, Myeloid dendritic cell, T cell regulatory and myeloid dendritic cells, etc. The detailed results were showed in Table S3. We also attempted to identify associations between prognistic model and the e cacy of six common chemotherapeutic drugs for the treatment of RCC. The high-risk score was associated with the lower half maximal inhibitory concentration (IC50) of chemotherapeutics such as Temsirolimus (p = 1.1e-09), Sunitinib (p < 2.22e-16), Pazopanib (p = 0.0041), Doxorubicin (p = 0.0054), Bleomycin (p = 0.012), Axitinib (p = 7.1e−06) (Fig. 9B-G). The above results showed that this model closely correlated with tumor-in ltrating immune cells and drug susceptibility.

Discussion
Previous studies had described that RCC possesses a high number of genetic alterations and epigenetic alterations [42,43]. As the major epigenetic modi cation, DNA methylation studies have become a research hotspot in many cancers, speci cally in RCC [44]. In this paper, we identi ed 146 DEMDGs by integrating transcriptomic and DNA methylation pro les. Two clusters were identi ed by consensus clustering based on 146 DEMDGs, and we developed a reliable prognostic model consisting of 17 DEMDGs.
We  [47]. Additionally, cluster 1 had the higher immune score and immune cells in ltration. There were studies reporting that high immune score as well as high in ltration of immune cells were associated with poor prognosis, which was similar with our results [48]. We identi ed 106 DEGs from cluster 1 to perform further analysis. Many results of biological processes were signi cantly enriched in immunity, including positive regulation of I-kappaB kinase/NF-kappaB signaling, leukotriene D4 metabolic process, etc.. This provided further evidence that the different immune status of two clusters may be the possible cause for different survival satus.
Subsequently, 17 DEMDGs (including SH3BGR, BDH1, MOB3A, CX3CL1, NMI, NAPSA, PPP1R18, TCF19, FMNL1, C1orf54, FAXDC2, BST2, MYH14, ESRP2, NCKAP1L, TRIM4, EMP3) prognostic model was constructed to act as a reliable predictor by univariate and multivariate Cox analysis. Risk score also was calculated based on the gene expression and regression coe cients of each gene. Patients were divided into high-risk group and low-risk group based on their risk scores. The AUC of 1-, 2-and 3-year survival was 0.788, 0.743 and 0.747, respectively. Patients in the low-risk group had a longer overall survival compared with those in the high-risk group (p < 0.001). For all different clinical features, the high-risk group had worse survival than the low-risk group signi cantly with P < 0.001. We used GSEA to con rm the signaling pathways where the genes were enriched in the high-and low-risk groups. The high-risk group mainly enriched in immune-related processes. Then, we performed GO analysis and KEGG analysis on the 17 DEMDGs. We noted that BP group of GO analysis were mainly enriched in immune-related processes. These results suggest a closely relation between prognosis and immune status in RCC. The multivariate Cox regression analysis results indicated that our prognostic model was unaffected by clinical features. Besides, we constructed a nomogram and a dynamic web-based survival rate calculator to verify prediction accuracy of prognostic model. The above outcomes indicate that our prognostic model has a good predictive ability and excellent predictive accuracy.
We further explored the relationship between tumor-in ltrating immune cells and risk score with seven common acceptable methods of estimating the immunoin ltrating cell, including TIMER [49], CIBERSORT [49], XCELL [50], QUANTISEQ [51], MCPcounter [52], EPIC [53], and CIBERSORT-ABS [54]. The synthetical analysis showed that risk score was more positively related to tumor-in ltrating immune cells. Finally, we evaluated the relationship between risk score and the e cacy of six common anti-tumor drugs the treatment of RCC. The results showed IC50s of six drugs signi cantly inversely correlated to risk score. Here, this study demonstrated that a novel prognostic model constructed by 17 DEMDGs that could predict prognosis for patients with RCC and might help distinguish those who could bene t from anti-tumor immunotherapy. However, our ndings still need to be demonstrated by experimental methods.

Conclusions
To sum up, our study indenti ed 146 DEMDGs in RCC. The consensus clustering based on 146 DEMDGs could be used to predict prognosis of RCC, and the clusters were associated with the immune

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.