Prediction of Prognosis in Patients with Prostate Cancer by DNA Methylation Classication

Background: Prostate cancer is the second most frequently diagnosed cancer and the fth leading cause of cancer-related death. It is estimated that the incidence of prostate cancer is on the rise worldwide. Epigenetic changes in tumors play an important role in the occurrence and development of prostate cancer. DNA methylation is one of the mechanisms of tumor epigenetic regulation and may be a new biomarker that has great potential in early tumor screening, treatment guidance and prognosis prediction. The purpose of this study was to explore a classication method from the perspective of DNA methylation. Methods: The least absolute shrinkage and selection operator (LASSO) method was used to analyze DNA methylation and RNA-seq data from the Cancer Genome Atlas (TCGA). The methylation sites with small differences were eliminated, and the 21 methylation sites with the most signicant differences were retained for analysis. Using their corresponding gene expression levels, a recurrence prediction model for prostate cancer patients was constructed to distinguish high-risk, medium-risk, and low-risk cases. Immune cell abundance analysis, gene enrichment analysis, Tumor burden mutation analysis and gene copy number variation analysis were then used to analyze the differences among these three subtypes and their underlying mechanisms. Results: We observed the difference in disease-free survival (DFS) of the three methylated subtypes in the test set, which was veried in the validation set. We found three subtypes have different proportions of immune cells, especially in memory B cells, M2 macrophages, Treg cells. GSVA and GSEA analysis revealed that the relevant metastasis gene sets of prostate cancer were enriched in high-risk cases. In addition, the mutation frequencies of TP53, TTN and KMT2D were the highest, and gradually increased in the three genotypes according to Tumor burden mutation (TMB) analysis. Gene copy number variation (CNV) showed that AR, LAPTM4B, and MTDH were signicantly amplied, while ATP1B2 and FAM92B were signicantly deleted. Finally, univariate and multivariate analysis showed that there were statistical differences among the three methylation subtypes, which can be used as an index to predict prostate cancer recurrence. Conclusions: Our study suggests that classication based on DNA methylation is an independent factor for predicting tumor recurrence in patients with prostate cancer.

In recent years, epigenetic modi cations were found to play an important role in tumorigenesis. Epigenetic inheritance is de ned as cellular information, other than the DNA sequence itself, that is heritable during cell division. Epigenetics mainly include DNA methylation, genomic imprinting and histone modi cation, the most common of which is DNA methylation [4]. There are two types of DNA methylation: hypermethylation, which inhibits gene expression, and hypomethylation, which activates gene expression. Functionally, tumor suppressor genes are frequently hypermethylated, and oncogenes are hypomethylated. To date, dozens of common abnormally methylated genes have been identi ed in prostate cancer. These genes encompass many cellular functions, including cell cycle control, apoptosis, hormone response, DNA repair and damage prevention, signal transduction, tumor invasion and tumor suppression [5]. Frequent promoter methylation of genes such as glutathione S-transferase π (GSTP1) [6] metastasis suppressor 1 (MTSS1) [7] dedicator of cytokinesis 2 (DOCK2) [8] adenomatous polyposis coli (APC) [9] and retinoic acid receptor beta2(RARβ2) [10] has been found and is associated with the progression of prostate cancer. However, whether the methylation of these genes is related to important clinical factors, such as the degree of tumor malignancy and biochemical recurrence and prognosis, has not been determined from large datasets of patients with prostate cancer. Therefore, this study was designed to explore a classi cation method from the perspective of DNA methylation and evaluate its potential for clinical application to help clinicians evaluate e cacy and prognosis and guide treatment decisions.
Materials And Methods

Data acquisition and data analysis
The DNA methylation data of prostate cancer patients were downloaded from the TCGA database [11], and methylation 450K microarray data, RNA-seq data and related clinical follow-up data of prostate cancer patients were obtained. Only the data with clinical survival information were selected for the analysis of the correlation between methylation level and disease-free survival. Then, the CpG site with NA (not available) and the cross-reaction CpG site in the genome were deleted [12]. In addition, the k-nearest neighbor method in the impute R package for methylation pro les was utilized to impute the missing values, with the further removal of the unstable genomic methylation sites [13]. In other words, the CpG sites and single nucleotide sites from the sex chromosomes were discarded, and the difference between the cancer tissue sample group and the adjacent normal (AN) prostate tissue sample group was analyzed by using the CHAMP package of R software. Speci cally, we obtained measurement data for each CpG site, including methylated signal intensity (denoted by M) and unmethylated signal intensity (denoted by N). Methylation status was expressed as the β value ((β = M/(M + U), ranging from 0 to 1). In total, 215166 differential DNA methylation sites were selected by setting p < 0.01 and p.adjust < 0.01. Finally, 5406 differential DNA methylation sites were obtained by further screening under the conditions of average β value of CpG sites in AN samples < 0.2 and SD (standard deviation) of CpG sites in tumors > 0.2.
The whole exome sequencing (WES) data of 432 samples were downloaded from TCGA. WES data provide a spectrum of somatic mutations in patients with prostate cancer. Tumor mutational burden (TMB), a new biomarker of immunotherapy response, is de ned as the number of nonsynonymous mutations per trillion bases [14]. The difference in total mutational burden in different subtypes was obtained by calculating the number of nonsynonymous mutations in the whole genome and analyzing them by the Kruskal-Wallis method.
The level 4 copy number variation (CNV) pro les of prostate cancer patients were downloaded from TCGA and classi ed according to methylation subtypes. The ampli cation or deletion changes in the whole genome were identi ed by the GISTIC tool, the ampli cation or deletion gene segments were further screened by setting the Q value > 0.25, and gene annotations were carried out to observe the changes in CNV in methylation subtypes.
Complete clinical information of prostate cancer patients in the TCGA cohort, including gleason score, clinical M, pathologic N, pathologic T, biochemical recurrence, treatment outcome and age, was collected and used for follow-up analysis.
The GSE2612 dataset based on the GEO platform contains prostate cancer methylation data, RNA-seq data, clinical follow-up data and complete clinical information and was downloaded and used for subsequent analysis.
2. LASSO regularization, signature construction and methylation subtype classi cation LASSO (least absolute shrinkage and selection operator) is a method of linear model estimation using L1 regularization, [15] which constrains the coe cient estimation to the direction of 0 so that some independent variables are eliminated and sparse feature space is generated. We used the glmnet R software package for LASSO regression analysis. Twenty-one genes are selected to construct the signature, and each coe cient is obtained through the constraint process. The risk score formula is as follows: risk score= , where and are the coe cient of each gene and the expression value of each gene, respectively. The names of the genes and their corresponding coe cients are shown in Table 1. X-tile software was used to calculate the optimal cutoff value and to classify methylation subtypes. The overall survival rates (the percentage of people in a group who were alive after a length of time) and disease-free survival rates (the percentage of people in a group who were disease-free after a length of time) of these subtypes were compared using the Survival R software package based on the log-rank (Mantel-Cox) test.

Analysis of immune cell abundance
The tumor in ltration fractions of various immune cell subtypes were calculated by the CIBERSORT method. In this study, a deconvolution algorithm was utilized to distinguish 22 immune cell subtypes, including dendritic cells, NK cells, neutrophils, eosinophils, mast cells, monocytes, macrophages and several B cell and T cell types in adaptive immunity. ANOVA was performed to evaluate the differences of 22 human immunocyte phenotypes in tumor in ltration.

Functional enrichment analysis of the gene set
To evaluate the potential mechanism of methylation typing in prostate cancer, GSVA analysis and ANOVA were performed to identify the differences in pathways and gene sets in different subtypes of prostate ∑ i ω i χ i ω i χ i cancer (P < 0.01). For cluster 3, which has a poor prognosis, we further carried out GSEA analysis to reveal the mechanism related to poor prognosis.

Other statistical analyses
The ability of factors to predict 1-year and 3-year recurrence rates was tested using ROC curves. The relationship between clinical features and methylation subtypes was analyzed by Fisher's precise test.
Methylation typing and clinical features were analyzed by univariate and multivariate Cox analysis.

Identi cation of differential methylation sites
To identify signi cant DNA methylation sites in prostate cancer, we downloaded the data from methylation 450K chips of 502 prostate cancer (PC) and 50 adjacent normal (AN) prostate tissue samples from TCGA.
The ChAMP package in R software was utilized to analyze the differentially methylated CpG sites between PC and AN samples, and 215,166 methylation sites were screened out with the signi cance thresholds set as p < 0.01 and p.adjust < 0.01 (Table S1). The DNA hypermethylation sites in prostate cancer were further identi ed under the conditions of average β value of CpG sites in AN samples < 0.2 and SD of CpG sites in tumors > 0.2. As a result, we obtained 5406 differential DNA methylation sites for further analysis.

LASSO Cox regression identifying a 21-gene epigenetic signature
The large number of methylation sites is not conducive to clinical application. To lter the CpG sites most relevant for clinical recurrence while maintaining effectiveness, the LASSO Cox model was applied to the gene expression pro le corresponding to the 5406 CpG sites. The LASSO algorithm is a method of linear model estimation using L1 regularization. It constrains the coe cient estimate toward 0 so that some independent variables are eliminated, thereby achieving the purpose of simplifying the model and selecting characteristic factors. In this study, the glmnet R software package was used for LASSO regression analysis. The coe cient trajectory of each independent variable is shown in Fig. 1a. As the penalty term coe cient λ increases, the independent variable coe cients are gradually constrained. A 10-cross validation was used to analyze the con dence interval (Fig. 1b) at each lambda value. To minimize the error, we found the best penalty term coe cient, which contained 21 genes (Table 1).

Risk classi cation of recurrence based on epigenetic signature
The coe cients and gene expression values of 21 genes were used to calculate the risk score of each sample (formula in the method section) to construct a recurrence-related model. To obtain DNA methylation subtypes, X-tile software was used to calculate the best cutoff value and classify patients into three groups: low-risk group, medium-risk group and high-risk group (in the subsequent gures, respectively denoted by cluster 1, cluster 2 and cluster 3) (Fig. 2a). The methylation difference heat map of the three groups is shown in Fig. 2b. To determine the effect of this classi cation method, the survival R software package was used to draw the disease-free survival (DFS) curve, which showed the clinical recurrence probability of different groups. We found that the high-risk group was the most likely to relapse, and the low-risk group was the least likely to relapse (p < 0.0001) (Fig. 2c). We used the same method to draw the overall survival curves, and we found that the survival time of the high-risk group was signi cantly shorter than that of the low-risk group (p = 0.036) (Fig. 2d).
In addition, we found that there were signi cant differences in the expression of androgen receptors (AR) among different subtypes (p = 0.013 < 0.05) (Fig. 2e). Pairwise comparative analysis showed that there were signi cant differences in the expression of AR between cluster 1 and cluster 3(p = 0.04 < 0.05), cluster 2 and cluster 3(p = 0.036 < 0.05). It means methylation classi cation was highly associated with the expression of AR, and the highest expression of AR was found in cluster 3 (high-risk group) among the three groups.

Differences in immune cell abundance among methylation subtypes
To observe the immunological characteristics of methylation subtypes, the CIBERSORT method was used to analyze the degree of in ltration of immune cells in tumor tissue samples. We found that samples with the three different methylation subtypes had different proportions of immune cells, especially memory B cells, naïve B cells, M2 macrophages, Treg cells, γ/δ T cells, plasma cells and mast cells (Fig. 3).No signi cant difference was found in dendritic cells, NK cells, neutrophils, eosinophils, monocytes or other types of T and B cells (Fig S1). For the cluster 3 group which has a poor prognosis, the proportion of memory B cells, M2 macrophages and Treg cells was signi cantly higher than that of cluster 1 and cluster 2. This is consistent with previous studies [16][17][18]. M2 macrophages and Treg cells are immunosuppressive cells that inhibit immune activation in tumors and promote the development of tumors. This indicates that the cluster 3 subtype has a unique immunophenotype, which is characterized by more immunosuppressive cell in ltration and lower immune activation. Different immunophenotypes of methylation subtypes may be the potential mechanism of different prognoses.

Difference analysis of pathways
To determine the biological processes related to these three types, we analyzed the TCGA-PRAD data by GSVA and generated a heat map (Fig. 4a). The results showed that among the three subtypes, some gene sets gradually increased, such as "REACTOME_SUMOYLATION_OF_DNA_METHYLATION_PROTEINS", which affected the level of DNA methylation, while other gene sets gradually decreased, such as "CHANDRAN_METASTASIS_TOP50_DN", which was a group of genes downregulated in metastatic prostate cancer. This gene set was highly expressed in cluster 1 and decreased gradually in cluster 2 and cluster 3. In line with the GSVA results, GSEA analysis showed that the gene set of " RAMASWAMY_METASTASIS_UP" was enrich while the gene sets of "CHANDRAN_METASTASIS_DN" and "ELLWOOD_MYC_TARGETS_DN" were down-regulated in patients of cluster 3 group (Fig. 4b-d). These results indicated the association between DNA methylation subtypes and tumor metastasis. In general, the aforementioned biological process may play an important role in the role of DNA methylation in the progression of prostate cancer.

Difference analysis of tumor mutational burden
Tumor burden mutation (TMB) is de ned as the total number of nonsynonymous mutations per megabase.
To investigate the differences among the three types of prostate cancer. Data from 432 samples, those with both WES data and the DNA methylation data, were analyzed, and the results showed that TMB was signi cantly different among the three types (Fig. 5a). The TMB increased gradually with risk (cluster 1 < cluster 2 < cluster 3), suggesting that the risk score was positively correlated with TMB. We further analyzed the mutation burden of different genes in different subtypes, and we found that the mutation frequencies of tumor protein p53 (TP53), titin (TTN) and lysine methyltransferase 2D (KMT2D) were the highest and gradually increased in the three genotypes (Fig. 5b).

CNV analysis
We next investigated gene copy number variation (CNV) in prostate cancer. The ratio of tumor tissue to normal tissue in each gene segment was obtained by analyzing the CNV probe in the Affy SNP6.0 chip. The GISTIC tool was utilized to further calculate and screen out regions that were signi cantly ampli ed and signi cantly deleted (Fig S2). Gene segments with Q values > 0.25 were selected for gene annotation to observe the CNV changes of prostate cancer in different subtypes (Fig. 5c) (see Tables S2 and S3 for genes contained in each chromosome segment). The results showed that the CNV ampli cation or deletion of genes was positively correlated with the risk score. For cluster 3 patients with poor prognosis, AR, lysosomal protein transmembrane 4 beta (LAPTM4B), and metadherin (MTDH) were signi cantly ampli ed, while ATPase Na+/K + transporting subunit beta 2 (ATP1B2) and family with sequence similarity 92, member B (FAM92B) were signi cantly deleted.

Evaluation and validation of models
To evaluate the methylation model we devised, the 1-year and 3-year boundaries were set to observe the clinical recurrence of patients, and the ROC curve was drawn to evaluate its effectiveness in predicting recurrence (Fig. 6a). The accuracy of the model for predicting 1 year recurrence was slightly higher than that of for predicting 3-year recurrence. In general, the methylation model has good sensitivity and speci city for predicting whether patients will relapse within 1 or 3 years.
We used Fisher's exact test to analyze the association between clinical characteristics and methylation subtypes ( Table 2). Clinical characteristics in 3 subtypes, including "Gleason score", "Clinical M", "Pathologic N", "Pathologic T", and "Biochemical recurrence", were found to be signi cantly different, while "Treatment outcome" was not. In addition, we performed univariate and multivariate relapse analyses of methylation subtypes and clinical characteristics (Table 3). Univariate analysis showed that "DNA methylation subtypes", "Gleason score", "Biochemical recurrence", "Treatment outcome", "Pathologic N", and "Pathologic T" all had signi cant differences, while multivariate analysis showed that "DNA methylation subtypes" had signi cant differences among relapsed and nonrelapsed patients. These data indicate that methylation typing can be used as an index to predict the recurrence of prostate cancer.  Finally, the GSE26126 dataset in the GEO database was used to validate the model we built. We used 21 identi ed markers to calculate risk scores and perform recurrence analysis. The results showed that DNA methylation subtypes have signi cant differences in the probability of clinical recurrence (Fig. 6b). We found that the high-risk group (cluster 3) was the most likely to relapse, and the low-risk group (cluster 1) was the least likely to relapse (p < 0.0001), which is consistent with our previous results. Similarly, we performed univariate and multivariate relapse analyses using methylation subtypes and clinical characteristics ( Table 4). The results showed that there were signi cant differences in DNA methylation subtypes in both the univariate analysis and the multivariate analysis. It turned out methylation typing is a universal index for predicting the recurrence of prostate cancer.

Discussion
Epigenetics is a heritable phenotypic change that occurs without changes in the genomic DNA sequence. In small-cell lung cancer and retinoic acid receptor beta (RARB) methylation in prostate cancer [21]. RASSF1 has been characterized as a tumor suppressor gene because it has the potential to inhibit cell proliferation, control the cell cycle and promote apoptosis. The MGMT protein has shown DNA repair ability because it prevents the formation of DNA adducts of carcinogens. RARB has been shown to function as a tumor suppressor gene, and inactivation of its promoter via DNA methylation is associated with lung, breast and prostate cancer. In addition to the hypermethylation of the RARB gene, it was recently reported that Kruppellike factor 4 (KLF4) is also involved in the biological process of prostate cancer. As a new tumor suppressor gene, KLF4 has a hypermethylated promoter sequence, and its gene expression is downregulated in prostate cancer, which promotes tumor proliferation and drug resistance. Downregulation of KLF4 or hypermethylation of KLF4 gene promoters can be considered markers for predicting the early recurrence of cancer [22,23]. GSTP1 is functionally inactivated by promoter methylation, which increases the sensitivity to oxidative stress and increases the risk of prostate cancer progression [24]. Studies have shown that GSTP1 can be used as a biomarker of uid biopsy in prostate cancer, but the results show that it has high speci city and low sensitivity [25]. A post hoc analysis of data from a Phase 3 trial showed that the detection of serum free methylated GSTP1 (mGSTP1) DNA can predict the outcome of chemotherapy in metastatic prostate cancer and may guide treatment decisions in the clinic [26]. In addition, methylation of other genes, including olfactory 4 (OLFM4) [27], MTSS1 [7] and myeloid ecotropic viral insertion site 2 (MEIS2) [28], also plays an important role in the occurrence and development of prostate cancer. The methylation pattern of prostate cancer can be used as a potential reasonable choice for disease classi cation and prognosis evaluation. At present, DNA methylation sites related to the prognosis of prostate cancer have been found, including GPR7 (known as neuropeptides B and W receptor 1, NPBWR1), ABHD9 (known as epoxide hydrolase 3, EPHX3), an expressed sequence tag on chromosome 3 (Chr3-EST) [29], paired like homeodomain 2 (PITX2) [30] and zinc nger protein 154 (ZNF154) [31]. The clinical signi cance of these genes with different methylation patterns in tumor classi cation, survival and prognosis has yet to be examined in a large prostate cancer dataset. In this study, we tried to explore a classi cation method that integrates several DNA methylation markers to help clinicians evaluate the therapeutic effect and prognosis of prostate cancer patients and choose treatment strategies.
In general, we used a large dataset from the TCGA and GEO databases to screen methylation sites at the prognostic level by means of bioinformatics. We used the LASSO algorithm to eliminate the methylation sites with small differences; we retained the 21 methylation sites with the most signi cant differences, and used their corresponding gene expression data to construct a recurrence prediction model for patients with prostate cancer. We observed methylation models in terms of immunity, pathway, and genomic variation.
Flammiger et al [17] detected Treg cells in prostate cancer tissues by FOXP3 immunohistochemistry and found that prostate-speci c antigen (PSA) recurrence-free survival was reduced in patients with high levels of Treg cells, which was associated with more advanced tumor stages and a higher Ki67 labeling index.
Erlandsson et al [16] found with a good prognosis and an enhanced response to chemotherapy [32]. WooJR et al [18] found that the degree of B lymphocyte in ltration in prostate cancer tissues was higher than that in adjacent tissues, and the degree of B lymphocyte in ltration in patients with a high risk of recurrence was higher than that in patients with a low risk of recurrence. The role of B cell immunity in prostate cancer is unclear, which indicates opportunities for research and potential treatment.
According to the gene enrichment analysis, we determined that the related metastatic gene sets of prostate cancer are enriched in cluster 3 subtypes, and it can be inferred that the genes that distinguish subtypes play an important role in the metastasis of prostate cancer. In addition, our results suggest that the SUMOylation of DNA methyltransferase may play a vital role in tumor metastasis.
Our results show that patients in cluster 3 have a higher burden of somatic mutations in TP53, TTN and KMT2D. p53 is a tumor suppressor that responds to DNA damage [33] by regulating the cell cycle checkpoint pathway and tends to mutate in advanced, recurrent and metastatic prostate cancer [34].
Missense mutation of titin (TTN) is correlated with favorable prognosis in lung squamous cell carcinoma [35] In multiple tumor cohorts, the progression-free survival time or overall survival time of patients with TTN mutations is higher than that of patients with wild-type TTN, which is not consistent with our results.
The role of TTN in prostate cancer needs to be further determined. KMT2D is a histone H3 lysine 4 (H3K4) methyltransferase that inhibits the growth and metastasis of bladder cancer cells by regulating tumor suppressor genes such as TP53 [36]. Our study provides new ideas regarding the potential mechanism by which the high TP53, TTN and KMT2D mutation burden in cluster 3 subtypes may promote the development of prostate cancer through increased DNA damage and dysregulation of tumor suppressor genes.
Androgen receptors are members of the nuclear receptor superfamily that play a key role in prostate cancer.
The upregulation of AR may be the cause of the development of prostate cancer into castration-resistant prostate cancer and the related poor prognosis [37]. Studies show that metadherin (MTDH) plays a role in the progression of various tumors [38]. In prostate cancer, MTDH acts as an oncogene, regulating the growth and metastasis of prostate cancer cells [39].
Finally, the effectiveness and universality of the methylation model were veri ed and evaluated. Our study shows that classi cation based on DNA methylation is an independent factor for predicting tumor recurrence in patients with prostate cancer. Our model can provide clinicians with a judgment of prognosis in patients with prostate cancer and provide them with guidance for choosing treatment strategies. However, our research has some limitations. Availability of data and materials The dataset used and/or analyzed during the current study are available from the TCGA network, and from Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/, GEO accessions: GSE2612).

Figure 1
a. The changing trajectory of each independent variable. The horizontal axis represents the log value of the independent variable lambda, and the vertical axis represents the coe cient of the independent variable. b.
Con dence intervals for each lambda. The dotted line on the left represents the best penalty coe cient.