3.1 Identification prognostic methylation sites associated with OS
The prognostic molecular subsets were clustered based on a DNA methylation map of BCA samples in the TCGA database. Successfully, a total of 21,122 CpG sites were screened for methylation based on 203 experimental samples in the training group. The univariate Cox proportional hazard regression model was applied to each CpG site in the training group. Here, 1096 important CpG sites that might affect the survival of patients (P < 0.001) were identified. Accordingly, 1096 significant CpG sites were taken into the multivariate Cox proportional hazard regression model to study independent prognostic factors (P < 0.001). Finally, 402 meaningful CpG sites were obtained for further prognostic subgroup analysis (Supplementary Table 1).
3.2 Identification of different DNA methylation prognostic subgroups through consensus clustering and prognostic analyses between clusters
The consensus clustering of 402 potential prognostic methylation sites was performed to identify different DNA methylation molecular subsets and for further prognostic analyses. The optimal class number was obtained by calculating the average cluster consistency and intercluster coefficient of variation of each class number. As displayed by the CDF curve, when the clustering number is 7, the clustering result is relatively stable (Figure 1a). Further observation of the CDF delta area curve shows that the area under the CDF curve starts to remain stable after seven categories, suggesting that when the cluster is selected as 7, the cluster has stable clustering results, and the seven categories were also considered the ideal number of categories for further analysis (Figure 1b). The consensus matrix, as shown in Figure 2a, illustrates the consensus of K = 7 and shows the superlative seven-block structure. Moreover, the heatmaps in accordance with 402 CpGs were generated via the heatmap function based on DNA methylation classification with T category, N category, M category, grade, stage, sex and age (Figure 2b). The Kaplan–Meier diagram illustrated that the prognosis of BCA, defined by consensus clustering based on methylation, had significant differences among the seven clusters (P < 0.05; Figure 3a). According to our results, the prognosis of Clusters 2 and 6 were the best, whereas that of Cluster 7 was the worst. Figure 3b-h indicates the intracluster proportions for the seven clusters on the basis of T category, N category, M category, grade, stage, sex, and age. The association trends between features and specific clusters were as follows: Clusters 1 and 7 were in advanced stage; Clusters 2 had lower T grade; Clusters 7 had higher N grade, younger age, and more male; Cluster 5 was in higher M grade; Clusters 1, 5, and 7 had higher grade.
3.3 Differential feature recognition based on DNA methylation clustering and screening of cluster-specific methylation sites
A total of 557 promoter genes were identified according to the genome annotation of 402 meaningful CpG sites. The most significant GO terms for biological process, cellular component, and molecular function are shown in Figure 4a-c. Next, we analyzed the functional enrichment of these 557 genes and recognized 23 significantly enriched pathways (P < 0.05; Figure 4d). We observed that the three most significantly enriched pathways were peroxisome, base excision repair, and fatty acid biosynthesis pathways.
The expression of methylated genes found in the subgroup were also studied. Expression values were available for 465 of the 557 genes in the training group. We also generated a heatmap for specific annotated genes with methylation sites, as shown in Figure 5, with different gene expression patterns among different subgroups, indicating that the level of DNA methylation usually reflects the expression of these genes.
The cluster-specific methylation sites were screened by using methylation sites as the feature of the clusters. The differences among seven clusters were analyzed for each methylation site. Eighteen cluster-specific methylation sites were identified and are shown in the heatmap in Figure 6. Clusters 1, 2, and 6 had more specific sites, Cluster 1 had the highest level of methylation among all clusters, and Cluster 2 had the lowest methylation level among all clusters (Figure 7).
3.4 Construction and validation of the prognosis prediction model
Cluster 2 was found to contain a great number of samples that were associated with good prognosis and the highest number of specific methylation sites, which made it as seed cluster. Using the formula provided in the Methods section, we constructed a Cox proportional hazard model based on the methylation distribution of three specific sites combined with prognostic information. As shown in Figure 8a, there was a significant difference in prognosis between the two groups. The results of ROC analysis using the risk score calculated for each sample are shown in Figure 8b. The area under the curve (AUC) of the model was 0.788, which was higher than that of other predictive factors, indicates that the function of the model is effective. The samples were divided into high-risk group and low-risk group by median risk score which was used as cutoff value and then generate the risk curve (Figure 8c).
Next, the prognostic model was used to predict the outcomes of patients in the test dataset. The methylation level curves of three CpGs were obtained to test the dataset samples, and the prognostic model was used to calculate the risk score. The test dataset samples were then drawn into high-risk group and low-risk group. It is worth noting that there was a significant difference in prognosis between the two groups (P = 0.00666). It was consistent with the results gained from the training data set, indicating the accuracy of prediction and stability of the model (Figure 9a-c).
3.5 Clinical application of a nomogram
Through univariate and multivariate Cox regression analyses (Figure 10a-b), both our prognostic model and sex were identified as potentially independent predictors, based on which nomogram was constructed (Figure 11a). The C index was 0.77, and the correction chart showed that the actual survival rate of 5 years was in good agreement with the predicted survival rate (Figure 11b). This indicates that the nomogram has high potential for clinical application.