The clinical characteristics of selected patients from TCGA and GEO database
351 patients with methylation data, gene expression data, and complete clinical information were selected from the TCGA as training dataset and another 2 independent datasets of OSC from the GEO database were served as validation datasets (see ‘Methods’ section). All samples were pathologically and clinical diagnosed with OSC. The patients were divided into stage I, II, III, and IV according to FIGO staging system while they were also assigned to G2, G3, and G4 as the tumor histologic grade. Anatomic neoplasm subdivisions were based on different locations of the surgical lesion including left, right, and bilateral. According to the diameter of tumor residual lesions, the patients were divided into no macroscopic disease, 1–10 mm, 11–20 mm, and >20 mm groups. The clinical characteristics of samples are summarized in Table 1.
Screening of differentially methylation-driven genes
Through the comparison of normalized methylation data of 351 OSC samples and 10 normal samples extracted from TCGA based on MethylMix package, a β mixture model was constructed. Subsequently, Mann–Whitney U test for methylation gene and correlation test between methylation level and gene expression were calculated (FDR<0.05, Cor<-0.3, P<0.05). 171 methylation-driven genes were identified in result (see Additional file 2) and the heatmap of differential methylation-driven genes was plotted (Figure 1). Among the 171 differentially methylated genes, 113 genes had higher methylation levels in tumors than normal tissues while 58 genes had lower methylation levels. The first 10 differentially methylated genes with lowest FDR were selected to plot the distribution map based on β mixture model (Figure 2) and correlation scatter diagram (Figure 3).
Functional and pathway enrichment analysis of methylation-driven genes
To further explore the potential mechanism underlying oncogenesis of OSC, functional enrichment analysis and pathway analysis of the 171-differential methylation-driven genes were performed using DAVID and ConsensusPathDB (see ‘Methods’ section). As the result of GO enrichment analysis, these genes were concentrated in 40 biological processes, 14 molecular functions, and 11 cellular components (Figure 4A, Additional file 3). In biological process term, these genes mainly involve inflammatory response, positive regulation of transcription, DNA-templated, positive regulation of NF-κB transcription factor activity and others. In molecular function term, the genes were mainly enriched in transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding. And in cellular component term, extracellular exosome was enriched the most significantly. Pathway analysis on ConsensusPathDB showed that the genes were mainly involved in Interleukin-18 signaling, Cytokine Signaling in Immune system, and Interleukin-1 family signaling (Figure 4B, Additional file 4).
Construction and validation of risk assessment model based on different regrouping methods
The univariate COX proportional hazard analysis was first carried out for the 171-differential methylation-driven genes in TCGA training dataset and β values of DNA methylation were served as variables to identify the survival related methylation-driven genes. As a result, 9 methylated genes were screened out (P<0.05; Additional file 5). Further, the 9 candidate genes were analyzed using the multivariate COX regression based on Akaike Information Criterion (AIC). Consequently, five methylation-driven genes were singled out as key variables and statistical test on the COX regression coefficient showed that the 5 genes were significantly related to the survival (P<0.05, Additional file 1: Table S1 and Additional file 5). Then DNA methylation level of the 5 genes were weighted by the regression coefficient and a risk formula was finally obtained: Risk Score=1.209 × (methylation level of UBB) - 1.3560 × (methylation level of PLAT) + 2.7718 × (methylation level of TMOD1) + 1.4498 × (methylation level of KCNJ11) - 0.7739 × (methylation level of CDSN). The risk score of 351 tumor samples were then calculated with this formula and the median value of risk score (Median=2.567) was set as a group cutoff condition. Consequently, 351 OSC samples were divided into a high-risk cohort (n=175) and a low-risk cohort (n=176). Meanwhile, we further analyzed the methylation levels as well as the expression of these five key epigenetic genes in the two cohorts (Figure 5A and 5B). Significantly higher methylation levels of PLAT and CDSN while significantly lower methylation levels of other three driven genes were observed in high-risk patients than low-risk patients, which were consistent with the results of COX regression. Based on this grouping method, the Kaplan-Meier survival analysis of high-risk cohort and low-risk cohort showed that the survival rate of the low-risk cohort was significantly higher than that of the high-risk cohort at all time points (Figure 5C). To evaluate the validity of this risk model, a completely independent dataset of 31 OSC patients from GEO database was included analysis and we then stratified the 31 OSC patients into high-risk (n = 14) and low-risk group (n = 17) with the same cutoff condition as previous. The Kaplan-Meier analysis demonstrated that the OSC patients in low-risk group had a significant better overall survival than in high-risk group (P<0.05 with log-rank test; Figure 6A). Further, to confirm that whether the risk model could indicate prognosis independently, the univariate and multivariate COX regression analysis of risk model, age at diagnosis, as well as histological grade were conducted, respectively. As the results showed in Figure 6C and 6D, our risk model as an independently prognostic factor exhibited marginally significance in both univariate and multivariate COX regression analysis. In addition, we also detected the capacity of this model to distinguish the progression-free survival (PFS) of patients in another OSC dataset of 20 patients from GEO database. We used the same formula and cutoff value to calculated risk score of 20 OSC patients and separated them into high-risk (n = 10) and low-risk group (n = 10). The Kaplan-Meier curve showed that there was no significant difference on PFS between high-risk and low-risk group (P=0.73 with log-rank test; Figure 6B), which may be affected by the sample content. It was known to all that patients with OSC in advanced stage (Stage III and IV) had shorter survival time than in early stage (Stage I and II) (11). However, irrespective of the effect of the degree of disease progression, the survival rate was still lower in high-risk cohort obviously in either patients of early stage or advanced stage (Additional file 1: Supplementary Figure 1). The result suggested that the novel risk model was independent of FIGO staging system. We further determined the clinical effect of the risk model in patients with different histologic grade, including G2 and G3, which had enough samples. The Kaplan-Meier analysis demonstrated that the survival rate of high-risk cohort was significantly lower than that of low-risk cohort in both G2 and G3 groups (Additional file 1: Supplementary Figure 2). As for the anatomic neoplasm subdivisions, left side only and right side only were merged as unilateral group, while the opposite is the bilateral group. Similarly, we also observed significant differences in survival between the high-risk and low-risk cohorts (Additional file 1: Supplementary Figure 3). In addition, after classification of patients according to the size of the residual tumor lesions, the result of the Kaplan-Meier analysis in each group was also significant (Additional file 1: Supplementary Figure 4).
The receiver operating characteristic (ROC) analysis of risk model compared with other known prognosis-predict markers
To further estimate the sensitivity and specificity of the risk assessment model consisting of 5 key epigenetic genes in predicting survival, the time-dependent ROC curve was plotted and the AUC of the risk model was 0.71 (95%CI: 0.64–0.78) at 3 years of OS. We also performed the ROC analysis in groups of different FIGO stage, histologic grade, residual tumor after surgery, and anatomic neoplasm subdivision (Additional file 1: Supplementary Figure 1-4). All the results indicated that the risk model based on five key epigenetic genes had satisfied performance in predicting clinical survival. In addition, we chose several genes that had been identified as prognosis predictors in previous researches to verify whether the novel risk model had ascendancy in the predictive performance. For instance, methylation of BRCA1 promoter region and overexpression of Capn4 predict poor ovarian cancer prognosis (19,20). LncRNA HOTAIR was determined as an independent prognostic biomarker of OS (21). The AUC of these genes and risk scoring model were figured out and shown in Figure 5D and Additional file 1: Table S2. And Z test was used to examine whether the risk model had higher predictive value than these genes. The final results showed that our model was superior to many other biomarkers in reliability and stability