Mining of Subtype Markers for the Prognosis of Ovarian Cancer based on Methylation Data

Aims: Ovarian cancer is one of three major malignancies involving the female reproductive system, and its morbidity and mortality are ranked number 3 and number 1 among gynecological tumors, respectively. DNA methylation (MET), as one of the main epigenetic modes, is closely related to the occurrence and development of ovarian cancer. To guide individualized treatment and improve the prognosis in ovarian cancer patients, it is of great significance to elucidate effective MET subtype markers. Methods: A total of 571 ovarian cancer MET samples were downloaded from the Cancer Genome Atlas (TCGA), and a COX proportional hazards model was established using the MET spectrum and clinically pathological parameters. Subsequently, the consensus clustering of CpG loci with a significant difference in both univariate and multivariate analyses was performed to screen the molecular subtypes, and these CpG loci were subjected to gene function annotation. Finally, CpG MET loci associated with poor prognosis in ovarian cancer patients were further screened by constructing a weighted gene co-expression network analysis (WGCNA). Results: A total of 250 prognosis-related MET loci were obtained by COX regression and 6 molecular subtypes were screened by clustering. There was a remarkable MET difference between most subtypes, of which Cluster 2 had the highest MET level and demonstrated the best prognosis in patients, while Cluster 4 and Cluster 5 had a MET level significantly lower than that of the other subtypes and demonstrated a very poor prognosis. All Cluster 5 samples were at a high grade, while the percentage of Stage IV samples in Cluster 4 was evidently greater than that in the other subtypes. Using the co-expression network, 5 CpG loci were eventually obtained: cg27625732, cg00431050, cg22197830, cg03152385, and cg22809047. The clustering analysis shows that the prognosis in patients with hypomethylation was significantly worse than that in patients with hypermethylation.

4 epithelial ovarian cancer mainly includes high-grade serous carcinoma, high-grade endometrioid carcinoma, and undifferentiated carcinoma, and most cases have rapid onset and poor prognosis. Type I epithelial ovarian cancer is significantly correlated with mutations in BRAF, KRAS, and PTEN, while type II epithelial ovarian cancer is associated with p53 mutations and also very frequently with BRCA1/2 mutations. The dualistic theory reflects different biological behaviors and clinical prognoses of tumors, and such differences are especially remarkable between low-grade and high-grade serous ovarian carcinomas. However, application of the dualistic theory in non-serous ovarian carcinoma is limited; for instance, clear cell carcinoma has many biological behaviors similar to type II epithelial ovarian cancer, despite belonging to type I epithelial ovarian cancer. It is of profound significance to realize precision molecular typing of ovarian cancer for clinical treatment and prognosis monitoring 7 .
As a result of the continuous improvement in human genome sequencing technology and the advancement of biomedical analysis technology, a new trend has been formed that involves molecular targeted therapy and prognosis evaluation based on the molecular typing of malignant tumors. Molecular-targeted therapy has been successfully applied in several tumors, including ER (+) or HER2(+) breast cancer and EGFR-mutated lung cancer 8− 11 , presenting great progress in precision medical treatment. By K-means clustering, Tothill et al. 12 detected the gene expression spectrum of 285 cases of endometrioid and serous tumors originating from the ovary, peritoneum, and tuba uterine, finally identified 6 molecular subtypes, of which 4 (high interstitium-reactive, high immunity, hypomethylation trix-reactive, and interstitial low immunity) are features of high-grade serous ovarian carcinoma and can be used for predicting prognosis in these patients.
Moreover, the clinical trial completed by Kommoss et al. showed that in high-grade serous ovarian carcinoma patients with proliferative and interstitial molecular subtypes, bevacizumab can improve the progression-free survival to different degrees; however, this study was limited to the molecular typing of the gene expression spectrum 14 .
Despite the impact of gene changes on cancer occurrence, epigenetic changes such as DNA MET also play an important role. Epigenetic inheritance refers to the hereditary changes that occur under the precondition of no changes in DNA sequence, such as histone modification, DNA MET, RNA editing, and gene silencing. The occurrence and growth of ovarian cancer involve several pathways including DNA repair, cell apoptosis, cell cycle regulation, and changes in protooncogenes tumor suppressor genes. Studies have suggested that epigenetic changes in these pathways play essential roles in the development of ovarian cancer, and the detection of MET signals is helpful for early diagnosis 15,16 . DNA MET mainly occurs in CpG islands; CpG expression can be inhibited by the hypermethylation of tumor suppressor gene promoters and enhanced by the decreased demethylation probability of protooncogenes. Different regulatory effects of protooncogenes and tumor suppressor genes contribute to the occurrence of cancer 17− 20 . According to relevant studies, the tumor suppressor gene involved in ovarian cancer exists in a hypermethylated state, and changes in its MET level is an important molecular foundation for cancer occurrence 21,22 . Hu WL et al., 23 built a DNA MET interaction network for ovarian cancer, breast cancer, and glioma, confirming that the number of DNA MET loci was associated with prognosis in cancer patients; however, no DNA MET molecular typing of ovarian cancer was performed.
In the present study, the univariate and multivariate COX proportional hazards models 6 were established by analyzing Illumina Infinium® Human Methylation 27K data in the TCGA database and combining the MET level and clinical data of the samples.
Subsequently, 6 molecular subtypes associated with the prognosis in ovarian cancer patients were screened by consensus clustering of MET spectra with a significant difference in both models, and the ovarian cancer patients were classified using these subtypes. Further, 5 CpG hypomethylation loci related to poor prognosis in ovarian cancer patients were obtained by constructing a WGCNA co-expression network. These loci are of great significance for the clarification of the pathogenesis of ovarian cancer, and can be used as effective tumor markers to provide a reference for clinical prognosis and  (Illumina 27K) microarray was acquired from UCSC Cancer Browser. Samples with complete clinical data and methylation spectrum data were selected. CpG loci with NA (Not Available) > 70% in all samples were deleted. The impute-KNN of R package was used to fill the missing value of methylation spectrum. The unstable genomic methylation loci were further removed, involving CpGs and single nucleotide loci on sex chromosomes as well as CpG loci that were not annotated to the gene promoter region 25 . We divided the datasets into two queues: a training set coupled with a test set. The standards for the subgroups included: 1) Samples were assigned to the training set and the test set randomly; and 2) the data of the two groups should be similar, including age distribution, clinical stage, follow-up time, 7 and mortality ratio.

Univariate survival analysis of MET loci in the training set
The research objective was to determine the molecular subtypes of ovarian carcinoma as prognostic determinants. Therefore, CpG loci, which had an important impact on survival, were utilized as a classification feature. First and foremost, a univariate COX proportional hazard model was established based on the methylation level of each CpG loci, age, tumor grade and stage, coupled with survival data by coxph function of the R-package survival.
Subsequently, we introduced the significant CpG loci obtained from the univariate model into the multivariate COX proportional hazards model, and took the significant age and clinical attributes in the univariate model as covariables. Ultimately, the CpG loci, which were still significant, were employed as classification features [25]. For each CpG island, the multivariate COX proportional hazards model formula was described below: h((t, x)i = h0(t)exp(β methy methyi + β age age + β stage stage) (1) In the formula, "methy i " is the carrier of the CpG locus methylation level in the sample.
"Age" and "stage" describe the age and clinical characteristics of the patients, respectively. "β m ethy ", "β age ", and "β stage " are regression coefficients. The P-values of the COX regression coefficient was adjusted by Benjamini − Hochberg error detection rate.
Various comparing processes were carried out.
3. Screening of molecular subtypes by the consensus clustering of methylation profile with a significant difference in both univariate and multivariate analyses Consensus ClusterPlus in the R package 26 was utilized for consensus clustering according to the method described by Zhang et al. 25 The subgroups of epithelial ovarian tumors were identified based on the most variable CpG loci. The algorithm is described as follows. 8 First, double sampling of some items and features from the data matrix was conducted, in which each sub-sample was divided into several groups (max.) using a user-specific clustering algorithm (k-means, hierarchical clustering, or custom algorithms). The paired consensus value (defined as the proportion of clustering running for the combination of two items) was calculated and stored in the k i consensus matrix. Second, the final coherent sheaf clustering for each k i was completed using the distance of 1-consensus value and pruned into k i group through cutting, which is known as consensus clustering.
The algorithm determined the "consensus" clustering by measuring the stability of the clustering results applied to random data subsets from given clustering methods. In each iteration, 80% of the tumors were sampled, and a k-means algorithm with the Euclidean squared distance measures were utilized: There was k = 2 − 10 groups, and these results were compiled for 100 times. The cluster consensus as well as item consensus results were obtained with Consensus ClusterPlus.R Where SD is the standard deviation, while MN is the average value of the samples. We 9 selected category number as the area under the CDF curve, and there was no significant change. The consensus clustering heatmap was generated using the R package pheatmap. 4. Clustering analysis of the methylation expression profile and analysis of the clinical characteristics of screened molecular subtypes The stable clustering results were selected and the methylation profile was analyzed by clustering analysis. The distance between the MET loci was calculated using the Euclidean distance. Furthermore, the distribution of various molecular subtype samples was analyzed with respect to prognosis, stage, grade, and age.

Gene annotation of MET loci
As for the genes corresponding to the gene promoter regions annotated by the selected CpG loci, the transcription factor enrichment analysis was performed by the online tool g:profiler 27 .

WGCNA co-expression analysis of CpG loci
Based on the modification beta value of selected CpG loci, the co-expressed CpG loci were mined by WGCNA co-expression algorithm. The distance between CpG loci was calculated using Pearson Correlation Coefficient. The R package WGCNA was used to construct weighted co-expression network, and select a soft threshold of 4 to filter the CpG coexpression modules. The results showed that the co-expression network conformed to the scale-free network. That is to say, the log(k) of node k presented in the connection is negatively correlated with the log(P(k)) of the probability of node k, with a correlation coefficient larger than 0.8. In order to ensure a scale-free network, we chose β = 4. The next step was to convert the expression matrix into an adjacency matrix, and then transform the adjacency matrix into a topological matrix. Based on TOM, we utilized average-linkage hierarchical clustering method to cluster genes. The minimum number of genes in each IncRNA network module was set at 30 according to the standard merged dynamic tree cutting. After determining the gene modules with the dynamic cutting method, we calculated eigengenes of each module in turn. The modules were clustered, and the adjacent modules were merged into new modules. 7. Construction of prognosis models and data validation of independent test set Unsupervised clustering analysis was conducted on the CpG methylation profile selected in the previous step. The similarity between samples was calculated by using the Euclidean distance. The samples were then divided into two groups according to the methylation level of CpG loci. The prognosis differences between the two groups were further analyzed. The methylation profile of 286 samples in the test set were used for validation.

Selection of 250 characteristic MET loci
The Illumina Infinium® Human Methylation 27 BeadChip microarray contained 613 samples in total, with 571 samples being screened using MET detection. The missing data imputation of the MET spectrum was performed using Impute in the R software package, and 25,154 MET loci were selected following exclusion of the unstable genomic MET loci.
All 571 samples were assigned to either a training set (n = 285) or a validation set (n = 286). The clinical pathological information in the training set and validation test is shown in Table 1. Table 1 The clinical pathological information in the training set and validation test. The MET loci and survival data were analyzed using a univariate COX proportional hazards regression model with p < 0.05 as the threshold. A total of 967 loci demonstrated a significant difference in prognosis (Table S1), of which the top 20 loci with the most significant difference are shown in Table 2. The prognostic significance of age had a logrank P value of 5.93e-06, while that of stage 12 was 0.0379. The significant MET loci were selected using a univariate COX model followed by multivariate COX proportional hazards regression model analysis, with stage and age as covariates. Finally 250 significant MET loci were obtained (Table S2).

Screening 6 molecular subtypes by consensus clustering of the MET loci
Consensus clustering of MET loci with a significant difference in both univariate and multivariate analyses was performed using Consensus ClusterPlus in the R package to screen the molecular subtypes. The similarity between samples was calculated using the Euclidean distance, the clustering was performed with k-means, and 80% sampling was conducted 100 times using a double-sampling method. The optimal cluster number was determined by CDF; different colors in the CDF curve represent different cluster numbers ( Fig. 1A), the AUC was larger at 6 and 7 clusters and the clustering effect was better.
Further observation of the CDF delta area curve (Fig. 1B) shows that at 6 clusters, the AUC demonstrates stable clustering results. Following consideration, k = 6 was selected and 6 molecular subtypes were obtained.
3. Clustering analysis of the MET expression spectra of 6 molecular subtypes The composition and number of samples in the 6 clusters were evaluated using the consensus matrix. The color gradient was from white to blue, indicating the consensus of progression. In the matrix permutation, the same clusters were made mutually adjacent; and eventually, a color-coded heat map was displayed and featured by the arrangement of dark blue blocks on a diagonal white background ( Fig. 2A), showing that 285 tumor samples were assigned to these 6 clusters. Furthermore, clustering analysis was performed on 250 MET spectra; the distance between MET loci was calculated using the Euclidean distance, and the heat map was generated by pheatmap using clinical 13 pathological stage and histological type as notes (Fig. 2B).
The pairwise comparison of various subtypes was performed using a t-test, and the results revealed that most MET loci had a low beta value. There was a significant difference in MET level among the majority of subtypes; the MET level in Cluster 2 was remarkably higher than that in the other 5 subtypes, while the MET levels in Cluster 4 and Cluster 5 were evidently lower than those in the other subtypes (Table S3).

Analysis of the clinical characteristics of the 6 molecular subtypes
We further analyzed the distribution of the 6 molecular subtypes with respect to prognosis, stage, grade, and age (Fig. 3). There was a significant difference in the prognosis among the 6 subtypes; the prognosis was best in Cluster 2, worst in Cluster 5, and poor in Cluster 4 (Fig. 3A), indicating that the prognosis of hypomethylation subtypes was inferior to that of hypermethylation subtypes. The samples in Cluster 5 were all stage III, and the percentage of stage IV samples in Cluster 4 was significantly higher than that in the other subtypes (Fig. 3B). All samples in Cluster 5 were G3 (Fig. 3C, Table S4), suggesting that the hypomethylation subtypes were mostly high-grade in clinical pathology. The age of the patients of samples in Cluster 5 was remarkably greater than that of the patients of the other subtypes, and the onset age was 70 − 80 years old (Table   S5), while the mean age of the patients of samples in Cluster 2 was the lowest (Fig. 3D), indicating that the age of patients with hypomethylation subtypes was generally higher than that of patients with hypermethylation subtypes. The above findings suggest, to a certain degree, that these DNA MET subtypes could be used to predict the prognosis, tumor stage, and pathological grade in ovarian cancer patients.

Gene annotation and function analysis of the 250 MET loci
There was a total of 285 genes corresponding to the gene promoter regions annotated by 14 250 CpG loci, and these genes were subjected to transcription factor enrichment analysis using the online tool g:profiler. It was found that 42 genes were significantly enriched to Transcription Factor EC (TFEC) (logrank P value = 0.0107) (Fig. 4A). At present, the role of TFEC in cancer progression has been studied to a limited extent; thus, to further explore the biological functions in which TFEC may be involved, TFEC-co-expressed molecules in the cBioPortal database were elucidated. Subsequently, the top 300 molecules with the most positive and negative correlations according to Spearman's correlation were selected. Functional enrichment analysis was performed using the DAVID 6.7 database and TFEC may promote the occurrence and progression of ovarian cancer by influencing these biological functions (Fig. 4B).

Screening 5 CpG loci by WGCNA co-expression analysis
Using the WGCNA co-expression algorithm, 250 significant CpG loci were mined.
Evaluation of the scale-free model was performed at different soft thresholds; a larger value and lower mean connectivity both indicated better compliance with the scale-free distribution. Finally, β = 4 (Fig. 5A, B) was selected, and the settings of height = 0.25, deepSplit = 3, minModuleSize = 10 were chosen. A total of 7 modules were obtained ( Fig. 5C), of which the grey module is the set of genes that could not be clustered to other modules. The statistics of genes in various modules are shown in Table 3. The 250 CpG loci were assigned to 7 modules. Pearson's correlation coefficient between the ME of each module and the characteristics of the samples was calculated; a higher correlation coefficient indicates that the module was more important. In Fig. 5D, the row represents the eigengenes of each module and the column represents the feature information of the samples. The greatest correlation can be seen between the yellow module and Cluster 2 (R = 0.68, logrank P = 7e-40), the brown module and Cluster 3 (R = 0.51, logrank P = 1e- 20), and the black module and Cluster 5 (R = 0.61, logrank P = 6e-30). Since Cluster 2 demonstrated the best prognosis among all the clusters, all CpG loci in the yellow module, mostly correlated with Cluster 2, were selected, and the interaction network was constructed according to their weighted relationship (Fig. 6A). In this network, the CpG loci with a network centrality > 10 were cg27625732, cg00431050, cg22197830, cg03152385, and cg22809047. Furthermore, the expression relationships among the 22 CpG loci were calculated, and a significantly higher correlation was found among 8 (cg27625732, cg00431050, cg22197830, cg03152385, cg22809047, cg00328227, cg06851207, and cg01777397) (Fig. 6B). Finally, a total of 5 CpG loci in the intersection were chosen, which had a strong correlation between each other and a centrality > 10 in the weighted network, as the characteristic MET loci of Cluster 2 samples (Table 4).

Clustering analysis of the 5 CpG loci
Unsupervised clustering analysis was further performed on the MET spectra of the 5 selected CpG loci, and the similarity between samples was calculated using the Euclidean distance. Figure 7A shows that the samples were divided into two groups according to the MET level of the 5 CpG loci: Cluster 1 (hypomethylation group) and Cluster 2 (hypermethylation group). The prognosis difference between these two groups was further analyzed (Fig. 7B), indicating that the prognosis in the hypomethylation group was significantly poorer than that in the hypermethylation group.

Model validation using the test dataset
The MET spectra of the 5 CpG loci from the 286 samples in the test dataset were extracted and analyzed by hierarchical clustering (Fig. 8B). The results show that the MET spectra of the 5 CpG loci were also obviously clustered into two groups: Cluster 1 and Cluster 2. The MET level of Cluster 1 samples was significantly higher than that of Cluster 2 samples. The prognosis difference between Cluster 1 and Cluster 2 (Fig. 8B) was further analyzed, and it was found that the prognosis in the hypermethylation group was remarkably better than that in the hypomethylation group, which is consistent with the results of the training set.

Flow chart of all the analysis
The flow chart of mining of subtype markers for the prognosis of ovarian cancer based on methylation data is shown in Fig. 9. All the R package covered in this article is listed in Table S6. In the present study, 571 ovarian cancer MET samples were downloaded from the TCGA database, 250 MET loci related to the prognosis of ovarian cancer patients were screened by COX regression analysis, and 6 molecular subtypes were selected by clustering with kmeans. There was a significant difference in MET loci among most subtypes; the highest MET level and the best prognosis were observed in Cluster 2, and the MET level in Cluster 4 and Cluster 5 was remarkably lower than that in the other subtypes, accompanied by a very poor prognosis. This suggests, to a certain degree, that the prognosis of patients with a hypomethylation subtype was worse than that of patients with a hypermethylation subtype. All samples in Cluster 5 were high-grade, and the mean age of patients in Cluster 5 was higher than that in the other subtypes. The percentage of stage IV samples in Cluster 4 was significantly greater than that in the other subtypes. The above findings suggest that these molecular subtypes can be used not only to evaluate the prognosis in ovarian cancer patients, but also to fully distinguish the tumor stage, histological grade, and age of these patients to guide subsequent treatment.

Discussion
DNA MET molecular typing also plays a very important role in the diagnosis, treatment, and prognosis of other tumors. Zhang et al. 25  In the present study, the significance of MET in the molecular typing of ovarian cancer was analyzed using MET data, and the markers of subtypes closely related to the prognosis 21 prediction of ovarian cancer were further screened. A MET data-based ovarian cancer prognosis prediction model was subsequently developed to provide a reference for clinical trials and researchers. In summary, the study by Zhang et al., and our study have different focal points, despite both involving molecular typing.
Subtyping of ovarian carcinomas based on methylation profiles has been reported in a TCGA seminal article 13 , in which 4 subtypes were identified to be significantly associated with differences in age, BRCA inactivation events, and survival based on consensus clustering of variable DNA methylation. The cluster associated with the worst prognosis is characterized by hypomethylation and is associated with old age, which is in accordance with the present findings; however, our approach is different from that in the aforementioned TCGA paper.
Firstly, the samples included in the TCGA paper were 489 cases of high-grade serous ovarian cancer, while the present paper included 571 cases of methylated ovarian cancer, including different clinical stages and grades. Our sample size is larger, and the results are more abundant. Secondly, a multivariate COX proportional hazards model was performed to elucidate that 250 CpG loci were significant predictors of prognosis, and 6 molecular subtypes were clustered based on the methylation level at these 250 CpG loci.
The cluster that was characterized by hypomethylation was associated with a worse prognosis, stage, and grade, and an older patient age. Thirdly, weighted gene coexpression network analysis was further applied to identify the 5 most significant CpG loci, and hypomethylation of these 5 loci was demonstrated to be associated with a worse outcome.

Conclusion
We identified 6 different molecular subtypes using ovarian cancer MET data in the TCGA

Consent for publication
Not applicable.

Availability of data and materials
All data during this study are included within this published article and additional files.
Any material described in the article can be requested directly from corresponding author on reasonable request.

Declaration of Interests
The authors declare no competing financial interests

Author Contributions
Qing Yang designed and conceptualized this study; Lili Yin and Ningning Zhang were major contributors in experiment; Lili Yin analyzed the data; Lili Yin, and Qing Yang were major contributors in writing the manuscript. All authors read and approved the final manuscript.   Red to green represents a high to low correlation coefficient. The digit in each grid indicates the correlation coefficient between gene modules and the corresponding features, and the digit in the bracket represents the P value.