Gene Co-Expression Network Analysis Conrms Four Independent Predictors of Survival In Ovarian Cancer Patients Under Optimal Debulking Surgery

Background: Serious ovarian cancer (OvCa) is the most common histological type of epithelial OvCa with poor prognosis. Despite received optimal cytoreduction and standard chemotherapy, a large proportion of patients are forced to recurrence or death within three years. To identify exact prognostic biomarkers associated with overall survival (OS) is urgent requirements of exploring rapid tumor progression mechanisms and developing novel strategies for immunotherapy. Methods: The gene expression proles of GSE49997, GSE9891 and TCGA were screened through rigorous criteria using R software and Bioconductor package. Weighted gene co-expression network analysis (WGCNA) was constructed to gure out gene clusters associated with OS. Protein-protein interaction (PPI) networks were built through STRING website. Prognostic values of potential biomarkers were validated using forest map and Kaplan-Meier analysis. Results: According to screening criteria, 788 samples and 10402 genes were reserved as the modeling dataset. We detected ve modules related to OS and intersected 108 genes through WGCNA after random sampling. PPI network analysis, Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis revealed potential mechanisms of above biomarkers. Conclusions: Four exact biomarkers (CANT1, P4HB, DUS1L and SIRT7) were conrmed as independent predictors of survival in OvCa patients with success of debulking surgery, which might provide promising biomarkers for prognostic judgement in ovarian cancer.


Background
Ovarian cancer is the most lethal gynecological malignancy, with 239,000 new cases diagnosed and 152,000 deaths worldwide annually [1] . It is also a high malignancy with insidious onset, consequently, patients with advanced stage (FIGO III/IV) account for up to three-quarters of all diagnosed cases [2] . Despite satisfactory tumor reduction surgery and adequate chemotherapy, 60% of patients face peritoneal recurrence and only 29% survive till 5-year [3,4] . Among all the pathological types, high grade serous ovarian cancer (HGSOC) is the one with the worst prognosis. Prediction of HGSOC prognostic correlation biomarkers has great therapeutic signi cance and urgent requirement for clinical transformation.
The identi cation of ovarian cancer molecular subtype by Tothill promotes the application of integrated genomic research [5]. Previous study by Verhaak analyzed HGSOC individually which provided a prognostic model using 100-gene signature [6] .To reduce economic costs and improve repeatability, we need more precise genetic biomarkers. Weighted gene co-expression network analysis (WGCNA) is proposed to reconstruct robust gene modules in terms of large-scale gene expression pro les and the distinction of hub genes that drive key cellular signaling pathways.
In our present study, we have identi ed prognosis-associated genes for esophageal cancer [7] and grade predictive genes for serous ovarian cancer[8] through WGCNA analysis. On the basis of previous research, we conducted gene co-expression networks and clustered overall survival related genes based on three public microarray datasets (GSE49997, GSE9891, and TCGA), which included 464 HGSOC samples and 10647 genes. This approach identi ed 108 co-expression genes and four validated hub genes signi cantly related to OS. Our study might be useful to develop clinical transformation of prognostic biomarkers and immunotherapy drugs.

Preparation of HGSOC datasets
The HGSOC datasets were screened out from the 'NormalizerVcuratedOvarianData' Bioconductor package [9], which included both gene microarray expression pro le data and curated clinical data of ovarian cancer cohorts. Strict screening strategy was developed to reduce tumor heterogeneity: all high grade serious ovarian cancer patients that undergo optimal debulking surgery and with complete follow-up information were included. Accordingly, three microarray platforms datasets (GSE4997, GSE9891 and TCGA) were merged into the modeling set and 10647 common genes among all 3 modeling datasets were remained [10]. As results, 464 samples (87 patients, 75 patients and 302 patients) were used to do data pre-processing and batch effect adjustment through surrogate variable analysis [11]( Figure 1A).
Random sampling and WGCNA construction 464 patients were regarded as the population. In order to increase accuracy and repeatability, we selected individual samples randomly to form different subsets on a pro-rata basis. Select 90%, 80%, 70% and 60% of the population and repeat 10 times, furthermore, select 90% samples and repeat 100 times. As mentioned above, ve subsets were used to construct WGCNA separately ( Figure 1B). R package 'WGCNA' [12] was used to perform gene co-expression network separately. First, a matrix of pairwise correlations between all pairs of genes across all samples was constructed. Second, we chose soft-thresholding power 3 to which co-expression similarity is raised to achieve consistent scale free topology in multiple datasets. Third, we performed automatic network construction and module detection with min Module Size of 50, max Block Size of 10000, deep Split of 2, and merge Cut Height of 0.25. This procedure comprised calculation of network adjacencies and topological overlap dissimilarities, followed by scaling of topological overlap matrices and calculation of consensus topological overlap. Then, we built hierarchical clustering dendrograms of gene expression data for each dataset, and similar expression genes were merged into potential modules ( Figure 2). Correlations between gene expressions and prognostic traits (vital time and vital status) were calculated among each dataset.

Calculation of eigengene expression and identi cation of prognostic related gene modules
The rst principal component which represented the highest percent of variance for expression values of all genes in a module was regarded as module eigengene (ME). We used the expression pro le of module eigengenes to discuss the correlation of module genes expression with overall survival (OS). Cox regression model was used for survival analysis using 'OptimalCutpoints' [13]and 'survival' Bioconductor packages [14].

Construction of protein-protein interaction network
We extracted a subnetwork with module genes from the STRING protein interaction database [15]. Since the STRING database weights and integrates information from numerous sources, including computational prediction methods, experimental repositories, and public text collections, we set minimum required interaction score as high con dence (0.700), max number of interactions to show no more than 5, hoping to get a convincing interaction subnetwork of our module genes.

Gene Ontology and Kyoto Encyclopedia of Genes and Genomes analysis
The Database for Annotation, Visualization and Integrated Discovery (DAVID, https://david.ncifcrf.gov/) [16] was applied to perform Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses of hub genes and their most relevant cooperators. The human genome (Homo sapiens) was selected as the background variables. Enrichment terms were considered statistically signi cant when the FDR were less than 0.05.

Statistical analysis and veri cation
All statistical analyses were performed using R 3.4.1 software. Kaplan-Meier survival plots were generated with survival curves through Kaplan-Meier Plotter website [17]. For all analyses, differences were considered statistically signi cant if the P values were less than 0.05.

Results
Clinical characteristics in the HGSOC training set Three high quality datasets, respectively, GSE49997, GSE9891, and TCGA, were adopted from the 'NormalizerVcuratedOvarianData' Bioconductor package and were aggregated as the modeling dataset. In total, 464 samples and 10647 genes were included after data preprocessing. Among them, 429 were late-stage patients (FIGO III-IV). Clinical information statistically analysis and univariate cox regression analysis were performed. As shown in Table 1, there was signi cantly statistical difference between FIGO stage with recurrence (p = 0.001). Moreover, age more than 60 years old (p = 0.001), late FIGO stage (p = 0.049) and recurrence (p = 0.005) were also associated with the death consequence. These Data trends were consistent with clinical and epidemiological distributions.   (Fig. 2E). For each module, we calculated correlations between gene expressions and clinical features such as tumor stage, age, recurrence time, vital time, recurrence status, and vital status. In the present study, we focused on the risk factors of death. Survival analysis indicated ve correlated modules, their names were greenyellow (138 genes), midnightblue (126 genes), cyan (126 genes), darkturquoise (70 genes) and white (56 genes) (Fig. 3A).
In order to test the reproducibility and stability of gene co-expression networks, we constructed gene coexpression network among ten random sampling datasets each containing 90% of 464 samples. In parallel, the sampling ratio gradually diminished to 80%,70% and 60%. According to survival analysis, 2 out of 20 modules and 1 out of 24 modules were correlated with OS in 90% and 80% sampling groups. No prognostic module was found in 70% or 60% sampling groups. Eventually, we built WGCNA among 100 random sampling datasets each containing 90% of 464 samples and screened unique overall survival correlated cyan* module (113 genes) (Fig.  1B). As expected, there were 108 intersect genes between greenyellow module and cyan* module. Hence, we identi ed cyan* module genes for subsequent analysis (Fig. 3).
Validation of ten independent predictors of overall survival As mentioned above, there were 108 intersect genes between greenyellow module and cyan* module. Aiming to test the prognostic association with OS, we built forest plots of the expression of each gene using another 7 datasets (E.MTAB.386, GSE17260, GSE26712, GSE30161, GSE32062.GPL6480, PMID17290060 and TCGA.RNASeqV2) and illustrated the top ten independent predictors (C17orf62, CANT1, DUS1L, FN3KRP, GRB2, NARF, NUP85, P4HB, SIRT7 and STRA13) in table 2. The p values for the overall HR were between 4.42e-08 to 6.48e-05 ( Figure 5). In addition, we demonstrated gene descriptions and other prognostic related cancers.
Validation of four candidate predictors in optimal debulking ovarian cancer patients We selected 801 optimal debulking ovarian cancer patients as the validation dataset through online tool. Four out of ten predictors were ltered with absolutely signi cant differences. As demonstrated in Figure 6, low mRNA expression levels of CANT1 (p=7.6e-05), P4HB (p=9.8e-07), DUS1L (p=6.4e-06) and SIRT7 (p=9.5e-08) were associated with worse OS in OvCa patients. Table2 Ten potential prognostic genes

Discussion
The advantage of bioinformatics analysis is the integration of large amounts of data. In this study, we integrated large-scale transcriptional pro ling to identify robust co-expression gene modules associated with ovarian cancer overall survival. The ltering condition of modeling dataset determines whether the results are logical and worthy of generalization. Through epidemiological investigation and clinical observation, we con rmed HGSOC patients with optimal debulking surgery in survival analysis not only because their high incidence, but also rarely effectiveness of aftertreatment. Our long-term goal was to provide insights into genetic targets of rapid tumor progression and improve clinical prognosis.
In general, 464 samples and 10647 genes were included to construct WGCNA and 801 samples were used in validation in silico. Four predictors, namely, CANT1, P4HB, DUS1L and SIRT7 were associated with OS in OvCa patients with optimal debulking surgery. Reports of associations between these genes and ovarian cancer are rare. However, their link to other cancers are revealed increasingly.
The androgen-regulated Calcium-Activated Nucleotidase 1 (CANT1) is widely expressed in various organs.
Previous study has shown a commonly overexpressed in prostate carcinomas and prostatic intraepithelial neoplasia [18]. Protein level of CANT1 is signi cantly higher in clear cell renal cell carcinoma tissues than in adjacent normal tissues. CANT1 silencing suppressed cell proliferation, migration, and invasion in renal cancer [19]. Mutation of CANT1 is identi ed as a common pathogenic change for Desbuquois dysplasia type 1 [20].
Sirtuin 7 (SIRT7) encodes a member of the sirtuin family of NAD + dependent protein deacetylases and acts as key mediator of multiple cellular activities. Protein expression is linked to oncogenic activity and regulation of ribosome biogenesis [26]. It also antagonizes transforming growth factor β signaling and inhibits lung metastases in breast cancer [27]. SIRT7 depletion inhibits cell proliferation, androgen-induced autophagy, and invasion in prostate cancer [28].The expression pattern of SIRT superfamily and their prognostic values in serous ovarian cancer patients has been analyzed using GSE10971 and GSE30587 datasets. Increased expression of SIRT3, SIRT5, and SIRT7 are associated with better overall survival by univariable analysis which is consistent with our ndings [29]. Dihydrouridine synthase 1 like (DUS1L) expresses ubiquitous in prostate, spleen and 25 other tissues with unclear gene function.
Multiple integrated bioinformatics analysis are attempting to identify potential biomarkers in ovarian cancer [30,31]. Despite the insightful ndings, limitations still exist in likeness study. Firstly, the allocation ratio between modeling and validation datasets may affect the prediction results. Secondly, multi-center follow up data requires signi cant time, economic and labor costs. The predicted biomarkers need to be veri ed through biology experiments and clinical evidences.

Conclusions
Overall, our results predicted and validated valuable prognostic biomarkers in ovarian cancer patients with optimal debulking surgery. We obtained four promising genes (CANT1, P4HB, DUS1L and SIRT7) through WGCNA analysis and survival analysis in silico. Further molecular mechanism studies on prognostic biomarkers may provide insights into the new targets for progressive tumor treatment. Flow chart of WGCNA construction. A. The gene expression pro les of HGSOC patients with optimal debulking were ltered using three step screening strategy through 'NormalizerVcuratedOvarianData' Bioconductor package. GSE49997, GSE9891 and TCGA were used as modeling datasets in data pre-processing. B. Weighted gene co-expression networks were constructed and 5 out of 27 modules were related with overall survival. Select 90%, 80%, 70% and 60% of the population and repeat 10 times, furthermore, select 90% samples and repeat 100 times to build another ve parallel modeling datasets. As results, 1 out of 23 modules was related with overall survival in 90% random sampling *100 times group.

Figure 2
Clustering dendrograms of genes were based on dissimilarity topological overlap and module colors. A. 90% samples repeat 10 times; B.80% samples repeat 10 times; C. 70% samples repeat 10 times; D. 60% samples repeat 10 times; E. 100% samples; F. 90% samples repeat 100 times.  The PPI network of hub cyan* genes constructed by STRING. 113 hub genes was constructed in high quality STRING website.
Page 15/16 Top ten genes with signi cant p-value less than 0.00005 were shown in the forest plot.