Integrated Weighted Gene Co-expression Network Analysis Reveals ALB Associated With Prognosis of High-grade Serous Ovarian Cancer

Background: Ovarian cancer is the gynecologic tumor with the highest fatality rate, and high-grade serous ovarian cancer (HGSOC) is the most common and malignant type of ovarian cancer. One important reason for the poor prognosis of HGSOC is the lack of effective diagnostic and prognostic biomarkers. New biomarkers are necessary for improvement of treatment strategies and to ensure appropriate healthcare decisions. Methods: To construct the co-expression network of HGSOC samples, we applied weighted gene coexpression network analysis (WGCNA) to assess the proteomic data downloaded from Clinical Proteomic Tumor Analysis Consortium (CPTAC), and module-trait relationship was then analyzed and plotted in a heat map to choose key module associated with HGSOC. Enrichment analysis was performed on the genes in the key modules to explore the functional information in which these genes participate. Hub genes with high connectivity in key module were identied by Cytoscape software. Furthermore, the true hub gene was selected through survival analysis, followed by expression verication with transcriptome dataset from TCGA and GTEx. Finally, the potential biological functions of hub gene were analyzed via single-gene Gene Set Enrichment Analysis (GSEA). Results: After merging similar modules, a total of 9 modules were identied. Module-trait analysis revealed that the brown module (cor = 0.7) was signicantly associated with HGSOC. The results of enrichment analysis of the genes in the brown module show that these genes were related to the functions of the extracellular matrix, the complement system and the blood system. Ten hub genes with the highest connectivity were selected by protein-protein interaction analysis. After survival analysis and expression verication of hub genes, only ALB was conrmed to be the true hub gene and positively correlated with HGSOC prognosis. Single gene GSEA revealed that ALB was associated with cell material degradation. Conclusion: We conducted the rst gene co-expression analysis based on proteomic data from HGSOC samples, and found that ALB had prognostic value, which could be applied in the treatment of HGSOC in the future.


Background
Ovarian cancer is the gynecologic cancer with highest mortality in the world, and ranks third in the incidence of gynecologic cancers, next to cervical cancer and endometrial cancer; 313,959 new cases and 207,252 deaths were estimated to have happened in 2020 (1). High-grade serous ovarian cancer (HGSOC), belonging to epithelial ovarian cancer (EOC), is the most common and fatal type of ovarian cancer, and the rst-leading cause of ovarian cancer-related deaths (2)(3)(4). Since the ovaries are located deeply in the pelvic cavity and there are no speci c symptoms and effective early screening methods of the early stages of HGSOC, most patients are often in an advanced clinical stage (stage III/IV) with metastasis at the time of diagnosis. The standard treatment of HGSOC is similar to other ovarian cancers, which is to perform primary debulking surgeries followed by chemotherapy combining platinum and paclitaxel (2). But 80% of patients with advanced HGSOC will experience recurrence and chemotherapy resistance, leading to a 5-year survival rate of 30% (5), and more and more evidences indicate that the metastasis-prone characteristics of HGSOC play an important role in its relatively poor prognosis (6, 7). In advanced HGSOC, tumor cells have spread out from the ovaries and pelvic organs to the peritoneum and abdominal organs, which could impede the normal function of vital organs in the abdomen by uncontrolled proliferation, as well as function of circulatory and respiratory system by generating large amounts of ascites and increasing intra-abdominal pressure (5,8), which eventually causing the HGSOC patients to die of various complications. However, patients with HGSOC present early stage tumor at diagnosis usually have a good prognosis after standard clinical interventions. Therefore, research on the biomarker of early diagnosis or prognosis of HGSOC is of great help to develop new effective therapies and improve the prognosis of patients with HGSOC.
Over the past two decades, with the breakthrough of high-throughput sequencing technology, a huge amount of sequencing data of various human tissues has been accumulated, which has greatly promoted the progress of biomedicine research, such as the screening of probable cancer prognostic biomarkers. Speci cally in the eld of ovarian cancer research, a variety of mRNAs or lncRNAs-based signatures have been identi ed for survival prediction in patients with ovarian cancer. For example, a study analyzed the differentially expressed genes in samples of ovarian cancer at different clinical stages and discovered speci c gene co-expression modules related to the clinical stage and nally identi ed COL3A1, COL1A1, COL1A2, KRAS and NRAS as potential prognostic genes for ovarian cancer (9). Another study showed that 5 lncRNAs (LINC00664, LINC00667, LINC01139, LINC01419 and LOC286437) could be used as independent risk factors for ovarian cancer recurrence (10). Meanwhile, other researchers have proved that changes in expression levels of genes were related to platinum resistance in ovarian cancer (11). But all of these integrated analyses are performed at the transcriptome level, not at the proteome level. It is well known that protein is the main executor of life activities, and all life activities depend on the correct function of protein. And with the advancement of proteomics research methods and the accumulation of public proteomic data, performing integrated proteomics analysis is completely feasible at present.
In the post-genomic era, the consensus that the occurrence of complex diseases such as cancer is not determined by a single gene has gradually gained the approval of most researchers. Gene co-expression network research can provide us with information about gene expression correlations and potential functional relationships, thereby helping us understand biological systems and explore the relationship between the relevant functional genes (12). And weighted gene co-expression network analysis (WGCNA) is a method for concretizing this idea, which has been comprehensively applied to multiple cancerassociated studies to identify hub genes related to various traits.
Recent proteomics studies using ovarian tissue samples from HGSOC patients by Clinical Proteomic Tumor Analysis Consortium (CPTAC) have helped us better understand the tumor mechanism from a novel insight, and have also identi ed some candidate therapeutic targets (2,13,14). In this study, we used WGCNA to re-analyze these published proteomic data in order to discover proteins and pathway related to occurrence and development of HGSOC, and identi ed a signi cant correlation between the brown module and HGSOC, clinical stage, histological grade and patient survival time. Ten hub genes were selected from this module and veri ed by survival analysis and other databases. Finally, it was identi ed that ALB might be a protein biomarker related to the prognosis of HGSOC. To our knowledge, this is the rst study of prognostic biomarkers for HGSOC based on protein pro le data, which provides us with some new insights into the occurrence and progress of HGSOC.

Methods
Proteomic data collection and pre-processing The quantitative proteomic data and clinical information of HGSOC were obtained from CPTAC Data Portal (https://cptac-data-portal.georgetown.edu/studies/ lters/primary_site:Ovary). And the unshared peptides expression matrices analyzed through the Common Data Analysis Pipeline were used for subsequent data analysis. Samples lacking information about clinical stage, tumor histological grade and survival time and proteins with missing value of relative abundance among all samples were excluded from our study. Other sample inclusion criteria and data processing procedures were described in previous studies (2,13,14), in short, that were 1) ve samples were removed causing without TP53 mutation; 2) the median values of relative protein abundance over all proteins in every sample were calculated and re-centered to value of 0; 3) the normalized relative protein abundances of overlapping samples were averaged and used as their protein abundances. Finally, 2892 proteins were identi ed to perform WGCNA analysis.

Co-expression network construction and module detection
We used the WGCNA package (Version 1.70, https://CRAN.R-project.org/package=WGCNA) to construct the co-expression network for the identi ed proteins (15). First, sample cluster analysis was carried out by the function hclust of WGCNA package to assess whether there were any signi cant outliers in the selected sample. Next, a suitable soft threshold power for scale-free network construction was calculated and chosen with the function pickSoftThreshold of WGCNA package. After that with the chosen soft threshold power value, an adjacency matrix was built to bring about weighted separation of coexpression (16). Co-expression similarity for paired proteins from adjacency matrix was calculated by the topological overlap dissimilarity measure, and we got a Topological Overlap Matrix (TOM) for next identi cation and similarity analysis of co-expression gene modules and combination of similar modules with the following major parameters: deepSplit of 2, minModuleSize of 15 and mergeCutHeight of 0.3.
The resulting protein co-expression network was visualized as the heat map based on dissimilarity of TOM with hierarchical clustering dendrogram, and the number of proteins in each module was counted and plotted with the barplot (ggplot2, Version 3.3.5, https://CRAN.R-project.org/package=ggplot2).

Identi cation of module-trait correlations and module preservation
The correlations between modules and clinical traits including sample type, clinical stage, pathological grade and survival time were assessed by Pearson Correlation Coe cients and a heat map was plotted to demonstrate the correlation value of interaction between modules and traits. The student t-test was used to get the P value of the correlation, and a P value of < 0.05 was considered as statistically signi cant. The brown module with highest value of Correlation Coe cients was mainly focused on and the correlation between Gene Signi cance (GS) for HGSOC and Module Membership (MM) in brown module was checked to identify module-trait associations.

Identi cation of hub genes
Genes closely connected to the intramodular nodes are regarded as hub genes which usually have more important biological function than other nodes (19). Protein-protein interaction (PPI) network analysis was performed via the online database Search Tool for Retrieval of Interacting Genes (STRING, Version 11, https://string-db.org/) (20), then the result of PPI analysis was imported to Cytoscape software (Version 3.8.0) to screen out hub genes with degrees ranking in the top 10 in the network of key modules using CytoHubba plug-in (Version 0.1) (21,22).

Kaplan-Meier survival analysis and validation of hub genes
Survival analysis was conducted and visualized for hub genes via the Survival package (Version 3.2-11, https://CRAN.R-project.org/package=survival) and Survminer package (Version 0.4.9, https://cran.rproject.org/package=survminer), respectively. The relative protein abundances and overall survival time from our data were used to plot the Kaplan-Meier curves. The cutoff values of the hub genes to separate the samples were determined by Survminer package. The hazard ratio (HR) was calculated with 95% con dence interval. Log-rank tests were performed to provide the statistical signi cance and P value of < 0.05 was considered as statistically signi cant. The protein level of hub genes between HGSOC and normal tissue were also validated using transcriptome data in The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) database and immunohistochemistry data in the Human Protein Atlas (HPA) database.

Results
Identi cation of co-expression modules using WGCNA It is believed that genes with comparable co-expression patterns are usually controlled by relative regulatory manner or have similar or parallel pathways of functional interaction (12). In this study, we downloaded the proteomic data from the CPTAC database according to Data Use Agreement. After the necessary quality control and manual check and screening, a matrix of relative abundance of 2892 proteins from 25 normal fallopian tube and 235 HGSOC samples with clinical information were selected to construct the co-expression networks. To ensure the reliability of the co-expression network, sample clustering analysis was performed to investigate the outliers among all samples, and no outliers were detected (Additional le 1). Finally, the relative abundances of these 2892 proteins and 260 samples were applied to identify the modules of co-expression genes (Additional le 2). To obtain scale free topology, a value of soft threshold power of 7was selected based on scale independence analysis (R^2 = 0.927), and the mean connectivity analysis was also relatively high under this soft threshold power (Fig. 1). Thirteen modules were generated and modules with high relevance of module eigengenes were combined for nal 9 modules ( Fig. 2a and b). The hierarchical clustering dendrogram of proteins also showed the same results of dynamic tree cut and merged dynamic (Fig. 2c). Numbers of proteins of each module were displayed in Fig. 2d and the detailed result was summarized in Additional le 3. Afterwards, the interactive relations among all modules and all proteins were visualized by a heat map plot based on TOM (Fig. 3).

Identi cation of modules related to HGSOC
To determine whether co-expression modules were associated with sample types and clinical traits, the module-trait relationship analyses were performed (Fig. 4a). We identi ed that the brown module was positively related to HGSOC with the top relevance and survival time and negatively to stage and grade. In addition to brown module, tan module also showed a lower correlation with above traits compared to brown module. Then the relationship between GS for HGSOC and MM in brown module was performed and the Correlation Coe cients value was of 0.82 (Fig. 4b).
Functional enrichment analysis of proteins in HGSOCrelated module To further explore the biological function of the proteins in the brown module related to HGSOC, GO and KEGG pathway enrichment analyses were performed for these proteins. The top 5 terms ranked by adjusted P value of enrichment analysis were plotted in chord diagram. The proteins in brown module were signi cantly enriched in term of extracellular matrix organization, extracellular structure organization, platelet degranulation, regulation of complement activation and complement activation of Biological Process category (Fig. 5a), and collagen-containing extracellular matrix, blood microparticle, vesicle lumen, secretory granule lumen and cytoplasmic vesicle lumen of Cellular Component category (Fig. 5b), and extracellular matrix structural constituent, enzyme inhibitor activity, peptidase regulator activity, endopeptidase regulator activity and actin binding of Molecular Function category (Fig. 5c). And for the enrichment of KEGG pathway, these proteins were mainly enriched in complement and coagulation cascades, ECM-receptor interaction, focal adhesion, amoebiasis and carbon metabolism (Fig. 5d).

Identi cation and validation of hub genes in HGSOC-related module
The 545 proteins in brown module were uploaded and analyzed by STRING to identify the interaction between them, and then the top 10 proteins (ALB, AKT1, APOB, C3, APOA1, FGA, FGG, SERPINA1, MAPK1 and AHSG) ranked by degree were selected as hub genes by cytoHubba plug-in of Cytoscape software. The network with neighbors generated by topological analysis method of Degree was shown in Fig. 6. Furthermore, the relationships between the 10 hub genes and the overall survival time of patients with HGSOC in our proteomic data were analyzed. Notably, low expression of SERPINA1, MAPK1, APOB and ALB signi cantly correlated with the poor survival time of HGSOC patients (P < 0.05; Fig. 7a and Additional le 4). Meanwhile, we veri ed the expression levels of these 4 proteins in a transcriptome pro le of HGSOC from the TCGA and GTEx database, which showed only the ALB was signi cantly expressed at lower level in HGSOC samples compared to normal samples ( Fig. 7b and Additional le 5ac), similar to the protein level of ALB from our proteomic data (Additional le 5d). Our results illustrated that ALB might be related to the risk of HGSOC. Finally, we also con rmed the protein level of ALB in HGSOC using the HPA database, and the result indicated that protein level of ALB in fallopian tube was signi cantly higher than that in HGSOC on the basis of immunohistochemistry (Fig. 8).

Gene set enrichment analysis
To further understand the biological function of ALB in HGSOC, we performed GSEA based on our proteomic data. As shown in Fig. 9a and b, proteins in high-expression group of ALB were involved in "negative regulation of proteolysis", "negative regulation of hydrolase activity", "enzyme inhibitor activity", "negative regulation of catalytic activity" and "regulation of proteolysis", whereas "mRNA splice site selection", "DNA biosynthetic process", "regulation of signal transduction by P53 class mediator", "signal transduction by P53 class mediator" and "post transcriptional regulation of gene expression" were enriched in low-expression group of ALB.

Discussion
HGSOC remains the most common type of ovarian cancer with the highest incidence and the strongest fatality rate all over the world, and there is no de nite research conclusion on its tumorigenesis mechanism. Meanwhile, due to the lack of effective early screening methods, most patients with highgrade serous ovarian cancer are at the advanced stage of tumor progression at the time of diagnosis, accompanied by extensive peritoneal metastasis, while most patients will experience tumor recurrence, both of the above two factors together lead to a very poor prognosis for patients (2,5). However, the 5year survival rate of patients with early stages of HGSOC is as high as 92%, which is 62% higher than that with later stages of HGSOC (25), which suggests the possibility that patients with HGSOC can bene t from e cient early screening methods. Many researchers have conducted extensive research in this eld, and have also discovered some novel diagnostic markers with clinical application value (9)(10)(11). But the objects of these studies are all at the transcriptome level, and protein, as a more direct manifestation of the life activities of cells, organs and even the body, may be a better research carrier for the screening of high-e ciency diagnostic or prognostic biomarkers. Furthermore, with the accumulation of proteomic of data in public databases, it also provides more feasibility for this research idea.
WGCNA is a powerful method based on "guilt-by-association", and algorithms based on WGCNA transform the complexity of gene expression data sets into advantages, which can clarify gene relationships above the noise level (16). At present, WGCNA has been widely used in the eld of tumor research, greatly promoting the biological interpretation of high-throughput sequencing data from clear cell renal cell carcinoma, pancreatic ductal adenocarcinoma (PDAC) as well as ovarian cancer, and assisting the discovery of tumor diagnosis and prognostic biomarkers (26-28). However, the number of studies applying WGCNA to analysis proteomic data is relatively small for the time being, but gratifying results have been achieved. Mantini et al. suggested that three proteins SPTBN1, KHSRP and PYGL might serve as potential prognostic biomarker for PDAC (29). While Johnson et al. identi ed proteins and biological processes in Alzheimer's disease brain that might serve as therapeutic targets and uid biomarkers for the disease (30). These results also illustrate the effectiveness of WGCNA applied to proteomic data.
In this study, we obtained proteomic data of HGSOC samples from the CPTAC database. Then these data were assessed by WGCNA, and the brown module signi cantly related to HGSOC was identi ed. Interestingly, the brown module is not only signi cantly related to HGSOC, but also signi cantly related to the patient's survival time and signi cantly negatively related to the clinical stage and histological grade of HGSOC. The results of enrichment analysis of proteins in brown module show that most of these proteins are related to the organization and function of the extracellular matrix (ECM) components including collagen (COL1A1, COL1A2, COL4A1, COL4A2, COL6A1, COL6A2 etc.), proteoglycan (LUM, DCN), laminin (LAMA4, LAMB1, LAMB2, LAMC1) and other proteins as linkers to connect the above proteins (NID1, PRELP, TNXB), covering almost all types of ECM components, and most of these proteins were down-regulated in our data and were consistent with the results of previous studies (2). The metastasis-prone characteristics of HGSOC play an important role in its relatively poor prognosis (6, 7). Tumor invasion and metastasis is a complicated pathological process which involving interactions between tumor cells and various biologically active molecules from tumor microenvironment including ECM (31), and for malignant tumor cells derived from epithelial cells like HGSOC, epithelial-mesenchymal transition (EMT) is the key rst biological step for these tumors cells to metastasize, which is accompanied by disorders of ECM composition and organization, and in turn enhancing tumor cell mobility and protecting tumor cells from immune attack via collagen remodeling, and nally promoting the invasion and metastasis of HGSOC (5, 32, 33). These facts not only prove the credibility of the correlation between the brown module and the clinical stage, histological grade and survival time of patients, but also indirectly prove the validity of our analysis results.
Moreover, the PPI analysis was conducted to screen hub genes, which were further performed with survival analysis using our data and veri ed by a transcriptome dataset of HGSOC from the TCGA and GTEx database due to the lack of other available proteomic data. Finally, only ALB showed a good consistence between expression pattern and correlation with the patient's prognosis, that was, the expression level of ALB in tumor tissues was signi cantly down-regulated and had a signi cantly positive correlation with the patient's prognosis. Therefore, we speculate that ALB may be involved in the prognosis of HGSOC patients. ALB encodes the secreted and main protein of human blood, lymph, cerebrospinal and interstitial uid, which plays important roles in a variety of physiological functions, such as maintaining colloid osmotic pressure and acting as the carrier protein for a wide range of endogenous molecules including hormones, fatty acids, as well as exogenous drugs (34). ALB has been reported to participate in the development and treatment of tumors with different mechanisms. First, previous studies have shown that the production of ALB is signi cantly inhibited during cancer-related systemic in ammation, which is regulated by a variety of cytokines and growth factors produced by tumor cells and immune cells (35). Secondly, the decrease in plasma ALB concentration re ects the poor nutritional condition of cancer patients, which may be related to the chemotherapy resistance of patients (36). Further, ALB might take part in antioxidant and anticancer effect (37, 38). In addition, the low serum ALB concentration is related to the poor survival time of patients with various cancers (35), as well as for ovarian cancer (39). However, these studies have not clari ed the de nite mechanism of how low plasma ALB levels lead to poor prognosis of patients. Meanwhile, there is no report about the relationship between ALB and the prognosis of HGSOC, and our research lls this gap. Moreover, the samples in our study are from tumor tissues of HGSOC patients, which may provide a certain research basis for the subsequent understanding of the role of ALB in the occurrence of HGSOC.
The limitation of this study is that the veri cation of our results was carried out in the transcriptome data due to the lack of other independent proteomic data. Secondly, due to the limitation of mass spectrometry technology, the relative abundance of most proteins is missing, and in order to ensure the reliability of the results, these proteins are not included into our analysis, which may cause bias to our conclusion.

Conclusions
In summary, HGSOC-associated module was revealed through WGCNA. ALB was identi ed as a prognostic biomarkers for HGSOC, and its protein level was positively correlated with the survival time of HGSOC patients. These ndings need to be con rmed by further clinical practices in the future.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analysed during this study are included in this published article and its supplementary information les.

Competing interests
The authors declare that they have no competing interests

Funding
No funding was obtained for this study.
Authors' contributions BW and BG conceived this study; BW and SC performed the analysis; BW and BG prepared and revised the manuscript. All authors have read and approved the submitted manuscript.     The interaction network diagram of the nodes in brown module. The large red node is the node with a high degree, and the small blue node is the node with a low degree.

Figure 7
Survival analysis and veri cation of ALB. The survival analysis uses our proteomic data and the veri cation uses data from TCGA and GTEx.