Co ‐ expression Network Analysis by WGCNA and Identify Potential Prognostic Markers Associated with Lung Metastasis in Breast Cancer

Backgroud: Breast cancer (BC) is an aggressive cancer with a high percentage recurrence and metastasis. As one of the most common distant metastasis organ in breast cancer, lung metastasis has a worse prognosis than that of liver and bone. Therefore, it’s important to explore some potential prognostic markers associated with the lung metastasis in breast cancer for preventive treatment. Methods: In our study, transcriptomic data and clinical information of breast cancer patients were downloaded from The Cancer Genome Atlas (TCGA) database. Co-expression modules was built by Weighted gene co-expression network analysis (WGCNA) to nd out the royalbule modules which is signicantly associated with lung metastasis in breast cancer. Then, co-expression genes were analyzed for functional enrichment. Furthermore, the prognostic value of these genes was assessed by GEPIA Database and Kaplan-Meier Plotter. Results: Results showed that the hub genes, LMNB and CDC20, were up-regulated in breast cancer and indicated worse survival. Therefore, we speculate that these two genes play crucial roles in the process of lung metastasis in breast cancer, and can be used as potential prognostic markers in lung metastasis of breast cancer. Conclusion: Collectively, our study identied two potential key genes in the lung metastasis of breast cancer, which might be applied as the prognostic markers of the precise treatment in breast cancer with lung metastasis.


Introduction
Breast cancer (BC) is the most common malignancy in the female population, accounting for about 30% of all female cancers [1]. Though the diagnosis and treatment methods have developed rapidly in recent decades,the mortality of BC still remains high due to the frequent distal metastasis [2][3][4]. The metastases at distant sites are the main cause of breast cancer death, therefore, the presence and location of breast cancer metastasis has been the critical diagnose to the clinical course and prognosis of patients [5,6]. Nevertheless, the prognosis of distant metastasis is signi cantly affected by the site of initial spread [7]. Among the priority sites of breast cancer transmission, metastasis to the lung has a worse prognosis than those to the liver or bone [8]. This is mainly due to little or no symptoms of early lung metastasis, and large metastases and severe symptoms have been caused when found the breast cancer lung metastases [9]. However, to date, there is no effective treatment for different metastatic sites, especially for the lung metastasis.
Different primary tumors have a proclivity to metastasize to distinct organs. The "seed" and "soil" theory has put forward that the distant metastasis of tumor was not accidental, but a certain biocompatibility existed between tumor cells and target organs, and tumor metastasis was affected by the effect of driver genes and target organ microenvironment [10]. However, identifying the speci c driver genes is still challenging.
WGCNA is a systematic biological method which can analyze multiple gene expression patterns in multiple samples [11]. Through constructing gene co-expression modules according to their expression patterns, WGCNA can analyze the relationship between modules and speci c phenotypes, for example the clinical symptoms of patients. Comparing with traditional molecular biology, the unique advantage of WGCNA is that it can convert gene expression data into co-expression modules, and provide insight into the signaling networks responsible for phenotypic traits of interest. It has been successfully used to study various biological processes, such as gastric cancer [12], colon adenocarcinoma [13], liver hepatocellular carcinoma [14] and glioma [15], proving its effective to identify the potential biomarkers and therapeutic targets. It not only helps to compare the expression of different genes, but also helps to calculate and analyze the interactions between genes in different co-expression modules. WGCNA deliver a more comprehensive exploration of the whole biological system in diseases and will be quite helpful to identify the candidate biomarkers or therapeutic targets.
Here we used WGCNA to explore the relationship between gene expression and metastatic site of breast cancer. We nd out the meaningful module correlated with lung metastasis of breast cancer. GO enrichment and KEGG pathway analysis were performed to gure out the main functions of genes in the main module. Cox regression analysis was performed to pick out the hub genes of the main module.
GEPIA and Kaplan-Meier Plotter were used to identify the expression value and prognostic value of these genes. We believe that our work will lay an important foundation for the future treatment of lung metastasis of breast cancer and improve the overall survival rate of breast cancer.

Patient samples data collection and processing
Public gene-expression data and clinical annotation were downloaded from the cBioportal online database (http://www.cbioportal.org/).The data we selected is the Metastatic Breast Cancer Project. The samples we selected were met the following criteria: 1) Include the sample with the character like bone metastasis, brain metastasis, live metastasis, lung metastasis and ovary metastasis. 2) Exclude the sample with missing data. And then we choose the top 30% most variable genes for our study. In the end, a total of 80 patient samples and 10083genes were included in our study.

Weighted gene co-expression network construction
In the present study, the soft-threshold β was set as 7. Subsequently, the adjacency matrix was transformed into a topological overlap matrix (TOM). Next, we performed hierarchical clustering to identify modules, each module included at least 20 genes (min Module Size = 20). Finally, we calculated the eigen gene, hierarchically clustered the modules, and merged similar modules.

Clinically signi cant modules identi cation
The co-expression module is de ned as a class of genes with high topological overlap similarity, and genes in the same module generally have a higher degree of co-expression. In this study, two methods were used to identify the important modules associated with clinical traits. First, the module eigengene (ME) represents the principal component of the module to describe the expression pattern of the module in each sample. Second, module membership (MM) refers to the correlation coe cient between genes and module eigengenes to describe the reliability of a gene belonging to a module. Finally, the correlation was calculated between the modules and the clinical data to identify signi cantly clinical modules.

Gene Ontology Enrichment and KEGG Pathway Analysis
WebGestalt (http://www.webgestalt.org/) is a functional enrichment analysis web tool for users to comprehend biological function information of genes and proteins, which supports three well-established and complementary methods for enrichment analysis. We used WebGestalt online tools to perform the GO enrichment and KEGG pathway analysis of the genes in royalbule module. "adjusted P < 0.05" was used as the threshold value to identify the signi cant terms.

Multivariate Cox regression
Multivariate Cox regression was performed using SPSS software. 39 genes were selected for screening the optimal prognostic signatures for breast cancer with lung metastasis. All data was evaluated by the Pearson's Chi-Square method with SSPS software.
GEPIA Database GEPIA (http://gepia.cancer-pku.cn/index.html) is an online database which facilitates the standardized analysis of RNA-seq data from 9,736 tumor samples and 8,587 normal control samples in the TCGA and GTEx data sets. In our study, we used this database to analyze the transcription levels of hub genes in breast cancer sample and normal sample. The P value was cut off at 0.01.

Kaplan-Meier Plotter Analysis
In our study, we used Kaplan-Meier plotter (http://kmplot.com/analysis/), which is an online database to explore the impact of genes on patient survival in different types of cancer, to verify the prognostic value of hub genes in breast cancer patients.
bc-GenExMiner v4.0 Breast Cancer Gene-Expression Miner v4.0 (http://bcgenex.centregauducheau.fr/BC-GEM/GEM-Accueil.php?js=1) is an online dataset containing published annotated genomic data, which includes 36 annotated genomic datasets and 5861 patients with breast cancer. Based on these, it can be used as a statistical mining tool to estimate the Pearson's correlation module.

Results
Construction of co-expression modules of breast cancer.
The data was processed and analyzed following the owchart (Fig. 1). Expression values of top 30% most variable genes (10083 genes) of breast cancer were used to construct the co-expression module by WGCNA package tool. The FlashClust tools package was used to perform the cluster analysis on these samples and the result was shown in (Fig. 2A). The power value, which mainly affected the independence and the average connectivity degree of co-expression modules, was screened out and equal to seven ( Fig. 2B) and the independence degree was up to 0.8. Therefore, the power value was used to build gene co-expression module and the results showed that there were 39 modules in breast cancer. These coexpression modules were displayed in different colours (Fig. 3). These modules were ranged from large to small by the number of genes they included. The number of genes in the 39 modules was shown in (Table 1). Interactions of the 39 co-expression modules were analyzed and shown in (Fig. 4). Gene Co-expression Modules Correspond to Clinic Traits.
The clinical information was provided by TCGA database. And we selected the metastasis organs as research targets and removed the patient sample information that was meaningless or lacking in our study. According to the correlation between module eigengene and clinic traits, the interaction of coexpression modules and particular traits were identi ed (Fig. 5). We found that the royalblue module (R = 0.34, p = 0.04) were signi cantly associated with breast cancer with lung metastasis. The eigengene dendrogram and heatmap were used to identify groups of correlated eigengenes (Fig. 6).
Functional enrichment analysis of genes in royalblue modules.
Go enrichment analysis and KEGG analysis were performed on the genes in the royal blue module. The biological process of most genes was metabolic process, biological regulation and cellular component organization. The cellular component of most genes was nucleus, and the molecular function of most genes was protein binding. According to KEGG analysis, genes in royal blue module were mainly enriched in systemic lupus erythematosus, reproduction, alcoholism, RHO GTPase effectors, viral carcinogenesis, signaling by Rho GTPase, cell cycle, developmental biology, generic transcription pathway.
Module visualize and hub genes.
The genes in royalblue module were calculated the intramodular connectivity. The intramodular connectivity was calculated for each gene by summing the connection strengths with other module genes and dividing this number by the maximum intramodular connectivity. The number of genes with connectivity greater than 0.1 was 39. These genes were selected as hub genes and then analyzed using the Cytoscape software (Fig. 9). Next, we integrated the expression pro les of 39 genes in the module with the occurrence of pulmonary metastasis in the corresponding 80 patients and conducted cox regression analysis. Finally, KRTAP4-1, LMNB1 and CDC20 were independent risk factors for breast cancer with lung metastasis. The relative risk between these three genes and lung metastasis of breast cancer were 1.146, 1.269 and 0.885, respectively ( Table 2). The omnibus test of model coe cients showed that the overall test of the model was statistically signi cant (Table 3). And then we found LMNB1 and CDC20 were up-regulated in breast cancer tissue through the GEPIA database. and we also found that the over-expression of this two genes predict worse survival of breast cancer using the Kaplan-Meier plotter website (Fig. 10). We further veri ed the correlation between LMNB1 and CDC20 using the bc-GenExMiner v4.0, the result showed that the correlation coe cientwas up to 0.9.

Discussion
Lung metastasis is a pernicious outcome of breast cancer. About 17% of patients with BC have a propensity developing to lung metastases [16]. Based on the selective evolution of organs, metastatic localization does not occur randomly, but is a preferred location controlled by numerous microenvironmental, cellular, and molecular factors [9]. Understanding the mechanism of lung metastasis of breast cancer is helpful for further prevention and even targeted therapy. There are many theories to explain the process of lung metastasis in breast cancer: for example, the interaction between CSCs forming breast cancer cells and pulmonary vessels [17][18][19],the histologic and intrinsic genomic pro les of breast cancer [20] barrier drive of host organs [21]. Individual genes are associated with organ-speci c metastasis. Multiple individual genes constitute a complex, dynamic and interactive network and govern the process of breast cancer cell metastasis to speci c organs [4,22]. To date, no studies have correlated gene expression with clinical traits and used it to predict lung metastasis of breast cancer. The advantage of our study is that we found out a predictive model that is bene cial to clinical decision making. Our result indicated that CDC20 and LMNB1 were co-expressed in breast cancer and played important roles in the process of lung metastasis in breast cancer, suggesting their important value for further study.
LMNB1 is a nuclear membrane protein that can build a framework for the nuclear envelope [23]. LMNB1 is also essential for cell senescence [24]. For example, down-regulation of LMNB1 induces cellular senescence through activating either the p53 or pRB tumor suppressor pathway [25]. Increasing evidence shows that up-regulation or down-regulation of LMNB1 affects the clinical behavior of cancer. Furthermore, previous study also showed LMNB1 is associated with lung metastasis in breast cancer in mice [26]. Therefore, we conclude that LMNB1 plays an important role in the process of lung metastasis in breast cancer, whereas further con rming evidence of this result in human breast cancer is still need.
CDC20 is a spindle assembly checkpoint molecule, which is critical in cell cycle progression [27]. Aberrant expression of CDC20 is associated with malignant progression and poor prognosis in various types of cancer [28][29][30]. In addition, there is a signi cant correlation between a high expression of CDC20 and advanced tumor stage in carcinoma [31]. To the best of our knowledge, our study is the rst time to verify that CDC20 is related with the progress of lung metastasis in breast cancer, indicating that CDC20 can be a prognostic marker or a potential therapy target for breast cancer with lung metastasis.
In conclusion, our study is the rst study to screen the characteristic hub genes and construct a prognostic model based on hub genes in breast cancer with lung metastasis using WGCNA. The potential prognostic predictive genes, LMNB1 and CDC20, have been proved to be able to investigate the occurrence of lung metastasis in breast cancer. Therefore, these two genes might be used as the potential biomarkers to identify the high-risk patients and assess the prognosis to facilitate the precise treatment in breast cancer with lung metastasis. Figure 1 The owchart of identifying procedure for the multi-gene signatures with breast cancer lung metastasis.      Construction of PPI network of genes in royalbule module.The PPI network of hub genes were analyzed by Cytoscape software.

Figure 10
A.The GEPIA database veri ed that LMNB1 and CDC20 mRNA expression was signi cantly up-regulated in breast cancer tissues (BRCA) (n = 1085) compared with normal breast tissues (n = 291), *P < .05. B.The relationship between LMNB1 and CDC20 in breast cancer analyzed using bc-GenExMiner v4.0. C.The prognostic value of KRTAP4-1, LMNB1 and CDC20 in breast cancer patients using Kaplan-Meier Plotter.