Identification of New Key Genes in Breast Cancer by Co-expression Network Analysis

Background: Breast cancer is one of the most common malignancies in women all over the world. This study aimed to identify the potential biomarkers associated with the occurrence and development of breast cancer. Results: Our research downloaded GSE54140 gene expression datasets and GPL10152 platform information from the Gene Expression Omnibus datasets, and used weighted gene co-expression network analysis (WGCNA) to construct a scale-free gene co-expression network to explore the associations between gene sets and clinical features. A total of 60 modules were analyzed, and found that the skyblue3 module was significantly related to HER2+ BC. The function of 93 genes in the skyblue3 module was annotated by DAVID bioinformatics tool, and it was demonstrated that the function of the module was mainly related to nuclear-transcribed mRNA catabolic process, cytosol, and oxidoreductase activity. Based on the WGCNA and Cytoscape software analysis, 9 hub genes (PGAP3, PPP1R1B, PNMT, ERBB2, CISD3, CRKRS, TCAP, STARD3, and NEUROD2) were identified. The Human Protein Atlas database detected that the protein level of PGAP3, PPP1R1B, PNMT, ERBB2, CISD3, CRKRS, TCAP, and STARD3 gene in tumor tissues was significantly higher than those in normal tissues. And survival analysis shows that PGAP3, PNMT, ERBB2, TCAP, and STARD3 were negatively associated with the overall survival (P < 0.05). Conclusion: A total of 9 candidate biomarkers were identified by comprehensive bioinformatics analysis, among which, the co-expansion of PGAP3 and CRKRS related to ERBB2 may be associated with the occurrence of breast cancer. In addition, PPP1R1B, CRKRS and TCAP are related to drug resistance and adverse reactions in the treatment of breast cancer.


Background
Breast cancer (BC) is the most common cancer in females, which is a serious threat to women's health [1]. Nearly 1.3 million women worldwide are diagnosed with breast cancer every year, and more than 400,000 people die from recurrence and metastasis of breast cancer [2,3]. People with breast cancer have a great psychological and financial burden. Despite of the continuously improvement of DFS and OS of breast cancer patients from the effective treatment [4], the high rate 3 of recurrence and metastasis is still the major cause of death in breast cancer patients [5]. In-depth study on the mechanism of invasion and metastasis of breast cancer is warranted. Exploring 'precision therapy' strategies for different breast cancer subtypes are still the aim and trend of breast cancer research development [4]. In recently years, many microarray profiling studies have been performed in breast cancer, and hundreds of differentially expressed genes have been obtained. However, the difference analysis ignores the interaction between genes so that there is no reliable biomarker for clinical use in breast cancer. Now, the weighted gene co-expression network analysis (WGCNA) is used to construct a scale-free gene co-expression network to explore the associations between gene sets and clinical features and to identify the modules (clusters) of highly related genes and the hub genes in each module [6]. The most salient characteristic of scale-free networks is the relative commonness of nodes with a degree that greatly exceeds the average. The highest-degree nodes are often called 'hubs', and are thought to serve specific purposes in their networks. Coexpression analysis is a powerful technique to construct free-scale gene co-expression networks.
Thus, WGCNA is ideal for the identification of gene modules and key genes that contribute to phenotypic traits. In the current study, we downloaded the gene expression microarray from the GEO datasets, and used the WGCNA algorithm to identify the highly correlated gene modules associated with breast cancer, and detected the hub genes that are potential diagnostic and treatment target biomarkers.

Materials And Methods
Microarray data and data pre-processing www.r-project.org) and relevant software packages were used to process the data. Before WGCNA, 4 this study preprocessed the Microarray Data: the probe name was converted into the gene name using platform information, and microarray data were standardized through the R function. Finally, the microarray data with row name as sample name and column name as gene name was obtained.
After standardization and probe summarization, the data set with 15,612 genes was further processed, and the top 25% most variant genes by analysis of variance (11,709 genes) were selected for the construction of co-expression network analysis.

Co-expression network construction
After pre-processing, the selected expression data profiles were constructed to a gene co-expression network using WGCNA package in R (www.cran.r-project. org/web/packages/WGCNA /index.html). The gradient method was used to calculate the connection strength between each pair of nodes and test the independence and the average degree of connectivity of the various modules with different power values (the power values ranged from 1 to 20). In the presented study, the appropriate power value (β) was selected when the degree of independence was 0.8 as the soft threshold parameter to ensure a scale-free network. The hierarchical cluster dendrograms was constructed by using the correlation coefficient between genes, and the genes with similar expression profiles were classified into the same gene module. Modules were identified as gene sets with high topological overlap. Different branches of the cluster dendrograms represented different gene modules, and different colors represented different modules. Subsequently, the soft threshold power was applied to transform the adjacency matrix into topological overlap measure (TOM), and 400 genes were randomly selected to make TOM heat map to prove the high degree of independence between modules and the relative independence of gene expression in each module.

Module and clinical trait association analysis
Module Eigengenes (MEs) were defined as the first principal component of each gene module and the expression of MEs was considered as a representative of all genes in a given module. The Pearson Correlation Coefficient and P value of MEs and clinical trait was calculated by WGCNA algorithm to evaluate the potential correlation between gene modules and clinical traits.
Subsequently, the module with the highest correlation coefficient was used for analysis. Gene Significance (GS) and Module Significance (MS) were used to calculate the expression patterns of modules related to sample types. GS was the correlation coefficients for different kinds of samples, and MS was the absolute value mean of GS of all genes in the module. Module membership (MM) was used to evaluate the degree of association between genes and modules and to filter hub genes in the target module.

Enrichment analysis of module
The enrichment analysis of module was conducted through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis to explore the biological functions using the DAVID bioinformatics tool (version 6.8; www.david.ncifcrf.gov/). Go term enrichment analysis and KEGG pathway contain biological process (BP), cellular component (CC), molecular function (MF) and KEGG pathway. P < 0.05 as the threshold was considered statistically significant.

Hub genes identification and analysis
The corresponding heat map was obtained by analyzing the correlation between each module and the clinical traits of the sample, and the module with the highest correlation was imported into Cytoscape software (version 3.7.2; www.cytoscape.org/) for analysis and visualization. MCOD, a plug in Cytoscape, was used to screen the module. Module with MCODE score > 59.5 were presented, and then the hub genes that degree > 74 were identified by analysis. The Human Protein Atlas database (www.proteinatlas.org) was used to detect the protein level of hub genes in tissues. Subsequently, to validate the hub genes, we used the Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016) from cBioPortal database (www.cbioportal.org/), which composed of 2,509 breast cancers samples/patients. Hub gene names were submitted in cBioPortal, and survival analysis of determining the importance of genes in biological processes was conducted.

Statistical method
In this study, R statistical software and WGCNA package were used for statistical calculation. WGCNA was used to construct free-scale gene co-expression networks to determine the relationships between genes, thereby enabling the identification of modules (clusters) of highly correlated genes, and the 6 hub gene in each module. P < 0.05 was considered statistically significant. Hypergeometric test was used for enrichment analysis, and Kaplan-Meier statistical method was used for survival analysis.

Data processing
In this study, GSE54140 (15,612 genes) composed of gene expression data from 63 samples including HER2-positive BC, Luminal-A BC and Luminal-B BC was selected and conducted by the R software function, and then microarray data were standardized. Using platform file GPL10152, the probe name was converted into the gene name, finally, the microarray data with row name as sample name and column name as gene name and the top 25% most variant genes by analysis of variance (11,709 genes) were selected for the construction of co-expression network.

Co-expression network construction and key modules identification
The co-expression analysis was carried out to construct the co-expression network. Select the appropriate weighting coefficient β to ensure a scale-free network, in this study, the power of β value of 3 was selected to construct a co-expression network, as shown in Figure 1. We then calculated the TOM for each gene pair, 60 modules were displayed by hierarchical clustering according to degree of TOM's difference as shown in Figure 2. Each module contained a group of coordinately expressed genes with high TOM, and was potentially involved in shared biological processes. To distinguish the modules individually, each module was assigned a unique color. 400 genes were randomly selected to make TOM heat map, as shown in Figure 3, which shows the high degree of independence between modules and the relative independence of gene expression in each module. The association analysis between tumor characteristics and co-expression modules was carried out to identify the correlation between MEs and tumor characteristics. As shown in Figure 4, Module-trait relationships heat map, the eigengene of the skyblue3 module (93 genes) had significant correlation with HER2+ BC (cor=0.74, P=3e -12 ). By calculating the correlation coefficient of GS and MM in skyblue3 module, and cluster analysis of traits and modules, the significant correlation between skyblue3 module and HER2+ BC was determined, and the credibility of the results was verified, as shown in Figure 5 and 6.
Hence, genes in the skyblue3 module were selected for further analysis.

Functional enrichment analysis of genes in modules
All the name of genes in the skyblue3 module was submitted to DAVID bioinformatics tool. For the BP, the genes were mainly enriched in nuclear-transcribed mRNA catabolic process, response to organophosphorus, positive regulation of hepatocyte proliferation, regulation of microtubule-based process, response to drug, tetrahydrofolate metabolic process, regulation of phosphatidylinositol 3kinase signaling, cellular aldehyde metabolic process, lysine catabolic process, as shown in Table 1.
The genes in the CC group were mainly enriched in cytosol, cytoplasm, ribosome, cytosolic large ribosomal subunit, membrane, neuronal cell, as shown in Table 2. For the MF, the genes were mainly enriched in oxidoreductase activity, protein binding, actin filament binding, aldehyde dehydrogenase [NAD(P)+] activity, large ribosomal subunit rRNA binding, 3-chloroallyl aldehyde dehydrogenase activity, as shown in Table 3. The KEGG pathway analysis was performed and showed that genes are mainly involved in Regulation of actin cytoskeleton (P > 0.05).

Hub genes identification and analysis
We found that the skyblue3 module had significant correlation with HER2+ BC, which suggests that hub genes may exists in skyblue3 module. Then we used Cytoscape software to visualize the network of the skyblue3 module and the intra-modular connectivity, which was calculated by WGCNA. Using MCOD function and correlation degree analysis, the MCODE score >59.5 were presented, and then the hub genes that degree > 74 were regarded as the hub gene represented by a red dot, as shown in Figure 7. The hub genes in the skyblue3 module included PGAP3, PPP1R1B, PNMT, ERBB2, CISD3, CRKRS, TCAP, STARD3, and NEUROD2. Moreover based on The Human Protein Atlas database, the protein levels PGAP3, PPP1R1B, PNMT, ERBB2, CISD3, CRKRS, TCAP, and STARD3 gene in tumor tissues were significantly higher than those in normal tissues, as shown in Figure 8. Then, all the hub genes underwent survival analysis using cBioPortal datasets. As shown in Figure 9, PGAP3, PNMT, ERBB2, TCAP, and STARD3 were negatively associated with the overall survival.

Discussion
Breast cancer is a life-threatening disease for females and one of the most frequently occurred malignancies in the world. Despite of the great progress in treatment options in the past decades, 8 much more remains to be done, and more cancer-driving genes need to be identified. Therefore it is critical to find more of the potential genes involved in the development and progression of breast.
In this study, we used WGCNA analysis to dynamically study genes co-expression in HER2-positive BC, Luminal-A BC, and Luminal-B BC, and to explore the related modules and hub genes. We conclude that the skyblue3 module had the highest correlation with HER2-positive BC. Skyblue3 module related to HER2-positive BC has well defined functions including nuclear-transcribed mRNA catabolic process, cytosol, and oxidoreductase activity. Among the genes in the skyblue3 module, PGAP3, PPP1R1B, PNMT, ERBB2, CISD3, CRKRS, TCAP, STARD3, and NEUROD2 were regarded as the hub genes.
PGAP3 is a member of glycosylated phosphatidylinositol (GPI)-specific phospholipase, and involved in the lipid remodeling steps of GPI-anchor maturation and plays a role in protein sorting and trafficking [7]. Some studies have confirmed that using the ion proton plate system, PGAP3 was found to be amplified simultaneously in breast cancer samples, with specific amplification of the locus harboring ERBB2 and PGAP3 [8]. Considering the close connection between the development of breast cancer and ERBB2 gene, the variation of PGAP3 copy number that affects the function of ERBB2 may also be participated breast cancer oncogenesis. PPP1R1B is a protein coding multiple transcripts and two experimentally-documented proteins Darpp-32 and t-Darpp [9]. And the protein encoded by this gene has been confirmed to highly overexpressed in breast cancer, colon cancer, esophagus cancer, lung cancer, et al [10,11]. Resistance to the anti-Her2 antibody Trastuzumab may be related to the overexpression of t-darpp/darpp-32 protein that activates the AKT pathway and ultimately leads to cell survival and apoptosis blocking [12]. ERBB2, commonly referred to as HER2, encodes a member of the epidermal growth factor receptor family of receptor tyrosine kinases, it is amplified and/or overexpressed in 20-30% of invasive breast carcinomas [13]. Amplification and/or overexpression of this gene have also been reported in many other cancers including ovarian and gastric cancer [14][15][16]. ERBB2 binds tightly to other ligand-bound EGF receptor family members to form a heterodimer, stabilizing ligand binding and enhancing kinase-mediated activation of downstream signaling pathways, such as PI3K/AKT activation and RET signaling [17]. ErbB2 overexpression is tumor cells specific, and closely related to cancer cell proliferation, anti-apoptosis and motor ability. As a cell 9 surface-associated protein, it is amenable to targeted inhibition by small molecules [18]. CISD3 is a member of the CDGSH domain-containing family, CISD3 codes mitochondrial inner NEET protein which resides inside the mitochondrial matrix [19]. And it may be a key player in the mitochondrial pathways such as regulating autophagy, apoptosis, and mitochondrial iron and reactive oxygen species homeostasis [20]. CRKRS, also named CDK12, encodes cyclin-dependent kinase 12 that phosphorylates RNA polymerase II, thereby acting as a key regulator of transcription [21]. Cdk12 maintains genomic stability by suppressing intron polygadenylation to regulate DNA repair genes [22][23][24]. Related studies show that endocrine therapy resistance in breast cancer is related to the activation of MAPK signal pathway by CRKRS silencing, which leads to the loss of endoplasmic reticulum dependence [25]. In breast cancers, CDK12 is frequently co-amplified with the HER2 (ERBB2) oncogene. Recent studies found direct correlations between CDK12 levels, DNAJB6 isoform levels and the migration capacity and invasiveness of breast tumor cells, and the team finally confirmed regulation by CDK12 modulates alternative last exon splicing of the DNA damage response activator ATM and a DNAJB6 isoform that influences cell invasion and tumorigenesis in xenografts, and suggests that CDK12 gene amplification can contribute to the pathogenesis of the cancer [26].
The protein encoded by TCAP gene were found in striated and cardiac muscle, it binds to the titin Z1-Z2 domains, it is a muscle assembly regulating factor and a substrate of titin kinase [27]. Among it is related pathways are cardiac conduction and striated muscle contraction. Doxorubicin and anti-HER2 targeting therapy of trastuzumab are frequently used breast cancer treatment agent, their most common adverse reaction is cardiac toxicity. Recent study reported that TCAP genetic variation may be associated with the development or progression of cardiomyopathy [28]. STARD3, StAR Related Lipid Transfer Domain Containing 3, is a protein coding gene. The encoded protein localizes to the membranes of late endosomes and may be involved in cholesterol exporting [29,30]. It facilitates the transport/distribution of cholesterol and sphingolipid to the intracellular membrane compartments and the metabolism of steroid hormones [29,31]. STARD3 gene is located to the minimal amplicon in HER2-positive breast cancers, it's overexpression leads to increased cholesterol biosynthesis and Src kinase activity in breast cancer cells, suggesting StARD3 over-expression may play a role in breast cancer aggressiveness through increasing membrane cholesterol and enhancing oncogenic signaling [32]. NEUROD2, Neuronal Differentiation 2, encodes a member of the NEUROD family of neurogenic basic helix-loop-helix proteins, which has an impact on the regulation of glutamatergic and GAB-Aergic genes [33]. NEUROD2 gene expression can induce transcription from neuron-specific promoters which contain a specific DNA sequence known as an E-box mediates calcium-dependent transcription activation [34].
The protein levels PGAP3, PPP1R1B, PNMT, ERBB2, CISD3, CRKRS, TCAP, and STARD3 gene in tumor tissues were significantly higher than those in normal tissues. And survival analysis shows that PGAP3, PNMT, ERBB2, TCAP, and STARD3 were negatively associated with the overall survival. These results suggest that these genes may be tumorigenic genes in breast cancer.

Conclusion
In summary, our study used a systems biology-based WGCNA approach to determine the module highly related to HER2 positive BC, whose function was mainly confined in nuclear-transcribed mRNA catabolic process, cytosol, oxidoreductase activity and actin cytoskeleton regulation. Though the majority of hub genes highlighted in this study have been previously reported, there is no comprehensive analysis of these genes. ERBB2 plays an important role in the occurrence and development of breast cancer. In addition, we found that PGAP3 and CRKRS were related to the coamplification of ERBB2, PPP1R1B may mediate anti-ERBB2 drug resistance by activating AKT pathway, TCAP may be related to cardiomyopathy caused by doxorubicin or Trastuzumab, and STARD3 may contribute to ERBB2+ breast cancer aggressiveness through increasing membrane cholesterol and enhancing oncogenic signaling. As endocrine therapy is the most important treatment for hormone receptor positive breast cancer. It has been found that CRKRS can silence the activation of mitogen activated protein kinase (MAPK) signal pathway leading to endocrine therapy resistance. Hub genes such as PGAP3 and STARD3 are highly correlated with breast cancer development. However, PPP1R1B, CRKRS, and TCAP may be brought new insights in breast cancer study in treatment target.
Further investigation about these genes is warranted.

Clinical significance
Our study reports several genes of breast cancer by Co-expression network analysis, they are associated with cancer's development and patient's prognosis and may serve as the therapeutic targets for the disease, especially in resistance and adverse reactions in the treatment of breast cancer.

Funding
Not applicable.

Availability of data and materials
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Authors' contributions
LX and WX wrote and revised the manuscript; LX was the main contributor to this manuscript; WX critically revised and corrected the manuscript; WY, WZ, FJ, LG and YG Analysis and interpretation of relevant data; All authors read and approved the final manuscript.

Ethics approval and consent to participate
Not applicable.

Consent for publication
All the authors have consented for the publication.     Hierarchical cluster dendrograms of all expressed genes based on topological overlap.
18 Figure 3 TOM heatmap between genes. The heat map describes the topological overlap matrix between 400 random genes. A darker red color indicates higher overlap.
19 Figure 4 Heatmap of the correlation between module eigengenes and clinical traits of breast cancer.
Each cell contains the correlation and p value of the corresponding module and character.
20 Figure 5 Scatterplot of gene significance of HER2+ BC versus intramodular module membership in the skyblue3 module.
21 Figure 6 Eigengene dendrogram representing the relationships between modules and HER2+ BC traits.
22 Figure 7 Visualization of weighted gene co-expression network analysis (WGCNA) network connections of the intramodular hub genes. The hub genes represented by red dots have the highest degree of association with other genes in the module.