Integrated bioinformatics analysis identifies core genes in breast cancer

Breast cancer (BRCA) is one of the most common malignancies in women. To reveal the molecular mechanism of the BRCA, the core genes associated with BRCA was studied by the integrated bioinformatics methods in the current study. A total of 1181 DEGs, 84 DEGs, and 190 DEGs were detected in GSE7904, GSE10797, and GSE103512, respectively. In addition, 7 hub genes overlapped from 3 datasets were selected, and ANNXA1, MYH11, IGFBP6, FOS, FOSB, KIT, and WLS might be the core genes of BRCA. The Go function analysis showed that the DEGs were enriched in chromosome segregation, gland development, leukocyte migration, extracellular structure organization and so on. The KEGG pathway analysis showed that the most DEGs were enriched in Pathways in cancer, Focal adhesion, ECM-receptor interaction, PPAR signaling pathway, etc. Additionally, ANXA1 was associated with worse overall survival, and the other 6 hub genes were associated with better overall survival. significantly biological process chromosome segregation, organelle fission, nuclear division, nuclear chromosome segregation, sister chromatid segregation, mitotic nuclear division, mitotic sister chromatid segregation, regulation of mitotic nuclear division, regulation of chromosome separation, regulation of sister chromatid segregation. While the most of pathways complement and coagulation cascades, ECM-receptor interaction, focal adhesion, Pathways in cancer, Cell cycle. In the data of GSE10797, the significantly enriched GO terms at biological process domain were gland development, leukocyte migration, response to radiation, response to a steroid hormone, cognition, learning or memory, digestive system development, digestive tract development, cell-substrate junction assembly, response to progesterone. The pathways were Pathways in cancer, Focal adhesion, Hematopoietic cell lineage, Protein digestion and absorption, ECM-receptor interaction. In the data set of GSE103512, the enriched GO terms were extracellular structure organization, extracellular matrix organization, reactive oxygen species metabolic process, response to reactive oxygen species, cellular response to reactive oxygen species, glomerulus development, regulation of nitric oxide biosynthetic process, renal system vasculature development, glomerulus vasculature development, metanephric glomerulus development. The pathways were PPAR signaling pathway, Tyrosine metabolism, Arachidonic acid metabolism,

Because of its poor prognosis and short average survival period, the early diagnosis and treatment of BRCA is very important for the survival. Previous reports showed that BRCA is highly heterogeneous [4] [4], and the traditional anatomical stage and histological classification could no meet the clinical needs. Thus, Molecular level study of pathogenesis, treatment, and prognosis has become the current research hotspots in the field of BRCA [5,6] [5,6].
In recent years, numerous biomarkers and therapeutic target genes for BRCA have been discovered, for example, estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor-2 (HER2); and most of them were useful in the diagnose and identify subtypes of BRCA [7] [7]. Using biomarkers to diagnose cancer early is a good choice for the diagnose of BRCA [8] [8].
Over the last decade, comprehensive sequencing efforts have revealed the genomic landscapes of common forms of human cancer [9]. With the help of high-through technology, a large number of data about BRCA, especially the data of cDNA microarray, can get and used in integrated informatics studies now. GEO (gene expression omnibus) is the largest and most comprehensive public gene expression database resource available today [9] [9]. In this study, three datasets (GSE7904, GSE103512, GSE10797) were downloaded from GEO for analysis using R [10] [10]. The screening of DEGs, the biological processes (BP) and KEGG pathway were analyzed with R packages. We used the GEPIA [11] [11] database to observe changes in the expression levels of these genes between cancer and normal population. In addition, the correlation between genes, survival, and disease was tested based on Kaplan Meier plotter online database (http://kmplot.com/analysis/) [12] [12] Methods 4 Raw data of BRCA microarray from the GEO database was subjected to quantity control (QC). the differentially expressed genes (DEGs) were studied using the "limma" package of R by the condition of both adj. p-value < 0.05, and log2 fold change of >1 or < -1. Then DEGs were analyzed by GO function and KEGG. In addition, the core genes found were identified by the performance of expression level and survival analysis in BRCA.

Search strategy and data collection
Public microarray repositories were curated to search through the Gene Expression Omnibus database GEO (http://www.ncbi.nlm.nih.gov/geo) database using the keywords "BRCA", or "mammary cancer". The datasets were reviewed manually and only datasets which fulfill the following criteria were included for further analysis: (1) Expression profiling by array, (2) Studies which comprised CEL raw files, and (3) Paired normal-tumor breast tissue samples. 3 microarray datasets of GSE7904, GSE10797 and GSE103512 were collected.

Quality control of microarray
Quality control is mainly focused on CEL file-level processing, like the simplest visual observation, the averaging method and more advanced data fitting methods. The three levels of quality control functions are implemented by the 'image' function, the 'simpleaffy' package, and the 'affyPLM' package. After quality control, unqualified sample data were removed.

Individual microarray data analysis
R tools were used in this study. All datasets were normalized individually on the base-2 logarithm by Robust Multi-Array Average (RMA) and Linear Models for Microarray (LIMMA) package and annotated by converting different probe IDs to gene IDs.
Genes significantly dysregulated in BRCA as compared to matched data at normal samples were defined by adj. p-value < 0.05, and log2 fold change of >1 (upregulated genes) or<-1 (downregulated genes).

DEGs enrichment analysis and its visualization
Gene enrichment analysis concluding GO enrichment analysis and KEGG enrichment analysis was done by the 'GOstats' and 'GeneAnswers' package. We used the 'Clusterfiler' and 'pheatmap' package to visualizing the statistical analysis.

Survival analysis of key genes
The KM-PLOT is capable to assess the effect of 54,675 genes on survival using 10,461 cases of cancer samples, including 5,143 cases of breast cancers, 1,816 cases of ovarian cancers, 2,437 cases of lung cancers and 1,065 cases of gastric cancers with a mean follow-up of 69, 40, 49 and 33 months, respectively [19]. The primary purpose of the tool is a meta-analysis based on biomarker assessment [19]. In this study, key genes ("MYH11", "IGFBP6", "WLS", "ANXA1", "FOSB", "KIT", "FOS") were assessed for their effect on survival with with 95% confidence intervals and log rank P value<0.01.

Correlation between hub genes and BRCA
Study on the correlation between hub genes and BRCA was performed by the online tools GEPIA (http://gepia.cancer-pku.cn/index.html).

Function protein association networks analysis
The function protein association networks of these three proteins through the "STRING" website (https://string-db.org/cgi/input.pl), and found that

Quality control of microarray
Quality control of the three data sets of microarrays was conducted. The results showed that the overall chip quality of GSE7904 was good and RNA degradation was small (Figure.S1). The overall chip quality of GSE10797 is general, and RNA degradation was small ( Figure.S2). and the GSE103512 chipset quality is slightly worse, RNA degradation is 6 obvious ( Figure.S3). based on these quality control analysis, 4 data of chips, 14 data of chips in the GSE10797 and GSE103512, respectively, were removed in the following studies (see supply information). the normalization of the three datasets showed that between the experimental group and the control group, the expression of most genes detected were consistent, and the consistency between parallel experiments is stronger. Relative logarithmic expression (RLE) box plots reflect these trends showed that the overall trend of the fitting curve is consistent, indicating that the chip pretreatment effect is good ( Figure 1). The normalized datasets were suitable for analysis of DEGs.

Significant Expression Changes of mRNAs in Breast Tumor
To determine the mRNA levels of genes in the breast tumor, we evaluated the mRNAs information in GSE7904, GSE10797, and GSE103512 by case-control design (Fig 2A, B, C).
the results showed that contrast with the normal group, 338 genes were upregulated and 843 genes were downregulated in the GSE7904, 4 genes were upregulated and 80 genes were downregulated in GSE10797, and 48 genes were upregulated and 142 genes were downregulated in GSE103512, respectively ( Figure 3A, B and C). We have listed the first ten DEGs sorted by the fold change (Table 1). Furthermore, levels of "MYH11", "IGFBP6", "WLS", "ANXA1", "FOSB", "KIT","FOS" were different between tumor and normal tissues in the three data sets, indicating their potential involvement in the progression of breast cancer ( Figure 2D). For further analysis, we chose the 7 genes evidently different mRNAs that were conserved among the three datasets: "MYH11", "IGFBP6", "WLS", "ANXA1","FOSB", "KIT","FOS". Cbioportal tools were used to check the 7 genes in the BRCA database (Fig.3D). The results showed the status of the 7 genes existed in BRCA and 7 MYH11 was the significant one.

DEGs enrichment analysis and its visualization
GO and KEGG pathway analyses were performed to elucidate the biological function of the DEGs from three datasets (Fig .4). the results showed that in the data set of GSE7904, the significantly enriched GO terms at biological process domain: chromosome segregation, organelle fission, nuclear division, nuclear chromosome segregation, sister chromatid segregation, mitotic nuclear division, mitotic sister chromatid segregation, regulation of mitotic nuclear division, regulation of chromosome separation, regulation of sister chromatid segregation. While the most obvious connection of pathways was, complement and coagulation cascades, ECM-receptor interaction, focal adhesion, Pathways in cancer, Cell cycle. In the data set of GSE10797, the significantly enriched GO terms at biological process domain were gland development, leukocyte migration, response to radiation, response to a steroid hormone, cognition, learning or memory, digestive system development, digestive tract development, cell-substrate junction assembly, response to progesterone. The pathways were Pathways in cancer, Focal adhesion, Hematopoietic cell lineage, Protein digestion and absorption, ECM-receptor interaction. In the data set of GSE103512, the enriched GO terms were extracellular structure organization, extracellular matrix organization, reactive oxygen species metabolic process, response to reactive oxygen species, cellular response to reactive oxygen species, glomerulus development, regulation of nitric oxide biosynthetic process, renal system vasculature development, glomerulus vasculature development, metanephric glomerulus development. The pathways were PPAR signaling pathway, Tyrosine metabolism, Arachidonic acid metabolism, 8 Complement and coagulation cascades, and Cytokine-cytokine receptor interaction.

DEGs alterations may play important role in BRCA patient's survival
Survival analysis is widely used in clinical and epidemiological research. In this study, survival analysis was performed to assess the 7 genes in patient's survival (Fig 5). The results showed that the high expression of FOS, FOSB, IGFBP6, KIT, MYH11 and WLS were associated with better overall survival (OS) for BRCA patients (HR>0.6, log rank P<0.0001), and the high expression of ANXA1 was associated with worse overall survival (HR=1.2(1.06-1.35) log rank P= 0.0044, Figure 5). To determine the correlation between DEGs and BRCA, the expression levels of the 7 genes were evaluated by GEPIA, the result showed that the expression of these 7 genes was down-regulated in BRCA (P<0.05, Figure   6).

Various genes and medicines specific to DEGs reveal the 7 genes as potential targets for BRCA treatment
To evaluate if this DEGs could be targeted to BRCA therapeutics, a website integrated PhosphoSite, KEGG Drugs, pid, HumanCyc, Reactome, PANTHER, and DrugBank databases was used to analyze the correlation of these genes and drugs. Results showed that 22 kinds of FDA approved drugs were target to ANNXA1, 11 kinds of FDA approved drugs and 10 kinds of non-FDA approved drugs were target to KIT, 2 kinds of FDA approved drugs were target to MYH11, and 1 kind of FDA approved, drugNadroparin, was target to FOS (Fig 7), however, there were no drug target to IGFBP6, FOSB, and WLS until now. 9 Discussion BRCA is the most common cancer and the leading death cause among women younger than 45 years [13] [13]. It is necessary to discover the molecular mechanism of the development of BRCA to decrease the mortality of BRCA,.
The bioinformatics analysis is one of the useful methods for the study of molecular mechanism of cancer. In the current study, threes GEO database, including GSE7904 (43 samples of BRCA and 19 normal samples), GSE10797 (65 samples of BRCA and 10 normal samples), and GSE103512(65 samples of BRCA and 10 normal samples) were used to study the differential expression genes between BRCA and normal tissues. The results showed that a total of 1181 DEGs in GSE7904, 84 DEGs in GSE10797, and 190 DEGs in GSE103512 were found significantly expressed between BRCA and control group.
The GO analysis showed that the DEGs found in the dataset of GSE7904 were mainly involved in chromosome segregation, organelle fission, nuclear division. The DEGs found in the dataset of GSE7904 were mainly involved in gland development, leukocyte migration, response to radiation, response to a steroid hormone, cognition, learning or memory, digestive system development, and cell-substrate junction assembly. The DEGs found in the dataset of GSE103512 were mainly involved in extracellular structure organization, reactive oxygen species metabolic process, glomerulus development, and regulation of nitric oxide biosynthetic process. There was no common GO term among three datasets, indicated that the BRCA was highly of heterogeneity, consistent with the previous study [15].
Additionally, KEGG analysis showed that the DEGs found in the dataset of GSE7904 most connection of pathways included Complement and coagulation cascades, ECM-receptor interaction, Focal adhesion, Pathways in cancer, and Cell cycle. The DEGs found in the dataset of GSE7904 most obvious connection of pathways included Pathways in cancer, Focal adhesion, Hematopoietic cell lineage, Protein digestion and absorption, and ECMreceptor interaction. The DEGs found in the dataset of GSE103512 hub pathways were PPAR signaling pathway, Tyrosine metabolism, Arachidonic acid metabolism, Complement and coagulation cascades, and Cytokine-cytokine receptor interaction. There was no same pathways among three datasets, consistent with the GO analysis and previous report [15].
Based on the Venn diagram analysis, 7 hub genes were overlapped among 3 datasets, and they were ANNXA1, MYH11, IGFBP6, FOS, FOSB, KIT, and WLS. In addition, the association of the seven hub genes with the overall survival of BRCA patients were studied, the results showed that ANXA1 was associated with worse overall survival, and the other 6 hub genes were associated with better overall survival. In addition, ANXA1, KIT, MYH11, and FOS have been targeted for drugs. There were no FDA drugs for IGFBP6, FOSB, and WLS. Further analysis showed that these three proteins interact with the well-known BRCA related proteins IGF1 [14][15][16]    The drug targets gene of the 7 DEGs

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.