Identication of Candidate Biomarkers Correlated Poorer Prognosis of Breast Cancer via Bioinformatics Method

Background: Breast cancer (BC) is a malignancy with a high incidence among women in the world, and it is very urgent to identify signicant biomarkers and molecular therapy methods. Methods: Total 58 normal tissues and 203 cancer tissues were collected from three Gene Expression Omnibus (GEO) gene expression proles, and the differential expressed genes (DEGs) were identied. Subsequently, the Gene Ontology (GO) function and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway were analyzed. Additionally, hub genes were screened by constructing a protein-protein interaction (PPI) network. Then, we explored the prognostic values and molecular mechanism of these hub genes Kaplan-Meier (KM) curve and Gene Set Enrichment Analysis (GSEA). Results: 42 up-regulated and 82 down-regulated DEGs were screened out from GEO datasets. GO and KEGG pathway analysis revealed that DEGs were mainly related to cell cycles and cell proliferation. Furthermore, 12 hub genes (FN1, AURKA, CCNB1, BUB1B, PRC1, TPX2, NUSAP1, TOP2A, KIF20A, KIF2C, RRM2, ASPM) with a high degree of genes were selected, among which, 11 hub gene were signicantly correlated with the prognosis of patients with BC. From GSEA reviewed correlated with KEGG_CELL_CYCLE and HALLMARK_P53_PATHWAY. Conclusion: this study identied 11 key genes as BC potential prognosis biomarkers on the basis of integrated bioinformatics analysis. This nding will improve our knowledge of the BC progress and mechanisms.


Background
Breast cancer is the most common malignant tumor among women in the world [1], which also is the leading cause of woman death [2]. Although many methods for BC early diagnosis, actually, effectively approaching still limited. Currently, the normal treatment of BC mainly includes surgery, chemotherapy and radiotherapy or endocrine therapy et al, but patients with a 5-year survival rate are still very low [3], so it is very urgent to identify signi cant biomarkers and molecular therapy methods [4].
Previous studies screened the biomarkers predictors and function enrichment, mostly using online tools, such as GEO2R, DAVID or KOBAS et al. Actually, the majority of the online database are not precise enough, due to the slow update. Besides, the online DE analysis tool GEO2R seems not to normalize original data, so there might be a great deviation for each probe.
In this study, we identi ed several key genes that could be used as a sensitivity biomarkers for the diagnosis of BC using the GEO database. We downloaded three different region source gene expression pro les (GSE29044, GSE42568, GSE50428) from the GEO database, and a total of 261 breast tissue samples were included in this study, consisting of 58 normal breast tissue samples and 203 BC tissue samples. Then, the limma package of R software and Venn diagram online tool were applied to DEGs in the three datasets above. Furthermore, GO function and KEGG pathway analysis were conducted by the R software, and the newest annotation les were downloaded from the o cial website respectively. We constructed a protein and protein interaction (PPI) network with Cytoscape and the hub genes were screened by the cytoHubba plugin. For the validation of the DEGs, the BC section in the TCGA database consisting of 1102 BC tissues and 113 normal tissues was downloaded using TCGA-Assembler [5] package of R software, and those DEGs from GEO were veri ed by the TCGA database. Finally, only 11 genes were selected as BC potential prognosis biomarkers by KM-plot, and GSEA analyses were involved to study the potential molecular mechanisms of these hub genes. In conclusion, our study provides some potential sensitive biomarkers for BC patients and promotes an understanding of the molecular mechanisms of BC progression.

Microarray data information
The GEO (http://www.ncbi.nlm.nih.gov/geo/) database is a free public database of microarrays and is used for gene expression datasets and platform records [6]. The gene expression pro les of GSE29044, GSE42568, GSE50428 were chosen from the GEO database. GSE29044 was based on the GPL570 platform containing 36 normal breast tissues and 73 BC tissues. GSE42568 was based on the GPL570 platform containing 17 normal breast tissues and 104 BC tissues. GSE50428 was based on the GPL13648 platform containing 5 normal breast tissues and 26 BC tissues. The downloaded data was tidied using the Perl (Practical Extraction and Report Language, Version 5.30.2) software, then log2 transformation and Z score standardization were performed on all data of gene expression.

Screening for DEGs
DEGs between BC samples and normal breast samples were identi ed using the limma (version 3.30.0) package of R (version 3.5.1) software, which compared two groups of samples in most GEO series to obtain genes with different expressions under the same experimental conditions. The DEGs with FC≥ 1.5 or FC≤1/1.5 and adjust P<0.05 were considered as the cutoff criteria. Then, we used Venn software online (http://bioinformatics.psb.ugent.be/webtools/Venn/) to obtain the common DEGs in all three independent cohorts.

GO term and KEGG pathway enrichment analysis of DEGs
The GO annotation has given us a conspicuous meaningful and for the variety of biological functional from microarray and other big datasets [7]. KEGG is a systematic gene and genomic function information database, the gene functional information is stored in the PATHWAY segments [8]. GO and KEGG annotation was downloaded from the o cial website respectively (http://current.geneontology.org/products/pages/downloads.html, https://www.genome.jp/keggbin/get_htext?hsa00001+3101, 2021-01-05download), and enrichment of DEGs was performed using hypergeometric distribution formula of R software, we regarded P-value ≤ 0.05 with fold change more than 2 as a statistically signi cant difference and signi cant enrichment. PPI network construction, hub gene identi cation After enrichment of the DEGs, we construct the PPI network based on String online database (https://string-db.org/) for all DEGs [9] and Combine score greater than 0.9 as the cutoff criterion, and then the network was loaded to the Cytoscape (version 3.7.0) software, and cytoHubba plugin was curried to predict hub genes [10].
Validation of the hub genes in the TCGA database The TCGA_BRCA dataset was downloaded using the TCGA-Assembler (version 2.0.6) package of R software and then the up and down-regulated hub gene expression levels were veri ed by the DESeq package of R software. Then the potential function of the hub genes was analyzed by the GSEA (version 4.1.0).

Survival analysis
To further investigate the value of hub genes in breast cancer patients, The Kaplan-Meier plotter (http://kmplot.com/analysis/) analysis was conducted [11]. BC database was applied to estimate the prognosis values of hub genes we had identi ed, each gene was separated into two parts according to the median of expression levels, and if the p-value ≤ 0.05, it would be considered statistically signi cant.

Statistical analysis
All statistical analyses were performed using R software, most data were expressed as the mean ± standard deviation (SD) from the dataset. Statistical analyses were conducted two-tailed and none paired Student's t-test by the GraphPad Prism 5(GraphPad Inc., San Diego, CA, USA) software, and p≤0.05 were considered as statistical signi cance [12].

GO functional enrichment analysis of DEGs
For the up-regulated DEGs, BP terms are most signi cantly enriched in cell division, negative regulation of B cell differentiation, and anaphase-promoting complex-dependent catabolic process. CC terms are most signi cantly enriched in spindle, collagen-containing extracellular matrix, and kinetochores. As for the MF, extracellular matrix structural constituent, microtubule binding, and microtubule motor activity are mostly enriched (Figure2. A-C). Besides, for the down-regulated DEGs, BP terms are most signi cantly enriched in the cellular response to heparin, retinol metabolic process, positive regulation of fat cell differentiation. CC terms are most signi cantly enriched in the extracellular space, collagen-containing extracellular matrix, and extracellular region. As for the MF, heparin binding, retinal dehydrogenase activity, and DNAbinding transcription activator activity, RNA polymerase II-speci c are mostly enriched (Figure2. E-G).

KEGG pathway enrichment analysis of DEGs
The method of enrichment is as same as that of the GO enrichment analysis. The up-regulated DEGs were most signi cantly enriched in the p53 signaling pathway, progesterone-mediated oocyte maturation, protein digestion and absorption. As for the down-regulated DGEs, they are mainly enriched in the AMPK signaling pathway, Adipocytokine signaling pathway, and PPAR signaling pathway (Figure2. D, H).

Survival analysis of the identi ed hub genes
The KM plot shows that all of the hub genes have a signi cant difference between high and low expression levels, except FN1 (Figure5.). This result, which is reliable in bioinformatics analyses, indicates that the 12 hub genes have prognostic signi cance for BC patients.

Hub gene GSEA analysis
From the GSEA, we could know that these hub genes are highly related to the "KEGG_CELL_CYCLE", and "HALLMARK_P53_PATHWAY" gene set (Figure6.). The result of GSEA indicates a signi cant difference (FDR ≤ 0.05) in the enrichment of MSigDB Collection and is consistent with GO and KEGG enrichment.

Discussion
AURKA, also known as aurora kinases, belongs to serine/threonine kinases which share a highly conserved catalytic domain containing auto phosphorylating sites [16]. It positively regulates cell cycle progression and plays a role in the cell centrosome and the spindle microtubules during mitosis [17], which implied that AURKA maybe promotes cancer cell division. On the other hand, it has been widely reported that AURKA is an oncogene to promote tumorigenesis in multiple types of cancer including solid tumors and hematological malignancies [16], such as bladder cancer [18,19], prostate cancer [20,21], colon cancer [22]. CCNB1 is a regulatory protein involved in mitosis and a critical cell cycle regulator of the G2/M checkpoint [23]. Previous studies have reported that CCNB1 could participate in oncogene pathways among many kinds of cancers, BC, colorectal cancer [24][25][26][27]. And initial stage of cancer, it was more recognized by the T cells [11,28]. In this study, we found CCNB1 was up-regulated in BC and associated with poor survival.
BUB1B, also called mitotic checkpoint serine/threonine-protein kinase BUB1 beta, is an important element of the mitotic checkpoint [29]. It plays a potential role in spindle checkpoint function in many kinds of cancer [30]. It has reported that inhibited BUB1B leading to cell death, and grow slowly in BC cell [31], consistent with our study. PRC1(Polycomb Repressive Complexes 1), PRC2 could form a multicomponent complex with PRC2, and these complexes play many roles in the biological process, such as pluripotency, development, and carcinogenesis et al. The abnormal regulation of PRC1 had contributed to cancer progress [32,33], also some research has con rmed that in prostate cancer and breast cancer [34,35]. But the mechanism of PRC1 in cancer still unclear.
Targeting protein for Xenopus kinesin-like protein 2 (TPX2) is a microtubule-associated protein. The expression of TPX2 is strictly regulated by the cell cycle. In general, TPX2 exists in the G1 and S phases of the cell cycle and disappears at the end of cytokinesis [36]. A growing number of papers reported that the high expression of TPX2 was connected with bad and shorter overall survival of patients in many tumors [37][38][39], but the mechanism of TPX2 in BC is still unknown. NUSAP1 (encodes nucleolar and spindle-associated protein 1), a nucleolar spindle-associated protein, has been reported that play a complicated role in cell division and mitotic progression, spindle formation, and stability [40]. It inhibits GBM cell growth both in vitro and vivo [41]. NUSAP1 was highly expressed in kinds of malignancies and correlated with poor prognosis in aggressive triple-negative BC [42].
Kinesin family member 20A (KIF20A) is believed to modulate microtubule dynamics [48], which could promote the tumorigenesis and progression of prostate cancer and glioma [49], particularly its biochemical recurrence and metastasis [48,50]. Silencing KIF20A could induce prostate cancer cells to death [36]. The aberrantly activated Gli2-KIF20A axis is crucial for the growth of hepatocellular carcinoma and predicts poor prognosis in hepatocellular carcinoma [51].
MCAK(also known as KiF2C), a member of the motor protein-13 motor family, has been reported to undergo large conformational changes when it is converted from solution to microtubule-binding during its catalytic cycle [52]. The activity of MCAK is inhibited by Aurora B kinase through phosphorylation on multiple amino acids within its N-terminus [53,54]. Previous studies reported that the high expression of KIF2C could serve as an independent marker of poor prognosis in several tumors, including glioma, colorectal cancer, and gastric cancer [55][56][57], but the roles in the BC reported less.
Ribonucleotide reductase M2 subunit (RRM2), a rate-limiting enzyme involved in DNA synthesis and damage repair, plays important roles in many cellular processes such as cell growth, invasiveness, migration and senescence et al [58]. RRM2 as a tumor driver is frequently overexpressed in various malignancies [59][60][61]. Others found that the expression level of RRM2 was correlated with invasion, cell differentiation, and metastasis in colorectal carcinoma [62]. Silencing RRM2 attenuated melanoma growth, which was consistent with the maintenance of senescence-associated cell-cycle arrest [63]. Also, it was correlated with lung cancer grade level [64].
Abnormal spindle-like microcephaly associated gene (ASPM), encodes a protein of 3477 amino acids with an NH2-terminal microtubule-binding domain, two calponin homology domains [65]. The function of APSM evidence has pointed out that it is an oncogene and the prognosis has been investigated in various cancers, such as epithelial ovarian cancer, gliomas, pancreatic and prostate cancer, and liver cancer, as well as BC, which devote to tumor progression and involved in poor prognosis [66][67][68][69][70][71].

Conclusions
In conclusion, our study identi ed 13 hub genes that might be involved in the progression of BC with multiple gene expression data sets and a series of comprehensive analyses of bioinformatics. Among those hub genes, 12 targets might be regarded as potential prognostic biomarkers, including AURKA, CCNB1, BUB1B, PRC1, TPX2, NUSAP1, TOP2A, KIF20A, KIF2C, RRM2, ASPM, PPARG. We hope that this study would provide some evidence for the clinical gene-targeting therapy, although somewhere still need more experiment to prove.