Identification of grade-specific diagnostic and prognostic biomarkers of key candidate genes and pathways in breast cancer by integrated bioinformatics analysis

Purpose: Breast cancer (BC) is the most common malignant tumor in women. Due to the mechanism of BC has not yet been completely clear, we aim to identify the key pathway and genes in BC based on bioinformatics method. Methods: Samples were obtained from NCBI-GEO website. Then, GEO2R tools and Venn diagram software were used to identify the differentially expressed genes (DEGs). Next, analyze Kyoto Encyclopedia of Gene and Genome (KEGG) pathway and gene ontology (GO) were analyzed by Database for Annotation, Visualization and Integrated Discovery (DAVID). The protein-protein interaction (PPI) network was drew by Search Tool for the Retrieval of Interacting Genes (STRING). Afterwards, we selected the core genes from PPI network by Molecular Complex Detection (MCODE) plug-in. And we performed Kaplan-Meier analysis to assess the overall survival of the core genes. Last Gene Expression Profiling Interactive Analysis (GEPIA) was used to discover highly expressed genes in BC. Results: DEGs contained 23 up-regulated and 32 down-regulated genes. GO described molecular function (MF), cellular component (CC), and biological process (BP). KEGG pathway showed DEGs were mainly involved in ECM-receptor interaction, p53 signaling pathway, PPAR signaling pathway, signaling pathways regulating pluripotency of stem cells, cGMP-PKG signaling pathway and Tyrosine metabolism. Finally, we screened 15 core genes, 14 of which had adverse prognosis and high expression in BC. Conclusions: In the current study, 14 core genes of BC were identified based on bioinformatics method, which could useful to provide essential information for early diagnosis and treatment of BC. B2 (CCNB2), kinesin family member 2C (KIF2C), DNA topoisomerase II alpha (TOP2A), TPX2 microtubule nucleation factor (TPX2), hyaluronan mediated motility receptor (HMMR), DLG associated protein 5 (DLGAP5), abnormal spindle microtubule assembly (ASPM), NIMA related kinase 2 (NEK2), maternal embryonic leucine zipper kinase (MELK), ubiquitin conjugating enzyme E2 T (UBE2T) and cell division cycle associated 3 (CDCA3).


Introduction
At present, breast cancer (BC) is the most common cancer diagnosis among female tumors 1 . Although the treatment of BC has made great progress in recent years, it is still the main cause of female death. In 2016, the mortality of BC in the European Union was expected to decline by 8%. However, BC remains the most common cause of cancer death in less developed countries 2 . The molecular mechanism of tumorigenesis in BC remains unclear. Identifying new genes related to tumorigenesis and prognosis of patients and elucidating the molecular mechanisms of these carcinogenic processes 3 are essential for early diagnosis, prevention. Meanwhile, genomic analysis of BC will result in new therapeutic roadmap 4 that contribute to personalized therapy.
Over the past ten years, gene chip technology has been widely used. This technology can recognize different expressed genes and store related-data in public databases.
Summarize and reanalysis of these genomic data can advantageous to identify biomarker sand gene expression profiles associated with certain diseases 5, 6 . Furthermore, the analysis and sequencing of genomic data to identify cancer-related driving genes and signaling pathways is also one of the most urgent needs in basic cancer research. And our knowledge of the cancer genomes can guide us to exploit more effective ways to reduce cancer morbidity and mortality 7 .
In this study, three datasets were acquired from Gene Expression Omnibus (GEO) database. Then, we got the commonly differentially expressed genes (DEGs) by comparison between the gene expression data of BC patients and non-BC patients in the three datasets. Afterwards, these DEGs were underwent Gene Ontology (GO) and Kyoto Encyclopedia of Gene and Genomes (KEGG) pathway analysis. In addition, we constructed protein-protein interaction (PPI) network and applied Cytotype MCODE (Molecular Complex Detection) for analysis of the DEGs to identify some core genes. And the Kaplan-Meier plotter online database were utilized for survival analysis of these core genes.
Subsequently, we validated the DEGs expression of BC tissues through Gene Expression Profiling Interactive Analysis (GEPIA). Through the above bioinformatics research, this study provided some effective biomarkers and signal pathways for BC.

Materials And Methods
Microarray data acquisition Three gene expression profiles datasets (GSE61304, GSE29044 and GSE42568) were obtained from the GEO (http://www.ncbi.nlm.nih.gov/geo/) database which is a free public repository for storing high-throughput microarray and sequence functional genomic

Data processing of DEGs
We identified DEGs between BC and normal breast samples via GEO2R online tools according to the criteria |logFC| > 2 and P < 0.05. Then, Venn diagram was plotted by online software to ascertain common DEGs among the three datasets. The DEGs with logFC > 2 was regarded as up-regulated genes and those with logFC<-2 was downregulated genes.
GO and KEGG pathway enrichment analysis of DEGs GO is a comprehensive analytical approach which can be used to calculate the functions of genes and gene products. It describes three aspects: molecular function (MF), cellular component (CC), and biological process (BP) 9 . KEGG is a collection of databases that systematically analyze gene functions and link genome information with high-level functional information 10 . Here we mapped the GO enrichment analysis and KEGG pathway of significant DEGs via Database for Annotation, Visualization and Integrated Discovery (DAVID, https://david.ncifcrf.gov/version 6.8)database. P-value < 0.05 was considered statistically significant.

PPI network establishment and modular selection
Search Tool for the Retrieval of Interacting Genes (STRING, https://string-db.org/) online biological database provided us with critical evaluation and collection of PPI information 11 .Then, we utilized the STRING app in Cytoscape software to build the protein interaction network. The maximum number of interactors = 0 and confidence score ≥ 0.4 was as the threshold value. Furthermore, we analyzed modules of PPI network via MCODE plugin of Cytoscape with the included criteria as follow: degree cutoff = 2, max. Depth = 100, k-core = 2, and node score cutoff = 0.2.

Survival analysis and RNA sequencing data
We evaluated the influence of key candidate genes on survival and prognosis with 95% confidence intervals and log rank P value were computed by Kaplan-Meier plotter (http://kmplot.com/analysis). Afterwards, GEPIA (http://gepia.cancer-pku.cn/), an interactive web application for gene expression analysis 12 , was applied to analyzing RNA sequencing data for the sake of validated these DEGs.

Identification of DEGs in BC
In the current study, 235 BC samples and 57 non-BC samples were obtained from three microarray databases (GSE29044, GSE42568 and GSE61304). Then, we identified 504,1770 and 749 DEGs with the criterion as |logFC| > 2 and P < 0.05 via GEO2R online tools. Subsequently, 23 up-regulated DEGs (logFC > 2) and 32 down-regulated DEGs (logFC<-2) intersected in three datasets were detected via VENN diagram software (   binding and RNA polymerase II distal enhancer sequence-specific DNA binding (Table 2). Table 2 Gene ontology(GO)analysis of up-regulated and down-regulated genes in breast cancer.  (Table 3).

Construction of PPI network and analysis of module analysis
The PPI network of DEGs which contained 38 nodes and 130 edges was constructed by STRING online database and Cytoscape software. A total of 38 DEGs were identified, of which 21 were up regulated genes and 17 were down-regulated genes (Fig. 2a).
Afterwards, we utilized Cytotype MCODE to identify significant modules. The results indicated that 15 core DEGs, which all belong up-regulated genes were recognized among the 38 DEGs (Fig. 2b). These core

Analysis of core genes by the Kaplan Meier plotter and GEPIA
To estimate the prognostic value of DEGs, we used Kaplan-Meier plotter to assess the correlation between the expression of 15 core genes and survival rates of BC patients. It was showed that 14 genes had a significantly worse survival (P < 0.05, Fig. 3).
Subsequently, we used GEPIA to re-analyze the expression level of 14 genes between cancerous and normal tissues. The results reported that compared with normal BC patients, all 14 genes were highly expressed in BC samples (Table 4 & Fig. 4). Table 4 Prognostic information of core genes via GEPIA.  ANLN is an evolutionarily actin-binding protein, which is firstly found in Drosophila that play a pivotal role in cytokinesis 13 .Research indicated that knockdown of ANLN could cause G2/M phase cell cycle arrest and thus inhibit cell growth 14 .In BC patients, high expression of nuclear ANLN in cancer cell is associated with poor prognosis 15 .Meanwhile, ANLN participated in the cellular growth and migration in BC 16 .It has been also testified that patients with lower level of ANLN expression might have both better prognosis and response to anthracycline-based chemotherapy 17 .

Category
KIAA0101 is a 15-kDa protein containing a conserved proliferating cell nuclear antigen (PCNA)-binding motif, which was identified by a yeast two hybrid screen for proteins binding to PCNA as 15PAF (PCNA-associated factor) 18,19 . It was verified that KIAA0101 has been associated with various tumor. In hepatocellular carcinoma and gastric cancer, KIAA001 overexpression is interrelated with cancer recurrence 19,20  HMMR is a hyper-expressed oncogene, the overexpression of which is transforming and is also essential for H-ras transformation 29 . HMMR is a hyaluronan receptor, and hyaluronan is a major constituent of the microenvironment of numerous malignant tumors 30 . In BC, recent evidences suggest that high levels of HMMR is a helpful prognostic indicator for breast carcinoma progression 31 .In addition, HMMR locus is related to high risk of BC in human 32 .
CDCA3 functions as part of a SKP1-Culin-RING-F-box ubiquitin ligase, which is reported to modulate cell cycle progression 33 . Previous data analysis indicated that CDCA3 showed on average > 100 -fold overexpression in BC compared with normal tissue 34  TOP2A is a DNA encodes topoisomerase IIα, an enzyme that controls and alters the topologic states of DNA during transcription and a direct target of anthracyclines 49 . In view of the role of TOP2A in the cell duplication process, it was thought that TOP2A protein was up-regulated in proliferating cancer cells 50 .TOP2A is closely related to HER2.
However, several studies demonstrated that TOP2A amplification and deletion is correlated with, but not limited to, BC with HER2 amplification 51 .
TPX2 is microtubule-associated protein which involved in the complex process of mitosis, and has been identified as a prohibition strictly regulated by cell cycle 52  UBE2C is a crucial component of the ubiquitin-conjugating enzyme complex, which is involved in the ubiquitin-proteasome system 55

Ethics approval and consent to participate
Not applicable.

Consent for publication
There are no human subjects in this article and consent for publication is not applicable.

Availability of data and materials
The datasets used and analyzed in current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
This work was supported by grants from the National Natural Scientific Foundation, China

Figure 3
The prognostic of the key candidate genes. Prognostic information of the core genes was identified via Kaplan Meier plotter online tools and 14 genes had a significantly worse survival rate (P < 0.05).

Figure 4
Compared with healthy people, 14 genes were significantly expressed in breast cancer patients. In order to further determine the gene expression level between breast cancer patients and healthy people, 14 genes related to poor prognosis were analyzed by GEPIA website. The expression level of these genes in breast cancer specimens was significantly higher than that in normal specimens(*P<0.05 .Red color means tumor tissues and grey color means normal tissues.