Identification of High-risk Genes in Triple-negative Breast Cancer by Bioinformatics


 Background: Current research has failed to find a target gene for triple-negative breast cancer (TNBC), which has resulted in the treatment for TNBC being less effective than that for other types of breast cancer. Finding high-risk genes for TNBC by bioinformatics may help to identify target genes for TNBC.Methods: The gene expression data of 4 chips (GSE7904, GSE31448, GSE45827, GSE65194) which contains of normal breast tissue and TNBC tissue were obtained from the Gene Expression Omnibus. The differentially expressed genes (DEGs) between normal breast tissue and TNBC tissue were identified. Gene Ontology (GO) functional annotation analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of DEGs were performed by the DAVID website. Protein-protein interaction network analysis of DEGs was carried out by the STRING website, and the results were imported into Cytoscape. Then, module analysis was carried out by using the MCODE app. The online tool of the Kaplan-Meier Plotter website was used to analyse associations between relapse-free survival (RFS) and the expression of genes obtained by MCODE, and the metastasis-free survival (MFS) data from GSE58812 were used for survival verification. The difference in the expression of the identified genes was verified by the online tool of the UALCAN website. Results: There were 127 upregulated and 293 downregulated genes in the DEGs. The GO and KEGG analysis showed that the DEGs were particularly enriched in mitotic nuclear division, extracellular space, heparin binding, and ECM-receptor interaction. MCODE obtained a total of 47 genes in 4 gene clusters, 29 of which were related to RFS. Survival verification indicated that 14 out of 29 genes were related to MFS, namely, CCNB1, AURKB, KIF20A, BUB1B, DLGAP5, CXCL11, CXCL9, CXCL10, CXCL12, IGF1, FN1, CFD, SGO2 and CDCA5. Conclusions: We identified 14 genes as the high-risk genes for TNBC. Further research on these genes may identify the target genes of TNBC.


Conclusions
We identi ed 14 genes as the high-risk genes for TNBC. Further research on these genes may identify the target genes of TNBC.

Background
Triple-negative breast cancer (TNBC) refers to breast cancer that is negative for estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor-2 (HER-2) [1]; TNBC accounts for 15%~20% of breast cancers [2]. TNBC favours young women and has the characteristics of strong invasiveness, easy metastasis and poor prognosis [3,4]. TNBC is a heterogeneous disease that can be classi ed into distinct molecular subtypes by gene expression pro ling [5,6]. Until now, the target gene of TNBC has not been found. Chemotherapy is still the main method of systemic treatment [7], and the curative effect is not satisfactory, the side effect are unfavourable, and the patient survival time is short. The study of target genes of TNBC may be important for improving the therapeutic effect and prolonging the survival period of TNBC patients.
Using bioinformatics to study the molecular mechanism of cancer has become a trend in cancer research. Public databases of biological information, such as the Gene Expression Omnibus (GEO) and The Cancer Genomes Atlas (TCGA), collect information on gene sequencing, methylation, mutation, miRNA, protein and other information; these databases have the advantages of including data from many kinds of organisms, many types of tumours, large sample sizes and so on. Through the screening of public databases, the differentially expressed genes (DEGs) between tumour and normal tissues, as well as the relationship between genes and tumour prognosis, can be recognized, and this process can facilitate the identi cation of high-risk genes in tumours. It is a strategy for nding tumour target genes.
This paper used the GEO and TCGA databases to nd the DEGs between TNBC and normal breast tissues and identi ed the genes that were related to relapse-free survival (RFS) and metastasis-free survival (MFS) as high-risk genes of TNBC, to provide direction for target genes.

Data source
Gene expression data from the GSE7904, GSE31448, GSE45827 and GSE65194 datasets were obtained from the GEO (https://www.ncbi.nlm.nih.gov/geo/). Gene expression data from GSE58812 was used as the source of survival veri cation. All these chips contains of gene information about normal breast tissue and TNBC tissue. The platform of these chips were GPL570 which complete coverage of the Human Genome U133 Set plus 6,500 additional genes for analysis of over 47,000 transcripts.

Identi cation of DEGs
Using the GEO2R online analysis tool provided by the GEO to screen the DEGs between normal breast tissue and TNBC, the screening conditions were logFC > 2 (representing the upregulated of genes in TNBC) or logFC < -2 (representing the downregulated genes in TNBC) and adjusted P < 0.05.
Using the Venn diagram online tool on the Van de Peer Lab website (http://bioinformatics.psb.ugent.be/beg/tools/venn-diagrams), Venn diagrams of the DEGs of the above 4 chips were drawn to determine the intersecting DEGs as the nal DEGs.
Functional and pathway enrichment analysis Gene Ontology (GO) functional annotation analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of DEGs were performed by the Database for Annotation, Visualization and Integrated Discovery (DAVID) tools (https://david.ncifcrf.gov/, Version: 6.8). GO analysis includes information on the biological process (BP), cellular component (CC) and molecular function (MF) of genes. BP refers to an ordered biological process with multiple steps, such as cell growth, differentiation and maintenance, apoptosis, and signalling. CC is used to describe the location of gene products in cells, such as the endoplasmic reticulum, nucleus or proteasome. MF refers to the function of a single gene product, such as binding activity or catalytic activity. KEGG is a database that systematically analyses the metabolic pathways of gene products in cells and is one of the most commonly used metabolic pathway analyses. The results were arranged in ascending order of P-value, and the top 6 were displayed separately.

Protein-protein interaction network and module analysis
The protein-protein interaction (PPI) network of DEGs was analysed by the online tool of the STRING website (http://string-db.org/, Version: 11.0), with the minimum required interaction score as highest con dence (0.900) and the combined score > 0.4. The results were imported into the Cytoscape software (Version: 3.7.1), and the sub-network was constructed by the Molecular Complex Detection (MCODE) app (node score cut-off = 0.01, k-core = 5, Version: 1.5.1).

Survival analysis and veri cation
Survival analysis of genes obtained from MCODE analysis was performed using an online tool from the Kaplan-Meier Plotter website (https://kmplot.com/analysis/). Set requirements were used as follows: split patients by = auto select best cutoff, survival = RFS, probe set options = only JetSet best probe set, restrict analysis to subtypes = ER negative and PR negative and HER2 negative, use following dataset for the analysis = all [8]. Genes with a log rank P-value < 0.05 were veri ed by external data.
The external data were from GSE58812. Veri cation was performed using SPSS software (Version: 22.0).
The receiver operating characteristic curve was used to evaluate the best cut-off value of the gene expression, and the patients were divided into a high expression group and a low expression group. The Kaplan-Meier method was used to compare the MFS of the two groups, and P < 0.05 was considered to indicate a signi cant difference.

Veri cation of gene expression
To verify the expression of the genes obtained from the above survival analysis, the online tool of the UALCAN website (http://ualcan.path.uab.edu) [9] was used, and the data source was the TCGA. Genes with P < 0.05 were considered high-risk genes of TNBC.

Data analysis process
The data analysis process is shown in Fig. 1.

DEGs between normal breast tissue and TNBC tissue
There were total of 33 normal breast tissue samples and 132 TNBC tissue samples from GSE7904, GSE31448, GSE45827, and GSE65194 (Additional le 1: Table S1). The intersection of DEGs from the 4 chips was found by the Venn diagram; from which 127 genes were upregulated and 293 genes were downregulated (Fig. 2, Additional le 2: Table S2).

GO and KEGG analysis of DEGs
The GO analysis results of DEGs were as follows: 1) for BP, DEGs were particularly enriched in mitotic nuclear division, cell division, sister chromatid cohesion, positive regulation of cell proliferation, cell chemotaxis and cell proliferation; 2) for CC, DEGs were particularly enriched in extracellular space, extracellular region, proteinaceous extracellular matrix, midbody, chromosome, centromeric region and extracellular exosome; and 3) for MF, DEGs were particularly enriched in heparin binding, chemokine activity, integrin binding, protein homodimerization activity, protein binding and metalloendopeptidase activity. The KEGG analysis results of DEGs revealed enrichment in ECM-receptor interaction, oocyte meiosis, cell cycle, focal adhesion, pathways in cancer and cytokine-cytokinereceptor interaction ( Table 1). PPI network and module analysis A total of 227 nodes and 1207 edges were displayed in the PPI network of DEGs (Fig. 3). There were 4 gene clusters obtained from the module analysis, including 20 genes in cluster 1, 12 genes in cluster 2, 9  Fig. S1,S2). Gene expression data from GSE58812 was used to verify the survival of these 29 genes, and 14 of them were found to be related to MFs, namely, CCNB1, AURKB, KIF20A, BUB1B, DLGAP5, CXCL11, CXCL9, CXCL10, CXCL12, IGF1, FN1, CFD, SGO2 and CDCA5 (Fig. 5).

Expression veri cation
The expression of the above 14 genes in normal breast tissue and TNBC tissue was veri ed by the online tool of UALCAN. The data was from TCGA, including 114 normal breast tissue samples and 116 TNBC tissue samples. The results showed that the difference in the expression of all these genes between normal breast tissue and TNBC tissue was statistically signi cant. The upregulated genes were CCNB1, AURKB, KIF20A, BUB1B, DLGAP5, CXCL11, CXCL9, CXCL10, FN1, SGO2 and CDCA5. The downregulated genes were CXCL12, IGF1 and CFD. This result was consistent with the results in GSE7904, GSE31448, GSE45827, and GSE65194 (Fig. 6).

Discussion
In this paper, the DEGs between normal breast tissue and TNBC tissue were obtained from 4 chips. The GO analysis indicated that the identi ed DEGs were related to mitotic nuclear division, cell division, sister chromatid cohesion, positive regulation of cell proliferation, cell chemotaxis, cell proliferation, heparin binding, chemokine activity, integrin binding, protein homodimerization activity, protein binding and metalloendopeptidase activity. The KEGG analysis indicated that the DEGs were related to ECM-receptor interaction, oocyte meiosis, cell cycle, focal adhesion, pathways in cancer and cytokine-cytokinereceptor interaction. This is consistent with the active proliferation of TNBC cells.
The PPI network shows the relationship between the proteins expressed by these genes. Due to the large number of proteins in the PPI network and the complex network relationship, it was necessary to further explore the more important gene modules through module analysis. PPI network analysis and module analysis of DEGs nally obtained 47 genes in 4 clusters. Survival analysis is an important method of studying tumour-associated genes. Due to the limitation of clinical follow-up data sources, RFS was used as the observation index of survival analysis, and MFS was used for veri cation in this paper. Through the survival analysis, 14 genes closely related to the prognosis of TNBC were obtained, namely, CCNB1, AURKB, KIF20A, BUB1B, DLGAP5, CXCL11, CXCL9, CXCL10, CXCL12, IGF1, FN1, CFD, SGO2 and CDCA5.
The cell cycle consists of ve phases: G0 (gap 0), G1, S (synthesis), G2, and M (mitosis). Mitosis proceeds in ve phases: prophase, prometaphase, metaphase, anaphase, and telophase [10].To ensure that only healthy cells proliferate, checkpoints have evolved that induce cell-cycle arrest in response to the detection of defects that may have arisen during DNA replication or other steps leading to mitosis [11]. The abnormal distribution of cells throughout the cell cycle is a hallmark of human cancer due to accumulating alterations of genes in the cell cycle pathway, possibly resulting in impaired abilities of cell division, cell proliferation and DNA damage response [12].
Among the high-risk genes found in this paper, the genes related to cell cycle and mitosis included AURKB, BUB1B, SGO2, CDCA5, CCNB1, DLGAP5 and KIF20A. These genes were overexpressed in TNBC and were associated with poor prognosis. AURKB is a member of the serine/threonine kinase family and is a major member of the chromosome passenger complex [13], which plays a key role in chromatin condensation and segregation and cytokinesis. Genetic instability caused by overexpression of AURKB is a direct cause of tumour formation [14]. Previous studies have found that AURKB expression is upregulated in leukemia, lymphoma, liver cancer and breast cancer and is associated with poor prognosis, which is consistent with the results of this study [15][16][17][18]. BUB1B encodes a kinase involved in the spindle assembly checkpoint and chromosome separation [19]. During the G2 phase, BUB1B inhibits anaphase-promoting complex/cyclosome activity, allowing cyclin B to accumulate before mitosis begins and slows the cell cycle [20]. Previous studies have found that BUB1B expression is upregulated in prostate cancer, breast cancer and lung cancer and is associated with poor prognosis [21][22][23]. SGO2 is a centromere-localized protein whose main function is to protect the cohesin of the sister chromatid from degradation. SGO2 regulates the localization of chromosomal passenger proteins [24] and mediates spindle assembly and chromosome congression to prevent the generation of chromosomal instability associated with malignant cell transformation [25]. SGO2 is also a key substrate of AURKB, which plays a central role in ensuring faithful chromosome segregation [26]. CDCA5 is a substrate of the anaphasepromoting complex and participates in the regulation of sister chromatid cohesion [27]. Previous studies have found that CDCA5 expression is upregulated in lung cancer and hepatocellular carcinoma and is associated with poor prognosis [28,29]. CCNB1 belongs to a highly conserved family of cyclins that regulates the cell cycle and promotes cell proliferation. It is essential to control the cell cycle during G2/M transformation. Previous studies have found that CCNB1 overexpressed in ER-positive breast cancer and is associated with poor prognosis [30]. DLGAP5 belongs to the discs large-associated protein family, encodes the disc large homolog 7 (DLG7) protein, and controls on spindle stability [31,32]. The overexpression of DLGAP5 in colorectal cancer and prostate cancer is associated with poor prognosis [33,34]. KIF20A is a member of the kinesin super family − 6 that participates in spindle assembly and interacts with mitotic regulators [35]. The overexpression of KIF20A in hepatocellular carcinoma and lung adenocarcinoma is associated with poor prognosis [36,37].
CXCL9, CXCL10, CXCL11, and CXCL12 belong to the chemokine family. They are small molecular cytokines that are produced by many kinds of cells. After binding with the receptor, they mediate cell migration, activate antigen-presenting cells and immune active cells, and regulate the immune process of the body. In this study, we found that CXCL9-11 was overexpressed in TNBC, CXCL12 was not overexpressed, and patients with overexpressed CXCL9-12 had a better prognosis. Possible reasons were that the large amount of CXCL9-12 can promote the in ltration of cytotoxic T lymphocytes into the tumour cell area to kill tumour cells and induce T or NK cells to inhibit tumour angiogenesis [38][39][40][41][42][43].
IGF1 is a key growth factor of the mammary terminal end bud and for ductal formation during development, and it also plays an important role in breast cancer development, progression and metastasis [44,45]. Upregulating IGF1 may promote TNBC progression [46]. However, in this study, IGF1 overexpression represented a better prognosis, which is di cult to explain and is worthy of further study.
FN1 encodes two forms of bronectin, soluble plasma bronectin-1 and insoluble cellular bronectin-1 [47]. It regulates cell adhesion and migration processes [48]. Previous studies have found that overexpressed FN1 in oesophageal squamous cell carcinoma was associated with poor prognosis, which is consistent with this study [49].
CFD is a member of the serine protease family, which stimulates the transport of triglycerides to fat cells and inhibits lipolysis. CFD is a key component in the regulation of alternative pathways [50,51]. This Declarations study found that CFD gene expression was downregulated in TNBC and correlated with poor prognosis, which was consistent with the changes seen in gastric cancer and oral tongue squamous cell carcinoma [52,53].

Conclusions
In this paper, 14 genes related to TNBC survival were obtained by using bioinformatics and public databases. Some of the genes were related to cell proliferation and division, and some were related to chemokines. There was no NCBI article found to study the function of SGO2, DLGAP5, KIF20A and CFD in TNBC before, which means some novel research directions. It is necessary to study these genes and their biological pathways, as these studies may be a way to nd the target genes of TNBC. There are also some shortcomings in this paper, such as the lack of experimental intervention data in the public database. In the future, it is necessary to design relevant experiments to verify the research results.  The data analysis process.