Identication of Signicant Genes and Therapeutic Agents for Breast Cancer by Integrated Bioinformatics

Background: Breast cancer is the most commonly diagnosed malignancy in women; thus, more cancer prevention research is urgently needed. The aim of this study was to predict potential therapeutic agents for breast cancer and determine their molecular mechanisms using integrated bioinformatics. Methods: Summary data from a large genome-wide association study of breast cancer was derived from the UK Biobank. The gene expression prole of breast cancer was from the Oncomine database. We performed a network-wide association study and gene set enrichment analysis to identify the signicant genes in breast cancer. Then we performed Gene Ontology analysis using the STRING database and conducted Kyoto Encyclopedia of Genes and Genomes pathway analysis using Cytoscape software. We veried our results using the Gene Expression Prole Interactive Analysis, PROgeneV2, and Human Protein Atlas databases. Connectivity map analysis was used to identify small molecule compounds that are potential therapeutic agents for breast cancer. Results: We identied 10 signicant genes in breast cancer based on the gene expression prole and genome-wide association study. A total of 65 small molecule compounds were found to be potential therapeutic agents for breast cancer. Conclusion: Combined analyses of network-wide association studies, gene expression proles, and drug databases are helpful for identifying potential therapeutic agents for diseases. This method is a new paradigm that can guide future research directions.


Introduction
Breast cancer is a frequently diagnosed cancer in women with a family history [1]. Breast cancer is a heterogeneous disease with different molecular subtypes and biological behaviors. Gene microarray technology and immunohistochemical techniques have classi ed breast cancers into different types [2]. The estrogen receptor (ER) is the most important prognostic and predictive immunohistochemical marker in breast cancer. ER-negative tumors tend to be of higher histological grade, are more sensitive to chemotherapy, and are more likely to metastasize to visceral organs [3,4]. There is an urgent need to nd available drugs and clarify their molecular mechanisms in breast cancer treatment.
Most previous studies have focused on identifying novel prognostic markers and drug targets for breast cancer [5][6][7]. Sulaiman et al. [8] reported that a synthetic azaspirane targets the Janus kinase/signal transducer and activation of transcription 3 pathway in breast cancer. Huang et al. [9] demonstrated that the Gαh-PLCδ1 signaling axis drives metastatic progression in breast cancer. However, due to toxicity, cost, the chemical effects of novel prognostic markers and drug targets for breast cancer need further research [10], not all previous ndings contribute to breast cancer treatment, breast cancer still lacks therapeutic targets and with poorer prognosis. And there is still an urgent need to identify additional therapeutic and prognostic targets in breast cancer [11].
Genome-wide association studies (GWAS) are widely used to characterize the genetic mechanisms that underlie complex diseases. Integrative analyses of GWAS data are rapidly becoming a standard approach to explore the genetic basis of disease susceptibility [12]. Network-wide association studies (NetWAS) can identify relevant disease-gene associations by integrating tissue-speci c networks and GWAS results [13,14]. Prior studies have shown that network-associated analysis of GWAS data is highly e cient when used to identify novel causal genes of complex diseases [15,16].
In this study, to better understand the molecular mechanisms and therapeutic agents for breast cancer, we identi ed signi cant genes in breast cancer by integrating GWAS data with breast cancer gene expression data. Then drug prediction analysis was performed to discover potential therapeutic agents for breast cancer. In total, 65 small molecule compounds were identi ed, including trichostatin A, LY-294002, econazole, prestwick-1082, and vorinostat.

Summary of GWAS datasets in breast cancer
The UK Biobank is a large, population-based prospective UK study, which was established to identify genetic and nongenetic determinants of various diseases. It comprises approximately 500,000 individuals with extensively detailed phenotypes. Their genotypes were determined using an array that included 847,441 genetic polymorphisms, enabling the identi cation of novel genetic variants in a uniformly genotyped and phenotyped cohort of unprecedented size [17]. Using data from the UK Biobank, samples from the participants were genotyped on the UK Biobank Axiom array and UK BiLEVE custom array. Genotype imputation was conducted with IMPUTE software against the UK10K haplotype panel and the 1000 Genomes Project phase 3 panel. GWAS analysis was performed by SNPTEST using a logistic regression model. A genome-wide gene-association study was performed using the MAGMA gene analysis tool, and multiple genes and genetic variants were identi ed. The Icelandic GWAS dataset from the deCODE Genetics genealogical database was based on whole-genome sequencing using Illumina technology. Finally, meta-analysis of small nucleotide polymorphisms (SNPs) in the UK Biobank and deCODE sample was performed using the METAL analysis tool [18].
The atlas of genetic associations in the UK Biobank (GeneATLAS, http://geneatlas.roslin.ed.ac.uk) helps researchers effectively analyze UK Biobank results without high computational costs. It also allows users to query genome-wide association results for 9,113,133 genetic variants and download GWAS summary statistics for more than 30 million imputed genetic variants (> 23 billion phenotype-genotype pairs) [19].
We downloaded large-scale GWAS breast cancer summary data from the atlas of genetic associations. Detailed descriptions of sample characteristics, experimental designs, statistical analyses, and quality control can be found in previous studies.

Gene expression datasets
Oncomine (https://www.oncomine.org) is a cancer microarray database and web-based data mining platform for facilitating discovery. In this study, differentially expressed genes (DEGs) in breast cancer were identi ed by comparing cancer samples to respective normal samples using the Oncomine database. The heatmap of signi cant DEGs in breast cancer was driven from the Oncomine.

Identi cation of signi cant genes in breast cancer
NetWAS (https://hb. atironinstitute.org/netwas/) integrates tissue-speci c networks and signi cant GWAS association results, and identi es relevant disease-gene associations based on genomics. Brie y, SNP-level association statistics were converted into gene-level statistics (gene-based P values), which then were integrated with tissue-speci c networks to predict the causal genes [16]. Greene et al. [13] demonstrated that NetWAS is more accurate than GWAS alone. In this study, we identi ed most relevant genes in breast cancer using NetWAS.

Kyoto Encyclopedia of Genes and Genomes pathway and Gene Ontology analyses
Cytoscape is one of the most successful network biology analysis and visualization tools. It exposes more than 270 core functions and 34 applications as REST-callable functions with standardized JSON interfaces supported by Swagger documentation [20]. CluePedia, a plug-in in Cytoscape, can search certain Kyoto Encyclopedia of Genes and Genomes (KEGG) signaling pathways of certain genes by calculating linear and nonlinear statistical dependencies from experimental data [21]. KEGG signaling pathways were identi ed by CluePedia. Search Tool for the Retrieval of Interacting Genes (STRING) (https://string-db.org/cgi/input.pl) is an online tool that for Gene ontology (GO) analysis in gene sets [22,23]. GO is a commonly used bioinformatics tool that provides comprehensive information on the gene function of individual genomic products based on de ned features consisting of three domains: biological process (BP), cellular component (CC), and molecular function (MF) [24]. We conducted GO analysis using the STRING database.

Analysis of the correlation between signi cant genes and breast cancer
Gene Expression Pro ling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn) is a web server for analyzing RNA sequencing expression data of 9,736 tumors and 8,587 normal samples from The Cancer Genome Atlas and Genotype-Tissue Expression projects, using a standard processing pipeline [25]. The Human Protein Atlas (HPA, www.proteinatlas.org) is an immunohistochemistry-based map of protein expression pro les in normal tissues, cancer tissues, and cell lines, and provides a resource for pathologybased biomedical research including protein biomarker discovery [26][27][28]. Correlations between signi cant genes and breast cancers were analyzed with GEPIA and the HPA.
2.6 Analysis of the correlation between signi cant gene expression and overall survival PROGgeneV2 (http://www.compbio.iupui.edu/proggene), a tool that can be used to predict the prognostic implication of genes in cancers, is written in PHP5 with a MySQL database backend, which stores gene expression data, covariates data and metadata for catalogued studies in the form of relational database tables. Survival analysis in PROGgeneV2 is done using the backend R script; users can input multiple genes and use combined analysis to create survival plots for different genes of interest [29]. We used PROGgeneV2 to analyze the relationship between overall survival and genes that were overexpressed and underexpressed in breast cancer.

Drug prediction analysis
CMap (https://portals.broadinstitute.org/cmap/) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple patternmatching algorithms that together enable the discovery of functional connections among drugs, genes, and diseases through the transitory feature of common gene expression changes [30][31][32]. We used CMap to identify small molecule compounds as potential therapeutic agents to target the signi cant genes in breast cancer.

GO and KEGG enrichment analyses of signi cant DEGs in breast cancer
BP analysis revealed that the signi cant genes in breast cancer were mainly enriched in the Wnt signaling pathway, calcium modulating pathway, protein repair, gene silencing by microRNA (miRNA), mRNA cleavage involved in gene silencing by miRNA, and positive regulation of epithelial cell proliferation involved in lung morphogenesis (Table 2). MF analysis showed that signi cant genes were enriched in functions related to oxidoreductase activity, acting on a sulfur group of donors and disul de as acceptor, and phosphoinositide 3-kinase (PI3K) and PIK3CA activities ( Table 2). CC analysis showed that signi cant genes were enriched in P-bodies. KEGG analysis revealed that signi cant genes in breast cancer were mainly involved in pathways in cancer, breast cancer, gastric cancer, melanoma, the PI3K/Akt signaling pathway, mitogen-activated protein kinase (MAPK) signaling pathway, Ras signaling pathway, tight junctions, and ubiquitin-mediated proteolysis (Fig. 3).

Correlation between signi cant DEGs and breast cancer
Consistant with the identi cation of signi cant genes, protein pro ling in breast cancer samples from the HPA using immunohistochemistry showed that that the gene expression of CLDN7, RBM33, SH3RF1, and UBE2Z was signi cantly enriched in breast cancer, whereas there was no signi cant enrichment of FGF7 and TNRC6B (Fig. 4). Fig. 4:The immunohistochemistry of signi cant genes(CLDN7, RBM33, SH3RF1, UBE2Z, FGF7 and TNRC6B) in BC.

Drug prediction analysis
To identify potential small molecule compounds with therapeutic effects on breast cancer, drug prediction analysis was performed by CMap. A total of 65 drugs were predicted, and the 10 most signi cant were trichostatin A, LY-294002, econazole, Prestwick-1082, vorinostat, lome oxacin, clorsulon, amantadine, thiostrepton, and orciprenaline (Table 3).

Discussion
Breast cancer is the most commonly diagnosed malignancy in women worldwide and is the main cause of cancer-related death in women [33][34][35]. Despite signi cant advances in cancer research, breast cancer remains a major health problem and is a top biomedical research priority [36][37][38], as there is an urgent need for effective breast cancer treatments.
Protein pro ling in breast cancer samples from the HPA using immunohistochemistry and analysis of signi cant DEGs in breast cancer samples compared to normal samples from GEPIA further veri ed the results. Signi cantly overexpressed genes (CLDN7, MLLT10, RBM33, SH3RF1, SSBP4, and UBE2Z) were correlated with shorter survival, whereas underexpressed genes (BMPER, FGF7, MSRB3, and TNRC6B) were correlated with longer survival in breast cancer.
Consistent with our ndings, previous studies have shown that some of these genes play important roles in the development of breast cancer. For example, Bernardi et al. [39] showed that CLDN7 is associated with a shorter time to recurrence, suggesting its contribution to the aggressiveness of breast cancer. In a GWAS, Guo et al. [40] identi ed common genetic loci for breast cancer risk including SSBP4. Whole transcriptome analysis by Bauer et al. [41] demonstrated that BMPER plays a possible therapeutic role in breast cancer. Fu et al. [42] demonstrated that acetylation, expression and recruitment of FGF7 promoters induce cancer growth and progression. Zhu et al. [43] found that targeting FGF7 can exert oncogenic functions in breast cancer. A previous study showed that the ZEB1-MSRB3 axis is related to breast cancer genome stability [44]. Interestingly, other DEGs in breast cancer identi ed in this study, including MLLT10, RBM33, SH3RF1, UBE2Z, and TNRC6B, have not been proven in previous studies. We believe that these are potentially novel key genes in breast cancer.
BP analysis in GO annotation indicated that the 10 signi cant genes are mainly enriched in the Wnt signaling pathway, which plays an important role in the occurrence and development of many cancers. Inhibiting this pathway can suppress breast cancer growth and metastasis [45][46][47]. MF analysis of GO suggested that the DEGs were most signi cantly enriched in functions related to oxidoreductase activity.
The redox reaction is accompanied by tumor development. CC analysis of GO annotation showed that the 10 DEGs were enriched in P-bodies. A previous study suggested that P-body disassembly correlates with breast cancer progression [48].
KEGG analysis of the 10 DEGs showed their enrichment in breast cancer, gastric cancer, melanoma, the PI3K/Akt signaling pathway, MAPK signaling pathway, Ras signaling pathway, tight junctions, and ubiquitin-mediated proteolysis. Some of these pathways contribute to the development of breast cancer.
For example, the PI3K pathway is found in many types of cancer and plays an important role in breast cancer cell proliferation [49]. Ras signaling is a key determinant of poor survival in breast cancer patients [50]. MAPK regulators are widely used for triple-negative breast cancer-targeted therapy. Abnormal MAPK signaling plays a core role in the regulation of growth and survival and the development of drug resistance in triple-negative breast cancer [51].
The aim of this work was to identify signi cant genes and potential therapeutic agents for breast cancer based on genomics. We found 65 potentially small molecule compounds to reverse signi cant genes in breast cancer. The 10 most signi cant drugs were trichostatin A, LY-294002, econazole, Prestwick-1082, vorinostat, lome oxacin, clorsulon, amantadine, thiostrepton, and orciprenaline. Consistent, with our study, it has been reported that that trichostatin A, a histone deacetylase inhibitor, has therapeutic potential in breast cancer [52]. Jiang et al. [53] showed that trichostatin A sensitizes ER-negative breast cancer cells to tamoxifen. LY294002, a speci c inhibitor of the PI3K pathway, can decrease the rate of cell growth and increase therapeutic sensitivity in MCF7 cells expressing wild-type p53, which may be useful for the treatment of breast cancer [54]. Econazole is a novel PI3K/AKT signaling pathway inhibitor, which can be used to overcome adriamycin resistance and improve chemotherapy sensitivity in breast cancer [55]. A preclinical study showed that vorinostat can prevent the formation of brain metastases in breast cancer [56]. Yang et al. [57] suggested that thiostrepton is a promising agent in triple-negative breast cancer. Kwok et al. [58] showed that thiostrepton selectively targets breast cancer cells through inhibition of Forkhead box M1 expression. However, some of the predicted drugs such as Prestwick-1082, lome oxacin, clorsulon, amantadine, and orciprenaline have not been shown to directly play a role in breast cancer. Thus, future studies are needed to con rm our ndings.
In conclusion, we conducted an analysis combining genomic data with drug database analysis to identify novel candidate therapeutic agents for breast cancer treatment. Our study demonstrates the usefulness of this approach for evaluating the relationship among genes, diseases, and drugs. These ndings will pave the way for the discovery of potential therapeutic targets for breast cancer.

Declarations Ethical statement
Our study did not require an ethical board approval because it did not contain human or animal trials.

Consent for publication
Not applicable.