Comprehensive network analysis of integrin α11 expressed in human breast cancer

Background: Breast cancer is the most common malignancy in women worldwide, but few genes have been reported to be involved in breast cancer development. Methods: We constructed a co-expression network to expand upon the knowledge of the various molecular biomarkers in breast cancer development. Transcriptome data for the tissues of all breast cancer and para-carcinoma types were retrieved from The Cancer Genome Atlas (TCGA) database. We performed a co-expression analysis of breast cancer transcriptome data using Pearson’s correlation coefficient (PCC) for inter-sequence mutual rank (MR)-based cutoffs using 11 guide genes and employed Kaplan-Meier survival analyses to assess the prognostic values of hub genes. Results: A co-expression network centered on the Integrin α 11 (ITGA11) of the extracellular matrix (including 11 guide genes and 72 edge genes) was extracted for breast cancer. The gene ontology (GO) analysis showed that various genes may also be included in the breast cancer network. Among them, CEMIP, COL11A1, CTHRC1, ITGA11, LUM and P4HA3 were negatively associated with the overall survival, and ADAMTS12 and LOXL1 were positively associated with the overall survival at early stage. However, there was no significant difference between the eight gene expressions and Disease-free survival. Conclusions: Our co-expression analysis includes the isolated transcriptome of breast cancer, which is a useful resource for breast cancer researchers, as it enables them to elucidate important and complex biological events to prevent and predict cancer.

negatively associated with the overall survival, and ADAMTS12 and LOXL1 were positively associated with the overall survival at early stage. However, there was no significant difference between the eight gene expressions and Disease-free survival. Conclusions: Our co-expression analysis includes the isolated transcriptome of breast cancer, which is a useful resource for breast cancer researchers, as it enables them to elucidate important and complex biological events to prevent and predict cancer.

Background
Breast cancer is one of the most common malignancies in women. There are approximately 1.2 million new breast cancer cases and 500,000deaths worldwide each year [1]. The occurrence and development of breast cancer is a complex biological process involving multi-gene participation, multi-factor action, and multi-stage progression, which depends on the interaction and regulation of multiple genes, transcripts, and proteins [2]. A promising strategy for the treatment of breast cancer and other cancers is molecular-targeted therapy, which requires a detailed understanding of the specific molecular events and different pathological pathways involved in the development and proliferation of breast cancer. In order to better classify and characterize tumors, it is necessary to predict breast cancer by seeking its associated biomarkers, which may contribute to the design of multimodal treatment options for designated individuals, and may also improve clinical efficiency and clinical outcome assessments [3].
To study candidate biomarkers in cancer, it is important to systematically analyze the interaction and functions of genes that exhibit similar expression patterns. For breast cancer, one of the most effective methods is to examine a co-expression network. Analysis of gene co-expression networks can calculate the correlation between gene expression through large-sample expression profiles, and incorporate highly-related genes into biologically meaningful regulatory networks to visually reveal the relationship between functional molecules in biological systems. It can be used to discover key target genes with important biological functions and significance [4,5]. To date, TCGA program has collected and offered a large amount of transcriptome data on breast cancer, which is valuable for seeking comprehensive understanding of new developments in transcriptional pathways and mining biomarker gene.
In this study, we searched the TCGA database for transcriptome data from the tissues of all breast cancers and para-carcinomas involved in the development of breast cancer. We analyzed each of the guide genes and extracted a network of structural components of the extracellular matrix (ECM) (11 guide genes and 72 edge genes). All genes obtained from the co-expression network of this study may act as resources when exploring candidate biomarker genes in the further study of cancer prevention and control.

Microarray data
We searched the transcriptome data of breast cancer from the TCGA, including the gene profiles of all breast cancer and para-carcinoma tissues, for a total of 1,028 samples (see Table S1).

Gene ontology (GO) and expression coherence (EC)
The GO terminology was analyzed to obtain the genes associated with breast cancer and to classify their cellular components (CC), biological processes (BP), and molecular functions (MF). The GO terminology for the GO enrichment analysis and EC analysis was retrieved from DAVID (http://david.abcc.ncifcrf.gov/). For the GO enrichment analysis, the significant difference of GO enrichment in co-expressed genes was evaluated against a background set consisting of 15,438 genes, at least one of which was derived from a human co-expressed network dataset using hypergeometric distribution. Without multiple test correlations, the BP p-value <0.01 and MF p-value <0.1 were set as significance thresholds. EC analysis was performed as described by Aya K et al. [6]: Genes with the same GO term (containing 20-100 genes) were used as a set of predefined functional genes. To calculate the EC of the paired genes for each GO classification, Pearson's correlation coefficient (PCC) was used to measure expression similarity. A threshold of 0.722corresponded to the median of all PCC values, while >99% of the random PCC distribution was derived from 1,000 random genes (yielding approximately 1,000 × 999 × 0.5 gene pairs). To calculate the random EC of the GO category, the random gene set was sampled to the same size as the survey category.

Co-expression analysis and hub genes validation
To construct a co-expression network, we calculated the PCC values for all combinations of the 60,488unique genes present in the 1,028 downloaded samples obtained from the TCGA database. We estimated a PCC threshold of 0.722, corresponding to the median of all PCC values, which was >99th percentile of the random PCC distribution of 1,000 random genes (approximately 1,000 × 999 × 0.5 gene pairs). According to Obayashi's report [7], we also calculated the mutual rank (MR) value between gene pairs as another value that indicated co-expression to further reduce the number of false positives. Only gene pairs with an absolute MR value <10 were considered to be important linkages to the co-expression network. Calculations were performed with R/Bioconductor. To extract the breast cancer subnet dataset, we took two steps out of the guide gene to extract the vicinity of the network, as described by Mutwil et al. [8]. The network is illustrated using the Cytoscape program.
Overall survival and disease free survival of genes from co-expression network in breast cancer based on Kaplan Meier-plotter(www.kmplot.com) [9]. The patients were stratified into high-level group and low-level group according to different expression ratio.

Biological significance of expression similarity
We first used the GO terminology for EC analysis to examine the biological significance of the similarity in the transcriptome datasets of the 1,028 samples downloaded in this study (Table S1). EC analysis is a statistical analysis method used to detect whether the transcriptional profiles of genes belonging to a predefined functional gene set are interrelated [10]. EC scores can measure expression similarity within a predefined functional genome [10]; therefore, when most genes with the same GO term are co-expressed with each other, a higher EC score is obtained. In our EC analysis, EC scores (biological processes [BP], molecular function [MF], and cellular components [CC]) were higher in all three GO groups than in the random sampling (Fig 1). Among the three GO groups, CC showed the highest expression similarity, of which approximately 28.55% of the categories showed higher EC scores than random sampling at threshold EC score of 0.15, while approximately 25.86% and 25.62% did so in BP in MF, respectively (Fig 1 and Table S2).

Construction of the breast cancer network
Using the PCC method, the PCC threshold was 0.722, corresponding to median of all PCC values that were >99th percentile of the random PCC distribution obtained for 1,000 random genes. We determined the total number of human breast cancers generated from the 1,028 sample data points that we downloaded. The final dataset of the network used in this study contained 40,750 genes (guide) and 209,928 gene pairs (edges). From the entire dataset of the human breast cancer coexpression network, we were able to construct a network around the gene of interest, which served as a guide gene, and used MR-based truncation to obtain genes with very close expression profiles.
The ECM, glycoprotein, signal transduction, and secretion play important roles in tissue development, cancer formation, and invasion. To test whether our co-expression network analysis could identify useful transcriptional networks related to human breast cancer development, we selected 11 genes involved in these pathways or functional processes to find which genes were co-expressed in all samples. After screening by cutoff and MF, a complex network of 72 genes was obtained (Fig 2) and a co-expression network centered on Integrin α 11 (ITGA11) (guide gene; green circle in Fig 2) was constructed. The blue circle represents the remaining 10 guide genes. As can be seen from Fig 2 and Table S3, the remaining 10 guide genes share a highly positive correlation with ITGA11. These genes were counted according to the UniProtKB keyword enrichment analysis ( Table 1). As can be seen from the table, ITGA11 shares similar functions with the 10 guide genes, and it can also be found in the same pathway.

GO analysis and KEGG pathway analysis of the breast cancer network
The good association between the 11 guide genes in the primary expression cluster indicates that the genes involved in breast cancer can be strictly co-regulated at the expression level; thus, the breast cancer development event may be an appropriate subject for co-expression analysis. The results of the GO analysis also support this (Fig 3). During biological process analysis, terms related to collagen catabolism (e.g. "collagen catabolic process", "collagen fibril organization", and "skeletal system development"), terminology related to the structural components of the ECM (e.g. "extracellular matrix organization", "cell adhesion", "extracellular matrix disassembly", "cell-matrix adhesion", and "integrin-mediated signaling pathway") appear in the genes of breast cancer collagen catabolism and in the structural components of the ECM, respectively (Fig 3a). the cellular components of those genes obtained in the breast cancer networks were mainly associated with the ECM (e.g., "extracellular matrix", "proteinaceous extracellular matrix", "collagen trimer", "extracellular region", and "extracellular space" ; Fig 3b). Additionally, The molecular functions of the obtained genes are primarily related to binding (e.g. "collagen binding", "integrin binding", "heparin binding", and "calcium ion binding" ; Fig 3c).
Furthermore, from the GO analysis (Fig 3), it seems various other genes can be included in the breast cancer network (white circles, Fig 2). This suggests how some genes that have not been reported in breast cancer can be mined from the existing data, and new interactions between these genes can also be identified from connections in the network (Fig 2). Cancer development is a very specific biological event; as such, co-expression network analysis may have the greatest success in identifying genetic interactions in breast cancer.
In addition, we performed a KEGG pathway analysis of the genes obtained from breast cancer coexpression networks. Through the above GO analysis, the pathway related to ECM receptor interaction, focal adhesion, and protein digestion and absorption is constant. We were surprised to find that pathways for infectious diseases, such as amoebiasis, were involved in the pathway enrichment of those genes obtained by co-expression analysis in this study (Table S4). All of these results support our co-expression analysis of gene predictions for breast cancer, which may be useful for subsequent studies and in the design of various medical treatments.

Identification and validation of hub genes
Based on the UniProtKB keyword enrichment analysis, 62 genes with high correlation with ITGA11 were identified as hub genes (Table 1). Survival analysis of hub genes were performed using Kaplan Meier-plotter [9]. The patients were stratified into high-level group and low-level group according to different expression ratio. The customized cutoff-high and cutoff-low of the eight genes is ADAMTS12 negatively associated with the overall survival, while ADAMTS12 and LOXL1 were positively associated with the overall survival at early stage (Fig 4). However, there was no significant difference between the eight gene expressions and Disease-free survival (Fig S1).

Discussion
The aim of this study was to identify the various gene interactions involved in the biological events associated with breast cancer using the 1,028 datasets from TCGA, specifically exploring human breast cancers and normal tissues. The combined application of these valuable microarray data allows us to build high-resolution networks (Fig 2) that contain useful information to detect gene regulatory networks in biological processes [7]. This suggests that we can perform co-expression screening using reliable network datasets and an effective human cancer database to track the genes involved in specific biological events [11]. In this study, 62 genes with high correlation with ITGA11 were identified as hub genes based on the UniProtKB keyword enrichment analysis. We searched for 11 important genes involved in breast cancer or in other cancer development (COL8A1, COL10A1,   COL11A1, COL12A1, ITGA11, AEBP1, CORIN, HSD17B6, THBS2, ANTXR1, andFN1) to construct a gene co-expression network. Furthermore, among them, CEMIP, COL11A1, CTHRC1, ITGA11, LUM and P4HA3 were negatively associated with the overall survival, ADAMTS12 and LOXL1 were positively associated with the overall survival at early stage. The ECM receptor interaction pathway, the focaladhesion pathway, and the PI3K-AKT pathway were shown to be involved in the progression of breast cancer [12,13].
ITGA11 is a member of the integrin family, which is involved in various processes that affect the biological behavior of cells, such as metastasis, embryogenesis, hemostasis, the immune response, tissue repair, cancer growth, tumor angiogenesis, and resistance to treatment [14,15]. However, its expression has been shown to be upregulated under malignant conditions, such as in non-small cell lung cancer, which has been suggested to be associated with cancer-cell growth [16,17]. In addition, the high expression of ITGA11 is associated with the prognosis of breast cancer [18], glioblastoma [19], pancreatic cancer [20] and lung cancer [21]. Speculated from its gene ontology, ITGA11 can participate in the biological process of collagen binding and cancer-cell differentiation, indicating that it is involved in the pathological process of breast cancer.
He et al's study [22] found that COL1A1, COL3A1, COL4A1, and COL11A1 significantly enriched in the focal-adhesion and ECM receptor interaction pathways. In this study, different gene expressions in the blue circles, including COL8A1, COL10A1, COL11A1, COL12A1, and ITGA11 in green circle, were identified as key genes located upstream of these three signaling pathways. The first four genes were reported to be members of the collagen (COL) family and are associated with breast cancer.
FN1 is a glycoprotein that exists as a dimer on the cell surface and in the ECM, involved in cell adhesion, cell migration, wound healing, and cell metastasis [23]. FN1 is upregulated in breast cancer, and it has also been identified as a central protein in previously constructed protein-protein interaction (PPI) networks [24]. A prior gene regulatory network analysis of breast cancer showed that FN1 is upregulated in invasive breast cancer cell lines and is associated with the aggressive behavior of breast cancer cells [25]. FN1 is associated with five significantly enriched KEGG pathways, including "sticky spots", "protein digestion and absorption", and "ECM receptor interactions" (Table S3).
ANTXR1 is a transmembrane protein that is highly expressed in breast cancer cells [26]. ANTXR1 interacts with lipoprotein receptor-associated protein(LRP)6 and vascular endothelial growth factor receptor(VEGF)2, respectively, and regulates Wnt and VEGF signaling pathway [27,28]. In addition,

Availability of data and materials
The GO terminology for the GO enrichment analysis and EC analysis was retrieved from DAVID (http://david.abcc.ncifcrf.gov/).   respectively. The total number of categories in each GO group is indicated in parentheses.
In the three GO groups (BP, blue; CC, red; and MF, gray), the EC score was calculated for each GO category (only gene numbers between 20 and 200) and compared with that of random sampling (orange) to estimate the statistically significant level. The fraction of categories (y-axis) indicates the ratio of the number of GO categories with a higher EC score than each threshold EC score (x-axis) to the total number of GO categories.

Figure 2
The BC co-expression network in humans. The co-expression subnetwork was constructed using 11 reported human breast cancer or other cancer genes as guide genes (blue circles).
A link between two nodes indicates a direct interaction with PCC >0.722 and MR <10. The subnetwork vicinity is extracted by taking two steps out from each guide gene.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.