Exploration of Tumor Microenvironment-Related Biomarkers for the Prognosis of Triple Negative Breast Cancer

Background: Triple negative breast cancer (TNBC) is one of the most disastrous breast cancer subtypes world widely. The tumor microenvironment (TME), especially the inltration of immune and stromal cells, are highly related to the occurrence, development and prognosis of breast cancer. Therefore, exploration of TME-related biomarkers is greatly important for improving overall survival rate of TNBC patients. Methods: The open-assess Cancer Genome Atlas (TCGA) database provides gene expression prole for a variety of malignant tumors allowing researchers to explore genes demonstrating TME in the prognosis prediction of TNBC. In our present study, ESTIMATE algorithm was used to calculate the immune and stromal scores in accordance with the characteristics of specic genes in immune and stromal cells, and divide them into high and low-score groups. Limma R package was then utilized to screen differentially expressed genes (DEGs). After that, functional enrichment analysis and protein-protein interaction (PPI) network were performed to explore the bio-information of the DEGs and their encoded proteins. Subsequently, the identied these genes were further veried in the Gene Expression Omnibus (GEO) datasets. Results: Eight genes, including ACAP1, DUSP1, LYZGZMA, SASH3, CCL5, CD74, and DPT, were explored to closely related to the TME of TNBC. A prognostic model containing these selected genes was established with a high eciency in the prediction of the poor prognosis of TNBC patients. Conclusion: An eight-gene prognostic model was a considerably reliable approach for predicting the overall survival of TNBC patients, and could help clinicians selecting personalized treatments for their TNBC patients. score groups in TNBC and identify potential genetic biomarkers using data in cBioPortal. By RNA-sequencing (RNA-seq) survival analysis, we explore eight gene eight-gene prognostic model including ACAP1, DUSP1, LYZ, GZMA, SASH3, CCL5, CD74, and DPT; the model was validated in the GEO data set. This prognostic model is promising to accurately predict the prognostic status of TNBC patients for the TME of different individuals.


Introduction
Accounting for about 15-20% of breast cancers, triple negative breast cancer (TNBC) is currently the most aggressive and lethal breast cancer subtype, posing a great threat to women's health [1]. Previous studies have showed that the TNBC had an increased likelihood of distant metastasis, the peak being within the rst 3 years after onset. The mortality of TNBC within 5 years is signi cantly higher than other types of breast cancer [2,3]. Multiple factors like age, menopause, tumor size, histological grade, and pathological stage play roles in the short and long term of clinical outcomes [4]. However, due to the extremely complex molecular mechanisms in tumor as well as the limited predictive strength of conventional clinical information [5], there is a greatly urgent need to explore more advanced techniques to accurately predict the prognosis of TNBC patients and develop personalized precise therapies.
Genomic sequencings and practicable datasets are emerging in this era of genomics [6]. Those tools have made great contributions to tumor diagnosis and prognosis prediction. The genes inherent in tumor cells, especially the main transcription factor, determine the initiation, progression and evolution of TNBC [7,8]. Being similar to the solid tumor portions manifested the progressive level of tumor, tumor microenvironment (TME) contributing to tumor occurrence, development, and treatment have been widely reported in recent year [9][10][11]. TME is a complex biosystem composed of extracellular matrix, stromal cells (such as broblasts, adipocytes and mesenchymal stromal cells) and immune cells (such as B and T lymphocytes, macrophages and natural killer cells) [12]. Immune cells and stromal cells are two main types of non-tumor components in TME, which are considered to be of great signi cance for the diagnosis and prognostic evaluation of tumors [13]. The range of stromal cells is a sensitive prognostic predictor for patients with solid cancer [16]. To be speci c in breast cancer, higher immune in ltration indicates better clinical outcomes for the patients, especially in ER-negative patients. Higher CD8 + T cell in ltration indicates better overall survival (OS) of breast cancer patients [14]. Patients with hyperimmune in ltration showed satisfactory response to the neoadjuvant chemotherapy and adjuvant chemotherapy [15]. Therefore, in term of the special role of TME, it is crucial to accurately predict the purity and prognosis of TNBC and develop individualized treatment plans for different individuals with different TME.
In order to assess the in ltration of immune cells and stromal cells, an online ESTIMATE (Estimation of Stromal and Immune cells in Malignant Tumor tissues using Expression data) algorithm was created to calculated the immune scores and stromal scores in tumor. By analyzing the expression characteristics of speci c genes in immune and stromal cells, ESTIMATE algorithm is able to predict the in ltration of non-tumor cell [17]. Until now, it has been applied to the analysis of multiple solid cancers, including prostate cancer [18], breast cancer [11,19], and colon cancer [20], demonstrating the potential e ciency of this data-based algorithm. In the most recent research, ESTIMATE was utilized in detection of TME of luminal breast cancer. A ceRNA network associated with the TME, consisting with miRNAs, lncRNAs and mRNAs was identi ed to have prognostic value for the luminal breast cancer.
In this study, we aimed to explore the expression pro les of high and low immune and stromal score groups in TNBC and identify potential genetic biomarkers using data in cBioPortal. By RNA-sequencing (RNA-seq) survival analysis, we explore eight gene eight-gene prognostic model including ACAP1, DUSP1, LYZ, GZMA, SASH3, CCL5, CD74, and DPT; the model was validated in the GEO data set. This prognostic model is promising to accurately predict the prognostic status of TNBC patients for the TME of different individuals.

Demographic characteristics of patients
A total of 299 samples with pathological diagnosis of TNBC were downloaded and screened from the cBioPortal website. Table 1 describes the patient's detailed demographic characteristics and baseline characteristics.  Fig. 1). In all the characteristics shown in Table 1, we found that the distribution of immune scores and stromal scores in histological subtypes and CLAUDIN subtypes are different (p < 0.05), and the comparison results of each subtype are shown in Table 2. In addition to this, the stromal in ltration score also varies in diagnostic cell count, grade and stage (p < 0.05), but the immune in ltration score is similar between them (p > 0.05); the immune in ltration score between different tumor sizes are also different (p < 0.05) ( Table 1). Kaplan-Meier chart was used to analyze the relationship between the immune / stromal score and the corresponding overall survival rate. The results showed that a lower immune score was signi cantly associated with a lower OS (Fig. 2, p < 0.05). However, no difference in survival data was observed in immune score and ESTIMATE score (p = 0.08 and 0.42, respectively) ( Fig. 2). Comparison of TNBC gene expression pro le with immune score and stromal score To reveal the correlation between the overall gene expression pro le and the immune score and / or stromal score, we compared the Affymetrix microarray data of 299 TNBC patients obtained from the cBioPortal website. The heat map in Fig. 3 showed the different gene expression pro les of cases in the high and low immune / stromal score groups. 164 DEGs (| logFC |> 1, p < 0.05) were found in the comparison of immune scores. Similarly, 124 DEGs were found in the comparison of stromal scores (| logFC |> 1, p < 0.05). However, the DEGs extracted from the comparison of the high and low immune / stromal score groups were signi cantly different, the two groups of DEGs data were combined for the subsequent analysis.

Function enrichment analysis and construction of the PPI network
In order to explore the biological signi cance of DEGs, we performed GO function enrichment analysis on DEGs in the immune scores and stromal scores. The results showed that DEGs in the immune scores were signi cantly enriched in biological processes such as the T cell activation, the regulation of leukocyte activation, the side of membrane, the antigen binding (p < 0.05) (Fig. 4); DEGs in the stromal scores were signi cantly enriched in biological processes such as the extracellular matrix organization, the extracellular matrix, the extracellular matrix structural constituent (p < 0.05) (Fig. 4). In addition, in order to explore the relevant pathways of the TNBC microenvironment, we conducted a KEGG pathway analysis of all DEGs (p < 0.05). The Fig. 5 showed the main pathways involved such as the phagosome, the cytokine-cytokine receptor interaction, and the chemokine signaling pathway.
In order to explore the protein interactions related to the TNBC microenvironment, we conducted a PPI network analysis (interaction score > 0.9) of DEGs on the STRING online website. We selected the hub genes by the number of node connections and displayed them with barplot charts (Fig. 6). We found that most of these hub genes were important immune-related factors, indicating that TNBC was very immunogenic and had strong immune characteristics.
Establishment of gene-related prognostic models By using single factor Cox survival analysis of DEGs related to TNBC microenvironment, 39 genes signi cantly related to survival were obtained (p < 0.05). Then, the top 20 were taken to enter into the multifactor regression analysis, and the stepwise regression method was used to establish the best model. The model nally included eight genes (Fig. 7). The predictive model was characterized by linear combination based on the expression levels of eight genes weighted by their relative coe cients, resulting in the following model: risk score (RS) = -0.5076 * ACAP1 + 0.2347 * DUSP1 + 0.2336 * LYZ − 0.2973 * GZMA + 0.5710 * SASH3 -0.2512 * CCL5 + 0.5367 * CD74 -0.4422 * DPT.
The risk value of each patient was calculated according to the risk scoring formula, and it was divided into a high-risk group (n = 149) and a low-risk group (n = 150) according to the median value of the patients. There were signi cant differences in the KM plot survival curves of the two groups (the median OS was 3.01 years and 4.54 years, p < 0.05; Fig. 8A). The prognostic survival of the high-risk group was signi cantly lower than that of the low-risk group. The ROC curve was used to evaluate the e cacy of RS in predicting 5-year survival of patients. The AUC value reached 0.737, indicating that the model was accurate and reliable (Fig. 8B).
Veri cation of the prognostic model The breast cancer dataset (No. GSE103091) was downloaded from the GEO, and 107 TNBC samples was selected. After that, the established model was validated depending on the newly enrolled cases. In brief, the risk value of each patient was calculated using the above risk score formula, and the accuracy was evaluated to predict the 5-year survival of the subjects. As shown in the results, the AUC of the established model reach as high as 0.636, indicating its broadly applied e ciency in TME estimation for TNBC patients (Fig. 9).

Independence analysis of prognostic model and other clinical characteristics
The clinical information of 277 patients in the cBioPortal TNBC cohort was downloaded, and the univariate and multivariate Cox regression analysis was used to evaluate the independent predictive value of the prognostic model of TNBC. The ndings of the univariate Cox regression analysis indicated that the prognostic model and histological subtype showed considerable prognostic value (p < 0.05). In contrast, the diagnosis age, left/right position, grade, stage, tumor size, and number of lymph nodes were not related to OS. Therefore, the prognostic model and the histological subtype were taken into a multifactor Cox regression analysis to clarify their speci c effectiveness with OS. In our further analysis, we concluded that prognostic model and histological subtype were the independent prognostic factors related to OS (Fig. 10).

Discussion
TNBC is still the serious subtype of breast cancer due to its complex molecular and cellular heterogeneity with a increasing incidence worldwider [24]. Previous studies have shown that TME is highly correlated with the occurrence, progression and prognosis of breast cancer [25]. TME is where the immune system interacts with the tumor, indicating that the plastic cells of the tumor and immune system are an important part of TME. Undoubtably, the TME play a vital role in the development of various kinds of cancer, expecially in breast cancer [13,26]. Therefore, in our present study, we used the TNBC data obtained from the open-access cBioPortal website to identify speci c TME-related genes. Those genes represent the bioactivities of immune and stromal components in TME and might pose a great impact on the prognosis of the TNBC.
In our present study, we calculated the immune and stromal scores in TME by the ESTIMATE algorithm to investigate the in ltrated level of immune and stromal cells in TME of TNBC. Our results illustrated that in the claudin subtype, the immune score of Claudin-low subtype was signi cantly higher than that of other subtypes (p < 0.05). Kaplan-Meier analysis also showed a higher immune score predicted OS in TNBC patients Good (p < 0.05). This outcomes was consistent with the results published by Kay Dias that Claudin-low cancer has different clinical pathology and prognostic characteristics from other types of breast cancer. In terms of disease free survival (DFS), Claudin-low cancer has the best 10-year prognosis (72.5%, p = 0.002) [27].
In order to explore the potential mechanism of TME changes, we performed GO function enrichment analysis on the screened 289 differentially expressed genes, and found that majorities were related to the TME. Then, a PPI network was constructed assess the interactions between the corresponding TMErelated proteins. Multiple critical genes was determined as they had higher number of node connections in PPI. Interesting, most of the genes are found to be immune-related, indicating that TNBC is highly immunogenic with strong immune characteristics. In the identi ed genes, COL1A1 promote the metastasis of breast cancer especially in TNBC cell lines as previous reported. In metastasis, extracellular matrix (ECM) secreted more COL1A1 than usual to regulate the bioprocess of cells and ECM, subsequently resulting in an invasive and metastatic phenotype [28]. Pre-immune markers CXCL9, CXCL13 were positively correlated with enhanced number of tumor in ltrating lymphocyte (TIL), and were signi cantly associated with the longer DFS and (pathologic complete response, pCR) [29,30].
In the study, a survival analysis was performed to explore the potential prognostic value of 289 DEGs and establish a risk model for predicting the prognosis of TNBC. We identi ed eight TME-related genes (ACAP1, DUSP1, LYZ, GZMA, SASH3, CCL5, CD74, DPT). The expression levels of these genes in TNBC patients had signi cant correlations with poor prognosis, indicating that our huge data analysis via ESTIMATE algorithm on the cBioPortal were greatly successful [31][32][33][34]. Among these eight genes, we are particularly interested in ACAP1, DUSP1, and GZMA. ACAP1 (ArfGAP With Coiled-Coil, Ankyrin Repeat And PH Domains 1) is a gene encoding protein. Hoffman et al. reported that the expression of the six genes identi ed by the genes was related to the risk of breast cancer, including the expression of three genes in breast tissue (RCCD1, DHODH and ANKLE1), and three in whole blood genes (RCCD1, ACAP1 and LRRC25). ACAP1 played a role in cell proliferation and activation of Arf6 protein [20]. In addition, ACAP1 might be correlated with EEC / PI, and regulated the recycling of integrin β1 during cell migration [7]. Dualspeci city phosphatase-1 (DUSP1 / MKP1) was a member of the triple-tyrosine dual-speci city phosphatase family. As a protein phosphatase, DUSP1 downregulates p38 MAPK and JNKs signals by directly dephosphorylating threonine and tyrosine. DUSP1 participated in multiple bioprocess such as cell proliferation, differentiation and apoptosis. It was showed that [33] TNBC had the highest frequency of DUSP1 methylation if comparing with other breast cancer subtypes. Therefore, DUSP1 methylation was considered as a unique subtype-speci c marker in TNBC patients. Granzyme A, one of the ve granzymes encoded in the human genome, was a human protein encoded by GZMA, which was closely related to immunity. In a meta-analysis, it was found that at least half of malignant tumors had low or missing GZMA protein expression. High levels of GZMA and PRF1 synergistically affected the survival rate of tumors [16].
By using the established prognostic model to predict the 5-year survival rate of TNBC patients, we concluded that the AUC of the ROC curve was 0.737. It meant that the prognostic model had a good survival prediction performance. In the coming future practices, TNBC patients will be divided into highrisk groups and low-risk groups as the mRNA-based risk score prognostic models suggested. Clinicians can determine the therapies based on the predicted outcomes of the model, so as to achieve personalized treatments of TNBC patients. Particularly in high-risk populations, positive strategies should be adopted to prevent TNBC recurrence. Meanwhile, high-risk populations should also be followed up more frequently, and breast MRI scans should be performed routinely to detect the TNBC recurrence earlier. We also demonstrated that the prognostic model was independent of other clinical factors in TNBC. In the GEO dataset (No. GSE103091) to predict the patients' 5-year survival rate, in order to verify its predictive ability, the AUC value of the ROC curve reaches 0.636. This shows that our model has signi cance in wide application.
As far as we concerned, these eight biomarkers have not yet been studied in TNBC before. Hence, our ndings can provide a solid basic foundation for the development of these new prognostic factors for TNBC in clinical practices, particularly in diagnostic kits exploration. The advantages of the predictive genes we identi ed is that no further requirement of needed somatic mutation assessment were needed in patients. In addition, our method greatly reduces the cost of sequencing, which makes targeted sequencing applications more cost-effective and routine. Accurate prognosis was essential for appropriate treatment selection. In routine clinical practice, pathological stage classi cation was common evaluations of prognosis in TNBC patients [35]. However, the clinical outcomes of patients at same stages were usually various with each other due to the known tumor heterogeneity, which indicated that the current staging classi cation was far from su cient to a comprehensive prognosis of TNBC [36].
Obviously, the current-used pathological stages in TNBC was entirely based on the anatomical scope of the diseases. This limited property indicated that they were unable to fully represent the biological heterogeneity of TNBC [36]. The tumor heterogeneity, as demonstrated in the previous report, is not only represent the numerous genetic mutations happening in tumor cells, but also dynamic changes of the TME. Those changes of TME mediated by large number of the recruited immune cells somehow determined the occurrence, development and prognosis of the TNBC [37]. As the conventional classi cations failed to estimate the tumor heterogeneity, the prognostic model we proposed was expected to improve the prognosis accuracy in TNBC patients.
Honestly, our research also suffered from some limitations. First, the population of the TNBC samples obtained cBioPortal website was mainly limited to white and black people, so it is necessary to expend our study to other nationalities. Secondly, the AUC of the prognostic model we evaluated by GEO dataset (No. GSE103091) was not high enough. Therefore, more veri cations in multicenter clinical trials and prospective studies were needed. In the future, we will also explore the possibility of more predictors to improve the predictive performance of our model. Other regression modeling methods will be used to determine whether the prediction accuracy can be improved or not.
In summary, it was clearly demonstrated the eight-gene prognostic model was a considerably reliable tool for predicting the OS of TNBC patients, and can help clinicians selecting personalized treatments for their TNBC patients.

Data resources
The metabric breast cancer datasets were download from the open-access cBioPortal data portal (http://www.cbioportal.org/), and TNBC samples was selected and categorized by ER / PR / HER2. The data extracted from the cBioPortal included mRNA expression pro le of the TNBC patients and the corresponding clinical information (age, left/right position, overall survival rate (OS) status, cellularity, claudin subtype, histological subtype, grade, tumor stage, tumor size, lymph node). Our research follows the cBioPortal data access policy and publishing guidelines, so no approval from the local ethics committee is required.

Different distributions of immune scores and stromal scores in clinical characteristics
We extracted the matrix data of the gene expression corresponding to the sample, standardized it with the Limma R package (Version 3.5.2) [21], and then calculated the immune scores and the stromal scores of each sample. The Wilcox rank sum test and the Kruskal-Wallis rank sum test were used to analyze the distribution of immune scores and stromal scores in their clinical characteristics. Then, we analyzed the immune scores, stromal scores, and ESTIMATE scores of those TNBC patients through the survival analysis software. p < 0.05 was considered statistically signi cant.

Screening of differentially expressed genes (DEGs)
The TNBC cases were divided into high/low groups according to the median values of immune and stromal scores, and the difference analysis of the data was performed using the software package limma [21] to obtain the corresponding DEGs (| logFC |> 1, p < 0.05).

Enrichment path analysis and construction of PPI network
In our study, we used cluster pro ler package to perform Gene Ontology (GO) function enrichment analysis on immune and stromal cell DEGs (p < 0.05). Then the two groups of DEGs were combined, and the cluster pro ler package was used to perform Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses, which showed the pathways involved. Then, a protein-protein interaction (PPI) network was retrieved from the STRING database [22], reconstructed by Cytoscape software, and a single network with 10 or more nodes was selected for further analysis [23]. We calculated the number of connections of each node in the network and ltered out hub genes.

Establishment of gene-related prognostic models
A single factor Cox regression analysis of the above DEGs was performed using the survival software package to analyze the correlation between TNBC patients and the expression level of each gene. Those genes signi cantly related to survival were obtained (p < 0.05). Next, we used multifactor Cox regression analysis and the stepwise regression to create a most match model to evaluate the contribution of genes as independent prognostic factors for patient survival, calculated the risk value of each patient according to the risk formula, and divided the patients into high-risk and low-risk groups. The Kaplan-Meier (KM) survival curves of high-risk and low-risk cases were drawn, and the properties of the prediction model were evaluated by ROC curve.
Veri cation of the prognostic model Availability of data and materials The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
Xiaorui Han organizes the research, performs the data analysis and writes the manuscript. Zaiyi Liu revises the manuscript. Changhong Liang supervises the research. Figure 1 The  The overall survival curve obtained by the Kaplan-Meier method illustrated the prognosis of overall survival (OS) among high/low immune score (A) and stromal score (B), as well as the estimate score (C).    Univariate and multivariate association of the prognostic model and clinical characteristics with overall survival (OS). The con dence interval is shown as the length of the line.