Identication of an independent immune-genes prognostic index for breast cancer

Objective Increasing evidence has indicated an association between immune micro-environment in breast cancer and clinical outcomes. The aim of this research is to comprehensively investigate the effect of tumor immune genes on the prognosis of breast cancer patients. 2498 immune genes were downloaded from ImmPort database. Additionally, we identied and downloaded the transcriptome data of patients with breast cancer from the TCGA database through the R package, as well as relevant clinical information. Survival R package was applied in survival analyses for hub-genes. Cox regression analysis was used to analyze the effect of immune genes on the prognosis of breast cancer. Immune risk scoring model was constructed based on the statistical correlation between hub immune genes and survival. Meanwhile, multivariate cox regression analysis was utilized to investigate whether the immune genes risk score model was an independent factor for predicting the prognosis of breast cancer. Nomogram was constructed to comprehensively predict the survival rate of breast cancer. P < 0.05 was considered to be statistically signicant.


Introduction
Breast cancer, one of the leading causes of cancer-related death, was considered as the second most common cancer in the world. It accounted for a quarter of all malignant tumors (1). Triple negative breast cancer (TNBC) was supposed to be the most aggressive sub-type, which accounts for approximately one fth of breast cancer (2,3). It was characterized by higher tumor grade, larger tumor size, higher metastasis rate and lymph node involvement (4)(5)(6). TNBC was negative for estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER-2), thus resistant to endocrine therapy and trastuzumab (7). Regard to the lack of targeted treatment strategies, chemotherapy remained the unique treatment option (8). Therefore, it was crucial to identify several prognostic factors and targets for improving the treatment and prognosis of TNBC patients.
It has been extensively documented that the immune system plays an important role in cancer. The immune response is the process of interaction between many speci c cells that may affect the clinical outcome of breast cancer (9). It has been reported that the composition and function of tumor in ltrating immune cells (TIICs) varies with the host immune status and has potential prognostic value. Notably, they could be effectively targeted by drugs and affect the clinical outcomes of patients. In addition, immune genes have not been extensively studied in the context of breast cancer biology.
The purpose of this study is to describe the in ltration expression and lineage of immune genes in breast cancer and investigate the effects of immune genes on the prognosis of patients with breast cancer. Furthermore, an immune genes risk score model and a nomogram model is constructed to predict the survival of breast cancer.

Data acquisition
First of all, 2498 immune genes were downloaded from ImmPort database. Additionally, we identi ed and downloaded the transcriptome data of patients with breast cancer from the TCGA database through the R package, including 112 cases of paracancerous normal tissue and 857 cases of tumor tissue. Further, relevant clinical information of 857 breast cancer patients were obtained such as age, gender, stage, tumor&Lymph node&metastasis stage, survival status and survival duration (Table 1). Finally, "Limma" package in R software was utilized to correct the transcriptome data we have downloaded. Gene set enrichment analysis Gene enrichment analysis (GSEA) (version 3.0, the broad institute of MIT and Harvard, http://software.broadinstitute.org/gsea/downloads.jsp) was conducted between colon cancer and paracancerous normal tissues to study the biological characteristics of colon cancer. In detail, the "collapse data set to gene symbols" was set to false, the number of marks was set to 1000, the "permutation type" was set to phenotype, the "enrichment statistic" was set to weighted, and the Signal2Noise metric was used for ranking genes. High expression group was used as experimental group and low expression group was used as reference group. "c2.cp.kegg.v7.0.symbols.gmt" gene sets database was used for enrichment analysis. Gene set size > 500 and < 15, FDR < 0.25, and nominal Pvalue < 0.05 were regarded as the cut-off criteria.

Analysis of copy number variation data and gene mutation analysis
Based on TCGA breast cancer data, the copy number variation (CNV) was analyzed using R-Circos package and R-ggplot2 package. Furthermore, we used the online tool website-cbioportal to analyze the genetic variation of hub genes. The threshold used was P < 0.05.

Statistical analysis
All analyses were performed using R 3.6.1. All statistical tests were two-sided, and P value < 0.05 was considered statistically signi cant. Continuous variables that conformed to the normal distribution were compared with the use of independent t test for comparison between groups, while continuous variables with skewed distribution were compared with the Mann-Whitney U test. The correlation matrix was constructed by R software based on Pearson Correlation Coe cient. The relationship between immune cell in ltration and overall survival was analyzed through the Kaplan-Meier curve which was evaluated by log-rank test. Time-dependent ROC curves were used to analyze the sensitivity and speci city of the recurrence prediction model. The univariate regression model was used to analyze the effects of individual variables on survival. The least absolute shrinkage and selection operator (LASSO) cox regression model was used to con rm independent impact factors associated with survival. The nomogram was constructed with the regression coe cients based on the cox analysis.

Results
Differential expression screening of breast cancer 2498 immune genes were downloaded from ImmPort database. The transcriptome data of 857 breast cancer cases and 122 adjacent normal tissues cases was obtained from TCGA database for differential expression analysis. A total of 556 immune genes were identi ed as differentially expressed immune genes (DEIGs) between breast cancer and normal tissues, including 402 up-regulated and 154 down regulated (p < 0.05, Fig. 1A, Table 2). The heatmap of the top 10 up-regulated and top 10 down regulated DEIGs was shown in Fig. 1B.  RAC2 ITK ADAR RFXANK GDF11 IRF1 CMTM1 UCN2 MX2 TGFB1 KRAS IFNA17  PLXNC1 IFNA2 DHX58 ADRM1 PIK3R3 PRDX1 SOCS1 S100A11 HLA-DQA1 VEGFA  F2R OGFR PSMC4 SEMA4A IL1RN HLA-DQB1 IL32 CALR CD48 INHBC DEFB104B  TNFRSF12A CD3E CCR6 SERPINA3 HSPA2 IL31RA NR1I3 NFKBIB INHBE ISG20L2  IFIH1 LEAP2 CACYBP ZAP70 CXCR4 GIPR TNFRSF13C TGFB3 DEFB103A Functional annotation of these 556 DEIGs In order to fully understand the biological attributes of these 556 DEIGs, we conducted Kyoto Encyclopedia of Genes and Genomes (KEGG) and gene ontology (GO) analysis. Based on the results of DAVID, the top three enriched GO terms for up-regulated genes and down-regulated genes were: T cell activation, lymphocyte differentiation and response to virus; cell chemotaxis, positive regulation of response to external stimulus and leukocyte migration, respectively ( Fig. 2A). The top biological pathway enriched for up-regulated genes and down-regulated genes were: Cytokine − cytokine receptor interaction, JAK − STAT signaling pathway and Chemokine signaling pathway; Cytokine − cytokine receptor interaction, JAK − STAT signaling pathway and EGFR tyrosine kinase inhibitor resistance pathway, respectively (Fig. 2B).

Establishment of immune prognosis model
For the purpose of revealing the relationship between these 556 DEIGs and overall survival, 66 prognostic DEIGs were identi ed by utilizing univariate Cox regression analyses (Fig. 3A). Survival analysis showed that 30 of them were signi cantly correlated with OS. TCGA breast cancer data were randomly divided into two sets (training set : validation set, 7 : 3). Then, lasso regression analysis was applied to increase the robustness and select the independent indicators for the overall survival based on training set and nally we got 15 DEIGs for the construction of prognostic index (Fig. 3B, Fig. 3C, Table 3). After the construction of prognostic index, patients were separated into high risk and low risk (Fig. 3D, Fig. 3F). Heap map was utilized to visualize the difference of gene expression pro le in low-and high-risk patients in breast cancer training set (Fig. 2E). The results from K-M analysis indicated that high risk patients had lower overall survival than low risk patients in both training group and validation group (P < 0.001) (Fig. 4A, Fig. 4B, Fig. 4C). The ROC curve revealed that the risk model had a good sensitivity and speci city in predicting survival risk (AUC = 0.813, AUC = 0.704 for 5 years overall survival in training and validation group, respectively) (Fig. 4D, Fig. 4E, Fig. 4F). To explore whether the constructed immune risk scoring model was independent form age, gender, stage, and other clinical pathological parameters, we performed an univariate and multivariate cox regression analysis for age, gender, stage, grade, TNM and risk score. In univariate Cox model, age, pathological stage, pathological T, N, M stage and high risk score were associated with poor survival (Fig. 5A). In multivariate Cox model, only age and risk score worked as independent predicted factors (Fig. 5B). To better predict the prognosis of breast cancer patients at three and ve years post-surgery, we constructed a new nomogram from the variables associated with OS (age, histological grade, pathological stage and risk score) (Fig. 5C, 5D, 5E).

Recognition of gene sets for genome variation
Based on TCGA breast cancer data, we analyzed the copy number variation (CNV) of 15 model genes and showed the frequency of copy number variation using R-Circos package and R-ggplot2 package ( Figure  S1A, Figure S2B). The results showed that PAC2, ULBP and SERPINA3 had the most variation frequency ( Figure S2B). Furthermore, we analyzed the single nucleotide polymorphism composition (SNPs) of 15 model genes ( Figure S2C). The results showed that NR3C2 had the most SNPs, including missense mutation and silent. Finally, we used the online tool website-cbioportal to analyze the genetic variation of 15 immune genes ( Figure S2D).
Clinical and prognostic correlation of 15 model genes and immune genes risk score We further investigated the proportion of each model genes in different pathological stages. We demonstrated that IL17B, NFKBIE and SERPINA3 were most signi cantly associated with development of breast cancer (Fig. 6). In addition, survival analysis showed that all model genes were signi cantly associated with survival ( Figure S2). Meanwhile, we found that the expression of RAC2, CD79A and IFNG were associated with the in ltration of Macrophage M0 and Macrophage M2 (Fig. 7). Regard to the immune genes risk score, a strong correlation with age, gender, pathological stage and clinical T stage was identi ed (Fig. 8).

Gene set enrichment analysis of risk scores
To explore the biological relevance of risk scores involved in progression of breast cancer, we performed a gene set enrichment analysis of risk scores based on the TCGA breast cancer cohort. GSEA analysis indicated high risk scores was associated with E2F_TARGETS, G2M_CHECKPOINT, GLYCOLYSIS, MTORC1_SIGNALING and PROTEIN_SECRETION pathway (Fig. 9A). In addition, low risk scores was associated with APOPTOSIS, COMPLEMENT, IL2_STAT5_SIGNALING, INFLAMMATORY_RESPONSE and P53 pathway (Fig. 9B).

Discussion
As the most common cancer in women, although great efforts have been made to improve the diagnosis and treatment strategy, breast cancer remains a fatal threat to patients (10). Accumulation of evidence have shown that the immune system plays an important role in the development of cancer, which could recognize and kill cancer cells through adaptive immune response defence. Meanwhile, cancer cells could succeed in escaping the recognition of immune systems via a variety of ways. Tumor micro-environment (TME) was composed of cancer cells, stromal cells, chemokines and cytokines, where these cells could communicate and transmit by direct contact or secretion of related cytokines (11). Studies had shown that cancer prognosis was closely related to TME, especially cancer immune micro-environment (12,13). It is evident that different types of cancers had diverse immune genes sub-population. Therefore, it was crucial to investigate the immune genes subsets for the evaluation of risk and tumor prognosis.
In our study, we conducted a comprehensive and detailed assessment of immune genes in breast cancer, based on the data from a large set of samples. All gene expression data and patients clinical characteristics information were downloaded from TCGA dataset. We analyzed the 2498 immune genes between breast cancer and normal tissues from IMMPORT database, eventually, we veri ed 556 DEIGs.
Moreover, we identi ed and constructed a 15 hub genes risk score model for breast cancer via univariate and lasso cox regression analysis, including TSLP, IL17B, NR3C2, RAC2, SERPINA3, HSPA2, CD79A, UNC93B1, NFKBIE, SDC1, IFNG, IRF7, GALP, TNFRSF18 and ULBP1. Furthermore, to investigate the prognostic value of the model, we performed the ROC curve and investigate the association between the model and clinical features. As expected, the high-risk group was correlated with worse overall survival and was inclined to have advanced stage and higher histological grade which might manifest poor outcome.
Several genes in the immune genes model had been investigated in human cancers. Thymic stromal lymphopoietin (TSLP), a critical upstream cytokine inducing type 2 in ammation, represent a poor prognostic factor for oropharyngeal squamous cell carcinoma (OPSCC) (14). Interleukin-17 (IL-17), an in ammatory cytokine that functions in in ammation and cancer, participate in a positive feedback loop that enhances invasion/migration ability in lung cancer (15). Down-regulation of RAC2 by small interfering RNA restrains the progression of osteosarcoma by suppressing the Wnt signaling pathway (16). The up regulation of hnRNP-K transcriptional activity mediated by SERPINA3 promotes the survival and proliferation of HCC cells, which may be an indicator of poor prognosis in HCC patients (17). Heat shock protein family a member 2 (HSPA2), a putative oncoprotein, is often up-regulated in human cancers and promotes tumor growth and progression (18). NFKBIE aberrations are common genetic events in trans-b-cell malignancies, and NFKBIE deletion is a new marker of poor prognosis in primary mediastinal B-cell lymphoma (PMBL) (19). The remaining genes have also been con rmed to be associated with malignant origin, progression and metastasis of tumors. , and ADRB1 by using cox regression survival modeling for breast cancer, which may improve survival prediction for BC patients and serve as a prognostic biomarker for BC patients.. In another study (22), six mRNAs were screened and identi ed to construct a prognostic risk scoring system, including four upregulated mRNA (RDH16, SPC25, SPC24, and SCUBE3) and two down-regulated mRNA (DGAT2 and CCDC69). The risk scoring system constructed in this study is helpful to improve the screening of highrisk patients with HER-2-positive breast cancer and is bene cial for implementing early diagnosis and personalized treatment. Different from previous studies, we rst focused on differentially expressed immune related genes, and established and validated a new immune related risk prediction model.
However, there were some limitations in our research. Firstly, the sample size in our study was small and a larger cohort and more abundant sequencing results were needed. Secondly, we only focused on the gene expression and gene mutation level, but ignored other events such as the gene methylation, and copy number ampli cation, which were also important in tumor progression.

Conclusion
In summary, our study sheds light on the utility of immune genes in the prognosis of breast cancer. The constructed immune genes risk scoring model is reliable in predicting the prognosis of breast cancer, and this risk scoring model is an independent in uencing factor for the prognosis of breast cancer. With the rapid development of high-throughput technology, we have con dence to believe that our immune risk scoring model have great potential in clinical practice. And it may have critical value for exploring new anti-cancer immunodiagnosis and treatment strategies. Competing interests