Identication and Validation of Immune-Related Gene Prognostic Signature for breast cancer

prognostic signature and the IRGP prognostic signature was a strong independent risk factor. The functional enrichment analysis indicated that low IRGP value was correlated with biological processes related to immune. Immune cell inltration analysis indicated a signicant difference in percentage of M2 macrophages between high- and low-risk groups.


Abstract Background
Although the outcome of breast cancer patients has been improved by advances in early detection, diagnosis and treatment. Due to the heterogeneity of the disease, prognostic assessment still faces challenges. The accumulated data indicate that there is a clear correlation between the tumor immune microenvironment and clinical outcomes.

Objective
Construct immune-related gene pairs to evaluate the prognosis of breast cancer and patient survival rate.

Methods
From the Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database, the Gene expression pro les and clinical data of breast cancer samples were downloaded. TCGA cohort were further divided into a training set (n = 764) and internal validation sets (n = 325). The GEO cohort was analyzed as an external validation cohort (n = 327). In the training set, differently expressed immunerelevant genes (IRGs) were screened rstly, and they were used to construct immune-relevant gene pairs (IRGPs). Then, the prognostic IRGPs were identi ed via univariate Cox regression analysis. Finally, least absolute shrinkage and selection operator (LASSO) Cox regression analysis was used to constituted the IRGP prognostic signature. Kaplan-Meier (KM) survival curves, receiver operating characteristic (ROC) curve analysis, univariate and multivariate Cox regression analysis were used to estimate the predictive value of the IRGP prognostic signature. And the IRGP prognostic signature was validated in the internal validation cohort and external validation cohort. We used gene set enrichment analysis (GSEA) to elucidate the biological functions of the IRGP prognostic signature.

Results
A total of 474 differently expressed IRGs and 2942 prognostic IRGPs were identi ed. Finally, we generated a IRGP prognostic signature consisting of 33 IRGPs. Subsequently, the 33 IRGPs grouped BRCA patients into high-and low-risk groups. Kaplan-Meier curves shown a signi cantly different overall survival in risk groups. Time-dependent ROC curves indicated that the IRGP prognostic signature possessed a high speci city and sensitivity in all the sets. Univariate and multivariate Cox regression analysis showed a statistical signi cance for the prognostic value of IRGP prognostic signature and the IRGP prognostic signature was a strong independent risk factor. The functional enrichment analysis indicated that low IRGP value was correlated with biological processes related to immune. Immune cell in ltration analysis indicated a signi cant difference in percentage of M2 macrophages between high-and low-risk groups.

Conclusion
The 33-IRGPs prognostic signature was developed to provide new insights for the identi cation of highrisk breast cancer and the evaluation of the possibility of immunotherapy in personalized breast cancer treatment.

Introduction
Breast cancer (BRCA) is the most frequently diagnosed malignant tumor in female, affects about 12% of women worldwide [1]. The incidence of BRCA was increased from 2004 to 2013 by about 2% per year among women [2]. TNM stage system, which takes both clinical and pathology information into consideration including tumor size (T), regional lymph nodes (N) and spread to distant metastatic sites (M), represents the most important prognostic factor for BRCA, and the higher the stage at diagnosis, the poorer the prognosis [3,4]. Although the 5-year survival rate was more than 90% for stage I and II BRCA patients, it drops down to about 27% for stage IV [5]. Thus, early detection, screening, and advanced tailored treatment are extremely important to reduce the mortality of BRCA. However, due to the high degree of heterogeneity of breast cancer, the prognosis of patients with similar clinical features may vary greatly. Therefore, in order to stratify patients more accurately, in addition to clinical factors, other prognostic factors need to be considered.
Immune oncology has attracted more and more attention because of its excellent clinical application value in many kinds of malignant tumors [6]. Immune-based strategies point out a new direction for the treatment and prevention of breast cancer. Previous researches have shown that the immune cells and immune-related genes are attractive bio-target for regulating cancer prognosis [7][8][9]. Among them, because the construction of immune gene pairs can directly compare the expression of two genes in each sample, and there is no need for comparison between samples, the correction between samples is saved and the accuracy is improved [10]. It provides convenience for various subsequent studies. However, previous studies on breast cancer have mostly focused on the relationship between immune cell in ltration and tumor prognosis, and there are relatively few studies on immune-related gene pairs. Our research has done our best to ll the gap in the study of some breast cancer-related immune gene pairs.
In this study, we established a 33-IRGPs signature to predict the individual prognostic characteristics of BRCA by univariate Cox regression analysis and LASSO model. For validation of the signature, we investigated its accuracy and e ciency in determining the prognosis of BRCA patients with KM curves, ROC curve analysis, univariate and multivariate Cox regression analysis. The ndings of this study showed and proved that the 33-IRGPs signature can be applied in the clinical prognosis of BRCA patients.

Transcriptome data acquisition and pre-processing
The BRCA transcriptome data and related clinical characteristics were downloaded from The Cancer Genome Atlas (TCGA) database [11]. In total 1089 BRCA samples and 113 adjacent normal samples were extracted from TCGA. Then, all selected patients were further separated into training and internal validation groups(7: 3) to apply the prognostic analyses based on cohorts. The models are identi ed and evaluated by "caret" package with its "createDataPartition" function [12]. GSE20685 Microarray data [13] from the Gene Expression Omnibus (GEO) database [14] was analyzed as external validation cohort which contained data of 327 BRCA patients. All selected clinical characteristics of BRCA patients were listed in Table 1

Acquisition of somatic mutation data
A total of 4 subtypes of data les from the TCGA database were selected, from which the "Masked Somatic Mutation" data was processed by VarScan software. Somatic mutation data was processed into Mutation Annotation Format (MAF) le and visualized by "maftools" [15] R package, which can provide multiple analyzing models. There was no ethical con ict to declare because all the data in this research were from public databases.

Prognostic immune-relevant gene pairs identi cation
Immune-relevant genes (IRGs) from immported database (https://www.immport.org/home) were obtained to perform differentially expressed genes screening. The genes with different expression levels between tumor and normal samples of BRCA samples in TCGA cohort were analyzed by limma R packages. At the same time, P-value were adjusted by false discovery rate (FDR) that proposed by the Benjamini-Hochberg procedure to limit the false positives results. The genes were considered as differential expression with |log2(fold change)|>1 and FDR < 1.0E-5.
Besides, the immune-relevant genes with different expression levels were used to construct immunerelevant gene pairs (IRGPs). Each IRGP was calculated by comparing the expression levels of genes in a particular sample or sequence, and if IRG 1 > IRG 2, then IRGP equals to 1, otherwise IRGP equals to 0. By taking advantages of analyzing genes in a pairwise approach, standardization is no more required in individualized analysis. Besides, some IRGPs with unique values of 1 or 0 were excluded to avoid the biases. The prognostic IRGPs were identi ed via univariate Cox regression analysis in the TCGA cohort, where IRGPs with P-value < 1.0E-5 were selected for further analysis.

The IRGP prognostic signature construction and validation
TCGA samples were randomly separated into training (764 patients) and internal validation cohorts (325 patients). The endpoints analyzed in this study were overall survival (OS), de ned as the interval between the date of diagnosis and death. Least absolute shrinkage and selection operator(LASSO) is a biased estimation tool for complex collinearity data, which can select variables and estimate parameters at the same time, and solve the problem of multicollinearity in regression analysis [16]. Thus, LASSO Cox regression analysis was applied to reduce dimensionality of IRGPs by R package glmnet. IRGPs represented by optimal values of the penalty parameter λ, which were determined by ten-fold crossvalidations, constituted the IRGP prognostic signature in this study. On the basis of IRGP prognostic signature, the risk score for each BRCA sample was calculated according to the following formula: risk score = expression gene 1 × β gene 1 + expression gene 2 × β gene 2 +⋯+expression gene x × β gene x , (x = the number of IRGPs; β = coe cient value for each IRGP).
All BRCA samples were classi ed into high-and low-risk groups based on the cut-off values of median value of risk score. Kaplan-Meier (KM) survival curves were applied to calculate the over survival (OS) differences between two groups, and the statistical signi cance was obtained by log-rank test. The values of area under the curve (AUC) in 1, 3, and 5 years from the results of receiver operating characteristic (ROC) curve analysis were calculated and taken to evaluate the accuracy and sensitivity of the survival prediction results.
Then, the multivariate Cox regression model was used to evaluate whether the prognostic value of IRGPs was independent of clinical characteristics. Furthermore, the comparison between IRGP prognostic signature and clinical features was performed by forest plots to determine the effectiveness of the prognostic value. Wald test was performed to compare the statistical difference of IRGPs in clinical characteristics.

Clinical utility of IRGP prognostic signature
A composite nomogram was developed based on the IRGP prognostic signature and available clinical factors to predict the OS of BRCA patients with the rms package. Calculate the consistency index (Cindex) to evaluate the discriminative ability of the nomogram. Besides, a calibration curve was drawn to compare the predicted probability and actual probability of OS. Each component of the nomogram is given points, and their sum represents the total points of a patient has obtained.

Estimation of immune in ltration
Immune in ltration assessment was performed by cibersort method to quantify the absolute abundance of 22 immune cell populations in heterogeneous tissues from transcriptome data [17]. The cibersort method was applied by CIBERSORT package to convert mRNA data into the levels of in ltrating immune cells. And standard annotation les was used to prepare gene expression pro les before immune in ltration analysis.

Functional enrichment analysis
Gene Set Variation Analysis (GSVA) is a method of gene set enrichment analysis, which estimates the changes of pathways and biological process activities in a sample population in an unsupervised manner [18]. The gene set le of "c5.all.v7.0.symbols.gmt", downloaded from the "Molecular Signatures Database" (https://www.gsea-msigdb.org/gsea/), were employed for GSVA using "GSVA" R packages. Each group of data was taken a normality test for test the data distribution by the means of shapiro, and homogeneity test for variance in multiplicate samples by the means of bartlett. The result is normal distribution and comparable when P > 0.05, and one-way ANOVA was used to compare differences between groups. Otherwise the data was analyzed by Kruskal-Wallis test. The gene sets in the "Molecular Signatures Database" of "c5.all.v7.0.symbols.gmt", "c2.all.v7.0.symbols.gmt", and "h.all.v7.0.symbols.gmt" were used to explored the potential mechanisms of the IRGP prognostic signature by gene set enrichment analysis (GSEA) with the fgsea package [19]. The number of random sample permutations was set at 10000, and the signi cance threshold was set at p < 0.05.

Prognostic immune-relevant gene pairs identi cation
We obtained 1382 common IRGs overlapped between TCGA cohort and GSE20685 cohort. A total of 135 up-regulated and 339 down-regulated IRGs were identi ed between 1089 BRCA tumor and 113 normal samples in TCGA cohort (Fig. 1A). Then we acquired 683795 common IRGPs based on these 1382 common IRGs between TCGA cohort and GSE20685 cohort by gene pairwise calculation. Univariate Cox regression analysis was performed for the 683795 common IRGPs, of which 2942 IRGPs showed signi cant prognostic potential (P-value < 1.0E-5), containing 1102 IRGs. Among these 2942 IRGPs, 1692 IRGPs were risky and 1250 IRGPs were protective (Fig. 1B). There were 356 overlapped IRGs between differently expressed IRGs and prognostic IRGs, and they constituted 302 IRGPs (Fig. 1C). The work ow chart for data collection and analysis was shown in Fig. S1.

Construction of molecular subgroups using prognostic immune-relevant gene pairs (IRGPs)
Firstly, unsupervised clustering methods was performed to group 1089 tumor samples into different subgroups based on 302 prognostic IRGPs. And optimal cluster number was calculated by ConsensusClusterPlus package. Three distant patient clusters, termed as the subtype1, subtype2 and subtype3, were nally identi ed ( Fig. 2A-B). Through the log-rank test, the Kaplan-Meier curve shown that the overall survival difference between the three clusters is signi cant (Fig. 2C). The patients in subtype2 had worst survival outcome, whereas the survival time in subtype3 group was the longest. Samples in subtype3 exhibited a more negative ER status or PR status and lower fraction of genome altered, whereas samples in subtype1 and subtype2 exhibited a more positive ER status or PR status and higher fraction of genome altered (Table S1). With regards to biological behavior (Fig. 2D), pathways involved in humoral immune response and macrophage proliferation were activated in subtype3, which might cause the longer survival time for subtype3 patients. Whereas patients in subtype2 showed a lower immune activity, which might lead to the worst survival outcome for subtype2 patients. Subsequent immune cell in ltration analysis shown that subtype3 patients was obviously in ltrated by naive B cell, memory resting CD4 T cells and CD8 T cells, while subtype2 patients showed an obvious increase trend in in ltration by M0 macrophages and M2 macrophages (Fig. 2E). Finally, the distribution of these three subtypes and other established breast cancer molecular subtypes was compared. The result demonstrated that subtype3 patients were mainly concentrated in the Luminal A subtypes with the best prognosis, while subtype2 patients were mainly concentrated in the C4 subtypes with the worst prognosis (Fig. 2F, Table S1).

Construction and validation of the IRGP prognostic signature
The IRGPs in the risk model were selected by applying LASSO Cox regression analysis. All samples in TCGA cohort were regrouped into training and internal validation groups randomly for prognostic analyses. No signi cant difference was observed when comparing patient characteristics between the two groups (Table S2). The IRGP prognostic signature consisting of 33 IRGPs was generated through the LASSO model ( Table 2). The 33 IRGPs grouped BRCA patients into high-and low-risk groups based on the median value of risk score. In training, internal validation and external validation cohorts, Kaplan-Meier curves indicated that patients in low-risk group have signi cantly longer survival time than that in high-risk group (Fig. 3A-C). Figure 3A   other clinical characteristics in all three cohorts, and indicated that IRGP prognostic signature was a strong independent risk factor ( Table 3). The predictive power of the IRGP prognostic signature was further tested in various subgroups strati ed by TNM stage, age, fraction genome altered, sample initial weight, ER status, HER2 status, PR status, tumor site, immune subtype, and BRCA subtype in the TCGA entire cohort. The forest plot shown that in almost all subgroups, the higher the risk score, the more obvious the patients with the worse prognosis (Fig. 4).

Landscape of mutation pro les in breast cancer
The somatic mutation pro les of 968 BRCA patients with 4 types of data based on different processing software were downloaded from TCGA. Besides, we visualized the results of 442 high-risk and 526 lowrisk BRCA patients using the " maftools" package, based on mutation data with VCF format. The result shown that the frequency of somatic mutation in high risk BRCA patients was slightly higher than low risk BRCA patients. The waterfall plot shows the mutation information for each gene in each sample, and the different types of mutations are annotated by different colors at the bottom (Fig. 5A, Fig. S2A). In summary, these mutations were further classi ed based on the different taxonomic categories, among which missense mutation accounted for the majority (Fig. 5B, Fig. S2B), only single nucleotide polymorphism occurred (Fig. 5C, Fig. S2C), and C > T was the most common of single nucleotide variants (SNV) in BRCA patients (Fig. 5D, Fig. S2D). Besides, the number of altered bases in each sample was counted and mutation type with different colors in box plot for BRCA patients were illustrated (Fig. 5E-F,   Fig. S2E-F). Last, we exhibited the top 10 mutated genes with ranked percentages, including PIK3CA

Establishment of a nomogram predicting OS in BRCA patients
In order to develop a clinically relevant quantitative method to predict the mortality rate of patients, we constructed a nomogram integrating IRGP prognostic signature derived score and clinical information to predict survival probability of BRCA patients in TCGA cohort and GSE20685 cohort ( Fig. 6A and Fig.S3A).
The concordance index (C-index) of the nomogram was 0.862 in TCGA cohort and 0.758 in GSE20685 cohort, which indicating a good discriminatory ability of the nomogram. The calibration plots showed that the derived nomograms performed well compared to the performance of an ideal model at 1-year, 3years and 5-years (Fig. 6B-D and Fig. S3B-D). Decision curve analysis shows that the clinical utility of nomograms greatly exceeds TNM staging system ( Fig. 6E and Fig. S3E).

Identi cation of IRGP prognostic signature related biological pathways and processes
The relationship between IRGP prognostic signature derived scores of clinical characteristics and molecular subtypes were further investigated in the entire TCGA cohorts (Fig. 7A). In terms of clinical features, IRGP prognostic signature was increased in patients with more advanced TNM stage and patients who had died due to the disease. Furthermore, while age in uenced the value of IRGP prognostic signature, that value of IRGP prognostic signature didn't varied between sample initial weight and tumor site. In terms of molecular characteristics, it is explored that more genomic changes relative to higher values of IRGP prognostic signature, and patients in molecular subtypes C4 and Luminal B exhibited signi cantly higher values of IRGP prognostic signature than others.
Next, we performed GSEA to elucidate the biological functions of the IRGP prognostic signature (Fig. 7B), which revealed that genes highly expressed in the low-IRGP risk score group were associated with immune-related gene set, including regulation of lymphocyte activation, phagocytosis and immune response regulating cell surface receptor signaling pathway. While the high-IRGP-related genes showed signi cant enrichment in multiple biological processes such as the DNA activity-related gene set, including helicase activity, catalytic activity acting on DNA, ATP dependent 5_3 DNA helicase activity.
Immune cell in ltration analysis indicated that patients with high IRGP prognostic signature scores seemed to in ltrated by M0 and M2 macrophages, while the patients in low-risk group were in ltrated by naive B cell, memory resting CD4 T cells and CD8 T cells (Fig. 7C).

Discussion
Immunity is closely linked to tumors. For example, Li B et al inferred the abundance of 6 immune cell types (B cells, CD4 T cells, CD8 T cells, neutrophils, macrophages, and dendritic cells), and discovered signi cant associations between cell abundance and prognosis in 23 cancers. For example, in addition to prolonging the survival of patients, CD8 T cells may also play an important role in preventing tumor recurrence, such as in melanoma, colorectal cancer, and cervical cancer [20]. In addition, increasing evidence points to the importance of biomarkers (especially genes) in determining cancer outcomes, which provides new opportunities for integrating this information into treatment algorithms [10]. Many previous studies have shown that immune-related genes can be used as prognostic indicators for breast cancer. However, researches were often limited by the singularity of genes and differences between samples [21,22]. Therefore, new methods are urgently needed to improve the accuracy of breast cancer prediction.
Immune-related gene pairs were widely used in tumor analysis, and great progress has been made in many cancers, such as melanoma, ovarian cancer and pancreatic cancer [23][24][25]. However, further researches were needed in breast cancer, and this study would ll the gap in this regard. Our study focused on predicting BRCA survival using the prognostic characteristics of IRGP composed of 33 IRGP combined with clinical information. We believed that our research would provide a new perspective for clinical decision-making and prognostic monitoring in breast cancer. At the same time, we also analyzed the somatic mutations in patients with breast cancer, and we found that most of the missense mutations occurred in the following genes: Last, we exhibited the top 10 mutated genes with ranked percentages, including PIK3CA ( stimulate tumor cell growth [27]. Through previous studies, we have learned that an increased prevalence of PIK3CA mutations in women with CRC, and an increased risk of recurrence and poorer prognosis associated with PIK3CA mutations [28][29][30]. The TP53 protein is a DNA-bound transcription factor that has the potential to bind to hundreds of different promoter elements in the genome to regulate gene expression [31]. Tumors with TP53 mutations are usually characterized by poor differentiation, increased invasiveness and high metastatic potential, which are associated with poor prognosis [32,33]. Combined with our conclusions about somatic mutation analysis, the mutations of PIK3CA and TP53 affect the prognosis of patients with breast cancer to a great extent. So we can determine the reliability of our functional enrichment results [31,34] and our IRGPs may play some role in the prognosis of breast cancer. At present, several immune-related therapies for tumors have achieved good results in clinical trials. For instance, by blocking macrophage chemokines (such as CXCL12) and preventing macrophages from entering tumors, the development and proliferation of cancer cells can be inhibited [35]. Although the focus of this research is not on the mechanism of immune cells, it still provides strong evidence that tumor-related immune genes may become potential targets for cancer treatment. Our research focuses on immune-related genes and uses strict standard-level screening to obtain genes that may be prognostic targets for BRCA [6].
The current study has some limitations. First, due to the retrospective nature of this study, the patient population was heterogeneous. Secondly, because gene expression data are required for Cox regression as categorical variables, the optimal cut-off value needs to be further veri ed in future studies. In conclusion, the IRGPS gene map is a powerful tool for predicting breast cancer survival and guiding treatment. In addition, prospective clinical trials are needed to validate our ndings.    Forest plots of the associations between IRGP prognostic signature and overall survival in various subgroups in TCGA cohort. Unadjusted HRs (boxes) and 95% con dence intervals (horizontal lines) are depicted. IRGP, immune-relevant gene pair.