3.1 Prognostic immune-relevant gene pairs identification
The workflow chart for data collection and analysis was shown in Fig. 1. We obtained 212 common high variable IRGs overlapped between TCGA cohort, GSE20685 cohort, GSE42568 cohort, and GSE20711 cohort (Fig. 2A). Then we acquired 3075 common IRGPs after removing IRGPs with scores of 0 or 1 in over 80% of the samples in all cohorts (Fig. 2B). Univariate Cox regression analysis was performed for the 3075 IRGPs, of which 35 IRGPs showed significant prognostic potential (P-value < 0.001). Based on the cutting off value (HR = 1), we identified 14 risky IRGPs and 21 protective IRGPs (Table S1).
3.2 Construction and validation of the IRGP prognostic signature
The IRGPs in the risk model were selected by applying LASSO Cox regression analysis. All samples in TCGA cohort were regrouped into training and internal validation groups randomly for prognostic analyses. No significant difference was observed when comparing patient characteristics between the two groups (Table S2). The IRGP prognostic signature constructed by 16 IRGPs consisting of 28 unique IRGs was generated through the LASSO model (Fig. S1A-B). Most of these 28 IRGs encoded molecules related to antimicrobials, cytokines, and cytokine receptors (Table 2).
Furthermore, we uploaded the 28 unique IRGs list to Metascape (https://metascape.org/gp/index.html) to annotate biological function of these genes. The GO enrichment results shown that these genes were correlated with cancer immune-related biological functions, including cell chemotaxis, leukocyte migration, myeloid leukocyte migration and neutrophil migration (Fig. S2A, Table S3). Similarly, the KEGG pathway enrichment results indicated that these genes were associated with cytokine-cytokine receptor interaction, chemokine signaling pathway and several classical signal pathways (Fig. S2B).
The 16 IRGPs grouped breast cancer patients into high- and low-risk groups based on the optimal cut-off point determined by “cutp” function from survMisc package. Kaplan-Meier curves indicated that patients in low-risk group have significantly longer survival time than that in high-risk group in training cohort (P-value = 6.12E-13, HR = 3.90, 95%CI = 2.65–5.74; Fig. 3A), internal validation cohort (P-value = 8.15E-6, HR = 3.70, 95%CI = 1.74–7.89; Fig. 3B), GSE20685 validation cohort (P-value = 1.83E-9, HR = 3.44, 95%CI = 2.02–5.86; Fig. 3C), GSE42568 validation cohort (P-value = 1.80E-3, HR = 4.51, 95%CI = 2.27–8.96; Fig. 3D), and GSE20711 validation cohort (P-value = 2.72E-2, HR = 2.41, 95%CI = 1.09–5.33; Fig. 3E). As shown in Fig. 3F, the IRGP prognostic signature possessed a high specificity and sensitivity in the training cohort, with the area under the curve (AUC) of 0.849, 0.827, and 0.781 for 1-, 3- and 5-year OS prediction. The AUC of the IRGP prognostic signature for 1-, 3- and 5-year OS prediction were 0.688, 0.796 and 0.714 in internal validation cohort (Fig. 3G), 0.584, 0.752 and 0.748 in GSE20685 validation cohort (Fig. 3H), 0.538, 0.620 and 0.689 in GSE42568 validation cohort (Fig. 3I), and 0.920, 0.542 and 0.657 in GSE20711 validation cohort (Fig. 3J). In addition, the patients in the high-risk group had a shorter disease-free survival (DFS) and distant metastasis-free survival (DMFS) time compared with low-risk group in training cohort, internal validation cohort, GSE20685, GSE42568 validation cohort, and GSE20711 validation cohort (Fig. S3A-E). And the prognostic value of IRGP signature for DFS and DMFS prediction was also evaluated in both training and validation cohorts (Fig. S3F-J). The above analyses presented that the IRGP prognostic signature could has a good predictive value in OS, DFS and DMFS prediction.
Univariate Cox regression analysis showed a statistical significance for IRGP prognostic signature in training cohort (P-value = 2.63E-11, HR = 4.04, 95%CI = 2.68–6.09), internal validation cohort (P-value = 3.28E-5, HR = 3.80, 95%CI = 2.03–7.15), GSE20685 validation cohort (P-value = 1.63E-8, HR = 3.50, 95%CI = 2.27–5.40), GSE42568 validation cohort (P-value = 4.45E-3, HR = 4.54, 95%CI = 1.60-12.88), and GSE20711 validation cohort (P-value = 3.23E-2, HR = 2.47, 95%CI = 1.08–5.66) (Table 3). In order to explore the independence of IRGP prognostic signature in survival prediction, a multivariate Cox regression analysis was performed in the TCGA training cohort and internal validation cohort, including IRGP prognostic signature, age, M stage, N stage, T stage, tumor stage, fraction genome altered, immune subtype, and breast cancer subtypes. Whereas only age, M stage, N stage and T stage were available for GSE20685 validation cohort and age as well as tumor grade were available for GSE42568 and GSE20711 validation cohort, we integrated IRGP prognostic signature and these clinical features in multivariate cox regression analysis. The prognostic values of IRGP prognostic signature was significant compared with other clinical characteristics in training cohort (P-value = 2.02E-6, HR = 3.53, 95%CI = 2.10–5.94), internal validation cohort (P-value = 7.21E-05, HR = 8.88, 95%CI = 3.02–26.10), GSE20685 validation cohort (P-value = 3.40E-6, HR = 2.98, 95%CI = 1.88–4.72), GSE42568 validation cohort (P-value = 1.22E-2, HR = 3.93, 95%CI = 1.35–11.48), and GSE20711 validation cohort (P-value = 3.68E-2, HR = 2.47, 95%CI = 1.06–5.78) (Table 3). It indicated that IRGP prognostic signature was a strong independent risk factor. The predictive power of the IRGP prognostic signature was further tested in various subgroups stratified by TNM stage, age, fraction genome altered, ER status, HER2 status, PR status, immune subtype, and breast cancer subtype in the TCGA entire cohort. The forest plot shown that higher risk score was correlated with worse prognosis in almost all subgroups (Fig. 4). Collectively, the results indicated that the predictive ability of the IRGP prognostic signature was independent of other clinical parameters and was predictive of OS of breast cancer patients.
3.3 Increasing IRGP risk score is associated with clinical characteristics of breast cancer patients
The relationship between IRGP prognostic signature and clinical characteristics of breast cancer patients was further investigated in the entire TCGA cohorts. In terms of clinical features, increasing IRGP risk score was correlated with more advanced tumor stage (P = 3.10E-03; Fig. 5A), M stage (P = 1.20E-03; Fig. 5B), N stage (P = 9.50E-02, without statistical significance; Fig. 5C), T stage (P = 1.40E-05; Fig. 5D), and patients who had died (P-value = 1.10E-11; Fig. 5E) or progression (P-value = 2.50E-3; Fig. 5F) due to the disease. Furthermore, age influenced the value of IRGP prognostic signature (P-value = 8.20E-6; Fig. 5G). In addition, it is explored that more genomic changes (P-value < 2.20E-16; Fig. 5H) and triple negative breast cancer (TNBC) patients (P-value = 1.20E-8; Fig. 5I) relative to higher values of IRGP prognostic signature. In terms of molecular characteristics, patients in molecular subtypes C4 (P-value < 2.20 E-16; Fig. 5J) and HER2-enriched as well as Basal-like (P-value < 2.20 E-16; Fig. 5K) exhibited significantly higher values of IRGP prognostic signature than others. Above all, increasing IRGP risk score is associated with worse clinical status or molecular subtype.
3.4 Establishment of a nomogram predicting OS in breast cancer patients
In order to develop a clinically relevant quantitative method to predict the mortality rate of patients, we constructed a nomogram integrating IRGP prognostic signature derived score and clinical information to predict survival probability of breast cancer patients in TCGA entire cohort (Fig. 6A). The concordance index (C-index) of the nomogram was 0.820 in TCGA cohort, which indicating a good discriminatory ability of the nomogram. The calibration plots showed that the derived nomograms performed well compared to the performance of an ideal model at 1-year, 3-years and 5-years (Fig. 6B-D). Decision curve analysis shows that the clinical utility of nomograms greatly exceeds TNM staging system (Fig. 6E). The nomogram analysis indicated that the IRGP prognostic signature is the most important factor to predict survival probability of breast cancer patients, which has the greatest contribution to survival with largest regression coefficient.
3.5 Potential of IRGP prognostic signature as an indicator of response to immunotherapy
It has been reported that the infiltration of immune cells is associated with the prognosis of breast cancer patients. The expression levels of genes in IRGP signature in differential immune cell types were investigated in scRNA-seq dataset GSE169246. The results showed that CMTM8, CXCL13, TNFRSF12A, ICAM2, PIK3R3, IL7R, PDGFD were higher expressed in T cells (Fig. S4). The immune cell infiltration analysis shown that high-risk group patients was obviously infiltrated by M0 macrophages and M2 macrophages, while low-risk group patients showed an obvious increase trend in infiltration by naive B cell, memory resting CD4 T cells and CD8 T cells (Fig. 7A). Then, we estimated the tumor microenvironment (TME) in the two risk groups. We found that the patients in high-risk group had a lower stromal score (Fig. 7B) and lower immune score (Fig. 7C). And increasing IRGP risk score was significantly positively correlated with tumor purity (Fig. 7D).
Then, we investigate the relationship between IRGP prognostic signature and ICI genes, including PDCD1, CD274, CTLA-4, TIGIT, VSIR and CD276. As shown in Fig. S5, the IRGP prognostic signature was markedly negatively related with PDCD1, CTLA-4, TIGIT, VSIR (Fig. S5A-D), and positively correlated with CD276 (Fig. S5E). The expression of CD274 was negatively related with IRGP prognostic signature but without statistical significance (Fig. S5F). Moreover, patients in low-risk group tend to express high expression level of PDCD1 (P-value = 2.30E-15; Fig. 8A), CTLA-4 (P-value = 2.50E-6; Fig. 8B), CD274 (P-value = 5.80E-4; Fig. 8C) compared with patients in high-risk group. In addition, there were statistical significance between the risk group and expression level of TIGIT, VSIR, CD276 (Fig. S5G-I). This result indicated that IRGP prognostic signature was associated with the expression levels of ICI genes (PDCD1, CTLA-4, TIGIT, VSIR and CD276).
Furthermore, we used the TIDE score and stemness indices to explore whether the IRGP prognostic signature could reflect the immunotherapeutic benefit in breast cancer patients. Interestingly, we found that the TIDE score was significantly lower in low-risk group and low-risk patients may be more likely to respond to immunotherapy compared with high-risk patients (Fig. 8D-E). As excepted, we observed that the breast cancer patients in low-risk group had lower stemness indices value than high-risk group (Fig. 8F-G). Collectively, these results indicated that low-risk patients were more likely to respond to immunotherapy.
3.6 Differential somatic mutation burden landscape between two risk score levels
We visualized the results of 390 high-risk and 565 low-risk breast cancer patients using the " maftools" package, based on mutation data with VCF format. The waterfall plot shows the mutation information for each gene in each sample, and the different types of mutations are annotated by different colors at the bottom (Fig. 9A-B). The result exhibited the top 10 mutated genes with ranked percentages, where TP53 shown the highest mutation rate in high risk breast cancer patients (39%) compared with low risk breast cancer patients (19%). Besides, the number of altered bases in each sample of high risk breast cancer patients was higher than low risk breast cancer patients (Fig. 9C-D).
3.7 Identification of IRGP prognostic signature related biological pathways and processes
GSEA was performed to elucidate the biological functions of the IRGP prognostic signature. According to the GO analysis results (Fig. 10A, Table S4), these genes highly expressed in the low-IRGP risk score group were associated with immune-related gene set, including T cell receptor complex, B cell receptor signaling pathway, phagocytosis recognition, and immunoglobulin receptor binding. While the high-IRGP risk score group related genes showed significant enrichment in DNA transcription-related gene set, such as the mRNA cis splicing via spliceosome, pre mRNA binding, mRNA splice site selection, and mRNA 5’ splice site recognition. The KEGG pathway results also shown that pathways involved in immune response were activated in low-IRGP risk score group, including T cell receptor signaling pathway, natural killer cell mediated cytotoxicity, and antigen processing and presentation (Fig. 10B).