Identification of immune-related genes with differential prognosis in patients with prostate cancer and construction of prognostic gene signatures for 10 genes
In the TCGA training set data, we used a univariate Cox proportional hazard regression model to establish the relationship between overall patient survival and immune-related gene expression, found that there were 78 genes with differences. In order to screen for robust immune-related prognostic signature genes, we used the R package glmnet to perform lasso cox regression on these 78 genes for dimensionality reduction analysis. The trajectory of each independent variable shows that as the lambda gradually increases, the number of independent variable coefficients tending to increase gradually. (Fig. 2A). The results of the confidence interval analysis show that the model is optimal when lambda = 0.039. We choose the model when lambda = 0.039 as the final model, which contains 19 genes (Fig. 2B). Furthermore, the 19 genes obtained in the previous step were reduced to 10 genes by multivariate cox survival analysis using AIC red pool information criterion (Table 2). The prognostic KM curve results of 10 genes show that 5 genes (ie, RXRB, KAL1, EED SORT1, and GPI) could significantly reduce the risk of TCGA training set samples in two groups (p < 0.05) (Figure S2). The resulting 10-gene signature formula is as follows:
Table 2
Basic information of this 10 Gene.
Gene Symbol | Coef | Pvalue | HR | Low.95.CI. | High.95.CI. |
KCNH2 | 0.2018 | 0.0001 | 1.2236 | 1.1072 | 1.3522 |
GREM1 | 0.1629 | 0.0001 | 1.1770 | 1.0833 | 1.2787 |
PSMD14 | 0.4970 | 0.0011 | 1.6438 | 1.2187 | 2.2171 |
RXRB | 0.2324 | 0.0055 | 1.2616 | 1.0707 | 1.4866 |
BCL10 | -0.3032 | 0.0336 | 0.7385 | 0.5584 | 0.9767 |
KAL1 | -0.3677 | 0.0641 | 0.6924 | 0.4691 | 1.0218 |
EED | 0.4985 | 0.0680 | 1.6462 | 0.9637 | 2.8120 |
SORT1 | -0.1302 | 0.0766 | 0.8779 | 0.7602 | 1.0140 |
LTB4R | 0.2973 | 0.1104 | 1.3462 | 0.9345 | 1.9391 |
GPI | 0.0282 | 0.1187 | 1.0286 | 0.9928 | 1.0656 |
RiskScore10 = KCNH2*0.2018 + GREM1*0.1629 + PSMD14*0.4970 + RXRB*0.232 + BCL10*-0.30312 + KAL1*-0.3676 + EED*0.4985 + SORT1*-0.1302 + LTB4R*0.2972 + GPI*0.0282.
Based on the Risk model, the Risk Score of each sample in the TCGA training set was calculated. The training set samples were divided into the high-risk group (risk-H) and the low-risk group (risk-L) as the median Risk Score was taken as the threshold. As patients' risk scores increased, the number of dead samples increased. Heat map analysis showed that high expression of BCL10, KAL1 and SORT1 were correlated with low risk, which were protective factor. High expression of KCNH2, GREM1, PSMD14, RXRB, EED, LTB4R and GPI were correlated with high risk, which were risk factors associated. (Fig. 2C). ROC curve analysis suggests that the Area Under Curve (AUC) at one, three, and five years are greater than 0.70 (Fig. 2D). A risk model based on 10 genes is used to predict the KM survival curve of the risk-H and the risk-L on the training set. There is a significant difference between the risk-H and the risk-L (Fig. 3E).
Robustness of the 10 gene model
In order to verify the stability and reliability of the model, we applied 10 gene signatures to the validation set and TCGA all data, calculated the risk score of each sample. The median risk score as the threshold, the validation set data and all TCGA data set samples were divided into the risk-H and the risk-L. As patients' risk scores increased, the number of deaths increased, and the high-risk group had more deaths than the low-risk group. Heat map analysis showed that high expression of BCL10, KAL1 and SORT1 are correlated with low risk, which are protective factor. High expression of KCNH2, GREM1, PSMD14, RXRB, EED, LTB4R and GPI are risk factors associated with high risk (Fig. 3A, D). The prediction effect of the model at 1,3 and 5 years was evaluated in the validation set data and all TCGA data sets, and the AUC was greater than 0.70 (Fig. 3B, E). The risk model constructed based on 10 genes was used to predict the k-m survival curve of the high-risk group and the low-risk group on the validation set data and all TCGA data sets, and there was a significant difference between the high-risk group and the low-risk group (Fig. 3C, F).
Analysis between risk models and clinical characteristics
The comprehensive analysis of Riskscore calculated by 10-gene signature and clinical information revealed that The 10-mRNA signature could distinguish T2, T1 + T3 + T4, N0, N1, M0, patients who have not received radiotherapy, young group, and old group significantly different from high-risk group (p < 0.05) (Figure S2). This results indicated that our model still has good predictive power in different clinical signs. To identify the independence of the 10-gene signature model in clinical applications, Univariate and multivariate COX regression analysis were used in the clinical information carried throughout the TCGA data. Univariate COX regression analysis found that only T-stage and Riskscore were significantly related to survival (Fig. 4A), However, the corresponding multivariate COX regression analysis only found that the T stage and Risk score (HR = 2.691, 95% CI = 1.200-6.039, p = 0.016) were significantly related to survival (Fig. 4B). The analysis results of the nomogram model show that the RiskScore has the greatest impact on the survival prediction, indicating that the 10 gene-based risk model can better predict the prognosis (Fig. 4C). In addition, we counted 1-, 3-, and 5-year data used to visualize nomogram performance (Fig. 4D).
Riskscore potentially relevant regulatory pathways
In order to observe the relationship between risk score and biological function of different samples, the R software package GSVA carried out ssGSEA analysis and functions with a correlation greater than 0.4 was selected. The results showed that most of the samples were negatively correlated with Riskscore, and a few were positively correlated with Riskscore (Fig. 5A). Furthermore, GSEA was used in the data set of the training set to analyze the significantly enriched pathways in the high-risk group and the low-risk group. The threshold value was p < 0.05, and the significantly enriched pathways were obtained (Fig. 5B), among them, PROPANOATE_METABOLISM, PPAR_SIGNALING_PATHWAY were significantly enriched to the high-risk group, which were closely related to the development of cancer.
Analysis of risk models and immune scores
Z score normalization was used for Risk score of each sample in the training set, the samples were divided into high risk group and low risk group. The TIMER tool was used to calculate the immune score of each sample in the training set, and it is found that other than the CD8 Tcell, the other immune scores show significant differences in the high and low risk groups (p < 0.05) (Fig. 6).
Comparison with other models
As described in the method, four prognostic risk models were selected: 4-gene signature (Wang), 4-gene signature (Komisarof), 24-gene signature (Long), and 22-gene signature (Erho). In order to make the models comparable, the risk score of each prostate adenocarcinoma (PRAD) sample in the TCGA using the same method based on the corresponding genes in the 4 models were calculated. The OS-KM curve shows that the OS prognosis of the Risk-H and Risk-L in the four models are significantly different except for the Wang model (logrank p < 0.05) (Fig. 6A-D). ROC curve reveals that the prediction effect of four models is worse than that of 10 gene signature (Fig. 6E-H). In order to compare the prediction performance of these models on PRAD samples, the RMS curve was drawn using the R language RMS package. The 10 gene signature is more accurate for long-term follow-up (Fig. 6C). Decision Curve Analysis (DCA) results showed that Riskscore had the highest degree of benefit, far higher than the other four subtypes (Fig. 6D).