1. Construction of energy metabolism related molecular subtypes
By using the NMF analysis based on the expression of the 565 EMRGs (Supplementary Figure 2A), we identified two distinct subtypes (Cluster1[n=74], Cluster2[n=97]) between the 171 patients in the TCGA PAAD dataset (Figure 1A-B). Clinically, patients in Cluster 1 showed significant higher tumor grade than that in Cluster 2 (Supplementary Figure 2C). Moreover, we assessed the potential difference on prognosis between the two subtypes, which demonstrated that patients in Cluster 1 had significant better OS than that in Cluster 2 (p = 0.017, HR = 0.597 95%CI 0.383-0.914, Figure 1B). Similarity, the expression profiles of these 565 EMRGs can also divided 257 patients into two molecular subtypes in the ICGC PAAD dataset (Figure 1C-D, Supplementary Figure 2B), and also patients in Cluster 1 showed significant better OS than that in Cluster 2 (p = 0.003 HR = 0.610 95%CI 0.441-0.844, Figure 1D). These data showed the consistent existing of the molecular types in PAAD.
Then, we calculate the immune scores of six cells (B cell, CD4 T cell, CD8 T cell, Neutrophil, Macrophage, and Dendritic) in each PAAD sample and analyzed the potential difference between Cluster 1 and Cluster 2. The results showed that except for the B cell immune score, Cluster 1 showed higher immune score than Cluster 2 (Figure 1E). We further observed that the scores of immunity, stroma and tumor purity in Cluster 1 were also significantly higher than that in Cluster 2 (Figure 1F). These results indicate that lower immune cells infiltration in tumor environments (TME) may confers worse prognosis in patients with PAAD.
2. Identification of differential co-expression genes between subtypes
We extracted the expression profile of protein coding genes from TCGA PAAD dataset, and clustered all samples through hierarchical clustering (Supplementary Figure 3A), from which we confirmed that there is no outlier sample. To ensure that the network constructed by WGCNA is scale-free, β was set as 10 (Supplementary Figure 3B). Then we run cluster analysis and obtained 14 modules, among which grey module represent gene sets that cannot be aggregated to other modules (Figure 2A). Moreover, by analyzing the correlation of the module and genes in the module with phenotypes (Supplementary Table 2), we found that the blue module (contains 1692 co-expression genes) is significant correlated with Cluster 1, and the yellow module (contains 645 co-expression genes) is significant correlated with Cluster 2, respectively (Figure 2B-D). In addition, by analyzing the differential expression genes (DEGs) between Cluster 1 and Cluster 2, we obtained 2411 DEGs, comprised of 1641 up-regulated DEGs and 770 down-regulated DEGs between the two subtypes (Figure 2E-F, Supplementary Table 3). We further analyzed these 2411 DEGs and those co-expression genes in blue and yellow modules, and obtained 743 overlapping genes (Supplementary Table 4). These 743 co-expression DEGs were analyzed by GO function and KEGG pathway enrichment (Supplementary Table 5), and 38 KEGG pathways, 52 GO cellular component (CC), 126 GO molecular function (MF) and 977 GO biological process (BP) were enriched. The top enriched pathways include cell adhesion molecules (CAMs), transcriptional mis-regulation in cancer, immunological synapse and T cell differentiation (Supplementary Figure 3C-F), suggesting that these co-expression DEGs may involve in PAAD molecular regulative network by exhibiting pivotal function through these pathways.
3. Development of prognostic risk modal based on co-expression DEGs
By analyzed the expression profiles of 743 co-expression DEGs and corresponding survival of training set for using univariate Cox proportional hazard regression model, we obtained sixty-seven prognostic co-expression DEGs (P < 0.01, Supplementary Table 6). After Lasso Cox regression analysis and 10-fold cross validation, we selected four genes (λ = 0.1042) as the candidate genes for construction of prognostic risk modal (Supplementary Figure 4A-B). And we then established a gene-based prognostic model by using univariate Cox regression analysis (Table 2). The high expression level of GJB5, MET and TMEM139 were identified as risk factors, while AFF3 as protective factors. The final 4-gene signature formula is as follows: RiskScore = - 0.1513* expAFF3 + 0.0156*expGJB5 + 0.0045*expMET + 0.0164*expTMEM139.
We calculated the risk score of each sample according to the established model, and plotted the risk score distribution, which showed that the survival time of the samples with high risk score is significantly shorter than that with low risk score (Figure 3A). In addition, the AUCs of 1-, 3-, and 5-year ROC curves for the 4-gene signature to predict PAAD survival were all above 0.70 (Figure 3B). Finally, we carry out Z-score normalized on RiskScore, which classified samples with risk score greater than zero into high-risk group, and samples with risk score less than zero into low-risk group. Kaplan-Meier survival analysis demonstrated there were significant differences between the high- and low- risk group (log rank P < 0.001, HR = 2.413, Figure 3C).
4. Internal and external validation of the prognostic risk model
In order to determine the robustness of the model, we submit patients in the entire TCGA dataset into the formula aforementioned. The risk score distribution of all samples (Figure 4A), corresponding ROC curve (Figure 4B), and Kaplan-Meier survival curves (Figure 4C) showed that the AUCs of the signature remained high, and the high-risk groups had consistently shorter OS than the low-risk groups.
We further verified the robustness of the 4-gene prognosis signature by external analyzed in the GSE57495 dataset (Figure 5A-C) and ICGC PAAD dataset (Figure 5D-F) using the same coefficients aforementioned. Excellent performance was overserved in the prognostic risk indication.
5. Independence of the 4-gene prognosis signature
In order to identify the independence of 4-gene signature in clinical application, we conducted univariate and multivariate Cox regression in TCGA PAAD dataset. We systematically analyzed the clinical data of patients, including age, gender, pathologic T stage, pathologic N stage, pathologic M stage, tumor stage, tumor grade, and the 4-gene signature. Univariate Cox regression analysis showed that gender, tumor grade, pathologic T stage, pathologic M stage, tumor stage and the 4-gene signature were significantly associated with survival (P < 0.05, Figure 6A). However, multivariate Cox regression analysis showed that age, pathologic N stage and the 4-gene signature (Figure 6B) were independent prognostic indicators in PAAD. The above conditions indicated that the 4-gene signature has good predictive performance in clinical application.
Furthermore, we combined clinical features and the 4-gene signature and constructed a nomogram using the entire TCGA PAAD dataset (Figure 6C). Nomogram suggested the 4-gene signature has the greatest impact on the survival rate prediction. We calibrated the performance of 1, 2, and 3-year nomography data for visualization of nomogram, which further verified the consistency between predicted and actual survival probability (Figure 6D).
6. Comparison with previous prognostic models.
Previous studies had identified several prognostic models for survival prediction of PAAD patients. The predictive performance of the present 4-gene signature was further compared with four previous models (a 15-gene signature proposed by Chen et al.(15), a 7-gene signature proposed by Cheng et al. (16), a 5-gene signature proposed by Raman et al. (17), and a 7-gene signature proposed by Magouliotis et al. (18). We calculated the risk score of each PAAD sample in TCGA PAAD dataset based on the corresponding coefficients provided by each model, evaluated the ROC of each model, and divided the samples into high-risk and low-risk groups based on the median risk score of each signature. All of the four models could divide the patients into high-risk group and low-risk group (Supplementary Figure 5). Kaplan-Meier curves showed that except for Li model (P = 0.076), there are significant differences between high-risk group and low-risk group in Chen model, Cheng model and Raman model (P < 0.05, Supplementary Figure 5A-D). Among the four models, the AUC of Chen model and Raman model are greater than 0.70, but generally the prediction effect of the four models is worse than that of our four gene models (Supplementary Figure 5E-H). Furthermore, RMST curve (Figure 7A) and DCA curve (Figure 7B) were used to evaluate the predictive effect of our 4-gene signature and the four published models on the prognosis of PAAD patients, both demonstrated that the performance of our four gene models was significantly better than those of the four models.
7. GSEA analysis of enriched pathway based on risk score
To investigate the relationship between the risk score and biological function of different samples, we conducted single sample GSEA (ssGSEA) analysis, and calculated ssGSEA score of each sample on different biological functions. The correlation between these functions and RiskScore with the coefficient cutoff of 0.4 showed that most of the functional pathways were negatively correlated with the RisScore of samples (Figure 8A). Moreover, we divided the training set into high-risk group and high-risk group referring to the RiskScore. GSEA was used to analyze the significantly enriched pathways in the two groups (Supplementary Table 7). The result showed that pathways including bladder cancer, pentose phosphate pathway, p53 signaling pathway and thyroid cancer were significant negatively correlated with the low-risk group, whilst neuroactive ligand receptor interaction pathway was negatively correlated with the high-risk group (P < 0.01, Figure 8B).