Clinical characteristics of CCA datasets
Two independent datasets of CCA were collected in this study. The clinical characteristic of all CCA patients was summarized in Table 1. The training set from TCGA contained 36 CCA patients with a mean follow-up time of 806 days, ranging from 10 to 1976 days. The mean age of individuals from TCGA was 64. There were 18 (50%) patients alive at the time of the last follow-up. The 30 CCA patients from GSE107943 selected as validation set had a mean follow-up time of 334 days (ranging from 14 to 1140), the average age of 66 at initial pathologic diagnosis, and more than half of patients (17) dead during follow-up times.
Differentially expressed mRNAs and lncRNAs in CCA
Based on the expression profiles of the TCGA_CHOL dataset, we compared mRNAs and lncRNAs expression level between 36 CCA tumor and 9 adjacent normal samples. A total of 4787 DEmRNAs (Figure 1A) and 1950 DElncRNAs (Figure 1B) were identified by DESeq2, whereas 4907 DEmRNAs (Figure 1C) and 2216 DElncRNAs (Figure 1D) were detected by edgeR. The 4628 DEmRNAs (Figure 1E) and 1810 DElncRNAs (Figure 1F) identified by both packages were utilized for further downstream analyses. The heatmaps showed that CCA samples were clearly distinguished from normal tissues based on the top 200 DEmRNAs and DElncRNAs (Figure S1).
Development of integrated mRNA-lncRNA signature in the training cohort
To detect prognostic biomarkers in CCA patients, we carried out a univariate Cox proportional hazards regression analysis for 4628 DEmRNAs and 1810 DElncRNAs in the discovery set. A total of six mRNAs (ACRV1, TMEM121B, PIWIL4, GOLGA8M, CFHR3, and FUT4) and three lncRNAs (AC134682.1, AC007285.1, and AC138430.1) with P-value < 0.01 were determined to be significantly associated with overall survival (Table S1). The six mRNAs and three lncRNAs were further subjected to a stepwise multivariate Cox regression analysis. The optimally integrated mRNA-lncRNA signature was determined with the lowest value of Akaike information criterion [29], which included two mRNAs (CFHR3 and PIWIL4) and two lncRNAs (AC007285.1 and AC134682.1). The chromosomal position, hazard ratio, P-value, and coefficient of these four prognostic RNAs in CCA are provided in Table 2. Among these four RNAs, only CFHR3 had a positive coefficient, which suggested that higher expression was related with shorter survival, whereas the remaining three RNAs were protective factors as their negative coefficients implied that higher expression level was associated with higher survival rate. We subsequentially performed a meta-analysis for CFHR3 and PIWIL4 using random effect model due to the P-value of heterogeneity test less than 0.05 as shown in Figure S2. The pooled SMD of CFHR3 and PIWIL4 were -2.80 (95% CI: -4.13 to -1.47) and 1.46 (95% CI: 0.51 to 2.41), respectively, which provided additional confidence of prognostic value as these mRNAs that were also differently expressed cross various independent CCA cohorts.
To build integrated mRNA-lncRNA signature for survival prediction in CCA patients, we calculated the risk scores for each individual using expression level of the two mRNAs and the two lncRNAs weighted by their regression coefficients from above multivariate Cox analysis as follows: risk score = (3.18 × expression value of CFHR3) + (-1.62 × expression value of PIWIL4) + (-2.97 × expression value of AC007285.1) + (-1.95 × expression value of AC134682.1). The patients then were classified into high-risk group (23 patients) and low-risk group (13 patients) based on the optimal cut-off point (-0.14) determined by “cutp” function from survMisc package (Figure 2A). The survival status and the expression pattern of the four prognostic RNAs for each CCA patient in the discovery cohort are presented in Figure 2A as well. The Kaplan–Meier curve with a log-rank test suggested that patients in the low-risk group have significantly longer survival time compared to the patients in a high-risk group (Figure 2B). Additionally, the univariate Cox regression model (Table 3) showed a 6.46-fold increase (P-value = 1.35E-02) of hazard ratio in the high-risk group compared with the low-risk group for OS. Time-dependent ROC curve for the risk score model in the training cohort is shown in Figure 2C, with an area under the curve (AUC) of 0.872 and 0.790 for 1- and 3-year OS prediction, respectively, which implied that the integrated mRNA-lncRNA signature possessed a high specificity and sensitivity.
Validation for the prognostic prediction value of integrated mRNA-lncRNA signature in the independent validation cohort and the combined dataset
To evaluate robustness of the integrated mRNA-lncRNA signature for prognosis in CCA patients, we validated its prognostic ability in an independent cohort (GSE107943) obtained from GEO database and yielded similar results as we obtained from the training dataset. Individuals in the validation dataset were divided into high-risk group (16 patients) and low-risk group (14 patients) according to the threshold determined by the same method as for the training dataset. The survival outcome of high-risk group patients was significantly worse than that of low-risk group (P-value = 1.71E-04) as shown in Figure 3B. Notably, there were 14 deaths in patients with high-risk scores and only three death events in low-risk group (Figure 3A). The hazard ratio of high-risk group was 8.04 folds compared with that of low-risk group (95% CI= 2.26 - 28.62, P-value= 1.29E-03) in the univariable analysis (Table 3). The AUC of time-dependent ROC curve was 0.750 and 0.729 for 1- and 3-year overall survival prediction (Figure 3C), representing the risk score model has a good performance in CCA patients’ OS prediction.
Furthermore, we assessed prognostic performance of the integrated mRNA-lncRNA signature in the combined dataset, which was consistent with the findings from the discovery or validation cohort. The principal components analysis (PCA) for the combined samples showed that there was clearly divergence between the training and validation individuals (Figure S3A), which suggests that the batch effect existed in the combined data. To eliminate bias caused by the batch effect, we fitted it as a fixed effect in the formula design generated by DESeq2 package. As a result, the batch effect was adjusted properly (Figure S3B-C). The patients were then divided into high-risk group (n =34) and low-risk group (n = 32) according to the same risk score model and criteria. Kaplan-Meier survival curves between two risk groups were significantly different in the combined dataset with a P-value of 5.51E-06 (Figure 4B). The survival rates at 3- and 5-years were 26.47% and 23.53% for patients in the high-risk group, compared to 87.50% and 75.00% survival rate for patients in the low-risk at 3- and 5-years respectively. Patients with high-risk scores exhibited a 5.27-fold increased risk than patients in low-risk group (Table 3). Results of time-dependent ROC curve analysis similar to those obtained from the training and validation datasets are presented in Figure 4C with AUC equal to 0.819 for 1-year and 0.781 for 3-years OS prediction.
Correlation between the integrated mRNA-lncRNA signature and other clinicopathologic characteristics
To investigate independence of the integrated mRNA-lncRNA signature in survival prediction, a multivariate Cox regression analysis was performed including risk scores, age, gender, tumor stage, residual tumor, and histologic grade. In the training cohort, the integrated mRNA-lncRNA signature was the most significant (P-value= 1.12E-02) compared with the other clinical characteristics. Furthermore, after adjusting the age, gender, and tumor stage, we found the hazard ratios of overall survival in high-risk versus low-risk group were 7.70 and 5.27 in the validation and the combined dataset, respectively (Table 3).
Besides, the tumor stage was relatively associated with OS in the validation (P-value= 5.65E-02) and combined (P-value= 5.88E-02) dataset (Table 3). The stratification analysis was carried out to estimate the relationship of the mRNA-lncRNA signature with tumor stage. All patients were classified into two subgroups: I/II stage with 49 samples and III/IV stage with 17 individuals. As shown in Figure 5A-B, KM curve observed patients with high-risk scores have significantly shorter survival time than that with low-risk scores in stage I/II (P-value = 4.71E-04) and stage III/IV (P-value = 4.97E-03) subgroup. Accordingly, the multivariate Cox and stratification analysis demonstrated that prognostic power of the integrated mRNA-lncRNA signature was independent from other clinical features.
We also compared prognostic performance of the mRNA-lncRNA signature with other clinical features by calculating the AUC of time-dependent ROC. In the combined set, the AUC of mRNA-lncRNA risk scores at 3 years was 0.781, which was higher than that of tumor stage (AUC = 0.673), gender (AUC = 0.541), and age (AUC = 0.505) (Figure 5C). These results demonstrated that the mRNA-lncRNA signature had a better prognostic power than other factors including tumor stage, age, and gender.
Functional roles of the integrated mRNA-lncRNA signature in the CCA biology
We performed WGCNA for 1067 co-expressed mRNAs to cluster genes that highly correlated with the risk scores. A total of 5 modules were identified including a turquoise module with 522 mRNAs, a blue module with 272 mRNAs, a brown module with 139 mRNAs, a yellow module with 131 mRNAs and a grey module with three mRNAs. The turquoise module showed a higher correlation with risk score model than other modules, which had a high correlation coefficient of 0.87 (P-value=8.00E-12) with risk scores (Figure 6A-B). We then carried out GO and KEGG enrichment analyses based on the 522 genes from the turquoise module. These 522 genes significantly enriched in 567 GO terms and 48 KEGG pathways. The top 10 GO biological processes and KEGG pathways are shown in Figures 6C-D. These genes were mostly enriched in catabolic or metabolic biological processes, such as small molecule catabolic process, organic acid catabolic process, carboxylic acid catabolic process, and lipid related metabolic processes. The enriched KEGG pathways included metabolic-related pathways involved in cholesterol metabolism, drug metabolism - cytochrome P450, glycine, serine and threonine metabolism, and primary bile acid biosynthesis. In addition, complement and coagulation cascades and PPAR signaling pathways were also enriched by these turquoise module genes.