Building survival model to identify clinically relevant eRNAs
We developed an eRNA based survival model to identify clinically relevant eRNA for CC and RC through the following steps (Figure 1).
- We first randomly selected 70% of the entire cohort from the TCGA as the training set and reserved the remaining 30% of the cohort as testing set. To avoid deviation affecting stability of the model we maintained the distribution of disease-free and relapsed patients from the entire cohort in both training and testing sets.
- Multivariate Cox regression analysis was carried out in the training set on 477 eRNAs for CC and on 460 eRNAs for RC, along with controlling the effects from other clinical risk factors including patient age, gender and TNM stage.
- eRNAs significantly associated (P < 0.05 & z-score > 1.96) for predicting patient DFS were retained and termed as prognostic eRNAs and were further selected by LASSO Cox regularization with 10-folds cross validation.
- Following LASSO Cox regularization and eRNA selection, a risk score formula was established. The risk score for each patient was calculated by a linear combination of expression and multivariate Cox coefficient of eRNAs.
- Patients were classified into low-risk or high-risk groups using the median risk score as the cut-off threshold from the training set. The coefficient for each eRNA and the cut-off value of the risk score from the training set was used to calculate the risk score and to stratify patients into individual risk groups in the testing set.
- Survival difference between the two groups were estimated using Kaplan-Meier curve and compared using log-rank test.
The above-mentioned steps 1-6 were repeated for 100 times to obtain 100 different eRNA sets. Only those sets that significantly stratified patients into two groups (high-risk groups exhibited poor survival than low-risk groups) in the testing were termed as successful survival model, else failed model. The results obtained from one such trial is shown in Additional file 1: Figure S1. For CC, we applied the model to the entire cohort (n = 379), however, the model was generalized, and we found very few prognostic eRNAs signature which successfully stratified patients into two different risk groups within the testing cohort. Due to CC heterogeneity, we performed the identification of prognostic eRNAs signature within CC subtypes based on Consensus Molecular Subtypes (CMS) classification. The entire CC cohort (n = 379) was classified into 5 subtypes with CMS1 comprising of 13%, CMS2 23%, CMS3 14%, CMS4 29%, and mixed subtype 21% (Figure 2A). The proportion of patients classified into CMS subtypes were approximately similar as described in Guinney et al [20] study. We further checked the five-year DFS probability of each CMS subtypes with CMS1: 89%; CMS2: 64%; CMS3: 81%; CMS4: 40%; and mixed subtype: 60% as shown in Figure 2B. As CMS4 displayed the worst survival compared to other CMS subtypes, we focused our study to identify a clinically relevant eRNA signature for this subtype. We found 22 prognostic eRNAs for CC-CMS4 subtype and 19 prognostic eRNAs for RC that significantly distinguished patients into two risk groups in testing set (Figure 2, C and D).
Prognostic association of eRNA signature risk score with disease-free survival in patients
A risk score was constructed using the median Cox-coefficients from the successful survival models. For the CC-CMS4 subtype, compared with patients in the low-risk score group, those with high-risk scores exhibited a poor DFS (Figure 3A). Only 17% of CC patients in the high-risk group were disease-free at 5 years, compared with 93% of the patients in the low-risk score group. We further randomly divided the entire cohort into 50% and using the same risk scoring formula and cut-off thresholds from the entire cohort, wherein the 22 eRNA signature still significantly stratified patients into two risk groups for DFS (Figure 3B). For RC, 54% of the patients in the high-risk score group were disease-free at 5 years, compared with 84% of the patients in the low- risk groups (Figure 3C). Similarly, upon random splitting the RC cohort into 50% and using the same cut-off threshold, the 19 eRNA signature significantly stratified the patients into two risk groups (Figure 3D)
Recurrence prediction by eRNA signature is independent of clinical risk factors
To evaluate the prognostic independence of eRNA signature against the known clinical risk factors, multivariate Cox regression analysis revealed that the 22 eRNA signature was an independent predictor of DFS in CC patients (HR= 11.89, P = 9.54e-04; Figure 4A). Additionally, we found that patients with relapsed tumors manifested with significantly higher risk scores than disease free patients (P= 5.72e-07; Figure 4B). Similarly, the 19 eRNA signature was independently associated with DFS in RC (HR=3.91, P = 3.52e-02; Figure 4C), and their expression was significantly higher in relapsed patient (P = 2.61e-06; Figure 4D).
The eRNA signature is a better predictor of recurrence with high sensitivity and specificity
To confirm the prediction accuracy for DFS, we performed ROC curve analysis for the eRNA signature. In both CC and RC, we observed eRNA signature exhibited excellent prediction power for relapse. For CC, the AUC and 95% confidence interval (CI) for the 22-eRNA signature was 0.83 and 0.739-0.928 (Figure 5A), and for RC the 19-eRNA signature achieved a remarkable prediction accuracy (AUC=0.834, 95% CI= 0.737-0.931; Figure 5B). Subsequently, when eRNA signature was combined with the TNM stage, it achieved an even higher prediction accuracy in CC (AUC=0.847) suggesting its potential clinical utility in this malignancy. However, no increment in prediction accuracy was observed in RC when combined with TNM stage.
Higher expression of eRNAs associated with tumor recurrence compared to its target genes
We performed the annotation of each prognostic eRNA (as suggested by Zhang et al [21]) and examined the expression profiles of eRNAs and their neighboring target genes, which were highly positively correlated. For example, AADAC-associated eRNA (AADACe, ENSR00000160285) was highly correlated with the AADAC gene in CC (SCC = 0.43, P= 1.03e-05) as shown in Figure 6A. Similarly, TRMT6-associated eRNA (TRMT6e, ENSR00000106418) was highly correlated with the expression of the TRMT6 gene in RC (SCC = 0.38, P= 1.53e-05) as shown in Figure 6B. Several studies have reported that eRNA expression has a better survival prediction potential compared to its target genes. In support of the previous studies, we also observed that high levels of AADACe was associated with worse survival (log-rank test P = 0.02058, Figure 6C). Interestingly, AADAC gene itself was not associated with CC patient survival, suggesting AADACe clinical utility regardless of AADAC expression in CC patients. Similar results were observed for TRMT6-associated eRNA (TRMT6e) in RC as shown in Figure 6D.
Putative biological functions of eRNA signature in colorectal cancer
The eRNAs have no protein-coding capacity; thus, we applied guilt-by-association strategy to explore the putative potential biological functions of the eRNA signature in CRC as shown in Figure 7A. We found biological functions related to metabolic process, DNA methylation, transcription initiation, cell cycle, myeloid leukocyte mediated immunity for CC. Similarly, biological functions related to regulation of DNA replication, ncRNA metabolic process, mRNA processing, oxidative phosphorylation were enriched for RC. The highest enriched biological functions for TATDN2-associated eRNA (TATDN2e, ENSR00000148476) in CC was histone H3K4 methylation and CDH17-associated eRNA (CDH17e, ENSR00000227152) in RC was regulation of mRNA splicing by spliceosomes (Figure 7, B and C)