1 Comprehensive analysis of multi-omics data to obtain lncRNAs related to COAD prognosis
1.1 Lncrnas That Are Closely Related To Coad Prognosis
According to the univariate Cox regression analyse, we identified a total of 483 candidate prognostic lncRNAs from the training cohort, and information on the top 20 lncRNAs is shown in Table 2.
Table 2
Information on the top 20 candidate prognostic lncRNAs
lncRNA ID | P-value | HR | Low 95% CI | High 95% CI |
ENSG00000274925 | 5.78E-06 | 1.156 | 1.086 | 1.231 |
ENSG00000272512 | 1.29E-05 | 1.219 | 1.115 | 1.333 |
ENSG00000275494 | 3.19E-05 | 1.051 | 1.027 | 1.077 |
ENSG00000258053 | 3.59E-05 | 1.114 | 1.059 | 1.173 |
ENSG00000260563 | 6.04E-05 | 1.086 | 1.043 | 1.131 |
ENSG00000245281 | 8.09E-05 | 0.362 | 0.219 | 0.600 |
ENSG00000247095 | 0.000138199 | 1.035 | 1.017 | 1.053 |
ENSG00000235245 | 0.000149251 | 1.104 | 1.049 | 1.162 |
ENSG00000272555 | 0.000187134 | 0.437 | 0.283 | 0.675 |
ENSG00000229380 | 0.000188209 | 1.309 | 1.136 | 1.507 |
ENSG00000273456 | 0.000197802 | 1.159 | 1.073 | 1.253 |
ENSG00000279148 | 0.000207384 | 0.422 | 0.268 | 0.666 |
ENSG00000228288 | 0.000246995 | 1.077 | 1.035 | 1.120 |
ENSG00000269680 | 0.000250415 | 1.253 | 1.110 | 1.413 |
ENSG00000227947 | 0.000294015 | 0.591 | 0.445 | 0.786 |
ENSG00000271781 | 0.000314732 | 1.068 | 1.031 | 1.108 |
ENSG00000238113 | 0.000381295 | 1.301 | 1.125 | 1.505 |
ENSG00000230641 | 0.000382509 | 0.476 | 0.316 | 0.717 |
ENSG00000226659 | 0.000402587 | 1.176 | 1.075 | 1.287 |
ENSG00000251637 | 0.000431394 | 0.495 | 0.335 | 0.732 |
1.2 lncRNAs that are closely related to gene copy number variation
We obtained lncRNAs that are closely related to gene copy number variation. A total of 137 lncRNAs that were significantly amplified on each fragment of the COAD genome (Fig. 2A), including LINC00392 on the 13q22.1 segment (q = 8.17E-12), LINC01598 on the 20q11.21 segment (q = 5.75E-09) and LOC730183 on the 16p11.2 segment (q = 0.0014485), were identified. On the other hand, a total of 261 lncRNAs that were significantly deleted on each fragment in the COAD genome (Fig. 2B), including LINC00681 in the 8p22 segment (q = 2.74E-45), LOC101928728 in the 1p36.11 segment (q = 2.02E-09), and LINC00491 in the 5q22.2 segment (q = 1.48E-05), were identified.
1.3 lncRNAs that are closely related to gene mutations
Through MutSig2, we identified a total of 41 genes with significant mutation frequencies. The distribution of synonymous mutations, missense mutations, framework insertions or deletions, framework movements, nonsense mutations, cleavage sites and other nonsynonymous mutations in these 41 genes in TCGA COAD patient samples are shown in Fig. 3. We identified 41 genes, some of which have been reported to be closely related to the development of cancer, such as KRAS, TP53, APC, PIK3CA, and FBXW7. Among these 41 genes, we identified lncRNAs associated with gene mutations using each of the genes to mutate into a tag, and a rank-sum test was used to detect the difference between the expression of each lncRNA in the mutant and nonmutant groups. lncRNAs with a P-value < 0.01 were considered to be associated with a gene mutation; thus, we obtained 2712 lncRNAs related to gene mutations.
2 Identification Of An 11-lncrna Signature For Coad Survival
The comprehensive analysis revealed 147 lncRNAs associated with amplifications, deletions, and mutations from a total of 483 candidate prognostic lncRNAs. We analysed the change trajectory of each independent variable (Fig. 4A). It can be seen that with the gradual increase in lambda, the number of independent coefficients becomes closer to zero. We used three-fold cross-validation to build the model. The confidence interval under each lambda is shown in Fig. 4B. As shown in the figure, the model was optimal when lambda = 0.04078231. For this reason, we selected the lncRNAs obtained when lambda = 0.04078231 as the target lncRNAs to construct the model.
After Lasso Cox regression narrowed the scope, we obtained 21 target lncRNAs that were used to construct the model. A multivariate Cox survival analysis was performed on 21 lncRNAs, and the 11 lncRNAs with the lowest AIC values (AIC = 767.27) were used to construct the final model. Details of the 11 lncRNAs are shown in Table 3. The 11-lncRNA signature was then tested for its ability to predict survival in COAD patients.
Table 3
Eleven lncRNAs identified as significantly associated with OS in the training cohort
Ensembl Gene ID | Symbol | Coef | HR | Z-score | P-value | Low 95% CI | High 95% CI |
ENSG00000269680 | AC008760.1 | 2.2664 | 9.645 | 3.516 | 0.000438 | 2.727 | 34.115 |
ENSG00000215039 | CD27-AS1 | 0.3273 | 1.387 | 2.03 | 0.042341 | 1.011 | 1.903 |
ENSG00000249550 | LINC01234 | 0.4492 | 1.567 | 2.418 | 0.015587 | 1.089 | 2.255 |
ENSG00000247095 | MIR210HG | 0.3796 | 1.462 | 3.788 | 0.000152 | 1.201 | 1.779 |
ENSG00000180139 | ACTA2-AS1 | 0.9405 | 2.561 | 3.111 | 0.001866 | 1.416 | 4.632 |
ENSG00000256546 | AC156455.1 | 0.6567 | 1.928 | 2.736 | 0.006223 | 1.205 | 3.087 |
ENSG00000260805 | AC092803.2 | 1.5048 | 4.503 | 2.02 | 0.043372 | 1.046 | 19.391 |
ENSG00000273576 | AC009283.1 | 0.1432 | 1.154 | 2.15 | 0.031533 | 1.013 | 1.315 |
ENSG00000246627 | CACNA1C-AS1 | 2.7294 | 15.323 | 4.146 | 3.39E-05 | 4.216 | 55.69 |
ENSG00000238113 | LINC01410 | 2.0232 | 7.563 | 3.026 | 0.00248 | 2.039 | 28.045 |
ENSG00000235560 | AC002310.1 | 1.3988 | 4.05 | 2.372 | 0.017675 | 1.275 | 12.863 |
3 Determination and analysis of the 11-lncRNA signature in the training cohort
The 11-lncRNA signature was then established using a multivariate Cox regression analysis with the following model:
RiskScore11 = 2.2664*expAC008760.1+0.3273*expCD27−AS1+0.4492*expLINC01234
+ 0.3796*expMIR210HG+0.9405*expACTA2−AS1+0.6567*expAC156455.1
+ 1.5048*expAC092803.2+0.1432*expAC009283.1+2.7294*expCACNA1C−AS1
+ 2.0232*expLINC01410+1.3988*expAC002310.1
The risk score was calculated as the sum of the above gene expression values * the ordinal, and then we selected 0.9892846 as the cutoff (median risk score) and divided the samples into high-risk and low-risk groups. Finally, 247 patients were classified as low risk, and 110 patients were classified as high risk; significantly different OS rates were observed between the two groups in the training cohort (log-rank P < 0.0001, HR = 2.014) Fig. 5C). We acquired a five-year AUC of 0.83 according to the ROC curve for predicting survival in COAD patients (Fig. 5B). As the patient’s risk score increased, the OS rate significantly decreased, and the number of deaths in the high-risk group increased significantly (Fig. 5A). According to the changes in the expression levels of the 11 different lncRNAs in the signature observed with increases in the risk score, the expression of ENSG00000246627 was correlated with a low risk of COAD, and the other 10 lncRNAs were identified as risk factors based on their high expression and correlation with a high risk of COAD.
4 Validation of the 11-lncRNA signature in the testing and GSE17536 cohorts
First, the 11-lncRNA signature was validated in the testing cohort; 88 patients were classified as low risk, and 31 patients were classified as high risk. There was a significant difference in OS between the two groups (log-rank P = 0.0019, HR = 3.374) (Fig. 6C). The five-year AUC was 0.66 according to the ROC curve (Fig. 6B). The results of the testing cohort were similar to those of the training cohort; as the patient’s risk score increased, the OS time decreased significantly, and the number of deaths in the high-risk group increased significantly (Fig. 6A). Moreover, 10 lncRNAs (all lncRNAs except ENSG00000246627) were identified as risk factors.
Similarly, we used the same model and the same cut-off from the training cohort and verified the model’s robustness using an external independent data cohort (GSE17536). Ultimately, 99 patients were classified as low risk, and 78 patients were classified as high risk. A significant difference in OS was observed between the two groups (log-rank P = 0.0076, HR = 1.864) (Fig. 7C). The five-year AUC was 0.71 according to the ROC curve (Fig. 7B). The GSE17536 cohort showed similar results to the TCGA training cohort. As the risk score increased, the survival time decreased significantly, and the number of deaths in the high-risk group increased. The expression of the 11 different signature lncRNAs also increased with the increase in the risk score, indicating that high expression of the 11 lncRNAs is associated with a high risk of COAD and could serve as a risk factor (Fig. 7A).
5 Independent predictive power of the 11-lncRNA signature according to the MSI status, tumour stage and clinicopathological characteristics
The patients were subdivided into a high-risk subgroup and a low-risk subgroup based on different MSI statues, and the 11-lncRNA signature was used to predict OS; the OS rate was significantly different between the high-risk and low-risk subgroups in the MSI-L and MSS groups (excluding MSH) (Fig. 8A-C). Based on their tumour stage, patients were subdivided into a high-risk group and a low-risk group in each stage, and the 11-lncRNA signature revealed no significant difference in OS at the II, III and IV stages (all stages except stage I) between the two groups (Fig. 8D-G). Furthermore, these results demonstrate that the 11-lncRNA signature model can better predict the OS of patients with different MSI statuses and tumour stages.
We systematically analysed the clinical information of the TCGA and GSE17536 patients, including age, sex, lymph node invasion status, pathology (T, N, and M classifications), tumour stage, and the 11-lncRNA signature grouping information (Table 4).
Table 4
Identification of the clinical factors and clinical independence associated with prognosis with univariate and multivariate Cox regression analyses
Variables | Univariate analysis | Multivariable analysis |
HR | 95% CI of HR | P-value | HR | 95% CI of HR | P-value |
TCGA training dataset |
11-lncRNA risk score |
Risk score (High/Low) | 4.96 | 3.14–7.82 | 5.14E-12 | 4.39 | 2.58–7.46 | 4.26E-08 |
Age | 1.02 | 0.99–1.04 | 0.07 | 1.03 | 1.01–1.05 | 0.002 |
Sex(Male/Female) | 1.14 | 0.72–1.82 | 0.57 | 0.97 | 0.60–1.58 | 0.921 |
T3/T4 vs T1/T2 | 5.77 | 1.81–18.37 | 0.002 | 3.59 | 0.84-15.219 | 0.082 |
N1/N2 vs N0 | 3.05 | 1.85–5.01 | 1.04E-05 | 0.31 | 0.091–1.07 | 0.065 |
M1 vs M0 | 2.58 | 1.61–4.16 | 8.72E-05 | 1.48 | 0.88–2.49 | 0.136 |
Stage Ⅲ/Ⅳ vs Stage Ⅰ/Ⅱ | 3.37 | 2.00-5.67 | 4.94E-06 | 7.56 | 1.94–29.47 | 0.003 |
TCGA validation dataset |
11-lncRNA risk score |
Risk score (High/Low) | 3.06 | 1.36–6.89 | 0.0066 | 3.02 | 1.13–8.08 | 0.027 |
Age | 1.02 | 0.98–1.05 | 0.28 | 1.04 | 0.99–1.08 | 0.083 |
Sex (Male/Female) | 0.93 | 0.42–2.05 | 0.86 | 0.55 | 0.21–1.44 | 0.226 |
T3/T4 vs T1/T2 | 0.99 | 0.29–3.39 | 0.99 | 0.76 | 0.13–4.34 | 0.756 |
N1/N2 vs N0 | 2.52 | 1.13–5.58 | 0.02 | 0.54 | 0.059–4.87 | 0.583 |
M1 vs M0 | 8.03 | 3.31–19.51 | 4.17E-06 | 5.85 | 1.88–18.15 | 0.002 |
Stage Ⅲ/Ⅳ vs Stage Ⅰ/Ⅱ | 2.86 | 1.24–6.56 | 0.01 | 5.05 | 0.41–61.95 | 0.205 |
GSE17536 validation dataset |
11-lncRNA risk score |
Risk score (High/Low) | 1.86 | 1.17–2.97 | 0.0085 | 1.65 | 1.02–2.67 | 0.039 |
Age | 1.006 | 0.98–1.02 | 0.49 | 1.02 | 1.002–1.04 | 0.029 |
Sex (Male/Female) | 1.104 | 0.69–1.76 | 0.67 | 1.17 | 0.71–1.91 | 0.521 |
Stage Ⅲ/Ⅳ vs Stage Ⅰ/Ⅱ | 4.22 | 2.39–7.46 | 7.28E-07 | 4.226 | 2.36–7.56 | 1.21E-06 |
In the TCGA training cohort, we found significant survival-related correlations in the clinicopathological characteristics, with the exception of age and sex, according to the univariate Cox regression analysis, but we found that only the risk score (HR = 4.39, 95% CI = 2.58–7.46, P = 4.26E-08), age, and stage III/IV vs sage I/II were significantly related to survival according to the multivariate Cox regression analysis. The 11-lncRNA signature was verified to be clinically independent.
In the TCGA testing cohort, the risk score, N1/N2 vs N0, M1 vs M0, and stage III/IV vs stage I/II were significantly associated with survival according to the univariate Cox regression analysis, but only the risk score (HR = 3.02, 95% CI = 1.13–8.08, P = 0.027) and M1 vs M0 were significantly related to survival according to the multivariate Cox regression analysis. The 11-lncRNA signature was also verified to be clinically independent.
In the GSE17536 cohort, the risk score and stage III/IV vs stage I/II were significantly associated with survival according to the univariate Cox regression analysis, but only the risk score (HR = 1.65, 95% CI = 1.02–2.67, P = 0.039), age and stage III/IV vs stage I/II were significantly associated with survival according to the multivariate Cox regression analysis.
In conclusion, the 11-lncRNA signature is a prognostic indicator independent of other clinical factors and shows independent predictive performance with clinical application value.
6 Identification of the 11-lncRNA signature-associated biological pathways in the training cohort
The signalling pathways associated with the 11-lncRNA signature significantly enriched in the TCGA training cohort were detected by GSEA (Table 5). The signalling pathways that were significantly enriched in the high-risk and low-risk groups, were the Notch signalling pathway, the VEGF signalling pathway, the P53 signalling pathway and the cell cycle; all were significantly associated with the development and metastasis of COAD (Fig. 9).
Table 5
KEGG pathways significantly enriched in the high-risk and low-risk groups detected by GSEA
NAME | SIZE | ES | NES | NOM P-val | FDR q-val | FWER P-val |
KEGG_NOTCH_SIGNALING_PATHWAY | 47 | -0.468 | -1.633 | 0.028 | 1.000 | 0.726 |
KEGG_VEGF_SIGNALING_PATHWAY | 75 | -0.354 | -1.486 | 0.048 | 1.000 | 0.916 |
KEGG_P53_SIGNALING_PATHWAY | 67 | 0.434 | 1.660 | 0.015 | 0.936 | 0.632 |
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM | 32 | 0.464 | 1.601 | 0.025 | 0.552 | 0.752 |
KEGG_RIBOSOME | 87 | 0.757 | 1.773 | 0.031 | 0.747 | 0.387 |
KEGG_UBIQUITIN_MEDIATED_PROTEOLYSIS | 133 | 0.418 | 1.658 | 0.035 | 0.632 | 0.638 |
KEGG_BASAL_TRANSCRIPTION_FACTORS | 35 | 0.490 | 1.575 | 0.043 | 0.466 | 0.801 |
KEGG_CELL_CYCLE | 124 | 0.459 | 1.609 | 0.050 | 0.657 | 0.745 |
7 Comparison of the 11-lncRNA signature with other COAD prognostic signatures
The ROC and OS KM curves of the five models are shown in Fig. 10. Significantly different OS rates were observed between the high-risk and low-risk groups using the six-lncRNA signature established by Zhao[17] (log-rank P = 0.0014,HR = 2.03) (Fig. 10B). We acquired a five-year AUC of 0.65 and a ten-year AUC of 0.67 according to the ROC (Fig. 10A). Significantly different OS rates were also observed between the two groups using the two-lncRNA signature established by Xue[18] (log-rank P = 0.018, HR = 1.73) (Fig. 10D), and we obtained a five-year AUC of 0.54 and a ten-year AUC of 0.47 (Fig. 10C). Significantly different OS rates were also observed with the 14-lncRNA signature reported by Xing[19](Fig. 10F), and we found a five-year AUC of 0.66 and a ten-year AUC of 0.53 (Fig. 10E). In addition, the six-lncRNA signature established by Fan[20](Fig. 10H) yielded a five-year AUC of 0.64 and a ten-year AUC of 0.41(Fig. 10G), and the 15-lncRNA signature obtained by Wang[21] (Fig. 10J) resulted in a five-year AUC of 0.78 and a ten-year AUC of 0.67 (Fig. 10I). The final comparison showed that our model was slightly better than the 15-lncRNA model and significantly better than the other four models.