Integrative analysis and identification of an excellent lncRNA signature to predict prognosis in patients with colon cancer

Backgroud: Tumour recurrence and metastasis lead to poor prognosis in colon cancer (COAD). Therefore We aimed to identify a lncRNA signature through an integrative analysis of copy number variation, mutation and transcriptome data to predict prognosis and explore its internal mechanism. Methods: The lncRNA expression profile were collected from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). TCGA data was randomly divided 3:1 into training and testing cohort. In the training, we performed integrated analyses of three candidate lncRNA sets that correlated with prognosis, copy number variations and mutations to establish a signature through Cox regression analysis. The robustness was determined in the testing and GEO. Results: An 11-lncRNA validated in the testing ( P =0.0019, HR=3.374) and GSE17536 ( P =0.0076, HR=1.864). The signature is significantly related to MSI status and clinical prognostic factors. The prognostic-related risk scores were significantly excellent than the other five models have been reported. Furthermore, GSEA suggested that the signature was involved in COAD development and metastasis-related pathways. Conclusions: We identified an signature has strong robustness and can stably predict the prognosis of COAD in different platforms and may be implicated in COAD pathogenesis and metastasis and applied clinically as a prognostic marker.

the difference in the expression of each lncRNA between the mutant and nonmutant groups, and each gene mutation was used as a label. lncRNAs with significant P values (P<0.01) were considered to be associated with a gene mutation. Finally, the lncRNA dataset related to gene mutations was established.

Development and validation of the robustness of the lncRNA signature
By intersecting these three lncRNA sets (Prognosis-related lncRNAs, copy number variation-related lncRNAs and mutation-related lncRNAs), we obtained the targeted lncRNAs. The least absolute shrinkage and selection operator (Lasso) method, which is a compression estimate, was used to narrow the lncRNA range. We used the R package glmnet for the Lasso Cox regression analysis [14,15]. Furthermore, we performed a multivariate Cox regression analysis, and stepwise regression was used to reduce the number of lncRNAs again. The lowest Akaike information criterion (AIC) value as the final model. RiskScore=coefficient*exp lncRNA1 +oefficient*exp lncRNA2 +oefficient*exp lncRNA3 +....+coefficient*exp lncRNAn .
The risk score of each sample in the TCGA training cohort was then calculated.
Based on the median risk score, the COAD patients were divided into two groups: a high-risk group and a low-risk group. A receiver operating characteristic (ROC) curve was used to test the accuracy of lncRNA signature to predict prognosis. Finally, the lncRNA signature was verified in the testing cohort. The GSE17536 cohort used the same model and the same cutoff as the TCGA training cohort.

Independent predictive power of the lncRNA signature based on different MSI statuses, tumour stages and clinicopathological characteristics
We divided the TCGA cohort into MSI-high (MSI-H), MSI-low (MSI-L) and microsatellite stable (MSS) groups according to the MSI phenotype information described by the TCGA network study [16]. We further analyze the relationship between this lncRNA signature and MIS status.
To identify the lncRNA signature model for clinical applications, we analysed the relationship between clinical information (including age, sex, lymph node invasion status, pathology (T, N, and M classifications), tumour stage) and lncRNA signature through univariate and multivariate Cox regression analyses.

Identification of the lncRNA signature-associated biological pathways with gene set enrichment analysis (GSEA)
We used GSEA and the "cluster profile R" package to conduct Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Significantly enriched pathways in the high-risk and low-risk groups in the TCGA training cohort were identified. The selected gene set was c2.cp.kegg.v6.0.symbols, which contained the KEGG pathways. In KEGG pathway analysis, P<0.05 was considered to indicate statistical significance.

Comparison of the lncRNA signature with other COAD prognostic signatures
By reviewing the literature, we identified five prognostic-related risk models, a six-lncRNA signature (PMID: 30396175) [17], a two-lncRNA signature (PMID: 29254165) [18], a 14-lncRNA prognostic signature (PMID: 29565464) [19], a six-lncRNA signature (PMID: 29227531) [20] and a 15-lncRNA signature (PMID: 30510449) [21], for comparison with our lncRNA signature. To make the models comparable, we performed a multivariate Cox regression analysis to calculate the risk scores of the training set samples based on the corresponding genes in the three models. We evaluated the ROC of the five models and then divided the samples into high-risk and low-risk groups according to the median risk score and analysed the difference in OS between the two groups.

Results
1 Comprehensive analysis of multi-omics data to obtain lncRNAs related to COAD prognosis

lncRNAs that are closely related to COAD prognosis
According to the univariate Cox regression analyse, we identified a total of 483 candidate prognostic lncRNAs from the training cohort, and information on the top 20 lncRNAs is shown in Table 2.

lncRNAs that are closely related to gene copy number variation
We obtained lncRNAs that are closely related to gene copy number variation. A total of 137 lncRNAs that were significantly amplified on each fragment of the COAD genome (Figure 2A), including LINC00392 on the 13q22.1 segment (q= 8.17E-12), LINC01598 on the 20q11.21 segment (q=5.75E-09) and LOC730183 on the 16p11.2 segment (q=0.0014485), were identified. On the other hand, a total of 261 lncRNAs that were significantly deleted on each fragment in the COAD genome (Figure2B), including LINC00681 in the 8p22 segment (q=2.74E-45), LOC101928728 in the 1p36.11 segment (q= 2.02E-09), and LINC00491 in the 5q22.2 segment (q=1.48E-05), were identified.

lncRNAs that are closely related to gene mutations
Through MutSig2, we identified a total of 41 genes with significant mutation frequencies. The distribution of synonymous mutations, missense mutations, framework insertions or deletions, framework movements, nonsense mutations, cleavage sites and other nonsynonymous mutations in these 41 genes in TCGA COAD patient samples are shown in Figure 3. We identified 41 genes, some of which have been reported to be closely related to the development of cancer, such as KRAS, TP53, APC, PIK3CA, and FBXW7. Among these 41 genes, we identified lncRNAs associated with gene mutations using each of the genes to mutate into a tag, and a rank-sum test was used to detect the difference between the expression of each lncRNA in the mutant and nonmutant groups. lncRNAs with a P-value<0.01 were considered to be associated with a gene mutation; thus, we obtained 2712 lncRNAs related to gene mutations.

Identification of an 11-lncRNA signature for COAD survival
The comprehensive analysis revealed 147 lncRNAs associated with amplifications, deletions, and mutations from a total of 483 candidate prognostic lncRNAs. We analysed the change trajectory of each independent variable ( Figure   4A). It can be seen that with the gradual increase in lambda, the number of independent coefficients becomes closer to zero. We used three-fold cross-validation to build the model. The confidence interval under each lambda is shown in Figure 4B.
As shown in the figure, the model was optimal when lambda=0.04078231. For this reason, we selected the lncRNAs obtained when lambda=0.04078231 as the target lncRNAs to construct the model.
After Lasso Cox regression narrowed the scope, we obtained 21 target lncRNAs that were used to construct the model. A multivariate Cox survival analysis was performed on 21 lncRNAs, and the 11 lncRNAs with the lowest AIC values (AIC=767.27) were used to construct the final model. Details of the 11 lncRNAs are shown in Table 3. The 11-lncRNA signature was then tested for its ability to predict survival in COAD patients.

Determination and analysis of the 11-lncRNA signature in the training cohort
The 11-lncRNA signature was then established using a multivariate Cox regression analysis with the following model: The risk score was calculated as the sum of the above gene expression values * the ordinal, and then we selected 0.9892846 as the cutoff (median risk score) and divided the samples into high-risk and low-risk groups. Finally, 247 patients were classified as low risk, and 110 patients were classified as high risk; significantly different OS rates were observed between the two groups in the training cohort (log-rank P<0.0001, HR=2.014) Figure 5C). We acquired a five-year AUC of 0.83 according to the ROC curve for predicting survival in COAD patients ( Figure 5B). As the patient's risk score increased, the OS rate significantly decreased, and the number of deaths in the high-risk group increased significantly ( Figure 5A). According to the changes in the expression levels of the 11 different lncRNAs in the signature observed with increases in the risk score, the expression of ENSG00000246627 was correlated with a low risk of COAD, and the other 10 lncRNAs were identified as risk factors based on their high expression and correlation with a high risk of COAD.
First, the 11-lncRNA signature was validated in the testing cohort; 88 patients were classified as low risk, and 31 patients were classified as high risk. There was a significant difference in OS between the two groups (log-rank P=0.0019, HR=3.374) ( Figure 6C). The five-year AUC was 0.66 according to the ROC curve ( Figure 6B).
The results of the testing cohort were similar to those of the training cohort; as the patient's risk score increased, the OS time decreased significantly, and the number of deaths in the high-risk group increased significantly ( Figure 6A). Moreover, 10 lncRNAs (all lncRNAs except ENSG00000246627) were identified as risk factors.
Similarly, we used the same model and the same cut-off from the training cohort and verified the model's robustness using an external independent data cohort (GSE17536). Ultimately, 99 patients were classified as low risk, and 78 patients were classified as high risk. A significant difference in OS was observed between the two groups (log-rank P=0.0076, HR=1.864) ( Figure 7C). The five-year AUC was 0.71 according to the ROC curve ( Figure 7B). The GSE17536 cohort showed similar results to the TCGA training cohort. As the risk score increased, the survival time decreased significantly, and the number of deaths in the high-risk group increased.
The expression of the 11 different signature lncRNAs also increased with the increase in the risk score, indicating that high expression of the 11 lncRNAs is associated with a high risk of COAD and could serve as a risk factor ( Figure 7A).

Independent predictive power of the 11-lncRNA signature according to the MSI status, tumour stage and clinicopathological characteristics
The patients were subdivided into a high-risk subgroup and a low-risk subgroup based on different MSI statues, and the 11-lncRNA signature was used to predict OS; the OS rate was significantly different between the high-risk and low-risk subgroups in the MSI-L and MSS groups (excluding MSH) ( Figure 8A-C). Based on their tumour stage, patients were subdivided into a high-risk group and a low-risk group in each stage, and the 11-lncRNA signature revealed no significant difference in OS at the II, III and IV stages (all stages except stage I) between the two groups ( Figure   8D-G). Furthermore, these results demonstrate that the 11-lncRNA signature model can better predict the OS of patients with different MSI statuses and tumour stages.
We systematically analysed the clinical information of the TCGA and GSE17536 patients, including age, sex, lymph node invasion status, pathology (T, N , and M classifications), tumour stage, and the 11-lncRNA signature grouping information (Table 4).
In the TCGA training cohort, we found significant survival-related correlations in the clinicopathological characteristics, with the exception of age and sex, according to the univariate Cox regression analysis, but we found that only the risk score (HR=4.39, 95% CI=2.58-7.46, P=4.26E-08), age, and stage III/IV vs sage I/II were significantly related to survival according to the multivariate Cox regression analysis.
The 11-lncRNA signature was verified to be clinically independent.
In the TCGA testing cohort, the risk score, N1/N2 vs N0, M1 vs M0, and stage III/IV vs stage I/II were significantly associated with survival according to the univariate Cox regression analysis, but only the risk score (HR=3.02, 95% CI=1.13-8.08, P=0.027) and M1 vs M0 were significantly related to survival according to the multivariate Cox regression analysis. The 11-lncRNA signature was also verified to be clinically independent.
In the GSE17536 cohort, the risk score and stage III/IV vs stage I/II were significantly associated with survival according to the univariate Cox regression analysis, but only the risk score (HR=1.65, 95% CI=1.02-2.67, P=0.039), age and stage III/IV vs stage I/II were significantly associated with survival according to the multivariate Cox regression analysis.
In conclusion, the 11-lncRNA signature is a prognostic indicator independent of other clinical factors and shows independent predictive performance with clinical application value. The signalling pathways associated with the 11-lncRNA signature significantly enriched in the TCGA training cohort were detected by GSEA ( Table 5). The signalling pathways that were significantly enriched in the high-risk and low-risk groups, were the Notch signalling pathway, the VEGF signalling pathway, the P53 signalling pathway and the cell cycle; all were significantly associated with the development and metastasis of COAD (Figure 9).

Comparison of the 11-lncRNA signature with other COAD prognostic signatures
The ROC and OS KM curves of the five models are shown in Figure 10.
Significantly different OS rates were observed between the high-risk and low-risk groups using the six-lncRNA signature established by Zhao [17] (log-rank P=0.0014， HR=2.03) ( Figure 10B). We acquired a five-year AUC of 0.65 and a ten-year AUC of 0.67 according to the ROC ( Figure 10A). Significantly different OS rates were also observed between the two groups using the two-lncRNA signature established by Xue [18] (log-rank P=0.018, HR=1.73) ( Figure 10D), and we obtained a five-year AUC of 0.54 and a ten-year AUC of 0.47 ( Figure 10C). Significantly different OS rates were also observed with the 14-lncRNA signature reported by Xing [19]( Figure   10F), and we found a five-year AUC of 0.66 and a ten-year AUC of 0.53 ( Figure 10E).
In addition, the six-lncRNA signature established by Fan [20]( Figure 10H) yielded a five-year AUC of 0.64 and a ten-year AUC of 0.41( Figure 10G), and the 15-lncRNA signature obtained by Wang [21] (Figure 10J) resulted in a five-year AUC of 0.78 and a ten-year AUC of 0.67 ( Figure 10I). The final comparison showed that our model was slightly better than the 15-lncRNA model and significantly better than the other four models.

Discussion
CRC is a common digestive tract tumour that is a serious threat to the health of patients. According to recent statistics, there are approximately 1.45 million new cases of CRC each year, which made it the third most prevalent cancer in 2018 [1] and the second most prevalent cancer in American males in 2019 [22]. Approximately 694,000 deaths have been reported every year, making the second most common cause of cancer-related death worldwide [2]. The recurrence and metastasis of CRC seriously affect the efficacy and prognosis of treatment. Identifying the risk of recurrence and metastasis can help us guide early intervention for the treatment of CRC, ultimately improving the prognosis. Therefore, it is very urgent to study and identify one or more efficient molecular models for predicting prognosis to guide treatment options and to improve the survival quality of CRC patients.
lncRNAs are a major class of ncRNAs with a length of more than 200 base pairs and are not translated into proteins [23,24]. Over the past decade, it has become clear that certain lncRNAs have strong potential, and an in-depth study is needed to elucidate their mechanism of action. lncRNAs not only control the nuclear structure [25] but also regulate the expression of adjacent genes and act as amplifiers, with remarkable tissue specificity through various mechanisms [26]. lncRNAs have been shown to directly regulate gene expression at the transcriptional, posttranscriptional and epigenetic levels. lncRNAs have long nucleotide chains and intricate secondary structures and can interact with genomic DNA, chromatin, transcription factors, chromatin regulators, spliceosomes and other nuclear proteins [27]. lncRNAs are known to serve as tumour regulators and participate in complex networks of biological regulation [28,29]. Therefore, the identification of lncRNAs closely related to tumour prognosis and the establishment of a signature that can predict the risk of tumour prognosis will be helpful for improving the prevention and treatment of tumours. Zhang GH [15] identified a novel four-lncRNA signature using Cox regression analysis to identify lncRNAs that correlated with the prognosis of 111 laryngeal cancer patients from the TCGA. The signature was shown to predict the prognosis of patients with laryngeal cancer and may influence the prognosis of laryngeal cancer through many pathways, such as regulating immunity and tumour apoptosis. Jie Li [30] also identified a five-lncRNA signature to predict the risk of tumour recurrence in breast cancer (BC) patients and found that it was independent of clinical prognostic factors, such as BC subtypes and adjuvant treatments.
Recently, an increasing number of researchers have focused on the role of lncRNAs in the development and progression of CRC and its significance to clinical prognosis [4,6]. Many lncRNAs, such as MALAT1 and HOTAIR, have also been used as biomarkers in CRC [31,32]. Several lncRNAs have not only been reported as markers in CRC diagnosis but also been shown to be correlated with patient prognosis (e.g., CCAL, PURPL and lnc-GNAT1-1) [33][34][35]. Therefore, this method was also used to identify a CRC-related lncRNA signature for predicting the prognosis of CRC [17][18][19][20][21] by analysing different datasets.
At the same time, as high-throughput multi-omics sequencing data have laid a solid foundation for identifying genes associated with cancer prognosis, multi-omics data analysis can reveal the mechanisms of cancer development from multiple perspectives [36]. We identified closely related genomics, gene mutations, epigenetics, and functions of lncRNAs that are inherently regulated by COAD [4,6]. We show that dysregulated lncRNAs may be new prognostic and diagnostic biomarkers or therapeutic targets for clinical applications. Therefore, we identified a robust lncRNA signature through an integrative analysis of prognosis-related lncRNA, copy number variation, mutation and transcriptome data with the help of multi-omics data analysis technology. In this study, we fully integrated and analysed lncRNAs that are related not only to prognosis but also to copy number variations, mutations and transcriptome regulation in 359 COAD samples from the TCGA training cohort. We developed an 11-lncRNA signature that was validated in testing and GSE17536 cohorts. We also found that the signature had good robustness and good predictive ability of the prognosis risk in other independent datasets. Moreover, we compared the 11-lncRNA signature with other COAD prognostic signatures [17][18][19][20][21] to validate the good prediction of the prognosis risk of the 11-lncRNA signature. We found that the AUC of the 11-lncRNA signature was better than that of the other five signatures. Because the researchers did not confirm that the lncRNAs directly regulate gene expression at the transcriptional, post-transcriptional and epigenetic levels, we focused on the predictive role of lncRNAs in tumour prognosis and established lncRNA signatures to predict prognosis merely by analysing and identifying lncRNAs associated with prognosis. However, the intrinsic interactive relationship between lncRNAs, genomics, gene mutations and epigenetics was not investigated. Ultimately, the studies failed to identify a good lncRNA signature to predict the risk of CRC prognosis. Significant differences in OS outcomes between the high-risk and low-risk groups were obtained using the described method [37].
Furthermore, we found that the 11-lncRNA signature had independent predictive power from the MSI status and tumour stage (e.g., MSI-L and MSS patients and patients in stages II, III and IV, excluding stage I), which may be due to a good survival rate in COAD patients with stage I [22]. This study also showed that the 11-lncRNA signature was a prognostic indicator independent of other clinical factors and had independent predictive performance with clinical application value.
Additionally, we identified the 11-lncRNA signature-associated biological pathways that were significantly enriched in COAD patients as detected by GSEA. In summary, we determined that the Notch signalling pathway, the VEGF signalling pathway, the P53 signalling pathway and the cell cycle are significantly associated with the development and metastasis of COAD [38,39].
In summary, we constructed a novel 11-lncRNA signature that can be used to predict the prognosis of patients with COAD and exploited the possible underlying mechanisms involved. The 11-lncRNA signature may indicate the potential roles of lncRNAs in COAD pathogenesis. The results will provide molecular diagnostic markers and therapeutic targets with clinical implications in COAD patients. Finally, we hope that all of the above results will be verified in basic experiments and clinical trials in further studies.

Disclosure statement
No potential conflict of interest was reported by the authors.