Differential Genomic Instability-Associated LncRNAs Predict Differences of Clinical Outcome and Immunity in Left- And Right- Sided Colon Adenocarcinoma

Background: The left-sided and right-sided colon adenocarcinoma (LCCs and RCCs, respectively) have unique characteristics in various aspects, particularly molecular features and clinical heterogeneity. The purpose of our study was to develop a prognostic risk model based on differential genomic instability-associated (DGIA) Long non-coding RNAs (lncRNAs) of LCCs and RCCs, therefore the prognostic key lncRNAs could be identied. Methods: We adopted two independent gene data-sets, corresponding somatic mutation and clinical information from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. Identication of differential DGIA lncRNAs from LCCs and RCCs were conducted with appliance of "Limma" analysis. Then, we screened out key lncRNAs based on univariate and multivariate Cox proportional hazard regression analysis. Meanwhile, DGIA lncRNAs related prognostic model (DRPM) was established. We employed the DRPM in the model group and internal verication group from TCGA for the purpose of risk grouping and accuracy verication of PRSM. We also veried the accuracy of key lncRNAs with GEO data. Finally, the differences of immune inltration and functional pathways were analyzed within different risk groups. Results: A total of 123 DGIA lncRNAs were screened out by differential expression analysis. We obtained 6 DGIA lncRNAs by the construction of DRPM, including AC004009.1, AP003555.2, BOLA3-AS1, NKILA, LINC00543 and UCA1. After the risk grouping by these DGIA lncRNAs, we found the prognosis of high-risk group (HRG) was signicantly worse than that in low-risk group (LRG) (all p (cid:0) 0.05). In all TCGA samples and model group, the expression of CD8 + T cells in HRG was lower than that in LRG (all p (cid:0) 0.05). The functional analysis indicated that there was signicant up-regulation with regard of pathways related to both genetic instability and immunity in LRG, including cytosolic DNA ROC


Background
Colon cancer (CC) are one of the leading types of cancer occurred in humans and globally more than 1.8 million people acquired this disease each year [1,2]. Lately, immunotherapy has achieved breakthroughs and fetching much consideration as a leading therapy against tumors. However, some CC patients shows a low response and drug resistance [3]. The traditional treatments such as surgery, radiotherapy, and chemotherapy are used to rectify the cancers, but their durable effects are also di cult to predict. The differences of above phenomenon are more obvious in left-sided and right-sided CCs (LCCs and RCCs, respectively) [4]. It is well known that the molecular and clinical heterogeneity differences between LCCs and RCCs are very complex, as are their occurrence, development, and response to treatment and prognosis [5]. It is important to understand the potential molecular biological mechanisms, as these in uences treatment decisions.
Recently, researchers have revealed that LCCs and RCCs are different in clinical and genomic characteristics [6]. In addition to the microsatellite instability status, the identi ed differences include APC, TP53, RAS and BRAF mutations, etc [7][8][9]. The dissimilarities in gene expression patterns could be used to analyze LCCs and RCCs. Mainly, doctors can be bene ted by selecting the most effective individualized treatment via the degree and nature of these molecular mutations. Therefore, while we are looking for novel and precise prognostic biomarkers, their use is more vital for guiding targeted therapy.
Prior to cell division, the delity of humans genome replication, exhibits a high degree of consistency over time [10]. However, various genome mutations causes to alter this replica, which leads to the occurrence and development of tumors [11]. In CC, mutations in mismatch repair genes led to functional defects, which can cause microsatellite instability (MSI). The distinction in MSI status is also one of the factors helps to differentiate between LCCs and RCCs [12]. Also, a variety of biological processes are related to genome instability, including abnormal transcription and post-transcriptional regulation, DNA damage regulation, etc [13]. Latest ndings disclosed that variations in the instability of the genome produces new antigens, which affects the immune phenotype and immunotherapy response [14]. Long non-coding RNAs (lncRNAs) are considered to be incapable to encode proteins, and they play an indispensable regulatory role in tumors. Currently, lncRNAs have been shown to be related to genome stability [15],but in LCCs and RCCs, the in uence of differential genomic instability-associated (DGIA) lncRNAs on tumorassociated immune microenvironment has not been explored yet.
Therefore, in the current investigation we proposed to create a prognostic model and risk clustering, containing key lncRNAs based on the differentially expressed genes and genomic instability in the LCCs and RCCs in The Cancer Genome Atlas (TCGA) database; the leading goal of this study was to analyze the differences in immune in ltration between high-and low-risk groups (HRG and LRG, respectively) along with the veri cation in the Gene Expression Omnibus (GEO) database. Moreover, to screen out new prognostic biomarkers related to genetic instability in LCCs and RCCs and to provide a molecular basis for identifying immunotherapy.

Data collection
In this research, we ratify two independent gene data-sets from different high-throughput platforms, including 473 colon adenocarcinoma (COAD) samples from TCGA (https://portal.gdc.cancer.gov/) and 156 COAD samples from GEO (http://www.ncbi.nlm.nih.gov/geo/) (GSE103479). The downloaded data included paired lncRNA and mRNA expression pro les, somatic mutation information, and clinical information. The CRCs in the cecum, ascending colon and hepatic exure were de ned as LCCs and CRCs in splenic exure, descending colon, sigmoid colon, and rectosigmoid junction was de ned as RCCs. After screening based on CRCs location, there were a total of 411 samples with complete information available for analysis, of which 322 from TCGA and 89 from GEO. In this series of analysis, the TCGA samples were divided into two groups randomly, as the model group and the internal veri cation group. To ensure the undifferentiated clustering, we performed an analysis to determine the differences in strati cation of various clinical factors. The GEO sample was used as the external validation group to verify the accuracy of prognostic lncRNAs. The analysis excluded RNA that was undetectable in more than 10% of the samples. Concerning each data-set, the gene ID was converted to the corresponding gene symbol according to the corresponding annotation package. Identi cation of differential genomic instability-associated lncRNAs from LCCs and RCCs Initially, we examined differentially expressed genes (DEGs) in LCCs and RCCs from TCGA by the R package "Limma" (|log2foldchange| > 0.5, false discovery rate (FDR) < 0.05) [16]. These DEGs were distributed by the human genome annotation package into mRNAs and lncRNAs. In order to assess the genomic instability, we proposed a mutator hypothesis-derived calculation method: we determined the cumulative number of somatic mutations (CNSMs) on the basis of number of changed sites for each gene on each sample and categorized the patients in descending order. The top 25% of patients were titled with genomic unstable like (GU) group and the last 25% as genomic stable like (GS) group. The differentially expressed lncRNAs between the two groups was evaluated and called as DGIA lncRNAs from LCCs and RCCs (| log2foldchange | > 0.5, FDR < 0.05).
Cluster and analyze the TCGA samples according to DGIA Hierarchical cluster test (HCA) was performed to verify the grouping effect of DGIA IncRNAs and to batch all TCGA samples according to DGIA by the R package "sparcl" [17]. HCA is the approach commonly used to classify similar samples or variables using Euclidean distances and Ward's linkage method. The samples were bundled into GU group and GS group by clustering. Subsequently, we explored the two groups on the CNSMs by univariate analysis.

Functional enrichment analysis
To estimate the potential functions of DGIA lncRNAs, two methods were applied to identify mRNAs that were more likely to be co-expressed with them. The rst one was used to analyze the co-expression relationship between lncRNAs and mRNAs by Pearsons correlation tests. Here we designated that the top 10 mRNAs with the highest coe cients have a strong co-expression relationship with each DGIA lncRNAs.
The second one was used to analyzed the DEGs between LCCs and RCCs through weighted gene coexpression network assessment by the R package "WGCNA" [18]. At rst, construction of an adjacency matrix (AM) of genes was done using the power function and an appropriate power index was selected. Later AM was converted into a topological overlap matrix. Finally, the gene consensus modules were collected and performed the correlation analysis with CNSMs. The mRNAs in the modules with the highest absolute correlation coe cient with CNSMs were selected for further examination.
Overall, intersection of the mRNAs was screened by the two methodes and accomplished Gene Ontology (GO) functional annotations and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis by R package "clusterPro ler" [19].

Construction of DGIA lncRNAs related prognostic model (DRPM)
To check the effect of DGIA lncRNAs on the prognosis, DGIA lncRNAs was secluded by univariate Cox proportional hazard regression analysis (COX). LncRNAs with P < 0.05 in univariate COX were retained and multivariate COX was performed in the model group by the R package "glmnet" [20]. The risk scores (RS) of the model group and the internal validation group were estimated according to the coe cient of each lncRNAs within the model. The patients in TCGA were separated into HRG and LRG with poor prognosis.

Validation of the DGIA lncRNAs in DRPM
Log-rank test was used to expose the difference in survival of HRG and LRG in the model group and internal validation group by R packages "survcomp" [21]. Simultaneously, the predictive effect of DRPM were gured out through the receiver operating characteristics curve (ROC) and the area under the ROC curve (AUC) by the R package "survivalROC" [22]. Additionally, univariate and multivariate COX were utilized to verify the independent predictive effect of the RS obtained by the model.
In the external veri cation group, DGIA lncRNAs was employed in DRPM to construct a prognostic model again by multivariate COX. The Log-rank test was also used for survival analysis, and time-dependent ROC (timeROC) of 1, 3, and 5 years were plotted. The purpose was to verify the accuracy of prognostic DGIA lncRNAs.
The survival curves of DGIA lncRNAs in DRPM were plotted and the differences were analyzed by log-rank test. The R package "maxstat" [23] was performed to get the best cut-off value.
Immune in ltration and gene set enrichment analysis (GSEA) in HRG and LRG R package "CIBERSORT" [24] was employed in the TCGA samples to estimate the relative in ltration abundance of 22 immune cells and to assess the variations in the various immune cells' in ltration of HRG and LRG. The results with p < 0.05 were retained. "CIBERSORT" calculated the p-value of the deconvolution for each sample by Monte-Carlo simulation to provide the assessed con dence. The differences in the abundance of 22 immune cells in HRG and LRG were examined by the Wilcoxon ranksum test.
Besides, to study the differences in biological functions of genes between HRG and LRG, we downloaded the biological process (BP), molecular function (MF) data-sets related to GO, and KEGG data-set. GSEA was performed using the Bioconductor package "fgsea" [25] with 10,000 permutations between LRG and HRG. The threshold values were p < 0.05.

DEGs and DGIA lncRNAs in LCCs and RCCs
We rstly selected, separated and bundled TCGA samples in furtherance to segregate DGIA IncRNAs from DEGs of LCCs and RCCs. Soon after, the corresponding gene expression data were standardized and analyzed. We obtained 1724 DEGs (Fig. 1a), including 1325 mRNAs and 399 lncRNAs. According to the CNSMs, the top 25% (n = 75) and last 25% (n = 62) patients were labeled as GU group and GS group, respectively. By analyzing in contrast in lncRNAs between these two groups, 123 DGIA lncRNAs were attained. Among them, 63 lncRNAs showed upregulation whereas 60 exhibited downregulation in the GU group (Fig. 1b, 1c). Based on these DGIA lncRNAs, we carried out unsupervised HCA on TCGA specimens and distributed them into GU group and GS group (Fig. 1d). The CNSMs in both groups were signi cantly different with its median higher in the GU group comparing with GS group (Fig. 1e). These ndings conclusively depicted that the selected DGIA lncRNAs had a marvelous classi cation effect.

Functional enrichment analysis for DGIA lncRNAs
To explore the functionalities and pathways concerned with 123 DGIA lncRNAs, we operated functional enrichment analysis on protein-coding genes (PCGs) co-expressed with DGIA lncRNAs. The rst method's procedure included a correlation analysis between the selected DGIA lncRNAs and 1,325 differential mRNAs from LCCs and RCCs. The PCGs of DGIA lncRNAs, that is, the top 10 mRNAs with the strongest correlation with each lncRNAs, were achieved.
The second method disclosed, after constructing the co-expression network (Additional le 1: Figure S1), the cognizance of blue module with highest positive correlation, and the turquoise module with highest negative correlation (Fig. 2a). We intersected the selective mRNAs from the blue and turquoise modules with PCGs chosen by the rst method. Thereupon, these genes were used to construct a lncRNAs-mRNA co-expression network (Fig. 1f).
The results of functional enrichment analysis with the intersection PCGs comprised DNA-binding transcription activator activity, RNA polymerase II-speci c and various phospholipase-related enzyme activities. These molecular functions are closely associated with the formation and development of genomic instability. More importantly, the enrichments of biological processes are mainly related to immune processes, such as T cell activation, lymphocyte differentiation, regulation of T cell activation, etc. (Fig. 2b). KEGG enrichment analysis displayed that both regulating pluripotency of stem cells signaling pathways, as well as immune-related pathways, including Th17 cell differentiation, Th1, and Th2 cell differentiation, PD − L1 expression, PD − 1 checkpoint pathway in cancer, etc., were signi cantly enriched (Fig. 2c). These results indicated that 123 DGIA lncRNAs not only cause genomic instability but also in uence the regulation of immune system. The variation in the expression of these 123 DGIA IncRNAs potentially disturbs the balance of co-expressed PCGs regulatory network, consequently causing instability in the cell genome, and also affecting the killing of tumors by immune cells, mostly by the proliferation, differentiation, activation, and receptor recognition of T cells. Thus, these DGIA lncRNAs possess astounding potential in immune regulation while affecting gene instability.

Construction of DRPM using DGIA lncRNAs
The samples from TCGA were randomly and uniformly arranged into model group (n = 162) and validation group (n = 160). The clinical factors were not statistically signi cantly different in both groups (all p > 0.05) (Additional le 2: Table S1). In the model group, we accomplished univariate and multivariate COX to assort and construct DRPM with 123 DGIA lncRNAs, and 6 prognostic-related DGIA lncRNAs with corresponding risk coe cients were determined ( Table 1). All patients in TCGA were divided into HRG and LRG on the basis of median of RS (0.851) measured by DRPM in the model group (Additional le 2: Table S2). Validation of the DRPM To con rm the anticipated effects of DRPM, we conducted Kaplan-Meier test to plot a survival curve. The results demonstrated that the survival outcomes of HRG were worse than those in LRG (all p 0.05) (Fig. 3a). The ROC curves plotted for patients in different groups con rmed the consistency and satisfying categorization effects of DRPM, the AUC were shown in the gures respectively (Fig. 3b). Using RS, we organized the patients in different groups and detected changes in expression level of the prognostic DGIA lncRNAs. The heat map presented the increment in the expression levels of 6 lncRNAs in HRG (Fig. 3c).
To verify the independent predictive effects of RS, we combined RS with clinical factors for univariate and multivariate COX analysis. These clinical factors were age, gender, and TNM stage. The results indicated that RS was an independent prognostic factor (Table 2). Besides, to assess the risk clustering ability of DRPM in different strata, we separately strati ed age (< 65 years and ≥ 65 years), gender (male and female), and clinical stage (Stage I-II and Stage III-IV). The survival curves of HRG and LRG were plotted through strati cation of different clinical factors. HRG and LRG exhibited signi cant difference in overall survival up to the strati cation of age, gender, and Stage I-II (all p < 0.05) (Additional le 3: Figure S2). In strati cation of Stage III-IV, the difference was very close to reaching statistical signi cance (p = 0.077) (Additional le 3: Figure S2). In summary, the DRPM revealed a consistent and promising prognostic evaluation ability in different strati cation.

Validation of the prognostic DGIA lncRNAs
To verify the accuracy of prognostic DGIA lncRNAs, we plotted survival curves for these lncRNAs in TCGA samples. In AC004009.1, AP003555.2, BOLA3-AS1, NKILA, LINC00543 and UCA1, the prognosis of the high expression group was worse as compared to the low expression group (all p 0.05) (Fig. 4a). Also, we investigated the correlation between these lncRNAs and stages, and the results indicated that the expression levels of AP003555.2, BOLA3-AS1, NKILA, LINC00543, and UCA1 were signi cantly different between at least two stages (Fig. 4b).
Meanwhile, in the external validation group from GEO, we constructed a model and grouped patients with the four prognostic DGIA lncRNAs, comprising BOLA3-AS1, NKILA, LINC00543, and UCA1. The prognosis of HRG was also worse than that of LRG (p 0.001) (Fig. 5a). The timeROC of 1, 3, and 5 years proved that the model had a promising classi cation effect, and that of 3 years displayed optimum effects (AUC = 0.83) (Fig. 5b).
Immune in ltration and GSEA within different risk groups The above-mentioned enrichment investigation demonstrated that DGIA lncRNAs also in uences immune regulation. Hence, we evaluated the differences in the in ltration of 22 immune cells in HRG and LRG according to the results of CIBERSORT. The expression of CD8 + T cells, out of all TCGA samples and model group, was lower in HRG than in LRG (all p 0.05) (Fig. 6). CD8 + T cells are the cytotoxic immune cells that are capable of directly killing tumor cells, and their abundance differences indicated the immune-related reasons for the prognostic difference in HRG and LRG.
To explore the signi cantly altered MF, BP, and pathways in HRG and LRG, we performed GO-and KEGGrelated GSEA. Mainly immune and genomic instability related pathways in LRG were signi cantly enriched. In GO enrichment terms, immune-related pathways encompassed response to type I interferon (IFN-), natural killer cell activation, T cell activation involved in immune response, etc. Simultaneously, some genomic instability-related pathways were also signi cantly enriched, including structural constituent of ribosome, transcription elongation from RNA polymerase II promoter, response to dsRNA, and some energy-related pathways in glucose metabolism (Fig. 7a, 7b). In KEGG enrichment terms, besides the regulation of autophagy and cytosolic DNA sensing pathway, which are related to genomic instability, there were also immune-related pathways, including antigen processing and presentation, and cytokine receptor interaction enriched (Fig. 7c). Finally, we also noticed that the CNSMs of LRG was signi cantly higher comparing with HRG (all p < 0.05) (Additional le 4: Figure S3).

Discussion
Lately, it has been proclaimed that genomic instability is one of the key prognostic factors for most cancers [26]. Various assays are used to assess the genomic instability by the expression of certain characteristic proteins and mutations of genes [27,28]. Moreover, with the development of gene sequencing technology, the detection of genomic instability has achieved increased resolution [29]. In recent times, researchers have put a great effort to identify PCGs and microRNAs and to nd biomarkers related to genomic instability and prognosis [30,31]. Simultaneously, the intensive study on lncRNAs also makes researchers aware of their role in genomic stability. Although many works have been done by scientists, identi cation of DGIA lncRNAs in LCCs and RCCs and their relationship with immunity are still rare. Therefore, we explored the in uence of genomic instability in LCCs and RCCs, as well as the key prognostic DGIA lncRNAs.
We identi ed 123 DGIA lncRNAs from the DEGs of LCCs and RCCs, and the functional analysis on their co-expressed PCGs surprisingly a rmed that these DGIA lncRNAs potentially in uenced the genomic instability and immune functions through PCGs. In molecular functions, the accumulation of errors during the transcription of DNA under the action of RNA polymerase is the source of genomic instability in all organisms [32]. Additionally, phospholipase C participates in numerous physiological processes within the cell, especially signal transduction pathways that regulates cell functions and proliferation [33,34]. These processes also involve genomic mutations and even leads to cancers [35]. Other enriched pathways are mainly related to the positive activation and differentiation of T cells. Thus, we suspected that genomic instability could potentially cause differences in prognosis and immunity of LCCs and RCCs.
Moreover, we investigated whether DGIA lncRNAs can identify differences in immunity while predicting clinical outcomes. Upon identi cation of DRPM containing 6 key lncRNAs, we successfully divided the patients into HRG and LRG with poor prognosis. From the study of differences in immune in ltration and the GSEA between HRG and LRG, it has been concluded that some pathways related to genetic instability in the LRG are signi cantly enriched, including regulation of autophagy and glucose metabolism-related pathways (Fig. 7c). Genomic instability could cause the production of a large number of misfolded proteins, and autophagy participate in the degradation of ubiquitinated and misfolded proteins [36]. Autophagy has also been reported to be involved in the regulation of number of centrosomes during cell division to maintain genomic instability [37]. Besides, some pathways are associated with both genetic instability and immunity, such as the cytosolic DNA sensing pathway,response to dsRNA RIG-like receptor signaling pathway and Toll-like receptor signaling pathway (Fig. 7a, 7c). In somatic cells, cytoplasmic DNA sensors, after identifying the double-strand DNA (dsDNA), activate the cytosolic DNA sensing pathway and innate immune responses [38]. These dsDNA may be endogenous and are the yields of inaccurate replication of mitochondrial DNA (mtDNA) or micronuclear DNA [39,40]. DsDNA can be transformed into double-strand RNA (dsRNA) by the action of RNA polymerase III for recognition by the RNA sensor RIG-I [41,42]. DsRNA can also be recognized by Toll-like receptors to induce in ammatory cytokines and IFN- [39,43]. Some pieces of research disclosed that dsRNA accumulates in the mitochondria as a result of gene deletion during transcription of mtDNA, which promote the production of IFN-(eliciting innate immune response) after being recognized [40]. These shreds of evidence and our enrichment results explain the potential mechanism of immune activation by genomic instability.
The immune-related pathways are also signi cantly enriched and CD8 + T cell in ltration is higher in LRG.
Antigen processing and presentation, response to IFN-, and cytokine related pathways also provide the basis for the immune system activation processes (Fig. 7b). CNSM in LRG is signi cantly higher than that in HRG (Additional le 4: Figure S3), which indicates that the degree of genetic instability is higher.
Mardis suggested that genomic instability could predict immunotherapy response more accurately.
Various forms of genomic instability in cancers produce new antigens and immune-responsive phenotypes eventually [14]. In the analysis of LCCs and RCCs, we have con rmed that genomic instability does affect the immune response and the prognosis through a series of potential mechanisms.
Main limitations are as follows: rstly, while using GEO data for the veri cation of key lncRNAs predictive effects, two key lncRNAs were missed. Although the classi cation effect was propitious, the veri cation was not su cient enough. In this regard, we examined the prognosis and clinical characteristics of all key lncRNAs to support the evidence. Secondly, we de ned a mutator hypothesis-derived calculation method to evaluate genomic instability. In depth study is required to corroborate the functionality and signi cance of this method.

Conclusion
We constructed DRPM based on the DGIA lncRNAs of LCCs and RCCs. While using DRPM to predict the prognosis, we also deeply investigated the in uence and mechanism of genetic instability on immunity.
Meanwhile, 6 key DGIA lncRNAs were identi ed. They can not only predict the prognostic risk of patients but also serve as biomarkers for evaluating the differences of genetic instability and immune in ltration.
The ndings our research can provide some basis for identifying the bene ts of immunotherapy through genetic instability.