Identification and Validation of lncRNA-Related Prognostic Signatures in Cholangiocarcinoma


 Background and Aims: Cholangiocarcinoma (CCA), the second most common hepatobiliary cancer, is associated with poor prognosis. Therefore, there is a need to elucidate on the pathogenic mechanisms of CCA. In this study, we aimed at identifying lncRNA-related prognostic signatures for CCA through bioinformatics analysis and further validated their functions in CCA tumorigenesis and progression.Methods: The RNA-seq data of CCA were downloaded from public databases. Differentially expressed lncRNAs (DElncRNAs) were screened using R packages. Then, candidate OS- and DFS-related DElncRNAs were selected through Kaplan–Meier survival analysis. Furthermore, LASSO regression analyses were performed to establish two prognostic signatures, termed the OS and DFS signatures, respectively. Multivariate COX models and nomograms for overall survival (OS) and disease-free survival (DFS) were established based on OS/DFS signature and clinical data. Hub lncRNAs were identified and enrichment analyses performed to explore their potential functions. Finally, in vitro and in vivo models were used to validate the effects of the hub lncRNAs in CCA tumorigenesis and progression.Results: A total of 925 DElncRNAs were selected, from which six candidate OS-related lncRNAs and 15 candidate DFS-related lncRNAs were identified. The OS and DFS signatures were then established using four lncRNAs, respectively. We found that the OS signature and vascular invasion were significant independent risk factors for OS outcomes of CCA, while the DFS signature, vascular invasion and CA19-9 were significant independent risk factors for DFS outcomes of CCA. MiR4435-2HG and GAPLINC were selected as hub lncRNAs because they were included in both OS and DFS signatures. GO and KEGG enrichment analyses revealed that the two hub lncRNAs were involved in CCA tumorigenesis and progression. Finally, we constructed in vitro and in vivo models and revealed that the lncRNAs, MiR4435-2HG and GAPLINC can prompt CCA proliferation and migration in vitro and in vivo.Conclusions: The established OS and DFS signatures, which were based on DElncRNAs, are independent risk factors for OS and DFS of CCA patients, respectively. MIR4435-2HG and GAPLINC were identified as hub lncRNAs. In vitro and in vivo models revealed that MiR4435-2HG and GAPLINC can prompt CCA progression, which might be novel prognostic biomarkers and therapeutic targets for CCA.

Long non-coding RNAs (lncRNAs) are special RNA molecules without protein-coding potential and are longer than 200 nucleotides [4][5][6] . It has been reported that they are important in tumorigenesis and progression [4,[7][8][9][10][11][12][13] . For example, lncRNA MiR4435-2HG overexpression has been associated with poor prognosis in colorectal cancer (CRC) [14] . Furthermore, MiR4435-2HG is overexpressed and markedly associated with ovarian cancer and clear cell renal cell carcinoma progression [15,16] . Besides, the expression of lncRNA GAPLINC has been shown to promote gastric cancer proliferation and is closely correlated with poor prognosis for gastric cancer patients [17,18] . Studies have determined that MiR4435-2HG and GAPLINC are involved in the initiation and progression of various tumors. However, their biological roles in cholangiocarcinoma have not been clearly established.
In this study, different lncRNA expression patterns were examined among appropriately selected HCC cases to identify candidate lncRNA biomarkers based on The Cancer Genome Atlas data (TCGA) and the Gene Expression Omnibus (GEO). The least absolute shrinkage and selection operator (LASSO) algorithm was used in determining key lncRNAs; thereafter, an CCA risk score system was also constructed and the lncRNA signature was validated. Finally, the roles of the target gene were validated in vitro and in vivo. Our work yielded a signature based on lncRNA expression that can accurately predict CCA prognosis through integrated analysis of genomic data.

Data collection and preprocessing
The collected CCA RNA-seq data and corresponding clinical data included the TCGA-CHOL dataset which was downloaded from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/) and the GSE107943 dataset which was downloaded from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). The TCGA-CHOL dataset included 36 tumor and 9 non-tumor tissue samples while the GSE107943 dataset included 30 tumor and 27 non-tumor tissue samples. Batch effects from the two databases were corrected by adjusting expression values using the ComBat function (sva R package [19] ). Furthermore, principal component analysis (PCA) was performed to visualize the RNA-seq data before and after correcting the batch effect. Relative clinical data were downloaded and combined as well, and missing values were imputed through mice R package (https://CRAN.Rproject.org/package=mice). Tumor samples with OS < 1 month or without the corresponding clinical data were removed, 61 tumor and 36 non-tumor samples were included in the bioinformatics analysis.
Differentially expressed lncRNAs DESeq2 [20] and edgeR [21] R packages were used to determine differential expression levels of lncRNAs (DElncRNA) between tumor and non-tumor tissues in TCGA-CHOL or GSE107943 datasets. The DElncRNAs criteria was set as FDR < 0.05 and |Log 2 fold change| > 1.3. Then, four DElncRNA lists, including GSE107943_DESeq2 list, GSE107943_edgeR list, TCGA-CHOL_DESeq2 list, and TCGA-CHOL_edgeR list were obtained after our analyses. To ensure plausibility, DElncRNA within at least three DElncRNA lists were selected to be "real" DElncRNA.
Construction of lncRNA related OS/DFS signature Based on the median TPM value, the obtained DElncRNAs were divided into two groups. The CCA DFSrelated or OS-related lncRNAs were selected using univariate Cox regression analysis (survival R package: https://CRAN.R-project.org/package=survival). Kaplan-Meier analysis was used to calculate survival differences in DFS or OS between high-and low-expression groups. LASSO regression analysis was used to further compress the number of CCA DFS-related or OS-related lncRNAs, and the most signi cant prognostic lncRNAs were left to comprise OS or DFS signatures, which was calculated based on the nal LASSO model as follows: OS/DFS signature = Whereby; n represents the number of lncRNAs included in the OS or DFS signature, Expi represents the TPM value of each lncRNA and Coei represents the coe cient generated by LASSO regression. Furthermore, the Receiver Operating Characteristic (ROC) curve and the area under the ROC curve (AUC) were calculated to evaluate the predictive ability of the OS/DFS signature to CCA patients' death or recurrence. Moreover, patients were allocated into high-or low-risk groups based on their OS/DFS signatures, and the most appropriate cut-off was determined using the survminer R package (https://CRAN.R-project.org/package=survminer). Kaplan-Meier analysis was used to calculate survival differences in DFS or OS between high-risk and low-risk groups. Finally, lncRNAs that were included in both OS or DFS signatures were selected as hub lncRNAs.

Construction of multivariate COX models and OS / DFS nomogram
To establish the prognostic values of the OS/DFS signatures, multivariate survival analysis was performed using the COX proportional hazards regression model based on OS/DFS signatures and other clinical characteristics. Then, the nomogram was established based on the result of multivariate survival analysis. Concordance index was used to validate model discrimination of the established nomogram.

GO and KEGG pathway enrichment analyses
To establish the biological functions of our hub lncRNAs, relevant mRNAs were selected through Pearson correlation (|r| > 0.5, p < 0.05 ). Furthermore, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed to investigate their functional and pathway annotation. p≤ 0.05 was considered statistically signi cant.
Cell culture and transfection HCCC-9810 and RBE cells were purchased from Cell Bank of Chinese Academy of Sciences. The cells were cultured in 1640 medium supplemented with 10% fetal bovine serum (FBS) and incubated at 37°C in a 5% CO 2 atmosphere. When cells grew to a density of 80%, they were seeded in 6-well plates. Plasmids containing Sh-RNA molecules directed against target lncRNA or scrambled Sh-RNA were transfected into cells using the Lipofectamine 3000 reagent (Invitrogen, Carlsbad, CA, USA), according to the manufacturer's instructions. For gene knockdown, the psi-LVRU6GP vectors containing shRNAs targeting MIR4435-2HG from GeneCopoeia™ (China, Shanghai) were used. Then, 50-nM vector with shRNA or negative control was transfected into cells. The two cell lines were characterized by Genetic Testing Biotechnology Corporation (Suzhou, China) using short tandem repeat (STR) markers.

Cell clone formation experiment
Logarithmically growing HCCC-9810 and RBE cells were seeded in 6-well plates, incubated, and the medium changed after every 24 h. Incubation was stopped after cell clones could be seen using the naked eye. Paraformaldehyde was added to x the clones, after which they were dried in air after taking images.

Cell proliferation assay
The CCA cells (5000 cells per well) were seeded in 96 well plates overnight. At 0, 24, 48, 72, and 96 h posttransfection, 10 μL Cell Counting Kit-8 solution (Dojindo, Shanghai) was added into each well, followed by incubation for 2 h at 37 °C. Then, absorbance was determined at a wavelength of 450 nm to measure cell proliferation.

Cell invasion and migration assay
The transwell chamber (Millipore) with 50 μL of Matrigel (0.2 μg/μL) was used to perform the invasion assay. Cells were added to the upper chamber at a concentration of 3×10 4 /mL with serum-free medium, while about 500 μL of the medium containing 10% FBS was added to the lower chamber. Then, the transwell chamber was incubated at 37°C for 48 h after which cells were stained using crystal violet for 15 min and counted. The same procedures were repeated to evaluate cell migration ability, except that the trans-chamber was not coated with Matrigel.

Xenograft model in athymic mice
Five-week-old male BALB/c nu/nu mice were obtained from Shanghai Experimental Animal Center, Chinese Academy of Science. Mice were maintained under a pathogen-free condition and treated in accordance with the institutional animal welfare guidelines of the Chinese Academy of Science. For the assay to assess tumorigenicity, suspended HCCC-9810 cells (5 × 10 6 ) were subcutaneously injected into each nude mice following the AAALAC criterion. Tumor volumes (cm 3 ) were recorded after every week from the 7 th day. Finally, at the end of 4 weeks,the mice were sacri ced with 120 μl 10% chloral hydrate (Sinopharm Chemical Reagent Co., Ltd., Shanghai, China) at the experimental endpoint, and the tumours were harvested and weighed.

Statistical analysis
Normally distributed continuous data were presented as the mean ± standard deviation (SD) and compared using the student's t test. Non-normally distributed continuous data were presented as medians and interquartile range (IQR) and compared using the Mann-Whitney U test. Categorical variables were presented as percentages and compared using the χ2 test. Statistical tests were performed using R (Version 3.6.3). p ≤ 0.05 was considered statistically signi cant.
Clinic-pathologic characteristics are presented in Tables 1 and S4. The abundance of male, Child-Plugh B or vascular invasion positive CCA patients in the GSE107943 dataset were signi cantly high than those in the TCGA-CHOL dataset (p < 0.05). There were no signi cant differences between the two datasets in other characteristics, including age, AJCC Stage, CA19-9, recurrence, and death (p > 0.05). Volcano plots and heatmaps of DElncRNAs are presented in Fig. 1A-D. Then, a venn diagram was used to reveal the overlapped DElncRNAs in the two datasets. To ensure reliability, DElncRNAs that appeared at least three times were considered real DElncRNAs. Finally, a total of 925 DElncRNAs were identi ed and used in the downstream analysis ( Fig. 1E-F).

Construction of a DFS signature with four lncRNAs
The RNA-seq data from GSE107943 and TCGA-CHOL datasets were combined and the batch effect corrected using the sva package in R. PCA was performed to visualize sample distribution before and after correcting the batch effect. As shown in Figure S1 A, there was a signi cant batch effect before correction, which was reduced after our correction (Figure S1 B).
To determine whether the 925 DElncRNAs are associated with DFS of CCA patients, we performed univariate COX analysis and ltered out 15 DFS-related DElncRNAs. Then, Kaplan-Meier analysis was performed to validate the in uence of each lncRNA on CCA patients' DFS. Figure S2 shows that all 15 DFS-related DElncRNAs were associated with DFS of CCA patients (p < 0.05). The most signi cant DFSrelated DElncRNAs were selected through LASSO regression analysis ( Figure S1 C-D), and 3-fold crossvalidation performed. Four DElncRNAs were identi ed, including HRAT92, PTGES2-AS1, GAPLINC and MIR4435-2HG. Then, the DFS signature was calculated as follows: The DFS signatures for all patients were calculated and arranged in an ascending order ( Fig. 2A). Then, the CCA patients were allocated into low-risk and high-risk groups using a cut-off value of 0.21579145.
Besides, a high mortality rate and worse DFS were found in the high-risk group (Fig. 2B). Expression levels of the four lncRNAs are shown in Fig. 2C. As the DFS signature increased, expression levels of HRAT92 decreased, while those of GAPLINC, MIR4435-2HG, and PTGES2-AS1 increased. Furthermore, the ROC curve was used to evaluate the predictive value of the DFS signature for CCA recurrence (Fig. 2D).
The DFS signature was found to have a strong predictive value for CCA recurrence ( 1-, 2-, 3-year recurrence AUC 0.805, 0.854, 0.822 ). Finally, Kaplan-Meier analysis revealed that the high-risk group had worse DFS outcomes than the low-risk group (p < 0.0001, Fig. 2E). These ndings imply that the DFS signature, based on the four lncRNAs is a signi cant prognostic factor for DFS and has a strong predictive value for CCA recurrence in patients.

Construction of an OS signature with four lncRNAs
Univariate COX regression ( Figure S3 All CCA patients were allocated into high-and low-risk groups based on the OS signature, using a cutoff value of 0.106651484. The OS signature and survival time for all CCA patients are in Fig. 3A-B. The four lncRNAs included in the OS signature were highly expressed in the high-risk group, implying that they were associated with worse OS outcomes for CCA patients (Fig. 3C). Similarly, the predictive value of the OS signature for CCA patients' death was also strong (1-, 2-, 3-year death AUC 0.827, 0.773, 0.838).
Finally, Kaplan-Meier analysis revealed that the high-risk group had worse OS outcomes than the low-risk group (p = 0.0003, Fig. 3E)

Establishment of nomogram for predicting CCA patients' DFS and OS outcomes
To elucidate on the prognostic value of the DFS/OS signature, univariate and multivariate COX proportional hazards regression analyses were sequentially performed (Fig. 4A, Fig. 5A). Multivariate COX regression analysis revealed that the DFS/OS signature and vascular invasion were independent risk factors for CCA patients' DFS and OS outcomes. In addition, CA19-9 was also found to be an independent risk factor for DFS. Then, based on the results of multivariate COX regressions analyses, we established nomograms to predict CCA patients' 1-, 2-, and 3-year recurrence-free or survival (Fig. 4B, Fig. 5B). The Cindices of our nomograms for DFS and OS prediction were 0.87 and 0.77, respectively, implying good predictive values (Fig. 4C, Fig. 5C). Therefore, the DFS and OS nomograms exhibited a high prognostic value.

Identi cation of prognostic hub lncRNAs and enrichment analysis
We established the DFS/OS signature using DElncRNAs. Since MIR4435-2HG and GAPLINC were both included in the DFS/OS signature, they were selected as our prognostic hub lncRNAs. Then, the relevant mRNAs were identi ed (Table S5). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed to evaluate their potential biological functions. It was found that MIR4435-2HG was associated with focal adhesion, Hippo signaling pathway, PI3K-Akt signaling pathway, and TGF − beta signaling pathway among others, implying that MIR4435-2HG is highly involved in tumorigenesis and progression ( Figure S4A-B, Table S5, S6). Moreover, GAPLINC was associated with histone acetylation and methylation, protein methylation, and chromatin binding among others, implying that GAPLINC may be associated with epigenetic modi cations of chromosomes ( Figure S4C-D, Table S5, S6). Our enrichment results revealed that MIR4435-2HG and GAPLINC are involved in various biological processes, which are highly correlated with tumorigenesis and progression.

The hub lncRNAs enhanced CCA proliferation, invasion and migration abilities in vitro
To elucidate on the effects of the two hub lncRNAs in CCA tumorigenesis and progression, we performed loss-of-function assays in vitro. Transwell assays were performed to identify their effects on CCA cell invasion and migration while the CCK8 assays and colony formation assays were used to verify their effects on CCA cell proliferation. We found that MIR4435-2HG or GAPLINC knockdown signi cantly inhibited cell proliferation, invasion and migration ( Fig. 6A-D, Fig. 7A-D). These ndings imply that MIR4435-2HG and GAPLINC are crucial genes in CCA progression.

Discussion
CCA, the second most common primary hepatobiliary cancer, originates from the bile duct epithelium [22][23][24][25] . Due to a lack of a speci c tumor marker, diagnosis of CCA at an early stage is di cult [26][27][28] . In addition to its insidious onset, CCA is also characterized by high malignancy and early recurrence [23][24][25]28] . Studies have reported that lncRNAs are crucial regulators of diverse biological processes of mRNA and play essential roles in tumorigenesis [29][30][31][32] . Therefore, in this study, we aimed at identifying prognostic-related DElncRNAs in CCA and validated their effects on CCA tumorigenesis and progression.
It has been documented that lncRNAs can be used as prognostic biomarkers for various malignancies [33][34][35] . The occurrence and development of CCA is closely associated with some tumor-related lncRNAs, including PVT1 [36] , H19 [37] , NEAT1 [38] and HATOIR [39] . These tumor-related lncRNAs have been shown to affect CCA proliferation, apoptosis, invasion, metastasis and EMT. However, identi cation and validation of hub lncRNAs associated with CCA prognosis, based on high-throughput transcriptome sequencing has not been performed.
In this study, we combined TCGA-CHOL and GSE107943 datasets and screened a total of 925 DElncRNAs. Then, the DElncRNA related DFS/OS signature was established. The established DFS/OS signature was highly correlated with CCA patients' DFS/OS. Combining clinical characteristics and the DFS/OS signature, multivariate Cox regression analysis was performed, and nomograms were established based on the results of multivariate Cox regression models. We found that the DFS/OS signature was an independent risk factor for CCA patients' DFS/OS. MIR4435-2HG and GAPLINC were selected as our hub lncRNAs, as they were included in both DFS and OS signatures.
MIR4435-2HG has been shown to play important roles in the progression of diverse malignancies [14,[40][41][42][43][44][45] . It has been reported that MIR4435-2HG is overexpressed in non-small cell lung carcinoma (NSCLC), and its knockdown inhibited NSCLC cell proliferation and migration abilities [42] . In lung cancer, MIR4435-2HG was found to be upregulated, and that it enhanced lung cancer cell proliferation and invasion [41] . In colorectal cancer (CRC), MIR4435-2HG was associated with poor prognosis of CRC patients, and it promoted CRC growth and metastasis through the miR-206/YAP1 axis [44] . Furthermore, Xuanlin Shen et al. reported that MIR4435-2HG is overexpressed in hepatocellular carcinoma (HCC) tissues and is associated with the poor prognosis of HCC patients. Further studies determined that MIR4435-2HG can enhance HCC cell proliferation and invasion in vitro [45] . We found that overexpression of MIR4435-2HG was associated with worse prognostic outcomes for CCA patients and that it may be involved in multiple processes of tumorigenesis. Furthermore, MIR4435-2HG was found to act as an oncogene in CCA, which enhances CCA proliferation, invasion and migration.
GAPLINC was initially reported in gastric cancer, and is therefore referred to as, gastric adenocarcinoma predictive long intergenic non-coding RNA (GAPLINC) [18] . It has been reported that GAPLINC is a negative regulator of in ammation in mice [46] . A study on human umbilical vein endothelial cells (HUVECs) revealed that GAPLINC can increase HUVECs angiogenesis under hypoxia conditions, implying that GAPLINC may play a critical role in tumor angiogenesis [47] . It has also been reported that GAPLINC regulates rheumatoid arthritis broblast-like synoviocytes (RA-FLSs) proliferation, invasion and migration, and promotes RA-FLSs tumor-like behaviors [48] . In gastric cancer, GAPLINC was shown to promote cell proliferation and migration by regulating CD44 expression [18] . Pro-oncogenic effects of GAPLINC in other cancers, including CRC [49] , NSCLC [50] and bladder cancer [51] have also been reported. We identi ed GAPLINC as our hub lncRNA, and it was determined that GAPLINC is a risk factor for CCA and a predictor for poor prognosis. Furthermore, enrichment analysis revealed that GAPLINC may be involved in epigenetic modi cations of chromosomes in CCA. Loss-and gain-function assays determined that GAPLINC promoted CCA proliferation and migration in vitro and in vivo.
Compared to existing studies on lncRNA-based prognostic signatures in other tumors [4,52,53] , our study has some strengths that should be noted. First, to ensure a su cient sample size for our bioinformatics analysis, we combined two independent datasets. Second, necessary preprocessing was performed to remove batch effects to ensure reliability of our data as the systematic bias caused by the batch effect was devastating [54,55] . Third, essential experimental validations were indispensable. Loss-and gainfunction assays in vitro and in vivo were performed to validate the potential functions of our hub lncRNAs. However, this study also has some limitations. First, although we identi ed hub lncRNAs and validated their effects in tumorigenesis and progression, the speci c signaling pathways regulated by the hub lncRNAs were not elucidated, therefore, more experiments are needed for further exploration. In addition, our research was at the stage of experimental exploration, there is a long way to go before true clinical translation for cancer treatment.

Conclusions
The established OS and DFS signatures, which were based on DElncRNAs, are independent risk factors for OS and DFS of CCA patients, respectively. MIR4435-2HG and GAPLINC were identi ed as hub lncRNAs. In vitro and in vivo models revealed that MiR4435-2HG and GAPLINC can prompt CCA progression, which might be novel prognostic biomarkers and therapeutic targets for CCA.  Table 1 and Table 2).