A multi-lncRNA signature for suvival prediction in esophageal squamous cell carcinoma (ESCC) patients


 Background: The aim of this study was to identify prognostic long non-coding RNAs (lncRNAs) and develop a multi-lncRNA signature for suvival prediction in esophageal squamous cell carcinoma (ESCC) patients.Methods: The clinical and gene expression data from 301 ESCC patients were downloaded, including a corhort used as training set from Gene Expression Omnibus database (GSE53624, n=119), another cohort of 98 paired ESCC tumor and normal tissues as test set and an independent validation cohort including 84 ESCC tissues. Survival analyses, Cox regression and Kaplan–Meier analysis were used.Results: we screened a prognostic marker of ESCC from the GSE53624 dataset and named it as the five-lncRNA signature including AC007179.1, MORF4L2-AS1, RP11-488I20.9, RP13-30A9.2, RP4-735C1.6, which could classify patients into high- and low-risk groups with significantly different survival(median survival: 1.75 years vs. 4.01 years, log rank P<0.05). Then test dataset and validation dataset confirmed that the five-lncRNA signature can determine the prognosis of ESCC patients. Predictive independence of the prognostic marker was proved by multivariable Cox regression analyses in the three datasets (P<0.05). In addition, the signature was found to be better than TNM stage in terms of prognosis.Conclusion: The five-lncRNA signature could be a good prognostic biomarker for ESCC patients and has important clinical value.


Background
Esophageal carcinoma is a malignant tumor of esophageal mucosa epithelium or gland. Globally, esophageal cancer ranks seventh in morbidity and sixth mortality among all cancers [1] . China is one of the countries with the highest incidence of esophageal cancer, accounting for approximately half of all esophageal cancer cases [2] . There were 477,900 new cases diagnosed in China in 2015 [3] . Among these Chinese esophageal cancer patients, esophageal squamous cell carcinoma (ESCC) patients account for nearly 90% [2] . In terms of mortality, esophageal cancer deaths in China account for half of global esophageal cancer deaths [4] . Therefore, ESCC is a major public health challenge in the world, especially in China. Using Chinese ESCC patients as research samples is of great significance in revealing the pathogenesis and prognostic factors of ESCC.
ESCC is an extremely aggressive malignancy. Along with the clinical symptoms of early esophageal squamous cell carcinoma are not obvious and can be easily ignored by patients, most ESCC patients are diagnosed in advanced stages. As a result, ESCC patients have a extremely poor survival. The 5-year survival rate of ESCC patients in some less developed areas is less than 10% [5] . In the United States, where medical conditions are more advanced, the 5-year survival rate is 18% [6] . In China, the 5-year relative survival of postoperative ESCC patients is 8%-41% [7,8] . Despite recent advances in diagnostic and therapeutic approaches, the life expectancy of ESCC patients have not improved significantly [9][10][11] . Therefore, identification of a sensitive and effective prognostic marker of ESCC is in urgent need to evaluate disease progression and patient's overall outcome.
Next generation sequencing (NGS) and bioinformatics analysis tools provides great help for humans to understand tumors [12][13][14][15] . In the process of exploring the pathogenesis and prognostic factors of tumors by NGS and bioinformatics analysis, long non-coding RNAs (lncRNAs), defined as transcripts longer than 200 bp, have been discovered to play an important role. For example, a HOX transcript antisense RNA, lncRNA HOTAIR, was found to correlate with poor prognosis of lung cancer and promote tumor progression [16] . LncRNA REG1CP was shown to promote the development and progression of tumor [17] . For ESCC, several lncRNAs have been detected and associated with shorter survival, such as lncRNA H19 [18] , lncRNA CASC9 [19] and lncRNA LUCAT1 [20] . Subsequently, prognostic multi-lncRNA signatures with high potential clinical application significance were constructed based on lncRNA expression profile data from public database such as Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) database. For instance, a three-lncRNA signature (ENST00000435885.1, XLOC_013014, ENST00000547963.1) was found that it can classify the ESCC patients into two groups with significantly different overall survival. [21] A seven-lncRNA (RP5-1172N10.2, RP11-579D7.4, RP11-89N17.4, LA16c-325D7.2, RP1-251M9.2, RP11-259O2.2, LINC00173) signature can predict overall survival of ESCC patients [22] . Although prognostic lncRNA signatures have been identified for ESCC, there are few studies that can verify the effectiveness of the predictive model in independent experimental data set and confirm the presence of the prognostic lncRNA in ESCC tissues.
Here, we used our lncRNA expression data tested by RNA sequencing and followed up the clinical 5-year-survival information of 98 ESCC patients, then, combined it with 119 ESCC public expression profiles from GEO and another independent 84 ESCC tissues for qPCR validation to construct a clinically valuable lncRNA signature that can accurately predict survival in ESCC patients.

Sample collection and preparation
LncRNA expression profile and corresponding clinical data of 119 ESCC cases (GSE53624) were obtained from the publicly available GEO database (https://www.ncbi.nlm.nih.gov/geo/). As Anyang is a high-risk area of ESCC in China, we collected 98 postoperative ESCC tissues and paired non-tumor tissues from Anyang Tumor Hospital during 2014-2019, and organized relevant patient clinical information. Then we examined the PCG and lncRNA expression profile of the 98 pair ESCC tissues by next-generation sequencing (NGS, Hereinafter referred to as RNA-seq dataset). In addition, we collected an independent validation cohort including 84 postoperative ESCC patients from the same hospital and detected the lncRNA expression level using the qRT-PCR. The patients were coded to protect anonymity and all pathological information of the ESCC samples in this study was shown in Table 1. Tumor-node-metast asis (TNM) classification of the International Union against Cancer, 7th edition was used to categorize. Documentation of informed consent was obtained through the institutional review board. The study was approved by the Anyang Tumor Hospital Ethical Committee.

RNA isolation and next generation RNA sequencing analysis
After TRIZOL lysis and purification, total RNA was isolated by the miRNeasy Mini Kit (QIAGEN) with DNase digestion step. A total amount of 5 ug RNA per sample was used as input material for the RNA sample preparations. Firstly, ribosomal RNA was removed by Epicentre Ribozero TM rRNA Removal Kit (Epicentre, USA), and rRNA free residue was cleaned up by ethanol precipitation. The sequencing libraries were generated by NEBNext Ultra TM Directional RNA Library Prep Kit for Illumina (NEB, USA) following manufacturer's recommendations and sequenced on an Illumina Hiseq platform. The 150 bp paired-end reads were generated. Normalized fragments per kilobase per million mapped reads (FPKM) by cufflinks was used to estimate the gene expression values.

Construction of multi-lncRNA prognostic signature
GSE53624 and RNA-seq dataset were analyzed respectively by Cox analysis and Kaplan-Meier analysis to identify lncRNA that associated with OS of ESCC patients. We selected the prognostic lncRNA using Cox p & log rank P <0.05 in two datasets. Model was estimated as follows [23,24,26] : Risk Score (RS) =∑ N i=1 (ExpLncRNA i * CoefCox i ), where N is prognostic lncRNA number, ExpLncRNA i represents the lncRNA expression value, and CoefCox i is the Cox regression coefficient of lncRNA. We plotted ROC curves [14] and calculated their area under the curve (AUC) values to screen out the prognostic signature with largest AUC value in the GSE53624 set [26] . R program (Version 3.5.1) was used to perform the above analyses, including packages called pROC, TimeROC and survival from Bio-conductor (http://www.bioconductor.org/).Gene set enrichment analysis (GSEA) were used to find the dysregulated pathways in the low-and high-risk groups. Normalized enrichment score (NES) and P value were calculated to verify the statistical difference for GSEA analysis [27] .

Screening prognostic lncRNAs in ESCC
We used 119 ESCC data from public GSE53624 as training group, 98 RNA-seq data as test group and 84 qPCR data as validation group. Detailed clinicopathological characteristics of all these ESCC patients were showed in Table 1. There were a total of 217 samples with lncRNA expression data including GSE53624 dataset and test dataset. From the two datasets, we found a total of 6253 lncRNAs were expressed in ESCC tissue. We then performed univariate cox analysis and Kaplan-Meier analysis to analyze the relationship between the lncRNA expression and ESCC OS. Based on the two ESCC profiling datasets and the corresponding clinical follow-up information, univariate cox analyses identifed 368 lncRNAs in the GSE53624 and 290 lncRNAs in the RNA-seq dataset which were significantly associated with ESCC OS (Cox P < 0.05) and Kaplan-Meier analyses identifed 386 survival associated lncRNAs in the GSE53624 and 312 survival associated lncRNAs in the RNA-seq dataset Kaplan-Meier (log rank P < 0.05). Through comparing the above four groups of survival related lncRNA data, we found 19 identical lncRNAs significantly correlated with OS in two datasets (COX P<0.05, log rank P <0.05) ( Figure 1A).Subsequently, we found that 10 of the 19 prognostic-related lncRNAs showed a consistent risk trend in the four groups data,displaying a risk or protective role in ESCC(Supplementary table S1).

Predictive power of the lncRNA signature for patients with ESCC
After calculated risk scores of ESCC patients in the GSE53624 dataset, the median risk classified ESCC patients into high-and low-risk groups (n = 59/60). Kaplan-Meier analysis found the survival of ESCC patients in the lowrisk group were significantly longer than those in the high-risk group (median survival: 1.75 years vs. 4.01 years, log-rank test P<0.001; Figure 2A). The 3-year survival rate for patients in the high-risk group was only 23.73%, while that of patients in the low-risk group reached as high as 66.67%.
Then we evaluated the predictive ability of the five-lncRNA signature in ESCC RNA-seq dataset (n=98). The median risk score of patients in the test dataset was calculated and used to divide patients into high-and lowrisk groups. Figure 2B showed the Kaplan-Meier analysis result. The survival rate of ESCC patients in the low-risk group was significantly higher than patients in the high-risk group. (Median survival: 2.44 years vs. 3.97 years, log-rank test P=0.014; Figure 2B). The 3-year survival rate for the patients of high-risk group was only 40.82%, while that of the low-risk group was 61.22%.
In addition, we performed qPCR experiment with the primers (Supplementary table S3) in 84 ESCC tissues and obtained the five lncRNAs expression level. Through calculated risk scores and used the median risk score as the cutoff value, we obtained the high risk score group and low risk score group. Kaplan-Meier analysis discovered the survival difference between two risk groups. The 3-year survival rate of ESCC patients with low-risk scores was still significantly higher than patients with high-risk scores. (Median survival: 2.82 years vs. 4.08 years, logrank test P=0.011; Figure 2C). Figure 3 displayed the survival time, risk score and lncRNA expression level of each patients in training, test and qPCR validation datasets. As shown, the higher expression of AC007179.1, MORF4L2-AS1 and RP4-735C1.6, the higher the risk and the more deaths. On the contrary, patients with lower-risk scores tended to have higher expression level of protective lncRNAs (RP13-30A9.2, RP11-488I20.9) and have a lower mortality rate .

The five-lncRNA signature can predict the survival of ESCC patients independently
The Chi-square test found lncRNA signature was not associated with clinicopathological factors, including age, sex and pTNM stage in the training, test and qPCR validation groups (Table 3). Univariate cox analysis of the training, test and independent validation datasets showed TNM staging was associated with prognosis. The results of multivariable Cox regression analyses showed only the lncRNA signature was an independent prognostic factor for ESCC patients (High-risk group vs. Low-risk group, HR training = 3.50, 95% CI 2.13-5.76, P < 0.001, n=119; HR test = 1.03, 95% CI 1.00-1.06, P = 0.014, n=98; HR independent = 2.95, 95% CI 1.21-7.17, P = 0.017, n=84, Table 4).

The prediction performance of the five-lncRNA signature and TNM stage
TNM stage is currently used as the main indicator for prognosis evaluation of ESCC. Our study confirmed that TNM staging and the five-lncRNA signature were related to the prognosis of patients. Therefore, we performed stratification analysis to explore the predictive power of the five-lncRNA signature in patients with known TNM stage. We first divided all the ESCC patients involved in this study (n=301) into two groups, one group was patients with TNM low (I+II) stage and the other group was patients with high TNM stage (III+IV). Then we used the lncRNA signature to conduct stratification analyses of above two groups of patients. Kaplan-Meier results showed that the lncRNA signature could further seperated the patients with TNM low (I+II) stage ( Figure 4A) or patients with high TNM stage (III+IV)( Figure 4B) into two subgroups with significantly different survival(log-rank test P<0.001).
To analyze the five-lncRNA signature's predictive performance advantages, TimeROC analyses were performed in the three datasets (n=301). The AUCs of the signature were 0.690/0.648/0.639 at 5/3/1 years ( Figure 4C), while the AUCs of TNM were 0.643/0.654/0.637 at 5/3/1 years ( Figure 4D), suggesting the five-lncRNA signature was superior to TNM stage in terms of ESCC prognosis evaluation, especially at 5 years survival. Moreover, we found AUC values were maximized when signature and TNM staging were combined to assess patient prognosis (0.742/0.723/0.702 at 5/3/1 years, Figure 4E), indicating the lncRNA signature improve the accuracy of TNM in evaluating prognosis.

Functional prediction of five prognostic lncRNAs
In order to predict function of the five prognostic lncRNAs, we performed GSEA analyses based on high-risk and low-risk groups in the GSE53624, and RNAseq datasets, respectively. The four-lncRNA risk signature was significantly enriched in 83 different GO categories and KEGG pathways in the GSE53624, and enriched in 385 GO categories and KEGG pathways in RNA seq dataset, simultaneously (P < 0.05, Supplementary Table S4), especially involved in RNA metabolism and immune response processes such as cell division, mRNA processing, ncRNA metabolic process, RNA splicing, activation of immune response, immune response regulating cell surface receptor signaling pathway, innate immune response, immune effector process ( Figure 5).

Discussion
Esophageal squamous cell carcinoma is a rapidly progressive disease faced with many difficulties, such as difficulty in early diagnosis, low five-year survival rate, and lack of effective prognostic markers. Recently, lncRNA has been reported to be involved in the occurrence and progression of tumors [28,29] . However, the expression characteristics and roles of lncRNAs in ESCC are still fairly elusive. Thus, this paper revealed the survival related lncRNAs expression level in ESCC and constructed a prognostic five-lncRNA signature by collected public ESCC lncRNA expression data. Then, we measured the lncRNA expression of 98 ESCC patients by RNA seq and test the lncRNA expression of 84 ESCC tissues for validating the predictive performance of the five-lncRNA signature. Meanwhile, the 98 ESCC lncRNA expression profiles provide scientists with reference data to study the association of ESCC and lncRNAs.
Despite numerous research have reported some ESCC lncRNA models for prognostic markers [22] , PCR validation was rarely performed in the independent validation group and some lncRNAs in the predicting model might not yet exist in ESCC tissuse. In this study, the predictive model was constructed through the analyzing two sets of high-throughput sequencing data, and lncRNAs of the five-lncRNA signature were shown to be present in collected ESCC samples by qPCR.
In the five-lncRNA suvival predicting model, the high expression level of lncRNA MORF4L2-AS1, AC007179.1 and RP4-735C1.6 were associated with short survival (Cox coefficient >0), indicating these three lncRNAs were risk lncRNAs for ESCC patients. The high expression of RP13-30A9.2 and RP11-488I20.9 were associated with long survival (Cox coefficient < 0), suggesting the two lncRNAs were protective lncRNAs for ESCC. The significant correlation of the five lncRNAs with ESCC prognosis was found in two data sets, emphasizing the important clinical research value of the five prognostic lncRNAs in ESCC. However, the function of prognostic lncRNAs in ESCC have not been reported so far and biological roles of the five lncRNAs in ESCC progression should be investigated in further experimental studies. The functional annotation found the signature may play a role by participating in RNA metabolism and immune regulation.
TNM staging is a commonly used tumor classification standard in clinical practice and a recognized prognostic marker [30,31] . However, TNM staging remains flawed in prognostic assessment. We found that the prognostic predictive power of signature is better than TNM staging, suggesting that the strong prognostic ability of the five-lncRNA signature. Consistent with some scholars' results that combined TNM classification with molecular marker can predict outcome of ESCC patients more accurately [32] , we found the prognostic prediction of the combination of signature with TNM staging was the best, indicating the signature combined with TNM staging is useful for prognosis evaluation.

Conclusion
We identified a prognostic five-lncRNA signature from a large cohort of ESCC patients. The signature could predict the survival of patients with ESCC based on lncRNA expression profile and have strong clinical value.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The data generated during this study are available from the corresponding author upon reasonable request.      The lncRNA signature classification power for ESCC prognosis. Kaplan-Meier curves found ESCC patients were classified into two different risk groups based on the risk score of the signature in the GSE53624 (A), RNA-seq (B) and qPCR validation datasets(C).   GSEA functional analysis of the six-lncRNA signature.