Six-long Non-coding RNA Signature to Predict the Prognosis of Lung Adenocarcinoma

Background: Long non-coding RNAs (lncRNAs) have been reported to play essential roles in tumorigenesis and cancers prognosis, and they can be a potential cancer prognostic markers. However, in lung adenocarcinoma(LUAD), how lncRNA signatures predict the survival of patients is poorly understood. Our study aims to explore lncRNA signatures and prognostic function in LUAD. Methods: The expression and prognosis data of lncRNAs in LUAD patients was collected from the Cancer Genome Atlas (TCGA) data. All analyses were performed using the R package (version 3.6.2). Metascape, STRING and Cytoscape were used for enrichment analysis and function prediction of the lncRNA co-expressed protein-coding genes. Results: We have collected lncRNA expression data in 466 LUAD tumors, and a six-lncRNA signature(RP11-79H23.3, RP11-309M7.1, CTD-2357A8.3, RP11-108P20.4, U47924.29, LHFPL3-AS2) has been shown to be signicantly related to LUAD patients’ overall survival. According to the lncRNA signatures, the high-risk and low-risk groups were divided in LUAD patients with different survival rates. Further multivariable cox regression analysis showed that the prognostic value of this signature was independent of clinical factors. The potential functional roles and hub co-expressed protein-coding genes in the six prognostic lncRNAs are shown in the functional enrichment analysis. Conclusions: These results showed that these six lncRNAs could be independent predicted prognostic biomarkers in LUAD patients.


Introduction
The incidence and mortality rate of lung cancer worldwide is still in the forefront, and the 5-year survival rate is less than 20% [1]. Lung cancer is classi ed as lung squamous cell carcinoma, small cell lung cancer and lung adenocarcinoma (LUAD) which is the most common type of lung cancer. The standard treatments for LUAD are chemotherapy, target therapy and surgery, but many patients still got into deterioration. Nowadays, no signi cant molecular biomarkers for LUAD have been accepted. It is necessary to prevent the development of early LUAD actively. Long non-coding RNAs (lncRNAs) are RNA transcripts longer than 200 bp with little or no protein-coding capacity [2][3]. More and more studies showed that lncRNAs were involved in chromosome regulation, transcription and post-transcriptional regulation [4][5]. It has also been observed to be involved in the development and progression of other cancers, such as gastric cancer and liver cancer [6][7]. Several prognostic biomarkers for LUAD have been reported such as lncRNA LOC100132354 [8]and lncRNA CTB-193M12.5 [9], etc. In this study, we used a cohort of 466 LUAD patients from The Cancer Genome Atlas (TCGA) to predict whether potential lncRNA was related to the survival of LUAD patients. We demonstrated that six-lncRNA signature was independent of clinical factors and suggested the potential roles in LUAD pathogenesis.

Ethics Statement
The current study received approval from the Xiangya Hospital of Central South University according to the Declaration of Helsinki. Every data was collected from the open web, con rming that all the written informed consents were attained.

The LUAD patient information
The lncRNAs expression and clinical data of LUAD patients were collected from the TCGA database. A total of 466 LUAD patients were enrolled in this study. We also downloaded the related clinical data of LUAD patients, including age, gender, tumor stage, smoked pack-year, kras state. We divided samples from TCGA data set into training (n = 233) and testing sets (n = 233). lncRNA expression pro le LUAD RNA-seq data were collected from TCGA data portal (https://tcga-data.nci.nih.gov/tcga/).
According to the human genome (Ensembl database v72 assembly), the expression level of ln-cRNAs and mRNAs were counted by the value of Reads Per Kilobase of exon model per Million mapped reads (RPKM). We evaluated lncRNAs from TCGA dataset according to the foll-owing criteria: 1) transcripts do not locate in any protein-coding region; 2) transcript sequences connect with GENCODE project [10]; 3) transcripts express in more than half of the LUAD samples. The lncRNA expression pro les were de ned as those with an average RPKM ≥ 0.1 across 466 LUAD samples. At last, a total of 6102 lncRNAs in the dataset were enrolled. We used edgeR [11] software to identify the expression difference.

Statistical analysis
Based on the training set, we used univariate cox regression to calculate the association between the expression level of every lncRNA and overall survival in patients. Those lncRNAs were signi cant if Pvalues were less than 0.05. Then, we used multivariate analysis by the Random Survival Forests method to calculate the selected lncRNAs. Risk scores were considered to be involving in these selected lncRNAs, which were weighted by their estimated regression coe cients in the multivariable cox regression model.
The risk score can be estimated for each LUAD patient according to prognostic six lncRNAs. We divided LUAD patients into high-risk and low-risk groups according to the risk score. The Kaplan-Meier survival analyses can estimate the survival differences in those two groups. We used multivariate Cox regression and strati ed analyses to demonstrate whether the lncRNA was independent of other clinical factors. The receiver operating characteristic (ROC) curve of 5 years were used to estimate the sensitivity and speci city of the survival prediction based on the risk score. All analyses were performed using the R package (version 3.6.2).

Functional Enrichment Analysis
Metascape(http://metascape.org) [12][13][14][15][16][17][18][19] is a website for gene annotation and analysis. In this study, metascape was used to identify pathway and process enrichment of lncRNA correlated genes. On the Metascape online tool, enrichment is shown by the Gene Ontology (GO) terms for biological process, cellular component, and molecular function categories, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The terms with the characteristics of P-value < 0.01,minimum count of 3 and enrichment factor of > 1.5 were considered signi cant. The network was plotted with the similarity of > 0.3 to connect edges. Search Tool for the Retrieval of Interacting Genes (STRING; http://string-db.org) (version 10.0) [20] constructed the PPI network of six lncRNA co-expressed protein-coding genes. Cytoscape (version 3.4.0) is an open web for visualizing networks of molecular interaction [21]. We used The plug-in Molecular Complex Detection (MCODE) (version 1.4.2) of Cytoscape to cluster PPI network so that we can get the most signi cant interaction network.

Results
Identifying the prognostic lncRNAs from the training set The 466 TCGA LUAD patients were randomly divided into a training (n = 233) set or a testing set (n = 233), respectively. Based on the training set, the lncRNAs were subjected to the Cox regression model, and a total of six lncRNAs were signi cantly correlated with the patients' overall survival (P-value < 0.05; Table 1). Three of them (RP11-79H23.3, RP11-309M7.1, CTD-2357A8.3) had positive coe cients, suggesting that the higher expression level was related to shorter survival. The negative coe cients for the other three lncRNAs ( RP11-108P20.4, U47924.29, LHFPL3-AS2) with higher levels of expression were related to longer survival. The six-lncRNA signature and survival analysis in the training set Acording to the expression level of six lncRNAs, there is a risk-score formula for LUAD patients' survival prediction. The risk score formula is as following: Risk score= (0.37 × expression level of RP11-79H23.3) + (0.21 × expression level of RP11-309M7.1) + (0.15 × expression le-vel of CTD-2357A8.3)+ (-0.22 × expression level of RP11-108P20.4)+ (-0.16 × expression level of U47924.29)+ (-0.12 × expression level of LHFPL3-AS2). Then, we calculated the lncRNA-based risk score for each LUAD patient in the training set and divided LUAD patients into high-risk (n = 117) and low-risk groups (n = 116) by a threshold as the median risk score value. The Kaplan-Meier curves showed that patients in the high-risk group had a worse prognosis th-an those in the low-risk group (28.5 months vs 59.3months, P-value < 0.0001; Fig. 1a). Time dependent ROC curve analysis was measured to estimate the competitive performance of the six lncRNA features, and the AUC score of the six-lncRNA features was 0.746 ( Fig. 1b), suggesting the better survival prediction in the training dataset. Univariate cox regression analysi-s showed that the six-lncRNA risk score were signi cantly associated with patients' survival (P-value < 0.05, HR = 1.41, 95% CI = 1.26-1.58; Table 2). Then, the aggregation effect of samples was shown (red represents tumor, black represents normal sample) in Fig. 1c. Figure 1d showed the gene distribution between expression logFC and CPM (counts of per Million mapped reads). Figure 1e signi ed the lncRNA gene distribution between expression logFC and FDR (FDR is adjusted P value). Figure 1f showed CPM heat map of differentially expressed lncRNAgenes. Figure 1g showed heatmap of the six lncRNA expression pro les, which were ranked according to the risk score value. Patients with high-risk scores had higher mortality than patients with low-risk scores. For patients with high risk scores, the expression pro les of lncRNAs (CTD-2357A8.3, LHFPL3-AS2) were signi cantly unregulated, whereas the expressionsof remaining lncRNAs (RP11-79H23.3, RP11-309M7.1, RP11-108P20.4, U47924.29) were down-regulated. Figure 1h signi ed the risk score scatter plot. Figure 1I signi ed the follow-up scatter plot. The survival prediction of the six-lncRNA signature in the testing set and the entire TCGA data set In the testing set, patients in the high-risk group had signi cantly shorter survival than those in the lowrisk group (33.3 months vs. 57.5 months, P-value = 0.0207; Fig. 2a). ROC curves for the six-lncRNAsbased model got an AUC score of 0.737 in the testing set (Fig. 2c). In the entire TCGA data set, patients in the high-risk group had signi cantly shorter survival than those in the low-risk group (31.0 months vs. 57.5 months, P-value < 0.0001; Fig. 2b). ROC curves analysis for the six-lncRNAs based model got an AUC score of 0.746 in the entire set (Fig. 2d).
Independence of the six-lncRNA signature and the other clinical factors We estimated if the survival prediction based on six-lncRNA signature was independent of clinical variables. We used multivariate cox regression analysis to estimate lncRNA-based risk score and other clinical information, such as age, gender, tumor stage, smoked pack-year and kras state ( Table 2). The result showed that six-lncRNA risk score is tightly related to survival after regulating the clinical factors.
Then, we found that the tumor stages were also signi cantly related to overall survival. We carried out the strati ed analysis. The entire TCGA data set were divided into younger stratum (age ≤ 66, n = 236) and older stratum (age > 66, n = 230), male stratum(MALE, n = 213) and female stratum(FEMALE, n = 253), low smoked pack-year stratum (low smoked-pack year < 25, n = 247) and high smoked pack-year stratum (high smoked-pack year > 25, n = 219), low tumor stages (stage I/II, n = 364) and high tumor stages (stage III/IV, n = 102), no kras state stratum (kras_no, n = 443) and kras state stratum (kras_yes, n = 23). The result showed that the six-lncRNA risk score could divide LUAD patients into high-risk and low-risk subgroup within each stratum. These result indicated that the prognostic value of six-lncRNA signature is independent of age (Fig. 3). The different stratums of age showed statistical signi cance in risk score, and we can see the strati cation analysis of gender (Fig. 4), tumor stage I/II (Fig. 5a), entire tumor stage (Fig. 5c), smoked pack-year (Fig. 6), kras_no state (Fig. 7a) and entire kras state (Fig. 7c) show similar results in the entire set. However, the stratum of tumor stage III/IV and kras_yes state did not show signi cance in risk score, and the reason may be the number of patients in the two stratums were small. These ndings demonstrated that six-lncRNA risk score displayed a competitive survival prediction of LUAD patients.

Functional characteristics of six prognostic lncRNAs
To explore the functional implication of six prognostic lncRNAs in LUAD tumorigenesis, we performed functional category enrichment analysis by analyzing GO and KEGG in Metascape. The biological functions of lncRNAs are still largely unknown. Here, we predicted their potential functions according to the co-expression network. We calculated Spearman correlation coe cients between lncRNAs and protein-coding genes according to their expression values. We selected protein-coding genes as coexpressed partners of six prognostic lncRNAs, whose spearman coe cient > 0.42. At last, a total of 1003 protein-coding genes were signi cantly correlated with at least one prognostic lncRNAs. Functional enrichment analysis showed that lncRNA correlated protein-coding genes mainly enriched in mitotic nuclear division, T cell activation, addative immune system, PID IL12 2 pathway and several pathways (Fig. 8a, Table 3). Figure 8b showed dot plot of the functional enriched GO terms where terms containing more genes tend to have big circles; Fig. 8c showed a network of KEGG enriched terms colored by p-value, where terms containing more genes tend to have a more signi cant p-value. Figure 8d showed the network of enriched terms: colored by cluster-ID, where nodes that share the same cluster-ID are typically close to each other; Fig. 8e showed the network of enriched terms: colored by p-value, where terms containing more genes tend to have a more signi cant p-value. These results suggested that the six prognostic lncRNAs may be related to immune reaction through regulating protein-coding genes to in uence lung tumorigenesis. The PPI network of six lncRNA co-expressed protein-coding genes was constructed by STRING, and the most signi cant module was obtained using Cytoscape (Fig. 8f). "Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.
Discussion lncRNAs play vital roles in tumorigenesis and progression of tumor, and many researchers have studied whether some speci c lncRNAs in cancers can be involved in diagnosis and prognosis. Although lncRNA in LUAD tumorigenesis has been reported many times, the competitive prognostic values of lncRNA in LUAD are not fully understood. So, it is necessary to nd reliable prognostic biomarkers of LUAD. In our work, a six-lncRNA prognostic feature was identi ed according to the lncRNA expression of LUAD patients. This study determined the potential six-lncRNA feature to predict the prognosis of LUAD. The performance of six-lncRNA feature was estimated using ROC analysis, displaying that the prognostic of performance the six-lncRNA feature is suitable for survival prediction. ROC analysis has got an AUC of 0.746 in the training set, which indicated high sensitivity and speci city of the six lncRNA model. Further functional annotation of these prognostic lncRNAs correlated protein-coding genes indicated that they might be involved in immune reaction to regulate LUAD. Also, the result showed that the prognostic value of six-lncRNA signature was independent of other clinical factors in LUAD. Although these six prognostic lncRNAs have not been investigated previously in lung cancers, we assume that these lncRNAs may be related to LUAD, and many studies are needed to validate the deep mechanism in the future.
We should explain some limitations of our study. First, we only analyzed the prognostic power of the six-lncRNA feature in the TCGA data, and the deep mechanisms of lncRNAs need to be further explored.

Conclusion
We have explored the prognostic values of lncRNAs in LUAD patients in our study. Our result has suggested that the six-lncRNA signature is useful in predicting the clinical outcome, and may be effective prognostic biomarkers in LUAD patients survival prediction. All data included in this study are available upon request by contact with the corresponding author.

Code availability
All code included in this study are available upon request by contact with the corresponding author.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.   The number below the curve is the number of the patients in the high-and low-risk group. The P-value represents the differences between the curves from the results of two-sided log-rank tests.   The P-value represents the differences between the curves from the results of two-sided log-rank tests.
where terms containing more genes tend to have a more signi cant p-value. f Gene-gene interaction network among the most signi cant module of six lncRNA co-expressed protein-coding genes.(STRING and Cytoscape).