A comprehensive analysis of the lncRNAs and genes for the gastric carcinoma

The present study obtained expression data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database. The gene risk signature and lncRNA signature were constructed by performing the univariate cox regression analysis and least absolute shrinkage and selection operator (LASSO) analysis. The receive operator curve (ROC) analysis was applied to evaluated the specicity and sensitivity of risk signature. The potential pathway was performed by using the Gene Set Enrichment Analysis (GSEA).


Abstract Background
Gastric carcinoma (GC) in one of the most common malignant tumors in the worldwide. Despite numerous studies, the molecular mechanism of is still unclear and the prognosis of GC remains poor.

Methods
The present study obtained expression data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database. The gene risk signature and lncRNA signature were constructed by performing the univariate cox regression analysis and least absolute shrinkage and selection operator (LASSO) analysis. The receive operator curve (ROC) analysis was applied to evaluated the speci city and sensitivity of risk signature. The potential pathway was performed by using the Gene Set Enrichment Analysis (GSEA).

Results
In total, 1641 differentially expressed genes (DEGs) and 985 differentially expressed lncRNAs (DElncRNAs) were obtained among GC samples. A 6 prognostic DEGs (DIRC1, IQCM, MATN3, SOX14, C5orf46 and CYP19A1) classi er and 9 prognostic DElncRNAs (AC007126.1, AC011352.1, AL356417.2, AP000695.1, LINC01210, LINC01614, VCAN-AS1, AC005165.1 and AC011586.2) classi er were identi ed using lasso-penalized multivariate survival modelling with 10 fold cross-validation. According to median risk score, patients were divided into high risk group and low risk group. The overall survival result for patients have a signi cant divergence between high rosk group and low risk group (p < 0.001). The 6 DEGs signature risk model and 9 DElncRNAs signature risk model was further veri ed that can served as an independent prognostic biomarkers for GC prediction among the clinical traits (P < 0.001). Moreover, two independent cohort GEO dataset (GSE13911 and GSE70800) was employed to evaluate the speci city and sensitivity of the prognostic gene and prognostic lncRNA. Gene set enrichment analysis (GSEA) result for the risk model revealed that these gene involved in ECM receptor interaction pathway, ERBB signaling pathway, UBIQUITIN mediated proteolysis, cell adhesion molecules CAMS, ECM receptor interaction, focal adhesion, pathways in cancer and TGF beta signaling pathway, meaning that the prognostic gene risk model and lncRNAs risk model play an crucial role and have important prognostic values.

Conclusion
These ndings may have important signi cance in understanding the molecular mechanism of GC and potential therapeutic method for the GC patients.

Page 3/15
Gastric carcinoma (GC) is one of the most common malignant tumours and the third leading cause of cancer-associated mortality [1,2]. About 80% of patients with GC are diagnosed at an advanced stage, which makes the patients show a poor prognosis [3]. Surgery is the most suitable curative treatment for gastric cancer by far. But for patients with recurrent or un-resectable GC, there is no satisfactory treatment, and for advanced gastric cancer, the 5-year mortality remains 30% to 50% [4][5][6]. Therefore, it is necessary to explore the early diagnosis methods of GC, which can help improve the prognosis of patients.
As a type of noncoding functional RNAs, long non-coding RNAs (lncRNAs) were recently attracted to show the mechanism of competing endogenous RNAs (ceRNAs) [7,8]. Recent studies have also indicated that ceRNAs is involved in the regulation of carcinogenesis [9][10][11]. Especially for GC, many lncRNAs and mRNAs were reported to be related with prognosis of GC patients [12][13][14]. Although many non-coding RNAs were found to be associated with the prognosis of gastric cancer and can be used as potential prognostic markers for potential prognostic analysis. However, unilateral analysis of genes or lncRNAs are often not stable because the potential association between lncRNAs and genes. Thus, it is necessary to establish a model combining multiple RNAs molecular markers with prognostic effects.
Here, we comprehensively analysed two types of RNA (lncRNAs and mRNAs) in GC patients through the RNA expression pro ling data. According to differential expression of these RNAs, we established six gene signature risk model and nine lncRNAs risk model using lasso-penalized multivariate survival modelling with 10 fold cross-validation analysis. The risk model further evaluated by performing ROC analysis and validated in the two independent dataset. Based on these efforts, we hope to nd a variety of more effective clinical prognostic markers for gastric cancer patients.

Methods
RNA sequencing data and clinical information RNA expression pro ling data (including LncRNA, mRNA) and corresponding clinical information of GC patients were all downloaded from The Cancer Genome Atlas (TCGA) dataset (https://cancergenome.nih.gov/). In order to reduce the bias of data, some exclusion criteria were made as follows: (1) absence of prognosis information; (2) histologic diagnosis ruled out GC. Differentially expressed lncRNA and mRNA in GC and adjacent normal tissues were analyzed by using the "edgeR" package in R software. |Fold change| > 2 and adjusted P value < 0.05 were set as the thresholds to select different kind of RNA. Gene expression greater than 1 will be retained for further analysis. The DE lncRNAs and DE genes were then integrated with the clinical information with the patients' survival information (survival time and survival stat), respectively.

Statistical analysis and validation
The DEGs and DElncRNAs expression pro le signature for prediction of prognosis was obtained from the variance in expression levels bigger than 5, and the P values calculated from univariate cox regression analysis lower than 0.05. These DEGs and DElncRNAs were then subjected to penalized multivariate cox regression survival modelling using an LASSO estimation algorithm [15,16].With this model, the prognostic genes were selected by 10 fold cross-validation in the dataset.
According to the cox regression coe cient, a risk score formula was established and the risk score were then calculated for each patient samples. According to the median risk score, patients were further categorized into high risk group and low risk group respectively. The prognostic lncRNAs and mRNAs were subsequently evaluated by using ROC analysis and validated in the two independent dataset (GSE13911 and GSE70800). Kaplan-Meier and Log-Rank methods were used to test difference between two groups and P-value < 0.05 was taken as statistically. All statistical analyses and graphics were conducted with R software (version 3.5.1).

Gene set enrichment analysis (GSEA)
According to the median expression levels of each prognostic genes, patients were categorized into high expression level group and low expression level group. Then, Gene set enrichment analysis were performed for the prognostic genes between high expression level group and low expression level group.
The pathway was considered statistically signi cant if its p values were less than 0.05

Results
Differentially expressed analysis and clinical factor screening.
According to the exclusion criteria, 314 patients with the expression data of mRNA and lncRNA were enrolled in this study, respectively. A total of 1641 (including 872 up-regulated genes and 769 downregulated genes) differentially expressed genes (DEGs) and 985 (including 775 up-regulated and 210 down-regulated) differentially expressed lncRNAs (DElncRNAs) were identi ed between tumour tissue and normal tissue with a criterion: adjusted p value < 0.05 and |Fold change| >2. As showed in the Fig. 1, the expression of lncRNAs and genes were presenting by heatmap ( Fig. 1A and 1C), and the volcano plots of the distribution of DEGs and DElncRNAs in each data set are shown in Fig. 1B and 1D.

Survival analysis for the DElncRNAs and DEGs
We excluded the DEGs and DElncRNAs with variance in expression levels lower than 5 and the rest DEGs and DElncRNAs were then integrated the corresponding clinical information respectively. By performing univariate cox regression analysis for the DE lncRNAs and DEGs dataset, we selected the signi cantly DEGs and DElncRNAs with p value smaller than 0.05. Then, Lasso-penalized multivariate Cox proportional hazards modeling was performing on the lter DElncRNA (N = 75) and DEGs (N = 128) dataset respectively. After 100 iterations, 6 DEGs expression signatures and 9 DElncRNAs expression signatures with optimal survival predictions in the GC cohort more than 50 times each. Based on the regression coe cient of the DEGs and DElncRNAs, the risk formula was constructed for the 6 DEGs and 9 DElncRNAs, and the risk score for each patient was then calculated according to the following formula: According to median risk score, patients were classi ed into high risk group and low risk group respectively. Kaplan-Meier curves analysis result showed that the there is a signi cantly difference between low risk score group and high risk score group in the lncRNA risk model and gene risk model (p value < 0.001) ( Fig. 2A and Fig. 2B). Moreover, patients with high risk score tend to have shorted survival time and more deceased cases, whereas patients have prolonged survival time were inclined to low risk group ( Fig. 3A and 3B). To further evaluated the accuracy of the gene and lncRNA risk model, ROC analysis was conducted on these risk model and the ve years AUC value were 0.851and 0.884, respectively, showed a good accuracy ( Fig. 4A and 4B). We also compared the accuracy of the other clinical trait including age, gender, grade, stage and TNM with our risk model. As showed in the Fig. 5A and 5B, our risk model is superior to the other clinical trait in accuracy. In addition, we performed a univariate and multivariate cox regression analysis for the risk model and clinical information, as showed in the Fig. 6A to 6D, the P value of the lncRNA risk model and gene risk model were signi cantly among the clinical trait, suggesting that these risk models can served as an independent predictor. We subsequently investigate the correlation between prognostic genes and lncRNAs, as the Fig. 7 showed, C5orf46 was highly correlated with AC011352.1, SOX14 was highly correlated with LINC01210.

Validation of the risk model
To further investigated the sensitivity and speci city of our risk model, we downloaded two GEO dataset including GSE13911 and GSE70800 to validate it. The GSE13911 dataset including 31 normal samples and 38 tumor sample, and including ve prognostic genes (C5orf46, CYP19A1, DIRC1, MATN3 and SOX14). ROC analysis result showed that the CYP19A1 have a highest AUC value (0.735) and the AUC value of the DIRC1 is the lowest (0.466) (Fig. 8A). The lncRNA expression dataset GSE70800 were reannotated based on the latest annotation revision. In the dataset, six prognostic lncRNA were identi ed and ROC analysis result showed that the AUC range from 0.558 to 0.841 (Fig. 8B).

Gene Enrichment analysis for the prognostic genes
In order to explore the potential function of the prognostic gene and lncRNA risk model, we performed GSEA enrichment analysis for the risk group that obtained according to the median risk score. In the gene risk model, as showed in the Fig. 9A, ECM receptor interaction pathways was signi cantly enriched in the high risk group, while ERBB signaling pathway and UBIQUITIN mediated proteolysis were signi cantly enriched in low risk group in the risk gene model. In addition, In the lncRNA risk model, cell adhesion molecules CAMS, ECM receptor interaction, focal adhesion, pathways in cancer and TGF beta signaling pathway were enriched in the high risk group (Fig. 9B). All the result suggesting that the risk model for gene and lncRNA have play important role in the Pathogenesis of GC.

Discussion
A growing number of researches suggests that many complex diseases, especially cancer, can rarely be attributed to a single genetic mutation [17][18]. Many studies had reported that lncRNAs and genes were related with prognosis of GC. Cao et al identi ed a set of lncRNAs differentially expressed in gastric cancer, providing useful information for discovery of new biomarkers and therapeutic targets in gastric cancer [19]. Lin et al found 10 differentially expressed lncRNAs potentially regulating the p53 signalling pathway from large scale expression pro ling of lncRNA and mRNA [20].
GC is a painful experience for patients due to its poor prognosis [21]. Therefore, identi cation of effective prognostic biomarkers and exploration of potential regulatory networks are indispensable steps in the development of effective treatments.
In the present study, in order to gain insights into the molecular events relevant to GC prognosis, we took advantage of the molecular resolution provided by TCGA database and downloaded the expression pro le of lncRNA and genes rstly. By performing cox regression analysis and LASSO analysis, we then identi ed six prognostic genes including C5orf46, CYP19A1, DIRC1, MATN3, SOX14 and IQCM, and nine prognostic lncRNAs namely AC007126.1, AC011352.1, AL356417.2, AP000695.1, LINC01210, LINC01614, VCAN-AS1, AC005165.1 and AC011586.2 that associated with the survival of GC. According to their coe cient derived from the LASSO analysis, we constructed six gene signature risk score model and nine lncRNA signature risk score model, respectively. We further categorized patients into high risk score group and low risk group based on the median risk score. Patients with a high risk score tend to have a signi cantly shorter survival time, corresponding with more death cases of GC. ROC analysis was used to evaluate the accuracy of the risk model for the gene and lncRNA, separately. The high AUC value for the 5 year (gene = 0.851, lncRNA = 0.884) revealed that the risk model are reliable for the prognosis of the GC.
Further cox regression analysis between risk model score and clinical trait suggesting that the gene risk model and lncRNA risk model can act as an independent predictor for the GC survival. In addition, we also compared the accuracy between risk model and clinical trait and the result showed that the AUC value of the risk model was better than clinical trait, indicated that our risk model was more accuracy than clinical trait. In order to validate the prognostic genes and lncRNAs, we downloaded a gene expression dataset (GSE13911) and lncRNA expression dataset from GEO database (GSE70800). High con dence AUC value demonstrate that the genes and lncRNAs have important prognostic value. GSEA enrichment analysis result revealed that the gene risk model and lncRNA risk model have important function in the molecular pathogenesis and progression of GC.
However, there were several limitations towards our work. First, though the comparison between our results and published articles has suggested the validity of our result, it was still a limitation that we did not proved an external validation for our results. Second, several novel lncRNAs, with signi cant clinical signi cance in GC need to be explored further to determine the underlying molecular mechanism. Finally, despite the limited power of detecting individual events, the model that we proposed has promising implications in clinical practice.

Conclusion
In a word, our study delineates two prognosis models based on lncRNAs and mRNAs that may improve the poor prognosis of GC. This nding provided some new potential prognostic markers, and identi ed novel therapeutic targets for GC.

Declarations Data Availability
The dataset performed in this study are available from the corresponding author on reasonable requests.

Con icts of Interest
The authors declare that there are no con icts of interest regarding the publication of this paper.