Development and Validation of a Six-lncRNA Prognostic Signature in Gastric Cancer

Background: Gastric cancer (GC) has been a leading cause of cancer-related mortality for many years. It is thought that long noncoding RNAs (lncRNAs) can play a signicant role in GC. This study aimed to construct a powerful six-lncRNA signature as a prognostic biomarker for GC patients. Methods: Based on The Cancer Genome Atlas (TCGA), the expression proles of lncRNAs and the corresponding clinical data of GC patients were obtained. Cox regression and the least absolute shrinkage and selection operator (LASSO) regression model were used to identify the prognostic lncRNA signature. A total of 337 patients were included in the combined dataset (N = 337), which was divided into a training dataset (N= 169) and a test dataset (N = 168). The reliability of the lncRNA prognostic signature was validated in three datasets. Results: A six-lncRNA prognostic signature was constructed to predict the overall survival (OS) of GC patients. The signature had better discriminability than clinical characteristics. The prognostic risk score was as follows: (expression level of RP11-284F21.7×-0.243981) + (expression level of RP11-432J22.2×-0.502378) + (expression level of RP4-584D14.5×-0.447878) + (expression level of AC093850.2×0.261822) + (expression level of AP000695.6 ×0.654318) + (expression level of AC098973.2× 0.406603). In addition, the signature was conrmed to be a signicant predictor for predicting the OS. The nomogram model precisely predicted the OS of GC. Enrichment analysis indicated that the signature was mainly enriched for extracellular matrix-related functions and tumor signaling pathways. The target genes IGFBP7, VCAN, and COL1A1 had prognostic value in GC. AC098973.2 and RP11-284F21.7 was veried for the we established a prognostic predictive ability an independent risk factor in GC. to the traditional TNM staging system, can predict OS and be precisely applied in clinical practice. GC:Gastric cancer; TCGA :The Cancer Genome Atlas; LncRNAs: Long noncoding RNAs; LASSO :The least absolute shrinkage and selection operator; OS: Overall Survival; AUC: Area under the curve ; ROC: Receiver operating characteristic;GEPIA: Gene Expression Proling Interactive Analysis; K-M: Kaplan-Meier;CI: Condence intervals; HR:Hazard ratio;ECM: Extracellular matrix; FC:Fold change; FDR:False Discovery Rates; GO:Gene Ontology ; KEGG: Kyoto Encyclopedia of Genes and Genomes ;PPI:Protein–protein interaction;


Background
Gastric cancer (GC) currently ranks fth in morbidity and fourth in mortality, and accounted for an estimated 1 million new cancer cases and 769,000 cancer-related deaths in 2020 [1]. The mortality rate and disease burden of GC have increased with the aging of the population in China [2]. Despite advancements in the diagnosis and treatment of GC, most patients are diagnosed at the advanced stage, and the 5-year overall survival rate is unsatisfactory. Therefore, it is particularly important to explore new diagnostic markers and therapeutic targets for GC patients.
Fortunately, with the continuous development and improvement of bioinformatics technology, most of the noncoding regions have been found to be widely transcribed, and these transcripts include long and short noncoding RNAs [3].
Long noncoding RNAs (lncRNAs) are longer than 200 nucleotides and cannot be translated into protein [4]. lncRNAs are involved in a variety of physiological processes and are closely related to tumors [5]. In general, lncRNAs have the ability to regulate genes through various mechanisms, such as translation, transcriptional regulation and protein modi cation [6]. LncRNAs such as HOTAIR [7], GClnc1 [8], GMAN [9] and MEG3 [10] can be used as oncogenes or tumor suppressor genes, which indirectly in uence the occurrence and development of GC. Therefore, lncRNAs might be potential targets for new therapeutic strategies for GC [11]. As another option, in recognition of the heterogeneity and susceptibility of GC, a group of lncRNA biomarkers could be used to evaluate the prognosis more stably and accurately in clinical application.
We studied the corresponding data of 337 patients in TCGA database. We identi ed differentially expressed lncRNAs for prognosis prediction. Furthermore, we constru-cted a six-lncRNA prognostic signature and validated the accuracy in the three datasets. We established an individualized nomogram model, and its value in prognosis prediction was higher than that of clinicopathologic characteristics. Finally, the target genes related to GC were identi ed by functional enrichment analysis and network construction technology.

Data collection and preprocessing
Data for 407 patients with GC (providing 375 cancer tissues and 32 normal tissues) and a total of 14834 lncRNAs were obtained from the TCGA database (https://portal. gdc.cancer.gov). The expression pro les of lncRNAs and the corresponding clinical information were also downloaded. The package "edgeR" can be utilized to identify the differentially expressed lncRNAs using log2 fold change > 0.5 and FDR < 0.05 as a selection threshold. The clinical information included age, gender, histologic grade, tumor (T) stage, lymph node (N) stage, metastasis (M) stage, clinical stage, and survival information. The exclusion criteria were including: (1) pathologic diagnosis was not gastric adenocarcinoma; (2) the follow-up days was less than 30. Construction and validation of the lncRNA prognostic Signature Univariate Cox regression analysis was conducted to identify lncRNAs associated with prognosis by the R package "survival" in the training dataset. LASSO regression analysis was used to narrow the gene score by the R package "glmnet". According to the calculated minimum lambda value, prognosis-related lncRNAs were obtained to represent the best signature. We established a prognostic risk model using multivariate regression analysis. The following formula was applied: risk score = ΣβlncRNAi × ExplncRNAi (β is the multivariable Cox regression coe cient of each lncRNA, and Exp is the expression level of each lncRNA).
According to the formula, We calculated the risk score of each GC patient in the training dataset. GC patients were further classi ed into the low-risk group (N = 169) and the high-risk group (N = 168) using the average risk score as the cutoff value. Similarly, the risk score formula was applied in the test dataset and the combined dataset. Kaplan-Meier (K-M) survival analysis was utilized to explore correlations between the signature and overall survival (OS). We used the R package "survival ROC" to verify the accuracy and sensitivity of the prognostic model. Moreover, univariate and multivariate Cox analyses were used to investigate whether the signature was signi cant predictor for predicting the OS.

Establishment and evaluation of the nomogram
We generated individualized predictions for the OS using the nomogram. The nomogram model for predicting 1-, 3-, and 5-year OS was based on the outcomes of the multivariate analysis. We used the R package "rms" to establish the nomogram and calibration plots. A calibration chart was utilized to evaluate the predictive performance of the nomogram. Receiver operating characteristic (ROC) analysis was applied to assess the accuracy of the predicted nomogram.

Functional enrichment analysis
Functional enrichment analysis was used to elucidate the potential biological mechanism and the pathway of the prognostic six-lncRNAs. The correlations between lncRNAs and those of the co-expressed mRNAs were calculated by Pearson correlation coe cient analysis (Pearson coe cient > 0.4). The differentially co-expressed mRNAs with log2 fold change ≥ 0.5 and FDR < 0.05 were included. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were utilized to explore the potential related enrichments and pathways.
Protein-protein interaction (PPI) network construction and validation of the target genes We used the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) online database to assess PPIs. The PPI network was visualized by Cytoscape software. The Gene Expression Pro ling Interactive Analysis (GEPIA) database (http://gepia. cancer-pku.cn/detail.php) and Kaplan-Meier (K-M) Plotter database (https://kmplot. com /analysis/) were performed to explore the relationship between the target genes and prognosis of patients.

Expression of the six lncRNAs in GC tissues
The expression levels of lncRNAs in 337 GC tissues and 32 nontumor tissues were analyzed from TCGA database. We also analyzed the differential expression of lncRNAs in 27 pairs of cancer tissues and paracancerous tissues. Moreover, K-M analysis was performed to display the OS of patients according to the expression of the six lncRNAs

RNA extraction and qRT-PCR analysis
We used the qRT-PCR method to validate the expression of two hub lncRNAs in cancer tissues and cell lines. Total RNA from tissues and cultured cells was isolated using TRIzol (TAKARA, Beijing, China), cDNA was generated using Hifair ® III 1st Strand cDNA Synthesis Super Mix for qPCR (YEASEN, Shanghai, China). Quantitative PCR reactions was performed using SYBR Green Master Mix (VAZYME, Nanjing, China) in order to detect the expression levels of the lncRNAs (AC098973.2 F: 5′CATGCC CTGGATTCCTGCTA3′, R: 5′TGGCTGGGTAGCTC TGATTC3′ ; RP11-284F21.7 F: 5′CAGGGACAGGGCAGTATTCC3′, R: 5′GCCCCAACGACTCAGGTATC3′ ). GAPDH was selected as an endogenous control. Each sample was independently analyzed in triplicate and each experiment was repeated at least three times.
We used the relative quanti cation 2 −△△ CT method to calculate fold changes.

Statistical analysis
We used R software (version 4.0.3), SPSS 22.0 (SPSS Inc., Chicago, IL, USA) or GraphPad Prism 9 (GraphPad Software Inc., La Jolla, CA, USA) for statistical analyses. K-M analysis and log-rank tests were utilized to compare differences in survival for the different groups. Univariate and multivariate Cox analyses were performed to testify the signi cant predictors. ROC curve analysis was used to assess the sensitivity and speci city. The prognostic accuracy was evaluated by AUC analysis. Unpaired Student's t-test and Paired Student's t-test were employed to analyze lncRNA expression in cancer tissues and nontumor tissues. Differences with P < 0.05 were determined to be statistically signi cant.

Identi cation of prognosis-related lncRNAs
The entire owchart of this study was brie y shown in Fig. 1. A total of 337 samples were included for further analysis.
The clinical characteristics of GC patients are described in Table 1. A total of 410 differentially expressed lncRNAs were obtained. Among these lncRNAs, 15 lncRNAs associated with prognosis were identi ed by univariate Cox analysis ( Fig. 2A). Furthermore, the LASSO regression model was conducted with 10-fold cross validation to analyze their expression. Finally, six lncRNAs were ultimately analyzed using LASSO regression analysis (Fig. 2B-C). Construction and validation of the six-lncRNA signature The six-lncRNA signature was constructed by multivariate Cox regression analysis. The descriptions of the six lncRNAs are listed in  (Fig. 3). We analyzed the risk score, survival status, and the distribution of lncRNA expression pro les in the three datasets ( Fig. 4A-C). The results showed that GC patients in the low-risk group tended to have longer survival times. According to the K-M analysis of the three datasets, the survival rate of the high-risk group was lower than that of the low-risk group (P < 0.05) (Fig. 4D). The AUC of model for 5-year survival was 0.748 in the training dataset, 0.723 in the test dataset and 0.745 in the combined dataset (Fig. 4E). Our observations demonstrated that the signature has a certain ability to distinguish risk and to predict the OS of GC patients.

Development and evaluation of the prognostic nomogram
We constructed a nomogram model to predict the survival rates of GC patients at 1, 3 and 5 years. The covariates included age, gender, histologic grade, clinical stage and risk score. The score for each variable was incorporated into the total score by means of the nomogram model. A higher score indicated a poorer prognosis (Fig. 6A). The calibration curve was used to represent the actual probability. (Fig. 6B). The AUC of the predicted nomogram was 0.751, which was higher than that of age (AUC = 0.617), gender (AUC = 0.571), histologic grade (AUC = 0.515), TNM stage (AUC = 0.585), T stage (AUC = 0.546), N stage (AUC = 0.523), and M stage (AUC = 0.498). Overall, we established an individualized predictive nomogram model that can accurately predict the OS (Fig. 6C).
Functional enrichment analysis of prognostic-related lncRNAs in the signature We identi ed 184 differentially co-expressed genes associated with lncRNAs. GO analysis enrichment included biological process, cellular component and molecular function. The related biological processes mainly consisted of extracellular matrix organization and extracellular structure organization (Fig. 7A). Cellular components included collagen-containing extracellular matrix (ECM) and endoplasmic reticulum lumen (Fig. 7B). The enriched molecular functions consisted of extracellular matrix structural constituents and glycosaminoglycan binding (Fig. 7C). GO enrichment analysis demonstrated that the biological function may be related to the tumor interaction with the ECM. KEGG pathway indicated that these genes were rich in tumor-related signaling pathways, including the Wnt signaling pathway, focal adhesion and ECM-receptor interaction (Fig. 7D). These results indicated that function of these genes was highly asssociated with the occurrence of tumor.
Establishment of the PPI network and validation of target genes.
The PPI network was constructed by using the STRING online database to re ect the interaction of 24 target genes (Fig. 8A). We selected three genes (IGFBP7, VCAN, COL1A1) for further research. We veri ed the expressions and survival correlations of the three genes. In the GEPIA database, the expression levels of IGFBP7, VCAN, and COL1A1 in cancer tissues were higher than those in normal tissues ( Fig. 8-B, F, J). In addition, high mRNA expression of these target genes was associated with a poor prognosis ( Fig. 8-C, G, K) (P < 0.05). According to the K-M Plotter database, the survival rate of patients with high expression of these genes were lower than that of the low expression (P < 0.01) ( Fig. 8-D, H, L). The expression levels of IGFBP7, VCAN and COL1A1 were related with tumor stages (Fig. 8-E, I, M) (P < 0.05). To sum up, we selected three lncRNA target genes that were closely related to prognosis and had an effect on the development of GC.

Expression of the six lncRNAs in GC
We analyzed the six lncRNAs from the TCGA database. The expression levels of the lncRNAs in cancer tissues were signi cantly higher than those in nontumor tissues (P < 0.01) (Fig. 9A). Then, we compared their expression levels in 27 pairs of cancer tissues and paracancerous tissue. These results showed that the six lncRNAs were differentially expressed in GC tissues (Fig. 9B). High expression of RP4-584D14.5, RP11-284F21.7 and RP11-432J22.2 generally indicated better overall survival than low expression. Nevertheless, the survival rates of patients with high AP000695.6 expression were lower than those of patients in the low expression group (P < 0.05). Although there was no signi cant difference in the survival of patients for AC093850.2 and AC098973.2 expression (P > 0.05), we observed a trend of poor prognoses in the high expression groups (Fig. 9C). These studies ultimately showed that these lncRNAs can be universally detected in GC tissues.
Validation of two hub lncRNAs in cancer tissues and cell lines To verify the results of bioinformatics analysis, we selected two hub lncRNAs (AC098973.2 and RP11-284F21.7) for further study. The expressions of the two lncRNAs were determined in tumor tissues and cell lines with qRT-PCR. Compared with adjacent tissues, the expression of the two hub lncRNAs in GC tissues was markedly increased (P < 0.05, Fig. 10A-B). Moreover, the expression of the two hub lncRNAs was signi cantly higher in three GC cell lines than in GES-1 (P < 0.05, Fig. 10C, D).

Discussion
GC is a relatively common digestive malignancy, especially in Asian countries. Many patients lose the opportunity for surgery because of the delayed diagnosis. The treatments for advanced GC mainly include chemotherapy and targeted therapy, but the e cacy remains poor. The molecular genetic landscape of GC includes gene mutations, transcriptional changes involving mRNAs and lncRNAs [12]. Many lncRNAs have been discovered in GC tissues, cells, gastric juice and plasma, and lncRNAs may act as potential biomarkers for early diagnosis and the evaluation of prognosis, therapeutic response, and chemotherapy resistance [13]. Furthermore, we can utilize public bioinformatics databases to explore the novel perspective of applying prognostic lncRNAs as biomarkers in GC.
It is well-known that GC is a multistep oncogenic process with characteristics of pathological and molecular heterogeneity. Prognostic-related lncRNA signatures are more reliable than single lncRNAs in prognosis prediction. As shown in Table 3, many lncRNA signatures have been reported in recent years. Cai et al. and Nie et al. identi ed many differentially expressed lncRNAs, but their models were di cult to implement in clinical practice [14,15]. Although some studies showed signatures composed of fewer lncRNAs, the accuracy of the prognostic evaluation was not high [16,17]. Moreover, some scholars have established models that can accurately predict prognosis, but they have not further explored the corresponding target genes [18,19]. In this study, we established a prognostic signature that had good predictive ability and was an independent risk factor in GC. Compared to the traditional TNM staging system, the nomogram can accurately predict OS and be precisely applied in clinical practice. Among the six lncRNAs, AC093850.2 and AC098973.2 are known as LINC01614 and LINC01980, respectively.AC093850.2 has been identi ed as a potential predictor of prognosis in many cancers, such as breast cancer, esophageal squamous cell carcinoma (ESCC), glioma, and osteosarcoma [20][21][22][23]. In the latest study, AC093850.2 was found to promote GC cell growth and migration [24], but the mechanism should be explored in GC. AC098973.2 may act as an oncogene in ESCC and hepatocellular carcinoma (HCC) [25,26]. In addition, the LINC01980/miR-190a-5p/MYO5A pathway promotes the development of ESCC [25]. Previous research established a three-lncRNA prognostic signature including AP000695.6, but the value of AP000695.6 has not been further veri ed in clinical samples [27]. RP11-284F21.7 was evaluated in lung cancer cells [28]. RP4-584D14.7 was found in renal cell carcinoma cells [29]. In this study, these lncRNAs were identi ed from GC patients in the TCGA database. The expression levels of the two lncRNAs in tissue samples and cel lines were detected by qRT-PCR. Due to the limited tissue samples, we only veri ed the expression of two hub lncRNAs in GC tissues. The high expression levels of the two lncRNAs (AC098973.2 and RP11-284F21.7) were carried out in GC tissues and cell lines for the rst time. However, a large number of clinical samples from multiple centers need to be assessed, and the speci c regulatory mechanisms need to be deeply explored in GC.
Our study evaluated that the biological function of the signature components was related to the ECM. The current study showed that ECM and ECM-related components play an essential role in the occurrence and development of GC [30]. We found that the 24 target genes were mainly composed of ECM-related genes, such as SPARC, BGN, FBN1, SPP1, and FN1. Similarly, the pathways related to the signature included ECM-receptor interactions. The present study suggests that ECM targeting holds great clinical potential as an innovative and effective treatment for GC [31]. KEGG pathway analysis showed that these genes were rich in the Wnt signaling pathway and focal adhesion. Yang et al. revealed that LINC01133 inhibited progression and metastasis via the Wnt/β-catenin pathway in GC [32]. As dysregulation of the Wnt pathway has been observed in approximately 50% of GC tumors [33], the Wnt pathway might offer a new therapeutic target. Likewise, the CBP/β-catenin antagonist PRI-724 has been developed as a Wnt pathway inhibitor [34]. In addition, the integrins and growth factor receptors consisted of the focal adhesions through cytoplasmic signaling networks. Previous research has revealed that targeting focal adhesion proteins would be an effective mechanism of treatment regimens including chemotherapy, radiotherapy and novel molecular therapeutics [35]. Thus, the six-lncRNA signature has clinical potential as a pharmacological target.  [39]. Thus, further research on the pathogenesis and immune mechanism of IGFBP7 needs to be performed in GC. COL1A1 and VCAN are widely related to ECM-receptor interaction. COL1A1 is a member of the type I collagen family and is involved in breast cancer [40], HCC [41], ovarian cancer [42] and GC [43]. COL1A1 is closely correlated with cell invasion and metastasis based on the activation of the TGF-β signaling pathway [44]. VCAN is a chondroitin sulfate proteoglycan and it can provide hydration and a loose matrix during disease progression. Cheng et al. revealed that VCAN was upregulated and had an impact on the progression of GC [45]. Mohamed Salem found that high expression levels of VCAN mediated miR-590-3p and promoted the development of ovarian cancer [46]. Consequently, these target genes have important clinical application value.
Nevertheless, out study revealing a six-lncRNA signature has several limitations. First, we only obtained data from the TCGA, and more databases are required for further validation. Thus, we need large-scale cohorts to verify the prognostic signature in a multicenter prospective clinical study. Third, the six lncRNAs need to be further explored to determine their functions and underlying mechanisms. The six-lncRNA signature might provide new insight for individualized and precise treatment in GC patients.

Conclusions
The results of the current study indicated that the six-lncRNA signature might be a potential biomarker for prognosis in GC. We veri ed a prognostic six-lncRNA signature by using integrated bioinformatics approaches and experiment. Future investigations will focus on the functional mechanisms of these lncRNAs.    Evaluation of the six-lncRNA signature in the three datasets. A Distribution of six-lncRNA risk score in the three datasets. B Distribution of survival status for patients in the three datasets. C Heat map of the expression pro les of the six lncRNAs in the three datasets; D K-M analysis of the survival in the three datasets. E ROC curve of the six-lncRNA signature in the three datasets Figure 5 Cox analysis of the six-lncRNA signature. A, B Univariate analysis and multivariate analysis in the training dataset. C, D Univariate analysis and multivariate analysis in the test dataset. E, F Univariate analysis and multivariate analysis in the combined dataset