Identication of LncRNA Prognostic Biomarkers Associated with Copy Number Variants in Gastric Cancer

Backgrounds: Gastric cancer is one of the most common gastrointestinal carcinomas worldwide, with a poor prognosis. Prognosis prediction is very important in the treatment of gastric cancer .This study aimed to explore the prognostic value of lncRNA deregulated by copy number variants (CNVs) in gastric cancer. Methods: Multi-omics cluster analysis was performed to identify subtypes on the prognosis associated coding genes, CNVs and methylation sites. We conducted survival analyses on median expression level for all identied lncRNAs. Finally, we constructed and validated a prognostic model on lncRNAs with data of public and our center. Results: As a result, we identied six subtypes of gastric cancer with different prognosis (P=0.00446) and a total of 83 disease associated lncRNAs. We nally obtained ve prognostic lncRNA biomarkers. Survival analyses showed that the high expression of all identied lncRNAs positive associated with worse prognosis. We also found that the prognostic model on ve identied lncRNAs could predict survival with high 5-years AUC 0.69. And the differences of survival between high risk and low risk groups were signicant in both of our and public database. In the multivariable analyses, we found that the prognostic model was an independent prognostic factor (p = 0.0104). Conclusions: We concluded that the prognostic model on ve identied lncRNAs was closely related to the overall survival of gastric cancer and may serve as promising prognostic biomarkers of gastric cancer.


Introduction
Gastric cancer is one of the most common gastrointestinal malignant tumors, with an estimated 951600 cases and 713100 deaths being attributed to this cancer in 2012 [1]. The regional variations observed regarding stomach cancer are substantial. Eastern Asia had the highest incidence of gastric cancer, with an incidence of 35.4 per 100000 [1,2]. In China, gastric cancer ranks second in the incidence of cancer following lung cancer and ranks third in cancer deaths [3,4]. Approximately 4104000 gastric cancer cases and 293800 gastric cancer deaths were estimated to occur in China in 2014 [4,5]. Although the prognosis of gastric cancer has improved substantially, the mortality of this cancer is still high, which is partly attributed to the advanced stages of diagnosed cancer and the high proportion of recurrence, including lymph node metastasis, distant metastasis, and peritoneal metastasis [6,7]. Therefore, early detection of patients with aggressive biological characteristics and recurrent cancer undergoing radical surgery becomes signi cant for improving patients' survival with gastric cancer.
Long noncoding RNAs (lncRNAs) are noncoding RNAs longer than 200 nucleotides. Increasing evidence has pointed towards lncRNAs as regulators of several human diseases, including cancer [8,9]. Many cancer-related lncRNAs have been reported to play important roles in multiple steps of carcinogenesis, including cell proliferation, cellular signaling, angiogenesis and metastasis [10,11]. HOX transcript antisense RNA (HOTAIR) is reported to be overexpressed in gastric cancer and associated with lymph node metastasis and vessel invasion in gastric cancer by promoting epithelial to mesenchymal transition (EMT) [12,13]. As the rst imprinted lncRNA to be reported, H19 is upregulated in gastric cancer tissues compared with normal tissues and associated with proliferation, migration and invasion of gastric cancer cells [14,15]. The upregulated H19 in plasma makes it a potential diagnostic biomarker for gastric cancer [15]. Growth arrest-speci c transcript 5 (GAS5) arrests tumor growth by regulating apoptosis and the cell cycle [16]. GAS5 has been reported to be correlated with tumor stage and lymph node metastasis [17]. Maternally expressed gene 3 (MEG3) is a tumor suppressor gene downregulated in gastric cancer. MEG3 has been demonstrated to be associated with deep tumor invasion, metastasis and poor prognosis, which make it a potential prognostic biomarker for gastric cancer [18,19]. Many lncRNAs, such as long intergenic noncoding RNA00152 and urothelial carcinoma-associated 1, have also been reported to be related to biological processes and the diagnosis and prognosis of gastric cancer.
Copy number variations (CNVs) refer to genomic structural variations with gene ampli cation, gain, loss and deletion, which have been regarded to play a signi cant role in carcinogenesis of gastric cancer [20,21]. Researchers have found numerous CNVs on many cancer-associated genes, such as CTNNB1, MYC and CDKN2A, located on different chromosomes, including 3p22, 4q25, 11p13, 1p36 and 9p21 [22,23].
The PIK3CA gene is frequently ampli ed in gastric cancer and is involved in multiple steps of tumorigenesis, including cancer cell proliferation and apoptosis [24]. The CNVs of this gene could predict the prognosis of gastric cancer regardless of tumor stage [25,26]. As a well-known suppressor gene, adenomatous polyposis coli (APC) exhibits frequent deletions in gastric cancer [27,28]. APC was regarded as a potential prognostic biomarker due to its close relationship with lymph node invasion and metastasis [29]. Epidermal growth factor receptor (EGFR) CNVs were also reported to be associated with an increased risk of invasion and metastasis, thereby resulting in worse prognosis [30]. The CNVs of other well-known genes, such as MET, HER-2 and TP53, were also correlated with the prognosis of gastric cancer [31,32]. As mentioned above, most studies focused on the effects of CNVs on protein-coding genes; however, an increasing number of researchers have found that the expression of lncRNAs and miRNAs were also regulated by CNVs [33,34]. It was reported that an estimated one-third of aberrant lncRNA expression could be attributed to CNVs [33]. Considering the role of lncRNAs in carcinogenesis, the lncRNAs deregulated by CNVs warrant further study to determine their role in the prognosis of gastric cancer.
Therefore, we mined the CNVs and lncRNAs of gastric cancer from The Cancer Genome Atlas (TCGA) and identi ed prognostic biomarkers of lncRNA regulated by CNVs.

Sequenced data collection
We downloaded methylation, RNA sequencing, CNV, and mutation data of gastric cancer and their corresponding follow-up information from the database TCGA Genomic Data Commons (GDC). First, we obtained all fragments per kilobase of transcript per million fragments mapped (FPKM) and counted data from TCGA GDC. Then, we transferred the FPKM data to Transcripts Per Million (TPM) data. LncRNA, sense_intronic, sense_overlapping, antisense, processed_transcript, and 3prime_overlapping_ncRNA were classi ed as lncRNA. We then obtained the FPKM expression pro le of lncRNA and protein-coding genes. We downloaded the expression data of all samples sequenced for methylation with the HumanMethylation450 Beadchip from TCGA GDC. CpG sites with cross-reactivity in the genome according to the cross-reactive sites from the discovery of cross-reactive probes and polymorphic CpGs in the Illumina In nium HumanMethylation450 microassay were excluded [35]. The CpGs and single nucleotide sites were also removed.
Subtype identi cation and differential analyses First, we conducted COX regression analyses between protein-coding genes, CNV and methylation and the survival of patients with gastric cancer. Second, we analyzed a total of 337 samples with CNV, methylation and RNA sequencing data. Multi-omics cluster analysis was performed to identify subtypes using the iClusterPlus R package on the prognosis-associated coding genes, CNVs and methylation sites. iClusterPlus is developed based on unsupervised cluster analysis, which can generate tumor classi cation by capturing patterns from multiple genomic data. Before iClusterPlus analysis, we rst selected and optimized the necessary parameters. Firstly, we repeatedly divided the samples into different training and veri cation sets to determine the optimal number of clusters k. In order to visualize the results, we plotted the percent of explained variation vs. the number of clusters. The optimal k value is the point where the curve starts to atten out. We then use the Bayesian information criterion to select the optimal combination of the sparse model and the penalty parameter, or lambda(λ). Finally, we combine the optimal clustering number (k) and penalty parameter (λ) to run the iClusterPlus analysis. Furthermore, we analyzed the differential lncRNA and protein-coding genes between tumor and normal samples in different subtypes. Foldchanges greater than 2 and FDR < 0.05 were used as cutoff values. A comparison was performed between identi ed lncRNAs and 232 lncRNAs closely related to disease from the databases LncRNADisease and Lnc2Cancer [36]. To evaluate the differential expression of lncRNAs in each subtype of cancer, we conducted gene set enrichment analysis (GESA) according to the absolute value of lncRNA fold change.

Weighted gene coexpression network analysis
We constructed coexpression modules of differential protein-coding genes and lncRNAs using WGCNA.
We rst transferred the FPKM data to TPM data and extracted the expression pro les of lncRNAs and protein-coding genes. Hierarchical clustering analysis was performed to identify outlier samples. We obtained 405 samples excluding outlier samples with distances greater than 8000. The Pearson correlation coe cient was used to calculate the distance between each gene and lncRNA. We screened coexpression modules by constructing a weighted coexpression network with 3 as its soft threshold using the WGCNA R package. We transferred the expression matrix to the adjacent matrix and then to the topology matrix (TOM), on which we conducted average-linkage hierarchical clustering analysis according to the hybrid dynamic shear standard. Each coexpression module was set to include more than 30 genes. Then, we calculated eigengenes of each module after they were determined. After that step, we conducted clustering analysis on modules, and the adjacent modules emerged as a new one. The parameters are height 0.25, deepsplit 2 and minModulesize 30. We then made statistics of protein-coding genes and lncRNAs in each module. We further performed gene ontology enrichment (GO) analysis on signi cantly enriched modules and analyzed the crosstalk of GO terms.

CNV-related lncRNA biomarker selection
We analyzed the CNVs of 442 gastric cancer samples from TCGA using GISTIC 2.0 software. The CNV pro le of lncRNAs was extracted for further analysis. We de ned a copy number more than one as copy number ampli cation and a copy number less than one as copy number deletion [37,38] Then, we performed statistical analyses on the proportions of copy number ampli cations and deletions for each lncRNA. To explore the relationship between lncRNA expression and CNV, we identi ed lncRNAs with more than 10 percent CNVs from each sample for further analysis. To systematically identify lncRNA prognostic markers, we analyzed the CNVs of differential lncRNAs in each subtype of gastric cancer. We selected lncRNAs with CNVs greater than 0.1 percent and differential expression within identi ed subtypes. We screened samples with expression levels greater than 0 for each lncRNA and divided samples according to the median expression level. Then, the survival analyses were performed with a threshold P value less than 0.05 as signi cance. We performed prognostic identi cation e ciency by constructing a receiver operating characteristic (ROC) curve for all identi ed lncRNA prognostic biomarkers. Then, Pearson correlation coe cient analyses for the expression level and CNV of identi ed lncRNAs were conducted. The area under the curve (AUC) greater than 0.6 was chosen for further analysis We retained lncRNAs positively associated with CNVs and that with a Pearson correlation coe cient greater than 0.1. We classi ed samples into high and low expression groups based on the median expression level of each sample for all included lncRNAs. Survival analyses were conducted between both groups. We conducted multivariable cox analyses and constructed prognostic model on identi ed lncRNA. The independence of prognostic model was also analyzed using multivariable analyses. The validation was performed using GSE62254 data from the GPL570 platform and in our center.

Results
Six genomic subtypes of gastric cancer were identi ed We included 1886 protein-coding genes, 3176 CNV and 9256 CpG sites by u nivariate COX regression analyses between protein-coding genes, CNV and methylation and the survival of patients with gastric cancer (Fig. 1, Figure S1A). Before iClusterPlus analysis, we determined that the optimal value of k was 5, and the number of clusters was k + 1, that is, 6 clusters ( Figure S1B). In order to build the nal model, we selected the 95th percentile as the threshold to select the most discriminative features, and only features larger than this threshold could be nally expressed in the 6 clusters. Based on the multi-omics cluster analysis on the prognosis-associated coding genes, CNV and methylation sites, we acquired 6 subtypes (Supplementary Table 1). We found a substantial difference in prognosis among the 6 subtypes with statistical signi cance (P = 0.00446) ( Fig. 2A). Meanwhile, we extracted the top 10 mutated genes from each subtype, and a total of 28 genes were obtained, which indicated substantial overlaps of highly frequent mutated genes within all subtypes (Fig. 2B).

Differential analyses of lncRNAs and protein-coding genes in different subtypes
A total of 2507 differential lncRNA and 3453 protein-coding genes were obtained from all subtypes (Supplementary Table 2). The minimum quantities and the largest number of lncRNA and protein-coding genes were found in C3 and C2 subtypes, respectively (Supplementary Table 2). The differential lncRNAs were presented for all subtypes (Fig. 3A-G). We found that lncRNA downregulation was greater than upregulation in the C2, C3 and C4 subtypes (Fig. 3B-D). In contrast, the opposite results were detected in the C6 subtype (Fig. 3F). In the C1 and C5 subtypes, a similar number of lncRNAs with differential regulation were found (Fig. 3A, 3E). We found that the differential coding genes were more than those of lncRNAs (Fig. 3H). A total of 83 disease-associated lncRNAs were obtained by comparing the differential lncRNAs identi ed from all subtypes with disease-associated lncRNAs from the database (P < 0.0001) (Fig. 3I).
To evaluate the differentiation of lncRNAs in each subtype of cancer, we conducted gene set enrichment analysis (GESA). We found that the differential lncRNAs were abundant in the gene sets with substantial fold changes ( Figure S2A-G). We analyzed the overlapping differential lncRNAs among six subtypes of gastric cancer, and substantial common differential lncRNAs were identi ed ( Figure S2H).
Weighted gene coexpression network analysis (WGCNA) of subtype-associated differential protein-coding genes and lncRNAs Hierarchical clustering analysis was performed to identify outlier samples. We excluded outlier samples with a distance greater than 8000 ( Figure S3A). In our study, the coexpression network conforms to the scale-free network by choosing 3 as its soft threshold (Fig. 3B-C). Finally, we acquired 24 gene modules (Fig. 3D), in which the gray module represented genes not clustered in other modules. We then made statistics of protein-coding genes and lncRNAs in each module (Supplementary Table 3). The black magenta and purple modules are enriched for lncRNA ( Figure S3E). We further performed gene ontology enrichment analysis on these three modules and analyzed the crosstalk of GO terms. A total of 843 GO terms were enriched, and few instances of crosstalk were found (Fig. 4A). The results indicated that the three modules may have different functions. The top 20 GO terms from the black module are associated with many metabolic processes, including fatty acid metabolism and xenobiotic metabolic process (Fig. 4B). The Magenta module was associated with extracellular structure organization and extracellular matrix organization (Fig. 4C). The purple module was mainly related to pattern speci cation process and embryonic limb morphogenesis (Fig. 4D). All the results suggested the important role of lncRNA in carcinogenesis of gastric cancer.

Conjoint analyses of CNVs and lncRNAs in the TCGA database
To explore the function of lncRNAs associated with CNVs on carcinogenesis, we conjointly analyzed the CNVs and lncRNAs of 442 gastric cancer samples from TCGA using GISTIC 2.0 software. We found that the proportion of copy number ampli cations was greater than that of deletions ( Figure S4A). The most frequent deletions were identi ed on chromosome 8, and the largest copy number ampli cations were distributed on chromosomes 5, 12 and 19 ( Figure S4A). We further analyzed the distribution of correlation between lncRNA expression pro les and CNVs. We found a positive correlation between them rather than a random distribution ( Figure S4B). In focal CNV peaks of the genome, we identi ed more copy number deletions than ampli cations of lncRNA genes, which suggested a close relationship between the lncRNA gene copy number deletions and gastric cancer ( Figure S4C-D).
To study the relationship between lncRNA expression and CNV, we identi ed lncRNAs with more than 10 percent CNVs from each sample for further analysis. A total of 13 lncRNAs were selected. We further analyzed the differential expression between samples with lncRNA CNVs and normal samples. We found that the expression levels of 10 lncRNAs were higher in samples with copy number ampli cations than in normal samples ( Figure S5). However, Linc00861 showed more expression in normal samples than in samples with copy number ampli cations ( Figure S5). The results showed that the lncRNA could be regulated by CNVs.

Identi cation and validation of CNV-related lncRNA biomarkers
A total of 187 subtypes speci c differential CNV-related lncRNAs were identi ed. Survival analyses were conducted, and we selected 19 prognostic lncRNA biomarkers with statistical signi cance (Supplementary Table 4). To evaluate the prognostic differentiation e ciency, we constructed a ROC curve for all 19 lncRNA prognostic biomarkers. Twelve lncRNAs with an area under the curve (AUC) greater than 0.6 were included for further analyses (Fig. 5). We nally obtained ve prognostic lncRNA biomarkers after Pearson correlation coe cient analyses for the expression level and CNV of 12 lncRNAs (Table 1). Survival analyses were performed on the high and low expression groups of the ve lncRNAs.
The results supported that all ve lncRNAs could effectively predict the prognosis of patients with gastric cancer (Figure S6A-E). To validate the effects of the ve CNV-related lncRNAs on prognosis, we conducted analyses using GSE62254 data from the GPL570 platform. However, only three lncRNAs could be annotated. Finally, survival analyses supported the prognostic e ciency of ENSG00000246859, ENSG00000237187 and ENSG00000245105 ( Figure S7A-C). We conducted multivariable cox analyses and constructed prognostic model on the ve identi ed lncRNA. RiskScore 5 = 0.11041842*exp NR2F1AS1 +0.17242272*exp STARD4−AS1 -0.03026703*exp EVX1−AS +0.02301935*exp LOC102724623 -0.05347689*exp A2M−AS1 . We found that the RiskScore model could predict survival with high 5-years AUC 0.69 (Fig. 6A). Meanwhile, the difference of survival between high risk and low risk groups was signi cant (P = 0.01012) (Fig. 6B). We also performed validation of prognostic model in our center. The predictive value for survival is similar (Fig. 6C). The difference of survival was also observed signi cantly (P = 0.033) (Fig. 6D).
To further identify the independence of RiskScore model in clinic, we systematically analyzed the clinical characteristics and RiskScore model. Multivariable analyses showed that risk score was independent prognostic factors (HR = 1.583 95%CI = 1.114-2.249, p = 0.0104) ( Table 2).

Discussion
With the development of next-generation sequencing and target drugs for cancer, molecular classi cation has become considerably more important for the detection and treatment of cancer. In 2014, TCGA molecularly divided gastric cancer into four types, including tumors with Epstein-Barr virus, microsatellite unstable tumors, genomically stable tumors and tumors with chromosomal instability [39]. The role of this molecular classi cation for gastric cancer therapy was partly supported by the high response rate for programmed cell death 1 (PD1)-targeted therapy of EB virus tumors and microsatellite instability-high tumors [40]. However, the classi cation is not su ciently comprehensive due to a lack of transcription data, and it could not effectively predict the survival of gastric cancer. In our study, we proposed a new molecular classi cation method for DNA methylation, CNVs and coding genes. We also described 6 clusters in combination with the clinical characteristics of patients and found that C2 and C5 showed good prognosis due to a higher proportion of patients in early stage, while C1 and C6 showed poor prognosis due to a higher proportion of patients in advanced stage. In the molecular typing comparison of TCGA, we found that the EBV positive and MSI rates were higher in C2 and C4. Interestingly, the gene mutation rate of patients was also higher in these two clusters. It is well known that EBV positive and MSI high patients have good e cacy in current anti-PD-L1 treatment of gastric cancer, and high tumor mutation burden is also considered as a potential biomarker. Therefore, patients in these two clusters may be potential bene ciaries of immunotherapy. Considering the complexity of our molecular classi cation, we identi ed a few prognostic lncRNAs according to the different subtypes, which could be used to distinguish gastric cancer with different risks.
LncRNA has been used to predict the survival of many types of cancer, including gastric cancer, HCC and prostate cancer [41][42][43][44]. Genomic CNVs play important roles in the carcinogenesis and development of cancer. Approximately one-third of deregulated lncRNAs are associated with their CNVs. However, the function and effects of CNV-related lncRNAs on cancer have not been thoroughly elucidated to date. Therefore, we conducted conjoint analyses on lncRNAs and their CNVs.
In our research, we identi ed ve CNV-related lncRNA prognostic biomarkers. We found that gastric cancer patients with high expression of lncRNA NR2F1 antisense RNA1 (NR2F1-AS1, ENSG00000237187) tend to have a worse prognosis. Previous research showed that NR2F1-AS1 knockdown could reduce hepatocellular carcinoma (HCC) cell invasion, migration and drug resistance [45]. Similar to NR2F1-AS1, the other four lncRNAs were also observed to have a negative association between their expression and prognosis in patients with gastric cancer. STARD4-antisense RNA1 (STARD4-AS1, ENSG00000246859) is rarely studied, and its role in carcinogenesis is unknown [46]. EVX1 antisense RNA (EVN1-AS, ENSG00000253405) is reported to be expressed during embryonic body differentiation [47]; however, its association with cancer has not been determined. Additionally, the roles of long intergenic non-protein-coding RNA1414 (LINCO1414, ENSG00000253554) and A2M antisense RNA1 (A2M-AS1, EGSG00000245105) in carcinogenesis have not been determined. However, most of the identi ed CNV-related lncRNAs were not further studied for their role in tumorigenesis and the development of cancer in addition to NR2F1-AS1. Our survival analyses showed that the high expression of these lncRNAs was associated with poor prognosis, supporting the importance of these lncRNAs in gastric cancer. Therefore, the mechanisms of lncRNA regulation in gastric cancer merit further study.
Our research rst conducted analyses of lncRNAs and their CNVs and identi ed ve novel lncRNA prognostic biomarkers. The deep bioinformatics analyses on multidimensional genomic data make our results convincing. However, the large-scale multi-omics data are not very su cient; therefore, our results may not be very reliable. Furthermore, although studies validating identi ed lncRNAs have been performed in another database, we could not perform validation in patients at our center.

Conclusions
In summary, we identi ed ve innovative CNV-related lncRNAs, including NR2F1-AS1, STARD4-AS1, EVN1-AS, LINCO1414 and A2M-AS1. The prognostic model on ve identi ed lncRNAs was closely related to the overall survival of gastric cancer and may serve as promising prognostic biomarkers of gastric cancer.

List Of Abbreviations
GC: gastric cancer, CNV: copy number variants, lncRNA: long non-coding RNA, TCGA: The Cancer Genome Altas,

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.  The differential lncRNA and coding genes in different subtypes of gastric cancer. A-F: the volcano plot of six subtypes of cancer. Red dot and blue dot indicate upregulated and downregulated differential lncRNAs. G: differential lncRNAs in all samples. H: the differential lncRNAs and protein coding genes (PCGs) in different subtypes of cancer. Blue and rea columns represent differential lncRNAs and PCGs, respectively. I: Venn diagram shows the relationship between differential lncRNAs and disease associated lncRNAs.  Survival prediction and survival analyses of RiskScore models. A: The ROC curve of relapse prediction for 1-years and 5-years of lncRNA-based RiskScore models. B: Kaplan Meier curve between high and low riskscore groups. C: The ROC curve of relapse prediction for 1-years and 5-years of lncRNA-based RiskScore models. In our cohort. D: Kaplan Meier curve between high and low riskscore groups in our cohort.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. supplementarymaterials.docx