The TCGA and GEO databases identify COL5A2 as a poorly prognostic gene in patients with advanced gastric cancer

Gastric cancer (GC) metastasis determines the prognosis of patients, and exploring the molecular mechanism of GC metastasis is expected to provide a theoretical basis for clinical treatment. Recent studies have shown that extracellular matrix protein is closely related to GC metastasis. This study aimed to explore the expression prole and role of COL5A2 (Collagen V-type α2), as an extracellular matrix protein, in GC. The expression, overall survival and progression-free survival data of COL5 family members were extracted from The Cancer Genome Atlas(TCGA)database, respectively. Paran immunohistochemistry and RT-qPCR in GC and matched normal tissues were used to analyze the expression of the target genes. Weighted gene co-expression network analysis of the GSE62229 database was performed out to identify modules and associated genes, and the functions were predicted by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses.


Conclusion
COL5A2 was most closely related to advanced GC among COL5 family members. High COL5A2 expression is associated with a poor prognosis in GC, and may be a novel therapeutic target for GC.

Background
Gastric cancer (GC) is a common malignant tumor of the digestive tract, and the global incidence and mortality of GC are ranked fth and second, respectively [1,2]. Presently, the survival of GC patients has been signi cantly prolonged by the combination of radical surgery with radiotherapy and chemotherapy; however, the prognosis of advanced or metastatic patients remains unsatisfactory [3,4]. Because the symptoms of early GC lack speci city, most of the patients are diagnosed in the middle and late stages. Thus, identifying abnormally expressed genes in GC and intervening are important strategies to prolong the survival time of GC patients.
Collagen is the main component of the extracellular matrix (ECM), which can be divided into types I-V [5].
Type V collagen (COL5), an important component of the ECM, can regulate the diameter of bers by interacting with type I collagen during ber development [6]. The COL5 family comprises three main isomers, with three different polypeptide α chains, A1, A2 and A3. The abnormal expression of the COL5 family in tumors affects malignancy and progression, but the clinical role and molecular mechanism of the COL5 family in GC remain unclear [7][8][9].
Previously, high-throughput bioinformatics approaches, such as gene chip and gene sequencing, have been widely used to identify cancer biomarkers [10]. Some high-throughput storage databases are publicly available [11,12], and investigators can reuse these databases for data mining according to their study design. gene co-expression network analysis (WGCNA) is a powerful biology method to analyze the correlation patterns among genes in RNA-seq or microarray samples [13,14]. The method clusters highly correlated genes into the same module and connects them with clinical traits, which may be more conducive to the identi cation of clinical biomarkers for diagnosis and treatment. This method has been generally recognized in cancer research and has successfully identi ed targeting modules and hub genes [15,16].
In the current study, we performed The Cancer Genome Atlas (TCGA) analysis on expression, overall survival (OS), and progression-free survival (PFS) microarray data to identify the COL5A family gene that is signi cantly associated with GC metastasis. Moreover, we explored the related genes and predicted the pathway through WGCNA analysis of GSE62229 database.

Data Sources and Data Preprocessing
The TCGA Stomach Adenocarcinoma (STAD) data set contains 408 cancer cases and 211 matched paracancerous tissues. We used GEPIA (http://gepia.cancer-pku.cn/) to compare the gene expression differences, OS and PFS of COL5A1, COL5A2 and COL5A3 in TCGA, so as to select the most signi cant different expression genes (DEGs) of COL5 family.
The pretreated expression pro les of the GSE62229 and GSE15459 datasets,with a high quality and quantity of GC cases, were downloaded from the GEO database. The OS and PFS of the two patient databases were detected using the K-M plotter [17]. GSE62229 is a microarray dataset containing 300 cancer tissue samples and 100 cases of paracancerous tissues, and its clinical characteristics are very complete. This dataset selected as the training data for further study.

Samples and patients
We used 48 pairs of fresh specimens and adjacent non-cancerous tissues from the First A liated Hospital of China Medical University in 2018. We also used 126 para n-embedded GC tissues and 60 adjacent normal tissues from patients treated between 2011 and 2012. All the patients were con rmed to have gastric adenocarcinoma pathoogically, no tumor was found in other regions, and no radiotherapy or chemotherapy was performed before the operation. The patients or their families sign informed consent. This study was approved by the research ethics committee of our institute.

Screening of DEGs
The R software based on the "Limma" R package was used to screen the DEGs between GC tissue and adjacent normal tissue for GSE62229. A false discovery rate (FDR) < 0.05 and |log 2 (FC)| ≥0.263 were regarded as the cut-off thresholds.

Construction of the Co-expression network
After determining the DEGs' expression data from the GSE62229 dataset, a co-expression network was conducted for downstream analysis using the "WGCNA" R package. WGCNA could effectively combine gene expression information with the clinicopathological features to identify potential modules. Next, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEEG) enrichment analyses were used to assess the functional role of the module genes based on R software [18,19].

Gene Set Enrichment Analysis (GSEA)
To determine the possible pathway through which COL5A2 functions in the development of GC, the expression data from GSE62229 and TCGA were also used to perform Gene Set Enrichment Analysis (GSEA) [20,21]. According to the differences in expression, the database cases were uniformly divided into low-expression and high-expression groups.

Real time quantitative PCR (RT-qPCR) analysis
The tissues were cut and homogenized. After extracting the total RNA according to the instructions, cDNA templates were generated by reverse transcription by PrimeScriptTM RT kit (TaKaRa, Japan). Real-time polymerase chain reaction was performed to calculate relative expressions of mRNA according to the reaction system. The number of cycles was set to 40. GAPDH was chosen as the reference gene. The primer sequences of COL5A2 were 5′-CAGGCTCCATAGGAATCAGAGG − 3′ (sense) and 5′-CCAGCATTTCCTGCTTC TCCAG-3′ (antisense). Immunohistochemistry Immunohistochemistry (IHC) staining was performed according to standard protocols. IHC staining was assessed by scores based on the percentage of positive cells (0: < 5%; 1: 5-25%; 2: 25-50%; 3: 50-75%; 4: > 75%) multiplied by scores based on the intensity of staining, (0: colorless; 1: light yellow; 2: brown; 3: dark brown), with 6-12 considered high expression and 0-4 considered low expression. The primary antibody against COL5A2 used in IHC testing was purchased from LifeSpan BioSciences, lnc (Seattle, WA, USA).

Statistical analysis
Statistical analysis was performed using SPSS 22.0 statistical software and GraphPad Prism7.0 mapping software. Student's t-test was used to compare the two groups. The Kaplan-Meier method was used to calculate OS. P < 0.05 was considered statistically signi cant.

Results
COL5A2 is upregulated in GC tissues and correlates with poor survival in the TCGA and GEO databases.
First, TCGA-STAD was used to predict the mRNA expression levels of three major isomers of the COL5 family in GC and adjacent normal tissues. COL5A1 and COL5A2 were up-regulated in GC compared with COL5A3 (P<0.05) ( Figure 1A). To evaluate the prognostic value of the COL5 family mRNA expression in GC, Kaplan-Meier analysis and the log-rank test were used to verify the relationship between mRNA expression and OS or PFS in GC patients. In patients with high COL5A2 expression, OS and PFS were signi cantly reduced (P<0.05), however, COL5A1 was only have a signi cant trend in OS (P=0.12) and PFS (P=0.14) ( Figure 1B and C). Analysis of T stage showed that COL5A2 expression in advanced GC was signi cantly higher than that in early GC ( Figure 1E). The above analysis showed that high COL5A2 expression indicated a poor prognosis of GC. Therefore, we chose COL5A2 for further exploration ( Figure  1D).
To verify the ndings in the TCGA database, the GSE62229 and GSE15459 datasets were selected to evaluate the expression and prognosis of COL5A2. COL5A2 expression in cancer tissue was signi cantly higher than that in adjacent normal tissues (P<0.001) ( Figure 2B). Additionally, in the two GEO databases, patients with low COL5A2 expression showed longer OS and PFS (Figure 2A and C).

High COL5A2 expression indicates a poor prognosis in GC tissues
To validate the possible role of COL5A2 in GC progression, the expression pattern of COL5A2 was explored in paired clinical tissue samples in our patient samples. Thus, 126 para n-embedded GC tissues and 60 adjacent normal tissues with complete clinicopathological variable and follow-up information were collected. The COL5A2 protein level was signi cantly higher in GC tissues than in normal tissues (P<0.001; Figure 3A, 3B). Next, we used RT-qPCR to assess the expression pattern of COL5A2 in 48 pairs of fresh specimens and adjacent non-cancerous tissues ( Figure 3C); the ndings were consistent with the IHC results. Taken together, these results con rmed that COL5A2 is highly expressed in GC tissues.
Next, the prognostic role of COL5A2 was con rmed in our samples. Based on the COL5A2 expression levels, patients with complete follow-up information were divided into the COL5A2 low-expression group (negative or weakly positive expression, n=64) and COL5A2 high expression group (moderately or strongly positive expression, n=64). Kaplan-Meier curves con rmed that patients with high COL5A2 expression had a signi cantly shorter OS than those with low COL5A2 expression (P=0.0085, Figure 3D).
Additionally, we veri ed the signi cance of COL5A2 in the survival of advanced GC (P=0.018; Figure 3E).
The association between COL5A2 expression and clinicopathological parameters in patients with GC was further evaluated. As shown in Table 1, COL5A2 expression in GC was correlated with Borrmann type (P=0.036), histological type (P=0.013), and T stage (P<0.011). A signi cant correlation was not found between COL5A2 and age, sex, tumor location, tumor size, or N stage. These results con rmed that COL5A2 expression is associated with the malignant phenotype of GC.
Weighted co-expression network construction and module identi cation After quality evaluation and data preprocessing, an expression matrix was formed from the 298 GC samples of the GSE62229 dataset. The clinical traits were shown in the heatmap of the clustering dendrogram ( Figure 4A). With the variance in the top 25%, 5407 genes were screened out and used for subsequent co-expression analysis. When choosing the soft threshold, we calculated the network topology with power values from 1 to 20. As shown in Figure 4B, the power value of 3, which was the lowest power of the scale-free topological t index of 0.9, was pitched on. Additionally, the mean connectivity met the scale-free network distribution at the power value of 3. After merging similar clusters, thirteen different modules were identi ed that contained groups of genes with similar connection strengths ( Figure 4C).
Finally, we found that COL5A2 was enriched into the salmon module ( Figure 5A). and was highly correlated with T stage and Lauren stage ( Figure 5B, r = 0.32, P=3e-8 and r = 0.31, P=4e-8). Interestingly, the salmon module was also found to be related to pStage (r = 0.23, P=8e-5) and survival status (r = 0.23, P=9e-5). Additionally, we selected the top 100 genes related to COL5A2 and constructed a visualized network using Cytoscope software ( Figure 5C).

Functional Annotation and GSEA in the GSE62229 dataset and TCGA database
To understand the biological correlation of COL5A2, GO enrichment and KEGG pathway analyses were carried out. The top GO terms are shown in Figure 6A. The most enriched GO terms were as follows: BP (biological process), such as the extracellular matrix and structure organization, epithelial cell proliferation, and cell-substrate adhesion, CC (cellular component) such as the extracellular matrix, endoplasmic reticulum lumen, collagen trimer, and basement membrane, and MF (molecular function) such as cell adhesion molecular binding, glycosaminoglycan binding, and growth factor binding. Additionally, these genes were mainly enriched in the PI3K-Akt signaling pathway and focal adhesion, suggesting that the tumor microenvironment plays an important role in metastasis development ( Figure  6B).
We performed GSEA of the GSE62229 dataset and TCGA database which revealed that COL5A2 was enriched in focal adhesion, ECM receptor interaction and regulation of actin cytoskeleton (Supplementary Figure S1). The GSEA results also showed that metastasis samples were signi cantly enriched in several well-known cancer-related pathways, such as the TGF-β, MAPK and JAK2 signaling pathways ( Figure 7A, B). The results provide clues into the in-depth mechanism of metastasis development.

Discussion
GC is a biologically and pathologically heterogeneous disease [22]. The prognosis of advanced GC has shown little improvement, and it is necessary to identify e cient prognostic biomarkers and therapeutic targets. In the current study, we rst focused on the COL5 family, and chose COL5A2 as our target, according to the expression, OS and PFS data of the TCGA database. The analyses showed that COL5A2 was associated with T stage and Lauren stage and is involved in cancer-related pathways.
The expression level of COL5A2 is increased in various types of cancers, such as pancreatic cancer and colon cancer [9,23]. The upregulation of COL5A2 is correlated with a poor prognosis in tongue cancer [8], a nding that was consistent with ours. Moreover, higher COL5A2 expression was associated with the Borrmann type, histological type and T stage in the GC samples of our department, suggesting that COL5A2 might be a potential biomarker for GC tumorigenesis and progression.
WGCNA is a method that can highlight functional co-expression gene modules, and plays an important role in determining the potential mechanisms of malignancies, including breast cancer and colon cancer [16,24]. One main advantage of our study was that the WGCNA model of GSE62229 was constructed identify the module of COL5A2, and further explore the role of COL5A2 in GC. Eventually, we found that COL5A2 was enriched in the salmon module and was associated with T stage and Lauren stage, ndings that are consistent with our IHC data. However, our study possessed the limitation of a small sample size and more databases need to be incorporated into future research.
Disorders of functions and cancer-related pathways are common in cancers [25,26]. Regarding GO and KEGG enrichment analyses, COL5A2 was involved in the extracellular matrix, focal adhesion, and PI3K-Akt signaling pathway. During cancer cell migration, Paluch et al [27] proposed that adhesion to the matrix through a speci c site is an essential step. Additionally, the PI3K-Akt signaling pathway plays an important role in cell migration, angiogenesis, and survival in GC [28,29]. In GSEA enrichment, cancerrelated pathways, such as the TGF-β, MAPK and JAK2 signaling pathways, were signi cantly identi ed. Notably, our previous study showed that TGF-β was an independent factor of the peritoneal metastasis of GC [30]. These results reveal the deeper mechanism of COL5A2 in the metastasis development of GC.

Conclusions
In conclusion, we aimed to select a COL5 family member with expression and survival signi cance and identi ed its potential molecular mechanism in advanced GC using bioinformatics analyses and clinical samples. Eventually, we used the TCGA database to select COL5A2 as our research target. WGCNA showed that COL5A2 was enriched in the salmon module, which was connected with the T stage and Lauren stage. Functional annotation demonstrated COL5A2 might be involved in the formation of the extracellular matrix, focal adhesion, and some cancer-related pathways. However, because this study is mainly based on the analysis of open available datasets and clinical samples, further detailed experimental studies are needed to con rm the results in the future. Authors' contributions Tan and Xing work at paper writing, and Chen and Pan work at data analysis. Zhang and An is responsible for data download. Xu provides ideas of research.

Consent of publication: Not applicable.
Competing interests: The authors declare that they have no competing interest.

Ethics approval and consent to participate: The experiment was approved by the Medical Ethics Research
Association of the rst a liated Hospital of China Medical University, and each GC patient signed a written informed consent form.
Availability of data and materials: All data generated or analyzed during this study are included in this article.  Figure 1 Expression and survival analysis of the COL5 family in the TCGA-STAD cohort. (A) Box plot of the expression levels of the three genes in GC from the TCGA database. The x-axis shows the number of GC samples and normal samples, and the y-axis shows the gene expression levels. The P value was determined using Student's t-test, and error bars were represented as means ± s.d. (B) and (C) Kaplan-Meier survival curves of the OS and PFS for three genes were plotted. The P value was determined using the log-rank test. (D) COL5A2 was found to be statistically signi cant in both expression and survival in patient samples, compared with COL5A1 and COL5A3. (E) Relationship between COL5A2 and T stage based on the TCGA-STAD data. P<0.05 represents statistical signi cance.