Bioinformatics Analysis and Identi cation of Potential Biomarkers in Gastric Cancer

Background: Although the morbidity and mortality of gastric cancer are declining, gastric cancer is still one of the most common causes of death. Early detection of gastric cancer is of great help to improve the survival rate, but the existing biomarkers are not sensitive to diagnose early gastric cancer. The aim of this study is to identify the novel biomarkers for gastric cancer. Methods: Three gene expression proles (GSE27342, GSE63089, GSE33335) were downloaded from Gene Expression Omnibus database to select differentially expressed genes. Then, Gene Ontology and Kyoto Encyclopedia of Genes and Genomes analysis were performed to explore the biological functions of differentially expressed genes. Cytoscape was utilized to construct protein-protein interaction network and hub genes were analyzed by plugin cytoHubba of Cytoscape. Furthermore, Gene Expression Proling Interactive Analysis and Kaplan-Meier plotter were used to verify the identied hub genes. Results: 35 overlapping differentially expressed genes were screened from gene expression datasets, which consisted of 11 up-regulated genes and 24 down-regulated genes. Gene Ontology functional enrichment analysis revealed that differentially expressed genes were signicantly enriched in digestion, regulation of biological quality, response to hormone and steroid hormone, and homeostatic process. Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis showed differentially expressed genes were enriched in the secretion of gastric acid and collecting duct acid, leukocyte transendothelial migration and ECM-receptor interaction. According to protein-protein interaction network, 10 hub genes were identied by Maximal Clique Centrality method. Conclusion: By using bioinformatics analysis, COL1A1, BGN, THY1, TFF2 and SST were identied as the potential biomarkers for early detection of gastric cancer.

have been identi ed as risk factors (8). Surgery like curative gastrectomy has been suggested as the only radical cure for early gastric cancer but it also involves the risk of recurrence (9). In contrast to advanced gastric cancer with 10% ve-year survival rate, the ve-year survival rate of early gastric cancer can be above 95% (10). However, the existing biomarkers can't be effectively diagnosed early gastric cancer and most patients with early gastric cancer present with nonspeci c clinical symptoms until the advanced stage (11). It means that patients missed the best period of surgical radical treatment and recurrent GC also reduces the survival rate. Therefore, improving the detection methods of early GC and identifying the effective biomarkers for GC are crucial to enhance the survival rate of patients with gastric cancer.
Biomarker is de ned as a biochemical indicator, and has been used widely in the eld of clinical diagnosis. Although the existing biomarkers for GC could be used to evaluate the prognosis of GC, recurrence and its distant metastasis, they are not suitable for the detection of early GC because of low speci city and sensitivity, such as carcinoembryonic antigen (CEA) and carbohydrate antigen (CA-199) (12). Recently, there are some studies found that stanniocalcin-1 (STC1) and matrix metalloproteinase-9 (MMP9) have abnormal expression in GC (13,14). The expression of STC1 was higher in GC, but decreased after surgical treatment. In addition, the sensitivity of STC is better than CEA and CA-199, and it's worth further exploring its clinical value (14). MMP9 expression was increased in GC and it had association with lymph node metastasis (13). Therefore, it suggested that MMP9 could be a potential biomarker for GC detection and could act as a possible molecular target for therapy. Despite these ndings, the underlying molecular mechanisms of GC is not clear. The identi cation of new diagnostic biomarkers should still be carried out actively.
Microarray analysis is a high-throughput technology that has been used widely in the screening of abnormally expressed genes. It provides the possibility for the search for biomarkers and further study on the mechanism of diseases. In this study, three gene expression microarray datasets were analyzed to obtain differentially expressed genes (DEGs) in GC. Subsequently, integrated bioinformatics analysis was performed to help us identify gastric cancer related crucial genes and understand the likely underlying molecular mechanisms of carcinogenesis.

Microarray data
The Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo) database is an international public repository and it stores the submitted high-throughput microarray datasets (15). Three gene expression pro les (GSE27342, GSE63089, GSE33335) were retrieved from GEO database (Affymetrix GPL5175 platform, Affymetrix Human Exon 1.0 ST Array). Based on the annotation information of the GEO platform, the probes were converted to the corresponding gene symbols. GSE27342 dataset contained 80 GC tissue samples and 80 paired adjacent noncancerous tissue samples (16). GSE63089 dataset contained 45 GC tissue samples and 45 paired adjacent normal tissue samples (17). GSE33335 dataset contained 25 GC tissue samples and 25 paired adjacent noncancerous tissue samples (18).

Screening for DEGs
The limma package (version 3.44.3, http://bioconductor.org/packages/limma/) in R programming language was applied to screen the DEGs between GC tissues and matched noncancerous tissues (19).
The adjusted P-value (adj. P-value) which based on the Benjamini and Hochberg method was used to reduce false positive results from analysis. The cut-off criteria was set as the adj. P-value <0.05 and |log 2 FC (fold-change) | ≥1.5, and genes that meet the above conditions can be regarded as DEGs (17).
The R package venn (version 1.9, https://CRAN.R-project.org/package=venn) was utilized to visualize the overlap of DEGs in different datasets.

Visualization of DEGs
The free R package ggplot2 is a visualization package to construct statistical graphics and it could display the information of the datasets in an e cient and intuitive way. The volcano plot was constructed to show DEGs in different datasets by using the ggplot2 package (version 3.3.2, https://CRAN.Rproject.org/package=ggplot2) (20).

Gene Ontology (GO) and pathway enrichment analysis of DEGs
The GO (http://geneontology.org/) knowledgebase is a public database that annotates genes and collects the information of genes functions (21). In order to explore the possible biological functions of DEGs, the GO knowledgebase was utilized for analysis. KOBAS (version 3.0, http://kobas.cbi.pku.edu.cn/kobas3) is an online bioinformatics tool to indicate the correlation among genes and putative pathways and diseases (22). The information of the Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis was performed by KOBAS. P-value <0.05 was considered to reach statistical signi cance.

Construction of the protein-protein interaction (PPI) network of DEGs
Search Tool for the Retrieval of Interacting Genes (STRING, version 11.0, https://string-db.org/) is publicly available database to explore the information of protein-protein interaction (PPI) and further investigate the cellular mechanism (23). In order to ensure that the relationship among DEGs was statistically signi cant, the cut-off criteria was set as con dence score >0.400. Cytoscape (version 3.8.1) is an open software platform for biological research, which could integrate molecular interaction networks with gene annotations and other states data (24). Then, the PPI network was constructed by Cytoscape which visualizes DEGs subsets as an interactive network.

Identi cation and analysis of hub genes
The cytoHubba (version 0.1, http://apps.cytoscape.org/apps/cytohubba) is a plugin in Cytoscape and it was utilized to rank the PPI nodes according to the features of the network (25). Maximal Clique Centrality (MCC) method is one of the topological analysis methods in cytoHubba plugin, which has better sensitivity and speci city, and it can help us to screen important nodes in PPI network as hub genes (25). Subsequently, KOBAS was used to analyze the enriched pathway of the identi ed hub genes.

Comparison of hub genes expression level
Gene Expression Pro ling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn/) is a website based on TCGA and GTEx databases which analyzes the relationship of gene expression between cancer and noncancerous control (26). It was utilized to validate the expression level of hub genes between GC tissues and noncancerous tissues, and then the result of comparison was visualized in the form of box plot.

Kaplan-Meier survival analysis
The screened hub genes were processed for survival analysis. Kaplan-Meier plotter (https://kmplot.com) is an online web server which contains 10,188 cancer samples and integrates the relative gene expression and clinical trial data (27). The website calculates the hazard ratio (HR) with 95% con dence intervals and logrank P-value in its own R language environment, and makes the corresponding Kaplan-Meier survival curves. A logrank P-value <0.05 was considered to be statistically signi cant.

Screening and visualization of DEGs in GC
In this study, we downloaded and processed three gene expression pro les (GSE27342, GSE63089, GSE33335). According to the above cut-off criteria (the adj. P-value <0.05 and |log 2 FC (fold-change) | ≥1.5), 540 DEGs in GSE27342 were selected, 183 DEGs in GSE63089 and 109 DEGs in GSE33335. Figure   1 visualizes DEGs from 3 datasets. The overlap of DEGs among three datasets included 35 genes, which consisted of 11 up-regulated genes and 24 down-regulated genes through comparing GC tissues to matched noncancerous tissues. The result is shown in the Venn diagram ( Figure 2). In order to further understand the biological functions of 35 overlapping DEGs, GO and KEGG pathway enrichment analysis were performed. GO functional enrichment analysis result showed the identi ed DEGs were signi cantly enriched in digestion, regulation of biological quality, response to hormone and steroid hormone, and homeostatic process. Among them, the most signi cantly enriched GO term was digestion. In addition, KEGG pathway analysis revealed that DEGs were mainly enriched in the secretion of gastric acid and collecting duct acid, leukocyte transendothelial migration, ECM-receptor interaction and pancreatic secretion. Gastric acid secretion (P-value =1.02E-08) was the most relevant KEGG enriched analysis result. Based on P-value, the most prominent GO and KEGG terms are listed in Table 1. Figure 3 also displays the details of GO and KEGG terms.  Figure 4 shows proteins with con dence score of interactions that was higher than 0.400. Moreover, the PPI network was further analyzed by cytoHubba plugin in Cytoscape. The top 10 genes of PPI network ranked by MCC method were regarded as hub genes ( Figure 5). The hub genes were listed as the following order: COL1A1, BGN, ATP4A, SPP1, TFF2, GIF, SST, GHRL, ATP4B and THY1 (Table 2). Then, to have a better investigation of the hub genes, KOBAS was used to analyze the functional enriched pathways of these genes. The hub genes in the PPI network were signi cantly enriched for the secretion of gastric acid and collecting duct acid, ECM-receptor interaction and focal adhesion, oxidative phosphorylation and cAMP signaling pathway. Table 3 reveals the enriched pathways in detail. In order to validate the expression level of the hub genes between GC tissues and normal tissues, GEPIA website was performed to compare the expression level of these genes. As the box plot in Figure 6 shown, all of these hub genes expression level were statistically signi cant because of P

Discussion
Gastric cancer is one of the most common malignant tumors around the world and it has high incidence and mortality. It is estimated that the mortality of GC reaches 70% and is obviously higher than other epidemic diseases (28). With the improvement of standards of hygiene and the eradication of H. pylori, the incidence rate of GC has shown a decline trend and changed from the fourth most common cancer to the fth (29,30). However, the mortality rate of GC has not been signi cantly decreased. Due to the application of radical surgery and chemotherapy, the ve-year survival rate of early GC can increase up to 90%, but the prognosis of advanced GC is very poor (31). It's because that advanced GC missed the best therapeutic time window and has the potential for distant metastasis (32). Meanwhile, most patients with early GC do not display any symptom is also an important cause and it means that GC patients are di cult to detect at an early stage (33). Although surgical techniques are improving, novel chemotherapy and targeted therapy are developing, there is still no effective treatment for advanced gastric cancer (10). Therefore, the identi cation of the effective biomarkers for GC and further exploration of their correlative biological functions could help us to detect GC at early stage and understand the pathogenesis of GC, which in turns lead to the emergence of novel targeted therapy.
In this study, three gene expression pro les of GC (GSE27342, GSE63089, GSE33335) were utilized to select DEGs between GC tissues and noncancerous tissues. A total of 35 genes were identi ed, including 11 up-regulated genes and 24 down-regulated genes. GO and KEGG pathway enrichment analysis were executed to investigate the biological functions of DEGs. The results of GO functional enrichment analysis indicated that 35 genes were mainly enriched in the process of digestion, regulation of biological quality, response to hormone and steroid hormone, and homeostatic process. GC itself is a malignant tumor in the digestive system and most of risk factors are associated with digestion and diet. There are many studies have revealed that obesity acts as a risk factor of GC and could affect digestion (8,34).
While fat has metabolic capacity and could secret leptin and insulin-like growth factor (IGF), the change of leptin and IGF has been reported to be associated with gastrointestinal cancers (35). Moreover, Zhou et al. has proven that the abnormal expression of gastrointestinal hormones like hypergastrinemia have correlation with GC (36). Some previous studies have suggested that the production of steroid hormone could affect the development of GC (37,38). Estrogen receptor (ER) acts as steroid hormone receptor and the dysregulation of ERs is associated with GC. ERα36 is a kind of ER, and it could increase the expression of cyclin D1 to promote the proliferation of GC cells (39). In addition, the role of estrogen in GC is related to its concentration. Low concentration of estrogen can increase the expression of ERα36 and promote GC cells growth, but high concentration of estrogen is on the contrary (40). Homeostatic process has also been demonstrated that is associated with GC (41). Autophagy is a vital homeostatic process and it plays both tumor-suppressor role and tumor-promoter functions in GC (42). One cause of GC development is autophagy could suppress GC cells at early stages and promote the growth of existing tumor (43).
The ndings from KEGG pathway enrichment analysis showed DEGs were signi cantly enriched in the secretion of gastric acid and collecting duct acid, leukocyte transendothelial migration, ECM-receptor interaction and pancreatic secretion. It has been reported that gastric acid secretion is associated with GC and chronic atrophic gastritis which is regarded as a risk factor of GC (44,45). The migration characteristics of leukocytes are essential to promote the systemic immune response (46). Enarsson et al. found that the transendothelial migration of T cells in GC patients was reduced and speculated that reduced transendothelial migration of T cells might help GC cells to evade immune responses and was bene cial for carcinogenesis (47). ECM, its full name is extracellular matrix, and the abnormal changes of ECM contents are associated with the formation of cancer (48). The component of ECM like hyaluronan could play roles in promoting GC development (49). The relevant studies of the above GO and KEGG analysis have been con rmed to be associated with GC, which is also consistent with our results.
According to the PPI network and a total of 10 hub genes are obtained from it. Among these genes, COL1A1, BGN, SPP1 and THY1 were up-regulated and ATP4A, TFF2, GIF, SST, GHRL and ATP4B were down-regulated. We further explore these genes and analyze their possible biological functions. The result showed that ATP4A and ATP4B had correlation with most of enriched pathway. ECM-receptor interaction and focal adhesion were associated with COL1A1 and BGN, which two genes have the highest score in MCC method. Subsequently, we validated the expression level of these hub genes by using GEPIA website. The result as shown in Figure 6, COL1A1, BGN, SPP1, THY1 in GC tissues had an increased expression. While ATP4A, TFF2, GIF, SST, GHRL and ATP4B had a decreased expression level. The veri cation of expression level in the present study supports the screened hub gene and makes this study more credible. Finally, the Kaplan-Meier survival analysis indicated that up-regulated COL1A1, BGN and THY1, down-regulated TFF2 and SST were associated with a poor OS. It provides a possible research trend of GC, which is worth further exploring.
COL1A1, the full name is collagen type I α 1, and its protein is collagen type I which is a major component of ECM (50). There are many studies have been reported that the overexpression of COL1A1 is related to a variety of cancers, such as colorectal cancer, breast cancer and GC (51)(52)(53). Liu et al. found that the abnormal expression of COL1A1 in breast cancer cells promoted cancer metastasis and knockout of COL1A1 could inhibit this process (54). In colorectal cancer patients, overexpressed COL1A1 affected metastasis by regulating the WNT/PCP signaling pathway (51). As a component of ECM, high expression of COL1A1 could increase the stiffness of ECM and contribute to cancer development (55). Furthermore, miR-129-5p has been proved to inhibit the proliferation of GC and distant metastasis by suppressing the expression of COL1A1 (52). COL1A1 has also been recognized as a potential biomarker of GC for early detection at many times (56,57).
BGN is also the component of ECM, and the upregulation of BGN has been reported as a biomarker for multiple cancers (58). A recent study demonstrated that BGN has a high expression in GC patients and BGN could interact with Toll like receptor 2 and 4 secreted by GC cells to activate the VEGF pathway and promote angiogenesis, which is bene cial to the progression of cancer (59). Therefore, BGN could be regarded as a biomarker of GC that is correlated with poor prognosis. THY-1 acts as a cell surface protein with immunoglobulin domain and it exhibits a high level in GC patients (60,61). There are some studies has revealed that the formation of GC was associated with THY1 (62). The activation of Notch signaling pathway can promote cancer growth and THY1 stimulates Notch signaling pathway through the regulation of RhoA (63,64). Furthermore, THY1 has been proven to be a target gene of miR-140-5p (65). Wu et al. has indicated that miR-140-5p can inhibit the development of gastric cancer by inhibiting the expression of THY1 and inactivating Notch signaling pathway (65). According to the above studies, THY1 can be used as a target for the treatment of GC, and it is worth further exploring its value in early detection.
In addition, TFF2 and SST have been reported as tumor suppressors, and low level of TFF2 and SST were found in GC patients (66)(67)(68). TFF2 is a short peptide that secreted by gastric mucous neck cells and is helpful for cellular protection of gastrointestinal tract (69). However, the exact mechanism between this protective effect and GC has not been elucidated. Recent studies have indicated that TFF2 could inhibit the development of GC and tumor invasion by interacting with Sp3, therefore down-regulated TFF2 might loss its anti-tumor effect (70). SST, an inhibitory gut peptide, it plays role in regulating cell proliferation and its analog octreotide has been used to inhibit the GC metastasis (71,72). In GC patients, SST has a decreased expression and its promoter DNA methylation level is higher than normal patients (73). Low expression of SST promotes GC growth, while the inhibition of GC invasion and metastasis by overexpression of SST has been con rmed in vivo experiments (67). Accordingly, we speculate that promoter DNA methylation might regulate the expression of SST and further promotes GC growth through mechanisms that are not yet clear. Although the mechanism of TFF2 and STT in GC has not been elucidated, they are still worthy of being used as biomarkers for early diagnosis of GC.

Conclusions
In the present study, we used bioinformatics analysis to screen out potential biomarkers for early detection of GC. Although there is a lack of in vivo or in vitro experiments on gastric tissue samples, we still need to con rm these selected novel biomarkers. The ndings of our study have been veri ed by multiple methods such as GEPIA and Kaplan-Meier plotter, and has been supported by previous researches. A total of 5 gastric cancer related crucial genes (COL1A1, BGN, THY1, TFF2 and SST) were identi ed. However, the biological functions of these genes and their exact mechanisms in GC are still worthwhile researching thoroughly.

Availability of data and materials
All data generated or analyzed during this study are included in the published article.    The most signi cant module of PPI. The most signi cant module was selected as the hub gene by MCC method. The depth of the color represents the size of the score.