Background: Gastric cancer (GC) is a digestive system tumor with high morbidity and mortality. It is urgently required to identify genes that are associated with GC prognosis and to elucidate the underlying molecular mechanisms. The aim of this study aims to identify the key genes which may affect the prognosis of GC patients and be a therapeutic strategy for GC patients by bioinformatic analysis.
Methods: The significant prognostic differentially expressed genes (DEGs) were screened out from TCGA and the GEO datasets. The protein-protein interaction (PPI) network was established by STRING and screen key gene by MCODE and CytoNCA plug-ins in Cytoscape. Function enrichment analysis, construction of prognostic risk model and nomogram verify key genes as potential therapeutic target.
Results: In total, 997 genes and 805 genes related to the prognosis of GC in the GSE84427 and TCGA datasets, respectively. We define the 128 genes shared by the two datasets as PRG. Then, the first 4 genes (MYLK、MYL9、LUM and CAV1) with great node importance in the PPI network of PRG were identified as key genes. Independent prognostic risk analysis found that patients with high PRG expression had poor prognosis, excluding their own age, gender and TNM stage. GO and KEGG enrichment analyses showed that key genes may exert influence through the PI3K-Akt pathway, in which, extracellular matrix organization and focal adhesion may play important roles in key genes influencing the prognosis of GC patients.
Conclusion: We identified that MYLK、MYL9、LUM and CAV1 are potential reliable biomarkers prognosis key genes and invasion and migration target in GC.
Gastric cancer (GC) is the fifth most common cancer and the third most common causation of cancer-related death in the world. The statistical results showed that there were more than 1 million new cases of GC in the world every year, and the number of GC-related death cases continuously increased, statistics for 2018 showed that the death toll has risen to 784000 . Many interfering factors can cause the low survival rates of GC patients, among which, the diagnosis of GC patients usually occurs in the middle and late stages, easy recurrence and metastasis after an operation are the most common reasons for the poor prognosis of GC patients . In the past ten years, a large number of studies have revealed that there were quite sensitive and effective biomarkers that can affect the occurrence and progression of GC. For example, Graziano F et al. found that methylation of CpG island in the promoter region of the CDH1 gene will lead to a change of CDH1 expression , which may play an important role in the occurrence and progression of diffuse GC, and CDH1 is likely to be one of the therapeutic targets of GC. Several previous studies  also found that HER2 expression is not only an independent risk factor affecting the prognosis of GC patients but also an effective target for the treatment of GC patients. These experiences provide the basis for the research on the occurrence, progression and treatment of GC. However, previous studies about biomarkers on the occurrence and progression of GC were based on the single-gene pattern, and cancer is usually a disease involving multiple genes and mechanisms. Therefore, it is very important to comprehensively explain the specific mechanism of GC progression and identify significant biomarkers to improve the prognosis of GC patients.
Bioinformatics is a broad multidisciplinary field. Computational tools have been developed to analyze and manage the increasing amount of biological data . Bioinformatics can be used to identify the key drivers of each specific cancer patient. Therefore, they have the potential to realize more personalized cancer treatment programs, paving the way for new targeted drugs targeting specific protein . With the development of The Cancer Genome Atlas (TCGA), Gene Expression Omnibus (GEO), and the accumulation of cancer genetics knowledge have developed rapidly in recent ten years, tumor analysis based on database not only reveals the whole panorama of tumor-related genome changes, but also lays a foundation for the study of related tumor types .
In this study, bioinformatics methods and techniques were used to screening out prognostic differentially expressed genes (P-DEGs) of GC from GEO and TCGA. Futher, we established a PPI network to identify the key genes in DEGs through module analysis and centrality analysis, constructed prognostic risk model and verified a unfavorable indicator. This study provides a reliable basis for exploring the molecular mechanisms of GC pathogenesis and identifying molecular targets for clinical diagnosis or treatment.
The gene expression matrix data of GC patients were obtained from the data set (GSE84437) in the GEO database of the national bioinformatics center of the United States. The data set was composed of the gene chip expression profile data and the survival information of 433 GC patients, which were collected through the GPL6947 chip platform. Besides, 380 cases of GC tissue expression profile data and clinical information were downloaded from the TCGA database.
Screening of prognosis related genes of GC patients
The gene expression matrix of GC tissues were obtained from the GEO database (n = 433) and TCGA (n = 380) respectively, and then the data were mined through R software. To obtain the standardized gene expression matrix of GC patients, the "impute" and "limma" packages in R were used to process the missing value estimation and logarithmic transformation of data. According to the K-M method, each gene in the gene expression matrix was divided into high or low expression group based on the median value of the gene expression. Subsequently, the survival difference between these two groups was evaluated and verified. The proportional hazards model was used for multivariate analyses and survival estimation to analyze, verify, filter and screen out these genes which were significantly correlated with the prognosis of GC patients (P < 0.05). Finally, these genes filtered by the above survival analyses were mutually verified in the two datasets obtained by GEO and TCGA. Then the common significant prognostic differentially expressed genes were identified as P-DEGs.
STRING （ http://string-db.org ）is an online tool, which is often used to predict protein-protein interactions . Through STRING, the gene interaction analysis can be conducted, including physical and functional interactions. In this study, we used it to establish a PPI network of P-DEGs while the confidence score of connections in this network is required to be > 0.15 and the disconnected nodes in the network were excluded.
Module analysis and centrality analysis of PPI network
The PPI network of P-DEGs is visualized by Cytoscape software. In this network, the functional modules and the interactions between genes were identified and measured through the MCODE plug-in . The selection parameters in MCODE plug-in were as follows: find clusters = in whole network, degree cut-off = 2, clustering finding=haircut，Node score cut-off= 0.2，K-core= 2, max depth= 100. In all sub-modules, the higher the score was, the stronger the protein correlation in the sub-module was, and the sub-module with the highest score was considered as the result of MCODE analysis. The plug-in CytoNCA is used for centrality analysis, including three parameters: degree, betweenness, and eigenvector . Degree is a measurement of the importance of a single node, which describes the number of sides of a connected node . Betweenness is the shortest path to analyze a specific node . While for the eigenvector, the importance of the node itself and its neighbors is considered . The top 5% of the nodes under each parameter are considered as the important nodes of CytoNCA analysis, the genes which represented were considered as the result of centrality analysis. Finally, by combining the results of MCODE and CytoNCA plug-ins, the common genes were considered as the most important genes in the PPI network of P-DEGs and identified as key genes.
Prognostic analysis and validation of key genes
Based on the gene expression matrix data of GC patients in GEO and TCGA databases, the median of key genes expression value was set as the cut-off value, and the key genes expression matrix of GC patients was divided into key genes high and low expression groups. By using the "survival" package in R, according to the K-M analysis and multivariate Cox regression test, the difference of overall survival events between the high and low expression groups of key genes was compared. Then, the survival rate and survival curve were analyzed and drawn. By using the "survival" package, according to univariate and multivariate Cox regression analysis, the hazard ratio (HR) and forest maps of independent prognostic analysis of single-gene and multiple-genes combination of key genes were analyzed and drawn. Finally, to precisely predict the survival rates of GC patients, the risk score of key genes and some clinic-pathological factors, such as age, gender and pathological stage, etc, were linked together. According to the risk ratio weighted key genes expression data, the key genes  was constructed as follows:
Where N is the number of selected genes of key genes, expi is the expression value of each single-gene of key genes, and HRi is the HR value of each single of multivariate Cox regression model. According to the median value of the risk score of key genes in the expression matrix of GC patients, GC patients were divided into the low-risk group and high-risk group, and the prognostic risk rates were measured by K-M analysis. Subsequently, based on the multivariate Cox regression analysis, the nomogram is established and drawn through the "RMS" package in R language.
GO and KEGG analysis of key genes
According to the median value of each key gene, GC patients in GEO and TCGA databases were divided into high and low expression groups of each gene respectively. The differentially expressed genes sets (DEGs) between high and low expression groups of each key genes were identified and the corresponding GO and KEGG functional enrichment analysis of each DEGS conducted through “limma”, “cluster profile”, “org.Hs.eg.db”, “enrichplot”, ”ggplot2” and other R software packages. |log2FC|>0.5, P<0.05 and adjusted P<0.05 were considered as the cut-off criteria.
R language (version 4.0.1) was used for data statistical analysis: K-M analysis, univariate Cox regression analysis and multivariate Cox regression analysis were used to identify the key genes; Survival curve and forest maps of survival analysis and independent prognostic analysis of single-gene or multiple-genes combination of key genes were drawn with R language through the "survival" package. P<0.05 and adjusted P<0.05 were considered as the cut-off criteria.
To explore the key genes affecting the prognosis of GC patients and the roles of these genes play in the mechanism of GC progression. The gene expression matrix data obtained from GEO and TCGA databases were used to conducted multivariate analyses and survival estimation to screen out the genes that were significantly correlated with the prognosis of GC patients (P < 0.05). Subsequently, we obtained 997 and 805 genes related to the prognosis of GC in the GSE84427 and TCGA gene expression matrix data sets, respectively. Therefore 128 common P-DEGs were obtained by mutual validation between the two datasets, which means 128 of 997 genes in GSE84427 and 128 of 805 genes in TCGA databases (Fig. 1A).
Module analysis and centrality analysis of P-DEGs related PPI network
In order to study the molecular mechanism which can affect the prognosis of GC patients from a systematic perspective, we established a PPI network of P-DEGs to explore the molecular mechanism. The results showed that there were 124 nodes and 819 edges in the PPI network. Futhermore, we used the MCODE plug-in in Cytoscape software to analyze the modules existing for exploring more closely related genes in the PPI network. The results showed that there were four modules and 1 non-module in the PPI network, and scores of the four modules were as follows: 8.667 (module 1), 7.455 (module 2), 4.111 (module 3), and 2.667 (module 4), respectively. We found that the first module (module 1) was the most interactive area in the PPI network, which is located in the center of the whole network, including 16 nodes and 65 edges (Fig. 1B). Therefore, the protein interactions in module 1, which ranks the first, maybe the strongest, and most important part of the whole network. Module 1 was considered as the final result of the MCODE analysis. At the same time, to obtain GC prognosis related key genes in this complex PPI network, we use the centrality analysis method to analyze the PPI network. First, we used the CytoNCA plug-in to analyze the score of three parameters of each gene in the PPI network, which were degree, betweenness, and eigenvector. Then, we selected the genes with its score ranked top 5% in three parameters. Finally, we selected these genes which ranked top 5% in three parameters and showed in module 1 as key genes, which were MYLK, MYL9, LUM and CAV1, and they were all in module 1with high centrality (Fig. 1C).
To analysis the role of key genes in the progression of GC, the survival analyses of four genes of key genes were further analyzed through the K-M method. According to the median expression of the genes matrix, GC patients were divided into the high expression group and low expression group. The survival curve showed that the expression of MYLK, MYL9, LUM and CAV1 were significantly correlated with the 5-year survival rate and overall survival time of GC patients in GEO and TCGA databases (P < 0.05). According to the survival analyses, the median survival time of GC patients with lower expression of MYLK, MYL9, LUM and CAV1 were 1.37, 1.41, 1.35 and 1.42 years, with higher expression of MYLK, MYL9, LUM and CAV1 were 1.06, 1.08, 1.15 and 1.06 years in TCGA database, respectively. Compared with GC patients with lower expression of MYLK, MYL9, LUM and CAV1 (GEO, n = 217; TCGA, n = 190), these patients with high expression of key genes (n = 216, GEO; n = 190, TCGA) had significantly poorer prognosis (P < 0.05, Fig. 2). The results were verified through GEO gene matrix once again. According to the univariate and multivariate Cox regression analyses, the results of independent prognosis of key genes in the GEO and TCGA databases showed that the HR of MYLK, MYL9, LUM and CAV1 all presented as HR > 1, which were 1.15, 1.18, 1.19 and 1.31, respectively (P < 0.05). These results indicated that key genes can independently affect the prognosis of GC patients (Fig. 3). The influence of key genes is of great significance and has potential value as prognostic biomarkers and therapeutic targets for GC patients.
To better elucidate the mechanisms of key genes affecting GC prognosis, we performed GO and KEGG enrichment analyses. Results of GO analyses showed that most GO-terms were significantly enriched in the extracellular matrix organization, extracellular structure organization, cell-substrate adhesion, tissue migration, muscle contraction, muscle tissue development, mesenchymal development, etc. (Fig. 4). Besides, the results of KEGG analyses showed that the related pathways were significantly enriched in focal adhesion, PI3K-Akt signaling pathway, ECM receptor interaction, cell adhesion molecules, proteoglycans in cancer, protein digestion and absorption, Cell cycle, calcium signaling pathway, etc (Fig. 5). These results indicated that key genes affect the prognosis of GC patients mainly through influencing the invasion, migration and cell cycle functions of GC cells.
Construction and validation of prognostic risk model of key genes
Based on multivariate Cox regression analysis, key genes (MYLK, MYL9, LUM and CAV1) were integrated, and prognostic risk model of key genes was established according to GEO and TCGA data respectively. The risk score of key genes was calculated using the formula mentioned in the method, processes were as follows: risk score = (HR (MYLK) × MYLK expression level) + (HR (MYL9) × MYL9 expression level) + (HR (LUM) × LUM expressionival rate, risk score and clinical features of GC patients can be estimated based on the total points. The results showed that the risk score is very high in both GEO and TCGA databases (risk score is about 10 in GEO and 9 in TCGA, Fig. 4), which confirmed that key genes may be potential therapeutic targets for prognosis.
To verify the reliability of key genes, GC patients were divided into the low-risk group and high-risk group according to the median risk score in TCGA and GEO databases, respectively. The survival curves showed that the prognosis of the high-risk group was worse than that of the low-risk group (Fig. 5, P < 0.05). With the risk score increasing, the number of patients’ death increases (Fig. 5). Univariate and multivariate Cox regression analyses were performed based on the gene matrix data, the results of which showed that the risk score of key genes was independently correlated with the overall survival rate of GC patients (Table 1, P < 0.05). These results indicated that the key genes can be a significant reference to the prognosis of GC patients. The key genes can be used to guide the next step of treatment after surgery or/and chemoradiotherapy treatment. MYLK, MYL9, LUM and CAV1 can be potential targets to improve the prognosis of GC patients.
GC is one of the most common and malignant tumors. Although the main treatment methods of GC such as surgery, radiotherapy, and chemotherapy have made progress, the incidence rate and mortality rate of GC patients remains stubbornly high [15, 16]. More than 90% of the GC patients were in the late stage when diagnosed, which was related to the unclear symptoms in early-stage of GC patients and unclear influential factors of GC prognosis to a large extent [17, 18]. The occurrence and progression of GC is a multi-stage, slow-moving pathological process, in which genetic mutations, epigenetic changes, and abnormal molecular signal transduction pathways all can participate in the occurrence, diffusion, and metastasis of GC . Therefore, it is very important to find specific prognostic biomarkers of GC to develop therapeutic strategies for malignant behaviors of tumors. These problems highlight the necessity of finding prognostic markers of GC. Nowadays, high-throughput platforms for detecting gene expression have been developed rapidly in the processes of disease progression, which lays the foundation for the discovery of new targets that can be used to predict, diagnose and treat cancer. In this study, we integrated GEO and TCGA databases, using bioinformatics analysis methods, to mine and analyze high-throughput data to conducted module and centrality analysis of the PPI network. Which helped us screen out key genes that has an important impact on the prognosis of GC patients and can be considered as a biomarker and potential therapeutic target for GC prognosis. Then the establishment a prognostic risk model of key genes further explained the kernel roles the key genes may play in the development of GC.
Module analysis (MCODE) and centrality analysis (CytoNCA) in PPI network play important roles in screening molecular markers, these genes appear in the modules with the highest scores, and also rank higher in centrality analysis results are the key genes that can affect the occurrence of diseases . Studies have shown that module analysis can help to screen key genes in cancers more accurately such as cervical cancer , glioblastoma , and head and neck squamous cell carcinoma . While CytoNCA can analyze the centrality degree of each node in the whole PPI network and can exhibit the nodes with important connections, to help select key genes . Combined with these two methods, key genes (MYLK, MYL9, LUM and CAV1) with important value in the whole PPI network was obtained. Some studies also elucidated the impact of key genes on various tumors.
Liang X et al., and Luo Z et al., indicated that Caveolin 1 (CAV1) played an important role in the occurrence and progression of varieties of malignant tumors, especially in the malignant progression of GC by promoting epithelial-mesenchymal transition (EMT) function. Under the conditions of extracellular matrix integrin interaction and Tyr-14 phosphorylation, CAV1 enhanced melanoma cells will migrate, invade, and migrate to the lung [24, 25]. The results of Jin g-h et al., showed that compared with normal gastric mucosa, Myosin Light Chain 9 (MYL9) was abnormally up-regulated in GC patients' tumor tissues, and it could affect the prognosis of GC patients through adhesion plaque and leukocyte cross endothelial migration . As an important part of extracellular matrix, luminan (LUM) can be expressed in many organs and tissues of human body. Its abnormal expression can affect the formation of extracellular matrix. The process of tumor proliferation, invasion and migration is often accompanied by the synthesis and degradation of extracellular matrix. Therefore, LUM can play an important role in tumor metastasis and invasion through extracellular matrix . Myosin Light Chain Kinase(MYLK ) can catalyze the phosphorylation of myosin light chain and regulates the invasion and metastasis of some malignant tumors [28, 29].
In the past few years, there has been more and more evidence of the key role of the extracellular matrix in mediating different cell processes (including cell adhesion, polarity, migration, differentiation, proliferation, and survival), and tumor cells are closely related to it . Focal adhesion is a strong adhesion of the sub-cellular structure to the extracellular matrix. It also acts as a scaffold for many signal transduction pathways involving integral protein or mechanical force exerted on cells . Focal adhesion dysfunction is considered to be an essential pathway in tumor invasion and migration [32, 33]. Many cellular processes in cancer are attributed to kinase signaling networks. Akt, as a serine/threonine kinase, also known as protein kinase B, is a carcinogenic protein that can regulate cell survival, proliferation, growth, apoptosis, and glycogen metabolism. Over-expression of Akt is a common molecular feature of human malignant tumors. Many tumor tissues and tumor cells are accompanied by Activation of PI3K/Akt signaling pathway . In this study, we explored the relationship between key genes and classical carcinogenic signaling pathways by GO and KEGG enrichment analysis. Results showed that key genes can promote the development of GC by regulating various signaling pathways, many of which have been proved to play important roles in the occurrence and progression of cancer. In particular, focal adhesion and PI3K/Akt signaling pathways may be the main signaling pathways involved in the effect of key genes on GC prognosis, and their influences could not be divorced from the extracellular matrix. Therefore, in these GC patients whose key genes expression significantly increased, we can try to further study the specific treatment involving key genes.
Over this study, we not only used the integrated multi-step analysis (including network data mining and enrichment analysis) to find the key genes that have an important impact on the prognosis of GC patients, namely key genes (MYLK, MYL9, LUM, CAV1), but also established a risk model with predictive value based on key genes, which provides a basis for exploring the molecular mechanism affecting the prognosis of GC patients.
The integrative analyses of gene expression matrix identified 128 common P-DEGs. The 4 key genes (MYLK, MYL9, LUM, CAV1) of P-DEGs may be predictive biomarkers or therapeutic targets for GC prognosis. These predictions should be verified through experiments validation, though this study provided new insights into the development of individualized therapeutic targets for GC.
Prognostic differentially expressed genes
The Cancer Genome Atlas
Gene Expression Omnibus
Availability of data and materials
GC: Gastric cancer
P-DEGs : Prognostic differentially expressed genes
PPI: Protein-protein interaction
TCGA: The Cancer Genome Atlas
GEO: Gene Expression Omnibus
HR: Hazard ratio
ZYL designed this paper and performed critical revision of the manuscript.
Ethics approval and consent to participate
This study was conducted in accordance with the Declaration of Helsinki. This study was conducted with approval from the Ethics Committee of Jiangxi Provincial People's Hospital. Written informed consent was obtained from all participants.
Consent for publication
The authors declare that they have no competing interests.
Due to technical limitations, table 1 is only available as a download in the Supplemental Files section.