Background: Gastric cancer is one of the most common clinical malignant tumors worldwide, with high morbidity and mortality. The commonly used TNM staging and some common biomarkers have a certain value in predicting the prognosis of GC patients, but they gradually failed to meet the clinical demands. Therefore, we aim to construct a prognostic prediction model for GC patients.
Methods: A total of 350 cases were included as TCGA-STAD entire cohort, including TCGA-STAD training cohort (n=176) and TCGA-STAD testing cohort (n=174). GSE15459 (n=191), GSE62254 (n=300) and some cases in our center (n=12) were for external validation.
Results: Through differential expression analysis and univariate Cox regression analysis in TCGA-STAD training cohort, we screened out 5 genes among 600 genes related to lactate metabolism for the construction of our prognostic prediction model. The internal and external validations showed the same result, that is, patients with higher risk score were associated with poor prognosis (all P<0.05), and our model works well without regard of patients' age, gender, tumor grade, clinical stage or TNM stage, which supports the availability, validity and stability of our model. Gene function analysis, tumor-infiltrating immune cells analysis, tumor microenvironment analysis and clinical treatment exploration were performed to improve the practicability of the model, and hope to provide a new basis for more in-depth study of the molecular mechanism for GC and for clinicians to formulate more reasonable and individualized treatment plans.
Conclusions: We screened out and used 5 genes related to lactate metabolism to develop a prognostic prediction model for GC patients based on them. The prediction performance of the model is confirmed by a series of bioinformatics and statistical analysis.
As a serious global health problem, the global morbidity of gastric cancer (GC) ranks fifth and it's the fourth leading cause of cancer-related death, with more than one million people newly diagnosed with it worldwide. The majority of cases of gastric cancer are usually not diagnosed until an advanced stage, usually leading to a poor prognosis[3, 4]. Although overall survival (OS) for GC patients have improved greatly due to combined treatment[4, 5], it is still difficult to evaluate the prognosis of GC patients due to the great differences in individual prognosis.
At present, tumor-node-metastasis (TNM) staging system has been generally applied to predicting the prognosis of GC and more and more prognostic prediction models have also been constructed based on the general data and clinicopathological data of GC patients. However, there is a certain lag in time of these data, resulting in its less accuracy.
In addition, a series of traditional biomarkers (such as CEA, CA 12 − 5, CA 19 − 9 and CA 72 − 4) and novel biomarkers (such as microRNAs, long non-coding RNAs and circular RNAs) have been exploited to assist in the diagnosis and prognostic prediction of GC and achieve good clinical effects. Unfortunately, most of novel biomarkers show shortcomings due to limitation by high-throughput technology and some of them are still at the organ level. Some researchers have found several sorts of genes related to GC and explored their prognostic value, but the prognostic value of an individual gene is still not ideal and perfect[10, 11]. With the rapid development of precision medicine and gene sequencing technology, multiple genes signature will be more and more applied to the construction of prognostic prediction models to achieve more prospective, accurate and individualized prediction performance.
Reprogramming of energy metabolism and evading immune are two emerging hallmarks of cancers, and the former is a common feature of most tumor cells. Tumor cells were found to upmake a large amount of glucose and metabolize into lactate under oxygen-rich conditions, known as "Warburg effect". The same phenomenon was also found in GC cells. Lactate, as a signaling molecule, is discovered to play important roles in the regulation of the metabolic pathways and the immune response within the tumor microenvironment, which is closely related to the occurrence, invasion, metastasis, drug resistance and poor prognosis of GC. Sonveaux found that lactate promotes endothelial cell division, thus promoting angiogenesis. Some researchers have successfully constructed corresponding metabolism-related gene signatures for prognostic prediction of gliomas[18, 19], GC and lung adenocarcinomas (LUAD). All the above indicates the availability and feasibility of applying lactate metabolic genes to predicting the prognosis of GC, which is the theoretical basis for us to construct a prognostic prediction model for GC patients.
As lactate metabolism is a complex biological process involving multiple genes, the predictive value of those models based on multiple metabolism related genes will be better than those based on a single gene. Therefore, we aimed and managed to construct a prognostic prediction model by screening a series of genes related to lactate metabolism to better predict the prognosis of GC patients.
Genes related to lactate metabolism, known as LM genes, were collected in GeneCards database (https://www.genecards.org/), which provides a wide range of information about human genes. The term "lactate metabolism" was used as a keyword for the search, and genes with relevance scores > 12 were considered to be LM genes.
The RNA-seq data and clinical characteristics of TCGA cohorts (ACC, BLCA, BRCA, CESC, CHOL, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, MESO, OV, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC, UCS and UVM) were downloaded from Xena website (https://xenabrowser.net/datapages/). At the same time, we also downloaded the TCGA Pan-Cancer Datasets of Immune Subtype, Immune Signature Scores (Denise Wolf et al), Stemness Score (DNA methylation based) and Stemness Score (RNA based) from Xena website. The gene-expression profile matrixs of the STAD cohort GEO: GSE15459 and GSE62254 were collected from GEO database (https://www.ncbi.nlm.nih.gov/geo/) for external validation. The "limma" package was used to do the normalization for the expression profiles. The average expression level was retained for repeated genes. Moreover, Sequencing data was obtained from surgical tissue specimens of patients in our center. The specimens included gastric cancer tissue and gastric cancer peritoneal metastasis tissue, and the tissue specimens of these patients were diagnosed as GC and GC peritoneal metastasis by pathologist in our center.
By using the "limma" package, those LM genes mentioned above were investigated with differential expression analysis in TCGA-STAD cohort and visualized as heatmap plots. |Fold change (FC)| > 0 and p value (adj. p) < 0.05 were considered to be statistically apparent for distinguishing differential expression genes (DEGs). We selected the intersection of DEGs and LM genes, known as LM-DEGs, for further analysis.
The analysis of gene function is an important procedure in translating molecular findings from high-throughput data into biological significance. By using the "clusterProfiler" package in R, we perform statistical analysis and visualize functional profiles of LM genes, including Genetic Ontology (GO) annotation analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment pathway analysis. Adj. P value < 0.05 was considered to be statistically significant.
By performing the univariate Cox proportional hazards regression analysis on each LM-DEG, we selected genes evidently associated with OS in TCGA-STAD training cohort. Then we constructed a multivariable model with LM genes. We calculate the risk score by using those genes with nonzero coefficients. A prognostic risk score was calculated for each patient by using the following formula:
Risk score = expression level of gene 1 * j1 + expression level of gene 2 * j2 + … + expression level of gene x * jx, where j represents the coefficient.
We used the "survminer" package to select the optimal cut-off value in risk score to divide TCGA-STAD patients into high and low risk groups. The identical formula and identical cut-off value were selected to two GEO datasets and our own sequencing data for validation.
Performing univariate and multivariate Cox proportional hazards regression analyses to investigate whether the LM-DEGs-based prognostic model was an independent prognostic factor. To assess the survival differences between two groups, we constructed the KM survival curve and used the log-rank test. At the same time, to examine the specificity and sensitivity of the prognostic performance, we explored the ROC curve analysis. The AUC values indicated discrimination.
To find the possible pathways and molecular mechanisms in the high-risk group and low-risk group in TCGA-STAD cohorts, we use the analysis of GSEA (https://www.gsea-msigdb.org/gsea/index.jsp). Gene sets were considered significantly enriched with a p value of < 0.05 after 1000 permutation.
To explore the relationship between the immune-cell characteristics and the model and genes, we considered the CIBERSORT method to calculate the immune infiltration statues in the samples from the TCGA-STAD dataset. By using the Wilcoxon signed-rank test, we explored the differences in immune infiltrating cell content between high and low risk groups of the constructed model.
To explore the immune microenvironment of GC patients, we applied the ESTIMATE algorithm in “estimate” and “limma” package of R Software to calculate the Immune Score and Stromal Score. The Estimate Score was used to describe tumor purity. This analysis was based on the gene expression data retrieved from TCGA-STAD datasets.
To explore the model for gastric cancer treatment in the clinic, we calculated the IC50 of common approved chemotherapeutic drugs in the TCGA-STAD datasets by using "pRRophetic" package of R Software. The difference between the high and low risk groups in the IC50 was compared by using Wilcoxon signed-rank test.
We downloaded the drug-related data from the NCI-60 database, which contains data on 60 different cancer cell lines in nine different types of cancer. F5, MTTP, SERPINE1, CYP19A1 and SLC52A3 mRNA expression levels and z scores for cell sensitivity data (GI50) were collected from 59 cell lines and were used to explore the relationship between gene expression and drug sensitivity by using Pearson correlation. In the correlation analysis, we analyzed the drug response of 262 drugs approved by FDA or on clinical trials.
A total of 350 cases of TCGA-STAD were downloaded from Xena as the entire cohort, which were further randomly divided into TCGA-STAD training cohort (n = 176) and TCGA-STAD testing cohort (n = 174). The former was for the construction of our prognostic prediction model and the latter was for internal validation (Fig. 1).
After search and screening on GeneCards database, we collected 600 LM genes. Among them, 376 differential expression genes (known as LM-DEGs) were screened out after differential expression analysis between cancer and normal tissues in TCGA-STAD training cohort. The gene-expression profiles of the top 60 of them are displayed in Fig. 2A. After univariate Cox regression analysis, finally, we obtained 5 genes (F5, MTTP, SERPINE1, CYP19A1 and SLC52A3, known as model genes) with nonzero coefficients. The expressions of the 5 genes were all significantly different between GC and normal tissues in TCGA-STAD training cohort (all P < 0.01, Fig. 2B-F), verifying the rationality and feasibility of using these 5 genes to develop our prediction model:
Risk score = F5 * 0.1496 + MTTP * 0.1838 + SERPINE1 * 0.1581 + CYP19A1 * 0.4849 + SLC52A3 * (-0.3032).
The risk scores of each patient were calculated separately using the above formula, and all these patients were divided into high and low risk group according to the optimal cut-off value we calculated.
In training cohort, Kaplan-Meier (KM) survival analysis revealed that patients of high risk group obviously had worse prognosis compared to those of low risk group, with significant difference (P < 0.001, Fig. 3A). And the corresponding receiver operating characteristic (ROC) curve of 5-year survival was shown in Fig. 3D, with the area under curve (AUC) of 0.763. Univariate and multivariate Cox regression analyses in TCGA-STAD training cohort showed that the risk score was confirmed to be an independent prognostic factor for patients' survival (both P < 0.05, Fig. 4A, 4D).
For internal validation, the patients in TCGA-STAD testing cohort were also divided into high and low risk group according to the same cut-off value, showing significant difference in KM survival analysis (P < 0.001, Fig. 3B) and with 5-year AUC of 0.781 (Fig. 3E). In entire cohort, our prognostic prediction model also had a good performance (Fig. 3C, 3F).
All the patients in TCGA-STAD entire cohort were further divided into two subgroups according to their age (> 65 or ≤ 65 years old), gender (male or female), tumor grade (G1-2 or G3), clinical stage (stage I-II or stage III-IV), T stage (T1-2 or T3-4), N stage (N0 or N1-3) and M stage (M0 or M1). All the corresponding KM curves were shown in Fig. 5. It should be specially explained that between high and low risk patients with M1, the result of log-rank test did not reach the level of statistical significance (P = 0.056), but with obviously separated survival curve. We think it might be caused by lack of numbers of cases. While in other subgroups, there were all significant differences between high and low risk groups (all P < 0.05), revealing that our prognostic prediction model works well without regard of patients' age, gender, tumor grade, clinical stage or TNM stage, that is, the patients in high risk group were usually associated with a poor prognosis.
According to the same cut-off value in TCGA-STAD training cohort, we did KM survival analysis on three independent datasets: GSE15459 (n = 191), GSE62254 (n = 300) and some cases in our center (n = 12). The results showed that the patients in high risk group were significantly associated with lower OS (all P < 0.05, Fig. 6), which was consistent with findings in TCGA-STAD training cohort. As for the 12 cases in our center, similar results were obtained regardless of whether the tissues were taken from GC or peritoneal metastasis.
To evaluate the sensitivity and specificity of the prognostic prediction model, external validation in the independent dataset of 32 types of TCGA tumors were performed. According to the results of KM survival analysis, it was found that our signature could significantly distinguish high and low risk patients with one of the following 20 types of TCGA tumors respectively: adrenocortical carcinoma (ACC), bladder urothelial carcinoma (BLCA), colon adenocarcinoma (COAD), glioblastoma multiforme (GBM), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), acute myeloid leukemia (LAML), lower grade glioma (LGG), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), mesothelioma (MESO), pancreatic adenocarcinoma (PAAD), pheochromocytoma and paraganglioma (PCPG), sarcoma (SARC), skin cutaneous melanoma (SKCM), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA), thymoma (THYM), uterine corpus endometrial carcinoma (UCEC) and uveal melanoma (UVM) (all P < 0.05; Fig. 7A-T).
To further explore the potential biological functions of LM-DEGs, we performed GO annotation analysis and KEGG enrichment pathway analysis to visualize their functional profiles. The GO analysis showed that the top 5 enriched terms in BP group were "energy derivation by oxidation of organic compounds", "small molecule catabolic process", "ATP metabolic process", "cellular respiration" and "electron transport chain"; the top 3 enriched terms in CC group were "mitochondrial matrix", "mitochondrial inner membrane" and "mitochondrial protein complex"; the top 2 enriched terms in MF group were "electron transfer activity" and "tetrapyrrole binding" (Fig. 8A, 8B). Meanwhile, KEGG analysis showed that LM-DEGs were mainly enriched in "thermogenesis", "carbon metabolism" and "oxidative phosphorylation" (Fig. 8D, 8E). The above 13 enriched terms are all associated with cellular energy metabolism.
Furthermore, the transcript message of patients stratified by risk score into high and low risk groups were analyzed by GSEA (Fig. 8C, 8F). "Collagen containing extracellular matrix", "extracellular matrix structural constituent", "calcium signaling pathway", "focal adhesion" and "neuroactive ligand receptor interaction" were more enriched in high risk group, while "mitochondrial membrane organization", "mitochondrial protein complex", "mitochondrial respiratory chain complex assembly" and "mitochondrial transport" were more enriched in low risk group, suggesting that mitochondrial function of GC cells in high risk group was inhibited to a certain extent.
To evaluate the impact of the 5 model genes on patients' OS individually, the optimal cut-off values of the expression levels of the 5 model genes were calculated according to their expression levels separately, and patients in TCGA-STAD entire cohort were divided into high and low expression groups to explore the impact of the 5 model genes on patients' survival. The results of KM survival analysis showed that all these 5 model genes exerted significant effects on patients' survival individually (all P < 0.05, Fig. 9). High expressions of F5, MTTP, SERPINE1 and CYP19A1 may be risk factors for GC, while high expression of SLC52A3 may be a protective factor, which were consistent with our risk score formula.
We also analyzed immune infiltration levels of several specific types of immune cells. Macrophages M1, NK cells activated, T cells CD8 and T cells follicular helper were more enriched in low risk group, while Mast cells activated and NK cells resting were more enriched in high risk group (all P < 0.05, Fig. 10A-F). The KM survival analysis showed that high level of Mast cells activated was associated with a poor prognosis, and the high levels of NK cells activated, T cells CD8 and T cells follicular helper were associated with a better prognosis (both P < 0.05, Fig. 10G-J).
To further explore the association of the 5 model genes' expressions with immune microenvironment, we further compared their expressions among 5 immune subtypes in TCGA-STAD entire cohort. As shown in Fig. 11A, high expressions of F5, SERPINE1 and CYP19A1 were associated with Immune C6 (TGF-β dominant), and high expression of MTTP was associated with Immune C3 (Inflammatory). However, high expression of SLC52A3 was usually associated with Immune C2 (IFN-γ dominant).
The Estimate Score and Stromal Score calculated by "estimate" package showed the patients in high risk group were associated with significantly higher Estimate Score and Stromal Score, indicating their poor prognosis (both P < 0.001, Fig. 11B). The results of the correlation between the expressions of the 5 model genes and RNAss, DNAss, Stromal Score, Immune Score or Estimate Score in TCGA-STAD cohort showed that higher expression of SLC52A3 was associated with lower Estimate Score (P < 0.001), indicating SLC52A3 correlated with reduced risk of GC (Fig. 11C).
We also analyzed the correlation between the expressions of the 5 model genes and immune checkpoint molecules among TCGA pan-cancer, as shown in the scatter plots of Fig. 11D. The results showed that the higher expression of SERPINE1 was associated with higher levels of immune checkpoint molecules (PD-1, PD-L1 and CTLA4), indicating that SERPINE1 might promote tumor development by down-regulating immune response (P < 0.01).
We analyzed the estimated IC50 of the 9 chemotherapeutic drugs (Camptothecin, Cisplatin, Mitomycin C, Vinblastine, Docetaxel, Doxorubicin, Erlotinib, Paclitaxel and Sorafenib), the results showed that the patients in high risk group had significantly higher estimated IC50 levels of the 4 chemotherapeutic drugs (Camptothecin, Cisplatin, Mitomycin C and Vinblastine, all P < 0.05, Fig. 12A-D), revealing their lower drug sensitivity for high risk patients. While there were no significant declines of drug sensitivity among Docetaxel, Doxorubicin, Erlotinib, Paclitaxel and Sorafenib (Fig. S12).
Scatter plots were also drawn to study the correlation between the expressions of the 5 model genes and drug sensitivity. We found that the expression of the 5 model genes were related to increased or decreased drug sensitivity of a number of chemotherapeutic drugs, see Fig. 12E for more details.
Gastric cancer (GC) is one of the most common clinical malignant tumors worldwide, with high morbidity and mortality. The commonly used TNM staging and some common biomarkers, such as CEA, CA 12 − 5, CA 19 − 9 and CA 72 − 4, have a certain value in predicting the prognosis of GC patients, but they gradually failed to meet the clinical demands at present because of their lag in time and there're great differences in individual prognosis. Therefore, the construction of a prospective, accurate and individualized prognostic prediction model for GC is one of the most urgent clinical tasks at the present stage.
After differential expression analysis and univariate/multivariate Cox regression analysis, we screen out 5 genes as the model genes (F5, MTTP, SERPINE1, CYP19A1 and SLC52A3). In previous studies, Liu, Y have found that F5 gene is significantly upregulated in GC tissues, and may be a potential prognostic biomarker for GC. SERPINE1, as a tumor-promoting gene of gastric cancer, its high expression is significantly related to the poor prognosis of GC patients, which can be used as a biomarker for the diagnosis and prognosis of GC alone. CYP19A1 was found to be correlated with the poor prognosis of lung cancer and colon/rectal cancer, unfortunately there is no research on GC.
After calculating the risk score and successfully constructing the prognostic prediction model, to ensure its effectiveness and stability, the study first used TCGA-STAD testing cohort for internal verification, and used three independent datasets for external verification. These results all suggest that patients in high risk group were negatively correlated with their OS. We also evaluated the impact of the 5 model genes on patients' OS individually, all the results were consistent with our risk score formula.
To study whether the differences in clinical information of patients have an impact on the predictive performance of the model, we further divided TCGA-STAD entire cohort into several subgroups according to their age, gender, tumor grade, clinical stage and TNM stage. The KM analysis results were all in line with the expected results, showing that our model is not easily disturbed by the above clinical information of patients, suggesting its wide range of clinical predictive value.
For further analysis of the specificity and sensitivity of the model, 32 types of TCGA tumors were selected for external verification of the model. The results of the KM analysis showed that a higher risk score was associated with a lower OS among 22 TCGA tumors. Our explanation for this result is that these tumors may have similar metabolic characteristics and share some same nodes in their metabolic pathways. The identification of these nodes may contribute to a more in-depth study of the mechanism of these tumors and update the existing prognostic prediction models.
In addition, GO analysis and KEGG analysis were performed to further explore the potential biological functions of LM-DEGs, and it was found that all these enriched terms are associated with cellular energy metabolism. For deeper study of the biological functions, we performed GSEA to explore the different enrichment pathways between high and low risk groups in TCGA-STAD entire cohort. Interestingly, it was found that all the "mitochondrial-related" pathways were more enriched in low risk group, suggesting that mitochondrial function of GC cells in high risk group was inhibited to a certain extent, which is consistent with the fact that even under the condition of sufficient oxygen, GC cells still absorb a large amount of glucose and metabolize it into lactate.
Importantly, tumor-infiltrating immune cells seems to have more and more significant prognostic value of tumors[29, 30], including GC. In our study, Mast cells activated were more enriched in high risk group and led to poor prognosis. It is found in previous studies that Mast cells enriched in tumor tissue can produce a large amount of pro-angiogenic factors, which promote angiogenesis and lead to poor prognosis[32, 33]. Sammarco found the same phenomenon in GC. Macrophages M1, NK cells activated, T cells CD8 and T cells follicular helper were found to be more abundant in patients of low risk group, and the latter three significantly affected patients' OS. Husain found tumor-derived lactate inhibits NK cell function. Fischer found tumor cells can increase surrounding lactate concentration and further inhibit the function of cytotoxic T cells. Interestingly, M1 and M2 macrophages were found to have almost opposite effects on tumor patients' prognosis, and the former has been proved to improve patients' OS.
As for tumor microenvironment analysis, Baumann found lactate induces TGF-β expression, and some other researchers' results showed TGF-β promotes GC malignancy and metastasis, which can be suppressed by attenuating TGF-β. It is worth noting in our study that SERPINE1 and CYP19A1, may act as oncogenic genes, were found to associated with Immune C6 (TGF-β dominant), which is consistent with the above researches. SLC52A3 was found to be associated with Immune C2 (IFN-γ dominant) in our study. Intriguingly, IFN-γ was once recognized as a tumor-killing cytokine, but there's controversial in recent years and its mechanism is not clear. Based on the ESTIMATE algorithm, patients in high risk group were associated with significantly higher Estimate Score and Stromal Score, indicating a correlation with poor prognosis. And higher expression of SLC52A3 was associated with lower Estimate Score, indicating better correlation. Part of the above results, as well as the results of immune checkpoint molecules, are not in line with expectations, which we think may be related to the complexity and polyvalence of metabolic pathways. On the whole, however, these results accord with our expectations.
According to our drug sensitivity analysis, Camptothecin, Cisplatin, Mitomycin C and Vinblastine showed lower drug sensitivity among GC patients in high risk group compared with those of low risk, while there're no significant reduced drug sensitivity among Docetaxel, Doxorubicin, Erlotinib, Paclitaxel and Sorafenib. Meanwhile, the expression of the 5 model genes were found to be related to increased or decreased drug sensitivity of a number of chemotherapeutic drugs. These results can provide a new and fresh basis for clinicians to formulate individualized chemotherapy regimens for GC patients in high risk group based on our prognostic prediction model.
However, some limitations exist in our study. Most of our data was downloaded from the public database, with the clinical information of some patients incomplete, and there's only a small number of cases from our center. At the same time, as a retrospective study, information bias is inevitable, so it is necessary to prove the clinical predictive value of the model through more convincing prospective studies in future clinical practice.
In this study, we excavated and screened a series of LM-DEGs and developed a prognostic prediction model for GC patients based on these genes. Through a series of bioinformatics and statistical analysis, our model has been validated to have a good performance to predict the prognosis of GC patients, and can guide clinicians to provide more effective and individualized therapeutic schemes for GC patients.
Ethics approval and consent to participate: Not applicable
Consent for publication: Yes
Availability of data and materials: Yes
Competing interests: The authors declare no competing interests.
Acknowledgements: Not applicable
Funding: This research was not solicited and did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Siyu Wang: Investigation, Methodology, Visualization, Software, Roles/Writing - original draft. Yuxin Wang: Data curation, Roles/Writing - original draft, Writing - review & editing. Xin Wang: Data curation. Shuqiang Yuan: Conceptualization, Resources, Supervision.