A Prognostic Nomogram Combining Gene Expression Proles and TNM Stage Predicts Survival in Gastric Cancer

Gastric carcinoma (GC) is a highly aggressive malignancy and is associated with high morbidity and mortality rates around the world, the current tumor-node-metastasis (TNM) staging system is inadequate to predict overall survival (OS) in GC patients. therefore, potential forecasting methods for prognosis are important to investigate. Differentially expressed genes (DEGs) were screened using gene expression data from The Cancer Genome Atlas (TCGA). We then construct a risk score signature model by univariate Cox proportional hazards regression (CPHR) analysis, the Kaplan-Meier method(cid:0)KM(cid:0)and multivariate CPHR analysis. Using TNM stage, we developed a signature-based nomogram. Finally, we utilize an independent Gene Expression Omnibus dataset (GSE62254) validate the prognostic value of risk score signature model and nomogram. We identied ve OS-related mRNAs among 1113 mRNAs that were differentially expressed between GC and normal samples in the TCGA dataset. We then constructed a ve-mRNA signature model, which eciently distinguished high-risk from low-risk patient in both cohort, and even viable in the TNM stage-III, gender(male, female) and age((cid:0)65-year-old, ≥ 65-year-old) subgroups (P<0.05). Utilizing TNM stage, we developed a signature-based nomogram, which performed better than use the TNM stage or ve-mRNA signature alone for prognostic prediction in the TCGA and GSE62254 dataset. These results suggest that both risk signature and nomogram were effective prognostic indicators for patients with GCs, and could potentially be used for individualized management of such patients.


Introduction
GC is the fth leading cause of cancer and the third leading cause of cancer related deaths worldwide, accounting for up to 7% of cancer occurrences and 9% of deaths [1][2][3] . Patients with GC are rarely diagnosed at an early stage, and the prognosis of patients with late-stage GC remains extremely poor regarding recurrence and metastasis; >70% of patients eventually die from this disease 4,5 . Pretreatment evaluation may help identify patients with GC at high risk for recurrence, which may guide the future development of targeted treatment strategies and reduce mortality and recurrence rates. Therefore, it is essential to seek prognostic indicators and more effectively to screen the prognostic factors of patients with GC.
In recent years, the analysis of biological information, also known as bioinformatics, has attracted a great deal of attention and sustained breakthroughs in the search for oncogenic genes [6][7][8][9][10] . Various functions for molecular typing, prognostic prediction, new targeted drug development applications and biomarkers of prognosis have been con rmed 3,[11][12][13][14] . Thus, we used bioinformatics to identify genes predictive of GC prognosis. In this study, we utilized a rigorous computational framework to mine mRNA expression pro les and clinical data from The Cancer Genome Atlas stomach adenocarcinoma (TCGA-STAD Project'). Then, we constructed a risk signature of mRNAs and a nomogram integrating the signature with clinical features, and the predictive performances of the signature and the nomogram were validated in GSE62254 dataset.

Methods
Data source and pre-processing The raw counts of the RNA expression pro les and the clinical data for 375 GC patients and 32 normal control patients from the publicly available TCGA-STAD Project were downloaded directly from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/, updated until March 31, 2021). All expression pro les were obtained as HT-seq raw read counts and were annotated with the Ensemble reference database(ftp://ftp.ensemble.org/pub/release-93/gtf/homo_sapiens). The mRNA expression pro les were normalized and variance stabilizing transformation was performed with the "DESeq2" package in R software. The present study was conducted in accordance with the publication guidelines and data access policies of TCGA (http://cancergenome.nih.gov/publications/publicationguidelines).
The GSE62254 microarray expression data were downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/), which contained data on 300 patients with GC and their associated clinical information.

Screening of differentially expressed mRNAs
Differentially expressed mRNAs (DEMs) between GC samples and normal control samples were detected with the "limma" package use R software in the TCGA dataset. We de ned mRNAs with adjusted P values <0.01 and log2|fold change| values >2. Volcano plots and heatmaps were visualized with the "ggplot2" and "pheatmap" packages of R software, respectively.

Identi cation of OS-related mRNAs in GC patients
To identify prognostic mRNAs, we removed patients without accurate survival data, such as survival for less than 0 days. The association between DEM expression and OS was evaluated by univariate Cox proportional hazards regression (CPHR) analysis and the Kaplan-Meier method. Only DEM with logical consistency between their expression and prognostic effects were considered as candidate OS-related mRNAs. After excluding patients without de ned clinical characteristics, the important clinicopathological characteristics of the patients in the TCGA and GSE62254 dataset are shown in Table 1. In the TCGA dataset, the candidate OS-related mRNAs were selected for multivariate CPHR analysis by R software. To optimize the tting accuracy comprehensively with a moderate number of parameters, we computed the Akaike information criterion (AIC) and used it to estimate the relative quality of the statistical models for the given set of data. The best-t predictive model with the lowest AIC was chosen.
Identi cation and assessment of the ve-mRNA signature After choosing the best-t OS-related mRNAs through the above steps, we performed a multivariate CPHR analysis to calculate the coe cient of each mRNA in the TCGA dataset. We thereby constructed a risk score formula, weighted by the linear combination of the expression values of the best-t OS-related mRNAs and their corresponding estimated regression coe cients. The formula is as follows: risk score = X 1 α 1 + X 2 α 2 + X 3 α 3 +…+ X n α n . Where X is mRNA expression level and α is the corresponding estimated regression coe cient from the multivariate CPHR analysis, the same formula was used to calculate the risk score of GSE62254 dataset. Using the median risk score of the TCGA dataset as the cut-off value, we divided patients into high-risk and low-risk groups. The Kaplan-Meier method and log-rank test were performed to assess the survival differences between the high-risk and low-risk groups in each dataset. Additionally, a strati ed analysis was conducted to assess whether the association of the ve-mRNA signature with OS was independent of the TNM stage and other clinical risk factors. To further evaluate the prognostic performance of the mRNA-based classi er, we plotted time-dependent receiver operating characteristic (ROC) curves and calculated the area under the time-dependent ROC curve (AUC) values in both datasets, with three and ve years as the de ning points.

Development of the mRNA signature-based prognostic nomogram
To identify independent predictors of OS, we tested conventional clinical risk factors and the mRNAbased signature through univariate and multivariate CPHR analyses of the TCGA and GSE62254 dataset. A prognostic nomogram was then established with the package "rms" in R software. Calibration curves were conducted to assess whether the predicted survival in the nomogram was agreement with the actual survival. The predictive performances of the prognostic model were also evaluated using AUC in the ROC analysis and concordance index (C-index) calculation.
Statistical analysis R (https://www.r-project.org/) was used as the main tool for data analysis and mapping. Univariate CPHR analysis and the Kaplan-Meier method were used to obtain candidate OS-related mRNAs. Multivariate CPHR analysis was then performed to screen variables and determine the risk score formula. OS was analyzed using Kaplan-Meier survival curve analysis and the log-rank test. A time-dependent ROC curve was used to assess the speci city and sensitivity of the prognostic prediction at each time point. The nomogram incorporating both the mRNA signature and independent clinical risk factors was developed through a multivariate CPHR analysis and was validated with the C-index and calibration curves.

Candidate OS-related mRNAs from GC patients
The overall design and owchart of this study is presented in Figure 1. In total, 375 GC patients from the TCGA database were included, we compared mRNA expression pro les of the 375 GC samples with those of 32 normal samples. We identi ed 1113 differentially expressed mRNAs (DEMs) with a log2|fold change| >2 and an adjusted P value <0.01. Of the 1113 DEMs, 822 mRNAs were found to be upregulated and 291 were found to be downregulated in the GC patients. The volcano plots and heatmaps of DEMs were visualized with the "ggplot2"and "pheatmap" packages of R software, and are shown in Figure 2.
After the exclusion of 43 patients with insu cient survival data or availability of clinical data, 332 GC patients remained in our study. All 1113 DEMs were subjected to univariate CPHR analysis (P<0.01) and Kaplan-Meier analysis (P<0.05), with OS as the dependent variable and the mRNA level as the explanatory variable. As shown in Supplementary Table 1, 11 mRNAs were signi cantly associated with the OS of GC patients. Six of these 11 mRNAs (SERPINE1 MATN3 COL10A1 IGFBP1 CST2 VCAN) had hazard ratios (HRs) greater than 1, suggesting that their overexpression was associated with shorter OS. On the other hand, the HR of ve mRNAs (PRKACG MTBP ARHGEF38 RAD54L HSD17B3) was less than 1, with the opposite implications. The Kaplan-Meier analysis curves were consistent with the univariate CPHR analysis results (Supplementary Figure 1). Thus, we considered these dysregulated mRNAs as candidate OS-related mRNAs.
Identi cation and validation of a ve-mRNA signature for survival prediction To identify the best-t OS-related mRNAs, we ltered these candidate mRNAs through a multivariate CPHR analysis (stepwise model). We used the AIC to avoid over-tting. The ve OS-related mRNAs with the largest likelihood ratios and lowest AIC values (MATN3, HSD17B3, PRKACG, SERPINE1 and IGFBP1) were selected from the stepwise model (Table 2) and integrated into a predictive signature based on their risk coe cients. The formula was as follows: Risk Score = (0.338×Expression MATN3 ) + (-1.077 × Expression HSD7B3 ) + (-6.063 × Expression PRKACG ) + (0.207× Expression SERPINE1 ) + (0.271× Expression IGFBP1 ). Then, we calculated the ve-mRNA-based risk score for each GC patient in the TCGA dataset. Using the median risk score as the cut-off value, we divided patients into high-risk and lowrisk groups. The distributions of the OS statuses, mRNA-based risk scores and ve mRNA expression pro les in the TCGA dataset are shown in Figure 3A. The pheatmap showed that 3 DEM (MATN3, SERPINE1 and IGFBP1) were expressed at higher levels in the high-risk group than in the low-risk group, and 2 DEM (HSD17B3 and PRKACG) was overexpressed in the low-risk group. Kaplan-Meier curve analysis clearly demonstrated that the high-risk group had a poorer prognosis than the low-risk group (P=1.867e-04) ( Figure 3B). Subsequently, we constructed a time-dependent ROC curve with the TCGA dataset. As shown in Figure 3C, the AUC of the ve-mRNA signature reached 0.700 at three years and 0.788 at ve years ( Figure 3C).
The performance of the ve-mRNA signature for predicting survival was then validated with the GSE62254 dataset (n=300). When we used the ve-mRNA signature and cut-off value derived from the TCGA dataset, the distributions of the OS statuses, ve-mRNA-based risk scores and ve-mRNA expression pro les in GSE62254 dataset were consistent with the ndings described above ( Figure 4A). Similar to the results in the TCGA dataset, a Kaplan-Meier curve analysis indicated that the survival time of GC patients was signi cantly shorter in the high-risk group (n=124) than in the low-risk group (n=176) (P=8.014e-03) ( Figure 4B). The AUC of the ve-mRNA signature was 0.600 at three years and 0.583 at ve years ( Figure 4C). Thus, the ve-mRNA signature was great predictive value of GC patient.
The prognostic value of the ve-mRNA signature was independent from those of conventional clinical risk factors Next, we tested whether the prognostic performance of the ve-mRNA signature was independent from those of conventional clinical risk factors. A multivariate CPHR analysis demonstrated that the HR of a high vs. low-risk score was 1.761 (P <0.001) in the TCGA dataset and 1.325 (P<0.001) in the GSE62254 dataset (Table 3 and Supplementary Table 2), indicating that the ve-mRNA signature could independently predict the prognoses of GC patients.
Considering the number of GC patients, we combined the GC patients of TCGA and GSE62254 dataset for performed a risk-strati ed analysis. The 632 GC patients were strati ed into a stage-I subgroup (n=75), stage-II subgroup (n=206), stage-III subgroup (n=240) and stage-IV subgroup (n=111) based on their TNM stage. each subgroup was divided into a high-risk group and a low-risk group based on the risk scores proposed above. We found that the classi cation e ciency of the ve-mRNA signature was limited when it was applied to certain subgroups. As shown in the Kaplan-Meier curves, for the stage-III, patients in the high-risk group had signi cantly poorer survival than those in the low-risk group (P=4.937e-03, log-rank test) ( Figure 5E). However, the ve-mRNA signature did not reach the threshold of signi cance in the stage-I, stage-II and stage-IV subgroup ( Figure 5C, 5D, 5F). When a strati ed analysis was carried out based on age (≥65-year-old or <65-year-old) and gender (Male or Female), all subgroup did the ve-mRNA signature subdivide patients into a high-risk group and a low-risk group with signi cantly different survival (female subgroup, P=1.611e-04; male subgroup, P=2.775e-03; ≥65-year-old subgroup, P=4.19e-06; <65-year-old subgroup, P=1.69e-02) ( Figure 5A, 5B, 5G and 5H), and there is no difference in the survival time of male and female in High risk and Low risk subgroup ( Figure 5I and 5J). Thus, although the ve-mRNA signature could be viewed as an independent prognostic predictor for GC patients, its performance was limited to speci c subgroups.

Development of a nomogram combining the ve-mRNA signature with TNM stage
Clinical risk factors such as the TNM stage are still vital predictors of OS in GC patients 9 . Therefore, we integrated these traditional risk factors with our ve-mRNA signature to develop an e cient quantitative method of predicting OS. we evaluated the prognostic value of several clinical risk factors in univariate and multivariate CPHR analyses of the TCGA and GSE62254 dataset. We found that, in addition to the ve-mRNA signature, age (≥65 vs. <65) and TNM stage (III-IV vs. I-II) were signi cantly associated with OS (P<0.05) ( Table 3 and Supplementary Table 2). Ultimately, we selected 5-mRNA signature and TNM stage for multivariate CPHR analysis, and integrated them into nomogram. We then used this nomogram to predict the three-year and ve-year survival of GC patients ( Figure 6A). As shown in the nomogram, the ve-mRNA signature contributed the most to the three-and ve-year OS, followed closely by the TNM stage. This user-friendly graphical tool allowed us to determine the three-and ve-year OS probability for each GC patient easily.
We then evaluated the discrimination and calibration abilities of the prognostic nomogram by using a Cindex and calibration plots. A validation using a bootstrap with 1000 resamplings revealed that the nomogram performed well for discrimination: The C-index was 0.686 for the TCGA dataset, and 0.711 for GSE62254 dataset. The three-year and ve-year OS probabilities generated by the nomogram were plotted against the observed outcomes, as shown in Figure 6B-6E.
We further assessed the prognostic performance of the nomogram in a time-dependent ROC curve analysis. The AUC of the nomogram was 0.704 at three years and 0.708 at ve years in the TCGA dataset ( Figure 7A). In the GSE62254 dataset, the AUC was 0.760 at three years and 0.752 at ve years ( Figure  7B).

Survival prediction power: comparison of the ve-mRNA signature-based nomogram and other risk factors
To compare the predictive sensitivities and speci cities of different prognostic factors, we used timedependent ROC curves. As shown in Figure 7C, the predictive performance of the ve-mRNA-based nomogram (AUC=0.704) was superior to the performance of the ve-mRNA signature (AUC=0.700) and the TNM stage (AUC=0.595) in the TCGA dataset. In the GSE62254 dataset, the predictive value of the ve-mRNA-based nomogram (AUC=0.760) was better than the ve-mRNA signature (AUC=0.600) and the TNM stage (AUC=0.744) also ( Figure 7D). Thus, the newly developed prognostic nomogram concentrated the advantages of the ve-mRNA signature and TNM stage, improving their prognostic predictive e ciency for GC patients.

Discussion
Gastric cancer is a malignant cancer with a high mortality rate. Most patients with GC are diagnosed at the terminal stage of the disease due to its nonspeci c clinical symptoms in the early stages, which also creates a challenge for treatment 15 . Prognostic prediction for GC patients largely relies on American Joint Committee on Cancer/Union for International Cancer Control (AJCC/UICC) on Cancer TNM staging system at daily clinical practice [16][17][18] . However, patients with similar TNM stage can exhibit variable responses to therapy. A series of genomic landscape discoveries have demonstrated that this phenomenon may be due to tumor heterogeneity and genomic heterogeneity 19,20 . Thus, a reliable prognostic model for GC is urgently required in the era of precision medicine.
Several studies have been found that mRNA involved in cycle regulation, cell adhesion, angiogenesis, and tumor carcinogenesis have been reported to play a crucial role in forecasting survival outcome of GC patients [21][22][23][24] . Other studies also successfully identi ed several lncRNA-, miRNA-, and mRNAexpression based risk signatures to prevent the development of various tumors [25][26][27][28] . In the present study, based on public high-throughput mRNA expression pro les and clinical data from TCGA-STAD Project, we discovered a novel ve-mRNA signature that could effectively identify high-risk GC patients, and validation that patients with GC in the high-risk group had a shorter survival than those in the low-risk group. At the same time, the signature also showed a high prediction accuracy and robustness on calculating AUC in ROC analysis.
As interest in personalized medicine has grown, a few prognostic risk classi ers have been identi ed and found to enhance survival predictions in a variety of cancers [29][30][31][32][33][34][35] . However, most of these studies have focused only on statistical power in the screening of molecular markers, without regard for their clinical signi cance, in our study, we combined the TNM stage with molecular pro ling. Ultimately, we constructed a ve-mRNA signature-based nomogram to quantify an individual's probability of OS. The predictive performance of our proposed prognostic nomogram was superior to those of the ve-mRNA signature or the traditional TNM stage alone. This objective probability scale should be simple for patients and clinicians to understand and use in clinical practice 36 .
The most important convenience of our nomogram is its simplicity. Prognostic models are designed to identify the associations between risk factors and outcomes based on essential features, and should be accurate and parsimonious 14,37 . Our ve-mRNA signature-based nomogram relies on routinely available variables, including genetic differences (the ve-mRNA signature) and a histopathological characteristic (TNM stage). Therefore, clinicians can easily estimate outcomes and make clinical decisions for individual GC patients.
The most attractive biomarkers for clinical applications are those that provide accurate prognoses for patients, stratify patients into different risk groups and thus help clinicians choose the most effective treatment. In our study, the predictive capacity of our ve-mRNA signature was independent from those of conventional clinical factors including age, TNM stage, lymph node metastasis and distant metastasis. In our strati ed analysis, the ve-lncRNA signature performed well for risk strati cation in the male, female, stage-III, <65-year-old and ≥65-year-old subgroups. Notably, however, its application was limited in the stage-I, II and IV.
Although our newly proposed prognostic nomogram performed well in predicting survival for GC patients, the limitations of this study should be noted. One hand, the database of the TCGA and GEO databases lacks certain important pre-and postoperative parameters (e.g., chemotherapy, radiotherapy, immunotherapy), so we could not carry out a comprehensive survival analysis with these potential factors. On the other hand, we used data from an open-access published database and therefore our study design was retrospective. We are actively gathering samples and corresponding clinical data from a large number of GC patients to further validate our ndings and to determine whether our nomogram improves the prediction of accuracy.
Conclusion we successfully constructed a ve-mRNA risk signature correlated with GC prognosis in the TCGA and GSE62254 cohort. The results indicated that the signature is a potent predictive indicator for patients with GC. Furthermore, we identi ed and validated a novel and robust nomogram incorporating the signature and clinical factors to predict the 3-and 5-year OS rates of patients with GC, which could aid in the individualized management of GC patients in the future.  Table 2 Five-mRNAs signi cantly associated with overall survival in the TCGA dataset. Abbreviations: HR, hazard ratio; CI, con dence interval.       Comparison of male and female risks in high-risk subgroup (I), Comparison of male and female risks in low-risk subgroup (J). The tick-marks on the curve represent the censored subjects. The differences between the two risk groups were assessed with two-sided log-rank tests. Nomogram for predicting OS. Instructions: Locate each characteristic on the corresponding variable axis, and draw a vertical line upwards to the points axis to determine the speci c point value. Repeat this process. Tally up the total points value and locate it on the total points axis. Draw a vertical line down to the three-or ve-year OS to obtain the survival probability for a speci c gastric cancer patient. (B-E) Calibration plots of the nomogram for predicting OS at three years (B) and ve years (C) in the TCGA dataset, and at three years (D) and ve years (E) in the GSE62254 dataset. The 45-degree dotted line represents a perfect prediction, and the red lines represent the predictive performance of the nomogram.

Figure 7
The prognostic value of the composite nomogram in comparison with TNM stage in the TCGA and GSE62254 dataset. Time-dependent ROC curves of the nomogram for predicting OS in the TCGA(A) and GSE62254(B) dataset. The prognostic accuracy of the three-mRNA-based prognostic nomogram compared with those of the three-mRNA signature and TNM stage in the TCGA dataset(C) and GSE62254(D) dataset.