Selection of DEIGs
Altogether 3841 genes in the GSE62254 dataset and 4772 genes in the GSE84437 were found to be differentially expressed between lymph node positive (LNP) patients and lymph node negative (LNN) patients (Figure S1A and B, Table S1 and 2). By intersecting above DEGs with the immune-related genes on the basis of InnateDB and ImmPort, 54 hub genes with the same direction of change were obtained (Figure 1A). 54 hub genes were significantly associated with 338 GO terms and 40 KEGG pathways (Table S3 and 4) using functional enrichment analysis. We further showed the top six GO terms and 20 KEGG pathways in Figure S1C and S1D.
Construction of a lymph node metastasis gene signature for gastric cancer
To narrow down 54 differentially expressed immune-related hub genes affecting LNS of GC patients, a series of 14 genes were identified by executing LASSO logistic regression in the training dataset (Figure 1B). Further, the final model, a lymph node metastasis gene signature for gastric cancer (LGSGC), that consisted of the 14 immune-related genes by multivariate logistic regression was constructed (Figure 1C). Based on final model the formula of risk score: (1.191×expression of PMAIP1) -(0.777× expression of TRIM65) - (0.616× expression of RFXAP)+(0.647× expression of RB1) - (0.693× expression of CASP1)+(0.747× expression of SECTM1) - (1.022×expression of UBE2V2)+(0.756× expression of UBE2W) +(0.779×expression of GNAI3) - (0.528×expression of DDX3X) - (0.821×expression of CMTM6)+(0.751×expression of FSTL1) - (0.635×expression of SOCS2) - (0.734×expression of NLRP13). Compared to LNN patients, LNP patients had more risk score in the training dataset (p=3.2e−11), the GSE62254 dataset (p=1.3e−09), the GSE84437 dataset (p=2.22e−16), the TCGA dataset (p=2.9e−16) and the clinical dataset (p=2.3e−05) (Figure 2A1-E1). ROC analysis showed that the risk scores could distinguish LNP patients from LNN patients in the training dataset (AUC =0.85, 95% CI 0.79–0.91; Figure 2F).
To further explore the functional and molecular profiles of 14 immune-related genes identified, enrichment analysis was conducted by the Metascape software, showing that a total of 30 functional enrichment was involved in five related-apoptotic processes, four related-cell growth regulation and two related- autophagy, which were significantly correlated with the development and progression of cancer  (Figure 3A, Table S6, P < 0.05). Strikingly, only two pathways enriched by 14 immune-related genes included viral carcinogenesis and PID P53 downstream pathway, which were well known to play a key role in inducing tumor cell proliferation and migration (Figure 3B, Table S7, P < 0.05). Enrichment network showed that biological processes including apoptotic signaling pathway, protein polyubiquitination, homeostasis of number of cells, regulation of autophagy and positive regulation of proteolysis was generally shared among 14 genes (Figure 3C). For example, PMAIP1 played a centripetal role linking four enriched biological processes, and RB1, CASP1, UBE2V2 and DDX3X played a centripetal role linking three enriched biological processes, respectively.
Validation of the LGSGC
All patients in the several datasets (the GSE62254 dataset, the GSE84437 dataset, the TCGA dataset and the clinical dataset) obtained the risk scores that were calculated on the basis of the above formula. Expectedly, the LGSGC could distinguish LNP patients from LNN patients in the GSE62254 dataset (AUC =0.80, 95% CI: 0.73–0.88; Figure 2F), the GSE84437 dataset (AUC =0.82, 95% CI: 0.76–0.87), the TCGA dataset (AUC =0.77, 95% CI: 0.72–0.82) and the clinical dataset (AUC =0.76, 95% CI: 0.66–0.85). Next, in order to explore the efficacy performance of the LGSGC detecting LNS in early GC patients, ROC analysis was implemented in participants with pathological T1 and T2, showing that the LGSGC could distinguish LNP patients from LNN patients in the training dataset (AUC=0.83, 95% CI: 0.72–0.94; Figure 2G), the GSE62254 dataset (AUC=0.79, 95% CI: 0.70–0.88), the GSE84437 dataset (AUC=0.83, 95% CI: 0.71–0.95), the TCGA dataset (AUC=0.78, 95% CI: 0.67–0.88) and the clinical dataset (AUC=0.79, 95% CI: 0.63–0.95).
Next, after adjusting for other clinicopathologic features such as subtypes, Union for international cancer control (UICC) stage, Lauren classification, gender and age, the LGSGC was an independent diagnostic biomarker of detecting LNM using multivariate logistic regression analysis in the training dataset (odds ratio [OR] =3.14, 95% CI: 1.93–5.90; Figure 1D). The LGSGC as an independent diagnostic factor of detecting LNM was validated in the GSE62254 dataset (OR=7.59, 95% CI: 3.07–23.76; Figure 2SA), the GSE84437 dataset (OR=6.37, 95% CI: 3.76–11.33; Figure 2SB), the TCGA dataset (OR=7.90, 95% CI: 3.95–17.75; Figure 2SC) and the clinical dataset (OR=2.13, 95% CI: 1.46–3.32; Figure 2SD). Furthermore, UICC stage, Neoplasm histology, T Stage, mesenchymal phenotype (MP) and epithelial phenotype (EP) were significantly associated with LNM in GC (Figure 1D and Figure 2S). Intriguingly, the combination model of the LGSGC, clinicopathologic factors such as subtypes, UICC stage, Neoplasm histology, T Stage, mesenchymal phenotype (MP) and epithelial phenotype (EP) had significantly superior diagnostic accuracy for LNM compared to the LGSGC in the GSE15459 dataset (AUC=0.92, 95% CI: 0.88–0.97; Figure 2H), the GSE62254 dataset (AUC=0.96, 95% CI: 0.92–1.00), the GSE84437 dataset (AUC=0.83, 95% CI: 0.78–0.88), the TCGA dataset (AUC=0.95, 95% CI: 0.93–0.98) and the clinical dataset (AUC=0.81, 95% CI: 0.72–0.89).
Association between the LGSGC and the survival of GC patients
LNM proved to have a significant association with poor survival of GC patients, we further estimated prognostic potential of the LGSGC for GC patients. Each patient was categorized into low- or high-risk group based on the cutoff thresholds, derived by the Youden index from the LGSGC model. Reassuringly, high-risk group exhibited worse prognosis than low-risk group in the training dataset (HR=2.42, 95% CI: 1.55–3.78; Figure 4A), consistent with the result in the GSE62254 dataset (HR=1.68, 95% CI: 1.17–2.40; Figure 4B) and the GSE84437 dataset (HR =1.51, 95% CI: 1.12–2.05; Figure 4C). The Landmark analysis showed that only those patients with survival more than 29 months from high-risk group exhibited worse prognosis (HR=4.31, 95% CI: 1.16–15.93; Figure 4D) in the TCGA dataset. Furthermore, high-risk group had shorter disease-free survival (DFS) compared with low-risk group (HR=1.78, 95% CI: 1.20–2.65; Figure 4SA) in the GSE62254 dataset and high-risk group with survival more than 44 months exhibited significantly shorter DFS compared with low-risk group (HR=6.65, 95% CI: 1.16-38.03; Figure 4SB) in the TCGA dataset. Because of limited dead samples (n=7), we did not conduct survival analysis in different risk groups from the clinical dataset.
Molecular characteristics of different risk groups
GSEA showed that the gene sets were significantly enriched in 18 pathways, which mainly included cancer and immune response-related pathways (Figure 5A and Table S5, p < 0.05) in high-risk group and 20 pathways which mainly included cancer and tumor metabolism-related pathways (Figure 5B and Table S5, p < 0.05) in low-risk group.
To understand biological immunological nature of different risk groups, we analyzed somatic mutations of GC patients from the TCGA dataset. It was found that missense mutation was the most frequent kind of mutations, followed by nonsense mutation and frameshift deletion in both risk groups (Figure 5C). The mutations of DNAH5, OBSCN and HMCN1 were very much in evidence in low-risk group while the mutations of CSMD3 and FAT4 were very much in evidence in high-risk group. Of note, the mutation rates of SYNE1, LRP1B, MUC16, TP53 and TTN were higher than 20% in both risk groups.
Immune characteristics of different risk groups
It was found that low-risk group had more abundant T cells CD4 memory resting, Plasma cells, NK cells resting, Mast cells resting and B cells naïve in the training dataset while high-risk group had more abundant T cells follicular helper, T cells gamma delta, M0 Macrophages, M1 Macrophages, M2 Macrophages and Neutrophils (Figure 6A). Next, we analyzed the relation between risk score and some clinicopathological factors such as subtypes, gender, Lauren-classification, UICC stage, alive time, alive status, age, T stage and LNS. It was found that the distribution of GC subtypes (proliferative, metabolic, invasive and unstable), UICC stages, survival time, lymph node status and alive status had significant differences in both risk groups (Figure 6B). The ssGSEA scores showed that low-risk group had more plentiful Mast cells, APC co inhibition, NK cells and B cells while high-risk group had more plentiful Type I IFN response, check point, APC co stimulation, Macrophages, Inflammation promoting, Para inflammation, MHC class I and Treg (Figure 6S1).
Immune cells can affect the survival in multiple types of tumors. So, we assessed the prognostic efficacy of certain immune cells in the training dataset. Figure 6S2 indicated those patients with higher scores of APC co inhibition, T cells CD4 memory activated, B cells naïve, B cells, MHC class I, iDCs, Tfh, NK cells, NK cells resting, Mast cells activated, TIL and Plasma cells had longer (p < 0.05) survival, while patients with higher scores of NK cells activated, M2 Macrophages, Dendritic cells resting, M1 Macrophages, T cells follicular helper, APC co stimulation, Para inflammation, T helper cells, Treg and Type I IFN response had shorter (p < 0.05) survival.
Association between LGSGC-GC subtypes and other molecular and immune-GC subtypes
Previous a study suggested that four subtypes of GC including proliferative, metabolic, invasive and unstable had significant differences of molecular and genetic characteristics and response to therapy . Figure 7A showed a significant difference (p=0.014, chi-square test) in the distribution of four subtypes of GC between two risk groups. Low-risk group had more metabolic-subtypes samples and less unstable-subtypes samples while high-risk had less metabolic-subtypes samples and more unstable-subtypes samples. Next, a novel molecular classification of GC according to genomic alterations, recurrence pattern and prognosis was reported.
Figure 7B showed a significant difference (p=0.003, chi-square test) in the distribution of four subtypes of GC in two risk groups. Low-risk group had more EMT and MSI subtypes and less MSS/TP53+ subtypes while high-risk group had less EMT and MSI subtypes and more MSS/TP53+ subtypes in the TCGA dataset. However, significantly different distribution of four subtypes of GC including MSS/TP53-, MSS/TP53+, EMT and MSI subtypes was not found in the GSE15459 dataset and the GSE62254 dataset (Figure 7S, p=0.071 and 0.752, chi-square test, respectively). Although Oh SC et al identified two molecular subtypes of GC, epithelial phenotype (EP) and mesenchymal phenotype (MP) which were corelated with LNM in GC (Figure 2SA), the distribution of two subtypes did not have a significant difference in two risk groups. (Figure 7C, p=0.366, chi-square test) .
Association between the LGSGC and cancer ICI treatment response
To assess the efficacy of the LGSGC identifying patients with response to ICI treatment, 43 patients with advanced GC administered pembrolizumab or nivolumab from a dataset study were selected [21, 22]. It was found that only 32.4% (11/43) of subjects achieved partial response and complete response to pembrolizumab or nivolumab. Next, according to the RNA-seq data provided by Noh MG, et al, each participant gained the risk score calculated by the 14 immune-related gene signature model . Those patients who achieved response to PD-1 inhibition had higher risk scores than those who did not achieve response to PD-1 inhibition (Figure 8A, p=0.039). Strikingly, the LGSGC had significantly diagnostic accuracy for responders from non-responders to PD-1 inhibition with an AUC value of 0.71 (95% CI: 0.53-0.89; Figure 8B).
Other three datasets were also used to assess the efficacy of the LNMGC predicting response of cancer patients to PD-1 inhibition [23-25]. The standardized risk scores according to the LNMGC in both risk groups were significantly different in the IMvigor210 dataset (p=0.028, Figure 8C), the bladder cancer dataset (p=0.00014, Figure 8F) and the melanomas dataset (p=0.034, Figure 8H). Additionally, patients with high-risk scores exhibited worse prognosis compared with that with low-risk scores in the IMvigor210 dataset (n=326) (p = 0.013, Figure 8D) and in the bladder cancer dataset (n=25) (p = 0.029, Figure 8G). However, Figure 8I did not show significantly different survival of both risk groups in the melanoma dataset (p = 0.14, Figure 8I). The cause of this result might be statistical limitation in the melanomas dataset, which included only 27 patients, among whom eight dead patients from high-risk group and four dead patients from low-risk group. Of note, the LNM in GC had significantly diagnostic accuracy for selecting patients with an AUC value of 0.64 (95% CI: 0.56-0.72; Figure 8E), who had response to PD-1 inhibition, from the IMvigor210 dataset.
The cancer immunotherapy response for advanced GC patients was further assessed by TIDE score from the web application for response prediction using gene expression profiles . The results showed that patients with high-risk scores achieved lower TIDE scores compared with patients with low-risk scores in the training dataset, suggesting that patients with low-risk scores were more likely to obtain immune evasion compared with patients high-risk scores (Figure 8J). Besides, MSI score and T cell exclusion score were significantly different in both groups, while T cell dysfunction score were not significantly different.
Association between the LGSGC and cancer ICI treatment-related gene expression
The ICB therapy that embraced CTLA-4, PD-L1, PD-L2, PD-1 and TMB blocking agents provided significant clinical benefits for GC patients. Next, we explored the relationship between the LGSGC and CTLA-4, PD-L1, PD-L2 and PD-1 expression in the GSE63354 dataset (Figure 8SA-D).
The relationship between the LGSGC and TMB was assessed in the TCGA dataset (Figure 8SE). It was found that risk score of the LGSGC yielded significant positive association with expression of CTLA-4, PD-L1 and PD-L2 (r = 0.18, p=0.0022; r = 0.25, p=1.5e-05; r=0.15, p=0.0077; Figure 8SA, B and D), but did not yield significant associations with the expression of PD-1 (r = 0.06, p= 0.30; Figure 8SC) and TMB (r = 0.049, p= 0.37; Figure 8SE). Meanwhile, the efficacy of the LGSGC detecting LNM of GC patients comparation with CTLA-4, PD-L1, PD-L2 and PD-1 was explored by ROC analysis (Figure 8SF). The results showed that the LGSGC had bigger AUC value of 0.80 (95% CI: 0.73-0.88) than known immunological biomarkers, CTLA-4 (AUC=0.56, 95% CI: 0.45-0.65), PD-1 (AUC=0.57, 95% CI: 0.48-0.67), PD-L2 (AUC=0.60, 95% CI: 0.51-0.70) and PD-L1 (AUC=0.57, 95% CI: 0.47-0.68). Moreover, the combination of the LGSGC and PD-L1 with AUC value of 0.84 (95% CI: 0.76-0.92) did not have significantly superior diagnostic accuracy for LNM compared with the LGSGC (Z = -1.8617, p-value = 0.06264, DeLong test; Figure 8SF).