Immuno-profiling identifies three subtypes in the TCGA dataset
To cluster hepatocellular carcinoma samples from the TCGA dataset, we first calculated the enrichment levels of the samples using ssGSEA scores from 29 immune gene signatures. Next, samples were clustered hierarchically based on ssGSEA scores, stratifying the samples into three clusters (Fig. 1A). The three clusters were defined as Immunity High (Immunity_H), Immunity Medium (Immunity_M), and Immunity Low (Immunity_L). In addition, our correlation analysis showed that tumor purity was significantly lower in the Immunity_H cluster and significantly higher in the Immunity_L cluster (Kruskal–Wallis test, Fig. 1B). To measure the level of immune cell infiltration, the purity of the tumor, and the stromal content scores in the three clusters, we used the R package “ESTIMATE”. As shown in Fig. 1C, the three subtypes were clustered and divided by lymphocyte_infiltration, tumor_purity, stromal_score, immune_score, and percent of lymphocyte infiltration.
Construction of weighted gene co-expression network
WGCNA analysis is widely applied to examine the Pearson correlation coefficient between gene expression levels (Wu, et al., 2019). Here, we used WCGNA analysis to construct a gene co-expression network in the immunity_H subgroup. To identify the immunity-related correlation gene modules, the mRNA modules in each subtype were analyzed. As presented in Fig. 2A, immunity_H was significantly correlated with the green module, MEpurple, MEsaddlebrown, MEorange, METan, and MEdarkorange. The genes in the green module were used for subsequent analysis. As presented in Fig. 2B, immunity was significantly correlated with the module genes. In addition, we detected the overall gene expression level in the green module in the three immunity subtypes. We found that the genes in the green module were significantly increased in the high immunity group (Fig. 2C).
Gene signature model construction and performance
To exclude genes with a high correlation in the signature model, hub genes obtained from the above selection were further analyzed using the LASSO‐penalized Cox analysis (Fig. 3A). Next, we analyzed these nine genes using multivariate Cox regression analysis in the TCGA cohort and identified four genes (IL18RAP, CSF3R, KIAA1429, PIK3R6) for the construction of our signature model (Fig. 3B & table.1). Next, we constructed a prognostic model using the risk-score formula: risk score = (coefficients × gene expression level), for each patient. Patients were classified into high- and low‐risk groups with median risk scores in the training dataset (Fig. 3C). The high-risk group demonstrated a higher mortality rate than the low-risk group in the training dataset (Fig. 3C) with a P = 3.919e-03 in the log-rank test (table.2). In addition, we found that the area under the curve (AUC) for overall survival (OS) was 0.749 in the time-dependent receiver operating characteristic (ROC) curve (Fig. 3C). Furthermore, the ranked risk scores of patients in the testing dataset showed that the high-risk group demonstrated higher mortality than the low-risk group (AUC = 0.740) (Fig. 3D).
Gene signature model performance evaluation
To evaluate the gene signature performance, we first analyzed the risk status of these two groups. As shown in Fig. 4A, the cut-off value could distinguish between survival statuses and the expression profiles of these four genes in the two groups from the training dataset were visualized in a heatmap. Furthermore, a nomogram was built by including the expression level of the signature genes and the overall survival rate of the patients from the training dataset (Fig. 4B). Calibration plots were used to assess the performance of the nomogram in predicting the 1-, 3-, and 5‐year OS rates in the training dataset (Figure. 4C). The c-index of the prediction model was 0.803. Furthermore, we validated the performance of the gene signature in the testing dataset. We found that the risk score formula could also be used to classify the survival status (Fig. 4D & Table 3). Furthermore, nomogram analysis revealed that the gene signature and these four genes would also serve as risk factors to predict the overall survival rate of patients from the testing dataset (Fig. 4E). The c-index of the prediction model was 0.708 (Fig. 4F).
Immunity status in different groups
To further evaluate the immune status difference between the high- and low-risk groups, we compared the immunity-related scores in these two groups in both the training and validation datasets. We found that all four scores were significantly different between these two groups in the training dataset (Fig. 5A). Consistently, the scores were found to be significantly different in the validation dataset (Fig. 5B).
Immunity is positively associated with TMB
Since the TMB level is closely associated with the immunity status of patients, we further analyzed the TMB level of different immunity level groups. Correspondingly, we found that the TMB level was higher in the high-risk group than in the low-risk group in the training dataset (Fig. 6A). However, there was no significant difference between the high- and low-risk groups in the testing dataset (Fig. 6B).
Functional analysis of the four genes
Functional analyses were performed for the genes in the immunity_H module to clarify their biological significance. GO analysis was applied based on biological process (BP), cellular component (CC), and molecular function (MF). As displayed in Fig. 7, these results demonstrated that the immunity_H module was predominantly enriched in immune-related terms such as “cytokine-cytokine receptor interaction” and “natural killer-mediated cytotoxicity”.
Individual gene expression analysis
To further examine the expression profile of these four genes in liver cancer, we used the TCGA dataset. We first examined the different expression statuses in normal and tumor tissues. As shown in Fig. 8A, we found that CSF3R and IL18RAP increased in the tumor tissue, whereas KIAA2429 and PIK3R6 decreased in the tumor tissue. Furthermore, we found that the expression levels of these four genes were associated with the OS of the patients (Fig. 8B). In addition, we also analyzed the spatial expression of these four genes in the human protein altas database. We found that all four genes could be detected in normal tissue and cancer samples (Fig. 8C).