1. Gene screening by GSEA
The gene expression data for 29,417 mRNAs and the corresponding clinical data of 370 patients with HCC were collected from the TCGA database and randomly divided into training and test groups. GSEA was used to identify any statistically significant differences between the tumor and adjacent normal tissue in the training group. We found that 17 of 50 gene sets were enriched in tumor tissue by the GSEA automatically, and the four most significant gene sets at FDR < 0.25 and P < 0.05 were chosen for further analysis, i.e., the E2F targets (FDR = 0.012; P = 0.002), the G2/M checkpoint (FDR = 0.019; P = 0.008), DNA repair (FDR = 0.100; P = 0.021), and mitotic spindle (FDR = 0.101; P = 0.029; Table 2; Figure 1). Among these, the E2F targets gene set is the most statistically significant and ranked first with NES = 2.01. The 199 genes that were significantly enriched in the E2F target set were then subsequently investigated.
2. Determination of the E2F target‐related prognostic mRNAs
A univariate Cox proportional hazards model identified correlations with OS among the 199 genes enriched in the E2F target set. For the training group, six mRNAs were identified (P < 0.05). Afterwards, we identified five optimal independent prognostic mRNAs by stepwise multivariate Cox regression analysis. These were divided into risky and protective types based on the HR. NBN, PHF5A, AK2, and CDCA8 were classified as risky genes with HR > 1 and short survival, whereas EXOSC8 was rated as a protective gene (0 < HR < 1) with longer survival (Table 3).
The selected five-mRNA signature was queried in the cBioPortal web source for bioinformatic analyses. Alterations in the queried genes were detected in 186 (51%) of the 366 samples. The five genes differed in terms of mutation, amplification, deep deletion, mRNA high, mRNA low, and multiple alterations. NBN included one example of mutation, one example of deep deletion, 21 examples of amplification, 53 examples of mRNA high, five examples of mRNA low, and 28 examples of multiple alterations. PHF5A, AK2, CDCA8, and EXOSC8 had 7.65%, 14.21%, 5.46%, and 7.1% alteration, respectively (Figures 2A and 2B). A network analysis consisting of 55 nodes, five mRNAs, and the 50 most frequently altered neighbor genes was performed in cBioPortal (Figure 2C). For the TCGA LIHC cohort, we compared the expression levels of the five genes in the tumor and adjacent noncancerous tissues. NBN, PHF5A, CDCA8, and EXOSC8 were significantly upregulated, whereas AK2 was significantly downregulated (P < 0.05; Figure 2D). The best performing threshold was regarded as beast cutoff to partition the HCC samples collected from the TCGA database into high-expression and low-expression groups, which were then used to perform the survival analysis. The overexpression of NBN, PHF5A, AK2, and CDCA8 and low expression of EXOSC8 were associated with poor prognosis of HCC patients (Figure 2E). The downregulation of AK2 and the upregulation of EXOSC8 were found in HCC tumor tissues, in contrast with the OS prognosis of HCC patients, which was consistent with the prognostic signature formula constructed next.
3. Construction of the E2F target-related five-mRNA-based prognostic signature
For the training group, a prognostic signature was constructed based on the linear combination of the expression levels of the five mRNAs weighted with coefficients derived from the multivariate Cox regression analysis as follows:
We calculated the risk scores using Equation 2 and ranked the patients in the training group in ascending order (one risk score per patient). We then divided them into the high-risk group (n = 94) and the low-risk group (n = 93) using the median risk score as the cutoff value. The Kaplan-Meier OS curve showed that the prognosis of the high-risk group was significantly poorer than that of the low-risk group (P = 0.0011, log-rank test; Figure 3A). The risk scores are shown in Figure 3B. The survival status of the HCC patients in the training group was plotted (Figure 3C), along with that of patients in the high-risk group with higher mortality rates. The expression levels of the five mRNAs from the training group are shown in a heatmap (Figure 3D). The differential expression of the five genes in low-risk group compared with high-risk group HCC patients were also investigated. We found that NBN, PHF5A, AK2, and CDCA8 were upregulated but EXOSC8 was downregulated in high-risk HCC patients in the training group (Figure 3E). EXOSC8 is a protective gene with a negative coefficient. In contrast, NBN, PHF5A, AK2, and CDCA8 are the four risky mRNAs with positive coefficients. Thus, they could be used to predict high risk.
4. Validation of the five-mRNA prognostic signature in the test and entire groups
To verify the accuracy and stability of the prognostic signature constructed from the training group data set, we calculated the risk scores for the HCC patients in the test group and divided them into high-risk (n = 92) and low-risk (n = 91) groups using the median risk score as the cutoff value. The Kaplan-Meier OS curve showed that the prognosis of the high-risk group was significantly poorer than that of the low-risk group (P = 0.0114, log-rank test; Figure 4A). An even more remarkable result was obtained from the Kaplan-Meier analysis of the entire set (P < 0.0001, log-rank test; Figure 4F). The risk scores and survival status were plotted for the test group (Figures 4B and 4C) and the entire group (Figures 4G and 4H). The expression levels of the five mRNAs in the test group (Figure 4D) and the entire group (Figure 4I) were sorted by risk score and displayed in the heatmap. The differential expression of the five genes in low-risk group compared with that of the high-risk group were also investigated in both the test group and entire group (Figure 4E and 4J).
5. External validation of the five-gene prognostic model with GEO microarray data
To investigate whether our predictive model was appropriate for other datasets, we acquired four independent microarray datasets (GSE54236, GSE76427, GSE55092, and GSE107170), which incorporated the gene expression information and survival time of HCC patients from the GEO database and applied them to prognostic model. The raw data and platform information were downloaded from the GEO database. HCC tissue samples obtained from GSE55092 (n=49) and GSE107170 (n=94) were confined with the median risk scores. The expressions of NBN, PHF5A, AK2, CDCA8, and EXOSC8 in the low-risk and high-risk groups were compared and displayed in Figure 5A-D. EXOSC8 was downregulated in the high-risk group as a protective gene whereas NBN, PHF5A, AK2, and CDCA8 were upregulated in the high-risk group as four risky mRNAs. Thus, they could be used to predict high risk. GSE54236 and GSE76427 were obtained and merged after the batch normalization of 140 HCC tissue samples with survival time and status were collected and analyzed for prognosis. The Kaplan-Meier OS curve showed that the prognosis of the high-risk group was significantly poorer than that of the low-risk group (P = 0.0057, log-rank test; Figure 5E), thereby validating the prognostic signature. In addition, we calculated and compared the risk factors of different liver cancer tissues based on the formula for patients. The risk score increased with the severity of the liver disease as displayed in Figure 5F.
6. Independence from clinicopathological parameters
We collected samples with complete clinical information. Univariate and multivariate Cox regression analyses were performed on the five-mRNA-based signature and the clinicopathological parameters (Table 4). The median age of the 370 HCC patients was 60. Of 367 patients, 274 (74.66%) were classified as T1-2, and 93 (25.34%) were classified as T3-4. Of 343 HCC patients, 109 (31.78%) presented with tumors during follow-up. Of 269 HCC patients, 95 (35.32%) presented with new primary tumors. Of 319 HCC patients, 112 (35.11%) had family histories of HCC. Univariate Cox regression analysis of this dataset revealed that, similar to the risk score, the tumor status, family history, pathological T, and new tumor event after initial treatment were significant independent prognostic factors (P < 0.05). Of these, the risk score had the most remarkable prognostic value (P < 0.001 in the univariate Cox regression analysis; P = 0.003 in the multivariate Cox regression analysis; HR = 2.901; 95% CI = 1.425–5.905).
7. Validation of the survival prediction of the five-mRNA signature using Kaplan-Meier curves
The Kaplan-Meier curve showed that the prognosis of the high-risk group was significantly poorer than that of the low-risk group for the entire group (P < 0.0001, log-rank test; Figure 4E). Clinicopathological parameters, including age, T stage classification, family history, neoplasm cancer status, and new tumor event after initial treatment, effectively predicted OS according to the univariate Cox regression analysis (Figures 5A-E). HCC patients with the poorest prognoses were > 60 y (P = 0.0394), were pathologically rated as T3-T4 (P = 0.0090), and had HCC family history (P = 0.0065), tumors during follow-up (P = 0.0007), and new tumor events after the initial treatment (P = 0.0092). The accuracies of our prognostic indicators were confirmed by this observation. A stepwise analysis was applied to investigate these clinicopathological parameters for subsequent data mining.
The Kaplan-Meier curve disclosed that the five‐mRNA-based risk score is independent of gender (female/male), age (≤ 60/> 60 y), family history (NO/YES), tumor grade (G1-G2/G3-G4), pathological T (T1-T2/T3-T4), tumor status (tumor free/with tumor), and new tumor event after initial treatment (NO/YES). It is an accurate and stable prognostic indicator for patients with HCC, and the high‐risk group had significantly poorer prognosis than the low-risk group (Figures 6A-G).
8. Comparison of the five-mRNA prognostic signature with two other existing signatures
Lin et al. (2018) have reported a 12-gene signature (hereafter, the Lin signature), and Long et al. (2018) have reported a four-gene signature (hereafter, the Long signature) to predict the outcomes of patients with HCC [21, 22]. We calculated the risk scores of each patient in the entire cohort using formulae derived from these signatures and compared the prognostic value from the proposed five-mRNA signature with that of the Lin and Long signatures. The Kaplan-Meier curves showed that both the Lin signature (P = 0.0002) and the Long signature (P = 0.0092) successfully predicted the outcomes (Figures 7A and 7B) for the entire group. However, the risk score of the five-mRNA signature prediction had P < 0.0001 (Figure 4E); therefore, its prognostic value was more effective than either that of the Lin or Long signatures.