Identification of the patient OS-related gene set
Univariate regression analysis was applied to construct the relationship between gene expression and patient OS based on the TCGA training set samples. Table 1 shows the basic information of all samples in each group. A total of 1,265 genes with a univariate Cox regression log-rank p-value of <0.01 were identified as candidate prognostic genes. Table 2 presents the information of the top 20 genes with the closest correlation.
Identification of genomic variant genes with CNV and mutation
Genes with significant deletion or amplification were identified using GISTIC 2.0 based on the TCGA CNV data, and the fragment with the deletion or amplification length of over 0.1 (p<0.05) was seen as the parameter threshold. The significantly amplified fragments in the HCC genome are shown in Figure 1A, and the significantly amplified genes in each fragment are recorded in Table 3. For instance, VEGFA on the 6p21.1 region was significantly amplified (q value = 1.53E-15), MYC on the 8q24.21 region was evidently amplified (q value = 7.39E-37), CCND1 on the 11q13.3 region was also markedly amplified (q value = 1.51E-28), and a total of 200 genes were therefore amplified. The significantly deleted fragments in the HCC genome are displayed in Figure 1B, and the markedly deleted genes in each fragment are recorded in Table 4. For example, RB1 on the 13q14.2 region was markedly deleted (q value = 6.95E-22), CDKN2A on the 9p21.3 region was evidently deleted (q value = 4.65E-11), and CD3D on the 11q25 region was significantly deleted (q value = 0.010127), resulting in a total of 1,485 deleted genes.
Significantly mutant genes were identified using Mutsig2 based on the TCGA mutation annotation data, and the threshold was set at P<0.05, thereby obtaining a total of 344 genes with significant mutation frequency. Figure 2 presents the distribution of framework deletion or insertion, framework displacement, splice sites, nonsense mutation, missense mutation, synonymous mutation, and other non-synonymous mutation of the top 50 genes with the highest significance in the TCGA HCC patient samples. The number of samples with mutation in the 50 genes is indicated in the right histogram, while the upper histogram represents the total number of non-synonymous and synonymous mutations in the 50 genes. Therefore, some of the 344 identified genes were found in previous studies to be closely correlated with cancer genesis and development, such as RB1, TP53, PTEN, CDKN2A, and CDKN1A [23-26].
Functional analysis of genomic variant genes with CNV and mutation
To explore the genomic variant genes (n = 2,029 in total), CNV identified amplified and deleted genes were integrated with significantly mutant genes for GO and KEGG enrichment analyses. Figure 3A shows that a total of 2,029 genes are markedly enriched in the cancer genesis and development-related pathways, such as Hepatocellular carcinoma, PI3K-AKT signaling pathway, FoxO signaling pathway, and Human T-cell leukemia virus 1 infection, etc. Figure 3B displays that the 2,029 genes are enriched in the biological processes of cancer genesis and development, including metabolic process, cellular process, cell differentiation, immune system process, etc.
Construction of the HCC prognosis-related gene signature
The gene set was first integrated with genomic CNV and mutation as well as the prognosis-related gene set, and the intersection set of the three gene sets was chosen as the candidate gene set, and thus 78 genes were obtained. The random forest was further applied for feature selection. Figure 4A displays the relationship between the number of classification trees and the error rate, and Table 4 exhibits 5 genes with the relative importance of >0.65. Figure 4B presents the 5 genes in the out-of-bag importance ranking. Then the multivariate Cox regression analysis method was used to construct the 5-gene signature, and the model is shown below:
Risk = -0.29823*CISH-0.04233264*LHPP-0.2659875*MGMT+0.2069671*PDRG1-0.2678402*LCAT
Through calculating the risk score of each training set sample, the samples were divided into groups with high or low risk given the median risk score (cutoff = -0.05475002). Figure 5 presents the classification effect of the 5-gene signature in the TCGA training set. As shown in Figure 5A, there is an evident difference between the high-risk group of 92 patients and the low-risk group of 92 patients (P = 9.101663e-11). Figure 5B displays the ROC curve, with the 1-, 3- and 5-year AUC of 0.77, 0.82, and 0.86, respectively. According to Figure 5C, the survival time of death samples declined significantly with the increase in patient risk score, and more death samples were found in the high-risk group. Besides, the high expression of PDRG1 was identified to be the risk factor based on the variation in the expression of 5 different signature genes with the increase in the risk score. By contrast, the high expression of CISH, LHPP, MGMT, and LCAT was related to the low risk and thereby determined as protective factors.
Robustness verification of the 5-gene signature model
Through verification in the TCGA test set, the same model and cutoff were adopted as the TCGA training set to decide the 5-gene signature robustness. Figure 6 exhibits the classification effect of the TCGA test set. As shown in Figure 6A, there was a significant statistical difference between the high-risk group of 97 patients and the low-risk group of 86 patients (P = 0.003221398). Figure 6B shows the ROC curve, with the 1-, 3- and 5-year AUC of 0.74, 0.69, and 0.59, respectively. Figure 6C shows similar results to those in the TCGA training set. To be specific, the survival time of death samples declined evidently with the increase in the risk score, and more death samples were found in the high-risk group. As expected, PDRG1 was identified as the risk factor; CISH, LHPP, MGMT, and LCAT were the protective factors.
To validate the classification performance of the 5-gene signature model in data from different data platforms, GEO platform data (GSE40873 and GSE15654) were treated as the external dataset to calculate the risk score of each sample using the model. Besides, samples from low-risk and high-risk groups were classified based on the training set cutoff. Figures 7A and 8A indicate that the low-risk group had a markedly better prognosis than that in the high-risk group. ROC analysis revealed similar 1-, 3-, 5-year AUC to those in the training set and test set, as presented in Figures 7B and 8B. Additionally, the relationship between the risk score and the expression of 5 genes was also in line with that between the test set and the training set, according to Figures 7C and 8C. To sum up, the 5-gene signature model displays favorable prognosis prediction capacity in both external and internal data.
Clinical independence of the 5-gene signature
To determine the independence of the 5-gene signature model in clinical application, univariate and multivariate Cox regression analyses of clinical information were carried out in the TCGA training and test sets, GSE40873 and GSE15654 data sets for an analysis of the related HR, 95% CI of HR and p-value. In addition, the clinical information recorded in the TCGA, GSE40873, and GSE15654 samples was systemically analyzed, which included age, sex, HBV, HCV, platelet count, tumor stage, and pathological M stage, N stage, T stage. Table 5 shows the grouping information of the 5-gene signature. The univariate Cox regression analysis of the TCGA training set suggested that age, tumor stage III/IV, pathologic M1/T3, and the high-risk group were markedly associated with survival. However, clinical independence was only observed in the high-risk group (HR = 9.51, 95%CI = 1.78-50.63, P = 0.00828), according to the corresponding multivariate Cox regression analysis. The univariate Cox regression analysis of the TCGA test set suggested that the high risk group and pathologic M1/T3/T4 were significantly correlated with survival, but clinical independence was only found in the gender male (HR = 4.71, 95%CI = 1.08-20.55, P = 0.0389) and the high-risk group (HR = 3.38, 95%CI = 1.08-10.57, P = 0.0358), according to the corresponding multivariate Cox regression analysis. In GSE40873, the high risk group was found to be significantly related with survival by the univariate Cox regression analysis, and clinical independence was also observed in the high risk group (HR = 3.64, 95%CI = 1.27-10.42, P = 0.0158), according to the corresponding multivariate Cox regression analysis. In GSE15654, the analysis suggested that the high risk group, varices presence, and platelet count >100000/mm3 were markedly correlated with survival, and clinical independence was observed in the platelet count >100000/mm3 (HR = 3.03, 95%CI = 1.69-5.39, P = 0.000178) and the high risk group (HR = 1.69, 95%CI = 1.02-2.81, P = 0.040943), according to the corresponding multivariate Cox regression analysis (Table 4). To sum up, the 5-gene signature model is a prognostic index and shows independent predictive performance.
Pathway difference in GSEA enrichment analysis between high and low risk groups
In the high and low risk groups of the TCGA training set, pathways with significant enrichment were analyzed using GSEA. Table 5 shows that a total of 39 pathways were obtained, including those closely related to tumor genesis, development, and metastasis, such as adherens junction, cell cycle, endocytosis, and pathways in cancer. The significant enrichment results of these pathways in high-risk samples are shown in Figure 9.
Expression Levels of 5 Genes
The bioinformatics analysis confirmed that the expression of 5 genes was verified in twenty-five normal tissues and HCC tissues. The results in Figure 10 showed that the mRNA expression of PDRG1 was decreased in HCC tissues and the mRNA expression of CISH, LHPP, MGMT and LCAT were increased in HCC tissues (p<0.05). It was consistent with that analyzed using bioinformatic analysis.