3.1 Acquisition of Gene Expression and Clinical Data
We downloaded the TCGA-LIHC dataset from The Cancer Genome Atlas (http://portal.gdc.cancer.gov/). The TCGA-LIHC dataset included 334 samples, and all samples had data on the recurrence-free survival time and censoring status. The GSE76427 dataset was downloaded from the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/gov/). The GSE76427 dataset included 115 samples from HCC patients, but 7 patients had missing information on recurrence-free survival time and censoring status. Thus, 108 samples were included in this study. The median RFS times of the TCGA and GSE76427 series were 390 and 252 days, respectively, and the two datasets contained clinical information, such as sex, age, and TNM stage.
3.2 Genes associated with recurrence-free survival
We used the “survfit” function from the “survival” package and found 1331 genes associated with RFS. To explore the genetic biological implications, we next analyzed the 1331 genes through Gene Ontology (GO) functional and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. As shown in Figure 1, through KEGG analysis, we found that these genes are enriched in signaling pathways such as the cell cycle, homologous recombination, DNA replication, the Fanconi anemia pathway, complement and coagulation cascades, and the T cell receptor signaling pathway.
3.3 Construction of the prognostic model in TCGA-LIHC
Then, “rbsurv” identified seven genes to construct the risk score system. The seven genes included in the system were TTK protein kinase (TTK), chromosome 16 open reading frame 54 (C16orf54), phosphoribosyl pyrophosphate amidotransferase (PPAT), CD3e molecule associated protein (CD3EAP), Solute carrier organic anion transporter family member 2A1 (SLCO2A1), acetyl-CoA acetyltransferase 1 (ACAT1), and Growth-arrest specific 2 like 3 (GAS2L3) (Table Ⅰ).
The risk score was calculated with the following formula: risk score= (-0.038)*expression of TTK +(-0.357)*expression of C16orf54 + 0.634* expression of PPAT +0.221 * expression of CD3EAP +(-0.076) * expression of SLCO2A1 + (-0.184) * expression of ACAT1 + 0.277 * expression of GAS2L3.
A total of 334 patients were divided into two groups (134 high-risk patients and 200 low-risk patients) with a cut-off of 4.97976 for the risk score. Furthermore, the survival curve revealed that the recurrence-free survival was significantly poorer in the high-risk group than in the low-risk group (p<0.0001; Figure 2).
3.4 Validation of the prognostic model in GSE76427
We validated the risk score system in the GSE76427 cohort. A total of 108 patients were divided into two groups (45 high-risk patients and 63 low-risk patients) with a cut-off of 3.4144 for the risk score. Furthermore, the survival curve revealed that the recurrence-free survival was significantly poorer in the high-risk group than in the low-risk group (p=0.011; Figure 3).
In summary, these results indicate that the prognostic model has moderate sensitivity and specificity.
3.5 Association between the prognostic model and the clinical characteristics of patients
In clarifying the correlation between the seven-gene signature and the clinical characteristics of HCC patients, we found that a high risk score was significantly correlated with TNM stage (p<0.001), grade (p=0.001), and AFP (p=0.014) but was not significantly associated with the sex, age, BMI, or Child-Pugh score of patients with HCC (Table Ⅱ).
In GSE76427, the results showed that the 7-gene signature was not significantly associated with sex, age, BCLC (Barcelona Clinic Liver Cancer) or TNM stage (Table Ⅲ).
3.6 Independent prognostic role of the prognostic gene signature
Moreover, the results of univariate Cox regression showed that the TNM stage (HR=1.680, p<0.001) and our prognostic model (HR=3.607, p<0.001) were both independent factors for recurrence-free survival among the 334 TCGA-LIHC patients. However, among the 108 patients in the GSE76427 cohort, TNM stage was not an independent prognostic factor for recurrence-free survival(18). The prognostic model (HR=2.407, p=0.014) was also an independent factor for recurrence-free survival. (Figure 4).
3.7 Comparison with the TNM stage model and BCLC model
To compare the accuracy of the prognostic model and the TNM model, we calculated the AUC for 12-month, 15-month, and 18-month recurrence-free survival. In the TCGA-LIHC dataset, the prognostic model’s AUCs for 12-month, 15-month, and 18-month RFS were 0.7768, 0.7934, and 0.7529; the TNM model’s AUCs for 12-month, 15-month, and 18-month RFS were 0.6884, 0.7026, and 0.6721, respectively (Figure 5). In the GSE76427 dataset, the prognostic model’s AUCs for 12-month, 15-month, and 18-month RFS were 0.6159, 0.6118, and 0.6217; the TNM model’s AUCs for 12-month, 15-month, and 18-month RFS were 0.6122, 0.6009, and 0.5762, respectively. In addition, the BCLC model’s AUCs for 12-month, 15-month, and 18-month RFS were 0.5669, 0.5627, and 0.5684, respectively (Table Ⅳ). Overall, our prognostic model showed benefit for predicting recurrence-free survival, which might help doctors with targeted treatment (Figure 6).
3.8 Development of the calibration curve
We calculated the C-index and drew the calibration curves for the 12-, 15- and 18-month survival predictions to evaluate the calibration in the TCGA-LIHC dataset and the GSE76427 dataset. The C-index in the TCGA-LIHC dataset is 0.717 and in the GSE76427 dataset is 0.647, as shown in Figures 7 and 8.
3.9 External validation in an online database
The representative protein expression of SLCO2A1, PPAT, GAS2L3, CD3EAP, ACAT1 were explored in the Human Protein Profiles. Then we explored the TTK, C16orf54, PPAT, CD3EAP, SLCO2A1, ACAT1, and GAS2L3 genes in the CBioPortal for cancer genomics. TTK exhibited the most frequent genetic alterations (3%), of which deep deletion was the most frequent alteration. The second most altered gene was CD3EAP (1.3%), and the most frequent alterations were amplification mutations (Figure 9). The expression of the seven genes in different cancers were shown in Figure 10. In summary, the aberrant expression of these seven genes may explain some of the abnormal expression of these genes.