Expression analysis of DEGs in hepatocellular carcinoma
In the present study, a multistep analysis was performed to investigate DEGs and their significant biological functions on by integrated bioinformatics method in HCC (Figure 1). First, ten microarray profiles (GSE14520, GSE22058, GSE25097, GSE36376, GSE57957, GSE45267, GSE76427, GSE76297, GSE84005 and GSE121248) from the GEO datasets were selected and downloaded. There were a total of 2220 HCC samples, including 1016 adjacent nontumor 1204 tumor samples. After gene expression data processing, averaging and normalizing, DEGs among each GEO datasets were filtered and screened using the “limma” package with corrected P-value < 0.05 and |logFC| > 1. Based on the filter criteria, 779, 1723, 1675, 444, 373, 1062, 436, 451, 1025 and 580 DEGs from the above GEO datasets were obtained, respectively. The volcano plots of DEGs among each dataset are shown in Figure 2a. To explore robust DEGs in multiple GEO dataset, the RRA method was applied for integrating and carried out the meta-analysis of listed ranked genes. This method is a probabilistic and rank-based method assigned P-value as a significant score to each gene in the aggregated list, which suggests how superior it is ranked when compared to an expecting random ordering model. Finally, 518 significantly robust DEGs, which corresponded to 390 down-regulated and 128 up-regulated genes. The gene expression heatmap of the top 20 DEGs is shown in Figure 2b.
In the TCGA HCC cohort, 9077 DEGs were identified and filtered by the “edgeR” package, which comprised 7518 up-regulated and 1559 down-regulated genes. Among these DEGs, 451 DEGs held in common between GEO and TCGA HCC cohort were filtered out through Venn diagram analysis according to the thresholds set (Figure 2c), which corresponded to 126 up-regulated genes (logFC > 1.0) and 325 down-regulated genes (logFC < -1.0).
PPI network construction, hub gene selection and analysis
To explore the interactions and central genes associated with HCC, a PPI network was constructed using the STRING database and visualized by Cytoscape software. A total of 154 downregulated and 95 upregulated DEGs were filtered into the PPI network, which contained 249 nodes and 1274 edges. Then, Maximal Clique Centrality (MCC), Density of Maximum Neighborhood Component (DMNC), Maximum Neighborhood Component (MNC) and Degree in the cytoHubba plug-in were employed for screening significant hub genes. After Venn diagram analysis of the top 50 genes screened by the four ranking algorithms, 41 genes were overlapped and identified as hub genes (Additional file 2: Figure S1).
External validation and survival analysis of hub genes
Since there are fewer normal liver tissues in the TCGA database, all 41 genes were subsequently investigated and analyzed in the GEPIA database, including 369 cancerous and 160 normal tissues in Genotype-Tissue Expression (GTEx) and TCGA projects. Based on the cutoffs (|log2FC| > 1 and p< 0.01), 36 of the 41 hub genes were significantly upregulated in HCC tissues compared to normal liver tissues (Figure 3). Subsequently, overall survival (OS) and disease-free survival (RFS) analysis of 36 hub genes were further performed using Kaplan-Meier analysis in the GEPIA database (364 HCC). The results illustrated that HCC patients with high expression (> median expression value) of these genes displayed worse OS and RFS (p< 0.05; Additional file 3: Table S2).
Function and pathway analysis of hub genes
To elucidate the functional characteristics of the verified 36 hub genes, the enrichment analysis of GO and KEGG pathway was performed on FunRich software. The GO enrichment analysis results were divided into three functional categories (BP, CC and MF) (Additional file 4: Table S3). For BP, these 36 hub genes were mainly enriched in spindle assembly, cell cycle, chromosome segregation, cell communication and signal transduction (Figure 4a). For CC, they were particularly enriched in kinetochore, nucleus, microtubule, chromosome and condensed chromosome kinetochore (Figure 4b). For MF, they were notably enriched in protein serine/threonine kinase activity, kinase binding, protein binding, motor activity and ATP binding (Figure 4c). According to KEGG pathway enrichment analysis, they were significantly enriched in the cell cycle, PLK1 signaling events, Polo-like kinase signaling events in the cell cycle, Aurora B signaling and M phase (Figure 4d, Additional file 4: Table S3).
Construction of the gene-related associated risk score system
After the exclusion of the patients with incomplete survival data, 364 HCC patients remained in this study. Univariate Cox analysis was performed to explore the relationship between the overall survival and the expression level of each hub gene. As a result, the high expression of 36 hub genes was significantly correlated with worse overall survival (P< 0.05). Then, LASSO-penalized Cox analysis was used to remove confounding factors and cut the number of genes by 10-fold cross-validation producing the best lambda to minimize the biases and errors. Based on the LASSO-penalized Cox regression, 18 genes (ASPM, AURKB, CCNA2, CDCA3, CDCA8, CENPA, CENPF, CENPK, ECT2, KIF20A, KIF4A, NEK2, PBK, PRC1, RACGAP1, RRM2, SPC25, TOP2A and TPX2) closely correlated with survival were selected for the development of a risk prediction model (Figure 5a, 5b). Finally, a stepwise multivariate Cox regression analysis was performed, and 11 genes were finally chosen to establish an evaluation model. The model was characterized as: prognosis index = (-0.2348 * expression level of AURKB) + (-0.4974 * expression level of CDCA3) + (0.6346 * expression level of CDCA8) + (0.3109 * expression level of CENPA) + (0.2525 * expression level of ECT2) + (0.3963 * expression level of KIF20A) + (-0.2562 * expression level of NEK2) + (-0.8553 * expression level of PRC1) + (0.3235 * expression level of SPC25) + (-0.3274 * expression level of TOP2A) + (0.6281 * expression level of TPX2). All of these genes, including CDCA8, CENPA, ECT2, KIF20A, SPC25 and TPX2, displayed positive coefficients in the formula, indicating high-risk signatures for them. The risk score of each HCC patient was calculated, and the patients were divided into the high-risk (n=182) or low-risk (n=182) group based on the median risk score as the cutoff value. The distributions of the eleven-gene-based risk scores, OS statuses and the expression profiles of 11 genes in the TCGA HCC cohort were displayed in Figure 5c. Intuitively, the number of deaths was notably taller in the high-risk group, and the heatmap suggested that all 11 genes were expressed at higher levels in the high-risk group compared to the low-risk group. Kaplan-Meier analysis of the entire dataset (n=364) clearly showed that the HCC patients in the high-risk group had a worse prognosis than those in the low-risk group (median OS: 3.11 vs 6.73 years, P = 3.21E-04) (Figure 5d). Subsequently, the prognostic capacity of the model was evaluated by calculating the receiver operating characteristic (ROC) area under a curve. The areas under the curves (AUCs) of the ROC curve of the whole cohort based on this predictive model were 0.755, 0.708 and 0.729 for the 1-, 3- and 5-year survival time, respectively (Figure 5e), suggesting that the predictive model had a high specificity and sensitivity.
Next, the model was further assessed in subgroups. The 374 patients with HCC were divided into the stage-Ⅰ subgroup (n=168), stage-Ⅱ subgroup (n=84), stage-Ⅲ subgroup (n=83) based on their pathologic stage. Except for the stage-Ⅳ subgroup, which did not have a plenitudinous sample, patients in each subgroup were divided into the high-risk or low-risk group based on the above cutoff value. As shown in Figure 6a-c, patients in the high-risk group had significantly shorter survival time than those in the low-risk group. When a stratified analysis was executed based on age, the test in the <65-year-old or ≥65-year-old subgroup illustrated the same results (Figure 6d, 6e). Thus, the model was certainly reliable and practicable in the prognosis evaluation.
Independence of the prognostic signature from other clinical characteristics
Next, univariate and multivariate Cox regression analyses were utilized to explore whether the prognostic performance of the signature was independent of those of conventional clinical risk factors in 283 HCC patients with full clinical information. Univariate Cox regression analysis suggested that the signature had independent prognostic value (P< 0.05), while age, gender, pathologic stage, histologic grade and vascular tumor invasion did not closely correlate with the OS and could be not extremely effective of independent prognostic signature (Table 1). Considering that age nearly reached the statistical significance, it and the prognostic model were incorporated into the multivariate Cox analysis. The results indicated that the eleven-gene signature could be an independent prognostic indicator (Table 1).
Table 1 Univariate and multivariate Cox regression analysis of 11-mRNA signature and clinical risk factors in TCGA HCC dataset
Characteristic
|
Univariate analysis
|
Multivariate analysis
|
HR (95% CI)
|
P-Value
|
HR (95% CI)
|
P-Value
|
Age (≥65 vs. <65)
|
1.422 (0.961-2.105)
|
0.078
|
1.459 (0.985-2.161)
|
0.059
|
Gender (Male vs. Female)
|
0.787(0.527-1.175)
|
0.242
|
|
|
Pathologic_stage (III-IV vs. I-II)
|
1.010 (0.632-1.615)
|
0.966
|
|
|
Histologic_grade (G3-G4 vs. G1-G2)
|
1.041 (0.696-1.555)
|
0.846
|
|
|
Vascular_tumor_invasion (macro vs. micro vs. none)
|
0.825 (0.591-1.153)
|
0.260
|
|
|
Risk_level (high vs. low)
|
2.316 (1.531-3.503)
|
6.99E-05
|
2.342 (1.548-3.544)
|
5.61E-05
|
Abbreviations: HR, hazard ratio; CI, confidence interval.
Development and validation of a predictive nomogram
To build an efficient quantitative method for predicting the survival probability of HCC patients, a user-friendly nomogram included two factors (age and prognostic model) that were generated (Figure 7a). The nomogram allowed the clinicians to calculate the 1-, 3- and 5-year OS probability of each HCC patient easily. Subsequently, the discrimination and calibration abilities of the nomogram were evaluated by using a concordance index (C-index) and calibration plots. The C-index was 0.707 (95% confidence interval: 0.660 - 0.754) using bootstrap with 1000 resamplings, suggesting that this nomogram has an excellent discrimination ability. The 1-, 3- and 5-year OS probabilities were visualized by calibration plots (Figure 7b), suggesting that the probabilities generated by this nomogram were closely approximated the actual survival situation.
Expression validation and KEGG analysis of the prognostic signature
The eleven genes in the prognostic model were further analyzed individually in the Oncomine database. As illustrated in Additional file 6: Figure S2, the mRNA expression levels of AURKB, CDCA3, CDCA8, CENPA, ECT2, KIF20A, NEK2, PRC1, SPC25, TOP2A and TPX2 were notably upregulated in HCC tissues compared with those in nontumor liver tissues (p < 0.05), which was consistent with our findings. To further elucidate and understand the biological functions of the eleven genes, KEGG pathway enrichment analysis was performed again. Results showed that these genes were markedly enriched in the biological process related to the regulation of cell proliferation, such as “cell cycle”, “PLK1 signaling pathway” and “Aurora B signaling pathway” (Additional file 5: Table S4).