Identication and Validation of a Reliable Prognostic Eleven-Genes Signature for Hepatocellular Carcinoma

Background: Hepatocellular carcinoma (HCC) is a common malignant tumor with high mortality and mortality. Although advances in early diagnosis, disease management and treatment of HCC, the outcomes remain unsatisfactory. This study aimed to identify the reliable prognostic biomarkers based integrated bioinformatics analysis to predict and improve the survival of HCC patients. Methods: The gene expression or transcriptome proles and survival of HCC were acquired from the Gene Expression Omnibus database (GEO) and the Cancer Genome Atlas (TCGA) database. Differentially expressed genes (DEGs) were screened out by the limma or edgeR package in the R software. Univariate, LASSO and multivariate Cox regression analyses were conducted to explore survival-related signature. Subsequently, a prognostic model and nomogram composed of prognostic signature were constructed for assessing overall survival (OS). Kaplan-Meier analysis, receiver operating characteristic (ROC) curve and stratied analysis were performed to conrm the prognostic performance of the prognostic model. Results: Compared with nontumor samples, 451 reliable DEGs were identied using the robust rank aggregation and overlap validation. Eleven survival-related DEGs were selected for the construction of a risk evaluation model, which could eciently distinguish high-risk patients from low-risk patients and even be feasible in the subgroups of stages and age. Further analyses suggested the positive and independent prognostic performance of the model compared to other clinical characteristics (P< 0.05, ROC > 0.7). Finally, a prognostic nomogram composed of the model was constructed for assessing the overall survival, and Harrell’s concordance index and calibration curves demonstrated its ecient predictive performance. Conclusion: model and nomogram will directly further the individualized more accurate HCC: Hepatocellular carcinoma; GEO: Gene Expression Omnibus database; TCGA: Cancer Genome Atlas; DEGs: Differentially expressed genes; LASSO: least absolute shrinkage and selection operator; RRA: Robust rank aggregation; KEGG: Kyoto Encyclopedia of Genes and Genomes; GO: Gene Ontology; GEPIA: Gene Expression Proling Interactive Analysis; OS: overall survival; ROC: receiver operating characteristic; C-index: Harrell’s concordance index; AUC: Areas under the curves.


Background
Hepatocellular carcinoma (HCC) is one of the lethal cancers and the second-leading cause of cancerrelated mortality [1]. In the last decades, advances in the treatment of unresectable liver cancer and the earlier diagnoses have made it possible for patients with HCC to live longer than in the past. Still, HCC is a malignant tumor with a worse prognosis due to its di culty in early diagnosis, metastasis and dissemination. The 1-year survival rate of these patients is 44% [2,3], and the 5-year survival rate is only 18% [4]. Poor outcomes may be caused by the rapid progression, early metastasis, and lack of typical clinical symptoms or suitable biomarkers for HCC. Thus, the early evaluation of individual outcomes is extremely urgent for improving HCC treatment and prognosis. Frequently, clinicopathological characteristics such as the pathologic stage, histologic grade, vascular invasion and tumor-nodemetastasis status have been utilized to assess HCC outcomes in practice. Patients with higher-stage, vascular invasion or metastasis have a worse overall survival (OS). However, the assessment is often based on pathological biopsy; It is di cult to predict the progression due to invasive surgery. Therefore, reliable prognostic biomarkers are needed to predict the outcome of treatment and guide personalized treatment of HCC.
Advances in microarrays, high-throughput sequencing techniques and bioinformatics have demonstrated that prognostic gene signatures be capable of predicting HCC patients' OS. Unfortunately, bias, errors as well as noises could be obtained due to the limited sample, single technological platforms or inappropriate analysis method. Integrated analysis of different microarrays and/or high-throughput sequencing data using effective analysis methods could obtain reliably differentially expressed genes (DEGs) and further identify potential molecular biomarkers. Recently, the robust rank aggregation (RRA) has been speci cally designed to compare several ranked gene lists and identi ed commonly overlapping genes to reduce or get rid of these mistakes [5]. Numerous studies adopted the RRA method to select and lter differentially expressed microRNA or mRNA pro les on multiple datasets [6][7][8]. However, there is a little study on identifying DEGs using the RRA method in HCC, especially on screening prognostic gene signatures.
In this study, 451 robust DEGs were identi ed by an integrated analysis of the data from the GEO and TCGA database using the RRA and overlap validation. Among them, eleven survival-related genes (AURKB, CDCA3, CDCA8, CENPA, ECT2, KIF20A, NEK2, PRC1, SPC25, TOP2A and TPX2) were selected by univariate, least absolute shrinkage and selection operator (LASSO) and multivariate Cox regression analysis. Based on their expression level and survival data from the TCGA HCC cohort, a prognostic risk model was proposed. The model was positively validated on the subgroup (age and pathologic stage) and the entire cohort (P < 0.05, ROC > 0.7). Subsequently, a user-friendly nomogram incorporating the signature and age was proposed for further potential application in clinical work. In general, the signature and nomogram may accurately evaluate the overall survival of HCC and will contribute directly to the individualized treatment and management for patients with HCC.

Materials And Methods
Data set collection and data processing The gene expression microarray pro les of HCC were ltered and downloaded from the NCBI-GEO database (https://www.ncbi.nlm.nih.gov/geo/). To reduce the in uence of individual heterogeneity and obtain more reliable results, those datasets with adequate sample size as well as paired tissues were screened and selected. The detailed screening criteria were as follows: (1) these data were derived from liver tumor tissues (TT) and adjacent nontumor tissues (NT) of HCC patients; (2) datasets were raw or standardized; (3) the sample sizes of NT and TT both were more than 35; (4) datasets were published later than 2010. A total of ten datasets were nally selected and their detailed information was shown in Additional le 1: Table S1. Meanwhile, transcriptome pro les with count-per-million calculated on an Illumina HiSeq RNA-seq platform and survival of HCC from the TCGA database (https://portal.gdc.cancer.gov/) were downloaded, containing 50 adjacent nontumor liver tissues and 374 HCC tissues. When multiple probes corresponded to one de ned gene, the average expression level was regarded as its nal expression. The quantile-normalization method was applied for normalizing gene expression intensities.

Identi cation of reliable DEGs
The open statistical R software (version 3.6.3, https://r-project.org/) and Bioconductor packages (http://www.bioconductor.org/) were utilized to excavate and screen DEGs between NT and TT. The original series matrix les of datasets and the annotation documents of platforms were downloaded, and the background correction and normalization were carried out by Perl software (version 5.26.3, https://www.perl.org/). Afterward, the identi cation DEGs between NT and TT in each GEO dataset was performed using the "limma" package with the cut-off criteria of the adjusted P-value < 0.05 and |log 2 fold-change (logFC)| > 1. To obtain robust DEGs of these GEO datasets, the RRA method was employed.
This method employs a probabilistic and nonparametric-based model for aggregation and is capable of identifying genes that were ranked consistently better than expected by chance and robust to outliers, noise and errors of results with the "RobustRankAggreg" package [5]. First, genes were ordered by their fold-change value. Then, the aggregation based on the ranks of genes in different GEO datasets was carried out. Finally, to reduce and avoid false-positive results, the P-value was subjected to Bonferroni's correction. Using the core algorithm of the "RobustRankAggreg" package, the screening criteria of |logFC| > 1.0 and corrected P-value < 0.05 were regards as statistically signi cant for the robust DEGs.
The transcriptome pro le of HCC was analyzed by the "edgeR" package, and the data analysis was consistent with that of GEO datasets except for the ltering approach in low expression genes. To obtain reliable results, DEGs held in common between GEO and TCGA were ltered out through Venn diagram analysis.

PPI network construction, hub gene selection and analysis
First of all, a functional protein association network was constructed in the STRING database (version 11.0, https://string-db.org/). The parameter of interactions among DEGs was set the highest con dence > 0.9. Then, Cytoscape software (version 3.7.2) was used to visualize and analyze the interaction network.
The cytoHubba plug-in of Cytoscape was utilized to explore vital nodes and sub-networks in a given network using different built-in topological algorithms. Any overlap in the top list of genes from different ranking algorithms will be de ned as hub genes. Subsequently, the expression and survival analysis of these hub genes were separately examined in Gene Expression Pro ling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn/). At the cut-off criteria of P-value < 0.05 and |log 2 fold-change (logFC)| > 1, hub genes were included in subsequent analyses.

GO and KEGG pathway enrichment analysis
To investigate and understand the biological functional roles of these hub genes, a user-friendly bioinformatics analysis tool, FunRich (version 3.1.3, http://www.funrich.org/) was applied for the biological process (BP), cellular component (CC), and molecular function (MF) of Gene Ontology (GO) and KEGG pathway enrichment analysis. The screening criteria of P-value < 0.05 was considered as statistical signi cance.
Construction of the risk score system Univariate, LASSO and multivariate Cox proportional regression analyses were utilized to explore the correlation between the expression level of each hub gene and HCC patient survival. The genes with P < 0.05 were considered statistically signi cant in the univariate Cox regression analysis and included in subsequent analyses. Next, LASSO-penalized Cox regression analysis was applied to narrow the range of target genes in the selected panel using 10-fold cross-validation by the "glmnet" package in R software. Then, a stepwise multivariate Cox regression analysis was performed to evaluate the contribution of a single gene as an independent prognostic factor and screen the genes tightly associated with survival for the best-t model. Finally, a prognostic gene signature was established based on a linear combination of the expression levels of the best-t OS-related genes multiplied with their corresponding regression coe cients (β) derived from the stepwise multivariate Cox regression. The prognostic gene signature was constructed as follows:

Prognosis risk index =
Where n is the number of genes, β i is the corresponding regression coe cient from the stepwise multivariate Cox regression analysis and Exp i is the expression value of gene.
Using the median risk index as the cutoff value, the 364 HCC patients with survival data were divided into high-and low-risk groups. Kaplan-Meier analysis and time-dependent receiver operating characteristic (ROC) curve analysis were utilized to assess the distinguishing and predictive performance of the signature. Additionally, a strati ed analysis was performed to evaluate whether the signature was e cient and practicable in the subgroups of the pathologic stage and age.

Identi cation of independent prognostic parameters
To investigate their independent prognostic value, univariate and multivariate Cox regression analyses were performed on the prognostic gene signature and the clinicopathological characteristics, including age, gender, pathologic stage, histologic grade, and vascular tumor invasion. Parameters with P < 0.05 based on univariate Cox regression analysis were considered statistically signi cant and further included in subsequent multivariate Cox regression analysis.

Predictive nomogram construction and validation
To predict the probability of 1-, 3-and 5-year OS for each patient with HCC, all independent prognostic parameters were included in the construction of a nomogram with the "rms" package in R software.
Subsequently, Harrell's concordance index (C-index) and a calibration plot comparing observed and predicted OS were utilized to assess the performance of the nomogram. The C-index was calculated to evaluate nomogram discrimination using a bootstrap method with 1000 resamplings. The calibration curve was plotted to compare observed and predicted OS.

Oncomine database analysis for prognostic signature
The expression levels of the prognostic signature in the prognosis risk model were further analyzed using the Oncomine database (https://www.oncomine.org), a web-based cancer microarray data mining platform. The mRNA expression folds in HCC tissue compared to respective normal tissue were obtained and compared. P-value < 0.05 and top 10% gene rank as the threshold.

Expression analysis of DEGs in hepatocellular carcinoma
In the present study, a multistep analysis was performed to investigate DEGs and their signi cant biological functions on by integrated bioinformatics method in HCC ( Figure 1). First, ten microarray pro les (GSE14520, GSE22058, GSE25097, GSE36376, GSE57957, GSE45267, GSE76427, GSE76297, GSE84005 and GSE121248) from the GEO datasets were selected and downloaded. There were a total of 2220 HCC samples, including 1016 adjacent nontumor 1204 tumor samples. After gene expression data processing, averaging and normalizing, DEGs among each GEO datasets were ltered and screened using the "limma" package with corrected P-value < 0.05 and |logFC| > 1. Based on the lter criteria, 779, 1723, 1675, 444, 373, 1062, 436, 451, 1025 and 580 DEGs from the above GEO datasets were obtained, respectively. The volcano plots of DEGs among each dataset are shown in Figure 2a. To explore robust DEGs in multiple GEO dataset, the RRA method was applied for integrating and carried out the metaanalysis of listed ranked genes. This method is a probabilistic and rank-based method assigned P-value as a signi cant score to each gene in the aggregated list, which suggests how superior it is ranked when compared to an expecting random ordering model. Finally, 518 signi cantly robust DEGs, which corresponded to 390 down-regulated and 128 up-regulated genes. The gene expression heatmap of the top 20 DEGs is shown in Figure 2b.
In the TCGA HCC cohort, 9077 DEGs were identi ed and ltered by the "edgeR" package, which comprised 7518 up-regulated and 1559 down-regulated genes. Among these DEGs, 451 DEGs held in common between GEO and TCGA HCC cohort were ltered out through Venn diagram analysis according to the thresholds set (Figure 2c), which corresponded to 126 up-regulated genes (logFC > 1.0) and 325 downregulated genes (logFC < -1.0).

PPI network construction, hub gene selection and analysis
To explore the interactions and central genes associated with HCC, a PPI network was constructed using the STRING database and visualized by Cytoscape software. A total of 154 downregulated and 95 upregulated DEGs were ltered into the PPI network, which contained 249 nodes and 1274 edges. Then, Maximal Clique Centrality (MCC), Density of Maximum Neighborhood Component (DMNC), Maximum Neighborhood Component (MNC) and Degree in the cytoHubba plug-in were employed for screening signi cant hub genes. After Venn diagram analysis of the top 50 genes screened by the four ranking algorithms, 41 genes were overlapped and identi ed as hub genes (Additional le 2: Figure S1).

External validation and survival analysis of hub genes
Since there are fewer normal liver tissues in the TCGA database, all 41 genes were subsequently investigated and analyzed in the GEPIA database, including 369 cancerous and 160 normal tissues in Genotype-Tissue Expression (GTEx) and TCGA projects. Based on the cutoffs (|log2FC| > 1 and p< 0.01), 36 of the 41 hub genes were signi cantly upregulated in HCC tissues compared to normal liver tissues ( Figure 3). Subsequently, overall survival (OS) and disease-free survival (RFS) analysis of 36 hub genes were further performed using Kaplan-Meier analysis in the GEPIA database (364 HCC). The results illustrated that HCC patients with high expression (> median expression value) of these genes displayed worse OS and RFS (p< 0.05; Additional le 3: Table S2).

Function and pathway analysis of hub genes
To elucidate the functional characteristics of the veri ed 36 hub genes, the enrichment analysis of GO and KEGG pathway was performed on FunRich software. The GO enrichment analysis results were divided into three functional categories (BP, CC and MF) (Additional le 4: Table S3). For BP, these 36 hub genes were mainly enriched in spindle assembly, cell cycle, chromosome segregation, cell communication and signal transduction (Figure 4a). For CC, they were particularly enriched in kinetochore, nucleus, microtubule, chromosome and condensed chromosome kinetochore (Figure 4b). For MF, they were notably enriched in protein serine/threonine kinase activity, kinase binding, protein binding, motor activity and ATP binding (Figure 4c). According to KEGG pathway enrichment analysis, they were signi cantly enriched in the cell cycle, PLK1 signaling events, Polo-like kinase signaling events in the cell cycle, Aurora B signaling and M phase (Figure 4d, Additional le 4: Table S3).
Construction of the gene-related associated risk score system After the exclusion of the patients with incomplete survival data, 364 HCC patients remained in this study. Univariate Cox analysis was performed to explore the relationship between the overall survival and the expression level of each hub gene. As a result, the high expression of 36 hub genes was signi cantly correlated with worse overall survival (P< 0.05). Then, LASSO-penalized Cox analysis was used to remove confounding factors and cut the number of genes by 10-fold cross-validation producing the best lambda to minimize the biases and errors. Based on the LASSO-penalized Cox regression, 18 genes (ASPM, AURKB, CCNA2, CDCA3, CDCA8, CENPA, CENPF, CENPK, ECT2, KIF20A, KIF4A, NEK2, PBK, PRC1, RACGAP1, RRM2, SPC25, TOP2A and TPX2) closely correlated with survival were selected for the development of a risk prediction model (Figure 5a, 5b). Finally, a stepwise multivariate Cox regression analysis was performed, and 11 genes were nally chosen to establish an evaluation model. + (-0.8553 * expression level of PRC1) + (0.3235 * expression level of SPC25) + (-0.3274 * expression level of TOP2A) + (0.6281 * expression level of TPX2). All of these genes, including CDCA8, CENPA, ECT2, KIF20A, SPC25 and TPX2, displayed positive coe cients in the formula, indicating high-risk signatures for them. The risk score of each HCC patient was calculated, and the patients were divided into the highrisk (n=182) or low-risk (n=182) group based on the median risk score as the cutoff value. The distributions of the eleven-gene-based risk scores, OS statuses and the expression pro les of 11 genes in the TCGA HCC cohort were displayed in Figure 5c. Intuitively, the number of deaths was notably taller in the high-risk group, and the heatmap suggested that all 11 genes were expressed at higher levels in the high-risk group compared to the low-risk group. Kaplan-Meier analysis of the entire dataset (n=364) clearly showed that the HCC patients in the high-risk group had a worse prognosis than those in the lowrisk group (median OS: 3.11 vs 6.73 years, P = 3.21E-04) (Figure 5d). Subsequently, the prognostic capacity of the model was evaluated by calculating the receiver operating characteristic (ROC) area under a curve. The areas under the curves (AUCs) of the ROC curve of the whole cohort based on this predictive model were 0.755, 0.708 and 0.729 for the 1-, 3-and 5-year survival time, respectively (Figure 5e), suggesting that the predictive model had a high speci city and sensitivity. Next, the model was further assessed in subgroups. The 374 patients with HCC were divided into the stage-subgroup (n=168), stage-subgroup (n=84), stage-subgroup (n=83) based on their pathologic stage. Except for the stage-subgroup, which did not have a plenitudinous sample, patients in each subgroup were divided into the high-risk or low-risk group based on the above cutoff value. As shown in Figure 6a-c, patients in the high-risk group had signi cantly shorter survival time than those in the low-risk group. When a strati ed analysis was executed based on age, the test in the <65-year-old or ≥65-year-old subgroup illustrated the same results (Figure 6d, 6e). Thus, the model was certainly reliable and practicable in the prognosis evaluation.
Independence of the prognostic signature from other clinical characteristics Next, univariate and multivariate Cox regression analyses were utilized to explore whether the prognostic performance of the signature was independent of those of conventional clinical risk factors in 283 HCC patients with full clinical information. Univariate Cox regression analysis suggested that the signature had independent prognostic value (P< 0.05), while age, gender, pathologic stage, histologic grade and vascular tumor invasion did not closely correlate with the OS and could be not extremely effective of independent prognostic signature ( Table 1). Considering that age nearly reached the statistical signi cance, it and the prognostic model were incorporated into the multivariate Cox analysis. The results indicated that the eleven-gene signature could be an independent prognostic indicator ( Table 1).

Development and validation of a predictive nomogram
To build an e cient quantitative method for predicting the survival probability of HCC patients, a userfriendly nomogram included two factors (age and prognostic model) that were generated (Figure 7a). The nomogram allowed the clinicians to calculate the 1-, 3-and 5-year OS probability of each HCC patient easily. Subsequently, the discrimination and calibration abilities of the nomogram were evaluated by using a concordance index (C-index) and calibration plots. The C-index was 0.707 (95% con dence interval: 0.660 -0.754) using bootstrap with 1000 resamplings, suggesting that this nomogram has an excellent discrimination ability. The 1-, 3-and 5-year OS probabilities were visualized by calibration plots (Figure 7b), suggesting that the probabilities generated by this nomogram were closely approximated the actual survival situation.

Expression validation and KEGG analysis of the prognostic signature
The eleven genes in the prognostic model were further analyzed individually in the Oncomine database. As illustrated in Additional le 6: Figure S2, the mRNA expression levels of AURKB, CDCA3, CDCA8, CENPA, ECT2, KIF20A, NEK2, PRC1, SPC25, TOP2A and TPX2 were notably upregulated in HCC tissues compared with those in nontumor liver tissues (p < 0.05), which was consistent with our ndings. To further elucidate and understand the biological functions of the eleven genes, KEGG pathway enrichment analysis was performed again. Results showed that these genes were markedly enriched in the biological process related to the regulation of cell proliferation, such as "cell cycle", "PLK1 signaling pathway" and "Aurora B signaling pathway" (Additional le 5: Table S4).

Discussion
Hepatocellular carcinoma is one of the deadly malignant cancers worldwide with > 700,000 new cases per year [3], which is associated with various risk factors, such as hepatitis B virus, hepatitis C virus, alcoholic cirrhosis, nonalcoholic fatty liver diseases and certain genes. It is estimated that Asia will possess the largest number of HCC patients in the world by 2030 [9]. Although advances in the diagnosis and treatment, 1-, 3-and 5-year overall survivals of HCC are still low and unsatisfactory. Accurate prediction of overall survival is vital for individualized therapy approach selection and prognosis improvement. Conventional indicators such as pathologic stage, histologic grade, vascular invasion and tumor-node-metastasis status currently assist in predicting prognosis to a certain extent. However, the clinical outcomes of patients with the same stage or status often differ, suggesting that the current assessment is not su cient for the heterogeneity of HCC and identi cation of novel prognostic biomarkers and construction of more accurate prognostic models are urgently required. The integration of prognostic signature and signi cant clinical parameters may contribute to a better prognosis prediction than a single biomarker or clinical factor. Recently, prognostic signatures based on DEGs have caused wide public concern and showed great potential in prognosis prediction of cancer and novel therapeutic targets.
In the present study, 451 robust DEGs were identi ed through an integrated analysis of multiple microarray and transcriptome pro les from different analysis platforms. Among them, eleven genes (AURKB, CDCA3, CDCA8, CENPA, ECT2, KIF20A, NEK2, PRC1, SPC25, TOP2A and TPX2) closely correlated with survival were selected by univariate, LASSO and multivariate Cox analysis. Based on their expression level and corresponding regression coe cients, a prognostic risk model was proposed. The veri cation of this model in the entire cohort and subgroup (age and pathologic stage) showed that patients in high-risk groups had signi cantly worse overall survival. The AUCs of the ROC curve of the whole dataset based on this model were 0.755, 0.708 and 0.729 at 1, 3 and 5 years of OS, respectively. For further application of this model in clinical, a user-friendly nomogram integrated the eleven-gene signature with age was generated and can be used to assess the survival probability of individual patients. The C-index and calibration plots indicated that the predicted survival is very close to the actual survival situation, suggesting that this model had an excellent performance in survival prediction. Based on the nomogram, the clinician can set an appropriate treatment plan to achieve the individualized treatment of HCC patients.
With the growing interest in personalized medicine, many prognostic risk assessments have been identi ed and found to boost survival predictions in a wide variety of cancers. Unfortunately, DEGs with no biological functions could be obtained due to the limited sample sizes and impertinent methods such as plain intersecting the DEGs lists from different technological platforms. In terms of these di culties, ten microarray pro les from the GEO database and a high-throughput sequencing pro le from the TCGA database were integrated to identify DEGs in this study. Undoubtedly, the candidate genes ltered in this study are more reliable than those identi ed in previous studies. Among reliable DEGs, the eleven genes included in the prognostic model were closely related to the proliferation and invasion of the tumor cell and further analyzed individually.
Aurora kinase B (AURKB) is the core enzyme of the chromosomal passenger complex (CPC), which is responsible for the accurate regulation of chromosomal segregation, cytokinesis, mitotic checkpoint and protein localization to the centromere and kinetochore, and plays a vital role in the correction of microtubule-kinetochore attachments [10][11][12]. Its uncontrolled expression can promote aneuploidy and tumor development. Increasing evidence revealed that AURKB was highly expressed in a variety of malignant cancers. Tanaka et al. reported that high expression level AURKB in HCC and may be an effective predictor of aggressive HCC recurrence after curative hepatectomy [10]. Benten et al. reported that inhibition of AURKB by a novel aurora kinase inhibitor (PHA-739358) signi cantly suppressed the growth of hepatocellular carcinoma in vitro and a xenograft mouse model [13], implying AURKB could be a potential therapeutic target for HCC. Another study showed a selective inhibitor of AURKB (AZD1152) suppressed histone H3 phosphorylation and induced cellular apoptosis in twelve human HCC cell lines; AZD1152 can decelerate tumor growth and increase survival in an orthotopic hepatoma model [14].
Cell division cycle associated 3 (CDCA3), a component of Skp1-cullin-F-box, acts as a regulator of mitosis. Timo et al. demonstrated CDCA3, as a driver gene in carcinogenesis, exerted crucial functions in HCC by inducing cell cycle progression [15]. Cell division cycle associated 8 (CDCA8) is one key regulatory component of CPC, which plays a crucial regulatory role in mitosis and cell division [16]. The high expression level of CDCA8 was positively correlated to tumor cell proliferation, invasion, and poor prognosis in a variety of cancers, such as gastric [17] and lung cancer [18].
Centromere protein A (CENPA), a histone H3 variant of centromeric nucleosomes, plays a vital role in cell cycle regulation and genetic stability. Previous research has shown that it can promote HCC cell proliferation both in vitro and in vivo [19]. RNAi-mediated its depletion inhibited HCC cell growth, blocked cell cycle progression at the G1 phase, and assisted apoptosis of HCC cell [20,21].
Epithelial cell transforming sequence 2 (ECT2) belongs to the Ras GTPases superfamily, [22] is a classical oncogene originally identi ed in 1991 and highly expressed in various cancers [23]. It performs crucial regulatory roles in cell proliferation, oncogenesis, tumorigenesis, and metastasis of HCC cells [24]. Chen et al. demonstrated that ECT2 was signi cantly upregulated in HCC and strongly responsible for promoting early recurrence of HCC. Knockdown of ECT2 will notably suppress Rho GTPases activities, enhance apoptosis, attenuate oncogenicity, and restrain the metastatic ability of HCC cells [24]. Another study reported that the down-regulation of ECT2 by miR-190-5p can signi cantly inhibit the metastasis of HCC [25]. These results indicated that ECT2 may be a potential therapeutic target for HCC.
Kinesin family member 20A (KIF20A), a microtubule-associated motor protein, is an essential mitotic kinesin required for cytokinesis, which can directly interact with Rab6 small GTPase and participate in the dynamics of the Golgi apparatus [26]. Increasing evidence revealed that KIF20A may play an important role in the development and progression of various cancers [27]. Shi et al. reported that knockdown of KIF20A could markedly inhibit the growth of HCC cells in vitro and the growth of HCC xenografts in vivo [28].
NIMA-related kinase 2 (NEK2) located in the centrosome, a member of the NIMA family of serine/threonine kinases, is involved in the cell cycle and mitosis as a vital oncogene. Elevation of NEK2 will contribute to premature centrosome splitting, concomitant with centrosomal abnormalities, monopolar spindles and aneuploidy [29]. Previous studies have reported that it was overexpressed in numerous cancer cell lines and associated closely with tumor size, portal vein invasion and poor tumor differentiation [30]. Zhang et al. reported that over-expression of NEK2 could promote the proliferation of HepG2 cells by activating the mitogen-activated protein kinase (MAPK) signaling pathway [31] and interfering or silencing NEK2 expression can suppress HepG2 cells' cell growth and invasion, promote apoptosis and cycle arrest [32]. Lai et al. demonstrated that aberrant NEK2 could accelerate the progression of the cell cycle and promote cell proliferation via the activation of the Wnt/β-catenin signaling pathway [33]. Another study found that overexpression of NEK2 in hepatoma cells could enhance drug resistance, promote metastasis and angiogenesis via the PI3K/AKT-NF-κB signaling pathway [34]. These studies suggested that the aberrant expression of NEK2 was signi cantly relative to the occurrence and progression of HCC and imperfect survival.
Protein regulator of cytokinesis-1 (PRC1), a member of the microtubule-associated protein family, is involved in cytokinesis and carcinogenesis. Increasing evidence revealed that PRC1 as an oncogenic factor in tumorigenesis of HCC [35] and lung adenocarcinoma [36]  Recently, several studies con rmed that intracellular topoisomerase levels could determine the chemosensitivity of tumor cells, [39][40][41] suggesting that TOP2A may be a potential target for enhancing drug response in cancer therapy.
Targeting protein for Xenopus kinesin-like protein 2 (TPX2) is a nuclear proliferation microtubuleassociated protein and impacts spindle assembly in mammalian cells [42]. Aberrant expression of TPX2 induces the ampli cation of centrosome and causes DNA polyploidy. Recently, several studies have reported that TPX2 promotes tumorigenesis and metastasis and is signi cantly upregulated in numerous types of malignant tumors. And increasing evidence demonstrated that TPX2 expression is associated with tumor migration, invasion and poor prognosis in HCC [43,44]. Reiko et al. con rmed that the siRNAmediated knockdown of TPX2 can inhibit the proliferation of HCC cells, and the growth of HCC xenografts transplanted into immunode cient mice [45]. Liu et al. found that siRNA-mediated knockdown of TPX2 suppressed HCC cell invasion via inactivating AKT signaling and down-regulation of MMP2 and MMP9 expression in SMMC-7721 and HepG2 cells [46]. Another study reported that the siRNA-mediated knockdown of TPX2 may suppress the growth of HCC by regulating the PI3K/AKT signaling pathway [47].
These studies indicated that TPX2 may be a promising target for the treatment of HCC.
The predictive model is based on the expression levels of the above-mentioned genes. The eleven-gene signature was integrated with patients' age to build a user-friendly nomogram, which enables clinicians to assess patients' prognosis and facilitate the customized treatment for high-risk patients. To our knowledge, the eleven-gene signature described herein and the nomogram has not been studied previously. These ndings will contribute directly to the understanding of the molecular mechanism of HCC and the prediction of prognosis. Moreover, the gene signature may be a potential therapeutic molecular target. However, this study also had certain limitations as follows: 1) the model was mainly validated in the TCGA-HCC project. To reduce the risk of over tting, other independent cohorts with plenty of clinical data were needed. 2) the clinical information was required from the TCGA database and most of the patients are white and Asian. Caution must be taken when extrapolating our ndings to other ethnicities.

Conclusion
In summary, signi cant and reliable survival-related DEGs were identi ed by using multiple gene expression pro les from the GEO and TCGA datasets, and a nomogram composed of an eleven-gene signature was established. Based on the prognosis risk index and patients' age, the clinician can change the patient's treatment plan to realize a more individualized approach for high-risk patients and effective measures can be taken to prevent or slow down the recurrence and deterioration of HCC in high-risk populations. The ndings contribute directly to the survival prediction and treatment management for patients with HCC.