A Nomogram For Predicting HCC Patients’ Overall Survival Based On Double Hub Genes and Other Clinical Risk Factors.

Background: Hepatocellular carcinoma (HCC) is one of the series malignant aggressive disease which shows elusive biological behavior and terrible prognosis. It is inadequate for the single evaluation system such as tumor-node-metastasis (TNM) staging system to predict the overall survival (OS) in HCC patients. Here, we conducted this study to identify prognosis-related genes, so as to form a reliable prognostic assessment model. Results: From three datasets in the GEO database, we identied two hub genes cytoskeleton associated protein 2 (CKAP2) and forkhead box M1 (FOXM1) among 348 differentially expressed genes (DGEs) that were differentially expressed between HCC and normal samples. The function analysis of those DEGs are enriched in cell division process (e.g., mitotic nuclear division, nuclear chromosome segregation) and metabolic process (e.g., organic acid catabolic process). Then, we established a two-gene model, that extremely distinguished the population at risk of liver cancer to high-risk and low-risk and was even viable in the TNM stage i-ii and iii-iv, vascular invasion and non-invasion subgroups (all P<0.05). Next, a nomogram was set out combined with the two hub genes and clinical risk factors, and the predictive power of the nomogram performed more outstanding than the gene expression or clinical parameters alone. Conclusions: Our two-gene-based evaluation system effectively ltered out the high-risk HCC patients, and could potentially be used for clinical decision-making and individualized management of particular HCC patients.

In our study, we downloaded 3 microarray datasets [GSE46408, GSE62232, GSE74656] from GEO database, calculated the differential expression multiples between HCC and non-cancerous liver tissues of each gene with GEO2R online tools, then obtained the DEGs in those datasets. Next, GO and KEGG enrichment analysis were conducted to display an overview of the function of those screened DEGs. With the STRING online website and the Cytoscape software, protein-protein interaction (PPI) network was established for further analysis and ltering of the crucial genes. Last, combining some clinicopathological factors in the TCGA database, we established a two-genes clinical-related signature for assessing patients' risk. Based on the two-genes clinical-related signature and other clinicopathological factors (i.e, TNM stage and vascular invasion), a two-gene-based nomogram to evaluate the overall survival of patients with HCC was created.

Materials And Methods
The overall ideal and design route of this article was presented in the removing those factors whose missing values percentages were more than 15% (Fig S1). Next, we selected 1 as the random number used multiple imputation in mice package, based on 5 replications and a chained equation approach method in the RMI procedure, to account for missing data so as to obtain a new complete dataset (entire dataset) (table 1). The clinical characteristics did not differ signi cantly between the two datasets (all P>0.05) (Table S1).

Selection of differentially expressed genes (DEGs)
GEO2R(http://www.ncbi.nlm.nih.gov/geo/geo2r), an online analysis tools in GEO website, was used to identify the DGEs between the tumor tissue and the nontumourous tissue in the three GEO datasets. The adjusted P-values (adj. P) and Benjamini and Hochberg false discovery rate were applied to provide a balance between discovery of statistically signi cant genes and limitations of false-positives. Probe sets without corresponding gene symbols or genes with more than one probe set were removed or chose the maximum, respectively. We set adj. P≤0.05 and | logFC (fold change) |≥1 as the criterion of statistically signi cant, then selected the common DEGs conforming the criterion in the three datasets.

KEGG and GO enrichment analyses of DEGs.
Kyoto Encyclopedia of Genes and Genomes(KEGG) is a database resource for understanding high-level functions and biological systems from large-scale molecular datasets generated by high-throughput experimental technologies. Gene ontology (GO) is a major bioinformatics tool to annotate genes and analyze biological process of these genes.
The selected DEGs can be divided into two groups, up-regulated and down-regulated. To analyze the function of DEGs, biological analyses were performed using Cytoscape(4) plug-in ClueGO(version 2.5.7) (5) and Gluepedia(version 1.5.7)(6). P<0.05 was considered statistically signi cant.

PPI network construction and module analysis.
Search Tool for Recurring Instances of Neighbouring Genes (STRING; http://string-db.org) (version 11.0) (7) is an online database implementing experimental and predicted PPI information. All the selected DEGs including up-regulated and down-regulated were input the online tool to establish a complicated protein to protein interaction (PPI) network. Basic setting and advanced setting of STRING database were default. After ltering out the disconnected nodes, the new PPI network was import into the Cytoscape, with Up-regulated DEGs and down-regulated DEGs respectively marked in red and blue.
We use the plug-in MCODE(8) in the Cytoscape to identify the most signi cant module from this PPI network. The criteria for selection were as follows: degree cut-off=2, node score cut-off=0.2, Max depth=100 and k-score=2. Module with the highest score was considered as the candidate module for the next analysis.
Biological Networks Gene Ontology (BiNGO, version 3.0.4)(9) plug-in can calculate overrepresented GO terms in the selected module and display them as a network of a signi cant GO terms. The setting of biological process analysis in the candidate module is as follows: Hypergeometric test as the statistical test, Benjamini & Hochberg False Discovery Rate (FDR) correction as the multiple testing correction, overrepresented categories after correction to be visualized, the whole annotation as reference set, P<0.01 as the signi cance level.

Identi cation and analysis of hub genes
The candidate genes of the most signi cant module from the PPI network were evaluated by the plugin cytoHubba (version 0.1)(10). The plugin cytoHubba in the Cytoscape can predict and explore important genes and subnetworks in a given PPI network with different topological algorithms. In this study, we calculated the scores of all genes in the module by the algorithms of ClusteringCoe cient. Consequently, the top 10 genes with the highest scores of the algorithm was selected as the candidate genes. For the preliminary exploration of the prognostic potential, overall survival (OS) of the 10 candidate genes was performed with the Kaplan-Meier test in 363 HCC patients.
Then, 182 patients in the whole population were randomly sampled and established as the 'primary dataset' (Table S2), while the 'entire dataset' included the complete 363 patients. Importantly, there were no signi cant differences in clinical characteristics between the two datasets. These two datasets were used for the subsequent model establishment and veri cation. During the process of grouping, we set the sample size of primary dataset as 50% in the entire dataset, while 1 was selected as the random number.
To further explore the best-t clinical-related hub genes, we ltered the 10 candidate genes through the univariate (screening criteria: P < 0.2) and multivariate (screening criteria: P <=0.05) Cox proportional hazards regression (CPHR) analysis in the primary dataset. The hub genes with P <=0.05 in the primary dataset was chosen for assessment of the clinical-related gene signature and the development of OSrelated gene signature-based prognostic nomogram.

Establish of clinical-related gene signature and clinicalrelated gene signature-based prognostic nomogram
During performing the multivariate CPHR in the above, we also calculated the coe cient of each gene. Therefore, a risk score formula, with the linear combination weighting of the OS-related genes' expression values and the corresponding regression coe cients, can be established. The score formula is shown as follow: As shown above, E i represents the expression value of each OS-related genes, while the regression coe cient of the above genes is de ned as K i . Next, the corresponding risk score of every patient in the two datasets can be calculated, and cut-off value of the high/low-risk group in 2 datasets was set as the median score calculated from the primary dataset. All datasets were performed Kaplan-Meier method to explore the prognosis differences in the high-risk subgroup and low-risk subgroup. Additionally, to further evaluate the predictive ability of the OS-related genes-based classi er, time-dependent ROC curves of different datasets and correspondent AUC values were conducted with one-year, three-year and ve-year.
Moreover, in order to identify independent predictors of OS, we conducted 14 typical clinical characteristics and the OS-related genes signature with univariate (screening criteria: P < 0.2) and multivariate (screening criteria: P <= 0.05) CPHR analyses in the primary dataset. Furthermore, we performed a strati ed analysis so as to verify whether the association of the OS-related genes signature with OS was independent of the TNM stage and vascular invasion. Using the selected clinical characteristics, we developed a prognostic nomogram with the 'rms' package. The predictive abilities of the established model were assessed with a C-index (non-events vs. events) and calibration curves (nomogram-predicted OS vs. actual probabilities OS) in all datasets. A bootstrap validation with 1000 resamplings was used for these activities. As for evaluating the predictive performance of our nomogram, the time-dependent ROC curves and AUC values of the nomogram based on the OS-related genes and other risk factors were the indicators considered in each dataset.

Veri cation of expression level and survival analysis in the public databases
First of all, two databases including GEPIA2 database (http://gepia.cancer-pku.cn/index.html) (11) and UALCAN database(http://ualcan.path.uab.edu/index.html) (12) were conducted for the 2 OS-related genes' expression difference in carcinomatous and normal samples. Meanwhile, in order to verify the expression level of CKAP2 and FOXM1 between tumor and normal tissues, we downloaded 424 sample data from HCC patient in TCGA database and analyze those data with R (version 4.0.2). After removing 5 unquali ed samples(formalin-xed para n-embedded tissues and recurring tumor tissues) and TMM normalization of edgeR package(13), a total of 419 samples containing 369 tumor tissues and 50 adjacent tumor tissues were taken to analyses. Then, with the aim of exploring the role of the 2 selected genes during HCC formation and progression, the mRNA expression of hub genes in different stages was compared between HCC and normal tissues by using UALCAN database.
Kaplan-Meier plotter (http://kmplot.com/analysis) (14) is an online survive analysis tool whose sources mainly coming from TCGA database. Overall survival (OS) was performed in our study to evaluate the prognostic potential of CKAP2 and FOXM1. OS analysis was also conducted in GEPIA2 database.

Selection of DEGs in HCC
We comprised 92 Hepatocellular Carcinoma tissues and 21 normal liver tissues in the 3 databases.
According to the cut-off criteria of |log2 FC| ≥ 1.0 and adj P-value ≤ 0.05, we identi ed 2160 DEGs in GSE46408 database, 1444 DEGs in GSE62232 database and 1018 DEGs in GSE74656 database with the GEO2R online analysis tool. 348 overlapping DEGs in the 3 databases containing 194 up-regulated genes and 154 down-regulated genes was shown by a venn diagram (Fig. 2). The results of the expression level analysis of 3 the databases are presented in a volcano plot (Fig. 2).

GO and KEGG enrichment analysis of DEGs
To explore the biological classi cation of DEGs, functional and pathway enrichment analyses were performed using Cytoscape plug-in ClueGO. The GO function annotation can be divided into three functional groups, cell component (CC), molecular function (MF), and biological process (BP). 194 upregulated genes were main enriched in BP (mitotic nuclear division, nuclear chromosome segregation, nuclear chromosome segregation, microtubule cytoskeleton organization involved in mitosis and mitotic sister chromatid segregation) and CC (centromeric region) ( Fig. 3A and B). Changes in KEGG pathway analysis of these gene signi cantly enriched cell cycle, focal adhesion, ECM-receptor interaction and cellular senescence ( Fig. 3C and D).
The GO function analysis of 154 down-regulated genes revealed that those biological processes including organic acid catabolic process, alpha-amino acid metabolic process, cellular amino acid catabolic process, cellular response to xenobiotic stimulus, lipid oxidation may play a critical role in the progress of HCC ( Fig. 4A and B). While, the changes in KEGG pathway analysis were main enriched in fatty acid degradation, complement and coagulation cascades, chemical carcinogenesis and retinol metabolism ( Fig. 4C and D).

PPI network construction and module analysis.
348 DEGs were input the STRING online website to establish a PPI network with 297 nodes and 2453 edges after removing the disconnected nodes. This network was imported into the Cytosacpe software for the visualization (Fig S2) and the next module analysis. The MCODE plug-in in Cytoscape software was used to grab the most signi cant module in the network. 12 clusters obtained from the MCODE plugin were shown in the Table 2. The rst cluster with the scores 50.423 was identi ed as the most signi cant module (Fig. 5A). The biological processes of the candidate module were carried out using the BiNGO plug-in (Fig. 5B).

Identi cation and analysis of hub genes
We implemented the ClusteringCoe cient method of cytoHubba plug-in to evaluate the signi cance of the genes in the module. Thereafter, 10 top genes (CDC7, CKAP2, MND1, FANCD2, EZH2, DEPDC1, ECT2, FOXM1, UBE2T and CKS2) with the highest scores were considered as the candidate genes. With the Kaplan-Meier analysis in 363 HCC patients, high expression group of every gene performed signi cantly worse survival event than the low-expression group (P < 0.05) (Fig S3).
We ltered these candidate genes through a univariate and multivariate CPHR analysis. As shown in the table (Table 3)

Establishment and assessment of OS-related genes signature
Based on the above multivariate CPHR analysis, we got CKAP2 and FOXM1's risk-coe cients, 1.21 and 0.98 respectively. Next, we established a risk score formula: Then, we calculated the two-genes-based risk score for each HCC patient in the primary dataset and entire database. Because of the non-normal distribution of risk score, we set the median risk score as the cut-off value to classify all patients to 'High-score' group and 'Low-score' group. Primary database contains 91 High-score patients and 91 Low-score patients, while 181 High-score patients and 182 Low-score patients in entire database.
In primary dataset, Kaplan-Meier curve analysis clearly demonstrated that the high-risk group (n=91) had a poorer prognosis than the low-risk group(n=91) (P= 0.0062, log-rank test) (Fig. 6A). Subsequently, we constructed a time-dependent receiver operating characteristic (ROC) curve with the primary dataset. As shown in the gure, the area under the time-dependent ROC curve (AUC) of the OS-related genes signature reached 0.68 (95% con dence interval [CI]=59.89−75.97) at one years, 0.64 (95% CI=55.2−73.61) at three years and 0.62 (95% CI=50.09−73.91). (Fig. 6B) The performance of the OS-related genes signature for predicting survival was then validated in the entire dataset. Similar to the results of the primary dataset, a Kaplan-Meier curve analysis indicated that the survival time of HCC patients was signi cantly shorter in the high-risk group (n=181) than in the low-risk group (n=182) (P=0.00073, log-rank test) (Fig. 6C). The AUC of the clinical-related gene signature was 0.7 (95% CI=64.3−76.27) at one years, 0.62 (95% CI=54.97−68.99) at three years and 0.57(95% CI=48.12−65.78) in the entire dataset (Fig. 6D). Thus, the predictive performance of the clinical-related gene signature for HCC patients was great in both the primary dataset and the entire dataset.
3.6. The prognostic value of the clinical-related gene signature was independent from those of conventional clinical risk factors Clinical risk factors such as the TNM stage and age are still vital predictors of OS in HCC patients. Therefore, we integrated these traditional risk factors with our clinical-related gene signature to develop an e cient quantitative method of predicting OS. We rst evaluated the prognostic value of several clinical risk factors in univariate and multivariate CPHR analyses of the primary dataset. We found that, in addition to the clinical-related gene signature, TMN stage (iii-vi vs. i-ii) and vascular invasion (yes vs. no) were signi cantly associated with OS (all P<0.05) ( Table 4).
Considering the number of HCC patients, we performed a risk-strati ed analysis with the entire dataset. The 363 HCC patients were strati ed into a stage i-ii subgroup (n=270), stage iii-iv subgroup (n=93) based on their TNM stage. Each subgroup was divided into a high-risk group and a low-risk group based on the risk scores proposed above. We found that the classi cation e ciency of the clinical-related gene signature was limited when it was applied to certain subgroups. As shown in the Kaplan-Meier curves, for the two subgroups, patients in the high-risk group had signi cantly poorer survival than those in the lowrisk group (stage i-ii subgroup, P=0.049; stage iii-iv subgroup, P=0.03, log-rank test) (Fig. 7A and B). Similarly, when a strati ed analysis was carried out based on vascular invasion, prognosis was worse in the high-risk group both in invasion subgroup(n=133) and non-invasion subgroup(n=230) (Fig. 7C and D). 3.7. Establishment of a nomogram containing the clinicalrelated gene signature with clinical risk factors we integrated these traditional risk factors (TNM stage and vascular invasion) with our clinical-related gene signature to develop an e cient quantitative method of predicting OS. Ultimately, on the basis of clinical judgment and statistical signi cance, we developed a clinical-related gene signature-based nomogram, which integrated the clinical-related gene signature and two clinical risk factors (vascular invasion and TNM stage). We then used this nomogram to predict the one-year, three-year and ve-year survival of HCC patients (Fig. 8A).
In the nomogram, the TNM stage contributed the most to the one-, three-and ve-year OS, closely followed by the OS-related genes signature and vascular invasion. This user-friendly graphical tool allowed us to determine the one-, three-and ve-year OS probability for each HCC patient easily.
We then evaluated the discrimination and calibration abilities of the prognostic nomogram by using a concordance index (C-index) and calibration plots. An internal validation using a bootstrap with 1000 resamplings revealed that the nomogram performed well for discrimination: the C-index was 0.658 (95% CI=0.633-0.683) for the entire dataset and 0.659 (95% CI=0.628-0.69) for the primary dataset. The oneyear, three-year and ve-year OS probabilities generated by the nomogram were plotted against the observed outcomes, as shown in Fig. 8B-G. The probabilities determined by the nomogram closely approximated the actual probabilities, especially the predicted ability of the three-year OS.

Veri cation of expression level and survival analysis in the public databases
We next explore those genes' expression difference in the public database. Pan-cancer research of the GEPIA database has prompted us the up-regulated of CKAP2 and FOXM1 in the HCC tissue (Fig. 10E). In GEPIA database and UALCAN database, as shown in Fig. 10A and D, the bar chart demonstrated that the expression levels of CKAP2 and FOXM1 in tumor samples are higher than those in the normal tissues. With 419 sample from TCGA database, CKAP2 and FOXM1 were overexpressed in the HCC tissues (Fig.  10C). Furthermore, according to the nomogram, we inferred that the expression level might be associated with disease progression, which was consist with our inference in the UALCAN database (Fig. 10B). All the results have signi cant statistical differences.
After examining the mRNA expression levels of the 2 genes in HCC, we veri ed the prognostic potential in the public databases. The survival signi cance map of Pan-cancer analysis in the GEPIA database preliminary revealed that the overexpression of those 2 genes were signi cant in the HCC patients (Fig.  11A). Consequently, as shown in Fig. 11B and C, high expression of CKAP2 and FOXM1 both performed poorer prognosis in the OS analysis of GEPIA database and ULACAN database.

Discussion
Hepatocellular Carcinomas is one of the most commonly diagnosed cancers and the fourth most common cause of cancer-related death in the world. Although the treatment of hepatocellular carcinomas has improved, the prognosis of patients is generally poor due to the lack of precise molecular targets. Therefore, better biomarkers for speci c prognosis and progression of HCC are demanded. In the present experiments, bioinformatics analysis was performed to identify the potential key genes correlated with HCC and develop a reliable prognostic model for HCC patients for helping the clinical decision-making.
In our study, we compared three gene microarray expression datasets from GEO database containing tumor tissues and normal tissues. 194 up-regulated and 154 down-regulated DEGs were successfully identi ed, respectively enriched in cell division process (e.g., mitotic nuclear division, nuclear chromosome segregation) and metabolic process (e.g., organic acid catabolic process). Based on the degree of connectivity in PPIN, 10 selected genes from ClusteringCoe cient algorithms were identi ed, including CDC7, CKAP2, MND1, FANCD2, EZH2, DEPDC1, ECT2, FOXM1, UBE2T and CKS2. Through Kaplan-Meier analysis and CPHR analysis, CKAP2 and FOXM1 were considered as the crucial genes in the progress of HCC.
CKAP2 is a cytoskeleton-associated protein which was reported to colocalize with microtubules and centrosomes during interphase and with mitotic spindles during mitosis (15). The function of CKAP2 protein can stabalize microtubules and plays a role in the regulation of cell division. However, Tsuchihara et al disclosed that the overexpression of CKAP2 in the absence of p53 can induce tetraploidy and aberrant centrosome numbers(16). The CKAP2-induced cytokinesis failure may play a crucial role during the process of tumorigenesis and development. Furthermore, the overexpression of CKAP2 has been reported in cervical carcinoma (17), osteosarcoma(18), ovarian cancer(19) and prostate cancer(20). With the hepatocellular carcinomas, Hayashi et al also proved the potential predictive ability of CKAP2 in early and extensive recurrence after operative resection and the expression of CKAP2 may be associated with multiple tumors or microscopic vascular invasion (15). In our nomogram based on CKAP2 and FOXM1, high-expression of genes and vascular invasion were related with the poorer prognosis, whose results supported the present ndings. FOXM1 is one of the members of the Fork head family proteins that contains a 100 amino acid long DNA binding domain, functioning in the regulation of G1/S and G2/M transitions of cell cycle, maintenance of mitotic spindle integrity, angiogenesis, DNA damage repair, and tissue regeneration(21). On the other hand, overexpression of FOXM1 has been detected in a broad range of cancer types such as AML(22), prostate cancer(23), breast cancer(24) and lung carcinoma(25), suggesting that FOXM1 is also essential for tumorigenesis. Similar to the other cancers, up-regulated FOXM1 in hepatocellular carcinomas performed an indicator of poor survival and metastasis(26-28). Wei et al demonstrated that the elevation of P53 and down-regulation of FoxM1 in the sorafenib-treatment tissues(29), so the mechanism of P53 negative regulation might explain the overexpression of FOXM1 in HCC(30, 31).
Based on the 2 hub genes, we established a risk score system which can divided the HCC patients to high-risk subgroup and low-risk subgroup. These high-risk patients exhibited signi cantly shorter survival than those in the low-risk group. The present studies have indicated that TNM stage and vascular invasion are both simple but useful predicators of survival in HCC. Consequently, we combined these traditional clinical factors with molecular pro ling. A two-genes clinical-related nomogram was established to quantify the individual's probability of OS. The predictive performance of our nomogram was more accurate than those of the traditional TNM stage or vascular invasion alone. And the nomogram's predictive ability was high both in the primary and entire validation, which suggests its high application potential in clinical practice.
Meanwhile, our predicted model in the present study still has some potential limitations: First of all, we excluded participants with incomplete records for complete-case analysis to build the model, which may introduce selection bias and small sample cohorts. However, we used multiple imputations to replace missing values, and the results proved that our study participants after multiple imputations could well represent the overall population. Therefore, in the future, we can consider designing our studies or cooperating with other researchers to collect as many variables as possible to reduce missing values. Secondly, the database of TCGA lacks enough tumor recurrence information, so the accuracy of predicting the prognosis and survival possible in the HCC recurrence patients need further veri cation. thirdly, our used data came from an open-access published database, so our study design was retrospective. Therefore, prospective clinical studies are needed to validate our ndings and to determine whether our nomogram improves patients' satisfaction and outcomes.

Conclusion
In summary, the genes of CKAP2 and FOXM1 were identi ed as the clinical-related genes in the progression of HCC by bioinformatics analysis. We determined the altered RNA expression patterns of HCC patients and identi ed a clinical-related genes signature that could e ciently divide patients into different risk groups. Importantly, by combining this signature with conventional clinical risk factors (TNM stage and vascular invasion), we developed a clinical-related genes signature-based nomogram that could accurately predict the one-year, three-year and ve-year OS of HCC patients. Furthermore, the prognostic performance of the nomogram was superior to those of the conventional TNM stage or vascular invasion. Therefore, we have provided a reliable, user-friendly prognostic nomogram to aid the individualized management of HCC patients.

Declarations
Ethics approval and consent to participate: Not applicable Consent for publication: Not applicable Availability of data and material The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. https://portal.gdc.cancer.gov/; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc= [GSE46408, GSE62232 and GSE74656].

Declaration of interest statement
19. Gao Y, Liu X, Li T, et al: Cross-validation of genes potentially associated with overall survival and drug resistance in ovarian cancer.   Figure 1 Overview of article design ideas and operating procedures          Pan-cancer survival analysis and Kaplan-Meier plot of CKAP2 and FOXM1 were validated in the public database. (A) the pan-cancer survival heatmap of CKAP2 and FOXM1 in the GEPIA database. Risk factor (HR>1) was colored as red, while protective factor (HR<1) as blue. Red frame represents the statistically difference. Kaplan-Meier plot of CKAP2 and FOXM1 between the high/low expression subgroup in the GEPIA database (B) and UALCAN database (C