Revelation of critical genes and pathways involved in hepatocellular carcinoma (HCC) and phylogenetic tree construction of CDK1 protein family An integrated multi-omics bioinformatics analysis

Hepatocellular carcinoma (HCC) is one of the leading and fastest increasing causes of death in the world. It primarily happens in the people who have been exposed to high-risk factors for a long time, such as the hepatitis B virus, hepatitis C virus, etc. Some similar analysis on HCC were published years ago, and results need updating. In this study. We analyzed the gene expression pattern obtained from the GEO database in healthy tissue and tumor tissue and identied differentially expressed genes (DEGs), then GO and KEGG pathway analyses were performed subsequently on DEGs. We also used the protein-protein interaction (PPI) network drawn from STRING to nd hub genes and analyzed their effects on prognosis and expression level in different stages of HCC from GEPIA. Furthermore, we analyzed the CDK1 in different species and established the phylogenetic tree of the CDK1 family by the MEGA-X (version 5) according to the amino acid sequence of CDK1 in different species.


Results
Among all DEGs identi ed in this study, the CDK1 is the most statically signi cant hub gene in the perspective of prognosis and differential ex pression. The expression level of CDK1 varies in different stages of HCC. The expression level reaches the highest in stage two and three and reaches the lowest in Stage four. And the human CDK1 gene and ape-like CDK1 gene subtype 5 may be homologous genes.
However, human CDK1 gene and the gene encoding human CDK1 protein A chain may be heterozygous genes.

Conclusion
CDK1 is a gene with great signi cance in the diagnosis and prognosis of HCC, more studies that investigate deeply about the CDK1 is needed in the future.

Background
Hepatocellular carcinoma(HCC) is one of the prominent and growing causes in the United States and worldwide, especially in people previously infected with hepatitis B virus, hepatitis C virus and hepatitis D virus, etc. (1,2) There are some novel diagnostic and treatment targets have been identi ed and broadly used, such as histone deacetylase inhibitors (HDACi), chimeric antigen receptor T cells (CAR-T cells) and multi-targeted tyrosine kinases (TKIs) (3,4). In this study, we aimed to discover new targets or ideas for the diagnosis and prognosis for HCC based on the gene expression level data in healthy samples and tumor samples and provide a unique aspect for understanding the underlying mechanism of carcinogenesis and prognosis of HCC.
They are GSE60502 (14) and GSE36376 (15,16). The DEGs that mentioned in all four datasets or any three datasets were considered as co-DEGs, which are shown in Table S1. The validation sets showed that the co-DEGs identi ed had high con dence and credibility to be authentic.
The co-DEGs identi ed were updated to STRING (17) for the construction of the PPI network. We used the MCODE and cytohubba (version 0.0.1) applets of Cytoscape software (version 3.7.2) (18) to modify the whole PPI network and divided the entire network into statically and medically signi cant modules for further analysis. And nd hub genes respectively within the whole PPI network and the module 1.
We selected all co-DEGs, all genes involved in three modules of PPI network, and the top ten hub genes of the whole network for further GO (19) and KEGG (20) pathway analysis on DAVID (21). The results revealed that all co-DEGs were mainly enriched in the epoxygenase P450 pathway, organelle membrane, iron ion binding, and metabolic pathways. For the upregulated co-DEGs, they were primarily enriched in epoxygenase P450 pathway, organelle membrane, oxidoreductase activity, and act on paired donors with incorporation or reduction of molecular oxygen and metabolic pathways. In contrast, the downregulated co-DEGs, the functional enrichment analysis results were cell division, spindle, protein binding, and cell cycle. For module 1, the genes involved in this were mainly enriched in cell division, nucleus, protein binding, and cell cycle. For module 2, the genes involved in this were primarily enriched in epoxygenase P450 pathway, organelle membrane, arachidonic acid epoxygenase activity, and drug metabolismcytochrome P450. For module 3, the genes involved in this were primarily enriched in xenobiotic metabolic process, organelle membrane, oxygen binding, and metabolic pathways. For the top 10 hub genes of the whole network, they were primarily enriched in cell division, nucleoplasm, cyclin-dependent protein serine/threonine kinase activity, and p53 signaling pathway.
After the functional enrichment analysis, we focused on the clinical signi cance, which is the effect on the prognosis of hub genes. We used GEPIA (22,23) to draw the percentage of overall survival of patients with a speci c hub gene differentially expressed, and set the Log-rank P-value as a criteria for determining the static correlation between the expression of this hub gene and the percentage of overall survival of patients. We found that among the ten hub genes we identi ed, CCNB2 is the only gene with a Log-rank P-value more than 0.05, which indicates that this gene has weak or no correlation with the prognosis of HCC. CDK1 and CCNB1 have the smallest Log-rank P-value, which were 0.00017 and 0.00015, respectively, indicating that both of these two genes have a strong correlation with the prognosis of HCC.
We validated our results with another two pro le datasets searched form the GEO database, and we rstly used R (version 3.6.0) (24) to generate 25 random numbers for choosing the genes for validation.
To determine which one has more considerable medical signi cance for further investigation, we collected the expression data of each gene from the matrix of the four pro le datasets used for data extraction and analysis. We plotted the forest plot (25,26) (Fig. 1) for the expression level of each gene with a xed-effect model. Based on the heterogenicity and the overall mean difference, we chose CDK1 for further analysis.
We rstly systemically searched the sequence of CDK1 in humans from the uniport (27)(28)(29) and conducted a systemic position-speci c iterated (PSI) search for three times on BLAST (30)(31)(32). Then we excluded the redundant sequence and exported the nal results of protein sequence and corresponding species into MEGA-X (version 5) (33,34) for the construction of the phylogenetic tree of the CDK1 family.
In the end, we searched and analyzed the CDK1 expression pattern in different stages of patients with HCC (35,36), which may offer a new understanding of the role of CDK1 in the carcinogenesis and proliferation of HCC.

Microarray platform
The four pro le datasets for data extraction and analysis were based on different platforms.

Data assignment
To increase the credibility of our ndings, we assigned the six datasets into two groups, one is a data extraction and analysis group, which is used for the identi cation of DEGs, and following relevant researches, the other one is used for validation, which is used for validating the results. GSE14520, GSE101685, and GSE 64041 were assigned into the data extraction and analysis group; GSE60502 and GSE36376 were assigned into the validation group.
Expression analysis of DEGs.
We used the GEO2R function on the website of the GEO database to sort the data into two groups, one is consist of the liver tissue extracted from healthy people; the other one contains the tissue obtained from patients diagnosed as the HCC, and individually analyzed each pro le dataset with all the relevant sets default and exported to Excel. For all four pro le datasets, a gene with |log2-fold change|>1.5 and adj. P < 0.05 were considered as a DEG.
GO and KEGG pathway enrichment analysis.
The Database for Annotation Visualization and Integrated Discovery (DAVID) is a website that provides a wide range of functional annotation tools to investigate the effect of genes in the term of biological process, molecular function, cell components, etc. Identi ed DEGs were investigated further using DAVID (version 6.8), GO, and KEGG pathway enrichment analyses. While analyzing all genes in the whole network, all upregulated co-DEGs, all downregulated co-DEGs and genes involved in the three modules identi ed from the PPI network, P < 0.05 and gene counts of > 10 indicates a statistically signi cant difference in the functional enrichment analysis; For the top 10 hub genes, P < 0.05 and gene counts of > 5 indicates a statistically signi cant difference in the functional enrichment analysis.
Integration of the PPI network.
The protein-protein interaction network (PPI) of all proteins expressed by the DEGs was mapped by the STRING (https://string-db.org/; version 11.0) to evaluate the interactions between the DEGs. The Cytoscape plugin Molecular Complex Detection (MCODE; version 1.6) was used to detect the medically and statistically signi cant regions within the whole PPI network, which is called speci c modules.
Interactions with medium con dence (a combined score > 0.400) were de ned as statistically signi cant. Cytoscape software (version 3.7.2) was used to visualize and adjust PPI networks. And based on the degree levels in the Cytoscape plugin cytoHubba (version 0.1), the top 10 ranked genes were de ned as hub genes.
Analysis of hub genes on the GEPIA Identi ed ten hub genes performed survival analysis and analysis about expression level in different stages of HCC. And in the overall survival plot, if Log-rank p < 0.05, that means the difference in the gene expression level is strictly relevant to the prognostic survival of HCC. The dotted line in the overall survival plot indicated the 95% con dence interval (95% CI). The violin plot exhibited the expression level of all hub genes in different stages of HCC.

Construction of phylogenetic tree of CDK1 family
We searched the sequence of human CDK1 protein in FASTA format (37,38) and performed a systemic position-speci c iterated (PSI) search for three times on BLAST (43,44). We imported the sequences after the de-redundancy process into MEGA-X for the construction of maximum-likelihood phylogenetic tree (39,40) based on the JJT matrix model and 100 bootstrap replications, all other settings remained the default. This phylogenetic tree indicates the relationship and similarity of different species from the perspective of the CDK1 protein family.

Results
The identi cation of DEGs In this study, 1093 DEGs were identi ed, of which 721 were upregulated, and 372 were downregulated, including 196 co-expressed upregulated genes and 66 co-expressed downregulated genes, (Table S1) which are roughly shown in the Venn diagram (Fig. 2a,b). Volcano plots (41,42) (Fig. 3, a-d) and expression heatmap (Fig. 4, a-d) for all pro le datasets used for data extraction and analysis were constructed. The red lines in the volcano plots represent the threshold values of the identi cation of DEGs.

Functional Enrichment analysis
To identify the pathways which had the most biologically and statistically signi cant involvement with the genes identi ed, upregulated and downregulated DEGs were submitted into DAVID for GO and KEGG pathway analysis. For all co-DEGs, they are mainly concentrated in cyclooxygenase P450 pathway, cell membrane, iron-binding, and metabolic pathway (Fig. 5, a-d). The upregulated co-DEGs are enriched primarily on cyclooxygenase P450 pathway, cell membrane, and oxidoreductase activity, which binds or reduces molecular oxygen and metabolic pathways by acting on paired donors (Fig. 6, a-d). As for the downregulated co-DEGs, the functional enrichment analysis reveals the results that cell division, spindle, protein binding, and cell cycle. (Fig. 7, a-d) For module 1, the genes involved in this were mainly enriched in cell division, nucleus, protein binding, and cell cycle. (Fig. 8, a-d) For module 2, the genes involved in this were primarily enriched in epoxygenase P450 pathway, organelle membrane, arachidonic acid epoxygenase activity, and drug metabolism-cytochrome P450 (Fig. 9, a-d). For module 3, the genes involved in this were primarily enriched in xenobiotic metabolic process, organelle membrane, oxygen binding, and metabolic pathways (Fig. 10, a-d). For the top 10 hub genes of the whole network, they were primarily enriched in cell division, nucleoplasm, cyclin-dependent protein serine/threonine kinase activity, and p53 signaling pathway (Fig. 11, a-d).

PPI network construction and module analysis
Interactions between the identi ed DEGs were elucidated by the PPI network. (Fig. 12, a-c) In total, there were 226 nodes in the network, the average number of neighbors is 13.416, the clustering coe cient is 0.457, and the network heterogenicity is 1.044. The top ten hub genes were TPX2, TTK, RRM2, TOP2A, NCAPG, CCNB2, ZWINT, CCNB1, CDK1, and RFC4. And after analyzing with The Cytoscape plugin Molecular Complex Detection (MCODE; version 1.6) of Cytoscape (3.7.2), we got three statically and medically signi cant modules. We analyzed all three modules and selected the top ve hub genes in module 1. They are TOP2A, CCNB2, TTK, PRC1, and NUSAP1 (Fig. 13a,b).

Prognostic survival analysis
We analyzed the effect of the top ten hub genes on the prognosis of HCC patients (Fig. 14, a-j). CCNB2 is the only gene with a Log-rank P value more the 0.05, which indicates that CCNB2 has almost no correlation with overall survival outcome of HCC. CDK1 and CCNB1 have the smallest Log-rank P-value among these (0.00017 and 0.00015, respectively), which indicates that both of these two genes have a strong correlation with the prognosis of HCC.

Expression level of CDK1 in different stages of HCC
The expression of CDK1 was different in different stages of HCC, with the highest expression level in stage 3 and the lowest in stage 4. The discrete degree of expression level reached the highest in the second stage. (Fig. 15) Phylogenic tree of CDK1 family analysis The phylogenic of the CDK1 protein family was based on the maximum-likelihood among species. From this phylogenic tree, we can easily conclude that human CDK1 gene and ape-like CDK1 gene subtype 5 may be homologous genes. However, human CDK1 gene and the gene encoding human CDK1 protein A chain may be heterozygous genes (Fig. 16).

Discussion
From the result of functional enrichment analysis, we can nd that the epoxygenase P450 pathway is a pathway that is mentioned many times with a high count of genes and correlation in functional enrichment analysis of different genes. Arthur A Spector et al. reported in 2014 (45) that epoxygenase P450 pathway involved a lot in the angiogenesis and vasodilation, which are essential factors for the HCC proliferation, in other words, the proliferation of HCC may be initiated and maintained by this pathway.
In terms of molecular function, protein binding and oxygen-binding are the most prominent ones among all results. HCC is always vascularized by the upregulation of angiogenic factors, such as VEGF (46). And VEGF is a primary target of HIF-1 (47), which means HIF-1 has an angiogenic effect on HCC. In 2017, Chen et, al reported that the HIF-1 also has a positive impact on the proliferation of HCC, such as promotion of glycolysis, enhancement of critical oncogenic factors, etc. (48) In terms of the organelle, the organelle membrane was mentioned many times with a high count of genes and correlation in functional enrichment analysis of different genes. The genes involved in the organelle membrane are all existed on the endoplasmic membrane. The relationship between the endoplasmic membrane and carcinogenesis of HCC may be involved in the ER stress (49), and more studies with high credibility are in need to validate this relationship.
In the hub gene analysis, TOP2A protein is one type of the topoisomerase in our bodies that controls the topological pattern of DNA (50), while the deletion or down-regulation of TOP2A may result in the resistance of topoi inhibitor-chemotherapy (51).
In terms of enhancing the proliferation of HCC, NCAPG shares similar mechanisms with TTK. NCAPG, which is also called non-SMC condensin I complex subunit G, and the overexpression of NCAPG leads to an enhanced proliferation and reduced apoptosis of HCC through PI3K/AKT signaling pathway (52). Liu et al. reported in 2015 that TTK activates the Akt and promotes the proliferation and migration of HCC (53). Ying et al. .in 2018 (54) reported that ZWINT promotes the proliferation of HCC through the regulation of cell-cycle-related proteins such as CDK1, PCNA, etc. The regulation is also mentioned in the top ten hub gene network, which is corresponding to our ndings.
However, there are some inevitable limitations in this study. First, the number of available pro le datasets in the database was not su cient to analyze more DEGs, which may cause the imprecision of the results. So, more microarray data that compare the gene expression level between the tumor sample and the normal sample of HCC is urgently needed. Secondly, The microarray data available now is mostly about the white people, and only a few datasets cover the yellow or black people, which means, on the one hand, the expression level difference caused by the difference of race is unable to be identi ed; on the other hand, the heterogenicity between studies was dramatically huge, which indirectly decreases the credibility of studies included and results.
Due to the limitation of the condition and the outbreak of the COVID-19 virus, we are unable to collect enough biopsy samples and conduct qRT-PCR for validation with our data. We have to use the data from others for validation. And the expression heatmap of the genes chosen for validation in each dataset in also generated. (Fig. 17) The reasons why we chose the chose GSE60502 and GSE36376 for validation are as follows. Firstly, these two datasets contain 469 samples in total, which will signi cantly increase the credibility of the results of validation. Secondly, these two studies were performed in Taiwan and Korea, thus the in uence of unknown factors such as racial, country, economic development can be maximumly deduced. Thirdly, this dataset was collected in recent years. Last but not least, all genes involved in the cross-validation were randomly chosen according to the random number generated by software-all these reasons listed above contributed to the high credibility of the following crossvalidation. Therefore, we can guarantee that this cross-validation will have the same reliability as a real experiment, and we have enough con dence in the results of validation. Conclusion CDK1 is a gene with great signi cance in the diagnosis and prognosis of HCC, more studies that investigate deeply about the CDK1 is needed in the future. The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
This study was supported by The Government of the Xuzhou City. (2018TD011).

Authors' contributions
JZ and WZ designed the study, JZ collected the data, carried out the data analysis, JZ produced the initial draft of the manuscript, WCa contributed to drafting the manuscript, ZC and WCh polished all the gures in the manuscript. The forest plot of CDK1 and CCNB1 expression level