Identication of novel potential biomarkers in hepatocarcinoma cancer; a transcriptome analysis

The goal of this study was to understand possible core genes associated with hepatocellular carcinoma (HCC) pathogenesis and prognosis. Gene Expression Omnibus (GEO) contains datasets of gene expression, miRNA and methylation patterns of diseased and healthy/control patients. GSE62232 Dataset was selected by employing the server GEO. A total of 91 samples were collected, including 81 HCC samples and 10 healthy samples as control. GSE62232 was analyzed through GEO2R, and functional enrichment analysis was performed to extract rational information from a set of DEGs. The protein-protein relationship networking search method was used for extracting interacting genes. MCC method was used to calculate the top 10 genes according to their importance. Hub genes in the network were analyzed using GEPIA to estimate the effect of their differential expression on cancer progression. We identied the top 10 hub genes through Cytohubba plugin. These genes include cell cycle regulatory cyclins and cyclin-dependent proteins CCNA2, CCNB1 and CDK1. The pathogenesis and prognosis of HCC may be directly linked with the aforementioned genes. In this analysis, we found critical genes for HCC that showed recommendations for more diagnostic and predictive biomarker studies that could promote selective molecular therapy for HCC.


Introduction
Cancer, also known as malignancy, is irregular cell growth. More than 100 different types of cancer exist, in which most common are breast, skin, prostate, lungs, colon, and lymphoma [1]. Cancer is present in human as the most considerable public health concern worldwide, and liver cancer adds greatly to the morbidity and mortality in cancer [2]. Liver cancer (hepatocellular carcinoma) is the fourth leading cause of cancer-related death globally, ranking sixth in prevalence [3,4]. Hepatocellular carcinoma (HCC) constitutes about 85-90% of all primary malignant liver tumors. Chronic hepatitis B virus (HBV) infection, hepatitis C virus (HCV), smoking, a atoxin, obesity, chronic liver disease, and type 2 diabetes are the main risk factors [3,5,6]. Of these variables, recurrent liver disease is the primary cause of liver cancer [7].
The prevalence of viral infection in HCC cases varies from developed to developing nations, where HBV re ects 60% in developing nations and 23% in developed nations, while HCV infection is responsible for 23% in emerging nations, and 20% of patients in developed nations [8]. Moreover, the highest incidence of HBV is in sub-Saharan Africa, South-eastern Asia, and East Asia, while HCV is high in the USA, Europe, and Japan [7]. The prevalence of non-alcoholic fatty liver disease (NAFLD) also adversely affects individual health causing an increase in obesity and other metabolic disorders [9]. Around 25-30% of patients having a western lifestyle possess more fats in their liver, 2-5% of which have NAFLD, and 1-2% suffer from non-alcoholic steatohepatitis cirrhosis [10].
The World Health Organization (WHO) estimates that in 2030 over one million people are going to die of liver cancer [11]. The key factor that affects HCC mortality is a poor diagnosis, which results in just 18% survival [12] less than the cancers of the breast (77.1%), renal pelvic (74.8%), and myeloma (52.2%). The high risk of recurrence and metastasis of the HCCs also contribute to a shorter life span and poor survival after hepatectomy [13]. Different variables participate in HCC diagnoses, such as cell proliferation, apoptosis, and genes linked to the mTOR pathway [14]. HCC is on a global increase, but early detection and therapy of HCC remain a concern [7]. In developing countries, the HCC prevalence is growing as a consequence of low levels of health and treatment, with a global rate of liver cancer per 100000 people approximately at 9.3 in 2018 [15], as well as poor prognosis [16].
The diverse factors implicated in liver cancer are cellular tumor antigen p53 (TP53), axin-1 (AXIN1), catenin β-1 (CTNNB1), and telomerase reverse transcriptase (TERT) promoters as well as other primary genes for mutation generation, p53 cell cycle system, WNT/β-catenin, oxidative stress, RAS/RAF/MAPK, and PI3K/AKT/MTOR pathways along with other main primary signaling pathways.  used highly e cient microarray technology to screen molecular indicators across all human cancerous tumors, especially for liver cancer, by using Gene Expression Omnibus (GEO) datasets, The Cancer Genome Atlas (TCGA) RNA-sequence and analyzed with the help of bioinformatics methods [17]. Gene chip technology can also reliably represent the molecular expression pro le and detect genetic variants correlated with HCC in liver cancer studies [18,19]. The data, information, knowledge, and wisdom (DIKW) model is widely used in life, including medicine [20][21][22]. Recently, genome-wide screening has signi cantly improved the knowledge of the genetic context and pathways that lead to HCC [23][24][25][26].
Four core genes and two essential pathways of developing HCC from cirrhosis have been established by GEO dataset using a bioinformatics methodology, including DEG screening and networking of proteinprotein interactions (PPIs) [27]. Zhang et al. screened the genes and pathways associated with HCC development and prevalence through a series of bioinformatics observations, such as DEG recognition, functional enrichment analysis, PPI network and module analysis, and weighted network correlation analysis [19]. Zhou et al. identi ed HCC critical genes and microRNAs through raw data processing by using Gene Ontology (GO), GEO2R, and Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway enrichment processing and PPI network creation [28]. Established research, nevertheless, have been of small scale, and the lack of molecular prognostic indexes might have restricted the ability to recognize possible important molecular biomarkers and emerging clinical goals.
Applied bioinformatics research with the current genomic evidence offers an in-depth insight into therapeutic resistance and disease progression processes. This study focuses on the expression pro ling of HCC patients compared to healthy ones. GSE62232 dataset (GEO: https://www.ncbi.nlm.nih.gov/geo/) has been chosen. GSE62232 was analysed through GEO2R (https://www.ncbi.nlm.nih.gov/geo/geo2R) [29] to evaluate and recalculate the genes that are differentially expressed in healthy and diseased samples. However, new diagnostic and prognostic biomarkers are needed to improving HCC diagnosis and treatment.

Data Collection and Expression Pro ling of DEGs
GEO contains datasets of gene expression, miRNA and methylation patterns of diseased and healthy/control patients [30]. This study was focused on the expression pro ling of HCC patients in comparison to healthy ones. A total of 91 samples were collected, including 81 HCC samples and 10 healthy samples as control [31]. GSE62232 was analysed through GEO2R [29] to evaluate and recalculate genes expressed uniquely in healthy and unhealthy samples. Analysis provided a lot of genes whose expression differed in both samples. To minimize the background noise, statistical lters of p-value and fold-change value were used.

Functional Enrichment Annotation of DEGs
Functional enrichment analysis was performed to extract rational information from a set of DEGs. This provided us with the most prominent pathways being affected by the change of gene expression in diseased samples. The database named Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://david.ncifcrf.gov) has been used to perform functional enrichment analysis. DAVID annotates the provided set of genes into the most affected processes such as Biological Processes (BP), Molecular Functions (MF), and Cellular Components (CC) [32]. We used GO and KEGG datasets in DAVID to annotate the most affected processes. KEGG is a database of molecular pathways containing detailed information about functional and biological systems [33]. The p-value ltering ≤ 0.05 was used to exclude the results with a low con dence level.

Protein-Protein Interaction Network Backbone Analysis
Search Tool for the Retrieval of Interacting Genes (STRING; http://string-db. org) has been used to establish PPIs Network [34]. A con dence score of ≥ 0.07 was used to scrutinize statistically signi cant results. The PPIs network was visualized using Cytoscape (open-source platform), in which the most considerable module with a Molecular Complex Detection (MCODE) score of > 5 and node score-cut-off = 0.02 was scanned with the Cytoscape plug-in MCODE.

Hub Gene Identi cation, Expression and Survival Analysis
Cytohubba is a Cytoscape module that has been used in the PPI network for identifying primary hub genes. MCC method was used to calculate the top 10 genes according to their importance in the system because it is comparatively recent and is highly recommended [35]. These top 10 genes having the highest MCC score were considered as hub genes. However, the Gene Expression Pro ling Interactive Analysis (GEPIA) server provided a correlation of gene expression and their effect on the survival chances in speci c cancer types [36]. Hub genes were analysed using GEPIA to calculate the impact of their differential expression on cancer progression.

Functional Annotation of MCODE Cluster Genes
ClueGO plugin was utilized to functionally annotate the top 3 MCODE Cluster genes into the biological process. It classi ed gene products into crucial biological processes using GO datasets as the reference [37].

DEGs Identi cation
Expression pro le of genes for GSE62232, titled in NCBI "large-scale gene expression pro ling of 81 hepatocellular carcinomas" was obtained from GEO dataset, which was generated on the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array). After getting data of 91 liver samples (81 HCC patients and 10 control) of GSE62232 study, the data was analysed through GEO2R to identify the DEGs of both upregulated and downregulated genes using a p-value of < 0.05 and log FC > 1.5 as selection criteria. Overall, 598 genes out of a total 19982 were differentially expressed in HCC samples with 233 upregulated and 365 downregulated genes (Supplementary le 1). The DEGs are represented by a volcano plot constructed using Prism (www.graphpad.com/scienti c-software/prism) for a better graphical representation of overall genes (Fig. 1).

Enrichment Analysis of DEGs
In order to explore GO terms and cellular mechanisms affected by these DEGs, both over and downregulated genes were imported into an online DAVID server to conduct the annotation process. The annotated GO terms were divided into MF, BP, CC ontologies (p-value < 0.05, FDR < 0.05). The GO BP analysis revealed that the majority of DEGs were enriched into oxidation-reduction (GO:0055114), metabolic (steroids, drug, xenobiotic) and catabolic (exogenous drug, tryptophan), cell division  Table 1).The KEGG signaling pathways examination of DEGs by applying the lter of p value < 0.05 and FDR < 0.05 sorted mainly metabolic pathways such as retinol metabolism (hsa00830), biosynthesis and metabolism of amino acids (hsa01230), antibiotics (hsa01130), and drug metabolism-cytochrome P450 (hsa00982) ( Table 2). Some additional signaling pathways i.e., chemical carcinogenesis (hsa05204), cell cycle (hsa04110), and p53 signaling pathway (hsa04115) were also markedly enriched by DEGs. The complete list of GO terms and KEGG pathways are enlisted in supplementary le 2.

STRING PPI Network analysis and Interrelation between pathways
For better understanding the role of DEGs in HCC development, we constructed co-expression protein networks. The insertion of DEGs list into STRING and application of con dence score of > 0.70 established a PPI network contained 1715 edges and 322 nodes, with each node connected with 11.6 other proteins on average. Different topological parameters for PPI network included network density of 0.040, network heterogeneity 1.335, network centralization 0.180, clustering coe cient 0.523, and characteristic path length 4.689. A complete interaction network is shown in Fig. 2A, in which degree and betweenness topological features were calculated to distribute genes into different size circles and colors. The higher the value of these quantitative terms, the greater the importance in the network [38]. Through this PPI network, top clusters were sorted using Cytoscape plugin MCODE with score > 4, nodes > 4. Cluster 1 was the densest interaction network showing MCODE score of 40.53, followed by cluster 2 and cluster 3 with scores 10.72 and 8.0, respectively (Fig. 2B). Subsequently, several crucial hub genes including cyclin-dependent kinase (CDK1), cyclins (CCNA2, CCNB1, CCNB2), serine/threonine-protein kinase (BUB1), NDC80, BUB1B, NCAPG, MAD2L1, and CDC20 were determined by Cytohubba plugin (Fig. 2B). These selected hub genes were either involved in the cell cycle and its regulation or the main components of kinetochore-microtubule interaction.Protein clusters were processed through ClueGO plugin of Cytoscape which suggested that cluster genes were mostly associated with nuclear and cellular division during meiosis and mitosis, chromosome reorganization, deoxyribonucleotide biosynthetic processes, and regulation of ubiquitin-protein ligase activity (Fig. 3).

Survival Analysis through GEPIA and Expression Level of Hub Genes
GEPIA servers provide a platform for integrated information of gene expression from TCGA and GTEx databases regarding multiple cancer types [36]. To evaluate the overall prognostic importance of hub genes in this study, Kaplan-Meier survival analysis was performed to examine the association between different expression levels of genes and survival time of patients with HCC. The logrank p value was estimated, where p value smaller than 0.05 indicates a statistically signi cant difference in mortality between the high-level and low-level groups. These high and low-level groups of patients were separated based on the median level of expression. The high expression level of CCNB2 (logrank p = 0.052) and NDC80 (logrank p = 0.013) demonstrated poor prognosis, while HCC patients showing high level of BUB1 (logrank p = 0.001), CDK1 (logrank p = 0.00017), NCAPG (logrank p = 0.00097), BUB1B (logrank p = 0.0028), CCNB1 (logrank p = 0.00015), CDC20 (logrank p = 3.8e-06) and MAD2L1 (logrank p = 0.0047) had a higher risk of mortality (Fig. 4). Furthermore, GEPIA boxplot representation of hub genes expression in 369 HCC patients and 160 normal/healthy persons described considerable increase in the level for all hub genes (Fig. 5). These hub genes with high expression and low survival rate in patients suggested their association with the pathophysiology of liver cancer to varying extents and could be potential biomarkers for HCC prognosis to monitor the severity of liver cancer or a therapeutic target.

Discussion
Over recent years, tumor initiation and progression in HCC have been extensively researched; still, early diagnosis of HCC is truly a big challenge, because the exact mechanism of HCC inducement needs a full understanding. Also, incidence and cancer-speci c mortality worldwide is increasing due to limited therapeutic strategies; therefore, there is an urgent need to identify the potential key genes and mechanisms to precisely predict the HCC onset, progression and to develop novel therapeutic agents.
Bioinformatics has become increasingly popular to evaluate changes in the gene expression pro le during the initiation and progression of diseases by the integrated microarray analysis, investigating the screened hub genes for cancer diagnosis and therapy.
In this study, we identi ed a total of 598 DGEs comprising of 233 upregulated and 365 downregulated genes between HCC patients in comparison to healthy ones chosen from the GSE62232 expression pro le datasets. Using the DAVID software, GO functional, and KEGG pathway enrichment analyses of the DEGs were performed. The results revealed that the identi ed DEGs were closely related to various biological processes and pathways such as metabolic, heme binding, drug detoxi cation, cell cycle, meiosis, and mitosis, etc. We had also screened out ten hub genes, including cell cycle regulatory cyclins and cyclin-dependent proteins CCNA2, CCNB1and CDK1. Besides, a PPI network was constructed to analyse the interactional relationships between the DGEs, and survival analyses of hub genes were performed using the GEPIA. These hub genes have been extensively researched in recent years.
Recent genetic evidence has revealed interphase cyclin-dependent kinases (CDKs) are essential for the proliferation of tumor cells. CDK1 belongs to the CDK family, a member of the Ser/Thr protein kinases necessary for cell-cycle progression, triggering cell cycle transitions, namely G1/S and G2/M [39]. Clinical implications of deregulated CDK1 are closely related to HCC tumorigenesis. Research has found that overexpression of CDK1 in HCC is signi cantly negatively correlated with HCC patients' survival. Zhang et al. [40] suggested that miR-582-5p regulated the progression of HCC by directly targeting the CDK1. Elevated levels of CDK1 was also shown to be directly associated with advanced stage portal vein invasion, increased AFP levels, and poor patient survival in HCC [41]. Wu et al. [39] revealed higher levels of CDK1 in HCC patients than healthy individuals, which was in agreement with the present study's ndings.
CCNA2, CCNB1, and CCNB2 genes are all members of the highly conserved cyclin family. In this study, high expression of CCNA2 was closely associated with poor prognosis in HCC patients. CCNA2 protein functions as regulators of the cell cycle by activating cyclin-dependent kinases, and CCNA2 expression in the cell cycle was driven mostly by E2Fs [42]. CCNA2 overexpression has been observed in several types of cancers; also, a study has demonstrated that inhibition of CCNA2 led to the suppression of HCC, cell proliferation, and tumorigenesis [43]. The aberrant expression of CCNA2 is related to reduced survival in patients with HCC and breast cancer [42,44]. Chai et al. [45] revealed that CCNB1 is highly expressed and associated with the unfavorable prognosis for patients with HCC, consistent with our ndings. This suggested the plausibility of CCNB1 as a potential therapeutic target for the treatment of HCC. This conclusion is further proved by a study that CCNB1 knockdown by miR-144 inhibited HCC cell migration, invasion, and proliferation [46]. Previous research [47] has shown that CCNB1 inhibits the growth of cells by inducing cell cycle arrest at the G2/M phase suggesting CCNB1 may be an effective anti-cancer agent in future therapy. Furthermore, CCNB2 was found to be overexpressed in several malignant tumors, and high expression of CCNB2 is associated with poor prognosis in HCC and invasive breast carcinoma [48,49].
In the case of BUB1, several studies have found high expression of the BUB1 in a variety of human tumors. Serine/threonine-protein kinase BUB1 binds centromeres during mitosis [50], and has been involved in apoptosis and cell cycle [51], as well as in reducing the overall survival (OS) rate of HCC patients. BUB1B was reported to be involved in tumor cell cycle regulation, and overexpression of BUB1B is related to the progression and recurrence of HCC [52]. NDC80 is highly conserved; a core component of the kinetochore-microtubule interaction machinery which is identi ed as a requirement for proper chromosome segregation. NDC80 has also been associated with the HCC progression [53], and NDC80knockdown in pancreatic cancer inhibited cell cycle and cell proliferation [54]. NCAPG organizes the coiling topology of individual chromatids, Liu et al. [55] demonstrated NCAPG functioning as an oncogene in the development of HCC.
CDC20 is an essential cell-cycle regulator, which plays an important role in promoting the onset of anaphase and mitotic exit. The increased expression levels of CDC20 have been linked with the development and progression of HCC [56]. Additionally, research has shown that silencing CDC20 and HPSE expression activated cell apoptosis; thus, targeting inhibition of both CDC20 and HPSE expression is an ideal therapeutic option of HCC [57]. As for MAD2L1, Yun et al. [58] demonstrated that MiR-200c-5p inhibits HCC cell proliferation, migration, and invasion by targeted suppression of MAD2L1, suggesting that the high expression levels of MAD2L1 are associated with poor prognosis of patients with HCC. Moreover, MAD2L1 may potentially be used as a prognostic and therapeutic target in HCC patients.

Conclusion
In summary, the purpose of the present study purpose was to screen and verify hub genes that may provide new insights into the development, prognosis, and treatment of HCC. In total, 598 DEGs were identi ed via integrated bioinformatics analysis, of which ten were identi ed as hub genes that may be used as biomarkers for the diagnosis and prognostic evaluation of HCC. However, because the results of this study were based on data analysis only, further experimental veri cation via animal experiments and clinical trials are required to con rm these results.

Declarations Funding/acknowledgement
We are thankful for the support of the Innovation Foundation of Radiation Application, China, as they have funded this project under the project number (KFZC2018040208).  A) Using the online STRING tool, a PPI network was developed which was further visualized by Cytoscape software. The size and color map nodes are determined by the degree value, which renders a gradual setting in small size with low degree in blue, large size with a high degree in yellow. B) Top clusters determined by MCODE with score > 4 and node > 4. The top 10 genes derived from the MMC method were chosen using the CytoHubba plugin  Kaplan-Meier survival analysis curve of the hub genes expressed in the HCC. The survival curves were plotted using the GEPIA1 web server. The gene candidates with high expression in the cohorts are shown in red, and the blue line indicates the low-expression cohort, the survival curve is represented in a dotted line, whereas the solid line is the 95% con dence interval. The logrank p value represents the overall signi cance of analysis, and HR stands for hazard ration; patient number (n) = 182.