A novel five-gene signature for predicting prognosis in liver cancer

Purpose Liver cancer is one of the most common malignant tumors in China, ranked 5th among the malignant common tumors in the world, which is still difficult to diagnose early and treat effectively. Therefore, exploring some indicators for prognostic prediction is imperative in the treatment of liver cancer. Methods Liver cancer data was obtained from The Cancer Genome Atlas (TCGA). We obtained differentially expressed genes (DEGs) by R software from TCGA database. Risk scores were acquired to assess the weighted gene-expression levels by Cox regression analysis and predict the prognosis of patients with liver cancer. Using the KEGG and GO databases, pathway enrichment was performed by identifying the analysis of DEGs. The display of receiver-operating characteristic (ROC) curves and area under the curve (AUC) could show the validity and the prognostic value of this model in liver cancer. Results In total, 1897 DEGs of transcriptome genes in liver cancer and 1197 DEGs of clinical data were extracted from the TCGA database. We identified a novel five-gene signature associated with liver cancer, including CDCA8, NR0B1, GAGE2A, AC018641.1, and SPANXC. Among of them, CDCA8 and NR0B1 were negatively related to 5-year OS, displaying a worse prognosis (P < 0.05). In particular, we also found that GAGE2A is related to lymphatic metastasis from the clinical data analysis in liver cancer. Receiver-operating characteristic (ROC) curve assessed the accuracy and sensitivity of the gene signature. In the heat map, each of the five genes for patients was presented with the distribution of the risk score. Conclusions We figured out a novel five-gene signature for the prognosis of patients with liver cancer, which may be an effective predictor for patients’ prognosis in the future.


Introduction
Liver cancer is one of the most common malignant tumors in China, which ranked 5th among the malignant common tumors in the world. Over 620,000 incipient liver cancer cases occur in the world, including above half of Chinese, 80% of which are hepatocellular carcinoma (HCC) [1]. Viral hepatitis, cirrhosis, chemicals such as aspergillus flavus, and water and environmental pollution are the main causes of liver cancer [2]. The Barcelona Clinic Liver Cancer (BCLC) staging system, as a commonly utilized staging system for HCC, classifing patients according to tumor stage, liver function, and cancer-related symptoms [3]. In the light of the BCLC staging system, specific treatment options are substantial for different stages in HCC [4]. There are multiple therapeutic methods for HCC, including surgical options (resection or liver transplantation (LT), ablative electrochemical therapies (e.g. radiofrequency ablation or ethanol injection), and stereotactic body radiotherapy (SBRT) [5]. Despite of the development of surveillance systems, the over 5-year survival is less than 20%. Although risk factors including late diagnosis at the first time, a series of declines in body function induced by ascites caused by liver cirrhosis, recurrence, and occur of lymph vascular, it limits the availability of clinical treatment and leads to the poor prognosis [6].
Widespread prognostic systems have been acknowledged, but none are widely suitable for predicting prognosis. Some researches do not take into account major prognostic predictors for the presence of cancer. Identification of novel targets or predictors through the analysis based on molecular levels may figure out some new therapeutic methods for better outcome prediction.

Methods
Data extraction, quality assessment Liver cancer patients (including transcriptome and clinical data) were downloaded from the TCGA database. In order to reorganize data, we used the Perl scripting tools. Patients could be divided into several classification with missing data including vague clinical stages and pathological grading were excluded. We compared 407 liver cancer cases with 58 normal controls and calculated the fold changes between the genes using the edge R package in the R language. Data screened for differentially expressed genes (DEGs) and heat map Subsequently, by means of the mRNA data in liver cancer,in comparison with matched normal tissues, we identified DEGs (fold change = 3, P < 0.05) using R package software (version 3.5.2). The heat map was constructed by the data reduction (407 liver cancer cases vs. 58 normal controls). Five genes containing the division of the risk score of patients was demonstrated by a heat map. The high/low gene expression level was represented by the dark/light color in the heat map. The gplots package was used to draw the heat maps.

Gene enrichment
As for Gene Ontology (GO) enrichment, we utilized the Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://david. abcc.ncifcrf.gov/) to execute related pathway dissect.
Three GO terms [biological process (BP), cellular component (CC) and molecular function (MF)] were used to notarize the enrichment of pointed genes. Besides, the analysis of Kyoto Encyclopedia of Genes and Genomes (KEGG) was presented for the role pathway of these genes by identifying DEGs (P < 0.05, counts = 1544). Taking advantage of the Cytoscape (version 3.7.0), more intuitive pathway was shown with up/down-regulated genes (version 3.7.0) (http://www.cytoscape.org/cy3.html) and R (version 3.5.2).

Gene functional annotation evaluation
Receiver-operating characteristic curve (ROC) was conducted by 'pROC package' to assess the accuracy and sensitivity about the gene signature. Univariate Cox analysis was used to select the related variables. Based on the multivariate Cox regression analysis, we calculated each patient' s risk score, which would be the sign of weighted expression levels of these genes. Risk score = ∑β genei × Exp genei , where β genei indicates the coef for each of five genes and Exp genei suggests the expression degree of liver cancer patients. The higher score is together with high risk mortality.

Statistical analysis
Edger R, characterized by log fold change (FC), log CPM (counts per million), P value, and FDR (false discovery rate), was suitable for us to distinguish the statistical significance in the process of data analysis (defined as a fold change = 3, P < 0.05). Derived from the univariate and multivariate Cox regression analysis, a five-gene signature would be considered as an indicative of prognostic predicting factor (P < 0.05).

1.Characteristics of the datasets
The transcriptome and clinical data were abstracted from the TCGA database, including 465 samples from patients with liver cancer. Among of them, 407 were tumor tissues and 58 were normal tissues.
As for clinical data, there are totally 418 patients, consist of 146 women and 272 men, ranged from 16 to 90 years. 222 were White people, 164 were Asians, 20 were Black or African American, and 12 were from the other backgrounds. Amongst the 418 patients, 147 died with their lifespan postdiagnosis ranged from 0-3675 days. (Table 1)

2.Enrichment and visualization of signaling pathways
We identified 1897 DEGs by integrating the differential expression of 465 samples in the TCGA data.
Of them, 77 were downregulated genes and 1820 were upregulated genes (Fig. 1). KEGG was presented for the role pathway of these genes by identifying DEGs. The 1897 DEGs were enriched with 6 signaling pathways significantly (P < 0.05, counts = 149) ( Table 2). hsa00650 (Butanoate metabolism); hsa00380 (Tryptophan metabolism) and hsa00410 (beta-Alanine metabolism) (Fig. 2). Taking advantage of the Cytoscape (version 3.7.0), more intuitive pathway was shown with up/down-regulated genes (Fig. 3). We then enriched DEGs according to the GO terms (Table 3). As shown in Fig. 4A and 4B, these specific genes were strikingly enriched in the GO terms, such as BP, CC and MF. Following, we also organized the graphs of the GO signaling pathways by using R package components (Cluster Profiler, Stringi, Pathview, and GOplot) (Fig. 5A, B). and SPANXC. Of these, CDCA8 and NR0B1 were related to a worse prognosis (P < 0.05) (Fig. 6). In particular, we also found that GAGE2A is positively related to lymphatic metastasis through the data analysis in liver cancer. We calculated each patient' s risk score, which would be the sign of weighted expression levels of these five genes (Table 4). By means of median-risk score, patients were divided into high-and low-risk groups. A poorer prognosis could be represented by a higher risk score. These results may help clinicians develop adaptive treatment strategies for the liver cancer patients to predict prognosis.

Accuracy of TCGA datasets
The sensitivity and accuracy of data could be shown through ROC curve and AUC (= 0.714) (P < 0.05) (Fig. 7A). Risk score calculated was associated dramatically with 5-year OS (overall survival) by using univariate and multivariate Cox regression analysis (P < 0.05) (Fig. 7B). The heat map based on the risk scores in patients with liver cancer revealed deeply the obvious differences in expression level of these five genes between liver cancer and corresponding normal tissues (Fig. 8).

Discussion
Up to now, liver cancer is still hardly diagnose early and treat effectively. Hepatic resection, orthotopic liver transplantation, ablative therapies, chemoembolization and systemic therapies with cytotoxic drugs were still the main treatment strategies for liver cancer [7]. Following the past few years, increasing studies have shown that the molecular phenotype plays more and more imperative functions in the diagnosis and therapeutic reactions of liver cancer patients [8].
Several datasets about patients' genomes have been built, which makes researchers recognize the genomic changes between tumor and matched normal tissues. It becomes a crucial step to form a novel and effective identification for prognostic biomarkers on liver cancer outcomes. Meanwhile, accumulating studies have revealed that a variety of biomarkers are emerged as the indicators of prognosis by the analysis of dataset [9][10][11]. At this study, DEGs between tumor and normal tissues exerted according to a series of analytical processions based on the TCGA database. Especially, we obtained a five-gene signature by means of univariate and multivariate Cox regression analyses.
Besides, the risk score was also calculated by this signature, which could be a possible marker to predict the prognosis of patients with liver cancer.
Cell division cycle associated 8 (CDCA8), is a member of the chromosomal passenger complex (CPC) crucial for transmission of the genome during cell division [12]. More studies illustrated that massive expression of CDCA8 was an urgent need to the formation and development of tumor [13]. Loss of CDCA8 could induce the proliferation of defective cell and early embryonic lethality [14]. That reminds us that CDCA8 could be a risk factor in liver cancer because of negative regulatory in the  [16]. In addition, NR0B1, associated with ERα, PR and AR expression [17], is also a positive prognostic factor in node-negative breast-cancer. It is correlated with smaller tumor-size, earlier disease-stage and increased survival [18]. With the negative survival correlation in this study, NR0B1 maybe a passitive prognostic factor in live cancer. Sperm Protein Associated With The Nucleus On The X Chromosome C (SPANX-C) is a member of the SPANX family, which is located in a gene cluster on chromosome X, whose overexpression has little influence about primary tumor growth in cancer cells, but may be sufficient to determine an invasive phenotypes, including the morphology of their nucleus and the diameter of vessel [19]. that GAGE gene products could be an important part during early stages of embryonic development [22]. AC018641.1, the Aliases of ENSG00000226468 Gene, was poorly investigated, but along with the further investigations, it could be a prognostic possible marker for cancer detection and therapy.
Future research could investigate the biological characteristics of these five genes for their prognostic value in liver cancer. Besides, we would inform some limitations in this study as followed: the fivegene signature based on dataset analysis in liver cancer was initially useful, but it was not exhaustive. Deeply, it is necessary to have more comprehensive data sets obtained from other databases for supplementary authentication. Particularly, clinical practice would do to verify the fivegene signature mode, and experiments are needed to demonstrate our results. Consequently, this study provides us a guideline that roles of these five genes are related to the prognosis of patients in liver cancer. Although we have only limited knowledge of these five genes, these indicators would be a selecting assist on the cancer progression, monitor and treatment.

Conclusions
Totally, several examinations offer us further sustenance for the suggestions that the five-gene signature possess a meaningful relationship with the prognosis of the patients with liver cancer. We also hope that our studies will present long-term opinion as an effective predictor for the prognosis of patients with liver cancer in the future.

Ethics approval and consent to participate
This study was approved by the Affiliated Hospital of Zunyi Medical University institutional review board (IRB). The data were obtained from TCGA (https://portal.gdc.cancer.gov/). Informed consent has been obtained from all individual participants included in the study was approved by the Institutional review Board.

Consent for publication
Not applicable.

Availability of data and materials
The datasets supporting the results of this article are publicly available at the TCGA (https://portal.gdc.cancer.gov/).

Competing interests
The authors declare that they have no competing interests.   (B) Risk score was significantly correlated with overall survival (OS). Risk score was an independent prognostic factor for OS in multivariate Cox regression analysis (P=0). The median score divided patients into high-and low-risk groups.

Figure 8
Heatmap of the five-gene signature in TCGA data sets. Each column represents a sample and each row represents one of the five genes. The expression levels of the five genes are shown in different colors, from orange to blue with increasing risk