A Novel Five-gene Signature Predicts Overall Survival in Hepatocellular Carcinoma

Background: Hepatocellular carcinoma (HCC) is one of the most common challenges for public health worldwide. Due to its complex molecular and great heterogeneity, the effectiveness of existing HCC risk prediction models is unsatisfactory. Hence, more accurate prognostic models are pressingly needed. Materials and methods: Differentially expressed mRNAs (DEMs) between HCC and normal tissues were identified after downloading GSE1450 from gene omnibus (GEO) database. We randomly divided all patients into training and testing sets. Univariate Cox regression, lasso Cox regression and multivariable Cox regression analysis were used to constructed the prognostic gene signature in training set. Our study utilized Kaplan-Meier plot, time-dependent receiver operating characteristic (ROC), multivariable Cox regression analysis with clinical information, nomogram and decision curve analysis (DCA) to evaluate the predictive ability for overall survival of the novel gene signature in training, testing and whole sets. We also validated the prognostic capacity of the five-gene signature in an external validation set. The information of mutation of each gene was explored on cBioPortal online website. We performed gene set enrichment analysis (GSEA) to explore underlying mechanisms in the high and low risk group. Finally, quantitative real-time PCR was conducted to validate the expression tendency between 12 paired HCC and adjacent normal tissues. Results: Our study constructed a novel five-gene signature ( CNIH4 , SOX4 , SPP1 , SORBS2 and CCL19 ) for predicting overall survival of HCC. Time-dependent ROC curve indicated admirable ability in survival prediction in two datasets. Multivariable Cox regression analysis indicated that both this five-gene signature and TNM stage were two independent prognostic factors for overall survival of HCC patients. Combined with TNM stage clinical pathological parameters, the predictive capacity of nomogram had a decent improvement. The mutation of the five genes had no obvious variation. Plenty pathways were enriched by GSEA, including cell cycle and various metabolism. Furthermore, the mRNA levels of these five genes had signiﬁcantly diﬀerent expressions between HCC tissues and adjacent normal tissues by quantitative real-time PCR. Conclusions: A five-gene prognostic model and nomogram were constructed and validated for predicting prognostic of HCC patients. And the five-gene risk score with TNM stage models might help various HCC patients to customize individual therapies.


Background
Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related death, which parallels with gastric cancer and only behind colorectal cancer and lung cancer worldwide [1][2][3]. The major risk factors for HCC are hepatitis B virus (HBV) infection, hepatitis C virus (HCV) infection, cirrhosis, aflatoxin contamination and so on [4].
Since HCC patients are usually asymptomatic at an early stage and most HCC patients likely lose the best treatment period with an advanced stage. Although therapeutic methods for HCC have massively improved in recent decades, the ten year overall survival of HCC patients remains unsatisfactory [5,6]. Conventional clinical parameter, such as TNM stage, histologic grade and portal vein tumor thrombus (PVTT), could help predict the prognosis of HCC patients [7]. Because of the enormous heterogeneity of HCC, the predictive effect of the traditional model is not satisfactory yet. It is vital and urgent to establish a sensitive and reliable prognostic signature for optimizing the clinical treatment decision for individuals.
With the development of genome-sequencing technology, plenty of studies showed that gene signature had huge potential in predicting HCC prognosis, including mRNA, lncRNA and microRNAs [8][9][10][11]. By using public free genomic data, people could identify efficient gene signature to predict prognosis of patients. However, it is limited that individual gene signature discards clinical parameters for predicting overall survival. So, it is necessary for us to identify novel gene signatures combined with clinical information.
In this study, plenty bioinformatics methods, such as differential expression mRNAs (DEMs), univariate Cox regression, least absolute shrinkage and selection operator (LASSO) Cox regression, multivariable Cox regression analysis, Kaplan-Meier plot, time-dependent ROC, decision curve analysis (DCA) and gene set enrichment analysis (GSEA), were used to build and validate a five-gene signature model which could combine with TNM stage to foresee the prognosis of HCC patients. Meanwhile, GSEA was conducted to explore underlying mechanisms in the high and low risk group.
Quantitative real-time PCR was also operated to validate the expression tendency between 12 paired HCC and adjacent normal tissues.

Data source
The GSE14520 dataset containing genes expression and clinical information was downloaded from GEO database in NCBI (https://www.ncbi.nlm.nih.gov/pmc/) [12].
We used the GPL3921 to re-annotate these probes. We visited UCSC Xena online website(http://xena.ucsc.edu/public/) for acquiring The Cancer Genome Atlas-Liver Hepatocellular Carcinoma (TCGA-LIHC) validation set which contained gene expression profile, clinical and survival information.

Differential expression mRNAs analysis
To obtain differential expression mRNAs (DEMs) between HCC and non-tumor samples, the "limma" software package in R was used to normalize and analyze the GSE14520 dataset [13]. The |log(FC)| > 1 and p value < 0.05 were considered for next study.

Construction of prognostic gene signature
We eliminated the samples whose overall survival time was less than one month for subsequent operation. With p value < 0.05, we conducted univariate Cox regression analysis to explore DEMs which had significantly correlated with overall survival. By using "caret" package in R, the patients were randomly divided into two groups, training and testing sets. Next, LASSO regression analysis was used for vital prognostic DEMs. We used "glmnet" package to conduct this step [14]. Then, the vital prognostic mRNAs with expression and survival information were devoted into multivariable Cox regression analysis to obtain a prognostic risk formula. The risk score = (CoefficientmRNA1 * expression value of mRNA1) + (CoefficientmRNA2 * expression value of mRNA2) + (CoefficientmRNA3 * expression value of mRNA3) + ⋯ + (CoefficientmRNAn * expression value of mRNAn). We utilized the "surv_cutpoint" function of "survminer" package to explore the optimum cut-off risk score to separate patients into high and low risk groups. The Kaplan-Meier survival curve and log-rank test were used to examine the survival condition of two different groups. To check the predictive ability of the prognostic gene signature, we utilized "timeROC" package to calculate the area under curve (AUC) values of 1-year, 3-year and 5-year [15]. Finally, the coefficients of all prognostic DEMs were used to testing and whole sets.

Independent validation of the prognostic gene signature and expression
In order to examine the prognostic gene signature, we used TCGA-LIHC dataset to validate. The risk score of each patient was acquired by the same formula like before.
Next, we also chose the optional cut-off value to separate patients into high and low risk groups using the "surv_cutpoint" function. The Kaplan-Meier curve and timedependent ROC curve were utilized to examine the ability for predicting overall survival of the risk gene signature in validation dataset. The expression profile of mRNAs in the risk gene signature between tumor and non-tumor tissues in 50 paired tissues was calculated by paired t-test. The p value < 0.05 was considered statistically significant.

Independent prognostic role of the gene signature by multivariate analysis
To identify the capacity of the prognostic gene signature, we put all clinical information [gender, age, alanine transaminase (ALT), tumor size, multinodular, cirrhosis, serum alpha fetoprotein (AFP) and tumor node metastasis (TNM) stage] in GSE14520 into univariate and multivariate analysis. Meanwhile, we also had the same operation in TCGA-LIHC clinical information [gender, age, body mass index (BMI), tumor grade, cancer status and TNM stage]. The p value <0.05 was considered significant.

A predictive nomogram building and validating
Various cancer prognosis has been extensively predicted by Nomogram [16,17]. The nomogram was built to research the probability of 1-year, 3-year and 5-year overall survival in GSE14520 using all independent prognostic factors calculated by multivariate Cox analysis. Meanwhile, the predictive capacity of nomogram was evaluated by calibration and discrimination. We obtained the concordance index (Cindex) to evaluate the discrimination of the nomogram by a bootstrap method with 1000 resamples. Next, the calibration curve was plotted to watch the influence of observation rates in different years on the prediction probability of nomogram. Next, we compared different models and combined models by C-index, time-dependent ROC curve and DCA in R [18]. Subsequently, we conducted the same operation in the independent validation set.

Gene set enrichment analysis
After downloaded The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway in C2 genes sets from gene set enrichment analysis (GSEA) online website (https://www.gsea-msigdb.org/gsea/index.jsp) [19], we performed GSEA to probe the promising function of the established prognostic gene signature in GSE14520 or TCGA-LIHC set. The p value < 0.05 and false discovery rate (FDR) q value < 0.25 were remained.

Tissues collection and quantitative real-time PCR
We collected 12 pairs of HCC and corresponding adjacent tissues from 12 different HCC patients after surgery. The pathological type of these tissues was confirmed by the pathology department in The First People`s Hospital of Jingmen. The tissues of HCC patients used to quantitative real-time PCR were based on the following criteria: (1) patients treated in The First People`s Hospital of Jingmen; (2) none of the patients received any neoadjuvant therapy before surgery. Informed consent was obtained from each patient. The research was performed in accordance with relevant guidelines/regulations. Total RNA was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, United States). The ratio of absorbance at A260/A280 was nearly 2.0 for reverse transcription using the NanoDrop spectrophotometer (Thermo Scientific Inc.). We used the PrimeScript RT reagent Kit with gDNAEraser (Takara, Tokyo, Japan) to proceed the reverse transcription. After setting up the proper protocol, quantitative real-time PCR was conducted by the CFX Connect Real-Time PCR Detection System (Bio-Rad, United States) with the SYBR Green PCR kit (Toyobo, Osaka, Japan). We used GAPDH as an internal control. The 2 -△△ Ct method was used to analyze the outcome. All the primers of genes were designed by the website PrimerBank (https://pga.mgh.harvard.edu/primerbank/). The sequences and Tm values of all primers were listed in Supplementary Table 1.

Statistical analysis
All of the statistical analysis was performed by R (version 3.6.3). Categorical variables were analyzed using the Pearson χ2 test or Fisher`s exact test; paired tissues were used paired t-test. Wilcoxon test was used to unpaired samples when non-normal distribution. P < 0.05 was considered statistically significant.

DEMs analysis
The overview of research design is showed in Figure 1 Figures2A-B, Figure3A). These results indicated that the five-gene signature had good manifestation for overall survival prediction.

External validation of the prognostic gene signature
To certify the capacity of the five-gene signature for predicting overall survival, we used the same formula to calculate risk score in TCGA-LIHC dataset. After chose appropriate cut-off value 2.88, all patients were split into high risk group (n=97) and low risk group (n=266). The Kaplan-Meier plot showed the same result with three sets.
The AUCs for 1-year, 3-year and 5-year overall survival were respectively 0.728, 0.694 and 0.643 (Figure3B). In summary, the five-gene signature had certain ability to predict overall survival in HCC.

Independent prognostic role of the prognostic gene signature
After we constructed the five-gene signature with high and low risk group, we formed all clinical information of GSE14520, including gender, age, ALT, tumor size, multinodular, cirrhosis, serum AFP and TNM stage into a table. The missing information used "Not Available" instead. Results from clinical studies indicated that higher risk score had significantly relation to advanced age, larger tumor size, cirrhosis, higher AFP and advanced TNM stage. Meanwhile, we conducted the same analysis to TCGA-LIHC clinical information. It showed that higher risk score was extremely correlated with advanced TNM stage and histologic grade (Table1). To confirm the significance of the five-gene signature, we used univariate and multivariate Cox regression analysis to analyze the GSE14520 and TCGA-LIHC datasets. Results showed that both TNM stage and risk prognostic model were the independent prognostic factors for overall survival (Figure 4). Two datasets showed that patients in high risk group had poorer overall survival compared with those in low risk group both in low TNM stage (I+II) and high TNM stage (III+IV) (Figures5A-D).

A predictive nomogram building and validating in GSE14520 and TCGA-LIHC sets
In GSE14520 set, we built the nomogram including prognostic model and TNM stage

Mutation information and gene set enrichment analysis
To find deeper value of the five-gene signature, we searched the cBioPortal online website to explore the mutation information of each gene. The result showed CNIH4 had 6% amplification and SORBS2 had 4% deep deletion and the other genes hardly changed ( Figure 8A). By setting up cut-off p and q value, 37 significant KEGG pathways were enriched in GSE14520 by GSEA. Spliceosome, ribosome, cell cycle and basal transcription factors were enriched in high risk group. Plenty pathways of metabolism were enriched in low risk group, such as histidine metabolism, tyrosine metabolism, butanoate metabolism, fatty acid metabolism and so on ( Figure 8B, Supplementary file3).

External validation in expression
We selected 50 paired samples to validate the expression tendency in TCGA-LIHC dataset. The results showed all of the five genes had significant differential expression between tumor and non-tumor tissues expect SORBS2 (Supplementary Figure 2A). We found risk score had significant differential expression between low and high TNM stage both in GSE14520 and TCGA-LIHC sets. Furthermore, risk score in low and high histologic grade also had significant differential expression (Supplementary Figure 2B).
The mRNA levels of these five genes had significantly different expressions between HCC tissues and adjacent normal tissues by quantitative real-time PCR in 12 paired HCC tissues (Supplementary Figure 2C). In summary, aberrant expression of the five genes was validated in HCC.

Discussion
Hepatocellular carcinoma is still one of the malignant tumors of high mortality worldwide [20]. Due to complicated pathogenesis, it is extremely hard to have a satisfying prediction model for overall survival in HCC. Traditional clinical indicators such as TNM stage, histologic grade and portal vein tumor thrombus (PVTT) could partly predict prognosis of HCC patients. Nevertheless, because of the enormous heterogeneity of HCC, it is necessary for people to find novel prognostic biomarkers and build more precise prognostic models. Compared with single biomarker, a system of multiple prognostic models could enhance predictive efficacy.
In this study, through mining the GSE14520 and validating in TCGA-LIHC dataset, we constructed a novel five-gene signature (CNIH4, SOX4, SPP1, SORBS2 and CCL19) for prognosis prediction of HCC patients. The predictive capacity for overall survival of the five-gene signature performed well both in GSE14520 training, testing, whole set and TCGA-LIHC dataset. By univariate and multivariate analysis with clinical information in two datasets, the TNM stage and five-gene signature were two independent prognostic factors for overall survival in HCC. Meanwhile, patients in high risk group had significant poorer overall survival both in TNM stage I+II and III+IV in two datasets compared with those in low risk group. By constructing nomogram including five-gene signature model and TNM stage model, the prognosis prediction had a better improvement, which may subdivide HCC patients more accurately for individual treatment. In summary, all these results indicated that the five-gene signature model plays an important role for predicting overall survival of HCC patients. GSEA showed several significant enriched KEGG pathways for the five-gene signature, such as cell cycle and various kinds of metabolism, which might comprehend the underlying molecular mechanisms. The 50 paired HCC tissues from TCGA-LIHC dataset showed these five genes had significant differential expression between tumor and non-tumor tissues expect SORBS2. And we used 12 paired HCC tissues of our hospital to validate expression tendency between tumor and non-tumor tissues.
Cornichon Family AMPA Receptor Auxiliary Protein 4 (CNIH4) plays an important role in regulating G protein-coupled receptors (GPCRs) transporting from the endoplasmic reticulum (ER) to the functional site (cell surface). And overexpression and down-regulation of CNIH4 resulted in the retention of GPCRs [21]. In colon cancer, CNIH4 which encodes a member of the CORNICHON family of evolutionarily conserved TGFα exporters, is required for metastasis and is regulated by TMED9 activity [22]. The role of CNIH4 in HCC has not been reported yet. SRY-Box Transcription Factor 4 (SOX4), a member of a highly conserved transcription factor SOX (SRY-Box) family known to have a typical DNA-binding HMG domain [23]. It has been reported overexpressed SOX4 could promote metastasis in HCC. Through technology of immunoprecipitation and gene ablation, two SOX4 target genes which had influence on HCC metastasis were identified and validation [24]. A study demonstrated that the HMG box domain of SOX4 interacted with p53, leading to the inhibition of p53-mediated transcription by the Bax promoter. More importantly, overexpressed SOX4 led to a remarkably inhibition of p53-induced Bax expression and subsequent repression of p53-mediated apoptosis induced by gamma-irradiation in HCC [25]. Secreted Phosphoprotein 1 (SPP1) is a secreted arginine-glycine-aspartate (RGD)containing phosphoprotein. It might be an important molecule of tumor metastasis mediated by macrophage invasion and direct stimulation of macrophage migration [26]. It has been reported that genetic polymorphisms of the SPP1 gene are linked with HBV clearance and onset age of HCC in a large Korean HBV study, which might provide a way to elucidate the molecular mechanisms underlying HBV clearance and HCC progression [27]. Sorbin And SH3 Domain Containing 2 (SORBS2) is essential for regulating cell adhesion and actin/cytoskeletal organization. A recent reported SORBS2 could suppress metastatic colonization in ovarian cancer by some mechanisms [28]. A study showed that expression of SORBS2 in HCC was significantly decreased, which was related to HCC metastasis, TNM stage and prognosis.
Mechanistically, SORBS2 contributed to the suppression of HCC tumourigenesis and metastasis via post-transcriptional regulation of RORA expression as an RNA-binding protein [29]. Another study indicated that SORBS2 was down-regulated by MEF2D and inhibited HCC metastasis through the c-Abl /ERK signaling pathway, which might become a new prognostic marker or therapeutic target for HCC [30]. C-C Motif Chemokine Ligand 19 (CCL19) might play a role not only in inflammatory and immunological responses but also in normal lymphocyte recirculation and homing. A study showed knock-down levels of CC chemokine receptor like 1 (CCRL1) could inhibit expression of CCL19 and CCL21. By isolating CCL19 and CCL21, CCRL1 reduced their binding to CCR7 and consequently reducing the harmful effects of CCR7, including Akt-GSK3 pathway activation in tumor cells [31]. The specific mechanism of CCL19 in HCC remains to explore.
So far, our team identified a novel five-gene signature prognostic model and nomogram for predicting overall survival of HCC. Combined with TNM stage clinical pathological parameters, the capacity of prediction had a decent improvement, especially in 5-year overall survival. And by quantitative real-time PCR validation, these five genes had significant differential expression between tumor and non-tumor tissues. Although the five-gene signature was constructed and seemed to be a potential prognostic biomarker in clinical, there are some limitations. Firstly, the number of external validation dataset is no abundant. The second limitation is our study didn`t explore the expression and prognostic effects of the five genes at the protein level.
Thirdly, the risk score model needs further validation from clinical trials. To verify the results of this study, further clinical studies are necessary. Finally, our study has not explored the underlying mechanical of the five genes at cellular and molecular levels.

Conclusion
In brief, our study constructed and validated a five-gene prognostic model and