An apoptosis-related gene signature for the prognosis of hepatocellular carcinoma

Hepatocellular carcinoma (HCC) is a major public health burden worldwide owning to high incidence and poor prognosis. Although a mushrooming number of apoptosis-related genes had been disclosed in HCC, the prognostic value and clinical utility of them remain to be illustrated. Here, we defined the data from Gene Expression Omnibus (GEO) as a training cohort and data from The Cancer Genome Atlas-Liver Hepatocellular Carcinoma data set (TCGA-LIHC) as a validation cohort. The apoptosis-related differentially expressed genes (AR-DEGs) were identified with the two cohorts and the Gene Set Enrichment Analysis. Then, we constructed a Lasso-penalized Cox regression model using AR-DEGs and conducted a signature including 14 apoptotic genes to calculate the risk score. Patients with a high risk score indicated worse overall survival than those with low risk. Besides, the 3-year and 5-year area under curve (AUC) values of the signature were above 0.7 in both training and validation cohorts (0.762, 0.818, 0.717, 0.745, respectively). Moreover, a nomogram containing the signature and clinical characteristics presented reliable net benefits for the survival prediction. And the nomogram was tested by probability calibration curves and Decision Curve Analysis (DCA). Furthermore, protein-protein interaction (PPI) and Gene Ontology (GO) enrichment analysis disclosed several noticeable pathways that might clarify the hidden mechanism. Collectively, the present study formed a novel signature based on the 14 apoptotic genes and this possibly predicted prognosis and strengthened the communication with HCC patients about the likely treatment.


Abstract
Hepatocellular carcinoma (HCC) is a major public health burden worldwide owning to high incidence and poor prognosis. Although a mushrooming number of apoptosis-related genes had been disclosed in HCC, the prognostic value and clinical utility of them remain to be illustrated. Here, we defined the data from Gene Expression Omnibus (GEO) as a training cohort and data from The Cancer Genome Atlas-Liver Hepatocellular Carcinoma data set (TCGA-LIHC) as a validation cohort. The apoptosisrelated differentially expressed genes (AR-DEGs) were identified with the two cohorts and the Gene Set Enrichment Analysis. Then, we constructed a Lasso-penalized Cox regression model using AR-DEGs and conducted a signature including 14 apoptotic genes to calculate the risk score. Patients with a high risk score indicated worse overall survival than those with low risk. Besides, the 3-year Hepatocellular carcinoma (HCC) accounts for about 90% of cases of primary liver cancer; the remaining pathologic types are cholangiocarcinoma and mixed hepatocellular-cholangiocarcinoma (2). Surgical resection of the tumor completely and safely is the most effective treatment for HCC, however, most patients are diagnosed at an advanced stage and become unrespectable (3). And thus, the 5-year survival rate for HCC patients was found to be less than 12% (4). Although existing drugs such as sorafenib and immune checkpoint inhibitors could help improve patients' prognoses, the effect is still unsatisfactory (5). Therefore, more sensitive biomarkers are earnestly required for the early detection of HCC.
Apoptosis, a programmed cell death, is precisely regulated by numerous checkpoints to maintain equilibrium (6). Suppression of apoptotic cell death was reported to cause cancers, autoimmune diseases, propagation of intracellular pathogens, and neurodegenerative diseases (7). Remarkably, an elevated level of apoptosis could be observed in acute liver injury, but apoptosis generally occurs to a lesser extent in non-alcoholic fatty liver disease and liver cancer (8)(9)(10).
In our present research, the apoptosis-related differentially expressed genes (AR-DEGs) were defined as the intersection of the high-throughput data from TCGA, GEO, and GSEA. We then selected a 14-AR-DEGs signature to calculate the risk score with the Lasso-COX regression model. Patients were divided into the high-risk group and low-risk group based on the median risk score. Moreover, univariate and multivariate COX regression models were conducted using risk score and clinical characteristics in both TCGA and GEO data. ROC curve, Calibration curve, and DCA certified the prognostic significance of models. Finally, protein-protein interaction (PPI) and GO enrichment analysis were executed to uncover the underlying mechanism. Briefly, we identified a novel apoptotic gene signature and it might serve as a tool for survival prediction, treatment adjustment, and even clinical classification.

Data collection and process
The GSE14520 dataset was obtained from the Gene Expression Omnibus (GEO) database and taken as a training cohort, which consisted of 218 HCC and 221 control samples. All the microarray raw data in the training cohort were transformed with log2(x+1) and normalized to ensure harmonized criteria.
Then, the messenger RNA (mRNA) expression profiles of 364 patients with HCC and 50 normal samples were accessed from the UCSC Xena (https://tcga.xenahubs.net) and selected as a validation cohort. Meanwhile, a list of 1982 apoptosis-related genes, named "GO_APOPTOTIC_PROCESS", was downloaded from GSEA.

Identification of AR-DEGs
Based on the Limma R package, differentially expressed genes (DEGs) were obtained from the training cohort and validation cohort (11). And the cutoff values were set as follows: the adjusted Pvalue < 0.05 and |log2 fold change (FC)| > 1.5. Then, genes were defined as apoptosis-related differentially expressed genes (AR-DEGs) and enrolled in the subsequent analysis, if they were the intersection of DEGs and apoptosis-related genes.

Construction of the apoptotic gene signature and risk score calculation
A Lasso-penalized Cox regression model was applied in Rstudio software with AR-DEGs to shrink and choose variables further (12). The "glmnet" R package (13) helped determine the optimal penalty parameter λ and partial likelihood deviance. And finally, a fourteen-apoptotic-gene signature was conducted and the risk score could be calculated. The calculation formula was as follows: The Expression i was the mRNA expression of each gene and Coef i represented the corresponding coefficient.

Evaluation of the prognostic value of the apoptotic gene signature
Initially, the training cohort was equally divided into the high-risk group (n=109) and low-risk group (n=109) based on the median risk score. And the Kaplan-Meier method was applied to estimate the survival differences between the high and low-risk groups. Besides, we plotted the receiver operating characteristic (ROC) curve to obtain the area under the curve (AUC) for 3-year and 5-year overall survival. Also, the validation cohort was split into the high-risk group (n=182) and low-risk group (n=182), and then survival and ROC analyses were repeated for accuracy.

Assessment of the apoptotic gene signature in subgroup analysis
To identify the prognostic value of the apoptotic gene signature in various conditions, we conducted the subgroup analysis in the training cohort. The clinical characteristics were stratified into the following groups: male versus female, age younger than 50 years versus 50 years and older, AFP less than 300 ng/ml (low) versus higher than 300 ng/ml (high), ALT less than 50U/L (low) versus higher than 50U/L (high), early BCLC_stage (0+A) versus advanced BCLC_stage (B+C), main tumor size less than 5 cm versus main tumor size bigger than 5 cm, multinodular negative versus multinodular positive, cirrhosis negative versus cirrhosis positive, and early TNM_stage (Ⅰ+Ⅱ) versus advanced TNM_stage (Ⅲ+Ⅳ). Then, survival curves were drawn to compare the median survival time between the high-risk and low-risk subgroups.

Building and validating the nomogram
To determine appropriate variables for nomogram building, univariate and multivariate Cox regression analyses were conducted in the training and validation cohorts separately. Therefore, the independent prognostic variables were included in the nomogram by the "rms" and "ggplot2" R packages (14). Then, calibration of the nomogram was assessed visually using 3-year and 5-year calibration plots. As an innovative but robust assessment method, the 3-year and 5-year decision curve analyses (DCA) were plotted to identify the net benefits of risk score, whether alone or cooperating with other clinical characteristics (15). And the x-axis was threshold probability, while the y-axis was net benefits. The black and green lines represented the two extreme cases in DCA.

Protein-protein interaction, GO and KEGG analysis
To uncover underlying mechanisms in the apoptotic gene signature, we implemented protein-protein interaction (PPI), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis with STRING database and "clusterprofiler" R package (16,17). In the PPI analysis, the potential interactions with 14 apoptotic genes were predicted under the conditions as follows: high confidence (interaction score>0.400) and no more than 20 interactors per shell.
Furthermore, the results of PPI analysis were applied in GO and KEGG enrichment analyses and the outcome was visualized by Cytoscape (Version 3.72) and Rstudio (Version 1.2.5033).

Extra validation of the apoptotic gene signature with online tools
To include more normal samples in the validation cohort, we associated TCGA_LIHC with The Genotype-Tissue Expression (GTEx) in the GEPIA database. The expression of the 14 genes in the apoptotic gene signature was further validated at the mRNA level using GEPIA database, and the protein level with The Human Protein Atlas (18)(19)(20). Also, the genetic alterations of the 14 apoptotic genes were investigated in The cBioPortal for Cancer Genomics and Human Cancer Metastasis Database (HCMDB) to explore the relationship with HCC metastasis (21,22).

Statistical analysis
All of the statistical analyses were processed with R software or SPSS (IBM, SPSS, version 22), and Pvalue < 0.05 was considered statistically significant unless otherwise specified. One-way ANOVA was called in the "limma" R package for differential expression analysis in cohorts. Then, the Kaplan-Meier survival curve analysis and the log-rank test were applied to analyze overall survival. The AUC value of greater than 0.70 was considered acceptable. Univariate and multivariate Cox proportional hazard models were conducted to obtain the hazard ratios and to select independent prognostic variables.

Identification of the prognostic value of the apoptotic gene signature
Patients were subdivided into the high-risk group and low-risk group with the median risk score. In the training cohort, overall survival was significantly better in the low-risk group than the high-risk group and patients with low-risk scores tended to survive (P-value<0.0001, Figure

Construction and validation of the nomogram
In the training cohort, univariate Cox regression analysis revealed that tumor size, cirrhosis, pathological TNM_stage, BCLC_stage, AFP, and risk score of the apoptotic gene signature were the prognostic factor for overall survival (Figure 4a, Table 1). Then, the significant variables were included in multivariate Cox regression analysis, and only BCLC_stage, cirrhosis, and risk score were still independent prognostic factors ( Figure 4b). The forest map and nomogram were plotted built on the three variables. When patients met specific conditions, they received corresponding points and the higher total points got, the worse outcome was (Figure 4c). Moreover, three internal validation groups (70 subjects per group) demonstrated that the nomogram approached an ideal model (Figure 4f,g) and had better predictive performance than cirrhosis or BCLC_stage at 3 years and 5 years with the decision curve analysis (Figure 4d,e).
As for the validation cohort, univariate and multivariate Cox regression analyses indicated that pathological N_stage, M_stage, and risk score were selected to build nomogram (Figure 5a, b, Table   2). Similarly, the nomogram and risk score proved the accuracy and clinical utility for 3-year and 5- year survival prediction (Figure 5c, d, e). And we also generated three internal validation groups (120 subjects per group) randomly to testify the deviations from the ideal model (Figure 5f, g).
In short, the risk score based on the apoptotic signature brought obvious net benefit for prognosis, alone or in combination with classical clinical features.

Potential mechanisms of the apoptotic gene signature
Subsequent exploration of the underlying mechanisms was conducted using PPI, GO, and KEGG analyses. A total of 69 genes encoding proteins were enriched and visualized in Additional file 6.
Several genes were previously reported to participate in apoptosis and cancer tumorigenesis, including caspase family, FAS, VEGFA and so on. The cut-off of the combined score was set at 0.4 (high) and the detailed information was shown in Additional file 7. Then, the 69 genes were analyzed from the biological process (BP), cellular component (CC), molecular function (MF), and KEGG pathway perspectives. And genes of the signature mainly mapped to the chromosome and took part in growth factor, death receptor, and tumor necrosis factor receptor binding (Figure 6a). Additionally, some of the signaling pathways were related to multiparous tumors and enriched in analysis, including the PI3K-Akt signaling pathway, p53 signaling pathway, MAPK signaling pathway, and TNF signaling pathway. Interestingly, the Hepatitis B pathway was also associated with the genes of the apoptotic signature ( Figure 6b).
Notably, FALSG would be up-regulated in the primary tumor with metastasis than those without metastasis in HCC patients. And this suggested that FALSG might play a vital role in the neoplasm metastasis according to the HCMDB database (Additional file 8o).

Discussion
According to the annual projections of the World Health Organization, over 1 million people will die from liver cancer in 2030 (23). With a 5-year survival lower than 12%, HCC is the second most lethal tumor after pancreatic cancer (24). Although the molecular mechanisms of HCC remain unclear, the apoptosis played a key role in tumorigenesis and cellular immortality (8,25). With the development of high-throughput sequencing, more and more single molecular biomarkers were reported (26), but only a few studies highlighted the predictive value of multiple gene signature, especially the apoptotic gene signature in HCC.
In our research, AR-DEGs were singled out and put into a Lasso-penalized Cox regression model, based on TCGA_LIHC, GSE14520, and GSEA data set. The Lasso-COX model avoided the possible collinearity impact and selected fourteen apoptotic genes to identify the signature and calculate the risk score. Whether in training or validation cohorts, the risk score showed a robust prognostic value.
As were shown in Figure 2c and Figure  In the subsequent mechanism research, the fourteen apoptotic genes were enriched in growth factor binding, death receptor binding, tumor necrosis factor receptor binding, PI3K-Akt pathway, p53 signaling pathway, and Hepatitis B pathway. It is no doubt that the above molecular functions and signaling pathways were involved in liver cancer (27)(28)(29)(30). Indeed, most of the fourteen genes had been reported to participate in HBV infection and cancer. Over-expression of SOX4 might promote HCC intrahepatic metastasis by targeting semaphorin 3C (31). The previous study demonstrated that FALSG, also known as FasL, led to the escape from immune surveillance in HCC (32). In our study, the FASLG had the highest mutation frequency and higher mRNA expression in tumors with metastasis, which were shown in Figure 7b and Additional file 8o. Besides, TNFRSF25 was up-regulated in HCC and the silencing gene significantly inhibited liver cancer cell proliferation (33). FGF8, a member of the fibroblast growth factor family, enhanced the invasion of HCC by increasing the expression of YAP1 and EGFR (34). The upregulation of Neuropilin-1 (NRP-1) was related to the boost of arterial blood supply in primary carcinoma and converted the hepatic vasculature to the arterial phenotype (35). In breast cancer, abnormal expression of ATOX1 promoted migration and metastasis through raising the copper (Cu) transportation and Memo protein expression (36,37). Moreover, the overexpression of UNG was proved to facilitate the HBV mutations and increase HCC risk (38), which was identified with immunohistochemistry staining in Figure 7a. Thus, the high risk score of the signature implied poor prognosis and this might be combined effects of the multiple genes.
However, there are still some weaknesses that deserve comment. First, only differentially expressed genes were defined as candidates and applied for model building. Given the complicated processes of HCC, some crucial apoptotic genes might be omitted and thus reduces the reliability of the signature.
Then, more independent validation cohorts with data integrity were in urgent need to pinpoint the utilization of the signature and address the lack of samples in the subgroup (female, non-cirrhosis).
Moreover, the basic experimental studies were warranted to investigate the elusive mechanisms of HCC.

Conclusions
In brief, a novel signature was composed of fourteen apoptotic genes in our study and identified for

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The datasets generated and analyzed during the current study are available from the corresponding author on a reasonable request.
revised the manuscript; Kunlun Chen and Wenlong Zhai designed the study and revised the manuscript.

Additional file 2
The apoptosis-related differentially expressed genes.

Additional file 3
Detailed information of the 14 genes in the apoptotic signature.

Additional file 4
The curve of risk score and heatmap of the fourteen apoptotic genes in high-and low-risk group.

Additional file 5
Kaplan

Additional file 6
Protein-protein interaction of the fourteen apoptotic genes.

Additional file 7
The combined score of protein-protein interaction from the STRING database.

Additional file 8
The mRNA level expression of the fourteen apoptotic genes in association with TCGA_LIHC, The