Prognostic value of a metastasis-related gene signature in hepatocellular carcinoma

Background: Surgery is a potential curative treatment for hepatocellular carcinoma (HCC), but postoperative recurrence occurs in majority of patients. Currently, there are no robust biomarkers to predict the disease progression or recurrence. Methods: Weighted gene correlation network analysis (WGCNA) was used to construct co-expression network. The least absolute shrinkage and selection operator (LASSO) method was applied to develop prognostic model. Survival analysis, gene set variation analysis (GSVA) and gene set enrichment analysis (GSEA) were performed to assess the performance. Results: By using GSE14520 dataset, 16 hub genes associated with HCC metastasis were screened. A signature of 5 metastasis-related genes was subsequently constructed by incorporating TCGA LIHC dataset. HCC patients with high risk score have low survival rate, low disease free survival rate and high recurrence rate. The prognostic value of this 5-mRNA signature was further verified in pan-cancer datasets. A nomogram was built with excellent performance and potential clinical application for prognosis. GSVA and GSVA revealed that metabolic pathways are distinct between high- and low-risk groups. Conclusions: We have established a 5-mRNA signature and a nomogram that can efficiently predict metastasis, recurrence and patient survival in HCC. was used to plan the nomogram. Calibration plots were generated to evaluate whether the predicted result (by nomogram) was close to the actual outcome. The x-axis represents nomogram predicted probability of survival and the y-axis represents the actual survival. The 45-degree line represents the perfect nomogram performance. Decision curve analysis (DCA) was


Data collection
We searched on Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) with keywords "HCC", chose study type as "expression profiling by array", top organisms as "Homo sapiens" and sorted by "number of samples (high to low)". Meanwhile, we downloaded RNA-sequencing data of HCC from The Cancer Genome Atlas (TCGA) database (https://genome-cancer.ucsc.edu/).

WGCNA analysis
Firstly, we performed data preprocessing to remove outlier samples. The "WGCNA" R package was then used to construct the co-expression network. We identifified a set of genes based on the module connectivity, measured by absolute value of the Pearson's correlation (cor.geneModuleMembership > 0.8) and clinical trait relationship, measured by absolute value of the Pearson's correlation (cor.geneTraitSignificance > 0.2). The common genes were selected to perform further study.

Construction of a prognostic mRNA classifier
Genes from the module most related to tumor metastasis risk were used. Based on "glmnet" R package, Lasso cox regression analysis was applied to build an optimal prognostic signature of HCC samples by using TCGA LIHC. Time-dependent receiver operating characteristic curve (ROC) analysis by using "survivalROC" R package was used to calculate the area under curve (AUC) and to evaluate the prediction accuracy of the model.

Analysis of the relation of the prognostic model with clinical characteristics
To determine whether the prognostic model was significant in combination with other clinical characteristics, all the clinical variables in TCGA, including stage, grade, N stage, M stage, age and gender, were divided into two groups. Patients were stratified into stage I/II and stage III/IV subgroups, grade I/II and grade III/IV subgroup, N0 and N1/Nx subgroup, M0 and M1/Mx subgroup, age < 65y and age ≥ 65y subgroup, male and female subgroup. Overall survival analyses were performed in each subgroup.

Building and validating a predictive nomogram
The "regplot" R package was used to plan the nomogram. Calibration plots were generated to evaluate whether the predicted result (by nomogram) was close to the actual outcome. The x-axis represents nomogram predicted probability of survival and the y-axis represents the actual survival.
The 45-degree line represents the perfect nomogram performance. Decision curve analysis (DCA) was used to determine whether the predictive model is clinically useful.

Pan-cancer outcome validation
To further investigate the clinical value of the signature, pan-cancer validation of overall survival was performed based on KM plotter database (http://kmplot.com). 19 types of tumors were included (n = 7008). Since the signature has high correlation with tumor metastasis, gene expression profile of human breast cancer and prostate cancer with metastasis information was downloaded and analyzed.

Functional enrichment analysis
The Gene Set Variation Analysis (GSVA) method from the "GSVA" R package was applied to investigate the significantly altered pathways between high-and low-risk groups in the TCGA dataset and GSE14520. Furthermore, Gene set enrichment analysis (GSEA) was applied to evaluate the value of the signature by using diagnosis, metastasis risk and prognosis related gene sets (http://software.broadinstitute.org/gsea/msigdb/index.jsp).

Construction of co-expression network and identification of metastasis-related module
Among the 378 screened datasets, GSE14520 was identified as the most idea dataset for constructing co-expression networks. It was performed on Affymetrix Human Genome U133A 2.0 Array and Affymetrix HT human Genome U133A Array, including 247 HCC samples with comprehensive clinical information [24,25]. The TCGA dataset was used to establish LASSO model, as it contains 365 HCC samples with complete gene expression profile and patient outcome information.
To satisfy normal distribution, the data cleaning process excluded 26 samples from the GSE14520 dataset for subsequent analysis (Additional file 1: Figure S1). To ensure a scare-free network, the power of β = 5 (scare-free R 2 = 0.9) was selected (Additional file 2: Figure S2). Based on 12 categories of clinical information, including metastasis risk (Fig. 1b), 14 modules were identified by using the "WGCNA" package in R (Fig. 1c). The blue module shows the highest correlation with the metastasis risk, and thus was subjected to further analysis (Fig. 1d, e). 16 hub genes that were closely connected to the blue module were identified for the following analysis (Fig. 1f).
Validation of the 5-mRNA signature for predicting overall survival, recurrence free survival and disease free survival Overall survival and recurrence free survival data from GSE14520 and disease free survival data from TCGA LIHC were used to perform the validation. Based on GSE14520, patients with low risk score had a higher survival rate (p = 0.0062) (Fig. 2e, f, g). The prognostic accuracy of the 5-mRNA signatures is 0.650 at 1 year, 0.561 at 3 years, and 0.668 at 5 years (Fig. 2h). In TCGA LIHC, patients with low risk score had a elevated disease free survival rate (Fig. 2i, j, k) and the AUC was 0.689 at 1 year, 0.637 at 3 years and 0.597 at 5 years (Fig. 2l). Recurrence analysis was performed in GSE14520 and patients with high risk score had a higher recurrence rate (p = 0.026) (Fig. 2m, n, o). The AUC was 0.598 at 1 year, 0.546 at 3 years, and 0.609 at 5 years (Fig. 2p), respectively. Prognostic value of the 5-mRNA signature in combination with essential clinical characteristics To explore whether the prognostic value of the 5-mRNA signature is applicable to other clinical characteristics, stratification analysis was performed according to stage, grade, N stage, M stage, age and gender. We found that the 5-mRNA signature is still a clinically and statistically significant prognostic model (except in the female subgroup) (Fig. 3).

Clinical application of nomogram
To make the predictive model visual and quantized, we constructed a nomogram combining the 5-mRNA signature with other clinical characteristics, including age and stage, to predict the probability of 3-and 5-year overall survival (Fig. 4a). Calibration plots showed high performance of the nomogram compared with an ideal model (Fig. 4b, c). The decision curve analysis further confirmed the clinical usefulness of this nomogram (Fig. 4d, e).
Validation of pan-cancer overall survival and assessment of metastatic risk To further broaden the clinical implications of our signature, pan-cancer validation of overall survival was analyzed. Consistent with the performence in HCC, the signature exhibited its prognostic value in breast cancer (p = 0.028), kidney renal clear cell carcinoma (p = 0.021), kidney renal papillary cell carcinoma (p = 0.0011), lung adenocarcinoma (p = 0.0001) and pancreatic ductal adenocarcinoma (p = 0.017) (Additional file 10: Figure S9), but not in cancers, such as bladder carcinoma, cervival squamous cell carcinoma and esophageal carcinoma (Additional file 11: Figure S10).
In respect to metastatic risk, the signature significantly distinguishes high/low metastasis survival rate. As for breast cancer, in GSE5327 cohort, patients with high risk score had a lower metastasis free survival rate (p = 0.017). Noticeably, lung metastasis survival rate had the same trendency (p = 0.038) (Fig. 5a, b). The results of metastasis free survival analysis were further confirmed in GSE20685, GSE24450 and GSE58644 datasets (p = 0.043, 0.00042 and 0.0091, respectively) (Fig. 5c, d, e). The distant metastasis free survival of breast cancer in KM plotter database and metastasis free survival of GSE116918 (prostate cancer) were also significant (p = 0.0014 and 0.0038, respectively) ( Fig. 5f, g).

Alteration of biological pathways between high-and low-risk groups classified by the 5-mRNA signature
To identify the significant changes of biological pathways between high-and low-risk groups, we performed the gene set variation analysis (GSVA). Interestingly, metabolic changes, including bile acid metabolism, fatty acid metabolism, xenobiotic metabolism, heme metabolism, glycolysis and adipogenesis, were particularly prominent between these two groups, both in TCGA dataset and GSE14520 (Fig. 6a, b). GSEA analysis demonstrated that the high-risk group was enriched in gene sets which were positively related to metastasis/invasiveness and poor survival and low-risk group was enriched in gene sets negatively related to metastasis and poor survival (Fig. 6c).

Discussion
It is well-recognized that more than 90% of cancer mortality is caused by metastasis [26,27]. High potential of metastasis is the main risk of death for postoperative HCC patients [28]. In the present study, we have established a signature of 5 metastasis-related mRNAs for prediction of tumour recurrence and patient survival in HCC.
These five genes, to our knowledge, are all related to metastasis in general but their role in HCC remains largely under investigated. ABAT encodes a key enzyme responsible for catabolism of principal inhibitory neurotransmitter gamma-aminobutyric acid (GABA). Loss of ABAT-mediated GABAergic system is associated with the aggressive behavior of basal like breast cancer (BLBC), providing potential prognostic indicators and therapeutic targets for BLBC [29]. GYS2 is a glycosyltrasferase that catalyses the reaction of glucose to glycogen. It has been reported to be down-regulated in HCC and correlated with decreased glycogen content and unfaborable patient outcomes [30]. HAO1 catalyzes the oxidation of glycolate to glyoxylate within the peroxisomes of hepatocytes [31]. HAGH belongs to the main scavenging system of methylglyoxal, a potent prohe main scavenging system of methylglyoxalcogen [32,33]. It has been reported to involve in urological malignancies [34]. SEC14L2 is a cytosolic lipid-binding protein that has been reported to facilitate hepatitis C virus (HCV) replication [35,36]. Collectively, these five metastasis-related genes play vital role in metabolism.
Liver is the central hub of metabolism, and metabolic reprogramming commonly occurs in cancer including in HCC [37,38]. Metabolomic profiling has been widely explored for identification of potential biomarkers for early detection or prognosis in cancer [39,40,41,42,43]. In our study, interestingly, GSVA analysis revealed significant differences in metabolism between high-and low-risk groups, including bile acid metabolism, fatty acid metabolism and xenobiotics metabolism. Bile acids serve as signaling molecules in the liver that regulate lipid and glucose homeostasis [44]. They play an important role in the sustentation of inflammation, and thus the development and promotion of HCC [45,46]. In cancer cells, 95% of fatty acids come from de novo synthesis [47]. In HCC, aberrant expression of lipogenic enzymes including fatty acid synthase is associated with tumor development and progression [48]. Our analysis revealing the dramatic metabolic changes between the different risk groups provides potential molecular targets for diagnosis, prognosis and therapeutic development.
In the field of biomarker development for HCC, it is well-recognized of the necessity to combine multiple molecules to enhance the sensitivity and specificity. For example, the most classical serological marker, AFP, was verified to achieve a higher sensitivity and specificity for diagnosing HCC when combined with AFP-L3 and des-gamma-carboxy prothrombin [49]. In this respect, our study has several advantages. Firstly, the 5-mRNA signature was established based on large dataset and innovative bioinformatics approaches. These are all metastasis-related genes identified by WGCNA, which indicates that the signature is functionally relevant to HCC progression and recurrence.
Furthmore, the prognostic value of this signature was further validated in pan-cancer overall and metastatic survival. Secondary, with the combination of risk score and clinical factors, a nomogram was built to accurately predict the probability of overall survival in patients with HCC. Of note, there are still a few limitations of our study. Due to the lack of data, we were not able to validate the prediction of metastasis in HCC. More importantly, prospective studies are required for a sound validation [50].

Conclusions
In summary, we have established a 5-mRNA signature and a nomogram for predicting the outcome and potential risk of metastasis and recurrence for HCC patients. This can be potentially developed as a promising tool for prognostic evaluation of HCC patients after surgical operation, but further validation is essentially required.

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analyzed during this study are included in this published article and supplementary files.

Competing interests
The authors declare that they have no competing interests.

Authors' contributions
ZZ and JN made substantial contributions to conception and design of the research. ZZ carried out data collection and analysis. ZZ and JN wrote the paper. ZZ and JN edited the manuscript and provided critical comments. All authors read and approved the final manuscript