Construction of co-expression network and identification of metastasis-related module
Among the 378 screened datasets, GSE14520 was identified as the most idea dataset for constructing co-expression networks. It was performed on Affymetrix Human Genome U133A 2.0 Array and Affymetrix HT human Genome U133A Array, including 247 HCC samples with comprehensive clinical information [24, 25]. The TCGA dataset was used to establish LASSO model, as it contains 365 HCC samples with complete gene expression profile and patient outcome information.
To satisfy normal distribution, the data cleaning process excluded 26 samples from the GSE14520 dataset for subsequent analysis (Additional file 1: Figure S1). To ensure a scare-free network, the power of β = 5 (scare-free R2 = 0.9) was selected (Additional file 2: Figure S2). Based on 12 categories of clinical information, including metastasis risk (Fig. 1b), 14 modules were identified by using the “WGCNA” package in R (Fig. 1c). The blue module shows the highest correlation with the metastasis risk, and thus was subjected to further analysis (Fig. 1d, e). 16 hub genes that were closely connected to the blue module were identified for the following analysis (Fig. 1f).
Establishment of a 5-mRNA signature to predict overall survival
By using LASSO regression, a 5-parameter model was established after ten-time cross-validation to predict overall survival, including 4-aminobutyrate aminotransferase (ABAT), glycogen synthase 2 (GYS2), Hydroxyacid oxidase 1 (HAO1), hydroxyacylglutathione hydrolase (HAGH) and SEC14-like protein 2 (SEC14L2). The coefficiencies were − 0.00504, -0.0063, -0.00165, -0.00426 and − 0.00147, respectively (Additional file 3: Figure S3). They were all down-regulated in HCC tumor samples (Additional file 4: Table S1; Additional file 5: Figure S4; Additional file 6: Figure S5; Additional file 7: Figure S6; Additional file 8: Figure S7; Additional file 9: Figure S8). Using this 5-mRNA formula, the risk score for each patient was calculated (Fig. 2a). Then, the patients were divided into two groups based on the median of risk score (Fig. 2b). We found that patients with low risk score had a significantly higher survival rate (p = 0.00013) (Fig. 2c). Time-dependent ROC analysis has indicated that the prognostic accuracy of 5-mRNA signatures is 0.650 at 1 year, 0.627 at 3 years and 0.601 at 5 years (Fig. 2d).
Validation of the 5-mRNA signature for predicting overall survival, recurrence free survival and disease free survival
Overall survival and recurrence free survival data from GSE14520 and disease free survival data from TCGA LIHC were used to perform the validation. Based on GSE14520, patients with low risk score had a higher survival rate (p = 0.0062) (Fig. 2e, f, g). The prognostic accuracy of the 5-mRNA signatures is 0.650 at 1 year, 0.561 at 3 years, and 0.668 at 5 years (Fig. 2h). In TCGA LIHC, patients with low risk score had a elevated disease free survival rate (Fig. 2i, j, k) and the AUC was 0.689 at 1 year, 0.637 at 3 years and 0.597 at 5 years (Fig. 2l). Recurrence analysis was performed in GSE14520 and patients with high risk score had a higher recurrence rate (p = 0.026) (Fig. 2m, n, o). The AUC was 0.598 at 1 year, 0.546 at 3 years, and 0.609 at 5 years (Fig. 2p), respectively.
Prognostic value of the 5-mRNA signature in combination with essential clinical characteristics
To explore whether the prognostic value of the 5-mRNA signature is applicable to other clinical characteristics, stratification analysis was performed according to stage, grade, N stage, M stage, age and gender. We found that the 5-mRNA signature is still a clinically and statistically significant prognostic model (except in the female subgroup) (Fig. 3).
Clinical application of nomogram
To make the predictive model visual and quantized, we constructed a nomogram combining the 5-mRNA signature with other clinical characteristics, including age and stage, to predict the probability of 3- and 5-year overall survival (Fig. 4a). Calibration plots showed high performance of the nomogram compared with an ideal model (Fig. 4b, c). The decision curve analysis further confirmed the clinical usefulness of this nomogram (Fig. 4d, e).
Validation of pan-cancer overall survival and assessment of metastatic risk
To further broaden the clinical implications of our signature, pan-cancer validation of overall survival was analyzed. Consistent with the performence in HCC, the signature exhibited its prognostic value in breast cancer (p = 0.028), kidney renal clear cell carcinoma (p = 0.021), kidney renal papillary cell carcinoma (p = 0.0011), lung adenocarcinoma (p = 0.0001) and pancreatic ductal adenocarcinoma (p = 0.017) (Additional file 10: Figure S9), but not in cancers, such as bladder carcinoma, cervival squamous cell carcinoma and esophageal carcinoma (Additional file 11: Figure S10).
In respect to metastatic risk, the signature significantly distinguishes high/low metastasis survival rate. As for breast cancer, in GSE5327 cohort, patients with high risk score had a lower metastasis free survival rate (p = 0.017). Noticeably, lung metastasis survival rate had the same trendency (p = 0.038) (Fig. 5a, b). The results of metastasis free survival analysis were further confirmed in GSE20685, GSE24450 and GSE58644 datasets (p = 0.043, 0.00042 and 0.0091, respectively) (Fig. 5c, d, e). The distant metastasis free survival of breast cancer in KM plotter database and metastasis free survival of GSE116918 (prostate cancer) were also significant (p = 0.0014 and 0.0038, respectively) (Fig. 5f, g).
Alteration of biological pathways between high- and low- risk groups classified by the 5-mRNA signature
To identify the significant changes of biological pathways between high- and low- risk groups, we performed the gene set variation analysis (GSVA). Interestingly, metabolic changes, including bile acid metabolism, fatty acid metabolism, xenobiotic metabolism, heme metabolism, glycolysis and adipogenesis, were particularly prominent between these two groups, both in TCGA dataset and GSE14520 (Fig. 6a, b). GSEA analysis demonstrated that the high-risk group was enriched in gene sets which were positively related to metastasis/invasiveness and poor survival and low-risk group was enriched in gene sets negatively related to metastasis and poor survival (Fig. 6c).