Differential methylation analysis based on WGBS data
After excluding one abnormal sample, QC results showed that the average sequencing depth of WGBS data for the 24 samples was 21.42 [minimum 15.85, maximum 26.2]. The average base quality of WGBS data was 31.68, and the average mapped quality was 34.40. These results reflect the high quality of WGBS data from 24 fresh tissue samples. We combined the PRJNA762641 and GSE70090 datasets with the 24 samples of this study as an integrated dataset. The MDS plot showed that the tumor and normal samples of the three datasets clustered clearly in two separate groups, indicating good consistency among the datasets (Fig. 1A). Overall, CpGs in tumor samples showed lower methylation levels than normal samples, as seen in the density plot (Fig. 1B).
We use the circular binary segmentation (CBS) algorithm to identify DMRs between HCC and NATs, as described in the methods. We obtained 132773 DMRs, including 12317 hypermethylated and 120456 hypomethylated regions in HCC (Fig. 1C). We found that the hyper-DMRs displayed a significantly smaller length than hypo-DMRs and non-DMRs (Fig. 1D), suggesting that CpG in hyper-DMR tends to accumulate in shorter regions (e.g., CpG islands). The top 100 DMRs with the most significant differences are shown in Supplementary table 5. Some of these DMRs, like the hypermethylation of the MGMT gene promoter region (Fig. 1E) and DAPK1 (Fig. 1F), have been linked to HCC. In addition, we investigated the methylation status of 27 genes reported hypermethylated in HCC by previous studies [8, 17] and found that ten were covered by these hyper-DMRs (Supplementary table 6).
Integration of DNA methylation and gene expression data
In addition to WGBS data, the PRJNA762641 project also included transcriptomic data from the same patient, allowing us to investigate the potential impact of aberrant DNA methylation on gene expression. We identified 4,322 DEGs, including 4662 upregulated and 264 downregulated genes in HCC (Fig. 2A). Interestingly, we found a high proportion of upregulated DEGs (48.58% vs. 2.75%) (Fig. 2B), suggesting extensive transcriptional activation events during HCC development, consistent with the frequent hypomethylation events observed in tumor tissues. The functional enrichment results showed that the upregulated DEGs in HCC were mainly enriched in biological processes related to RNA metabolism (including RNA splicing, ribonucleoprotein complex biogenesis, etc.) (Fig. 2C). In contrast, down-regulated DEGs were significantly enriched in entirely different biological processes, mainly in GO terms like small molecule and organic acid metabolism (Fig. 2D).
By integrating the results of DMRs with gene expression, we attempted to identify the genes whose expression was tightly correlated with methylation levels. DEGs were first linked with the nearest DMRs (within 2kb) and then grouped into four categories: hyper-upregulated, hyper-downregulated, hypo-upregulated, and hypo-downregulated. We focused on hyper-downregulated and hypo-upregulated genes based on the hypothesis that methylation is negatively regulated with expression, as mentioned in the methods. Consequently, we obtained 987 methylation-driven candidates, including 970 upregulated and 17 downregulated genes (Fig. 2E). The methylation-driven DEGs accounted for 20.1% of the total DEGs, suggesting a significant impact of DNA methylation on gene expression.
Further investigations revealed that hypo-upregulated genes are enriched in the biological processes of RNA biosynthesis and transport (Fig. 2F). In contrast, hyper-downregulated genes are mainly associated with the metabolism of retinol, cellular hormones, and olefinic acid (Fig. 2G). Of note, four of the 17 hyper-downregulated genes are retinol metabolic pathway genes, namely ADH1A, CYP2A6, CYP2C8, and CYP2C19. Correlation analysis further confirmed that methylation levels of CpGs within the DMRs of these four genes significantly correlate with gene expression negatively, while no significant correlation was observed for CpGs outside of the DMRs (Supplementary Fig. 1). In addition, we evaluated the abilities of four genes’ DMRs to discriminate HCC from NATs. AUC values for the four DMRs were 0.93 [0.88 − 0.97], 0.72 [0.59 − 0.84], 0.91 [0.86 − 0.97] and 0. 90 [0.81 − 0.95] (Supplementary Fig. 2), indicating their potential for HCC diagnosis.
HCC can be stratified by the genes of retinol metabolism
To validate these findings, we investigated the methylation and expression characteristics of the four genes in the TCGA-HCC dataset, which consists of 365 HCCs and 50 matched NATs. We obtained 12 probes associated with the four genes (Fig. 3A) which all showed high concordance with the results of WGBS, except for cg19645066 (CYP2A6). For example, two probes on ADH1A showed hypomethylated in HCC, which was in line with the results of WGBS (Supplementary Fig. 3A). Similar results were observed for the probes of the other three genes (Supplementary Fig. 3B-D). However, significantly different methylation levels were only found for six probes between HCC and NATs (FDR < 0.05, Supplementary table 7), indicating the shortcomings of 450k microarray technology in determining the methylation patterns of these genes.
We then investigated the expression profiles of these four genes in the TCGA HCC dataset. Three of them had expression values, showing significant downregulation on tumor samples (Fig. 3B), consistent with our findings based on 33 tumor samples and matched NATs. Interestingly, we found that the expression levels of the three genes were not uniform across the HCCs, with some HCCs exhibiting higher expression levels and others significantly lower expression levels (Fig. 3B), suggesting a high heterogeneity of the three genes expression in HCC. Unsupervised clustering of the HCC samples using K-means method indicated that they could be clustered into three groups (Fig. 3C). Subgroup 1 (C1, n = 214) displayed the highest expression of the three genes, while subgroup 2 (C2, n = 118) and subgroup 3 (C3, n = 33) showed lower expression levels (Supplementary Fig. 4A). UMAP analysis also showed similar results: HCC could be stratified into distinct subtypes by the three genes (Fig. 3D).
Additionally, survival analysis demonstrated that patients in C1 had better overall survival rates than those in C2 and C3 (p < 0.05, Fig. 3E). To exclude bias in a single dataset, we verified these results in an independent liver cancer cohort from the ICGC-JP project. Using the same method, we obtained a remarkable agreement between the two cohorts for the K-means results (Fig. 3F) and the UMAP analysis (Fig. 3G). Moreover, the prognosis for C1, C2, and C3 subgroups also showed significant differences in the ICGC’s LIRC cohort, and the trend was consistent with the TCGA HCC cohort (Fig. 3H). These findings suggest that the genes related to retinol metabolism can be used to stratify hepatocellular carcinoma into different subtypes, with potential prognostic value for patient outcomes assessment.
Clinical features and genomic characteristics of the three subgroups
To better understand the differences between the three subgroups, we calculated the average expression values of the three genes as the retinol scores of HCC patients. We observed significant differences in the distribution of several clinical features between the three subgroups in TCGA-LIHC cohort (Supplementary Fig. 4B-H). Overall, retinol scores were lower in patients with advanced, high-grade HCC. For instance, patients in T1 stage showed higher scores than those in T2 and T3 (Supplementary Fig. 4B). However, the degree of lymph node metastasis showed an opposite trend with patients’ retinol scores (Supplementary Fig. 4C). While no significant difference in retinol scores was found between M0 and M1 stage patients, which may be related to the small number of cases in M1 stage (n = 3, Supplementary Fig. 4D). Interestingly, retinol scores showed a decreasing trend over the AJCC stage (from I to II and III) and grade (from G1 to G2, G3, and G4) (Supplementary Fig. 4E-F). Patients with HBV or HCV infections did not display significantly different retinol scores (Supplementary Fig. 4G). It should be noted that younger patients (age < 50) showed significantly lower retinol scores than older patients (age > = 50). This may be attributed to the higher malignancy of younger people in the TCGA HCC cohort because we found that patients under 50 were more likely to be AJCC stage III/IV, T3/T4, and G3/G4 (data not supplied).
We then investigate the mutation profiles of the three subgroups (Fig. 4A). The most commonly mutated genes in HCC patients, such as TP53, CTNNB1, TTN, MUC16, and PCL, were shared across the three groups (Supplementary Fig. 5A-C). Tumor mutation load also did not show significant differences between the three groups (Supplementary Fig. 5D). Nevertheless, mutational enrichment analysis showed that MDN1, NLGN4X, and COL11A1 were significantly enriched on C1, while CTNNB1, PREX2, and AXIN1 were significantly enriched on C2, and OBSCN, BAP1, and TSC2 were significantly enriched on C3 (Supplementary Fig. 5E). In addition, we found that the three groups shared the mutation signature of SBS22 (exposure to aristolochic acid) (Supplementary Fig. 6). These results suggest that three subgroups shared similar mutation patterns.
Next, we explored the tumor microenvironment (TME) profiles of the three subgroups. Cell proliferation and wound healing scores gradually increased from C1 to C3, suggesting that the subgroups have distinct tumor cell aggressiveness and blood vessel formation (Fig. 4B-C). We found that IFN-γ and TGF-β response scores exhibited opposite trends among the three groups (Fig. 4D-E), with C3 having the lowest IFN-γ response scores but the highest TGF-β response scores, indicating a lower immune response in C3 than C1. In addition, we investigated the 24 immune cell scores between the three groups (Supplementary Fig. 7). The top 4 significantly altered immune cells are macrophage M0, memory B cells, resting mast cells, and Th17 cells. We observed the highest scores of macrophage M0, memory B cells, and resting mast cells in C3 (Fig. 4F–H) and the highest score of Th17 cells in C1 (Fig. 4I). These results highlight a suppressed immune response in C3 compared to C1 and C2.
Integration of single-cell RNA-seq data revealed potential functions of retinol metabolism-related genes
In order to elucidate the biological processes in which these genes may be involved, we integrated single-cell RNA-seq data from HCC patients and investigated the expression patterns of the three retinol metabolism-related genes across different cell types. The scRNA-seq dataset, GSE151530, was used in this study, which contained 49432 cells from 23 tumor tissues after quality control. These cells were classified into six categories, namely T cells, B cells, cancer-associated fibroblasts (CAFs), tumor-associated macrophages (TAMs), tumor-associated endothelial cells (TECs), and epithelial cells, according to the annotation of Gao et al. (Fig. 5A). We calculated the retinol scores (average expression) for all cells using the expression values of the three genes (CYP2C19 was excluded due to lack of expression). Malignant cells have the highest scores and exhibit huge variation, while retinol scores are pretty low for other cells (almost equal 0) (Fig. 5B). Therefore, we focused on malignant cells and re-clustered them. Malignant cells clustered into 15 subpopulations (Fig. 5C), with the highest scores in subpopulation 3 and the lowest scores in subpopulation 5, 14, 9, and 12 (Fig. 5D). Cell types were annotated using Human Primary Cell Atlas Data, and most of the cells in subpopulation 5, 14, 9, and 12 were annotated as hepatocyte cells (Supplementary Fig. 8). These findings suggest that a subpopulation of cells, differentiated from malignant cells, exhibits significantly varied retinol scores. We then identified marker genes for the 15 subpopulations and conducted a GO-enrichment analysis. The results showed that marker genes of subpopulation 14, 9, and 12 were predominantly involved in immune system (Fig. 5E). Moreover, we found marker genes of subpopulation 5 significantly enriched in wound healing, endothelium/vasculature development, and cell migration (Fig. 5E). In addition, we compared the DEGs between TCGA HCC-C1 (highest retinol scores) and HCC-C3 (lowest retinol scores) with marker genes of subpopulation 5 (Supplementary Fig. 9A). There were 91 genes shared by both, which significantly enriched in similar biological processes such as cell-substrate adhesion, regulation of actin cytoskeleton, and wound healing (Supplementary Fig. 9B).