Single-cell transcriptome analysis identifies novel biomarkers involved in major liver cancer subtypes

Hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) are the two aggressive subtypes of liver cancer (LC). Immense cellular heterogeneity and cross-talk between cancer and healthy cells make it challenging to treat these cancer subtypes. To address these challenges, the study aims to systematically characterize the tumor heterogeneity of LC subtypes using single-cell RNA sequencing (scRNA-seq) datasets. The study combined 51,927 single cells from HCC, ICC, and healthy scRNA-seq datasets. After integrating the datasets, cell groups with similar gene expression patterns are clustered and cluster annotation has been performed based on gene markers. Cell-cell communication analysis (CCA) was implemented to understand the cross-talk between various cell types. Further, differential gene expression analysis and enrichment analysis were carried out to identify unique molecular drivers associated with HCC and ICC. Our analysis identified T cells, hepatocytes, epithelial cells, and monocyte as the major cell types present in the tumor microenvironment. Among them, abundance of natural killer (NK) cells in HCC, epithelial cells, and hepatocytes in ICC was detected. CCA revealed key interaction between T cells to NK cells in HCC and smooth muscle cells to epithelial cells in the ICC. Additionally, SOX4 and DTHD1 are the top differentially expressed genes (DEGs) in HCC, while keratin and CCL4 are in ICC. Enrichment analysis of DEGs reveals major upregulated genes in HCC affect protein folding mechanism and in ICC alter pathways involved in cell adhesion. The findings suggest potential targets for the development of novel therapeutic strategies for the treatment of these two aggressive subtypes of LC.


Introduction
Liver cancer (LC) is the sixth most commonly diagnosed cancer, with the third-highest number of cancer deaths attributed to it (Rumgay et al. 2022). Hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) are the two major subtypes of LC (Massarweh and El-Serag 2017). HCC is ranked the fifth leading cause of death occurring due to various cancers (Seeff and Hoofnagle 2006). Moreover, 90% of all the cases of primary LC belong to the HCC (Kudo et al. 2020). HCC is caused by the accumulation of genetic mutation patterns and exhibits vast tumor heterogeneity. HCC develops drug resistance and lots of clinical trials have failed in the past owing to the heterogeneity in cell types (Llovet et al. 2015). ICC is the second most common liver cancer after HCC. The prognosis of ICC has been very poor. Moreover, the underlying biological abnormalities in ICC are unknown so far. In the past, several studies have been done in order to understand the physiology, development, and pathology of human LC. However, designing a therapeutic treatment for LC has been challenging due to diverse microenvironment, genomic variation, immense cellular heterogeneity, and variability in disease progression (Liang et al. 2019). Thus, it is crucial to understand the tumor micro-environment at the cellular level and major 1 3 cellular cross-talks for designing efficient cell-type targeted therapies.
Single-cell transcriptomics technologies have vast potential in advancing our understanding of liver cancer at the single-cell resolution. In the past few years, successful efforts have been made in collecting single-cell RNA sequencing (scRNA-seq) data for the public research. For instance, the Human Cell Atlas project was initiated with an objective to classify all cell-types present in different organs of the human body based on the gene expression profiles (Regev et al. 2017). The advent of scRNA-seq has enhanced our ability to build detailed cellular maps of tissues. This has also enabled us to identify rare cell types by analyzing the difference in gene expression patterns between individual cells. These developments help us to gain better insight into the tumor microenvironment (TME). Several integrative studies have been performed in the past to investigate the molecular mechanism and biomarkers associated with HCC and ICC (Chaisaingmongkol et al. 2017;Xue et al. 2015). However, these studies relied on microarray and bulk RNA sequencing datasets which have limitation in addressing the tumor heterogeneity and cellular cross-talk present in the TME. In a recent study on HCC, using scRNA-seq data has successfully identified dynamic properties of diverse CD45+ cell types in the TME (Chen et al. 2020). In another study, the inter-tumor diversity of cells in ICC has been studied using scRNA-seq techniques to get insights into the intercellular cross-talk (Zhang et al. 2020a). Despite several individual studies in the past, a systematic characterization of tumor heterogeneity in major LC subtypes (i.e., HCC and ICC) remained unexplored yet.
LC subtypes have distinct molecular characteristics, and a complex network of immune cells, stromal cells, and extracellular matrix components underlies the TME (Tian and Li 2022). To develop effective treatment strategies for LC, it is crucial to characterize its subtypes at the cellular and molecular level. The scRNA-seq data analysis can reveal the cellular diversity and heterogeneity by analyzing the individual cells (Chen et al. 2023). In this work, we performed an integrative analysis of HCC and ICC subtypes to get an insight of the cellular heterogeneity and major cross-talk between cell types involved in the TME. We used scRNA-seq data from HCC and ICC patients along with their adjacent healthy tissues collected from public repositories. Using previously known cell type-specific gene markers, we identify rare cell populations involved in HCC and ICC. Further, we perform cell-cell communication analysis for HCC and ICC to gain insight into how different cell types interact with each other subtype-wise. Moreover, we carry out differential gene expression analysis in order to find up-and downregulated genes separately for HCC and ICC. The gene enrichment analysis is performed to understand the biological functions and molecular pathways affected by the differentially expressed genes (DEGs). Further, we perform survival analysis to understand the association between the expression level of DEGs and patient survival outcome for each LC subtype.

scRNA-seq datasets
We used two different scRNA-seq datasets collected from the Gene Expression Omnibus (GEO) database. Our first scRNA-seq dataset (GSE140228) comprises 17,936 cells. Of these, 6837 cells were picked from HCC tumor core sites and 11,099 cells were retrieved from adjacent paired nontumor sites. HCC tumor non-core cells were avoided to get only quality tumor cells for the downstream analysis. This scRNA-seq dataset was obtained from 15 HCC patients, and out of these 12 patients have hepatitis B virus. The age of patients in this dataset was in the range of 26-84 years, with median age of 55 years. This dataset was generated using droplet-based scRNA technique (Zhang et al. 2019). Our second scRNA-seq dataset (GSE138709) comprises a total of 33,991 cells. Of these, 17,090 cells were taken from intrahepatic cholangiocarcinoma site (i.e., ICC) and the remaining 16,901 cells were taken from adjacent paired non-tumor sites (i.e., healthy). This dataset was collected from 5 patients who were diagnosed with ICC (Zhang et al. 2020b). Table 1 provides an overview of the two scRNA-seq datasets used in this study.

Data preprocessing
The preprocessing of all scRNA-seq datasets was performed with the Seurat package available in the R software (version 4.0.4) (Satija et al. 2015). At first, the dead cells and empty barcodes were removed. The low-quality cells having less than 500 genes (total reads) were removed and genes that were expressed in less than 20 cells were removed. The threshold was set higher to pick quality cells and genes in the follow-up analysis. To further filter low-quality cells, we removed the high (i.e., 50%) mitochondrial transcript containing cells. The threshold was chosen a bit high as hepatocyte cells have a higher proportion of mitochondrial genes. The doublet (i.e., two cells captured in a single droplet) filtering of cells was not considered since the potential doublet may be a single large cell with two nuclei (MacParland et al. 2018). Table 1 shows the summary of cells and genes of all datasets before and after quality control steps. Further, log-normalization was performed for each dataset to remove the impact of potential outlier and to reduce the batch effect before integration of all datasets.

scRNA-seq data integration
After preprocessing of all three datasets (HCC, ICC, and healthy), 2000 high variable genes (HVGs) were extracted using variance stabilizing transform method from each dataset. For data integration, FindIntegrationAnchors function in the Seurat package was used to find the intersample anchor between healthy and cancer datasets (Stuart et al. 2019). Next, the data integration step was performed using the IntegrateData function available in the Seurat package. This function outputs the concatenated centered expression matrix. After data integration, various features like the number of distinct genes per cell, the total count of genes per cell, the percentage of mitochondrial genes per cell, and the percentage of ribosomal protein genes per cell were analyzed subtype-wise from the integrated dataset (IntD).

Clustering and annotation of cell types
As the IntD has the large number of features (anchor genes), 2000 HVGs having high cell-to-cell variation in expression level were extracted. Next, linear dimension reduction was performed on the selected top 2000 HVGs by using principal component analysis (PCA) (Roweis 1997). Top 50 principal components (PCs) were used to generate k-nearest neighbor (KNN) and shared nearest neighbor (SNN) graphs. The KNN and SNN clustering was performed using the Louvain algorithm with a resolution parameter of 0.8 (Xu and Su 2015). After dimension reduction of the IntD, non-linear reductions techniques like uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) were performed using the top 50 PCs. Top two components of UMAP and t-SNE were used to visualize the gene expression variations across cells in the IntD (McInnes et al. 2020). After clustering of IntD, top marker genes were identified for each cluster using the Wilcoxon's rank sum test. Based on the marker genes, cell type annotation of clusters was performed by SingleR package, where HumanPrimaryCel-lAtlasData was used as a reference to annotate the cell clusters (Aran et al. 2019).

Cell cycle and cell-cell communication analysis
To understand the cell-cycle stage of each cell in the IntD, Seurat's CellCycleScoring function was used. Cell-cycle phase-specific scores were calculated based on the expression level of previously known 97 marker genes that regulate the S (43 marker genes) and G2/M (54 marker genes) phases of cell-cycle (Tirosh et al. 2016). A high cell-cycle score indicates proliferating cells, whereas a low score indicates non-dividing cells.
To analyze the complex inter-cellular communications between various cell types in the LC subtypes, CellChat version 1.4.0 was used. CellChat predicts intra-and inter-cellular interactions based on the known ligand-receptor interaction. In this workflow, over-expressed ligand-receptor interactions were fetched from each cell type clusters from the IntD. Then, CellChat's computeCommunProbPathway function was used to infer the probability of intra-and inter-cellular interactions among the cell types in each LC subtype (Jin et al. 2021).

Differential gene expression analysis
Top DEGs for each subtype (HCC and ICC) were identified by using Wilcoxon's rank sum test. Genes with absolute average log-fold change > ± 0.5 and p-value < 0.05 were considered DEGs. The DEGs common to both subtypes were identified. Next, we identified DEGs which were specific to only HCC and ICC subtypes. To get insight into the biological functions of DEGs for each subtype, gene enrichment analysis has been performed using clusterprofiler package (Yu et al. 2012).

Survival analysis of DEGs
In order to determine the relationship between the expression levels of DEGs and patient survival outcomes, survival analysis was performed for each cancer subtype using the GEPIA webserver (Tang et al. 2017). For the HCC DEGs, the TCGA-LIHC patient dataset was utilized, which contains survival data for 377 HCC patients. Similarly, for the ICC DEGs, the TCGA-CHOL patient dataset was used, which includes survival data for 51 cholangiocarcinoma (ICC) patients .

Comparative analysis of healthy and liver cancer subtypes
In order to perform the comparative analysis between healthy and LC subtypes, 51,927 single cells derived from HCC, ICC, and healthy sites have been integrated using Seurat version.4.0.4. After preprocessing and integration, 11,727 genes and 51,010 cells remained for further analysis in the IntD. Total transcript count, unique transcript count, mitochondrial gene percentage, and ribosomal encoding gene percentage across samples are shown in Fig. 1B. Average total transcripts and unique transcripts counts were higher in ICC samples due to the higher number of unique molecular identifiers sequenced during sequencing. The percentage of mitochondrial genes in ICC samples was a bit higher than others, as hepatocyte cells have a higher proportion of mitochondria and after cell type annotation it was observed that more than 25% of ICC cells are hepatocytes (Fig. 2B). Genes encoding cellular respiration, electron transport chain, and oxidative phosphorylation pathways were having the highest variance across cells in the IntD which is shown in Supp. Fig. 1.

Cell population heterogeneity across liver cancer subtypes
The PCA was applied to the highly variable genes (n=2000) in the integrated dataset. Supp. Fig 2A shows the plots of the first two PCs. The genes associated with the first 6 PCs were visualized in Supp. Fig 2C, where each PC is successfully  Fig 3 and 4 where 3, 4, 9, and 11 clusters have an abundance of ICC cells and clusters 12 and 16 have an abundance of HCC cells. Based on the known gene markers, these 27 clusters were annotated into 10 cell groups. T cell, monocyte, hepatocyte, and epithelial cells were the top four cell types (>80%) found in the IntD. Distribution of cell groups across subtype wise (HCC, healthy, and ICC) is shown in Fig. 2B. Among all cell types, high abundance of epithelial cells and hepatocytes was observed in ICC samples compared to HCC and healthy samples. NK cells trended towards increase in HCC samples and T cell proportion decrease in ICC samples. The role of epithelial cell markers (S100A2, KRT7, and KRT19) in tumor progression like cell migration, drug resistance, and cell adhesion was well studied for gastric cancer and ovarian cancer (Fang et al. 2017), but the role of these markers in ICC is poorly studied. We found a significant association of epithelial cell markers in ICC samples. Supp. Fig. 5 shows the expression level of epithelial cell markers (S100A2, KRT7, and KRT19) in the IntD. We noticed a higher expression of these epithelial cell markers in ICC subtype.

Cell cycle and cell-cell communication analysis
On the basis of previously known marker genes that regulate cell-cycle, each cell was assigned a cell score that predicts whether the cell was in G1, in S, or in G2/M phase. The distribution of cells based on phase-specific score for each subtype is shown in Fig. 3A. We found a higher percentage of cells in healthy sample (>60%) were in G1 stage, which indicates cells in HCC and ICC samples were actively dividing. Figure 3B shows the proportion of phasespecific score for 10 cell groups. We noticed that a higher proportion of epithelial cells, hepatocytes, T cell, and NK cells were in G2M and S phase suggesting their enhanced proliferation rate. As endothelial cells and smooth muscle cells have less proportion of G2M/S cells, this indicates these cells are in dormant stage. Based on ligand-receptor interaction, the CCA was performed on the 10 cell types in the IntD. Circle network diagrams of significant cell-cell interactions are shown in Fig. 3C and D. In HCC, the cross-talk between T cells to NK cells and endothelial autocrine signals was found to be enriched. Conversely, the interaction between smooth muscle cells to epithelial cells and epithelial cells to endothelial cells was found to be enriched in ICC. The interaction between immune cells and tumor cells within the TME is complex and dynamic. As NK cells trended to increase in our HCC samples (Fig. 2B), NK cells via their cytotoxicity and cytokine secretion can suppress the tumor. In IntD, most of the HCC patient samples were in tumor stage I/II; there was a high chance that the cross-talk between T cells and NK cells enhance the anti-tumor immunity in the TME (Shimasaki et al. 2020). Recent studies have shown that T cells can interact with NK cells to amplify the anti-tumor immune responses (Sun et al. 2022). T cells can produce cytokines such as interferon-gamma (IFN-γ) and tumor necrosis factor-alpha (TNF-α), which activate NK cells and enhance their cytotoxic activity against tumor cells (Sun et al. 2019). In ICC, during tumor progression, epithelial cells can undergo epithelial-mesenchymal transition (EMT), which involves the acquisition of mesenchymallike characteristics and is crucial for tumor invasion and metastasis (Thiery 2002). Moreover, a recent study has shown smooth muscle cells can induce EMT by secreting transforming growth factor-beta (TGF-β) on epithelial cells (Xue et al. 2020).

Differential expression and gene enrichment analysis
Differential gene expression analysis was performed on both HCC and ICC subtypes. The four genes KLRB, CD8A/B, HSPA6, and UCP2 were found to express differentially in both subtypes. For HCC subtype, SOX4 and TNFRSF4 genes were upregulated while DTHD1 and MYBL1 genes were downregulated. For ICC subtype, KRT (keratin) and SPP1 gene were upregulated while CCL4 and RGS1 genes were downregulated (Fig. 4A). DEGs with average fold change > ± 0.5 and p-value < 0.05 in both subtypes are summarized in Supp. Table. 1.
The enrichment analysis was performed on the up-and downregulated DEGs for both subtypes as shown in Fig. 4B.
In HCC, the majority of upregulated genes are involved in altering the ficolin family proteins. According to a recent study, FCN1 (ficolin-1) is associated with monocytes and neutrophils and plays an important role in innate immunity and disruption in expression can contribute significantly to tumor progression (Sun et al. 2022). Conversely, in ICC, majority of upregulated genes alter biological processes like focal adhesion and cell-substrate junction which causes loss of cell adhesion and increase cellular mobility which is a favorable event for EMT transition (Gao et al. 2022;Ishiwata 2016;Janiszewska et al. 2020;Knights et al. 2012). Downregulated genes in both HCC and ICC are involved in clathrin-coated vesicle membrane-related biological processes. The alternation of these biological processes will disrupt the trans-Golgi network which results in aberrant protein secretion and cellular dysfunction (Di Martino et al. 2019).

Survival analysis of DEGs
Survival analysis was performed on DEGs identified for each subtype of LC. Kaplan-Meier survival plots (Fig. 5) revealed that a significant number of DEGs were associated with patient survival outcomes in both HCC and ICC. High expression levels of TNFRSF4, SOX4, PKM, and RTKN2 genes and low expression levels of IL7R and DTHD1 were associated with poor overall survival in HCC patients. In ICC patients, low expression levels of HCST, FCGR3A, and GPR65 genes were associated with poor survival. Furthermore, the downregulation of the KLRB1 gene was found to be associated with poor survival in both subtypes.

Discussion
Effective development of therapies against liver cancer warrants deeper understanding of the micro-environment and the gene expression pattern of the cancerous liver (Hoyer et al. 2021). The scRNA-seq has become a powerful highthroughput technique in the last decades with high-resolution for sequencing of the transcriptome. These advancements have facilitated the study of cell-to-cell variation to better understand the complex cellular microenvironment of cancerous tissues (Li and Wang 2016). Despite several individual studies, a comprehensive analysis of tumor heterogeneity at a single-cell resolution in two major LC subtypes has yet to be explored. In the present study, we performed a comparative analysis using single-cell transcriptomic data from healthy and cancerous liver tissue subtypes. In total, 51,927 cells were derived from HCC, ICC, and adjacent healthy liver sites. The integrated dataset was reduced using PCA and cells with similar characteristics were clustered using SNN clustering. The clusters were annotated into 10 cell types based on their cell markers. Moreover, we studied the major cellular cross-talk for each LC subtype and their potential role in TME.
Our results show that T cell, monocyte, hepatocyte, and epithelial cells were the major cells in LC. High abundance of epithelial cells was detected in the ICC subtype. It was found that epithelial cell markers (S100A2, KRT7, and KRT19) have a high expression in ICC samples. KRT19 is a member of the keratin family, and its upregulation in liver cancer is unknown yet. Thus, understanding the mechanism of action of the KRT19 gene can be helpful in patient-specific treatment of LC (Govaere et al. 2014). Previous studies have shown the upregulation of the KRT7 gene in ovarian cancer. However, its expression is highly variable in different Kaplan-Meier survival analysis for the differentially expressed genes (DEGs). Five genes, namely, SOX4, TNFRSF4, PKM, IL7R, and DTHD1 exhibit a significant association with HCC. Two genes (HCST and GPR65) demonstrate a significant association with ICC. The KLRB1 gene shows a significant association with both HCC and ICC subtypes types of cancer ). In the present study, KRT7 was found to be upregulated in LC-specific to ICC. S100 class genes are associated with intra-cellular as well as extracellular functions and involved in cell signaling by binding with the receptor for advanced glycation end products (Arumugam and Logsdon 2011). The role of S100 class in other cancers such as non-small-lung cancer and gastric cancer has been studied well albeit their role in ICC has not been explored yet (Hountis et al. 2014).
Cell-cell communication analysis was performed on both subtypes to identify the major interaction among cell types. The interaction between T cells and NK cells in HCC and the interaction between smooth muscle cells and epithelial cells and epithelial autocrine signals in ICC were found to be enriched. It is known previously that an interaction between T cells and NK cells could potentially activate NK cells that amplify the anti-tumor immune responses (Shimasaki et al. 2020). This might be the reason for high abundance of NK cells detected in HCC samples. Similarly in ICC, some preliminary studies have demonstrated that smooth muscle cells can induce epithelial-mesenchymal transition (EMT) in epithelial cells during the tumor progression which requires further investigation and research (Thiery 2002). Further, our differential expression analysis identified the genes which are upregulated and downregulated in the cancerous liver cells. Interestingly, 4 genes (KLRB, CD8A/B, HSPA6, and UCP2) were found to be expressed differentially in both subtypes. For HCC subtype, SOX4 and TNFRSF4 genes were upregulated while DTHD1 and MYBL1 genes were downregulated. For ICC subtype, KRT (keratin) and SPP1 were upregulated while CCL4 and RGS1 genes were downregulated. KLRB gene that encodes for transmembrane glycoprotein is expressed on the surface of NK cells and T cells. Some studies show KLRB gene expression is suppressed in patients with non-small-cell lung cancer and esophageal squamous-cell carcinoma but its role in HCC needs further study (Pleshkan et al. 2007). Similarly in HCC, the role of SOX4 and MYBL1 was studied in several clinical studies but their precise molecular mechanism is still unknown (Huang et al. 2021). Our study shows that TNFRSF4 and PKM genes were associated with overall poor survival in HCC patients. These genes have a major role in cell apoptosis and glycolysis, but their molecular role in HCC needs to be studied further (Safran et al. 2010). A recent study showed that SPP1 gene suppresses the proliferation rate of T cells, which may enhance tumor growth (Cheng et al. 2022).
In this study, we followed a scRNA-seq analysis to provide a better insight towards TME in LC subtypes. However, there are a few limitations of this approach which should be considered for any follow-up research work. One limitation is that the gene expression levels measured using scRNAseq technologies do not necessarily reflect the actual protein expression levels. Another limitation of this study is that it does not establish a causal relationship between the genes and LC subtypes. Moreover, the potential markers identified in our study warrant further validation using the immunohistochemistry. In order to confirm the relevance of our findings, further clinical studies need to be conducted including diverse populations and environments.
In conclusion, our work identified novel differentially expressed genes and cell groups that can act as potential targets for liver cancer subtypes. These findings will help in getting more insights into LC etiology. Further functional studies can be conducted to understand how these genes interact with each other in the tumor microenvironment. Our work suggests that developing therapies against liver cancer should consider the potential role of these identified markers while their mode of action in the liver cancer microenvironment is required to be established through experimental validation.