Data Acquisition
The single-cell sequencing data of the two HCC patients(HCC1,HCC2) included in this paper were downloaded from the GEO database (GSE146115). The bulk RNA sequencing information and clinical information of 365 HCC samples in the test set were downloaded from TCGA database (https://portal.gdc.cancer.gov/), and the bulk RNA sequencing information and clinical information of 240 HCC patients in the validation set were downloaded from ICGC database (https://dcc.icgc.org/). Immunotherapy datas of 195 bladder cancer patients (IMvigor210) were downloaded from the IMvigor210CoreBiologies R package
Process flow of scRNA-seq Data
We first used the Seurat package of R software to create Seurat objects for the two HCC samples. Next, we filtered out any cells where the average expression level of genes was below 500, above 8000, or above 5% for mitochondrial genes. Expression data normalization was performed with the NormalizeData function, and data normalization was performed with the ScaleData function. We used the FindVariableFeatures function to select the top 2000 highly variable genes for principal component analysis (PCA). The top 20 PCAs were utilized to downscale the cells into clusters using a t-distributed random neighborhood embedding (t-SNE) method. The FindAllmarker function was used to find the highly expressed genes in each cell cluster. Pseudo-temporal analysis of the cells was carried out by me using the Monocle package of R software in order to search for genes that change during cell development.
Weighted Gene Correlation Network Analysis Identifies the most relevant module for Tumor Grade
We used the WGCNA package of R software to analyze the cell development-related genes and identify the gene modules most strongly associated with Tumor Grade(Grade). WGCNA has two parts: expression clustering analysis and phenotypic association, which has four steps: gene correlation coefficients, gene module identification, co-expression network, and module-trait linkage. Soft thresholding power is the lowest power with which scale-free topology fit index reaches 0.90. After clustering, the modules were shown via dendrogram. Examining their correlation coefficients and p values, we then determined which modules were associated with Tumor Grade. Finally, we selected the genes corresponding to the pertinent modules for the subsequent round of analysis.
Differentially expressed genes associated with cell developmental trajectories and Functional enrichment analysis
To obtain differentially expressed cell developmental trajectory-related genes, we extracted the mRNA expression matrix of cell developmental trajectory-related genes and analyzed it between tumor and normal tissues in 424 patients (365 tumor tissues, 64 normal tissues) using the Limma package of R software. We screened genes with adjust.pvalue < 0.05 and log2FoldChange > 1 or log2FoldChange < -1 for downstream analysis. To explore the molecular mechanisms of genes associated with differential cell developmental trajectories, we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes(KEGG) functional enrichment analysis using the clusterProfiler package of R software.
Construction and validation of a prognostic signature based on genes associated with cell developmental trajectories
We first performed Cox univariate analysis of the differentially expressed cell developmental trajectory-associated genes obtained to screen for genes that affect the prognosis of HCC patients. Subsequently, we used the glmnet package of R software to screen for genes that could be used as independent factors to influence patient prognosis. We calculated the risk score for each HCC patient in the test set using the formula: Risk score=∑(Expi*Coefi), where Expi represents the gene expression value and Coefi represents the risk coefficient. Based on the median risk score, we divided the test set of 365 HCC patients into high-risk and low-risk groups and compared the overall survival differences between the two groups using Kaplan-Meier survival curve analysis. ROC curves were used to determine the performance of risk scores for predicting 1-, 3-, and 5-year survival in HCC patients. The validation group of 240 HCC patients all had their risk scores determined using the same methodology. Patients were classified as either high-risk or low-risk based on their median risk score. The outcomes of the two groups were compared using Kaplan-Meier survival curve analysis. The validity cohort's ROC curves were used to evaluate the risk scores' ability to predict outcomes.
Identification of prognostic signature as an independent prognostic factor
To further explore the underlying mechanisms behind the risk scores, we further compared the relationships between risk scores and clinical characteristics such as age, gender, T-stage, N-stage, grade, and survival status. We first compared, the differences in the distribution of risk scores between the different clinical subgroups. To identify prognostic factors, we first ran a Cox univariate analysis of clinical parameters and risk scores. On the basis of the acquired data, we then conducted a Cox multivariate analysis to identify additional independent risk factors influencing the prognosis of HCC patients. To further enhance the clinical feasibility of the prognostic signature, we used the rms package of R software to incorporate risk scores and clinical characteristics into the analysis and construct a nomogram to more accurately predict the prognosis of HCC patients.
Gene Set Enrichment Analysis(GSEA)
Gene Set Enrichment Analysis determines if a priori chosen gene sets reveal statistically significant differences between two biological states. GSEA is not limited to differential genes and can include minor yet coordinated alterations in biological processes. We used GSEA software to compare the differences between the high-risk and low-risk groups in the enrichment pathway. The p<0.05 was used to determine whether the pathway was significantly enriched..
Mutation analysis and Microsatellite Instability(MSI)
We downloaded mutation information from TCGA for 365 HCC patients and calculated the somatic mutation frequency and tumor mutation burden for each sample using the maftools package of R software. We then compared the relationship between prognostic signature and tumor mutation burden. Microsatellite Instability (MSI) is a phenomenon of microsatellite (MS) sequence length change due to insertion or deletion mutations during DNA replication, often caused by mismatch repair (MMR) defects.
To fully explore, the relationship between prognostic signature and immunotherapy response, we MSI was introduced to the analysis. The MSI data of patients in TCGA-LIHC were obtained from the study of Bonneville[12] et al.
Immune Cells Infiltration and Gene Sets Variation Analysis (GSVA)
CIBERSORT is based on the principle of linear support vector regression, which deconvolutes the expression matrix of human immune cell subtypes, and the content of each immune cell subpopulation is calculated by deconvolution[13]. To fully elucidate the underlying mechanisms behind the prognostic signature, we used the CIBERSORT function of R software to deconvolute the bulk RNA sequencing data from 365 HCC patients to obtain the proportion of immune cells infiltrating the tumor microenvironment. By translating the expression matrix of genes across samples into the expression matrix of gene sets across samples, GSVA evaluates the enrichment of various pathways across samples[14]. Therefore, with the special features of GSVA, we evaluated the proportion of 28 immune cells in each sample using a gene set containing information on the expression of 28 immune cells.
Prediction of response to immunotherapy
The degree of expression of immune checkpoint-associated proteins is the cornerstone of HCC immunotherapy. The link between prognostic signature and mRNA expression levels of immune checkpoint-related genes routinely used in immunotherapy for hepatocellular cancer was investigated initially. The TIDE score was subsequently developed to analyze the association between prognostic signature and immunotherapeutic response. TIDE employs the induction of T-cell dysfunction in tumors with high infiltrating cytotoxic T lymphocytes and the prevention of T-cell infiltration in tumors with low infiltrating cytotoxic T lymphocytes to model the mechanism of tumor immune escape and determine the efficacy of immunotherapeutic response. To validate the relationship between prognostic signature and immunotherapy, we validated our results in the IMvigor210 immunotherapy data.
Statistical analysis
Wilcoxon t-test was used to compare categorical variables between subgroups. cox univariate analysis and cox multivariate regression analysis were used to screen for independent risk factors affecting prognosis. Kaplan-Meier analysis was used to compare survival differences between subgroups and to plot survival curves. p<0.05 was determined to be statistically significant. All analyses in this study were performed in R software (version 4.1.2).