Datasets
For the training set, the RNA-seq transcriptome information and matching clinicopathological data of 368 LIHC patients were acquired from TCGA (LIHC) (https://portal.gdc.cancer.gov/). For the validation set, 228 patients were retrieved from ICGC (LIRI-JP) (https://dcc.icgc.org/).
Somatic Mutation Analysis
Somatic mutation data of the LIHC samples were obtained from TCGA GDC Data Portal in “maf” format. Waterfall plots were then performed using the Maftools package (version: 2.8.0) in R software(15), which facilitated the visualization and summarization of the mutated genes.
Single-sample Gene Set Enrichment Analysis (Ssgsea)
The ssGSEA algorithm was based on ICD genes. We employed the ssGSEA algorithm via GSVA packages (version: 1.40.1) to comprehensively assess the ICD characteristics of every sample included in the study.
Wgcna Analysis
The analysis was performed based on the package instructions(16). The connection strength between each pair of nodes was calculated using the adjacency matrix aij(17). The formula for the adjacency matrix was as follows:
$${Z}_{ij}=\left[cor\left({x}_{i},{x}_{j}\right)\right]{a}_{ij}={S}_{ij\beta }$$
In this formula, i and j represent two different genes, and xi and xj are their respective expression values. Zij represents the Pearson’s correlation coefficient, and aij represents the strength of the correlation between two genes(18). The soft-thresholding power of β = 14 was used to ensure scale-free topology. The hierarchical clustering of the weighting coefficient matrix was used to define the modules(17). The functional modules in the co-expression network with defined genes, the topological measure (TOM) indicating the concurrence in shared adjacent genes, was calculated as
$${TOM}_{i,j}=\frac{{\sum }_{K=1}^{N}{A}_{i,j}. {A}_{{\kappa },\text{j}}+{A}_{i,j}}{{min}\left({K}_{i},{K}_{j}\right)+1-{A}_{i,j}}$$
where A is the weighted adjacency matrix described in the above formula. TOM-based dissimilarity measures with a minimum size of 30 for the gene dendrogram and average linkage hierarchical clustering were performed, and similar expression profiles were divided into the same gene modules using the dynamic tree cut package(17).
Identification Of Icd Significant Modules
The co-expression module is a collection of genes with high topological overlap similarity. Genes in the same module often have a higher degree of co-expression(18). In this study, we used two methods to identify the important modules relevant to clinical traits. The first principal component of each gene module and the expression of the module eigengene (ME) were defined as representative of the whole gene set and were described in the first eigengene module(17). Module membership (MM) refers to the correlation coefficient between genes and module eigengenes, and is used to describe the reliability of a gene belonging to a module. The correlation between module eigengenes (MEs) and clinical traits (stage I to IV) was calculated to identify clinically significant modules(19). Finally, we calculated the correlation between the modules and the ICD ssGSEA score to identify ICD significant modules(18).
Functional Enrichment Analysis
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were carried out among the ICD-associated genes. The clusterProfiler package (version: 4.0.2) in R software(20) was employed to evaluate GO and KEGG pathways. GO and KEGG enrichment analyses were premised on the q-value and p-value thresholds of < 0.05.(21)
Construction Of The Icd-associated Risk Signature
The ICD-associated genes that were found to be statistically significant in the univariate Cox regression analysis were subsequently exposed to a LASSO cox regression analysis to compute the exact coefficient values of each identified association.
Survival Analysis
Kaplan-Meier (KM) analysis was conducted for comparison of the overall survival (OS) between the low and high ICD risk cohort utilizing the survminer (version: 0.4.9) and survival (version: 3.2–11) packages in R. The prospective prognostic indicators were identified utilizing the Univariate Cox analysis while the determination of whether the risk score is an independent risk factor for OS in LIHC was done utilizing the multivariate Cox analysis.
Rna Extraction And Qrt-pcr
Tumor tissues and adjacent tissues were taken from patients with hepatocellular carcinoma, and the total RNA was extracted from these tissues according to the instructions of the universal RNA Purification kit (EZBioscience, EZB-RN4). 400 ng of total RNA was reverse transcribed into cDNA using the PrimeScript RT Master Mix Kit (TaKaRa, RR036A). Then, 100 ng of cDNA was taken for quantitative PCR detection, qRT-PCR was performed on PerfectStart Green qPCR Supermix (TransGen, AQ601-04), the gene expression data was normalized to TBP.
The qRT-PCR primer sequence information was as follows: FGF9-F-CAGGCGGAGGCAGCTATAC, FGF9-R-CCTGGTTCCCTGGATAGTACC; PNLIPRP2-F-ACTTCCAACTAATCACTGGCAC, PNLIPRP2-R-CATGGATGATGAAGCGTGTCT; FKBP6-F-TTCTGTTCAAACCGAACTACGC, FKBP6-R-GAGAGCACAAAACTTGTCTGACT; RAMP3-F-GGGAAGGCTTTCGCAGACAT, RAMP3-R-TAGCAGCCCACGACATTGG; TBP-F-CCACTCACAGACTCTCACAAC, TBP-R-CTGCGGTACAATCCCAGAACT.
Independent Prognostic Evaluation Of The Risk Score And Construction Of The Nomogram
Patients information was extracted from patients in the TCGA-LIHC dataset. These variables and risk scores were jointly analyzed in both univariate and multivariate Cox regression analyses. The rms package (version: 6.2-0) was used to create a nomogram based on independent variables for visualization and potential clinical use in predicting the prognosis of training cohort. The DCA curve and calibration curve were used to evaluate the performance of the nomogram.
Immune Infiltration Analysis
XCELL, CIBERSORT, and EPIC immune infiltrating analysis was performed to calculate the abundance of immune infiltrating cells(22). We calculated the stromal score and immune score through the Estimation of Stromal and Immune cells in Malignant Tumor tissues using Expression data (ESTIMATE) algorithm(23).
Statistical analysis
All statistical analyses were performed using R software (version 4.1.0) and R studio (version 1.4.1717). The Wilcoxon test was used to compare the gene expression levels between HCC patients and healthy individuals, as well as the composition of immune cells and ICD risk scores between low- and high- risk groups. LASSO-Cox regression was used for candidate gene selection(24). The Kaplan-Meier method and log-rank test were used to compare the survival rates between the low- and high-risk groups(25). Univariate and multivariate cox regression analyses were used to assess the independent prognostic variables. A two-tailed P value < 0.05 was considered statistically significant except for a certain P value was set(26).