Raw Data Acquisition
Using 10x scRNA-seq, we examined two COAD samples (T1 and T2) and two corresponding normal tissue samples (N1 and N2) to analyze colon adenocarcinoma. These data, sourced from the GSE161277 series, comprised 3190, 4438, 2882, and 4097 cells for each respective sample[5]. Additionally, we acquired bulk RNA sequencing data, mutation profiles, and clinicopathological details for COAD from the TCGA (The Cancer Genome Atlas) database.
scRNA-Seq Data Processing and Analysis
To begin with, the 10x scRNA-seq data was converted into a Seurat object using the 'Seurat' package in R software during the processing of the data. This was followed by a quality control step, where raw counts were assessed. To eliminate cells of poor quality, the calculation of the ratio between mitochondrial or ribosomal genes was performed. Lastly, to identify the top 2000 genes with the most variability, we employed the "FindVariableFeatures" function after quality control. In the processing of 10× scRNA-seq data, the fourth step involved conducting principal component analysis (PCA) on the selected 2000 genes[6]. This was followed by using t-SNE for reducing dimensions and identifying clusters. To identify important marker genes in various clusters, we used the 'Find All Markers' feature as the fifth step. This involved setting a log2 [Foldchange (FC)] threshold of 0.35 and a minimum percentage (min. pct) of 0.15. Finally, the sixth step involved the use of R software's "SingleR" package for the annotation of clusters to distinguish various cell types. Then, let us conduct a differential analysis after annotation using the same parameters. We conducted functional enrichment analysis on the key cell types identified using the "ReactomeGSA" package in R software. As part of the enrichment process, the "analyze_sc_clusters" function was used, followed by the "pathways" function to extract the results. The 'monocle' package in R was utilized for cell trajectory and pseudo-time analysis, employing the 'DDRTree' technique to reduce dimensions. We then utilized the "BEAM" statistical method to assess gene contributions during cell development, focusing on the top 200 genes for visualization purposes. To analyze cell-cell communication and visualize networks, we utilized the R software packages 'CellChat' and 'patchwork'.
Differentially Expressed Genes Identification and Functional Enrichment
Analysis
We performed differential expression analysis using the R software package 'limma' to identify genes that were expressed differently in the TCGA cohort. The criteria set for filtering included an absolute log2 fold change (log2FC) greater than 1.2 and a false discovery rate (FDR) below 0.05. A volcano plot was then created to display the DEGs' distribution. Afterward, we employed the R software's 'clusterProfiler' package to perform analyses on the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO). The purpose of these analyses was to investigate the pathways and biological processes that are most significantly enriched and linked to the DEGs.
Analysis of gene correlation networks using weights
The "WGCNA" package in R was used to identify key hub genes among differentially expressed genes (DEGs) using the Weighted Gene Correlation Network Analysis (WGCNA) method. WGCNA involves a two-pronged approach: clustering gene expressions and correlating them with phenotypic traits. This analysis is primarily comprised of four stages: calculating gene correlation coefficients, identifying gene modules, constructing a co-expression network, and correlating these modules with specific traits. During the network construction, we chose a soft thresholding power of 12. The modules were then visualized in a dendrogram following clustering. To identify the most important DEGs in the development of colorectal adenocarcinoma (COAD), a heatmap of module traits was created using correlation coefficients and p-values. Finally, we focused on the overlapping genes between the marker genes and the DEGs identified in the WGCNA for in-depth analysis.
Construction of the prognostic risk model
The risk score for each patient was determined using the following equation:
In this context, 'i' denotes the gene expression level, and beta represents the coefficient corresponding to the receptor-ligand pair derived from multivariate Cox regression analysis[7]. Patients were divided into high-risk and low-risk categories based on a predetermined threshold, specifically the middle value. Survival curves were generated using the Kaplan-Meier method for prognostic analysis. Furthermore, the log-rank test was utilized to assess the statistical significance of the noted disparities.
Comparison of clinical significance and mutation patterns in high-risk and low-risk populations
We then investigated the correlation between the risk score and the clinicopathological characteristics of patients within the TCGA group. Cox regression analysis was conducted using the "survcomp" package in R software to assess if the risk score serves as an independent prognostic indicator for colorectal adenocarcinoma (COAD) patients. At the same time, the forest plot package in R was utilized to generate forest plots that depict the findings from both univariate and multivariate Cox regression analyses. Furthermore, to examine the distinct genetic mutation patterns in the groups at high and low risk, we utilized the 'oncoplot' feature in the 'maftools' package of the R software to create waterfall plots.
Comparison of immune cell infiltration and immune function status between high and Low-Risk Groups
Afterward, the cibersort technique was employed to assess the levels of immune cell infiltration and the activity in pathways related to the immune system. The Wilcoxon rank-sum test was used to evaluate the statistical disparities between the high-risk and low-risk groups. Furthermore, the R 'ESTIMATE' package was employed to evaluate the level of immune infiltration in colorectal cancer patients categorized into high and low-risk cohorts. For visualization, the "ggplot2" package in R was employed.