2.1 Sketch of study design and GEO dataset merging
As can be seen in the research protocol (Figure 1), we obtained MS by phenotypic scoring of MRGs on the dataset TCGA-COADREAD, and then categorized the samples into groups with high/low scores in accordance with median phenotypic scores. Then differential expression analysis was implemented in two groups to obtain DEGs, which were intersected with MRGs to obtain MRDEGs, and MRDEGs were intersected with weighted gene co-expression module related genes to obtain key genes by LASSO model screening, and consistency clustering analysis, cox analysis, immuno-infiltration analysis, mutation analysis, clinical relevance analysis, etc. were carried out. Finally, the datasets GSE14333, GSE74602, GSE87211 were used for external validation of expression analysis, and real-time quantitative PCR were used for validation of bioinformatics results.
Then, we removed the batch effect for CRC datasets GSE14333, GSE74602 and GSE87211 to obtain the merged dataset GEO dataset, and compared the datasets before and after batch effect removal by distribution box line plots and PCA plots (Figure S1). The results of the distribution box line plots and PCA plots showed that the batch effect of the samples in GEO dataset is largely eliminated after the batch removal process.
2.2 Analysis of DEGs associated with CRC
To analyze the DEGs between groups with high/low MS scores of COADREAD patients in TCGA-COADREAD dataset, differential analysis was fulfilled on TCGA-COADREAD dataset utilizing limma package to obtain DEGs of the data. The results are as follows: with |logFC| > 0 and P < 0.05 as the thresholds, there were 11316 genes identified in TCGA-COADREAD dataset, including 4074 up-regulated genes with logFC > 0 and 7242 down-regulated genes with logFC < 0. According to differential analysis results of this dataset, a volcano plot was plotted (Figure 2A).
To identify Macrophage-related differentially expressed genes (MRDEGs), we initially conducted a univariate Cox regression analysis on a set of 637 MRGs (Macrophage-related genes). Among these genes, we selected those with a p-value < 0.05, resulting in a final set of 45 MRGs that exhibit prognostic significance. Detailed information about these MRGs can be found in Table S2. Subsequently, we compared these 45 MRGs with all the differentially expressed genes (DEGs) derived from the TCGA-COADREAD dataset, specifically focusing on genes with |logFC| > 0 and a P-value < 0.05. The overlapping genes from this analysis yielded a total of 37 MRDEGs. To visually represent the intersection results, we created a Venn diagram (Figure 2B).
The expression differences between various sample groups in TCGA-COADREAD dataset were analyzed, and R package pheatmap was employed to plot heat maps to show the analysis results. We selected the differential analysis results of 37 MRDEGs for heat map display, with these results displaying in table S1.
2.3 FEA (GO) and PEA (KEGG) of MRDEGs
For the purpose of analyzing BP, CC, MF, biological pathways, and their association with colon cancer of 37 MRDEGs, we first performed GO (Table 2) and KEGG (Table 3) enrichment analyses on MRDEGs. P.value < 0.05 served as screening criteria of enrichment entries, and FDR value (q.value) < 0.05 was deemed to statistically significant. We showed the results of GO FEA and KEGG PEA in bubble charts (Figure 3A-B), circular network diagrams (Figure 3C-D), and chord diagrams (Figure 3E-F).
2.4 GSEA of CRC dataset
We studied how gene expression levels relate to colon cancer by looking at the differences in gene expression, BP, CC, and MF between two groups of CRC patients (high/low scores) in TCGA-COADREAD dataset using GSEA. All genes in TCGA-COADREAD showed significant enrichment in pathways (Figure 4) like NFKB pathway, Macrophage pathway, JAK_STAT pathway, TGFBETA pathway, etc. (Table 4).
2.5 WGCNA to screen co-expression modules in the dataset TCGA-COADREAD
We performed WGCNA on the DEGs in colon cancer patients with high/low scores in TCGA-COADREAD dataset to screen for co-expression modules. In the WGCNA process, we first clustered colon cancer patients with high/low scores in TCGA-COADREAD dataset using a clustering tree and labeled grouping information (without setting cut height). We set a screening criterion of 50 to identify the best number of modules. The DEGs of CRC patients with high/low scores in TCGA-COADREAD dataset were aggregated into nine modules (MEturquoise, MEred, MEyellow, MEbrown, MEgreen, MEpink, MEdarkgrey, MEblack, MEblueMEgrey) (Figure 5A). The DEGs in colon cancer patients with high/low scores in TCGA-COADREAD dataset were clustered again and the relationship between genes and corresponding new modules was visualized. Finally, depending on expression patterns of module genes and grouping information of two groups in TCGA-COADREAD dataset, we obtained nine modules (MEturquoise, MEred, MEyellow, MEbrown, MEgreen, MEpink, MEdarkgrey, MEblack, MEblueMEgrey) and their correlation with CRC patients with high/low scores in TCGA-COADREAD dataset (Figure 5B). Then we merged modules with a cut height set to 0.2 and clipped and merged modules with a cut height below 0.2 (Figure 5C).
Firstly, we analyzed four modules (MEred, MEyellow, MEbrown, MEgreen) containing DEGs that show significant statistical differences. (P < 0.05, correlation absolute value ≥ 0.3) and correlations with CRC patients with high/low scores in TCGA-COADREAD dataset among nine modules (excluding useless gray module: MEgrey). Firstly, we took intersections between MRDEGs in colon cancer patients with high/low scores in TCGA-COADREAD dataset with DEGs contained in four modules respectively and drew Venn diagrams (Figures 5D-G) to obtain module MRDEGs. As shown in Figure 5, we obtained a total of 15 MRDEGs (SLC11A1, SPP1, CXCL9, MMP3, CXCL8, CIITA, C5AR1, WNT5A, PDGFRA, FABP4, TIMP1, CCL22, CTSD, ADAM8, MS4A1).
In this study, the expression levels of 15 MRDEGs (SLC11A1, SPP1, CXCL9, MMP3, CXCL8, CIITA, C5AR1, WNT5A, PDGFRA, FABP4, TIMP1, CCL22, CTSD, ADAM8, MS4A1) were analyzed in two groups of colon cancer patients with high/low scores in both TCGA-COADREAD (Figure 6A) and GEO datasets (Figure 6B). The results showed that the expression levels of all 15 MRDEGs were statistically significantly different (P<0.001) in TCGA-COADREAD dataset, whereas in GEO dataset, 12 MRDEGs (SLC11A1, SPP1, CXCL9, MMP3, CXCL8, CIITA, C5AR1, WNT5A, PDGFRA, TIMP1, CCL22, CTSD, ADAM8) exhibited significant differences (P<0.001) between the two groups.
We then annotated the positions of 15 MRDEGs on human chromosomes and visualized them using circle diagrams (Figure 6C). As shown in the figure: gene WNT5A is located on chromosome 3 and SLC11A1 is located on chromosome 2. We then performed friends analysis on 15 MRDEGs and visualized them using a plot (Figure 6D). Then we generated ROC curves for 15 MRDEGs (SLC11A1, SPP1, CXCL9, MMP3, CXCL8, CIITA, C5AR1, WNT5A, PDGFRA, FABP4, TIMP1, CCL22, CTSD, ADAM8, MS4A1) in both TCGA-COADREAD and GEO datasets, demonstrating the association between high/low scores of these genes and CRC patients. (Figure S2 and S3).
2.6 Correlation analysis between hub genes and MS
To explore the relationship between 15 MRDEGs (SLC11A1, SPP1, CXCL9, MMP3, CXCL8, CIITA, C5AR1, WNT5A, PDGFRA, FABP4, TIMP1, CCL22, CTSD, ADAM8, MS4A1) and the macrophage score, we created a scatter plot ((Figure S4).) to visualize their correlation. The results indicated that a subset of MRDEGs (C5AR1, CXCL8, CIITA, CXCL9, ADAM8, CCL22, SLC11A1, MMP3) exhibited a moderate level of correlation with the macrophage score (0.5 < r < 0.8). Conversely, the remaining MRDEGs (SPP1, CTSD, TIMP1, MS4A1, PDGFRA, WNT5A, FABP4) displayed a weak correlation with the macrophage score (0.3 < r < 0.5).
2.7 Construction of the diagnostic model for MRDEGs
To determine the diagnostic value of 15 MRDEGs in TCGA-COADREAD dataset, a MRDEGs diagnostic model was constructed utilizing LASSO regression analysis (Figure 7A). Then we visualized the expression of MRDEGs in different groups through a forest plot (Figure 7B). According to Figure 6B, there are a total of 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) in the MRDEGs diagnostic model we constructed. LASSO regression is a type of linear regression that includes a penalty term to mitigate overfitting and enhance the model's ability to generalize. We visualized the LASSO variable trajectory based on LASSO regression results (Figure 7C), which showed that gene expression changes with lambda coefficient (log) of LASSO penalty term. As lambda decreases, the number of genes with a coefficient of zero gradually increases. Differential expression analysis of MRDEGs diagnostic model of CRC patients with high/low scores in TCGA-COADREAD dataset was conducted (Figure 7D), and the two groups exhibited marked differences in expression levels of MRDEGs diagnostic model (P<0.001).
A ROC curve was drawn for MRDEGs diagnostic model of CRC patients with high/low scores in TCGA-COADREAD dataset. As shown in Figure 7E, MRDEGs diagnostic model (AUC = 0.936) has high diagnostic value for colon cancer patients in TCGA-COADREAD dataset. The correlation between MRDEGs diagnostic model and MS was illustrated by creating a scatter plot (Figure 7F). The plot indicates a statistically significant difference between LASSO and MS (P<0.001).
2.8 Prognostic performance of MRDEGs
To probe the correlation of expression of 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) with the incidence of CRC, univariate/multivariate Cox regression analysis was implemented on expression levels of MRDEGs and clinical variables M stage, N stage, and T stage with prognostic clinical relationship in TCGA-COADREAD dataset. The analysis result illustrated a correlation between expression levels of MRDEGs and clinical variables M stage, N stage, and T stage with prognostic clinical relationship. In this study, the clinical data of COADREAD patients acquired from TCGA-COADREAD dataset was also statistically analyzed (Table 5).
A forest plot (Figure 8A) was utilized to present univariate/multivariate Cox regression analysis results (Table 5). Subsequently, the prognostic ability of Cox regression model was assessed through nomogram analysis, and a nomogram chart was generated (Figure 8B). Additionally, in Cox regression model, a risk factor chart was employed to visualize grouping of risk factors (Figure 8C).
In our research, calibration analysis was implemented on the variables in univariate/multivariate Cox regression models for 1-, 3-, and 5-year periods, and results were presented in calibration curve charts (Figure 8D-F). Furthermore, DCA was implemented to appraise the clinical utility of Cox regression prognostic model constructed for 1-, 3-, and 5-year periods and presented the results (Figure 8G-I).
We drew prognostic survival KM curves for 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) in TCGA-COADREAD dataset. It showed that only 9 MRDEGs (Figure 9) met the requirements when each of the 13 MRDEGs was drawn one by one with a prognostic survival KM curve using P < 0.05 as the standard for statistically significant correlation molecules.
2.9 Construction of COADREAD-related disease subtypes
To explore the expression differences of MRDEGs in COADREAD patient samples in TCGA-COADREAD dataset, R package "ConsensusClusterPlus" was employed to identify different subtypes of COADREAD disease related to COADREAD in TCGA-COADREAD dataset on the basis of expression levels of 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) using the consistency clustering method. Finally, two COADREAD disease subtypes (cluster1 and cluster2) were identified (Figure 10A). COADREAD disease subtype 1 (cluster1) contained 360 samples and COADREAD disease subtype 2 (cluster2) contained 284 samples. PCA was implemented on the expression data matrix of two subtypes of COADREAD disease samples in TCGA-COADREAD dataset. It demonstrated notable dissimilarities between the two COADREAD disease subtypes based on their expression matrices (Figure 10B). We also showed the Delta plot (Figure 10C) and cumulative distribution function (CDF) plot (Figure 10D) of different numbers of clusters in the consistency clustering results and the consistency clustering CDF plot. The figure shows that the unsupervised clustering of the TCGA-COADREAD dataset is most consistent when using k=2 as the number of clusters.
In addition, the variation in expression of 13 MRDEGs between two COADREAD disease subtypes (cluster1 and cluster2) in TCGA-COADREAD dataset was examined utilizing Mann-Whitney U test, and a group comparison graph was employed to present the results (Figure 10E). The group comparison graph reveals significant variations in expression of 13 MRDEGs between cluster1 and cluster2 in TCGA-COADREAD dataset (P < 0.001).
Then we plotted the ROC curves of 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) in the two COADREAD disease subtypes of TCGA-COADREAD dataset (Figure S5).
2.10 Mutation analysis of MRDEGs in CCRC patients
To analyze the mutation status of 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) in COADREAD patients in TCGA-COADREAD dataset, mutation of 13 MRDEGs from COADREAD patient samples in TCGA-COADREAD dataset were analyzed and visualized utilizing R package maftools. The analysis revealed the presence of five main types of somatic mutations in the body cells: Missense Mutation, Frame Shift Deletion, Nonsense Mutation, Frame Shift Insertion, and Splice Site mutation. Missense mutations accounted for most of them (Figure 11A). Most of the mutations observed in the 13 MRDEGs in COADREAD patients were SNPs, with a small number of insertions (INS) and deletions (DEL) also detected. Furthermore, the most frequent single nucleotide variant (SNV) observed in COADREAD patients was the C > T transition, followed by C > A (Figure 11A). Then we showed all the somatic mutations of 13 MRDEGs in COADREAD patients (Figure 11B).
We conducted an analysis on the CNV of 13 MRDEGs in TCGA-COADREAD dataset of COADREAD patients. We downloaded and merged the CNV data of COADREAD patients and analyzed it using GISTIC 2.0 and visualized the results (Figure 11C-E). The results indicated a high frequency of amplifications and deletions of 13 MRDEGs in COADREAD patient samples, among which FABP4, CCL22, CIITA and other genes had higher amplification frequencies while CXCL9, SPP1 and ADAM8 had higher deletion frequencies (Figure 11C).
We analyzed MSI and TMB data, as well as TIDE algorithm evaluation TIDE score data for COADREAD patients in TCGA-COADREAD dataset. Then we created grouping comparison graphs (Figure 11F-H) and correlation scatter plots (Figure 11I-K) to compare the patients’ risk scores. The results showed that MSI, TMB, and TIDE scores had statistically marked differences between patients with high/low risks (P < 0.05). Higher TIDE scores denote higher possibility of tumor immune escape in patients with high risk in contrast to those with low risk. The correlation scatter plot results showed a weak linear correlation between MSI data, TMB data, TIDE scores evaluated by TIDE algorithm, and risk scores.
2.11 Immune infiltration analysis of CRC (CIBERSORT)
The correlation between the expression profiles of 22 immune cells in different groups (cluster1 and cluster2) in colon cancer patients were analyzed utilizing CIBERSORT algorithm. On the basis of immune infiltration analysis results, a bar chart (Figure 12A) was generated to display the infiltration status of these 22 immune cells in each sample of colon cancer patients.
Differential expression of 22 immune cells in two groups (cluster1 and cluster2) in CRC patients was analyzed (Figure 12B). The analysis revealed extremely significant differences of 11 immune cells, including dendritic cells resting, eosinophils, macrophages M0, M1, M2, mast cells activated, monocytes, neutrophils, plasma cells, T cells CD4 memory resting, and T cells regulatory (Tregs) in expression levels between the two groups (P<0.001). Three immune cells (T cells gamma delta, NK cells activated, mast cells resting) had significant differences (P<0.01), and four immune cells (B cells naive, NK cells resting, dendritic cells activated, T cells follicular helper) showed certain differences (P<0.05).
We showed correlation heat map (Figure 12C) of 13 MRDEGs (ADAM8, C5AR1, CCL22, CIITA, CTSD, CXCL8, CXCL9, FABP4, MMP3, MS4A1, SPP1, TIMP1, WNT5A) with statistically significant immune cell infiltration abundance (P < 0.05). There was a strong correlation between infiltration abundance of Neutrophils and MMP3 among MRDEGs in different groups (cluster1 and cluster2) of colon cancer patients.
2.12 Clinical correlation analysis of prognostic MRDEGs
We investigated whether the expression levels of 13 prognostic MRDEGs were related to clinical features in COADREAD patients. The correlation of high and low expressions of these MRDEGs with different clinical pathological characteristics was examined (Figure S6).
2.13 In Vitro and Vivo Analyses
Real-time quantitative reverse transcription PCR was employed to detect the mRNA expression levels of hub genes in the HCT116 colorectal cancer cell line, normal colon epithelial cells, eight colorectal cancer (CRC) patients, and eight control subjects in adjacent tissues. This validation aimed to assess the reliability of the hub genes. The results demonstrated a significant upregulation of SPP1, C5AR1, MMP3, TIMP1, and ADAM8 expression in HCT116 cells compared to normal colon epithelial cells (Figure 13A). Consistently, in the clinical samples, the expression levels of SPP1, C5AR1, MMP3, TIMP1, and ADAM8 were significantly higher in CRC patients compared to the control tissues (Figure 13B), corroborating the aforementioned findings. The protein expression of SPP1, C5AR1, MMP3, TIMP1, and ADAM8 was examined using the Human Protein Atlas database, revealing a similar trend for C5AR1, MMP3, TIMP1, and ADAM8 (Figure 13C). Additionally, we observed a significant elevation in MMP3 expression levels in HCT116 cells through immunoblotting experiments (Figure 13D).