Data processing and identification of DEIGs, DEGs, and DETFs
The transcriptome RNA-sequencing data and clinical materials of 514 COAD patients were obtained from the TCGA database. The screening criterion for DEGs between tumors and normal samples was set up as the FDR < 0.05 and |log FC (fold change)| >1 (Figure 1A, B). A list of 2,660 immune genes was obtained from the Immunome database, downloaded from InnateDB and IMMPORT database, and a total of 649 DEIGs were obtained from the further screening (|log FC (fold change)| >1, FDR < 0.05). (Figure 1C, D). We obtained 318 TFs from the Cistrome program and 67 DETFs after screening ( Figure 1E, F)
Functional analysis of DEIRGs
GO, KEGG enrichment analyses were conducted on the DEIGs. GO-BP analysis indicated that the DEIGs were mainly involved in B cell, immunoglobulin mediated immune response, complement activation, and regulation of immune system processes (Figure 2A,B). The KEGG pathway was mainly enriched in cytokine−cytokine receptor interaction, viral protein interaction with cytokine and cytokine receptor, chemokine signaling pathway, IL-17 signaling pathway, neuroactive ligand−receptor interaction, and the NF−kappa B signaling pathway (Figure 2 C, D).
GO
Building DEIG and DETF interaction network
The correlation between DEIGs and DETFs was obtained using the screening criteria cor > 0.5, P < 0.001, and the correlation network diagram was drawn using Cytoscape (Figure 3). The specific correlation results are shown in Table 1.
Weighted Gene Co-Expression Network analysis of DEIGs
WGCNA[13, 14]can identify gene modules with similar expression patterns by calculating the expression relationship between genes, analyzing the relationship between gene modules and phenotypes, and mapping the regulatory network between genes in the gene module and central genes (hubs). The WGCNA package in R was used to divide DEIGs into 5 modules (“MEgreen,” “MEblue,” “MEblue,” “MEyellow,” and “MEgrey”) (Figure 4). The optimal POWER value was 4. Prognostic models were built based on the minimal p-value. Therefore, genes in the green module were selected for subsequent analysis.
Obtaining immune genes related to the prognosis of COAD patients
After obtaining the “MEgreen” gene module, univariate regression analysis was performed on the clinical data of COAD patients in TCGA and GEO databases. Nine genes were screened as prognosis-related genes (CD1A, CD1B, FGF9, GRP, OXTR、SPHK1, BGN, SERPINE1, and F2RL2) using p < 0.05 as the criteria. The results of univariate analysis are shown in Figure 5A. The K-M curves of 9 genes in COAD patients are plotted using “survival” packages in R software (Figure 5B-J)
Building of survival prognosis model and performing a survival analysis
The 9 prognosis-related genes were included in multivariate Cox regression analysis. The inclusion criterion was p < 0.05, HR≠1. Six genes (CD1A, CD1B, FGF9, GRP, SERPINE1, and F2RL2) were incorporated into the risk core model. The relationship between genes and risk score is shown in Table 2. The composite risk scores of patients enrolled in the TCGA datasets were calculated. The COAD patients were separated into low- and high-risk groups using the mean risk score as the cut-off. The prognostic model was built using TCGA data and verified in the GEO datasets GSE40967. Further, we used the R software “Survival” and “timeROC” packages to draw two groups of K-M and ROC curves, respectively (Figure 6A-D), The “ComplexHeatmap” of R software was used to conduct a chi-square test between the low- and high-risk groups in the clinical traits, and draw the clinical correlation heat map (p < 0.05, Figure 6E). There were significant differences between the two groups in the T, N, M, and tumor stages. Univariate and multivariate Cox regression analyses revealed that age, T stage, and risk score were independent prognostic factors for COAD patients (Figure 7 A-B).
ssGSEA analysis for low-and high-risk groups
GO and KEGG enrichment analysis files (“c5.go.v7.4.smbols,” and “c2.cp.kegg.v7.4.symbols”) were downloaded from the GSEA database (http://www.gsea- msigdb.org/gsea/index.jsp).Subsequently, GO, and KEGG enrichment analyses were performed using the clusterProfiler and the org.Hs.eg.db packages in high- and low-risk groups. GO enrichment analysis of the high-risk group’s main enrichment was at the keratinocyte differentiation, skin development, collagen-containing extracellular matrix external encapsulating structure, and structural molecule activity. The most significantly enriched KEGG pathways were axon guidance, ECM receptor_interaction, focal adhesion, PPAR signaling pathway, and systemic lupus erythematosus. GO enrichment analysis of the low-risk group’s main enrichment was of activation of immune response, adaptive immune response, immune response regulating signaling pathway, immunoglobulin production, immunoglobulin complex. The KEGG pathway was enriched in allograft rejection, asthma, autoimmune thyroid disease, intestinal immune network for IgA production, and primary immunodeficiency (Figure 7C-F).
Comparison somatic mutation in high- and low-risk groups
Somatic mutation profiles of COAD patients downloaded from the TCGA database were analyzed and visualized using the “maftools” package [15]. A total of 388 patients exhibited mutation, of which 202 were high-risk, and a low-risk of 184 patients. The top 5 genes with the highest somatic mutation rate in the high-risk group were APC, TP53, TTN, KRAS, and SYNE1. Missense mutations were the most common. The high‐risk group achieved a higher mutation frequency than the low‐risk group (Figure 8A-B).
Analyzing the immune infiltration of high- and low-risk groups
Next, we used CIBERSORT to estimate the proportions of 22 distinct immune cell types (p < 0.05) (Figure 8 E), The Wilcoxon signed-rank test was used to determine the immune infiltrating cells with differences between the high-risk groups. Additionally, the resting dendritic cells, follicular helper T cells (Tfh), were present in significantly higher fractions in low-risk patients than in high-risk patients (p < 0.05; Figure 8C). The immune cell function of the high-risk group was lower than that of the low-risk group in APC co-inhibition, APC co-expression, T cell function, and macrophage function (Figure 8D).
Comparison with the other models
We compared the four corresponding prognostic models: a 7-gene signature (Sun), 6-gene signature (Liang), 12-gene signature (Mia), and 7-gene signature (Chen) [16-19]. Similarly, taking the mean risk value of all samples as the division standard, the high- and low-risk groups were divided into high and low groups, and four models were used to calculate the risk value of the TCGA samples. The ROC and K-M curves for the four models are attached as Figure 9A-J. The AUCs of the four models at 5 years were 0.581, 0.521, 0.616, and 0.555, respectively, all significantly lower than our model (0.639). C-indexes were used to evaluate the prediction capability of the mixed-effect Cox model [20]. We then calculated the C‐indexes of all models. The results showed that our model exhibited the highest C-index value (0.704; Figure 9K).
Verification of six IRGs in external databases
The TIMER online database was used to analyze the differential expression of six genes in this model in 17 types of tumors and adjacent tissues. It was seen that CD1A, CD1B, GRP, SERPINE1, and F2RL2 were highly expressed in tumor tissue. In contrast, FGF9 was highly expressed in normal colon tissue (Figure 11). An HPA database search was performed to verify the protein expression levels of CD1A, CD1B, SERPINE1, and FGF9 (Figure 10).