1 Differentially expressed ARGs between CRC and adjacent normal colorectum tissue samples
A total of 568 CRC and 44 normal colorectum tissue samples with RNA-seq data in the TCGA cohort were involved in the current study. We selected 222 identical ARGs from 232 ARGs and RNA-seq data of TCGA cohort. Among the 222 identical ARGs, 32 differentially expressed ARGs were identified between CRC and adjacent normal colorectum tissue samples with thresholds of |log2 fold change|> 1 and FDR < 0.05. The result was shown in a heat map (Fig. 1A) Then, according to differentially expressed ARGs, gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed. Autophagy, mitochondrial outer membrane, and ubiquitin protein ligase binding were the top enrichment GO terms for biological processes, cellular components and molecular functions, respectively. As for KEGG pathway enrichment analysis, apoptosis, p53 signaling pathway and human cytomegalovirus infection were the top enrichment pathways and autophagy related pathways were also statistically significant (Fig. 1B-C).
2 Development and validation of a prognostic risk score model
In the TCGA cohort, we merged the expression matrixes of 222 ARGs with the corresponding overall survival time and overall survival status for 541 CRC patients on the basis of patients’ ID. The univariate Cox regression analysis was utilized to select out a total of 8 ARGs as candidate genes that were significantly associated with OS (Fig. 1D). And the differential expression levels of these candidate genes between CRC and adjacent normal colorectum tissue samples were shown in Fig. 1E. Subsequently, we excluded 26 CRC patients whose OS time was 0, and a total of 7 optimal genes were identified for the construction of the prognostic risk score model in the LASSO Cox regression analysis (Fig. 1F, G). The optimal genes are MAPK9, CDKN2A, SERPINA1, DAPK1, ULK3, EDEM1 and ULK1. The prognostic risk score = (-0.211578695097183* expression value of MAPK9) + (0.0426772420693879* expression value of CDKN2A) + (-0.14956314324999* expression value of SERPINA1) + (0.348009461063805* expression value of DAPK1) + (0.250327783852896* expression value of ULK3) + (-0.432074832788178* expression value of EDEM1) + (0.226888644181894* expression value of ULK1).
According to the median value of risk scores, we divided CRC patients into low- and high- risk score groups. Kaplan-Meier survival curves showed that high-risk group had a lower survival rate than low- risk group with P-value = 4.393e-06 (Fig. 2A). The ROC of the prognostic risk score model at 5-year was exhibited in Fig. 2B, with AUC of 0.678. To better demonstrate the association between risk scores and prognosis, the risk score of each CRC patient was ranked from low to high (Fig. 2C). The survival status and corresponding risk scores were plotted in Fig. 2D. Red dots represented dead patients and blue dots represented surviving patients. As shown in the figure, the red dot density of the low risk score group was significantly lower than that of the high risk score group. The differentially expressed heat map of optimal genes, between the low- and high- risk score groups was reported in Fig. 2E. For validation of the prognostic risk score model, we further compared the survival differences between low- and high- risk score groups in the GEO cohort. High-risk group also had a poorer prognostic than low-risk group with P-value of 4.707e − 02 (Supplement Fig. 1A) and the AUC of ROC at 5-year was shown in Supplement Fig. 1B. The distribution of risk score of each patient, survival status and corresponding risk scores, as well as the differentially expressed heat map of optimal genes were plotted in Supplement Fig. 1C-E. As expected, the prognostic risk score models also exhibited reliable ability to predict prognosis in the pooled cohort. Survival curve revealed that patients in the high risk score group had worse OS compared to those in the low risk score group with AUC of 0.638 (p = 7.7e − 05; Supplement Fig. 2A, B). Supplement Fig. 2C-E respectively, demonstrated the distribution of the risk score of each patient, survival status and corresponding risk scores, as well as the differentially expressed heat maps of optimal genes in the pooled cohort. In addition to this, we explored the relationship of risk scores and clinical characteristics and found that patients with advanced cancer had higher risk scores than early stage patients (Fig. 3A-D). The expression levels of CEA were positively correlated with the risk scores of patients (Fig. 3E). Figure 3F-H demonstrated the differences of PD1, PD-L1, CTLA-4 between low- and high- risk score groups, respectively. However, there were no significant differences of risk scores in terms of patients’ age and gender (Supplement Fig. 3A, B).
3 Chemotherapy in the GEO cohort
According to the above criteria, 139 patients were defined to non-chemotherapy resistance group and 111 patients were defined to chemotherapy resistance group. Then, we compared the difference of risk scores between non-chemotherapy resistance and chemotherapy resistance groups, and we found that the median value of risk scores in the chemotherapy resistance group was higher than that in non-chemotherapy resistance group with P-value of 0.03 (Fig. 3I). This result meant that patients with low risk scores could get more benefits from chemotherapy than those with high risk scores. Therefore, it is particularly important for patients with high risk scores to find potentially effective treatments.
4 PCA in the TCGA and GEO cohorts
In both TCGA and GEO cohorts, the expression matrixes of all ARGs were analyzed using PCA respectively. The results of PCA were shown in Fig. 4A, B. After that, we performed PCA on the expression matrixes of the optimal genes that identified by LASSO Cox regression analysis (Fig. 4C, D). As shown in the figures, compared with all ARGs, patients with different risk scores could be distinguished better using the optimal genes.
5 The results of enrichment analysis
The enrichment analyses of differentially expressed genes between low- and high- risk score groups were conducted with GO and KEGG pathway enrichment analyses. The results of GO and KEGG pathway enrichment analyses were demonstrated in Fig. 5A, B. Many immune-related biological processes and cancer- related pathways were the top enrichment terms and pathways, which indicated that there might be cancer-related immune differences between low- and high- risk score groups. Meanwhile, the results of GSEA showed that some cancer-related pathways, such as NOTCH signaling pathway and HEDGEHOG signaling pathway were enriched in high- risk score group (Fig. 5C, D). In addition, the p53 signaling pathway was upregulated in the low risk score group (Fig. 5E).
6 The difference of immune cell infiltration between low- and high- risk score groups
CIBERSORT algorithm was applied to quantify the relative abundance of 22 immune cells in each patient according to the RNA-seq profiles. Figure 5F presents the differences of immune cell infiltration between low- and high- risk score groups.
As shown in figure, the abundance of plasma cells, resting memory CD4 T cells, activated dendritic cells, Eosinophils and neutrophils were higher in the low risk score group compared to the high risk score group. However, regulatory T cells (Tregs) and Macrophages M0 exhibited higher infiltration in patients from the high risk score group. Based on the above results, patients with low risk scores were more likely to benefit from immunotherapy than those with high risk scores.
7 PPI network of differentially expressed ARGs between low- and high- risk score groups
We utilized the STRING online database to construct the PPI network of differentially expressed ARGs between low- and high- risk score groups, the result was shown in Fig. 6A and Supplement Table 1. Then, Cytoscape software was performed to process the data of PPI network further. Figure 6B presents the interaction of differentially expressed ARGs, red presents ARGs that are highly expressed in high- risk score group and blue presents ARGs that are highly expressed in low risk score group. Finally, we found the top 10 hub ARGs in the PPI network using cytoHubba. As shown in the Fig. 6C, the top 10 hub AGRs, degree ranked from high to low, are BECN1, ATG5, ATG16L1, PIK3R4, MAP1LC3A, MAP1LC3B, ULK1, ATG13, MAPK3 and ATG3.
8 Development of a nomogram for predicting overall survival
To predict the OS of CRC patients, we constructed a nomogram that integrated age, gender, pathological stage, neoadjuvant chemotherapy, the level of CEA and a prognostic risk score model (Fig. 7A). And the calibration curves at 1-year, 2-year, 3-year and 5-year demonstrated that the nomogram had a good ability to predict the OS of CRC patients (Fig. 7B-E). The results of univariate and multivariate Cox regression analyses proved that the prognostic risk score model, age and pathological stage could be regarded as independent prognostic indicators (Fig. 7F, G). The ROC curves demonstrated that the prognostic risk score model (AUC = 0.702) presented a good prognostic value than that of age (AUC = 0.623; Fig. 7H). Meanwhile, univariate and multivariate Cox regression analyses of GEO and pooled cohorts confirmed that the prognostic risk score model could act as an independent prognostic indicator (Supplement Fig. 4A-D).
9 Identification of potential drug targets for CRC patients
Based on the gene expression profiles and drug sensitivity profiles of CCLs in the CTRP and PRISM database, “pRRophetic” R package was applied to estimate the response of CRC patients to drugs/compounds via the built-in ridge regression model. The estimated AUC value represented the sensitivity of patients to drugs and compounds, lower AUC value indicated higher sensitivity to drugs. To further identify potential drugs for patients with different risk scores, Spearman correlation analysis was performed to quantify the correlation coefficient between AUC values and risk scores. The AUC value of CTRP-derived drugs, including DNMDP, brefeldin A, azacitidine, gossypol, PF-543 and L-685458, was positively related with risk scores; and the AUC value of CTRP-derived compounds, including semagacestat, epigallocatechin-3-monogallate, SMER-3, NSC48300, CHIR-99021, CHIR-99021, was negatively related with risk scores (Fig. 8A). In addition, the analysis in PRISM database yielded the top 6 compounds with positive correlation coefficient (bexarotene, zaleplon, propranolol, clofoctol, 3-amino-benzamide, selumetinib) and the top 6 compounds with negative correlation coefficient (pentostatin, afobazole, regorafenib, tofogliflozin, mepivacaine, tedizolid-phosphate) (Fig. 8B). Compounds with negative correlation coefficient had lower estimated AUC values in the high risk score group, which indicated that these compounds could act as potential drugs for CRC patients with high risk scores. In addition to immunotherapy, patients with low risk scores also benefited from the compounds with positive correlation coefficient, because these compounds had lower estimated AUC values in the low risk score group.
10 The results of HPA
We compared the expression level of the optimal genes between CRC tissue and normal colorectum tissue samples by immunohistochemistry in the HPA database. The results were shown in Supplement Fig. 5, and ULK1 gene was not included because no information was available in the HPA database.