Identification of gene expression patterns in colon tumor
To investigate the unique gene expression patterns in colon tumor tissues, we analyzed two independent data sets of gene expression in colon tumors (GSE44076 and GSE39582). Analysis of GSE44076 and GSE39582 identified 404 (log2FC > 2, P < 0.05) and 274 (log2FC > 2, P < 0.05) differentially expressed genes (DEGs) in normal colon tissue and colon tumor tissue, respectively (Fig. 1A, B). In order to obtain more reliable results, we intersected these DEGs and obtained 170 common genes (Fig. 1C). Furthermore, we performed GO enrichment analysis and the result showed that DEGs were significantly enriched in extracellular matrix organization and extracellular structure organization in the biological process group; apical part of cell and apical plasma membrane in the cellular component group; receptor ligand activity and signaling receptor activator activity in the molecular function group (Fig. 1D).
CTHRC1 is up-regulated in colon cancer tissues
CTHRC1 is a cancer-associated protein and is involved in tissue damage repair, which has attracted our attention. Analysis of CTHRC1 gene expression data in TCGA-COAD and GSE115313 cohorts confirmed that CTHRC1 was overexpressed in colon cancer tissues (Fig. 2A, B). We used the Timer 2.0 tool to screen for CTHRC1 expression in multiple cancer types, and found that CTHRC1 was significantly upregulated in 23 cancer types, especially in colon adenocarcinoma (Fig.2C). These results suggest that CTHRC1 may play an important role in the occurrence and development of colon cancer.
CTHRC1 overexpression predicts poor prognosis in COAD patients
In addition, we conducted Kaplan-Meier survival analysis in three cohorts (GSE17536 cohort, GSE39582 cohort, TCGA-COAD cohort) to examine the association between CTHRC1 expression and the survival. The optimal threshold was determined by the surv_cutpoint function in the survminer R package, and the patients were thus divided into high- and low- CTHRC1 groups. The results indicated that high CTHRC1 expression significantly correlated to shorter OS and RFS in colon cancer patients (Fig. 3A-D). Interestingly, the multivariate Cox analysis result suggested that increased CTHRC1 expression [hazard ratio (HR) = 1.724, 95% confidence interval (CI) 1.0797–2.75; p < 0.05] have the potential to act as an independent predictor of poor prognosis (Fig. 3E).
GO and KEGG enrichment analysis of CTHRC1 and co-expression genes and identification of hub genes.
Subsequently, we analyzed gene expression data of GSE39582 cohort and acquired 516 genes co-expressed with CTHRC1 (absolute Pearson correlation coefficient > 0.5, p < 0.001). GO and KEGG enrichment analysis of the co-expressed genes showed the top 30 items. GO functional enrichment analysis indicated that these genes were enriched in extracellular matrix organization and extracellular structure organization in the biological process group; collagen-containing extracellular matrix and endoplasmic reticulum lumen in the cellular component group; extracellular matrix structural constituent and glycosaminoglycan binding in the molecular function group (Fig. 4A). KEGG enrichment analysis showing the enrichment of various pathways including PI3K-Akt signaling pathway, focal adhesion, proteoglycans in cancer and human papillomavirus infection, etc (Fig. 4B). STRING database was employed to establish a PPI network and by opening it with Cytoscape software, we constructed a co-expression network. Hub genes in the network were identified via the cytoHubba application in Cytoscape software. The 13 hub genes are COL1A2, COL3A1, COL12A1, FBN1, SDC2, POSTN, THBS2, ADAMTS12, COL1A1, MMP9, COL5A2, COL6A3 and CDH11(Fig. 4C).
Establishment of the risk model on colon cancer
The risk signature based on CTHRC1 and 13 hub genes was established. For each patient in the TCGA-COAD cohort, the risk score was calculated, and patients were divided into high-risk and low-risk groups based on the median risk score (Fig. 5A). More deaths occurred in patients with high-risk group than those with low-risk group and the gene expression of risk signature showed significant difference between high-risk group and low-risk group (Fig. 5B-C). Multivariate Cox analyses revealed that the risk score of a 14-gene signature was an independent prognostic factor for colon cancer patients in TCGA cohort (Fig. 5D). In addition, risk score showed significant changes between early stage (I and II) and normal tissue though non-significant difference between early and advanced (III and IV) stages in COAD patients was observed (Fig. 5E). Kaplan-Meier survival curves showed that the high-risk group had a worse prognosis (p < 0.01, Fig. 5F). The area under curve (AUC) values of the 1-, 3-, and 5-year Receiver Operating Characteristic (ROC) curves were 0.656, 0.654, and 0.627, respectively (Fig. 5G).
Validation of the risk model
To validate the risk model, we selected a test cohort based on GSE39582 and performed univariate and multivariate Cox regression analyses. In accord with the TCGA cohort, the risk score of the 14-gene signature could be an independent prognostic factor for colon cancer patients. (p < 0.01, Fig. 6 A, B). Patients in the high risk score group showed poor OS. (p < 0.001, Fig. 6 C). Receiver Operating Characteristic (ROC) curve was employed for the assessment of predictive ability of the risk model, and the area under curve indicated that the risk model was effective (Fig. 6 D). Taken together, the model we provided was proved to be an effective and well-predicting model.
Gene Set Enrichment Analysis of the high-risk group
GSEA was used to determine the functions of gene sets in the high-risk group. We set the cutoff value of FDR q-value at < 0.05 and NES at > 2.0. We found that epithelial mesenchymal transition (FDR q-value, 0.023; normalized enrichment score, 2.07) and angiogenesis (FDR q-value, 0.023; normalized enrichment score, 2.03) were significantly enriched in high-risk group in the TCGA dataset (Fig. 7 A, B). Taken together, we deduced that patients in high-risk group could promote the development of colon cancer through the activation of EMT and angiogenesis pathways and subsequently influence their outcomes.