Construction of a Prognostic Model for Predicting Overall Survival of Patients with Colorectal Cancer

Background: Colorectal cancer (CRC) is one of the most common malignancies. The purpose of this study is to construct a prognostic model for predicting the overall survival (OS) in patients with CRC. Methods: The mRNA-seq and miRNA-seq data of colon adenocarcinoma (COAD) and rectal adenocarcinoma (READ) were downloaded from The Cancer Genome Atlas (TCGA) database. The differentially expressed RNAs (DE-RNAs) between tumor and normal tissues were screened. The Kaplan-Meier and univariate Cox regression analysis were used to screen the survival-related genes. Functional enrichment analysis of survival-related genes was conducted, followed by protein-protein interaction (PPI) analysis. Subsequently, the potential drugs targeting differentially expressed mRNAs (DE-mRNAs) were investigated. Multivariate Cox regression analysis was then conducted to screen the independent prognostic factors, and these genes were used to establish a prognostic model. A receiver operator characteristic (ROC) curve was constructed, and the area under the curve (AUC) value of ROC was calculated to evaluate the specicity and sensitivity of the model. Results: A total of 855 survival-related genes were screened. These genes were mainly enriched in Gene Ontology (GO) terms, such as methylation, synapse organization, and methyltransferase activity; and pathway analysis showed that these genes were signicantly involved in N-Glycan biosynthesis and the calcium signaling pathway. PPI analysis showed that aminolevulinate dehydratase (ALAD) and cholinergic receptor muscarinic 2 (CHRM2) served vital roles in the development of CRC. Aminolevulinic acid, levulinic acid, and loxapine might be potential drugs for CRC treatment. The prognostic models were built and the patients were divided into high-risk and low-risk groups based on the median of risk score (RS) as screening threshold. The OS for patients in the high-risk group was markedly shorter than that for patients in the low-risk group. Meanwhile, kazal type serine peptidase inhibitor domain 1 (KAZALD1), hippocalcin like 4 (HPCAL4),


Introduction:
Colorectal cancer (CRC) is one of the most common malignancies and leading causes of cancer death in the word [1]. The prognosis for CRC is widely variable, and approximately 20% of cases are metastatic at the time of emergence [2]. Meanwhile, there are signi cant differences in prognosis in CRC among different sites [3]. CRC includes colon cancer and rectal cancer, and adenocarcinoma is the most common pathological type of CRC. In addition, this adenocarcinoma comprised of colon adenocarcinoma (COAD) and rectal adenocarcinoma (READ) [4]. A previous study reported that the 5-year survival rate is approximately 10% in CRC patients with distant metastases [5]. In addition, COAD patients have a high rate of mortality due to late diagnosis, and lack of reliable biomarkers and therapeutic targets [6]. Therefore, it is necessary to clarify the survival events of CRC and screen novel prognostic factors and therapeutic targets.
Accumulating evidence has reported abnormal expressions of many genes that are involved in the development and carcinogenesis of CRC. Schmid et al. revealed that spondin 2 could induce CRC metastasis in a mice model, and served as a prognostic factor for CRC survival [7]. Kwak et al. observed that the mutation of KRAS genes seemed to be associated with the overall survival (OS) of CRC patients [8]. Meanwhile, BRCA1 interacting protein C-terminal helicase 1, ATM serine, homeobox B13, and partner and localizer of BRCA2 may be considered as CRC risk genes [9]. Despite extensive studies that have been carried out to screen prognostic markers, the mortality rate of CRC patients remains high.
Additionally, due to the limits of speci city of individual molecules, the survival time cannot be accurately predicted. Thus, there is a strong need to establish a prognostic model to search multi-molecular biomarkers for predicting CRC prognosis.
With the breakthrough advancement of sequencing techniques and bioinformatics analysis, gene expression pro ling has been con rmed to be a promising tool for classifying tumors and predicting cancer prognosis [10]. By integrating multidimensional genomic data from The Cancer Genome Atlas (TCGA), a prognostic model can be constructed and may provide personalized and precise treatment for cancer patients. Ying et al. established a prognostic model that generated a RS based on nine-gene expression for e ciently predicting endometrial carcinoma as the patient's prognosis [11]. In addition, Chen et al. developed a seven-gene signature that could be considered as an independent prognostic marker for OS prediction in CRC patients, providing new insights into identi cation of CRC patients with high-risk of mortality [12]. Since CRC has various common types, including COAD and READ, the developed prognostic signals are di cult to apply widely. Therefore, a prognostic model with high discriminant ability to effectively assist prognosis prediction for each type is needed in clinical practice.
Recently, CRC patients were staged according to the American Joint Committee on Cancer (AJCC) tumor, node, metastasis (TNM) system. A previous study reported that tumor staging is an important independent factor affecting the prognosis of CRC patients [13]. Additionally, OS is regarded as the ultimate measure of treatment bene t [14]. To comprehensively study the prognostic markers of CRC patients, a survival model for predicting prognosis in patients with CRC was constructed. The RNA-seq and miRNA-seq datasets of COAD and READ based on the OS and tumor staging were respectively downloaded from TCGA database. The differentially expressed RNAs (DE-RNAs) between normal and tumor samples were screened, then Kaplan-Meier (K-M) analysis, univariate Cox regression analysis, and multivariate Cox regression analysis were performed to obtain the prognostic independent factors for CRC. Next, the receiver operator characteristic (ROC) curve was built, and the under the curve (AUC) values of the ROC curve were calculated to assess the speci city and sensitivity of all genes. Furthermore, potential drugs targeting DE-mRNAs were also predicted. This model represents a potentially useful tool for OS predicting, but it still needs to be further veri ed in clinical practice. The ow chart of the analysis procedure is displayed in Fig. 1.

Materials And Methods:
Raw data processing The RNA-seq, miRNA-seq, and corresponding clinical information of TCGA-COAD and TCGA-READ patients up to July 19, 2019 were downloaded from the UCSC-Xena platform [15]. The RNA-seq and miRNA-seq were preprocessed by log2 transformation. Patients were divided into three groups based on the OS: an OS less than ve years was regarded as "S"; an OS more than ve years and less than ten years was considered as "M"; and an OS more than ten years was de ned as "L". In addition, the tumor staging information of all patients was derived from TCGA database, including stage I, stage II, stage III, and stage IV. The number of each group is listed in Table 1. Screening of DE-RNAs The differential expression of genes in COAD and READ datasets were respectively calculated according to survival time (S, M, and L) and tumor stage (I, II, III, and IV). DE-RNAs, including DEmRNAs, DEmiRNAs, DElncRNAs, and DEsnoRNAs, were screened using kruskal.test from the R package stats [17], and the p value was corrected by Bonferroni-Holm method. DE-RNAs with p value < 0.05 were considered signi cant. Subsequently, hierarchical cluster analysis was performed to display the heat map using pheatmap package (version: 1.0.12) in R software [18].

Functional and pathways enrichment analysis
The univariate Cox regression analysis was used to screen the survival-related genes. To understanding the biological function of these genes, the Gene Ontology (GO) [19] and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway [20] enrichment analysis were conducted using clusterPro ler (version 3.12.0) [21]. The results of GO analysis contained biological process (BP), cellular component (CC), and molecular function (MF). A p value < 0.05 and count > 2 were de ned as signi cant enrichment.

Construction of the protein-protein interaction (PPI) network
To explore the interactions between proteins, PPI analysis was performed using STRING (Version: 11.0, http://www.string-db.org/) database [22]. PPIs with con dence score > 0.4 were selected to constructed a PPI network, and this network was visualized using Cytoscape software [23].

Prediction of drug targeting survival-related DE-mRNAs
The drug gene interaction database (DGIdb) showed the drug-gene interaction and gene draggability information. In the present study, DGIdb 3.0 [24] was used to predict the potential drugs targeting the survival-related DE-mRNAs, and the drug-gene relationships were obtained. Subsequently, the drugs which were approved by the Food and Drug Administration agency and supported by the literature were screened. Finally, the drug-gene interaction network was structured by Cytoscape.

Survival analysis and prognostic survival models construction
Clinical and survival information were collected from COAD and READ datasets. Log-rank tests were conducted to evaluate the relationship between DE-RNAs and survival time, and p value < 0.05 was considered as statistically signi cant, then genes related to prognosis were screened and K-M survival curves were plotted. The coxph method [25] from the survival package was used for univariate Cox regression analysis to analyze the effects of differential RNA and clinical factors (tumor staging) on survival prognosis. The log-rank test was utilized to calculate the p value and the prognostic correlation coe cient β value, and genes with p value < 0.05 were selected for further study. Subsequently, a multivariate Cox regression analysis was employed to assess independent prognostic factors connected with patient survival, and the candidate DE-RNAs signi cantly related to the prognosis of patients were screened. Furthermore, the RS of each sample was calculated using the formula as follows: RS = βgene1 × expgene1 + βgene2 × expgene2+… +βgeneN × expgeneN (where β represents β coe cient value and exp represents the gene expression level). The COAD and READ patients were divided into high-risk and lowrisk groups based on the median RS value, and K-M survival curve was used to assess the correlation between the risk model and the survival prognosis of the samples. A computational model for predicting clinical endpoints of samples from the expression of key prognostic genes was constructed by the support vector machine (SVM) method of e1071 package in R language [26]. To assess the speci city and sensitivity of genes, ROC curves were plotted and the AUC value of ROC was calculated. According to the DE-RNAs obtained in the prognosis model, the nomogram method in R package 'rms' [27] was applied to establish a nomogram model for predicting the three-year, ve-year, and ten-year survival probability. Results:

Screening of DE-RNAs in CRC patients
The

Results of functional and pathway enrichment analysis
A total of 143, 503, 61, and 148 survival-related genes were respectively obtained in COAD OS, COAD stage, READ OS, and READ stage groups. To explore the functional characteristics of the screened OSrelated genes, functional enrichment analysis was performed (Fig. 2). The GO enrichment analysis showed that DE-RNAs in the COAD OS group were mainly associated with BP terms, such as methylation, positive regulation of synaptic plasticity, and response to cobalt ion; the genes in the COAD stage group were connected with BP terms including synapse organization, modulation of chemical synaptic transmission, and regulation of trans-synaptic signaling. Additionally, the genes in the READ OS group were primarily enriched in mRNA 5'-splice site recognition, regulation of the force of heart contraction, and muscle lament sliding; whereas the genes in the READ stage group were primarily correlated with muscle system process, muscle contraction, and cartilage development. Meanwhile, KEGG pathway enrichment analysis indicated that survival-related genes in the COAD groups were involved in N-Glycan biosynthesis, drug metabolism-other enzymes, and calcium signaling pathway; in addition, genes in the READ groups mostly participated in glycolysis/gluconeogenesis, carbon metabolism, PPAR signaling pathway, and calcium signaling pathway.

PPI network of survival-related RNAs
To clarify the relationship between proteins encoded by genes, STRING was utilized to establish PPI networks (Fig. 3). In the COAD OS group, there were 28 nodes and 19 edges. The degrees of surfeit 6 (SURF6), dolichyl-phosphate mannosyltransferase subunit 2 (DPM2), and aminolevulinate dehydratase (ALAD) were higher than that of others. In the PPI network of the COAD stage group, gastrin (GAST), dopamine receptor D4 (DRD4), and corticotropin releasing hormone receptor 1 (CRHR1) with higher degrees and might be recognized as hub genes. The PPI network of the READ OS group was composed of phosphoglycerate kinase 2 (PGK2) and fructose-bisphosphatase 2 (FBP2). Moreover, angiotensin II receptor type 1 (AGTR1), heart and neural crest derivatives expressed 2 (HAND2), and cholinergic receptor muscarinic 2 (CHRM2) had higher degrees than that of others in the PPI network of the READ stage group.
Drug prediction with regard to survival-related DE-mRNAs The DE-mRNAs obtained from the univariate Cox regression analysis were further analyzed by DGIdb database. A total of 13 genes might be considered as druggable genes, such as ALAD, peptidylprolyl isomerase F (PPIF), melan-A (MLANA), N-Acetyltransferase 1 (NAT1), and CHRM2 ( Fig. 4). Furthermore, aminolevulinic acid was an inducer or inhibitor of ALAD, levulinic and chlorzoxazone could respectively regulate the expression level of ALAD and MLANA, and cyclosporine was a binder of PPIF.

Survival-related gene screening and prognostic model construction Construction of a prognostic model based on genes in COAD OS
In the COAD OS group, the prognostic value of the 3207 DE-RNAs was assessed, and the results of logtank test indicated that 143 genes were associated with survival of COAD patients. In addition, ve genes were selected as independent predictors in COAD patients including hippocalcin like 4 (HPCAL4), RP11_452L6.1, AC006042.6, RP11_465I4.3, and kazal type serine peptidase inhibitor domain 1 (KAZALD1). These genes were used to build a prognostic model, and the RS of this model was calculated by using the formula: RS = (0.58 * ExpHPCAL4) + (2.35 * ExpRP11_452L6.1) + (2.63 * ExpAC006042.6) + (0.53 * ExpRP11_465I4.3) + (0.49 * ExpKAZALD1). A total of 417 patients were divided into a high-risk group and low-risk group by applying the median of RS as the cut-off criteria, and subsequently assessed by K-M survival analysis. The results indicated a signi cant difference between the high-risk group and low-risk group (p value = 1.854e-2), and the high-risk group was connected with poor prognosis (Fig. 5A).
The AUC value of this prognostic model was 0.677 (Fig. 5B). The RS distribution, patient survival status, and heat map of ve genes in the two groups are displayed in Fig. 5C-E. The results revealed that highrisk patients had high mortality and high expression levels of prognosis genes. Subsequently, a nomogram was plotted to predict the probability of the 3-, 5-, and 10-year OS (Fig. 5F). Five independent prognostic factors were obtained, including KAZALD1, HPCAL4, AC006042.6, RP11_465I4.3, and RP11_452L6.1.

Establishment of a prognostic model based on genes in the READ-OS group
In the READ-OS group, 61 survival-related genes were screened by univariate Cox analysis; among these, seven genes, namely RNVU1_15, PGAM1P4, RP11_291L19.1, RN7SL535P, RP11_96D1.3, CTD_3220F14.2, and hsa_mir_4444_1 (all P < 0.01), were used to conduct a risk score model for READ patients. The results revealed that low expression level of these genes was associated with poor prognosis. The formula of RS was: RS = 1.69 * ExpRNVU1_15 + 2.02 * ExpPGAM1P4 + 3.53 * ExpRP11_291L19.1 + 1.91 * ExpRN7SL535P + 5.62 * ExpRP11_96D1.3 + 0.79 * ExpCTD_3220F14.2 + 0.63 * Exphsa_mir_4444_1. The 150 READ patients were divided into a high-risk group and low-risk group based on the median value of RS. K-M survival analysis showed that patients with low RS had signi cantly longer OS than patients with high RS (p value = 0) (Fig. 6A). The AUC value of ROC curve was 0.917, indicating that the predictive model had high speci city (Fig. 6B). Meanwhile, the RS distribution, survival status of patients, and heat map of gene expression level are exhibited in Fig. 6C-E. Furthermore, seven independent prognostic factors including RNVU1_15, PGAM1P4, RP11_291L19.1, RN7SL535P, RP11_96D1.3, CTD_3220F14.2, and hsa_mir_4444_1, were revealed in the nomogram (Fig. 6). These genes may be potential therapeutic targets for READ patients.

Construction of the COAD stage prognostic model
In the COAD stage group, 267 prognosis-related genes were obtained using K-M survival analysis. A total of six genes signi cantly related to OS were detected using multi-Cox regression analysis. As a result, six genes, namely TUB like protein 1 (TULP1), cadherin 8 (CDH8), AC002076.10, CTD_2561B21.10, RP11_266L9.8, and hsa_mir_26b, were used to build the prognostic model. The RS value was calculated using the follow formula: RS = (4.22 * ExpTULP1) + (3.95 * ExpCDH8) + (0.27 * ExpAC002076.10) + (0.42 * ExpCTD_2561B21.10) + (2.67 * ExpRP11_266L9.8) + (0.39 * Exphsa_mir_26b). A total of 406 patients were divided into high-risk and low-risk groups according to the median RS. Survival analysis showed that patients in the high-risk group had a poorer prognosis than those in the low-risk group (p value = 2.842e-02) (Fig. 7A). The AUC of prognostic model was 0.814 ( Fig. 7B), suggesting that our model had a favorable e ciency in predicting prognosis of different stages of COAD patients. Notably, high-risk patients had more death and the expression level of survival-related genes was up-regulated in the highrisk group (Fig. 7C-E). In addition, six potential independent prognostic factors were predicted by nomograms, which were consistent with the genes obtained above (Fig. 7F).

Establishment of the READ stage prognostic model
In the READ stage group, Cox regression analysis, K-M survival analysis, and ROC curve analysis were performed to assess the prognostic value of the 1460 DE-RNAs. A total of four survival-related genes were selected by a series of analyses: adducin 2 (ADD2), cyclin D3 (CCND3), synaptopodin 2 (SYNPO2), and homeobox A4 (HOXA4). These four genes were used to construct a predictive model. RSs were calculated using the formula: RS = (1.6e + 03 * ExpADD2) + (8.8e + 00 * ExpCCND3) + (4.1e − 04 * ExpSYNPO2) + (1.5e − 02 * ExpHOXA4). A total of 140 patients were divided into high-and low-risk groups based on the median of the RS. There was a signi cant difference between the high-risk and lowrisk groups (p value = 2.8021e-01) (Fig. 8A), and the patients in the low-risk group had markedly longer OS than those in the high-risk group. Moreover, the ROC curve analysis was conducted and the AUC of the prognostic model was 0.859 (Fig. 8B). The RS, survival status of patients, and heat map of gene expression levels are shown in Fig. 8C-E. Subsequently, a nomogram was plotted to predict the READ patients' outcome (Fig. 8F). ADD2, CCND3, SYNPO2, and HOXA4 were considered as independent prognostic factors. Discussion: CRC is one of the most common malignancies, with multiple common types. Despite surgical resection, chemotherapy, and radiotherapy used in CRC treatment, the prognosis of CRC remains poor and CRC has high morbidity and mortality rates [28]. In the present study, we established a prognostic model of COAD and READ according to the OS and tumor staging. The clinical implementation of this developed risk score model is likely to provide more relevant information for clinical diagnoses and may improve the prognosis of CRC patients. ALAD and CHRM2 were considered as hub genes in the PPT network. Furthermore, aminolevulinic acid, levulinic, and loxapine may be potential drugs for CRC treatment.
Meanwhile, KAZALD1, HPCAL4, CDH8, SYNPO2, CCND3, and hsa_mir_26b were signi cantly associated with the OS of CRC patients, and served as potential independent prognostic factors.
PPI analysis showed that ALAD and CHRM2 were connected with the development of CRC, and were also related to the prognosis of CRC patients. ALAD catalyzed the second step in heme biosynthesis and was an inhibitor of the 26S proteasome [29]. Proteasome has been found to be associated with cancer and is a therapeutic target for cancer [30]. Meanwhile, the genetic variants in ALAD are connected with the risk of genitourinary cancers [31]. Neslund et al. indicated the potential interactions between ALAD genetic variants and occupational Pb exposure, suggesting that ALAD may be a current target of prostate cancer [32]. Additionally, a polymorphism in ALAD genes may affect lead toxicokinetics; Van Bemmel et al. revealed that ALAD may alter the risk of renal cell carcinoma related to lead exposure [33]. ALAD has also been connected with prognosis of cancers, such as breast cancer [34]. Drug prediction analysis showed that ALAD was targeted by aminolevulinic acid (ALA) and levulinic acid. ALA is an endogenous metabolite formed by succinyl-CoA and glycine in mitochondria. ALA, as an already approved therapeutic strategy, is considered as a successful prodrug used in cancer treatment. . However, the effect of hsa-mir-26b on the prognosis of tumor staging had not been reported. Thus, the relationship between hsa-mir-26b and cancer staging needs further study. Notably, the AUC values of the ROC curve for the COAD OS, COAD stage, READ OS, and READ stage prognostic models were 0.677, 0.917, 0.814, and 0.859, respectively, indicating that these genes had a good performance in survival prediction. Taken together, KAZALD1, HPCAL4, CDH8, SYNPO2, CCND3, and hsa_mir_26b were considered as potential prognostic markers for CRC. However, the speci c action mechanism of these genes needs to be elucidated in further studies.
Overall, we systematically investigated the molecular mechanism underlying the development and prognosis of CRC. First, the DE-RNAs related to the prognosis of CRC were screened. Even though the action mechanisms of these genes are still not fully understood, our study has provided a theoretical basis for further research on CRC prognostic. Second, a number of drugs were predicted by drug-target analysis. Although these drugs have not been clinically proven, our work has provided guidance for clinical drug use. Some anticancer drugs may improve disease-free survival or relapse-free survival of CRC [65], but there is currently no effective drugs for improving OS. In the present study, OS was regarded as the endpoint for developing molecular markers; thus, these drugs may be bene cial for improving the OS of CRC. Third, the prognostic models may assist clinicians in providing individualized treatment in CRC patients.

Conclusion:
In conclusion, the results in this study suggest that the prognostic models developed are reliable tools for predicting the OS of CRC patients, and these ndings could assist clinicians in providing personalized and precise treatment for CRC patients. KAZALD1, HPCAL4, CDH8, SYNPO2, CCND3, and hsa_mir_26b might be therapeutic targets for CRC. Additionally, aminolevulinic acid, levulinic acid, and loxapine may be potential drugs for CRC treatment.   represents the gene count, and the color represents the adj-P. The functional enrichment analysis of survival-related genes was conducted using clusterPro ler. GO enrichment analysis showed that survivalrelated genes were mainly associated with BP terms, such as methylation, synapse organization, mRNA 5'-splice site recognition, and cartilage development. KEGG pathway enrichment analysis indicated that survival-related genes in the COAD groups were involved in N-Glycan biosynthesis, drug metabolismother enzymes, and calcium signaling pathway; genes in the READ groups mostly participated in glycolysis/gluconeogenesis, carbon metabolism, PPAR signaling pathway, and calcium signaling pathway.

Figure 3
The PPI network of survival-related genes. A: COAD OS group. B: COAD stage group. C: READ OS group.
D: READ stage group. PPI analysis was performed using the STRING database. In the COAD OS group, the degrees of SURF6, DPM2, and ALAD were higher than others. In the COAD stage group, GAST, DRD4, and CRHR1 were considered as hub genes. The PPI network of the READ OS group was composed of PGK2 and FBP2. In the READ stage group, AGTR1, HAND2, and CHRM2 had higher degrees than others.   READ patients were divided into high-risk and low-risk groups, and patients with low risk score had signi cantly longer OS than patients with high risk score. The AUC of the prognostic model was 0.814.
Additionally, these seven survival-related genes were considered as independent prognostic factors.

Figure 7
Prognostic model analysis of seven survival-related genes in the COAD-stage group. A: K-M curve for the low-risk and high-risk groups. B: ROC analysis of six prognosis genes. C: Risk score distribution. The Xaxis represents the number of patients and the Y-axis represents the risk score. D: Survival status of patients. The abscissa axis represents the number of patients and the Y-axis represents the follow-up time. E: Heat map of expression of the six genes. F: Nomogram for predicting overall survival (OS) at 3, 5, and 10 years in COAD-stage group. TULP1, CDH8, AC002076.10, CTD_2561B21.10, RP11_266L9.8, and hsa_mir_26b, were used to build the prognostic model. A total of 406 patients were divided into high-risk and low-risk groups, and patients in the high-risk group had a poorer prognosis than those in the low-risk group. The AUC of thee prognostic model was 0.814. These genes were up-regulated in the high-risk group. Patients in the high-risk group had a higher mortality rate and the expression level of survivalrelated genes was up-regulated in the high-risk group. These six potential independent prognostic factors were predicted using nomograms.