Development and Validation of a Novel 13-Genes Prognostic Model for Colon Cancer

Colon cancer is one of the most common malignant tumors in the world. The purpose of this study is to explore the prognostic value of genes in colon cancer. After analyzing gene expression proles, differential expressed genes between 39 normal tissues and 398 tumor tissues were identied from The Cancer Genome Atlas database. We use Cox and lasso regression to nd genes related to prognosis. Through analysis, 13 genes were found to predict the overall survival of colon cancer patients. In addition, the external comparing of gene expression and the single prognostic gene survival analysis were made. Finally, pathway enrichment and mutation status of each gene were also analyzed. After a series of bioinformatics analysis, we select 13 survival-related signature and established a prognostic risk model based on these genes. The prognostic risk model was developed to comprehensively predict the overall survival of colon cancer patients. The prognostic value of the 13-genes related risk score for each colon cancer patent was calculated to predict the survival. Furthermore, ve genes (SIX2 MIR6835 LINC02474 CLDN23 HOXC11) were signicantly associated with overall survival (OS). The KEGG pathway enrichment results suggested that most of the pathways are related to the occurrence, metabolism, proliferation and invasion of the tumor cells. It was found that the expression of 13-genes signature can be used as prognostic indicator for colon cancer patients. The 13-genes signature predictive model may help clinicians provide a prognosis and personalized treatment for colon cancer patients.


Introduction
Colon cancer is the main malignant tumor of the digestive tract and ranks the third among malignant tumors in terms of global incidence [1][2][3].Although the morbidity and mortality associated with colon cancer have gradually decreased in recent years, colon cancer is still the most common cause of cancer related death, and long-term survival rates have hardly improved [4][5][6]. The accurate prognosis of colon cancer patients is essential to provide effective personalized treatment. According to some reports, the prognosis of colon cancer patients is mainly determined by clinicopathological characteristics and tumor stage. However, due to the heterogeneity of the disease, it is di cult to determine the prognosis of patients only based on these traditional risk factors [7,8].Therefore, in order to improve the prognosis of colon cancer patients, it is very important to better understand the pathogenesis of the disease and discover some prognosis-related biomarkers. However, nowadays most studies on the survival rate of colorectal cancer patients do not separately analyze the patients with colon and rectal cancer [9,10].The prognosis of these two types of malignant tumors, metastatic potential and treatment methods are different [11,12]. Thus, in this study, we built a prognostic model containing 13 genes applying Cox and the LASSO model (Least Absolute Shrinkage and Selection Operator), on which only focused colon cancer patients in TCGA-COAD dataset. We aimed to construct a highly reliable risk prediction prognostic model for colon cancer patients.

Collection and processing of expression pro le data
The gene expression data and clinical phenotype information of all colon cancer samples were downloaded from The Cancer Genome Atlas (TCGA) on September 3rd, 2020. The TCGA colon gene data from 437 samples include 398 tumor tissues and 39 normal tissues. In addition, the clinical data of 437 patients were also downloaded, including age, gender, tumor histological grade, survival time and survival status. Data sorting and ID transferring were done through Perl project.

Differential gene analysis
The Wilcox test method was used to analyze the difference between COAD samples and normal samples in the gene expression pro le by limma package with R project (version 3.40.0). |Log2 fold change (FC)|≥1 and false discovery rate (FDR) < 0.05 were used to identify the differentially expressed genes (DEGs).

Establishment of genes prognostic risk model
All samples are randomly divided into training set and testing set, for the training set, the Cox regression analysis in R survival package (version 2.44-1.1) was used to analyze the regression coe cient and FDR value of each gene in relation to survival time. The genes with FDR value < 0.05 were initially considered as the genes related to prognosis. Subsequently, lasso regression analysis with glmnet and survival package was used to further reduce selected genes. In order to build the cox prognosis model, the calculation of risk score for each patient was de ned as follows: Risk score = β gene1*expression (gene 1) + β gene2 * expression (gene 2) + ... + β gene n * expression (gene n), Where, β is the prognostic correlation coe cient beta estimated by Cox analysis which equals to log (Hazard Ratio), and expression represents the expression value of corresponding gene.
According to the risk score, the training set and the testing set patients were divided into high risk and low risk subgroups according to the median value of risk core. Then Kaplan-Meier survival curve analysis was performed to calculate the difference in survival prognosis time between samples of high risk and low risk group using the R survival package. Receiver operating characteristic (ROC) curves were constructed by using the R survival ROC package(version 1.0.3). The Area Under the Curve (AUC) was calculated to assess the predictive value of the models.

Pathway enrichment analysis
The Gene set enrichment analysis (GSEA) (http://software.broadinstitute.org/gsea/) was carried out to reveal the signaling pathways between high and low expression of each single gene based on TCGA-COAD dataset. Each single gene set"c2.cp.kegg.v7.0.symbols.gmt (Curated)" was utilized for GSEA. Of note, 1,000 permutations were used in the analysis of each parameter to calculate the enrichment scores.
The P value < 0.05 and normalized enrichment score (NES) were implemented to determine vital enrichment pathways among the subgroups. The multiple GESA pathway graph of each gene was constructed using the ggplot2, plyr, grid and gridExtra packages in R project.

Statistical analysis
R version 3.5.3 (http://www.R-project.org), was applied for statistical analysis and graph plotting. A 2sided P-value < 0.05 was determined to be statistically signi cant for Kaplan-Meier survival analysis. The Area Under the ROC Curve (AUC) was calculated to evaluate the diagnosis performance of the identi ed signature. Univariate and multi-variate Cox regression analyses were conducted to determine the prognostic performance of risk factors.

Identi cation of differential genes between normal tissues and colon cancer
In the TCGA-COAD dataset, there are 398 colon tumor samples and 39 normal samples. The gene differential expression pro le of normal tissues and colon cancer of patients were calculated using the R project. A total of 1593 genes were identi ed as DEGs with FDR value < 0.05 and |log 2 FC| > 2.

Construction of a 13-genes signature prognostic model in the training set
Among the 1593 genes, the DEGs in each patient was displayed. The patients were randomly divided into training set and testing set. According to clinical information, we combined the survival time of the training set with DEGs expression to analyze the genes that may affect the patients' clinical prognosis. Through Cox regression analysis of the training set, the single factor signi cance was de ned as P value < 0.01, and prognosis-related genes were selected. The hazard ratio (HR) and the 95% con dence interval (CI) of each variable were shown in Table 1.There were a total of 48 genes related to prognosis. The candidate genes were further reduced to 15 genes through the lasso regression method to reduce the over tting of the model (Fig. 1). Finally, through the following stepwise variable selection procedure, we established an optimal Cox proportional hazard model containing 13 key genes: MIR6835(microRNA 6835),TYRO3(TYRO3 protein tyrosine kinase),SRCIN1(SRC kinase signaling inhibitor 1),HOXC11(homeobox C11),CLDN23(claudin 23),UPK2(uroplakin 2),TNNI3(troponin I3),KLHL35(kelch like family member 35),KRT6B(keratin 6B),HAND1(heart and neural crest derivatives expressed 1),IL23A(interleukin 23 subunit alpha),LINC02474(long intergenic non-protein coding RNA 2474),SIX2(SIX homeobox 2)( Table 2). Thepredicted risk score can be calculated with the formula: 0.325257866*MIR6835 + 0.367503693*TYRO3 + 0.418341346*SRCIN1 + 0.134589791*HOXC11 + 0.123203974*CLDN23 + 0.205044971*UPK2 + 0.236533736*TNNI3 + 0.15943496*KLHL35 + 0.009257475*KRT6B + 0.083391046*HAND1 + 0.09208856*IL23A + 0.255146739*LINC02474 + 0.079143843*SIX2. The forest graph of risk genes was shown ( Fig. 2A).The samples of the training set were separated into high-risk and low-risk subgroups according to the median value of risk scores. In order to evaluate the diagnosis prediction value of the model, the time-dependent survival ROC curves were also established. At the training set, the predictive capability of the genes remains well, with an AUC of 0.823 (Fig. 2B).The Kaplan-Meier survival curve also showed that the survival rate of the high-risk group was signi cantly reduced (P < 0.01, Fig. 2C).The risk distribution plot and heat map of gene expression in the training set are displayed (Fig. 2D-2F).

External validation in testing set
The signature's predictive performance was further validated in the testing set. Using the same formula established in the training set, the risk scores of each patient in the testing set was calculated based on the relative expression levels of 13 genes. In the testing set, the AUC of the 13-gene signature used to predict patient survival was 0.702 (Fig. 3A). The Kaplan-Meier curve generated from the testing set showed that the survival outcome of low risk group patients was signi cantly better than that of high risk group patients (Fig. 3B), which indicated that it has moderate diagnostic performance. The risk distribution plot and heat map of gene expression in the testing set were displayed (Fig. 3C-E). These results suggested that the prognostic model has better sensitivity and speci city.
3.5 Survival analysis of the prognostic 13 candidate genes signature in COAD In order to further investigate the speci c relationship between the 13 individual genes with OS in colon cancer patients, a comprehensive survival analysis was performed using the Kaplan-Meier method. The results showed that ve genes (SIX2 MIR6835 LINC02474 CLDN23 HOXC11) were signi cantly associated with OS, while the other candidate genes were no signi cantly correlated with OS in COAD (Fig. 5).

Pathway enrichment analysis
To further investigate the correlated pathway of each candidate gene in COAD, KEGG pathway enrichment analysis of these genes was obtained with GSEA. The upregulated and downregulated pathways of each single gene were shown in GESA multiple pathway graph (Fig. 6).The GESA pathway results suggested that most pathways are related to the occurrence, metabolism, proliferation and invasion of the tumor cells.

Identifying the mutation status of candidate genes in colon cancer
The genetic alterations of 13 candidate genes were examined to determine their role in colon cancer genes in GEPIA database. Among the 110 COAD patients, a total of 30 (27.27%) patients had changes (cbioportalCPTAC-2 Prospective, Cell 2019 cohort) (Fig. 7). The alteration rates of CLDN23, TYRO3 and SRCIN1 in this dataset were 9%, 9% and 10% respectively. Frequent genetic alterations indicated that these genes play a vital role in the development of colon cancer.

Discussion
In this study, we identi ed a total of 13 genes as signature genes that is associated with the patients' survival from the TCGA colon cancer datasets through the training set. We further veri ed that these ndings were consistent in the testing set. The prognostic risk scores of colon cancer patients for precise predictions was important. In the prognostic model, our results showed that patients' survival with high risk scores was decreased, compared to that with low risk scores.
Among the 13 genes that built the model, much of them are associated with cancer. For example, CLDN23 is down-regulated in colorectal cancer tumors, and its level down-regulation is related to the prognosis of colorectal cancer patients [13].Changes in UPKs levels in urothelial carcinoma are considered to be useful markers for diagnosis, detection and prediction of urothelial carcinoma [14,15].TYRO3 plays a role in promoting the survival and growth of cancer cells[16] and the knockdown of TYRO3 by siRNA prevents melanoma cell migration and invasion [17].Knockdown of KHLH35 in HEK293 cells promoted anchorage-independent growth, indicating that it may play a role in tumorigenesis [18,19]. HAND1 gene has been found to be silent and hypermethylated in human gastric cancer [20],pancreatic cancer [21] and ovarian carcinomas [22].A study shows that SIX2 can promote breast cancer metastasis [23].IL-23 may be related to the progression of bladder cancer [24,25].The KRT6B is associated with the increased risk of lung cancer[26, 27],breast carcinoma [28], and urothelial cancer [29]. When expressed outside the context, HOXC11 can abnormally promote proliferation, thereby promoting tumorigenesis [30].SRCIN1 inhibited the invasion of highly metastatic breast cancer cells by inhibiting corticosteroid-dependent cell movement [20,25,31].For the other three genes (MIR6835, LINC02474,TNNI3) among the predicted scoring model, their roles in human cancers have not yet been fully investigated. Strength of this study is that we analyzed KEGG pathway enrichment with GSEA and identi ed mutation status for each candidate gene in GEPIA database. We also made the external comparing of gene expression in Oncomine datasets and the OS of each gene based on TCGA-COAD dataset. However, there also are some limitations in this research. First, the differential expression of 13 genes was only identi ed from the TCGA colon dataset, and experimental veri cation was lacking. Although TCGA's DNA-seq data is of high quality, it still needs further experimental veri cation. In addition, further studies on the functions of these 13 genes in colon cancer are needed in vitro and in vivo.

Conclusion
In summary, our study revealed a 13 genes signature related to the prognosis of colon cancer patients. Besides, the 13genes predictive model may be useful for survival prediction and diagnosis in colon cancer patients, which may increase the effectiveness of personal treatment for colon cancer patients. Moreover, some gene signature displayed vital biological function, which can potentially be used for further biological research. Pre-clinical studies followed by clinical trials are needed to validate our ndings in the future.

Declarations Acknowledgement
We are grateful to the contributors of data to public databases including TCGA, Oncomine and cBioPortal database.

Authors' contributions
Zhirong Yang designed and revised the current study. Duo Yun performed the analyses, calculations and wrote the rst manuscript. All authors read and approved the nal manuscript.

Funding
No funding.

Availability of data and materials
The datasets generated and/or analyzed during the current study were TCGA(https://portal.gdc.cancer.gov/).    Veri cation of 11 candidate genes expression in COAD and normal colon tissue using the Oncomine database.

Figure 5
The comprehensive survival with clinical characteristics regarding the OS of the 13 individual genes of the colon cancer patients were analyzed by the Kaplan-Meier method.