Colon Cancer Classification and Prognosis Prediction Based on Genomics Multi-features


 Background: To classify colon cancer and predict the prognosis of patients with multiple characteristics of the genome.Methods: We used the mRNA expression profile data and mutation maf files of colon cancer patients in the TCGA database to calculate the TMB value of patients. Combined with CNV, MSI, and corresponding clinical information, the patients were clustered by the "K-means" method to identify different molecular subtypes of colon cancer. Comparing the differences of prognosis, and immune cell infiltration, and other indicators among patients in each subgroup, we used COX and lasso regression analysis to screen out the prognosis difference genes among subgroups and construct the prognosis prediction model. We used the external data set to verify the model, and carried out the hierarchical analysis of the model to compare the immune infiltration of patients in the high and low-risk groups. And detected the expression differences of core genes in tumor tissues of patients with different clinical stages by qPCR and immunohistochemistry.Results: We successfully calculated the TMB value and divided the patients into three subgroups. The prognosis of the second subgroup was significantly different from the other two groups. The mmunoinfiltration analysis showed that the expression of NK.cells.resting increased in cluster1 and cluster 3, and the expression of T.cells.CD4.memory.resting increased in cluster3. By analyzing the differences among subgroups, we screened out eight core genes related to prognoses, such as HYAL1, SPINK4, EREG, and successfully constructed a patient prognosis evaluation model. The test results of the external data set shows that the model can accurately predict the prognosis of patients; Compared with risk factors such as TNM stage and age, the risk score of the model has higher evaluation efficiency. The experimental results confirmed that the differential expression of eight core genes was basically consistent with the model evaluation results.Conclusion: Colon cancer patients were further divided into three subtypes by using genomic multi-features, and eight-core genes related to prognosis were screened out and the prognosis evaluation model was successfully constructed. With external data and experiments, it verified that the model had good evaluation efficiency.


Background
Colon cancer is the third most common cancer in the world, which seriously threatens human health [1][2]. With the progress of surgical techniques and the emergence of new therapeutic drugs, the 5-year survival rate of colon cancer has gradually increased, but colon cancer is still one of the important causes of cancer-related deaths. In recent years, the mortality of colon cancer has dropped signi cantly which mainly due to the progress of adjuvant therapy technology. The choice of adjuvant therapy is mainly based on TNM staging of the tumor, which is the most valuable staging method for judging prognosis at present, and the colorectal staging is mainly based on the pathology and anatomy of the tumor [1]. However, TNM staging can not effectively distinguish the differences in tumor biological behavior caused by tumor heterogeneity, which leads to the lack of accuracy of adjuvant therapy based on TNM staging. Colon cancer is highly heterogeneous, and its biological behavior has wide genetic and epigenetic differences among different individuals or different lesions within the same individual, which can further lead to different prognosis and response to treatment. Because of the differences in biological behaviors of tumors, the prognosis and response to adjuvant therapy of colon cancer patients in the same TNM stage are quite different. Some studies have pointed out that there are differences in anatomical features, histological features, prognosis, and tumor biological behavior between left colon cancer and right colon cancer. According to the National Comprehensive cancer network Guidelines, patients with metastatic unresectable colon cancer have different therapeutic effects on cetuximab due to the mutation status of the RAS gene. Because of the clinical heterogeneity of colon cancer, it is always being a di cult problem to provide accurate and effective treatment for individual differences of patients. Classifying colon cancer at the molecular level, and understanding the molecular change characteristics of each subtype and its corresponding tumor biological behavior characteristics have important guiding signi cance for the treatment of colon cancer. In the era of precise treatment, there is an urgent need for a method to classify colon cancer according to tumor heterogeneity and classify tumors at the molecular biological level.
The occurrence of tumors is closely related to the accumulation of gene mutation. In 1914, CALKINS et al. [3] rst discovered that abnormal chromosome distribution may be related to the malignant tumor during cancer cell division, and then people began to explore the relationship between abnormal genetic material and tumor occurrence. Tumor genomes carry a variety of mutations, including single nucleotide variations, structural rearrangement, and copy number variations, CNVs). According to the number of gene mutations per million bases, we can get the tumor mutation load value. Studies have con rmed that patients can bene t from immunotherapy when TMB (Tumor Mutation Burden) is greater than 10. In addition, the change of gene expression and the mutation of microsatellite instability genes in colon cancer are closely related to the choice of treatment methods and prognosis of patients. Therefore, the comprehensive summary analysis of these features in the genome of colon cancer patients is conducive to the classi cation of colon cancer, so as to further stratify and accurately treat colon cancer patients at the molecular level and improve the prognosis of patients.
With the accumulation of biological data and the maturity of data mining technology, the application of advanced calculation methods is helpful to build a multi-factor comprehensive scoring system and solve the above problems. Therefore, this study integrates the mRNA expression pro le data, mutation maf le, CNV information, MSI information, and corresponding clinical information of colon cancer patients, and calculates the TMB of patients. The patients were clustered by multi-feature indexes of the genome, and different molecular subtypes of colon cancer were identi ed, and the patients were divided into different subgroups. Comparing the differences of prognosis, immune cell in ltration, and other indicators among patients in each subgroup, further screening the differential genes among subgroups with signi cant differences in prognosis by regression analysis and constructing a prognosis prediction model. The external data set was used to verify the model, and the hierarchical analysis of the model was carried out. The difference in immune in ltration and tumor purity of patients in high and low-risk groups were calculated, and established the Nomogram model to evaluate the prognosis of patients. We use the "maftools" function package in R language to calculate the tumor mutation load (TMB) of each sample according to the maf le. We use the "factoextra" function package of the R language, based on the "K-mean" method, and integrate CNV, TMB, and MSI data to cluster 379 samples. At the same time, the principal component analysis is carried out on the samples, and the results of clustering analysis are compared with those of subcomponent analysis to see if the results are consistent. For the obtained subgroups, do survival analysis between each cluster, and judge whether the differences among subgroups are signi cant. Compare CNV, TMB, and MSI among each subgroup, and visualize the results.

Immune differences between subtypes
Cibersort R package was used to analyze the immune cell in ltration of each subgroup, and the difference of immune cell in ltration between each group was compared. The difference of expression content of each immune cell in each group was compared and analyzed, and the results were displayed visually. The expressions of CTLA4, PDL1, LAG3, TIGIT, IDO1, TDO2, and other classical immune checkpoints in each subgroup were compared and analyzed.

Screening and enrichment analysis of different genes among subtypes
According to the difference of survival analysis between groups, we divided the groups into two groups: good prognosis and poor prognosis. Then, the limma[PMID: 25605792] function package of r language (version3.5.2, the same below) is used to analyze the difference of gene expression between the two subgroups, and the absolute value of Log2FC is greater than 1 and FDR ≤ 0.05 is used as the standard to screen the differentially expressed genes.
Cluster pro ler R package was used for enrichment analysis, and online analysis tools of string (HTTPS: / / www.string-db. org /) and metascape (http: / / metascape. Org /) were used for protein interaction analysis.

Regression analysis to construct prognosis model
Univariate Cox regression analysis was performed on colon cancer samples based on the expression value of differential genes, and the differential expression genes signi cantly related to the prognosis of colon cancer were screened with P < 0.05 as the threshold. Then LASSO Cox regression analysis was carried out with glmnet package of r language [PMID: 20808728] to further select differentially expressed genes related to the prognosis of colon cancer. And use the screened differentially expressed genes to calculate the Risk Score of each sample according to the following formula: 2.6 Coe is the risk coe cient of each factor calculated by the LASSO-Cox model, Xi is the expression value of each factor, and in this study, it refers to the expression value of mRNA. Then, r package survival, survminer, and bilateral log-rank test were used to determine the optimal cutoff value of the Risk score. according to the cutoff value, patients were divided into the Low-Risk group and the High-Risk group.

Veri cation of model by external data set
The time-dependent receiver operating characteristics (ROC) curve of the model was drawn by the r language survivalROC package [PMID: 10877287]. TCGA and GSE17536 data sets are used as veri cation sets to verify the model. the model is used to calculate the Risk scores of patients in the data sets, and the survminer package is used to nd the best cutoff value so that patients can be divided into high and low-risk groups. the r language survival package is used to estimate the overall survival rate of different groups based on the Kaplan-Meier method, and the log-rank test is used to test the signi cance of survival rate difference between different groups. A Multivariate Cox regression model was used to analyze whether the Risk Score can predict the survival of colon cancer patients independently of other factors.

Hierarchical analysis of the model
The model was analyzed hierarchically to explore the relationship between age, gender, clinical stage, and the prognosis of patients. According to clinical stage, gender, age over 65, and other criteria, patients were divided into two groups by using this model, and the survival rate of patients with different clinical characteristics was analyzed.
2.9 Difference of immune cell in ltration between high and low-risk groups and calculation of tumor purity We use the software CIBERSORT [PMID: 25822800] to calculate the relative proportion of 22 immune cells in each cancer sample. CICERSORT software will use the deconvolution algorithm to characterize the composition of immune in ltrating cells with preset 547 barcode genes according to the gene expression matrix. The sum of all estimated immune cell types in each sample is equal to 1. The estimate function package of r language [PMID: 24113773] was used to calculate the tumor purity of each cancer sample.

Establishing Nomogram prognosis prediction model
Nomogram is widely used to predict the prognosis of cancer. To predict the survival probability of patients in 1 year, 3 years, and 5 years, based on all independent prognostic factors determined by multivariate Cox regression analysis, we use R language rms package to establish nomogram, draw the calibration curve of the nomogram, and observe the relationship between nomogram prediction probability and actual incidence rate.
2.11 Clinical sample veri cation 100 cases of patients with stage I and stage III colon cancer(50 cases at each stage) were screened out from the First A liated Hospital of Nanchang University from June 2017 to June 2019. Inclusion criteria: (1) Pathologically con rmed (2) New cases diagnosed by this hospital for the rst time.
Exclusion criteria: (1) with other malignant tumors or diabetes, hypertension Systemic diseases such as cardiovascular and cerebrovascular diseases; (2) patients who have received radiotherapy, chemotherapy, or other anti-neoplastic drugs before surgery Tumor tissue and normal specimens(Tissues more than 5 cm from the outermost periphery of tumor tissues are regarded as normal tissues)from patients surgically were collected. GAPDH was used as the internal reference, the expression of selected RNA was detected by qPCR. Primers were shown in Table 1. Using the expression level of the GAPDH as the standard value "1", the relative expression levels of 8 RNAs in cancer tissues of a patient with stage I and III were calculated and draw statistical charts. The expression of prognostic-related genes in tumor tissues of patients with I and III stages of colon cancer was detected by immunohistochemistry.  According to the maf le, the TMB (Fig. 1A, Supplemented Table 1) of each sample is calculated rst, and then three kinds of data are integrated to cluster 379 samples using the "factoextra" function package. samples can be clustered into three categories (Supplemented Table 2). PCA analysis shows that the samples can be well divided into three categories (Fig. 1B), and PCA results are visualized in three dimensions (Fig. 1C). Comparing PCA results with cluster analysis results, it is found that the classi cation results are consistent, which shows that our clustering is more accurate. The survival analysis among the three clusters shows that there are signi cant differences among the subgroups (Fig. 1D). A comparative analysis of TMB and MSI among the three types of samples shows that the TMB of Cluster3 is signi cantly higher than that of the other two types of samples, and the proportion of MSI-H samples in Cluster3 is also signi cantly higher than that of the other two types of samples ( Fig. 1E-F). 3.2 Analysis of characteristic differences and immune in ltration among subgroups The mutations among the three subgroups were analyzed, and the results were shown in Fig. 2A, Fig. 2B and

Screening and enrichment analysis results of differential genes
According to the analysis of the difference of prognosis among the three subgroups, we found that the prognosis of patients with cluster1 and 3 was worse than that of patients with cluster2, and the difference was statistically signi cant; There is no signi cant difference in prognosis between cluster1 and cluster3 patients. Therefore, we combined the patients of cluster1 and cluster3 and analyzed the differential genes with those of cluster2. Comparing the gene expression differences between Cluster1 + Cluster3 samples and Cluster2 samples, we got 128 differentially expressed genes, including 51 upregulated genes and 77 down-regulated genes (Fig. 5G). GO enrichment analysis is carried out on the differential genes, The results showed that the differential genes were mainly related to anion transport, Negative Regulation of Cell Proliferation, Multicellular Organic Homeostasis, Humoral Immunization Response, Organic anion transport, positive regulation of defense response, epigenetic cell differentiation and tissue homeostasis are related to biological activities ( Fig. 5H-I). KEGG enrichment analysis showed that the differential genes were mainly enriched in alpha-Linolenic acid metabolism, linolenic acid metabolism, Fat digestion and absorption, Ether lipid metabolism, Arachidonic acid metabolism, Wnt signaling pathway, and etherlipid metabolism (Fig. 5J). The protein-protein interaction analysis results are shown in Fig. 5K-M. The results show that these differential genes can be divided into four interaction modules. CCK GPR143 TAC1 NPFFR1 GNG4 MUC5A MUC5B MUC6 DRD2 KY6G6D and other genes are the core genes and interact most closely with other targets.
Comparing the difference in the expression of differential genes between the two groups, it was found that the expression of differential genes between Cluster1 + Cluster3 samples and Cluster2 samples was signi cantly different (Fig. 6A).

Construction and veri cation of prognosis model
Univariate TCGA regression analysis was carried out with 128 differential gene expression values as continuous variables, and the Hazard ratio(HR) of each gene was calculated. Eight genes were obtained by screening with P value < 0.05 as the threshold, and the survival analysis of these eight genes was carried out (Fig. 6B-Fig. 6I). The best cutoff value is shown in Table 2. COX regression analysis showed that the HR value of 7 of these 8 genes was less than 1, which was bene cial to the prognosis. There is one gene whose HR value is greater than 1, which is a dangerous gene and unfavorable to prognosis (Fig. 7A). Then, the selected eight genes were analyzed by LASSO regression. according to the lambda values corresponding to different gene numbers in LASSO analysis, we determined that the optimal number of genes was eight (Fig. 7B,  (-0.0000407*B3GNT6) We calculated the risk score of each patient and divided the samples of TCGA data set and GEO veri cation set into high-risk group and low-risk group according to the median. Survival analysis found that in TCGA and GEO data sets, the overall survival of high-risk colon cancer samples was worse than that of low-risk patients (Fig. 7C and Fig. 7D). It shows that the risk model can effectively predict the prognosis of colon cancer patients in both data sets. Generally speaking, the results show that the Risk Score calculated by the evaluation model constructed by HYAL1, SPINK4, EREG, VWDE, ITLN1, ZBTB7C, KLK12, and B3GNT6 can better predict the prognosis of colon cancer patients.

Risk Score is an independent prognostic marker of colon cancer
To further explore the prognostic value of Risk Score in colon cancer samples with different clinicopathological factors (including age, TNM Stage and gender), we regrouped colon cancer patients according to these factors and performed Kaplan-Meier survival analysis. It was found that among patients with Stage I + StageII and StageIII + StageIV, patients with low risk score had better prognosis than those with high risk score (Fig. 7E-F). Among patients aged > 67 and ≤ 67, patients with low risk score had better prognosis than those with high risk score (Fig. 7G-H). In female and male patients, the overall survival rate of was signi cantly lower in the high-risk group than in the low-risk group (Fig. 7I-J). The difference was statistically signi cant. These results indicate that the Risk Score can be used as an independent index to predict the prognosis of colon cancer patients.
Then, we included age, MSI classi cation, TNM Stage, gender and Risk Score for multivariate Cox regression analysis to determine whether the Risk Score is an independent prognostic indicator. The result is shown in Fig. 8A. We found that the Risk Score, TNM Stage and age were still signi cantly correlated with overall survival, and the samples with high risk score had a higher death risk and were adverse prognostic factors (HR = 2.63, 95%CI:1.83-3.8, P < 0.001).

3.6.Nomogram model can better predict the prognosis and survival of patients
The Nomogram model (Fig. 8B) was successfully constructed by using three independent prognostic factors: age, TNM Stage and Risk Score. For each patient, draw up three lines to determine the Points obtained from each factor in Nomogram. The sum of these Points is located on the "Total Points" axis, and then a line is drawn down from the "Total Points" axis to determine the survival probability of colon cancer patients for 1, 3 and 5 years. The corrected curve in the calibration chart is close to the ideal curve (a 45-degree line passing through the origin of the coordinate axis with slope 1). The result shows that the predicted result of the model is basically consistent with the actual result ( Fig. 8C-E).

Immune status of colon cancer patients in high and low risk groups
CIBERSORT method and LM22 feature matrix were used to estimate the difference of immune in ltration among 22 kinds of immune cells in colon cancer patients with high and low risk groups, and the result was visualized (Fig. 9A-B). The result shows that there was a signi cant difference in the in ltration ratio of T cell CD4 memory resting immune cells between high and low risk groups, and its expression level was signi cantly higher in low risk groups which suggesting that the expression of T cell CD4 memory resting is related to the improvement of prognosis.

3.8.qPCR and immunohistochemical veri cation
There was no signi cant difference in baseline data between patients with Stage I and III (  (Fig. 10). The experimental veri cation results were basically consistent with the previous data analysis results.

Discussion
As a common tumor, the incidence of colon cancer is high and increasing year by year. Actively exploring the prevention, treatment and prognosis evaluation of colon cancer is of great signi cance to reduce its morbidity and mortality. In recent years, with the development of tumor molecular biology, researchers have gradually realized that tumor is also a genetic disease, and its occurrence and development are the result of long-term interaction between genes and environment [4]. The evaluation of tumor risk factors and genetic predisposing factors helps to further clarify the mechanism of cancer and contribute to clinical treatment. Because of the high heterogeneity among tumor patients, the traditional pathological and clinical staging can no longer better predict the prognosis and guide the treatment for patients. In order to predict the prognosis of individual patients accurately, it is necessary to stratify patients in more detail and precision at the genomic level, and divide them into different tumor subtypes, so as to carry out more targeted antitumor treatment. It is of great signi cance to nd out the types of cancer subtypes accurately, whether it is the understanding of cancer, the treatment of cancer or the practical clinical application.
Researchers found that the prognosis of patients with the same cancer type can be quite different, and even patients of the same pathological type have signi cant differences in prognosis [5][6][7]. Through NGS sequencing and related research on a large number of tumor tissue samples from colon cancer patients, We found that the prognosis is related to patients' TMB, MSI, CNV and other genomic characteristics [  [12] found the interaction between BRAF mutation and microsatellite instability (MSI) status were very important in determining survival outcomes after adjuvant 5FU based chemotherapy in Stage III colon cancer.Researchers had con rmed the correction between TMB and prognosis in patients with colorectal cancer treated with adjuvant Fluoropyrimidine and Oxaliplatin. Studies also found that TMB was associated with the treatment outcomes in patients with colorectal cancer [13][14]. Wang et al [15] studied the genome-wide expression pro ling-based copy number variations and colorectal cancer risk in Chinese.Their study suggested that the ampli ed copy number of HLA-DQB1 is associated with lower risk of colorectal cancer and able to induce the apoptosis of colon cancer cells.
Generally speaking, patients with TMB greater than 10 can bene t from immunotherapy, thus improving the prognosis. MSI mutation also plays an important role in guiding the selection of clinical chemotherapy drugs. A number of evidences also showed that the differences of patients' prognosis can be analyzed more accurately from the perspective of genomic differences, and colon cancer patients can be further divided into different hypotypes in detail at the genetic level, so as to explore the potential mechanisms and predict the prognosis of patients.
Therefore, we decided to analyze the genome sequencing data of colon cancer patients and calculate the TMB value of each patient; Furthermore, combined with CNV, MSI mutation and other genomic characteristic data indicators for comprehensive analysis, so as to use these genomic characteristics to subgroup colon cancer patients at the gene level.We analyzed the differences of prognosis and immune cell in ltration between the three groups wuth different genomic characteristics. We found that compared with the patients in the second subgroup, the prognosis of rst and third one was worse, and the difference was statistically signi cant. However, the prognosis curves of subgroup 1 and subgroup 3 were close, and there was no signi cant difference in prognosis between the two groups. Therefore, when using COX and LASSO regression analysis, we combined the patients in subgroup 1 and subgroup 3 rst, and then compared them with those in subgroup 2 to screen the differentially expressed genes related to prognosis. We eventualky identi ed eight core genes related to prognosis: HYAL1, SPINK4, EREG, VWDE, ITLN1, ZBTB7C, KLK12, B3GNT6, and used them to construct a prognostic evaluation model for colon cancer patients.
Jin et al [16] found that HYAL1 and HYAL2 played a suppressive role in the metastasis of colorectal cancer. The expression of hyaluronan synthase2 or hyaluronidase1 differentially could affect the growth rate of transplantable colon carcinoma cell tumors [17]. Wang et al [18] con rmed that downregulated SPINK4 was associated with poor survival in colorectal cancer. Another study demonstrated [19] that the serum SPINK4 level increased in CRC and was associated with the location and distant metastasis of CRC. They thought SPINK4 had a high diagnostic value in CRC but was not associated with the survival of CRC patients. Previous studies found [20] miR-215-5p-EREG/TYMS axis in colon cancer cells to produce resistance to 5-FU. Study showed that high EREG expression was predictive of better outcomes in rectal cancer patients receiving neoadjuvant concurrent chemoradiotherapy [21].
Qu et al revealed the activation of EGFR was depended on demethylation of the EREG promoter by integrated genomic analysis of colorectal cancer progression [22].
Human Intelectin-1 (ITLN-1) is a novel identi ed galactose-binding lectin that is expressed in the colonic goblet cells. Study has con rmed that the aberrant ITLN-1 expression in gastric cancer is correlated with clinicopathological features and may be a useful prognostic factor for predicting the outcomes of gastric cancer patients [23]. Yousef et al reported that the expression of KLK12 was down-regulated at the mRNA level in breast cancer tissues and was upregulated by steroid hormones in breast and prostate cancer cell lines. This gene may be involved in the pathogenesis and/or progression of certain cancer types and may nd applicability as a novel cancer biomarker [24].
We independently veri ed the model by using GEO and TCGA datasets to con rm its accuracy and credibility. Then we analyzed the relationship between age, sex, clinical stage, model risk score and the prognosis. The results showed that besides TNM staging and age, the risk score of the model was also signi cantly correlated with the overall survival time of colon cancer patients, which further a rmed the effectiveness of our model in evaluating the prognosis of patients. By analyzing and comparing the immune cell in ltration of patients in high and low risk groups divided by the model, it was found that the expression of T cell CD4 memory resting decreased in the high risk group. This suggests that this cell may be related to the prognosis of patients.
construction. Of course, our research also has a few shortcomings. For example, we did not use our own MSI, TMB and CNV data of colon cancer patients to increase the sample size or verify the grouping. Since not all patients complete the NGS sequencing in our department, we can only obtain more abundant data through cooperation in future research. To verify the expression of model genes, we used patients with different clinical stages to represent patients with different prognosis, and detected the expression level of each gene in tumor tissues of patients with different stages. However, there are a small number of stage colon cancer patients with poor prognosis. According to the experience of our center, this part of stage I colon cancer patients with poor prognosis may account for 1-3% of all stage I colon cancer patients. Since the sample size we used to verify the conclusion is not very large, a small amount of result deviation may occur, but we think this is statistically allowable.

Conclusion
Our study successfully calculated the TMB value of colon cancer patients by using mRNA expression pro le data and mutation information, and identi ed three subtypes by combining MSI and CNV genomic characteristic data. Through the analysis of each subgroup, we successfully screened out 8 core genes related to prognosis and constructed a prognosis prediction model. Ethical approval and consent to participate: No animals or humans were involved in this study. This study was carried out in accordance with the Declaration of Helsinki.
Authors' contributions: JSX: research design and drafting the manuscript; KLL CFW and QJY: literature search; HPW and XZW: review and revision of the manuscript and writing guidance. All authors approved this manuscript. Differentially expressed genes between Cluster 1+Cluster 3 samples and Cluster 2 samples; H: The network diagram of GO analysis results of differential genes between groups; I: Bubble chart of GO analysis results of differential genes between groups; J: Bubble chart of KEGG analysis results of differential genes between groups; K: Protein interaction analysis results of differential genes between groups; L: Topological network of protein interaction analysis results of differential genes; M: The core module obtained from the interaction analysis of differential gene protein.