A Mutational Signature that Predicts Prognosis and Benefit of Immune Checkpoint Blockade in Colorectal Cancer


 Background: Colorectal cancer (CRC) is characterized by broad genomic and transcriptional heterogeneity. However, the genomic basis of this variability remains poorly understood. Our pilot study identified mutated genes were associated with immune infiltration. This study aims to explore a novel mutational signature (MS) in tumor microenvironment (TME) of CRC.Methods: We integrated single nucleotide variation and transcriptome data and collected corresponding clinicopathologic information from 1,133 and 588 CRC patients of Memorial Sloan Kettering Cancer Center and The Cancer Genome Atlas databases, respectively. Single sample gene set enrichment analysis (ssGSEA) was used to identify the subtypes of CRC based on the immune genomes of 29 immune signatures. CIBERSORT was used to analyze the infiltration of 22 immune cell types in the TME and immune-related gene expression CRC tissues. Results: In the training cohort, we identified a novel MS consisting of 27 genes and generated a prognostic model that classifies patients into high- and low-risk groups. The low-risk group was associated with better survival and more tumor mutational burden, microsatellite instability, and mismatch repair deficiency. The data were all verified in the validation set. Further analysis revealed that the MS was associated with tumor immunogenicity and immunocyte infiltration, and the determined risk score (RS) could be an index for the immunity level.Conclusion: We identified a MS that could assist clinicians to select immunotherapy responsive patients and the combination of RS and TNM stage could provide comprehensive prognostic information for CRC.


Background
Colorectal cancer (CRC) is considered as a genetic disease, which arises from the stepwise accumulation of genetic and epigenetic alterations [1,2]. It is known that these alterations promote the dysplasia and tumorigenesis of CRC [3,4], but the genomic basis of this variability remains poorly understood [5]. TNM staging system is currently regarded as the standard for the staging of patients with CRC [6,7], but it is limited by variations among patients with the same tumor stage. Previous studies have shown that treatment response and survival rate of CRC patients depend not only on tumor staging but also on heterogeneous and epigenetic molecular features [8][9][10]. Thus, it is signi cant to identify an effective system that better predicts overall survival (OS) and treatment selection of patients with CRC.
Although pro ling studies have been carried out to identify patterns of gene and which might predict CRC survival and recurrence [11][12][13][14], expression pro le is greatly in uenced by physiological and pathological conditions that lead to poor reproducibility in the process of library construction. Moreover, exaction to RNA samples tends to degrade, leading to deviations and inaccuracies in the results of subsequent analyses. Consequently, RNA expression pro les are rarely used to predict patient survival and devise medication plans. In contrast, DNA is stable during extraction and detection. Furthermore, structural changes of amino acids and proteins caused by mutations play a signi cant role in tumor progression [15]. Thus, further analysis and validation in larger, independent cohorts in combination with mutated genes to predict prognosis are essential prior to application in a clinical setting.
The somatic mutations in a tumor are caused by multiple mutational processes [16,17]. Different mutational processes often generate distinct combinations of mutation types, termed a 'signature' [18]. The effect of such mutation processes can be modeled by a mutational signature, of which two different conceptualizations exist, as in the models introduced by Alexandrov et al. [19] and Shiraishi et al. [20].
Glowing evidence have revealed that tumor mutational burden (TMB) is correlated with response to programmed cell death 1 (PD-1) inhibitior and programmed cell death ligand 1 (PD-L1) in tumor microenvironment (TME), which is so called immune checkpoint blockade [21]. Therefore, an understanding of cancer characteristics based on TMB and mutational signature could provide new insights into mutation-driven tumorigenesis and progression (22).
Reliable prognostic biomarkers are needed to select patients at high-risk for a poor prognosis. It is particularly important to identify mutated genes that are associated with the immune microenvironment to achieve more precise application of immunotherapy. In our study, we constructed a predictive model characterized by mutated genes alone to estimate patient survival outcomes and immunity levels. Furthermore, we characterized 27 mutated genes at the molecular level and identi ed the mutational signature associated with the TMB, microsatellite instability (MSI), and mismatch repair de ciency (dMMR) in CRC. Gene set enrichment analysis (GSEA) revealed how these genes are involved in the immune response, and further analysis indicated that the risk scores represents an index of the immunity level. Thus, this model could offer opportunities to stratify CRC patients for optimal treatment plans based on genomic subtyping.

Patients and datasets
This study used data from two separate cohorts: the MSKCC cohort [23] as training cohort and the Cancer Genome Atlas (TCGA) cohort as validation cohort. The training cohort, MSKCC, contains data of 1133 colorectal adenocarcinomas samples. The simple nucleotide variation (SNV) and corresponding clinical information were collected from the cBioPortal (https://www.cbioportal.org/study? id=crc_msk_2017) [24]. The validation cohort, TCGA, contains 596 CRC tissues. The transcriptome expression pro les, simple nucleotide variation and clinical data were downloaded from the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/). The expression data was HTSeq-FPKM and count type. The detailed clinical information and molecular features are listed in supplementary material, Table S1.

Construction and validation of the prognostic mutational signature
The SNV data of the MSKCC cohort was analyzed using the MSK-IMPACT, a capture-based nextgeneration sequencing platform that can detect somatic mutations. As result, about 521 mutated genes were selected as potential candidate genes to construct a prognostic signature. To minimize the overtting risk, a penalized regression analysis was applied to construct a prognostic model. The LASSO algorithm test within the R package "glmnet" was used for variable selection and shrinkage, while the Cox model was used for the coe cient determination of the prognostic model. The generated prognostic model was applied to calculate the CRC risk-score of patients in both training and validation set. In order to nd out the optimal cutoff value between low or high-risk subgroups, an optimal ROC cutoff was used in the training dataset. The point representing optimal ROC cutoff was chosen. The Kaplan-Meier survival curve with the log-rank test and ROC curve were applied to evaluate the predicting power of the model in training and validation cohorts using R packages "survival" and "survivalROC". To examine the prognostic value of our risk-score model as an independent factor, the predicted risk-score and clinical characteristics selected with univariate model were analyzed using multivariate Cox model. The signi cant variables remained in the nal Cox model were used to establish a prognostic nomogram using the "rms" package in R. Calibration curves for three-year and ve-year were plotted to conform the predicted accuracy of the nomogram model.

Functional annotation and enrichment analyses
GO and KEGG pathway analyses were conducted for selected genes within the mutational signature using the R package "clusterPro ler" [25], and results were visualized by "GOplot" R package. To compare the potential immune mechanisms between the subgroups, single-sample Gene Set Enrichment Analysis (ssGSEA) was performed for both low and high-risk groups using the "gsva" R package [26].

Identi cation of mutational landscape
To determine whether the mutational landscape is different between the low and high-risk subgroups. We downloaded the SNV data of two cohorts from datasets. The r package "maftools" was used to visually delineate the mutational landscape for low and high-risk groups [27].

Evaluation of immune cell in ltration
The relationship between signature and immune microenvironment was examined as follow: Based on the ssGSEA result, the dataset of MSKCC cohort and TCGA were clustered into three robust clusters representing low, medium, and high immunity using R package "sparcl". The cluster distributions of low and high-risk were compared with each other. Next, the immune scores of different immune microenvironment states were calculated using the ESTIMATE algorithm test within the R package "estimate" [28]. The scores of each subgroups were compared with each other. Finally, the relative proportions of 22 different in ltrating immune cell types were estimated using normalized transcriptome data and CIBERSORT algorithm as described before [29]. CIBERSORT is a biology tool that uses the deconvolution method to analyze bulky gene expression data of 22 immune cell types. Different in ltrating immune cells between low and high-risk groups were identi ed and considered further research.

Statistical analysis
Statistical analysis was conducted using the statistical packages for R software (version 3.4.0). Means with standard deviations (SD) was calculated for continuous values, while frequencies were determined for categorical values. The signi cance between two different groups composed of continuous values was examined using Student-t tests or Wilcoxon rank-sum test, while the signi cance between the rates of different groups was determined using Chi-squared test. Kaplan-Meier analyses with the log-rank test were performed to access the difference between the survival rates of each group. Multivariate analysis was conducted for the variables, which were signi cantly associated with disease free survival in the univariate Cox regression model univariate analyses, using the Cox proportional hazards regression model. Hazard ratios and 95% con dence intervals were provided for the multivariate analysis. For functional annotation and enrichment analysis, FDR (false discovery rate) < 0.05 was selected as the threshold to identify signi cant terms or pathways. For the CIBERSORT algorithm test, only cases with CIBERSORT p-value < 0.05 were included in the corresponding analysis. Unless specially mentioned, all statistical tests were two sides and p-value < 0.05 were considered as statistically signi cant.

Flowchart of mutational signatures stabilization and clinical characteristics.
A owchart demonstrated the procedure how to analyze the mutational signatures and their correlation with clinical characteristics and immunity, as well as prognostic models based on the mutated genes is shown in Fig. 1. This study obtained data from two cohort studies, Memorial Sloan Kettering Cancer Center (MSKCC) database (n = 1,133) and The Cancer Genome Atlas (TCGA) database (n = 588), respectively. The training cohort, MSKCC cohorts, comprised 729 male patients (64.34%) and 404 female patients (35.66%), while the validation cohort, TCGA cohort, comprised 309 male patients (52.55%) and 279 female patients (47.45%). The clinical characteristics of all CRC patients were listed in Supplementary Table S1.
Construction and veri cation of the mutational signature prognostic classi er.
A mutational signature including 521 genes was identi ed to be related to survival in CRC by MSKCC cohort, and 27 mutated genes which were included in the classi er were identi ed by least absolute shrinkage and selection operator (LASSO) analysis ( Fig. 2A and B). The coe cients of the 27 mutated genes were shown in Table S2. Using Kaplan-Meier analysis, the mutational signature effectively strati ed CRCs into high-and low-risk groups, and the classi ed high-risk group showed a poorer OS compared with the low-risk group, which veri ed in both MSKCC and TCGA cohorts ( Fig. 2E and F). The receiver operating characteristic (ROC) curves indicated that the classi er had a strong predictive ability, as the area under the curve (AUC) values for 1-, 3-, and 5-year OS were 0.712, 0.670, and 0.682, respectively (Fig. 2G). To assess the predictive power of the classi er, we compared the area under ROC curves between the classi er and risk score, tumor location, M stage, and TNM stage. The result showed that the classi er had a better predictive power and accuracy than other clinical features (Fig. 2H). In multivariate Cox analysis of both the MSKCC and TCGA cohorts, the classi er was identi ed as an unfavorable prognostic factor ( Fig. 2I and J). The results in the univariate and multivariate analyses of prognostic factors shown in Table 1 revealed mutational signature is an independent, unfavorable prognostic indicator for CRC in both the MSKCC and TCGA cohorts.
Construction of a predictive nomogram in CRC.
Using the data of the training cohort, a nomogram was generated to predict the OS (Fig. 3A). The predictors included tumor location, M stage, TNM stage, and risk score, among which the risk score had the highest C-index. The calibration plots for the 3-and 5-year OS were well predicted in the training cohort (C-index = 0.666) and the validation cohort (C-index = 0.689) ( Fig. 3B and C). The predictive power of the nomogram comprising the mutational signature was compared to clinicopathological risk factors using ROC analysis. The result indicated that the OS was more accurately predicted by the nomogram than by the risk factors in both cohorts (Fig. 3D).
Decision curve analysis was used to quantify clinical application by net bene ts at different threshold probability in our nomogram model. Here we found that threshold probabilities of 0 ~ 0.43 and 0 ~ 0.65 were the most bene cial for predicting 3-and 5-year OS, respectively ( Supplementary Fig. S1A). Gene ontology (GO) analysis based on the 27 mutated genes demonstrated that mutated genes were mainly enriched in protein binding, beta-catenin binding, nucleus, nucleoplasm, and beta-catenin destruction complex assembly (Supplementary Fig. S1b). Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analyses revealed that these 27 mutated genes were associated with endometrial cancer, acute myeloid leukemia, pathways in cancer, signaling pathways regulating pluripotency of stem cells, CRC, ErbB signaling pathway, prostate cancer, and thyroid cancer (Supplementary Fig. S1C).
Mutational landscape of signi cantly mutated genes in de ned high-and low-risk subgroups.
To explore the differences of genomic alterations between the de ned high-and low-risk groups, we analyzed the data containing somatic mutations from TCGA (https://portal.gdc.cancer.gov/) database. First, comparison according to mutation frequency revealed a signi cant enrichment of different mutations between high-and low-risk groups (Fig. 4A and E). The most frequently found mutation types were missense mutations, nonsense mutations and frameshift deletions. Analyzing the mutation frequency of both subgroups, a larger number of mutations were found in the low-risk group as compared with the high-risk group (Fig. 4A). More than 95% genes had a higher mutation rate in the low-risk group as compared with the high-risk group (Supplementary Fig. S2A and B). In addition, the high-and low-risk groups had a signi cant different distribution of the top 10 mutated genes ( Fig. 4C and G). These results suggested that there were signi cant differences in the mutated genes between the high-and low-risk groups.
A signi cant enrichment of oncogenic alterations in such genes as BRAF, ZFHX3, SOX9 and MTOR were found in right-sided primary tumors, while oncogenic alteration of APC was primary found in the leftsided primary tumors. Analyzing mutated genes in MSI and MSS CRC patients, it revealed a higher altered frequency of genes including BRAF, TCF7L2, ZFHX3, MTOR and DNMT1 in MSI patient group as compared with MSS patient group, though a signi cant enrichment of oncogenic alteration in APC gene was found in MSS patients ( Fig. 4D and H).
We observed signi cantly higher TMB in the low-risk group as compared with the high-risk group. Since mutational signatures are signi cantly correlated with TMB, which is positively correlated with tumor immune signatures and immunotherapy response, it can be speculated that mutational signature may be related to tumor immune activity and further affect immunotherapy response.
The mutational signature was associated with the genomic features of MSI and dMMR in CRC.
The MSI status is critical when considering immunotherapy and chemotherapeutic drugs as options for CRC patients [30]. MMR is the process by which potentially mutagenic misincorporation errors that occur during normal DNA replication are corrected and the absence of MMR results in increased accumulation of mutations [31,32]. To further characterize the classi ed risk groups, we examined the association between the de ned risk groups and other clinical characteristics using data of patients from both training and validation cohorts. The result showed that the outcome of the risk score was highly correlated with tumor location, hyper-mutation, MSI status, and TNM stage ( Table 2). The risk score was observed to be signi cantly associated with the status of MSI/dMMR in both training and validation cohorts ( Fig. 5A and B). In line with previous observation, the status of MSI was more common in low-risk group as compared with high-risk group (Fig. 5C). In addition, left-side tumors and TNM stage III-IV were more common in the high-risk group compared with the low-risk group ( Supplementary Fig. 1A and B).
In TNM stage III-IV patients, the risk score was much higher than in that of patients with TNM stage I-II ( Supplementary Fig. S1C). In addition, the de ned low-risk group was signi cantly associated with hypermutation (Fig. 5D). Furthermore, we observed that high-risk group exhibited signi cantly higher rate of low mutation than low-risk group (Fig. 5E), and the number of mutations was higher in the low-risk group than that in the high-risk group (Fig. 5F). These results showed the mutational signature was associated with TMB, MSI status, and dMMR in CRC. It demonstrated a potential value of this mutation signature model for characterization of immune environment and prediction of immunotherapy outcome.
Mutation signature was associated with immune activity by immunogenic pro ling identi cation.
To further clarify the relationship between mutational signature and immune-phenotyping, we analyzed single nucleotide variation (SNV) and transcriptome data in TCGA database. Immune activity differences between the high-and low-risk groups were determined by analyzing 29 immune-associated gene sets, which represented diverse immune cell types, functions, and pathways [33]. These gene sets were analyzed using the single sample gene set enrichment analysis (ssGSEA), an extension of GSEA, which could calculate separate enrichment scores for each pairing of a sample and gene set to quantify the activity or enrichment levels of immune cells, functions, or pathways in cancer samples [34].
On the basis of ssGSEA scores, we hierarchically clustered all CRC samples in TCGA dataset, and de ned the three clusters as Immunity-High, Immunity-Medium, and Immunity-Low. Tumor purity, stromal score, and immune score were analyzed for each CRC sample based on the Estimation of Stromal and Immune cells in Malignant Tumor tissues using Expression data (ESTIMATE) algorithm. A heatmap of the in ltration levels and scores of each sample of immune cells in the three subtypes was shown in Fig. 6A. The results showed that the stromal score was signi cantly higher in the Immunity-High cluster and signi cantly lower in the Immunity-Low cluster. Immunity scores and ESTIMATE scores were gradually reduced from Immunity-H cluster to Immunity-L cluster. However, the opposite trend was observed for tumor purity in comparisons of the three CRC subtypes (Fig. 6A). Principal component analysis (PCA) revealed marked differences between the three clusters (Fig. 6B), indicating that Immunity-H cluster contained the largest number of immune cells and stromal cells, while Immunity-L cluster contained the largest number of tumor cells (Fig. 6D). Furthermore, we found signi cant higher expression of most human leukocyte antigen (HLA) genes in Immunity-H cluster as compared with Immunity-L cluster (Fig. 6C). Moreover, a signi cantly higher expression level of PD-L1 gene was found in Immunity-H cluster while Immunity-H cluster was correlated with better survival outcome as compared with Immunity-L cluster ( Fig. 6E and 6F). According to distribution of these three clusters, we found that Immunity-H cluster was signi cantly enriched in the low-risk group (Fig. 6G), indicating that patients in the low-risk group might bene t more from PD-L1 inhibitor treatment.
To assess whether risk score was highly correlated with the immunity, we performed consensus molecular subgroups (CMS) classi cation [35], which give a more profound biological insight into metastatic CRC carcinogenesis, immunity typing, and has a strong prognostic effect. CMS1 is de ned by an upregulation of immune genes and is highly associated with microsatellite instability (MSI-h) [36], while CMS4 is de ned by an activated tissue growth factor (TGF)-β pathway and by epithelialmesenchymal transition (EMT) making it in general more chemo-resistant. As expected, patients were divided into four clusters (Fig. 6H), and the distribution analysis revealed that the CMS4 was signi cantly more representable for the high-risk group as compared to CMS1, while the low-risk group was more represented by CMS1 subgroup (Fig. 6J). In addition, a signi cantly higher expression level of PD-L1 gene was found in the CMS1 subgroup as compared with other CMS subgroups (Fig. 6I). Taken together, the data suggested that the low-risk group was mainly represented by CMS1 and had a higher PD-L1 expression, which might bene t more from PD-L1 inhibitor treatment.
Composition of immune cell pro les in the high-and lowrisk groups.
To further investigate the potential predictive value of the mutational signature for the immune status, we examined possible associations between the risk score and immune status. The risk score was negatively correlated with the TMB score (Fig. 7A) while TMB was positively correlated with immune score (Fig. 7B). Therefore, we postulated that the risk score was negatively associated with immune score. Since a high immune score was related with a better survival outcome (Fig. 7C), the low-risk group may also be associated with a better survival.
To Fig. out the in ltrated immune cell composition in the de ned risk groups, we analyzed the expression signature matrix of 22 in ltrated immune cell types in tumor samples from the TCGA cohort using the CIBERSORT test. Among the total samples, 63 low-risk and 63 high-risk samples were found to be eligible for further analysis. The different immune cell fractions were weakly correlated with each other in tumor tissues in the TCGA cohort ( Fig. 7D and 7E). Regarding to tumor-in ltrating immune cells in CRC microenvironment, reduced number of activated CD4 + memory T cells, but increased number of macrophages M0 were found in the high-risk group (Fig. 7F) as compared with the low-risk group. Finally, we analyzed the pathways that were signi cantly enriched in the high-and low-risk groups and found an enriched immunologic pathway in the high-risk samples (Fig. 7G). Taken together, these data suggested that mutational signature consisted of genes that are important regulatory components of the immune cell activation mechanisms. The predictive power of the mutational signature for OS might be dependent on the immune status of TME.

Discussion
CRC is considered as a genetic disease, which arises from the stepwise accumulation of genetic and epigenetic alterations that might promote the dysplasia and tumorigenesis of CRC. Advances in molecular biology stimulate the generation of large amounts of data that were used to construct multigene pro les, which can be used for risk strati cation and guidance for chemotherapy treatment in various types of cancers [37][38][39][40]. Therefore, exploring the dysregulated genes involved in carcinogenesis and disease development might help to improve prognostic and therapeutic strategies for CRC patients.
In this study, we generated a novel prognostic model based on a mutational signature classi er to predict the CRC overall survival and the e cacy of immunotherapy. The mutational signature classi er consisted of 27 mutated genes including APC and TCF7L2 that are relevant to the WNT signaling pathway and in uence the cancer cell metastatic ability [41][42][43], and BRAF and NRAS that are involved in the EGFR signaling pathway and associated with drug-resistance [44][45][46][47]. Using the generated prognostic model, the CRC patients from the cohorts were categorized into high-and low-risk groups. Comparing the global heterogeneity between high-and low-risk groups, the risk score was shown to be correlated with known predictive factor for the carcinogenesis and the therapy outcome such as the TNM stage, MSI status, hyper-mutation, and TMB. Furthermore, we demonstrated that the nomogram comprising the identi ed mutational signature classi er could better predict the OS as compared to clinic-pathological risk factors. This may due to the property of the mutational signature, which re ects the biological heterogeneity of these tumors. This new nomogram including the mutational signature might provide a simple and accurate method for predicting prognoses in CRC.
The immune status within TME plays a pivotal role during the tumorigenesis, and the immune response is a complex process in which various immune cells interact and play different roles. Studies have showed that the immune status could be a better prognostic predictor than the TNM stage, since the tumor progression was signi cantly dependent on the density of host cytotoxic-and memory T cell, higher density of these T cells was correlated with better survival outcome [44][45][46][47]. Tumor in ltrating lymphocytes (TILs) are immune cells leave the blood stream and migrate towards tumor; this population of immune cells contains T cells, B cells, and NK cells, which could exhibit anti-tumor functions. Some studies revealed that TILs are highly heterogeneous in intra-tumor and para-tumor areas [51,52], and could be associated with prolonged survival [53]. Therefore, better understanding of the immune status of the TME and exploring the distribution and function of immune cells are critical to improve the e cacy of immunotherapy in cancer.
In our study, signi cant higher amount of activated CD4 + memory T cells among TILs was found in the low-risk group. Further classi cation, previous data revealed that activated CD4 + memory T cells were mainly detected in early stage of tumor progression [54]. Since the low-risk group was correlated with a better survival outcome, it is reasonable to suggest this activated CD4 + memory T cell population may exhibit an anti-tumor effect in early stage of CRC.
Immunotherapy has raised as a novel effective treatment against CRC, however, the current standard therapeutic guidelines based on the TNM stage cannot re ect the information of host immune system response. In clinic, tests are required before taking immunotherapy, and only when certain conditions and levels are met might immunotherapy drugs be effective in patients. At present, the most commonly used clinical detection method is MSI status; the higher the degree of microsatellite instability, the more genetic errors the patient shows, and the greater the mutation load, which in turn triggers attack of the immune system on tumor cells. However, only 40% of CRC patients with MSI-H can bene t from immunotherapy. Our prediction model can not only predict the prognosis of CRC, but also further evaluate immune in ltration, accurately screen the population of immunotherapy subjects, and improve the e cacy of immunotherapy.

Conclusions
This study presents a novel valuable mutational signature that could be used to predict OS and support immunotherapy selection, but several limitations still need to be noted. First, our data were obtained from the MSKCC and TCGA datasets, which was not multicenter cohorts, our mutational signature and nomogram require further validation in prospective studies and multicenter clinical trials. Second, the TCGA database mainly contains data from people of European descent, which means that the result from these data cannot be directly extrapolated to other racial groups. Third, the biological contribution of candidate genes of the mutational signature, such as ZFHX3 and DNTM1, to OS remain unclear. Further investigations to elucidate the biological mechanisms of these genes might provide novel targets and treatment strategies.
Despite these limitations, we have identi ed a novel mutational signature, which can generate a prognostic tool to effectively classify CRC patients into groups with different OS risks. Moreover, the mutational signature classi er can be used to predict the effective patients to immunotherapy and nomogram comprising the mutational signature could help clinicians in directing personalized therapeutic regimen selection for patients with advanced CRC.

Declarations
Authors' contributions