Tumor Microenvironment-Aware, Single-Transcriptome Prediction of Microsatellite Instability in Colorectal Cancer


 Background: Detecting microsatellite instability (MSI) in colorectal cancers (CRCs) is essential since it is therapeutic strategy determinant feature, including immunotherapy and chemotherapy. Yet, no attempt has been made to exploit transcriptomic profile and tumor microenvironment (TME) of it to unveil MSI status in CRC. Methods: Hence, we developed a novel TME-aware, single-transcriptome predictor of MSI for CRC, called MAP (Microsatellite instability Absolute single sample Predictor). MAP was developed utilizing recursive feature elimination-random forest with 466 CRC samples from The Cancer Genome Atlas, and its performance was validated in independent cohorts, including 1118 samples. Results: MAP showed robustness and predictive power in predicting MSI status in CRC. Additional advantages for MAP were demonstrated through comparative analysis with existing MSI classifier and other cancer types. Conclusions: Our novel approach will provide access to untouched vast amounts of publicly available transcriptomic data and widen the door for MSI CRC research and be useful for gaining insights to help with translational medicine.


Background
Microsatellite instability (MSI) is characterized by genetic hypermutability due to impaired DNA mismatch repair (MMR) system [1]. MSI is observed in many solid tumors, including gastric, and endometrium cancers, as well as in colorectal cancer (CRC, approximately 15%) [2,3]. Exhibiting prognostic and predictive features of a high tumor mutational burden, a high neoantigen load, and an immune-active tumor microenvironment (TME) characterized by high levels of tumor-in ltrating lymphocytes and overexpression of immune checkpoint molecules, cancers with MSI are known to be great candidates for immune checkpoint inhibitors (ICIs) treatment, such as pembrolizumab and atezolizumab (anti-PD-1 and anti-PD-L1 monoclonal antibody, respectively) [4,5]. In addition, MSI primary tumors face a better prognosis than patients with microsatellite stability (MSS) tumors [2]. MSI status is meaningful as a predictive indicator for cancer treatment and as a prognostic determinant, identifying a patient's MSI status is essential in clinical setting and research areas.
With recent escalation of its importance in CRC, it has been explored from publicly obtained samples, such as The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database, resulting in numerous studies which broaden our understanding in MSI and expand therapeutic options for MSI CRC patients [6]. However, prior to utilization, MSI status information must be provided beforehand by quantifying the extent of genomic events in microsatellite loci, at genomic level, using the Bethesda Panel, a PCR-based ve marker panel, or by examining the loss of mismatch repair proteins using immunohistochemistry (IHC) at the protein level [1]. Additionally, with recent advances in next-generation sequencing (NGS) technology, MSI predictors, such as MANTIS [7,8], MSIsensor [9], and MOSAIC [10], have been developed to extract MSI status from whole exome and whole genome sequencing data. However, assigning the MSI status from expression data had not been possible until a k-Nearest Neighbors (k-NN, k=5) classi er called preMSIm using 15 gene-set for pan-cancer was recently proposed [11] and several other attempts which had been made, although the software has not been made readily available [12,13].
The preMSIm has constructed as a pan-cancer MSI predictor by utilizing three MSI dominant carcinomas as training data [11], but it did not re ect the distinct expression pro les of its cancer of origin.
Furthermore, individual MSI tumors have unique tumor microenvironment (TME) and molecular pathway characteristics. For example, immune in amed MSI microenvironment could be characterized by higher in ltration of anti-tumorigenic immune cells, such as adaptive immune cells (T and B lymphocytes) and innate immune cells (dendritic cells, macrophages, and natural killer cells) than immune desert MSS tumors, and, in CRC, when mutations or activation of MYC and RAS pathways occur, chemokine CCL9 is expressed and an immunosuppressive environment is established, which prevents enrichment of cytotoxic NK cells and T cells around the tumor [14]. Therefore, transcriptome based MSI predictor which integrates both TME and molecular pathway characteristics in CRC is needed.
Here, in this study, we have developed an enhanced single-sample MSI classi er called MAP (Microsatellite instability Absolute single sample Predictor) that integrates transcriptomic characteristics of TME and molecular pathways to predict MSI in CRC. Our TME and molecular pathways aware predictor will open a way to utilize CRC expression data to elucidate MSI CRC. Hence, massive amounts of publicly available expression data without MSI status will be utilized to drive valuable MSI CRC research through our novel approach, and, furthermore, to give patient bene t in clinical setting.

Dataset acquisition
For the discovery cohort, 581 RNA-seq data (rsem.norm.expression) from TCGA-COADREAD were downloaded from the TCGA data portal (https://portal.gdc.cancer.gov/). Matching data on MSI status (82 MSI and 499 MSS) was downloaded from The Cancer Imaging Archive (TCIA) (https://tcia.at/). For the validation cohort, 106 RNA-seq data from 24 MSI and 82 MSS samples (rsem.norm.expression) from an independent study [15] were downloaded. MSI-low tumors were grouped with MSS tumors as in previous studies [16,17]. The gene expression values of RNA-seq were log-transformed (with base 2) for analysis.
Information on consensus molecular subtype (CMS) classi cation was obtained from the Colorectal Cancer Subtyping Consortium for all array datasets [2]. For cases with missing CMS information, CMS labels were inferred by using the random forest (RF) method provided by the CMSclassi er R package [2].
Genes covered in both of the discovery and validation datasets were used for further analysis.

Development of the MAP predictor
Development of a gene-based predictor (MAPgene) A schematic drawing of the MAP development process is provided in Additional le 1: Figure S1. To select informative genes for MSI prediction, we rst identi ed differentially expressed genes (DEGs) between MSI and MSS tumors using the Wilcoxon rank-sum test in the discovery cohort. To construct and train a prediction model, RNA-seq data were divided into training and internal validation datasets at a ratio of 4 to 1. To extract the most discriminative genes from the DEGs, the recursive feature eliminationrandom forest (RFE-RF) algorithm was used on the 466 training dataset. Brie y, feature selection was conducted by the backward selection method, wherein the RFE-RF repeatedly constructed an RF model by eliminating features with the least importance. The selection process was repeated 100 times, applying an upsampling approach to the MSI group (due to the small group size) using caret [23] and randomForest R package. The nal model (MAPgene) was then selected based on that with the best area under a receiver operating characteristic curve (AUC) for 31 genes.

Development of an absolute, gene-pair-based predictor (MAPpairs)
To make the MAPgene model absolute (i.e., to predict MSI status from a single patient without comparison to a reference cohort or sample-wise normalization), a new model (MAPpairs) was developed using pairwise gene expression values instead of single gene expression values. A total of 465 ( 31 C 2 ) gene-expression pairs were generated for the selected 31 genes. These gene expression pairs were then converted to rules that indicated the relative over-or under-expression between two genes. For example, if the expression of gene A was higher than that of gene B, the rule (gene A > gene B) was generated.
Another RFE-RF model was then constructed using the 465 rules and trained with a ve-fold crossvalidation. Similar to the feature selection procedure, RFE-RF was applied with a ve-fold cross-validation and repeated 100 times. The nal absolute model was selected according to its AUC.

Development of a tumor microenvironment-integrated model (MAPsig)
To construct a more sophisticated model, we exploited the molecular differences in cancer-, immune-, and TME-related signatures between MSI and MSS tumors. We collected 101 signatures, including immune and stromal cells (TCIA and MCP-counter) [24,25], cancer hallmarks from MSigDB[26], immune-related signatures, such as epithelial and mesenchymal signatures [27]; stromal and immune signatures[28]; immunoinhibitory signatures and immunostimulatory signatures) [24]; T-cell-in amed gene expression pro le (GEP) [29] and IFN-γ expanded signatures [29]; cell cycle signature [30]; cell cycle regulator [31]; mismatch repair (KEGG), C-ECM signature [32]; angiogenesis, HLA class I and II family signature [33]; proin ammatory cytokines and chemokines [33]; CD8 T cells (Teff) [34]; and the MAP signature. To obtain signature scores for each individual sample, single sample gene-set enrichment analysis (ssGSEA), with ssgsea.norm = F, was applied for the signatures above. Additionally, for cross-platform comparability, the acquired score was adjusted to a value between 1 to 10. We used the same modeling method as that for the MAPgene and MAPpairs models, although with different input values. Finally, the MAPsig model and features were selected for inclusion in the nal according to those that provided the best AUC.

Model re nement
When applying the MAPpairs model, we noted that true MSI samples tended to be classi ed with MSI at a probability much higher than 70%. Thus, only samples with a probability of having MSI that was more than 70% were assigned MSI status. Samples with a predicted probability of MSI that was lower than 70% were further examined by applying the MAPsig model to determine nal MSI status, as it showed high overall AUC, accuracy, and speci city, but low sensitivity, making it of use in only verifying a true MSS sample. The software is available at https://sourceforge.net/p/mapmsi/wiki/MAP/.

Validation dataset
To evaluate the predictive performance of the MAP model, we employed RNA-seq data for CRCs (N=106) with log 2 -transformed rsem.norm data. Also, to assess platform independency and the applicability of MAP on different array datasets, we collected data for ve cohorts. In the microarray datasets, the probes per gene were selected using Jetset (http://www.cbs.dtu.dk/biotools/jetset/) [35]. The array datasets were processed using fRMA R package per sample [36]. A total of ve datasets were evaluated for the following: accuracy, sensitivity, speci city, F1 score, and balanced accuracy. All information on the datasets is provided in Table S1. For RNA-seq of stomach adenocarcinoma (STAD), and uterine corpus endometrial carcinoma (UCEC) were downloaded from the TCGA data portal (https://portal.gdc.cancer.gov/).

Consistency of genes in a microsatellite instability classi er model based on gene expression
To verify the consistency of feature genes with discriminative value in an MSI classi er model using gene expression, the Wilcoxon rank-sum test was used to analyze the external RNA-seq validation dataset. In addition, to assess the utility of MAP for MSI prediction, we calculated MAP signature scores (31-gene-set signature) using ssGSEA and compared them between MSI and MSS groups, as well as among MSI CMSs, using the Wilcoxon rank-sum test and Kruskal-Wallis test.

MSI signature construction at UCEC and STAD
To investigate the MSI signature that can distinguish MSS and MSI in each cancer types, the same method was applied when constructing the MAP signature, except that the P < 0.02 and |log 2 fold change| > 1 criteria was applied to identify su cient number of DEG from two types of cancer. TCGA-UCEC and STAD expression dataset were download TCGA-STAD and UCEC RNA-seq data were downloaded from EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.txt le at https://gdc.cancer.gov/about-data/publications/panimmune. In this le, only 12 of 15 signatures of preMSIm existed. The missing genes were HENMT1, NOL4L, and RTF2.

Statistical analysis
Comparisons of two groups were conducted using the Wilcoxon rank-sum test, while comparisons of multiple groups were performed using the Kruskal-Wallis test. All statistical analyses were conducted using R language software (https://www.r-project.org/).

Overview of MAP development
As an MSI single sample predictor (SSP), the MAP model was developed with the following four components (Additional le 1: Figure S1): 1) identi cation of the MAP signature (MAPgene model); 2) modeling based on pairwise gene expression of the MAP signature genes (MAPpairs model); 3) modeling based on ssGSEA scores of cancer-, molecular-, TME-, and immune-related signatures (MAPsig model); and 4) post-re nement of the nal model and prediction of MSI status. To develop an SSP of MSI status without relying on a relative approach (e.g., comparing a patient's data with other samples) and with limited platform bias, we constructed a recursive feature elimination-random forest (RFE-RF) model (MAPpairs model) with pairwise gene comparisons, leveraging an informative gene-set (MAP signature from the MAPgene model), rather than gene expression pro les, on a training dataset. In brief, RFE trains the model, ranks the features, and selects features through the process of repeatedly removing lowerranked features [37]. The method of building a model by selecting features with the RFE method based on the RF algorithm is called RFE-RF [37]. We built another RFE-RF model (MAPsig model) based on ssGSEA scores for 101 signatures to re ect the degree of activity of cancer-, immune-and TME-related signatures of the samples. To select the best RFE-RF model from the parts mentioned above, we evaluated the area under the receiver operating characteristic curve (AUC) and con rmed the model performance in validation datasets (Table S1). Finally, at the post-re nement stage, an integrated MAPpairs and MAPsig model was used to precisely predict MSI status. We named this nal model MAP and evaluated its accuracy, kappa, sensitivity, speci city, F1, and balanced accuracy in the validation datasets (Table S2).

MAP signature
To minimize the size of the informative gene-set utilized in the MAPgene model, we, rst, identi ed differentially expressed genes (DEGs) between MSI and MSS samples using the Wilcoxon rank-sum test. We assessed 718 DEGs with criteria of P < 0.001 and |log 2 fold change| > 1, and selected a gene-set of 31 genes by performing RFE-RF modeling with an AUC of 99.2%. We called this gene-set the MAP signature (11 up-and 20 down-regulated DEGs, Fig. 1a and Additional le 1: Table S3). Among genes comprising the MAP signature, the MLH1 gene, which is commonly downregulated and/or hypermethylated in sporadic MSI samples, ranked as the top feature gene, based on both accuracy (the importance of the features that improves classi cation accuracy) and Gini index values (the importance of the features that reduces the impurity of classi cation) (Fig. 1b and Additional le 1: Table S3). We also noted that LY6G6D[38] and EPM2AIP1 genes [39], which share a promoter with MLH1, were included in the gene-set. Additionally, we found that a known predictive marker for chemotherapy, thymidylate synthase (TYMS) [40], was included in the gene-set, and its expression was higher in the MSI samples than the MSS samples ( Fig. 1a and 1b). Other genes belonging to the following pathways were also included in the MAP signature: the WNT signaling pathway (RNF43, TCF7, and NKD1), Hippo signaling (TCF7, NKD1, and TGFBR2), and MAPK signaling (DUSP4 and TGFBR2). Three well-known frameshift mutated genes (DDX27, TGFBR2, and RNF43) in microsatellite loci in MSI CRC were also included. In terms of MMR, 718 DEGs were initially used when constructing the MAPgene model, although three MMR genes (MSH2, MSH6, and PMS2) were not included because their statistical signi cance or fold change did not meet the inclusion criteria (Additional le 1: Figure S2).
To assess the representativeness of the MAP signature (31-genes) in re ection of MSI status, the expression patterns of the genes in the gene-set were investigated in a validation dataset, and expression patterns similar to those observed in the discovery dataset were noted. To further investigate whether the MAP signature could serve as a surrogate marker of MSI status and to evaluate its generalizability, we obtained and compared ssGSEA scores for the MAP signature in MSI and MSS tumors, as well as in MSI tumors of each of the four CMSs. The general comparison between MSI and MSS samples revealed signi cant differences in MAP signature scores (P = 7.6 x 10 −36 ), but not among the CMSs (Fig. 1c). This suggests that the composite genes of the MAP signature can capture MSI's behavior-related features and discriminate between MSS and MSI status independent of CMS context.

MAP model
Although the MAPgene model built based on gene expression showed high performance, to develop a true SSP with unnormalized data that does not rely on a relative comparison among multiple samples, we employed a pairwise gene comparison approach for model building: for example, if the expression of gene A was greater than that of gene B, the sample would be assigned MSI status. An RF model with 1000 trees of such rules was constructed utilizing the RFE-RF algorithm with ve-fold cross-validation repeated 100 times. Finally, the MAPpairs model, comprised of 187 rules from 465 ( 31 C 2 ) rules at a starting point, was selected (AUC of 99.7%). To assess its performance and reproducibility, we applied the model to internal and external RNA-seq validation datasets and obtained accuracies of 99.1% and 95.4%, respectively, indicating it to be robust and highly accurate. In the MAPpairs model, MLH1-related features (MLH1/HPSE, MLH1/FECH, and MLH1/GNLY) were the highest ranked features (Fig. 2a), and the expression levels of these features separated the two groups well (Fig. 2b). To investigate the features of MAPpairs further, we calculated the number of MSI and MSS samples that satis ed each rule in MAPpairs (Additional le 1: Figure S3). Most rules were able to classify MSI and MSS samples, and as such, they were considered to be re ective of the overall characteristics of MSI, although not all samples may show similar pro les: For example, MSI samples are known to have a loss of MLH1 and an immunity-activated characteristic [2], as well as high expression of thymidylate synthase (TYMS) (chemotherapy response-associated gene) [40]. Features of the MAPpairs model, MLH1/GNLY and TYMS/MLH1, respectively, described these MSI characteristics well (Fig. 2b), but not in all tumors. This may suggest that the MAPpairs model, a random forest classi er, captures and re ects the more complexity of MSI CRC, not just a simple single rule.
In order to complement the MAPpairs model with the characteristics of immune and TME pro les, as well as the transcriptomic pro le and tumor's characteristics of MSI, we built the MAPsig model based on molecular, cancer, immune, TME, and MAP signature scores inferred by single-sample gene set enrichment analysis (ssGSEA). The top signatures used in the nal MAPsig model (44 signatures) included the MAP signature, antitumorigenic immune lymphocytes (effector memory CD8 T cell, Teff (CD8 T effector), Th2 cells, activated CD4 T cell), complement, INF-γ signatures, Wnt-β/catenin signaling, glycolysis, and cell cycle signaling (Fig. 2c). To nd out the degree of activation of 44 signatures, we investigated the heatmap based on the inferred ssGSEA score. Compared to MSS, antitumorigenic immune lymphocytes, complement, glycolysis, cell cycle, and INF-γ related signatures were up-regulated in MSI, whereas MAP signature, Notch, angiogenesis, epithelial signature, and Wnt-β/catenin signaling were down-regulated (Additional le 1: Figure S4). The nal MAP model was established after integrating the MAPpairs and the MAPsig models, and post-re nement processing was done by utilizing probability. Next, we applied the nal MAP model to validation datasets to evaluate any potential over tting and its applicability across multiple platforms. A total of 1118 samples (240 MSI and 878 MSS tumors) were tested, and MAP exhibited an average accuracy of 96.1% (95% con dence interval (CI) 94.3-98.9), a sensitivity of 93.1%, a speci city of 97.5%, and an F1 score of 92.0% (Fig. 2d and 2e), indicating outstanding performance and feasibility as an MSI predictor.

MSI signatures in other cancer types
Using TCGA-STAD and TCGA-UCEC RNA-seq datasets, we evaluated whether MAP, which was developed for CRC, could be applied to other cancers. In the stomach adenocarcinoma (STAD) and uterine corpus endometrial carcinoma (UCEC) data, accuracies of 80.2% and 75.4% were observed, respectively. To investigate why the MSI classi er of CRC is not suitable for other cancers, the same method used to construct the MSI signature (MAP signature) in CRC was applied to examine MSI signatures in gastric cancer and uterine cancer, and then the differences in expression patterns were investigated. Uterine cancer showed an accuracy of 90.9%, with only nine genes (CXCL13, EPM2AIP1, H2AFJ, HOXA9, MLH1, RNLS, SDR42E1, TNFSF9, and ZNF300), whereas gastric cancer reached an accuracy of 83.4% using 78 genes (Table S4). We further probed how cancer-speci c MSI signatures are expressed in each cancer and observed that individual MSI signatures tend to correspond to DEGs not statistically signi cant in other cancers (Fig. 3a and 3b). MLH1 and EPM2AIP1 were differentially expressed in all three cancers, RPL22L1 was included in the MAP signature of CRC and STAD, and H2AFJ was observed in both CRC and UCEC. In addition, comparing the MAP signature and MSI signature from the recently developed preMSIm, ve genes (MLH1, RPL22L1, EPM2AIP1, DDX27, and SHROOM4) were observed in both signatures in CRC (Fig. 3c). It is also worth mentioning that all of the genes used in preMSIm are downregulated in MSI, except RPL22L1, whereas the MAP signature additionally includes both up-and downregulated genes in CRC. In addition, the expression pattern of the signature of preMSIm did not appear to suitably re ect genes important in gastric cancer, such as DDX27, SMAP1, and ZSWIM3, and in uterine cancer, such as DDX27, SHROMM4, SMAP1, and ZSWIM3, thereby making it unable to e ciently differentiate MSI and MSS tumors (|log 2 fold change| < 0.5) in these cancer types (Fig. 3d and 3e).

Discussion
Not all MSI status information is available in publicly available colorectal cancer expression data, such as NCBI Gene Expression Omnibus (NCBI GEO), thus such data cannot be utilized in MSI CRC research. For example, it hampers studies determining why most MSS samples belong to the immune desert type or the mechanism by which immune evasion occurred in a subset of MSI tumors by utilizing molecular or immunological characterization of MSI and MSS. Although at the research level, if these studies are conducted, this may give clues to convert the immune-inactivated tumors into immune-activated types or to discover drugs targeting abnormally activated oncogenic pathways or suppressed TMEs which can be combined with ICIs. Additionally, since MSI samples are rare, with the di culty of producing expression data due to RNA degradation, meta-analysis using multiple cohorts is required, but the use is hindered due to the absence of MSI status information in them. Here, we present MAP, a tumor microenvironmentaware, single-transcriptome predictor of MSI in CRC, with robust accuracy validated using large samples from multiple cohorts of primary tumors. (N = 1118). We expect that the MAP will open the door to make such datasets of use in future MSI studies. MAP has the advantage of not requiring a matched normal sample as a control and su ciently predicts MSI status with a single-sample transcriptome pro le.
Attempts to create an absolute predictor for subtype classi cation of cancer and strati cation of patients by applying relationships or ratios between two genes, not the expression value of the gene itself, are ongoing [41,42]. MAP is an absolute classi er, not relative, and was developed to re ect tumor molecular characteristics, immune-related signatures, and tumor-in ltrating immune cells in TME of CRC. Also, since MAP is an RF classi er, one feature does not represent all MSI in common, but the MSI status is determined through the complex re ection of various features. Therefore, it may be di cult to interpret clinical and biological signi cance of features, and it might be considered to be included technical as well as biological rules to improve the accuracy of classi cation.
During the development, it showed an accuracy of 99.1% (1/115) in the correct identi cation of MSI in the internal validation TCGA dataset. Only one sample (TCGA-DC-6154) with MSI status was incorrectly predicted as being MSS by the MAP model, and it was also marked as MSS with the MOSAIC program, a tool which predicts MSI status at genomic level [10]. We speculated such discrepancy may stem from the different tissue sampling locations (MSI typing vs. DNA and RNA sequencing) or MSI intratumor heterogeneity, rather than MAP misinforming. We also encountered misclassi cation of a 11CO070 (MSS) hypermutated sample from an external RNA-seq validation dataset and ve MSS samples from the GSE39582 dataset as MSI by MAP. Using the clinical information available, we further investigated the ve MSS samples from the GSE39582 dataset and they all carried BRAF mutation and high CpG island methylator phenotype (CIMP). In sporadic MSI CRC, the accompanying characteristics of BRAF mutation and high CIMP are known to be strongly correlated with MSI [43], but it was not possible to determine misassignment or tumor heterogeneity characteristics in detail due to the absence of lynch syndrome status or mutation information of other MMR genes of the samples. Additionally, in research on CMS reported by the Colorectal Cancer Subtyping Consortium, the distribution of CMS2 (known as immune-desert type) samples with MSI status was exceedingly rare (10 of 270, 2.7%) [2], whereas eight out of the 10 CMS2-MSI samples belonged to one cohort (GSE13294 dataset). This particular cohort carried a slightly dissimilar CMS2-MSI population distribution from the other datasets, and out of these eight samples, ve were classi ed as MSS by MAP.
MAP showed accuracies of 98.6% (95% CI 97.6-99.6) in RNA-seq and 95.1% (95% CI 91.6-98.7) in microarray data, all primary tumor and MSI detected based on PCR panel, showing a slight difference depending on the platform. Although MAP is an absolute SSP with a speci city of approximately 97% and a high accuracy of 96.1%, it may be due to the inherent characteristics derived from development based on RNA-seq, or a rare MSI subgroup (eg. immune-desert CMS2-MSI) that exists in a speci c cohort (GSE13294). Due to the paucity of clinical information, we were unable to thoroughly characterize the samples that were not accurately predicted.
The recently developed preMSIm, a pan-cancer MSI predictor, is a k-NN classi er using 15 genes identi ed by using only three frequent cancer types (COAD, STAD, and UCEC) as training data. However, due to the limitations mentioned by the author of preMSIm [11] and based on our ndings, these 15 genes are not enough to predict MSI in pan-cancer. This is because tumor biology and tumor microenvironment are distinct for individual cancer origins, suggesting diverse tumor-intrinsic gene expression patterns. In this context, MAP is superior when predicting the MSI status in CRC as it was designed to re ect both the molecular characteristics of CRC and the complexity of its surroundings.
As the MAP model includes the MLH1 gene, sporadic CRC, characterized by MLH1 promoter hypermethylation or MLH1 loss, can be classi ed well, whereas Lynch syndrome, a familiar syndrome, due to germline mutations of MMR or EPCAM gene [1], may not be re ected. In addition, due to the lack of IHC and clinical information (e.g., KRAS, BRAF mutations, and Lynch syndrome status) in the validation datasets, the characteristics of samples with incorrectly predicted MSI status (e.g., MSH2/MSH6-negative CRC) could not be thoroughly assessed. Although MAP re ects the characteristics of sporadic MSI CRC well, MSH2/MSH6-negative MSI CRC re ection is somewhat limited because the expression patterns of MSH2 and MSH6 among MMR genes are not distinctly distinguished from MSS and MSI in TCGA and external validation dataset.

Conclusions
In conclusion, we provided MAP, an MSI predictor for CRC that is robust and accurate. Although MSI prediction based on IHC and PCR is well established and available at a low cost for clinical application, MAP will nd use in MSI-related research seeking to employ the large amounts of publicly available CRC expression data and will be useful for gaining insights to help with translational medicine.

Consent for publication
Not applicable.

Con icts of Interest
The authors declare no con ict of interest.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. MAPsupplementary.pdf