Not all MSI status information is available in publicly available colorectal cancer expression data, such as NCBI Gene Expression Omnibus (NCBI GEO), thus such data cannot be utilized in MSI CRC research. For example, it hampers studies determining why most MSS samples belong to the immune desert type or the mechanism by which immune evasion occurred in a subset of MSI tumors by utilizing molecular or immunological characterization of MSI and MSS. Although at the research level, if these studies are conducted, this may give clues to convert the immune-inactivated tumors into immune-activated types or to discover drugs targeting abnormally activated oncogenic pathways or suppressed TMEs which can be combined with ICIs. Additionally, since MSI samples are rare, with the difficulty of producing expression data due to RNA degradation, meta-analysis using multiple cohorts is required, but the use is hindered due to the absence of MSI status information in them. Here, we present MAP, a tumor microenvironment-aware, single-transcriptome predictor of MSI in CRC, with robust accuracy validated using large samples from multiple cohorts of primary tumors. (N = 1118). We expect that the MAP will open the door to make such datasets of use in future MSI studies. MAP has the advantage of not requiring a matched normal sample as a control and sufficiently predicts MSI status with a single-sample transcriptome profile.
Attempts to create an absolute predictor for subtype classification of cancer and stratification of patients by applying relationships or ratios between two genes, not the expression value of the gene itself, are ongoing[41, 42]. MAP is an absolute classifier, not relative, and was developed to reflect tumor molecular characteristics, immune-related signatures, and tumor-infiltrating immune cells in TME of CRC. Also, since MAP is an RF classifier, one feature does not represent all MSI in common, but the MSI status is determined through the complex reflection of various features. Therefore, it may be difficult to interpret clinical and biological significance of features, and it might be considered to be included technical as well as biological rules to improve the accuracy of classification.
During the development, it showed an accuracy of 99.1% (1/115) in the correct identification of MSI in the internal validation TCGA dataset. Only one sample (TCGA-DC-6154) with MSI status was incorrectly predicted as being MSS by the MAP model, and it was also marked as MSS with the MOSAIC program, a tool which predicts MSI status at genomic level[10]. We speculated such discrepancy may stem from the different tissue sampling locations (MSI typing vs. DNA and RNA sequencing) or MSI intratumor heterogeneity, rather than MAP misinforming. We also encountered misclassification of a 11CO070 (MSS) hypermutated sample from an external RNA-seq validation dataset and five MSS samples from the GSE39582 dataset as MSI by MAP. Using the clinical information available, we further investigated the five MSS samples from the GSE39582 dataset and they all carried BRAF mutation and high CpG island methylator phenotype (CIMP). In sporadic MSI CRC, the accompanying characteristics of BRAF mutation and high CIMP are known to be strongly correlated with MSI[43], but it was not possible to determine misassignment or tumor heterogeneity characteristics in detail due to the absence of lynch syndrome status or mutation information of other MMR genes of the samples. Additionally, in research on CMS reported by the Colorectal Cancer Subtyping Consortium, the distribution of CMS2 (known as immune-desert type) samples with MSI status was exceedingly rare (10 of 270, 2.7%)[2], whereas eight out of the 10 CMS2-MSI samples belonged to one cohort (GSE13294 dataset). This particular cohort carried a slightly dissimilar CMS2-MSI population distribution from the other datasets, and out of these eight samples, five were classified as MSS by MAP.
MAP showed accuracies of 98.6% (95% CI 97.6-99.6) in RNA-seq and 95.1% (95% CI 91.6-98.7) in microarray data, all primary tumor and MSI detected based on PCR panel, showing a slight difference depending on the platform. Although MAP is an absolute SSP with a specificity of approximately 97% and a high accuracy of 96.1%, it may be due to the inherent characteristics derived from development based on RNA-seq, or a rare MSI subgroup (eg. immune-desert CMS2-MSI) that exists in a specific cohort (GSE13294). Due to the paucity of clinical information, we were unable to thoroughly characterize the samples that were not accurately predicted.
The recently developed preMSIm, a pan-cancer MSI predictor, is a k-NN classifier using 15 genes identified by using only three frequent cancer types (COAD, STAD, and UCEC) as training data. However, due to the limitations mentioned by the author of preMSIm[11] and based on our findings, these 15 genes are not enough to predict MSI in pan-cancer. This is because tumor biology and tumor microenvironment are distinct for individual cancer origins, suggesting diverse tumor-intrinsic gene expression patterns. In this context, MAP is superior when predicting the MSI status in CRC as it was designed to reflect both the molecular characteristics of CRC and the complexity of its surroundings.
As the MAP model includes the MLH1 gene, sporadic CRC, characterized by MLH1 promoter hypermethylation or MLH1 loss, can be classified well, whereas Lynch syndrome, a familiar syndrome, due to germline mutations of MMR or EPCAM gene[1], may not be reflected. In addition, due to the lack of IHC and clinical information (e.g., KRAS, BRAF mutations, and Lynch syndrome status) in the validation datasets, the characteristics of samples with incorrectly predicted MSI status (e.g., MSH2/MSH6-negative CRC) could not be thoroughly assessed. Although MAP reflects the characteristics of sporadic MSI CRC well, MSH2/MSH6-negative MSI CRC reflection is somewhat limited because the expression patterns of MSH2 and MSH6 among MMR genes are not distinctly distinguished from MSS and MSI in TCGA and external validation dataset.