Study specimens
All specimens were obtained from cancer patients as part of standard clinical care. The complete NGS dataset of N=1072 consecutive solid tumor samples (large variety of origin tissues) consists of all unselected, consecutive patient samples processed for diagnostic NGS at AZ Delta General Hospital from Jan 1, 2019 to Dec 31, 2020, and included N=215 patients with an additional IHC testing results.
Wet lab pre-processing
Immunohistochemistry for MSH2, MSH6, PMS2 and MLH1 was performed on 5-µm thick sections of formalin-fixed, paraffin-embedded (FFPE) tumor tissue on a Ventana, Benchmark Ultra device (Ventana Medical Systems, Arizona, USA) as previously described [11]. Tumors were classified as dMMR if no nuclear staining or nuclear staining in less than 10% of invasive tumor cells for 1 or several markers was seen in the presence of a positive internal control (inflammatory and stromal cells). Tumors with nuclear staining for all for markers in at least 10% of invasive tumor cells are considered to be pMMR. DNA was extracted from 10-µm thick sections of the same FFPE tumor tissue blocks as used for immunohistochemistry, using the Cobas DNA Sample Preparation Kit (Roche, Basel, Switzerland) with macrodissection guided by haematoxylin eosin staining where needed, with elution in 10 mM Tris–HCl pH8.0.NGS was performed on an Illumina MiSeq (PE150) using a 138 kb customized hybridization capture-based (NimbleGen SeqCap EZ HyperPlus, Roche) gene panel that included 36 microsatellite loci [11] (Supplementary Table 1).
Data pre-processing
Reads were aligned to the human reference genome (hg19) with BWA-mem2 (2.2.1) [15], sorting and duplicate marking were performed using elPrep (5.0.2) [16] and SAMtools (1.13) [17]. Starting from this output the DeltaMSI tool was developed. In parallel mSINGS v4.0 was used to create a comparison with the same baseline sample sets and regions. The NGS panel design included 36 published microsatellite regions: 15 proposed by Salipante [8], 7 from the Idylla assay (Biocartis, Belgium) [18, 19], 10 proposed by Hause [20] and 4 regions from the updated Bethesda guideline [21]. Bam files were read by DeltaMSI, and additional read filtering was performed. Besides the removal of duplicates, reads were filtered on a mapping quality of at least 20. Only reads overlapping the complete microsatellite regions with a flanking region of 5 bases on each side were used. Regions within samples were filtered with a minimum depth of 30x (similar as mSINGS). The length distribution graphs, obtained by counting the observed fragment lengths, were cleaned by removing lengths with a depth of only 1 read (to decrease the complexity of the regions). As raw data for modeling we used normalized read depth per position for each marker locus, obtained by dividing the absolute read number per position by the maximally observed depth of all lenghts within that locus, obtaining a value between 0 and 1 for each possible length within that region.
Machine learning modelling by train-validation-test split approach
The model set (N=179, N=133 pMMR and N=46 dMMR deficient samples) was randomly split in a training set, validation set and test set of equal size (graphical study design in Fig.1)
The training set was first used to select the loci with acceptable raw data quality based on acceptable coverage within sample and in 75% of all training set samples: 29 of 36 loci were retained (Fig. 2). We then trained 7 machine learning models on the training set (scikit-learn) [22] using normalized read depth per position as feature versus binary dMMR status measured by IHC as target. The three outlier methods (isolation forest, local outlier factor and one class support vector machine) were trained with pMMR samples only, the other methods (logistic regression, random forest, naive bayes and support vector classifier (SVC, aka support vector machine)) were trained with an unbalanced set (45 pMMR, 16 dMMR). For each method hyperparameter tuning was performed using a grid search with a kfold of 3. Each model used the same kfold split of the samples, regardless of the number of filtered samples for that region within that set.
Next, the validation set was used to select the best performing models based on the AUC score, additionally curate the markers for diagnostic power and set thresholds. The highest AUC were obtained with logistic regression, random forest and SVC (Fig. 2). Random forest was removed due to strong overfitting of each region. 28 of 29 marker loci achieved AUC of 0.60 or higher in logistic regression and support vector classifier and were retained in the final modeling. We additionally combined both models into a combined voting model that scores regions as unstable if both models predict the region as unstable. For each model, the percentage of loci scored as unstable was calculated (sample score) for final dMMR/pMMR classification at sample level.
ROC analysis was done by MedCalc (version 19.2.1, MedCalc software Ltd, Mariakerke, Belgium) and Python scikit-learn[22].