Model Development
We used machine learning to train a model that detects acute infarct (Figure 1a; Supplementary Figure S1). The model calculates the probability of infarct in each voxel within an MRI study. The presence of any voxel with a probability above a given operating point causes the entire study to be classified positive. The amalgamated positive voxels output as a segmented region providing infarct visualization and volume quantification.
The primary dataset consisted of MRI studies from an academic medical center and its affiliated satellites. The data were allocated to a training set, validation set and primary test set (Table 1; Supplementary Table S1 for scanner manufacturers and models). The validation set allowed optimization of model hyperparameters including selecting an appropriate operating point. We decided that an operating point of 0.5 provided an appropriate balance between sensitivity and specificity (96.5% and 97.5% respectively on the validation set; Supplementary Table S2). We used this operating point for subsequent experiments.
Our machine learning architecture utilized two specific strategies to improve performance. Firstly, it included two annotation types: slice-level segmentation of the infarcted region and study-level classification of infarct presence. The segmentations provided the model with more information about individual cases but were time intensive to create, while the classification annotations exposed the model to a greater number of cases. This strategy improved AUROC for the validation set from 0.982 (95% CI, 0.972-0.990) when trained on only the segmentation studies to 0.995 (95% CI, 0.992-0.998) when trained on both the segmentation and classification studies (Table 2). The median Dice coefficient for overlap of ground truth and model output segmentations improved from 0.776 (interquartile range [IQR], 0.584-0.857) to 0.797 (IQR, 0.642-0.861). Secondly, the model incorporated both ADC and DWI series given both are required clinically to determine restricted diffusion. The model performed with AUROC 0.954 (95% CI, 0.939-0.968) on the validation set when using only ADC series, 0.991 (95% CI, 0.985-0.996) when using only DWI series and 0.995 (95% CI, 0.992-0.998) when using both series. The median Dice coefficient was 0.598 (IQR, 0.444-0.736) with only ADC series, 0.787 (IQR, 0.650-0.863) with only DWI series and 0.797 (IQR, 0.642-0.861) with both series.
The finalized model was evaluated on the primary test set. It performed with AUROC 0.998 (95% CI, 0.995-0.999; Figure 1b), sensitivity 98.4% (95% CI, 97.1-99.5%) and specificity 98.0% (95% CI, 96.6-99.3%) for infarct detection. The median Dice coefficient was 0.813 (IQR, 0.727-0.863) and the Pearson correlation coefficient for the segmentation volumes was 0.987 (Figure 1c).
Stroke Code Test Set Performance
As balanced datasets can differ from real-world clinical scenarios, the model was next evaluated on MRI studies performed after ‘stroke code’ activations.12 These activations reflect group pager messages to mobilize team members including neurology, radiology and pharmacy after a patient presents with stroke symptoms. Approximately half of these patients ultimately have an infarct. We obtained the activations over a six-month period from two hospitals including the hospital that training data was obtained from (‘training hospital’) and a hospital that training data was not obtained from (‘non-training hospital’).
The training hospital had 598 stroke codes for which 396 MRI studies occurred and 381 met model inclusion criteria (Supplementary Figure S2). There were 168 positive studies (44.1%). The model performed with AUROC 0.964 (95% CI, 0.943-0.982), sensitivity 89.3% (95% CI, 84.5%-93.9%) and specificity 94.8% (95% CI, 91.7%-97.6%) for classification (Figure 2a). The model also outputted segmented infarct regions (Supplementary Figure S3). The model volume quantification had Pearson correlation 0.968 compared with the averaged reader volume (Figure 2b and Supplementary Figure S4). The Bland-Altman analysis between the averaged reader and model volumes provided a difference of -0.4mL (95% CI, -6.9 to +6.1mL) for infarcts less than 70mL and
-1.5mL (95% CI, -27.0 to +24.0mL) for all infarcts (Supplementary Figure S5). The overlap of segmented regions was similar for the model compared to each reader as it was between readers: the median Dice coefficient was 0.726 (IQR, 0.568-0.803) for model versus reader 1, 0.709 (IQR, 0.551-0.793) for model versus reader 2, 0.729 (IQR, 0.600-0.813) for reader 1 versus reader 2. While two patients were excluded from the final analysis due to age <18 years, the model correctly predicted both studies as negative.
The non-training hospital had 494 stroke codes for which 255 MRI studies occurred and 247 met model inclusion criteria (Supplementary Figure S6). There were 128 positive studies (51.8%). The model performed with AUROC 0.981 (95% CI, 0.966-0.993), sensitivity 96.1% (95% CI, 92.3%-99.2%) and specificity 86.6% (95% CI, 80.2%-92.3%) for classification (Figure 2c). The model volume quantification had Pearson correlation 0.986 compared with the averaged reader volume (Figure 2d; Supplementary Figure S7). The Bland-Altman analysis between the averaged reader and model volumes provided a difference of -3.1mL (95% CI, -14.4 to +8.2mL) for infarcts less than 70mL and -6.1mL (95% CI, -31.2 to +19.0mL) for all infarcts (Supplementary Figure S8). The overlap of segmented regions was similar for the model compared to each reader as it was between readers: the median Dice coefficient was 0.658 (IQR, 0.480-0.750) for model versus reader 1, 0.652 (IQR, 0.473-0.770) for model versus reader 2, 0.682 (IQR, 0.592-0.770) for reader 1 versus reader 2.
We reviewed the false negative and false positive studies from the training hospital and non-training hospital. The majority of false negative studies were for infarcts that were less than 1 mL (14 out of 18 studies at the training hospital and 2 out of 5 studies at the non-training hospital; Figure 3a and Supplementary Figure S9a). The remaining false negative studies were felt secondary to subtle ADC hypointensity (4 studies) and atypical infarcts (1 study for each of air embolism etiology, venous etiology and atypical hippocampal location; all studies displayed in Supplementary Figure S9b-e). Overall false negative studies had smaller infarct sizes compared to true positive studies (mean averaged reader volume 12.3mL for false negatives and 26.4mL for true positives; Spearman correlation between classification probability and volume of 0.764, p < 0.001; Figure 3a). The false positive studies mostly reflected “mimics” of acute infarct including hemorrhage and tumor (Supplementary Figure S10). We found one “false positive” punctate infarct that the readers labelled negative, but on review was more evident on an MRI performed three days later and should have been labelled positive (Supplementary Figure S10d); its ground truth was not updated given the ground truth interpretations were locked prior to comparison with model outputs.
We also obtained the National Institutes of Health Stroke Scale (NIHSS), last seen well time (when a patient last had no symptoms) and symptom onset time (when a patient first had symptoms) for patients with an infarct, in order to stratify model performance by these clinical variables (Figure 3b-d). Overall false negative studies were more likely to have a lower NIHSS (average NIHSS 5.1 for false negative studies and 8.7 for true positive studies; Spearman correlation between classification probability and NIHSS of 0.442, p < 0.001), shorter duration between the MRI and last seen well time (average interval 8.2 hours for false negative studies and 17.5 hours for true positive studies; Spearman correlation 0.291, p < 0.001), and shorter duration between the MRI and symptom onset time (average interval 6.8 hours for false negative studies and 14.4 hours for true positive studies; Spearman correlation 0.271, p < 0.001).
International Test Set Performance
To further demonstrate the generalizability of our model, we tested it on 171 MRI studies, including 70 positive studies (40.9%), obtained from Brazil. The initial dataset contained an additional 6 studies that were excluded (2 with no DWI / ADC series; 4 non-diagnostic with significant motion or metal artifact). The model performed with AUROC 0.998 (95% CI, 0.993-1.000), sensitivity 100% (95% CI, 1.000-1.000) and specificity 98.0% (95% CI, 0.949-1.000) for classification (Figure 4a). The model volume quantification had Pearson correlation 0.980 compared with the averaged reader volume (Figure 4b; Supplementary Figure S11). The Bland-Altman analysis between the averaged reader and model volumes provided a difference of -1.6mL (95% CI, -8.1 to +4.9mL) for infarcts less than 70mL and -3.9mL (95% CI, -23.1 to +15.4mL) for all infarcts (Supplementary Figure S12). The overlap of segmented regions was similar for the model compared to each reader as it was between readers: the median Dice coefficient was 0.686 (IQR, 0.503-0.776) for model versus reader 1, 0.683 (IQR, 0.519-0.762) for model versus reader 2, 0.714 (IQR, 0.604-0.813) for reader 1 versus reader 2.