Head CT Deep Learning Model for Early Stroke Identication Outperforms Human Experts

Non-contrast head CT (NCCT) is extremely insensitive for early (< 3-6hrs) acute infarct identication. We developed a deep learning model that detects and delineates early acute infarcts on NCCT, using diffusion MRI as ground truth (3,566 NCCT/MRI training pairs). The model substantially outperformed 3 expert neuroradiologists on a test set of 150 CT scans (sensitivity 96% model versus 61–66% experts); infarct volume estimates strongly correlated with those of diffusion MRI (r 2 > 0.98).


Abstract
Non-contrast head CT (NCCT) is extremely insensitive for early (< 3-6hrs) acute infarct identi cation. We developed a deep learning model that detects and delineates early acute infarcts on NCCT, using diffusion MRI as ground truth (3,566 NCCT/MRI training pairs). The model substantially outperformed 3 expert neuroradiologists on a test set of 150 CT scans (sensitivity 96% model versus 61-66% experts); infarct volume estimates strongly correlated with those of diffusion MRI (r 2 > 0.98).

Main Body
Stroke is a signi cant public health issue, affecting approximately 13.7 million people annually and the second major cause of death and disability worldwide [1]. The management of acute ischemic stroke was revolutionized in 2018 with publication of the DAWN trial [2]. This study showed that the time window for safe and effective stroke treatment could be expanded from 6 to 24 hours post symptom onset, with appropriate patient selection using "advanced" CT or MR imaging to detect and estimate the volume of irreversibly ischemic infarct "core". Speci cally, stroke patients with intracranial vascular occlusions and "small" (<50mL) infarct cores, treated with endovascular thrombectomy (EVT), achieved a 49% rate of functional independence at 90-days, compared to only 13% with best medical therapy. A 50mL infarct volume threshold was chosen as an enrollment criterion to minimize the risk of intracranial hemorrhage (ICH) as a treatment complication. The resulting effect size of 36% (49-13%) remains among the highest of any stroke trial to date, especially considering the treatment window of up to one-full day after symptom onset, with a "number-needed-to-treat" of only 2.8. In DAWN and related late-window (6-24hr) treatment studies, infarct core volume was either determined using maximally e cient, ground truth, reference standard MR diffusion-weighted imaging (DWI) or approximated using CT perfusion imaging (CTP) [3][4][5]. Regardless of the imaging modality used for infarct core evaluation, however, all stroke clinical treatment trials have underscored the critical need for rapid, safe, highly sensitive and speci c assessment, ideally minimizing cost, complexity, and technical variability [6,7].
Only one major EVT clinical trial, MR CLEAN, which assessed treatment safety and e cacy in early stroke (<6hr), used non-contrast CT (NCCT) exclusively to both rule out ICH prior to enrollment and to estimate core infarct volume for subgroup analyses [7]. Unfortunately, both detection and volume estimation of early ischemic ndings on NCCT -even by expert, subspecialty-certi ed neuroradiologists with decades of experience interpreting complex stroke scans -is signi cantly limited by the typically-subtle decreased X-ray attenuation and low contrast-to-noise ratios of acute infarcts. This poor conspicuity, attributable to the mildly reduced blood pool and early vasogenic edema of these developing lesions, is especially di cult to perceive in the rst 3-6 hours after stroke onset, before blood brain barrier breakdown becomes well established [8,9]. Even with interpretation by highly-trained readers using optimal image review display parameters, the sensitivity of NCCT for early (3-6hr) stroke detection has been reported to range as low as 43-71%, compared to 97% for DWI [8][9][10]. In a 2002 study comparing NCCT and DWI stroke detection within 3 hours of symptom onset, sensitivity for expert readers was 61% by CT and 91% by DWI; for novice readers, sensitivity was 46% by CT and 81% by DWI, with CT described as "little better than ipping a coin" [11]. As such, DWI is considered the operational reference standard for rapid, accurate, emergency department assessment of brain tissue viability; it identi es regions of reduced water diffusivity attributable to cytotoxic edema, likely to be irreversibly infarcted even in the setting of early, robust restoration of critically ischemic cerebral blood ow [12].
In this study, we developed a deep learning model that detects, delineates, and estimates the volume of early acute infarction on NCCT, using diffusion MRI as ground truth (Figures 1,2). Our model, adapted from the U-Net architecture, takes an NCCT series as input and generates a segmentation mask of the early infarct changes, which is used to estimate infarct volume [21,Methods]. Model training relies on a large dataset of paired admission NCCT followed by ground truth DWI scans, acquired within a short time interval of one another. Infarct cores were segmented semi-automatically and segmentation masks for each pair were registered to the corresponding NCCT images (Figure 2a). Expert review of the test set, which included scans from two different vendors and 8 different scanner models (Table, Methods), was randomized with a different order of presentation for each radiologist. The experts recorded the presence or absence of acute infarct and categorized infarct volumes as >0-20mL, >20-50mL, or >50mL.
These results suggest that our model has the potential to obviate the need for more complex, costly, and time-consuming "advanced" CT and MR imaging (e.g., CTP, DWI) for safe, rapid assessment of infarct core -essential to patient selection for both early-and late time window stroke treatments such as EVT.
The performance of our model compared favorably to those of other, published AI models for NCCT acute stroke detection and delineation. This result is likely attributable to our large, accurately labeled training set consisting of 3,566 NCCT / ground truth diffusion MRI pairs of early strokes (most <6hrs post-onset), for which DWI was obtained within 3-hours of admission CT for stroke-positive patients (median <50min) and within 5-days for stroke-negative patients (median <19hrs) (Figure 2a) [13][14][15][16][17][18]. It is noteworthy that both our training/validation and test sets contained predominantly small volume strokes (median DWI infarct volume <10mL and <30mL, respectively; see Table, Methods). Much of the existing work on automated detection and analysis of acute stroke focuses on three approaches: imaging features engineering, ischemic region segmentation, or biomarkers computation [13]. Although some of this literature reports high performance, few of these studies are focused on early ischemic ndings and limitations include small and/or poorly labeled poorly annotated training datasets, as well as weaker "reference standard" ground truth (e.g., ground truth based on reader consensus or on less accurate, more highly variable modalities than MR-DWI, such as CTP) [3,4,[13][14][15][16][17][18].
In one published model tested on 100 CT scans, for example (median 48-minutes after symptom onset, IQR 27-93 minutes), there was moderate correlation between algorithm-predicted NCCT and expertcontoured DWI infarct volumes (r=0.76, r 2 =0.58), with the Bland-Altman plot 95% con dence interval for DWI-NCCT core volume measurement ranging from -59 to 80mL, versus -18 to 16mL for our model ( Figure 1c) [14]. Recently, a model trained on NCCT/DWI pairs showed 0.76 accuracy for <9hr infarct detection [15]. For a different recently published model tested on 479 early and late window acute stroke CTs, there was modest correlation between NCCT predicted volumes and both CTP derived (r=0.44, r 2 =0.19) and nal-infarct (r=0.52, r 2 =0.27) volumes [16]. Another recent model showed moderate performance in correlating automated NCCT Alberta Stroke Program Early CT Scores (i.e., "ASPECTS", a 10-point scoring system for infarct size) with measured CTP (r 2 =0.58) and DWI (r 2 =0.46) core volumes [17]. Moreover, our algorithm's accuracy is notably superior to that of the CTP derived infarct volume accuracies reported in the literature (e.g., Bland-Altman plot 95% con dence interval for mean CTP-DWI core volume measurement ranging from -59 to 55mL [18]).
Few medical arti cial intelligence (AI) models to date have signi cantly outperformed human experts, and better-than-human detection and delineation of clinically important ndings on CT or MRI cross sectional imaging has not previously been emphasized in the literature [19,20]. In one study of a convolutional neural network (CNN) for malignant melanoma detection, compared to a group of 58 dermatologists with a broad range of experience including 30 experts, the "CNN missed fewer melanomas and misdiagnosed benign moles less often as malignant" [19]. In another AI imaging study, McKinney et al described a system for breast cancer screening mammography that outperformed US board certi ed radiologists "compliant with the requirements of the Mammography Quality Standards Act" [20]. There was a 5.7% reduction in false positives and a 9.4% reduction in false negatives with this system, which outperformed all human readers with an area under the receiver operating characteristic curve (AUC-ROC) of 0.740, re ecting an 11.5% improvement over the 0.625 AUC radiologist average. The authors concluded that AI has the potential to alleviate pressures on limited radiology sta ng resources, as well as to discern "patterns and associations that are often imperceptible to humans". Indeed, Figure   2b shows two head CT's that were interpreted as negative for stroke by all three of our neuroradiology experts, but correctly classi ed by our model as positive for early infarction (one of which had a large, >125mL infarct core).
In summary, we have developed a deep learning model that leverages the high sensitivity of DWI as ground truth to automate the detection, segmentation, and volume estimation of early ischemic changes on NCCT. Although DWI remains the reference standard for maximally sensitive, early infarct detection, MRI is a limited resource, not rapidly and routinely accessible in most acute care settings, such as community hospitals and rural urgent-care facilities, where only CT is likely to be available. Indeed, our deep learning platform might be especially bene cial to stroke patients in underserved areas, without 24/7 advanced imaging capability or off-hour radiologist sta ng.
In conclusion, the accuracy of our AI model for non-contrast head CT early stroke detection and volume estimation (greater-than versus less-than 50mL) exceeds that of human experts, and approaches that of ground truth MR-DWI. If prospectively validated and con rmed to be generalizable across a variety of different CT-scanner platforms, manufacturers, and acquisition protocols at different institutions, this model has the potential to considerably reduce the need for more complex, costly, time-consuming and limited-availability advanced CT and MR imaging techniques, for the safe rapid selection of patients for both early-and late-time window highly effective stroke treatments such as endovascular thrombectomy.

Methods
This was a HIPAA-compliant retrospective study, with institutional review board approval and waived patient consent. The dataset was identi ed by searching the radiology exam archive of two large US academic medical centers (BB, SP) for non-contrast head CT (NCCT) scans for which patients also had MR diffusion-weighted imaging (DWI) scans acquired within the following 5 days. Brain MRI reports were screened using natural language processing to identify studies positive and negative for acute stroke.
The time difference between the NCCT and MR-DWI imaging was limited to under 3 hours for strokepositive scans, in order to capture infarct-related physiological changes on NCCT as close as possible to the MRI ground truth, and to under 5-days for stroke-negative scans, as restricted diffusion persists for several weeks following acute stroke. Parsing methods included keyword and sentence matching; all reports were manually reviewed by a trained radiologist (JP, JKC, BB). Scans were de-identi ed during image transfer using the Radiological Society of North America Clinical Trial Processor, with customized scripts to maintain relevant Digital Imaging and Communications in Medicine (DICOM) tags for series identi cation.
Brain MR-DWI and Apparent Diffusion Coe cient (ADC) sequences were considered ground truth for the presence or absence of acute infarction; axial DWI "b=1000" and ADC series with slice thickness ≥5mm were selected using a brain MRI series selection algorithm [22]. All images were reviewed by a trained radiologist (JKC, DC, BB, JP, AP, IS, JC) to ensure correct classi cation. Infarct segmentation was performed using established methodology, including a previously developed algorithm for mask generation [23]. The automated masks were reviewed by a trained radiologist along with the corresponding MR-DWI/ADC series and radiology reports. Masks of positive scans were assigned a 5point scale re ecting segmentation quality, where 4 re ected perfect overlap (i.e., 1:1 correspondence between DWI/ADC infarct and mask) and 0 re ected absent overlap. Only scans with quality grades 3-4 were used for model development; others were discarded or manually segmented by a trained radiologist (JKC, DC, AP, FN; Osirix MD v11.0.3). Segmentations were converted into Neuroimaging Informatics Technology Initiative (NIfTI) masks for machine learning model use.
The segmented DWI stroke-positive and negative scans were paired with the corresponding non-contrast head CT scans, obtained post-symptom onset but prior to DWI acquisition. Axial CT images with slice thickness <5mm and standard or soft kernel reconstructions, computed using routine iterative reconstruction or ltered back projection algorithms, were manually selected for model input (JKC, DC, JP, AP, JC, ES); in some cases, this resulted in several CT scans per patient (Table, 2 nd row). Scans were excluded if they were non-diagnostic (e.g., severe metal or motion artifact). NCCT/DWI image pairs were spatially registered using a SimpleITK Python package (v1.2) with a multiscale a ne transformation and mutual information loss. Registration results were assessed visually using a checkerboard display of the NCCT and co-registered DWI; failed or imprecisely registered images were excluded. The registration transformation was subsequently applied to the DWI acute infarct masks, to obtain a registered mask on the NCCT images.
The resulting dataset was randomly sampled to create the training, validation, and testing sets (Table); for the purposes of this study the test set was restricted to 150 patients. For the training and validation sets, all selected CT scans were retained for model building, even if there were multiple scans per patient, to maximize algorithm robustness at training and enhance algorithm evaluation at validation; all scans from the same patient were used exclusively for either training or validation, but not both. For the test set, only a single CT scan per patient was used; if more than one was available, the earliest (i.e., closest to admission) within the de ned post-symptom onset timeframe was used, with 5mm-thick standard kernel reconstructed slices prioritized. The nal test set included 60 stroke-negative control patients and 90 stroke-positive patients, distributed evenly with 30 NCCT/DWI pairs in each of the >0-20mL, >20-50mL, and >50mL infarct volume categories. Only patients with strokes in the treatment-relevant middle cerebral artery vascular territory of the brain were selected for inclusion. The demographic characteristics of the training, validation, and test sets are shown in the Table. We developed a neural network that can take as input 3D CT axial image stacks of varying numbers of slices, to generate segmentation masks as outputs. Pre-processing steps include resampling of the 3D input NCCT-image dataset and window/level pixel-value scaling. The resulting data becomes input to the neural network, which outputs both a classi cation result and a segmentation mask. Speci cally: (1) each axial scan slice from the NCCT input is resized to a standardized 5mm thickness, then resampled to a 256x256 matrix size, for a maximum of 35 (inferior to superior) axial slices; (2) pixel intensities are clipped to window-width and center-level display range settings of 90 and 40 Houns eld units, respectively, corresponding to the display parameters typically used clinically by neuroradiologists for workstation stroke CT image interpretation [8]; and (3) the resampled image pixel values are mapped between 0 and 1; the binary masks, superimposed in the preprocessing step onto the original NCCT input slices, allows infarct volume estimation into <20mL, >20-50mL, or >50mL categories.
The network design extends the U-Net approach for biomedical image segmentation [21]. The 3D architecture is slightly modi ed, with an additional classi cation output (computed with a global max pooling from the segmentation output) that adds a classi cation component to the loss function, while maintaining segmentation and classi cation output consistency. Adding a classi cation component improved performance compared to using the Dice loss alone, as several very small stroke masks in our dataset could contribute disproportionally to lower the Dice score. The model was developed using Python 3.6 and Tensor ow 1.13.1. Moreover, although input image size is xed in the axial in-plane dimensions, the framework can process 3D image volumes with varying numbers of slices. The architecture otherwise follows a classical U-Net design, with 6 down-sampling blocks (composed of 3x3 convolutions, batch normalization, and maximum pooling layers, followed by ReLU activation) and 6 upsampling blocks; the main difference from a classical architecture is that the pooling operations are done at the slice level only, with shape (2, 2, 1), rather than between slices, which avoids unintended interpolation effects when the slice thickness is large. The neural network is optimized using a loss function that combines a differentiable dice loss (for segmentation) and a cross entropy loss (for studylevel classi cation) as follows: L = L dice + (1-α) L CE , where L dice is the dice loss, L CE is the cross-entropy loss, and α is a constant (0< α<1) re ecting the balance between segmentation and classi cation during training.
We applied geometrical and pixel intensity-based data augmentation techniques at the 3D volume level, which included a combination of in-slice rotations and translations, scaling, right-left ipping, and both Gaussian and Poisson random noise. At each epoch, each transformation was drawn with a probability of 0.5, and if applicable, the transformation parameters were randomly modi ed with a probability of 0.95. All transformations were applied in image space, prior to down-sampling, using linear interpolation.
During training, CT series volume mini-batches were randomly selected for each epoch, without replacement. Because each mini-batch could contain a variable number of slices depending on the scan acquisition parameters, however, in order to both limit unnecessary memory and computational resource allocation, as well as to process standardized equally sized 3D volumes, the number of axial CT slices was xed at 35, with generation of additional padded slices along the z-axis, as needed, if fewer than 35 were selected.
To control for data imbalance in our training set, we developed a standardized, batch sampling strategy. This included, for each batch: (1) selecting 8 stroke-positive and 4 stroke-negative scans, to ensure a xed proportion of positive versus negative exams; and (2) selecting 7 scans acquired from General Electric (GE) CT platforms & 1 from Siemens platforms for stroke positive patients, and 2 from GE and 1 from Siemens for stroke negative patients, to re ect the manufacturer distribution of scanner platforms typically available for emergency department "stroke code" use at both institutions. Our datasets also included a broad range of small (<20mL), intermediate (<50mL), and large (>50mL) infarcts (Table).
Moreover, among stroke-positive scans, there was a large percentage of very small infarcts (<1mL) in the training set (455/1896=24%). Because signal-to-noise ratio, and hence CT conspicuity, of these tiny infarcts is likely to be poor -which could contribute to both decreased accuracy for stroke detection and increased error rate for small structure segmentation, impacting dice loss -we studied the effects on model performance of excluding infarcts smaller than 1 or 5mL in our analyses (Figure 1a). Those results suggest that, for future clinical implementation, exclusion of infarcts smaller than 1mL might provide an appropriate operating point on the ROC curve as a trade-off between optimizing both sensitivity and speci city for stroke detection.
Our neural network was trained using the Adam optimizer; network parameters were initialized with the uniform approach proposed by Glorot and Bengio [24]. The learning rate was reduced by a factor of 0.75 when the validation loss did not improve after 20 epochs. Our network trained for a maximum of 200 epochs, processed using NVIDIA 4 GPU Tesla V100 with 32Gb RAM, allowing batch sizes of twelve 3D volumes; training a single model took approximately 2.5 days. Such computationally demanding training was prohibitive for extensive hyperparameter search; approximately 400 different models were trained during the roughly 2-year development cycle. Hyperparameter search was performed manually with a grid search approach; the following parameters were tuned: learning rate, loss weights, batch sampling strategy (random uniform, positive/negative sampling, manufacturer sampling), exclusion/inclusion of infarcts (< 1mL, <5mL), and size of the rst convolutional layer. After curation and data cleaning, several models were re ned, and some hyper-parameters were adjusted. Hyper-parameter tuning was performed on the validation and training sets exclusively. Next, a small set of models were selected according to prede ned performance metrics, including but not limited to Dice scores for the segmentation masks, ROC-AUC, sensitivity/speci city for stroke detection and volume estimation at the >0-20mL, >20-50mL, and >50mL segmented thresholds. These models were presented to a panel of several experienced radiologists (DC, BB, JKC), blinded to the speci c model parameters, but with the performance metrics and a random, representative sample of results available for review for each model. The experts ranked these models and provided justi cation for their ratings; majority voting was used to select the nal model to use for test set comparison to three, independent, expert neuroradiologists (ML, GG, SP) ( Figure  1a).
For model metrics, 95% con dence intervals were computed using either the simple asymptotic method (for classi cation metrics) or bootstrapping technique (for continuous values, bootstrap size 500). Bland-Altman plot analysis was performed with MedCalc software (MedCalc for Windows, v19.8 / 2021, Ostend, Belgium). Python (v3.7) with NumPy package (v1.2) was used for all other statistical calculations, including but not limited to ROC curve analyses and linear regression. A p<0.05 level of con dence was considered statistically signi cant.

Data availability
The training, validation, and test datasets generated for this study are protected patient information. Some data may be available for research purposes from the corresponding author upon reasonable request.

Code availability
The code base for the deep-learning framework makes use of proprietary components and we are unable to publicly release the full code base. However, all experiments and implementation details are described in su cient detail in the Methods to enable independent replication with non-proprietary libraries.

Figure 1
Model performance for infarct detection (a, ROC curve) and delineation (b, scatterplot; c, Bland-Altman plot; d, confusion matrices), based on DWI ground truth, compared to three human experts. (a) Model AUC was 0.95; sensitivity/speci city were 0.96/0.72 at a 0mL-threshold operating point for infarct detection, 0.82/0.92 at a 1mL-threshold, and 0.78/0.98 at a 5mL-threshold for infarct detection, compared to mean reader sensitivity/speci city of 0.64/0.91. (b) Model infarct volume estimates strongly correlated with those of DWI ground truth (r2>0.98). As per the Bland-Altman plot (c), the model had excellent performance for distinguishing infarcts smaller versus larger than 50mL (95%CI<+17mL), the volume threshold used for patient selection in major late window stroke treatment trials. Expert interrater Cohen's kappa values ranged from 0.42-0.48, suggesting signi cant variability compared to the model, con rmed by the confusion matrices for volume segmentation (d, mean study-counts-per-category and ranges shown for the 3-experts; calculated at the model's 0mL-threshold for infarct detection).