Distinct brain morphometry patterns revealed by deep learning improve prediction of aphasia severity

Emerging evidence suggests that post-stroke aphasia severity depends on the integrity of the brain beyond the stroke lesion. While measures of lesion anatomy and brain integrity combine synergistically to explain aphasic symptoms, significant interindividual variability remains unaccounted for. A possible explanatory factor may be the spatial distribution of brain atrophy beyond the lesion. This includes not just the specific brain areas showing atrophy, but also distinct three-dimensional patterns of atrophy. Here, we tested whether deep learning with Convolutional Neural Networks (CNN) on whole brain morphometry (i.e., segmented tissue volumes) and lesion anatomy can better predict which individuals with chronic stroke (N=231) have severe aphasia, and whether encoding spatial dependencies in the data might be capable of improving predictions by identifying unique individualized spatial patterns. We observed that CNN achieves significantly higher accuracy and F1 scores than Support Vector Machine (SVM), even when the SVM is nonlinear or integrates linear and nonlinear dimensionality reduction techniques. Performance parity was only achieved when the SVM was directly trained on the latent features learned by the CNN. Saliency maps demonstrated that the CNN leveraged widely distributed patterns of brain atrophy predictive of aphasia severity, whereas the SVM focused almost exclusively on the area around the lesion. Ensemble clustering of CNN saliency maps revealed distinct morphometry patterns that were unrelated to lesion size, highly consistent across individuals, and implicated unique brain networks associated with different cognitive processes as measured by the wider neuroimaging literature. Individualized predictions of severity depended on both ipsilateral and contralateral features outside of the location of stroke. Our findings illustrate that three-dimensional network distributions of atrophy in individuals with aphasia are directly associated with aphasia severity, underscoring the potential for deep learning to improve prognostication of behavioral outcomes from neuroimaging data, and highlighting the prospective benefits of interrogating spatial dependence at different scales in multivariate feature space.


53
Aphasia is a language processing disorder that often results from strokes affecting the brain 54 hemisphere dominant for language. It is estimated that aphasia impacts around 30% of all 55 individuals who have survived a stroke 1 and although many individuals experience spontaneous 56 recovery, chronic language impairments are common 2,3,4 , with deficits persisting beyond 6 57 months in up to 60% of patients 5 . Chronic aphasia is strongly associated with reduced quality of 58 life, more so than stroke alone 6 , Alzheimer's, or cancer 7 .

59
Worse aphasic symptoms in the chronic stages have been associated with worse 60 aphasia in the acute period 8 as well as larger lesion volumes and older age at time of 61 stroke 9,10,11 . Nonetheless, comprehensive models that include lesion characteristics, acute

231
Healed T1 scans were segmented (binary) into volumes containing white matter, grey matter, 232 and cerebrospinal fluid using FAST 48 . The same healed T1 scans were normalized to the 233 MNI152 (2mm) template distributed with FMRIB Software Library (FSL) 49 using the fsl_anat 234 pipeline (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/fsl_anat), which combines linear and nonlinear 235 registration methods. Segmented tissue and lesion maps were transformed to template space 236 using k-nearest neighbor interpolation and merged to generate ordinal morphometric maps with 237 voxels inside the lesion being assigned a 4 th "tissue" value. This procedure yielded a tissue map 238 with brain structures outside of the lesion and the lesion map itself. Note that the 239 enantiomorphic healed tissue was not included, but it was replaced by the lesion. The maps 240 were then downsampled to 8 mm voxel size and the field of view was cropped to remove any 241 empty space present across all study participants. Finally, each map was scaled to range 242 between -1 and 1. A visual overview of image preprocessing can be found in Figure 3A.  predictions made by the model (i.e., true positives and false positives). The F1 score, which 299 represents the harmonic mean between precision and recall, was prioritized above other 300 measures for model assessment. This score is more appropriate when classes are imbalanced, 301 and when it is crucial for the model to optimize for both false positives and false negatives. By 302 seeking a balance between these error types, the F1 score ensures that the model does not 303 favor one class over the other. Preference for the F1 score is motivated by the utility of 304 prognosticating patients with severe aphasia, which has the potential to guide more efficient 305 distribution of clinical resources. In this context, false negative predictions are arguably worse 306 than false positives because patients that could benefit from treatment may be missed. the network learning rate (0.1e-4, 0.8e-4,1e-4; see supplemental material for decision on 344 magnitude). To speed convergence and prevent suboptimal solutions a cosine annealing 345 scheduler with warm restarts was used to adjust the learning rate 56 . First, the learning rate was 346 reduced using the cosine annealing schedule, decreasing the rate to 1e-10 over 50 epochs. The 347 process was then restarted. The number of epochs over which cosine annealing was scheduled 348 increased by a factor of 2 with each restart. The CNN was trained and tuned over 800 epochs 349 using mini batches of 128 samples to encourage gradient stabilization. See supplemental 350 material for information on measures and criteria used to initiate early stopping during training.

352
Classical machine learning 353 Support Vector Machines (SVMs) were one of the first machine learning methods introduced to 354 neuroimaging 51 and remain the most popular machine learning method in the field (see Figure   355 2). We trained SVMs to predict aphasia severity by minimizing hinge loss weighted by inverse 356 class frequencies using the Sequential Minimal Optimization (SMO) algorithm. In SVMs, the 357 kernel trick efficiently transforms data into a higher dimensional space through which a 358 hyperplane maximizing class separability can be more successfully optimized using different 359 kernel functions 57 . Because optimizing more hyperparameters may lead to model overfitting, we 360 assessed more optimistic estimates of SVM generalizability (i.e., performance) by training 361 independent models using the linear and radial basis kernel functions. Indeed, when we tested 362 SVMs that tuned the kernel function alongside other hyperparameters, performance plummeted 363 (see supplemental section). Other hyperparameters were tuned using random search with 300 logarithmically spaced bins for each parameter. These included kernel scale, which controls the smoothness of the kernel function (ranging from 1e-3 to 1e3), and cost, which controls the width of the margin and balances the trade-off between maximizing the margin and minimizing hinge

422
Subtyping distinct brain patterns through deep learning 423 Unsupervised learning of Grad-CAM++ maps was used to explore the heterogeneity of brain 424 patterns learned by the CNN. Feature saliency maps were used as inputs instead of CNN layers 425 to understand which brain structures were implicated. Reliable clustering solutions were 426 identified and generated through the framework of consensus clustering, wherein a resampling 427 technique is used to compute clustering consensus 68 . Consensus is defined as the proportion of 428 times that a pair of samples from the data are placed into the same cluster when they are 429 subsampled together. This measure is computed for each pair of samples in the data to form a 430 symmetric similarity matrix that embeds a consensus solution, which can be retrieved using a

434
Here, we subsampled 60% of the voxels in feature saliency maps 1000 times, each time 435 performing k-means clustering using the k-means++ algorithm with 250 replicates to improve 436 solution stability 74 . The eta 2 distance measure was used for k-means clustering (i.e., 1 minus the 437 eta 2 coefficient). This coefficient aims to capture voxelwise similarity between whole brain maps, 438 with 1 reflecting identical maps, and 0 reflecting maps that share no variance 75 . It shares some 439 resemblance to the Pearson correlation coefficient but quantifies similarity on a point-by-point 440 basis, therefore accounting for scaling and offset when comparing two whole brain maps 75 . For 441 each subsample, clustering was repeated to generate solutions ranging from 3 to 30 clusters.

442
The most complex solution (i.e., highest number of clusters) that exhibited high reliability was

488
The robustness of the CNN classifier was tested more formally. Class labels were 489 randomly permuted 500 times and the model building and testing process was repeated. Our 490 results showed that CNNs trained on permuted data produced a distribution of F1 scores with 491 no overlap with CNNs trained on unpermuted data ( Figure 3B). Performance significantly better 492 than chance (p < 0.000001) indicated the network learned meaningful features that distinguish 493 aphasia severity. As further evidence of model robustness, we showed that learned features 494 exhibited a similar structure across cross-validation repeats. T-distributed stochastic neighbor 495 embedding 114 was used to group participants based on features extracted from all models and 496 revealed distinct clusters for predicted classes ( Figure 3C; also see supplemental material). The 497 influence of lesion size (i.e., small, medium and large based on 3 quantiles) and more granular 498 aphasia severity categories on the model decision boundary were also investigated ( Figure 3C).

499
CNN models often correctly classified participants with medium-sized lesions that belonged to 500 both severity classes, suggesting lesion size was unlikely to be the sole driver of predictions.   We next determined whether SVMs outperformed CNNs (Figure 4). This could indicate that 531 CNNs, which generally demand more computational resources and involve adjusting more 532 parameters, may have difficulties with overfitting to our sample size and type of data. We first 533 noticed that linear SVMs outperformed nonlinear SVMs across repeats and focus on these 534 models for brevity ( Figure S1). Paired t-tests comparing model performance across repeats

605
Evaluating the impact of CNN properties on feature importance 606 Multiple feature saliency mapping methods were used to understand if better CNN performance 607 was the result of the special spatial properties of CNNs. For simplicity, we focused on the mean 608 performing models across repeats of cross-validation but note that learned features exhibited a 609 similar structure across repeats ( Figure 3C). Grad-CAM++ and deep SHAP saliency maps 610 generated for CNNs were compared to SHAP saliency maps generated for SVMs. As only 611 Grad-CAM++ maps are explicitly sensitive to CNN spatial properties we expected these maps to 612 look different from SHAP and deep SHAP saliency maps, and for SHAP maps to be more 613 similar to each other, provided that the CNN models exploited spatial dependencies in the data.

614
To facilitate model comparisons, we focused on predictions that were accurate for CNN and 615 SVM models.

616
First, map similarities were compared using the eta 2 coefficient (see methods).

623
Group-averaged feature maps confirmed these trends ( Figure 6). Grad-CAM++ maps 624 alone highlighted contralateral regions. These were anterior to the concentration of lesions in the group and predicted severe aphasia. Whether predicting severe or non-severe aphasia,

626
SVMs emphasized the region where lesions were concentrated in the group. This was also true 627 of deep SHAP maps generated for CNNs (see Figure S3 for individual maps).

661
Having established that the qualitative patterns in Figure 6 are meaningful, we     Although group-averaged saliency maps showed strong class differences in 710 lateralization (i.e., left hemisphere for nonsevere aphasia and right hemisphere for severe 711 aphasia), subgroups of patients exhibited both lateralization patterns, and most subgroups 712 involved more bilateral patterns than suggested by group effects. For example, subgroups 1, 3 713 and 5 for individuals with severe aphasia showed moderate-to-high saliency in the right 714 hemisphere despite an overall stronger emphasis on the left hemisphere ( Figure 9B). In 715 individuals with nonsevere aphasia, one subgroup showed right-lateralized saliency (subgroup 716 6) and multiple subgroups showed modest-to-moderate saliency in the non-lateralized 717 hemisphere (subgroups 6, 4, and 2). The right-lateralized subgroup (6) showed remarkable 718 similarity to a subgroup of severe aphasia individuals (2) but was distinguished by more bilateral 719 saliency. Indeed, many subgroups of different classes exhibited overall similar saliency patterns, 720 differing mainly by which hemisphere was more strongly emphasized. As an example, one 721 severe aphasia subgroup (4) displayed highest feature saliency in right temporoccipital cortex, 722 while another nonsevere subgroup (4) displayed peak saliency in left temporooccipital cortex.

723
For a thorough list of similarities see supplemental material. We note that despite having lower 724 importance, features contralateral to the saliency peak contributed meaningfully to model 725 performance. "Ablated" SVMs trained on different hemispheres of saliency maps performed 726 slightly worse than SVMs trained on bilateral maps ( Figure S4).

727
Decoding saliency maps using the wider neuroimaging literature provided evidence that 728 some of the variability in aphasia severity is grounded in damage to different language 729 subsystems, but also to non-language systems outside of the area of stroke, as well as aging

738
Second, decoding showed different portions of superior parietal cortex used for prediction were 739 associated with response inhibition, object and spatial processing, including visuospatial 740 memory and working memory (nonsevere:3,5). Third, patterns targeting anterior frontal cortex 741 around the frontal pole and the temporal pole together were associated with higher-level 742 functions such as decision making, but also age, symptom severity, and a range of brain 743 disorders that included epilepsy and Alzheimer's (severe: 1,3,6; nonsevere: 1,2). For more 744 comprehensive analysis, including details about each meta-analytic topic, see supplemental 745 material.

746
Finally, we found that patient subgroups were not associated with lesion size, as hinted

798
The integrity of tissue outside of the lesion offers a rich source of information that stands to 799 improve models of stroke outcome. The current study leveraged deep and classical machine 800 learning to understand how well brain morphology and lesion anatomy can predict aphasia 801 severity in chronic stroke. In a repeated, nested, cross-validation scheme, we found that balanced accuracy of 77% and a median F1 score of 0.7. This performance was both 804 significantly better than chance and better than Support Vector Machines (SVMs). Using a 805 variety of techniques for fusing the information learned by CNNs and SVMs, we found that 806 SVMs were not sensitive to uniquely predictive information. A thorough investigation of feature 807 saliency confirmed that improved CNN predictions were rooted in the identification of more 808 diverse patterns of brain morphology capitalizing on information outside the area of injury.

836
Evidence of improved prediction through consideration of spatial dependence

837
The current study used CNNs based on the understanding that they can provide a novel window 838 into brain morphometry outside the lesion that is predictive of aphasia severity. The first line of 839 evidence indicating that our CNNs exploited spatial properties was that they outperformed 840 SVMs, which were not able to provide any unique predictive information. Feature saliency maps 841 produced more compelling confirmation. Group-averaged saliency maps that utilized a method 842 explicitly sensitive to CNN spatial properties (Grad-CAM++) highlighted markedly different 843 features than methods that were insensitive to these properties (SHAP). Compared to SHAP, Grad-CAM++ emphasized an entirely different hemisphere in patients with severe aphasia and with nonsevere aphasia. Critically, SHAP saliency maps were exceedingly similar between CNNs and SVMs, presenting strong evidence that CNN performance was related to exploitation 848 of spatial information. These patterns were corroborated by more quantitative regional analyses, 849 suggesting that often-overlooked spatial dependencies in neuroimaging data may be critical for 850 building more predictive multivariate models.

933
Limitations 934 An important caveat to these results is that our data was downsampled to a lower resolution  testing every implementation of classical learning, we also cannot definitively say that classical 944 machine learning cannot compete with CNNs for identification of severe aphasia. Our work simply demonstrates that CNNs can be highly effective for this task, more so than other Future work is likely to find better performance through utilization of more granular brain and 949 behavioral data, as well as the integration of multiple imaging modalities 22 .

1094
Here, we show that tuning the kernel function alongside other hyperparameters when building 1095 SVM models results in substantially worse performance than simply tuning the same 1096 parameters but holding the kernel function fixed to be a linear SVM ( Figure S1). This was the 1097 case whether the SVM was trained on all features in the data, or if dimensionality reduction was applied during training. When linear and nonlinear SVMs were trained independently, linear SVMs tended to outperform nonlinear SVMs. Note, performance for SVMs that were tuned for 1100 kernel function and did not involve dimensionality reduction as a preprocessing step was 1101 measured over 10 instead of 20 repeats of our cross-validation scheme (these models took the

1207
Example feature saliency maps for individuals 1208 We expected deep SHAP maps generated for CNN and SHAP maps generated for SVM to 1209 share similarities, and to highlight very different trends compared to Grad-CAM++ maps 1210 generated for CNN. This finding would be consistent with the fact that Grad-CAM++ alone can 1211 recognize spatial dependencies exploited by the CNN. Figure S3 qualitatively illustrates this to 1212 be the case in a random sample of 6 participants that were correctly predicted by the CNN to 1213 have severe aphasia, and another 6 participants that were correctly predicted by the CNN to

1300
Affinity propagation was used to extract the final clustering from the selected consensus 1301 matrices. We verified that affinity propagation could readily identify the structure of consensus 1302 matrices by repeating the clustering process 1000 times. We note that clusters were identical 1303 every single time affinity propagation was repeated. We also report that the exemplar samples

1369
To further understand the extent to which each subgroup reflected atrophy patterns 1370 affecting different brain systems, we decoded subgroup saliency maps as typified by the 1371 exemplar (see methods for more details). Decoding ( Figure 9C) confirmed that most feature 1372 saliency patterns reflected language function (i.e., subgroups 2, 4, 5, and 7) but also underlined 1373 the extent to which they tapped into different language subsystems. For example, one subgroup 1374 showing more focal saliency around the right TP and left superior parietal cortex (SPC) was 1375 associated with semantics (i.e., top 3 terms mapping onto topic: semantic, word, knowledge; 1376 subgroup 5), whereas another subgroup (i.e., 2) showing saliency along large portions of frontal 1377 cortex and inferior temporal cortex (ITC) was associated with lexical-semantics, reading, 1378 comprehension, and pitch processing (i.e., topics: reading, phonological, readers; speech, 1379 auditory, temporal; words, word, lexical; verbs, verb, noun; language, sentences, 1380 comprehension; music, musical, pitch). Further, the subgroup with peak saliency in right OC 1381 (subgroup 4) displayed strongest associations to language, including reading and lexical 1382 processing (topics: hearing, deaf, sign; words, word, lexical; reading, phonological, readers; 1383 semantic, word, knowledge), but also tool knowledge (topic: tool, tools, knowledge) and sensory 1384 cortex (i.e., topic: visual, auditory, sensory). At the same, this subgroup displayed saliency 1385 associated with selective adaptation (topic: adaptation, selective, stimulus), as well as motion 1386 and object processing (topics: motion, mt, moving; object, objects, visual). A similar pattern of 1387 topics was associated with the subgroup exhibiting peak saliency around right MFG (i.e., 1388 subgroup 7), but also included associations to comprehension (i.e., topic: language, sentences, 1389 comprehension) and overt spontaneous speech (topic: verbal, fluency, overt). This subgroup 1390 was also associated with the dorsal and ventral streams despite being relatively focal (i.e., topic: 1391 dorsal, ventral, stream), as well as task switching, integration, item pairs, and action observation 1392 (topics: integration, process, task and task, switching, set; item, pairs, item; action, actions, 1393 observation). Curiously, several subgroups exhibited saliency related to other studies on morphometry (topic: matter, gray, volume; subgroups 6) as well as age (i.e., topic: adolescents, adolescent, age; subgroups 1, 3, 6). These particular subgroups were also associated with a of patients with nonsevere aphasia also highlighted diverse patterns of brain integrity learned by the CNN. Combined with the results of Figure 9, these patterns demonstrated that irrespective  (Table S2).

1422
Decoding of patient subtypes ( Figure 10C) demonstrated that most patterns of 1423 morphometry associated with nonsevere aphasia were unrelated to language. Although many 1424 patients with nonsevere aphasia were grouped into subtypes that displayed peak saliency in 1425 regions commonly associated with language, only one subgroup (i.e., 6) showed a decoding 1426 pattern that mapped strongly onto language function (topics: semantic, word, knowledge; word, 1427 words, lexical). Coincidentally, this was the only subgroup exhibiting a right-lateralized saliency 1428 pattern. The saliency patterns of two subgroups (i.e., 1 and 2) were associated with 1429 morphometry studies (topics: matter, gray, volume) and one of these subgroups also displayed 1430 an association with age (topic: adolescents, adolescent, age; subgroups 2). Consistent to 1431 patients with severe aphasia that had similar topic associations, decoding also showed a strong 1432 relationship to demographics (e.g., sex), higher-level cognitive functions (e.g., emotional 1433 regulation, emotional processing, social interactions) and a number of disorders (e.g., PTSD, 1434 AD, OCD, depression, epilepsy, ADHD, substance abuse, schizophrenia, axnixety, BPD, generic symptom severity, etc.

1455
Studies using CNNs and neuroimaging in stroke patients

1456
The main text cites and characterizes 19 PubMed studies that use CNNs and neuroimaging in 1457 stroke patients. A more detailed categorization of these studies is presented in a Sankey 1458 diagram ( Figure S7) and the full list of studies can be found in Table S3. Note, of the 19 studies, 1459 one was a review of stroke segmentation studies, one included "stroke" in its abstract but