Latent disconnectome prediction of long-term cognitive symptoms in stroke

Stroke signicantly impacts quality of life. However, the long-term cognitive evolution in stroke is poorly predictable at the individual level. There is an urgent need for a better prediction of long-term symptoms based on acute clinical neuroimaging data. Previous works have demonstrated a strong relationship between the location of white matter disconnections and clinical symptoms. However, rendering the entire space of possible disconnections-decit associations optimally surveyable will allow for a systematic association between brain disconnections and cognitive-behavioural measures at the individual level. Here we present the most comprehensive framework, a composite morphospace to predict neuropsychological scores one year after stroke. Linking the latent disconnectome morphospace to neuropsychological outcomes yields biological insights available as the rst comprehensive atlas of disconnectome-decit relations across 86 neuropsychological scores. Out-of-sample prediction derived from this atlas achieved average accuracy over 80%, which is higher than any other framework. Our novel predictive framework is available as an interactive web application, the disconnectome symptoms discoverer (http://disconnectomestudio.bcblab.com), to provide the foundations for a new and practical approach to modelling cognition in stroke. Our atlas and web application will reduce the burden of cognitive decits on patients, their families, and wider society while also helping to tailor personalized treatment programs and discover new targets for treatments. We expect the range of assessments and the predictive power of our framework to increase even further through future crowdsourcing.


Introduction
The delity of lesion-de cit models depends not only on the quality of the data but also on the underlying theoretical framework. Together they produced evidence of a relationship between the location of brain lesions and clinical symptoms such as visuospatial neglect 1-3 , aphasias 4-6 , apraxias 7,8 or motor anosognosia 9,10 amongst others. Recently, the associations between anatomical white matter networks and clinical presentations revealed that there is no one-to-one relationship between structures and clinical presentation, as different lesions can cause the same functional impairments 11,12 . One example would be that a stroke in the middle or posterior cerebral artery may lead to visuospatial neglect 13 , just like different perisylvian white matter disconnections can lead to aphasia 11 . Hence, the current methodologies do not capture the potential overlap between brain signatures and clinical manifestations nor the distributed nature of their neural substrate, now familiar from network analyses of functional imaging data 14 . Therefore, a comprehensive framework that would systematically associate brain disconnections with cognitive-behavioural assessments is needed for accurate precision medicine [15][16][17][18][19] .
Modelling distributed relations is computationally expensive and requires large scale data. With advances in data modelling and the availability of databases, tackling the high complexity of clinical-anatomical relationships is now conceivable. Beneath the surface complexity there may lie a simpler order that can be described within a compacted representational space. As such, dimensionality reduction algorithms allow de ning low-dimensional spaces that can embed multivariate data. In embedding spaces, also known as morphospaces 20,21 , patients with similar features cluster together while diverging features are placed apart 12,22 . Morphospaces render lesion-de cit relations more easily surveyable. Hence, speci c brain features can de ne territories in a morphospace and help predict symptoms and brain pathologies, similar to typical machine learning approaches 23,24 . Arti cial intelligence (AI) has recently progressed in modelling the association of symptom severity with medical imaging modalities, e.g., reaching high accuracy and sensitivity in the characterisation of tumour tissues 25 . However, AI models need to be re ned with a broader spectrum of clinically practical endpoints, including neuropsychological measures.
The next challenge will be making AI patient-centric for a more effective deployment into the clinical routine and to e ciently bene t patients' quality of life 26 .
To drive the realisation of this challenge forward, we propose a modelling approach that employs a morphospace to predict neuropsychological assessments of one of the most common neurological disorders: stroke 27 . We rst mapped the distribution of 1333 brain disconnection patterns in stroke -the disconnectome morphospace. A second dataset (training set) with rich neuropsychological measures 1year after stroke was imported into this disconnectome morphospace. This second dataset enriched the morphospace with clinical symptoms obtained from 86 neuropsychological assessments. An out-ofsample "validation set" with the same neuropsychological data served to assess prediction accuracy. This procedure, hereafter referred to as disconnectome symptoms discoverer (DSD), reliably predicted the performance of patients with an average accuracy > 80%. To make the DSD tool readily available to the clinical-academic community and facilitate its incorporation into the clinic, we provide an open-access web application (http://disconnectomestudio.bcblab.com), in which individual disconnection patterns can be uploaded to predict the expected 1-year neuropsychological scores. The web application will be interactively updated, thanks to future crowdsourcing, informing the DSD model with any newly available datasets.

Results
The disconnectome morphospace The rst dataset (N=1333 stroke lesions 28 ; see Supplementary Table 1) was processed to obtain disconnectome maps. Disconnectome maps quantify the pattern of connections interrupted by each lesion based on the high-resolution tractography of a healthy population 12,29,30 . Subsequently, the Uniform Manifold Approximation and Projection (UMAP) 31 method was used to embed disconnection complexity. A latent 2 dimensions con guration of the disconnectome maps was obtained. Figure 1 indicates that patients' disconnectome pro les distribute based on lesion location and commonly disconnected tracts. For instance, patients with major left or right hemisphere disconnections were embedded in the right and left half of the morphospace, respectively. Similarly, patients with posterior or anterior disconnections were localised at the top or the bottom of the embedded space. Patients with a prominent disconnection of the inferior-fronto occipital fasciculus (IFOF) located at the bottom left and right extremities of the morphospace while corticospinal (CST) and arcuate (AF) disconnections were relatively more central. Hence the morphospace appropriately segregated the different pro les of disconnection of the classic tract 32,33 .

The composite morphospace
The extent to which the disconnectome morphospace can predict different neuropsychological performances is currently unknown. To answer this question, we took advantage of the second independent dataset of stroke patients (N=119 stroke lesions 14 ; see Supplementary Table 1) that was extensively explored with standard neuropsychological assessments (N=86, see Supplementary Table 2).
For each patient of the second dataset, disconnectome maps were calculated and imported into the disconnectome morphospace using the UMAP de ned transformation. To tackle uncertainty, patient coordinates in the morphospace were spatially smoothed (see methods). In so doing, each patient's coordinates in the disconnectome morphospace were converted into probabilities of localisation. A Pearson correlation approach was then used to estimate the association between each morphospace coordinate and a neuropsychological performance (see Supplementary Figure 1 for more details). Figure  2 indicates that a medium to large effect size association (all |r| >0.2) existed between territories in the disconnectome morphospace and neuropsychological scores (Figure 2a-c). Importantly, for some scores, multiple clusters in the disconnectome morphospace, corresponding to different disconnection pro les, apparently led to the same neuropsychological impairment. This con rmed that no one-to-one relationship exists between lesion of structures and clinical disorders, and likewise, different locations of brain damage can lead to the same functional impairment. To avoid simple linear association between the morphospace coordinates magnitude and neuropsychological scores, patients' probabilities of localisation in clusters of signi cance were modelled by a principal component analysis (later referred to as spatial PCA). For each patient, the rst three-component of the spatial PCA were entered into a multiple regression analysis to predict single-patient neuropsychological scores 1 year after symptom onset. The multiple regressions created equations, modelling the relationship between each patient's potential localisation in the disconnectome morphospace (i.e., as de ned by the three rst components of the spatial PCA) and their neuropsychological scores. In so doing, we obtained a composite morphospace that takes advantage of the joint strengths of the two datasets. The composite morphospace accurately and reliably predicted 83 out of 86 neuropsychological scores with a small to large effect size (see Supplementary Table 3).

Disconnectome morphospace component mapping
In the next level, we brought the score prediction results back to the neuroimaging space to explore the neuroanatomical patterns leading to symptoms. The rst dataset was split in half (2 X 666 disconnectomes maps) to assess reproducibility. Latent patterns of predicted neuropsychological performances were statistically associated with brain disconnections maps of the two halves of the rst dataset using voxelwise linear regressions. In doing so we obtained two sets of maps of brain disconnection for each neuropsychological score (see example in gure 2d-f and all maps together with their full discussion in Supplementary Material -Section C). We were able to produce a comprehensive atlas of the brain disconnections associated with neuropsychological test scores and the statistical comparison of the two sets of maps indicated a good level of reproducibility (Pearson R = 0.82). Figure 3 summarises the highest statistical associations spanning from a medium (0.25 > f 2 >0.42) to a high effect size (0.42< f 2 ). The highest effect sizes were in the left hemisphere, particularly in the frontal lobe connections, indicating the strongest associations between these disconnections and neuropsychological scores (Figure 4a). Some areas can also be associated with multiple different neuropsychological scores.
To summarise this information, we calculated a versability map that indicates how many neuropsychological scores can be predicted with a large effect size per volume unit of white matter (Figure 4b). The versatility maps revealed a clear asymmetry between the left and the right hemispheres.
This lower effect size and higher versatility in the right hemisphere suggests that more work is required to nely measure and dissociate right hemisphere functions in neuropsychology.
Accuracy in predicting neuropsychological score at 1-year after stroke To assess the accuracy of the predictions, data derived from a third independent dataset (20 stroke patients withheld from the original dataset 14 ; see Supplementary Table 1) were projected into the morphospace. From there, equations derived from the composite morphospace were applied to predicted individual neuropsychological scores. Prediction accuracy was assessed as the difference between the observed and predicted scores, normalized by the maximum score (i.e., normalized prediction error; Figure   5). The pro le of neuropsychological scores for single patients was predicted with an average accuracy of 84.3 ± 5.6 % while each test was predicted individually with an average accuracy of 83.9 ± 7 %. Overall, the prediction of two-thirds (N=65) of the tests was replicated in this third independent dataset with an accuracy >80% (Supplementary Table 4).

Disconnection Symptoms Discovery web application
To make this resource and method available for the clinical-research community, we deployed an interactive web application platform called Disconnectome Symptoms Discovery -DSD (http://disconnectomestudio.bcblab.com). The DSD requires the input of brain lesions converted to disconnection maps and returns the expected 1-year neuropsychological scores for individual disconnectome maps (see the DSD user guide in the Supplementary Material Section E). The DSD tool prediction model relies on the databases presented in this study that can be updated on-demand with new neuropsychological assessments and patients' disconnectomes.

Discussion
Applying state-of-the-art data embedding methods we succeeded in combining complementary databases of stroke patients and produced an atlas of neuropsychological scores associated with brain disconnections. This atlas applied to an out-of-sample dataset accurately predicted 65 neuropsychological scores with an accuracy of over 80%. An openly available web application, the disconnectome symptoms discoverer (DSD) capitalises on our methods and provides new anatomical insights into cognitive symptoms for researchers and clinicians.
Similar patterns of stroke-induced white matter disconnections were distributed close by in the embedding space comparably to other research elds using UMAP methods 31 , e.g., single-cell genetic transcriptomes 34,35 . Therefore, the disconnectome morphospace acted as a reference to quickly import and summarize new stroke disconnections. Such embedded information allowed us to associate singlepatient neuropsychological pro les at 1-year after a stroke with territories in the morphospace and pro le of disconnection. By exploring white matter correlates systematically, we created a comprehensive atlas of the neuropsychological scores associated with brain disconnections. Classical functional associations were con rmed, e.g., the lateralisation of motor functions, the left perisylvian language network, the fronto-parietal attentional networks, or the right insula for sickness sensations. In addition, new insights on functioning and disconnection were reported, e.g., the callosum connectivity related to visual neglect, the cerebellum hub for visuospatial memory, and the lingual gyrus for verbal memory (for individual results and discussion see Supplementary Material -Section C).
The atlas allowed for the evaluation of acute MRI scans to predict long-term stroke symptom severity.
These results indicate the suitability of the disconnectome model in predicting a wide range of functional performances and addressing a complete personalised, individual patient pro le. This information will be a valuable resource in clinical settings, for example for the planning of personalized therapeutic and rehabilitation strategies. This is a step forward in comparison to many stroke AI methods that have a purely diagnostic purpose 36 . The DSD model has a prognostic vocation based on cross-modal data (neuroimaging input -neuropsychological outcome prediction).
However, predictions were not equally accurate across functions (see Supplementary Table 4). Three factors might explain these differences. First, some neuropsychological scores are more reliable than others in assessing performances 37 . Second, plasticity and interindividual variability might interact with recovery 11,38-40 . Third, the disconnectome model may not capture all the variance of brain injuries.
Indeed, hypoperfusion 41 and hypometabolism 42 factors as well as acute imaging changes such as pseudonormalization 43 are not included.
Besides these limitations, the disconnectome symptoms discoverer (DSD) web application is a free and user-friendly web browser tool that only requires an internet connection. Instant software access and automatic updates make the word-wide-web the ideal media for clinical translations. The application of the DSD results can help personalized prognosis. Further, while our predictions were validated in an outof-sample dataset, the DSD web application allows for a wider validation with crowdsourcing usage, through new dataset implementation. Hence, the DSD aims to bene t the researchers' understanding of brain functioning and the patient's treatments alike. Neuropsychological scores. Neuropsychological scores were available for datasets 2 and 3. The details of each neuropsychological evaluation (grading, test battery, administration) are reported in the Supplementary Materials Section C. In brief, motor abilities (Section C.1) were assessed for upper limb hand grasping, gripping, pinching, grip strength, peg replacement, motion shoulder exion, wrist extension, and lower limb walking. Language abilities (Section C.2) were assessed using picture naming, non-word repetition, commands, sentence reading, sentence comprehension, and semantic uency.
Visuospatial abilities (Section C.3) were tested for using discrimination accuracy, reaction time, subbing, behavioural inattention, and unstructured symbol cancellation. Visuospatial memory (Section C.4) was evaluated using abstract gures retrieval scores and verbal memory (Section C.5) for listed word recognition scores. A pain scale during the MRI scanning was recorded (Section C.6) and a stroke sickness questionnaire administrating, investigating physical and psychosocial daily sickness (Section C.7).
Disconnectome. The probability of white matter disconnections caused by the stroke event was quanti ed accounting for controls' connectivity where the lesion occurred. Stroke lesions were manually delineated in T1-weighted MRI scans and subsequently normalized to the MNI152 space (2 mm resolution). The BCBtoolkit "normalization tool" was used with the enantiomorphic normalization option (http://toolkit.bcblab.com) 44 . From the Human Connectome Project (HCP), 7T MRI diffusion-weighted scans were processed for N=163 healthy participants, 45% males. For the healthy participants, wholebrain tractography was reconstructed using the same procedure reported in 12 . Then, disconnectome pro les were processed with the BCBtoolkit 45 . HCP tractography was ltered considering only streamlines passing through each stroke lesion. The lter tractography was binarized and averaged across the HCP participants. As a result, for each stroke patient a map of probability, ranging from 0 to 1, was obtained to quantify lesion disconnections.
Spatial embedding. Dimensionality reduction of patients' disconnectome was obtained using the UMAP method 31 . A non-linear embedding method that distributes data variability along major axes. Dataset 1 3-dimensional disconnectome maps were vectorised and imported as features of the embedding methods. As UMAP parameters, an approximation of 15 neighbours and a minimum 0.1 Euclidean distance was set to obtain a two-dimensional embedding of dataset 1. A space locally connected as Riemannian manifold that we addressed in the paper as disconnectome morphospace. The UMAP embedding transformation was stored as a Python object, using the Pickle library, to apply the same lowdimensional transformation further when new patients are imported into the model. Subsequently, to have positive coordinates with a zero origin, the maximum negative dataset 1 UMAP values across dimensions were added to shift the coordinate scales (Umap 1 and Umap 2).
Relationship to neuropsychological scores. Statistical correlations between patient localisation in the disconnectome morphospace and neuropsychological scores were conducted. Before the multiple regression formula, UMAP coordinates were converted into a 2D nifti image (260x260 matrix, 0.05 mm pixel size), and a Gaussian kernel spatial smoothing of 1 mm was applied (using FSL libraries https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). This step was conducted to consider the uncertainty of UMAP coordinates and obtaining a spatial distribution of patient localisation in the disconnectome morphospace. Pixel-wise Pearson correlations between the patient probability of localisation and neuropsychological scores were conducted with iterative loops in Python (python numpy.corrcoef). Medium effect size correlation results only were considered informative (). Subsequently, since multiple clusters of voxels survived the threshold, a principal component analysis (PCA) has been run to compress the patient coordinate distribution variability. Three main principal components have been considered (Python sklearn.decomposition.PCA). Subsequently, patients' principal components have been entered, as dependent variables, in the multiple regression model (Python sklearn.linear_model.LinearRegression) to predict neuropsychological scores: White matter atlas of neuropsychological components. In order to create a white matter atlas of the evaluated neuropsychological assessments, white matter disconnectomes (dataset 1) were correlated with patients' PCA scores, evaluated by running the prediction model on the dataset 1. The former disconnectome data were used in de ning the UMAP space, whereas the latter model weights as variables of the multiple regression model to predict long term neuropsychological symptoms. Using randomise (FSL libraries) a generalized voxel-based linear regression model was run, with disconnectome maps as independent variables and PCA scores as dependent variables. To address the result of replicability this procedure was repeated twice, splitting the dataset 1 into two halves of n=666 subjects each.
The randomise T-maps obtained were used to calculate the correspondent effect size maps (f 2 , python code reported in http://www.bcblab.com/BCB/Coding/Coding.html). For each neuropsychological score three principal component scores were evaluated and the maximum effect size across the components was considered. Subsequently, the highest effect size across neuropsychological assessments was reported in the white matter atlas summary map (FSL libraries nd_the_biggest function). The replicability of the neuropsychology white matter atlas was quanti ed by means of Pearson correlations between the two summary maps.
DSD web application development. The DSD web application was built using the Django framework (https://www.djangoproject.com). This web framework allows database manipulation and is Pythonbased. The DSD frontend was created with standard Javascript and css templates; whereas the backend is hosted in a DigitalOcean webserver (https://www.digitalocean.com). Gunicorn and Ngnix are used for the web application live production.

Visualization.
A visualisation of the results was performed using Trackvis (http://trackvis.org), FSLeyes for imaging data, and Python matplotlib and seaborn libraries for scatter plots and matrices.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability. All the neuropsychological score map used for de ning the white matter atlas of neuropsychological components are freely available at https://neurovault.org/collections/11260/. The raw dataset imported in the BCBtoolkit software to calculate individual patient disconnectomes is available at https://www.humanconnectome.org (7 T diffusion data). In addition, processed data are available on request to the corresponding author.
Code availability. The code used in the analyses is available as part of the BCBtoolkit package http://toolkit.bcblab.com and the DSD web application http://disconnectomestudio.bcblab.com. Any additional information is available on request to L.T.