Federated disentangled representation learning for unsupervised brain anomaly detection

With the advent of deep learning and increasing use of brain MRIs, a great amount of interest has arisen in automated anomaly segmentation to improve clinical workflows; however, it is time-consuming and expensive to curate medical imaging. Moreover, data are often scattered across many institutions, with privacy regulations hampering its use. Here we present FedDis to collaboratively train an unsupervised deep convolutional autoencoder on 1,532 healthy magnetic resonance scans from four different institutions, and evaluate its performance in identifying pathologies such as multiple sclerosis, vascular lesions, and low- and high-grade tumours/glioblastoma on a total of 538 volumes from six different institutions. To mitigate the statistical heterogeneity among different institutions, we disentangle the parameter space into global (shape) and local (appearance). Four institutes jointly train shape parameters to model healthy brain anatomical structures. Every institute trains appearance parameters locally to allow for client-specific personalization of the global domain-invariant features. We have shown that our collaborative approach, FedDis, improves anomaly segmentation results by 99.74% for multiple sclerosis, 83.33% for vascular lesions and 40.45% for tumours over locally trained models without the need for annotations or sharing of private local data. We found out that FedDis is especially beneficial for institutes that share both healthy and anomaly data, improving their local model performance by up to 227% for multiple sclerosis lesions and 77% for brain tumours. Federated learning and unsupervised anomaly detection are common techniques in machine learning. The authors combine them, using multicentred datasets and various diseases, to automate the segmentation of brain abnormalities without the need for annotations or sharing private local data.

B rain magnetic resonance imaging (MRI) is one of the most commonly used tests in neurology and neurosurgery. Due to the high contrast of the soft tissues, different MRI sequence (for example, fluid-attenuated inversion recovery, FLAIR) images are very sensitive to pathology and are useful for detecting several abnormalities such as tumours, inflammation, multiple sclerosis (MS) or acute infarctions.
Multiple sclerosis is an immune-mediated inflammatory disease that destroys myelin and axons, resulting in remarkable physical disability. Age-related white matter hyperintensities (WMH) in the brain are the consequence of cerebral small vessel disease and are related to vascular risk factors and cognitive impairment. Magntic resonance imaging is the most commonly used modality for the diagnosis of white matter lesions and assessment of disease progression 1 . The most used method for detecting lesions in clinical practice is to threshold the hyperintense regions in an MRI scan (for example, by using FLAIR); however, different residual and intensity artefacts make it challenging to compute the lesion volume. The development of robust and automated lesion detection methods can thus reduce the burden on radiologists and improve diagnostic performance.
Glioma is an aggressive type of brain cancer classified by its histopathological appearances in low-(LGG, grades I and II) and high-grade glioma (HGG, grades III and IV). Despite improvement in the diagnosis and treatment, brain tumours are still associated with substantial morbidity and a poor overall prognosis. Automatic assessments of the tumour boundary and volume are beneficial for treatment planning and monitoring treatment response 2 .
Recent technological advances have resulted in much faster and higher-resolution MRIs, which has led to overburdening radiologists with the interpretation and triage of acute findings. One solution to address these challenges is the use of artificial intelligence for the automated segmentation of abnormalities. These methods could provide rapid interpretation of brain MRIs, improving clinical workflow and thus reducing such burdens.
Although machine learning methods have been successfully applied to various medical applications, their performance is limited by the amount and diversity of training data and the availability of annotated images 3 . In the context of medical imaging, datasets are often siloed across many institutions, highly unbalanced due to the low incidence of pathology and often inaccessible due to privacy regulations 4,5 .
Unlike traditional centralized learning, federated learning 6 enables multiple parties to collaboratively train a machine learning model without exchanging the underlying datasets. Despite its promising results in medical imaging [7][8][9][10][11][12][13][14][15] , the performance on unseen datasets acquired from distinct clinical centres is negatively affected 16 . This drop is caused mainly by the statistical heterogeneity in non-IID scenarios 6 , that is, domain shifts due to, for example, acquisition parameters or uniquely manufactured medical devices 17 . Recent methods were proposed to tackle data heterogeneity and domain shifts, for example, by not averaging local statistics (SiloBN) 18 . Higgins and colleagues 19 proposed that disentangled representation learning could improve the generalization of neural networks.
Inspired by recent works on disentangled representations 20-23 and the promising findings of anomaly detection in brain MRI [24][25][26][27] , we propose a federated, unsupervised, domain-agnostic federated method to segment multiple abnormal brain MRI findings. Compared with the simple FedAvg 6 , which does not take the domain shifts of the clients into consideration, we improve the segmentation performance by 35% for MS lesions, 19% for vascular lesions and 12% for brain tumours. Figure 1 shows an overview of our proposed approach. We learn the healthy normative of a brain by learning how to efficiently compress and encode healthy scans x ∈ R H×W from multiple institutions C j , with local datasets D j and N j samples, and then learn how to reconstruct the data x Rec ∈ R H×W to as close to the original input as possible, where H is the height and W is the width of an image. We disentangle the clients model parameters into shape θ C j S and appearance θ C j A , and only train the shape parameters jointly. After every federated round, the global model is thus updated as follows: θ G ← ∑ M j=1 w j θ C j S , where w j = N j / ∑ M j=1 N j and M is the number of participating clients in the federated learning. At inference (sites 1-6), the abnormal regions are given by binarization of the residual of the abnormal input images and their reconstructed healthy version: The contributions of this work lie in applying federated learning concepts to automate the segmentation of abnormalities in brain MRIs. Specifically, we developed federated disentangled representation learning (FedDis) for unsupervised brain anomaly detection, which is able to leverage MRI scans from four different sites featuring multiple scanner manufacturers (Siemens, Philips) without sharing local data or the need for annotated samples. To mitigate the domain shifts of the different clients, we propose disentangling the neural network to learn global, shared parameters as well as local, personalized parameters. This allows the network to leverage mutual anatomical structure information across different institutions and to style the output with local, personalized features. We also show how this property can be enforced with the use of a latent contrastive loss and use self-supervision to enforce the healthy reconstruction of abnormal input. We extensively evaluated our method on anomalies such as MS, vascular lesions (WMH), glioblastoma and LGG, using MRI scans from multiple sites with multiple scanners, and demonstrated its superior performance.

Results
Anomaly detection performance. We evaluate the performance of our method on six different institutions and datasets (four public, two internal) that include MS lesions (MSLUB, MISBI, MSKRI), vascular lesions (WMH) and brain tumours (GBKRI, BRATS). We report the structural similarity index (SSIM) on the unseen healthy test data of the participating institutions to evaluate the ability of our method to reconstruct healthy samples, as well as the area under precision-recall curves (AUPRC) to assess the pathology detection performance. Tables 1 and 2 report the aforementioned evaluation metrics of different methods, namely, Baur and co-workers 24 (spatial autoencoders trained on KRI), Local (a model trained on a single local dataset), FedAvg 6 (the federated baseline), SiloBN 18 (a federated method that tackles data heterogeneity), our proposed method, FedDis (which mitigates the domain shifts by learning disentangled representations, keeping the same amount of shared parameters to capture global information), FedDis § (a variant of our method with reduced shared parameters to keep the same network complexity), Data Centralized (a single global model trained on all datasets combined) and U-Net (a supervised model). We use our self-supervision contribution to boost the performance of all unsupervised baselines and ensure a fair evaluation of the federated disentangled contribution, as shown in Fig. 2.
As expected, having access to all data samples in a data-lake trained model achieves the best reconstruction fidelity (SSIM), considerably improving the site results of the local clients by 5.04% on average. Note that the federated methods improve the reconstruction fidelity of local clients even without sharing local data. FedDis achieves the best results among the federated baselines and improves the local/site reconstruction fidelity on average by up to 2.58%. Note that although clients such as OASIS, with a large amount of training data, benefit marginally from the federated training on the local reconstruction task, clients with fewer training data such as ADNI-P or KRI benefit from the learned global shape model of FedDis and improve their reconstruction fidelity by 3.37% and 5.06%, respectively. Table 1 presents the model's capacity to generalize to unseen sites and segment pathology. Both the Data Centralized and the federated methods improve anomaly detection results over the local clients. SiloBN handles data heterogeneity by averaging only the batch-normalization parameters while keeping batch-normalization statistics (that is, the running mean and variance) private. This improves the results for the site with the largest amount of data points (OASIS), which outperforms simple federated averaging. However, the rest of the participants to the federation are negatively affected, resulting in poor average scores. We hypothesize that SiloBN adapts the batch-normalization parameters to the dominating   We collaboratively train a neural network on multiple clients (C j |j ∈ {1, 2, 3, 4}), each with its local dataset D C j . For every training round, each client sends only its shape model parameters θ C j S to a server without sharing private local data. The server then aggregates the parameters of all clients and sends back the updated global model θ G . We train the neural network in an unsupervised manner without requiring expansive expert annotation. In doing so, we model the healthy anatomy of the human brain by learning to compress and then reconstruct healthy samples. At inference (sites 1-6), the model sees abnormal samples as input, reconstructs the healthy version of these scans and then segments the abnormalities given by the residual image. Our main contributions are to disentangle the parameters and leverage global anatomical structure while mitigating domain shifts, and to use self-supervision techniques to enforce healthy reconstructions. client, which hinders the rest of the clients from generalizing as well. FedDis mitigates the statistical heterogeneity by leveraging the common anatomical information of the human brain and improves the detection of MS on average by up to 110% and 35% over the local and federated averaging, respectively. Similarly, FedDis is able to better detect tumours by up to 41% and 12% on average than local and federated averaging, respectively. The federated paradigm is especially beneficial for clients performing anomaly detection that also share healthy data from the same institute. Specifically, client KRI improved its local performance from 0.130 up to 0.425 (up to 226%) for MS lesions, and from 0.172 up to 0.305 (up to 77%) for glioblastoma. Our proposed method outperforms the upper-bound, Data Centralized model in most cases, indicating that our disentanglement property by leveraging only global information across different sites improves anomaly detection results, even without sharing local data. Table 2 provides a comparison with supervised methods. We trained U-Net on each of the databases containing pathology and evaluated its performance on the other datasets. Our method achieves competitive results and closes the gap to supervised We show the SSIM to asses the reconstruction of healthy unseen test images and the AUPRC to assess the anomaly detection performance on different datasets with pathology. Single and double asterisks mark statistical significant improvements over the local method and best federated method (FedAvg), respectively (Kolmogorov-Smirnov test; P ≤ 0.05). The § symbol refers to FedDis w/o LCL, but with the same network complexity of our baselines. Bold and underlined values indicate the best and second best results, respectively. LOL, latent orthogonalityloss; LCL, latent contrastive loss.
models; outperforms all U-Net variants by up to 267% on our internal dataset MSKRI; and achieves the second-best results for MSLUB, GBKRI and BRATS. In comparison to supervised methods, FedDiS has the following advantages: (1) it does not require any annotated labels during training; (2) it generalizes better to unseen domains, that is, by leveraging multiple different non-IID datasets to learn the normative of the human brain, while preserving data privacy; and (3) it can detect unseen anomalies. By contrast, as previously also shown by Baur et al. 27 , U-Net does a poor job on pathologies not seen during training, for example, its average performance on MS and vascular pathology is 0.08 when trained on brain tumours (BRATS). Figure 3 gives more insight into the network's predictions and shows generated segmentation masks of different methods on both MS lesion and glioma coming from different testing sites. FedDis reduces the number of false positives and negatives and has a more robust segmentation output for both MS and glioblastoma pathologies.
Effect of self-supervision. We train our models with healthy magnetic resonance scans from different datasets and institutes. As we want to capture the whole spectrum of healthy brains for different ages, we also include older patients with a high probability of microangiopathy (see Fig. 1). The presence of hyperintense regions   in the training set is problematic as the model will model these as part of the healthy anatomy and might be not able to accurately detect pathology (for example, MS lesions). Based on this observation, we propose to enforce the healthy reconstruction of samples with two self-supervision techniques, as visualized in Fig. 1. First, to ensure that our training set is healthy, we clean the dataset by painting over values larger than the 98th percentile (see examples in Extended Data Fig. 1); second, we follow context encoders 28,29 and use a strong augmentation technique by drawing rectangles of various sizes and bright intensities over the input sample and force the network to paint over these regions. Figure 2 shows AUPRC results for all baselines with and without self-supervision. Our proposed self-supervision technique is applicable to all methods and improves anomaly detection performance. Our method achieves the best anomaly detection scores for both with and without self-supervision set-up. Note that in the absence of self-supervision, our method avoids the reconstruction of anomalies the best, achieving considerably better results than the baselines.

Effect of demographics.
Sources of data heterogeneity beyond acquisition parameters and different vendors (for example, age, sex or ethnicity) might negatively influence the performance of neural networks. The presence of such biases 30 could hinder the use of artificial intelligence systems in clinical practice. To assess the generalizability and clinical applicability of our method, we measured the performance on different sex and age groups for both the MS (MSLUB) and brain tumours (BRATS) pathology ( Fig. 2d-f).
The patient-wise DICE for male (N = 7) and female (N = 23) patients is illustrated in Fig. 2d. We found there is no statistically significant difference in the mean DICE performance (0.152 ± 0.142 for men and 0.153 ± 0.139 for women). This observation holds for all baselines. Figure 2e shows patient-wise DICE for different age groups: patients younger than 40 years (N = 20); aged 40 to 60 years (N = 9); and above 60 years (N = 1). Our method achieved the best DICE score (0.179 ± 0.152) for the age group younger than 40 years while having a comparable DICE score of just 0.100 ± 0.091 on patients aged 40 to 60. One possible reason for the performance gap could be the different lesions sizes present in the split: 32.67 mm 2 for younger patients compared with 23.67 mm 2 for the age groups aged 40 to 60 and older than 60. One patient over 60 had a very small lesion (6.3 mm 2 ), and lesion detection and segmentation were just failed (DICE = 0) for the age group older than 60 in Fig. 2e. By contrast, we observed no significant difference for detecting HGG at all different age groups in Fig. 2f. FedDis achieved a DICE of 0.444 ± 0.194 for the age group younger than 40 years (N = 9); 0.423 ± 0.207 for patients aged 40 to 60 years (N = 64); and 0.459 ± 0.172 for patients above the age of 60 (N = 90).
Effect of disease severity. Indicators or classification of disease severity were not provided through MS radiology reports; however, we performed an analysis on the effect of the lesion/tumour size on the DICE performance of our method, as shown in Fig. 4. For brain tumours, we show the performance on the BRATS dataset in detecting LGG and HGG. Despite the bigger mean lesion size of 452 mm 2 in LGG compared with 345 mm 2 in HGG, our network performs slightly worse in detecting LGG, with a mean DICE of 0.379 ± 0.169 compared with 0.425 ± 0.193 per patient in detecting HGG. Note that our algorithm performs better with increasing lesion size. Specifically, 28.85% and 29.60% of the evaluated magnetic resonance slices had a DICE score above the average scores of 0.23 and 0.39 for MS and brain tumours, respectively. Interestingly, 51.67% and 44.44% of the magnetic resonance slices contain an insignificant lesion (<4 mm 2 ) or slices in which no pathology is present, as can be seen in the top rows of the lower panels Fig. 4. By removing these slices from the computation, we increase the DICE from 0.23 to 0.28 and from 0.39 to 0.41 for MS and brain tumours, respectively. The slices with a mean lesion size above 300 mm 2 achieve a lower DICE score for MS. A possible explanation of this behaviour is the capability of the networks to reconstruct larger lesions, as visualized in the bottom left plot of Fig. 4, thus leading to poorer anomaly detection.

Effect of disentanglement.
Medical data are inherently heterogeneous due to acquisition parameters, uniquely manufactured medical devices, local demographics, rare pathology occurrence and so on. The different sources of heterogeneity can harm the convergence of neural networks and thus limit their clinical applicability, particularly in the case of multi-institutional training.
To mitigate the domain shift, we propose to learn disentangled representations of shared anatomical structure and local client-specific information.
We found out that FedDis with LCL works best for detecting tumours, whereas FedDis without LCL achieves better results for detecting MS. This is in contrast to our preliminary works 20 , in which we show that LCL is crucial for improving the anomaly detection scores in both MS lesions and tumours. It is worth mentioning that the only differences between the two works are: (1) using shallower but wider network architecture to improve the reconstruction fidelity; (2) using self-supervision to enforce healthy reconstructions and (3) adding an additional participating client, namely, KRI. This finding suggests that having powerful autoencoders that are able to better explore and reconstruct the data yields similar anomaly detection scores regardless of the additional constraints.
To showcase the effectiveness of the disentanglement, in Fig. 5a we illustrate a two-dimensional visualization of the global (shape) and local (appearance) latent embeddings of healthy, unseen samples of the clients that participated in the federated training. Note that the shape and appearance embeddings are far from each other, with shape encodings following a similar distribution, whereas the appearance representations are well separated. To further analyse the latent space, we show four samples from the four datasets that are in close proximity in the shape manifold (0.5 radius). The samples show structural similarity (for example, between the shape of the skull or ventricle size) but vary in appearance (for example, by intensity/contrast). This observation correlates with the appearance embeddings of the same samples belonging to different appearance clusters. Figure 5b shows sample reconstructions of our proposed method with and without using the appearance representation. Although both reconstructions seem to capture the global shape of the input, adding the local parameters styles the reconstruction to resemble the original input in terms of contrast, indicating that the local parameters encode indeed appearance information.

Discussion
Our results showed that the proposed automated, unsupervised neural network-based method can identify anomalies and segment critical findings including MS lesions, vascular lesions and tumours on brain MRI scans from multivendor imaging systems from different sites. Although deep-learning methods have been used to detect multiple abnormal findings on brain MRIs, available solutions for detection and assessment of disease progression have been limited to a single dataset and/or require expensive labelling for supervised training. In this work we proposed a privacy-aware, unsupervised method to detect abnormal regions in pathological brain MRIs. We leveraged healthy magnetic resonance scans from multiple institutes and datasets to train a single global model without sharing local sensitive data. We do so by deploying a neural network directly to the clients, training locally and aggregating only the updated models of the local institutes. We also proposed to mitigate the domain shifts among distributed clients by disentangling shared global features (such as anatomical information) from personalized local features (such as appearance). We evaluated our proposed method on multiple datasets containing real pathology data and showed a superior performance over a locally trained model by 99.74% for MS, 83.33% for vascular lesions and 40.45% for tumours.
Clinical relevance. The commonly used approach in clinical practice to find and assess brain magnetic resonance findings is a naïve thresholding-based classifier 17 . This approach requires no training data and selects hyperintense regions in the input images based on a given threshold.  drop in their diagnostic performance by 33% and 36%, respectively. Clinical institutes usually acquire different magnetic resonance sequences of the same patient in one scan, for example, T1w and T2w images. Although these sequences vary in tissue representation and contrast, they both capture the same underlying anatomical structure. Based on this observation, we could formulate the shape consistency loss to enforce similar shape embeddings for different co-registered sequences. To test this hypothesis, we adapt our internal client, KRI, to enforce the shape consistency based on T1w/FLAIR sequence pairs. The other three clients in the federation compute the shape consistency loss using intensity-augmented versions of the original FLAIR images, as described in the methods section. Interestingly, early results show a slight improvement in the detection performance when using pairs of sequences for the shape consistency loss. However, more work is required to analyse the slight shape difference between T1w and FLAIR sequences and their impact on the performance.

Thresholding.
To analyse the diagnostic performance of the different models, we reported the area under the precision-recall curves (AURPC) in Tables 1 and 2. The question of how to balance between precision and recall and choose an operating point for a model remains open. Similar to ref. 32 , we followed the approach proposed in ref. 33 where we opt for an institute-specific threshold τ that results in a very few mistakes (<1% false positive rate) on their corresponding healthy test data. We compared the unsupervised choice of the operating point with two different methods that require annotated samples of the test client: choosing a threshold on a random subset containing 15% of annotated samples and optimizing the best operating point for the whole dataset, that is, upper-bound DICE. Our approach or FedDis achieved comparable performance to the one using 15% of the annotated samples and showed a slight drop in the performance compared with the upper-bound DICE, highlighting the effectiveness of the unsuper vised approach 33 . We show the anomaly detection performance of different variants of our method (FedDis w/o LCL) with different complexities for the shared/local parameters. We show the AUPRC to assess the anomaly detection performance on different datasets with multiple sclerosis and brain tumours.

Complexity.
To implement FedDis, we opted to hold the same amount of shared parameters to be able to capture the global information and add a few parameters for capturing the appearance/ local information. However, if we choose to keep the same network complexity and reduce the amount of parameters that are shared globally (see Extended Data Fig. 1 and Table 3), our network still achieves a relative improvement of 94.33%/24.68% for MS, 46%/6.46% for vascular lesions and 42.21%/12.18% for brain tumours over the local and federated averaging, respectively. In our experiments, we opt for the standard autoencoder architecture 24 .
For future work, we plan to investigate more complex architectures for the reconstruction of healthy samples, such as Gaussian mixture variational autoencoders 34,35 .
Privacy concerns. Although the federated paradigm reduces the privacy risks by not explicitly sharing local data, recent works 36,37 demonstrated that sharing model updates makes federated learning vulnerable to inference attacks, that is, data representation leakage from gradients being the essential cause of privacy leakage. To mitigate this issue, recent works 38 have been proposed to, for example, encrypt gradient updates from clients to the server or withhold individual information from global statistics using differential privacy. These works are complementary to our approach and can be integrated into our pipeline to mitigate privacy risks.

Methods
The main concept behind our federated unsupervised anomaly segmentation framework (depicted in Fig. 1) is to model the distribution of healthy anatomy by learning how to efficiently compress and encode healthy brain scans from multiple institutions and then learn how to reconstruct the data to as close to the original input as possible. This enables the detection of pathology from faulty reconstructions of anomalous samples. We first formally introduce the problem and define the federated unsupervised anomaly segmentation set-up. We then present FedDis and elaborate a loss to enforce the disentanglement.
Problem formulation. Given M clients Cj with local dataset Dj ∈ R H×W×Nj consisting of N j healthy brain magnetic resonance scans x ∈ R H×W , our objective is to train a global model f θ G (·), leveraging the healthy brain scans from multiple institutes without sharing local data, to detect and segment the pathology r q ∈ R H×W for a given query brain scan x q ; where the choice of τ is discussed in the 'Thresholding' section.
FedDis. Federated learning aims to collaboratively learn a global model f θ G (·), without centralizing or sharing training data. At each communication round, the local clients are initialized with the global weights and trained locally on their own datasets for a fixed number of epochs to minimze following objective function: where the learned local parameters θ Cj are aggregated to a new global model: Nj is the respective weight coefficient. A popular architecture to learn efficient data representation in an unsupervised manner are convolutional autoencoders 39 where an encoder is trained to compress x to a latent representation z ∈ R d , from which a decoder attempts to reconstruct the original by minimizing following objective: where a common choice for the reconstruction loss is the mean absolute error:

Disentanglement.
To mitigate the statistical heterogeneity, but leverage the shared structural anatomical information among the distributed clients, we propose to disentangle θ into θ S and θ A , and only share the former in the federation. After every communication round, the global model parameter is thus updated as follows: θ G ← ∑ M j=1 wjθ Cj S , where θ Cj S is the shape parameter at client C j . At inference, each client can style the global model parameters with the locally trained parameters, resulting in M personalized models given by θ G and θ Cj A .
To further enforce the disentanglement, we introduce the following losses First, SCL: shape embeddings of a given brain scan should be similar under different intensity augmentations, for example, changes in brightness/contrast, random gamma shifts or corresponding pairs of different magnetic resonance sequences-we chose the last two. Second, LOL: latent representations for shape and appearance should be orthogonal to each other. We thus define the LCL as where z A , z S and z γS are the latent representation for the appearance, shape, and the shape of the gamma shifted image, respectively; β S and β L are used to weigh the two contrasting loss terms, and L CS (x1, x2) = cos(x1, x2) is the cosine similarity. The overall objective is given by: LRec(x, xRec) + αL LCL (z A , z S , z γS ), with regularization weight α. Finally, the anomaly segmentation is given by the binarization of the residual r = max(0, xq − xRec) > τ.
Implementation details and hyper-parameters. For U-Net, we followed the original description by Ronneberger and colleagues 40 , but adjusted the network to handle input at a 128 × 128 resolution and used a cutoff detection probability of 0.5 to binarize the predictions. The resulting network has three blocks for encoding, one block in the spatial bottleneck z ∈ R 16×16×128 and three blocks for decoding. The composition of a block is as follows: 2 × [3 × 3 convolutions with filters ∈ {32, 64, 128}, batch normalization and ReLU activation]. We trained the supervised networks using pathological data (70:30% training:validation) for 250 epochs with a batch size of 8. We used an ADAM with a learning rate of 1 −2 and exponential decay of 0.97. For FedDis, we followed ref. 24 by using an autoencoder consisting of three layers of 5 × 5 convolutions with filters ∈ {64, 128, 128}, with the spatial bottleneck z ∈ R 16×16×128 for encoding the global shape parameters and three layers of 5 × 5 convolutions with filters ∈ {16, 32, 32}, with the spatial bottleneck z ∈ R 16×16×32 to capture the local appearance. Each convolution layer is followed by batch normalization and leaky ReLU activation layers. We add a dropout of 0.2 to the last convolutional layer for all methods to avoid overfitting. We trained the models for 50 rounds, each with five local epochs with a batch size of 8. We used an ADAM with a learning rate of 0.0001 and exponential decay of 0.97. We set the regularization weight α to 0.2 after a grid search with α ∈ 0.2, 0.5, 0.75, 1. β S and β L pairs correspond to different variants of our method: FedDis, (1,1); FeDis w/o LOL, (1,0); FedDis w/o SCL, (0,1); and FedDis w/o LCL, (0,0). We used the MONAI implementation for the gamma shift, with random values in the interval [0. 5,2]. The objective loss imposes two contrasting tasks: (1) learn to eliminate masking and hyperintensities from the reconstruction; and (2) have a similar shape embedding for two intensity-augmented sequences (SCL). We thus consider it is useful to inject the regularization loss (LCL) at a later time-point in the optimization, after the network learns to reconstruct healthy samples. We performed a hyperparameter search for the time-point t ∈ 0, 5, 25, 45 in which to inject the contrastive loss, and found that injecting the latent contrastive loss at round 25 in the federated training yields the best results. For the self-supervision augmentation, we first clean the images by painting over high-intensity values (>98th percentile) with the mean intensity value of the brain slice. Second, we augment the original images with up to three rectangles of random size s ∈ [6, 500] with width w ∈ [3,20] and height h ∈ [w − w/3, w + w/3]. The position of the rectangles is random with the coordinates of the top-left corner given by x ∈ [20,90] and y ∈ [30,90]. The intensity of the painted rectangles is uniform, with the random brightness value in the interval given by the value of 99th percentile of the brain slice and 1. We used this augmented image as a training input of our networks and the cleaned image as the ground truth.
Datasets. An overview of the datasets used for training and evaluation is shown in Extended Data Table 1. We used two publicly available brain magnetic resonance datasets (OASIS-3 and ADNI-3) and one internal database (KRI) for training.
OASIS-3. A dataset 41 containing 1,089 patients collected over 30 years. Participants (aged 42-95) include 609 cognitively normal adults and 489 individuals at various stages of cognitive decline. We used all 633 FLAIR sequences of 450 healthy patients for training.
ADNI. The Alzheimer's Disease Neuroimaging Initiative (ADNI) 42 is a longitudinal multicenter study designed for the early detection of Alzheimer's disease. We used the healthy control group and selected all 507 FLAIR MR scans of 118 patients acquired with multiple Siemens scanners for our training client, ADNI-S. Furthermore, we selected all 228 FLAIR magnetic resonance scans of 55 healthy patients acquired with multiple Philips scanners for our training client, ADNI-P.
KRI. Our internal dataset with 163 FLAIR, T2w and T1w co-registered scans acquired with a 3 T Philips Achieva scanner.
To assess the diagnostic performance of our network, we used two publicly available MS lesion datasets (MSLUB, MSISBI); an in-house database with MS and glioblastoma (MSKRI, GBKRI), and a publicly available dataset containing brain tumours (BRATS).
MSLUB. This dataset 43  MSKRI and GBKRI. Our internal datasets containing 48/94 FLAIR, T2w and T1w scans acquired with a 3 T Philips Achieva scanner. MS and glioblastoma (whole tumour) segmentation masks are provided by expert neuroradiologists.
WMH. The public dataset 45 for the 2017 MICCAI WMH segmentation challenge and contains FLAIR and T1w scans of 60 patients acquired with 3 T scanners from Siemens, Philips and GE. Experts in WMH scoring provide manual annotations for WMHs of presumed vascular origin.
BRATS. The 2018 dataset [46][47][48] contains FLAIR, T2w, T1w and T1Gd (post-contrast) scans of 285 patients with low-(N = 122) and high-grade tumours (N = 163). Segmentation masks are provided manually by one to four raters and approved by experienced neuroradiologist. For our evaluations, we consider the whole tumour containing the GD-enhancing tumour, the peritumoral oedema, and the necrotic and non-enhancing tumour core.
Pre-and post-processing. All scans have been registered to the SRI24 atlas template space 49 to ensure all data share the same volume size and orientation. The scans were subsequently skull-stripped with ROBEX 50 and normalized to the [0,1] range. We used the axial mid-line with a size of 128 × 128 px for training and evaluated our methods patient-wise on whole volumes containing slices with visible tissue information. For post-processing, we use prior knowledge and keep only positive residuals, as these lesions are known to be fully hyperintense in FLAIR images. Furthermore, we apply median filtering of size 3 to remove small outliers and obtain a more continuous signal. We use the resulted heat maps to generate AUPRC for the test sets containing pathology. Finally, we binarize the results using τ and compute the DICE score per patient. We choose the operating point τ in an unsupervised manner 33 by choosing the lowest threshold that delivers a false positive rate lower than 1% on the healthy test set of each client.

Evaluation metrics.
To measure the anomaly segmentation performance and compare different models, we report the area under the precision TP/ (TP + FP) -recall TP/(TP + FN) curves (AUPRC), where TP, FP and FN are true positives, false positives and false negatives, respectively. We also report DICE scores per patient given by 2TP/(2TP + FP + FN). We report the relative improvement of a over b as (a − b)/b and use SSIM 51 to measure the reconstruction fidelity. Finally, we used the two-sided Kolmogorov-Smirnov test: KS(F(x), G(x)) to measure statistical significant differences between models F and G, with x being patient-wise DICE values. The null hypothesis is that F(x) ≤ G(x) for all x; and the alternative is that F(x) > G(x) for at least one x, with a P-value lower than 0.05 suggesting stronger evidence in favour of the alternative hypothesis.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Most of the datasets used in this study are publicly available and can be downloaded after signing a Data Usage Agreement. The OASIS dataset is available at https://www.oasis-brains.org; the ADNI-S and ADNI-P datasets are available at http://adni.loni.usc.edu/data-samples/access-data/; the MSLUB dataset is available at http://lit.fe.uni-lj.si/tools.php?lang=eng; the MSISBI dataset is available at https://smart-stats-tools.org/lesion-challenge-2015; the WMH dataset is available at https://wmh.isi.uu.nl; and the BRATS 2018 dataset is available at https:// www.med.upenn.edu/sbia/brats2018/data.html. For KRI, MSKRI and GBKRI, all patients were part of in-house observational cohorts, some of which were prospective (MSKRI; with patient consent), whereas the others were retrospective (without patient consent). For all patients, our local IRB approved the use of imaging data for research purposes after anonymization. As several patients were part of retrospective cohorts without explicit patient consent, these data cannot be shared as mandated by our IRB. For the prospective cohort data can be shared through Benedikt Wiestler (http://b.wiestler@tum.de) on reasonable request and signing of data transfer agreements, pending approval by our IRB and data protection officer.

Code availability
The code is publicly available at ref. 52 .