Federated Synthetic Learning from Multi-institutional and Heterogeneous Medical Data

Statistically and information-wise adequate data plays a critical role in training a robust 11 deep learning model. However, collecting sufﬁcient medical data to train a centralized model 12 is still challenging due to various constraints such as privacy regulations and security. In this 13 work, we develop a novel privacy-preserving federated-discriminator GAN, named FedD-14 GAN, that can learn and synthesize high-quality and various medical images regardless of 15 their type, from heterogeneous datasets residing in multiple data centers whose data cannot 16 be transferred or shared. We trained and evaluated FedD-GAN on three essential classes of 17 medical data, each involving different types of medical images: cardiac CTA, brain MRI, and 18 histopathology. We show that the synthesized images using our method have better quality 19 than using a standard federated learning method and are realistic and accurate enough to 20 train accurate segmentation models in downstream tasks. The segmentation model trained 21 on the synthetic data only is comparable to that trained on an all-in-one real-image dataset 22 shared from multiple data centers if possible. FedD-GAN can learn to generate a scalable 23 and diverse synthetic database without compromising data privacy. This synthetic database 24 could help to boost machine learning techniques in medical data analytics. 25


97
Summary of datasets and experimental settings We collected three categories of heterogeneous 98 datasets for cardiac computed tomography angiography (CTA), brain magnetic resonance imaging 99 (MRI), and histopathological images. The characteristics of these datasets are summarized in 100 table 1, which shows large heterogeneity, e.g., various subject numbers, voxel spacings, scanners, 101 organs, hospitals/centers, and so on. 102 In the first category, we collected three public cardiac CTA datasets acquired from globally dif- images from the breast, kidney, liver, and prostate (4 images per organ). The testing set has 14 128 images from 7 organs (2 images per organ), and three organs are not in the training set: bladder, 129 colon, and stomach. In the training of FedD-GAN, we used the nuclear boundary annotations as 130 the input to the generator. We partitioned the training data into 4 sites based on the organ types. 131 We conducted three different learning tasks for the three categories of data. After the learning 132 of FedD-GAN, the generator can generate images and form a synthetic database. We assessed 133 the synthetic image quality quantitatively by computing the DistFID score (see definition in Meth-134 ods). We can select the best generator model with the lowest DistFID score. Then, we used the synthetic images in a downstream task, i.e., training an image segmentation model, to implicitly 136 evaluate the image quality. We assumed that if the quality of synthetic data is good enough, then 137 the segmentation model would perform similarly to the model learned from the real data. For the 138 downstream segmentation tasks, we withheld 20% of the training subjects as a validation subset 139 to select the best segmentation models. The evaluation metrics for the cardiac CTA and brain 140 tumor MRI tasks were subject-level volumetric DICE, 95% Hausdorff Distance (HD95), and av-  Learning from multi-center cardiac CTA data Firstly, we learned FedD-GAN on multi-center 145 cardiac CTA images. Taken a multi-structural heart label image as the input, the goal of the FedD-146 GAN is to learn from the joint distribution of data samples at isolated private data centers by 147 adversarial learning strategy to obtain a central generator that can generate realistic CTA images.

148
In this task, we simulated three data centers: WHS, CAT08, and ASOCA.  Table 3 summarizes the quantitative results of the downstream segmentation task, and 177 Figure 7 shows the comparison in detail. Figure 10 visualizes some examples of segmentation 178 results. 179 We found that by only using synthesized images by FedD-GAN, the segmentation model re-  Furthermore, the synthetic database of FedD-GAN could be used as data augmentation for any 189 private data center to boost its segmentation performance. Specifically, by comparing the fourth 190 and the second sections in Table 3, we noticed that after introducing synthetic images into any 191 data center to train segmentation, the DICE scores were significantly improved (p<0.05), and the 192 distance metrics were also improved (p<0.05) for center CBICA. 193 Next, we evaluated the robustness of the methods in a more heterogeneous setting where the 194 MRI modalities are misaligned across data centers. To simplify the problem, we removed one 195 modality from each data center. As a result, the CBICA center missed all Flair images, the TCIA 196 center missed the T1c modality, and the Other center missed the T2 images. It was a challeng-197 ing scenario as the data features were not the same across data centers, which may require some 198 complex federated transfer learning methods to resolve. We adjusted the FedSeg and FedD-GAN 199 methods for learning from this missing-modality task and reported the segmentation results in 200   Table 4 as Hetero-FedSeg and Hetero-FedD-GAN. In Hetero-FedSeg, we set the pixel values of 201 missing modalities to 0 and trained the FedSeg. In Hetero-FedD-GAN, we simply adjusted the 202 discriminators in the architecture (see Methods for details). As shown in Table 4 and Figure 4, our 203 method can handle this challenging problem with very small performance loss by "completing" 204 the missing modalities with synthesized images, while a trivially adjusted FedSeg failed to achieve 205 satisfying segmentation from missing-modality data.

206
Learning from multi-organ histopathology data At last, we evaluated FedD-GAN on multi-207 organ histopathology images. In this nuclei task, we simulated four data centers, each containing 208 data of a single organ: breast, liver, kidney, and prostate. During training, the original images 209 were cropped to smaller tiles (see Methods for details). Figure 5 shows examples of real vs syn-210 thetic images. The proposed method generated much better synthetic images with an DistFID 211 score of 159.60 than the FLGAN with an DistFID score of 234.16. Table 5 compares the average 212 performance of the downstream segmentation task by different methods, and Figure 8 plots the 213 distribution of the results. Figure 11 shows some segmentation samples. Our FedD-GAN was 214 on par with the Real-All model and FedSeg. Compared with those models using the local data 215 (Real-breast, Real-liver, Real-kidney, and Real-prostate), our method achieved significantly better 216 results in at least one of the metrics.

217
The FLGAN method is also reported in this table and is statistically worse than FedD-GAN in 218 terms of both DICE and AJI (p<0.01). The results imply that our method could generate more 219 realistic and diverse synthetic image samples than FLGAN, which can benefit the downstream ML 220 task.

222
Learning from multi-institutional and heterogeneous data is a challenging but practical problem in 223 large-scale medical image analysis. In this study, we developed a GAN-based federated-learning 224 architecture, FedD-GAN. It can learn to synthesize medical images from heterogeneous/non-225 independent-and-identically-distributed (non-IID) datasets. It consists of a central generator and 226 multiple distributed discriminators. We demonstrated that the proposed framework could learn 227 from various data across data centers, with varied data sizes and images acquired from different 228 scanners, subjects, and organs. FedD-GAN can be used to build a centralized and scalable syn-229 thetic database for downstream machine learning tasks without access to any private information.

230
The proposed learning framework is general and could be used in various clinically useful ap- quantitative DistFID score, we can see that the synthetic images generated by the learned generator 237 have higher image quality than the FLGAN 28 . In addition, we validated the synthetic images in 238 segmentation tasks by using them as training data and testing on real cases to compare with models 239 trained on real data. The results of the downstream segmentation tasks showed that the segmen-240 tation models learned from only synthetic images can achieve close performance to the models 241 trained on the all-in-one real dataset collected by copying from multiple institutions.  The missing-modality experiment for the brain MRI task in Results can be considered a feder- observation is similar in pix2pix 51 . Therefore, by default we just generated the same amount of 295 synthetic images as the total number of the real data in all participating centers. Since in many 296 medical applications, the size of the annotated dataset is often a concern, we adopted two ways 297 to make FedD-GAN generate scalable synthetic database with diverse images. One is applying 298 random transformations to the input, such as scaling, shift, flip, and rotation, to generate more 299 varied images (see examples in Supplementary Fig. 1, Fig. 2, and Fig. 3). The other one is 300 utilizing multiple generator models saved at different training epochs with the smallest DistFID 301 scores. We found that in the histopathology task, when the synthetic database becomes twice the 302 size, the performance of the down-steam task was improved (Dice: from 0.789 to 0.805, AJI: from 303 0.528 to 0.552). This indicates implicitly the good quality and diversity of the synthetic images. 304 However, further increasing the synthetic database of histopathology images to more than twice 305 the size can not benefit the downstream models. Increasing synthetic database for the cardiac CTA 306 and brain MRI tasks cannot improve downstream task either. A possible reason is that as the syn-307 thetic database becomes larger some repeating patterns and artifacts in synthetic images may also 308 accumulate and cause overfitting problems in the training of downstream models. When the overall 309 size of real data is small (like in histopathology task), scaling up the synthetic database could bring 310 more benefit than harm until the useful information in the synthetic database 'saturate'. Thus, we 311 conducted a manual evaluation by asking a radiologist to distinguish between 200 randomly picked 312 pairs of real images and synthetic images for cardiac CTA. Each pair was shuffled randomly. It 313 turns out that the radiologist can effectively tell fake from real images (over 90% accuracy) by  In terms of privacy and security, a recent work showed that the gradients can be used to recover  network. Therefore, for a large-scale dataset, our approach may need heavier communication than the FedAVG. However, the proposed method should be more favorable in terms of scalability, 344 image quality, and downstream task performance, especially for non-IID data. It is because the 345 federated learning in the medical domain is more like a cross-silo setting 19 with fewer clients and 346 more powerful computing capacities and reliable connections than in a cross-device setting 19 .

347
In future work, we would like to investigate more complex practical problems with more hetero-348 geneous situations, for example, learning from more pathological conditions, learning from both 349 labeled and unlabeled data, etc. It is also of interest to combine imaging data with other forms 350 of electronic health record, e.g., lab results and radiology reports, into one learning framework.
The goal of D j is to maximize Eq. (1), while G minimizes it. In this way, the learned G(x) 435 with maximized D(G(x)) can approximate the real data distribution p data (y|x) and D cannot tell 436 'fake' data from real. x follows a distribution s(x). In this paper, We assume that the joint distri- π j s j (x), where s j (x) is marginal distribution of j-th dataset and π j is the prior 438 distribution. In the experiment, we set s j (x) be a uniform distribution and π j ∝ |S j |, resulting in 439 a uniform distribution s(x). For each sub-distribution, there is a corresponding discriminator D j 440 which only receives data generated from prior s j (x). Similar to previous works 51, 87 , we incorpo-441 rate noises by using Dropout 88 at several layers of the generator G in both training and inference, 442 instead of providing a Gaussian noise as input to the generator.

443
The losses of D j and G are defined in Eq.

444
where m is the minibatch size. The L G contains perceptual loss (L P ) 89 and L 1 loss besides the 446 adversarial loss. In this study, G and D j are not on the same server and thus the equation (3) needs 447 to be split into two parts (Eq. (4) and Eq. (5)) in order to back-propagate the loss to G.
where ∇ŷj = patterns, and more information can be used to differentiate the real and the 'fake' data. However, 461 the task of G may become more challenging to learn. It is because, on one hand, the G needs to 462 learn more complex data distribution to generate multiple modalities with different contrasts. On 463 the other hand, the easily-learned D may learn some trivial discriminative features and thus cannot 464 provide helpful feedback to G to guide its learning.

465
To balance the task difficulty of the G and D's, we extend our framework by deploying multiple 466 discriminators at each entity. Every single modality has its discriminator in one data center, and the 467 G receives losses from the multiple Ds for a multi-modality data sample. In this way, each D can 468 focus on learning discriminative features for one specific modality and provide more meaningful [E y k ∼p data (y k |x) log D j,k (y k |x) where D j,k represents the discriminator for the k-th modality at the center j.
Besides, another advantage of the proposed multi-modality framework is that it enables learning 472 from missing modality data. Let C j denote the set of index of available modality for center j, if 473 data center j misses the s-th modality for example, then C j = {1, ..., s − 1, s + 1, ..., c}. In this 474 case, center j only needs to deploy c − 1 discriminators during the learning. The learning process 475 has no difference except that it only collects losses of available discriminators for C j to update 476 the G and only use a subset of the synthetic images {ŷ j k |k ∈ C j } to update the corresponding 477 {D j,k |k ∈ C j } in center j. Because the discriminators for different modalities in different entities 478 are all independent, the G can still learn to generate all modalities, assuming that the missing 479 modality in one center is available in some other data centers. The loss function of D is the same, 480 while the loss function of G can be adjusted as the following: After training, the learned G can act as a synthetic image provider to generate multi-modality 482 images from the conditional variable, a mask image. As a result, it can also be used for missing where µ 1 and µ 2 refer to the feature-wise mean of the real and generated images, σ 1 and σ 2 are the 493 covariance matrices for the real and generated feature vectors, T r refers to the trace operation in 494 linear algebra.

495
Though FID is an ideal metric to find the best model when train a GAN 19, 53 , we are unable 496 to compute one FID score in federated learning because a joint set of the isolated real data does 497 not exist. Therefore, we propose a new metric named distributed FID (DistFID) to calculate the 498 weighted average distance between each real dataset and the synthetic database. The DistFID is 499 defined as: in which each of the N entities host a dataset S i of size |S i | with feature statistics (µ i 1 , σ i 1 ). The where the sup 95 is the 95% maximum value. In addition, we report the average Surface Distance 525 (SD) as follows: For nuclei segmentation, we utilize the object-level Dice 46 and the Aggregated Jaccard Index 527 (AJI) 45 : where S(G i ) is the segmented object that has maximum overlap with G i with regard to Jaccard 529 index, K is the set containing segmentation objects that have not been assigned to any ground-truth 530 object.  Figure 1: The overview of the FedD-GAN architecture and workflow. The architecture contains one central generator and multiple distributed discriminators, each located in a medical entity. The generator takes a conditional input (segmentation masks in our experiments) and outputs synthetic data. Each discriminator learns to differentiate between its own real images and synthetic images received from the generator, and then sends back the gradients. The generator updates by the adversarial learning. At last, the well-trained generator can be used as an image provider to build a synthetic database for downstream machine learning tasks, e.g., segmentation in this study.          Figure 7: The detailed box plots for whole tumor segmentation in multi-modal brain MRI. (a) The comparison of different methods using real data, synthetic data or combined data. (b) The comparison between the FedSeg using real data and the centralized model by using the synthetic database of FedD-GAN. By using all 4 modality data our FedD-GAN had comparable results as the Real-FedSeg. By using missing-modality data, Hetero-FedD-GAN achieved much more robust results than Hetero-FedSeg. Figure 8: The detailed box plots for nuclear segmentation in terms of accuracy, DICE, and AJI metrics. The testing set contains the same four organs as the training set and three unseen organs. Learning from the synthetic images of FedD-GAN can achieve comparable segmentation performance as directly learning from all-in-one data collecting from all centers.