2.1 Data description
We used the data provided for the 2020 Computing in Cardiology / PhysioNet challenge [7], which consisted of six publicly available 12-lead ECG datasets (CPSC, CPSC-Extra, INCART, PTB, PTB-XL, Georgia). Datasets collected by the same institutional source (CPSC and CPSC-Extra; PTB and PTB-XL) were merged to simulate a realistic distributed learning setting, resulting in a total number of four nodes: 1) INCART, 2) CPSC, 3) Georgia and 4) PTB (Table 1):
Table 1. Data description of the four ECG datasets used for our analyses
Node ID
|
Name
|
Number of samples
|
Duration [s]
|
Sampling rate [Hz]
|
Mean age [y]
|
Female [%]
|
Number of classes
|
Reference
|
1
|
CPSC
|
10,330
|
[6 - 144]
|
500
|
61.36
|
46.35
|
73
|
[43]
|
2
|
Georgia
|
10,344
|
[5 - 10]
|
500
|
60.52
|
46.34
|
67
|
[7]
|
3
|
INCART
|
74
|
[1,800]
|
257
|
55.99
|
45.95
|
37
|
[44]
|
4
|
PTB
|
22,353
|
[10 - 120]
|
500, 1000
|
59.76
|
47.41
|
60
|
[45][44]
|
The ECG recordings in these databases were heterogenous in terms of signal length, sampling rate, demographic properties and the number of classes. In total, 111 different classes represented by SNOMED codes were present. Each ECG could be labelled with one or multiple classes. For this publication, medically related classes were joined or merged into parent categories, resulting in 13 classes as described in Table 2.
Table 2. Considered ECG classes and frequency of occurrence in each data source
Class
|
CPSC
|
Georgia
|
INCART
|
PTB
|
Total
|
Sinus rhythm
|
922
|
1,752
|
0
|
18,172
|
20,846
|
ST interval abnormal
|
2,985
|
3,053
|
10
|
2,188
|
8,236
|
Myocardial infarction
|
1,544
|
7
|
9
|
5,629
|
7,189
|
T wave abnormal
|
27
|
3,118
|
1
|
2,639
|
5,785
|
Myocardial ischemia
|
545
|
1,635
|
0
|
2,580
|
4,760
|
Right bundle branch block
|
2,057
|
977
|
2
|
1,660
|
4,696
|
Left ventricular hypertrophy
|
158
|
1,232
|
10
|
2,359
|
3,759
|
Atrial fibrillation
|
1,374
|
570
|
2
|
1,529
|
3,475
|
Bradycardia
|
316
|
1,683
|
11
|
637
|
2,647
|
Ventricular ectopics
|
896
|
398
|
49
|
1,154
|
2,497
|
Tachycardia
|
303
|
1,261
|
11
|
827
|
2,402
|
1st degree AV block
|
828
|
769
|
0
|
797
|
2,394
|
Atrial ectopics
|
742
|
640
|
7
|
555
|
1,944
|
2.2 Pre-processing
All recordings were resampled to 250 Hz. Subsequently, from each signal, a 10-second sequence was extracted to generate a uniform data sample for the machine learning model. The first 5 seconds of a signal were ignored in this selection, if excessive data was available. The ECG data was filtered by a bandpass filter (3-30 Hz, Butterworth bandpass, 2nd order).
2.3 Model architecture description
For this multi-class, multi-label classification task, a deep convolutional neural network with five one-dimensional convolutional blocks and a concluding global average pooling prior to the classification layer was applied [46]. Figure 1 graphically summarizes the model architecture:
The model was trained with the binary cross-entropy loss function and the Adam optimizer [47]. The number of training epochs and learning rate decay, as suggested by Kingma et al. [47], are described in the individual methods’ descriptions. All implementations were executed in Python 3.7.4 and modelling was realized with Tensorflow 2.4 [48].
2.4 Learning schemes
Eleven different learning schemes were applied (1 centralized baseline, 4 node-individual, 6 decentral), which are summarized in Table 3. The following chapters describe each method in detail.
Table 3. Learning method descriptions: 1 baseline model with centralized data B, 4 individual models for each node (I1-I4), 6 decentral learning schemes: M1 – M3
Notation
|
Name
|
Description
|
Reference
|
B
|
Baseline Model
|
1 model trained with centralized data of all nodes
|
standard
|
I1-I4
|
I1: Individual Model: CPSC
|
1 individual model trained with CPSC data only
|
standard
|
I2: Individual Model: Georgia
|
1 individual model trained with Georgia data only
|
standard
|
I3: Individual Model: INCART
|
1 individual model trained with INCART data only
|
standard
|
I4: Individual Model: PTB
|
1 individual model trained with PTB data only
|
standard
|
M1a
|
Regression Ensemble
|
Classification obtained by averaging the results of I1-I4
|
[49]
|
M1b
|
Weighted Regression Ensemble
|
Classification obtained by averaging the results of I1-I4 with weights
|
[49]
|
M2a
|
Node-wise Sequential Learning
|
1 model trained on full data of nodes in one sequence
|
[42]
|
M2b
|
Batch-wise Sequential Learning
|
1 model trained on mini batches of data in multiple sequences
|
new
|
M3a
|
Federated Learning
|
1 model trained with standard federated learning (all nodes contribute equally)
|
[39]
|
M3b
|
Weighted Federated Learning
|
1 model trained with weighted federated learning (according to performance)
|
new
|
B: Baseline Centralised Model
I1-I4: Individual Models
One individual model was trained for each of the four data nodes, resulting in four additional models (I1: CPSC, I2: Georgia, I3: PTB and I4: INCART), which were trained the same way as the combined model, but only with training data from the respective nodes.
M1a: Regression Ensemble
The first method to aggregate knowledge from federated data sources was to calculate the average of all classification results from the individual models I1-I4. All models trained on individual nodes were queried to classify the common test set. Subsequently, the result was determined by calculating the mean of each class-specific regression result. Finally, a threshold of 0.5 (= 50% probability) was applied to derive the classification result of M1a from the regression values.
M1b: Weighted Regression Ensemble
In M1a, all four individual models contributed equally to the final classification. However, in M1b, the individual regression results were weighted according to two factors (see Equation 2): a) their training set size proportion in relation the total dataset size (sample size divided by the sum of sample sizes of all nodes) and b) their node-internal AUROC performance.
AUROC scores were interpolated to a range of [0, 1] and the final weights were normalized, so that the sum of all four weights was equal to 1.
M2a: Node-wise Sequential Learning
A combined model was trained by progressively exposing the initially untrained model to the data of one node after the other, so that knowledge was gathered sequentially. This method was comparable to Institutional Incremental Learning as proposed by Sheller et al [42]. For method M2a, a single model was sent to all nodes in the following order: 1) CSPC, 2) Georgia, 3) INCART and 4) PTB, as depicted in Figure 2.
At first, a model was initialised and sent to the first node, where the model is trained with the data of this specific node. After training, the model was sent to the next node in order, where its already partly optimized weights were the initial condition for continuing the training with the next pool of data. At each node, the model was trained for 50 epochs and the learning rate was decayed after each epoch as described in Equation 1. After training at a node, the learning rate was reset to the initial value of 0.001 and sent to the next node in order. This was repeated until the model was trained at all nodes once.
M2b: Batch-wise Sequential Learning
To take the idea of sequential learning even further, we applied a novel method called Batch-wise Sequential Learning (M2b). Instead of fully completing training at a node like in M2a, the model was trained only on a randomly selected single mini-batch of one node’s training data before sending it to the next node. The batch size of these mini-batches was set to 2% of a node’s training set size. This meant that each sample contributed equally to the model in M2b (which is equivalent to larger nodes consisting of more samples contributing more, as achieved with the weighted approaches). One epoch was considered completed, when the model was exposed to each training sample exactly once. The model was trained for 50 epochs in total and the learning rate was decayed after each epoch according to Equation 1. Method M2b is illustrated in Figure 3.
M3a: Federated Learning
In M3a, a model was trained in update cycles as depicted in Figure 4. Each of these cycles repeated the steps as described in chapter 1: 1) distribute central model, 2) train locally at the nodes, 3) average the weights of the trained models and 4) update central models with new parameters. The newly updated model from step 4 was then re-distributed as the central model in step 1 for the next update cycle. This method follows the original proposal for federated learning [39].
50 update cycles were completed. The epoch number for one cycle was set to 1. The learning rate was decayed according to Equation 1, where is the current update cycle iteration.
M3b: Weighted Federated Learning
As an advancement to Federated Learning (M3a), Weighted Federated Learning (M3b) was implemented. Therefore, a weighted average was used to calculate the new parameters in step 3 according to node-internal performance and dataset size as described in Equation 2 for model M1b.
2.5 Cross validation and evaluation metrics
We trained models with a central dataset, with local datasets and in decentral schemes. Therefore, only data available for training in the respective scheme were provided to the respective models during training. To find out how well all these models perform, each model was applied to a “global” dataset, containing data from all nodes. Therefore, a 10-fold cross-validation scheme was applied.
During training in each fold N, 90 percent of each dataset were applied to the respective learning scheme. Depending on the learning scheme, training was done based on data from single nodes or from all nodes as described in the Learning schemes chapter.
While training of fold N was done with different datasets depending on the learning scheme, all resulting models in fold N were evaluated with one and the same test-set-N. Therefore, the respective 10-percent shares of data from each dataset were aggregated to form one common test dataset N per fold. All models and decentral schemes described in the following chapters were tested on this test-set-N within fold N.
Predicted classes were compared to the known reference classes for each ECG and each model was evaluated with four standard metrics for a complete assessment of classification performance: accuracy, area under the receiver operator curve (AUROC), Jaccard score, and F1 score. To correctly address the multi-label classification problem, the metrics (except accuracy) were derived from a weighted average according to the frequency of occurrence in the test set [50].
To combine the results achieved with each of these evaluation metrics in a representative way, we ranked the models by each of the four evaluation-metrics and calculated the mean ranks of all metrics for each model, i.e., the best model ended up with the lowest mean rank.