In this study, we investigated the use of ECG signals for the development of a predictive model for new-onset AF. This is a critical medical task since the high prevalence of AF particularly in elderly population and the importance of an early diagnosis of AF for prompt prescription of effective treatments to prevent stroke and systemic thromboembolism.
Two approaches were considered: first, a ML model based on the set of ECG features extracted from the ECG and accessible to clinicians; second, the analysis of the digital ECG traces using deep learning techniques, in a setting of end-to-end analysis. In addition, a logistic regression model based on ECG features was estimated to provide a benchmark for the comparison of results.
As for the analysis of ECG features, for large sample sizes the XGB algorithm produced a model that outperformed the benchmark in terms of discrimination ability. In particular, the XGB and LR models appeared almost equivalent when the number observations in analysis was lower than 104, but for larger sample sizes XGB demonstrated a clear increase in the level of discrimination, resulting however constant in further enlargements of the dataset. In contrast, CNN model showed a discriminative performance highly dependent of the sample size: to reach a satisfactory result, the DL model required at least 104 observations, but for every further increase of the size we observed a correspondent improvement in discrimination. In terms of calibration, no major differences were detected across models when the original fraction of cases was used. In general, we observed better calibrated predictions for increasing sample size. Our results may suggest that the choice of the approach in the analysis of ECG should take into account the amount of data available for the training, preferring more standard models for small datasets, and indicate the well-known ability of DL methods to leverage massive datasets.
The second part of our analysis was focused on the effect of undersampling on models’ calibration. This aspect of the study was stimulated by a recent published work by van den Goorbergh et al [13] where authors examined the effect of imbalance correction on the performance of standard and penalized (ridge) LR models in terms of discrimination, calibration, and classification. When developing prediction models for a binary outcome with high class imbalance, undersampling is a standard technique for mitigating the difference in class frequencies in the training phase, with the aim of improving the model’s performance. We analyzed the results of models obtained with different levels of balancing ratios and failed to detect an improvement in discrimination, leading to even worse results in the case of CNN. Besides, imbalance correction caused miscalibrated predictions. Our results are in line with the findings of van den Goorbergh et al. and extend their note of caution in using methods for class imbalance correction in the case for XGB and CNN models. We observe that in our study the CNN resulted more robust compared to XGB and LR to the calibration worsening caused by the imbalance correction, a counter-intuitive finding with respect to what observed by Gou et al. [26].
Concerning the relative performance of our CNN approach with respect to the recent literature that investigated new onset of AF, Attia et al. [11] considered a set of 649.931 12-lead ECGs of patients ≥ 18 years and applies CNN to identify the electrocardiographic signature of future AF developed within one month from ECG examination (8.4% of the cohort). They obtained a very accurate model (AUC 0.90 [0.90–0.91]), but the sample size and the time-frame prediction period is clearly very different from ours. Another relevant study was carried out by Ragunath et al. [10], in which authors analyzed 1.6 M 12-lead ECGs from patients aged 18 years or older in order to identify individuals at risk of developing AF within 1 year. Training a CNN using only ECG traces as input, they were able to predict the new onset of AF with AUC of 0.83 (95% CI, 0.83–0.84). Although the sample size and observational period are different from ours also in this case, the performance is comparable with our findings (Table 2). No measures of calibration were reported in those works.
Our study has some limitations. First, we could not validate our findings in an external validation cohort that represents one of the most critical steps in the development of machine learning models in medicine, a context where internal validation is not considered sufficiently conservative [27]. Second, for AF subjects we only considered ECG exams no further than 5 years before the date of AF diagnosis. We set such constraints because based on clinical knowledge, AF individuals are unlikely to show predictive signs of the condition earlier than 5 years. The methodological choice is also in line with previous clinical scores and predictive models that are usually evaluated at a time horizon of 5 years of follow-up [3]. Third, in order to simplify the prediction task, we did not take into account the time-to-event in disease onset. A very recent research carried out by Khurshid et al. [28] has highlighted the potential of CNN for the prediction of the time-to-incident AF and obtained very accurate predictions (5-years AUC 0.823 [95% CI, (0.790–0.856]). One of the advantages of the time-to-event data is the possibility to evaluate the accuracy of the model for any time frame from the baseline.
Another possible limitation was the choice of the method to correct the class imbalance, as RUS is a very naïve approach. The main obstacle here was to deal with entire signals. For example, a commonly used method that has shown good results in various applications is the synthetic minority oversampling technique (SMOTE) [28]. SMOTE is an oversampling approach that creates new, synthetic samples interpolating the original minority class samples. This method and its variations were developed for tabular data, but an extension in the case of signals is not straightforward. Some methods to generate synthetic ECG signals were recently proposed [29]– [31], but it was out of the scope of this work. Finally, we selected only a few features automatically detected by the Mortara® instruments; it could be possible that by extending the number of features extracted [4], [32] the performance of XGB could improve, but we limited our search to the features that are usually automatically extracted from the ECG.
Future developments of the present study will include the integration of standard tabular information (sex, age, clinical information) as predictors in addition to ECG traces. According to the findings of recent studies [32], [33], new tools are emerging to combine deep representations of data obtained from convolutional neural networks (in substitution to human feature engineering) with electronic health records tabular information. In our opinion, such methodologies intended to integrate heterogeneous data sources could have great potential, in particular if extended to time-to-event data analysis, since employing deep learning models represents the most promising and feasible approach to operate in ultrahigh dimensional settings, as the case of ECG waveforms.