Data selections
Study data were obtained from the ALLSTAR database of 24-h ambulatory ECG [20, 21] and the PhysioNet database [22]. All Holter ECG data in the ALLSTAR database were sampled with Holter ECG recorders (Cardy pico series, Suzuken Co., Ltd., Nagoya, Japan) at 125 Hz and 10 bit (0.02 mV/digit) and analyzed by Holter ECG scanners (Cardy Analyzer 05, Suzuken Co., Ltd, Nagoya, Japan). The reliability of the analyzer has been certified by the product conformance test (IEC60601-2-47, International Electrotechnical Commission, Geneva, Switzerland) including assessment of QRS detection accuracy with the American Heart Association (AHA) and Massachusetts Institute of Technology (MIT) ECG databases (the test results are not published to avoid business risks). The ALLSTAR database was used for extracting both the training and test data. The use of this database for this study has been approved by the Institutional Review Board of Nagoya City University Graduate School of Medical Sciences and Nagoya City University Hospital (No. 709). The PhysioNet database were used only for the test data for the comparisons with earlier studies [9, 11, 13].
Training data
As the sources of the training data, we extracted 24-h Holter ECG data in 58 subjects with persistent SR and 52 subjects with CAF from ALLSTAR database. The inclusion criterion for the SR data was a 24-hour ECG consisting of SR beats >99% of total beats without PAF episodes or frequent premature beats in subjects >40 years old. The inclusion criterion for the CAF data was 24-h ECG consisting of persistent AF >99% of total beats in patients >40 years old. The data from patients with pacemaker implant were excluded from both the SR and CAF data. The cardiac rhythms were diagnosed and confirmed independently by multiple laboratory technicians and cardiologists. The training data were used for the machine learning to develop CNN discriminant models.
Test data
As the sources of the test data, we extracted 24-h Holter ECG data of other 52 subjects with SR without PAF and of 53 subjects with PAF from ALLSTAR database, independently of the training data. In this study, PAF was defined as an AF episode that started and / or ended within each 24-h data set. The inclusion criterion for SR data was a 24-hour ECG consisting of persistent SR beats without PAF episodes in subjects >40 years old. The inclusion criterion for PAF data was 24-hour ECG including at least one PAF episode in patients >40 years old. For both SR and PAF test data, subjects were selected regardless of the number of premature beats, but patients with pacemaker implant was excluded from both. The cardiac rhythms were diagnosed and confirmed independently by multiple ECG technicians and cardiologists. The onset and offset points of each episode of PAF were determined by agreement between the ECG technicians and confirmed by clinical laboratory technicians. The authors in this research were not involved in these ECG assessments. The test data were used for evaluating the performance of discriminant model obtained from the training data.
Additionally, we used MIT-BIH AF database [6], MIT-BIH Arrhythmia Database [25], and MIT-BIH normal sinus rhythm (NSR) database [22] in the PhysioNet database. These data were used as another test data to compare the performance of the discrimination models generated by CNN with the performance of other methods reported by earlier studies [9, 11, 13].
Lorenz plot image
We used 24-h time series data of R-R interval and QRS wave annotations. For the initial coarse analysis, 24-h R-R interval data were split into consecutive non-overlapping segment windows with lengths of 10, 20, 50, 100, 200, and 500 beats for generating LP images. For the secondary detained analysis, the finer steps from 10 beats to one beat were used for the segment window lengths between 50 and 130 beats.
LP was generated for each segment window length by plotting all R-R intervals in the segment but the first one as the y values against preceding R-R intervals as the x values [15, 17]. The obtained LPs were converted into the monochrome images of a 32 × 32-pixel resolution with 3-bit scale for brightness level, resulting in a temporary resolution of 80 ms and a dynamic range between 0 and 2,560 ms for both the x and y values. When (x, y) values scaled out of the range, the data were plotted at the edge of the image. The number of plots in each pixel was counted and was used as the brightness level of the pixel. When the number of plots was >7, the level was set at 7. Figure 6 shows the example of LP images.
Machine learning with training data
Since the training data were generated from either pure (>99%) SR or pure (>99%) CAF cases, all LP images produced for any segment window lengths were consisted of either pure non-AF beats or pure AF beats. Consequently, LP images were simply annotated as non-AF and AF images accordingly.
The machine learning was performed by a CNN using keras (version 2.2.4) open-source python machine learning library, using the Microsoft Windows 10 operating system on a computer equipped with an 8-core Intel Xeon E3-1275 V3 processor with 32-GB memory. The computer was also equipped with an NVIDIA RTX 2080 Ti graphic board with 27-GB memory.
The CNN consisted of an input layer, 1st convolution layer, 2nd convolution layer, 1st max pooling layer, 3rd convolution layer, 2nd max pooling layer, 1st dropout layer, a flatten layer, 1st dense layer, 2nd dropout layer, and 2nd dense layer. In all convolution layers, the kernel size was set at 3 × 3 and the numbers of output filter were set at 32, 64, and 128 in the first, second, and third convolution layers, respectively. In all pooling layers, the height and width were set to be half each. The dropout rate was set at 0.25. In the 1st and 2nd dense layers, the numbers of unit were set at 128 and 2, respectively. Rectified linear unit was used for the activation function of all convolution layers and 1st dense layer and softmax for the activation function of the 2nd dense layer in order to classify LP images. Binary cross entropy was used for loss function and stochastic gradient descent (SGD) for optimizer. The hyperparameters of SGD were as follows: the learning rate = 0.001, the learning rate decay = 0.000001, and the momentum accelerating SGD = 0.9.
We enrolled 5-fold cross-validation for the CNN. Briefly, for each segment window length, the LP images of training data were divided randomly into five subsets. For each cross-validation, one subset out of five was selected in order and used as a validation subset, and the remaining 4 subsets were used as training subsets. Consequently, we obtained five datasets with different training subsets and validation subsets. We set the batch size to 32. Accuracy and loss were updated for each batch. The number of the maximum epochs was set at 50 times per training. If the validation loss didn’t improve during 10 epochs, training stopped early. As a result, we obtained five models for each segment window length. The validation accuracy and the validation loss of each dataset were updated for each epoch. We obtained the final validation accuracy and the final validation loss of each dataset in the last epoch. Classification accuracy was evaluated by a cross-validation score and the confusion matrices. For the cross-validation score, probabilities were calculated by applying each of the five discriminant models to each LP image in the corresponding data subset from which the model was derived. LP images were classified as AF when the probability was ≥0.5 and as non-AF when the value <0.5. The cross-validation score was obtained as the average accuracy of the five models.
Classification of test data
In contrast to the training data, the test data were generated from either non-AF or PAF cases regardless of the number of premature beats. Thus, the produced LP images could contain AF beats at various ratios between 0 to 100%. We annotated them with strict and non-strict criteria. In the strict criteria, LP images were annotated as AF images if they contained AF beat even one beat and LP images were annotated as non-AF images if they contained no AF beat at all. In the non-strict criteria, LP images were annotated as AF image if they contained AF beats >1/2 of total beats in the segment, and non-AF images otherwise. The strict criteria were used for the test data from the ALLSTAR database to examine the optimal segment window length for PAF detection. The non-strict criteria were used for the test data from both ALLSTAR and PhysioNet databases to compare the classification performance between the models developed in the present study and those reported by earlier studies [9, 11, 13].
The confusion matrices for the test data were obtained as follows. For test data of each segment window length, five discriminant models were applied to all of each LP images in the test data and five probabilities were calculated. LP images were classified as AF when the average of the five probabilities was ≥0.5 and as non-AF when it was <0.5. The classification results were summarized as a confusion matrix for each segment window length.
To examine if AF burden is estimated by LP-based AF detection, AF burden was calculated as the ratio of AF beats among the total recorded beats in each 24-h data set and compared with the ratio of AF LP images among the total LP images generated from each 24-h data set for each segment window length.
Statistical analysis
The receiver operating characteristics curve and the area under the curve were calculated for the classification performance in the test data. The sensitivity, specificity, positive predictive value, negative predictive value, accuracy, and positive and negative likelihood ratios of classifications were calculated for each segment window length. The relationships between these performance metrics and segment window length were analyzed by regression curve fitting. The logarithmic transformation of axis was used as necessary to the better fitting of regression curves. Among these indices of classification performance, the optimal segment window length was determined based on the likelihood ratios that are known to provide a fair evaluation for classification performance independent of prior probability [24]. To examine if AF burden is estimated by LP-based AF detection, AF burden was calculated as the ratio of AF beats among the total recorded beats in each 24-h data set and compared with the ratio of AF LP images among the total LP images generated from each 24-h data set for each segment window length. The agreement between the AF burden and the ratio of AF LP images was evaluated with the upper and lower limits of agreement of Bland and Altman method [26]. These analyses were performed with numpy (version 1.15.4) and scikit-learn (version 0.20.2) open-source python machine learning library.