This study was approved by the institutional review boards (IRB) of Sejong general hospital (2018-0689) and Mediplex Sejong hospital (2018-054). Clinical data, including digitally stored ECGs, age, sex, and endpoints of admitted patients, were extracted from both hospitals. Both IRBs waived the need for informed consent because of the retrospective nature of the study, using fully anonymized ECG and health data, and minimal harm.
ECG data
The predictor variables are ECG, age, and sex. Digitally stored 12-lead ECG data were recorded at 500 data points per second (500 Hz) at each lead for 10 seconds. We removed 1 s each at the beginning and end of the ECG, because these areas have more artifacts than other parts. Because of this, the length of each ECG was reduced to 8 s (4000). We made a dataset using the entire 12-lead ECG data. We also used partial datasets from the 12-lead ECG data, such as limb 6-lead (aVL, I, -aVR, II, aVF, and III) and single lead (I or II). We selected these leads as they can easily be recorded by wearable and pad devices in contact with the patient’s limbs.[25] Consequently, when we developed and validated an algorithm using 12-lead ECGs, we used the dataset that was 2D data of 12 × 4000 numbers. Similarly, for 6-lead and single lead ECGs, we used datasets comprising 6 × 4000 and 1 × 4000 numbers, respectively. We rearranged the input 2D ECG data in the order V1, V2, V3, V4, V6, aVL, I, -aVR, II, aVF, and III. Convolutional neural network (CNN) is a well-known deep learning architecture for learning 2D image data.[20]
Development of deep learning based artificial intelligence algorithm
The DLA was made using many hidden layers of neurons to learn complex hierarchical nonlinear representations from the data.[20] As a block with six stages, it had two convolutional layers, two batch normalization layers, one max pooling layer, and one dropout layer. This block was fully connected to the one-dimensional (1D) layer composed of 128 nodes (Figure 1). The input layer of epidemiology (age, sex) was concatenated with the 1D layer. There were two fully connected 1D layers after the flattened layer, and the second layer was connected to the output node, which was composed of one node. The values of the output node represent the possibility of developing cardiac arrest, and the output node uses a sigmoid function as an activation function, as the output of the sigmoid function is between 0 and 1. We used TensorFlow’s open-source software library (Google LLC, Mountain View, CA USA) as the backend, and conducted our experiment with Python (version 3.5.2; Python Software Foundation, Beaverton, OR, USA). We conducted additional experiments for the DLA using limb 6-lead and each single-lead (lead I, lead II, lead III, aVR, aVL, and V1–6) ECGs. To develop and validate the DLA for these ECGs, we changed the sizes of the filters and convolutional layers, thus adjusting the shape of the input datasets. The number of filters, max pooling, and fully connected layers were the same as that of the 12-lead ECG architecture.
Development and validation datasets
Data from hospital A were used for development and internal validation. We identified patients who were admitted to hospital A in the study period (October 2016–September 2019), and who had at least one standard digital, 10 s, 12-lead ECG acquired in the supine position during the admission period. We excluded subjects with missing demographic or electrocardiographic information. As shown in Figure 2, patients treated at hospital A were randomly and exclusively split into algorithm development (70%) and internal validation (30%) datasets. Data from hospital B were only used for external validation, which confirmed that the developed DLA was robust across diverse datasets. The characteristics of the 2 hospitals are different (hospital A is a cardiovascular teaching hospital, and hospital B is a community general hospital). We also identified patients who were admitted to hospital B in the study period (March 2017–September 2019) and had at least one ECG during it. We also excluded subjects in hospital B with missing values. Because the purpose of the validation data was to assess the accuracy of the algorithm, we used only one ECG from each patient for the internal and external validation dataset—the most recent ECG to the endpoints (cardiac arrest or survival and subsequent discharge).
Endpoint
The endpoint of this research was cardiac arrest, defined as a lack of palpable pulse, with or without attempted resuscitation. We reviewed electronic health records to identify the exact time of each endpoint. The objective of the DLA was to predict whether an ECG was within the prediction time window of cardiac arrest, which is the 24 hour interval before cardiac arrest. For a patient with cardiac arrest, the ECGs belonging to the prediction window were labeled as cardiac arrest and other ECGs were labeled as a nonevent. For a patient without cardiac arrest, all ECGs were labeled as a nonevent. In other words, the aim of the developed DLA was to accurately classify an ECG as cardiac arrest or nonevent.
Statistical analysis
At each input (ECG, age, and sex) of the validation data, the DLA calculated the possibility of cardiac arrest in the range from 0 (nonevent) to 1 (cardiac arrest). To confirm the performance of the DLA, we compared the possibility calculated by the DLA with the occurrence of cardiac arrest within 24 hours after the time of ECG in the validation data. For this, we used the area under the receiver operating characteristics curve (AUROC) to measure the performance of the model. As the purpose of the DLA was screening, we evaluated the specificity, the positive predictive value, and the negative predictive value at a cut-off point selected for high (90%) sensitivity in development data. Exact 95% confidence intervals (CIs) were used for all measures of diagnostic performance except for AUROC. The CI for AUROC was determined based on Sun and Su optimization of the De-long method, using the pROC package in R (The R Foundation, Vienna, Austria; www.r-project.org). Statistical significance for the differences in patient characteristics was defined as a 2-sided P value of less than 0.001. Measures of the diagnostic performance were summarized using 2-sided 95% CIs. Analyses were computed using R software, version 3.4.2.
Subgroup analysis
We hypothesized that early in the course of any deterioration, ECG signals would show subtle abnormal patterns due to metabolic and structural changes. Although cardiac arrest did not happened within 24hours in nonevent ECGs, delayed cardiac arrest and events of deterioration could have occurred in nonevent ECGs as well. In other words, we hypothesized that our DLA would classify ECGs with characteristics of deterioration as cardiac arrest, giving the initial appearance of a false positive test (that is, an ECG classified as cardiac arrest, but not leading to cardiac arrest within 24 hours). To test this hypothesis, we designed two subgroup analyses with nonevent ECGs in the external validation dataset. We divided the nonevent ECGs as low and high risk groups defined by the DLA. In the analysis of the first subgroup, we confirmed the occurrence of cardiac arrest over 2 weeks in each ECG. We also confirmed the performance of the DLA in predicting deterioration events. The deterioration events were defined as unexpected intensive care unit transfer over 2 weeks in each ECG. In the nonevent ECGs of the external validation data, we included ECGs which were acquired in general wards for second subgroup analysis. Kaplan-Meier analysis was used to depict the occurrence of delayed cardiac arrest and deterioration events for the true negative (low risk) versus the false positive (high risk) groups over time. Subsequently, Cox proportional hazards regression was used to estimate the hazard for the delayed cardiac arrest and the deterioration events.
Visualizing using sensitivity map
To understand the developed DLA and make a comparison with existing medical knowledge, it was important to identify which regions had significant effects on the decision of the DLA. We employed a sensitivity map using the saliency method, and used it to visualize the ECG regions used by the DLA to predict cardiac arrest. The map was computed using the first-order gradients of the classifier probabilities with respect to the input signals. If the probability of a classifier was sensitive to a specific region of the signal, the region would be considered as significant in the model. We used a gradient-weighted class activation map (Grad-CAM) for visualization.[26] Grad-CAM uses the gradient information of the algorithm, and could be used with any activation function and any architecture of CNNs.