A Spectrogram Based Novel Approach for Arrhythmia Detection with Convolutional Neural Networks

ECG is one of the most important medical scans which is used for diagnosis of various heart related conditions and diseases. One of the most common of these is arrhythmia, which is caused by the irregularity of the heart beats. Artificial Intelligence has had a major impact in the field of vital monitoring and autonomous medical diagnosis. Therefore, a lot of work has demonstrated its effectiveness in arrhythmia detection. In this paper, we propose a method that tries to improve upon the accuracy of such models with the help of a light weight deep learning architecture that utilized 2D Separable CNN with a group of graphical representations of the ECG signals like the STFT, CWT and MFCC. Our model has achieved an accuracy of 97.41 and an F1 score of 88.20 on a processed version of the MIT-BIH dataset and takes on an average 7.93 times less calculations compared to a simple 2D Convolution model.


1.1
Background Arrhythmia is one of the most common heart conditions that is diagnosable with Electrocardiogram [1]. These are caused by abnormally fast, slow or irregular heartbeats. A variant of arrhythmia called atrial fibrillation was mentioned in 175,362 death certificates and was the possible cause of death of 25,845 deaths [2]. They are identified by characteristics like missing discrete P wave [3], irregularly high ventricular heart rate and the absence of iso-electric baseline [4]. Most or all of these observations can be seen in the lead II of the electrocardiogram and till some extent in lead V1 [5]. Currently, this diagnosis is done manually by doctors and technicians by looking at the ECG graph. This makes the diagnosis slow and delays the treatment the patient requires. Therefore, there is a need for methods that can assist them by automating this process.
This has led to the introduction of methods like machine learning and deep learning to help with the classification of the type of arrhythmia from the given beat. Various methods ranging from statistical machine learning to deep learning like LSTM based autoencoder [6] and Convolution Neural Network [7] has been used to classify heartbeats measured using an ECG device. However, recent advancements in the same fields have made it possible to improve the accuracy of such models with minor tweaks to the existing methodology.

Motivation
Most of the work in this field is done with keeping the data in a one-dimensional vector. This involves the use of methods that can encode a 1D vector into useful features while capturing the sequential trend of the signal. Although, the information given by a 1D vector is substantially small which is further affected by the low efficacy of a 1-dimensional convolution model. A better manner of representation of such signals are spectrograms, that represent the frequency variation with respect to time in a graphical format. To process such images, we need to use a 2D CNN model. However, different types of representation capture different features. Therefore, the best way to utilize the same would be to explore a method that can collectively infer from multiple such spectrograms at the same time. This is the primary motivation and goal of this work. Here, we also explore methods and techniques that can mitigate the increased number of calculations that come with the move from a single 1D to an ensemble 2D convolution network.
The rest of the paper is organized as follows: Section 2 details the Literature Review, Section 3 describes Research Methodology, Section 4 discusses the results obtained from the experiments and Section 5 concludes the work while indicating a future direction.

2.1
Overview Some of the earliest works done in arrhythmia detection with the help of machine learning, like the one presented in [8], used algorithms including Random Forest, SVM and Gradient Boosting to detect arrhythmia episodes. Since then, the advancements in deep learning have helped in improving the accuracy and feature extraction methods in the same.
In [9], the authors have made a 5-layer CNN model with Exponential Linear Unit and Batch Normalization layers. The authors achieved an accuracy of 93.6% and loss of 0.2 on MIT-BIH database. The authors of [10] have made a 34-layer convolutional neural network consisting of 16 residual blocks with 2 convolutional layers per block. They were able to exceed the average cardiologist performance in both recall and precision. Authors of [11] have proposed an 11-layer CNN model consisting of 5 residual blocks with 2 convolutional layers per block. The authors evaluated the network on PhysioNet's MIT-BIH and PTB diagnostics datasets and achieved an accuracy of 93.4%.
These methods explicitly focused on using 1D models to process the signal directly. Some works try to bring this to 2D by attempting to classify the graphical representation of the same. In [12], authors of the paper have made an 11-layer 2D CNN model for arrhythmia classification and achieved an average accuracy of 99.05% on 7 heartbeat classes present in MIT-BIH arrhythmia database by converting the signals into line graph images. A better way of this is the use of spectrograms, which can highlight features much more effectively. Work done in [13] and [14] shows the effectiveness of the use of spectrograms such as Mel Spectrogram, STFT and CWT to detect and classify any arrhythmia episodes.
Based on the literature survey, it has been observed that most of the work done in arrhythmia detection and classification on ECG involves the use of 1D model. This shortcoming can be tackled with the use of 2D models. Several works have tried to do the same with a variety of inputs ranging from basic graphs to spectrograms. However, the efficacy of these models can be improved with the help of deep learning centric ensemble techniques, which have not been explored thoroughly in most of the existing research work. Along with this, most papers working with 2D CNN models fail to address the increased amount of computation that arises with the transition. Therefore, there's a need of a work that addresses both the issues, preferably with a single model.

Objective
The goal of this work is to develop a model that uses an ensemble of spectrograms [15] to make a model that can infer the arrhythmia episode from a variety of features to improve the prediction capabilities of the model. We also aim to overcome some of the computational overhead that comes with the use of an ensemble model with 2D CNN. This includes a method to reduce the number of features that are needed to be backpropagated along with reducing the number of calculations required to generate the output, which will help reduce both the training and the inference time of the model.

Data Preprocessing
The signal being used here in order to detect and classify arrhythmia is taken from the lead 2 of an ECG device with a sampling rate of 125 Hz. The first step in processing the signal is to remove any extraneous noises that might arise from electrical interference, breathing motion and other such sources. For this, we have used Butterworth filter [16] to only keep the frequencies within the range of 1Hz to 25Hz in the signal. The difference in the output is shown in the following Figure 1. The filtered signal was then converted into the graphical formats that we required for our model. For this work, we have selected 3 representations. They are as follows:

1)
Short Term Fourier Transform: STFT [17] uses the power vs frequency graph extracted after applying the Fourier transform [18] on small time segments of the signal to create a time vs frequency spectrogram where the color intensity opposite to a frequency reading on the y axis depends on the corresponding power in the power vs frequency curve. Here, we have taken 50 for the length of the windowed signal.

2)
Continuous Wavelet Transform: A CWT [19] is a convolutional transform calculated using Fourier fast transform [20] with a set of functions generated by the mother wavelet. For a mother wavelet , the wavelet transform is given by: where a is the dilation parameter and b is the location parameter. We have used the Morlet wavelet [21] to create the wavelet transform.

3)
Mel Frequency Cepstral Coefficient: MFCC [22] is a set of features that briefly represents the shape of the short-term power spectrum of a wave, based on a cosine transform of the same on a Mel scale frequency. Similar to the STFT, we have taken 50 for the windowed signal length.
All the three representations are individually resized to a shape of 128x128x3. They'll be fed to a network parallelly by concatenating all the three images into one along the 3rd axis, making the final shape of the input 128x128x9. The following Figure 2 shows the spectrogram of one of the beats extracted from the dataset.

Model
The basis of the model is a general convolution 2D network. This breaks the input image into multiple lower resolution feature maps that have more focused features than the input image. Our model in total has 7 convolution layers in the backbone. However, the transition from the conventional 1D CNN to a 2D CNN with the input image having 9 layers instead of the conventional 3, increases the amount of computation required. Therefore, to first reduce the number of calculations, we have replaced the conventional 2D CNN block with Separable 2D Convolution [23]. In a separable block, the convolution is done in 2 steps: In the depthwise operation, the block convolves each layer individually (unlike grouping them together like a conventional block) with our given kernel and feature map specifications. After this, the pointwise convolution operates on the output of the depthwise operation with a 1x1 kernel applied across all the layers in order to combine the results. This benefits the operation by reducing the number of multiplications required during such operations.
For a normal 2D convolution operation, the number of multiplications required are given by: And for a subsequent separable convolution operation, the number of multiplications required are: ( Where BxB is the resolution of the output image, n is the required number of feature maps, k is the size of the kernel and C is the number of channels in the input image. This makes the ratio: To make the convergence rate faster, we have augmented the convolution blocks with a batch normalization operation [24] followed by an ELU activation to tackle any possible dying ReLU problem. The figure 3 (a) shows the architecture of the base model.
The following Table I shows the computational advantage the separable convolution holds over the normal convolution network in the model shown in Figure 3. All the kernels have a size of 3x3. Secondly, to first tackle the calculations over 9 filters, we have used group convolution [25] of order 3 instead of the conventional variant. Group Convolution splits the features maps into n groups and a separate convolution block is assigned for the processing of each group. This eradicates the redundancy of the convolution operation, therefore reducing the number of connections between the layers along with the number of trainable parameters of the model. This also helps with the regularization in the case where each input group represents different features or are altogether different images by not convolving across them at the same time. In our case, we have grouped the first 6 blocks and have combined their feature maps using a 1024 feature map normal 2D Separable CNN. The same can be seen in figure 3 (b). The main model that uses the base model in a group convolution format and combines the output with a 1024 separable convolution block. The maps are flattened by global average pooling which is followed by a head with 2 dense layers and a dropout [27] before the 5 node SoftMax layer [28] and the argmax operation to get the output.

Dataset
To evaluate our work, we have used a modified version of the MIT-BIH dataset whose preprocessing steps were described in [11]. The original MIT-BIH dataset consists of 14 classes with the signal being recorded at 360 Hz [26]. The steps in [11] groups the 14 classes into 4 broad ones and reduces the sample rate to 125 Hz. The data from lead II is taken in this case. Here, every class is associated with one heartbeat. A beat is extracted by:  Finding the R-Peaks in the signal.  Calculating the mean R-R distance for 10 seconds.  Taking a sample from R-peak to R-peak + mean R-R distance with a 0 padding up to a length of 187.
To balance the classes and reduce the training time, we have reduced the number of samples belonging to class 0 in the training set to 10,000 from the original 87,554.

Performance Comparison
To compare our model (referred to as Spectrogram Group-Conv 2D) with the existing architectures, we have used various metrics to compare the performance. The main metric has been accuracy [29] which is the ratio of the total number of correct predictions and the total entries in the dataset. Accuracy is given by: where TP and TN are true positives and negatives, and FP and FN are false positives and false negatives respectively. Precision [30] gives us an idea of the number of relevant instances from the extracted batch. Similarly, recall [30] is the fraction of relevant instance that were extracted from the dataset. They are given by: To get a combined score from precision and recall, the F1 score [31] is taken that is calculated by taking the harmonic mean of the precision and recall scores. This is given by: The following Table II compares the result of our model with existing architectures like K-nearest neighbors, 5-layer CNN [8], 34-layer CNN [9], 11-layer 1D Convolution with residuals [10] and 11-layer 2D CNN [11]. The following Table III shows the class wise precision and recall for group convolution model and the Figure 4 shows the confusion matrix of the same.  shows a pretty good balance between the precision and the recall score compared to some which are leaning heavily on one side.
The 2D-CNN model trained on the plain graph although outperformed every model in terms of F1-score, its low accuracy hinders the effectiveness of the model and makes it impractical to be used as a reliable source for diagnosis.

Discussion
As the results indicate, the method in this paper obtains both its goals of improving accuracy and speed of the model. We achieved an accuracy of 97.41% on the altered version of the MIT-BIH dataset. The results achieved were better or comparable to the existing architectures that we have tested in the same environment. This can be accredited to the use of multiple spectrograms in the input.
Along with that, the number of calculations required for the classification of these spectrograms have been reduced significantly with the help of Separable convolution with group convolution also helping by reducing the number of trainable features present in the network. As compared to the conventional CNN, the separable CNN reduces the number of multiplication required in each layer by around 7-8 folds. At the same time, the group convolution makes reduces the input depth that is being fed to a CNN layer by 3 times. The presence of batch normalization and dropout along with the group convolution architecture also helps with the regularization of the model by preventing it from overfitting and boosting the convergence rate during the training process.

Conclusion and Future Scope
In this paper, we have introduced a novel approach of analyzing ECG signals with machine learning to detect any anomalies. The efficacy of this model was shown on the MIT-BIH set for classifying the beats into 5 types of arrhythmias.
The main goal of such models is not to replace doctors or technicians, but rather to assist them in diagnosis from ECG readings so that the process of diagnosis and decision making can be accelerated. Due to its high accuracy and low weight and inference time, this model is suitable for deployment in an environment that receives a large number of requests. This can further be improved with a deeper architecture which still manages to keep the computation required to a reasonable limit. Replacement or modifications to the input spectrograms can also help in improving the performance. This method can also be applied to other episodes that can be found in an ECG recording such as infarctions and ischemia. 7.

Conflict of Interest
The authors declare no conflict of interest.

Acknowledgement
We have no financial or equipment supports/funding for this project.

7.4
Ethics and consent to participate Not Applicable.

Availability of data and materials
The raw MIT-BIH dataset used in this work is open source and available at https://physionet.org/content/mitdb/1.0.0/. The processed version of the same set, which has been used in this study is also open source and available at https://www.kaggle.com/shayanfazeli/heartbeat.