Hybrid Deep Learning based approach for ECG Heartbeat Arrhythmia Classiﬁcation

Background: Myocardial infarction, or heart attack, is caused by a blockage of a coronary artery, which prevents blood and oxygen from accessing the heart properly. Arrhythmias are a form of CVD that refers to irregular variations in the normal heart rhythm, such as the heart beating too quickly or too slowly. Arrhythmias include Atrial Fibrillation(AF),Premature Ventricular Contraction(PVC), Ventricular Fibrillation(VF), and Tachycardia are just a few examples of arrhythmias. It aggravates if not detected and treated on time i.e., on-time /proper diagnosis of arrhythmias may minimize the risk of death. It is very labor-intensive to externally evaluate ECG signals, due to their small amplitude. Furthermore, the analysis of ECG signals is arbitrary and can diﬀer between experts. As a consequence, a computer-aided diagnostic device that is more objective and reliable is needed. Methods: In the recent era, Machine Learning based approaches to detect arrhythmias has been established proﬁciently. In this view, we proposed a hybrid Deep Learning-based model to detect three types of arrhythmias on MIT-BIH arrhythmia databases. In particular, this paper makes two-fold contributions. First, we translated 1D ECG signals into 2D Scalogram images. When one-dimensional ECG signals are turned into two-dimensional ECG images, noise ﬁltering and feature extraction are no longer necessary. This is notable since certain ECG beats are ignored by noise ﬁltering and feature extraction. Then, based on experimental evidence, we suggest combining two models, 2D-CNN-LSTM, to detect three forms of arrhythmias: Cardiac Arrhythmias (ARR), Congestive Heart Failure (CHF), and Normal Sinus Rhythm (NSR). Results: The experimental ﬁndings indicate that the model attained 99% accuracy for ”normal sinus rhythm,” 100% accuracy for ”cardiac arrhythmias,” and 99% accuracy for ”congestive heart failures,” with an overall classiﬁcation accuracy of 98.6%. The sensitivity and speciﬁcity were 98.33% and 98.35%, respectively. The proposed model, in particular, will aid doctors in correctly detecting arrhythmia during routine ECG screening. Conclusion: As compared to the other State-of-the-art methods our proposed model outperformed and will greatly minimise the amount of intervention required by doctors.


Introduction
Cardiovascular diseases are the leading cause of death globally, and they manifest themselves in the form of myocardial infarction, or heart attack, which is caused by a blockage in a coronary artery, preventing blood and oxygen from reaching the heart properly. According to the WHO [1], CVD is responsible for 17.7 mil-lion deaths, or about 31% of all deaths, with 75% of these deaths occurring in low and middle-income countries. Arrhythmias are the type of CVD's that refers to uneven changes in the normal heart rhythm i.e., beating of heart too fast or too slow is termed as arrhythmias.Atrial Fibrillation(AF), Premature Ventricular Contraction(PVC), VentricularFibrillation(VF), and Tachycardia are just a few examples of arrhythmias. Although single heartbeat arrhythmias may not have a significant effect on one's life, a continuous one may result in fetal complications e.g., prolonged PVC occasionally turned into Ventricular Tachycardia(VT) or Ventricular Fibrillation can immediately lead to heart failures. If treatment is given within an hour, the risk of death is reduced. As a result, it is critical to monitor the heart rhythm regularly to manage and avoid CVDs. Practitioners use electrocardiographs as a diagnostic tool to detect cardiovascular disease called arrhythmias which detects and interprets the electrical activity of the heart during the diagnosis and is represented in the form of ECG signals [2].

ECG Signals :
The electrical activity of the heart is represented in the form of waves when an ECG machine is attached to the human body; to get an exact picture of the heart, 10 electrodes are needed for capturing 12 leads (signals). According to Zubair et al. [3], 12 ECG leads are required to properly diagnose, which are divided into limb leads (I, II, III, aVL, aVR, aVF) and precocious leads (I, II, III, aVL, aVR, aVF).P waves, Q waves, R waves, S waves, T waves, and U waves based on the structure of the heart are shown in Figure.1 as positive and negative deflections from the baseline that signify a particular electrical event. Figure 1 Representation of different ECG waveforms [42] •P waves: are the first positive deflections that occur when electrical activity flows from the Sinus node (SA node), which is located in the upper portion of the right atrium and serves as the heart's original pacemaker, to the Atrioventricular node (AV node), which is located on the border of the right atrium and right ventricles and serves as the heart's doorkeeper. Arterial Depolarization is what P waves are called when the left and right arteries are contracting while the ventricles are at rest. If the P waves are inverted or absent, the person is suffering from Junctional Arrhythmias (JA).
• QRS Complex: The QRS complex, which consists of different waves (Q wave, R wave, and S wave) is referred to as ventricular depolarization. Although most of the QRS waveform is derived from the larger left ventricles, QRS reflects the simultaneous activation of the right and left ventricles. Because ventricles are contracting and arteries are relaxing at this time, it may have been called arterial repolarization. Electrical activity flows from the AV node, which includes bundles of HIS to the left and right bundle branch blocks. The QRS complex has a threshold value of 0.06-0.125 seconds, and if its value exceeds this value, it is referred to as PVC Arrhythmias.
• T waves: Ventricular Repolarization occurs when electrical activity flows from the left and right bundles of HIS to the apex of heart, where it enters Purkinje fibers. Ventricles are relaxing in this picture.
• Intervals: Intervals are the time between two particular points on an ECG event, such as the PR Interval, QRS Duration, and QT Interval.
• Segments: The baseline amplitude, which includes PR Segment, ST Segment, and TP Segment, is intended to be the distance between two particular points on the ECG.

Motivation:
Through studying ECG signals, cardiologists will understand a lot about the heart's rhythm and function. As a result, its study is a useful tool for identifying and treating a variety of cardiac conditions [4] [5]. In [6], the authors propose using a de-noising auto-encoder to learn a suitable function representation from raw ECG data in an unsupervised manner (DAE). They then apply a Softmax regression layer on top of the resulting hidden representation layer to construct a deep neural network (DNN). They advise the specialist to mark the most critical ECG beats in the test record and use them to update the network's weights during the interaction phase.For ECG classification, the authors suggested a 1-D convolutional neural network in [4]. The de-noising auto-encoder (DAE) was used by the authors in [7] to improve the quality of ECG signals by deleting various kinds of residual noise. Despite the fact that these experiments have yielded promising results, but still they have certain limitations such as, some of some of the ECG beats are lost due to noise filtering and feature extraction.

Our Contribution:
In this paper, we first suggest an ECG arrhythmia classification approach based on colored Scalogram. ECG images and a deep two-dimensional CNN model for automatic feature extraction and LSTM for the classification of arrhythmias. . As one-dimensional ECG signals are converted into two-dimensional ECG images, noise filtering and feature extraction are no longer needed. This is notable since certain ECG beats are ignored by noise filtering and attribute extraction. The fact that ECG images can be used as input data for ECG arrhythmia recognition is also a benefit. Present ECG arrhythmia detection strategies are resilient to noise signals since any ECG one-dimensional signal value is viewed with the same degree of classification. The proposed CNN model would discard the noise data while extracting the corresponding feature map in the convolutional and pooling layers after transforming the ECG signal to a two-dimensional image. As a result, the proposed CNN model can be used to interpret ECG signals from a wide range of ECG devices with different sampling frequencies and amplitudes, while previous studies suggested a different model for different ECG devices. Furthermore, detecting ECG arrhythmia using ECG images is similar to how medical practitioners detect arrhythmia when they monitor an ECG graph from the patient, which shows a sequence of ECG images. To put it another way, the proposed scheme can be applied in a surgical robot that can detect ECG signals and help specialists classify ECG arrhythmias more precisely. The remainder of the paper is structured as follows: the relevant literature is discussed in Section 2; Section 3 describes methodology which defines MIT-BIH arrhythmias dataset , preprocessing steps to filtering data,conversion of ECG signals into scalogram images,Model architecture and Details. Experimental results and performance evaluation is seen in Section 4.Finally, Section 5 concludes with the conclusion.

Related Work:
Coronary heart disease and diabetes account for 1.3 to 4.6 million deaths annually in India, with an annual prevalence of 491,600 to 1.8 million [8]. Ventricular arrhythmias, one of the most common cardiac arrhythmia conditions that cause irregular heartbeats, are responsible for nearly 80% of 2 sudden cardiac deaths [9][9] [10]. Early detection of arrhythmia conditions from ECG signal analysis may improve the identification of risk factors for cardiac arrest.Traditionally, the study of arrhythmia diagnosis focused on the filtering of noise from electrocardiogram (ECG) signals [11][12] [13], waveform segmentation [13] [14], and manual feature extraction [15] [16][17] [18] .Various scientists have tried to classify arrhythmias using various methodologies such as machine learning and data mining, with some using deep learning and fuzzy logic. This section will go over some of the previous research that has been done on classification of arrhythmias, Firstly, we summarized the literature that used some of the machine learning algorithm such as, SVM and KNN and RF as a classifier, Sahoo et al. [19] detected QRS complex with the help of Discrete Wavelet Transform and Empirical mode Distribution(EMP) for noise removal and Support Vector Machine for classifying 5 different types of arrhythmias with the accuracy of 98.39% and 99.87% sensitivity with an error rate of 0.42. Osowski et al. [20] applied Higher-order statistics and Hermite Coefficients of detection of QRS complex, and compares the result by using different algorithms e.g., spectral power density and genetic optimization and Support Vector Machine for classification of 5 different types of arrhythmia which provides average accuracy of 98.7%, 98.85%, 93%. Although, these models are quite accurate, it adds additional computational cost by manual feature extraction. Plawiak et al. [21]Higher-order spectra were used for feature extraction, PCA was used for dimensional reduction, and Support Vector Machine was used to identify 5 different forms of arrhythmia with a 99.28% accuracy. Yang et al. [22] to manually extract features, researchers used higher-order statistics (HOS) and Hermite functions, as well as a support vector machine (SVM) for disease classification. Polat and Gunes et al. [23] Using the UCI (University of California, Department of Information and Computer Science) machine learning library, they suggested the least-square Support Vector Machine (LS-SVM) for the classification of arrhythmia and used Principle Component Analysis (PCA) to reduce the dimensionality of features from 256 to 15 features. Melgani and Bazi et al. [24] SVM was used in an experimental analysis to identify five types of irregular waveforms and natural beats using a database from the Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH). To increase the efficiency of SVM, an optimization algorithm known as Particle Swarm Optimization (PSO) is used. This algorithm assists in fine-tuning the discriminator function, which selects the best features for SVM classifier training. They also ran a comparative analysis with other classifiers including K-Nearest Neighbours (KNN) and Radial Basis Function Neural Network, and found that SVM outperformed them with an accuracy of 89.72 %. Dutta et al. [25] proposed LS-SVM classifier for the classification of heart beats such as normal beats, PVC beats and other beats by utilizing MIT-BIH arrhythmia database and outperformed with an accuracy of 96.12%. Desai et al. [26] Using 48 records of MIT-BIH arrhythmias, the researchers suggested a classifier using SVM for the classification and Discreate Wavelet Transformation (DWT) for feature representation and catogorize five types of arrhythmia beats: Non-ectopic (N), Supraventricular ectopic (S), Ventricular ectopic (V), Fusion (F), and Unknown (U). Furthermore, ANOVA was used to select important characteristics for the classification of heart beats, which outperformed by using 10 fold cross validation with an accuracy of 98.49%. Nasiri et al. [27] proposed SVM classifier for the classification of arrhythmias and further SVM is optimized by genetic algorithm by applying best parameter for tuning discriminator function that helps in optimizing SVM.Besides above literature [28] [29][30] also applied SVM for arrhythmia classification. Kumaraswamy et al. [31]MIT-BIH arrhythmia database was used to propose a classifier for the classification of heartbeats that are useful for the detection of arrhythmias using Random Forest Tree classifier and Discrete Cosine Transform (DCT) for discovering R-R intervals as features. Park et al. [32] proposed a classifier for detecting 17 different types of heartbeats that can be used to detect arrhythmias. A two-step experimental study is carried out. The Pan-Tompkins algorithm is used to identify P waves and the QRS complex, and the KNN classifier is used to classify them. This model is validated using MIT-BIH arrhythmias and outperformed with an sensitivity of 97.1% and specificity 96.9%. Jun et al. [33]For high-performance arrhythmia detection, a GPU-based Cloud system was used. QRS detection is done with the Pan-Tompkins algorithm, and beat classification is done with the K-Nearest Neighbor algorithm (K-NN). They parallelized the beat classification algorithm with CUDA to run it on virtualized GPU devices on the Cloud system to support high performance beat classification.
Machine learning paradigms focused on feature extraction and feature filtering are heavily influenced by feature architecture. Another idea is to include all of the data in the signals to let the machine learning algorithm learn and choose the functions. This theory underpins deep learning, especially convolutional neural networks (CNN) and their one-dimensional equivalents. According to [4], which were recently adopted for the classification of patient-specific heartbeats. Automated classification of arrhythmias [34]. Acharya et al. [35] proposed a 9-layer deep CNN architecture that could accurately distinguish 5 individual heartbeats with 94.3 percent accuracy. A large number of imbalanced ECG signals were used to train the network. Acharya et al. [36] built a convolutional neural network architecture that can predict myocardial infarction with a 95.22 percent accuracy. Kachuee et al. [37] Deep residual CNN is utilised For classification, and t-SNE is used for visualization. Centered on deep convolution neural networks for heartbeat classification, which can correctly identify five distinct arrhythmias by the AAMI EC57 norm and offers better results than ALTAN et al. [38] on the same database MIT-BIH arrhythmias with accuracy 95.9xia et al. [39]proposed a CNN classifier for the detection of arterial fibrillation arrhythmia's, to obtain two-dimensional (2-D) matrix input suitable for deep convolutional neural networks, the short-term Fourier transform (STFT) and stationary wavelet transform (SWT) were used to evaluate ECG segments. Then, for STFT and SWT outputs, two separate deep convolutional neural network models were established, and all of these models were compared to current state-of-the-art machine learning algorithms. Because CNN is in charge of automated feature extraction, there is no need to extract R peaks, QRS complexes, and so on. As a result, CNN outperformed with 98.29% for STFT and 98.63% percent for SWT. Acharya et al. [40] suggested an automated CNN model for the classification of shockable and non-shockable ventricular arrhythmias, using 10-fold cross-validation, which outperformed with an accuracy of 93.18% , 95.32% sensitivity and 94.04% specificity. Savalia et al. [41] proposed a model based on 5-layer CNN that able to classify 5 types of arrhythmias. Although these models are quite accurate, to minimize the loss function in backpropagation these, model suffers from vanishing gradient problem of exploding gradient problem.
Although, 1D CNN outperformed with automatic feature extraction [42] and more accurate than clinicians [42] [43] but RNN are effective deep learning methods where there are time dependencies and variable length segmented data [44]. Despite the fact that RNN face certain difficulties, such as the vanishing gradient problem [45], in which the gradient (a value used to update the weights) decreases with backpropagation and its value becomes too small, do not contribute much to learning because in RNN, the layers that receive a small gradient to upgrade weights stop learning and when these layers do not remember, they may miss what is used in longer sequences and thus have a short-term memory thus it has negative consequences in prediction problems. These constraints can be overcome by using an LSTM, GRU, with ReLU, which allows capturing the impact of the earliest available data. By tuning the burden value the vanishing gradient problem can be avoided.
shadmand et al. [15].A block-based NN classifier is used for effective classification of arrhythmia at an accuracy of 97%and Hermite coefficient function and temporal feature extractor from ECG signal. Raj et al. [46] accurately classify 4 types of arrhythmia with an accuracy of94.6% by using RR interval as threshold and the fuzzy system as the classifier. Oh et al. [11] proposed an automated computer-aided sys- Based on the above-mentioned issues, our model is more convenient because instead of using 1D ECG signals as input,2D colored Scalogram images are used as input of size 227x227x3. Some ECG signal information, such as noise filtering, may be lost during preprocessing, but this can be prevented by converting a one-dimensional ECG signal to a twodimensional ECG image. [20]. 2D Convolutions Neural Network used for automatic feature extraction. Additionally, dropout regularization is used for minimizing over-fitting of features and LSTM for classification purposes. Table.1 represents a comparative study of the proposed model with other models in terms of feature extraction methods, methodology, accuracy, and other statistical classifications. Here the difference between the proposed work and the state-of-the-art mentioned is quite promising as compared to other models in terms of accuracy and computational cost.

Methodology
This section includes a summary of the data collection used for training and checking the proposed model, data cleaning pre-processing measures, and a detailed description of the proposed model.

Dataset :
We evaluated the accuracy of our CNN-LSTM model using 162 ECG recordings from three Physionet databases [1] .
•96 recordings come from the MIT-BIH Cardiac Arrhythmias database [54] [55], this archive provides beat annotation files for 29 long-term ECG recordings of congestive heart failure patients aged 34 to 79. (NYHA classes I, II, and III). Eight men and two women were among the subjects, while the gender of the remaining 21 was uncertain and initial ECG recordings were digitised at 128 samples per second.
•36 recordings come from the MIT-BIH Normal Sinus Rhythm database , this archive holds 18 long-term ECG recordings of patients admitted to Boston's Beth Israel Hospital's Arrhythmia Laboratory (now the Beth Israel Deaconess Medical Center). There were no major arrhythmias in the subjects in this database, which included 5 men aged 26 to 45 and 13 women aged 20 to 50. [54].
•30 recordings come from the BIDMC Congestive Heart Failure database [56] [54], this archive contains long-term ECG recordings from 15 patients with serious congestive heart failure (NYHA class 3-4) (11 men, ages 22 to 71, and 4 women, ages 54 to 63). Each recording lasts about 20 hours and contains two ECG signals sampled at 250 samples per second with 12-bit resolution over a 10 millivolt spectrum. The first analogue recordings were made at Boston's Beth Israel Hospital (now the Beth Israel Deaconess Medical Center) using ambulatory ECG recorders with a recording bandwidth of around 0.1 Hz to 40 Hz.
According to [35] database is structured as an array of two fields: Data and Labels. and every recording consist of 65536 samples. As a result, Data is interpreted as a 162x65536 matrix that means it comprises a total of 162 ECG signals with a sample size of 65536 and is re-sampled at a common rate of 128 hertz where as Labels indicates ECG signal information, i.e., 1 : 96 are ARR signals 97 : 126 are CHF signals. However, 127 : 162 are NSR signals, as seen in Figure.2

Data Preprocessing :
In this section, we are applying preprocessing steps to prepare data for training and testing such as segmentation by using a transformed datastore and the resize-Data helper function and converting 1D ECG signals into 2D colored scalogram images using Continuous Wavelet Transformation(CWT).

Data Segmentation:
Deep Learning models, such as CNN, are dynamic models that are used for feature extraction and include a large amount of data for testing. In our analysis, we used a dataset from the Physionet databases that included ECG recordings from 162 patients, resulting in 65536 sequences segmented into 10 chunks of 500 samples each and discarded the rest, with 96 samples from ARR, 30 samples from CHF, and 36 samples from ARR. To make them equally proportional, we used 30 recordings from the NSR database, 30 recordings from the CHF database, and 30 recordings from the ARR database. As a result, there are 900 recording in all, with 750 recordings chosen for training and 250 recordings chosen for testing. Table.2 Describes information of all arrhythmia databases from Physionet. Remark 1:f loor(65536/500) will provide 131 chunks of 500 samples, out of which 10 chunks are used from each recording

Image Conversion:
The majority of previous work has used a 1D ECG signal to train models, which has a lot of noise [57], baseline wanders effect [58], and artifacts. They need a large number of preprocessing steps to filter and extract features, which can compromise data integrity and model accuracy.Thus, in this study, 1D ECG signals are converted into 2D coloured Scalogram images that are used as input parameters with a resolution of 227x227x3 using a Continuous Wavelet Transform. As seen in Figure.3, where 900 Scalogram images are used as input parameters, with 750 images used for training and 250 images used for testing after data acquisition and transition these input parameters are passed through CNN-LSTM architecture for classification of arrhythmias.Complete procedure for image conversion using CWT is explained in section 3.2.3

Continuous Wavelet Transformation :
The ECG signals are raw vectors of dimension d, while the RGB representations of dimensions (x, y, 3) are accepted as inputs by the pretrained CNN. As a consequence, such transformations must first be implemented to translate them into images. Since it is effective for analyzing ECG signals [59][60], this study considers the CWT as a potential candidate solution. The CWT essentially allows the signals to be mapped into a time-scale domain. It also makes the frequency components in the studied signals more visible, Remark 2:(x,y,3), here x represents no of rows; y represents no of columns and 3 represents (RGB) where 1 represents greyscale images.
A wavelet is a waveform of limited duration that has an average value of zero and this wavelet has equivalent time and frequency defined through equation (1):, Continuous wavelet transformation(CWT) [61] uses a wavelet to measure the similarity between a signal and an analyzing function is represented by ϕ. CWT is the Figure 3 Complete procedure followed in classification of Arrhythmia's sum of the signal multiplied by scaled and shifted versions of the wavelet function ϕ. The components of functions ϕ(t) are called as mother wavelet and ϕ p,q (t) are called daughter wavelets and it obtained by comparing signals by stretching and compressing of mother wavelet at various scales and positions and we can obtain CWT coefficients of two variables Cp, q of the region R where p is used for scaling purpose and q is used for positioning on x axis and * denotes the complex conjugate and t represents the time interval. Morlet wavelet is the most commonly used wavelet and is a product of a complex sine wave and a Gaussian window: represented by equation (2) and seen in Figure.4, where i represents the imaginary operator (i = √ −1), f represents frequency in Hz, and t represents time in seconds, t should be oriented at t=0 to prevent adding a phase transition, for e.g., by specifying t from -2 to +2 seconds. Sigma is the Gaussian's width, which is determined by equation (3): The time-frequency precision trade-off is defined by the parameter n, which is also known as the "number of cycles." Typical values of n for neurophysiology data such as EEG, MEG, and LFP vary from 2 to 15 over frequencies between 2 Hz and Now we will find out the CWT coefficients of our signal f(t) by the following equation (4): Here,< f, ϕ p,q >is the inner product,as shown in Figure.5 Here, we can find different coefficients by placing on magnitudes of Scalogram on the y axis by stretching or compressing the wave with the help of 'p's and placing at different positions of time axis with the help of 'q's, at lower scales frequencies are high because waves are less stretched but, as we move on higher scales the frequency of waves decreases as waves are more stretch at this position. By analyzing the coefficients more variations can be seen at lowers scales where the frequency is very high, so one can able to capture more varying details of signal f(t) and at higher scales where the waves are stretched less variation of coefficients can be seen and able to capture less varying details of the signal.
In our study, we used 2D scalogram images as input for training and validation of our model, and CWT is used to convert 1D ECG signals into 2D Scalogram, we have used Morlet wavelet (i.e., amor) these are having one-sided spectra and have complex value in the time domain. 12 bandpass filters per octave are used for CWT as shown in Figure.6

Model Architecture and Details :
There are several methods to automatically analyze ECG which includes machine learning [ [71]. Deep learning algorithms are more feasible because feature engineering is done automatically (e.g., extraction of QRS Complex). Here, mapping of ECG input signals into types of arrhythmias is done in an end to end manner. In this study, we explained the complete architectural details that helps in identifying 3 types of arrhythmias (i.e., ARR, CHF, NSR). Overall, the structure is a combination of 2D-CNN and LSTM. Description of complete layers of the proposed model is shown in Table.3 where layers 1to17 of 2D-CNN is used for feature engineering and from layer 18to27 of LSTM is used for classification of 3 types of arrhythmias and complete architecture of the proposed model is presented in Figure.

Convolutional Neural Network
Convolutional Neural Network(CNN) [73]is a kind of neural network used for enhancing the image or fetching some useful information out of it e.g., image classification, and by taking 2D grid features of an image and find temporal features of time series data by taking 1D grid samples at different time intervals. The basic  [20].
Primary operations involved in CNN are convolution, non-linearity, max pooling, and classification. In this analysis, CNN is in charge of extracting temporal features, while LSTM excels at capturing the characteristics of time series data and classification. CNN consist of the following layers: •1DConvolutional Layer: The 1D Convolutional layer is in charge of generating feature maps from 2D filters strung together. The number of samples that the filter matrix slides over the input matrix is referred to as the stride, Remark 3: when the stride is one filter shifts over one sample and when it is 2 filter shifts over two samples.
•ReLU Activation Function: Activation is an incredibly critical function of a neural network. These activation functions determine if the information neuron that is being received is sufficient for the information or can be overlooked. We use ReLU as an activation mechanism in this study. ReLU is a non-linear activation function that is used to reduce the linearity of an image that does not activate all the neurons at the same time, means neurons are deactivate only when their values are less than zero explained through equation (5):, where, X are inputs,W represents weights and b represents bias and y represents output value, •Batch normalization: is the most critical layer, since it normalizes the performance of the previous layer and serves as a regularise to prevent the model from overfitting, as a layer is deepened in deep learning, a minor parameter shift in the previous layer may have a significant effect on the input distribution of the subsequent layer. Internal covariate change is the term for this phenomenon. To minimise this internal covariate transition, batch normalisation has been proposed, in which the mean and variance of input batches are estimated, normalised, and then scaled and shifted. Batch normalisation is typically implemented only before and after the activation function and convolution layer. However, in some situations, it is best to put the batch normalization layer after the activation function. In our case we have applied after activation function, •Max pooling: are used for dimensionality reduction or downsampling of input matrices. Max pooling is achieved by adding the maximum filter to non-overlapping sub-regions and selecting the maximum value from each patch, for example, the maximum value in the first patch of Figure.8 is 6. We used 3x3 max filters with stride 2 in this report, •Flattening: This layer is responsible for transforming a two-dimensional matrix into a one-dimensional vector that can be fed into a fully connected layer as seen in Figure.9, Figure 9 Conversions of 2D matrix to 1D vector.
•Softmax layer: is used to measure the probability distribution of an occurrence over n separate events in multi-class grouping. This feature computes the probabilities for each target class in the total target classes, which range from 0 to 1. The probabilities are then used to decide which target class has a higher probability for the given inputs. As shown in Figure.10 where x axis represents inputs and y axis represents probabilities, Figure 10 Softmax function for multi-class classification.
•Classification layer: is used for classifying different categories.
•Dropout Regularization: Generally, the network suffers from over-fitting during the model training [69]. Dropout regularization solves this problem by discarding some of the function nodes and reducing dependencies between them. We used 50% dropout before the last completely linked layer in our model.

Long Short Term Memory
In this part, we clarified the function of Convolutional RNN and addressed its shortcoming, which can be recovered using conventional LSTM by discussing its full architecture: Shortcomings of traditional Artificial Neural Network whose output doesn't depends on its input and just used either for classification and regression purpose [75] [76]. However, conventional recurrent neural networks whose output depends on its previous values, the inputp = (p 1 , p 2 , p 3 ...p T ) are feed to the network and hidden layers are h = h 1 , h 2 , h 3 . . . h T and q = q 1 , q 2 , q 3 , ...q T are the output sequences and where t is the timestamp and its value ranges from t = (1, 2, . . . ..T ) can compute through equation (6) and (7):, As shown in Figure.11 where, W are weight matrices e.g.,W ph is the weight matrix between input and hidden vector; b is the bias vectors, e.g., b h is the bias vector for hidden state vectors; W pq is the weight matrix between input and output vector; b q is the bias vector for output state vectors; H is the hidden layer activation function such as sigmoid and hyperbolic tanh function. Hence, future hidden state − → h t is calculated by passing summation of weight matrix of input weight W ph .p t ,weight matrix of previous hidden state W hh h t−1 and bias of hidden state b h into an activation function whose values ranges from 0 − 1. output values of RNN are calculated by dot product of previous hidden unit h t−1 values with weight vectors of input and output plus bias value of output, The main shortcoming of the conventional Recurrent Neural Network(RNN), in back-propagation steps to attenuate the loss function gradient(is a value used to update the weights) reduces , its value becomes too small that it do not contribute much to learning, as these layers get a small gradient to upgrade weights they stop learning and this is called as vanishing gradient problem [53], because when these layers do not remember, can miss what is used in longer sequences and therefore have a short-term memory that has detrimental effects on prediction problems, which restricts the network for training in long-term dependencies. LSTM may be a special quite RNN that uses cell state memory rather than simple neurons are one among the neural networks to affect this type of issue. As shown in Fig.13. LSTM architecture [77] consist of following components:Forget Gate, Input Gate and Output Gate. These gates can display or erase information in the cell state memory over random time intervals explained through following functions [78]: •Forget Gate: Either it allows or discard information, explained through equation(8): Where f t is the forget gate vector at timestamp t; W xf is the weight vector between input and forget gate; x t is the current input value at timestamp t; W hf is the weight vector between hidden state and forget gate; h t−1 is the previous hidden state at time t − 1; W cf is the weight vector between cell state vector and forget vector; C t−1 is the previous cell state at C t−1 ; b f is the bias vector for setting weights for forget gate. when the summation of all these values are passed through sigmoid function and if its values are within the range of 0 − 1, forget gate allows to pass otherwise it simply discards its information, •Input Gate: Transfer current values and previous values to the sigmoid activation function , which allows to refresh the cell state memory only when the values are between 0 − 1, are explained through equation (9): Where i t is the input gate vector at timestamp t; W xi is the weight vector of input values; input x t are the input values at timestamp t; W hi is a weight vector between input gate and current values; h t−1 previous hidden state at time t − 1; W ci is the weight vector between input gate and cell state memory; C t−1 cell state at t − 1; b i is the bias vector at input gate. when the summation of all these values are passed through sigmoid function(σ(.)) and if its values are within the range of 0 − 1, input gate allows to refresh the cell state memory, •Cell State: Sets the current cell state, multiplies the Forget variable with the previous cell state and drop values if multiplied by almost 0, applies the output of the input gate to alter the value of the cell state, are explained through equation (10): Where, c t is the cell state vector at timestamp t; f t is the forget gate vector; c t−1 is the cell state at timestamp t − 1; i t is the input state; W xc is the weight vector between input vector and cell state vector; x t is the input vector at timestamp t; W hc is the weight vector between current hidden state and cell state state vector; h t−1 is input value of previous hidden state at timestamp t − 1; bc is the bias vector of current cell state, •Output Gate:For the sake of prediction, it determines what is the next hidden state, explained through equation: (11) Where, W xo is the weight vector between input vector and output gate vector; x t is input vector at timestamp t; W ho is the weight vector between hidden state and output gate vector; h t−1 is input value of previous hidden state at timestamp t − 1; +W co is the weight vector between cell state vector and output gate vector; c t−1 is the cell state at timestamp t − 1; b o is the bias vector of output gate, Finally, the next hidden state value is calculated by applying hyperbolic activation function on current cell state memory and by doing dot product with output gate vector explained through equation (12): Remark 4:Sigmoid Activation Function is a logistic function its value ranges from 0 − 1 it is mainly used for Binary Classification, formula for sigmoid function :σ = 1/(1 + e −z ).However, tanh is hyperbolic activation function whose value ranges from -1 to 1 which provide the probabilities distribution to input vector for multi classification.  Table.3 and visioned through figure.7. Here we feed each image I i as an input to the CNN and generate a CNN feature representation Z i R D seen in equation (13): these input features are passed through Convolutional 1D which is responsible for creating a feature map f by striding filters of size f * f as explained in equation (14): The hidden layer takes the input z i and maps it to another representation Z i Where, w specifies weights and b represents bias whose value ranges from 0 − 1, we introduce non-linearity to the feature Z i by using the ReLU(Rectified Linear Unit) function, which has a range from 0 − 1 i.e., It does not activate all neurons at the same time as it deactivates neurons whose values are less than zero and all values whose scores are less than zero are discarded shown in equation (15): then Z i is passed through cross channel normalization with 5 channels per element that regularize a meaning that prevents over-fitting of functions. Next hidden state performs Max Pooling which is used for the downsampling of the function diagram uses matrix of size [3 * 3] with stride [2 * 2] and padding [0000] where it selects max(Z i ). The effects of the downsampling of the feature diagram are transferred through many convolutional and pooling layers, and the matrix is transformed into [3 * 3 * 192]. The input features are then flattened and encoded as [192 * 192] 1D vectors, which fed into LSTM at layer 17, LSTM is a special type of RNN that uses cell state memory instead of basic neurons that manage the sequence classification, next we have applied 50% dropout. Dropout is a regularization approach that reduces the problem of overfitting in Neural Networks by removing some of the random nodes during the training process and improves generalization error. Eventually, these values are transferred into a classification layer that functions in the same manner as the ANN works. Finally, it went into the softmax activation function, which is responsible for the multi-classification that measures the probability distribution of events over n of different events. This function measures the probability of each target class of the total target classes and these likelihood ranges from 0 to 1. Later, the probabilities help decide the target class that has a greater likelihood for the target class and classifies three different types of arrhythmias.
Optimizer Function: The cost function describes the difference between the given testing sample and the predicted performance and is a measure of how well the neural network is equipped. The optimizer function is used to reduce the cost function. Cost functions come in a variety of shapes and sizes, but deep learning usually employs a cross-entropy function.mathematically explained by equation (16): where, n represents batch-size, p is the expected value and q is the resultant value.
A gradient descent-based optimizer function with a learning rate is used to minimise the cost function. Adam [79], Adagrad [80], and Adadelta [81] are some of the most well-known optimizer functions. Although the final output difference between the above functions was not significant, we discovered that when Adam was used, the optimum point was achieved the earliest. As a result, we used the Adam optimizer algorithm, which had a learning rate of 1e − 4 and a decay rate of 0.95 for 1,000 steps. We used the Adam optimizer function in our CNN-LSTM model, which started with a 0.0001 learning rate and exponentially decayed the learning rate per 1,000 decay steps with a 0.95 decay rate.

Experimental results and performance evaluation
In this section, the results of training and validation of the proposed model are introduced and examined.We have trained our algorithm on workstation with Intel Xeon 2.40 GHz processor with 24GB RAM with training options: MiniBatchSize as 50, MaxEpochs as 50 ,InitialLearnRate as 1e − 4, LearnRateDropPeriod as 3, GradientThreshold as 1.and Adam optimizer used for the optimization For measuring the aggregate performance of our proposed model [82] we have utilized three metrics: Accuracy, sensitivity and specificity are as follows are shown in Table.4: Where, Sensitivity tests determine the capacity of the model to accurately detect the actuality of the cases studied, whereas the Specificity of the model is the ability to distinguish between individual negative samples. The Precision of the model defines the number of patients accurately described by the model, whereas . The confusion matrix derived from the testing data-set indicates 99.5% accuracy for the "normal sinus rhythm", 100% for "Cardiac arrhythmia", and 99% for the "Congestive Heart Failures". However, the average classification accuracy of our proposed model 98.7%. The sensitivity and specificity were 99.33% and 98.35% respectively.    Figure. 13, 14 represent the progress of the training and validation accuracy and the progress of the training and validation loss, respectively. As the graphs show, after 50 epochs, training and validation losses stabilized at a value that is very close to zero, while training and validation accuracy stabilized at 98.76%. Such results are very encouraging, as it was understood that there was a good percentage of accuracy in the classification of the three types of arrhythmia's described in Table.5, and accuracy is represented through the confusion matrix shown in Figure.15 Table 5 The table represents overall accuracy, TPR, TNR

Conclusion
Arrhythmia classification is the most important subject in healthcare. Arrhythmia is a rhythm or heart rate irregularity. In this article's proposed approach for the automated study of cardiac arrhythmias, the deep 2D-CNN-LSTM model is used. This approach has the following attractive properties as compared to conventional methods: (1) It employs the CWT to transform 1D ECG signals into 2D Scalogram colored images, making them ideal inputs for this network. (2) It passes data from a 2D-CNN-LSTM Model, which uses CNN for feature engineering and LSTM for classification, and has been trained on large labeled images. Experiments on three ECG cross-databases (obtained under a variety of acquisition conditions) demonstrated its effectiveness and potential to achieve better classification results than other methods. The testing dataset's confusion matrix demonstrated 99% accuracy for "normal sinus rhythm," 100% accuracy for "cardiac arrhythmias," and 98% accuracy for "congestive heart attacks." Additionally, the average classification accuracy is 99%, with sensitivity and specificity of 98.33 and 98.35 percent, respectively. Additionally, the proposed model will aid doctors in accurately diagnosing arrhythmia during routine ECG screening. According to preliminary results from the MIT-BIH database, our methodology's overall efficiency is better than other methods. Furthermore, the heavy computing burden caused by the use of CWT is a drawback. We could never achieve a complete inter subject state, despite the fact that doing so will greatly minimise the amount of intervention required by doctors. We'll keep working on it. To address these issues, a robust arrhythmia classification algorithm is needed.In the future, (1) we expect to test the proposed methodology on a range of ECG arrhythmia databases;(2)combining several pre-trained CNN models to generate more robust feature representations.
ReLU:Activation is an incredibly critical function of a neural network. These activation functions determine if the information neuron that is being received is sufficient for the information or can be overlooked. We use ReLU as an activation mechanism in this study. ReLU is a non-linear activation function that is used to reduce the linearity of an image that does not activate all the neurons at the same time, means neurons are deactivate only when their values are less than zero.
Batch Normalization:is the most critical layer used to normalize the performance of the previous layer and is often used as a regularization to prevent overfitting the model.
Softmax Function: is used for a multi-class classification that measures the probability distribution of events over n of different events. This function measures the probability of each target class of the total target classes and this likelihood range from 0 − 1. Later, the probabilities help decide the target class that has a greater likelihood for the target class.
Sigmoid Activation Function: is a logistic function its value ranges from 0 − 1 it is mainly used for Binary Classification.
Tanh Activation Function:It is similar to sigmoid activation function and its values ranges from −1, 1.
Dropout: are used for regularization which is used for controlling overfitting problem.
Learning Rate: its value ranges from 0 − 1 it defines how fast the problem is adapted by the model and how many weights are modified in the loss gradient model.
Hidden units: are the no of perceptrons used in neural network its values entirely lies on activation function.
Bias:Bias modifies the activation function by applying a constant (i.e. a given bias) to the input. It's analogous to the position of constant in linear function.