Cardiovascular disease is one of the main causes of mortality worldwide. In 2016, an estimated 17.9 million deaths occurred prematurely due to cardiovascular disease and accounts for 31% of all global deaths. Heart attacks and strokes are responsible for 85% of global deaths due to cardiovascular disease [1]. The high mortality rate is caused by cardiovascular problems that should be diagnosed early to avoid long-term complications and premature cardiac death.
Electrocardiograms (ECGs) and heart sound or phonocardiograms (PCGs) are commonly used to diagnose cardiovascular disease. PCG, which is a graphical representation of heart sound signals, can extract the heart valve opening time more accurate than ECG signals [2]. As a result, heart sound signals contain important physiological information of cardiac conditions that can be utilized to detect cardiac organ deformation as well as valve damage [3]. On the other hand, cardiac auscultation is determined by the physicians' abilities and subjective experiences [4]. Therefore, an objective and automatic computer-assisted analysis of heart sound signals is very important for the early diagnosis of cardiovascular diseases [5] which can potentially prevent premature death.
Automatic heart sound classification is currently a promising research field based on signal processing and artificial intelligence approaches. It is reliable to screen or monitor for cardiac diseases in a wide range of clinical settings [6], allowing for the reduction of costly and laborious manual examinations. Several studies have proposed algorithms to detect cardiovascular disorders based on heart sound signals.
Previously, Rubin et al. reported an 84% test accuracy using mel-frequency cepstral coefficients (MFCC) and two-dimensional convolutional neural network (2D CNN) [7]. In the preprocessing step, a 3 s duration of heart sound signals was selected. In addition, they extracted 13 MFCC features, and converted these into 2D heat map images as input for the 2D CNN. Nogueira et al. used 2D heat maps images as input for the support vector machine (SVM) classifier algorithm. This method achieved an accuracy of 82.33 % [8].
Meanwhile, in a study by Xiao et al. a validation accuracy of 93% was reported and verified via 10-fold cross validation using a 1D convolutional neural network (1D CNN) [9]. The preprocessing step required resampling at 2000 Hz, removing noise using a band-pass filter and sliding window with 3 s patches and 1s stride. The authors claimed that the proposed model provided extremely low parameter consumption. However, the proposed method did not use test datasets that were different from the training and validation datasets.
Li et al. proposed CNN as a classification method [10]. However, the method did not extract the features directly using CNN but required a separate feature extraction process using multi-feature extraction. The proposed model was tested using 831 test datasets and reported an accuracy of 86.8% using 5-fold cross validation. In a study by Khrisnan et al. a 6 s heart sound recording duration was used as input for a feed forward neural network (FNN) [11]. They verified the model using 5-fold cross validation and reported an accuracy of 85.65 % for 100 test datasets.
Al-Naami et al. reported the highest accuracy at 89% using high-order spectral analysis and adaptive neuro-fuzzy inference system (ANFIS) [12]. However, they only used 1837 heart sound recordings from folder ‘a’, ‘b’, and ‘e’ in Physionet Challenge 2016 datasets. Khan et al. reported a 91.39% accuracy using MFCC for feature extraction and LSTM as a classifier algorithm [13]. He et al. extracted 512 features using several feature extraction methods, such as the Hilbert envelope, homomorphic environment map, wavelet envelope, and power spectral density envelope as inputs for 1D CNN. This study reported an accuracy of 87.3% [14]. Jeong et al. used a 5 s heart sound recording duration and removed datasets that were less than 5 s [15]. They applied a short -time Fourier transform (STFT) to transform to the time-frequency domain and generated spectrogram images of heart sound signals as input for the 2D CNN. The proposed method obtained 91% accuracy for the 208 test datasets.
There are several challenges in developing automated cardiac disorders using Physionet challenge datasets, such as the imbalance datasets of normal and abnormal conditions, and the variation of the length data due to different recording procedures from clinical data around the world. The aforementioned studies reported promising results related to the performance of automated heart sound detection. However, the accuracy needs to be improved because the early diagnosis of heart sound is critical to save patient life. In addition to the accuracy performance, low consumption parameters must be considered by researchers. Although the deep network provides good accuracy and can extract the information from raw signals directly, the complex architecture provides high consumption parameters and a long computational time.
Considering the limitations of the previous studies, we proposed an algorithm that can detect cardiac abnormalities based on heart sound signals, which not only improves accuracy performance, but also produces low parameter consumption. To achieve this, we proposed the mel frequency cepstral coefficient (MFCC) as a feature extraction method that has many advantages, such as being capable of capturing important information contained in the audio signal and providing data as minimal as possible without losing information. Furthermore, the proposed classification method is a simple ANN that contains one hidden layer to classify heart sound signals as normal or abnormal conditions.