This section provides the details of the experiments conducted to measure the performance of the proposed system to detect cardiovascular disease. We evaluated the performance of the proposed system using the accuracy, precision, recall, and F1-score. The details of experiments on the datasets are given below.
3.1. Datasets
In this work, we used five popular cardiovascular disease datasets i.e., Statlog [39], Cleveland [40], Public Health [41], Z-Alizadeh sani [42], and Framingham datasets [43] that are publicly available in UCI machine learning repository database. Each dataset has different attributes such as Statlog has an age, gender, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, maximum heart rate, exercise-induced angina, oldpeak, the slope of the peak, number of major vessels, and thal, etc., while the Cleveland and public health datasets have also the same attributes as that of Statlog dataset. Z-Alizadeh Sani Dataset has a total of 54 features and is arranged in four groups such as demographic, symptom and examination, ECG, and laboratory and echo features. Framingham dataset has 4238 records, which belong to three groups such as demographic, behavioral, and medical risk factors. This dataset provides a potential risk of coronary artery disease before 10 years. Each dataset contains the data of both male and female patients. The datasets are diverse in terms of attributes where each dataset has distinctive features from others. The details of all the five datasets are given in Table I.
Table I. Details of the datasets.
Dataset | No of Observations | No of attributes | No of healthy persons | No of heart patients |
Statlog | 270 | 75 | 150 | 120 |
Cleveland | 303 | 76 | 164 | 139 |
Z-Alizadeh sani | 300 | 54 | 87 | 216 |
Framingham | 4238 | 16 | 3596 | 644 |
Public Health | 1025 | 14 | 499 | 526 |
3.2. Results on Z-Alizadeh sani Dataset
The objective of this experiment is to check the effectiveness of the proposed system on the Z-Alizadeh sani dataset [42] to detect the presence of cardiovascular disease using three different subsets of features i.e., 6, 8, and 15. We employed RFE feature selection technique to select an optimal subset of features that contain the maximum information of the cardiovascular disease to train the 11 machine learning algorithms and the proposed CNN-cardioAssistant. We conducted experiments on all the three optimal subsets of features.
In the first phase, we employed RFE and selected an optimal subset of six features i.e., age, body mass index (BMI), typical chest pain (ca), triglyceride (TG), platelet (PLT), and ejection fraction (EF-TTE) to train multiple classifiers for cardiovascular disease prediction. We achieved an accuracy of 80.32%, 81.96%, 85.24%, 85.24%, 83.60%,67.21%, 83.60%, 78.68%, 81.96%, 85.24%, 80.12%, and 88.52% on DT, LR, KNN, XGboost, MLP, GPC, AB, NB, QDA, RF, SVM, and CNN-cardioAssistant, respectively. From Table II, we can observe that the proposed method CNN-cardioAssistant performed well and attained maximum accuracy of 88.52%, precision of 100%, recall of 72.13%, and F1-score of 83.81% on an optimal subset of six features among all the twelve methods. KNN, XGboost, and RF performed second best and achieved an accuracy of 85.24%, precision of 90.90%, recall of 80.80%, and F1-score of 89.89% while the GPC performed the worst by achieving an accuracy of 67.21%, precision of 72.72%, recall of 80% and F1-score of 76.19%. From the results on the optimal subset of six features, we can conclude that the combination of these features does not hold enough information and can’t reliably be used by the cardiologists for the accurate detection of cardiovascular disease. So, we need to include more prominent features to further enhance the performance of our system.
In the second phase, we select another optimal subset of features comprised of two additional features i.e., fasting blood sugar (FBS) and erythrocyte sedimentation rate (ESR) along with the previous six selected features. This subset of eight features includes the following i.e., age, BMI, typical chest pain, FBS, TG, ESR, PLT, and EF-TTE. We achieved an accuracy of 81.96%, 81.96%, 81.96%, 93.44%, 78.68%, 60.65%, 85.24%, 80.32%, 80.32%, 86.88%, 81.30%, and 88.52% on DT, LR, KNN, XGboost, MLP, GPC, AB, NB, QDA, RF, SVM, and CNN-cardioAssistant, respectively. We achieved the best accuracy of 93.44%, precision of 95.45%, recall of 89.36%, and F1-score of 92.30% on these eight selected features using the (RFE_XGboost) method. Moreover, the proposed system also performed second best and achieved an accuracy of 88.52%, precision of 100%, recall of 72.13%, and F1-score of 83.80%. (RFE_GPC) performed worst by achieving an accuracy of 60.65%, precision of 75%, recall of 75%, and F1-score of 75%. From the results on the subset of eight features, we observed that by adding the additional two features, the accuracy of each method increases and found significant improvement for XGboost where an increase of 8.20% was observed.
In the third phase, we increased the number of input features from eight to fifteen i.e., age, weight, BMI, blood pressure (BP), ca, FBS, TG, low-density lipoprotein (LDL), ESR, hemoglobin (HB), Na, white blood cells (WBC), Lymph, PLT, and EF-TTE. We achieved an accuracy of 85.24%, 78.68%, 67.21%, 80.32%, 72.13%, 24.59%, 81.96%, 86.88%, 81.96%, 83.60%, 78.60%, and 78% on DT, LR, KNN, XGboost, MLP, GPC, AB, NB, QDA, RF, SVM, and CNN-cardioAssistant, respectively. The detailed results of all the three subsets of features in terms of accuracy, precision, recall, and F1-score are reported in Table II. We achieved the best accuracy of 86.88%, precision of 90.91%, recall of 90.91%, and F1-score of 90.90% on (RFE_NB) method. The DT performed second best and achieved an accuracy of 85.24%, precision of 95.45%, recall of 85.71%, and F1-score of 90.30% while the GPC performed the worst and achieved an accuracy of 24.59%, precision of 0.00%, recall of 0.00%, and F1-score of 0.00%. We can conclude from the results that (RFE_XGboost) performed well on an optimal subset of eight features among all different methods and subsets of features. For this dataset, the combination of ideal features is eight for (RFE_XGboost). The addition of two features i.e., FBS and ESR to a subset of 6 features played a significant role in the correct classification of healthy people and cardiovascular patients. It can be concluded that (RFE_XGboost) on eight features can reliably be used in clinics by cardiologists for the detection of cardiovascular disease.
Table II. Evaluation on Z-Alizadeh sani dataset using machine learning techniques on 6, 8 and 15 features subsets.
Algo | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% |
DT | 6 | 80.32 | 90.90 | 83.33 | 86.96 | 8 | 81.96 | 95.45 | 83.33 | 89.00 | 15 | 85.24 | 95.45 | 85.71 | 90.30 |
LR | 81.96 | 90.90 | 85.10 | 87.91 | 81.96 | 93.18 | 85.41 | 89.10 | 78.68 | 93.18 | 80.39 | 86.30 |
KNN | 85.24 | 90.90 | 88.88 | 89.89 | 81.96 | 93.18 | 78.84 | 85.40 | 67.21 | 84.09 | 74.00 | 78.70 |
XGboost | 85.24 | 90.90 | 88.88 | 89.89 | 93.44 | 95.45 | 89.36 | 92.30 | 80.32 | 93.18 | 82.00 | 87.20 |
MLP | 83.60 | 97.72 | 82.69 | 89.58 | 78.68 | 93.18 | 83.67 | 88.20 | 72.13 | 100 | 72.13 | 83.80 |
GPC | 67.21 | 72.72 | 80.00 | 76.19 | 60.65 | 75.00 | 75.00 | 75.00 | 24.59 | 0.00 | 0.00 | 0.00 |
AB | 83.60 | 93.18 | 85.41 | 89.13 | 85.24 | 97.72 | 86.00 | 91.50 | 81.96 | 93.18 | 78.84 | 85.40 |
NB | 78.68 | 85.71 | 83.72 | 84.71 | 80.32 | 86.36 | 86.36 | 86.40 | 86.88 | 90.91 | 90.91 | 90.90 |
QDA | 81.96 | 86.6 | 88.37 | 87.36 | 80.32 | 86.36 | 86.36 | 86.40 | 81.96 | 90.91 | 85.11 | 87.90 |
RF | 85.24 | 93.18 | 87.23 | 93.18 | 86.88 | 97.72 | 87.75 | 92.50 | 83.60 | 95.45 | 84.00 | 89.40 |
SVM | 80.12 | 90.10 | 83.12 | 86.70 | 81.30 | 93.12 | 82.22 | 88.70 | 78.60 | 93.10 | 79.80 | 89.90 |
Proposed | 88.52 | 100 | 72.13 | 83.81 | 88.52 | 100 | 72.13 | 83.80 | 78.00 | 66.67 | 100.00 | 78.57 |
3.3. Results on Framingham Dataset
This experiment is designed to check the effectiveness of distinctive features combination of the proposed system on the Framingham dataset [43] to accurately detect cardiovascular disease. As earlier, we conducted experiments on all features of the dataset as well as on two subsets of optimal features i.e., a combination of six and eight features.
In the first stage, we employed a subset of six features i.e., age, total cholesterol level (totchol), systolic blood pressure (sysBP), diastolic blood pressure (diaBP), BMI, and blood glucose level (glucose) to train multiple classifiers independently for classification. We achieved an accuracy of 83.96%, 85.25%, 83.37%, 85.02%, 84.78%, 75.82%, 85.02%, 82.19%, 82.66%, 84.66%, 82.60%, and 77.30% on DT, LR, KNN, XGboost, MLP, GPC, AB, NB, QDA, RF, SVM, and CNN-cardioAssistant, respectively. From the results reported in Table III, we can observe that (RFE_LR) outperformed among the twelve methods on a combination of six features and achieved an accuracy of 85.25%, precision of 5.55%, recall of 29,16% and F1-score of 9.30%. (RFE_AB) also achieved an accuracy of 85.02%, precision of 7.14%, recall of 47.36%, and F1-score of 12.40% while we achieved the lowest performance on (RFE_GPC) and achieved an accuracy of 75.82%, precision of 16.66%, recall of 17.35%, F1-score of 17.00%. We observed that all twelve methods provide lower performance in terms of precision, recall and F1-score. These six features do not contain enough details, so, we need to use more features in order to improve the performance of the system.
In the second stage of this experiment, we added two more features i.e., heartrate and cigsPerDay to a subset of six features. The second optimal subset of features is comprised of eight features i.e., age, cigrate person smoke per day (cigsPerDay), totchol, sysBP, diBP, BMI, heartrate, and glucose. We achieved an accuracy of 83.25%, 85.25%, 84.19%, 84.90%, 85.02%, 76.88%, 84.78%, 82.31%, 81.95%, 84.90%, 81.80%, and 83.24% for DT, LR, KNN, XGboost, MLP, GPC, AB, NB, QDA, RF, SVM, and CNN-cardioAssistant, respectively. We achieved best accuracy of 85.25% on (RFE_LR), precision of 2.38%, recall of 60.00%, F1-score of 4.58%. (RFE_MLP) performed second best and achieved an accuracy of 85.02%, precision of 12.69%, recall of 48.48%, and F1-score of 20.13% while the (RFE_GPC) performed worst and achieved an accuracy of 76.88%, precision of 20.63%, recall of 21.31%, and F1-score of 20.97%. From the results reported on a subset of eight features, we observed that all the methods again performed worst in terms of precision, recall and F1-score. This combination of features is not reliable to be used for the prediction of cardiovascular disease. We also observed that adding these two features i.e., heart rate and cigsPerDay enhance the accuracy on CNN-cardioAssistant from 77.30–83.24% while the accuracy of other methods increases slightly.
Table III. Evaluation on Framingham dataset using machine learning techniques on 6, 8, and 15 features subsets.
Algo | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% |
DT | 6 | 83.96 | 5.55 | 29.16 | 9.30 | 8 | 83.25 | 3.96 | 19.23 | 6.57 | 15 | 85.24 | 2.38 | 60.00 | 4.58 |
LR | 85.25 | 2.38 | 60.00 | 4.60 | 85.25 | 2.38 | 60.00 | 4.58 | 88.52 | 5.55 | 80.00 | 10.65 |
KNN | 83.37 | 7.14 | 6.76 | 6.90 | 84.19 | 9.52 | 37.50 | 15.19 | 70.49 | 9.52 | 37.50 | 15.19 |
XGboost | 85.02 | 3.17 | 44.44 | 5.90 | 84.90 | 5.55 | 43.75 | 9.85 | 86.88 | 5.55 | 60.00 | 9.60 |
MLP | 84.78 | 4.76 | 40.00 | 8.50 | 85.02 | 12.69 | 48.48 | 20.13 | 78.68 | 23.60 | 21.31 | 22.41 |
GPC | 75.82 | 16.66 | 17.35 | 17.00 | 76.88 | 20.63 | 21.31 | 20.97 | 37.70 | 20.63 | 21.31 | 20.97 |
AB | 85.02 | 7.14 | 47.36 | 12.40 | 84.78 | 8.73 | 44.00 | 14.57 | 90.16 | 21.38 | 60.00 | 15.58 |
NB | 82.19 | 23.01 | 34.94 | 27.80 | 82.31 | 24.60 | 36.04 | 29.25 | 86.88 | 2.38 | 58.00 | 6.58 |
QDA | 82.66 | 18.25 | 34.32 | 23.80 | 81.95 | 20.63 | 32.91 | 25.37 | 81.96 | 21.64 | 32.91 | 25.37 |
RF | 84.66 | 5.55 | 38.88 | 9.70 | 84.90 | 5.55 | 43.75 | 9.85 | 84.90 | 5.55 | 43.75 | 9.85 |
SVM | 82.60 | 18.10 | 34.80 | 27.50 | 81.80 | 20.50 | 32.80 | 25.20 | 85.88 | 2.38 | 57.00 | 6.60 |
Proposed | 77.30 | 18.18 | 14.29 | 16.00 | 83.24 | 28.57 | 7.14 | 11.43 | 99.86 | 100.00 | 99.11 | 99.55 |
In the third stage of this experiment, we increased the number of input features from eight to fifteen. This optimal subset of features consists of gender, age, education, currentSomker, cigsPerDay, BPMeds, prevalentStroke, prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartrate, and glucose. We achieved an accuracy of 85.24%, 88.52%, 70.49%, 86.88%, 78.68%, 37.70%, 90.16%, 86.88%, 81.96%, 84.90, 85.88%, and 99.86% for DT, LR, KNN, XGboost, MLP, GPC, AB, NB, QDA, RF, SVM, and CNN-cardioAssistant, respectively. The detailed results of all methods on all the three subsets of features in terms of accuracy, precision, recall, and F1-score are given in Table III. We observed that our method CNN-cardioAssistant performed well among all other methods and achieved an accuracy of 99.86%, precision of 100.00%, recall of 99.11%, F1-score of 99.55%. (RFE_AB) performed second-best on this subset of optimal features and achieved an accuracy of 90.16%, precision of 21.38%, recall 60.00%, and F1-score of 15.58% while the GPC performed the worst and achieved an accuracy of 37.70%, precision of 20.63, recall of 60.00%, and F1-score of 20.97%. From the results, it is concluded that increasing the number of features enhances the accuracy of the system. Comparing the results of eight and fifteen features, we noticed a slight increase in accuracy on all the classifiers except the CNN-cardioAssistant where we experienced a significant improvement from 83.24–99.86%. From the detailed results, we can conclude that CNN-cardioAssistant outperformed against all the other methods. The proposed method using the combination of fifteen features can reliably be used by the clinical physicians and medical specialists in hospitals, health care, and medical centers for the prediction of cardiovascular disease patients early and accurately.
3.4. Results on Public Health Dataset
The objective of this experiment is to check the effectiveness of the proposed system on the Public Health dataset [41] to classify healthy persons and cardiovascular disease patients. For this purpose, we designed a three-phase experiment on the subset of six, eight, and thirteen features.
In the first phase, we employed RFE to select an optimal subset of six features i.e., age, cp, chol, thalach, pldpeak, and ca. We achieved an accuracy of 90.73%, 83.41%, 77.56%, 92.68%, 83.41%, 87.31%, 83.90%, 84.87%, 83.31%, and 98.04% on DT, LR, KNN, XGboost, MLP, AB, NB, QDA, SVM, and CNN-cardioAssistant, respectively. From Table IV, we observe that in the first experiment on an ideal subset of six features, CNN-cardioAssistant performed well among all other methods and achieved an accuracy of 98.04%, precision of 100%, recall of 96.39%, and F1-score of 98.17%. The technique (RFE_XGboost) performed second best and achieved an accuracy of 92.68%, precision of 97.19%, recall of 89.65%, and F1-score of 93.28% while the (RFE_KNN) performed worst and achieved an accuracy of 77.56%, precision of 74.76%, recall of 80.80%, and F1-score of 77.67%. We can conclude from these results that CNN-cardioAssistant performs well even on the reduced features-set of six and can still be used for the correct prediction of heart disease.
In the second phase, again, we employed RFE to select eight key features. This optimal subset of features has same six features that were previously selected while the other two prominent features such as trestbps and thal are added to it. We achieved an accuracy of 91.70%, 84.87%, 74.63%, 95.12%, 82.43%, 91.70%, 86.34%, 84.39%, 82.33%, and 98.04 on DT, LR, KNN, XGboost, MLP, AB, NB, QDA, SVM, and CNN-cardioAssistant, respectively. From the Table IV, we can observe that CNN-cardioAssistant outperformed again against all the methods and achieved an accuracy of 98.04%, precision of 100%, recall of 96.40%, and F1-score of 98.17%. The (RFE_XGboost) also achieved better accuracy of 95.12%, precision of 98.13%, recall of 92.92%, and F1-score of 95.45% while the performance of the (RFE_KNN) degrades and achieved an accuracy of 74.63%, precision of 73.83%, recall of 76.60%, and F1-score of 75.24%. From the results on subset of eight features, we observed that by adding the two features i.e., testbps and thal enhances the accuracy of other methods slightly but the accuracy of CNN-cardioAssistant remains the same. We didn’t observe much increase in accuracy, so, we can conclude that both combinations of features i.e., six and eight are reliable for the accurate prediction of cardiovascular disease patients.
In the third phase, we increased the number of features from eight to a combination of thirteen features i.e., age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, and thal. We achieved an accuracy of 78.68%, 85.24%, 63.93%, 85.25%, 78.68%, 90.16%, 85.24%, 83.60%, 84.25%, and 98.68% on DT, LR, KNN, XGboost, MLP, AB, NB, QDA, SVM, and CNN-cardioAssistant, respectively. Again, we achieved maximum accuracy of 98.68% on CNN-cardioAssistant. The results on each subset in terms of accuracy, precision, recall, and F1 score are given in Table 4. Our proposed method CNN-cardioAssistant performed the best and achieved an accuracy of 98.68%, precision of 100%, recall of 97.18%, and F1- score of 98.57%. The (RFE_AB) also performed well and achieved an accuracy of 90.16%, precision of 88.23%, recall of 93.75%, and F1-score of 90.91% while the performance of (RFE_KNN) degrades and achieved an accuracy of 63.93%, precision of 64.70%, recall of 68.75%, and F1-score of 66.67%. The experimental results on thirteen features illustrate that there is a minor increase of 0.64% in an accuracy of CNN-cardioAssistant but we observe a decline in accuracy of other methods. We can conclude from the results on all the three combinations of features that CNN-cardioAssistant performed the best and can reliably be used for the timely and accurate prediction of cardiovascular disease. These selected features contain maximum information that is required to make an accurate prediction of cardiovascular disease.
Table IV. Evaluation on Public Health dataset using machine learning techniques on 6, 8, and 13 features subsets.
Algo | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% | No of Attr | Accuracy% | Precision% | Recall% | F1-Score% |
DT | 6 | 90.73 | 92.52 | 90.00 | 91.24 | 8 | 91.70 | 97.20 | 88.14 | 92.45 | 13 | 78.68 | 79.41 | 81.81 | 77.39 |
LR | 83.41 | 85.98 | 82.88 | 84.11 | 84.87 | 87.85 | 83.93 | 85.85 | 85.24 | 87.87 | 85.29 | 86.57 |
KNN | 77.56 | 74.76 | 80.80 | 77.67 | 74.63 | 73.83 | 76.60 | 75.24 | 63.93 | 64.70 | 68.75 | 66.67 |
XGboost | 92.68 | 97.19 | 89.65 | 93.28 | 95.12 | 98.13 | 92.92 | 95.45 | 85.25 | 87.87 | 85.29 | 86.57 |
MLP | 83.41 | 85.98 | 82.88 | 84.40 | 82.43 | 94.39 | 80.80 | 97.07 | 78.68 | 91.17 | 75.61 | 82.67 |
AB | 87.31 | 88.78 | 87.15 | 87.97 | 91.70 | 94.39 | 90.18 | 92.24 | 90.16 | 88.23 | 93.75 | 90.91 |
NB | 83.90 | 88.78 | 81.89 | 85.20 | 86.34 | 88.79 | 85.59 | 87.16 | 85.24 | 91.17 | 83.78 | 87.32 |
QDA | 84.87 | 91.58 | 81.66 | 86.35 | 84.39 | 88.79 | 82.61 | 8559 | 83.60 | 85.29 | 85.29 | 85.29 |
SVM | 83.31 | 85.80 | 82.70 | 83.90 | 82.33 | 94.29 | 80.69 | 96.80 | 84.25 | 86.91 | 84.10 | 85.10 |
Proposed | 98.04 | 100 | 96.39 | 98.17 | 98.04 | 100 | 96.40 | 98.17 | 98.68 | 100 | 97.18 | 98.57 |
3.5. Performance comparison on cross datasets
This experiment is designed to evaluate the performance of the proposed system on the cross-datasets settings to check the robustness and generalizability of the proposed system. For this purpose, we analyzed the three datasets such as Cleveland, Statlog, and public health. We observed that these datasets have similar attributes, thus, we selected the thirteen similar attributes such as age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, and thal of these datasets for cross dataset experiments. More specifically, in first cross dataset scenario, we used Cleveland dataset for training the model and the public health dataset for the testing purpose. We achieved the highest accuracy of 98.44% on CNN-cardioAssistant. The detailed results in terms of accuracy, precision, recall, and F1-score are reported in Table V. In the second experiment, we used public health dataset for the training and Cleveland for the evaluation purpose. We achieved a remarkable accuracy of 99.34%. In the third experiment, we used the public health dataset for training purpose and Statlog for testing purpose and vice versa. We achieved an accuracy of 70.74%. In the fourth experiment, we used Statlog dataset for training purpose and public health dataset for evaluation purpose. We achieved an accuracy of 64.88%. From the results reported in Table V, we observed that the performance of the proposed system degrades when training on Statlog and testing on the public health dataset. We analyzed both datasets and found that the data has few issues. The number of instances of Statlog and public health datasets are either missing or very less as compared to another dataset. The amount of training data is small, and we can’t use augmentation for the clinical data. The instances with fewer numbers are called the rare cases. We investigated both the datasets deeply and discovered that there are missing data in the training dataset (Statlog) for chest pain (cp), major vessels (ca), and thal while the testing dataset (public health dataset) has data for these features. There are four types of chest pain i.e, typical angina, atypical angina, non-anginal pain, asymptomatic. Typical angina is represented by 1, atypical angina by 2, non-anginal by 3, and asymptomatic by 4. From the dataset, we observed that there are a number of major vessels (0–3) colored by fluoroscopy. We analyzed that there are three types of thal i.e., normal, fixed defect, and reversible defect. Normal thal is denoted by 3, fixed
Table V. Performance results on cross datasets.
Training Dataset | Testing Dataset | Accuracy% | Precision% | Recall% | F1-Score% |
Cleveland | Public Health Dataset | 98.44 | 97.05 | 100 | 98.50 |
Public Health Dataset | Cleveland | 99.34 | 98.80 | 100 | 99.40 |
Public Health Dataset | Statlog | 70.74 | 86.60 | 56.00 | 68.02 |
Statlog | Public Health Dataset | 64.88 | 59.93 | 92.25 | 7357 |
defect is denoted by 6, and the reversible defect is denoted by 7. An over-constrained model underfits when there is a small amount of training data, whereas, an under-constrained model overfits the training data. Both of these cases result in poor prediction of the model. So, we conclude that the small amount of data for training in Statlog dataset is the main reason behind the poor prediction.
3.6. Performance comparison with other methods
To show the efficiency of our method for cardio disease prediction, we performed a comparative analysis of the proposed and existing state-of-the-art cardiovascular disease prediction methods, and results are shown in Table VI for Z-Alizadeh sani [42], public health [41], and Framingham [43] datasets, respectively. Our method yielded the best accuracy of 93.44%, 98.68%, and 99.86% on Z-Alizadeh sani, public health, and Framingham datasets, respectively. We have reported the results of comparative papers in this performance comparison experiment. First, we performed an experiment on Z-Alizadeh sani dataset [42] and split the data into 80 − 20 for training and evaluation purposes. We used 80% data for training and the rest 20% for the testing. The results of the existing state-of-the-art methods [44–48] and our method are listed in Table VI. From Table VI, we can observe that [47] performed the worst and achieved an accuracy of 88.49%, abdar, [44] performed second best and yielded an accuracy of 93.08% while our method performed the best that yielded an accuracy of 93.44% on Z-Alizadeh sani dataset. The results reported in Table 6 reveals that the (RFE_XGboost) on eight features has better performance in terms of an accuracy and can be used reliably for the detection of cardiovascular disease.
Table VI. Performance comparison with other state-of-the-art methods on Z-Alizadeh Sani, Public health, and Framingham datasets.
Dataset | Authors | Accuracy% | Dataset | Authors | Accuracy% | Dataset | Authors | Accuracy% |
Z-Alizadeh sani Dataset | Abdar, Moloud, et al. [44] | 93.08 | Public Health Dataset | Khan, Mohammad Ayoub et al. [49] | 97.6 | Framingham Health Dataset | Khan, Mohammad Ayoub et al. [49] | 92.02 |
Joloudari, Javad Hassannataj, et al. [45] | 91.47 | Khan, Mohammad Ayoub et al. [49] | 84.6 | Khan, Mohammad Ayoub et al. [49] | 88.3 |
Ghiasi, et al. [46] | 92.41 | Wu CS, et al. [50] | 86 | Sivaji, U., et al. [54] | 88.7 |
Khan et al. [47] | 88.49 | Ismail, A. et al. [51] | 90.6 | Al-Makhadmeh, et al. [55] | 99.03 |
Nasarian et al. [48] | 92.35 | Nurtas, Marat, et al. [52] | 82 | Nourmohammadi-Khiarak, Jalil, et al. [56] | 94.03 |
Nasarian et al. [48] | 92.85 | Raza, K, et al. [53] | 88.88 | Ali, Liaqat, et al. [57] | 89 |
In this study | 93.44 | In this study | 98.68 | In this study | 99.86 |
Next, we performed an experiment on public health dataset [41] using 80 − 20 data for the training-testing. We compared the performance of our method with existing state-of-the-art methods as shown in Table VI. The detailed results of the proposed and existing state-of-the-art methods [49–53] in terms of an accuracy are reported in Table VI. From these results, we observed that [52] proposed a system for the prediction of heart disease and achieved an accuracy of 82%, which is 16.68% smaller than our method, [49] yielded second-best accuracy of 97.6% while our method achieved remarkable accuracy of 98.68%. The results reported in Table VI reveal that our experimental study is remarkably effective for the prediction of cardiovascular disease.
Finally, we used Framingham dataset [43] for the prediction of cardiovascular disease using 80% of the data for training the model and 20% of the data for testing. We compared the results of the proposed method against these contemporary methods [49, 54, 55, 56, 57] based on the accuracy as shown in Table VI. Experimental results on Framingham dataset [43] revealed that [49] achieved the worst performance with an accuracy of 88.3%, [55] achieved an accuracy of 99.3% while our method performs the best and yielded an accuracy of 99.86%. Experimental results show that the proposed system can effectively and reliably be used for the prediction of cardiovascular disease on multiple and diverse datasets.