Machine Learning based Prediction and Diagnosis of Heart Disease using multiple models

DOI: https://doi.org/10.21203/rs.3.rs-2642516/v1

Abstract

Nowadays, heart disease is considered to be the main cause of sickness. Since the majority of people are unaware of their own kind and severity of heart disease, heart disease is now a significant problem that affects people of all ages. On the other hand, manual approach of prediction is challenging and often requires the capability to choose the relevant approach. To resolve these issues, various machine-learning models are playing a vital role in automatic disease prediction in medical field. In this study, we have calculated and made a comparison of accuracy of various machine learning models such as SVM, KNN, Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes, AdaBoost, Extra Tree Classifier and Gradient Boosting for prediction of heart disease using UCI repository dataset for training and testing of models. Among all the models used, the highest accuracy of 95.08% obtained by the Gradient Boosting model The major aim of the paper is to get a reliable, computationally effective machine learning algorithm for heart disease prediction.

I. Introduction

Due to their hectic schedules, people in the contemporary world often neglect their health. Changing lifestyles make Indians more susceptible to cardiovascular issues. According to the WHO data, cardiovascular disease accounts for 17.5 million deaths annually. [1] Being largely associated with older age, CVDs are also silent killers. A recent study, however, revealed that those under the age of 40 account for 25% of heart attacks. The problem is made worse by stress, a sedentary lifestyle, conditions like diabetes, etc. As a result, it is important to routinely check one's health and, if required, visit a doctor [2] [3].

The patient may not be under the observation of the doctor for a whole 24 hours. While there are many equipment available, they are not accurate in detecting cardiac problems, some of them are highly costly, and using them also requires specialized knowledge. A popular technique that falls under the category of artificial intelligence is machine learning. Machine learning enables machines to enhance the performance at tasks over time. By the use of machine learning, a system may recognize patterns on its own and make predictions.

For detection the disease, many classification algorithms are employed to categorize patient data.[4] During training period, the classification model [5] is trained using information from the standard dataset. Throughout the testing period, actual records of patient are used to diagnose the occurrence of disease [6]. A medical expert is given access to patients' health records and the outcomes of the processing and will offer emergency assistance as needed[7][8]. The Heart Failure Clinical Records of UCI Repository dataset are utilized to determine the occurrence of heart disease in the general population [9]. These systems usually monitor parameters[10][11] like blood pressure, blood sugar, body temperature, heart rate, and so on[12].

In this paper, we have used various ML algorithms like SVM, KNN, Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes, AdaBoost, Extra Tree Classifier and Gradient Boosting. Section II describes models used for predictions, Section III of the paper describes the related work, and Section IV describes proposed methodology. Section V of the study describes the performance analysis, Section VI evaluates the result and finally section VII discusses conclusion and the future scope.

Ii. Models Used

(A) Logistic regression

Logistic regression is a statistical method for developing machine learning models using binary dependent variables. For describing data and the correlation between a dependent and one or more independent variables, logistic regression is used. The independent variables may be of the interval, nominal, or ordinal types. The term "logistic regression" is derived from the logistic function that it uses. Logistic function is also called as sigmoid function. This logistic function has a value between zero and one.

(B) k-nearest neighbors (KNN) 

The k-nearest neighbor is a data categorization technique for calculating the probability that a data point will join one group or another depending on the group to which the data points are closest. KNN is a supervised ML model which is used for resolving classification as well as regression issues.

KNN is a lazy learning algorithm as it doesn’t execute any training when we provide training data. It does not do any computations during the training period, merely saves the data. It does not build model until any query is executed on the dataset. KNN is hence perfect for data mining.

(C) Support Vector Machine (SVM)

SVM is the most often used supervised learning technique for handling both classification and regression issues. SVM method aims to define the best decision boundary which splits the n-dimensional space in classes that enable us to categorise new data points with ease in future. The term "hyperplane" refers to this decision boundary. SVM selects the extreme vector points for building the hyperplane. The SVM method is based upon support vectors, which are used for representing such extreme situations.

(D) Naive Bayes Classifiers 

Naive Bayes Classifiers are supervised learning algorithm based on Bayes Theorem in which one assumption is taken as the strongest independence assumptions between the features. They make the assumption that the value of one feature is unrelated to value of other feature.

Gaussian Naive Bayes is a variant of Naive Bayes that works with continuous data and adheres to Gaussian normal distribution. When dealing with continuous data, it is common to make the assumption that continuous values of each class are distributed using Gaussian distribution.

(E) Decision Tree 

Decision Trees are supervised learning methods that are used to solve classification as well as regression issues, however they are often chosen for classification problems. DT is a tree-structured classifier in which inside nodes represent the data characteristic features and the leaf nodes represent the classification results. A decision tree has two nodes- i) Decision Node ii) Leaf Node. Decision nodes makes the decisions and may have several branches, while Leaf nodes record the outcomes of such decisions and does not have any further branch.

(F) Random Forest 

Random Forest is a machine learning technique that uses several decision trees to make decisions. The different random forest trees gives out a class prediction and the class which obtains the maximum votes is the prediction result generated by our model. The best results come when a large number of very dissimilar models (trees) collaborate as a committee.

(G) Extra Trees Classifier 

Extra Trees Classifier is a method of Ensemble Learning that integrates the findings of various uncorrelated decision trees gathered in a "forest" to get the classification result. The only way it varies conceptually from Random Forest is the construction of decision trees in the forest.

The original training samples are used to build every Decision Tree of Extra Tree Classifier’s forest. Then, k random features out from feature set are distributed to every tree for each test node. Among these, the tree must choose which characteristic will best classify the data as per the specified mathematical rule/criteria. There are several de-correlated decision trees produced as a result of this random sampling of characteristics.

(H) Gradient Boosting 

Boosting is a technique for turning weak learners to strong ones. Each new tree in boosting is a fit on an altered version of the initial data set. It is predicated that when combined with earlier models, the new model will provide predictions with lower error rates. To reduce errors, the fundamental goal of this forthcoming model is to define desired outcomes.

Gradient boosting is a method for gradually, additively, and sequentially training multiple models. The phrase "gradient boosting" came up because each case's intended outcomes are determined by the gradient's inaccuracy relative to the predictions. Every model makes progress in the right way by lowering prediction errors.

(I) AdaBoost 

A machine learning technique called AdaBoost, often referred to as Adaptive Boosting, is used in an ensemble environment. The most common technique used with AdaBoost is decision trees with a single level or with a single split. These trees are also known as Decision Stumps. AdaBoost develops a model that assigns equal weights to all of the data points. Then, it gives larger weights to the points that were misclassified. Now, in the following model, all of the points with higher weights are assigned more significance. It will keep training models up until and until an error is received.

Iii. Literature Review

In this research work, the author [13] tested a number of machine learning (ML) models after using or not using the Synthetic Minority Oversampling Method (SMOTE), assessing and comparing their accuracy, precision, and recall. The stacking ensemble model using SMOTE with 10-fold cross validation outperformed the other models, with an accuracy of 90.9%.

Using the EHR dataset from 3 academic medical institutions, researchers [14]conducted a multi-site retrospective analysis to look at long-term modifications in EF measures in patients with heart failure. In a recent research based on data of HF registry, the researchers conclude significant variations in the longitudinal EF change behavior and baseline characteristics of the HF datasets. To anticipate modifications in ejection fraction measures in heart failure patients, a number of machine learning models were created using data from this longitudinal research. They showed moderate to well performance with poor confidence for several models for predicting an EF drop. With AUCs of 83%, 87%, and 90%, XGBoost outperformed other machine learning models in terms of predicting EF changes.

This hybrid framework [15] employs a variety of ML and DL techniques to remove bias from the model and votes for the best outcome using a cutting-edge voting system. The structure has two further levels. In the first layer, a particular dataset is being used by many machine learning models. The results of the first layer are combined in the second layer, which then further categorize them using cutting-edge voting methods. The suggested framework on CHD Dataset achieved an accuracy of 91.8%.

The author [16] proposed a method based on machine learning (ML) techniques that uses support vector regression (SVR), neural networks, M5Tree model, and multivariate adaptive regression splines  for training to categorize, predict, and enhance the accuracy of CVDs diagnosis. In addition, seventeen CVD risk variables are predicted using Adaptive Neuro-Fuzzy Inference System (ANFIS), closest neighbor/naive Bayes classifiers, and statistical methods. Categorical and continuous factors influencing CVD risk are predicted using mixed-data transformation and classification techniques. ANFIS has prediction accuracy for the training phase of 96.56% and SVR with a prediction accuracy of 91.95%.

The suggested study[17] classify patient risk and predict the possibility of  heart disease using techniques such as Naive Bayes, Logistic Regression, Random Forest, and Decision Tree. In this study, we compare the performances of several machine learning models. Random Forest is the most reliable ML model, with an accuracy rate of 90.16%, when compared to other ML algorithms that have been used.

In this study [18], a new approach for monitoring heart disease using Internet of Things and Deep Learning models is described. Here, a deep learning model is employed in conjunction with a feature selection technique to improve classification performance. The suggested method for tracking heart disease uses IoT device inputs to determine the severity of the ailment. Moreover, it classifies patient data into categories based on the degree and kind of cardiac disease. Lastly, given the inputs supplied, it creates a warning or message for the patients depending on the kind of heart condition. The experiment's findings demonstrated and supported the greater prediction accuracy of 93.23%.

[19] Using the heart disease dataset stored in the unique client identification Machine Repository, the study intends to improve the diagnosis of heart disease. With an accuracy of 93.3%, BO-SVM delivered the greatest results, followed by SSA-NN and Naive Bayes with an identical accuracy of 86.7%, and KNN and NN with an accuracy of 80%. The results show that the new ideal algorithm can provide a trustworthy system for health monitoring for early diagnosis of cardiovascular disease.

This study [20] combines a health care dataset and a ML algorithm logistic regression to classify patients as having or not having cardiac issues based on information kept in their medical records. This technique yields a calculation accuracy of 87.02%.

COMPARISON AND DISCUSSION

In this section, we have compared various machine learning algorithms based on Year of Publication, Domain, Model used and Accuracy for heart disease prediction. Here, we have considered the latest works of well-known authors.

Table 1. COMPARSION OF WELL-KNOWN RESEARCH WORK

 

Ref.

 

Year

 

 

Purpose

 

Model Used

 

Accuracy

 

 

[13]

 

 

2023

 

Machine Learning Models for Coronary Artery Disease Long-Term Risk Prediction.

 

10-fold cross-validation using SMOTE technique

 

 

90.9 %

 

 

 

 

[14]

 

 

 

 

2023

 

A multi-site research examined the use of electronic health records with machine learning for prediction of heart failure with variations in ejection fraction of left ventricular.

 

 

KNN 

Logistic Regression

SVM 

Random Forest

XGBoost

Decision Tree

XGBoost model had an accuracy of 83%, 87%, and 90% in predicting across the three sites.

 

 

[17]

 

 

2022

 

Prediction of Heart Disease with Machine Learning models.

Decision Tree 

Naive Bayes

Logistic Regression

Random Forest

81.97%

85.25%

85.25%

90.16%

 

[18]

 

2022

IoT-Enabled Monitoring of Heart Disease by using DBN with Grey Wolf Optimization.

 

DBN

 

93.23%

 

 

 

[19]

 

 

 

2021

         

 

Prediction of heart disease using a novel optimization approach.

KNN

NN

SVM 

SSA-NN

Naive Bayes

BO-SVM

80%

80%

80%

86.7%

86.7%

93.3%

 

 

[20]

 

 

2020

Estimating the probability of developing heart disease with machine learning model of logistic regression

 

 

Logistic Regression

 

 

87.02%

 

 

 

Proposed Model

 

Machine Learning based Diagnosis and Prediction of Heart Disease using multiple models 

 

KNN

Naïve Bayes 

Decision Tree

Logistic Regression AdaBoost

Extra tree classifier

Random Forest 

SVM

Gradient Boosting

86.88%

88.52%

88.52%

91.80%

91.80%

91.80%

91.80%

91.80%

95.08%

Iv. Proposed Methodology

This suggested study explores the aforementioned classification methods and does performance analysis to estimate the accuracy of heart disease. This study's goal is to accurately determine if a patient has heart disease. The medical expert inputs the numbers from the patient's health report. The information is used to feed a model that predicts the likelihood of heart disease. Fig. 7 depicts the full procedure.

1. Data Collection-

First, we collect a dataset for the prediction of heart disease. After that we split the dataset in training and testing data. The training dataset is to learn the prediction model, while the testing dataset is for evaluation of prediction model. In this project, Training uses 70% of the data, whereas testing uses 30% of the data. Heart Disease UCI repository dataset used in this project. The dataset has 76 attributes, of which the system uses 14 for its execution

2. Feature Selection-

Feature selection comprises of the selection of relevant characteristics for the prediction system. This is performed to improve the system's efficiency. A variety of patient factors are taken into account for the prediction, including gender, the kind of chest pain the patient has, fasting blood pressure, exang, and serum cholesterol. A correlation matrix is used in this model for selection of attributes.

3. Data pre-processing-

Pre-processing of data is an important step in creating a machine learning model. Results might be inaccurate if the data isn't initially clean or in the right format for the model. Pre-processing involves converting data to the desired format. It is used for managing missing values, duplicates, and noise in the dataset. Data pre-processing includes tasks like importing datasets, splitting datasets, attribute scaling, etc. To improve the model's accuracy, the data must be pre-processed.

4. Data Balancing-

There are two techniques to balance unbalanced datasets.

1.Under Sampling: In under sampling, the size of the large class is reduced in order to balance the dataset. This approach is taken into account when there is sufficient data.

2. Over Sampling: The size of the insufficient samples is increased in order to balance the dataset in over sampling. This approach is used when there is insufficient data.

5. Disease Prediction-

Several machine learning methods, including SVM, Logistic Regression, Gaussian Naive Bayes, Decision Trees, KNN, Random Forest, Extra-tree classifier, Ada-boost, and Gradient boosting, are utilised for classification. The algorithm with the greatest accuracy is chosen for heart disease prediction after comparative examination of the different algorithms.

V. Performance Analysis

Various algorithms of machine learning, including SVM, KNN, Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes, AdaBoost, Extra Tree Classifier and Gradient Boosting, are utilised for classification. The Heart Disease UCI dataset has 76 different characteristics; only 14 of such factors are taken into account for diagnosis of heart disease. The experiment is evaluated using accuracy of models.

Accuracy- The ratio of accurate predictions to total inputs in the dataset is known as accuracy.

Accuracy =\(\frac{(TN+TP)}{(TP+TN+FP+FN)}\)

The training and validation accuracy graph of various models are shown below:

Vi. Results

Gradient boosting has higher accuracy compared to other algorithms, according to our results of the machine learning technique for training and testing. With the use of the confusion matrix for each method, accuracy is computed. Here, the number count of TP, FP, TN, and FN is provided. Using these values accuracy value is obtained. It is concluded that extreme gradient boosting obtained a highest accuracy of 95.08%. The comparison of vatious models is presented below.

Table 2

Accuracy Comparison Table

MODEL

ACCURACY

Gradient Boosting

95.0819

Random Forest

91.8032

Support Vector Machine

91.8032

Extra Tree Classifier

91.8032

Logistic Regression

91.8032

AdaBoost

91.8032

Decision Tree

88.5245

Gaussian Naïve Bayes

88.5245

K-Nearest Neighbour

86.8852

Vii. Conclusion And Future Scope

As the heart is one of the most important organs in the body and heart disease prediction is a serious issue for people, algorithm accuracy is one of the criteria taken into account when assessing an algorithm's performance. The accuracy of Machine learning algorithm is dependent on the datasets used for testing and training. High accuracy is generated using prediction models by removing the unnecessary features or highly correlated variables, or by feature selection/optimization algorithms.

SVM, KNN, Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes, AdaBoost, Extra Tree Classifier and Gradient Boosting are used for classification in this paper and the highest accuracy of 95.08% is obtained by Gradient Boosting.

Additional numerous datasets of heart disease from diverse sources should be taken into account with much more features to achieve more general and predictive accuracy. The primary goal of our future research will be to develop a powerful model for predictive framework that fixes most of the problems mentioned in this paper.

Additionally, real-time information about the working learning model should be analyzed to standardize it and validate it through clinical correlation to ensure its constancy.

Declarations

Author Contribution Jyoti Maurya wrote the manuscript and reviewed the manuscript. Shiva Prakash lead the work, wrote and reviewed the manuscript.

FundingThis research is not received any financial support.

Data Availability The authors declare that they have used publicly available ‘Heart Disease Dataset’ for present work, which is available online

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Ethics approval Not applicable. 

Consent to participate Not applicable. 

Consent for publication Not applicable. 

Conflict of interest The authors declare no competing interests.

References

  1. G. Verma and S. Prakash, “Internet of Things for Healthcare: Research Challenges and Future Prospects,” in Advances in Communication and Computational Technology, 2021, pp. 1055–1067.
  2. K. Bhagchandani and D. Peter Augustine, “IoT based heart monitoring and alerting system with cloud computing and managing the traffic for an ambulance in India,” Int. J. Electr. Comput. Eng., vol. 9, no. 6, pp. 5068–5074, 2019, doi: 10.11591/ijece.v9i6.pp5068-5074.
  3. Divya B Netalkar, Gowrika G N, and Hamsa N, “Review on IoT Based Heart Rate Monitoring System,” Int. J. Adv. Res. Sci. Commun. Technol., vol. 3, no. 3, pp. 354–356, 2022, doi: 10.48175/ijarsct-3129.
  4. Sensors, vol. 22, no. 23, 2022, doi: 10.3390/s22239074.
  5. G. Verma* and S. Prakash, “Pneumonia Classification using Deep Learning in Healthcare,” Int. J. Innov. Technol. Explor. Eng., vol. 9, no. 4, pp. 1715–1723, 2020, doi: 10.35940/ijitee.d1599.029420.
  6. M. Ganesan and N. Sivakumar, “IoT based heart disease prediction and diagnosis model for healthcare using machine learning models,” 2019 IEEE Int. Conf. Syst. Comput. Autom. Networking, ICSCAN 2019, pp. 1–5, 2019, doi: 10.1109/ICSCAN.2019.8878850.
  7. A. Raj, S. Prakash, J. Srivastva, and R. Gaur, “Blockchain-Based Intelligent Agreement for Healthcare System: A Review,” in International Conference on Innovative Computing and Communications, 2023, pp. 633–642.
  8. Mamta and S. Prakash, “An overview of healthcare perspective based security issues in Wireless Sensor Networks,” in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, pp. 870–875.
  9. M. Umer, S. Sadiq, H. Karamti, W. Karamti, R. Majeed, and M. Nappi, “IoT Based Smart Monitoring of Patients’ with Acute Heart Failure,” Sensors, vol. 22, no. 7, pp. 1–18, 2022, doi: 10.3390/s22072431.
  10. G. Verma, A. P. Shahi, and S. Prakash, “A Study Towards Recent Trends, Issues and Research Challenges of Intelligent IoT Healthcare Techniques: IoMT and CIoMT,” in Proceedings of Trends in Electronics and Health Informatics, 2022, pp. 177–190.
  11. A. Raj and S. Prakash, “Smart Contract-Based Secure Decentralized Smart Healthcare System,” Int. J. Softw. Innov., vol. 11, no. 1, pp. 1–20, 2022, doi: 10.4018/ijsi.315742.
  12. R. Gaur and S. Prakash, “Detection and diagnosis of virus infection based on sensor in internet of medical things,” Int. J. Adv. Sci. Technol., vol. 29, no. 5, pp. 3726–3736, 2020, [Online]. Available:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85085048543&partnerID=40&md5=1f8cd4ee01e25a421411b9f0048010b6
  13. M. Trigka and E. Dritsas, “Long-Term Coronary Artery Disease Risk Prediction with Machine Learning Models,” Sensors, vol. 23, no. 3, p. 1193, 2023, doi: 10.3390/s23031193.
  14. P. Adekkanattu et al., “Prediction of left ventricular ejection fraction changes in heart failure patients using machine learning and electronic health records: a multi-site study,” Sci. Rep., vol. 13, no. 1, pp. 1–16, 2023, doi: 10.1038/s41598-023-27493-8.
  15. A. Menshawi, M. M. Hassan, N. Allheeib, and G. Fortino, “A Hybrid Generic Framework for Heart Problem Diagnosis Based on a Machine Learning Paradigm,” Sensors, vol. 23, no. 3, p. 1392, 2023, doi: 10.3390/s23031392.
  16. O. Taylan, A. S. Alkabaa, H. S. Alqabbaa, E. Pamukçu, and V. Leiva, “Early Prediction in Classification of Cardiovascular Diseases with Machine Learning, Neuro-Fuzzy and Statistical Methods,” Biology (Basel)., vol. 12, no. 1, p. 117, 2023, doi: 10.3390/biology12010117.
  17. A. Srivastava and A. K. Singh, “Heart Disease Prediction using Machine Learning,” 2022 2nd Int. Conf. Adv. Comput. Innov. Technol. Eng. ICACITE 2022, vol. 9, no. 04, pp. 2633–2635, 2022, doi: 10.1109/ICACITE53722.2022.9823584.
  18. Ss. UPalani Teaching Fellow Professor, “An IoT Enabled Heart Disease Monitoring System Using Grey Wolf Optimization and Deep Belief Network,” 2022, [Online]. Available: https://doi.org/10.21203/rs.3.rs-1058279/v1
  19. S. P. Patro, G. S. Nayak, and N. Padhy, “Heart disease prediction by using novel optimization algorithm: A supervised learning prospective,” Informatics Med. Unlocked, vol. 26, 2021, doi: 10.1016/j.imu.2021.100696.
  20. M. Saw, T. Saxena, S. Kaithwas, R. Yadav, and N. Lal, “Estimation of prediction for getting heart disease using logistic regression model of machine learning,” 2020 Int. Conf. Comput. Commun. Informatics, ICCCI 2020, pp. 20–25, 2020, doi: 10.1109/ICCCI48352.2020.9104210.