Prediction of Heart Disease using Data Mining Classifiers

doi:10.21203/rs.3.rs-1790774/v1

Download PDF

Research Article

Prediction of Heart Disease using Data Mining Classifiers

https://doi.org/10.21203/rs.3.rs-1790774/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Fist size muscle acquires the name of an important part of the human body by pumping blood to all the parts of the body. Cardiovascular disease is the major cause of death in the nation. Though the data available in the health field is vast, still there is a need to develop decision supporting system to maintain, analyze, and knowledge evaluation. One such technique that can be used to address such a problem is data mining. Data mining techniques can help to classify the whether a patient having heart disease or not. This paper explores the different classification techniques for heart disease prediction. Logistic Regression, Support Vector Machine, Naïve Bayes, Nearest Neighbor, and Decision Tree methods are applied. To assess performance of the classifiers various measures are taken includes accuracy, recall, precision and F1-score.

Methods: An abundant data generated by the healthcare industry leads to exploring the required information to make a decision-making system using data mining techniques. Classification methods such as SVM, Naive Bayes, KNN, Logistic Regression, and Decision Tree are applied to predict heart disease with a different dataset.

Results: Five classification techniques are applied to predict the two different heart disease datasets. By inference, the different classifiers work differently on selected attributes of datasets in terms of accuracy, precision, recall, and F1-score measurements.

Conclusion: Our research focuses on comparing the results of classification techniques applied to two different heart disease datasets. Various performance metrics were used to compare and contrast the two datasets of heart disease.

Information Retrieval and Management

Data Mining

Heart disease

Data pre-processing

Classification Techniques

Heart disease is now the top reason of death in many countries. The term "heart disease" comprises issues in heart. Coronary artery disease (CAD), the most frequent type of heart disorder in which can increase the risk. Heart disease can be "silent” heart attack or heart failure occurs [1] with no sign of symptoms. There are various types of heart diseases. Although certain overlap exists, each disease has its own symptoms and treatments [3].

Below mentioned are few types of heart diseases:

Cardiovascular disease
Heart defect from birth
Arrhythmia
Cardiomyopathy with dilated blood vessels
Acute Myocardial Infraction
Cardiac Arrest
Cardiomyopathy with hypertrophy
Prolapse of the mitral valve
Aortic stenosis

And the factors can lead to the risk of heart disease:

Family history of heart disease
Consumption of Tobacco
Low nutrition intake
High pulse
Cholesterol
Hypercholesterolemia
Fatness
Physical lethargy

Data mining is used to identify relevant and intelligible patterns in huge volume of data, whether in healthcare or business. Certain knowledge outlines assist in the estimate of information trends and the associated decision-making process.

Data mining can be applied in the healthcare industry to minimize overheads by enhancing productivities, better patient safety in terms of carefulness, and, most considerably, more patients' lives saved [2]. Predictive treatment, customer relationship management, uncover the pattern of scam and misuse in medical sector, healthcare management, and monitoring the success of particular medications have all been demonstrated to be effective with data mining in healthcare. Every hospital collects a large amount of data about patients during their admission time and treatment time. These data will be stored in the Electronic Health Record form. From these records, data mining helps to physicians to discover the disease pattern earlier and prescribe better treatments. Data mining can help identifying the best treatment for specific diseases by comparing the causes and symptoms of diseases. Other data mining applications recommended for effective treatment associating the various side-effects of treatment and find the most effective medicine for particular diseases.

Two major data mining examples in medical filed are explained below: Measuring treatment effectiveness:

Use of healthcare data mining is evaluating health indications, reasons, and treatment options in order to regulate the best effective treatment option for a certain medical problem. For example, you can compare patients treated with different medications, choose an effective treatment strategy, and minimize costs. Consistent application of this data mining technique, on the other hand, provides standardize a treatment procedure for certain diseases, making the care and management process much faster.

To categorize the dataset into a number of classes, the classification technique is more commonly used in the field of disease diagnostics.

Learning can be used to create a classifier model. It is used to describe a set of class labels or concepts [4]. The model is built using available data, such as classified examples. These samples include a range of characteristics. This model is created via supervised learning, with the classified instances serving as training data. The classification technique is supplied with training example and is allowed to extract specific knowledge eventually. After, classifier has been developed, the model is tested based on the accuracy of the results it provides during the testing process.

3.1 Logistic Regression (LR)

An approach for estimating the likelihood of a given variable is known as logistic regression. True or false is the most common binary response in logistic regression models. For classification problems, logistic regression is an effective analyzer. It is a multiclass classification statistical method that can be generalized from binary classification.

3.2 K-Nearest Neighbour(KNN)

This method is based on the idea that because the neighbour is adjacent in the feature set, it is more likely to be comparable to the item being categorized and so belong to the same class. The nearest neighbor algorithms works as follows,

Step 1: To implement KNN, a dataset needed. Both training data and test data must loaded in the first stage.

Step 2: Locate the nearest data point's value, i.e. K. (any integer).

Step 3: By using Euclidean or Hamming distance methods, calculate the distance between each data point in the test data.

3.3 Naïve Bayes (NB)

A Naïve Bayesian classification method uses the Bayes theorem for classification. It is an eager learning algorithm. Since it does not wait for test data to learn, it can classify the new instance faster. It follows the Bayes theorem is,

Here, (X | attributes) ◊ next possibility of class.

𝑃 (X) ◊previous possibility of class.

𝑃 (attributes | X) Probability predictor of a given class.

𝑃 (attributes) ◊ prior possibility of predictor.

3.4 Support Vector Machine (SVM)

The SVM method seeks a hyper plane that distinguishes between data points in an N-dimensional space to divide the two sorts of data points, you can choose from a variety of hyper planes. The goal is to find a plane where the distance between data points from both classes is the shortest. Raising the margin distance provides some feedback, formulating new data points simpler to classify.

3.5 Decision Tree (DT)

A DT is a hierarchal structure where each internal nodes represents an attribute testing, each split denotes a test conclusion, and each leaf node represents a class label.

Data mining techniques have been used by many researchers to identify and forecast cardiac disease.

Sarath Babu et al., [2017], has worked on diagnosis of heart disease by data mining techniques. The authors selected the various data mining techniques, such as genetic,k-means,MAFIA algorithms and decision tree classification to diagnosing heart disease earlier. Genetic algorithm used to select important attributes from large number of attribute in the dataset. K-means is applied to calculate the low risk and high risk patient groups, MAFIA algorithm is used to extract the best item set from the dataset, and decision tree applied to construct the classification tree.

Anjan Nikhil Repaka et al., [2019] have worked on heart disease prediction using naïve Bayesian technique. The researchers proposed an architecture to perform the processes, includes data collection, pre-processing and applied the naïve bayesian classification method. The proposed technique is more efficient than other prevailing techniques applied before. The proposed method is gives 89.77% accuracy towards the predict heart disease.

Pratiksha Shetgaonkar et al., [2021] use of data mining techniques as a means for predicting heart disease is discussed. The authors have chosen the better and more efficient classification techniques to predict the heart diseases, such as neural network, NB and decision tree in terms of accuracy. The dataset selected 678 records with 14 attributes. Finally, conclude that accuracy with 81.83% (neural network), 85.01% (naïve Bayes), and 98.54% (decision tree).

Barbara Martins et al., [2021], has worked for cardiovascular disease prediction using data mining techniques. The researchers uses the CRISP-DM methodology to predict the outcome. Five classifiers namely Optimized Random Forest, Decision Tree and Deep Learning applied in RapidMiner and Weka software tool with split validation method and achieved accuracy 73.02%, 73.54%, 71.91%, and 71.94% respectively. And then applied defined threshold to get best result.

The research provides heart disease prediction with different data collected online. The experimentation led by two different dataset with unlike features. First dataset Framingham with 4238 records and 14 columns are used for experimentation. Another Cleveland dataset with 303 records and 13 columns. In the Framingham dataset overall rows with missing values is 582 entries; however, because this represents only 14% of the entire dataset, these rows are removed. After selection dataset, data prepared for mining with pre-processing techniques. The dataset was split into trained and test data respectively. Five Classifications such as SVM, Decision Tree, Naive Bayes, Logistic Regression and Nearest Neighbor are applied on dataset to predict the heart disease. The following table consists the attributes of the dataset.

SL No	Attributes	Meaning
1	Gender	Male or Female
2	Age	Patient age
3	Current Smoker	Presently smoking or not
4	Cigs Per Day	Average cigarettes per day
5	BP Meds	Medicated under BP
6	Prevalent Stroke	History of stroke
7	Prevalent Hyp	Hypertension condition
8	Diabetis	Diabetics
9	Tot Chol	Blood cholesterol level
10	Sys BP	Systolic blood pressure
11	Dia BP	Diastolic blood pressure
12	BMI	Body Mass Index
13	Heart Rate	Patient Heart Beat
14	Glucose	Blood glucose level
Table.1. Framingham data attributes

SL No.	Attributes	Meaning
1	age	Patient age
2	Sex	Male or Female
3	Cpt	Type of chest pain
4	restbp	BP in rest
5	chol	Cholesterol level
6	Fbs	Blood sugar level in fasting
7	restecg	ECG in resting
8	thalach	Maximum heart rate
9	exang	Angina induced by exercise
10	oldpeak	ST depression
11	slope	ST segment
12	Ca	Vessels
13	thal	thalassemia
Table.2. Cleveland data attributes

An experiment was conducted using following techniques, such as Nearest Neighbor (KNN), and Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT).

The aim of the study is to discover the most effective data mining methods for predicting heart disease on different dataset. An experiment was done on the heart disease dataset to find the best prediction system. Different classification techniques to see which ones provide the most accurate results for heart disease prediction. For five distinct classification algorithms, the result achieves accuracy and different metrics used to evaluate the performance.

Accuracy: The percentage of all successful predictions divide by the number of samples is used to calculate a classifier's accuracy. If the accuracy of the classifier is satisfied, it can be used on upcoming item sets for which the class label is unknown.

ROC-AUC: A ROC (Receiver Operating Characteristics) curve is a possibility curve. AUC (Area Under the Curve) is a statistical curve. ROC, on the other hand, is a measure of interpretability. It shows how the model can differentiate between classes.

Recall: The capability of a model to discover all the significant samples in a given dataset. Statistically, recall is also defined as number of correct predicted records divided by the number of correct predicted records in addition to the number of wrong predicted records.

Precision is the no of positive class forecasts that truly belong to the positive class. Mathematically, In addition to the number of false positive values, accuracy is defined as the total number of correct predicted records divided by the number of correct predicted records.

F1 score: F1-score uses the harmonic mean (HM) of a classifier's precision and recall to establish a single measure.

Methods	Accuracy (%)	ROC_CUV	Recall	Precision	F1
Logistic Regression	89.18	0.73	0.06	0.64	0.11
SVM	86.14	0.62	0.02	0.50	0.04
Nearest Neighbor	84.41	0.60	0.08	0.28	0.12
Naïve Bayes	83.96	0.71	0.18	0.35	0.23
Decision Tree	77.30	0.56	0.26	0.27	0.27
Table.3. Performance Evolution of Framingham dataset.

From the above table values, LR method shows the better accuracy compare to other techniques i.e., 89.18% and SVM, KNN, NB, and DT, and classification techniques are achieves the accuracy of 86.14 %, 84.41 %, 83.96 % and 77.30% respectively. SVM classifier also gives the good result after logistic regression is 86.14%, only 0.37% difference accuracy. NB also perform well after LR and SVM classifier, Decision Tree gives the least accuracy compare to other techniques i.e., 77.30%.

Methods	Accuracy (%)	ROC_CUV	Recall	Precision	F1
Logistic Regression	90.00	0.94	0.79	0.94	0.86
Naïve Bayes	82.00	0.85	0.82	0.82	0.82
SVM	87.00	0.73	0.87	0.87	0.87
Nearest Neighbour	89.00	0.90	0.88	0.89	0.88
Decision Tree	77.00	0.76	0.77	0.77	0.77

Table.4. Performance Evolution of Cleveland dataset.

From the above table values, LR method shows the better accuracy compare to other techniques i.e., 90% and NB, DT, SVM, and, KNN and classification techniques are achieves the accuracy of 82%, 77%, 87% and 89% respectively. Nearest Neighbor classifier also gives the good result after logistic regression is 89%. Decision tree gives the least accuracy compare to other techniques i.e., 77%.

Based on the experiment results, two dataset with different attributes performs differently through classification techniques. For both dataset LT gives the better accuracy.

The main focus of this study explores the application of different classifiers in healthcare especially in diagnose and prediction heart disease. Heart disease is a dangerous condition that can lead to death. Data from online cardiac patients is gathered and used to run an experiment. The following algorithms were used to implement the classification, which is a data mining technique. Logistic Regression, Decision Tree, SVM, KNN and Naive Bayes. Different performance criteria were used to evaluate the algorithms performance, including accuracy, ROC curve, Recall, precision, and F1- score.

According to the experiment, the Logistic Regression algorithm has the highest accuracy of all, at 89.18% and 90% for Framingham and Cleveland dataset with their different attributes. This study demonstrates how accurately and efficiently can be predicted for cardiac diseases with the help of data mining. The results or outcomes of these experiments could be utilized as an aid in making more consistent heart disease diagnoses. Individual procedures are not enough to get the intended result. Ensemble approaches will be used in the future to improve accuracy.

Python pandas with Anaconda 3.0 is used to perform the experimentation.

Uma K involved in all aspects to complete the final manuscript includes, paper writing, applying classification techniques in python, and final reading and approving.

Competing interests: The authors declare no competing interests.

Adam S. Vaughan at al., “Progress Toward Achieving National Targets for Reducing Coronary Heart Disease and Stroke Mortality: A County-Level Perspective”, Journal of the American Heart Association (JAHA), 2021Vol. 10, No. 4.
Hian Chye Koh and Gerald Tan, “Data mining applications in healthcare”, Journal of Healthcare Information Management (JHIM), 2005, Spring 2005; 19(2):64–72.
K.Gomathi and D.ShanmugaPriyaa, “Heart Disease Prediction Using Data Mining Classification”, International Journal for Research in Applied Science & Engineering Technology (IJRASET), Volume 4 Issue II, February 2016, ISSN: 2321–9653.
Sebastian Raschka and Vahid Mirjalili, “Python Machine Learning”,Third Edition, Packt Publishing, 2019.
Sarath Babu et al., “Heart Disease Diagnosis Using Data Mining Techniques”, International Conference on Electronics, Communication and Aerospace Technology (ICECA), 2017, 978-1-5090-5686-6/17/$31.00 ©2017 IEEE 750.
Anjan Nikhil Repaka et al., “Design And Implementing Heart Disease Prediction Using Naives Bayesian”, Proceedings of the Third International Conference on Trends in Electronics and Informatics (ICOEI 2019), IEEE Xplore Part Number: CFP19J32-ART; ISBN: 978-1-5386-9439-8.
Pratiksha Shetgaonkar and Dr. Shailendra Aswale, “Heart Disease Prediction using Data Mining Techniques”, International Journal of Engineering Research & Technology (IJERT), 2021, ISSN:2278 0181.
Barbara Martins et al., “Data Mining for Cardiovascular Disease Prediction”, Journal of Medical Systems (2021) 45:6, doi.org/10.1007/s10916-020-01682-8.

MaterialsandSupplementaryfile.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Prediction of Heart Disease using Data Mining Classifiers

Status:

Version 1

Abstract

Figures

1. Introduction

2. Data Mining In Health Care

3. Classification Techniques

3.1 Logistic Regression (LR)

3.2 K-Nearest Neighbour(KNN)

3.3 Naïve Bayes (NB)

3.4 Support Vector Machine (SVM)

3.5 Decision Tree (DT)

4. Related Work

5. Proposed Methodology

6. Experimentation Result And Discussion

7. Conclusion And Future Enhancement

8. Computer Programs

9. Authors Contribution

Declarations

References

Supplementary Files

Status:

Version 1