This study's objective was to develop a risk level prediction system for cardiovascular diseases using data mining techniques. For this study, data were collected from a total of 4004 patients' records from Jimma University, Black Lion and Alert Specialized Hospitals. Then, it was coded in excel and changed into a file in the format Weka software understands: Comma delimited (.CSV) data file. Then, the data were inserted into Weka to preprocess and remove redundancies, fill missed values, correct related values to the attributes, and type missed values. Then, the first scenario experiments were done with all data and all attributes by using classifier algorithms. After this scenario attributes, a section was done on the preprocessed data. In the end, 11 attributes were selected from 31 attributes. Then, using the selected 11 attributes, the second scenario was performed using different classification algorisms like Naïve Bayes, PART and J48. Finally, unpruned J48 decision tree algorism was used to construct the model.
Experiment setup
Many experiments were done on this data for preprocesses, attribute selection and model development. While information gain, CFS subset evaluator and information gain ratio were done for attributes selection, Naïve Bayes, PART and J48 were performed for classifications. For this study, as different scholars used it, 10-fold cross-validation was employed. For instance, it is recommended that 10-fold-validation is the correct number of folds to get the best estimate of error, and there is also some theoretical evidence that backs this up, like 10-fold cross-validation has become the standard method in practical terms. In addition, to select the best algorithms, the researchers selected a confusion matrix to calculate precision, recall and F-Measure. Since F-measure (also known as F1 or F-score) is a measure of a test's accuracy, it considers both the test's precision and recalls to compute the score. It can be interpreted as a weighted average of the precision and the recall, where 1 is its best value and 0 is its worst. The F-Measure only produces a high result when precision and recall are balanced; thus, using this technique is very significant. Moreover, the researchers used a 70% split test. The details of the experiments were discussed as follows.
Experiment
This sub-section is the backbone of the study; the results are presented and discussed. To achieve this study's objective, different experiments were done using data mining algorithms. However, PART induction decisions, J48, and Naïve Bayes classifiers algorithms showed good performance compared to the rest of the data mining classifiers. Therefore, the experiments of those algorithms were presented and discussed as follows by categorizing them into two big scenarios/experiments with all trained data and with the selected attributes.
Model Building Using Naïve Bayes classifier with all the training data
Naïve Bayes classifier use estimator classes. The precision levels of numeric estimators are determined by analyzing the training data. Based on this concept, correctly classified instances were 2831(70.7043%), and incorrectly classified instances were 1173(29.2957%). This was confirmed by confusion matrix results generated, which are presented by shaded cell values 2831, equal with correctly classified instances, and unshaded cell values 1179, equal with incorrectly classified instances, as presented in Table 2. In addition, Naïve Bayes produced 0.706 F-Measure. This was relatively poor compared to the two algorithms.
Table 2
Naïve Bayes Confusion Matrix
Low | High | Medium | |
2099 | 55 | 315 | Low |
75 | 239 | 371 | High |
287 | 70 | 493 | Medium |
Model Building Using PART classifier
The experiment was conducted by using a PART classifier. By this classifier, two scenarios were applied: pruned PART rule induction and unpruned PART rule induction classifier.
In the first scenario, the algorithm pruned PART rule induction containing 4004 instances with 31 attributes was performed. It took 2.48 seconds to build the model and generated 154 numbers of rules. It is presented in Table 3 below. In addition, the model built with pruned PART rule induction with all attributes correctly classified (predicted the correct outcome) 3441(85.9391%) instances while 563(14.0609%) of the instances were classified incorrectly. In addition, the accuracy of the algorithms by F-Measure produced 0.857 performances. Therefore, it can be said that pruned PART rule induction showed good performance.
In the second scenario, unpruned PART rule induction with all data and attributes was performed. It took 17.11 seconds to build the model and generated 763 rules. In addition, correctly classified instances 3577(89.3357%) and incorrectly classified instances 4279 (10.6643%). In addition, it showed a 0.893 F-Measure. As a result, unpruned PART rule induction showed the best performance compared to Naïve Bayes and pruned PART rule induction.
Model Building Using J48 classifier
This classifier also has two experiments J48 pruned decision tree with all attributes and a J48 unpruned decision tree with all attributes. J48 is another popular data mining algorithm. J48 pruned decision tree was used in this first experiment. This algorithm showed 251 numbers of leaves, 394 sizes of the tree and 0.41 seconds to build the model. In addition, correctly classified instances 3432 (85.7143%), incorrectly classified instances 572 (14.2857%). Moreover, the F-Measure generated by algorism was 0.856.
In the second experiment, J48 unpruned decision tree which generates 952 numbers of leaves, 1416 sizes of the tree and 0.35 seconds to build the model. Adding, correctly classified instances were about 3580 (89.4106%) incorrectly classified instances were 424 (10.5894%). Moreover, this algorithm produced 0.893 F-Measure. In conclusion, based on the experiment done for all data and attributes, the J48 unpruned decision tree showed the best performance. The summary is depicted in below Table 3.
Table 3
Summary of the performance of all algorithms used to build models
Type of classifier | TP | FP | Precision | Recall | F-Measure |
Naïve Bayes with all attributes | 0.707 | 0.198 | 0.727 | 0.706 | 0.706 |
Pruned PART rule induction with all attributes | 0.859 | 0.125 | 0.857 | 0.859 | 0.857 |
Unpruned PART rule induction with all attributes | 0.887 | 0.098 | 0.887 | 0.887 | 0.886 |
J48 pruned decision tree with all attributes | 0.857 | 0.117 | 0.857 | 0.857 | 0.856 |
J48 unpruned decision tree with all attributes | 0.894 | 0.096 | 0.893 | 0.894 | 0.893 |
With selected attributes
Attributes selection
Attributes selection is essential to select necessary attributes and remove insignificant attributes to develop a clear and good model. Therefore, to select significant attributes WEKA, the data mining tool has different techniques such as correlation-based feature selection (CFS) subset evaluator, classifier attribute eval, classifier subset evaluator, correlation attribute eval, gain ratio attribute eval and information gain attribute eval. All the experiments were done by algorithms which the WEKA tool support but correlation-based feature selection (CFS) subset evaluator, gain ratio attribute eval, and information gain attribute eval produced excellent performance compared to the rest of the algorithms. The experiment of these algorithms is briefly described below.
CfsSubsetEval: Correlation-based Feature Selection (CFS) Subset Evaluator
CFS Subset Evaluator evaluates the worth of a subset of attributes by considering each feature's individual predictive ability and the degree of redundancy between them. As a result of this technique, only seven attributes were selected from 31 attributes, i.e. age, smoking, family history of coronary heart disease < 60 years, hypertension, systolic blood pressure, diabetes and total cholesterol attributes. The experiment is shown in Fig. 1 below.
Information gain
Information gain evaluates the worth of an attribute by measuring the information gained with respect to the class. It gives the rank for all attributes based on their information gain. The highest information gain is at the top of the list, while the lowest information gain is at the bottom. The best attribute is the one which has an information gain of one (1), and the worst is zero (0). In addition, the attributes were selected by average information gain; only those with greater than average information gain were identified as good attributes for model development. Accordingly, age, hypertension, systolic blood pressure, blood pressure treatment, diastolic blood pressure, smoking and gender were the attributes selected, which produced information again greater than average, i.e. 0.04426, as depicted in Fig. 2 below.
Information gain ratio
Information gain ratio evaluates the worth of an attribute by measuring the gain ratio with respect to the class. Using an information gain ratio algorithm for attributes selection of smoking, hypertension, blood pressure, treatment, excessive drinking of alcohol, age, diabetes, systolic blood pressure, diastolic blood pressure, exercise, chronic obstructive pulmonary disease and gender. This indicates that from 31 risk factors of cardiovascular diseases, only 11 attributes were selected by the information gain ratio technique (Fig. 3).
In conclusion, after the three experiments, some of the attributes were identified by the three algorithms, information gain ratio almost covered attributes selected by both information again and (CFS) Subset Evaluator and included more attributes than the two algorithms. Based on these results, the information gain ratio was selected as the best algorithm for attribute selection. Finally, the model was developed using smoking, hypertension, blood pressure treatment, excessive drinking of alcohol, age, diabetes, systolic blood pressure, diastolic blood pressure, exercise, chronic obstructive pulmonary disease and gender.
Experiments with selected attributes
The algorithm was run on a complete training set containing 4004 instances with the selected 11 attributes by Naïve Bayes, pruned PART rule induction, unpruned PART rule induction, pruned J48 decision tree and unpruned J48 decision tree Classifier.
Model Building Using Naïve Bayes Classifier selected attributes
In the second scenario, the algorithm was run on a complete training set containing 4004 instances with selected 11 attributes. It took 0.02 seconds to build the model, and the model generated correctly classified instances 2897 (72.3526%), incorrectly classified instances 1107 (27.6474%.) and 0.708 F-Measure.
Model Building Using PART rule induction
Using pruned PART rule induction, 182 rules were generated, and it took 0.63 seconds to build the model, correctly classified instances 3478 (86.8631%) and incorrectly classified instances 526 (13.1369%), and 0.867 F-Measure.
In addition, the data were tested by unpruned PART rule induction, which generated 349 rules in 1.2 seconds to build the model, correctly classified instances 3457 (86.3387%) and incorrectly classified instances 547 (13.6613%), and 0.861 F-Measure.
Therefore, it is possible to conclude that from the two PART rule induction, pruned PART rule induction yielded better performance. Moreover, not only with unpruned PART rule induction, it also provided better performance than naïve Bayes.
Model Building Using J48 decision tree Classifier
J48 decision tree classifier is another algorithm for data mining to develop a model. Since then, the J48 decision tree classifier has been used to determine the most appropriate model for the data. It has two techniques pruned J48 decision tree classifier and an unpruned J48 decision tree classifier. By the first technique, pruned J48 decision tree classifier, the experiment showed 131 number leaves, 261 sizes of the tree, in 0.11 seconds to build the model, correctly classified instances 3385 (84.5405%), incorrectly classified instances 619 (15.4595%) and 0.843 F-Measure. By the second unpruned J48 decision tree classifier, 344 number of leaves, 687 sizes of the tree, 0.1 seconds to build model, correctly classified instances 3517 (87.8372%), incorrectly classified instances 487 (12.1628%) and 0.877 F-Measure were generated. In conclusion, the unpruned J48 decision tree classifier was the best algorithm to develop the expected model. The summary of the algorithm is presented in Table 4 below, and a comparison of Naïve Bayes, PART rule induction and J48 decision tree models with selected attributes is shown in Table 5.
Table 4
Low | High | Medium | |
2348 | 32 | 89 | Low |
59 | 540 | 86 | High |
174 | 47 | 629 | Medium |
Table 5
Comparison of Naïve Bayes, PART rule induction and J48 decision tree models with selected attributes
Type of classifier | TP | FP | Precision | Recall | F-Measure |
Naïve Bayes with selected attributes | 0.724 | 0.228 | 0.701 | 0.724 | 0.708 |
Pruned PART rule induction with selected attributes | 0.869 | 0.121 | 0.866 | 0.869 | 0.867 |
Un pruned PART rule induction with selected attributes | 0.863 | 0.119 | 0.860 | 0.863 | 0.861 |
J48 pruned decision tree with selected attributes | 0.845 | 0.130 | 0.842 | 0.845 | 0.843 |
J48 unpruned decision tree with selected attributes | 0.878 | 0.109 | 0.876 | 0.878 | 0.877 |
Sample rules generated by J48 unpruned decision tree
The following rules were retrieved for prototype system development from a total of 344 rules generated by the J48 unpruned decision tree method.
Age < = 47
| Hypertension = N
| | Smoking = N
| | | Chronic obstructive pulmonary disease = N
| | | | Diabetes = N
| | | | | Systolic Blood Pressure < = 130
| | | | | | Age < = 25: Low (1055.48/42.56)
| | | | | | Age > 25
| | | | | | | Age < = 26
| | | | | | | | Diastolic Blood Pressure < = 75: Low (14.99)
| | | | | | | Age > 20
| | | | | | | | Systolic Blood Pressure < = 165
| | | | | | | | | Systolic Blood Pressure < = 135: Medium (2.01/0.01)
| | | Age > 31
| | | | Exercise = N
| | | | | Chronic obstructive pulmonary disease (COPD) = N
| | | | | | Gender = F: Low (2.02)
| Hypertension = Y
| | Systolic Blood Pressure < = 135
| | | Diastolic Blood Pressure < = 85
| | | | Smoking = N
| | | | | Exercise = N
| | | | | | Age < = 33: Low (22.23/0.06)
| | | | | | Blood pressure treatment = Y
| | | | | | | Gender = F
| | | | | | | | Systolic Blood Pressure < = 175
| | | | | | | | | Systolic Blood Pressure < = 150
| | | | | | | | | | Systolic Blood Pressure < = 140.11
| | | | | | | | | | | Diastolic Blood Pressure < = 90
| | | | | | | | | | | | Age < = 31: Medium (2.0)
| | | | | | | | | Age > 29
| | | | | | | | | | Systolic Blood Pressure < = 165
| | | | | | | | | | | Systolic Blood Pressure < = 140.11: High (4.0/2.0)
| | | Smoking = Y
| | | | Age < = 34
| | | | | Diastolic Blood Pressure > 90: High (2.0/1.0)
Age > 47
| Systolic Blood Pressure < = 140.11
| | Smoking = N
| | | Age < = 58
| | | | Diabetes = N
| | | | | Diastolic Blood Pressure < = 90
| | | | | | Chronic obstructive pulmonary disease (COPD) = N
| | | | | | | Systolic Blood Pressure < = 135
| | | | | | | | Age < = 50: Low (157.4/17.0)
| | | | Diabetes = Y
| | | | | Gender = F
| | | | | | Diastolic Blood Pressure < = 75: High (3.0/1.0)
| | | Age > 58
| | | | Gender = F
| | | | | Diastolic Blood Pressure < = 50: Low (10.03)
| | | | | Diastolic Blood Pressure > 50
| | | | | | Hypertension = N
| | | | | | | Diabetes = N
| | | | | | | | Systolic Blood Pressure < = 115
| | | | | | | | | Exercise = N
| | | | | | | | | | Diastolic Blood Pressure < = 85
| | | | | | | | | | | Diastolic Blood Pressure < = 65: Low (38.05/12.0)
Decision support system for risk level prediction for cardiovascular diseases
Using extracted rules/patterns obtained by data mining classifiers algorithm that produced the best results, namely unpruned J48 decision tree, a user interface is developed, and the screenshot is depicted in Fig. 4 below, which can be used as a decision support system for risk level prediction for cardiovascular diseases.