A comparison of seven distinct machine-learning algorithms was conducted in this study. Decision Tree Classifier, Random Forest Classifier, Naive Bayes Classifier, Gradient Boosting Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector Machine was utilized to thyroid disease prediction. Firstly, we collect and preprocessed the data, and then fed the data to train the model. By comparing the scores, various performance criteria including as accuracy, precision, recall, and F1 score are utilized to establish whether algorithm is superior to others. We divide our dataset into three formats: first set by considering all attributes, second set with 14 feature selection process attributes and the third set with 14 univariant feature selection process attributes. We narrowed down attributes based on their correlation with the target, which we calculated with feature selection process and univariant feature selection methods. Overall, results of various sections are explained in the next part of this result analysis.
3.1 Descriptive Statistics of the Dataset
Exploratory data analysis (EDA) is a sort of data analysis that employs data visualization to evaluate and investigate data sets, as well as describe their key properties[34, 35]. EDA is mostly used to examine what data might reveal outside of formal modeling or hypothesis testing tasks, as well as to better understand data set variables and their interactions. It can also help us figure out if the statistical methods we are contemplating for data analysis are appropriate. Our dataset has 28 attributes, with only six of them being numeric. Therefore, we give a short descriptive statistics of our dataset in Table 2. We can see that all of the attributes have 3221 values in this table. Actually, there are 3221 patients' records, and some of the attributes have missing values. So, before we train the model, we use various techniques to fill in the missing values. We can also see that the average age of the patients is 52.4, implying that the majority of the patients were elderly. The youngest person was 1 year old, and the oldest person was 94 years old. The age distribution of the data is skewed, indicating that the population with a low age is absent. The standard deviation is 19.1, indicating the sparseness of the age group, which ranges from 57 to 73 years old. TSH mean was 6.322 mIU/L, indicating that most patients' TSH levels were not normal. TSH levels should be between 0.5 and 5.0 mIU/L in order to be considered normal. TSH had a minimum value of 0.005 mIU/L and a maximum value of 478.0 mIU/L. The mean T3 value was 1.95 nmol/L, with a minimum of 0.05 nmol/L and a maximum of 10.6 nmol/L. The mean value of TT4 is 107.55. The maximum value of TT4 is 430 and the minimum of TT4 is 2. In the case of T4U, the mean value is 0.988 mIU/mL
Table 2
Descriptive Statistics of Numeric Value of Our Dataset
Characteristics
|
age
|
TSH
|
T3
|
TT4
|
T4U
|
FTI
|
count
|
3221
|
3221
|
3221
|
3221
|
3221
|
3221
|
unique
|
94
|
264
|
65
|
218
|
139
|
210
|
unit
|
years
|
mIU/L
|
nmol/L
|
-
|
mIU/mL
|
-
|
freq
|
91
|
247
|
589
|
142
|
276
|
274
|
mean
|
52.4
|
6.322
|
1.95
|
107.55
|
0.988
|
110.26
|
std
|
19.1
|
26.54
|
0.8399
|
38.09
|
0.186
|
35.967
|
min
|
1.0
|
0.005
|
0.05
|
2.0
|
0.31
|
2.0
|
25%
|
37.0
|
0.58
|
1.6
|
86.0
|
0.88
|
93.0
|
50%
|
55.0
|
1.5
|
1.9
|
102.0
|
0.97
|
106.0
|
75%
|
68.0
|
3.0
|
2.2
|
123.0
|
1.07
|
123.0
|
Max
|
94.0
|
478.0
|
10.6
|
430.0
|
2.12
|
395.0
|
The maximum value of T4U is 2.12 mIU/mL and the minimum value of T4U is 0.31mIU/mL. And at last, the mean value of FTI is 110.26.The correlation between all the numeric data is depicted in Fig. 3.
The above figure showed that TT4 and FTI have a strong relationship. We can get a better understanding of this correlation table if we look at the heat map. Figure 4 depicted a heatmap of all attribute correlations.
3.2 Category Class Blanching
The target class has an uneven distribution of observations, which makes our dataset unbalanced. The following are the observations of different classes:
Category
|
Numeric representation of Category
|
Number of event
|
negative
|
2
|
2753
|
hypothyroid
|
1
|
220
|
sick
|
3
|
171
|
hyperthyroid
|
0
|
77
|
There are 2753 observations under the negative class label, 220 observations under the hypothyroid class label, 171 observations under the sick class label, and 77 observations under the hyperthyroid class label. So, our dataset is highly unbalanced.
As a result, machine learning classifiers faced some difficulties to make accurate predictions on our dataset. Because classic classifier methods such as Decision Tree and Logistic Regression favour classes with a lot of occurrences. They typically only forecast data from the vast majority of classes. The features of the minority class are frequently rejected and treated as noise. The graphical representation of our classes is shown in Fig. 5.
We can see that our dataset is completely skewed. We focus on balancing the classes in the training data before delivering the data as input to the classification model. The main purpose of class balancing is to either increase the frequency of the minority class or lower the frequency of the majority class. This is done to ensure that the number of instances in both classes is about equal. We employed the resampling technique to balance our dataset. Resampling is a common strategy for dealing with datasets that are very imbalanced. Under-sampling involves deleting samples from the majority class and/or introducing additional examples from the minority class. All our classes had equal number 2753 of observations. The balanced plot is shown in Fig. 4.4:
After resampling, we found our final balanced dataset. Now we can build our model using this dataset which will give us more accurate result. After resampling, we have a total of 11012 instances.
3.3 Performance Analysis of Different Algorithm
Our original dataset, which included all features, was first utilized to evaluate several machine learning measures. After that, we used our balanced dataset to put multiple machine learning models to the test. In this study, the dataset’s important features was selected using feature importance methods and univariate feature selection technique. Those important feature is then used to identify the model’s precision, accuracy, recall, and F1 score in our experiments.
The data we use is typically divided into two categories: training data and test data. In this study, 70% of the data was utilized for training and 30% for testing. So, out of our 11012 dataset instances, 7708 were used for the training set. 3304 of the 11012 dataset instances were used in the testing set. Using the testing, we can determine the accuracy of our model and how well it can predict thyroid disease. We used Sklearn library to split our data set as train and test set. Sklearn model selection train test split library component split the dataset randomly with specified portion and we get the random train and test part from the full dataset. After training the model with all algorithms, the testing dataset was used to test the methods. The F1-score, recall, precision, and accuracy were used to evaluate the model's performance.
The entire study's goal was to see which algorithm could best classify diseases. This section highlights the outcomes of the study and introduces the top performer based on a number of performance criteria. To begin, performance was measured using our raw dataset. Second, performance was measured using a dataset containing 14 attributes derived from the feature importance method. Third, performance was determined by taking into account 14 attributes from the univariate feature selection. Finally, we compare various performance metrics of various algorithms and features categories.
3.3.1 Results Using All Features
We apply the selected algorithms to our dataset. In our dataset we have total of 28 attributes, among them, the category is the target. The algorithms are then compared using various performance metrics. We can see from the Fig. 7 that Logistic Regression algorithm has the highest accuracy of any algorithm. After Logistic Regression, Support Vector Machine, Gradient Boosting Classifier and Decision Tree Classifier have higher accuracy.
Predictor accuracy refers to how well a predictor can forecast the value of a predicted characteristic for fresh data, while classifier accuracy refers to a classifier's ability to correctly predict the class label. However, accuracy does not always provide good performance metrics to compare algorithms, so consider other metrics for instance recall, precision, and F1 score. Now, we assess our model's performance using various performance metrics such as recall, precision, and F1 score. The performance results of all seven algorithms are listed in the Table 3.
Table 3
Evaluation of algorithms with all features
Algorithm Name
|
Accuracy
|
Precision
|
Recall
|
F1 Score
|
Decision Tree Classifier
|
82.9
|
29
|
26
|
25
|
Random Forest Classifier
|
74.4
|
22
|
23
|
23
|
Gradient Boosting Classifier
|
83.97
|
21
|
25
|
23
|
Naïve Bayes Classifier
|
16.44
|
32
|
52
|
19
|
K-Nearest Neighbor
|
72.18
|
25
|
24
|
25
|
Logistic Regression
|
84.48
|
25
|
24
|
25
|
Support Vector Machine
|
84.38
|
21
|
25
|
23
|
Logistic Regression, as shown in the table above, outperforms in terms of accuracy. However, this algorithm's precision, recall, and f1 score are all low. With an accuracy of 84.48 percent, precision of 25%, recall of 24%, and F1 score of 25%, we are the most accurate. As a result, Logistic Regression outperforms the other six classification algorithms for our dataset. The Support Vector Machine, Gradient Boosting Classifier, and Decision Tree Classifier all perform well after that. However, precision, recall, and F1-score are all extremely low in each case. As a result, we can only measure them using accuracy. However, accuracy cannot always provide us with an accurate measure of performance. Random Forest has a 74.4 percent accuracy, but precision, recall, and F1 score are all low. The accuracy of K-Nearest Neighbor is 72.18 percent. Naive Bayes, on the other hand, gives us a very low score for this experiment. This algorithm only has a 16.44 percent accuracy, which is extremely unsatisfactory. Overall evaluation results is depicted in Fig. 8.
From the result, we can also say that Logistic Regression gives us the best prediction for our dataset. Naïve Bayes gives us the poorest prediction in this case. As a result, we can conclude that for our dataset, Logistic Regression is the best classification algorithm, while Naive Bayes is the worst.
3.3.2 Results for Our Dataset Using Feature Importance Method
Using the feature importance technique, we determine our 14 best-correlated features from our dataset. On the 14 features chosen using the method, we apply the seven algorithms. The algorithms are then compared using various performance metrics. Seleceted 14 features are depicted in Fig. 9 with their importance value.
We apply Random Forest Classifier, Decision Tree Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector. Machine algorithms on our 14 feature data and the accuracy plot is shown in Fig. 10.
We can see from the above bar chart that the Random Forest algorithm outperforms all others in terms of accuracy. After Random Forest, Decision Tree Classifier and Gradient Boosting Classifier have higher accuracy. As previously stated, accuracy is not always an appropriate metric to use when comparing algorithms, so consider alternative metrics like precision, recall, and f1 score. The performance metrics of all seven algorithms are listed in the Table 4.
Table 4
Evaluation of algorithms with the features of feature importance
Algorithm Name
|
Accuracy
|
Precision
|
Recall
|
F1-score
|
Decision Tree Classifier
|
90.43
|
91
|
90
|
90
|
Random Forest Classifier
|
91.42
|
92
|
92
|
92
|
Gradient Boosting Classifier
|
90.5
|
91
|
90
|
90
|
Naïve Bayes Classifier
|
67.86
|
68
|
67
|
64
|
K-Nearest Neighbor
|
86.22
|
86
|
86
|
86
|
Logistic Regression
|
73.15
|
86
|
86
|
86
|
Support Vector Machine
|
73.7
|
74
|
74
|
74
|
Random Forest beats all other performance criteria, such as accuracy, precision, recall, and F1 score, as seen in the table above. We have the highest accuracy of 91.92 percent, the highest precision of 92 percent, the highest recall of 92 percent, and the highest F1 score of 92 percent. So, for our dataset with 14 feature importance attributes, Random Forest outperforms the other six classification algorithms. Following that, the Gradient Boosting Classifier and Decision Tree Classifier perform admirably. However, both the Decision Tree Classifier and the Gradient Boosting Classifier have the same precision, recall, and F1-score. And, in the case of Gradient Boosting, accuracy is improved. So, in terms of accuracy, we can say that Gradient Boosting outperforms Decision Tree Classifier. K-Nearest Neighbor has an accuracy of 86.22 percent and an F1 score of 86 percent. With a 73.7 percent F1 Score, SVM provides 73.7 percent accuracy. With an F1 score of 86 percent, Logistic Regression has a 73.15 percent accuracy. Finally, Naive Bayes gives a score of 64 percent F1 and 67.86 percent accuracy. Overall, results is shown in Fig. 11.
The confusion matrix tells us how accurate the classifier is at making predictions. Confusion matrix of all the seven classification algorithms shown in figure 12.
From the confusion matrix, we can also say that Random Forest gives us the best prediction and Naïve Bayes gives us the poorest prediction in this case. As a result, we can conclude that for our chosen dataset, Random Forest is the best classification algorithm.
3.3.3 Results for Our Dataset Using Univariate Feature Selection Method
Now, in this case, we use the univariate feature selection method for selecting our important features. The top 14 feature with their correlated score with our target is given in Fig. 14.
We apply Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector Machine algorithms on our selected data and the accuracy plot is given below:
We can see from the above bar chart that our results were slightly different from previous results. Random Forest algorithms have the highest accuracy of any algorithm, as shown in the bar chart above. Random Forest provides the best accuracy of 90.4 percent this time as well. After Random Forest, Decision Tree Classifier and Gradient Boosting Classifier have higher accuracy. Decision Tree Classifier and Gradient Boosting Classifier both have an accuracy of 89.55 percent and 89.35 percent, respectively. K Neighbors has an accuracy rate of 86.07 percent. The accuracy of SVM increased by 74.5 percent. The accuracy of Logistic Regression is decreased by 71.82 percent. But Naive Bayes decrease its accuracy for this dataset. As a result, we conclude that this method is ineffective when compared to the feature importance technique. Now, we now assess our model's performance using various performance metrics such as precision, recall, and f1 score. The performance metrics of all seven algorithms are listed in Table 5.
Table 5
Evaluation of algorithms with the features of univariate feature selection
Algorithm Name
|
Accuracy
|
Precision
|
Recall
|
F1-Score
|
Decision Tree Classifier
|
89.55
|
90
|
89
|
89
|
Random Forest Classifier
|
90.4
|
91
|
90
|
90
|
Gradient Boosting Classifier
|
89.35
|
90
|
89
|
89
|
Naïve Bayes Classifier
|
56.3
|
63
|
55
|
50
|
K-Nearest Neighbor
|
86.07
|
86
|
86
|
86
|
Logistic Regression
|
71.82
|
86
|
86
|
86
|
Support Vector Machine
|
74.15
|
74
|
74
|
74
|
In the table above, we can see that the performance metrics differ significantly from the previous test result. Logistic Regression, K Neighbors, and Support Vector Machine all have the same precision. The precision of the Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Naive Bayes Classifier, on the other hand, decreases. K Neighbors, SVM, and Logistic Regression all have the same recall. On the other hand, the recall of Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Naive Bayes Classifier falls. F1 Score provides a comprehensive view of precision and recall at the same time, as shown by the fact that F1 Score is the same for Logistic Regression, K Neighbors, and SVM. The F1 Score of Naive Bayes decreases. So, based on the table above, we can conclude that Random Forest is the best performer. After that, the Decision Tree Classifier performs admirably. Gradient Boosting Classifier and Decision Tree Classifier are nearly equal in this race, but Decision Tree Classifier outperforms Gradient Boosting Classifier by a small margin. However, Naive Bayes reduces performance across the board. Overall results using feature selection method is shown in Fig. 15.
Confusion matrix of all the seven classification algorithms is shown in figure 16:
We can also conclude from the confusion matrix that Random Forest provides the best prediction. In this case, Nave Bayes gives us the worst prediction. As a result, we can conclude that Random Forest is the best classification algorithm for our dataset, while Naive Bayes is not. Overall results with all classifier and features in this investigation depicted in Fig. 17.