Prediction of Hypothyroidism and Hyperthyroidism Using Machine Learning Algorithms

doi:10.21203/rs.3.rs-1486798/v2

Download PDF

Research Article

Prediction of Hypothyroidism and Hyperthyroidism Using Machine Learning Algorithms

https://doi.org/10.21203/rs.3.rs-1486798/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this latest preprint version

The thyroid gland is the key organs in the human body, secreting two hormones that help to regulate the human body's metabolism. Thyroid disease is a severe medical complaint that could developed by high TSH (Thyroid Stimulating Hormone) levels or an infection in the thyroid tissues. Hypothyroidism and hyperthyroidism are two important conditions caused by insufficient thyroid hormone production and excessive thyroid hormone production, respectively. Machine learning model can utilize for precise processing of the data that is generated from different the medical sector and could be used for building a model for the prediction of several diseases. In this study, we used a variety of machine learning algorithm to predict hypothyroidism and hyperthyroidism. Moreover, we identified the most significant features, which can be used to detect thyroid diseases more precisely. After completing the preprocessing and feature selection steps, we applied our modified and original data to several classification models to predict thyroidism. Finally, we found Random Forest is giving the maximum score in all sectors like accuracy, precision, recall, F1 score in our dataset and Naive Bayes is performing very poorly. By analyzing the characteristics and behavior of the dataset, we can identify the most important features of the datasets. In terms of accuracy and other performance evaluation criteria, this study could advocate the use of effective classifiers and features backed by machine learning algorithms for the detection and diagnosis of thyroid disease.

Machine learning

Classification algorithm

Thyroid disease

Hypothyroidism

Hyperthyroidism

As stated by the World Health Organization, thyroid illness is the most popular endocrine disorder in the world after diabetes (https://www.who.int/)[1]. Hyperthyroidism and hypothyroidism are the most frequent thyroid gland illnesses, which have been recorded in more than 110 countries throughout the world, putting 1.6 billion people in danger and majority of these are found in Asia, Africa, and Latin America[2]. Currently, over 25,000 emergency clinics around the world collect information of patients in various configurations, but studies are conducted by traditional examination and measurable tests using the traditional method[3], which is time-consuming and costly. Doctors believe that, early disease detection, diagnosis, and treatment are critical in inhibiting disease development or even passing away. Despite numerous trials, clinical diagnosis is frequently regarded as a difficult task[4]. The thyroid is a tiny, butterfly-shaped gland that sits right below Adam's apple at the base of the neck[5].The endocrine system is a complicated network of glands that controls the organization many of the actions of the human body. The thyroid gland yields hormones that govern human body's metabolism. The most common cause is a lack of iodine; however it can also be caused by other circumstances[6].T3, T4, and Calcitonin are the three hormones produced by the thyroid gland where T3 and T4 are just in the strictest sense[7]. Iodine is required for the creation of both hormones. We must receive this trace element through our diet because our systems are unable to produce it. Iodine is absorbed into our bloodstream by food in our intestines finally produce thyroid hormones. Hypothyroidism (underactive thyroid) is a malfunctions in which the thyroid gland does not produce enough of specific hormones[8]. A little symptom had been seen in the early on in the course of hypothyroidism. Without giving much concentration on Hypothyroidism this could be lead to obesity. Moreover, several other problems like joint pain, heart disease, and sometimes infertility might be seen among patients [9]. Hyperthyroidism is a malfunction in which the thyroid gland yields huge amount of thyroid hormones that circulate in the bloodstream. Nervousness, impatience, and an increased hunger are some symptoms of hyperthyroidism[10]. For thyroid prediction at early stage, we could use machine learning which is an area of computer science that has exploded in popularity in recent years and is likely to continue to do so in the future. A machine learning algorithm has several advantages, including high parallelism, speed, self-learning and noise error tolerance[11]. Machine learning allows humans to get insight from huge amounts of data that would otherwise be too difficult or impossible to process. By building a machine learning model, we can predict hypothyroidism and hyperthyroidism with the help of symptoms of the patient, which is a cost-effective and time-saving approach. The machine learning model is trained using data from a variety of databases as input. It can be used to produce predictions for other input data once it has been trained. Several supervised machine learning algorithms available in literature [12–16]. We employed Decision tree classifier, Random Forest Classifier, Gradient Boosting Classifier, Naïve Bayes Classifier, K-Nearest Neighbor, Logistic Regression and Support Machine Vector to predict thyroid disease in our study, and we could relate the performance of the algorithms to discover the finest method for more correctly predicting thyroid disease. Several work has been done so far in this relevant work. In [17] authors employed Decision Tree, Support Vector Machine, Artificial Neural Network, and the K-Nearest-Neighbor algorithm, among other classification algorithms, based on the dataset of thyroid gained from the UCI Repository, classification and prediction were performed, and accuracy was measured based on the output provided. Logistics Regression and SVM machine learning techniques to evaluate the Thyroid Dataset and RMS error, Precision, Recall, F1 measure, and ROC were used to compare these two methods in [18]. According to them, successful classifier was found to be logistic regression. Awasthi and Anil Antony[19] discussed how to use KNN, support vector machine (SVM), and machine learning algorithms to categorize and detect thyroid illness. They employed the K-nearest neighbor technique to approximate missing values in user input for thyroid diagnosis. A classification system for two categories of thyroid disease: hyperthyroidism and hyperthyroidism was proposed by K. Geetha et.al. [20]. During the preprocessing stage, missing values that are not a numerical constraint are identified, and the mean value of the matching column is used to fill in the gaps. The differential evolution technique is used to create child subsets from parent records. SVM is used as a classifier in[21] to distinguish thyroid disease. This investigation is based on two datasets: one from the UCI machine learning repository, and the other acquired by the K.N.Toosi University of Technology's Intelligent System Laboratory from a hospital. For classification, the authors employed Naive Bayes and support vector machines in [22]. A number of grouping algorithms, like K-nearest neighbor, support this idea. The Rapid miner device was used to conduct the research, and the findings reveal that K-nearest neighbor is more optimum than Naive Bayes in diagnosing thyroid issues. The K-nearest neighbor classifier was determined to be the most reliable, with accuracy of 93.44%, while the Naive Bayes classifier had only 22.56%. Nikita Sigh et. al [23], Support Vector Machine surpassed K-Nearest Neighbor and Bayesian with an accuracy of 84.62 percent. KNN independently discovered the closest neighborhood. In [24], they’ve proposed a variety of Thyroid prediction strategies based on data mining techniques. They investigated the link between T3, T4, and TSH, as well as hyperthyroidism and hypothyroidism. The primary goal of this research is to find a reliable machine learning classification method for predicting thyroid disease using the fewest possible features. We also determined the most important aspects of our datasets for predicting thyroid illness. Finally, we believe that, this research might have a significant impact on the scientific community for better understanding and applying machine learning in the medical field. Overall workflow of this study depicted in figure.1.

2.1 Dataset Description

The first real step towards the development of a machine learning model is the collection of data. The data was taken from the UCI (https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease) machine learning repository[25]. We took three datasets (hypothyroid, hyperthyroid, and sick) from the UCI machine learning repository and combine them to create our final dataset, which has 3221 entries in total. There are total 30 features, six of the features are real number properties, while the remainder are category traits. Pre-processing is done to improve the quality of the dataset obtained for further analysis. Each thyroid patient's file has 30 attributes, which are listed in Table 1. Histogram of all the attributes are visualized in Fig. 2 after dropping the two attributes (TBG measured and TBG) because of the large amount of missing value.

Table 1

Description of Thyroid Dataset Attributes
SL. No.	Attribute	Value Type	SL. No.	Attribute	Value Type
01	age	continuous	16	psych	f, t
02	sex	M, F	17	TSH measured	f, t
03	on thyroxine	f, t	18	TSH	continuous
04	query on thyroxine	f, t	19	T3 measured	f, t
05	on antithyroid medication	f, t	20	T3	continuous
06	sick	f, t	21	TT4 measured	f, t
07	pregnant	f, t	22	TT4	continuous
08	thyroid surgery	f, t	23	T4U measured	f, t
09	I131 treatment	f, t	24	T4U	continuous
10	query hypothyroid	f, t	25	FTI measured	f, t
11	query hyperthyroid	f, t	26	FTI	continuous
12	lithium	f, t	27	TBG measured	f, t
13	goitre	f, t	28	TBG	continuous
14	tumor	f, t	29	referral source	WEST, STMW, SVHC, SVI, SVHD, other
15	hypopituitary	f, t	30	category	Negative, hypothyroid, sick, hyperthyroid

M = Male; F = Female; t = True; f = False; TSH = Thyroid Stimulating Hormone; T3 = Triiodothyronine Hormone; TT4 = Thyroxine Hormone; T4U = Thyroxine Utilization Rate; FTI = Free Thyroxine Index

2.2 Data Pre-Processing

Raw data from the real world is frequently incomplete, unreliable, and devoid of specific behaviors or trends. They're also likely to have a lot of mistakes in them. [26]. As a result, they are pre-processed into a format that the machine learning algorithm can use for the model once they have been collected. The data pre-processing phase should be given a lot of attention in order to get the best model quality. It includes several tasks employed in the process to make the data more relevant. In this study, we followed the following steps in order to preprocess the data. At first, there were many unclear values which actually didn’t had any significant meaning. So we removed those unclear values to get better results from this process by reducing the datasets attributes. Following that, replaced missing value because there were so many missing values in our dataset. We took different steps to handle missing values for example, filling missing values with median, mode. In addition, the categorical data was encoded into integer format so that data with transformed category values may be fed into models to improve prediction accuracy. Furthermore, we handled the imbalance data from our datasets in which the target class had an unequal distribution of observations. For balancing our dataset, we employed the resampling technique. Finally, we spilled the datasets into training and test sets. The training dataset was utilized to fit the model and test sets were used to make predictions and compare them to the predicted values. In this study, 70 out of 100 data was used for training and 30 out of 100 was used for testing.

2.3 Feature Selection Methods

Feature selection is a strategy for limiting the input variable to the model by removing unsignificant data and only using valuable data. [27]. The purpose of feature selection in machine learning is to determine the best set of characteristics for building effective models of the phenomena being studied. In this study, for selecting the most important feature we used the univariate feature selection approach and the feature importance method [28].

2.4 Selection of the Classification Algorithms

Before selecting an algorithm, there are a few things to keep in mind like, the size of the training data, the output's accuracy and/or interpretability, time spent on training or speed, Linearity and the number of features [29, 30]. In this investigation, we took seven machine learning classification algorithms that are popular for solving this type of dataset because we are trying to figure out which algorithm performs better on our dataset. In order to predict the thyroid, we use Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression, K-Nearest Neighbor, and Support Vector Machine algorithms.

2.5 Evaluation of the Model

In the context of machine learning, performance metrics refer to how well an algorithm performs depending on various criteria such as precision, accuracy, recall, F1 score and so on [31–33]. The next sections go through several performance metrics.

Accuracy

The percentage of correct test data predictions referred to as accuracy. It is easy to calculate by dividing the total number of forecasts by the number of correct guesses. The Formula for calculating the accuracy given below:

$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{F}\text{P}+\text{F}\text{N}+\text{T}\text{N}}$$

Precision

The precision score used to assess the model's accuracy in counting genuine positives correctly among all positive predictions. The following is the formula for calculating precision:

$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$

Recall (Sensitivity)

The recall score used to assess the model's performance in terms of accurately counting true positives among all actual positive values. Below is the formula for determining the recall.

$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$

F1 Score

F1-score is the harmonic mean of precision and recall score, and utilized as a metric in situations when choosing either precision or recall score can result in a model with excessive false positives or false negatives. The F1 score calculated using the following formula.

$$\text{F}1=\frac{2\text{*}\left(\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\text{*}\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\right)}{(\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}+\text{R}\text{e}\text{c}\text{a}\text{l}\text{l})}$$

After combining three datasets, in our final thyroid dataset had 3221 number of instance of 3221 patients. Along with the target value, we had 30 attributes. There were no missing values in our data. When we looked back at the original dataset, there were missing values in several columns. 'nan' is used to replace these values.. Then we convert it into the numerical format. Because the missing values, with the exception of sex, are from numeric attributes, they replaced with the median value of the respective columns. However, sex is a categorical attribute it replaced with a mode value of the respective columns. Initially we dropped two attributes, TBG and TBG measured. Because the majority of data of these columns are missing. Our categorical attribute mapped to numeric values, which has been done manually with programming. For converting those values into numeric values, we use a label encoder. Our other attributes are in the form of objects. As a result, we convert them to integer format in order to fit them into our model. Our dataset is an imbalanced dataset because the target class has an uneven distribution of observations. There are 2753 observations under the negative class label, 220 observations under the hypothyroid class label, 171 observations under the sick class label, and 77 observations under the hyperthyroid class label. When dealing with unbalanced datasets, typical machine learning methods may create biased, erroneous, and unsatisfactory classifiers. Standard classifier methods favor classes with a large number of instances, such as Decision Tree and Logistic Regression. Typically, they can only anticipate data from the vast majority of classes. The minority class's traits are frequently dismissed as noise and ignored. As a result, the minority class has a higher chance of being misclassified than the majority class. Because Machine Learning Algorithms are typically design to improve accuracy by reducing error, this occurs. So that we convert the dataset into a balanced dataset in order to obtain the desired result. We use a resampling technique to ensure that the minority and majority classes are equal. Finally, the distribution of observations in our dataset is even across our entire class.

A comparison of seven distinct machine-learning algorithms was conducted in this study. Decision Tree Classifier, Random Forest Classifier, Naive Bayes Classifier, Gradient Boosting Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector Machine was utilized to thyroid disease prediction. Firstly, we collect and preprocessed the data, and then fed the data to train the model. By comparing the scores, various performance criteria including as accuracy, precision, recall, and F1 score are utilized to establish whether algorithm is superior to others. We divide our dataset into three formats: first set by considering all attributes, second set with 14 feature selection process attributes and the third set with 14 univariant feature selection process attributes. We narrowed down attributes based on their correlation with the target, which we calculated with feature selection process and univariant feature selection methods. Overall, results of various sections are explained in the next part of this result analysis.

3.1 Descriptive Statistics of the Dataset

Exploratory data analysis (EDA) is a sort of data analysis that employs data visualization to evaluate and investigate data sets, as well as describe their key properties[34, 35]. EDA is mostly used to examine what data might reveal outside of formal modeling or hypothesis testing tasks, as well as to better understand data set variables and their interactions. It can also help us figure out if the statistical methods we are contemplating for data analysis are appropriate. Our dataset has 28 attributes, with only six of them being numeric. Therefore, we give a short descriptive statistics of our dataset in Table 2. We can see that all of the attributes have 3221 values in this table. Actually, there are 3221 patients' records, and some of the attributes have missing values. So, before we train the model, we use various techniques to fill in the missing values. We can also see that the average age of the patients is 52.4, implying that the majority of the patients were elderly. The youngest person was 1 year old, and the oldest person was 94 years old. The age distribution of the data is skewed, indicating that the population with a low age is absent. The standard deviation is 19.1, indicating the sparseness of the age group, which ranges from 57 to 73 years old. TSH mean was 6.322 mIU/L, indicating that most patients' TSH levels were not normal. TSH levels should be between 0.5 and 5.0 mIU/L in order to be considered normal. TSH had a minimum value of 0.005 mIU/L and a maximum value of 478.0 mIU/L. The mean T3 value was 1.95 nmol/L, with a minimum of 0.05 nmol/L and a maximum of 10.6 nmol/L. The mean value of TT4 is 107.55. The maximum value of TT4 is 430 and the minimum of TT4 is 2. In the case of T4U, the mean value is 0.988 mIU/mL

Table 2

Descriptive Statistics of Numeric Value of Our Dataset
Characteristics	age	TSH	T3	TT4	T4U	FTI
count	3221	3221	3221	3221	3221	3221
unique	94	264	65	218	139	210
unit	years	mIU/L	nmol/L	-	mIU/mL	-
freq	91	247	589	142	276	274
mean	52.4	6.322	1.95	107.55	0.988	110.26
std	19.1	26.54	0.8399	38.09	0.186	35.967
min	1.0	0.005	0.05	2.0	0.31	2.0
25%	37.0	0.58	1.6	86.0	0.88	93.0
50%	55.0	1.5	1.9	102.0	0.97	106.0
75%	68.0	3.0	2.2	123.0	1.07	123.0
Max	94.0	478.0	10.6	430.0	2.12	395.0

The maximum value of T4U is 2.12 mIU/mL and the minimum value of T4U is 0.31mIU/mL. And at last, the mean value of FTI is 110.26.The correlation between all the numeric data is depicted in Fig. 3.

The above figure showed that TT4 and FTI have a strong relationship. We can get a better understanding of this correlation table if we look at the heat map. Figure 4 depicted a heatmap of all attribute correlations.

3.2 Category Class Blanching

The target class has an uneven distribution of observations, which makes our dataset unbalanced. The following are the observations of different classes:

Category	Numeric representation of Category	Number of event
negative	2	2753
hypothyroid	1	220
sick	3	171
hyperthyroid	0	77

There are 2753 observations under the negative class label, 220 observations under the hypothyroid class label, 171 observations under the sick class label, and 77 observations under the hyperthyroid class label. So, our dataset is highly unbalanced.

As a result, machine learning classifiers faced some difficulties to make accurate predictions on our dataset. Because classic classifier methods such as Decision Tree and Logistic Regression favour classes with a lot of occurrences. They typically only forecast data from the vast majority of classes. The features of the minority class are frequently rejected and treated as noise. The graphical representation of our classes is shown in Fig. 5.

We can see that our dataset is completely skewed. We focus on balancing the classes in the training data before delivering the data as input to the classification model. The main purpose of class balancing is to either increase the frequency of the minority class or lower the frequency of the majority class. This is done to ensure that the number of instances in both classes is about equal. We employed the resampling technique to balance our dataset. Resampling is a common strategy for dealing with datasets that are very imbalanced. Under-sampling involves deleting samples from the majority class and/or introducing additional examples from the minority class. All our classes had equal number 2753 of observations. The balanced plot is shown in Fig. 4.4:

After resampling, we found our final balanced dataset. Now we can build our model using this dataset which will give us more accurate result. After resampling, we have a total of 11012 instances.

3.3 Performance Analysis of Different Algorithm

Our original dataset, which included all features, was first utilized to evaluate several machine learning measures. After that, we used our balanced dataset to put multiple machine learning models to the test. In this study, the dataset’s important features was selected using feature importance methods and univariate feature selection technique. Those important feature is then used to identify the model’s precision, accuracy, recall, and F1 score in our experiments.

The data we use is typically divided into two categories: training data and test data. In this study, 70% of the data was utilized for training and 30% for testing. So, out of our 11012 dataset instances, 7708 were used for the training set. 3304 of the 11012 dataset instances were used in the testing set. Using the testing, we can determine the accuracy of our model and how well it can predict thyroid disease. We used Sklearn library to split our data set as train and test set. Sklearn model selection train test split library component split the dataset randomly with specified portion and we get the random train and test part from the full dataset. After training the model with all algorithms, the testing dataset was used to test the methods. The F1-score, recall, precision, and accuracy were used to evaluate the model's performance.

The entire study's goal was to see which algorithm could best classify diseases. This section highlights the outcomes of the study and introduces the top performer based on a number of performance criteria. To begin, performance was measured using our raw dataset. Second, performance was measured using a dataset containing 14 attributes derived from the feature importance method. Third, performance was determined by taking into account 14 attributes from the univariate feature selection. Finally, we compare various performance metrics of various algorithms and features categories.

3.3.1 Results Using All Features

We apply the selected algorithms to our dataset. In our dataset we have total of 28 attributes, among them, the category is the target. The algorithms are then compared using various performance metrics. We can see from the Fig. 7 that Logistic Regression algorithm has the highest accuracy of any algorithm. After Logistic Regression, Support Vector Machine, Gradient Boosting Classifier and Decision Tree Classifier have higher accuracy.

Predictor accuracy refers to how well a predictor can forecast the value of a predicted characteristic for fresh data, while classifier accuracy refers to a classifier's ability to correctly predict the class label. However, accuracy does not always provide good performance metrics to compare algorithms, so consider other metrics for instance recall, precision, and F1 score. Now, we assess our model's performance using various performance metrics such as recall, precision, and F1 score. The performance results of all seven algorithms are listed in the Table 3.

Table 3

Evaluation of algorithms with all features
Algorithm Name	Accuracy	Precision	Recall	F1 Score
Decision Tree Classifier	82.9	29	26	25
Random Forest Classifier	74.4	22	23	23
Gradient Boosting Classifier	83.97	21	25	23
Naïve Bayes Classifier	16.44	32	52	19
K-Nearest Neighbor	72.18	25	24	25
Logistic Regression	84.48	25	24	25
Support Vector Machine	84.38	21	25	23

Logistic Regression, as shown in the table above, outperforms in terms of accuracy. However, this algorithm's precision, recall, and f1 score are all low. With an accuracy of 84.48 percent, precision of 25%, recall of 24%, and F1 score of 25%, we are the most accurate. As a result, Logistic Regression outperforms the other six classification algorithms for our dataset. The Support Vector Machine, Gradient Boosting Classifier, and Decision Tree Classifier all perform well after that. However, precision, recall, and F1-score are all extremely low in each case. As a result, we can only measure them using accuracy. However, accuracy cannot always provide us with an accurate measure of performance. Random Forest has a 74.4 percent accuracy, but precision, recall, and F1 score are all low. The accuracy of K-Nearest Neighbor is 72.18 percent. Naive Bayes, on the other hand, gives us a very low score for this experiment. This algorithm only has a 16.44 percent accuracy, which is extremely unsatisfactory. Overall evaluation results is depicted in Fig. 8.

From the result, we can also say that Logistic Regression gives us the best prediction for our dataset. Naïve Bayes gives us the poorest prediction in this case. As a result, we can conclude that for our dataset, Logistic Regression is the best classification algorithm, while Naive Bayes is the worst.

3.3.2 Results for Our Dataset Using Feature Importance Method

Using the feature importance technique, we determine our 14 best-correlated features from our dataset. On the 14 features chosen using the method, we apply the seven algorithms. The algorithms are then compared using various performance metrics. Seleceted 14 features are depicted in Fig. 9 with their importance value.

We apply Random Forest Classifier, Decision Tree Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector. Machine algorithms on our 14 feature data and the accuracy plot is shown in Fig. 10.

We can see from the above bar chart that the Random Forest algorithm outperforms all others in terms of accuracy. After Random Forest, Decision Tree Classifier and Gradient Boosting Classifier have higher accuracy. As previously stated, accuracy is not always an appropriate metric to use when comparing algorithms, so consider alternative metrics like precision, recall, and f1 score. The performance metrics of all seven algorithms are listed in the Table 4.

Table 4

Evaluation of algorithms with the features of feature importance
Algorithm Name	Accuracy	Precision	Recall	F1-score
Decision Tree Classifier	90.43	91	90	90
Random Forest Classifier	91.42	92	92	92
Gradient Boosting Classifier	90.5	91	90	90
Naïve Bayes Classifier	67.86	68	67	64
K-Nearest Neighbor	86.22	86	86	86
Logistic Regression	73.15	86	86	86
Support Vector Machine	73.7	74	74	74

Random Forest beats all other performance criteria, such as accuracy, precision, recall, and F1 score, as seen in the table above. We have the highest accuracy of 91.92 percent, the highest precision of 92 percent, the highest recall of 92 percent, and the highest F1 score of 92 percent. So, for our dataset with 14 feature importance attributes, Random Forest outperforms the other six classification algorithms. Following that, the Gradient Boosting Classifier and Decision Tree Classifier perform admirably. However, both the Decision Tree Classifier and the Gradient Boosting Classifier have the same precision, recall, and F1-score. And, in the case of Gradient Boosting, accuracy is improved. So, in terms of accuracy, we can say that Gradient Boosting outperforms Decision Tree Classifier. K-Nearest Neighbor has an accuracy of 86.22 percent and an F1 score of 86 percent. With a 73.7 percent F1 Score, SVM provides 73.7 percent accuracy. With an F1 score of 86 percent, Logistic Regression has a 73.15 percent accuracy. Finally, Naive Bayes gives a score of 64 percent F1 and 67.86 percent accuracy. Overall, results is shown in Fig. 11.

The confusion matrix tells us how accurate the classifier is at making predictions. Confusion matrix of all the seven classification algorithms shown in figure 12.

From the confusion matrix, we can also say that Random Forest gives us the best prediction and Naïve Bayes gives us the poorest prediction in this case. As a result, we can conclude that for our chosen dataset, Random Forest is the best classification algorithm.

3.3.3 Results for Our Dataset Using Univariate Feature Selection Method

Now, in this case, we use the univariate feature selection method for selecting our important features. The top 14 feature with their correlated score with our target is given in Fig. 14.

We apply Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector Machine algorithms on our selected data and the accuracy plot is given below:

We can see from the above bar chart that our results were slightly different from previous results. Random Forest algorithms have the highest accuracy of any algorithm, as shown in the bar chart above. Random Forest provides the best accuracy of 90.4 percent this time as well. After Random Forest, Decision Tree Classifier and Gradient Boosting Classifier have higher accuracy. Decision Tree Classifier and Gradient Boosting Classifier both have an accuracy of 89.55 percent and 89.35 percent, respectively. K Neighbors has an accuracy rate of 86.07 percent. The accuracy of SVM increased by 74.5 percent. The accuracy of Logistic Regression is decreased by 71.82 percent. But Naive Bayes decrease its accuracy for this dataset. As a result, we conclude that this method is ineffective when compared to the feature importance technique. Now, we now assess our model's performance using various performance metrics such as precision, recall, and f1 score. The performance metrics of all seven algorithms are listed in Table 5.

Table 5

Evaluation of algorithms with the features of univariate feature selection
Algorithm Name	Accuracy	Precision	Recall	F1-Score
Decision Tree Classifier	89.55	90	89	89
Random Forest Classifier	90.4	91	90	90
Gradient Boosting Classifier	89.35	90	89	89
Naïve Bayes Classifier	56.3	63	55	50
K-Nearest Neighbor	86.07	86	86	86
Logistic Regression	71.82	86	86	86
Support Vector Machine	74.15	74	74	74

In the table above, we can see that the performance metrics differ significantly from the previous test result. Logistic Regression, K Neighbors, and Support Vector Machine all have the same precision. The precision of the Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Naive Bayes Classifier, on the other hand, decreases. K Neighbors, SVM, and Logistic Regression all have the same recall. On the other hand, the recall of Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Naive Bayes Classifier falls. F1 Score provides a comprehensive view of precision and recall at the same time, as shown by the fact that F1 Score is the same for Logistic Regression, K Neighbors, and SVM. The F1 Score of Naive Bayes decreases. So, based on the table above, we can conclude that Random Forest is the best performer. After that, the Decision Tree Classifier performs admirably. Gradient Boosting Classifier and Decision Tree Classifier are nearly equal in this race, but Decision Tree Classifier outperforms Gradient Boosting Classifier by a small margin. However, Naive Bayes reduces performance across the board. Overall results using feature selection method is shown in Fig. 15.

Confusion matrix of all the seven classification algorithms is shown in figure 16:

We can also conclude from the confusion matrix that Random Forest provides the best prediction. In this case, Nave Bayes gives us the worst prediction. As a result, we can conclude that Random Forest is the best classification algorithm for our dataset, while Naive Bayes is not. Overall results with all classifier and features in this investigation depicted in Fig. 17.

3.4 Discussion

We can see from the tables above that different algorithm performed better depending on whether a subset of features used or all of the features used. Depending on the situation, each algorithm has the inherent ability to outperform others. Random Forest, for example, outperforms all other algorithms in our dataset for the case of FS1 and FS2. We know that SVM (Support Vector Machine) performs better for small sets of data, and ensemble type classifiers like random forest perform better for large sets of data. Missing values play a significant role in decision trees. Even after imputing, it is unable to produce the same results as a perfect dataset. One of the good classifiers is Gaussian Naive Bayes. However, it did not perform well with our dataset. The presumption that all attributes are independent is the reason for this. Results and Analysis would have been less accurate if there was a dependency between the attributes in the dataset. The accuracy of the K-nearest neighbor increases as the number of 'k' we choose increases. It ensures that the given point and the dataset are similar. The performance of algorithms that use all of the dataset's features FS is poor relative to FS1 and FS2 for most of the algorithm. After reducing the attribute in the dataset, algorithm performance improved. When there are many attributes, classifier algorithms become complicated, and prediction results vary. Because this is the standard process of evaluating algorithms, performance metrics after converting categorical values, balancing our dataset, and feature selection are used for dataset comparison. Finally, we suggest that, Random Forest algorithm and FS1 features should be used to train the model in order to predict the Hypothyroidism and Hyperthyroidism more correctly.

After reducing the features using feature importance technique and univariate feature selection technique, we tested our collected dataset on various machine learning classifiers to see which classifier gave us the best accuracy. After analyzing the data, we discovered that Logistic Regression outperforms all other classification algorithms for our dataset. When all features are considered, Logistic Regression yields an accuracy score of 84.48 percent. When we use the feature importance method to narrow down the features set, the Random Forest Classifier gives an accuracy score of 91.92 percent. The accuracy of the Decision Tree Classifier and the Gradient Boosting Classifier is 90.5 percent and 90.43 percent, respectively. When we use the univariate feature selection technique to narrow down the features set, Random Forest also gives the highest accuracy score of 90.4 percent. The second-best algorithm is Decision Tree Classifier, which has an accuracy score of 89.55 percent, and the third-best algorithm is Gradient Boosting Classifier, which has an accuracy score of 89.35 percent. The feature importance technique was more accurate than the univariate feature selection technique in determining correlated features. So, after looking at all of the performance metrics, we've decided that the Random Forest Classifier, Decision Tree Classifier, and Gradient Boosting Classifier and feature importance technique might be potential choice for predicting Hypothyroidism and Hyperthyroidism.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding declaration

No funding

Biondi, B., G.J. Kahaly, and R.P. Robertson, Thyroid Dysfunction and Diabetes Mellitus: Two Closely Associated Disorders. Endocr Rev, 2019. 40(3): p. 789–824.
Alam Khan, V., M.A. Khan, and S. Akhtar, Thyroid disorders, etiology and prevalence. J Med Sci, 2002. 2(2): p. 89–94.
Sonuç, E. Thyroid Disease Classification Using Machine Learning Algorithms. in Journal of Physics: Conference Series. 2021. IOP Publishing.
Iqbal, Y.M.a.M., Sonu., Thyroid Disease Prediction Using Two Tier Ensemble Classifier. International Journal of Advanced Science and Technology, 2020. Vol. 29: p. pp. 4460–4471.
Bhaladhare, V., et al., Ayurvedic Management Of Hypothyroidism. NVEO-NATURAL VOLATILES & ESSENTIAL OILS Journal| NVEO, 2021: p. 1440–1447.
Knudsen, N., et al., Risk factors for goiter and thyroid nodules. Thyroid, 2002. 12(10): p. 879–888.
MKGarg, N.M. and K. Kumar, Laboratory evaluation of thyroid functions: dilemmas and pitfalls. Principles and Practices of Thyroid Gland Disorders, 2017. 2017: p. 22.
Feller, M., et al., Association of thyroid hormone therapy with quality of life and thyroid-related symptoms in patients with subclinical hypothyroidism: a systematic review and meta-analysis. Jama, 2018. 320(13): p. 1349–1359.
Kirschenbaum, A., et al., HORMONE REPLACEMENT THERAPY (HRT) AND CONTRACEPTIVES.
Abbas, S., To determine the frequency of undiagnosed hyperthyroidism in patients presenting with generalized anxiety disorder. Journal of Evolution of Medical and Dental Sciences, 2013. 2(8): p. 930–938.
Jordan, M.I. and T.M. Mitchell, Machine learning: Trends, perspectives, and prospects. Science, 2015. 349(6245): p. 255–260.
Choudhary, R. and H.K. Gianey. Comprehensive review on supervised machine learning algorithms. in 2017 International Conference on Machine Learning and Data Science (MLDS). 2017. IEEE.
Crisci, C., B. Ghattas, and G. Perera, A review of supervised machine learning algorithms and their applications to ecological data. Ecological Modelling, 2012. 240: p. 113–122.
Osisanwo, F., et al., Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT), 2017. 48(3): p. 128–138.
Praveena, M. and V. Jaiganesh, A literature review on supervised machine learning algorithms and boosting process. International Journal of Computer Applications, 2017. 169(8): p. 32–35.
Singh, A., N. Thakur, and A. Sharma. A review of supervised machine learning algorithms. in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom). 2016. Ieee.
Tyagi, A., R. Mehra, and A. Saxena. Interactive thyroid disease prediction system using machine learning technique. in 2018 Fifth international conference on parallel, distributed and grid computing (PDGC). 2018. IEEE.
S, G.a.S., Kumar., Prediction of Thyroid Disease Using Machine Learning Techniques., in International Journal of Electronics Engineering. 2018. p. pp. 787–793.
Aswathi, A. and A. Antony. An intelligent system for thyroid disease classification and diagnosis. in 2018 Second international conference on inventive communication and computational technologies (ICICCT). 2018. IEEE.
Geetha, K. and S.S. Baboo, An empirical model for thyroid disease classification using evolutionary multivariate Bayseian prediction method. Global Journal of Computer Science and Technology, 2016.
Kousarrizi, M.N., F. Seiti, and M. Teshnehlab, An experimental comparative study on thyroid disease diagnosis based on feature subset selection and classification. International Journal of Electrical & Computer Sciences IJECS-IJENS, 2012. 12(01): p. 13–20.
Chandel, K., et al., A comparative study on thyroid disease detection using K-nearest neighbor and Naive Bayes classification techniques. CSI transactions on ICT, 2016. 4(2): p. 313–319.
Singh, N. and A. Jindal, A segmentation method and comparison of classification methods for thyroid ultrasound images. International Journal of Computer Applications, 2012. 50(11): p. 43–49.
Begum, A. and A. Parkavi. Prediction of thyroid disease using data mining techniques. in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). 2019. IEEE.
Dua, D.a.G., C., UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. (2019), Irvine, CA: University of California, School of Information and Computer Science.
Kumar, A., A.K. Tyagi, and S.K. Tyagi, data Mining: Various Issues and Challenges for Future A Short discussion on Data Mining issues for future work. International Journal of Emerging Technology and Advanced Engineering, 2014. 4(1): p. 1.
Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. Journal of machine learning research, 2003. 3(Mar): p. 1157–1182.
Jović, A., K. Brkić, and N. Bogunović. A review of feature selection methods with applications. in 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO). 2015. Ieee.
Cui, S., et al., Introduction to machine and deep learning for medical physicists. Medical physics, 2020. 47(5): p. e127-e147.
Lehr, D. and P. Ohm, Playing with the data: what legal scholars should learn about machine learning. UCDL Rev., 2017. 51: p. 653.
Juba, B. and H.S. Le. Precision-recall versus accuracy and the role of large data sets. in Proceedings of the AAAI conference on artificial intelligence. 2019.
Junker, M., R. Hoch, and A. Dengel. On the evaluation of document analysis components by recall, precision, and accuracy. in Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR'99 (Cat. No. PR00318). 1999. IEEE.
Powers, D.M., Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061, 2020.
Komorowski, M., et al., Exploratory data analysis. Secondary analysis of electronic health records, 2016: p. 185–203.
Milo, T. and A. Somech. Automating exploratory data analysis via machine learning: An overview. in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020.

No competing interests reported.

Download PDF

Version 2

posted

You are reading this latest preprint version

Prediction of Hypothyroidism and Hyperthyroidism Using Machine Learning Algorithms

Status:

Version 2

Abstract

Figures

1. Introduction

2. Material And Methods

2.1 Dataset Description

2.2 Data Pre-Processing

2.3 Feature Selection Methods

2.4 Selection of the Classification Algorithms

2.5 Evaluation of the Model

3. Result And Discussion

3.1 Descriptive Statistics of the Dataset

3.2 Category Class Blanching

3.3 Performance Analysis of Different Algorithm

3.3.1 Results Using All Features

3.3.2 Results for Our Dataset Using Feature Importance Method

3.3.3 Results for Our Dataset Using Univariate Feature Selection Method

3.4 Discussion

4. Conclusion

Declarations

References

Additional Declarations

Status:

Version 2