Machine Learning for Predicting Metabolic Syndrome and its Prognostic Factors

doi:10.21203/rs.3.rs-1800914/v1

Download PDF

Research Article

Machine Learning for Predicting Metabolic Syndrome and its Prognostic Factors

https://doi.org/10.21203/rs.3.rs-1800914/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This study explores the potential for early detection of Metabolic Syndrome (MetS) using Machine Learning (ML) techniques. Dissection of prognostic components inciting the syndrome could help patients take cautious steps to prevent it in the early stages. MetS outnumbered diabetics by three to one, a 2020 report found that one billion people worldwide were affected. Patients with MetS typically have no symptoms or signs of the condition and are left undiagnosed. These conditions include extensive circulatory tension, high glucose levels, muscular overload around the abdomen, and unusual levels of cholesterol or fat. Supervised ML techniques like Naïve Bayes, Support Vector Machines, Random Forest, Logistic Regression, C4.5, Cart, etc. are widely used for predictions and diagnoses in various fields. It has been extensively used in medical sciences as well. ML is used for the prediction of the progression of certain diseases and analysis of important parameters in the medical domain. This research uses the aforementioned algorithms for the prediction of MetS using the patients’ dataset. The results were analyzed using precision-recall and Area Under the Curve (AUC) of Receiver Operating characteristic Curve (ROC). The results showed that Naïve Bayes predicted MetS more accurately showing 94.1% accuracy than the rest of the algorithms, while Random Forest surpassed the other tree-based algorithm. According to the results of this research, the prognosis factors for MetS identification are hyperglycemia, dyslipidemia, or a combination of high-density lipoprotein (HDL) dyslipidemia, hypertension, and obesity. Monitoring these factors reduces the risk of MetS occurrence, assists in the prevention, and provides important information for treatment in the early stages.

C4.5

Cart

Logistic regression

Machine learning

Metabolic syndrome

Mets classification

Naïve Bayes

Random Forest

Supervised machine learning

SVM.

Metabolic Syndrome (MetS) has emerged as a major health problem worldwide. An initial description of MetS was provided by Gerald M. Raevan as insulin resistance syndrome [1]. The syndrome is defined as a cluster of various intertwined metabolic abnormalities [2] that constitutes several factors such as insulin resistance, visceral adiposity, atheroid dyslipidemia, endothelial dysfunction, genetic susceptibility, hypertension, hypercoagulability, and chronic stress [3, 4]. Common reasons are obesity, poor lifestyle, including little exercise [5]. MetS also increases the risk of cardiovascular illness and type 2 diabetes mellitus that ultimately leading to mortality [6]. Figure 1 states the MetS-associated risk factors and conditions.

Several tests are conducted for the diagnosis of MetS including Body Mass Index (BMI), hip circumference, waist circumference, waist-to-hip ratio, cholesterol, blood pressure, glucose levels, and blood triglycerides [6]. It is diagnosed as an anomaly in three or more of these tests. Waist circumference is considered an important screening tool for MetS as opposed to obesity around the abdominal region. However, risks associated with this factor may differ depending on the gender and ethnicity of individuals [7]. Furthermore, Mets has shown to have an association with the smoking and alcohol history of the patient, and these factors are also taken under consideration for the diagnosis [8]. The prevalence of MetS, globally, is estimated to be one-quarter (over one billion people) of the world’s population [6].

A sedentary lifestyle is another cause of the rise in prevalence of MetS [9] such as staying indoors, watching television, and having no physical activity is unhealthy. The role of physical activity and exercise is far more necessary in the prevention of the mentioned MetS [10]. Physical activities more specifically regular exercises, proves to be beneficial in reducing not only weight but also blood pressure. Furthermore, it enhances lipid disorders, and increases HDL (High-Density Lipoprotein) also known as "good cholesterol", it absorbs the cholesterol and transports it to the liver. The liver rinses the bad cholesterol from the body, and lowers the triglycerides which in turn proves to have a great impact on insulin resistance [11].

Different institutions have different criteria for defining MetS such as the World Health Organization (WHO) 1998, National Cholesterol Education Program ATP3 2005, and International Diabetes Federation 2005. But this search follows the WHO’s definition of MetS. The definition requires the presence of insulin resistance > 6.1 mmol/L (140 mg/DL) with a minimum of two of the following [12]:

Raised arterial pressure (Blood pressure) > 140/90mmHg.
Raised triglycerides (T G) > 1.7mmol/L(150mg/dL).
Central Obesity (Waist to hip ratio) men > 0.9; women > 0.85 or BMI > 30kg/m2.
High density Lipoprotein (HDL) cholesterol in men < 0.9mmol/L, 35mg/dL; in women < 1.0mmol/L, 39mg/dL.
Urinary albumin excretion rate g 20g/minorcreatinineratio g 30mg/g.

These standardized values of the selected parameters are considered as a reference for evaluating the results of this research. ML is known as a data mining and predictive analytics technique [13]. It is extensively used in various research fields for classification and regression purposes [14], [15]. Commonly used ML algorithms include K-Nearest Neighbour (KNN), Artificial Neural Networks (ANN), Support Vector Machine (SVM), rules induction, Naïve Bayes, fuzzy logic, AdaBoost, and variation of decisions trees including J48, Cart, etc. A classifier is a ML algorithm that inputs values as vectors of discrete and/or continuous values learns them and performs classification. It returns a discrete value that labels the values in the dataset and is known as a class [16].

Datasets can be handled variously the most common way is the division of the dataset into two sets training and test set [16]. The ML classifier learns on the training set then it works on the test set and performs classification according to its learnings from the training set. Another way of handling the data is cross-validation, the most commonly used cross-validation is the ten-fold cross-validation [17]. Dataset is divided into 10 portions or subsets. Training and testing take place simultaneously and are executed for several iterations. In each iteration, training takes place for every subset except for one in which testing takes place. In the next iteration, another subset is set for testing. The final result is the average of all the training and testing results [17].

ML is used for prognosis which is the prediction of the progression of certain diseases and for analysis of important parameters in the medical domain. It helps in the extraction of medical knowledge as an outcome for research and in the planning of therapy and support for patients using patient management systems. ML is also being used when it comes to effective monitoring of a certain disease [18]. A very important area of ML in the medical domain is the diagnostics reasoning for example diagnosis of the coronary artery [19], breast cancer [20], liver cancer [21], hepatitis disease [22], etc. SVM is one of the most used algorithms are being used for medical research purposes [23].

The literature review for this paper has shown that certain limitations and possible future work exist in the studied research., few of the limitations are as follows:

Support Vector Machine (SVM) outperformed Decision Tree (DT) by attaining an accuracy of 75.5%, but the accuracy was quite low and required improvements [24]. Moreover, this was the only research found that was using SVM in the prediction of MetS. Dataset was limited to health checkup data and lacked the important predictive factors of MetS, it also lacked generalizability making it difficult to implement [25]. The dataset contained instances for male patients only, and the incidence of patients with MetS was low due to which similar results were obtained after several validation tests [26]. The dataset was not obtained from clinical data [27]. KNN was used for MetS prediction resulting 100% classification rate, the performance of the presented technique was not satisfactory due to the use of a small dataset consisting of 50 patients only [28].

This research paper focused on the prediction of MetS using ML techniques including Naïve Bayes, Logistic Regression, Random Forest, SVM, C4.5, and Cart.

The rest of the paper is organized as follows. Section 2 gives a brief review of the literature related to the problem under discussion. Section 3 describes the methodology used in this study for the prediction of MetS from the dataset set with class imbalance. Section 4 details the results obtained using different ML algorithms and SMOTE method of data augmentation followed by the discussion in Section 5. The paper concludes the study performed in Section 6.

Artificial intelligence excels in the healthcare field by analyzing complex medical data using ML [23]. ML provides tools, techniques, and algorithms to learn paternity in the data automatically [23, 29] and improve the performance of the learned experience for future use [30]. ML helps predict, diagnose and understand trends using complex and large data [31]. The literature shows that a significant number of patients indicating the symptoms of this disease are not diagnosed on time [32]. As a result, accurate predictions of the syndrome and careful early intervention reduce the risk of disease, while promoting prevention [33]. ML was used effectively in predicting MetS [28] and prognostic factors leading to the disease [34].

Several ML algorithms showed promising results in the prediction of MetS including the decision tree which gave an accuracy of 99% using prognostic factors of triglycerides (TG) + blood pressure (BP), fasting plasma glucose (FBP) + BP, and TG + BP + FBP [35]. The Support Vector Machine (SVM) gave 75% precision with combinations of TG BP and TG BP glucose [24]. Artificial Neural Networks (ANN) gave sensitivity and specificity of 93% and 91% respectively, based on BMI, age, BP, cholesterol, high-density lipoprotein (CHDL), and low-density lipoprotein (LDL), and HOMA-IR [36]. K-nearest neighbors (KNN) gave a classification rate of 100% using BMI, age, CHDL, and Tour Detaille were the prognostic factors [28]. Random Forest (RF) gave an accuracy of 97% for WHtR, HDL-C, BMI, DBP, FPG, TGs, and waist [37]. Classification and regression trees (Cart) underperformed by Area Under the ROC Curve (AUC) of 83% whereas, RF gave an AUC of 90% and the significant risk factors identified were: Obesity, serum glutamic-oxalocetic transaminase, controlled attenuation parameterscore, serum glutamic pyruvic transaminase, and glycated hemoglobin [38].

The limits referred to in the literature open up a new path as described in Table 1. Moreover, a limited dataset with a lesser number of predictive factors was identified as a limitation in research by Shimoda, et al. [39]. Hirose et al. reported only cases of male patients with a low incidence of patients with MetS in the dataset [26]. Behadada, et al. applied KNN and found a classification rate of 100%. However, the size of the dataset was small with records of only fifty patients [28].

Table 1

Comparison of ML techniques for MetS and its prognostic factors identification
Ref.	ML Technique	Accuracy	Prognostic Factors	Limitation
[35]	Decision Trees	99%	TG + BP, FBP + BP, TG + BP + FBP	Few ML techniques were discussed (only decision Trees)
[24]	SVM	0.76%	TG + BP and TG + BP + glucose-	Low accuracy
[36]	ANN	0.91%	BMI, age, BP, CHDL, LDL, HOMA-IR	Data of male patients only
[28]	KNN	100%	BMI, age, CHDL and Tour Detaille	Small dataset
[37]	RF	0.97%	WHtR, HDL-C, BMI, DBP, FPG, TGs, WAIST	Few prognostic factors
[38]	Cart	0.83% 0.90%	Obesity, serum glutamic-oxalocetic transaminase, controlled attenuation parameterscore, serum glutamic pyruvic transaminase, and glycated hemoglobin	Cart gave low accuracy
[38]	RF	0.83% 0.90%		Cart gave low accuracy
[39]	GBT	0.894% 0.889% 0.885%	None	A limited dataset with no prominent predictive factors
	RF
	LR
[40]	Cart	97% (for dataset 1)	TG	Too few prognostic factors
[40]	Cart	96% (for dataset 2)	serum highly sensitive C-reactive	Too few prognostic factors

To overcome the limitations discussed above in predicting and classifying MetS using ML techniques, Naïve Bayes, Logistic Regression (LR), RF, SVM, C4.5, and Cart are compared in this study. These selected ML techniques have certain merits and demerits, including estimation of the parameters using Naïve Bayes in a dataset that can be done by using a small portion of the training dataset. RF does not overfit the data and for that, a large number of trees are required. Similarly, the space complexity of the C4.5 classifier is large because of the repeated storage of data in tables [41]. LR can be regularized to avoid overfitting. The Cart algorithm can identify the most significant variables making it helpful in the medical diagnosis because it removes insignificant variables [41]. Lastly, the performance and efficiency of SVM make it popular in the health domain [23].

Availability of data and materials

The dataset(s) supporting the conclusions of this article is available in the NHANES repository, [https://wwwn.cdc.gov/nchs/nhanes/default.aspx].

To enhance the prediction and classification of MetS using ML techniques, the dataset was obtained from National Health and Nutrition Examination Survey (NHANES). The dataset holds cohorts from 2011 to 2012 recording about 27 attributes with total instances of 1931 patients both male and female. The class of the dataset that indicated patients with or without Metabolic Syndrome was binary and denoted by 0 (not a MetS patient) and 1 (MetS patient). Ages in the dataset ranged from 20-80 with information of patients such as marital status, annual income, race, first name, and last name. Body Mass Index (BMI) and waist circumferences were recorded according to the standards i.e., g15.7 kg/m2 and g63.1cm respectively. Certain blood tests data is also obtained including Uric Acid, Gamma-Glutamyl Transferase (GGT), Alanine aminotransferase (ALT), Aspartate aminotransferase (AST), Creatine Phosphokinase (CPK) with values ranges between g1.8 mg/dL, g 4 U/L, g 6 U/L, g 9 U/L, and g15 U/L respectively [26]. Blood glucose, HDL, triglycerides were averaged around g39 mg/dL, g14 mg/dL and g 26 mg/dL respectively. Factors like smoking history, hypertension, obesity, dyslipidemia with HDL, dyslipidemia, and hyperglycemia were also part of the dataset. Tab. 2 lists the 27 attributes in the dataset according to their respective types.

Table 2

Attributes in the dataset according to their types
Type	Attributes
Patients’ demographic information	seqn, age, sex, marital status, annual_income, race, fname, lname
Diagnostic test results	WaistCir, BMI
Blood test results	albuminuria, UrAlbCr, UricAcid, GGT, ALT, AST, CPK
MetS factors	BloodGlucose, HDL,Triglycerides, smoking, Dyslipidemia_HDL, Dyslipidemia, Hyperglycemia, Obesity, Hypertension

3.1 Pre-processing Step

The dataset consists of personal information on patients that were insignificant to assess and counteract the likelihood of MetS. Therefore, non-relevant fields were removed from the dataset, leaving 20 attributes to be analyzed. The data set was in CSV format and loaded into Weka for subsequent processing. It was also observed that the data set was unbalanced and that, as a result, some pre-treatment was necessary to increase the minority class or reduce the majority class. In this case, the minority class represents patients without MetS syndrome, whereas the majority class represents data for patients with MetS syndrome.

Synthetic Minority Oversampling Technique (SMOTE) was used to balance the dataset. It works on the class value which is minimal in number according to the nearest neighbors of the value and generates new synthetic data [42]. Figure 2 illustrates the data set before applying SMOTE.

The initial data set was unbalanced, with 1251 values of 0 (represented in blue) and 680 values of 1 (represented in red) in patients in the MetS class. It was important to balance an unbalanced dataset to ensure the good performance of ML algorithms for correctly classifying the minority class [26]. It also illustrates the dataset after SMOTE application as values 1 increased from 680 to 1380 (shown in red). As a result, instances in the dataset rose from 1931 to 2637. Fig. 3 shows steps executed sequentially using various ML algorithms on the selected dataset for classification.

3.2 Feature Selection and Classification

The dataset with irrelevant attributes can affect the performance of ML techniques either by increasing computational time and reducing the accuracy or producing a low accuracy rate [34]. Therefore, the selection of the most relevant features is the foremost important step in learning the correct data patterns as it may produce poor results and lead to overfitting.

Another advantage of the feature selection is that it reduces the dimension or number of attributes in the training data, also the ML algorithms train themselves quickly and have significant attributes to achieve accurate results [43]. In this research, the Wrapper method is used for feature selection. According to the Wrapper method, each classifier is wrapped in a cross-validation loop, and each attribute or feature is then evaluated [44]. Classifier subset evaluation and 5-folds cross-validation loops are used for the Wrapper method. Best-first search is used because it works better in terms of performance [45]. The combined selected features are used for classification.

Using feature selection on Naïve Bayes, resulted in 9 attributes. Whereas, C4.5, Cart, and SVM resulted in 5, 7, and 7 attributes respectively. Both RF and LR gave a selection of 3 attributes. The most occurring attribute was Waist Circumference, all of these selected attributes were combined and out of the initial 20 attributes, 14 attributes were selected. In total there were a total of 15 attributes including the class variable used for the classification also described in Tab. 3. Selected algorithms were applied to the dataset and the dataset was trained using ten cross-fold validation.

Table 3

Selected attributes using the wrapper method
Classifier	Selected Attributes	No. of attributes
NB	WaistCirc, UrAlbCr, CPK, BloodGlucose, Triglycerides, Hypertension, Dyslipdemia_HDL, Dyslipdemia, Obesity	9
RF	WaistCirc, BMI, UrAlbCr	3
C4.5	WaistCirc, Hypertension, Dyslipidemia, Hyperglycemia, Obesity	5
Cart	BMI, UrAlbCr, UricAcid, GGT, ALT, CPK, Hyperglycemia	7
SVM	WaistCirc, BMI, UrAlbCr, BloodGlucose, Triglycerides, Hypertension, Dyslipdemia_HDL	7
LR	WaistCirc, BMI, UrAlbCr	3

3.3 Evaluation

A confusion matrix was generated to evaluate the performance of the classifiers and many performance-related parameters were calculated [24, 46]. Performance parameters calculated from the confusion matrices were: Predictive Accuracy, Error rate, True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), and Prevalence. Results were also evaluated using the precision and recall method. AUC (Area Under the Curve) of ROC (Receiver Operating Characteristics) curves were also calculated.

ROC curve was plotted with FPR on the x-axis and TPR on the Y-axis [46]. The points (0,0) in the ROC curve graph or space indicate an inaccurate classification and on the contrary, point (1,1) indicates a positive classification. A classifier must have its curve near to the point (0,1) to indicate good performance [47]. AUC is a measure to evaluate the performance of a classifier and is a summarized value of the ROC curve [48]. The values of AUC normally lie between 0.5 and 1. If AUC = 1 or is near to 1, it indicates a good classification and if AUC = 0.5 or is near to 0.5, it indicates the occurrence of random or unacceptable classification results [49]. Performance results for selected classifiers were compared for a better MetS prediction. Results indicated Naïve Bayes as outperforming technique for the stated purpose.

According to the literature, many different methods were used to identify the prognostic factors. The commonly used method was the use of trees formed by the decision tree or by its derivative techniques like Cart [50]. The proposed approach used a decision tree for the identification of prognostic factors. Prognostic factors for MetS were analyzed using the decision tree formed by classifier C4.5. Factor at root position is considered as the most important factor, for the current research ‘Hyperglycemia’ was identified as root factor. According to the tree combination formed from Dyslipidemia_HDL, Hypertension, and Obesity, the result of the proposed approach corresponds with the definition of WHO.

A comparison of the proposed approach with the existing approaches were also performed, the leading technique of this research was Naïve Bayes with an AUC of 98% as compared with other existing approaches studied in the literature.

4.1 Confusion Matrix

A confusion Matrix was generated for each classifier and the performance parameters from confusion matrixes were used to analyze their performance. Table 4 gives the confusion matrix for all the selected classifiers. The total number of patients was 2611 in the dataset, 1360 patients were MetS positive and 1251 were healthy patients.

Table 4

Confusion matrix for selected classifiers
Classifier	True Positive (TP)	True Negative (TN)	False Positive (FP)	False Negative (FN)
Naive Bayes	1302	1144	107	58
Logistic Regression	1239	1120	131	121
Random Forest	1210	986	265	150
SVM	1360	821	430	0
C4.5	1223	819	432	137
Cart	1125	663	588	235

Naïve Bayes predicted 1409 patients positive and 1202 patients negative for MetS.
Logistic Regression predicted 1370 patients positive and 1241 patients negative for MetS.
Random Forest predicted 1475 patients positive and 1136 patients negative for MetS.
SVM predicted 1790 patients positive and 821 patients negative for MetS.
C4.5 predicted 1655 patients positive and 956 patients negative for MetS.
Cart predicted 1713 patients positive and 898 patients negative for MetS.

The performance parameters or rates calculated from the confusion matrix are:

Predictive Accuracy: Rate at which the classifier predicts correctly.

Misclassification rate: Rate at which the classifier predicts wrong.

True positive rate: When the observation is a patient of MetS, the amount of time the classifier predicts patients with MetS. Also known as recall.

False positive rate: When the observation is not a patient of MetS, the amount of time the classifier predicts patients with MetS.

True negative rate: When the observation is not a patient of MetS, the amount of time the classifier predicts patients without MetS.

Prevalence: Total number of patients with MetS:

The above-mentioned rates were calculated and the results are summarized in Tab. 5.

Table 5

Summarized results of the performance parameters from the confusion matrix
Classifier	Misclassification rate \(\frac{\mathbf{F}\mathbf{P}+\mathbf{F}\mathbf{N}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{P}+\mathbf{F}\mathbf{N}+\mathbf{T}\mathbf{N}}\)	TP rate\(\frac{\mathbf{T}\mathbf{P}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{N}}\)	FP rate\(\frac{\mathbf{F}\mathbf{P}}{\mathbf{T}\mathbf{N}+\mathbf{F}\mathbf{P}}\)	TN rate\(\frac{\mathbf{T}\mathbf{N}}{\mathbf{F}\mathbf{P}+\mathbf{T}\mathbf{N}}\)	Accuracy \(\frac{\mathbf{T}\mathbf{P}+\mathbf{T}\mathbf{N}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{P}+\mathbf{F}\mathbf{N}+\mathbf{T}\mathbf{N}}\)
NB	0.06	0.95	0.08	0.91	0.936
LR	0.09	0.91	0.10	0.89	0.903
RF	0.15	0.88	0.21	0.78	0.841
SVM	0.16	1	0.34	0.65	0.835
C4.5	0.21	0.89	0.34	0.65	0.78
Cart	0.31	0.82	0.47	0.52	0.68

Tab. 5 shows that Naïve Bayes outperforms all other classifiers in the prediction of MetS with an accuracy of 93.6 %. LR follow up with an accuracy of 90.3%. The classifier with the least accuracy is Cart with 68%. Naive Bayes gives the most accuracy and based on these results it should be used in the prediction of MetS. Fig. 4 presents the confusion matrix’s data of all the implemented classifiers in this research. The predictive accuracy of the classifiers, related performance parameters, and prevalence was calculated using this confusion matrix data. The prevalence of patients with MetS was obtained as 52%.

Fig. 5 shows the calculated values of various performance parameters, the results indicated that Naïve Bayes outperformed the other classifier in the prediction of MetS with an accuracy of approximately 94%. LR followed up with an accuracy of 90%. RF, SVM, and C4.5 gave an accuracy of 84%, 83 %, and 78% respectively. The classifier with the least accuracy was Cart i.e., 68%.

4.2 ROC Curves

ROC (Receiver Operator Characteristic) curve is used for model evaluation. Normally when a graph is computed for classification the value of the threshold above which the classification should be positive and below the classification should be negative is difficult to identify because at a certain threshold classification rate could be better. ROC curves come in handy for the identification of thresholds because it is a curve of probabilities as it plots values on different thresholds thus creating the curve [48]. ROC curve is plotted with the false positive rate at the x-axis and the true positive rate on the y-axis of every threshold value [47]. The graph is plotted between 0 and 1, from the evaluation perspective the graph of the classifier that is near value 1 of the x-axis is considered to be the outperformed classifier.

The graphs for this research work were generated using the explorer interface of Weka. Fig. 6 is the ROC curves generated for patients with MetS of Naïve Bayes, Logistic Regression, Random Forest, SVM, C4.5, and Cart.

ROC curve of NB is the closest to the x-axis value 1 indicating it to be better in the prediction of patients with MetS. LR and RF are a little further away but in terms of choosing a classifier with the best results is Naïve Bayes. There is a drastic fall in the ROC curve of SVM which is also continued in the ROC curve of C4.5 and Cart. From the ROC curve, it is evident that cart is the least one the performance in predicting patients without MetS.

4.3 Precision, Recall, and AUC

After classification, the proposed approach was evaluated by values of precision, recall, and AUC. The parameters based on which the proposed approach was evaluated are:

AUC is the area under the curve of the ROC curve.
Precision is the measure of how often the predicated patients of MetS are correct.
Recall is the measure of how often the actual patient of MetS is predicted correctly.

Tab. 6 summarizes the results generated in Weka.

Table 6

Precision, recall, and AUC values of the classifiers
Classifier	Precision \(\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}\)	Recall \(\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}\)	AUC \(\frac{1}{2} (Precision+Recall)\)
Naïve Bayes	0.924	0.957	0.941
Logistic Regression	0.904	0.911	0.907
Random Forest	0.820	0.88	0.85
SVM	0.759	1	0.879
C4.5	0.713	0.899	0.806
Cart	0.656	0.827	0.741

AUC is a good measure to evaluate the performance of a classifier and it indicates how much a classifier can distinguish between the two classes [48] and is the summarized value of the ROC curve. If the AUC is higher, it means the classifiers are better in the prediction of the patients with MetS and without MetS and if it is lower it means it is not good in the prediction of MetS. A classifier higher will have values near 1 and a classifier with lower AUC will have values near 0 [51]. For a classifier to have the capacity to distinguish successfully between two classes its value has to be above 0.5 [48]. Fig. 7 is the graph of AUC values generated in weka ranging from the highest AUC value to the lowest.

The highest AUC generated is 0.941 by Naïve Bayes for a given dataset, based on the results Naïve Bayes is better in the prediction of MetS than the other classifiers. The second highest AUC is produced by Logistic regression with the value of 0.907 and is followed by Random Forest, SVM, and C4.5 with AUCs of 0.85, 0.879, and 0.806 respectively. The lowest AUC generated is 0.741 b-y Cart. Although, it can distinguish between the classes as the value is above 0.5 but it is failed in the performance criteria. Fig. 8 visualizes the results of precision and recall in the graph.

Precision and recall are performance parameters used to evaluate models. These curves are generated for evaluation but only when class is imbalanced since classes in the proposed approach have been balanced there is no such need [46]. In such a scenario evaluation is best done by ROC curves. However, precision and recall gave an idea about whether patients with MetS are being classified. As result, Naïve Bayes and Logistic Regression have close values of precision and recall which means the amount of time taken by both algorithms for the prediction of patients is almost the same. However, the AUC value differs because the ROC curve generated at various thresholds for LR was further away from the x-axis value 1.

4.4 Predictive Accuracy and AUC

Predictive accuracy and AUC both values talk about how well the classifier predicated patients with MetS but the difference in values tells others otherwise. The purpose of this section is to see which performance parameter is better in measuring the accuracies used in this research. Tab. 7 gives the results of both predictive accuracy and AUC.

Table 7

Values of predictive accuracy and AUC
Classifier	Predictive accuracy	AUC
Naïve Bayes	0.936	0.941
Logistic Regression	0.903	0.907
Random Forest	0.841	0.85
SVM	0.835	0.879
C4.5	0.78	0.806
Cart	0.68	0.741

Fig. 9 gives the side-by-side comparison of predictive accuracy and AUC in form of a graph. The results of both parameters do not differ in the order of the performing accuracies of the classifier however the big difference is predictive accuracies values decreased in some points from the AUC. This can be seen in the accuracies of Cart. The AUC for Cart is 74.1% and predictive accuracy is 83.5% which is a big fall in values. It is evident that AUC is a better measure.

4.5 Identification of Prognostic Factors for MetS

Many factors are identified leading to MetS, the most significant factors are called prognostic factors. These prognostic factors play a vital role in the prognosis in MetS [52,53]. Thus, identification of these prognostic factors could help in the early prevention of MetS. According to the literature review, many different methods were used to identify prognostic factors, the most commonly used method is tree formation by decision tree or by its derivative ML techniques. The proposed approach also used a decision tree for the identification of prognostic factors. Fig. 10 shows the tree build by C4.5.

The root factor is considered the most prominent actor in the prognosis. Hence, the result of the decision tree depicts Hyperglycemia as the most important prognostic factor. The patients with an indication of Hyperglycemia (=1) should also be checked for MetS. After Hyperglycemia the second important prognostic factor is Dyslipidemia. Along with Hyperglycemia, Dyslipidemia (=1) leads to MetS indication. Apart from these two factors, a combination of DyslipidemiaH DL (=1), Hypertension (=1), and Obesity (=1) is important in the prognosis of MetS.

The objective of this research was to evaluate different ML algorithms for the classification of MetS using appropriate assessment parameters for unbalanced class data. Probabilistic classifiers (Naive Bayes) were compared with other supervised ML methods, including SVM and tree-based methods. Different performance metrics were used to characterize the performance of these algorithms. It was observed that the evaluation parameters should be selected based on the data characteristics, such as the data balance by class. The accuracy and ROC seemed misleading because of the class imbalance that existed in the data set used. But Precision and Recall identified Naive Bayes and LR as the best classification algorithm followed by RF and SVM. Surprisingly, the performance of the decision tree and the Cart algorithms are rather low with the lowest AUC, precision, and recall values.

Moreover, the use of the SMOTE method to obtain balanced data for the formation of ML algorithms improved the efficiency of almost all algorithms compared to previously reported results. Likewise, the feature selection adopted in this study has further improved the accuracy of classifiers. Prognostic factors were also determined using the tree comprising GT with the combination of BMI and GT with BMI and glucose.

A comparison of similar studies is presented in the related work section, [24] reported 82% AUC achieved using SVM with 75% accuracy, however, in this study the performance of SVM was improved by 0.7%. LR was identified as the leading technique achieving 95% AUC of ROC in comparison with Gradient boosted trees and RF [39]. The accuracy of the algorithms produced by the GBT, RF, and LR were 89%, 88%, and 88% respectively. However, no prognostic factors were identified and the missing data were filled using the multiple imputation method [39]. In this study, data distribution was improved by using the class balancing method and managing missing data. Therefore, the performance of LR was improved by 0.068% as compared to [39] in terms of its accuracy.

Similarly, research conducted by Tayefi et al., used only one ML technique i.e., Cart. The approach included the classification of two different types of datasets, naming them model-1 and model-2 [40]. The accuracy achieved by models-1 and model-2 were 97% and 76% respectively. The prognostic factor in model-1 was TG and in model-2 was serum highly sensitive C-reactive protein (Hs-CRP) [40]. However, in this study, Cart did not perform well as compared to Naïve Bayes who achieved an AUC of 98% and a precision of 93%.

This paper compared six supervised ML techniques; Naïve Bayes, LR, RF, SVM, C4.5, and Cart, and the precision, recall, and AUC techniques used to evaluate their performances. Effective prognostic factors identified are Hyperglycemia, Dyslipidemia, and a combination of Dyslipidemia_HDL, Hypertension, and Obesity. These factors are also aligned with the definition given by WHO. In the literature, the definition by WHO has not been considered before because the definition of ATP3 or definition specific to the researcher’s country or region were considered mostly. Contrary to the recommended Naive Bayes classifier by this, the literature depicted that the most commonly used ML techniques in the prediction of MetS were either decision tree [35] or its derivative ML techniques like RF [39] and Cart [50]. This research work suggests the use of Naïve Bayes for MetS prediction.

MetS is a set of conditions that triggers problems involving serious health issues, therefore, early detection of MetS from its associated prognostic factors can help prevent it. This research is intended to provide a mechanism for classifying MetS using state-of-the-art ML techniques. The results were analyzed using accuracy, precision, recall, and AUC of ROCs. The objective of this paper was to highlight the appropriate performance metric and identify clear performance results. Based on the results, the precision, recall, and AUC provided a clearer picture of the performance of the algorithms. In addition, the results also showed that Naïve Bayes gave better predictive results for MetS compared to other ML techniques with an AUC of 94.1%. For tree algorithms, RF is the one that performed better with a 90.7% AUC. The AUCs of LR, SVM, C4.5, and Cart were 85%, 87.9%, 80.6%, and 74.1% respectively.

According to the results of this research, the prognosis factors for MetS identification are hyperglycemia or dyslipidemia or a combination of HDL dyslipidemia, hypertension, and obesity. Monitoring these factors reduces the risk of MetS occurrence, assists in the prevention, and provides important information for treatment in the early stages.

In the future, more ML techniques such as Artificial Neural Networks (ANN) and Deep Learning (DL) methods can be used for the classification of MetS.

Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.

Unwin.N., "The Metabolic Syndrome," Journal of the Royal Society of Medicine vol. 99, no.9, pp. 457–462, 2006.
Al-Sadi.A.M., F.A. Al-Oweisi, S.G. Edwards, H. Al-Nadabi and A.M. Al-Fahdi, "Genetic Analysis Reveals Diversity and Genetic Relationship among Trichoderma Isolates from Potting Media, Cultivated Soil and Uncultivated Soil," BMC Microbiology vol. 15, no.1, pp. 1–11, 2015.
Alberti.G., "Introduction to the Metabolic Syndrome," European Heart Journal Supplements vol. 7, no.suppl_d, pp. D3–D5, 2005.
Alireza.D., F. Mohammad Hossein, A. Siamak, S. Afsaneh and K. Roya, "First Nationwide Study of the Prevalence of the Metabolic Syndrome and Optimal Cutoff Points of Waist Circumference in the Middle East: The National Survey of Risk Factors for Noncommunicable Diseases of Iran," Diabetes care vol. 32, no.6, pp. 1092–1097, 2009.
Kaur.J., "A Comprehensive Review on Metabolic Syndrome," Cardiology Research and Practice vol. 2014, pp. 21, 2014.
Saklayen.M.G., "The Global Epidemic of the Metabolic Syndrome," Current Hypertension Reports vol. 20, no.2, pp. 1–8, 2018.
Xu.H., X. Li, H. Adams, K. Kubena and S. Guo, "Etiology of Metabolic Syndrome and Dietary Intervention," International Journal of Molecular Sciences vol. 20, no.1, pp. 128, 2019.
Masulli.M. and O. Vaccaro, "Association Between Cigarette Smoking and Metabolic Syndrome: Response to Oh et Al.," Diabetes Care vol. 29, no.2, pp. 482–483, 2006.
Services.U.S.D. of H. and H., National Diabetes Statistics Report 2020. Estimates of Diabetes and Its Burden in the United States; 2020;
Bird.S.R. and J.A. Hawley, "Update on the Effects of Physical Activity on Insulin Sensitivity in Humans," BMJ open sport & exercise medicine vol. 2, no.1, pp. e000143, 2017.
Jonathan.M., K. Peter and N. Eric, "Physical Activity, Cardiorespiratory Fitness, and the Metabolic Syndrome," Nutrients vol. 11, no.7, pp. 1652, 2019.
Huang.P.L., "A Comprehensive Definition for Metabolic Syndrome," Disease models & mechanisms vol. 2, no.5–6, pp. 231–237, 2009.
Manyika.J., M. Chui, B. Brown, J. Bughin, R. Dobbs et al., Big Data: The next Frontier for Innovation, Competition and Productivity; 2011;
Mitchell.T.M., "Does Machine Learning Really Work?," AI Magazine vol. 18, no.3, pp. 745, 1997.
Witten.I.H. and E. Frank, Data Mining:Practical Machine Learning Tools and Techniques Second Edition; 2005; ISBN 0080890369.
Domingos.P., "A Few Useful Things to Know about Machine Learning," Communications of the ACM vol. 55, no.10, pp. 78–87, 2012.
Y.Ng.A., "Preventing “Overfitting” of Cross-Validated Data," in Proceeding 14th Int. Conf. Mach. Learn.; vol. Vol. 97, pp. 245–253, 1997.
Magoulas.G.D. and A. Prentza, "Machine Learning and Its Applications," Advanced Course on Artificial Intelligence vol. 2049, pp. 300–307, 2001.
Alizadehsani.R., M.H. Zangooei, M.J. Hosseini, J. Habibi, A. Khosravi et al., "Coronary Artery Disease Detection Using Computational Intelligence Methods," Knowledge-Based Systems vol. 109, pp. 187–197, 2016.
Polat.K. and S. Güneş, "Breast Cancer Diagnosis Using Least Square Support Vector Machine," Digital Signal Processing vol. 17, no.4, pp. 694–701, 2007.
Venkata Ramana.B., M.S.P. Babu and N.. Venkateswarlu, "A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis," International Journal of Database Management Systems vol. 3, no.2, pp. 101–114, 2011.
Sartakhti.J.S., M.H. Zangooei and K. Mozafari, "Hepatitis Disease Diagnosis Using a Novel Hybrid Method Based on Support Vector Machine and Simulated Annealing (SVM-SA)," Computer Methods and Programs in Biomedicine vol. 108, no.2, pp. 570–579, 2012.
Jiang.F., Y. Jiang, H. Zhi, Y. Dong, H. Li et al., "Artificial Intelligence in Healthcare: Past, Present and Future," Stroke and Vascular Neurology vol. 2, no.4, pp. 230–243, 2017.
Karimi-alavijeh.F., S. Jalili and M. Sadeghi, "Predicting Metabolic Syndrome Using Decision Tree and Support Vector Machine Methods," ARYA atherosclerosis vol. 12, no.3, pp. 146–152, 2016.
Shimoda.A., D. Ichikawa and H. Oyama, "International Journal of Medical Informatics Prediction Models to Identify Individuals at Risk of Metabolic Syndrome Who Are Unlikely to Participate in a Health Intervention Program," International Journal of Medical Informatics vol. 111, pp. 90–99, 2018.
Hirose.H., T. Takayama, H. Shigenari, T. Hibi and I. Saito, "Prediction of Metabolic Syndrome Using Artificial Neural Network System Based on Clinical Data Including Insulin Resistance Index and Serum Adiponectin," Computers in biology and medicine vol. 41, no.11, pp. 1051–1056, 2011.
Miller.B., M. Fridline, P. Liu and D. Marino, "Use of CHAID Decision Trees to Formulate Pathways for the Early Detection of Metabolic Syndrome in Young Adults," Computational and Mathematical Methods in Medicine vol. 2014, pp. 7, 2014.
Behadada.O., M. Abi-Ayad, G. Kontonatsios and M. Trovati, "Automatic Diagnosis Metabolic Syndrome via a K−Nearest Neighbour Classiﬁer," in 12th Int. GPC Conf.; Springer Verlag: Cetara, Italy; vol. Vol. 10232 LNCS, pp. 627–637[Online]. Available: https://research.edgehill.ac.uk/en/publications/automatic-diagnosis-metabolic-syndrome-via-a-knearest-neighbour-c-2, April 13 2017.
Pedregosa.F., G. Varoquaux, V. Michel, B. Thirion, O. Grisel et al., "Scikit-Learn: Machine Learning in Python," Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011.
Callahan.A. and N.H. Shah, "Machine Learning in Healthcare," Key Advances in Clinical Informatics pp. 279–291, 2017.
Angra.S. and S. Ahuja, "Machine Learning and Its Applications: A Review," in Int. Conf. Big Data Anal. Comput. Intell.; Institute of Electrical and Electronics Engineers Inc.; pp. 57–60, October 17 2017.
Helminen.E.-E., P. Mäntyselkä, I. Nykänen and E. Kumpusalo, "Far from Easy and Accurate - Detection of Metabolic Syndrome by General Practitioners," BMC Family Practice 2009 10:1 vol. 10, no.1, pp. 1–5, 2009.
Simona.B., C. Giovannino, P. Neil, M. Franco, G. Luigi et al., "Prevalence of Undiagnosed Metabolic Syndrome in a Population of Adult Asymptomatic Subjects," Diabetes research and clinical practice vol. 75, no.3, pp. 362–365, 2007.
Kavitha.R. and E. Kannan, "An Efficient Framework for Heart Disease Classification Using Feature Extraction and Feature Selection Technique in Data Mining," in Int. Conf. Emerg. trends Eng. Technol. Sci.; Institute of Electrical and Electronics Engineers Inc.: Pudukkottai, India; pp. 1–5, October 19 2016.
Apilak.W., N. Chanin, I.-N.-A. Chartchalerm, P. Phannee and P. Virapong, "Identification of Metabolic Syndrome Using Decision Tree Analysis," Diabetes research and clinical practice vol. 90, no.1, pp. e15–e18, 2010.
Lin.C.-C., Y.-M. Bai, J.-Y. Chen, T.-J. Hwang, T.-T. Chen et al., "Easy and Low-Cost Identification of Metabolic Syndrome in Patients Treated With Second-Generation Antipsychotics: Artificial Neural Network and Logistic Regression Models," The Journal of Clinical Psychiatry vol. 70, no.3, pp. 0–0, 2009.
Gutiérrez-Esparza.G.O., O.I. Vázquez, M. Vallejo and J. Hernández-Torruco, "Prediction of Metabolic Syndrome in a Mexican Population Applying Machine Learning Algorithms," Symmetry vol. 12, no.4, pp. 581, 2020.
Cheng-Sheng, Y.-J. Lin, C.-H. Lin, S.-T. Wang, S.-Y. Lin et al., "Predicting Metabolic Syndrome With Machine Learning Models Using a Decision Tree Algorithm: Retrospective Cohort Study," JMIR Medical Informatics vol. 8, no.3, pp. e17110, 2020.
Akihiro.S., I. Daisuke and O. Hiroshi, "Prediction Models to Identify Individuals at Risk of Metabolic Syndrome Who Are Unlikely to Participate in a Health Intervention Program," International journal of medical informatics vol. 111, pp. 90–99, 2018.
Tayefi.M., M. Saberi-Karimian, H. Esmaeili, A.A. Zadeh, M. Ebrahimi et al., "Evaluating of Associated Risk Factors of Metabolic Syndrome by Using Decision Tree," Comparative Clinical Pathology vol. 27, no.1, pp. 215–223, 2018.
Saleh.B., A. Saeidi, A. Al-Aqbi and L. Salman, "Analysis of Weka Data Mining Techniques for Heart Disease Prediction System," Internal Journal of Medical Reviews vol. 7, no.1, pp. 15–24, 2020.
Chawla.N. V., K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer, "SMOTE: Synthetic Minority Over-Sampling Technique," Journal of Artificial Intelligence Research vol. 16, pp. 321–357, 2002.
Ramalingam.V. V., A. Dandapath and M.K. Raja, "Heart Disease Prediction Using Machine Learning Techniques : A Survey," International Journal of Engineering & Technology vol. 7, no.2.8, pp. 684–687, 2018.
Witten.I.H. and Eibe Frank, "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations," ACM Sigmod Record vol. 31, no.1, pp. 76–77, 2002.
Bermejo.P., L. De La Ossa, J.A. Gámez and J.M. Puerta, "Fast Wrapper Feature Subset Selection in High-Dimensional Datasets by Means of Filter Re-Ranking," Knowledge-Based Systems vol. 25, no.1, pp. 35–44, 2012.
Cook.J. and V. Ramadas, "When to Consult Precision-Recall Curves," The Stata Journal vol. 20, no.1, pp. 131–148, 2020.
Fawcett.T., "ROC Graphs: Notes and Practical Considerations for Researchers," Machine learning vol. 31, no.1, pp. 1–38, 2004.
Walter.S.D., "The Partial Area under the Summary ROC Curve," Statistics in Medicine vol. 24, no.13, pp. 2025–2040, 2005.
Raffinetti.E. and P. Giudici, "A Generalised ROC Curve," SSRN Electronic Journal 2021.
Miller.B. and M. Fridline, "Development and Validation of Metabolic Syndrome Prediction and Classification-Pathways Using Decision Trees," Journal of Metabolic Syndrome vol. 4, no.1, pp. 1–9, 2015.
Idrees.F., M. Rajarajan, M. Conti, T.M. Chen and Y. Rahulamathavan, "PIndroid: A Novel Android Malware Detection System Using Ensemble Learning Methods," Computers and Security vol. 68, no.1, pp. 36–46, 2017.
Kakudi.H.A., C.K. Loo and F.M. Moy, "Diagnosis of Metabolic Syndrome Using Machine Learning, Statistical and Risk Quantification Techniques: A Systematic Literature Review," Malaysian Journal of Computer Science vol. 34, no.3, pp. 221–241, 2020.
Barrios.M., M. Jimeno, P. Villalba and E. Navarro, "Framework to Diagnose the Metabolic Syndrome Types without Using a Blood Test Based on Machine Learning," Applied Sciences vol. 10, no.23, pp. 8404, 2020.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Machine Learning for Predicting Metabolic Syndrome and its Prognostic Factors

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

3 Materials And Methods

3.1 Pre-processing Step

3.3 Evaluation

4 Results

4.1 Confusion Matrix

5 Discussion

6 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1