Availability of data and materials
The dataset(s) supporting the conclusions of this article is available in the NHANES repository, [https://wwwn.cdc.gov/nchs/nhanes/default.aspx].
To enhance the prediction and classification of MetS using ML techniques, the dataset was obtained from National Health and Nutrition Examination Survey (NHANES). The dataset holds cohorts from 2011 to 2012 recording about 27 attributes with total instances of 1931 patients both male and female. The class of the dataset that indicated patients with or without Metabolic Syndrome was binary and denoted by 0 (not a MetS patient) and 1 (MetS patient). Ages in the dataset ranged from 20-80 with information of patients such as marital status, annual income, race, first name, and last name. Body Mass Index (BMI) and waist circumferences were recorded according to the standards i.e., g15.7 kg/m2 and g63.1cm respectively. Certain blood tests data is also obtained including Uric Acid, Gamma-Glutamyl Transferase (GGT), Alanine aminotransferase (ALT), Aspartate aminotransferase (AST), Creatine Phosphokinase (CPK) with values ranges between g1.8 mg/dL, g 4 U/L, g 6 U/L, g 9 U/L, and g15 U/L respectively [26]. Blood glucose, HDL, triglycerides were averaged around g39 mg/dL, g14 mg/dL and g 26 mg/dL respectively. Factors like smoking history, hypertension, obesity, dyslipidemia with HDL, dyslipidemia, and hyperglycemia were also part of the dataset. Tab. 2 lists the 27 attributes in the dataset according to their respective types.
Table 2
Attributes in the dataset according to their types
Type
|
Attributes
|
Patients’ demographic information
|
seqn, age, sex, marital status,
annual_income, race, fname, lname
|
Diagnostic test results
|
WaistCir, BMI
|
Blood test results
|
albuminuria, UrAlbCr, UricAcid,
GGT, ALT, AST, CPK
|
MetS factors
|
BloodGlucose, HDL,Triglycerides,
smoking, Dyslipidemia_HDL, Dyslipidemia, Hyperglycemia, Obesity, Hypertension
|
3.1 Pre-processing Step
The dataset consists of personal information on patients that were insignificant to assess and counteract the likelihood of MetS. Therefore, non-relevant fields were removed from the dataset, leaving 20 attributes to be analyzed. The data set was in CSV format and loaded into Weka for subsequent processing. It was also observed that the data set was unbalanced and that, as a result, some pre-treatment was necessary to increase the minority class or reduce the majority class. In this case, the minority class represents patients without MetS syndrome, whereas the majority class represents data for patients with MetS syndrome.
Synthetic Minority Oversampling Technique (SMOTE) was used to balance the dataset. It works on the class value which is minimal in number according to the nearest neighbors of the value and generates new synthetic data [42]. Figure 2 illustrates the data set before applying SMOTE.
The initial data set was unbalanced, with 1251 values of 0 (represented in blue) and 680 values of 1 (represented in red) in patients in the MetS class. It was important to balance an unbalanced dataset to ensure the good performance of ML algorithms for correctly classifying the minority class [26]. It also illustrates the dataset after SMOTE application as values 1 increased from 680 to 1380 (shown in red). As a result, instances in the dataset rose from 1931 to 2637. Fig. 3 shows steps executed sequentially using various ML algorithms on the selected dataset for classification.
3.2 Feature Selection and Classification
The dataset with irrelevant attributes can affect the performance of ML techniques either by increasing computational time and reducing the accuracy or producing a low accuracy rate [34]. Therefore, the selection of the most relevant features is the foremost important step in learning the correct data patterns as it may produce poor results and lead to overfitting.
Another advantage of the feature selection is that it reduces the dimension or number of attributes in the training data, also the ML algorithms train themselves quickly and have significant attributes to achieve accurate results [43]. In this research, the Wrapper method is used for feature selection. According to the Wrapper method, each classifier is wrapped in a cross-validation loop, and each attribute or feature is then evaluated [44]. Classifier subset evaluation and 5-folds cross-validation loops are used for the Wrapper method. Best-first search is used because it works better in terms of performance [45]. The combined selected features are used for classification.
Using feature selection on Naïve Bayes, resulted in 9 attributes. Whereas, C4.5, Cart, and SVM resulted in 5, 7, and 7 attributes respectively. Both RF and LR gave a selection of 3 attributes. The most occurring attribute was Waist Circumference, all of these selected attributes were combined and out of the initial 20 attributes, 14 attributes were selected. In total there were a total of 15 attributes including the class variable used for the classification also described in Tab. 3. Selected algorithms were applied to the dataset and the dataset was trained using ten cross-fold validation.
Table 3
Selected attributes using the wrapper method
Classifier
|
Selected Attributes
|
No. of attributes
|
NB
|
WaistCirc, UrAlbCr, CPK, BloodGlucose, Triglycerides, Hypertension, Dyslipdemia_HDL, Dyslipdemia, Obesity
|
9
|
RF
|
WaistCirc, BMI, UrAlbCr
|
3
|
C4.5
|
WaistCirc, Hypertension, Dyslipidemia, Hyperglycemia, Obesity
|
5
|
Cart
|
BMI, UrAlbCr, UricAcid,
GGT, ALT, CPK, Hyperglycemia
|
7
|
SVM
|
WaistCirc, BMI, UrAlbCr, BloodGlucose, Triglycerides, Hypertension, Dyslipdemia_HDL
|
7
|
LR
|
WaistCirc, BMI, UrAlbCr
|
3
|
3.3 Evaluation
A confusion matrix was generated to evaluate the performance of the classifiers and many performance-related parameters were calculated [24, 46]. Performance parameters calculated from the confusion matrices were: Predictive Accuracy, Error rate, True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), and Prevalence. Results were also evaluated using the precision and recall method. AUC (Area Under the Curve) of ROC (Receiver Operating Characteristics) curves were also calculated.
ROC curve was plotted with FPR on the x-axis and TPR on the Y-axis [46]. The points (0,0) in the ROC curve graph or space indicate an inaccurate classification and on the contrary, point (1,1) indicates a positive classification. A classifier must have its curve near to the point (0,1) to indicate good performance [47]. AUC is a measure to evaluate the performance of a classifier and is a summarized value of the ROC curve [48]. The values of AUC normally lie between 0.5 and 1. If AUC = 1 or is near to 1, it indicates a good classification and if AUC = 0.5 or is near to 0.5, it indicates the occurrence of random or unacceptable classification results [49]. Performance results for selected classifiers were compared for a better MetS prediction. Results indicated Naïve Bayes as outperforming technique for the stated purpose.
According to the literature, many different methods were used to identify the prognostic factors. The commonly used method was the use of trees formed by the decision tree or by its derivative techniques like Cart [50]. The proposed approach used a decision tree for the identification of prognostic factors. Prognostic factors for MetS were analyzed using the decision tree formed by classifier C4.5. Factor at root position is considered as the most important factor, for the current research ‘Hyperglycemia’ was identified as root factor. According to the tree combination formed from Dyslipidemia_HDL, Hypertension, and Obesity, the result of the proposed approach corresponds with the definition of WHO.
A comparison of the proposed approach with the existing approaches were also performed, the leading technique of this research was Naïve Bayes with an AUC of 98% as compared with other existing approaches studied in the literature.