Diabetes mellitus (DM) is a set of disorders with hyperglycemia in common. Based on pathogenesis, DM is further classified. Type 1 DM is characterized by insulin deficiency. In contrast, type 2 DM is a heterogeneous collection of disorders characterized by variable degrees of insulin resistance, impaired insulin secretion, and excessive hepatic glucose production. Type 2 DM is an ongoing pandemic and is among the critical diseases. (13) Many risk factors are associated with diabetes, including age, obesity, lack of physical activity, family history of diabetes, fat-rich diet, and high blood pressure. Every three years, screening is recommended for individuals over 45 years and younger people having risk factors or whose body mass index is 25 kg/m2. (14) The role of insulin in glucose homeostasis has been well established. (15) Several studies used plasma glucose concentration and insulin concentration as predictors, which are used to define diabetes. (16)The present study classifies the diabetic category based on glycated hemoglobin and used other demographic, clinical, and laboratory parameters as predictors but excluded plasma glucose and insulin concentration. In contrast to binary classification in most studies, the present study divided subjects into three response categories – diabetic, prediabetic, and non-diabetic. The subspace discriminant algorithm best-classified diabetics with specificity and AUC of 94% and 0.70, respectively.
Zheng et al. extracted features of 300 patients from Electronic Health Repository ranging from 2012 to 2014. They applied machine learning models, including k-Nearest-Neighbors, Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine, and Logistic Regression, to predict type 2 diabetes mellitus. The learning methods achieved high identification performances (∼0.98 in average AUC) compared to the state-of-the-art algorithm (0.71 in AUC). (17) Maniruzzaman et al. conducted a study on 768 subjects (268 diabetic and 500 controls) to classify them into the diabetic and non-diabetic categories. Researchers quoted that due to non-linearity, non-normality, and inherent correlation structure in the medical data, the Gaussian process (GP)-based classification technique uses linear, polynomial, and radial basis kernel. This model’s accuracy, sensitivity, and specificity were 81.97%, 91.79%, and 63.33%, respectively. Compared to naïve Bayes, linear and quadratic discriminant analysis models, the GP-based model showed better performance. (18) Although the accuracy and sensitivity of the GP-based model are higher, the study has lower specificity. In a study using the Pima Indian dataset from the UCI repository involving 768 females at least 21 years of age, Mercaldo et al. train classifiers with eight feature vectors, including the number of times pregnant, plasma glucose concentration a 2 hours in an oral glucose tolerance test, diastolic blood pressure (mm Hg), triceps skinfold thickness (mm), 2-Hour serum insulin (mu U/ml), body mass index (in kg per sqm), diabetes pedigree function, and age ( in years). Six machine learning classification algorithms were used, including J48, Multilayer Perceptron (a deep learning algorithm), Hoeffding Tree, JRip, BayesNet, and Random Forest. They evaluated performance metrics using Precision, Recall, F-measure, and ROC Area. They found the best precision value of 0.770 and a recall equal to 0.775 using the Hoeffding Tree algorithm to predict the onset of type 2 diabetes mellitus within five years in Pima Indian women. (19) Zhang et al. tested the ability of machine learning algorithms to predict the risk of type 2 diabetes mellitus (T2DM) in a rural Chinese population. The authors focused on 36,652 eligible participants from the Henan Rural Cohort Study. Six machine learning classifiers were used, including logistic regression, classification and regression tree, artificial neural networks, support vector machine, random forest, and gradient boosting machine. Among the top-10 variables across all methods were a sweet flavor, urine glucose, age, heart rate, creatinine, waist circumference, uric acid, pulse pressure, insulin, and hypertension. The study includes new important risk factors such as urinary indicators and sweet flavor. The GBM model performed best with an AUC of 0.872 and 0.817 with and without laboratory data. (20) Using the National Health and Nutrition Examination Survey (NHANES) dataset, machine learning models, including logistic regression, support vector machines, random forest, and gradient boosting were evaluated on their classification performance. The models were then combined to develop a weighted ensemble model capable of leveraging the performance of the disparate models to improve detection accuracy. The information gained from tree-based models was used to identify the key variables within the patient data that contributed to the detection of at-risk patients in each disease class by the data-learned models. In diabetes classification (based on 123 variables), the eXtreme Gradient Boost (XGBoost) model achieved an area under ROC (AU-ROC) score of 86.2% (without laboratory data) and 95.7% (with laboratory data). The ensemble model had a top AU-ROC score of 73.7% (without laboratory data) for prediabetic patients, and XGBoost performed the best at 84.4% for laboratory-based data. The top five predictors in diabetes patients were waist size, age, self-reported weight, leg length, and sodium intake. (21) Kandhasamy et al. compared the performance of various machine learning algorithms, including J48 Decision Tree, K-Nearest Neighbors, and Random Forest, Support Vector Machines. Researchers concluded that the J48 decision tree classifier achieved higher accuracy of 73.82% than other classifiers. (22) Abbas et al. used the San Antonio Heart Study data to develop a type-2 diabetes prediction model using support vector machines with 10-fold cross-validation. The results showed 84.1% accuracy with a recall rate of 81.1% averaged over 100 iterations. (23)