Data was obtained from a longitudinal observational study of diabetes and its related conditions that was conducted in an American Indian (AI) community in the southwestern United States (US) over a 43-year study period between 1965–2007 by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), Phoenix Branch. Children 5-years of age or older, and adults were invited to participate in comprehensive research examinations biennially during the study. Data collected in this study included anthropometric measurements, clinical data, and biochemical tests. The current analysis included children and adolescents, who had their first non-diabetic exam with a complete set of relevant clinical and biochemical metabolic risk measures between 5 and < 21 years of age, and at least one follow-up exam before their 55th birthday. There were 3,415 children and adolescents who had data on all baseline parameters. Of them, 75 patients were diagnosed with diabetes prior to or at their baseline research exam and were excluded. Another 1,291 patients did not have a follow up research examination and were not eligible for inclusion since their future diabetes status could not be determined. This yielded a final dataset with 2,049 unique pediatric participants. Among these, data on maternal diabetes status was available in 1,965 patients and fasting insulin levels were available in 1,978 patients.
The primary outcome variable was a diagnosis of diabetes. In this AI population, diabetes is overwhelmingly type 2 diabetes (T2D) and mutations in the Maturity-Onset Diabetes of the Young (MODY) genes do not constitute a significant cause for diabetes in the youth. 15, 16 Definition of diabetes in this study was based on the American Diabetes Association (ADA) standard criteria: FPG ≥ 126 mg/dL (7.0 mmol/l), 2hPG ≥ 200 mg/dL (11.1 mmol/l), HbA1c ≥ 6.5% (48 mmol/mol) or a previous clinical diagnosis.16 The follow-up in years for diabetes was based on whether a child developed diabetes before 55 years of age. If the child did not develop diabetes before age 55, the follow-up time in years was calculated from the baseline measurement to their last research examination before age 55 years. All laboratory testing in this study was performed at NIDDK Phoenix’s Clinical Laboratory Improvement Amendments (CLIA) certified laboratory during the entire study period.
The Institutional Review Board of the National Institutes of Health (NIH) approved the study (Protocol ID: OH76DK0256). Written informed consent was obtained from parents at study initiation and assent was obtained from the children. The study was performed in accordance with relevant guidelines and regulations put forth by the NIH.
Predictors and preprocessing of data:
The metabolic risk predictors were selected based on the IDF’s consensus on MetS diagnosis in children and adolescents published in 2007 10 that included age-sex-adjusted waist circumference percentile, blood pressure (BP), fasting plasma glucose, serum triglycerides, and high-density lipoprotein (HDL) cholesterol. Additional diabetes risk predictor variables were explored in this large pediatric cohort for developing the ML analytical models. These included: Demographic information: age; sex; and history of maternal diabetes; Anthropometric measurements: height, weight, and body mass index (BMI); Biochemical tests: 2-hour plasma glucose following an oral glucose tolerance test with a 75-g oral glucose load (2-hr OGTT); glycated hemoglobin (HbA1c), and serum total cholesterol. Of the 17 predictor variables, only 2 had missing values, fasting insulin in 3.4% and history of maternal diabetes in 4.0% cases. For fasting insulin, missing values were imputed as the means of the non-missing values. Age-sex-adjusted BMI z-scores were computed using the computer program and 2000 CDC growth charts for children between 24 and 239 months of age (cdc.gov/growthcharts/computer_programs.htm accessed October 4, 2016). The “modified z-score” is similar to the usual z-score method (distance from the mean in standard deviation units), however unlike the commonly used unmodified z-score provided by the CDC program, it does not compress the frequency distribution of high z-scores such that very few have values > 3.17 Diabetes incidence per thousand person-years was calculated using the number of incident cases of diabetes and person-years of follow-up through age 55 years.
Classification schemes are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several inputs or independent variables. 18, 19 The 17 known metabolic risk variables obtained from the baseline non-diabetic exam in 2,049 children included in the study cohort were used to build ML models. The dataset was randomly split into training (70%) and testing sets (30%). Based on a review of relevant published information, several classification algorithms were initially applied to the dataset to determine the best classifier to build a suitable predictive model and predict future diabetes. The following classifying schemes were initially evaluated: neural network (NN), k-nearest neighbor (KNN), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support Vector machine (SVM), Linear Support Vector machine (LSVM), and Chi-square Automated Interaction Detection (CHAID).
In predictive data mining processes, it has been shown that combining the predictions from multiple independent algorithms often yields more accurate predictions than can be derived from any one method by reducing the generalization error.20 It is particularly useful when the types of algorithms included in the project are different from each other, such as the ones used in this project. After building and evaluating individual classification algorithms as mentioned previously, the results were combined in an ensemble by stacking the predictions made by the 5 best algorithms selected based on the accuracy with a minimum threshold of AUC of > 0.8 and using the confidence weighting voting method. To avoid overfitting and decreasing variance errors, bootstrapping was used in the training dataset. For comparison of predictive accuracy of traditional regression models to ML based models, a forward stepwise logistic regression model was built to predict incident diabetes using the same 17 metabolic risk variables.
Predictor importance of the RF and LR classification models were computed by calculating the reduction in variance of the target variable attributable to each predictor, via a sensitivity analysis. The software automatically calculates this sensitivity index and generates a graph of the top 10 predictors in terms of decreasing sensitivity index (importance). 21 Predictor importance based on the logistic regression analysis using the Wald statistics were also computed.
Sample size analysis:
Given the large number of observations (> 2,000) available for training and testing datasets, and considering the number of input variables, the sample size was adequate for representative model building. It has been shown that the predictive accuracy for binary outcome variables is higher using modern ML based modelling techniques and approximately 20–50 observations per variable was required to obtain a stable AUC with LR and other ML-based models. 22 Only 8.7% of patients in this cohort developed diabetes during the follow up period, and this imbalance between the classes can affect the performance of some of the classification algorithms. To optimize model performance, the training dataset was balanced by generating new samples in the under-represented minority class using the built-in random minority oversampling feature of the modeling software in a 10:1 ratio. 23
Demographic and clinical characteristics were summarized as means and standard deviations (SD) or medians and interquartile ranges [IQR: Q3-Q1)] for non-Gaussian variables. We compared categorical variables using chi-squared tests (or Fisher’s exact tests for cell counts < 5). Continuous variables were compared using the unpaired Student’s t test, and medians using the Mann–Whitney U test. Statistical significance was assessed at p < 0.05.
Data mining and machine learning models were built and evaluated by using the IBM SPSS Modeler, version 18.2 (IBM Corp., Armonk, N.Y., USA).