Sample
Data were obtained from electronic medical records (EMRs) in Epic from Sanford Health, a not-for-profit rural healthcare system that primarily serves South Dakota, North Dakota, Northern & Southwest Minnesota, Northwest Iowa, and parts of Nebraska. Sanford Health includes roughly 44 hospitals 1,382 physicians and 9,703 nurses delivering care in more than 80 specialty areas. All data were de-identified according to the Health Insurance Portability and Accountability Act HIPAA de-identification method Safe Harbor § 164.514(b)(2). The dataset included records from all patients who visited a Sanford healthcare facility between January 1, 2014 and December 30, 2016 (N=1,143,028). Only adult patients (age≥18; N=875,168) with a diagnosis of diabetes (ICD-10 codes E10.xx and E11.xx; N=67,575) were included in the current study. Further, only patients who reported a residential zip code in Minnesota (MN), North Dakota (ND), or South Dakota (SD) were included in the current study (N=63,781), due to low sample sizes in other states. Finally, patients who had missing data on the outcome variable of unplanned medical visits or any of the predictor variables were excluded, for a final sample size of N=43,831.
Measures
The outcome was any vs. no unplanned medical visits during the 3-year period over which EMR data were collected. This was derived from four separate variables: emergency department visits, hospitalizations, hospital observations, and urgent care visits. All four types of visits were summed and dichotomized at ≥1 vs. 0 unplanned medical visits.
Predictor variables included all numeric variables that were common and readily available in Sanford’s EMR’s. Ten variables were selected and are described in detail below.
Age was measured in years at time of initial analyses (12/1/2016).
Body mass index (BMI) was obtained from EMR’s as kg/m2. Extreme values (<15 or >60) were assumed to be errors and were set to missing. Values from the most recent visit in the 3-year period were used.
Blood pressure (BP) was obtained in mm/Hg. Values from the most recent visit in the 3-year period were used. Systolic BP and diastolic BP were included as two separate variables.
Serum cholesterol was obtained as both low-density lipoprotein (LDL) and high-density lipoprotein (HDL) in mg/dL. Extreme values in HDL (<10 or >100) or LDL (<20 or >200) were assumed to be errors and were set to missing. Values from the most recent laboratory result were used. LDL and HDL were analyzed as two separate variables.
Glycohemoglobin (A1C) was measured from the most recent laboratory result. A1C values below 4 or above 15 were assumed to be errors and were set to missing.
Ranked smoking status was obtained by patient self-report as a vital sign on their most recent visit. A ranked variable was created as follows from the several possible response categories, with higher values indicating more smoke exposure: never smoker (0), passive smoker (1), former smoker (2), current some day smoker (3), current every day smoker, light tobacco smoker, or heavy tobacco smoker (4).
Number of diagnoses on “problem list” was derived from the most recently available list over the 3-year period.
Number of prescriptions were aggregated over the 3-year period and was used as a numeric variable.
Analyses
Machine learning.
All analyses predicted the unplanned medical visit status of each patient (i.e. which patients had at least one vs. no unplanned medical visits in the 3-year period), and this classification task was based on the 10 EMR variables above (age, BMI, BP, HDL and LDL cholesterol, A1C, ranked smoking status, number of diagnoses on the patient’s “problem list,” and the number of prescriptions in the 3-year period). Three types of machine learning were utilized: discriminant analysis (linear and quadratic), support vector machines (SVM; linear basis and radial basis), and artificial neural nets (NN’s; single and double hidden layer). R software (17) was used for all analyses, including the packages MASS for discriminant analysis,(18) e1071 for SVM’s,(19) nnet for single-layer NN’s,(18) and deepnet four double-layer NN’s.(20) A logistic regression was run for purposes of comparing machine learning results with conventional prediction approaches.
Cross-validation testing.
Since classifiers are susceptible to overtraining (i.e. when the classifier can predict the training dataset with high accuracy, but fits noise and thus has not learned patterns that generalize to other datasets), cross-validation testing is important to identify models that have identified patterns that are truly important in the prediction task. Cross-validation testing is performed by partitioning all available data points into a training set and a testing set; the classifier is trained on the data from the training set, and the generalization of the prediction task learned by the classifier is tested using the data from the testing set. In particular for this study, repeated subsampling cross-validation was used by withholding a randomly selected 10% of trials as the testing set, and iterating (with different random selections of the testing set) 1000 times.
Both training accuracy (prediction accuracy on the testing dataset) and generalization accuracy (prediction accuracy on the training set) were assessed using confusion matrices. SVM’s and NN’s were optimized by running several iterations over different parameter values: for SVM, the cost parameter was varied from 0.1-10, and for radial SVM the gamma parameter was varied from 0.001-0.5; for single-layer NN’s, the size of the hidden layer was varied from 1-20, the number of training iterations was varied from 100-200, and the decay parameter was varied from 0-0.9; and for double-layer NN’s, the size of each hidden layer was varied from 0-20, the learning rate was varied from 0-1, the momentum of the learning rate was varied from 0-1, and the number of training iterations is varied from 10-20. For each classifier, the model with the highest generalization accuracy is reported. Since there are two possible categories, chance performance is 50%. Accuracy vs. chance was measured using a binomial test of the success rate out of the 1000 cross-validation iterations.
Sensitivity testing.
In order to derive clinical implications from the predictive model, it is valuable to know which variables are most strongly predictive of unplanned medical visits. Although being important for prediction does not necessarily indicate causality, many of the modifiable predictors (A1C, BMI, BP, cholesterol, smoking) do have plausible causal effects on diabetes and its complications. If risky values of these modifiable predictors are important for prediction (through a causal mechanism or an association), then removing risky values should disrupt prediction accuracy. Thus, in order to determine the modifiable variables that are most strongly indicative of unplanned medical visits, a variant of sensitivity testing was performed: for one variable at a time, the dataset was restricted to observations within the normal or healthy range, and the disruption in the model’s generalization accuracy was assessed when predicting on this restricted dataset. Restrictions to the normal/healthy range were based on current guidelines, namely BMI < 30,(21) BP < 120/80,(22) ranked smoking status < 2 (indicating never smoker or former smoker), LDL < 130, HDL > 50,(23) and A1C < 6.5.(24) Larger disruptions to the generalization accuracy as a result of restricting a variable to a healthy range indicates a greater importance of that variable to the prediction task, and potentially as a clinical target for intervention.