In this study we trained a diverse set of Machine Learning (ML) models to classify painful versus painless DPN in three datasets: a deeply phenotyped clinical cohort developed in the University of Oxford (5); Technion - Israel Institute of Technology; and Imperial College London. We then externally validated these models in a questionnaire-based phenotyped population cohort developed in the University of Dundee (15). We followed the TRIPOD reporting guidelines (36), supplementary Figure 1.
Datasets
All data used in this study has been generated using the DOLORisk study protocol which has been described elsewhere (14). The total sample size used for training and validating these models was 1230 people with diabetes mellitus, predominantly Type II.
Three large, deeply phenotyped, cross-sectional cohorts (DOLORisk Imperial College London, PINS - University of Oxford (5) and DOLORisk Technion – Israel Institute of Technology, N = 935) were used to train and estimate model’s performance using 5-times repeated 10-fold cross-validation. Training datasets had deep clinical phenotyping. Participants were first screened for clinical neuropathy based on symptomatology and DPN was confirmed by abnormalities of nerve conduction studies or Intra Epidermal Nerve Fibre Density (IENFD) (37). Neuropathic Pain (NeuP “pain caused by a lesion or disease of the somatosensory system”) was determined at the time of the clinical assessment according to the NeuP Special Interest Group (NeuPSIG) of the International Association for the Study of Pain (IASP) grading system (38).
The NeuPSIG grading for neuropathic pain was used to grade neuropathic pain.
This is pain with:
- a distinct neuroanatomically plausible distribution, i.e. pain in symmetrically distributed in the extremities;
- a history suggestive of a relevant lesion or disease affecting the peripheral or central somatosensory system—diagnosis of diabetes mellitus and a history of neuropathic symptoms including decreased sensation, positive sensory symptoms, e.g., burning, aching pain mainly in the toes, feet, or legs;
- Demonstration of the distinct neuroanatomically plausible distribution by at least one confirmatory test—presence of clinical signs of peripheral neuropathy, i.e., decreased distal sensation or decreased/absent ankle reflexes;
- Demonstration of the relevant lesion or disease by at least one confirmatory test—abnormality on either the nerve conduction tests or IENFD.
Possible neuropathic pain fulfils criteria 1 and 2. Probable neuropathic pain fulfils criteria 1, 2 and 3. Definite neuropathic pain fulfils all 4 criteria.
Participants with chronic (> 3 months) probable or definite NeuP were assigned to the NeuP group and participants with possible neuropathic pain were excluded. Participants with no pain or non-NeuP in the extremities were included in the no NeuP group (5,39).
Models were externally validated in the independent GoDARTS–DOLORisk Dundee (40) dataset. This is a clinical cohort, phenotyped for pain and neuropathy using the DOLORisk protocol. GoDARTS participants with type 2 diabetes from Tayside, Scotland were re-phenotyped for neuropathic pain and related traits, by questionnaire, using the DOLORisk core protocol in order to be classified according to the presence and extent of neuropathic pain. A subset of the 1915 GoDARTS- DOLORisk Dundee participants could be classified as painful/painless DPN using validated questionnaires and screening questions for the presence and anatomical distribution of pain (N = 295), see Outcome Definition for more details. Data from these cohorts were collected between 2012 and mid-2019.
Only variables that were missing in less than 50% of the training and validation datasets were considered as potential predictors. These include clinical, biochemical, demographical and self-reported quality of life data. A complete overview of the training and validation datasets is presented in Table 1. There was no significant difference in the outcome distribution between the training and validation sets. However, the age and Body Mass Index (BMI) of participants were lower in the training dataset and some self-reported quality of life and psychological variables were significantly different between the training and validation datasets. This is reflective of the fact that the training and validation cohorts are independent and comprised of different populations. PROMIS sleep disturbance t-score, Diabetes duration, Cholesterol, low-density lipoprotein (LDL), high-density lipoprotein (HDL), Creatinine and Triglycerides were removed due to high missing ratio (Supplementary tables 1 and 2). The Chronic Kidney Disease (CKD) indicator variable was removed due to very low incidence, i.e. only 9 positives, and 5 instances were removed due to very low HbA1c < 5 not consistent with diabetes mellitus.
Independent variables include the 5-level EQ-5D-5L instrument (41) that comprises assessment of mobility, self-care, usual activities, pain/discomfort and anxiety/depression; the Patient-Reported Outcomes Measurement Information System (PROMIS) depression and anxiety measurement instruments (42); a question assessing the experience of traumatic events before the age of 18 (Trauma); a question investigating whether someone had stayed in hospital for a long period because of a life threatening disease or situation before the age of 18; the extraversion, agreeableness, conscientiousness, emotional stability and openness personality dimension constructs from the ten-item personality inventory (TIPI) (43); a self-reported ever smoked status; self-reported alcohol consumption in an ordered Likert scale; age; gender; BMI and sugar glucose levels (glycated haemoglobin HbA1c). Variables were filtered for zero or near zero variance numerical features, highly correlated features (>0.8 Pearson's correlation coefficient) and factors with very low complexity.
Outcome definition
The outcome of this study was painful or painless Diabetic Peripheral Neuropathy (DPN). For the training datasets one or more physicians have defined phenotypes after detailed clinical examination and grading of neuropathic pain as discussed above and in line with IASP and NeuPSIG definitions (37,44). An overview of the training datasets including all independent variables is in Table 2.
For the validation datasets we have used an array of structured, validated questionnaires and screening questions to define phenotypes. The Michigan Neuropathy Screening Instrument (MNSI) (45) questionnaire section alone, with a cut-off value of 3, was used to define diabetic neuropathy. Various cut-offs had been suggested for the MNSI clinical examination and questionnaire instrument (45–47). In these it has been consistently reported that the cut-off of 7 for the questionnaire part when used in combination with clinical examination is too insensitive for stand-alone questionnaire use, whereas a cut-off score of 3 and above for the questionnaire only has been shown to have very good performance (AUC = 0.75, optimal cut-off > 2.0318) (45).
A screening question “Are you currently troubled by pain or discomfort, either all the time or on and off?” was used to define the presence of pain. Chronicity was screened using the question “How long have you been suffering with this pain or discomfort?”, with a cut-off of > 3 months to define the temporal aspect of chronic pain. These questions have been validated and are identical to those used in previous population-based epidemiology studies of pain, including UK Biobank (48,49). Location of pain was assessed using the question “In the past three months; a) which of these pains have you had, b) which one of these pains bothered you the most?” followed by a comprehensive choice of body locations including “Pain in your feet”. The participant was then asked to complete the self-complete version of the “Douleur Neuropathique en 4 Questions” (DN4) questionnaire (50) for the most bothersome pain where a score of 3 and above indicated the presence of NeuP. A definition of possible DPN required the presence of neuropathy as screened by the MNSI and chronic pain in the feet, regardless of whether it was the most bothersome pain present.
The DN4 questionnaire-only version was only considered to remove instances of conflicting evidence, i.e. painful neuropathy with a DN4 score under the cut-off value or painless neuropathy with a DN4 value over the cut-off , using a cut-off value of 3 to indicate the presence of NeuP. This cut-off has been validated to provide the optimal area under the ROC for the questionnaire-only section, i.e. excluding the clinical examination (50).
The presence of diabetes, neuropathy and chronic pain in the feet was used to define painful DPN. The presence of diabetes, neuropathy and no neuropathic pain in the feet defined the group of painless DPN, figure 1.
An overview of the validation dataset is in Table 3. 798 people had painful diabetic neuropathy (617 (66%) in training and 181 (61.4%) in validation sets) and 432 had painless diabetic neuropathy (318 (34%) in training and 114 (38.6) in validation sets). A sensitivity analysis assessing the change of the pooled coefficient estimates of a logistic regression model fitted on the imputed validation dataset showed no or very small sensitivity to various outcome definitions (Supplementary Figure 2). We further assessed the internal consistency of the validation cohort by calculating the Cohen’s Kappa inter-rater agreement between the MNSI question “Do you ever have any burning pain in your legs and/or feet?” and the response to the pain localisation question asking for “Pain in your feet”, We observed a significant (p.value < 0.01), fair agreement between these responses, Kappa = 0.295.
Missing values and feature construction
Datasets were largely harmonised and follow the DOLORisk core protocol (39). However, in the Oxford cohort, the Depression, Anxiety and Positive Outlook scale (DAPOS) (51) Anxiety and Depression scores initially used have been replaced with the PROMIS (42) anxiety and depression short forms. Under the assumption that these constructs measured the same quantity in different scales we linked DAPOS to PROMIS scores by scaling them together and then using the derived means and standard deviations to bring them in the same scale as PROMIS t-scores. Questions related to smoking were transformed to an “ever smoking” feature by taking into account the response to questions related to smoking at the time when the questionnaire was completed, clinical examination took place or in the past. The EQ-5D-5L (41) questionnaire was used alongside the UK normative data to obtain the EQ-5D index that was used as an independent variable. Alcohol consumption was transformed to a Likert type scale (0-5) using the following ordered levels: "Never", "Less than 1 day per month", "1 to 3 days per month", "1 or 2 days per week", "3 or 4 days per week", "Daily or almost daily". HbA1c was transformed to percentages (%) from mmol/mol when it was reported that way.
After removing all variables with >50% of missingness, anxiety and depression t-scores demonstrated the highest rate of missingness (about 45%). We then performed multiple imputations by chained equations using the predictive mean matching algorithm (52). The number of imputations was set equal to the maximum missingness ratio experienced by any variable. Multiple imputation was done separately in the training and validation datasets to prevent information leakage between datasets. In order to accurately and not over-optimistically model the uncertainty introduced due to missing values we performed outcome agnostic imputation of the validation dataset. For this purpose a dataset was created by stacking both the training and imputation datasets and removing the outcome. The density plot of the observed and imputed values are in Supplementary Figures 3 and 4 for the training and validation datasets respectively.
Statistical Analysis
In an exploratory analysis before model fitting, dependencies between all independent variables and the outcome were assessed using the chi-square test for categorical data and the Kruskal Wallis test for numerical data comparison between two groups. Model’s performance was assessed by calculating overall accuracy as the proportion of the total number of correctly classified instances and the binomial test (p.value < 0.05) was used to assess whether accuracy was higher than the prevalence of the majority class, i.e no-information rate (NIR), the balanced accuracy as the mean of sensitivity and specificity and the Area Under the Precision/Recall Curve (AUPRC). However, all these metrics are sensitive to class imbalance and can lead to the selection of models that severely mis-classify the minority class. Therefore, we used the Matthews Correlation Coefficient (MCC) (53), which is similar to Pearson’s correlation coefficient, ranging from -1 to 1 and measures how correlated the prediction is to the true outcome. Moreover, MCC does not change with the substitution of the reference class as it is symmetrical and provides a more robust performance metric than accuracy, balanced accuracy and F1 score (53–55). Models’ calibration was assessed on the validation set by visualising the observed event percentage against the 10 prediction probability deciles. A linear model fitted to the calibration curves, i.e. event rate vs midpoint of the decile bin, was used to estimate the calibration slope and intercept.
Models were not updated nor calibrated after training in order to realistically assess the performance in predicting new data.
Model training and validation
We developed a workflow that uses the predict-then-aggregate strategy, after multiple imputation during both model training and validation. This way we modelled the uncertainty due to the imputation in both datasets, did not allow for information leakage between training and validation datasets and encapsulated all model building decisions in resampling and external validation (23). Predictions were aggregated using majority voting, and point estimators were aggregated using the mean and Rubin’s rules to calculate the pooled - total standard deviation from the within/between imputations variance (56). The workflow is visualised in Figure 2. Models were trained on the training dataset using a maximum grid search of 60 tuning parameters optimised for the higher MCC. Numeric variables were centred and scaled for each cross-validation fold to ensure no information leakage between in and out-of-fold samples. The number of multiple imputations was equal to the highest rate of missingness observed across all variables. During training, we imputed missing values using all independent variables and the outcome. Thus we trained an ensemble of m=45 models, optimised for the highest MCC during the 5-times repeated, 10-fold cross-validation. The final model out-of-fold prediction probabilities, MCC, accuracy, balanced accuracy and AUPRC were calculated using Rubin’s rules. Partial dependence was also aggregated by calculating the mean marginal probability across imputations.
In validation, we first removed the outcome and pooled together the training and validation data. Then we performed m=45 imputations as we did during training. The real outcome was kept unknown during the imputation of the validation set, simulating a real-world scenario of prediction of new data in the presence of missing values. Then we used the ensemble of trained models to predict the outcome on each of the 45 imputed validation datasets, producing 452 predictions. Finally, these predictions were aggregated by majority voting.
We trained a series of models using a set of diverged and well-known algorithms presenting the most significant subdomains of ML. The Random Forest (57) is an algorithm that produces an ensemble of decision trees using bagging, a technique that selects random subsets of potential predictors in order split each node of each growing tree. The Random Forest is also robust to the presence of multicollinearity of independent variables as a subset of predictors is randomly selected for each node split.
The Adaptive Regression Splines (58) is a multi-variable extension of regression that is able to model complex non-linear problems using an ensemble of simple linear functions that in aggregate optimise predictive ability. The algorithm has a built-in backwards elimination variable selection.
Finally, the Naive Bayes classifier as implemented in the e1071 package (59) is a probabilistic classifier that uses the Bayes’ rule to calculate the posterior probability of each class given a configuration of independent predictor variables.
In the case of Random Forests and Adaptive Regression Splines we trained an unweighted version and a weighted version with class weights inversely proportional to the class prevalences. Weighted models should match the probability distribution of the outcome closer than unweighted by having better calibration but could also run a higher risk if over-fitting training data and be less generalisable.
Models were trained in R (60) using the CARET package (61) and mltools (62). Multiple imputation was done using MICE (52). Marginal feature effects were calculated using the IML package (63), plots were rendered using ggplot2 (64) and tables using finalfit (65).
Interpretability
In order to understand how independent variables values influenced models outcomes and to provide some interpretability of ML algorithms we calculated the variable importance in a model specific way and then scaled the metrics value to 100 to make them directly comparable. For the Random Forest we used the Gini importance index that calculates the mean decrease in impurity of the nodes produced by a split that uses a certain variable. For the Multivariate Adaptive Regression Splines we used the built-in backwards variable selection of the model to calculate the reduction in performance estimated by cross-validation when each variable is removed, and for the Naive Bayes model a ROC curve analysis was conducted for each variable.
Moreover, we calculated, aggregated and visualised the marginal effects of all independent variables with a scaled importance of > 10 on each model’s outcome prediction. These Partial Dependence Plots (PDP) (66,67) plots represent how the model’s outcome was influenced by changes in an independent variable values. Partial Dependence (PD) was calculated by marginalizing the classifier’s predicted probability over the distribution of the feature of interest.
PDP not only show the marginal probability of the outcome given certain feature values but also provide an assessment on how robust and accurate is the information that a ML model can learn across the distribution of a feature’s values. A one-dimensional plot below each PDP shows the density of the feature values across each whole range. A dense distribution indicates that the PD can be accurately calculated for this range of values, while a sparse distribution shows that we cannot reliably calculate PD and also that there was probably not enough data in our training datasets for a model to learn a meaningful relationship for this feature range. We should also note that, while most ML models use different mechanisms to robustly handle some collinearity of features, PDP is sensitive to multicollinearity.
Partial dependence was aggregated for the whole ensemble of trained models using a customised function that calculated the mean across imputed datasets. In addition, trends of PDPs were estimated and visualised using a LOESS (68) smoothed curve. This analysis highlighted how the different values of each independent variable influenced the models predicted outcome all other things being equal.