This study included hospital-visiting patients with an initial T2DM diagnosis and without renal diseases and employed SuperLearner and SHAP to reveal the potentially complex relationship between predictors and DKD outcomes. Compared with existing studies on developing prognostic models to predict DKD in patients with T2DM[10, 12], we discovered new predictive markers (e.g., WBC_Cnt and urine pH) by applying relatively new analytical methods to somewhat new and more representative populations. The study design ensured that the model pertained to a real-world setting[29], enhanced clinical application, and the discovered association between new markers and DKD provides new directions for future research.
This study design enhanced the practicality and validity of the final model. First, we considered hospital visits in patients with T2DM; this population was neither general (e.g., the general healthy population) nor too specific (e.g., clinical research)[13]. Moreover, T2DM patients have a higher risk of developing DKD, increasing their need for DKD screening[30, 31]. Patients initially diagnosed with T2DM typically undergo routine clinical tests, providing readily available measurements that make it easier to estimate their risk.[16, 19, 20]. Second, the EMR-based automatic data extraction, transformation, and loading (ETL) provides a set of characteristics from various sources, making data-driven discovery possible, as evidenced by the higher ranking of many previously uncommonly detected markers, such as WBC_Cnt. The higher ranking of routine blood characteristics allows for exploring the association between DKD and blood dysfunction, such as amenia.
Compared to previous studies that only established a risk model using a single classifier, this study included analyses using linear, non-linear (including ensemble learning algorithms), and SuperLearner. Based on the model performance comparison, SuperLearner was superior to non-linear models which were generally superior to linear models (Fig. 2A) in the validation set. Since multiple characteristics interact, they are more complex than linear or non-linear factors, contributing to the progression and risk of DKD[32, 33]. Although this is consistent with clinical intuition gained from clinical practice, this study demonstrated these findings using data for the first time. Notably, to reduce the potential for selection bias, we did not exclude patients based on their eGFR or other key indices (e.g., UACR); instead, we conducted data-driven filtering. This increased the difficulty of improving the prediction model’s discriminative capability, measured using the AUC. This is because patients who are required to be tested for creatine usually have a higher likelihood of renal disease and suspicion of renal disease by physicians, making it easier to identify those with positive outcomes for the prediction model. Nevertheless, a comparable AUC (0.7 in our study vs. 0.6–0.8 in previous studies) was obtained, thus improving the practicality and validity of the final model[12].
For a more complex model, greater interpretability is required. One of the strengths of SHAP is its individualised explanation, which means it provides more explanation at a higher granular level[34]. In addition, the dependence plot is an accumulation of individualised explanations, which depicted the general association between variables and the risk of DKD and revealed a desired value for the variables. For example, the risk of DKD increased sharply at 60–62 years of age (Fig. 3B), consistent with previous studies' results [13].
SuperLearner and SHAP confirmed the prognostic value of several common risk factors. The present study showed that age, HbA1c levels, and serum creatinine levels were positively correlated with DKD risk (Fig. 3A, B). Previous studies have identified these markers as important variables for predicting long-term microvascular complications of diabetes[10, 13, 35, 36]. Specifically, older T2DM patients with high HbA1c and creatinine levels had a higher risk of developing DKD. The SHAP dependence plots also provided tipping points where the DKD risk contribution of these variables switched from negative to positive; for example, 8 for HbA1c and 60 for age (Fig. 3B).
Infection, inflammation, and immunity are critical risk factors that affect DKD outcomes[37]. However, their role in early-stage DKD development has seldom been quantitatively investigated. In this study, WBC_Cnt and Neut_Cnt, reliable markers of systemic inflammation, infections, and immunity[38, 39], were ranked the highest among all potential predictors (Fig. 3A). We found that elevated WBC_Cnt and reduced Neut_Cnt were risk factors for DKD in patients newly diagnosed with T2DM and that these two characteristics at baseline were proportional to the risk of DKD even if they were in the clinically normal range. The identified association and previous studies indicated that control of inflammatory activity and attention to immune status are critical to T2DM outcomes[40–42]. Patients with elevated WBC_Cnt and reduced Neut_Cnt at baseline should be closely monitored (e.g., more frequent routine screening for DKD).
In the present study, low Hb levels were associated with an increased risk of developing DKD. Although previous studies have demonstrated the predictive value of Hb for the risk of renal function decline in the early stages of DKD[43], the negative association between baseline Hb concentrations and the risk of follow-up DKD has often been ignored[13, 44]. We speculate that patients with diabetes often have anaemia, which can lead to renal hypoxia and accelerate the progression of diabetic nephropathy[45, 46]. In contrast, other studies have shown that high Hb levels are also associated with an increased risk of diabetic nephropathy[47]. Maintaining an appropriate haemoglobin level is important for managing and preventing diabetic nephropathy.
We also found that higher baseline LDL levels were associated with a lower risk of progression to DKD. A previous study identified a similar association with marginal significance (P = 0.09)[13]. From a clinical perspective, elevated LDL levels are normally associated with an increased risk of DKD[48]. However, the current study showed the opposite association. We suspect this was due to the intervention provided. Since controlling lipid levels is significant, and lipids are one of the most intervenable indices[49, 50], patients with higher LDL levels may have received more aggressive lipid-specific interventions. Consequently, LDL levels were lowered, leading to a lower incidence of DKD in the follow-up period.
Notably, new predictive markers were identified in this study, such as urine pH. Urine pH, which reflects the kidney's ability to regulate the body's acid-base balance, and is related to diet, medication, and other factors[51], was positively associated with the risk of developing DKD. However, this finding has not been reported to date. Since urine pH detection is inexpensive and practical, it can serve as an auxiliary biomarker for DKD risk estimation. Additionally, this discovery provides valuable information for future research.
The findings from the present study have several clinical implications. First, based on our real-world study design and stringent inclusion and exclusion criteria, the model yielded a moderately good and realistic AUC, indicating good discriminatory ability. At the optimal threshold probability, the sensitivity and specificity were 0.7337 and 0.5910, respectively. Clinicians can adjust the threshold for better sensitivity (i.e., a lower risk threshold) or specificity (i.e., a higher risk threshold) case-by-case. Second, the dependence plots generated by SHAP provide index-wise cutoff points for DKD risk alteration, which, when combined with mechanism validation, can be used to facilitate clinical decision-making, such as targeted interventions.
This study has several limitations. First, the analysis was based on patient data from a single centre, which may limit the generalisability of these findings to other populations. Multicentre studies and external validations are required to confirm these results. Second, a prospective study is required to confirm the usefulness of the developed model. Nonetheless, the study design and large dataset generated by our Medical Data Intelligence Platform, the data-driven analytical approach, and the interpretability greatly improved the validity and robustness of the final model.