We developed prognostic models for the annual risk of acute severe complications in patient living with T2D, including acute CV complications, other acute severe complications and all-cause mortality. The originality of this study lies in the fact that we used exclusively data from a national claims database and it compared different modelling approach, including logistic regression model and machine learning approach with random forest and neural network. We demonstrated the feasibility of applying these approaches to medical claims data to predict the annual risk of acute complication in this population.
The main finding of this study was that RF model performed well in predicting severe acute complications and all-cause mortality with simple and few risk factors derived from medical claims database, such as age, LTD, antidiabetic medications, aDSCI, history of CVD and presence other comorbidities. This framework makes possible a routine evaluation of the different individual risks by the national health insurance communicated for patients and for primary care providers to identify high-risk patients and to personalize prevention.
In a recent meta-analysis on CVD prediction in patients with T2D, authors reported that C-statistics ranged from 0.64 to 0.80 for prognostic models developed in patient living with diabetes (9). In their meta-analysis, authors estimated a pooled C-statistics of 0.67 based on validation studies of the following prognostic models: UKPDS calculator (21), the ADVANCE model (22), the DCS model (23), the Fremantle model (24) and the NDR model (25). Our study highlights results from LR, RF and NN models were comparable with the existing findings for CV prognosis (26).
The crucial difference between models developed in this study and models included in this meta-analysis lies in the risk factors, which were derived from biological results and medical records in the latter. In contrast, we used only information that is available from claims database. Our study demonstrated the feasibility of prognostic models based on claims information and the possibility to derive relevant risk factors such as diabetes duration (available with LTD duration), history of diabetes-related complications (aDSCI) and antidiabetic treatments which are not systematically included in all clinical models described above. However, some risk factors identified in the UKPDS, NDR, or Advance models remain difficult to identify, and thus, limit the predictive accuracy. For example, smoking or obesity do exist, but are typically underreported in medical claims.
Prognostic models did take into account diabetes therapeutic strategies during the study period. Adjustment was made for patients without treatment, or patients receiving insulin and/or antidiabetic medication This analysis conducted with LR showed that patients receiving antidiabetic treatments were at lower risk of outcome (acute CV or other complications and all-cause mortality) compared to patients receiving no treatment.
Concerning acute CV outcomes, the comparison with the literature may be limited as authors mostly predict composite of strong endpoints (MI, stroke, CVD-related death, and CHD). In our study, we enlarged CV outcomes to other severe acute DM-related complications (UA, TIA and peripheral arterial disease) because they were also associated to major morbidity in DM patients. Hospitalization for CHF represented an important part of acute CV complications (~ 25%) in our study. Excluding CHF in acute CV complications was associated to a deterioration in discrimination, with C-statistics decreasing from 2 to 4 points.
To our knowledge, this study is also the first study to include a wide range of acute complications requiring hospitalization such as metabolic disorder with ketoacidosis coma, ketoacidosis, acidosis, hypoglycaemia, sepsis, acute renal insufficiency and amputations. Model performances for predicting these other acute complications were overall similar to those for acute CV complications, and better performance for RF model.
Machine learning models performed better than logistic regression to predict mortality and had better performance than the existing literature. In a recent study predicting the 5-year mortality risk in older adults with T2D (27), the authors presented balance accuracy of 0.77 and C-statistics of 0.74 with model including key risk factors, such as biological markers, BMI, and smoking. Our study showed that without the latest risk factors, but with other key risk factors (disease duration, aDSCI and diabetes medications), similar results can be achieved.
A second finding was that RF had the best accuracy and discrimination among models. Compared to a similar claims database study that predicted the 3-year risk of adverse outcomes with machine learning, discrimination performance of RF (C-statistics 0.77–0.86) model in our study was relatively close to the performance with complex machine learning models (C-statistics = 0.77), such as gradient boosting decision tree, recurrent neural networks, multilayer perceptron and transformers (28). On a broader perspective, machine learning models have been shown to be replicable and transferrable to local healthcare systems when built on a national database, while a model constructed on a local level cohort was difficult to transfer to a national level (29). On top, factors that usually contribute to risk of bias, including small study size, poor handling of missing data, and failure to deal with overfitting were not present in this study (30).
Survival models based on hazard ratio estimation were not included in our framework for two reasons: first, they were considered to use a different approach with instantaneous risk estimation which is different than binary outcome prediction, and second, dates were only available on a monthly basis, for anonymous purposes, which would have certainly underestimated performance of these models.
Main predictors identified with LR and RF were age and history of diabetes related complications described with aDSCI, diabetic medications, and diabetes duration. Additionally, chronic CV disease, psychiatric disorder and chronic end-stage renal disease were consistently associated with a significant risk of event across outcomes. Finding age and comorbidities as the most important risk factors for the three outcomes was consistent with the literature (8, 31). This study also confirmed that the aDSCI, initially developed for the prediction of mortality and risk of hospitalization (32), was an important predictor for acute complications and all-cause mortality from this claims database.
The first limitation was the identification of study outcomes, based on probabilistic algorithms using health insurance claims data that have not been fully validated, which could lead to the misclassification of outcomes and comorbidities. Second, patients may have experienced complications or major events prior to the start of data availability, such as CVD, and comorbidities may have been underreported or misclassified. Third, a recorded diagnostic code on a medical claim may be inaccurate. Taking that into account, authors applied additional inclusion criteria to differentiate type 1 and type 2 diabetes by using a minimum onset age of 45 years, conjointly applied with insulin delivery or LTD, before which patients were assumed to have type 1 diabetes.
Finally, censoring patients at 1st occurrence of event prevented the prediction of multiple and recurrent complications. However, our sensitivity analyses showed robustness in predicting 2 or more events within one year, with slightly better performance than predicting a single event.
Risk evaluation is essential to individualize therapy, and is encouraged by clinical practice guidelines for the management of risk factors. However, in practice, biological risk factors are not all systematically available and health professionals other than diabetologist may not be familiar with clinical models, which can limit their use in some cases. The use of routine prediction models from real-world claims database is to provide a transparent platform to communicate this risk to the patient and help all health professionals to make a quick and accurate assessment of their patients' risk and optimize their management care.
Medical claims databases are a valuable resource to develop prognostic models that have a strong potential to identify patients at high risk of acute complications within a certain time window (33, 34). Risks could be routinely assessed by the national health insurance, owner of the data, and communicate thereafter to patients, and primary care providers to personalize prevention.