Development and validation of a nomogram to estimate future risk of type 2 diabetes mellitus in adults with metabolic syndrome: prospective cohort study

To develop and validate the 4-year risk of type 2 diabetes mellitus among adults with metabolic syndrome. Retrospective cohort study of a large multicenter cohort with broad validation. The derivation cohort was from 32 sites in China and the geographic validation cohort was from Henan population-based cohort study. 568 (17.63) and 53 (18.67%) participants diagnosed diabetes during 4-year follow-up in the developing and validation cohort, separately. Age, gender, body mass index, diastolic blood pressure, fasting plasma glucose and alanine aminotransferase were included in the final model. The area under curve for the training and external validation cohort was 0.824 (95% CI, 0.759–0.889) and 0.732 (95% CI, 0.594–0.871), respectively. Both the internal and external validation have good calibration plot. A nomogram was constructed to predict the probability of diabetes during 4-year follow-up, and on online calculator is also available for a more convenient usage (https://lucky0708.shinyapps.io/dynnomapp/). We developed a simple diagnostic model to predict 4-year risk of type 2 diabetes mellitus among adults with metabolic syndrome, which is also available as web-based tools (https://lucky0708.shinyapps.io/dynnomapp/).


Introduction
The global prevalence of type 2 diabetes mellitus (T2DM) has increased rapidly over the past 30 years, moreover, diabetes mellitus is the ninth most usual cause of death [1]. Asia is the center of T2DM global epidemic, especially in China and India [2]. Based on American Diabetes Association criteria [3], two large-scale epidemiological studies in 2010 and 2013 showed that the percentage of awareness of diabetes was 30.1 and 36.5% in overall proportion of patients respectively [4,5]. Previous studies have confirmed that individuals with prediabetes who adhered to a lifestyle intervention, including exercise training and diet modification, can attenuate the risk for developing T2DM [6][7][8][9]. On the contrary, people with undiagnosed T2DM are more vulnerable to complications because of unawareness of preventing and inappropriate treatment. Early detection of T2DM is incredibly important in its control and effective treatment, especially with high-risk subjects.
With dietary and lifestyle modifications, Metabolic syndrome (MetS) diagnosis is expanding quickly in China [10]. MetS affects more than 20% of adults in Asia, and it affects more than 24.5% of people in China [11,12]. Obesity, hyperglycemia, dyslipidemia, and hypertension are the major symptoms of the MetS, which are a group of disorders that increase the risk of developing non-alcoholic fatty liver disease, insulin resistance, and diabetes [13]. Actually, prediabetes and MetS are related conditions [14].
To estimate the likelihood that MetS patients would proceed to T2DM, we still lack simple and reliable techniques.
Therefore, we report on the assessment and external validation of a diagnostic model for forecasting 4-year risk of diabetes among Chinese population with metabolic syndrome which might be useful for clinicians to improve identification of high-risk populations for diabetes. To provide more convenient and efficient service to clinical application, we perform our model by nomogram and online website.

Derivation cohort
The data of training cohort came from a public, nonprofit online database, named 'DATADRYAD' database (https://da tadryad.org/). We download the original data shared by Chen et al. [15]. Their retrospective cohort study included 211,833 participants free of diabetes at the baseline from 2010 to 2016 across 11 cities and 32 sites in China. The population cohort investigation was approved by the Rich Healthcare Group Review Board, and thus, ethics approval was not required in this longitudinal analysis. All the individuals from this study were at least 20 years old with no less than 2 visits of medical examinations. Variables included age, gender, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting plasma glucose (FPG), total cholesterol (TC), triglycerides (TG), high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), alanine aminotransferase (ALT), concentration of creatinine (CCR), years of follow-up, eventual diagnosis of diabetes and family history of diabetes. Participants with more than 30% missing records were excluded. We obtained the 3222 individuals with metabolic syndrome from the raw data for secondary analysis.

Broad validation cohort
The data of validation cohort came from The China Cardiometabolic Disease and Cancer Cohort(4C) Study, the part of Henan province. 4C study protocol and informed consent were approved by the Committee on Human Research at Ruijin Hospital Affiliated to the Jiao-Tong University School of Medicine (RUIJIN-2011-14). All participants signed the written informed consent. Follow-up began at enrollment (2011)(2012) and finish between 2015-2016, 3246 individuals finished the follow-up in total. We pick up the same variables with the training cohort and select 284 participants who were free of diabetes mellitus, but with metabolic syndrome. The details of 4C Study design have been described previously [16,17].

Derivation and validation of the models
All obtainable data based on the database were used to maximize the performance and generalizability of the model, and the sample size was far more than 20 events per variable (EPV) used in derivation model. We excluded participants whose information was missing more than 30% of it to assure the validity of the study. Additionally, any predictor documented in the development data for less than 50% of participants was disregarded, resulting in the exclusion of aspartate aminotransferase (AST), blood urea nitrogen (BUN), drinking status and smoking status. We used single imputation techniques to replace missing values for TC, TG, HDL-C, LDL-C, ALT, and CCR. After imputation, we included all variables. For further predictor selection, we used both-side stepwise selection with the Akaike information criterion (AIC) and the least absolute shrinkage and selection operator (LASSO) method to identify final variables for the multivariable Cox proportional hazards regression model. We used decision curve analysis (DCA), change in AUC (ΔAUC), integrated discrimination improvement (IDI) and net reclassification improvement (NRI) to compare variables selected by AIC and LASSO method. Then, we tested the variables from linear relation, influential cases, multicollinearity, and proportionality of the Cox proportional hazards regression model perspectives. Bootstrapping procedure with 1000 resamples was applied for internal validation. The performance of the model was evaluated by Harrell's C statistic concordance index (C-index) and the calibration plot both internal and external. In general, if a model's C-index > 0.7, meaning great discrimination. If the predictions of a model fall on a 45-degree diagonal line, meaning relatively good calibration. Bier score was used to evaluate the effect of the whole model, mainly focused on calibration. It is generally acknowledged that less Bier score  (3) 73 (3) 19 (3) Continuous variables with normal distribution were expressed as mean ± SD, non-normal data were expressed as median (interquartile range, IQR) BMI body mass index, SBP systolic blood pressure, DBP diastolic blood pressure, FPG fasting plasma glucose, HDL-C high-density lipoprotein cholesterol, LDL-C low-density lipoprotein cholesterol, ALT alanine aminotransferase, CCR concentration of creatinine represented better model, especially when that was less than 25. The score of the predict was performed by nomogram. All statistical analysis was carried out by R software 4.1.0 (http://www.R-project.org). Two-sided P < 0.05 was considered statistical significance. This research was conducted and reported in accordance with the Transparent Reporting of a Multivariate Prediction Model for Individual Prediction Diagnosis (TRIPOD) instructions [21].

Characteristics study population
Baseline characteristics for the development cohort were shown in Table 1, 3222 attendees (80% men and 20% women) with MetS were enrolled. The median of age was 52 years old. During the 3.8 years of the median follow-up,  Fig. 1.
For the characteristics of the baseline in validation cohort from Table 2, we recruited 284 participants (33% men and67% women) with MetS. The median of age was 59.88 years old. Median follow-up was 3.4 years. Among them, 53 participants developed diabetes. In contrast to attendees without diabetes, there was a significant difference in BMI, FPG, Triglyceride and LDL-C (P < 0.05).

Predicted feature selection and clinical usage
We developed two models. The predictors of Model A were selected by AIC, which were age, gender, BMI, DBP, FPG and ALT. Model B includes the additional variables, family history that was screened by LASSO (Appendix Figure 1). To compare the heterogeneity of Model A and Model B, we calculate ΔAUC (0.00037, P = 0.324), NRI (−0.022, P = 1.070), IDI (0.000, P = 0.478), likelihood ratio test (P = 0.458) and plot DCA (Fig. 2). All these evaluations indicated Model A can be regarded as equal with Model B. As fewer predictors were in Model A, we select Model A as our final model. The DCA showed that when the threshold T2DFS in MetS ranged between 2 and 52% at 4 years, using this model to predict the T2DFS probability yielded more net benefit than the scheme, which showed the model to be clinically useful. To further test the selection, firstly, we examined linear relation between these indicators, the figure indicated that these factors were fundamentally linear (Fig.  3A). Secondly, we tested influential cases, the results were also satisfactory (Fig. 3B). Thirdly, multi-collinearity was checked, the value of age, gender, BMI, DBP, FPG and ALT was 1.183032, 1.039056, 1.085907, 1.030464 and 1.181380, respectively. All of them were less than 5, meaning less risk of multi-collinearity. Finally, the Cox proportional hazards regression model's proportionality was examined, and because the center line is comparatively smooth and all the variables had P values greater than 0.05, we recognize that our hypothesis was appropriate (Fig. 3C). Therefore, the variables in our model met the requirements of building Cox proportional hazards regression model. As shown in Table 3, age (95% CI, 1.01-1.02), gender (95% CI, 1.05-1.55), BMI (95% CI, 1.01-1.06), DBP (95% CI, 1.00-1.01), FPG (95% CI, 4.59-6.40) and ALT (95% CI, 1.00-1.01) were used to build the final model.

Discussion
As one of the four major noncommunicable diseases, T2DM leads to microvascular and macrovascular complications as well as being a major cause of death and disability worldwide [22,23]. It should be highlighted that T2DM has a protracted asymptomatic phase before being diagnosed, during which there is insulin resistance, obesity, hyperlipidemia, and hypertension [24,25]. All these symptoms above are the components of MetS [26]. MetS is a well-known risk factor associated with T2DM. The risk of developing T2DM can significantly decrease through early lifestyle intervention [27]. Nevertheless, from the estimation of International Diabetes Federation (IDF), approximately 10.2% (578 million) people will suffer from diabetes globally and about 50.1% people living with diabetes without awareness, therefore, early identification of high-risk population with prediabetes and or MetS is necessary for reducing the incidence of T2DM [28]. Consequently, we sought to develop and validate a diagnostic model for T2DM in MetS population.
Only a few models for predicting the risk of T2DM in China have been created in the previous studies due to the disparities in areas and races. Eastern China's 14-year risk of T2DM was predicted using a nomogram developed by Xu et al. [29]. A sex-specific multivariate nomogram was created  [33] focused on the risk of T2DM in the overweight and obese adults. They did not, however, concentrate on the MetS population. These studies' shortcomings include the lack of external validation and the narrow focus on healthy residents in just one region of China. Additionally, their findings did not perform as well as a user-friendly online calculator. Therefore, based on a large-scale, multicenter retrospective cohort, systematic evaluation, internal, and external validation in China, this work may be the first to produce a nomogram and web-based tools for the prediction of T2DM risk in the MetS population. T2DM can develop relatively easily in people with MetS over time [34]. Accordingly, we focused on this group of population for more convincing study. For early recognition and prevention T2DM in MetS population, we developed a diagnostic model including age, gender, BMI, DBP, FPG and ALT which were all readily available demographic data and biochemical indexes. Both the results of internal and external validation showed great discrimination and calibration. We anticipate seeing increased use of the T2DM prediction model, either by MetS patients themselves or by healthcare systems for patients populations. To implement our approach in MetS patients, we developed a nomogram plot and a web-based interface.
The results of this model suggest that T2DM risk also increased concomitantly as the growth of the continuous variables, including age, BMI, DBP, FPG and ALT. Firstly, previous studies have proved that FPG is a significant risk factor for T2DM manifestation. Such as Cheung BM et al. [35] showed the high frequency of progression of MetS to T2DM and highlighted baseline hyperglycemia as the strongest predictor. Elevated FPG was demonstrated to be the biggest predictor of T2DM by Ding et al. [36] with an 8.9-fold greater risk, and Ferrara et al. [37] further demonstrated this, even after accounting for insulin levels. In this cox proportional hazards regression model, FPG is also the most essential factor (HR 5.42). Differences in race, diet, and ethnicity may be the cause of value discrepancies. FPG can also be seen as the most practical, reliable, and useful indication, particularly in low-resource settings. Secondly, gender is another main risk factor in our model (HR 1.28). Most of the earlier research found that women had greater rates of obesity and overweight than men, and that females with diabetes were more obese than males, even though the proportion of T2DM in males and females varied greatly by area, year, and age. [38,39]. Of note, The incidence of coronary heart disease (CHD) and stroke were higher in Chinese women with MetS than in those without [40]. Thus, our results indicated that paying much more attention on females with MetS was necessary. Thirdly, as well because T2DM is an age-related disease and ageing β-cells are less sensitive to high glucose levels [41,42].
Besides, it is also widely accepted that DBP was the strongest predictor in people under 50 years old, while pulse pressure will become a better predictor in the people above 60 years old [43,44]. Additionally, ALT was a significant indication of T2DM and MetS, which is congruent with our findings [45][46][47]. That may because non-alcohol fatty liver disease (NAFLD) was regarded as one of pathogenesis factors of T2DM and ALT was one of the most relevant indicators to NAFLD [48,49]. All the model's parameters are easily accessible during routine physicals and are in line with earlier research, making them valuable for identifying high-risk MetS individuals who are predisposed to T2DM.
Our study has several strengths. First, this model is the first to focus on the risk of progressing to T2DM in MetS population based on large-scale, multicenter retrospective cohort. Second, our model is performed by 2 forms: a nomogram plot and an online calculator that will give clinicians multiple choices. Third, all participants in the validation cohort using oral glucose tolerance test (OGTT) for a more accurate diagnosis. Fourth, both internal and external validation showed great discrimination and calibration. Nonetheless, this study still has a few limitations. First, as both the derivation and validation cohort are based The points of each feature were added to obtain the total points, and a vertical line was drawn on the total points to obtain the corresponding risk of T2DFS. BMI body mass index, DBP diastolic blood pressure, FPG fasting plasma glucose, ALT alanine aminotransferase, T2DFS type 2 diabetes-free survival on Chinese population, the usefulness of this model is probably limited to China. Second, additional factors that can influence the result, such as the lifestyle, the use of cardiovascular drugs and some other biochemical indexes, are not taken into consideration by the model. Third, the establishment and validation of this model based on a retrospective cohort which might result in information bias, selection bias and confounding bias. Fourth, the model still needs more external validation before applied in clinical practice. Fifth, even though we utilized statistical techniques to address the issue, there were some missing data in the training cohort. As a result, additional prospective study using a global multi-site and more predictors is still required.

Conclusions
Finally, a risk prediction model based on age, gender, BMI, DBP, FPG, and ALT was created and verified. To assess the 4-year risk of T2DM in the MetS group, a nomogram and an online calculator were included in the model, which may have been more useful for physicians. We think that by raising awareness of the T2DM risk in the MetS population, this model will assist to lower the incidence of T2DM.

Data availability
The datasets generated and/or analyzed during the current study are available in the DRYAD database repository, for the derivation cohort can be downloaded from https://doi. org/10.5061/dryad.ft8750v, and for the validation part can be downloaded from https://datadryad.org/stash/share/ HNXtM6qk-4uSRGSP90XkBKzIi818a4R40i5EWblEog0.