A Comparative Study on Risk Prediction Model of type 2 Diabetes based on Machine Learning Theory

In this study, the risk prediction model of type 2 diabetes was established by Logistic regression, decision tree, BP neural network, support vector machine and deep neural network methods based on the survey data of residents of Dongguan City, Guangdong Province during 2016-2018 and its risk factors. The prediction effect of the model was evaluated based on the accuracy rate, recall rate, AUC value of the area under the curve and other indicators. DeLong test was used to statistically analyze the difference in AUC value of each model, and the prediction results of each model were compared and analyzed. The results showed that, based on the selected data set, the prediction effect of the backpropagation neural network model was the best, the accuracy was as high as 93.7%, the recall rate was 92.8%, and the AUC was 0.977. This study could provide a methodical reference for the prediction of the disease risk of type 2 diabetes.


Introduction
Diabetes mellitus (DM) is a metabolic disease characterized by disorder of blood glucose metabolism, which is one of the major public health problems in the 21st century [1] .In 2017, the number of diabetes patients in the world has been 451 million, and this number is expected to increase to 693 million by 2045, which will bring a great burden to the health care system [2] .Type 2 diabetes mellitus (T2DM) is the most common form of diabetes [3] .Early lifestyle changes or pharmacological interventions have been shown to be effective in delaying or preventing type 2 diabetes and its complications [4] .Therefore, it is very important to accurately predict the risk of T2DM in advance. However, the onset of T2DM is slow and the clinical incubation period is long. Related detection and diagnosis methods are improving. Resulting in a possible delay of more than 10 years from the onset to the diagnosis of T2DM [5] .Therefore, timely screening and management of diabetes high-risk groups is of great signi cance to reduce the incidence of diabetes [6] .
In recent years, it has been widely used to predict the risk of diabetes by using machine learning method and mathematical model based on the basic situation of people and routine physical examination and other indicators [7] .Machine learning represents a powerful set of methods for characterizing, adjusting, learning, predicting, and analyzing data. These methods use large amounts of data input and output to recognize patterns and learn e ciently to train machines to make autonomous recommendations or decisions. After su cient repetition and modi cation, the machine can receive the input and predict the output [8,9] .The output results were compared with known results to judge their accuracy, and then iteratively adjusted to improve their ability to predict disease [10] .Machine learning can be divided into three types: supervised learning, unsupervised learning and reinforcement learning. Some of the most common supervised learning methods include Logistic Regression, K-Nearest Neighbor (KNN), Naive Bayes (NB), Decision Tree, Arti cial Neural Network (ANN) and Support Vector Machine (SVM).At present, the supervised learning method is widely used in T2DM related research.
At present, in the research on the risk prediction model of T2DM, the commonly used machine learning methods include logistic regression, CART decision tree, C4.5 decision tree, support vector machine, Back Propagation (BP) neural network and deep neural network. At present, there is no report on the systematic comparison of the prediction effects of the above six models. Based on the above background, this study uses the survey data of T2DM among Dongguan residents from 2016 to 2018.The prediction effects of six risk prediction models of T2DM based on machine learning theory including Logistic regression, CART, C4.5, back-propagation neural network, support vector machine and deep neural network were further compared to provide methodological reference for risk prediction of T2DM.

Study design
This design is a cross-sectional study.

The research objects
The respondents were residents aged 18 or older who had lived in the monitoring area for six months or more. Patients with T2DM were diagnosed using the 1999 World Health Organization criteria. The sample size required for calculation using a dedicated formula was 1340. At each monitoring point, the objects were selected by multi-stage cluster random sampling. The sampling methods of each stage were as follows: Phase 1 sampling: at each monitoring site, 3 communities are randomly selected according to a sampling method proportional to the population size. The second stage sampling: within each selected community, 2 administrative villages are randomly selected according to the proportion of population size. The third stage sampling: in each selected administrative village, a simple random sampling method is adopted to select more than 75 households. The fourth stage sampling: in each selected household, according to the Kish table method, randomly select 1 quali ed permanent resident.
In this study, a total of 4157 subjects were selected from the survey data of T2DM in Dongguan residents for 3 years from 2016 to 2018. After screening and exclusion, a total of 4106 subjects were included, including 149 patients with T2DM in the case group, and the rest were classi ed as the control group with 3957 subjects. This study was approved by the Medical Ethics Committee of the A liated Hospital of Guangdong Medical University, and was carried out in accordance with relevant guidelines and regulations. All studies were conducted with the consent and informed consent of the subjects.

Process unbalanced data
The proportion of type 2 diabetes patients to normal population in the original data was 1:27, and the data was unbalanced. Synthetic Minority Over-Sampling Technique (SMOTE) is used to process the unbalanced data by R Studio software.70% of the data processed by SMOTE were randomly selected as the training set and 30% as the test set.

Variable selection
The equilibrium data were tested for normality, and then univariate analysis was performed using Mann-Whitney U rank sum test and chi-square test. Screen out the target variables.

10-fold cross validation
The 10-od cross validation is when a data set is randomly divided into 10 equally sized subsets. One subset is used as the test set, and the other nine are used as the training set. In this study, the 10-fold cross-validation method was applied to 70% of the training sets, and the included variables were statistically signi cant variables in univariate analysis. By continuously adjusting the model parameters and comparing the corresponding 10-fold cross-validation results, the optimal parameters were selected to construct the risk prediction model of type 2 diabetes.

Model construction
70% of the processed data were randomly selected as the training set and the remaining 30% as the test set. Logistic regression, CART, C4.5, BP neural network, support vector machine and deep neural network were applied to the training set by R Studio software. The risk prediction model of type 2 diabetes was established, and the prediction effects of each model were evaluated and compared according to the accuracy, recall rate, AUC and other indicators.

Statistical description of baseline data
Subjects with many missing values and obvious data errors were deleted, and the nal sample size was 4106, including 149 patients with type 2 diabetes and 395 people in the normal population. The difference in sample size between the two groups was huge, so SMOTE method was adopted to process the unbalanced data. The parameters were perc.over=2600 and perc.under=103 (sampling ratio was 2600% and 103%, respectively). After treatment, 4023 patients with type 2 diabetes were enrolled, and 3990 were in the control group. 70% data were randomly selected as the model training set, and the remaining 30% data were used as the test set. The speci c results are shown in Table 1.

Screening results of variables in univariate analysis
Univariate analysis was performed on the balanced samples, α=0.1. The normal test found that the distribution of each characteristic attribute in the two groups of samples was mostly skewed distribution, so the Mann-Whitney U rank sum test and chi-square test in SPSS were used to analyze the quantitative and qualitative data respectively. The results of univariate analysis were shown in Table 2. The results showed that except the educational level, the distribution of the other 23 characteristic variables between the case group and the control group was statistically different.

Parameter tuning results
In this study, the 10-fold cross validation method was applied to 70% of the training sets, and the included characteristic variables were statistically signi cant variables in univariate analysis. The corresponding 10-fold cross validation results were compared by continuously adjusting the model parameters. For SVM, linear kernel function, radial basis function, polynomial kernel function and Sigmoid kernel function are used for 10-fold cross validation. The results show that the linear kernel function is the best predictor. For BP neural network, the maximum number of iterations is set to 3000, and the number of hidden layer neurons within the range of 5-20 is respectively cross-veri ed by ten times. The results show that when the number of hidden layer neurons is 2, the prediction effect is the best. For deep neural network(DNN), the range of hidden layers was 8-12, and the number of neurons in hidden layers was 25-35. The number of neurons in each hidden layer was set to be equal in this study. The results showed that the prediction effect was best when the number of hidden layers was 9 and the number of neurons in each hidden layer was 33.

Logistics regression model
Fit all data except education level, build logistics regression model, and then use stepwise regression method to screen variables based on AIC information criterion. A total of 16 variables were nally screened, as shown in Table 3, which were age, alcohol consumption, consumption frequency of cereals, potatoes, beans, fruits, eggs, dairy, poultry and sh, DBP, FPG, TC, TG, HDL-C, and LDL-C. Variables screened by stepwise regression were applied to the training set to build a Logistic model, as shown in Table 4. In this model, the factors that had a greater in uence on T2DM included potato consumption frequency, sh consumption frequency, TC, FPG, HDL-C. In addition, the frequency of cereal consumption and TC were negatively correlated with the incidence of T2DM, while the other variables were positively correlated with the incidence of type 2 diabetes.
The logistics model equation is: The Logistic model confusion matrix table and ROC curve were obtained by applying the model to test set for veri cation. As shown in Table 5, Table 6 and

Support Vector Machine Model
By using the linear kernel function, the 23 characteristic variables that are signi cant in the single factor analysis in the training set were substituted into the SVM model, the constructed SVM model was applied to the test set for veri cation, and the confusion matrix table and ROC curve of the SVM model were obtained. As shown in Table 5, Table 6 and Figure 3(B), it can be concluded that the accuracy rate of this model is 91.2%, the recall rate is 89.0%, the accuracy rate is 93.3%, and the AUC of the area under the ROC curve is 0.911.

BP neural network
The three-layer neural network structure is adopted. The hidden layer has 20 neurons and the maximum number of iterations is 3000. Twenty-three signi cant characteristic variables from univariate analysis in the training set were substituted into the BP neural network model. The nal model constructed was applied to the test set for veri cation, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3 (C), it can be concluded that the accuracy rate of this model is 93.7%, the recall rate is 92.8%, the accuracy rate is 94.6%, and the area under the ROC curve AUC is 0.977.

Decision tree model (1) CART decision tree
The 23 characteristic variables that were signi cant in the single factor analysis in the training set were substituting into the CART decision tree model, and the output model of the CART decision tree was shown in Figure 1. When FPG≥5.6mmol/ L, type 2 diabetes was diagnosed directly; When FPG<5.6mmol/ L, Potatoes= 0,1, the patient was diagnosed as non-type 2 diabetes mellitus; When FPG<5.6mmol/ L, Potatoes≠ 0,1 and AGE <34, the patients were diagnosed as non-type 2 diabetes mellitus; When FPG<5.6mmol/ L, Potatoes≠ 0,1 and AGE ≥34, the patient was diagnosed as having type 2 diabetes. The CART decision tree model constructed was applied to test set for veri cation, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3 (D), we can conclude that the accuracy rate of this model is 88.7%, the recall rate is 84.8%, the accuracy rate is 93.3%, and the area under the ROC curve AUC is 0.906.
(2) C4.5 decision tree Twenty-three signi cant characteristic variables from univariate analysis in the training set were put into the C4.5 decision tree model. As shown in Figure 2, the decision tree model output by C4.5 algorithm includes 6 root nodes and 9 leaf nodes. According to the model, type 2 diabetes was diagnosed when FPG> was 5.61mmol/ L; When FPG≤5.61mmol/ L: Potatoes=0 was diagnosed as non-type 2 diabetes; Potatoes=1, Age≤54 was diagnosed as non-type 2 diabetes; Potatoes=1, Age>54, TC≤5.11mmol/ L was diagnosed as type 2 diabetes. Potatoes=1, Age>54, TC BBB>11 mmol/ L was diagnosed as non-type 2 diabetes mellitus. Potatoes=2, DBP≤81mmHg was diagnosed as non-type 2 diabetes; Potatoes=2, DBP>81mmHg was diagnosed as type 2 diabetes; Potatoes=3, Age≤34 was diagnosed as non-type 2 diabetes; Potatoes=3, Age>34 was diagnosed as type 2 diabetes. The C4.5 model was applied to the test set for veri cation, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure 3 (E), it can be concluded that the accuracy rate of this model is 88.6%, the recall rate is 84.9%, the accuracy rate is 92.7%, and the area under the ROC curve AUC is 0.888.

Deep neural network model construction
The 23 characteristic variables that were signi cant for univariate analysis in the training set were substituted into the DNN model. The number of hidden layers was 9, with 33 neurons in each layer, and the correlation confusion matrix table and ROC curve were obtained. As shown in Table 5, Table 6 and Figure  3(F), the accuracy rate of this model was 84.5%, the recall rate was 82.9%, the accuracy rate was 86.1%, and the AUC of the area under the ROC curve was 0.845.

Comparison of model performance
DeLong test in R Studio was used to compare the AUC values of each model, as shown in Table 7 and Figure 4. Based on the data set and incorporating the robustness of the model and the prediction effect of type 2 diabetes, BP neural network model is the best, the accuracy is as high as 93.7%, the recall rate is 92.8%, accurate rate was 94.6%, the AUC value is 0.977, followed by logistic regression model, the SVM model, CART decision tree model, C4.5 decision tree model, depth of neural network model. The prediction effect of SVM and CART was similar, and the difference was not statistically signi cant.

Discussion
With rapid economic growth and changes in people's lifestyle, the prevalence of diabetes in China has increased signi cantly. According to the 2017 survey results of the International Diabetes Association, the number of diabetes patients in China has reached 114 million, making China the country with the largest number of diabetes patients in the world. T2DM is a chronic disease characterized by the body's inability to metabolize glucose effectively, raising blood sugar levels and leading to hyperglycemia. Chronic high blood glucose levels can affect the kidneys, nervous system, heart and vascular systems, leading to a range of serious complications, and have a signi cant impact on the health and medical costs of the population. It is estimated that approximately 5 million people aged 20 to 99 died of diabetes in 2017, accounting for 9.9% of all cause mortality in this age group globally, and more than one-third of all diabetes deaths occurred in people under 60 years of age [2] . At present, more and more researchers are committed to using machine learning methods to explore the risk factors related to T2DM and the construction of prediction models for T2DM. The early diagnosis or prediction of diabetes through the model is of great signi cance for the prevention of T2DM, the improvement of the quality of life of T2DM patients and the prevention of related complications. In this study, the risk prediction model of T2DM was established by using logistic regression, decision tree, BP neural network, support vector machine and deep neural network methods based on the survey data of residents of Dongguan during 2016-2018 and its risk factors. The prediction results of each model were compared and analyzed to provide methodological reference for the risk prediction of T2DM.
In recent years, machine learning techniques have been widely used to predict the risk of developing T2DM. Ye Hong [11] built a diabetes prediction model based on BP neural network, SVM and integrated learning, and the results showed that the prediction effect of BP neural network was better than that of SVM.
Jing Gao [12] used BP neural network and Logistic regression to construct the prediction model of T2DM complications, and the study found that the prediction effect of BP neural network was higher than that of Logistic regression. Liu et al. [13] constructed Logistic model, BP neural network model and decision tree model to analyze the risk factors of T2DM, and the results showed that the prediction effects of the three models from high to low were BP neural network, logistic regression model and decision tree model, respectively. Dwivedi et al. [14] used six algorithms of Classi cation trees, SVM, ANN, NB, Logistic and KNN to predict diabetes, and the results showed that the prediction effect of Logistic regression model was better than that of support vector machine. These are consistent with the results of this study. The results of this study showed that, considering the robustness of the model and the prediction effect of the model for T2DM, the prediction effect of the BP neural network model was the best among the six models, including Logistic regression, SVM, CART decision tree, C4.5 decision tree, BP neural network and deep neural network, which were constructed by the training set containing 70% samples. The accuracy rate was as high as 93.7%, the recall rate was 92.8%, and the AUC value was 0.977. followed by Logistic model, SVM model, CART decision tree model, C4.5 decision tree model, and deep neural network model. The prediction effect of SVM and CART was similar, and the difference was not statistically signi cant. Logistic model AUC value is 0.962, the prediction effect is better than SVM. Although the Logistic model is weaker than BP neural network, it has a strong explainability to the results and can re ect the relationship between various factors and T2DM.
Faruque [15] and Kandhasamy [16] found that the prediction effect of C4.5 decision tree model was signi cantly higher than that of SVM model. The results of this study showed that the prediction effects of SVM and CART models were similar, and the difference in AUC value was not statistically signi cant, and the prediction effects of the two models were slightly better than that of C4.5 decision tree model. In practical application, compared with SVM model, CART decision tree model and C4.5 decision tree model can present the variables included in the model more intuitively. Cheruku et al. [17] built a prediction model for T2DM based on PIDD data set by using a variety of machine learning methods. The results showed that the prediction effect of C4.5 decision tree model was better than that of CART decision tree model, and the accuracy rates were 74.2% and 70.7%, respectively. Althunayan et al. [18] compared the performance of nine algorithms such as Naive Bayes, C4.5, CART and random forest in the prediction of T2DM, and the results showed that the accuracy of random forest was the highest, up to 100%, and the accuracy of C4.5 algorithm was much higher than that of CART algorithm. Meng et al. [19] used ANN, logistic regression and C4.5 data mining technology to predict diabetes, and nally concluded that C4.5 machine learning technology is more effective and accurate than other methods. The results of this study show that the prediction effect of CART decision tree model is slightly better than that of C4.5 decision tree model. The CART decision tree model is a binary tree, and compared with C4.5 decision tree model, its operation is faster and the model formed is more concise.
Therefore, CART is more suitable for the processing of large sample data. Ayon et al. [20] used deep neural network method to build a prediction model for T2DM based on PIDD data set, and the accuracy of the model reached 98.35%. Mohapatra [21] applied deep neural network to the prediction of T2DM, and the accuracy of the algorithm was as high as 97.11%. The results of this study showed that compared with the other 5 T2DM prediction models, DNN had the worst prediction effect, with an accuracy rate of only 84.5% and an AUC value of 0.845. The results of this study are different from those of previous related studies, which may be caused by the differences in sample size, data quality, included characteristic variables, de nition of related variables and construction techniques of different data sets.
In this paper, SMOTE algorithm is used to process unbalanced data, but this algorithm cannot overcome the problem of data distribution of unbalanced data sets, and it is easy to produce the problem of distribution marginalization. If a sample of a few categories is at the distribution edge of the sample set of a few categories, then the sample generated by it and its adjacent samples will also be at this edge and will become more and more marginalized, thus blurring the boundary of the two sample types and increasing the di culty of classi cation algorithm. In addition, the model constructed in this paper has only been validated internally, lacking external validation, and the extrapolation is limited. In the later stage of consideration, it will be veri ed in larger population samples from different regions. In conclusion, the BP neural network model in this study has the best prediction effect, while the deep neural network, different from previous studies, has the worst prediction effect in this study. BP neural network model to predict the effect is best, but the results can be interpreted is not strong, which means that the BP neural network model can e ciently identify patients with T2DM, but due to its principle depends on the internal control mechanism, unable to understand the characteristics of the factors in uence on T2DM, so cannot to more accurately the relevant risk factors intervention.
Although the decision tree can visually present the classi cation process, it cannot understand the in uence of the factors included in the model on T2DM. Logistic regression model is second only to BP neural network in predicting effect, but it has strong explainability to the results, and its principle is simple and easy to understand. Therefore, suitable classi ers can be selected according to research purposes when applied in clinical decision-making.

Conclusion
Machine learning technology has high accuracy, low error rate and low cost in the early prediction of various diseases. Early diagnosis of T2DM is of great signi cance to improve the quality of life of patients with T2DM and prevent complications. This study is based on a variety of indicators, such as accuracy, recall rate, AUC, etc. The prediction effects of six risk prediction models of Type 2 glucose and urine disease were compared, including Logistic regression, CART, C4.5, back propagation neural network, support vector machine and deep neural network. The results showed that the prediction effect of the back propagation neural network model based on the selected data set was the best.   Table 1. NC, normal population; * indicates that the distribution of each feature attribute in the two groups of samples is mostly skewed distribution, so the median and interquartile range are used to describe the mean value and variation degree.   Table 5 Confusion matrix of each model Forecast classi cation   logistic  SVM  BP  CART  C4.5  DNN   1  0  1  0  1  0  1  0  1  0  1  0   True   classi cation   1  1068  174  1105  137  1153  89  1048  194  1054  188  1030  212   0  81  1081  78  1084  63  1099  78  1084  85  1077  161  1001   Table 6 Performance indexes of each model   Comprehensive ROC curve of six models