Machine Learning Approaches for Early Prediction of Gestational Diabetes Mellitus Based on Prospective Cohort Study

Purpose: To develop and verify an early prediction model of gestational diabetes mellitus (GDM) using machine learning algorithm. Methods: The dataset collected from a pregnant cohort study in eastern China, from 2017 to 2019. It was randomly divided into 75% as the training dataset and 25% as the test dataset using the train_test_split function. Based on Python, four classic machine learning algorithm and a New-Stacking algorithm were rst trained by the training dataset, and then veried by the test dataset. The four models were Logical Regression (LR), Random Forest (RT), Articial Neural Network (ANN) and Support Vector Machine (SVM). The sensitivity, specicity, accuracy, and area under the Receiver Operating Characteristic Curve (AUC) were used to analyse the performance of models. Results: Valid information from a total of 2811 pregnant women were obtained. The accuracies of the models ranged from 80.09% to 86.91% (RF), sensitivities ranged from 63.30% to 81.65% (SVM), specicities ranged from 79.38% to 97.53% (RF), and AUCs ranged from 0.80 to 0.82 (New-Stacking). Conclusion: This paper successfully constructed a New-Stacking model theoretically, for its better performance in specicity, accuracy and AUC. But the SVM model got the highest sensitivity, the SVM model was recommends as the prediction model for clinical.


Introduction
Gestational diabetes mellitus (GDM) is a growing public health concern [1][2][3]. It refers to abnormal glucose tolerance and persistent high blood glucose levels during pregnancy. It is a serious threat to maternal and fetal health. GDM not only causes adverse perinatal pregnancy outcomes, such as postpartum hemorrhage, infection, preterm delivery, macrosomia, and neonatal respiratory distress syndrome, but also threatens the long-term health of mothers and infants [4][5][6]. Compared with normal pregnant mothers, women with GDM have a 6-12.6 folds higher risk of developing type 2 diabetes after delivery [7][8][9]. It is reported that 1 in 4 pregnant women develop T2DM after being diagnosed of GDM, with an average time of about 8 years [10]. Moreover, the risk of metabolism-related diseases such as obesity and type 2 diabetes in offspring of women with GDM will also increase signi cantly [11]. In resent research, it is reported that mothers with GDM have a signi cantly increased risk of congenital heart defects (CHDs) in offspring (OR = 1.98, 95% CI 1.66-2.36) [12].
With the update of GDM diagnostic criteria and the change of lifestyles, the global prevalence of GDM has increased to 14.8%, with the trend increasing year by year [13][14][15]. It is urgent to identify GDM timely and provide intervention strategies to prevent or at least delay the onset of T2DM.
At present, the diagnosis of GDM needs to be con rmed by an Oral Glucose Tolerance Test (OGTT) test at the 24th to 28th week of gestation. However, previous studies have found that persistent hyperglycemia during pregnancy can also adversely affect the outcome of a pregnant woman or fetus before a clear diagnosis of gestational diabetes is made [16].
Recently, researchers have begun to use different machine learning algorithms to predict GDM [17][18][19][20], such as Logistic regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support vector machine (SVM), and Arti cial Neural Network (ANN) analyses. These machine learning algorithms can classify risk factors, deduce the correlation between attributes, establish risk prediction models, and predict the occurrence of disease. However, the predictive performance of existing machine learning models was not excellent enough for widely clinical application. It is necessary to explore and develop more accurate and easier approaches for predicting GDM risk using machine learning algorithms. More research and data support is needed in this eld.
In this study, 2811 pregnant women in Qingdao, eastern of China, were enrolled. Data including clinical and biochemical variables were collected from a prospective follow-up cohort. The GDM prediction model was established using four classic machine learning algorithms and a New-Stacking algorithm. The four models are Logical Regression (LR), Random Forest (RF), Support Vector Machine (SVM) and Arti cial Neural Network (ANN). The New-Stacking algorithm is established based on the above four classical algorithms based on ensemble method. Stacking algorithm is expected to avoid the possibility of increasing the false positive rate when a single machine learning model improved the true positive rate. It is hypothesized that the New-Stacking model outperform the four classic machine learning models in terms of discrimination and calibration. This study aims to develop and verify an early prediction model of GDM through machine learning algorithm approach and to provide theoretical support for early diagnosis and intervention of GDM.

Participants
The dataset used in this study derived from a cohort of pregnant women established in Qingdao between November 2017 and December 2019. This study was conducted at three women and child health care centers and a university-a liated hospital. The university-a liated Hospital is a treatment center for critical and di cult cases on the Jiaozhou Peninsula in eastern China, with 4500-5000 deliveries annually.
Information about participants' socio-demographic characteristics and medical history, including age (identi ed from the identity card), height, pre-pregnancy body weight, and family history of diabetes, was collected through face-to-face interviews and self-completed questionnaires. Interestingly, the birth weight of participants and the delivery weight of their mothers, when the participants' mothers gave birth to them, were collected in the study. Information about reproductive characteristics (gravidity, parity, multiple birth (yes/no), and pregnancy complications), as well as laboratory test results, including Hemoglobin (Hb), Urine Ket (U-Ket), Fasting Plasma Glucose (FPG), triglyceride (TG), total cholesterol (TC), and High-Density Lipoprotein (HDL), were all extracted from the participants' medical records.
The Medical Ethics Committee of the rst author's university approved the study (Ethical number: QYFYKYLL411311920). All participants were informed of the aims and plan of the study, and written consent was obtained. During the entire research process, the names of the research participants were anonymized, and a uni ed numbering system was used to identify the research participants.
The inclusion criteria included women 1) aged 18 years old and above, 2) who planned to give birth in the study hospital, and 3) those with a singleton pregnancy. Women were not eligible to participate in the study if they: 1) were previously diagnosed with type or type diabetes mellitus, 2) had their rst pregnancy visit later than the 28th week of gestation and could not obtain antenatal examination data in the early stage of pregnancy, or 3) had cognitive or communication impairments, such as participants' being unable to hear or speak. The diagnosis of GDM was based on the results of the OGTT test administered at the 24th to 28th week of gestation. In addition, participants whose blood glucose levels at fasting, 1 h, or 2 h after taking sugar reached or exceeded 5.1, 10.0, and 8.5mmol/L [21] respectively, were diagnosed as GDM.

Prediction methods
To systematically train the model and evaluate its accuracy, the train_test_split function was used to randomly divide the data set into 75% as the training data set and 25% as the test data set. The model was rst trained by the training data, and then veri ed by the test data. Using python-based tools, four machine learning algorithms were developed to model the original data, including LR, RF, SVM, and ANN, and the prediction abilities of the different models were compared.
To avoid the possibility of increasing the false positive rate when a single machine learning model improved the true positive rate, a New-Stacking algorithm was used based on the ensemble method. The approach involved integrating the GDM prediction model established by LR, RF, SVM, and ANN as the primary decision maker. And Decision Tree (DT) algorithm, a basic classi cation machine learning algorithm, was used to make secondary decisions to improve the prediction results of the algorithm. This process consisted of two stages. In the rst stage, the entire dataset was randomly divided into a training set and test set, and then N different models were t to the training set. For each model, K-fold crossvalidation was used. For the same model, the prediction set of the whole training set could be obtained by modelling K times in turn. Similarly, each sample in the test set generated K prediction values, and the prediction set of the test set could be obtained by averaging. By analogy, the training set t four different models (LR, RF, SVM, and ANN) to generate two output matrices, de ned as (nrow (Train), N) and (nrow (Test), N). These results would enter the second stage of the Stacking method. In the second stage, a DT model was selected to t the results of the training set in the rst stage, and then this model was used for predictions in the test set.

Model evaluation
The performance of each model was evaluated using the areas under the Receiver Operating Characteristic Curve (AUC), diagnostic accuracy, sensitivity, and speci city. When normal gestation women in the test set were predicted to be normal gestation pregnancies by the model, it was marked as a True Negative (TN). Otherwise, when normal gestation pregnancies were predicted to be GDM patients, it was marked as a False Positive (FP). Similarly, when GDM patients in the test set were predicted to be normal by the model, it was marked as a False Negative (FN). Conversely, when GDM patients were correctly predicted to be GDM patients, the result was marked as a True Positive (TP). Thus, the diagnostic accuracy was de ned as the proportion of all participants in which the gestational GDM status was correctly predicted (Eq. 1).
(1) Sensitivity was de ned as the percentage of GDM patients whose GDM status was successfully detected (Eq. 2). (2) Speci city was de ned as the proportion of normal gestations that was successfully detected (Eq. 3). (3) The Receiver Operating Characteristic Curve (ROC) is a quantitative method for accurate classi cation of two confusing features. The horizontal axis represents the false positive rate (1 -Speci city) and the vertical axis indicates the Sensitivity. If the vertex of the curve is closer to the upper left corner, indicating that the model not only has a higher Sensitivity, but also has a lower false positive rate. The AUC can quantitatively describe the accuracy of the model. The larger the AUC, the better the prediction accuracy of the model.

Data analysis
The collected data were inputted into Excel 2016, and all the classi ed variables were processed as 0/1 variables. The output variable was predicted by whether diabetes was diagnosed by the OGTT test at the 24th to 28th week of gestation. If GDM was diagnosed, the result was marked as 1, and if the OGTT was normal, it was marked as 0.
There was inherent correlation between some indexes in the original data set, such as the BMI and body weight, weight growth, and body weight. To eliminate the relationship between the original index and the comprehensive index, principal component analysis (PCA) was used to reduce the dimension of the data. PCA was used to extract a series of principal components from the original data and project the high-dimensional data to the low-dimensional space. These principal components are linear combinations of the original data vectors, which can approximately re ect the characteristics of the original data whilst reducing the noise impact of the original data. In this study, PCA is used to extract the global features of the original data.

Results
Baseline characteristics and principal component analysis A total of 2811 pregnant women were included in the nal data analysis. The prevalence of GDM was 30.99% (871/2811). Feature importance analysis showed that the most important factor predicting GDM were fasting glucose levels, pre-pregnancy BMI, uterine height, abdominal circumference, mother's weight, weight gain, body weight at birth, age, family history of diabetes, systolic pressure, diastolic pressure, gravidity, and Polycystic ovary syndrome (PCOS). In this study, the top fteen principal components were selected based on their cumulative contribution rates by PCA, which was 99.73% ( Table 1). The accuracy of the prediction model (considering SVM model as an example) was also close to the highest point ( Fig. 1).

Discussion
In this study, the incidence of GDM (30.99%) were higher than that reported in previous studies [13][14][15]. This difference may be related to the fact that the sample collection sites included a university-a liated hospital, which was a treatment center for pregnant women requiring critical care, characterized by a relatively high concentration of high-risk pregnancies. Feature importance analysis showed that fasting glucose levels occupied the highest weight in the SVM model, followed by pre-pregnancy BMI, uterine height, and abdominal circumference. This was similar to previous study [17]. In addition, the birth weight of participants and the delivery weight of their mothers, when the participants' mothers gave birth to them, were also captured by the SVM model. This nding was consistent with clinical experience, but further research is needed to provide additional supporting data.
The sensitivity of the model represented the proportion of GDM patients successfully identi ed. The higher the sensitivity, the lower the missed diagnosis rate of GDM patients. The FP rate (1 -Speci city) referred to the proportion of normal individuals misdiagnosed as GDM. In general, an ideal model is characterized by the combination of high sensitivity and low FP rate. As a classical algorithm, LR was used as the control model in this study. The sensitivity of the LR model was 66.51%, that is, the LR model successfully identi ed 145 of 218 GDM patients, with 73 missed diagnoses. The speci city of the LR model was 94.23%, that is, 457 of the 485 normal gestation individuals were identi ed correctly, with 28 pregnancies misdiagnosed as GDM. The LR model achieved good prediction performance, with a diagnostic accuracy and AUC of 85.63% and 0.80, respectively. The sensitivities of the RF and ANN models were similar to the LR model, but the speci cities were higher than that of LR model, that is, a lower FP rate was obtained. The AUCs for the RF model and ANN models were 0.80 and 0.81, respectively. The performances of both the RF model and the ANN model were similar with the performance of the LR model. Previous study [20] has shown similar results.
In this study, the sensitivity of the SVM model was the highest (81.65%) among the ve models. This value was slightly lower than that found in the report of Xiong et al [18], which may be related to its smaller sample database (490). In the SVM model, 178 of 218 GDM patients were successfully identi ed and only 40 participants were miss-diagnosed. Although the speci city (79.38%) and the diagnostic accuracy (80.09%) of SVM model decreased with the improvement of sensitivity, the overall AUC of SVM model increased to 0.81, which was higher than that of LR model. It showed that the SVM model not only improved the sensitivity, but also had better stability. The sensitivity (75.69%) of the New-Stacking model was higher than that of the LR model, RF, and ANN models, but lower than that of the SVM model. Fortunately, the speci city (89.48%) and accuracy (85.21%) of the New-Stacking model were higher than those of the SVM model. The AUC of the New-Stacking model was 0.82, which was the highest of all models tested. Theoretically, the New-Stacking model had the best performance among the ve machine learning models constructed.
Clinically, pregnant women who are misdiagnosed should undergo additional testing to con rm whether they have GDM. However, if GDM is not detected in a timely manner, these women will still face greater risks in future pregnancies. Overall, the cost of con rmation testing is smaller than that of future risk to the patient. Thus, as the objective of this study, it was hoped that a predictive model could be developed to identify as many GDM pregnancies as possible. Although the prediction accuracy and speci city of the New-Stacking model were higher than those of the SVM model, the missed diagnosis rate (1 -sensitivity) was 24.31%, which was higher than that of the SVM model (18.35%). The performance of the SVM model, including the speci city, accuracy, and AUC, were superior enough to provide the basis for early prediction of the risk of GDM in early pregnancy. For these ndings, we concluded it was more appropriate to apply the SVM model (given its high sensitivity and thus lower missed diagnosis rate) in actual clinical practice to predict the GDM risk, compared with the New-Stacking model.

Limitations Of The Study
The general applicability of the prediction model reported in the present study is limited by data derived from a single institution characterized by a large proportion of high-risk pregnant women. And the birth weight of participants and the delivery weight of their mothers were self-reported, reporting bias was possible. In future studies, in order to improve the generalizability of the model, the authors plan to expand the cohort to include additional sampling sites and a larger number of pregnant women, and to use additional data for external veri cation.

Conclusion
This study compared the predictive performances of four classic machine learning algorithms (including LR, RT, SVM, and ANN) and a New-Stacking algorithm to develop and verify a multivariable prediction model of GDM diagnosis and to provide information relevant to clinical decision making in pregnancy.
This paper successfully constructed a New-Stacking model theoretically, for its best performance in speci city, accuracy and AUC. But the SVM model achieved the best performance in sensitivity. As the harm from a missed diagnosis is more serious than that of a misdiagnosis, this study recommends the application of a more sensitive SVM model as the prediction model of GDM.

Declarations
Funding This work was supported by the Qingdao Municipal Science and Technology Bureau (Grant number: 19-6-1-55-nsh) for LW and JW.

Con ict of interest
The authors declare they have no nancial interests.

Ethics approval
This study was approved by the medical ethical committee of the A liated Hospital of Qingdao University (Ethical number: QYFYKYLL411311920).

Consent to participate
Informed consent was obtained from all individual participants included in the study prior to data collection.

Consent for publication
All authors con rm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere. All authors approve the content of the manuscript and have contributed signi cantly to the research involved and the writing of the manuscript. Figure 1 Prediction model accuracy curve under different principal components (taking SVM as an example).