General Patient Characteristics and establishment of the model
The data of patients with moyamoya disease who were admitted to the Second Affiliated Hospital of Nanchang University from January 1, 2012 to December 31, 2019 and met the inclusion criteria but did not meet the exclusion criteria were collected, 994 patients were enrolled (496 in cerebral infarction group and 498 in cerebral hemorrhage group); According to the parameters described in the chapter of statistical methods, LR, XGboost and MLP methods were used to establish the prediction model of hemorrhage/ischemia in moyamoya disease. The patient screening process is shown in Fig.1, the general information of the two groups is shown in Table 1.
Table 1 General information of patients
|
CI(n=496)
|
CH(n=498)
|
t /z /χ2 P
|
General information
|
|
|
|
*Gender:[n(%)]
|
|
|
|
Male
|
223(44.96%)
|
281(56.43%)
|
χ2=13.071, P<0.01
|
Female
|
273(55.04%)
|
217(43.57%)
|
*Age:[n(%)]
|
|
|
|
Juvenile(≤17years)
|
7 (1.41%)
|
0 (0.00%)
|
χ2=26.501, P<0.01
|
Young adult(18-44years)
|
114(22.99%)
|
180(36.14%)
|
The elderly(≥45岁)
|
375(75.60%)
|
318(63.86%)
|
*Residence:[n(%)]
|
|
|
|
Urban
|
127(25.60%)
|
31 (6.22%)
|
χ2=71.64, P<0.01
|
Rural
|
369(74.40%)
|
467(93.78%)
|
* Hospitalization times:`X±S
|
1.853±1.4662
|
1.402±0.978
|
t= 5.709, P<0.05
|
* Health insurance:[n(%)]
|
|
|
|
Employees
|
145(29.23%)
|
97 (19.48%)
|
χ2=81.665, P<0.01
|
Urban residents
|
107(21.57%)
|
89 (17.87%)
|
Rural cooperative
|
120(24.20%)
|
228(45.78%)
|
Retirement
|
2 (0.40%)
|
12 (2.41%)
|
Own expense
|
92 (18.55%)
|
71 (14.26%)
|
Other types
|
30 (6.05%)
|
1 (0.20%)
|
PLT:
|
|
|
|
Higher than normal
|
44 (8.87%)
|
40 (8.03% )
|
χ2=0.863, P=0.649
|
Normal range
|
436(87.90%)
|
437(87.75%)
|
Below normal
|
16 (3.23%)
|
21 (4.22% )
|
Risk factors of vessels
|
|
|
|
* Hypertension history:[n(%)]
|
185(37.30%)
|
86 (17.27%)
|
χ2=50.271, P<0.01
|
* DM history:[n(%)]
|
45 (9.07%)
|
15 (3.01%)
|
χ2=16.092, P<0.01
|
Smoking:[n(%)]
|
86 (17.34%)
|
96 (19.28%)
|
χ2=0.624, P=0.429
|
Alcohol using: [n(%)]
|
82 (16.53%)
|
92 (18.47%)
|
χ2=0.649, P=0.421
|
* FBG
|
|
|
|
Higher than normal
|
135(27.22%)
|
87 (17.47%)
|
χ2=23.473, P<0.01
|
Normal range
|
346(69.76%)
|
408(81.93%)
|
Below normal
|
15 (3.02%)
|
3 (0.60%)
|
Blood fat:median (IQR)
|
|
|
|
*HDL
|
1.080 (0.213)
|
1.106 (0.064)
|
z= -4.678, P<0.01
|
LDL
|
2.684 (0.720)
|
2.654 (0.158)
|
z= -1.892, P=0.058
|
*TG
|
1.400 (0.585)
|
1.354 (0.140)
|
z= -4.678, P<0.01
|
TC
|
4.364 (0.752)
|
4.357 (0.082)
|
z= -0.712, P=0.476
|
Apo A
|
1.050 (0.207)
|
1.055 (0.067)
|
z= -1.164, P=0.143
|
Apo B
|
0.836 (0.200)
|
0.837 (0.027)
|
z= -0.241, P=0.810
|
Vessels condition:[n(%)]
|
|
|
|
* Involved vessels
|
|
|
|
MCA
|
264(53.23%)
|
187(37.55%)
|
χ2=24.64, P<0.01
|
CA
|
232(46.77%)
|
311(62.45%)
|
* Stenosis or occlusion
|
|
|
|
Stenosis
|
431(86.90%)
|
353(70.88%)
|
χ2=38.23, P<0.01
|
Occlusion
|
65 (13.10%)
|
145(29.12%)
|
* Aneurysm:[n(%)]
|
12 (2.42%)
|
118(23.69%)
|
χ2=100.17, P<0.01
|
*Suzuki stage:`X±S
|
2.748±0.7245
|
3.952±0.808
|
t=-24.72, P<0.01
|
The marked * in the table indicates the risk factors with significant difference (P < 0.05)。
Verification and comparison of models
Verification of models
According to the collected data, LR, XGboost and MLP are used in the test set of the two groups to verify the models, and some evaluation index of the models are compared between the test and the training set. The AUC values of the test set in the three models are all above 0.9 (> 0.75), which indicates that the predictive ability of the three models is very good. The details of the evaluation index of the models obtained from the training and test set are shown in Table 2.
Comparison of evaluation index between training and test set
The AUC value, sensitivity, accuracy, specificity, positive predictive value, negative predictive value, positive likelihood ratio, negative likelihood ratio, Youden index and threshold of the three models are compared between the training set and the corresponding test set. The details of the evaluation index of the models are shown in Table 2.
Table 2 evaluation index of training and test set of the models
|
LR
|
XGboost
|
MLP
|
training set
|
test set
|
training set
|
test set
|
training set
|
test set
|
AUC
|
0.9227
|
0.9112
|
0.9677
|
0.9260
|
0.9672
|
0.9149
|
sensitivity
|
0.8291
|
0.8284
|
0.9085
|
0.8449
|
0.9185
|
0.8317
|
accuracy
|
0.8453
|
0.8402
|
0.9096
|
0.8621
|
0.91428
|
0.8482
|
specificity
|
0.8642
|
0.8627
|
0.9131
|
0.8907
|
0.9124
|
0.8780
|
PPV
|
0.8602
|
0.8598
|
0.9138
|
0.8861
|
0.9143
|
0.8713
|
NPV
|
0.8331
|
0.8291
|
0.9079
|
0.8439
|
0.9153
|
0.8311
|
PLR
|
6.2492
|
7.3518
|
11.3990
|
12.3768
|
12.0077
|
8.0112
|
NLR
|
0.1971
|
0.1964
|
0.0997
|
0.1730
|
0.0896
|
0.1903
|
Youden index
|
0.6933
|
0.6911
|
0.8216
|
0.7356
|
0.8309
|
0.7097
|
threshold
|
0.5154
|
0.5065
|
0.4769
|
0.5189
|
0.4744
|
0.5262
|
95% lower limit
|
0.9215
|
0.9065
|
0.9657
|
0.9225
|
0.9643
|
0.9115
|
95% upper limit
|
0.9239
|
0.9160
|
0.9696
|
0.9295
|
0.9701
|
0.9182
|
AUCstandard error
|
0.0006
|
0.0024
|
0.0010
|
0.0018
|
0.0015
|
0.0017
|
PPV= positive predictive value; NPV= negative predictive value; PLR= positive likelihood ratio
NLR= negative likelihood ratio.
ROC curve comparison between training and test set
Taking random state=0 to divide test set and training set, the best parameters of XGboost model are: max depth =4, n estimators =100, subsample =0.9, learning rate =0.05, the best parameters of MLP model are: activation ='logistic ', alpha = 0.001, hidden layer sizes =15. The ROC curve of training and test set of the models is shown in Fig.2, 3.
It can be seen from table 2 that there is no significant difference between the AUC values of the training set and the corresponding test set of models, P > 0.05, and all the AUC values are above 0.9(>0.75), which indicates that the three models have good discrimination ability in predicting the hemorrhage/ischemia of moyamoya disease, but the AUC value of XGboost model is the closest to 1, which indicates that XGboost model may be better than other models in discrimination ability. The details of discrimination ability comparison between models are shown the section “comparison of discrimination ability”.
Comparison between XGboost model and other models
Comparison of discrimination ability
Comparison between training sets: In the training set, the AUC of XGboost model is 0.9677, which is larger than that of LR model (AUC = 0.9227) and MLP model (AUC = 0.9672), which indicated that XGboost model had the best discrimination ability in predicting the hemorrhage/ischemia of moyamoya disease; AUC between the two models is tested by Z test, Comparing the AUC value of XGboost and LR model, the statistic z = 0.3430, P > 0.05, there is no significant difference between the two models in discrimination ability; comparing the AUC value of XGboost and MLP model, z = -0.2774, P > 0.05, there is no significant difference also between the two models in discrimination ability.
Comparison between test sets: In the test set, AUC of XGboost model is 0.9260, which is larger than that of LR model (AUC = 0.9112) and MLP model (AUC = 0.9149). Comparing AUC of XGboost model and LR model, the statistic z = -0.2000, P > 0.05, there is no significant difference between the two models in discrimination ability; comparing the AUC value of XGboost model and MLP model, z = 0.0404, P > 0.05, there is no significant difference also between the two models in discrimination ability.
XGboost model better than LR and MLP model in discrimination ability, but there is no significant difference in training and test set.
Comparison of accuracy
ROC curve and AUC of the model are often used to compare the discrimination ability between two models, while Net Reclassification Index (NRI) is often used to compare the accuracy of the prediction ability between two models [11].
Comparison between training sets: In the training set, Compared with LR model, the NRI of XGboost model is 0.1811, z = 7.9471, P < 0.01, which indicates that the prediction ability of XGboost model has been improved by 18.11%, and there is a significant difference. Compared with MLP model, the NRI of XGboost model is 0.0377, z = 2.1850, P = 0.0290 < 0.05, which indicates that the prediction ability of XGboost model has also been improved by 3.77%, and there is a significant difference with P < 0.05 as the standard. Compared with LR model, the NRI of MLP model is 0.1434, z = 6.5760, P < 0.01, which indicates that the prediction ability of MLP model has also been improved by 18.11%, and there is a significant difference.
Comparison between test sets: In the test set, Compared with LR model, the NRI of XGboost model is 0.1107, z = 2.5247, P = 0.0120 < 0.05, which indicates that the prediction ability of XGboost model has been improved by 11.07%, and there is a significant difference with P < 0.05 as the standard. Compared with MLP model, the NRI of XGboost model is 0.0306, z = 0.7838, P = 0.4330, which indicates that the prediction ability of XGboost model has been improved by 3.06%, but there is no significant difference. Compared with LR model, the NRI of MLP model is 0.0801, z = 2.1390, P = 0.0320 < 0.05, which indicates that the prediction ability of MLP model has been improved by 8.01% in the test set, and there is a significant difference with P < 0.05 as the standard.
In terms of the accuracy of the models, XGboost model is better than LR and MLP models in the training set, and there is a significant difference; in the test set, XGboost model is also better than LR and MLP models, and there is a significant difference with LR model. MLP model is better than LR model in training and test set, and there is a significant difference.
Feature importance of models
The data of 994 patients with moyamoya disease (496 patients with cerebral infarction and 498 patients with cerebral hemorrhage) who met the inclusion criteria and did not meet the exclusion criteria were collected from the database of Jiang Xi Province Medical Big Data Engineering; LR, XGboost and MLP are used to establish the prediction model for hemorrhage/ischemia in moyamoya disease, according to the feature importance of each model, the top ten variables are shown in Fig. 4-6.
SOO=stenosis or occlusion; FBG= fasting blood glucose; HT=hospitalization times; TC=total cholesterol
TG= Triglyceride; TOHI2=own expense for hospitalization; TOHI3=other type of health insurance;
TOHI6= rural cooperative health insurance;
The Suzuki grades of the two groups are compared by boxplot (Fig.7). It can be seen from Fig.7 that the Suzuki grades of the cerebral hemorrhage group are mainly concentrated in grades 4 and 5, while the Suzuki grades of the cerebral infarction group are mainly concentrated in grades 2 and 3.
Compared with the two groups of patients with or without aneurysm by histogram (Fig. 8), the proportion of patients with aneurysm in cerebral infarction group is only 2.42%, while that in cerebral hemorrhage group is 23.69%.