Machine Learning Models of Hemorrhage/Ischemia in Moyamoya Disease and Analysis of Its Risk Factors

Object: Identify the risk factors for hemorrhage/ischemia in patients with moyamoya disease and establish models using Logistic regression (LR), XGboost and Multilayer Perceptron (MLP), evaluating and comparison the effects of those models; providing theoretical basis for moyamoya disease patients to prevent stroke recurrence. Methods: This retrospective study used data from the database of Jiang Xi Province Medical Big Data Engineering & Technology Research Center; the data of patients with moyamoya disease admitted to the second aliated hospital of Nanchang university from January 1, 2012 to December 31, 2019 were collected. A total of 994 patients with moyamoya disease were screened, including 496 patients with cerebral infarction and 498 patients with cerebral hemorrhage. LR, XGboost and MLP were used to establish models for hemorrhage /ischemia in moyamoya disease, the effects of different models were veried and compared. Result: LR, XGboost and MLP models all had good discrimination (AUC>0.75), and their AUC value are 0.9227 (cid:0) 95%CI:0.9215-0.9239 (cid:0)(cid:0) 0.9677(95%CI:0.9657-0.9696) (cid:0) 0.9672 (cid:0) 95%CI (cid:0) 0.9643-0.9701 (cid:0) . Compared with LR model, the prediction ability of XGboost and MLP model in training and test set is improved, which is increased by 18.11% and 14.34% respectively in training set, and there is a signicant difference. Conclusion: Compared with the traditional LR model, the machine learning models are more effective in predicting hemorrhage/ischemia in moyamoya disease.


Background
Moyamoya disease (MMD) is a rare cerebrovascular disease characterized by chronic progressive stenosis or occlusion at the onset of bilateral internal carotid arteries, anterior cerebral arteries and middle cerebral arteries, and abnormal vascular proliferation. The incidence rate is high in Asian countries such as Japan, Korea and China. At present, the etiology is not clear, and no clear pathogen has been detected; it is considered that multiple factors lead to the occurrence of moyamoya disease. People with susceptibility genes, under the in uence of various external environmental factors, such as infection and in ammation, have abnormal immunologic mechanism, and the disease is caused by abnormal immunologic mechanism [1][2][3] . Family related studies report that the prevalence of familial MMD in China is 1.5% [4] . Vascular injury has been proved in autopsy reports of moyamoya disease patients [5] . Studies have shown that hyperhomocysteinemia, hypertension and smoking are the risk factors of vascular injury in patients with moyamoya disease [6] .
The main clinical manifestations of moyamoya disease are cerebral ischemia or cerebral hemorrhage [7] .
Most children patients take ischemic symptoms as clinical manifestations, and adult patients may be cerebral ischemia, hemorrhage or both. Intracranial hemorrhage is the initial symptom in half of adult patients with moyamoya disease [8] and it is basically consistent with this study. Study about the initial symptoms of moyamoya disease using univariate analysis showed that family history of moyamoya disease and cerebral hemorrhage are risk factors of hemorrhagic stroke, and family history of cerebral infarction is risk factor of ischemic stroke, but multivariate analysis showed no statistical signi cance [4] .
The study of Won-sang Cho indicated that cerebral hemorrhage is more common in familial moyamoya disease [9] . The risk or protective factors of moyamoya disease with cerebral hemorrhage as the initial symptom have not been fully identi ed, and hemorrhagic stroke is the main cause of death in patients with moyamoya disease. There is no ideal medical treatment for moyamoya disease, and the key treatment for moyamoya disease is reducing the incidence of hemorrhagic stroke and preventing of stroke recurrence. There are many controversies about the prevention of stroke recurrence in moyamoya disease patients, because of no exact theoretical basis for it, and lack of means to predict whether stroke recurrence is ischemic or hemorrhagic. There are few studies on the risk and protective factors for prediction the initial symptoms of moyamoya disease. In this study, we used three different methods to establish models of hemorrhage/ischemia in moyamoya disease to predict the type of stroke in patients with moyamoya disease. According to the models constructed in this study, the risk or protective factors of patients are input into the models for testing, aspirin can be recommended to prevent the recurrence of ischemic stroke if the probability of ischemic stroke is high; Conversely, aspirin treatment is not recommended to reduce the risk of cerebral hemorrhage if the risk of cerebral hemorrhage is high. The study provides a theoretical basis for the prevention of stroke recurrence in patients with moyamoya disease.

Data Source
In this retrospective study, the data of patients with moyamoya disease admitted to the Second A liated Hospital of Nanchang University from January 1, 2012 to December 31, 2019 were collected from the database of Jiang Xi Province Medical Big Data Engineering & Technology Research Center. All patients were required to meet the following criteria: 1. Moyamoya disease was diagnosed by CTA and / or MRA and / or DSA [10] . 2. The rst onset of the disease 3.clinical neurological de cit symptoms; and the following cases were excluded: 1. Initial symptom is not cerebral infarction or cerebral hemorrhage. 2. Cases with incomplete information. 3. Cases with connective tissue disease or autoimmune disease or hyperthyroidism. 4. Cases with multiple neuro broma. Patients with ischemic stroke as the initial symptom were selected as the cerebral infarction group and hemorrhagic stroke are selected as the cerebral hemorrhage group. The experimental protocol was approved by the ethics committee of the Second A liated Hospital of Nanchang University, and all the methods were carried out according to the relevant guidelines and regulations. Consent was obtained from all subjects or their parents (under 18 years old).

Variables
According to the inclusion and exclusion criteria, the patients' information including gender, age of onset, ethnicity, marital status, long-term residence (urban or rural), type of medical insurance, fasting blood glucose, platelet count, high-density lipoprotein, low-density lipoprotein, triglyceride, total cholesterol, apolipoprotein, hypertension and diabetes history, smoking and drinking history, involved vessels, stenosis or occlusion, combined with aneurysm, other disease history and other data, and Suzuki rating performed according to the vascular condition of patients [4] were collected;

Statistical Analysis
This study uses Python version 3.6 as the development tool to establish the models. The data set is randomly divided into training set and test set according to the ratio of 4:1. Different random seeds are used to repeat grouping for 20 times, based on the 3-fold cross validation method, the optimal parameters of XGboost and MLP in the training set are selected through the grid optimization method, and the parameters corresponding to the model with the largest area under the curve (AUC) are selected.
Parameters range of XGboost model: max depth range is (2,3,4,5); n estimators range is (40,60,80100); subsample range is (0.8,0.9,1); learning rate range is (0.01,0.05,0.1). MLP set Max ITER = 1000, random state = 0; Parameters range of MLP model: activation range is (identity, logistic, tanh, relu); the range of alpha is (0.01, 0.001, 0.0001); the range of hidden layer sizes is (15,20,25). MLP and LR call the corresponding functions in sklearn package; XGboost calls XGboost package; the feature importance is ltered by calling the PermutationImportance function of eli5. Sklearn package; SPSS statistics 18 software is used for univariate analysis of general data, t test or Mann Whitney U test for quantitative data, and c2 test for qualitative data. Z test is used for AUC value comparison between two models; in this paper, with P < 0.05 as the standard, the difference is statistically signi cant.

General Patient Characteristics and establishment of the model
The data of patients with moyamoya disease who were admitted to the Second A liated Hospital of Nanchang University from January 1, 2012 to December 31, 2019 and met the inclusion criteria but did not meet the exclusion criteria were collected, 994 patients were enrolled (496 in cerebral infarction group and 498 in cerebral hemorrhage group); According to the parameters described in the chapter of statistical methods, LR, XGboost and MLP methods were used to establish the prediction model of hemorrhage/ischemia in moyamoya disease. The patient screening process is shown in Fig.1, the general information of the two groups is shown in Table 1. According to the collected data, LR, XGboost and MLP are used in the test set of the two groups to verify the models, and some evaluation index of the models are compared between the test and the training set.
The AUC values of the test set in the three models are all above 0.9 (> 0.75), which indicates that the predictive ability of the three models is very good. The details of the evaluation index of the models obtained from the training and test set are shown in Table 2.

Comparison of evaluation index between training and test set
The AUC value, sensitivity, accuracy, speci city, positive predictive value, negative predictive value, positive likelihood ratio, negative likelihood ratio, Youden index and threshold of the three models are compared between the training set and the corresponding test set. The details of the evaluation index of the models are shown in Table 2. Comparison of accuracy ROC curve and AUC of the model are often used to compare the discrimination ability between two models, while Net Reclassi cation Index (NRI) is often used to compare the accuracy of the prediction ability between two models [11] .
Comparison between training sets: In the training set, Compared with LR model, the NRI of XGboost model is 0.1811, z = 7.9471, P < 0.01, which indicates that the prediction ability of XGboost model has been improved by 18.11%, and there is a signi cant difference. Compared with MLP model, the NRI of XGboost model is 0.0377, z = 2.1850, P = 0.0290 < 0.05, which indicates that the prediction ability of XGboost model has also been improved by 3.77%, and there is a signi cant difference with P < 0.05 as the standard. Compared with LR model, the NRI of MLP model is 0.1434, z = 6.5760, P < 0.01, which indicates that the prediction ability of MLP model has also been improved by 18 The Suzuki grades of the two groups are compared by boxplot (Fig.7). It can be seen from Fig.7 that the Suzuki grades of the cerebral hemorrhage group are mainly concentrated in grades 4 and 5, while the Suzuki grades of the cerebral infarction group are mainly concentrated in grades 2 and 3.
Compared with the two groups of patients with or without aneurysm by histogram (Fig. 8), the proportion of patients with aneurysm in cerebral infarction group is only 2.42%, while that in cerebral hemorrhage group is 23.69%.

Discussion
Effectiveness of the models At present, there are some related researches on prediction model of hemorrhage/ischemia in moyamoya disease, which mainly focus on the risk factors of hemorrhagic moyamoya disease, and most of them use LR to establish the model. Although XGboost and MLP have been widely used in arti cial intelligence and other elds, they are rarely used in medical elds. In this study, LR, XGboost and MLP methods were used to establish the prediction model of hemorrhage/ischemia in moyamoya disease, and the important model evaluation indexes such as discrimination ability, accuracy and speci city of the three methods are compared. All the three methods have good discrimination (AUC > 0.75); In the training set of this study, the AUC of XGboost model is 0.9677, which is larger than that of LR (AUC = 0.9227) and MLP model (AUC = 0.9672), indicating that XGboost model may have the best discriminant ability in predicting hemorrhage/ischemia of moyamoya disease, but there is no signi cant difference with LR and MLP model. In terms of NRI, an important index to evaluate the prediction accuracy of the model, XGboost model is superior to LR and MLP model in training set with signi cant difference, and its prediction ability is improved by 18.11% and 3.77% respectively; XGboost model is also superior to LR and MLP models in the test set, and with signi cant difference compared with LR model, but no signi cant difference with MLP model, and its prediction ability is improved by 11.07% and 3.06% respectively; The prediction ability of MLP model in training set and test set is better than LR model, and there is a signi cant difference; Compared with LR model, the prediction ability of MLP model in training set and test set is improved by 18.11% and 8.01% respectively. In summary, Discrimination ability of XGboost model may be better than that of LR and MLP model, but there is no signi cant difference with LR and MLP model; The prediction accuracy of XGboost model is better than that of LR and MLP models, and there is a signi cant difference; The prediction accuracy of MLP model is better than that of LR model, and there is a signi cant difference. The prediction accuracy of the model established by machine learning method is better than that of LR [12][13] , which is consistent with the research results of Gingal, churpek [14][15][16] . The accuracy, speci city, positive predictive value and negative predictive value of XGboost model are better than those of LR and MLP models, the reasons may be as follows: Firstly, XGboost model is a nonlinear ensemble learning algorithm model, and the tree model can be in nitely split, which can better t the data than LR model [12] ; Secondly, each factor in LR is independent and assumes a linear relationship, and XGboost has strong plasticity and exibility [13] , which can automatically nd and use the interaction effect and nonlinear relationship between related factors, making the prediction effect more accurate [17] . Finally, as a kind of forward structure arti cial neural network, MLP can deal with the problem of nonlinear separability, and has good fault tolerance, strong adaptive and self-learning function [18] . However, its learning speed is slow, and it is very di cult to select the number of hidden nodes in the network, which may lead to insu cient learning MLP model in this paper is better than LR model, and worse than XGboost model, which may be due to insu cient learning, but it needs further veri cation.

Feature importance of models
In the univariate statistical analysis of general data, all risk factors are analyzed by univariate analysis, which is obviously not rigorous and accurate enough to predict the initial symptoms of moyamoya disease. The traditional LR method assumes that the contribution of all factors to the model is linear, but it is not so in clinical practice. However, machine learning can automatically discover and utilize the interaction effect and nonlinear relationship between related factors when establish models, which is also the difference between traditional regression analysis and machine learning method in modeling [12] . According to the top ten factors of feature importance of the three models established in this study, LR model indicated that hemorrhagic stroke as the rst symptom may be closely related to Suzuki rating, with aneurysm or not, residence, involved vessels, vascular stenosis or occlusion and hospitalization times; MLP model indicated that hemorrhagic stroke as the rst symptom may be closely related to Suzuki rating, with aneurysm or not, other types of medical insurance, apolipoprotein A and hospitalization times; XGboost model suggests that hemorrhagic stroke as the initial symptom may be closely related to Suzuki rating, with aneurysm or not, fasting blood glucose, hospitalization times and residence. In this paper, the top ten factors of feature importance of the three models all indicated that hemorrhagic stroke as the initial symptom of moyamoya disease patients may be closely related to Suzuki rating, with aneurysm or not, hospitalization times, residence and age of onset. In conclusion, multiple factors determine the initial symptoms of patients with moyamoya disease, Suzuki rating, with aneurysm or not, hospitalization times, residence, age of onset and other risk factors may be important risk factors for hemorrhagic stroke, which is similar to some reports [19] .

Limitations
In this study, the data of patients is collected only from the big database of Jiang Xi Province Medical Big Data Engineering & Technology Research Center; the earlier the cases in the big database, the higher the proportion of incomplete data, which leads to the higher proportion of cases not included in this study because of incomplete information; Due to the limitation of medical level, the earlier the case is, the lower the proportion is diagnosed as moyamoya disease In this study, these factors increase the sampling error of data collection. Although the sample size of this study is 994 cases, but the sample size of 994 cases is still not very large for MLP method, which has a certain impact on the performance of MLP model. Due to the geographical limitations of the patients, some risk factors in the collected data are eventually eliminated, such as the ethnicity of the patients, because 100% of them are Han majority.
In conclusion, XGboost model can better predict the initial symptoms of patients with moyamoya disease, and provide an important theoretical basis for preventing stroke recurrence of patients with moyamoya disease. We can reduce the impact of regional limitations on the research results by multi center research and increasing the sample size, and reduce the sampling error by collecting the case data in recent years, so as to improve the performance of the model.

Conclusion
In this study, the data of 994 patients with moyamoya disease collected from database, LR, XGboost and MLP are used to establish prediction model for hemorrhage/ischemia in moyamoya disease, and the three models are tested and compared. The conclusion is as follows 1 All the three models have good predictive ability in predicting hemorrhage / ischemia in patients with moyamoya disease, XGboost model is better than the other two models in accuracy, speci city, positive predictive value, negative predictive value and other important evaluation indexes. 2) XGboost model may be better than LR and MLP models in terms of prediction discrimination ability, but there is no signi cant difference in training and test set. 3) In terms of the accuracy of the model, XGboost model is better than LR and MLP models in the training set, and there is a signi cant difference; in the test set, XGboost model is also better than LR and MLP models, and there is a signi cant difference with LR model. MLP model is better than LR model in training set and test set, and there is a signi cant difference. 4 According to the order of the feature importance of the three models, Suzuki rating, whether with aneurysm, involved vessels, vascular stenosis or occlusion, residence in villages and towns, hospitalization times all contributed a lot to the three models, which indicates that they may be important risk factors of hemorrhagic stroke; These risk factors also have signi cant differences in univariate analysis, which also indicates that they may be important risk factors for hemorrhagic stroke.

Declarations Author information
A liations