Prediction of post-stroke epilepsy using machine learning method

Background : Stroke is one of the most important causes of epilepsy and we aimed to find if it is possible to predict patients with high risk of developing post-stroke epilepsy (PSE) at the time of discharge using machine learning methods. Methods : Patients with stroke were enrolled and followed at least one year. Machine learning methods including support vector machine (SVM), random forest (RF) and logistic regression (LR) were used to learn the data. Results : A total of 2730 patients with cerebral infarction and 844 patients with cerebral hemorrhage were enrolled and the risk of PSE was 2.8% after cerebral infarction and 7.8% after cerebral hemorrhage in one year. Machine learning methods showed good performance in predicting PSE. The area under the receiver operating characteristic curve (AUC) for SVM and RF in predicting PSE after cerebral infarction was close to 1 and it was 0.92 for LR. When predicting PSE after cerebral hemorrhage, the performance of SVM was best with AUC being close to 1, followed by RF ( AUC = 0.99) and LR (AUC = 0.85) . Conclusion : Machine learning methods could be used to predict patients with high risk of developing PSE, which will help to stratify patients with high risk and start treatment earlier. Nevertheless, more work is needed before the application of thus intelligent predictive model in clinical practice.


Background
Stroke has been recognized as one of the most important causes of epilepsy, especially in the elderly [1,2]. Previous studies showed that about half of the patients over 60 years old with acquired epilepsy were caused by stroke [3,4]. With the increasing number of aging population, the number of patients with post-stroke epilepsy (PSE) will continue to increase. PSE significantly reduced the quality of life of stroke patients, and increased the economic and psychological burden of the patients and their families [5]. Therefore, the prediction of patients with high risk of PSE is very important [2,6].
Seizure can occur in the acute phase of, or months or years after the stroke. Seizures occur within 7 days after stroke are known as early onset seizures and are called late onset seizures or unprovoked seizures when occur 7 days after stroke [7]. According to the definition of the International League Against Epilepsy (ILAE), the late onset seizure is also known as PSE [8]. The risk of PSE is 2.6% to 9.5% in 5 years after cerebral infarction, which is 10 to 30 times that of the general population [9][10][11][12][13].
And it could be up to 11.8% after cerebral hemorrhage [14].
Although it is recognized that early onset seizure, cortical involvement, and the severity of symptoms are closely related to PSE [14], other clinical features and laboratory findings have not been fully explored. Few studies have demonstrated the feasibility of developing prediction models of PSE using traditional analyses, but their clinical application were limited due to low sensitivity and specificity [2,6,14]. Different from traditional analysis methods, machine learning method is superior in data mining and has become a hotspot in medical research [15]. Its powerful classification and predictive capabilities have been certificated by increasing studies [15].
In this study, we aimed to discover whether it is plausible to predict which patient would develop PSE based on their clinical features and laboratory findings. Three machine learning algorithms including support vector machine (SVM), random forest (RF) and logistic regression (LR) were used to learn the data and build the predictive model.

Date source and study population
This study was approved by the biomedical ethics committee of West China Hospital of Sichuan University. All subjects agreed to participate in the project and signed the informed consent. Patients were included if they were treated in Western hospital, Sichuan university from 2010 to 2017, diagnosed as "cerebral infarction" or "cerebral hemorrhage" and older than 16 years. Patients were excluded if they had (1) a previous history of epilepsy; (2) a previous history of cerebral hemorrhage or cerebral infarction; (3) intracranial tumor, infection, trauma or operation; (4) cerebral infarction caused by venous sinus thrombosis; (5) too much missing information; (6) cerebral infarction or cerebral hemorrhage occurred during follow-up; (7) died in hospital or within one year after discharge; (8) a follow-up period less than 12 months.

Data extraction
Information including demographic characteristics (gender, age at stroke etc.), clinical features (early onset seizure, antiepileptic drugs, risk factors of stroke, severity of stroke etc.), examinations results (biochemical and imaging examination findings) were systematically extracted. Seizures and antiepileptic drugs after discharge were obtained through clinical visit and (or) telephone follow-up based on a structured form. Seizures after 7 days after stroke were diagnosed as PSE.

Missing data handling
Missing data is an inevitable problem and removing all subjects or variables with missing data would loss lots of information and reduce the sample size. The following process of missing data handling was done before machine leaning algorithms were applied: a) variables with too much missing data and classified as possibly unimportant according to the clinical experience and previous research results were removed; b) subjects with too much missing records were also removed; c) the missing data then be supplemented by median (quantitative data) or mode (categorical data).

Features selection
The establishment of models largely depends on the correct selection of features. More features included would cause much more noise and result in overfitting of the models. But the performance of model would be affected if not enough features were included [16]. In this study, Univariate feature selection method was used to select features, which calculated the score of each feature and the P value of each variable in a scoring function. Features with P value less than 0.05 were then put into the machine learning algorithm.

Class unbalance
Our data showed that the number of patients who developed PSE was far less than those who did not, the resulting classification are characterized by an unbalanced distribution of the class variable. It is impossible to ignorance the class unbalance issue since all subjects would be likely automatically classified into the majority class by machine learning methods. In this study, we used a common strategy in machine learning, the Synthetic Minority Oversampling Technology (SMOTE), to handle the class unbalance issue [17]. Simply speaking, new samples were synthesized based on the character of subject and added to the data set of the minority class.

Models built
The samples were randomly divided into training set (70%) and testing set (30%). RF, SVM and LR were used to learn the data and build the prediction models. Decision trees which were built by randomly selecting subset features to best separate the data into the expected outputs. The forest of decision trees was then generated and to form the final output [18]. SVM maps the data to the high-dimensional space through kernel function to maximally separate the clusters of data which were not separable in low-dimensional space [19].

Assessment of the performance of predictive models
The final step was to assess the performance of the predictive models. the accuracy of different models was evaluated by receiver operating characteristic curve (ROC). The closer the area under the ROC curve (AUC) is to 1, the more accurate the model is. Other indicators including sensitivity, specificity, positive predictive value and negative predictive value were also measured for the models. Clinical features and examination results were summarized in (Table 1).

Features related to PSE
A total of 35 variables was found to be related to the PSE after cerebral infarction (Fig. 2A). The top five were creatine kinase, hospitalization days, lactate dehydrogenase, early onset seizures and antiepileptic drugs used in acute phase of stoke. A total of 19 variables were found to be related to PSE after cerebral hemorrhage (Fig. 2B) and the top five were hospitalization days, uroleukocyte, frontal cerebral hemorrhage, alanine aminotransferase and early onset seizures.

Performance of different algorithms in predicting PSE
To assess the performance of different algorithms in predicting PSE, sensitivity, specificity, positive predictive value, negative predictive value and AUC were calculated ( Table 2). The results showed that the performance of SVM and RF were better in predicting PSE after cerebral infarction with AUC being close to 1 in training set. The sensitivity, specificity positive predictive value and negative predictive value of them were also high. The performance of LR was a little poor with AUC being 0.92.
In testing set, the performance of SVM and RF were also good, with AUC being close to 1, and the AUC of LR was 0.92 (Fig. 3A).
Similar to that in cerebral infarction, SVM and RF also performed good in predicting PSE after cerebral hemorrhage. In training set, SVM achieved the highest AUC which was close to 1, followed by RF, with AUC being 0.99. The the AUC of LR (0.85) was slightly lower. In testing set, the performance of RF (AUC = 0.98) and SVM (AUC = 0.97) were also better than LR (AUC = 0.86) (Fig.   3B). The sensitivity and specificity of RF and SVM were also higher than that of LR (Table 2).

Discussion
The results of this study showed that the risk of PSE was close to 3% after cerebral infarction and 8% after cerebral hemorrhage within one year, which was similar to previous studies [6,14,20]. The first year was the peak time for seizure relapse after stroke and we considered this temporal threshold in this study [2]. Our model is designed to be used at the time of discharge and it showed that it is feasible to stratify patients with high risk of developing PSE using machine learning method based on clinical information and examination findings. Considering the differences of disease characteristics between cerebral infarction and cerebral hemorrhage, we analyzed them separately.
Hitherto, a few PSE prediction tools have been developed using traditional analysis, but their clinical application were limited because of complexity of operation, low sensitivity and poor specificity [2,6,14,21]. In this study, we used intelligent analyses to predict PSE for the first time and it showed that the performance of SVM and RF were the best with high sensitivity, specificity and AUC being close to 1. SVM is now widely used in classification since the kernel used in SVM model is a shortcut to accelerate the learning process and greatly improve the accuracy of the model [19]. RF is also widely used because it is easy to calculate, has high accuracy, can process large data sets and does not need to reduce the dimension of high-dimensional data sets [18]. However, we should also know that SVM and RF could be heavily influenced by the unbalance class issue [22]. The unbalance class issue was handled using SMOTE methods in this study, which may reduce the differences among synthesized samples and results in better performance of SVM and RF. Another shortage of these two algorithms is that the calculation process is a "black box", which could not be easily explained and understand by clinicians. However, the value of SVM and RF in predicting PSE was undeniable considering their superior performance evaluated by sensitivity, specificity, positive predictive value, negative predictive value and AUC. Larger longitudinal studies were needed to test the application of these models in clinical experience.
Importantly, The results of this study showed that neither endovascular thrombectomy nor thrombolysis with recombinant tissue plasminogen activator would increase the risk of PSE and it is always been the focus of clinicians' attention whether reperfusion treatment would increase the risk of PSE in cerebral infarction [2,23,24]. Similar with previous studies, we found that early onset seizure, symptom severity (which was assessed by NIHSS score at admission, length of stay in hospital, massive cerebral infarction and haemorrhagic transformation) and cortical involvement were related to PSE [2,6,13,14]. What is more, many laboratory findings like urine leukocytes, uric acid, alanine aminotransferase, creatine kinase and lactate dehydrogenase were also found to be associated with PSE. Uric acid is now believed to be inflammatory and has been confirmed by previous clinical and basic researches that both increase and decrease of uric acid level could lead to an increased incidence of PSE [25,26]. But the association between other laboratory findings and PSE need further research.
There are some limitations in this study. First, since only a relatively minority of patients developed PSE, which resulted in significant class unbalance, we used the SMOTE, a method widely used in dealing such issue to handle the unbalance class issue in artificial analyses. Which may lead to better performance of predictive models than they actually do. Second, due to the limited amount of data, we only constructed models to predict PSE one year after stroke, which may limit its clinical use. Finally, although the predictive models all showed good performance, similar to previous study, we can only discuss the possibility and accuracy of intelligent analyses in predicting PSE. The use of such models in clinical practice still has a long way to go.

Conclusion
This study demonstrated that the risk of PSE is about 2.8% after cerebral infarction and 7.8% after cerebral hemorrhage in one year. Lots of new risk factors were found to be related to PSE and based on these variables. We successfully constructed predictive models and RF and SVM showed better performance than LR in predicting PSE both in patients with cerebral infarction and cerebral hemorrhage.