Framework to select the top ML classifier for robust seizure detection and prediction: A comparison-based study using multiple preictal time and feature sets

,


Introduction
Epilepsy is one of the most common neurological disorders, affecting an estimated 50 million people worldwide. 1 Uncertainty about when a seizure will strike is the most debilitating aspect of living with epilepsy. 2 Seizure frequency 3 , seizure severity, stigma, fear, comorbidities like the presence of cognitive or psychiatric disorders adversely affects the quality of life of person's with epilepsy (PWE) 4 and their families. 5 From a clinician's perspective, proper detection and prediction of seizures would enhance efficacy of treatment, help optimize drug dosing and minimize drug side effects. 6 The need to enhance the quality of lives of PWEs, and to reduce avoidable deaths 7 have led to growing interest in predictive, ambulatory, wearable and machine learning (ML)/artificial intelligence (AI)-based epilepsy management technologies.
Various ML and data mining algorithms that perform regression, clustering, visualization and classification of seizure data have been described in the context of seizures. Detecting seizures is mostly a classification problem as the goal is to distinguish an 'ictal' class (group) from preictal and non-seizure classes, or to identify the time at which a class showed ictal properties.
• The time preceding a seizure is referred to as the preictal state. For some seizures, the epileptogenic transient changes develop very close to the onset, while for others it appears several minutes to hours prior to onset. 35 In ML-based seizure studies, arbitrary, fixed periods of time before seizure onset is denoted as the preictal period, which has been reported from 2 minutes to 240 minutes. 36 37 38 39 40 41 42 43 44 45 The preictal class annotation on seizure datasets should ideally be done based on statistical guidance, instead of arbitrary reasons as this class holds important evolutionary clues about an incoming seizure.
• The reporting of performance of classifiers is often confusing. Several of the seizure models have reported their results in terms of accuracy which is considered an unreliable measure, especially in unbalanced datasets. In most of the epilepsy or even healthcare-related datasets, it is common for the adverse event to belong to a minority class, while the nonevent related data forms the majority class. In such scenarios, the accuracy measure tends to make errors in predicting the right results. 46 Only a handful of the seizure-related ML algorithms have reported their findings using other, more stronger evaluation metric for imbalanced datasets.
Our work brings together classifier and preictal time selection with phase-wise performance metrics to show how proper interactions of these elements can lead to consistently reliable classifier performance.

Contribution
The main contributions of our paper are: • Providing a step-by-step process to identify the most appropriate classifiers, preictal time and feature sets for the study of seizures, using a methodology that could be used across domains and devices.
• Showing extensive use of statistical methods to select and eliminate classifiers in a phasewise manner across the research timeline.
• Showing that it is possible to achieve not only high subject-specific scores but also a consistently high overall seizure detection scores using an optimal combination of classifiers, preictal time and feature sets.
• Providing on-field evidence of results from multiple classifiers and ensembles, with the ability to predict seizure up to 176 minutes prior to EEG onset.
To the best of our knowledge, we are the first to report the best F1 performance on seizure data and predict seizure up to 176 minutes in advance.

Related work
A recent review of seizure detection with machine learning using EEG data 47 shows classifier performance ranges from 56.23 % to 86% in terms of the F measure and up to 100% in accuracy, sensitivity and specificity. 48 49 50 51 52 53 54 Seizure prediction windows have been reported between 1 sec 55 to up to 2 hours. 56  Wearable devices using Accelerometry (ACM), 60 61 62 Electrocardiogram (EKG), 63 64 Surface Electromyography (sEMG), 65 66 Electrodermal Activity (EDA), 67 68 skin temperature, 69 and respiration analysis 70 have been used to detect seizure activities. Publications on seizure monitoring with wearable devices have reported higher accuracy, but limited precision and F1 scores in seizure detection. 71 72 73 74 Kurada et al. showed that some models had a sensitivity of 100%-94% in detecting seizures, but with precision of 0.02%-35% , and F1 score of 0.03%-51%. 75 The low performance of ML/AI algorithms in seizure prediction and detection using wearables depend on various factors, including data acquisition techniques, usability of the equipment, and the quality of data used for experiments. 76

Framework for seizure detection and prediction
The larger context of our exploratory analysis was to identify factors that impacted performance of ML algorithms in the context of wearables in seizure research. The specific objective of the current study was to determine which classifier, preictal time and feature set combination provided best seizure detection performance at par with EEG-identified onset time. Additionally, we investigated the impact of predetermined preictal windows on classifier performance, to understand which preictal window gave the best result.
We chose ML-based algorithms over deep learning (DL) for our work as DLs are often difficult to interpret. Interpretability, consistency and scalability were the main criteria for our classifier selection.

Subject enrolment
We enrolled subjects admitted to the video-EEG monitoring unit at the National Institute of Mental Health and Neurosciences (NIMHANS), Bangalore, India, after receiving approval from the Institutional Ethics Committee. The study was conducted between 2017 and 2020, on subjects aged 15-65 years. All subjects signed informed consent form prior to being enrolled. We recorded data of 106 subjects with 202 total seizures. Out of 202, 53 were pseudo seizures, 34 seizures had artefacts and distortions, and 60 seizures were in close proximity with another seizure (within 2 hours of each other). We considered 55 seizures for our work, where 43 seizures were used for model building and 12 seizures used for unseen, unbiased testing of the trained models.

Data acquisition
We focussed on the activities of the Autonomic Nervous System (ANS), as it provides reliable clues about seizures. 77 78 79 80 81 82 We used a multimodal-sensing methodology focusing on Blood Volume Pulse (BVP) and Electrodermal Activity (EDA) to track the activities of the ANS. We extracted data using devices from BioSignal Plux and Shimmer Sensing, along with a wearable prototype designed as per our proprietary specifications. Data was sampled at 100 Hz for both EDA and BVP, which were were extracted from the distal and proximal phalanges of the non-dominant hand. The subjects were also monitored simultaneously with video EEG (vEEG).

Data pre-processing
We analyzed continuous, non-overlapping per 10 millisecond segments of data. During preprocessing, data was converted from raw form to their respective units, filtering and smoothening was done, along with timestamp insertion and artefact detection and removal.
Both EDA and BVP data were filtered using a low band pass, 2 nd order Butterworth filter.

Feature engineering
We extracted various features for our experiments, but used only the relevant ones based on Recursive Feature Elimination (RFE) and Principal Component Analysis (PCA). In the final form, we extracted 3 sets of features from the BVP and the EDA signals for our study. Set A consisted of 39 basic features as described in literature, 83 84

Dataset creation with preictal periods
We used 18 datasets, with each made up of a combination of a feature set and a preictal time.
During label annotation, the period prior to the recorded electrographic onset of a seizure was labelled as "Preictal.' The data prior to the designated preictal period was labelled 'Non Seizure', while the seizure onset time and 2 minute after it was labelled as 'Ictal'. Each dataset was trained and tested using the 11 classifiers and the 13 variations, providing unique performance scores in 3 phases of the study.

Experimental phases
To arrive at the best combination of classifiers, preictal time and feature set, we performed our work in 3 phases: • In phase 1, we experimented with 11 different classifiers, using a randomly selected feature set to pick the top 3 classifiers. We compared their performance against all 6 preictal time periods. The goal was to rank classifiers based on the consistency of performance across preictal windows and to reduce computational load in further work by dismissing poor performers right at the beginning.
• In phase 2, we used only the top 3 classifiers and 13 variants to evaluate performance over 3 different feature sets and 6 preictal time periods. This phase enabled us to confirm the best classifiers and pairs and find the most suitable feature set, preictal time and combinations.
• In phase 3, we used the findings from phase 2 to create multiple ensemble models, and tested their performance on the best suited dataset as well as on data of 12 individual subjects, unseen by the models.

Phase 1 -Identifying top classifiers for further exploration
The best known ML classification algorithms use probabilistic methods, rule-based learners, linear models such as neural networks and support vector machines, decision trees and instance-based learners. 86 As a first step, we selected 11 such popular classifiers and used them on 6 datasets, each made up by combining Set B and a specific preictal time, resulting in 66 model executions.

Performance evaluation of phase 1
For all the 3 phases, we measured performance of the models using 2 rules: • a classifier's seizure detection performance was measured by its ability to detect the ictal class at par with a neurologist validated EEG onset time of seizure, and • predictive performance was measured through a rule denoting that all predicted ictal segments must be preceded by the preictal class for a minimum of 5 minutes for the prediction criteria to be met. Though it is understandable that the preictal period can be as low as 1 minute to a few seconds in real scenarios, we formulated the rule to bring about rigidity in the prediction mechanism to reduce false alarms.
For performance evaluation, we split the datasets in a ratio of 80:20 training and validation sets. We used 5-fold cross validation, where the training data was randomly divided into 5 equal subsets. For every iteration, a single subset was used as the validation set, while the remaining 4 subsets were used for training, with the process repeating 5 times till all the sets got trained. This method is superior to Leave-One-Out Cross-Validation (LOOCV), where a model is iteratively fitted on the whole training set. LOOCV is also prone to high variance and over fitting. 5-fold cross validation addresses the drawbacks of LOOCV by using the k-th portion of data to validate, enabling the model to train on a larger number of observations.
We used optimization techniques like Grid Search to find the best combination of hyper parameters in an individual classifier. Post tuning, we repeated the 5-fold cross validation with the recommended hyper parameters and obtained multiple performance metrics for the training and validation sets.

Statistical evaluation of phase 1
We performed statistical evaluation of results in each phase, using F1 score as our main performance metric, though we did calculate other statistics. We used MCC to evaluate better than chance performance as MCC is considered a more reliable performance metrics than even the F measure. 87  • Youden's Index: Youden's index γ evaluates the algorithm's ability to avoid failure. It equally weights the algorithm's performance on positive and negative samples, with the formula γ = sensitivity − (1 − specificity) • Likelihood: The relation between the likelihood of two algorithms A and B determines which algorithm is preferable and in which situation. Formula used are ρ+ = sensitivity/ 1specificity and ρ− = 1 -sensitivity/specificity • Discriminant Power (DP): DP evaluates how well an algorithm distinguishes between positive and negative examples. Formula used is DP = √3/π * (log X + log Y ) • Area Under Curve (AUV): The AUC corresponds to the arithmetic mean of sensitivity and specificity values of each class. Formula is (TNR+TPR)/2 • False Positive Rate (FPR): Refers to the conditional probability of a positive test result given an event that was not present. Formula is 1-TNR

Performance scores in phase 1
The following table shows the individual F1 scores of each of the 11 classifiers on the 6 unique datasets. The highest score of 97.6% F1 came from LGB on the 30 minute dataset. The below table shows the range and mean scores of each of the 11 classifiers, averaged across the datasets. As can be seen, LGB, RF and XGB scored consistently better than the rest on the 6 datasets. We performed statistical analysis on the F1 scores reported above using the benchmarks for multiple machine learning comparisons by Demsar. 88 Based on the normality and the homogeneity tests, we selected the Friedman's test for determining the confidence intervals of the central tendency. After rejecting the null hypothesis of the Friedman's test(p=0.05), we inferred differences between populations are significant, if the difference of the mean rank is greater than the critical distance (CD) = 4.904 of the Nemenyi test. The below table shows the mean rank, the median as central tendency, the median absolute deviation (MAD) from the median as measure for the variance, the confidence interval (CI) of the median, the effect size in comparison to the highest ranking approach and the magnitude.

Phase 2 -Top classifier, feature set, preictal time and combination selection
In the next phase, we further evaluated the performance of RF, XGB and LGB using 18 datasets made up of a combination of 3 sets of features and 6 preictal time periods. We used a total of 13 variations of the selected classifiers.
The decision to use different variations of the classifiers were made to address class imbalance in our datasets. Depending on the feature set and the preictal window chosen, the distribution of the classes would vary in each dataset, with data in the Ictal class ranging from 01%-4%, Preictal class between 09%-44% and Non-Seizure class between 53% -86%. When there is class imbalance, the algorithms become more biased towards predicting the majority class, as they do not have enough data to learn the patterns in the minority class. Hence, it is essential to see the performance of the models in both balanced and unbalanced versions of the datasets.
The variations we used are as follows: • Synthetic Minority Oversampling Technique (SMOTE) 89 blends under-sampling of the majority class with a special form of over-sampling of the minority class. The minority class is over-sampled by creating "synthetic" examples, by operating in the 'feature space' rather than the 'data space'.
• 'balanced': During hyper-parameter tuning we specify the class_weight as 'class_weight = balanced'. This mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y).
• 'balanced sub-sample': During hyper-parameter tuning we specify the class_weight as 'class_weight = balanced_subsample.' This mode is the same as 'balanced' except that weights are computed based on the bootstrap sample for every tree grown.
• Class weights: By assigning higher weights to the minority class, we can influence the training outcome by heavily penalizing the misclassification made on the minority class. This, thus, forces the algorithms to be less biased towards the majority class. We experimented with weights between 4-30% for the Ictal class and determined optimal weights using a Python script.

Performance evaluation of phase 2
For performance evaluation, the datasets were again split into 80:20 ratio of training and validation sets. We performed 5-fold cross validations, tuned hyper parameters and repeated 5fold cross validations tuning to find the best scores. The table below shows the F1 scores of each the different variants of the classifiers. LGB LGB_ S LGB_ B LGB_ W  It can be seen that though RF(Weighted) with 98.1% F1 had a higher individual score, it came second when ranges and mean scores of the various performance metric are considered. Further analysis needs to be carried out to determine the top ranked classifier, feature set, and preictal time based on the data in Table 6.

Top classifiers for seizure data
Evaluation on the entire range of performance metrics revealed 3 different variations of LGB (balanced, weighted and SMOTE) to be the top 3 mean ranked classifiers. Post rejecting the null hypothesis of the Friedman's test, we found differences between populations are significant, if the difference of the mean rank is greater than the critical distance (CD) =0.5524 of the Nemenyi test.  Table 8, it can be seen that LGB (balanced) has got the best mean rank, and hence can be considered better than the rest of the classifier variants when the 5 performance metrics (F1, Precision, Recall, Accuracy and MCC) are considered. It must be noted that in phase 2, we determined RF_W as the best individual scorer, XGB_W as the best scorer in terms of range of performance, while LGB_B is the best mean ranked performer. This fact would not have been known had additional step-wise calculations were not performed.

Best feature set selection
We analyzed the performance metrics of Phase 2 to derive the best performing feature set. The statistical analysis on sets was done as follows: Post rejecting the null hypothesis of the Friedman's test, we found differences between populations are significant, if the difference of the mean rank is greater than the critical distance (CD) =0.5524 of the Nemenyi test. Out of the 3 feature sets, Set B (with 66 features) has got the best mean rank, and hence can be considered better than the rest. Set C with 116 features and Set A with 39 features came second and third respectively. This goes to show that it is essential to experiment with different feature sets and understand the contributions of each feature in enhancing the results, before deciding on the ultimate set of features to use.

Most suitable preictal time
In the statistical evaluation of phase 2 performance scores, the 30 minute preictal window emerged as a better performer compared to the other 5 periods. The analysis to identify the best preictal time was done as follows:  When 30 minute preictal period was chosen, all the classifiers in both phase 1 and phase 2 gave their best scores. It can be seen that the scores of all the classifiers increased steadily from the 5 th to the 30 th minutes, but dropped drastically in the 45 th and 60 th minutes. Our data shows that preictal windows in the range of 10 to 30 minutes is a better choice for researchers than windows above 30 minutes. Our findings are also in line with Texeira et al, who compared 4 different preictal time (10, 20, 30 and 40) and reported that 30.47 min was the most suitable for a patient-specific best predictor. Bandarabadi et al. tested the same 4 preictal times, reporting the preferred average as 33.7 min. 90

Best combination of classifier, set and preictal time
In the above three sections, we derived the top ranked classifier variants, feature set and preictal period to do seizure experiments with. Our next goal is to determine the top combination of classifier variant, feature and preictal time using statistical methods. The statistical analysis was conducted for 13 populations with 5 paired samples. The familywise significance level of the tests is alpha=0.050. We failed to reject the null hypothesis that the population is normal. We applied Bartlett's test for homogeneity and failed to reject the null hypothesis that the data is homoscedastic. As we have more than two populations and all populations are normal and homoscedastic, we use repeated measures ANOVA as omnibus test to determine if there are any significant differences between the mean values of the populations. We used the post-hoc Tukey HSD test to infer which differences are significant. Since our analysis showed 30 minutes at the optimal preictal period and Set B as the recommended set, we had to identify which classifier variant gave best performance against the recommended set and time. The results show that when specifically grouped with the top ranked feature set and time, RF (weighted) is actually the best performing classifier variant, beating LGB variants which emerged as the top 3 individual classifiers as described in Table 8. This analysis shows us that we are very likely to get better results if we use the ranked classifiers or make ensembles out of them.

Phase 3 -Ensembling
As a next step, we proceeded to create ensemble models with the top ranked combinations shown on Table 14. The goal was to evaluate if ensembling will give even higher scores compared to top individual classifier scores. It is not mandatory to create ensembles, and we are free to work directly with individual high-performing classifiers.

Statistical evaluation of phase 3
Using a Python script, we generated multiple combination of ensembles from the top 4 classifier variants and combined their results with the results of the top 4 classifier variants.  As an additional check for classifier performance, 91 we calculated the Youden's index, positive and negative likelihood ratios and the discriminant power to get better insights about the nature of the various classifiers. 1. 79 40 As per the Youden's Index, except for RF_B(balanced), all the ensembles and independent classifier variants have good ability to avoid failures when it comes to the Ictal class. RF_B has some chances of failure in identifying the Ictal class, but higher values of γ in the Preictal and Non Seizure classes indicate comparatively better abilities in handling these classes.
A higher positive likelihood (ρ+) indicates a better probability to identify a particular condition. As per the (ρ+), all classifiers and ensembles have 'infinitely' superior ability to locate the positives (true and false positives) in the Ictal class. RF_B has the best ability to locate the positives in the Non Seizure condition, while E1 has better abilities to identify the Preictal condition. In terms of (ρ−), all ensembles and classifier variants, except RF_B have good abilities to identify the negatives (true and false negatives) in the Ictal class. RF_B is also the best performer when it comes to locating negatives in the Non Seizure and Preictal classes. RF_B again has better abilities to discriminate among the 3 classes based on higher values of Discriminant Power, which is actually in disagreement with the findings of the Youden's Index regarding the Ictal class. Thus, we expect to see good performance from all the models on the field, except for RF_B, which might throw some surprises.

Testing on unseen data
The statistical analysis in the previous section has given us a hint about how the different classifiers may behave in real world testing. To be qualified for use as a clinical decision making tool, it is imperative for all results to be tested on real world data or such data that is previously unseen by the models. Hence, to test the soundness of the findings, we tested the performance of the classifier variants and ensembles on 12 actual patient's data. Table 18 shows the findings:  E1 with 98.3% F1 was the top performer as per the phase 3 analysis, however, in real world testing it was able to correctly detect only 9 out of 12 seizures, and predict only 7, with an average prediction time of 35.4 minutes. E2 with 98.1% F1, however, was able to detect 11 out of 12 seizures, though it had an F1 of 98.1% F1. RF_W had identical performance score as E2, but it was able to detect only 9 out of 12 seizures and predict 8, though it had the best average predictive time of 50.7 minutes among all the different models. However, the best performer on real world data was LGB_W, which was able to detect all the seizures and predicted 10 of them. Interestingly, LGB_W was a comparatively mediocre scorer in phase 3, similar to XGB_W. The worst performer was RF_B, which was not able to identify any one of the seizures. The hint about RF_B's performance can been had from Youden' Index and the negative likelihood ratio that indicated that RF_B had lesser abilities to identify the Ictal class compared to the others. In terms of predictive abilities, the highest individual prediction time was 176 minutes given by LGB_W, while the best overall prediction time across models was given by RF_W. The mean prediction time of all the 6 models combined is 31.56 minutes.
This final step of testing is critical to understand the real abilities of different models on the field. Depending on the requirement of either detecting a seizure or predicting it, a single model or a combination of several could be used for further study, validation and real life implementation. Unless the last two steps (phase 3) and real world testing was carried out, we would not have not about the true abilities of our models. Our methodology thus enables decision makers to choose a model that meets their clinical and research requirements, balancing between detection and prediction needs.

Post experiment analysis
Through our work, we are able to show how realistic, predictive and high performing seizure models could be created by following a phase-wise approach. We were guided by statistical insights at every stage of our work, leading us to understand the inner workings of the various classifiers, algorithms and models we used.
We derived the following insights by analysing the results of our experiments.
• Phase-wise selection of classifiers based on robust statistical methods yield more reliable results in unseen data during testing. We were able to determine the classifiers that were more likely to give better results, depending on the nature of data. We recommend that researchers experiment with different classifiers and their variations, eliminating redundant ones methodically and choosing ones that are best qualified to represent their data in a consistent manner.
• Seizure models built using 30 minutes prior to seizure onset as the preictal time provided the best results across phase 1, 2 and 3 classifiers. As the fixed preictal time increased from 5 to 30 minutes, the performance of the classifiers increased, but dropped drastically in the 45 th and the 60 th minutes. In fact, the classifiers made more errors in distinguishing the Ictal and Non seizure classes in the 45 th and 60 th minute datasets, and Preictal and Non seizure class between 5 and 15 minutes. During the 30 th minute, the top classifiers perfectly identified the Ictal class, but made comparatively lesser errors in distinguishing the Preictal and Non seizure class.

Figure 1: Classifier performance against preictal time
Bar plot showing classifier performance (F1 scores) against assigned preictal periods, with consistently low performance shown by classifiers in the 45 th and 60 th minutes.
• Simple features provided best results. Features that tap into the frequency attributes of the EDA and BVP signals gave better results compared to complex features requiring higher computational power. Though we calculated multiple features (over 700 from Tsfresh alone) for each signal, such features were shown to be redundant during analysis of feature importance. While for us 66 features gave better results, researchers can arrive at their own number based on feature performances.
• Tree-based classification models were preferable for our seizure data. Our work shows that due to the non-linearity and complex relationship between dependent and independent variables in seizure data, tree-based models like RF, XGB and LGB outperformed other methods. Unlike others, these algorithms are non-parametric and can efficiently deal with large datasets without imposing a complicated parametric structure. 92 As shown through our multiple statistical analysis of classifiers, these 3 classifiers consistently perform better than the rest, irrespective of the preictal windows used.

Future work
Our work shows promise in laying the foundation for a machine learning framework for seizure research with wearables. The validity of our finding needs to be tested using more heterogeneous seizure data and replicated rigorously in both clinical and field settings. Our future work will be focused on refining our understanding of seizures using traditional equipment like EEG and ECG along with wearables, creating dynamic methods to identify the preictal period using hybrid ML/AI methodologies, predicting seizures in realtime, verifying the quality of data sampled using varying sampling rates, studying the quality of data collected from multiple sites, analysing the impact of various anti-epilepsy drugs and extending our ML/AI methodologies to other aspects of seizure research including Sudden Death in Epilepsy (SUDEP).

Conclusion
Our work is an important step in bringing structure to ML-based seizure research. We show that a step-wise classifier selection mechanism based on the project's or researchers' end goals will lead to realistic performance scores, with better than chance results. The limitation of our study is that it was done on data collected from a controlled-hospital environment, with a limited number of subjects. More studies need to be conducted to prove the effectiveness of the methodologies described here. Automated seizure detection has the potential to change the life of people suffering from seizures, by enabling faster access to treatment and care. We hope our work will help others to arrive at the right strategies to adopt while exploring seizures in the context of making lives better of people with epilepsy.