The methodology followed in this work is as shown in Fig. 3
Table 1
Seizure types and description
Seizure code | Name | Description | Pre- Ictal | Ictal | Inter-ictal |
Samples | NE | samples | NE | Samples | NE |
aABSZ | Absence Seizure | On EEG, absence discharges are noticed. Symptom: Loss of consciousness | 271989 | 18 | 96568 | 56 | 745239 | 46 |
aCPSZ | Complex Partial Seizure | Staring blank and daydreaming is the symptoms also loss of awareness regarding their surroundings. | 317195 | 28 | 920447 | 52 | 753664 | 46 |
bFNSZ | Focal Seizure | Starts in one part of the brain, but can spread to other parts i.e. become generalized. | 426127 | 28 | 324887 | 55 | 786432 | 48 |
bGNSZ | Generalized Non Specific Seizure | Generalized seizures which cannot be further classified. Symptoms: unconscious-ness with spasms, stiffening, etc. | 323400 | 22 | 447487 | 54 | 712084 | 44 |
aSPSZ | Simple Partial Seizure | Awareness is not affected. It affects body muscles. | 298618 | 18 | 541405 | 52 | 786432 | 48 |
aTNSZ | Tonic Seizure | Stiffening of muscles. | 353294 | 28 | 1089507 | 52 | 786432 | 48 |
aTCSZ | Tonic-Clonic Seizure | First stiffening followed by body jerking (Grand Mal) | 411635 | 28 | 240279 | 45 | 700249 | 44 |
*NE = number of events |
Seizure Types And Descriptions
An epileptic seizure is mainly having four major types, namely, Fibril, generalized, partial, and unclassified. They are subdivided into 20 different classes. In this work, seven types of seizures have been chosen, which have different symptoms during the onset of a seizure attack. Two manifestations are used i.e. electro clinical and electrographic. For representing Electro clinical, ‘a’ is used and for electrographic ‘b’ is used. Seizure types and description is narrated in Table 1.
Dataset
The dataset used in this work is taken from Temple University Hospital’s EEG Repository [42], specifically TUH EEG Seizure Corpus (TUSZ) as shown in Fig. 4. The repository contains 16986 sessions from 10874 unique subjects. The number of EEG channels varies from 20 to 31. The signals are of varying lengths (3,00,000 columns). The 10–20 electrode configuration was used.
Pre Processing
The obtained EEG recordings were then preprocessed be- fore feeding into machine learning models. Since each file had a varying number of channels (from 21 to 30 channels), the first 18 channels from all files were taken and labelled for the type of seizure. Duration of signal for ictal region was directly taken from the corpus information shown in Fig. 1. For inter-ictal regions, duration between the stop time of one event and start time of the next chronological event of the same file were taken as start and stop times. The duration of signal before the first event of each file is taken as the pre-ictal duration.
Feature extraction using Fast Fourier Transform (FFT)
For feature extraction of each signal, FFT as in Eq. (1) was performed on each signal and the power spectrum was analyzed. The frequencies in the signal with maximum power were extracted as features and appropriate labels (type of seizure each for ictal, preictal, and interictal) were added.
X(K)= \(\sum _{n=0}^{N-1}x\left(n\right){e}^{\frac{2\pi Kn}{N}}\) (1)
The features extracted above were fed into multiple classification algorithms to find the models giving better results. Of these, the models showing better performance can then be tuned to improve performance. These models are explained as below.
Machine Learning Algorithms
A. Bagged Trees
To reduce variance of a statistical learning method, bagging or bootstrap aggregation, is a procedure in general. In this bagged tree (BT) method, using B bootstrapped training sets, B regression trees are constructed and the resulting predictions are averaged. Due to deep growth of trees and non-pruned nature, high variance with low bias can be observed in each individual tree, Towards reducing variance averaging the BT is used. The predicted value for an observation is the mode (classification) or mean (regression) of the trees. The difference between BT and Decision Trees(DT) is that BT may contain multiple trees, each having different features, terminal node counts, data, etc., where DT contains a single tree. Bagged Trees consist of the following steps.
-
Sample m data sets D1, D2, …., Dm from training set D with replacement.
-
For each Dj train a classifier hj(x) where x represents features.
-
The final classifier is given by Eq. (2)
$$\text{h}\left(\text{x}\right) = \frac{1}{m} \sum _{j=1}^{m}{h}_{j}\left(x\right)$$
2
The advantages of Bagged Trees are:
-
It can reduce the variance within the algorithm (helpful with high-dimensional data).
-
It is easy to implement.
-
It forms a strong learner yielding better performance. Disadvantages of Bagged Trees are:
-
The algorithm is computationally intensive.
-
It is difficult to interpret (Low explaining ability).
B. K-Nearest Neighbors
The KNN algorithm works on the assumption that similar data points tend to be situated closer to each other in a plane, as seen in Fig. 5. This classification algorithm works by classifying data points based on their ‘K’ nearest neighbors. For example, if the value of K is taken to be 1, the data will be assigned the class of its closest neighbor. In Fig. 5, the blue hexagon is the test sample, two concentrated circles represent K = 3 and K = 5 nearest neighbors. In this example, K = 3 has 2 red balls close to the test sample whereas green only has one ball so, in this case, the test sample is classified as red. On the other hand, for K = 5, 3 green balls are present compared to 2 red balls. Therefore, for K = 5, the test sample will be classified as a green class. Because of this ambiguity, researchers prefer to have K = 1, most of the time showing better results, compared to any value of K. the nearest neighbors are found based on Euclidean distance as in Eq. (4).
The largest probability of an input x being assigned to a class is given as in Eq. (3):
$$P\left(y=j|X=x\right)=\frac{1}{K}{\sum }_{i\in A}^{}I\left({y}^{\left(i\right)}=j\right)$$
3
The distance metric is given by the Euclidean distance as follows (where p = 2):
$$d\left(x,{x}^{{\prime }}\right)=\sqrt{{\left({x}_{1}-{y}_{1}\right)}^{p}+{\left({x}_{2}-{y}_{2}\right)}^{p}+\dots +{\left({x}_{N}-{y}_{N}\right)}^{p}}$$
4
Commonly used distance metrics are the Euclidean, Manhattan and Minkowski metrics. d(p, q) is the distance between points p and q as shown in Eq. (5), Eq. (6) and Eq. (7) respectively. Here n is the number of dimensions of the train set and pi and qi are the values of the ith dimensions of data points p and q respectively.
-
Euclidean distance metric
$$d\left(p,q\right)= \sqrt{\sum _{i}^{n}{\left({p}_{i}- {q}_{i}\right)}^{2}}$$
5
-
Manhattan distance metric
$$d\left(p,q\right)= \sum _{i}^{n}|{p}_{i}- {q}_{i}|$$
6
-
Minkowski distance metric
$$d\left(p,q\right)= {\left({\sum }_{i}^{n}|{x}_{i}-{y}_{i}|\right)}^{1/p}$$
7
The merits of KNN are:
-
Implementation is simple.
-
To the noisy training data, is robust
-
For large training data, it can be more effective.
The demerits of the KNN algorithm are:
-
Determination of the value of K is complex sometimes.
-
High computation cost because for all the training samples the distance between the data points needs to be calculated.
-
Standardization is required for training sets.
C. Weighted KNN
The weighted K-Nearest Neighbors (WKNN) algorithm differs from the conventional KNN algorithm (as in Fig. 6) as weights are assigned to classes, and the decision is reached depending on the weights. The data points located close to the query point are assigned higher weights. The results obtained from this model are easy to interpret. The weighted Euclidean distance d between points pi and qi with weight wi is calculated as in Eq. (8).
$$d\left(p,q\right)= \sqrt{{\sum }_{i}{w}_{i}{({p}_{i}-q)}^{2}}$$
8
The advantages of the weighted KNN algorithm are:
Disadvantages of weighted KNN are:
-
Sensitive to noise, missing data, and outliers
-
Does not work well for high-dimensional data
-
Needs feature scaling
D. Subspace KNN
In the random subspace method, the features (‘attributes’, ‘independent variables’, ‘predictors’) are sampled randomly unlike BT. In this, it replaces each learner. Informally, to avoid over-focusing on features by individual learners, which appear highly predictive/descriptive in the training set. But, at the same time, it fails to be as predictive for points outside that set. For this reason, for high-dimensional problems where the number of features is much larger than the number of training points, random subspaces are an attractive choice. The selection of subspace is mathematically shown in Eq. (9) and Eq. (10). The process is pictorially represented by Fig. 7.
-
Given a set of N points in an n-dimensional feature space {(\({x}_{1},{x}_{2},\dots ,{x}_{n}\))| \({x}_{i}\) is real for all \(1\le i\le n\)} (9)
-
We consider the m-dimensional subspaces
{(\({x}_{1},{x}_{2},\dots ,{x}_{n}\))|\({x}_{i}=1\) for \(\text{i}\in I,{x}_{i}=0 \text{f}\text{o}\text{r} \text{i}\notin I\)} (10)
Where I, is an m-element subset of {1, 2… n} and m < n.
The advantages of Subspace KNN (SKNN) are:
Disadvantages of Subspace KNN are:
E. Random Forest
A Random Forest (RF) algorithm can be used for classification as well as regression, as in the case of Support Vector Machines. The algorithm works by creating decision trees and assigning new data points to the category that wins the majority of votes, as described in Fig. 8. This model finds applications in many sectors and is often used for high-dimensional data.