**1-Subjects and IEEG Database**

Approval was obtained from the University of Wisconsin Institutional Review Board, informed consent was obtained from all human subjects prior to participation. All methods were conducted in accordance with the Declaration of Helsinki.

After approval of the respective IRBs, the raw EEGs from 25 pre-surgical epilepsy patients were visually inspected. The seizure onset time and channels involved with seizure onset were marked and confirmed by a board-certified neurophysiologist (Boly M). Only channels involved at seizure onset were used for further analysis.

The selected dataset consisted of 710 channels taken from 127 different seizures across 25 individuals: 13 seizures from five patients were from the University of Wisconsin; 82 seizures from 13 patients were from the Epilepsiae Database15; 32 seizures from seven individuals were provided by Mayo Clinic.14 Data from the Epilepsiae database was prefiltered with a 50Hz notch filter, and data from Mayo Clinic and the University of Wisconsin were prefiltered with a 60Hz notch filter. Each lead containing seizures was resampled to 400Hz and cropped to two minutes in length, with seizure onset being one minute into the segment. The preictal labels were assigned to every point before seizure onset, and the ictal labels were assigned to every point at and after seizure onset, creating an even amount of preictal and ictal data points. Overall, there were 15,240 seconds of data, consisting of 50% preictal and 50% ictal/post-ictal points, where each point is a one second clip.

**2 ****- Computing Engineered Features**

**A – Computing Digit Distributions and Cho Gaines Distance:**

The entire dataset was broken into one second epochs, and the power spectrum of each epoch was computed using the magnitude of the FFT. The first nonzero digits were counted and normalized to generate a probability distribution over the extracted digits.

The leading nonzero digits were extracted from each time point using the following element-wise formula:

Where e represents an arbitrary sized tensor.

The results were set to zero whenever the function was undefined, i.e., where e = 0. The specific digits between one and nine were counted and normalized to generate a probability distribution over the extracted digits within the window. The Cho-Gaines Distance (a.k.a. Euclidian Distance9) was used to compare observed probability distributions (**Equation 3)**, with expected probability distributions as calculated using the formula from Benford’s 1937 work3, as shown in **Equation 2**.

Where d is a digit from {1,2,3,4,5,6,7,8,9}.

Where N is the EEG sample rate, ei is the expected frequency that i is the first digit, oi is the observed frequency where i is the first digit.

**B - Band Powers:**

The delta (2 to 4 Hz), theta (4 to 8 Hz), alpha (8 to 12 Hz), beta (12 to 30 Hz), gamma (30 to 80 Hz), and high gamma (80 to 150 Hz) band powers were computed by using Welch’s method.2,6,18 Band powers were then normalized by the total signal power to create the relative band powers, and the log of the Absolute Band Powers were also used as features. The DC component (f = 0 Hz) was dropped before computing relative band powers.

**C - Epileptogenicity Index (EI):**

The formula for computing the EI is shown in **Equation 4:**2

Where P are the relative band powers of the α, β, γ, γ’, and θ are the alpha, beta, gamma, high gamma, and theta bands, respectively.

**D – Phase Locked High Gamma (PLHG):**

The formula for computing PLHG is shown in **Equation 5**:1

Where AHFO is the instantaneous signal envelope of the high frequency oscillations, and is the difference between the low and high frequency instantaneous phases. LF in this case means from the theta, alpha, and beta bands, where the HFO means from the gamma and high gamma bands.

**3 - Neural Network Design:**

Three neural networks were designed to learn from one second segments of iEEG data: A fully connected network for learning features from the Relative Welch Power Spectrum (FDBB), a convolutional neural network for extracting time domain features (TDBB), and a fully connected network for classifying engineered metrics (EMC). Each DNN had distinct encoder and classifier sub networks to allow for the networks to be used in dimensionality reduction. The classifier sub network was the same in all three models to allow for the comparison between different feature spaces. The output of the encoder sub networks was used for feature analysis. The dropout in the classifier sub network was 75% for FDBB, 90% for the TDBB, and 40% for the EMC. All networks were designed to work on any arbitrary number of electrodes by changing the input channels parameter. This allowed the same network architecture to be applicable to more datasets, such as the 2015 Kaggle competition “UPenn and Mayo Clinic's Seizure Detection Challenge”, Kaggle contest. The architectures for all three models are included in **Figure 1ab**.

**4 - Neural Network Pretraining:**

All three neural networks were trained using leave-one-out (LOO) cross validation across 25 subjects, where all the data points from one patient were withheld for training. For each fold, models were trained using stochastic gradient descent with Adam optimizer and binary cross entropy loss until there was no improvement in the validation loss for ten epochs. Training and validation sets were also separated by subject, with eight training subjects being assigned to the validation set. The training set was shuffled for each fold to ensure different validation and training sets for each fold. Within each set, the data was broken up into separate, single channel 1sec segments. The AUC, Brier, PPV, NPV, Recall, and Accuracy were all recorded for each test set, and the model with the best overall scores was selected to attempt the Kaggle challenge. Cross validation was conducted with the same random seed, allowing the data points used in each fold to be identifiable and consistent across models.

**5 - Feature Analysis: **

For each fold, the test subject and associated pretrained model were selected to extract the FDBB and TDBB feature spaces, preventing data leakage during RFC training. The feature spaces were then analyzed using Gini Importance and a correlation analysis. Gini Importance is a computationally efficient splitting criterion used in RFCs that splits samples into nodes minimizing the impurity of each node during fitting, making it the most efficient way of evaluating features when using RFCs.19 The correlation matrix between all feature combinations was computed to illustrate potential relationships between the DNNs and engineered metrics, and enabled us to determine if the DNN feature spaces were related to the engineered metrics.

**6 - Feature Ensembles: **

As summarized in Steps 2 and 3 in **Figure 2a**, leave-one-out (LOO) cross validation of a Random Forest Model across 25 subjects was used to generate interpretable models using all features for both seizure identification and latency determination tasks. The latency task is defined by the UPenn-Mayo Kaggle challenge and consists of identifying ictal segments from the first 15 seconds of the seizure as the positive group (12.5% of samples).24The dataset was expanded in the pretraining stage by splitting all channels for each subject and using a one second sliding window with 50% overlap to generate additional frames, as demonstrated in the winning Kaggle Submission. For the remaining 24 subjects within each LOO fold, an RFC (30 estimators, split by entropy, max depth of five) was trained using 100-fold cross validation on single channel FDBB, TDBB, and EMC. The Gini Importance of each feature was aggregated from the pretrained models from each internal cross validation fold and external LOO fold, making a score distribution from 2,400 importance calculations overall. Importance scores are represented in **Figure 2c**.

**7 – Kaggle Challenge:**

The deep neural network with the best AUC score on our seizure identification task was selected to attempt the Kaggle competition. The competition consisted of seizure and latency identification tasks, where contestants design a model for first identifying seizures, and another for determining if the seizure is within 15 seconds of onset (determine latency). Each subject had their own instance of the neural network module to prevent issues from mismatching channel counts. Each subject model was trained using stochastic gradient descent, Adam optimizer, and binary cross-entropy with five-fold cross validation by sample, where 25% of the validation set was held out for testing after each epoch. Additionally, another TDBB model was trained by projecting each sample point to be 16 leads, allowing for the same model to be applied to all subject data by controlling for the number of channels. These approaches were included because they resemble how neural networks are typically applied to real world problems and provide a baseline for comparison.

Dataset augmentation consisted of creating time points starting at each half second by combining the last half of the previous sample with the first half of the next, effectively using a sliding window with 50% overlap. The final response was graded by the Kaggle auto grader, which generated an ROC score representing the average score of the latency task and identification task.

In addition to the TDBB classifiers, the TDBB features were extracted from the Kaggle Dataset by averaging all feature spaces from each of the 25 pretrained LOO folds to create a single average feature space. This is possible because each pretraining LOO fold has its own fully trained model, and there are 25 possible folds because each patient is held out once. From the averaged features, the three most important and single most important for both the identification and latency tasks were extracted from each channel, creating two separate feature sets tailored to each task. The same was done with the least important features to serve as controls. All feature sets were used to train Random Forest classifiers for each subject (30 estimators, entropy, max depth of five) to evaluate the transferability of our DNNs and determine if they can be used for feature engineering. The trained estimator from each fold of this cross-validation step was used to create a Kaggle entry to allow for a distribution of 30 potential scores for various feature ensembles. It is important to emphasize that the neural networks used to extract the black box features were pretrained on data unrelated to the Kaggle set.