Participants
To ensure that our study was adequately powered to detect the effect of gamified calibration on BCI performance in children, we conducted an a priori power analysis using G*Power software (version 3.1.9.4). Based on pilot study results (13), we estimated an effect size of f ≥ 0.4 for the interaction between calibration condition (gamified vs. non-gamified) and paradigm (P300 and SMR) in a 2x2 factorial within-subjects design. We set the desired statistical power (1-β) at 0.80 and the significance level (α) at 0.05 for a two-tailed test. The power analysis indicated that a total sample size of N = 31 participants would be required.
Participants were recruited from the community via a database of typically developing children whose families have volunteered to participate in medical research (Healthy Infants and Children’s Clinical Research Program)(14). Inclusion criteria were typical neurodevelopment with no neurological conditions, age 6 to 18 years, and informed consent/assent. Methods were approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB21-1883). Participants received a $25 gift card for volunteering.
Protocol
We conducted a prospective, randomized, cross-over study to investigate the effects of gamified calibration in comparison to non-gamified calibration on two different paradigms: P300 and SMR. We employed a 2x2 factorial within-subjects design, with the factors being the paradigm (P300 vs. SMR) and the calibration condition (gamified vs. non-gamified). To ensure a balanced allocation of participants across the different conditions and to control for potential order effects, we utilized a Latin square design. Participants were randomly assigned to one of four conditions using an online randomization tool (www.randomlists.com). This resulted in four groups of eight participants each (total N = 32), with each group experiencing a unique sequence of the four combinations of paradigms and conditions. Participants attended two sessions lasting approximately 1.5 hours each. The primary outcome was classification accuracy, with secondary outcomes including precision and recall scores of the classification model from calibration, and the online accuracy achieved in subsequent BCI tasks.
Each visit had a similar protocol: 10-minute set up with a preliminary assessment questionnaire to record potential factors affecting performance (concussion, vision problems/corrected vision, mood, previous night’s sleep, tiredness, exercise, food, alcohol/drug consumption including caffeine, and hobbies/sports), and the adapted Edinburgh handedness inventory (15). At the start of the first visit, participants also completed three tests from the CNS Vital Signs computerized neurocognitive test battery to evaluation attention and working memory: the Stroop Test, the Shifting Attention Test, and the 4-part Continuous Performance Test (16). At the start of the second session, participants completed two N-back tasks (1-back and 2-back) to evaluate working memory and visuospatial memory. The results from these neurocognitive tests will be reported in a separate paper.
Before beginning the BCI paradigms, the participants were instructed to sit still, with feet flat on the floor and hands/arms either resting on their lap or on the table in front of them. Participants were then evaluated on two classifier training environments for each paradigm: one gamified and one non-gamified. Afterward, they were assessed on their BCI performance on a “utility driven” BCI task: spelling for the P300 paradigm and cursor control for the SMR paradigm. After each paradigm, motivation, tolerability, and workload were evaluated with a series of questions.
The performance of the classification model was evaluated by calculating the metric "accuracy," which represents the proportion of correct predictions among all predictions made. However, to provide a comprehensive assessment of the model's performance and ensure a balanced evaluation, we also reported precision and recall values alongside accuracy. Precision measured how many of the positive predictions made by the model were correct relative to the total number of positive predictions and serves as a measure of the model's ability to correctly identify positive instances without generating false positives. Recall measured how many of the true positive values in the test data were identified by the model relative to the total number of true values and helps us assess the model's sensitivity and its effectiveness in detecting all positive cases.
Experimental Setup and BCI System
A research-grade EEG-based BCI system was utilized for signal acquisition and processing (g.tec medical engineering GmbH, Schiedlberg Austria). Gamified and non-gamified classifier training paradigms were generated in-house.
EEG signals were acquired using the gSCARABEO active wet electrodes and amplified by the gUSB amplifier (g.tec medical engineering GmbH, Austria). Montages for SMR and P300 paradigms are specified in Table 1. All signals were sampled at 256Hz. Before initiating the BCI session, impedance was assessed to be < 5 kΩ for optimal signal quality and to minimize artifacts.
Table 1. Channel Configurations for P300 and SMR BCI Paradigms
Paradigm
|
# Channels
|
10-20 Channel Locations
|
P300
|
8
|
Fz, Cz, P3, Pz, P4, PO7, Oz, PO8
|
SMR
|
16
|
FC3, FCz, FC4, C5, C3, C1, CZ, C2, C4, C6, CP3, CP1, CPz, CP2, CP4, Pz
|
BCI Paradigms
One standard calibration scene, one gamified calibration scene, and one task scene were developed for each P300 and SMR (Fig. 1). For the gamified calibration scenes, we incorporated game elements such as a storyline and quest, scoring system, and audio-visual features like characters, background music, and sound effects associated with points.
P300 Paradigm
Design
A pseudo-random single character flashing design was chosen for the P300 paradigm due to its numerous reported advantages over the traditional row-column design, including mitigating adjacency-distraction effects and double-flashing errors (17). A 3x3 matrix (9 potential targets) was used for both calibration conditions and the spelling task, chosen to minimize total eye movement and user fatigue (17).
The familiar face paradigm is known to elicit stronger P300 responses in adults, leading to faster target selection and higher accuracy, in addition to increased potential for enhanced user engagement (18). These factors are crucial for optimizing BCIs for children with limited attention spans. Rezeika et al. (2018) suggested that using culturally well-known faces can lead to high and consistent effects across individuals (17); thus, we incorporated popular children's cartoon characters. Cells flashed one at a time, which means that only one key changes during each flash to display a cartoon character’s face in each flashing cell. Flashes lasted 100ms, with an inter-stimulus interval of 75ms.
Calibration
In the non-gamified calibration task, a target square was identified with a “+”. Users were instructed to pay attention to that square and silently count the number of times a character flashed, until the word "done" appeared on the screen, signaling the end of the training.
For the gamified P300-based BCI training, titled "Mole Patrol" (similar to whack-a-mole), participants were instructed to use their brain power to "whack" moles appearing from different holes. To achieve this, they had to silently count the number of times the mole appeared in the target hole, accompanied by a distinct sound, while ignoring other cartoon characters who were also attempting to catch the mole. Participants were guided through the game interface, with the number of moles "whacked" and the number of remaining moles displayed in the top left corner, and the participant's score shown in the bottom right corner.
After completing a run, participants were required to enter the number of times the mole appeared in its hole. Correctly counting the mole's appearances earned them 100 points. The farther they were from the correct count, the fewer points they received for each mole. The highest possible score was 900 (9 runs x 100 possible points per run).
For both paradigms, each run consisted of 9 trials with a random number of flashes ranging from 10–15, which represented the number of times the mole emerged from its hole or the number of "flashes". The total calibration time was approximately 5 minutes. Once the classification model was trained, a confusion matrix was provided alongside percentages for the performance of the model, including the accuracy, precision, and recall.
Spelling Task
Using the classifier that was trained in the calibration phase, participants then completed a free spelling task using a two-stage T9 speller. Participants were required to write a five-letter word beginning at 10 flashes per key and decreasing by increments of two flashes for each subsequent word until two flashes per key, and a final word at 1 flash per key. Target words were different for each flash rate to mitigate potential learning effects. Thus, participants spelled a total of seven different five letter words. They were instructed to employ the same method as in the calibration task, by focusing on the key containing their target letter and silently counting the number of times the cartoon character flashed on the key. Each five-letter word required ten actions to spell - two actions (selecting a key and confirming the choice) for each of the five letters.
The accuracy of the task was calculated at each flash rate as the ratio of the correct number of selections (correctly chosen and confirmed keys) to the total number of selections (ten actions per each five-letter word). Participants were tested at 10 and 8 flashes. If their online accuracy dropped below 40% on two subsequent attempts, the task was discontinued to avoid causing frustration and disappointment.
SMR Paradigm
Design
For the SMR task, a binary class paradigm of left- and right-hand motor imagery was chosen. The training protocol was based off the Graz training paradigm, which is the standard training approach for motor imagery-based BCI (19). It involves the imagined movement of specific motor tasks, such as moving the left or right hand, while EEG signals are recorded and processed. After the first pair of imagined actions, real-time feedback was provided to users based on a classifier trained on the completed sets of imagined actions. The duration of the training protocol was 5.67 minutes with the following parameters: window length = 1.5s; number of training windows = 6; pause before training = 2s; number of training selections = 20; pause after training = 2s; train break = 4s.
Calibration
During the non-gamified calibration task, participants were instructed to concentrate on two gray squares presented on the screen. They were asked to imagine opening and closing the hand corresponding to the side with the larger square. The participants received feedback after one trial per side. The feedback was provided by making the square corresponding to the classified side more opaque. This real-time feedback encouraged participants to think more intently about the imagined hand movement and to continue imagining the movement until instructed to relax.
The gamified task, titled "Banana Dash,” aimed to engage participants in teaching baby Moe, a virtual monkey character, how to catch falling bananas using their imagined hand movements. Participants were instructed to transfer their brain power to baby Moe by imagining grabbing the bananas with their left or right hand, depending on which side the bananas were falling. The task was designed to resemble a learning process, with Moe initially observing the participant's imagined actions before attempting to catch the bananas himself.
During the first two trials, Moe did not run towards the bananas, as he was simply observing the participant's thought patterns. After these initial trials, Moe would attempt to catch the bananas based on the participant's imagined hand movements. If Moe started moving in the wrong direction, participants were instructed to concentrate more intensely on imagining the correct hand movement to guide Moe towards the bananas.
Participants were encouraged to remain focused on consistently imagining the hand movements throughout the task and to keep imagining until Moe signaled them to relax. The objective of the game was to help Moe catch as many bananas as possible to achieve the maximum score. The gamified task aimed to maintain participant engagement and motivation while collecting motor imagery-based BCI data.
Cursor Control Task
The online task for the SMR paradigm was cursor control on a one-dimensional plane (horizontal) to answer yes/no questions by imagining left-hand and right-hand movements. The cursor control task entailed responding to ten yes/no questions with pre-confirmed answers. Participants were instructed to imagine left-hand or right-hand movements, which moved the cursor in the respective direction to indicate a 'yes' or 'no' response. Accuracy was calculated by dividing the number of correct responses by the total number of questions.
Signal processing and Classification
The classification steps for both P300 and SMR were identical in the standard and gamified calibration processes.
P300 Processing
After each visual stimulus the following 600ms of EEG data were recorded. Each 600ms epoch was filtered with a 5th order Butterworth bandpass filter with a passband of 0.1–15 Hz. Once the stimulus presentation for a single round was finished, the epochs for each object were ensemble-averaged, yielding one 600ms averaged epoch per object. These epochs were then saved with binary labels indicating whether the respective object was the target or not. This was repeated for each of the nine rounds of flashing. These epochs were then used to estimate the XDawn filtered ERP covariance matrices. ERP covariance matrices were mapped to their respective tangent space representations. This resulted in nine feature sets with the target label and 72 feature sets with the non-target label. To overcome this class imbalance, oversampling was used to oversample feature sets of the target class to match the non-target class. Feature sets were used to train a shrinkage linear discriminant analysis (sLDA) classifier using 5-fold cross validation. Reported classification accuracy did not include the oversampled feature sets used for training. This XDawn and sLDA were selected based on recommendations from Lotte et al. (20) for state-of-the-art BCI pipelines with small datasets. For the online spelling task, the trained classifier was used to evaluate each object’s posterior probability of belonging to the target class. The object with the greatest posterior probability was selected as the user’s chosen target.
SMR Processing
Each 9-second imagined action was segmented into 6 epochs of 1.5 seconds each. These epochs were filtered using a 5th order Butterworth bandpass filter with a passband range of 5–30 Hz. The covariance matrix was then calculated for each epoch. The covariance matrices were subsequently mapped to their corresponding tangent space representations, producing a feature set with a length equal to the number of channels. Logistic regression was employed for classification of the feature sets, following an approach similar to Barachant et al., but substituting logistic regression in place of LDA (21).
An iterative classification approach was chosen over a static one to provide real-time feedback. This feedback allowed users to adjust their mental strategies and enhanced the learning experience by offering more opportunities for improvement. A classifier was trained after the first two imagined actions, and updated after every subsequent pair of imagined actions, utilizing all available training data. Finally, a classifier was trained using all calibration data once it became available.
Questionnaires
Motivation was assessed using a shortened version of the Pediatric Motivation Scale (PMOT), which measures motivation as an event-based state from a child’s perspective (22). Workload was assessed using the child-adapted NASA Task Load Index (NASA-TLX), which measures subjective mental workload (23). The PMOT was developed for children as an event-based measure of motivation (rather than motivation as a trait) following activities. It contains 21 items which are divided into six subscales to evaluate subjective feelings of effort/importance, relatedness, autonomy, interest/enjoyment, competence, value/usefulness and open-ended items. Children responded to five items from the interest/enjoyment, effort/importance, and competence subscales using a 6-point ordinal face scale, where 1 was “not true at all” and 6 was “definitely true”. Tolerability was assessed at the end of each session with a questionnaire using the same 6-point ordinal face scale. To promote valid responses throughout the scale, some items were framed negatively and were reverse scored. Children reported their perceived fatigue on a scale of 1 to 10 at the start of each session and at the end of each paradigm. To quantify the level of fatigue attributed to each paradigm, change in fatigue was calculated by subtracting the reported fatigue recorded after each paradigm from the reported fatigue recorded immediately before each paradigm.
Data analysis
Statistical analyses were performed using SPSS Statistics version 28.0 (IBM, USA), and GraphPad Prism version 9.0.0 (GraphPad Software, USA) was utilized for further statistical evaluations and the generation of graphs. Accuracy scores were averaged to obtain a mean score for each paradigm. Responses from questionnaires were transformed into corresponding numerical values. Outcomes were tested for normality using the Shapiro-Wilk test and parametric or nonparametric tests were used as appropriate.
To analyze the differences in accuracy across various flash rates for the P300 paradigm, we conducted a Friedman two-way analysis of variance (ANOVA) by ranks followed by pairwise comparisons using Dunn’s Multiple Comparison’s test. To report the overall online BCI accuracy for the P300 spelling paradigm, we computed the average accuracy score for runs within the stable range of flash rates where accuracy was not significantly impacted. Performance, encompassing classification model metrics (accuracy, precision, and recall) and online accuracy for utility-driven tasks, was compared between gamified and non-gamified calibration using Wilcoxon signed-rank tests. Differences in factors affecting performance, including motivation, tolerability, workload, and fatigue were analyzed using Wilcoxon sign-rank tests. Participants were stratified into four distinct age groups for the purpose of evaluating differences between ages. These groups were defined as follows: 'young' (6–8 years), 'middle' (9–11 years), 'young teens' (12–14 years), and 'teens' (15–17 years). A Kruskal-Wallis test was subsequently performed to evaluate potential differences in accuracy across these age groups. Bonferroni-Dunn corrections were applied to control for Type I errors in multiple comparisons.
A proficiency threshold for the SMR paradigm was set at 74%, which is the upper confidence limit of chance results for a 2-class classification model with 10 trials/class at α = 5% (24, 25). This means that a classification accuracy higher than 74% would be considered statistically significant, indicating that the BCI performance is better than random.