Clinical trial
This study was performed as part of the CortiCom clinical trial (ClinicalTrials.gov Identifier: NCT03567213), a phase I early feasibility study of the safety and preliminary efficacy of an implantable ECoG BCI. Due to the exploratory nature of this study and the limited number of participants, the primary outcomes of the trial were stated in general terms (Supplementary Note 1) and were designed to gather preliminary data on: 1) the safety of the implanted device, 2) the recording viability of the implanted device, and 3) BCI functionality enabled by the implanted device using a variety of strategies. No methods or statistical analysis plans were predefined for assessing these outcomes given their exploratory nature and the limited number of participants. Results related to the first two primary outcome variables, though necessarily provisional as they are drawn from only one participant, are reported in Supplementary Notes 2 and 3 respectively. Results related to BCI functionality, also necessarily provisional and exploratory (Supplementary Note 4), are addressed within the subsequent methodology and results, which nevertheless employed rigorous analyses and statistics.
The study protocol can be found as an additional supplemental file. The study protocol was reviewed and approved by Johns Hopkins University Institutional Review Board and by the US Food and Drug Administration (FDA) under an investigational device exemption (IDE).
Participant
All results reported here were based on data from the first and only participant to date in the CortiCom trial. The participant gave written consent after being informed of the nature of the research and implant related risks. To date this participant has had no serious or device-related adverse events, and thus the primary outcome of the CortiCom trial has been successful. The secondary outcomes of the CortiCom trial are reported, in part, here; specifically, our success rate and latency are reported in terms of click detection accuracy and time from attempted movement onset to click.
The participant was a right-handed man who was 61 years old at the time of implant in July 2022 and diagnosed with ALS roughly 8 years prior. Due to bulbar dysfunction, the participant had severe dysphagia and progressive dysarthria. This was accompanied by progressive dyspnea. The participant could still produce overt speech, but slowly and with limited intelligibility. He had experienced progressive weakness in his upper limbs such that he is incapable of performing activities of daily living without assistance; his lower limbs are less affected.
Neural implant
The CortiCom study device was composed of two 8x8 subdural ECoG grids manufactured by PMT Corporation (Chanhassen, MN), which were connected to a percutaneous 128-channel Neuroport pedestal manufactured by Blackrock Neurotech Corporation (Salt Lake City, UT). Final assembly and sterilization of the study device was performed by Blackrock Neurotech. Both subdural grids consisted of soft silastic sheets embedded with platinum-iridium disc electrodes (0.76 mm thickness, 2-mm diameter exposed surface) with 4 mm center-to-center spacing and a total surface area of 12.11 cm2 (36.6 mm x 33.1 mm). The device included two reference wires, which were exposed to match the recording surface area of the ECoG electrodes. During all recordings with the study device, the Neuroport pedestal was coupled to a small (24.9 mm x 17.7 mm x 17.9 mm) external device (Neuroplex-E; Blackrock NeurotechCorp.) for signal amplification, digitization, and digital transmission via a mini-HDMI cable to the Neuroport Biopotential System (Blackrock Neurotech Corp.) (Fig. 1a).
The two electrode grids of the study device were surgically implanted subdurally, over sensorimotor cortex representations for speech and upper extremity movements in the left hemisphere. Implantation was performed via craniotomy under monitored anesthesia care with local anesthesia and sedation tailored to intraoperative task participation. There were no surgical complications or surgically related adverse events. The locations of targeted cortical representations were estimated prior to implantation using anatomical landmarks from a pre-operative structural MRI, functional MRI, and somatosensory evoked potentials. The locations of the subdural grids with respect to surface gyral anatomy were confirmed after implantation by co-registering a post-operative high-resolution CT with a pre-operative high-resolution MRI using Freesurfer19 (Fig. 1b).
Testing and calibration
At the beginning of each session, a 60 second calibration period was recorded, during which the participant was instructed to sit still and quiet with his eyes open and visually fixated on a computer monitor. For each channel, we then computed the mean and standard deviation of the spectral-temporal log-powers for each frequency bin. These estimates of resting baseline cortical activity were subsequently used for normalization of power estimates during model training and BCI operation.
Training task
Training data was collected across four sessions (six training blocks in total) spanning 15 days (Fig. 2a). For each block, the participant was instructed to attempt a brief grasp with his right hand (i.e., contralateral to the implanted arrays) in response to visual cues (Supplementary Fig. 1). Due to the participant’s severe upper extremity impairments, his attempted movements primarily involved flexion of the middle and ring fingers. After each attempt, the participant released his grasp and passively allowed his hand to return to its resting position hanging from the wrist at the end of his chair’s armrest.
Each trial of the training task consisted of a single 100 ms “Go” stimulus prompting the participant to attempt a grasp, followed by an interstimulus interval (ISI), during which the participant remained still and fixated his gaze on a crosshair in the center of the monitor. Previous experiments using longer cues had resulted in more variable response latencies and durations. The length of each ISI was randomly chosen to vary uniformly between a lower and upper bound to reduce anticipatory behavior. The experimental parameters across all training sessions are shown in Supplementary Table 1. In total, almost 44 min of data (260 trials) was collected for model training.
Data collection
Neural signals were recorded by the Neuroport system at a sampling rate of 1 kHz. BCI2000 was used to present stimuli during training blocks and to store the data for offline analysis20. Video of the participant’s right hand (i.e., which was overtly attempting grasp movements) and the monitor displaying the spelling application was recorded at 30 frames per second (FPS) during all spelling sessions except the last two (at 60 FPS). A 150 ms synchronization audio cue was played at the beginning of each spelling block (see Real-Time Switch-Scanning) so that the audio recorded by the Neuroport biopotential system’s analog input could be used offline to synchronize the video frames with the neural data. A pose estimation algorithm21 was applied offline to the hand video to infer the horizontal and vertical positions of 21 hand and finger landmarks within each video frame. The horizontal coordinates of the metacarpal-phalangeal (MCP) joint landmarks for the first and fifth digits were used to normalize horizontal positions of all landmarks, while the MCP and fingertip coordinates of the same digits were used to normalize vertical positions.
Feature extraction and label assignment
For each of the 128 recording channels, we used a Fast Fourier Transform (FFT) filter to compute the spectral power of 256 ms windows shifted by 100 ms increments. The spectral power in each frequency bin was log-transformed and normalized to the corresponding calibration statistics. We summed the spectral power in the frequency band between 110 and 170 Hz to compute our high-gamma (HG) power. We chose this lower bound of the frequency band because post-movement low frequency activity sometimes extended to 100 Hz in several channels (Supplementary Fig. 2). This resulted in a 128-channel feature vector that was used in subsequent model training.
After computing each channel’s trial-aligned HG power (-1 s to 2.5 s post-cue), we accounted for the inter-trial variability due to reaction delay by re-aligning each trial’s HG power using a subset of highly activated channels22. This resulted in generally increased HG power correlations between trials (Supplementary Figs. 3–5). We visually determined the onset and offset of the re-aligned trial-averaged HG power from the channels used for re-alignment (Supplementary Fig. 6). The average neural activity onset and offset were manually estimated from the aligned neural data to be roughly 0.2 s and 1.2 s post-cue, respectively, with neural activity more clearly differentiating from rest activity starting at 0.3 s post-cue and ending at 1.1 s post-cue. We consequently assigned grasp labels to ECoG feature vectors falling between 0.3 s and 1.1 s post-cue for each trial, and rest labels to all other feature vectors. Since this overall strategy relies only on the visual inspection of neural signals, we believe it to be compatible with reduced availability of ground truth signals, like movement, as might be the case in locked-in participants.
Model architecture and training
We designed a recurrent neural network in a many-to-one configuration to learn changes in HG power over sequences of 1 s (Supplementary Fig. 7). Each 128-channel HG power vector was input into a long short-term memory (LSTM) layer with 25 hidden units for modelling sequential dependencies. From here, 2 consecutive fully-connected (FC) layers with 10 and 2 hidden units, respectively, determined probabilities of the rest or grasp class. The former utilized an eLU activation function while the latter employed softmax to output normalized probability values. In total, the architecture consisted of 17,932 trainable parameters, and was trained on a balanced dataset of rest and attempted grasping sequences by randomly downsampling from the overrepresented rest class.
We determined the model’s hyperparameters by evaluating our model’s offline accuracy using 10-fold cross-validation with data collected for training (see Cross-validation). For each cross-validated model, we limited training to 75 epochs during which classification accuracy of the validation fold plateaued. We used categorical cross-entropy for computing the error between true and predicted labels of each 45-sample batch and updated the weights using adaptive moment optimization (Adam optimizer)23. To prevent overfitting on the training data, we used a 30% dropout of weights in the LSTM and FC layers. All weights were initialized according to a He Normal distribution.24 The model was implemented in Python 3.8 using Keras with a TensorFlow backend (v2.8.0).
Real-time pipeline
Pipeline structure
We used ezmsg, a Python-based messaging architecture (https://github.com/iscoe/ezmsg)25, to create a directed acyclic graph of processing units, in which all pre-processing, classification, and post-processing steps were partitioned.
Real-time pre-processing
Neural data was streamed in intervals of 100 ms via a ZeroMQ connection from BCI200020 to our real-time pipeline, which was hosted on a separate machine dedicated to real-time inference. Incoming data updated a running 256 ms buffer, from which a 128-channel feature vector of HG power was then computed as described above (Figs. 1c and 1d). This feature vector was stored in a running buffer of 10 feature vectors (Fig. 1e), which represented 1 s of feature history for our LSTM input (Fig. 1f).
Classification and post-processing
A rest or grasp classification was generated every 100 ms by the FC layer, after which it entered a running buffer of classifications, which in turn was updated with each new classification. This buffer was our voting window, which contained a pre-determined number of classifications (10 and 7 for the medical communication board and the spelling interface respectively), and in which a given number of those classifications (voting threshold) were required to be grasp in order to initiate a click (Fig. 1g). This voting window and threshold were applied to prevent sporadic grasp classifications from being interpreted as an intention to execute a click. A click triggered selection of the participant’s desired row or column in the switch-scanning application (Fig. 1h).
Switch-scanning applications
A switch-scanning application is an augmentation and alternative communication (AAC) technology that allows users with severe motor or cognitive impairments to navigate to and select icons or letters by timing their clicks to the desired row or column during periods in which rows or columns are sequentially highlighted26–32. The participant generated a click by attempting a brief grasping movement as described in Training task.
Medical communication board
As a preliminary assessment of our model’s sensitivity and false positive detections, we first cued our participant to navigate to and select keys with graphical symbols from a medical communication board (Supplementary Fig. 8). Graphical symbols were obtained from https://communicationboard.io/. We used a 10-vote voting window with a 10-vote threshold (all 10 classifications within the running voting window needed to be grasp to initiate a click) and set our row and column scan rates to 1.5 per s. Finally, we enforced a lock-out period of 1 s, during which no other clicks could be produced, after clicking on a row or a button within a row (Fig. 1g). This prevented multiple clicks being produced from the same attempted grasp.
Spelling application
We then developed a switch-scanning spelling application, in which the participant was prompted to spell sentences (Supplementary Fig. 9). The buttons within the spelling interface were arranged in a grid design that included a center keyboard as well as autocomplete options for both letters and words. Letter and word autocompletion options were both generated by a distilBERT language model33 hosted on a separate server, providing inference through an API. The distilBERT model was chosen over larger language models for its faster inference speed. We added three pre-selection rows at the beginning of each switch scanning cycle as well as one pre-selection column at the beginning of column scanning cycle. These allowed the participant a brief preparation time if he desired to select the first row, or first column within a selected row. We decided to use a 7-vote voting window with a 7-vote threshold, which decreased latency from attempted grasp onset to click (see Click latencies) compared to when using the medical communication board. However, after several sessions of spelling and feedback from the participant, we reduced the voting threshold requirement to a 4-vote threshold (any 4/7 classifications within the running voting window needed to be grasp to initiate a click). We again enforced a lock-out period of 1 s.
Real-time switch-scanning
Using the communication board, the participant was instructed to navigate to and select one of the keys verbally cued by the experimenter. If the participant selected the incorrect row, the cued key was changed to be in that row. Once a key was selected, the switch-scanning cycle would start anew (Supplementary Video 1, Supplementary Fig. 8).
To test real-time spelling performance using our click detector, the participant was required to type out sentences by using the switch-scanning spelling application. The sentences were sampled from the Harvard sentence corpus34 and were presented at the top of the speller in faded gray text. If the participant accidentally clicked a wrong key, resulting in an incorrect letter or autocompleted word, the corresponding output text would be highlighted in red. The participant was then required to delete it using the DEL or A-DEL (auto-delete) keys respectively. Once the participant completed a sentence, he advanced to the next one by clicking the ENTER key (Supplementary Video 2, Supplementary Fig. 9). A spelling block consisted of 3–4 sentences to complete, and in each session the participant completed 1–6 spelling blocks (Fig. 2b).
Performance evaluation
Sensitivity and click rates
Sensitivity was measured as the percentage of correctly detected clicks:
$$Sensitivity=\frac{{N}_{true clicks}}{{N}_{attempted grasps}} \times 100\%$$
where in one session \({N}_{true clicks}\) were the total number of correct clicks and \({N}_{grasps}\) were the total number of attempted grasps, and where \({N}_{true clicks}\le {N}_{attempted grasps}\). For a detected click to be correct (i.e., a true positive), it had to have occurred on the user interface (as visual feedback to the participant) within 1.5 s after the onset of an attempted grasp. Attempted grasps with no clicks occurring within this time period were considered false negatives. Clicks that occurred outside this time period were assumed to be unrelated to any attempted grasp and were thus considered false positives. True positive and false positive frequencies (TPF and FPF respectively) were measured per unit time and for each session were defined as the following:
$$TPF= \frac{{N}_{TP}}{T}=\frac{{N}_{true clicks}}{T} FPF= \frac{{N}_{FP}}{T}$$
where \({N}_{TP}\) and \({N}_{FP}\) are the number of true and false positives in a session respectively, and \(T\) is the total spelling time for that session. Whether the participant clicked the correct or incorrect key had no bearing on sensitivity, TPF, or FPF as these metrics depended only on whether a click truly occurred following an attempted grasp.
Click latencies
Movement onsets and offsets were determined from the normalized pose-estimated landmark trajectories of the hand. Specifically, only the landmarks of the fingers with significant movement during the attempted grasp were considered. Then, for each attempted grasp, movement onset and offset times were visually estimated.
For each correctly detected attempted grasp, we computed both: a) the time elapsed between movement onset and algorithm detection, and b) the time elapsed between movement onset and the click appearing on the spelling application’s user interface. The latency to algorithm detection was primarily composed of the time necessary to reach the voting threshold (i.e., a 4-vote threshold usually produced at least 400 ms latency if four grasps were sequentially classified). The latency to the on-screen click appearing on the spelling interface depended on the algorithm detection latency along with additional network and computational overhead necessary for displaying the click.
Spelling rates
Spelling rates were measured by correct characters per minute (CCPM) and correct words per minute (CWPM). Spelled characters and words were correct if they exactly matched their positions in the prompted sentence. For example, if the participant spelled a sentence with 30 characters (5 words) with 1 character typo, only 29 characters (4 words) contributed to the CCPM (CWPM). Note that all spelling was performed with assistance of autocompletion options from the language model.
Cross-validation
We partitioned our training data into 10 folds such that each fold contained an equal number of rest and grasp samples of HG power feature vectors (rest samples were randomly downsampled to match the number of grasp samples). To minimize data leakage of time dependent data into the validation fold, all samples within a fold were contiguous and each sample belonged to only one fold. Each fold was used once for validation and a corresponding cross-validated model was trained on the remaining 9 folds.
Channel contributions and offline classification comparisons
Using the subset of samples in the training data labeled as grasp, we computed each channel’s importance to generating a grasp classification given our model architecture. Specifically, we computed the integrated gradients from 10 cross-validated models (see Cross-validation) with respect to the input features from each sample labeled as grasp in the corresponding validation folds. This generated an attribution map for each sample35, from which we calculated the L2-norm across all 10 historical time feature vectors2, resulting in a 1x128 saliency vector. Due to the random initialization of weights in the RNN-FC network, models trained on features from the same set of folds were not guaranteed to converge to one set of final weights. We therefore retrained the set of 10 cross validated models 20 times and similarly recomputed the saliency vectors for each sample. The final saliency map was computed by averaging the attribution maps across all repeated samples and normalizing the resulting mean values between 0 and 1. We repeated this process using HG features from all channels except one (channel 112) and again by using features from a subset of 12 electrodes over cortical hand-knob (anatomically determined as channels 92, 93, 94, 100, 101, 102, 108, 109, 110, 116, 117, 118; Fig. 4e, Supplementary Figs. 2,3). Neither of these two model architectures were deployed for real-time BCI use.
To inform whether models trained with HG features from these smaller subsets of channels could retain robust click performance, we computed offline classification accuracies using 10-fold cross-validation (see Cross-validation). We repeated cross-validation (see above) such that for each of the 10 validation folds a set of 20 accuracy values was produced. We then took the average of these 20 values to obtain a final accuracy for each fold. For each subset of channels, a confusion matrix was generated using the true and predicted labels across all validation folds and all repetitions.
Statistics and Reproducibility
Statistical analysis
Spelling blocks with a specific voting threshold were collected on no more than nine sessions. Given this small sample size, we could not assume normality in the distribution of the sample mean of any of the performance metrics (sensitivity, TPF, FPF, latencies, CCPM, CWPM). Therefore, we decided to use the non-parametric Wilcoxon Rank-Sum test to determine whether there were significant differences between performance metrics from spelling blocks where different voting thresholds were applied. A P-value less than 0.05 was considered significant. Similarly, we used the Wilcoxon Rank-Sum test to determine whether there were significant differences in offline classification accuracies when different configurations of channels were used from model-training and validation. We additionally used a Holm-Bonferroni correction to adjust for multiple comparisons.
Reproducibility of experiments
Neural data collection and processing as well as decoder performance were reproducible across sessions as the participant was able to repeatedly demonstrate click control using neural signals from attempted hand movements to spell sentences. However, as this study reports on the first and only participant in this trial so far, further work will be necessary to test the reproducibility of these results in other participants.