Characterization of reviewer peak selection performance across all PSDs
For this study, we evaluated the performance of seven independent expert reviewers in their subjective selection of the first and second (if identified) beta peak frequency from fifteen PSD plots each across five participants (both left and right hemispheres). As a group, the distribution of consensus peaks derived from all 250 PSDs was non-significantly different from the peak selection distribution of each individual reviewer, determined using a 2-sample Kolmogorov-Smirnov test with the following p-values for each reviewer: R1: p = 0.97, R2: p = 1, R3: p = 1, R4: p = 0.78, R5: p = 0.49, R6: p = 1, R7: p = 0.87 (Fig. 3A). At a minimum, there are two factors that could affect the derivation of the consensus peak quality: 1) a subset of reviewers performed substantially worse on accurate classification of the consensus peak, or 2) a subset of PSDs was more challenging to classify across reviewers. To assess these two factors, we used an UpSet plot (Fig. 3B)49 which is a method to visualize and quantify the number of intersections between sets. Both of these issues are not mutually exclusive and could contribute to the consensus peak accuracy. Summarizing across reviewers, regarding the number of PSD sets in which an individual reviewer correctly identified the consensus beta peak, highlights two reviewers that made accurate selections for less than 70% of the PSDs (R7 61% and R4 69%) while the remaining reviewers made accurate selections on greater than 80% of the PSDs. In the main summary across reviewer and PSD set intersections, all reviewers correctly identified the consensus beta peak on 35% of the PSDs. To determine if there was a subset of PSDs that was difficult to accurately assess across reviewers, we identified the participant ID of PSDs for which fewer than four reviewers accurately identified the consensus beta peak (Fig. 3C). We found that two participants contributed approximately 25% of their PSDs to this low-consensus group (Participant 3 and Participant 5). Fewer problematic PSDs were observed for Participant 1 (2%), Participant 2 (14%) and Participant 4 (0%).
Characterization of algorithm peak selection performance across all PSDs
Following derivation of expert defined consensus beta peaks for all 250 PSDs, we sought to evaluate the performance accuracy for ten algorithms used in the literature for detecting PSD peaks. We did not impose any a priori restrictions on the maximum number of peaks that algorithms could detect. Furthermore, the algorithms were not configured to guarantee the detection of at least one peak, meaning it was possible for an algorithm to fail to detect any peaks. Comparing the consensus-derived peak distribution (for the first peak detected) with that of each individual algorithm revealed that four algorithms (I, II, VI, X) produced peaks with distributions significantly different from the consensus (2-sample Kolmogorov-Smirnov test, p < 0.01). In contrast, six algorithms (III, IV, V, VII, VIII, IX) yielded peak distributions similar to the consensus, as shown in Fig. 4A. To evaluate algorithm accuracy in detecting the consensus peak, we adopted an approach similar to our assessment of reviewer performance. We visualized the performance of individual algorithms across PSDs and analyzed the overlap among algorithms that successfully identified the peak. This analysis is shown in Fig. 4B. We observed that algorithms exhibited substantial variability in accuracy, shown in the lower horizontal bar chart in Fig. 4B, which summarizes individual algorithm accuracy across all PSDs. Three algorithms performed below 50% accuracy (I = 30%, VI = 47%, VIII = 48%), three performed below 75% accuracy (X = 64%, VII = 64%, V = 66%) and four algorithms performed above 75% accuracy (II = 75%, III = 75%, IV = 76%, and IX = 75%). The vertical bar chart of the UpSet visualization in Fig. 4B depicts the main summary across algorithm and PSD set intersections. The greater number of possible intersection sets resulted in greater diversity of set relationships in the plot. In general, all ten algorithms correctly identified the consensus beta peak on 11% of the PSDs. Following the approach we used in the reviewer performance assessment, we evaluated whether a subset of PSDs was difficult to accurately assess across algorithms. For this analysis, we quantified the participant ID of PSDs for which fewer than five algorithms accurately identified the consensus beta peak (Fig. 3C). We found that all participants contributed a subset of PSDs that were challenging for a half or more of the algorithms. The following participant PSD contributions were observed: participant 5 was the most challenging (62% PSDs, where < 4 algorithms match the consensus peak), followed by participant 3 (38%); participants 1 and 2 exhibited moderate difficulty with 20% and 24% respectively, and participant 4 again exhibited the least difficulty across algorithms (4%). All PSDs with successfully detected peaks by less than four algorithms were represented within the center of the donut chart (Fig. 4C). Algorithms also exhibited substantial variability in the number of peaks detected across PSDs (Fig. 4D); most algorithms predominantly detected two peaks per PSD.
Overall algorithm performance for detecting reviewer consensus beta power peaks
The main goal of the study was to assess algorithm performance in detecting expert-defined beta peaks in PSD representations of LFP data. We evaluated the performance of individual algorithms using two approaches. First, we used a Bland-Altman test to assess the agreement between the expert consensus peaks and the first peak identified by each algorithm. A t-test on the mean difference between the two measurement methods was used to determine if there was a significant difference from 0, indicating a systematic bias between the measures. In this case, a p-value less than alpha (0.05) would suggest a possible systematic bias, indicating that those algorithms may be less reliable. Second, we quantified algorithm accuracy by evaluating the mean squared error between the consensus peaks and the first peak derivations for each algorithm. Representative results showing the comparison of consensus and algorithm detected peaks are shown for all five participants in Fig. 5. (The top row depicts PSDs from segments on the left hemisphere that exhibited the highest peak beta, and the bottom row depicts PSDs from levels on the right hemisphere that exhibited the highest peak beta; all PSDs were derived from monopolar referenced recordings). Of greatest interest was the first peak detected, as all reviewers selected up to two peaks, with the first predicated as the main peak. For the first peak (Fig. 6A), we found that five of the ten algorithms passed the Bland-Altman test (I, V, IX, III, and IV) while five failed to pass (VII, VIII, II, X, VI). In general, there was a correlation between passing the Bland-Altman and having a lower MSE, with four algorithms that passed the Bland-Altman test exhibiting the lowest MSE. However, we found that MSE could be impacted by the number of PSDs in which any peak was detected. For example, algorithm I, which failed to identify any peak for 67% of the PSDs, had the lowest MSE; this algorithm was the most conservative algorithm in peak detection. All other algorithms successfully identified at least one peak in over 73% of the PSDs. We observed a similar phenomenon when assessing accuracy for the second peak (Fig. 6B). Only seven of the ten algorithms detected a second peak, and these detections occurred at a lower frequency across PSDs (average of 26% of PSDs per algorithm). The increased stringency in second peak detection resulted in lower MSEs, similar to the accuracy of algorithm I for first peak classification.
The cohort of PD participants (n = 5; 10 hemispheres) used for this study was representative of the heterogeneity in PSD morphology (e.g., peak number, slope, peak magnitude). Despite the inherent variability in overall PSD structure, both reviewers and algorithms were generally able to agree on the identified beta peak(s) in many cases. We directly evaluated the participant-based variability in algorithm performance to determine whether specific participants exhibited more challenging PSD morphologies for beta peak detection. In Fig. 6C, we arranged participants from highest to lowest accuracy based on the average MSE across algorithms, revealing notable differences that reinforce earlier results; for example, participant 4 exhibits less variability in MSE than participant 2. Applying an ANOVA to assess participant MSE variability demonstrated a significant difference in accuracy between participants (F(4, 45) = 6.04, p = 0.0006). Post-hoc tests revealed that participant 2 was significantly different from participants 3 (p = 0.04), 4 (p = 0.0002), and 5 (p = 0.01).
In this study, PSD plots were generated using two referencing strategies: bipolar, which references between neighboring contacts on the same lead, and monopolar, which involves referencing between contacts on different leads. We define the monopolar approach based on the concept that this strategy allows for the isolation of activity from a single contact or set of contacts (i.e., combined segments) on an implanted DBS lead. We sought to determine whether the monopolar approach resulted in greater isolation of peak activity and, thereby, greater accuracy in peak identification. In Fig. 6D, we directly compared the PSD data derived from monopolar and bipolar contact configurations using a paired t-test. We observed that PSDs generated by monopolar referenced contact configurations resulted in greater accuracy compared to bipolar referenced PSDs (monopolar mean MSE = 3.2; bipolar mean MSE = 7.8; p = 0.01).
Accuracy of top performing algorithms to predict contact level of stimulation at 3 months follow-up
Finally, we evaluated whether the top performing algorithms, ranked by accuracy in detecting expert reviewer-derived consensus beta peak(s), would have clinical utility. We replicated this approach by assessing how well the top three performing algorithms predicted the contact level of stimulation clinically determined and used at 3-months follow-up. We used three criteria to determine the top performing algorithms: 1) pass the Bland-Altman test while retaining the null hypothesis, 2) detect at least one peak from 75% of the PSDs, and 3) accuracy rating of MSE < 5, which was the median MSE across all ten algorithms. Per our evaluative criteria, three methods (III, V, and IX) were deemed most satisfactory. Of note, methods III and V were both algebraic methods such that a dynamic threshold was used to identify relevant peaks using properties of individual PSDs. Method IX employed the same algebraic thresholding approach as method III, albeit applied to a PSD whose 1/f component was subtracted; thus it was classified as elemental decomposition yet drew heavily from algebraic algorithms. As such, all three satisfactory methods overlapped in the use of individualized, dynamic peak amplitude thresholding procedure, pointing to a promising, reliable and standardizable peak detection strategy. At 3-month follow-up, seven of the ten hemispheres were programmed either at a single level or, if programmed across levels, had greater than 50% stimulation fractionated to one level. These seven hemispheres were used for the comparison. When comparing the level selected by the three top performing algorithms based on the max beta peak identification, we observed the following accuracies in predicting the contact level for each hemisphere at 3-months follow-up: V = 71%, IX = 86%, and III = 100%.