This study was based on the BEAT-PD DREAM challenge dataset, under the auspices of the Michael J. Fox Foundation for Parkinson’s Research and Sage Bionetworks 35. The dataset was sourced from two different studies; the CIS-PD and the REAL-PD (see Methods), and consisted of 28 patients (16 males, 12 females), aged 62.1 ± 9.5 (mean ± STD). Both studies collected mobile sensor data from patients with PD as they went about their daily lives. Each session consisted of the kinematic signals (CIS: accelerometer, REAL: accelerometer and gyroscope) and the patients' subjective scores of their PD symptoms on a severity scale ranging from 0 to 4. The symptoms included the on-off medication state (where a score of 0 refers to the fully ON state, and 4 to fully OFF) and the severity of the dyskinesia and tremor. In this study, not all symptoms' scores were available for all subjects, such that each symptom category contained a different subset of all subjects (on-off: 22 subjects, dyskinesia: 16 subjects, tremor: 19 subjects). The goal of the challenge was to generate a model capable of predicting the patients' subjective assessments of their symptoms using the kinematic data. The challenge was separated into three different sub-challenges, corresponding to the three symptom categories included in this study, and involved predicting the severity of on-off, dyskinesia and tremor, respectively.
Predictions in the BEAT-PD challenge were evaluated by the weighted mean square error (wMSE) score. Due to the wide range in the number of sessions scored by each subject (Fig. 1A), the MSE was first calculated for each separately:
$$\left(1\right) {MSE}_{k}=\frac{1}{{n}_{k}}{\sum }_{i=1}^{{n}_{k}}{({y}_{i,k}-{\widehat{y}}_{i,k})}^{2}$$
where \({n}_{k}\) corresponded to the number of test sessions for subject \(k\), \({y}_{i,k}\) was the ith rated score for subject \(k\), and \({\widehat{y}}_{i,k}\) was the corresponding predicted score. The subject-specific MSEs were then combined by weighting by the square root of the number of each subject's sessions to generate the wMSE score:
$$\left(2\right) wMSE=\frac{{\sum }_{k=1}^{N}\sqrt{{n}_{k}}{MSE}_{k}}{{\sum }_{k=1}^{N}\sqrt{{n}_{k}}}$$
The approach presented in this manuscript (team “HaProzdor”) won second place in the on-off challenge and fourth place in the dyskinesia and tremor sub-challenges.
In this section, we describe our solution to the challenge. First, we present the dataset, with its limitations and difficulties. Then, we describe our modeling approach, with the rationale for each of the stages in the process. Finally, we discuss the objective performance of our model, and its performance compared to the other winning models.
Data exploration
The distribution of the scores of each of the symptoms was skewed towards lower scores, and the scores of all the sub-datasets in both the CIS-PD and the REAL-PD were highly unbalanced (Fig. 1B). Moreover, different patients displayed very different distributions of subjective symptoms (Fig. 1C). These different symptoms are interrelated in the sense that tremor is expected to occur during the off period, thus leading to a positive correlation between these scores whereas tremor (which is characteristic of reduced dopamine levels) and dyskinesia (which is characteristic of excess dopamine levels) do not typically occur together. However, the correlations between the scores were not as expected, since both the tremor-on-off and tremor-dyskinesia scores displayed mostly positive correlations (Fig. 1D). On the single-subject level, the correlations between the symptom scores varied greatly. For example, subject 1004 exhibited highly positive correlations between all symptom scores (Fig, 1D, red circle). This subject rated the exact same score for all categories in numerous sessions. By contrast, subject 1023 rated different symptoms in an uncorrelated manner (Fig. 1D, blue circle).
The kinematic data were collected passively throughout the course of the subjects' daily lives. Hence, a single session could contain intermingled periods of different dynamic and static activities (Fig. 2A). These periods were not pre-categorized into different behaviors; however, some actions were easy to identify. These included periods of static activity, which appear as an almost flat accelerometer signal, and walking, which has a typical pattern of 1–2 Hz oscillations. Although the kinematic signals during different behaviors could sometimes be differentiated, signals recorded during sessions with significant differences in their symptom ranking could exhibit similar behavior patterns, thus blurring the distinction between the sessions (Fig. 2B-C). To test the use of “canonical” knowledge about different symptoms, we focused on the tremor category, whose kinematic properties are known. Tremor is typically associated with 4–6 Hz oscillations36, which typically appear as a peak in this frequency band during rest (i.e. with smaller lower frequency components typical of movement). The tri-axial raw accelerometer signals were merged using the Euclidean norm of the three single-axis signals minus 1 (ENMO) 37.
The power spectral density (PSD) of the ENMO in sessions with different tremor scores was compared. Some high tremor sessions exhibited a peak at 4–6 Hz (Fig. 3A right), whereas other sessions with a high tremor score had no corresponding peak in their PSD (Fig. 3A, middle). Moreover, some sessions ranked by this subject as having no tremor displayed an unexpected peak in the 4–6 Hz band (Fig. 3A, left). Consequently, we measured the tremor oscillatory fraction of the session, which we defined as the fraction of time in the session in which the ratio between the power in the 4–6 Hz (tremor) band and the surrounding 1–9 Hz band exceeded a threshold of 0.7 (Fig. 3B). The oscillatory fraction was then compared to the tremor score in the session (Fig. 3C), and for each subject, we measured the overall correlation between all their sessions and their corresponding oscillatory fractions (Fig. 3D). Unexpectedly, most of the subjects had low correlations, with some subjects (8 of 19, 42%) displaying negative correlations, suggesting that at least in some cases the reported scores represented the overall state as perceived by the subject rather than the score corresponding to the specific symptom.
The preliminary stage of exploration of the raw data highlighted important points that guided the way we developed the predictive models. First, the diversity of subjective scoring between patients emphasized the need to formulate personalized models for each subject, rather than a single model for the whole cohort. Second, given the lack of stable correlations or consistency between the scores for different symptoms, we generated independent models for each symptom. Third, the scores of the sessions could not be interpreted simply based on the tailor-made feature arising from the raw kinematic signals. Thus, the extraction of complex logical features, such as the tremor oscillatory fraction, were not sufficient for the current dataset. Since simpler features (e.g., range, standard deviation, and mean values of the signal) are meaningless when extracted from a whole session, we divided each session into shorter segments prior to the extraction of a large collection of simple features as the input to the model. In addition, because the data were obtained during free behavior, there were major changes in movement patterns as the subject performed various tasks throughout the sessions. Thus, extracting features from semi-stable short segments better reflected individual activities. During the long 20-minutes session, symptoms could wax and wane, such that short stretches of a symptom could affect the final rating ascribed by the subject. This resulted in feature values that differed throughout the session, which were crucial to the overall prediction. To account for this, our initial preprocessing step involved splitting each session into 10 second segments with an overlap of 5 seconds (see Methods; different segment sizes and overlaps were tested). Fourth, the kinematic data were segmented such that the symptom report was placed in the middle of the session (approximately 10 minutes). After the report time, the session continued for another 10 minutes, during which the severity of the symptoms could change. Given that the second half of the session might provide misleading information about the session score, most of our models only used the first half of the session.
Overview of the experimental design for predicting symptom severity
The general framework of our analysis is shown in Fig. 4 which illustrates the sequential process containing the nested verification (Fig. 4). Sessions were first segmented into short overlapping segments. A large set of features was calculated on each of the segments using temporal, spectral and dimensionality reduced information, and the top ranked features were selected for the model. An ensemble of different variants of random forest (RF) classifiers and regressors were used on the segment features using multiple training/test splits. The best model in each split was used to predict the scores of the test data, and a final prediction score was calculated as the average of all the split predictions. Finally, each subject's predictability was assessed, and the test scores of subjects with low predictability scores were replaced by the subject's naïve mean score.
Feature selection
A large set of features was constructed from each segment (CIS-PD: 368 features, REAL-PD: 469 features. see Methods). This large number of features could have led to an overfitting problem which would have degraded the performance of the model 21,38. Thus, we performed a feature selection process on the subset that was utilized by the model (see Methods). The top selected features for individual subjects varied across symptoms (Fig. 5A). This is because a feature may be predictive of one category, in terms of distinguishing between different scores (Fig. 5B, left), whereas for a different category, this same feature will not necessarily differentiate scores well (Fig. 5B, right). Thus, to obtain predictive features, we carried out the selection process separately for each symptom. We next examined commonly selected features across subjects for the same symptom category. We ran the feature selection process for each subject over multiple training/test splits. The ranking of all features for a given symptom were similar between all splits of the same subject, such that the same features tended to be ranked with high values over the different sub-datasets (Fig. 5C), and there was a substantial overlap between the top selected features in each (Fig. 5D). By contrast, the ranking of the features varied between subjects (Fig. 5E), and the overlap between the top selected features was significantly smaller (Fig. 5F). For example, the mean percent of the common tremor features between every ten splits of each CID-PD subject (71 ± 15%, mean ± STD) was significantly higher than the mean of the common tremor features between all pairs of CIS-PD subjects (59 ± 10%, mean ± STD, p < 0.001, Mann-Whitney U test). Therefore, as the base method, we conducted the selection process for each subject separately.
Ensemble- based approach to symptom severity prediction
The symptom score prediction approach consisted of the integration of predictions from multiple RF models that differed in terms of their hyperparameters. The optimal variants of the hyperparameters. Each subject/symptom data combination was split into 32 different training/test splits, and on each of the training splits, a set of RF models, differing in their hyperparameters, were trained (see Methods). The performance criterion was the fraction of splits leading the best performing model for all subjects (Fig. 6A). The results showed that the "best-combination" model was the most frequently best performing model in all categories but constituted only 30% of the cases. Analyzing the best performing models in individual cases revealed different distributions of selected models for different subjects (Fig. 6B) suggesting that there was no single optimal model that could be used for prediction, but rather that this depended on the specific subject/symptom combination. However, selecting a single model based on the given available training data may lead to overfitting and does not guarantee optimal generalization on the test data. Thus, we combined the predictions from the 32 best performing models generated for each train/test split into model was chosen based on the assessment of different an ensemble. The final prediction was calculated by averaging the ensemble predictions for each of the sessions, followed by a mean adjustment (Fig. 6C). We assumed that the usage of a multiple-model would optimize the output of single-type classifiers. However, a post-challenge analysis revealed that the use of the same single type model in all the ensembles was not necessarily worse than multiple-type models (Fig. 6D). Although the differences in performance were minor, in some cases, a single specific model could outperform multiple-type models probably due to reduced overfitting.
Cherry-picking: assessing the predictability of individual subjects
Validation of our models using nested 5-fold cross-validation (see Methods) indicated that the predictability of some subjects was low (Fig. 7A-C). Subjects whose scores could not be predicted reliably were thus defined as naïve subjects, and their final score prediction was replaced by the naïve mean. The proportion of naïve subjects varied across symptoms (on-off: 16/22, dyskinesia: 8/16, tremor: 8/19), and different subsets of naïve subjects were found in each of the categories (on-off: 1006, 1044, hbv013, hbv051, hbv077, hbv043; dyskinesia: 1007, 1023, 1034, 1043, 1044, 1048, hbv054, hbv018; tremor: 1007, 1023, 1034, 1046, 1048, hbv013, hbv054, hbv012). However, despite the division into naïve and non-naïve subjects, the mean improvement in the model relative to the naïve MSE of non-naïve subjects was small (on-off: 0.08 ± 0.07, dyskinesia: 0.07 ± 0.06, tremor: 0.07 ± 0.06, mean ± STD).
BEAT-PD Dream challenge results
Forty-three different teams submitted their predictions to this challenge. The teams were ranked according to their wMSE scores (Fig. 8A) and compared to the null model (on-off: 0.967, dyskinesia: 0.4373, tremor: 0.4399). Our submission was awarded second place in the on-off sub-challenge (wMSE = 0.8793), and fourth place in the dyskinesia (wMSE = 0.4205) and the tremor (wMSE = 0.404) sub-challenges. Although the final wMSE scores of the top-performing teams (HaProzdor, dbmi, ROC BEAT-PD, yuanfang.guan, hecky and Problem Solver) were very close, the correlations between the test predictions of these groups were low (supplementary Table 1). Though our models were top performing in this challenge in terms of low wMSE scores, the variance explained by the models for each subject in each of the sub-challenges was low (Fig. 8B).