A Comparative Study of Objective Outcome Measures Used in Clinical Trials of Freezing of Gait

Background: Freezing of gait (FOG) is notoriously dicult to quantify, leading to multiple metrics utilized as outcomes for clinical trials. The instrumented timed up a go and the many parameters that can be derived from it are commonly used as objective markers of gait severity in FOG trials, however it is unknown if they represent FOG severity. Objective: To determine the specicity and responsiveness of objective surrogate markers of FOG severity commonly utilized in FOG studies. Methods: Markers compared included: velocity, step/stride length, step/stride length variability, TUG, and turn duration. Data was collected in four conditions (ON and OFF dopaminergic drugs, with and without a dual task). Unied Parkinson’s Disease rating scale (UPDRS) was administered in the ON and OFF states. Results: 33 subjects were recruited (17 PD subjects without FOG (PD-control), and 16 subjects with PD and dopa-responsive FOG PD-FOG). The UPDRS motor scores were: 24.9 for the PD-control group in the ON state, 24.8 for the FOG group in the ON state, 42.4 for the FOG group in the OFF state. Signicant mean differences between the ON and OFF conditions were observed with all surrogate markers (p<0.01). However, only dual task turn duration and step variability showed trends toward signicance when comparing PD-control and ON-FOG (p=0.08). Test-retest reliability was high (ICC >0.90) for all markers except standard deviations. Step length variability was the only marker to show an area under the ROC curve analysis >0.70 comparing ON-FOG vs. PD-control. Conclusions: Multiple candidate surrogate markers for FOG severity showed responsiveness to levodopa challenge, however, most were not specic for FOG severity.

ON state, 24.8 for the FOG group in the ON state, 42.4 for the FOG group in the OFF state. Signi cant mean differences between the ON and OFF conditions were observed with all surrogate markers (p<0.01). However, only dual task turn duration and step variability showed trends toward signi cance when comparing PD-control and ON-FOG (p=0.08). Test-retest reliability was high (ICC >0.90) for all markers except standard deviations.
Step length variability was the only marker to show an area under the ROC curve analysis >0.70 comparing ON-FOG vs. PD-control.
Conclusions: Multiple candidate surrogate markers for FOG severity showed responsiveness to levodopa challenge, however, most were not speci c for FOG severity.

Background
Freezing of gait (FOG) is a debilitating condition occurring in the majority of patients with Parkinson's disease (PD) [1][2][3], for which there is no effective therapy. It is de ned as the episodic inability to walk, often triggered by environmental factors [4]. A major barrier toward therapeutic development in FOG is the lack of validated, objective outcome measures of FOG severity [5]. Measures like the Freezing of Gait Questionnaire [6] (FOG-Q) are limited by their subjective nature and cannot be repeated in one session, since they are meant to be retrospective over a period of one month. The new FOG-Q has been recently found to be unreliable and not responsive to small effect sizes [7]. Measures that rely on capturing a FOG episode in the laboratory (direct measures) [8][9][10][11][12], are limited by the inherent variability of each episode, therefore a captured episode may not be representative of overall FOG severity. Furthermore, approaches to reliably trigger an episode have not been established. Long term continuous monitoring approaches [12,13] are ideal since they capture FOG severity over a period of days or weeks accounting for variability of individual episodes, however, cannot be repeated in one session (since they must be administered over a long term), and require the analysis of large amounts of data. Surrogate markers of FOG severity present an option for therapeutic trials, since they are objective assessments, easy to administer, can be administered multiple times in one session, and do not depend on triggering an episode of FOG, however, their speci city for FOG severity has not been determined. This type of marker is particularly useful for dose nding studies, and determining immediate effects of therapeutic interventions, e.g. neuromodulation therapies which require testing of multiple variables to optimize, or dose nding studies.
For these reasons clinical trials of therapies for FOG have utilized multiple candidate surrogate markers including: instrumented timed up and go (TUG), turn duration, velocity, dual task interference, and step length variability [14][15][16][17]. However, it is not clear which (if any) of these markers best represent FOG severity, and if they are responsive to the interventions that are being tested, impairing our ability to interpret these studies and providing little guidance for future study design.
We selected markers that have been commonly utilized as surrogate markers of FOG severity in previous studies (velocity, step length, step length variability, dual task interference [13,14,18,17], and turn duration [19][20][21]). In order to determine which markers (if any) are most appropriate as outcomes in future clinical trials aiming to improve FOG severity, we were interested primarily in whether or not each marker was speci c for FOG and was responsive to intervention. To determine speci city of each marker for FOG, we selected a group of patients with FOG, and a control group of PD patients without FOG (that had otherwise similar motor severity as the PD-FOG group) and tested the ability of each marker to differentiate between each group. To determine responsiveness, we ensured each of the FOG patients selected had a clear dopa-response and compared the ability of each marker to differentiate between the OFF and ON medication state.

Subjects:
Subjects (ages 18-80) with who met UK Brain Bank criteria for idiopathic PD (Hoehn and Yahr stage 2-4) were enrolled in the study. Subjects with a score of zero in question one of the new freezing of gait questionnaire [22] (nFOGQ) and item 14 of the UPDRS part 2, were enrolled into the PD-control group.
Subjects with a score of 1 in question one of the nFOGQ were enrolled in the FOG group. To ensure subjects in the FOG group had dopa-responsive FOG an improvement of at least one point on item 14 of the Uni ed Parkinson's Disease Rating Scale (UPDRS) from the OFF to the ON state was required. In addition, each subject was observed to have FOG at screening and con rmed through multiple comprehensive clinical evaluations by a movement disorder neurologist (GJR) in the ON and OFF states. Subjects who exhibited FOG on any common trigger (initiation, turning, upon reaching destination, or on straightaway walking) or any phenomenological subtype of FOG (akinetic, knee trembling) were included in the FOG group. Subjects with a mini-mental status examination score of < 26, or who were unable to walk 30 feet unassisted in the OFF state, or had any other signi cant gait impairment (festination, or major orthopedic disturbance affecting gait) were also excluded from the study. The Institutional Review Board of the Medical University of South Carolina approved the study. All participants provided written informed consent to take part in this study. The datasets generated during the current study are available from the corresponding author upon request. between serial 7's and every other letter of the alphabet. Spatiotemporal data was collected and averaged from four trials over the GaitRite walkway. Speci cally, they were asked to stand up, walk over GaitRite mat, step off the GaitRite onto the M 2 walkway, turn around a cone set at the center of the M 2 (54 inches to the center of the cone from the leading edge of the M2/GaitRite interface), and walk back to the chair (see Fig. 1). The instructions for the walking task were identical to what is commonplace during the TUG [23]. Participants were instructed to rise from the chair, walk the length of the GAITRite® mat and around the cone in the center of the M2 mat, and walk back down the GAITRite® to the chair and sit down. The turn was 180 degrees and the diameter of the turn was only limited to the 48" width or lateral boundaries of the M2 mat. The turn was performed by each subject in their preferred direction. Participants were not required to pre-select their direction of turn, and were not mandated to turn in either or both directions.
This protocol yielded two walking periods on the GaitRite per trial, and one turn duration trial. Two trials were completed in each condition (ON levodopa: single and dual task, OFF levodopa: single and dual task). The average and standard deviation (SD) was estimated for each side (Left and Right) from a total of four walking trials in each condition (each trial producing two data sets on the GaitRite, one departing and another returning to the chair).
Step and stride length coe cient of variability (CV) were calculated from standard deviation of each parameter (again total of 4 trials on the GaitRite were used to calculate CV) as previously described [24]. The turn task (mean time to turn) was calculated as the difference between the moment the individual stepped off the end of the GaitRite and onto the M2 walkway to the time of the end of the nal foot fall leaving the M2 and returning to the GaitRite. The distance from the end of the GaitRite to the cone was kept constant for all participants. The difference for each spatiotemporal parameter with and without a concurrent cognitive task was calculated and labelled dual task interference (e.g. the measured step length without a dual task was subtracted from the measured step length with a dual task to generate step length dual task interference). If subjects experienced a freezing episode during a walking trial, accurate spatiotemporal data could not always be obtained. For those trials, manual step identi cation was attempted to include as many steps as possible in each trial. Timed data (TUG and turn duration) included the occurrence of FOG episodes when they occurred. This protocol is not designed to precipitate FOG episodes, or to directly measure the duration or severity of an individual episode, but rather describe a marker's properties to indirectly function as a surrogate of FOG severity.

Statistical Analysis:
Turn duration under the dual task condition was pre-speci ed as the primary parameter of interest as it had been utilized effectively in a previous clinical trial for FOG [14]. Sample size was estimated based on the ability for turn duration to distinguish between severity groups. A prior study [14] found the mean duration turn task was 31s for the PD-freezers versus 2.7s for PD non-freezers (SD = 25). Assuming a similar difference in groups and standard deviation when assessed under dual task, a two-sample t-test has 85% power when there are n = 15 patients in each group with two-sided alpha = 0.05. Test-retest reliability was calculated for each spatiotemporal parameter for the 3-4 trials on a single visit using the intraclass correlation coe cient (ICC) reliability for the mean of k ratings (SAS %INTRACC macro). For each spatiotemporal parameter Wilcoxon Rank sum test were used to compare group differences in FOG patients to PD-control. Similarly, Wilcoxon Signed-rank test were used to determine whether there were differences in levodopa response within FOG patients (tested under the ON and OFF condition, respectively). The statistical signi cance level was set at alpha = 0.05 for all comparisons. These analyses are purely to demonstrate the measurement properties of the spatiotemporal parameters by examining the extent to which the means differ in expected fashion using groups that are known to be different (ON-FOG, OFF-FOG, and PD-Control).
Area under the receiver operating characteristics curve (AUC) analysis was performed as a measure of responsiveness (or the ability to distinguish one group from another) for each spatiotemporal parameter. This was done by tting a series of logistic models of PD-control versus PD-FOG as the response modelled with a separate model for each levodopa response condition (ON/OFF). Similarly, a logistic model with a random effect for subject was t with the ON/OFF condition as the response (PROC GLIMMIX). AUC values of 0.70 or higher are generally considered adequate to demonstrate that a measure is able to distinguish one group from another [25].
The mean UPDRS, part III (motor) scores were: 24.8 +/-10.4 for the PD-control group in the ON condition, 24.2 +/-9.1 for the FOG group in the ON condition, and 42.4 +/-8.6 for the FOG group in the OFF condition. The UPDRS part II, item 14 FOG scores (a subjective measure of FOG severity) were: 0 +/-0 for the PD-control group, 0.8 +/-0.7 for the FOG group in the ON condition, 2.6 +/-0.6 for the FOG group in the OFF condition (severe FOG severity level). The mean nFOGQ score was 17.8 +/-5.5 in the FOG group and 0 in the PD-control group.

Test-retest reliability
Test-retest reliability of the spatiotemporal parameter under a single type of condition (i.e. SINGLE or DUAL) was high (ICC > 0.90) for all measures, except the Standard Deviation (SD) measures (e.g. Step Length Standard Deviation Left, etc). ICC was poor (< 0.50) for the for SD measures under the SINGLE condition and fair under the Dual task condition for the Freezers in the ON state, Freezers in the OFF state, and the PD control subjects. See Table 1. Comparison of surrogate markers: The group means (or medians) were different for all spatiotemporal measures, with and without a dual task, between the PD-control versus OFF-FOG groups and for the ON versus OFF condition within the FOG group. However, no differences in the means/medians were detected between the ON-FOG and PD-control groups, with only trends for dual task step CV and dual task turn duration. See Table 2. The dual task interference for average step length and average stride length were signi cantly different between the PDcontrol versus OFF-FOG groups, but no other group differences in the dual task interference metrics were detected.

Discussion
We report our ndings on direct comparisons of commonly used outcome measures in FOG clinical trials. The study was designed to determine: 1) the speci city of each marker for FOG and 2) responsiveness of each marker to an intervention. In addition, we investigated whether adding a dual task or calculating dual task interference changed the biometric properties of each marker or should be considered as a separate marker. The goal of our study was to provide objective data regarding the utility of each of these markers for clinical trials or behavioral association studies in order to assist investigators in choosing the appropriate marker for the scienti c question being asked. The ndings of our study can inform future clinical trials investigating the effectiveness of novel interventions for FOG and can help interpret previous trials that have reported changes in these surrogate markers.
All of the surrogate markers studied were able to differentiate between ON and OFF indicating the responsiveness to levodopa challenge with and without a dual task. However, none of the markers studied were able to distinguish between the PD control group and the FOG group when ON medications. These were two very similar groups (with very similar UPDRS scores) who only differed by the fact that the FOG group had the underlying propensity for FOG behavior when in the OFF state. These ndings imply that these markers are not speci c for FOG, however, the rigorous design of this study comparing very similar groups should be taken into account when interpreting this nding. These markers may be used in clinical trials to study the magnitude of response to an intervention, however, may not to represent a change in FOG severity itself. Turn duration and step CV in the dual task condition showed a strong trend toward signi cance when comparing the ON-FOG group and the PD-control. Therefore, dual task turn duration and step CV should not be ruled out as a proxies for FOG severity in crossectional studies or imaging-behavioral associations investigating the relationship of a speci c nding to FOG, or as an outcome in clinical trials of a therapeutic intervention. Similar markers like stride time variability have been shown to correlate with overall disease severity [24] and have also been shown to be greater in patients with PD and FOG as compared to PD alone [26,27].
Turn duration is a very simple metric to obtain, and has been utilized effectively in clinical trials for FOG in the past [28]. Our nding that adding the dual task to multiple surrogate markers improves the biometric properties of the marker informs this and future studies when selecting markers of this condition. Curtze et al found that turning measurements were the strongest correlates of disease severity as measured by the UPDRS, in a large PD cohort with similar disease duration, although this study did not look at FOG [29]. It is important to note that although some patients may experience a FOG episode during turning (particularly the severe FOG level), in this setup (using a large turning space and a cone) is designed to minimize -not precipitate -a FOG episode, and each parameter's value is an average of at least two trials in each condition. Therefore, these results are independent of whether or not a FOG episode is triggered and differ from studies of the turn condition designed to trigger a freezing episode and then quantify each episode individually. By understanding the biometric properties of markers of FOG severity that do not depend on eliciting a FOG episode we can remove the inherent variability of the episode, presumably allowing a more consistent and representative assessment of FOG severity.
Furthermore, such a marker is inherently simple to capture, and can be repeated in one session, making it ideal for same day dose nding studies or early stage neuromodulation clinical trials. However, this comes at the cost of speci city for FOG, for most of the parameters derived from this approach.
Study limitations include our inability to determine which condition (ON or OFF) best indicates severity, since we were comparing each marker in the ON and OFF states. However, other studies have assessed turn measurements and have found the OFF condition to be superior [29]. We were powered to determine a difference between PD-controls and freezers, but not between ON and OFF freezers, or ON freezers and PD-controls. Small sample size is also a limitation, and should be taken into consideration when interpreting p-values, especially trends. Therefore, non-signi cant differences or strong trends should not be discarded. Also due to the design we could not compare each marker's ability to differentiate between severity levels with the nFOGQ. This is due to the fact that retrospective subjective questionnaires, when administered, provide an overall assessment of severity over a period of time (usually weeks) and cannot be administered reliably to predict severity in the ON and OFF state. There was a small difference in age between the control and FOG groups (67.2 years for the control, and 64.3 years for the FOG group) and a signi cant difference in disease duration (5.2 years control, 10.2 years FOG group). The disease duration difference is to be expected as FOG occurs later in the disease course. Finally, this is not a validation study of any one surrogate marker, but our ndings help identify most appropriate markers to answer future scienti c questions or to be used in clinical trials and should lead to future validation studies of such.
Based on the ndings of this comparative study of surrogate markers of FOG severity, we conclude that: 1) objective gait assessment can be a useful outcome measure in clinical trials and behavioral association studies, 2) dual task turn duration and dual task step CV are most speci c for FOG of the markers compared, and 3) velocity, step/stride length and dual task turn duration are responsive to levodopa challenge. Further validation studies of these surrogate markers are warranted for their use as outcome measures in clinical trials.

Declarations
Ethics Approval and Consent to Participate: The institutional review board of the Medical University of South Carolina approved the study.

Consent for Publication:
Written patient consent was obtained and documented on all patients. This manuscript does not report individual data.
Availability of Data and Materials: The authors believe the data necessary for analysis and interpretation is provided in the manuscript, however, any further data found to be necessary can be made available by request.