This is the first study presenting a comprehensive comparative assessment of a broad range of algorithms applied to a single WD, for estimating key DMOs pertaining to gait (i.e., gait sequences (GSD), individual steps (ICD), cadence (CAD) and stride length (SL)) in heterogeneous diseases and using data from the real world. Here, we have presented algorithm performances, selected the best algorithm for each DMO and cohort, analysed the influence of walking speed and walking bout duration on their performance, and provided recommendations for their selection and implementation for real-world gait analysis.
Gait Sequence Detection
In line with previous work (16, 38, 55), overall concurrent validity of the GSD algorithms was good to excellent, as reflected by the high ICC(2,1) values and all performance measures were above 0.7. Accuracy, specificity, and positive predictive values were very high for all GSD algorithms. Our results were comparable to previous work on a different population (post-stroke survivors) which reported similar sensitivity (0.92) and positive predictive value (0.84) of GSD algorithms implemented on data obtained from bilateral WD on the feet (16). The excellent results for specificity are similar or even higher than those reported previously in the literature (0.96 in PD (55) and 0.93 in stroke survivors (16)). This is encouraging, as gait analysis relies on high specificity, which corresponds to a correct identification of gait sequences (high number of true positive events) while avoiding the misidentification of gait sequences (low number of false positive events). Avoiding incorrect identification of gait-sequences (as also reflected by positive predictive values) is preferable, to avoid the extraction of DMOs from activities which are not directly representative of gait, such as shuffling or transitions (55).
GSDA and GSDB tended to overestimate the total walking time (total gait sequence duration). This could potentially relate to different signal characteristics between the WD and the reference system (low-back signals recorded with the WD may be different from feet signals (51), as recorded with the INDIP). Slow gait, curved paths and short walking bouts with insufficient steady-pace phases for the spectral analysis could have also influenced the results, as the characteristics of the signals are more variable and the periods are less uniform than in steady-pace gait undertaken at faster speeds along straight paths (38).
Based on our findings collectively, we recommend using GSDB on cohorts with slower gait speeds and substantial gait impairments (e.g., PFF). This may be because this algorithm is based on the acceleration norm (overall accelerometry signal rather than a specific axis/direction (e.g., vertical), hence it is more robust to sensor misalignments that are common in unsupervised real-life settings (38). Moreover, the use of adaptive thresholds, that are derived from the features of a subject’s data and applied to the amplitude of acceleration norm and to step duration for detection of steps belonging to gait sequences, allows increased robustness of the algorithm to irregular and unstable gait patterns. GSDA algorithm may be more suitable for cohorts with a faster gait speed and regular gait pattern (e.g., HA). This algorithm is based on a convolutional transformation (based on a gait cycle) of a single axis signal (37), potentially justifying its suitability to conditions characterised by more stable and regular gait patterns.
Initial Contact Detection
Overall, all algorithms investigated for ICD presented excellent sensitivity and positive predictive values (all above 0.81) and relative errors below 21% in diverse cohorts of patients. These errors are in line with previous work, although slightly higher than those assessed in laboratory or controlled and supervised environments, ranging between 4 to 13% (25, 36, 51). Positive predictive values resulted were larger than sensitivity (although sensitivity values were > 0.75). This could be due to a lower number of false positive events (wrongly identified initial contact events) with respect to true positive events; slightly lower sensitivity measures reflect a higher number of missed initial contact events. Similar to GSD, higher positive predictive values (higher numbers of correctly identified initial contacts) are preferable, as gait assessment based on incorrectly identified events could lead to invalid DMO extraction and misleading clinical interpretation. Low relative errors (< 11%), found for ICDA and ICDC, for step duration across all cohorts based on similar approaches are very encouraging and concurs with previous work which reported errors between 4% to 13% from data collected in laboratory conditions (36, 56).
Accurate detection of steps is critical for estimation of a plethora of DMOs like cadence, step symmetry, gait variability, etc., which might have relevant clinical value (e.g., for the differentiation of stages of neurodegenerative diseases (56). In addition, step detection can be used to refine the identification of gait sequences (38), and thus, the definition of a WB, which highlights the importance of using a robust algorithm with high sensitivity and positive predictive value.
For all cohorts, we recommend the use of the ICDA for the identification of initial contact events, given the lowest absolute and relative errors (both in mean and SD of step duration and initial contact time event) and best performance indexes. ICDA is an optimised implementation of the algorithm based on continuous wavelet transform and peak detection originally presented in (39), and is frequently used and reported in the literature for heel-strike or initial time contact event detection (36, 57). This algorithm has been previously validated under different conditions, producing similar results in algorithm performance (40) even if tested under less challenging conditions (such as supervised lab/ clinical settings). To increase robustness to the variety of impaired gait patterns, ICDA applies additional detrending and filtering before the continuous wavelet transform, then it detects the steps-related peaks as maxima between zero-crossings (instead of using a predefined threshold for peak amplitude).
Cadence Estimation
The excellent performances of CAD algorithms, reflected by low relative errors of <12%, were in line with (17, 38, 41) or lower than previous results reported in the literature (13-14%) (16). Moderate to excellent ICC(2,1) (> 0.70) were found in all cohorts except PFF, especially for algorithms CADB and CADC. These results confirm the robustness of cadence estimation in all cohorts. PFF data showed the lowest ICC(2,1) values but good performances for the other metrics. This may be partially explained by the high asymmetry and the slow speed that characterise the PFF cohort (all PFF patients walked at a speed of <1.29 m/s) (58). This and the use of walking aids may have impacted the WD signal quality (amplitude and shape) and hence challenged the processing techniques on which the algorithms are based (i.e., wavelet transformations for CADA and CADB (38, 39), and zero-crossings for CADC (41)).
Overall, CADC performances were excellent across all cohorts, especially for groups with higher gait speeds. CADB was more robust in the PFF cohort as reflected by the performance index. Therefore, we suggest the implementation of this algorithm in cohorts with compromised gait speed and symmetry (e.g., severe or advanced neurological diseases) for which a zero-crossing approach may not be so suitable.
It is worth mentioning that the methodology for initial contact events/step detection, used by ICD and CAD algorithms, includes two main stages. The first is related to the processing of the WD acceleration signal in order to remove noise, artefacts and to enhance the step-related features (e.g., zero-phase low-pass filtering, detrending). Then, on the processed acceleration signal, the initial contacts/ steps are detected using peak detection or zero-crossing approaches. The combination of the various techniques for these two stages allowed us to implement optimized versions of state-of-the art algorithms.
Although ICD and CAD algorithms are based on similar approaches, our results are in line with previous findings showing that the use of a peak detection approach may be more suitable for identification of events (ICD), whereas zero-crossing techniques result more accurate in identification of cyclic events and step segmentation, required for the cadence estimation. All in all, as observed by Panebianco et al. 2018 (57), this underlines that each principle is better tailored to each DMO; i.e., a wavelet transformation with peak detection is better suited for the ICD metric, whereas the zero-crossing approach seems better suited for the CAD metric.
Stride Length Estimation
The performances of the SL metrics are lowest with respect to others, as reflected by relatively high absolute and relative errors, and low ICC(2,1). This could be due to the nature of the lower-back accelerometry signals recorded in real-world conditions, from which the stride length is calculated. Particularly, the estimation of the position of the centre of mass (by double integration of the acceleration) and the inverted pendulum models on which stride length algorithms are based, assumes straight walking trajectories. Moreover, these methodological principles do not consider turns, non-straight walking trajectories (i.e. veering) and other deviations from a purely symmetrical and straight pattern, which are almost absent in real-world recordings (33).
Among the four algorithms, our recommendation is to use SLA in all cohorts, given the lowest absolute, relative error and highest ICC(2,1), as summarized by the performance indexes. It must be noted that SLB was the best performer for the MS cohort, which is based on the same algorithm principle as SLA, but using a different correction factor implemented to estimate stride length (43). All in all, SLA showed good performance and similar to SLB also for MS.
In general SL algorithms tended to overestimate stride length between 0.07 and 0.16 m, this could be due to the correction factors that are implemented in both SLA and SLB (17). Overall, the results highlight the better suitability of biomechanically-based algorithms, rather than those based solely upon machine-learning approaches. This is in line with the results observed on a previous study which implemented the same algorithms, trained on the same pre-available datasets (17). This could be due to the fact that the biomechanically-based algorithms are less dependent on the intensity and morphology of the acceleration signals, highly influenced by the gait speed and irregularity of the gait patterns (17), which highlights a potential limitation in the generalization of the machine-learning based models when applied to external datasets. Future and novel machine-learning/deep-learning based models based on bigger datasets might produce better results.
Our results showed higher errors than those reported in previous studies: almost double with respect to (25) where results were evaluated from sensors on the shanks, and similar to (17) showing RMSE between 0.04 and 0.18 m, where data was collected in the laboratory. This could potentially be due to the additional challenges involved in real-world and uncontrolled gait assessments presented in the current study, and the use of different data, i.e., based on a single sensor and on a different reference system for comparison. Moreover, to ensure a fair comparison of the algorithms, the WB (input) on which the algorithms were applied was defined and “imposed” by the reference system (INDIP). This could have potentially led to higher errors stemming from applying the algorithms to a WD signal with reduced amplitude and noisier characteristics with respect to the signal identified by the INDIP (sensors on the feet), especially for short and slow WB. All in all, our results highlights that future studies should focus on the development and optimisation of SL algorithms for increasing robustness of SL estimation in order for this to be a useful (e.g. sensitive to change) DMO that could be used in clinical interventional studies.
Effect of walking speed and walking bout duration on algorithms performances
Generally, the performances of all algorithms significantly worsened for walking speeds below 0.5 m/s, which is considered as a threshold between slow and medium speed walkers (2), confirming what is well established in the literature (17, 59, 60). This may be explained by the fact that the signals recorded with WD in slow walkers are characterised by a compromised amplitude, non-uniform gait cycles (60, 61), variable and irregular gait patterns (17). Likewise, the lowest performances observed within PFF, may be explained by the lower speed and irregular gait patterns of this cohort (58). Accordingly, the choice of algorithms for DMO extraction should consider its sensitivity to gait speed, given its proven confounding effect on gait analysis (62), and the population of interest.
Walking bout duration also significantly affected the performances of the cadence algorithms, with an overall significant reduction of the relative error observed for longer WBs when estimating both step duration and cadence. This trend was also likely magnified by the fact that the shortest bouts were also the slowest ones and confirms similar previous results (34). This could be also due to the fact that the impact of breaks (start & stop) and/or miss-detected strides in short WB may be much larger than in longer WBs when quantifying algorithms performances.
Individual relative errors for stride length were higher for short WBs (e.g., < 10 s), although the median error did not seem to be significantly affected by bout length. DMOs estimated from short WBs, which have been reported as the majority (about 50%) in real-world conditions (21, 63), should have special consideration as, in agreement with previous work, these WBs were observed to be the slowest (63), and therefore more sensitive to higher error estimation.
General Discussion
The signals recorded at the lower back are less robust than at other locations, such as the foot or shank, for the identification of initial contact events (57), although still more accurate than wrist data (64). However, the lower back is among the most clinically favourable location for a single device, given its cost (one device), its location near to the centre of mass (which represents the overall human motion pattern), ergonomic conditions when worn attached to a belt or affixed to the skin, and its clinical value for fall risk, trunk stability and balance control, among others (56, 65, 66).
An advantage of real-world gait monitoring is the possibility of capturing a large number of diverse walking bouts and truly unsupervised gait performance in an ecologically valid environment (20). However, the presence of contextual factors in a real-world context, which were not counted for in this study, may have significantly influenced the performance of the algorithms. In particular, the presence of turns, the deviation from a straight path or other gait tasks (e.g., slope, presence of stairs or/and obstacles, crowdedness of space, visibility of trajectory), and the usage of various walking aids may have altered the gait pattern of the participant (20) and may partially explain the larger errors observed for SL.
The results indicate that the temporal characteristics (initial contact events, step duration, cadence) of gait analysed with the proposed algorithms were more robust and valid than the spatial ones, which may be due to the fact that low-back signals are better tailored to estimate particular events in the signal (i.e., initial contact events) and to assess its periodicity (i.e., cadence estimation) than to estimate displacements. These aspects should be considered when using the proposed algorithms, especially when interpreting findings for clinical applications.
Limitations
It must be noted that a complete ranking methodology should not only consider the overall findings for each cohort (as in this study) but should also consider the performance of algorithms on stratified subgroups (e.g., based on gait speed: slow-medium-fast walkers). This can be done by assigning a higher weight to the slow walkers’ results, given that their corresponding signals are more challenging and yield higher errors, as observed in this study. In addition, the percentage of WBs, as well as participants in which the algorithm successfully provided DMOs estimates should be considered to scale the overall performance of algorithms (23). Thus, a simplified, although comprehensive, implementation of the ranking methodology could be seen as a limitation of this study. Nonetheless, the purpose of this was to provide an overall recommendation on the algorithm that performed best for each DMO assessed in challenging real-world environments (20). We also suggest that the inclusion of laboratory assessments for the implementation of the ranking methodology could be relevant. Indeed, even if collected under controlled or semi-structured conditions, data from short and slow WBs, that are typical in lab-based setting, may add variability and challenge algorithm performance (19). In addition, the effect of walking aid use on results has not been assessed in this study. Thus, future work assessing this aspect could be clinically relevant, given the potential impact that walking aids (and the variety of types of walking aids) have on the quality of the WD signals and reference data (17), and as a consequence, on the assessment of the algorithm’s performance.