Our ML framework, outlined in Fig. 1, is organized into distinct phases: data collection, data preprocessing focusing on segmentation, sampling techniques, and model evaluation. In the data collection phase, we employed the MONI insoles, depicted in Fig. 2(a), to capture toe-tapping data. This dataset comprises recordings from 11 PD patients (6 early-stage and 5 mid-late-stage) and 10 HC. Detailed demographics and characteristics of these participants are provided in Table I. Each MONI insole is equipped with two accelerometers. These are strategically positioned at the first metatarsal ('T') and the heel area ('H'). These accelerometers operate at a sampling frequency of 150 Hz, as shown in Fig. 2 (b). Participants were guided to execute two specific 10-second toe-tapping exercises for the study. The first, termed "In-phase", required synchronized toe-tapping between the left and right leg. The second, "Out-phase", involved an alternating toe-tapping pattern where the two legs were out of sync.
Table I. Subject Characteristics
|
Healthy control
|
Early stage
|
Mid-late stage
|
Age
|
4 (61–70)
5 (71–80)
1 (over 80)
|
1 (51–60)
3 (61–70)
1 (71–80)
1 (over 80)
|
1(51–60)
3(71–80)
1(over 80)
|
Sex*
|
2 (M), 8 (F)
|
4 (M), 1 (F)
|
5 (M), 0 (F)
|
*Male (M), Female (F) |
A. preprocessing
As illustrated in Fig. 1, our ML framework initiates with data preprocessing, wherein it is meticulously filtered using a 4th -order low-pass filter with a cutoff frequency of 12.5 Hz [19]. This cutoff frequency is specifically chosen to capture the essential frequencies associated with PD to eliminate any high-frequency disturbances while retaining the frequencies most indicative of PD symptoms. After this, the data is segmented into distinct motion patterns by pinpointing specific lift-toe and drop-toe events, notably the commencement and culmination of toe-tapping activities. Each of these segments is characterized by the notation\(\left({x}_{T}, {y}_{T}, {z}_{T},{std}_{T},{x}_{H}, {y}_{H},{z}_{H},{std}_{H}\right)\)where \(x, y \text{a}\text{n}\text{d} z\) symbolize the 3-axis acceleration data (orientation as shown in Fig. 2. (c)), and \(std\) denotes the standard deviation across these three axes. In this representation, \(T\) indicates the first metatarsal position, while \(H\) signifies the heel position.
From this preprocessed data, we extrapolate a comprehensive set of 120 features in total from 8 axes of \(\left({x}_{T}, {y}_{T}, {z}_{T},{std}_{T},{x}_{H}, {y}_{H},{z}_{H},{std}_{H}\right)\). From [7, 20], these features had the potential to capture the nuanced movement characteristics affected by PD. These encompass (i) 10 highly-ranked statistical features per axis in the time domain: Mean, Standard deviation (σ), Minimum (Min), Maximum (Max), Range, Root Mean Square (RMS), Skewness, Kurtosis, and Zero Crossing Rate (ZCR) and the Mean Crossing Rate (MCR) per axis; and (ii) 5 highly-ranked features in frequency domain: Peak frequency, Cepstral coefficients (comprising of mean, max, and min coefficients), and Spectral entropy. We utilize the Fast Fourier Transform to transform the signal from time to frequency domain. We denoted the feature with the position and the axis. For example, the mean of the standard deviation axis at the first metatarsal position is \({Mean}_{{T}_{std}}\).
We first established the features and then created four distinct datasets. This includes the original unbalanced dataset and three others designed to address this imbalance. As illustrated in Fig. 1, we applied sampling methods: Tomek-Links, SMOTE, and a combination of SMOTE-Tomek. These methods either reduce the majority classes or synthesize data resembling the minority classes. After preparing the datasets, we trained and evaluated them using a random forest (RF) model. We applied a 5-fold cross-validation (CV) process for validation, but only on the original data. Finally, we assessed the validation performance using precision, recall, F1-score, and AUC/ROC metrics.
B. Sampling Methods
To address the imbalance problem, we employ three data sampling methods. In addition to the original dataset, we created three sampling datasets, as depicted in Fig. 3. The original dataset consists of 714 samples from HC, 424 samples from early-stage PD patients, and 288 samples from mid-late-stage PD patients. Each of these samples is derived from features extracted from raw data files. Specifically, the 714 samples from HC come from 714 individual files. Every file represents a specific combination of subjects, tasks, segments, and sides. The segmentation was based on distinct toe-tapping cycles, identifying segments from the commencement (lift-toe) to the culmination (drop-toe) of toe-tapping activities. This approach yielded 714 segments for the HC group. Segments with general noise or without clear toe lift-toe drop events were excluded from the analysis. To illustrate further, if each file represents one segment, then 714 files would equate to 714 segments, starting from the left toe and ending at the toe drop for each segment.
Given the inherent imbalance in the dataset, the 424 samples from early-stage PD and the 288 samples from mid-late-stage PD will be oversampled to match the majority count of 714 from HC. Conversely, under-sampling will be applied to the 714 HC samples and the 424 early-stage PD samples to match the lowest count of 288 from mid-late-stage PD. A significant observation from this dataset is its inherent imbalance; the data from HC is disproportionately represented compared to the PD stages.
The Tomek-Links Dataset focuses on under-sampling as a strategy to balance the dataset. The primary objective of under-sampling is to diminish the number of samples from classes that are over-represented. Using the Tomek-Links method, unique pairs of instances from different classes are identified through Condensed Nearest Neighbor, ensuring no closer instances exist between them. Once this method is applied, each class has a uniform representation of 288 samples. This count mirrors the data points from the minority class in the original dataset, which indicates the mid to late stage. Under-sampling was explicitly applied to the other two predominant classes to achieve this balance.
The SMOTE Dataset employs over-sampling as its primary strategy. Over-sampling seeks to balance the dataset by augmenting the number of samples in classes that are under-represented. Rather than merely duplicating the samples of the minority classes (early and mid-late stages), the SMOTE method synthesizes new samples that are proximate to the existing ones within the feature space, which bolsters the robustness of the classifier. After the application of SMOTE, every class is populated with 714 samples.
The SMOTE-Tomek Dataset combines the strengths of both SMOTE and Tomek-Links. SMOTE creates synthetic samples for the minority class, while Tomek-Links remove samples from the majority class close to the minority class samples. This helps create a clearer distinct boundary between the two classes. With this combined method, the dataset is balanced, with each class having 714 samples.