## Participants

Participants with stroke were recruited from the inpatient acute stroke or stroke rehabilitation units at the Foothills Medical Centre, the Dr. Vernon Fanning Care Centre in Calgary, Alberta, Canada, and Providence Care, St Mary’s of the Lake Hospital, Kingston, Ontario, Canada. Inclusion criteria for participants with stroke were: recent onset (< 35 days) of first clinical stroke and age > = 18 years. Exclusion criteria for participants with stroke were: other underlying neurological conditions (e.g. Parkinson’s, Multiple Sclerosis), upper limb orthopedic impairments, inability to understand task instructions or evidence of apraxia [55]. Neurologically intact control participants who also met the inclusion and exclusion criteria above, but had no history of stroke, were recruited from the communities of Calgary, Alberta, and Kingston, Ontario, Canada. This study was reviewed and approved by the University of Calgary Conjoint Health Research Ethics Board and the Queen’s University Research Ethics Board. All participants gave written informed consent before performing the assessment.

## Robotic Assessment

**Robotic Device.** The robotic assessment of position sense was performed using a Kinarm Exoskeleton robotic device (Fig. 1A; Kinarm, Kingston, Ontario, Canada), which permits movements of the arm in the horizontal plane involving horizontal abduction/adduction of the shoulder and flexion/extension of the elbow. Participants were seated in a height-adjustable wheelchair base with their arms supported against gravity. The device was fit to each participant’s arm by research staff who were trained to conduct the robotic assessment. The robot was wheeled to a 2D virtual/augmented reality display. The visual display is capable of projecting virtual targets into the plane of the participant’s arm during calibration and task performance. Given the focus on proprioceptive function, visual stimuli were not displayed on the screen during the experiment. Direct vision of the upper extremities was occluded by a shutter and a bib. The set-up and calibration procedures took between 6–8 minutes for each participant.

**Arm Position Matching Task**. The Arm Position Matching (APM) task was used to assess the individual’s position sense of their arm and has been described previously [31–34][56]. Participants were instructed to relax one arm (passive hand) and let the robot passively move the hand to one of four/nine spatial locations separated by 20/10 cm (Fig. 1B, 9-target task). The 4-target protocol is spaced on a 2x2 grid with targets spaced at 20 cm intervals in the X- and Y-directions. The 9-target protocol is the same as the 4-target protocol but includes nine targets spaced on a 3x3 grid at 10 cm intervals. Target locations were pseudo-randomized within a block. Each block contained one trial at each target location and participants completed six blocks. The robot moved the passive hand using a bell-shaped speed profile (max speed < 1 m/s). After the robot completed the passive movement, participants were asked to move their other arm (active hand; Fig. 1B) to mirror-match the spatial position of the passive hand. Participants were granted as much time as necessary to match the active hand position with the passive hand. Participants notified the examiner when they had matched their hand position, and the examiner triggered the next trial. Each participant completed either the 4-target or 9-target task protocol [105]. For the stroke participants, the affected arm was always the passive hand. Healthy control participants completed the task twice, where each arm served as the passive hand once and we consider each arm data as a separate participant in the analysis [117].

## Robotic Task Parameters

The following parameters were used to quantify task performance after completing all trials: (a) trial-to-trial *Variability* (Var) of the active hand, (b) *Spatial Contraction/Expansion* (Cont/Exp) of the area matched by the active hand, (c) *Systematic Spatial Shifts* (Shift) between the passive and active hands, and (d) *Absolute Error* (AE).

**Variability**. *Variability* in Arm Position Matching (APM) describes the trial-to-trial consistency of the active hand in matching the same target position (Fig. 1C). It was calculated as the standard deviation of the active hand’s position for each target location. The mean of the standard deviations was then calculated across all target positions in the x-coordinate (*Var*x), y-coordinate (*Var*y), and resultant linear variability of both coordinates (*Var*xy):

$${Var}_{xy}=\sqrt{{{Var}_{x}}^{2}+{{Var}_{y}}^{2}}$$

1

**Spatial Contraction/Expansion Ratio**. Spatial *Contraction/Expansion Ratio* describes whether a participant displayed contraction or expansion of their perceived workspace (Fig. 1D). It was calculated as the matched area/range of the workspace of the active hand relative to the passive hand. This parameter was calculated for the matched x-coordinates (*Cont/Exp**x*) by finding the difference between the mean x-coordinate of the three left and three right targets for the active hand compared with the passive hand:

$${Cont/Exp}_{x}=\frac{{range}_{{x}_{active}}}{{range}_{{x}_{passive}}}$$

2

A similar procedure was to calculate contraction/expansion in the y-coordinate (*Cont/Exp**y*) using the range of the top and bottom three targets. Spatial contraction/expansion in both the x- and y- coordinates (*Cont/Exp**xy*) was calculated by finding the area spanned by the active hand for the eight border targets and then normalized by the total spatial area spanned by these same targets using the passive hand.

**Systematic Spatial Shifts**. Systematic Spatial *Shifts* describe constant errors between the active and passive hands (Fig. 1E). These errors were calculated as the mean error between the passive and active hands for each target position. The mean was then calculated using the means for all target locations. Systematic shifts were calculated in the x-coordinate (*Shift**x*), y-coordinate (*Shift**y*), and combined across both coordinates to provide a measure of the resultant shift in matched positions (*Shift**xy*):

$${Shift}_{xy}=\sqrt{{{Shift}_{x}}^{2}+{{Shift}_{y}}^{2}}$$

3

**Absolute Error**. *Absolute Error* describes the mean absolute distance error between the position of the active and passive hands. The mean absolute distance errors were calculated in the x-coordinate (*AE**x*), y-coordinate (*AE**y*), and combined across both coordinates (*AE**xy*) of all trials between the active hand and the target position:

$${AE}_{xy}=\sqrt{{{AE}_{x}}^{2}+{{AE}_{y}}^{2}}$$

4

A total of 12 parameters were used to measure performance in the arm position matching task.

**Z-score.** For each of the parameters above, we relied on the Dexterit-E software [106] associated with the Kinarm to calculate a Z-score. The Z-score or standardized score, is the distance, measured in standard deviations, that a data point falls from the mean of the healthy cohort. Kinarm (Kinarm, Kingston, ON) [71] uses a consistent methodology for developing normal models to calculate the Z-scores of each parameter. Parameter scores from the distribution of the normative data set (developed from neurologically intact controls) are transformed using a Box-Cox power transformation [72]. The transformed data are fitted by accounting for age, sex, handedness, and robotic platform (exoskeleton, endpoint robot) using Multiple Linear Regression (MLR). After the first regression, the standard deviation of the residuals is then modeled using a second MLR with the same factors such as age, sex, handedness, and robotic platform. Z-scores are calculated using the residuals of the first regression and standard deviation modeled by second regression for each parameter. Z-scores are the particular values from the mean, i.e., a Z-score of 1 means a value was 1 standard deviation above the mean, and a Z-score of -1 means a value was 1 standard deviation below the mean of the healthy control data.

To ensure the distribution is “close-to-normal”, the skew and kurtosis of the final distribution were calculated and compared to the following criteria (Equations 5 and 6). These criteria were selected from Pearson and Please [73] so that it is statistically valid to use parametric tests with the Z-scores.

$$skew: abs\left(\sqrt{{\beta }_{1}}\right)\le 0.8, \sqrt{{\beta }_{1}}=\frac{{\mu }_{3}}{{\sigma }^{3}}$$

5

$$kurtosis: 2.4\le {\beta }_{2}\le 3.6, {\beta }_{2}=\frac{{\mu }_{4}}{{\sigma }^{4}}$$

6

where σ is the standard deviation, and \({\mu }_{3}\) and \({\mu }_{4}\) are the third and fourth moments of the mean.

**Task Score.** A task score gives a global measure of a participant’s performance for a given task. It measures how far the participant’s performance is from the best performance. For calculating the task score, the first stage is converting the task parameter scores into standardized Z-scores (described above). The second stage is to identify whether the best performance for a given metric reflects large negative Z-scores, large positive Z-scores, or near zero Z-scores. The Z-scores are transformed into Zeta scores using Eq. 7 for those parameters in which best performance is one-sided (i.e., large negative or large positive Z-scores).

$$\varsigma =\surd 2\bullet {erfc}^{-1}(\frac{1}{2}\bullet erfc\left(\frac{\pm z}{\sqrt{2}}\right))$$

7

Where ‘+’ is used when poor performance is positive and ‘-’ is used when poor performance is negative.

In the final stage, task scores are calculated based on the performance of healthy controls. The root-sum-square (RSS) distance of Z-scores and Zeta scores are calculated using Eq. 8 for healthy controls. RSS distance is also known as the Euclidean distance and is transformed into a Z-score using a Box-Cox transform. The Z-score of the RSS distance is then transformed using Eq. 7 to a one-sided statistic.

$$rssDistance=\sqrt{\sum _{i}{z}_{i}^{2}+\sum _{j}{\varsigma }_{j}^{2}}$$

8

Where \(\sum _{i}{z}_{i}^{2}\) includes all two-sided parameters and \(\sum _{j}{\varsigma }_{j}^{2}\) includes all one-sided parameters.

Task scores are always positive. A score of 0 corresponds to the best performance, and increasing values represent poorer performance. If the task score is > 3.29 (that is normally 1 in 1000) for control participants, then that participant was classified as an outlier for the task and removed. Outliers were removed to improve the robustness of the modeling process of normative data sets.

## Clinical Assessments

A broad range of clinical assessments was performed to characterize the impairment of stroke participants in this study. The assessments served to quantify sensation, movement, cognition, and functional abilities. The assessments were performed by a physician or physiotherapist who had expertise in stroke rehabilitation. They were blinded to the results of the robotic assessment.

Position sense was clinically assessed using the Thumb Localization Test (TLT) [11]. It was chosen because it has been used to quantify whole-limb position sense in many studies involving stroke [57–65]. In this test, the examiner moves the participant’s stroke-affected arm to a position in front of the participant at or above eye level, lateral to the midline with the participant’s eyes closed. The participant is then asked to pinch the thumb of that limb with the opposite thumb and forefinger (reaching limb). Participants were scored as 0 if their performance was considered normal (completed task perfectly) to 3, which is considered markedly abnormal (the participant was unable to find his or her thumb and did not climb up the affected arm to locate it).

Motor impairment was assessed using the Purdue Peg Board test (PPB) (Lafayette Instrument Co., Lafayette, IN, USA) [66] and the Chedoke-McMaster Stroke Assessment (CMSA) [67]. In the PPB assessment, participants placed as many small pegs as possible into holes in a board over 30 seconds using one hand. The participant is required to use the proximal upper extremity to keep the hand in the appropriate position to retrieve and insert each peg as a test of fine motor skills. The CMSA relies on the concept of stages of motor recovery, which was first introduced by Twitchell [68]. The CMSA classifies participants into subgroups based on the stage of motor recovery. The 7-point scale corresponds to seven stages of motor recovery, where the score of 1 is considered the most abnormal and a score of 7 is normal.

Functional abilities were assessed using the Functional Independence Measure (FIM) [69]. It was used as a metric for independence within activities of daily living. Within the 18-item scale, 13 items are considered as motor tasks, and 5 items are considered as cognitive tasks. In the current manuscript, we present the total FIM score (measured out of 126) and the FIM score for the motor component (measured out of 91).

## Data Analysis

Data analysis was done using Machine Learning and Deep Learning techniques in the Python programming language (version 3.7.4) [70]. In the first step of our analysis, we determined when stroke participants were impaired on robotic parameters using the Z-scores described above. We determined the 95% cut-off score of control performance (Task score > 1.96 is considered as impaired and Task score < = 1.96 is considered as unimpaired) on each robotic parameter to find whether an individual participant failed on a given parameter. When a stroke/control participant’s score fell outside of the control range, they were classified as impaired on that robotic task. Our primary analysis was concerned with comparing the impairment rate found on individual parameters and the overall task score (so called cut-off score technique) versus the ability of Machine Learning and Deep Learning techniques to determine impairment.

## Machine Learning and Deep Learning

**Flowchart of the Classification Models**. The workflow blueprint of the data classification models is shown in Fig. 2. The K-fold cross-validation (K = 10, CV) training and testing data represent the outcome measures (features) derived from the Arm Position Matching (APM) task (12 parameters) of each control and stroke participant. K-fold CV training and testing data were classified and labeled into two different categories (“control” and “stroke”). This data was passed through feature extraction and scaling processes. It was then fitted to the supervised machine learning and deep learning models. After evaluation, we applied the mean and standard deviation of the K-fold CV of all model performance metrics. At the last step, we showed the receiver operating characteristic curves (ROC curves) for the mean of the K-fold cross-validated result of each model.

**K-fold Cross-Validation (CV)**. The K-fold Cross-Validation procedure randomly divided the dataset into K-disjoint folds. One-fold was used for testing and remain K-l folds were used for training the model. This process was repeated K-times until the testing was performed on all K folds. All folds contained equal data points unless otherwise specified. We applied K-fold cross-validation (where K = 10) to estimate the performance and reliability of each classification algorithm and enable meaningful comparison between classification models. The performance of the classification models was evaluated by the mean and standard deviation across the K-fold datasets.

**Features**. A feature represents a measurable piece of data that can be used for analysis. It is also known as “attributes” or “variables”. In our case, features were the Z-score data of the 12 task parameters (Variability X, Variability Y, Variability XY, etc.) such that all features were selected for our analysis. The features were then normalized using the min-max normalization (where the minimum value of that feature got transformed into 0, the maximum value got transformed into 1, and every other value got transformed between 0 and 1) so that the variance of the features was in the same range. Then, features were trained and tested using machine learning and deep learning models. After that, we tested classification models for prediction and evaluation in the testing phase.

## Classification Methods

We applied five Machine Learning (ML) techniques: Logistic Regression (LR) [108], Decision Tree (DT) [109], Random Forest (RF) [110], Random Forest with Hyperparameters Tuning (RFT) [111], and Support Vector Machine (SVM) [112]. We also applied one Deep Learning technique: Deep Neural Network (DNN) [113] for the classification (or supervised learning) of stroke and control data.

**Logistic Regression (LR)**. We used a Logistic Regression model to classify each participant as a stroke or control based on their performance in the arm position matching task. For that purpose, we implemented a logistic regression classifier that was fitted in the binary logistic regression regularization. This regularization added a penalty as model complexity increased to ensure the model generalized the data and prevented overfitting with an increase in parameters. LR model assumes a linear relationship between the input features and output. The binary logistic model had a dependent variable with two possible outcomes as healthy control and stroke. We used a tolerance of 0.0001 and the maximum number of iterations of 100 as criteria to stop network training.

**Decision Tree (DT)**. We implemented a Decision Tree classifier as one of predictive modeling. It uses a tree-like model in which each internal node (non-leaf) is labeled with an input feature. The arcs coming from a node (branch) labeled with an input feature are labeled with each of the possible values of the target feature or the arcs leads to a subordinate decision node on a different input feature. Each leaf node is labeled with a class either healthy control or stroke. This model splits the nodes of all available features/parameters and then selects the splits, which results in the most homogeneous sub-nodes.

Our decision tree classifier implementation consisted of the following parameters: Gini impurity as a criterion to measure the quality of split, best as a splitter to choose the best split, the maximum depth of the tree as 4, and the minimum number of samples at the leaf node as 1.

**Random Forest (RF)**. We implemented an ensemble learning model (i.e., a Random Forest classifier). It is a classification algorithm consisting of many decision trees, which uses bagging and feature randomness when building each individual tree. It tries to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. The output of the random forest model was the class selected by most trees.

The parameters included in our implementation were: the number of estimators (the number of trees in the forest) was 100, Gini impurity as the criterion for the information gain, the minimum number of samples required to split an internal node was 2, and the minimum number of samples required to be a leaf node was 1.

**Random Forest with Hyperparameters Tuning (RFT)**. We tuned the hyperparameters (a hyperparameter is a parameter whose value is used to control the learning process) of the Random Forest model to determine the best hyperparameters. It relies more on experimental results than theory, and thus the best model to determine the optimal settings was by trying many different combinations to evaluate each model’s performance.

The tuned hyperparameters of the random forest model were: the number of trees in the forest, the maximum number of levels in each decision tree, the maximum number of features considered for spotting a node, the minimum number of data points placed in a node before the node is split, and the minimum number of data points allowed in a leaf node.

**Support Vector Machine (SVM)**. We implemented a Support Vector Machine (SVM) classifier. It constructed a set of hyperplanes (hyperplanes are decision boundaries that help to classify the data points) in high-dimensional space to perform the classification task. The model transformed the data to find an optimal boundary between outputs (control or stroke). A good separation is achieved by the hyperplane that had the largest distance, or functional margin, to the nearest training data point of any class.

Our implementation included the following parameters: the regularization parameter that must strictly positive, the Radial Basis Function (RBF) type kernel, the size of the kernel cache as 200 MB, the pseudo-random number generator was used for shuffling the data for probability estimators, and tolerance of 0.001 was applied as the network stopping criterion.

**Deep Neural Network (DNN)**. We also implemented a Deep Learning technique, namely, Deep Neural Network (DNN). It is a part of a broader family of machine learning techniques based on artificial neural networks.

Our DNN classifier implementation consisted of three hidden layers between input and output layers. The first hidden layer had 12 units with the Rectified Linear Unit (ReLU) as the activation function, the second hidden layer had 8 units with the ReLU as the activation function, and the third hidden layer had 1 unit with the sigmoid function as the activation function. We also used: binary cross-entropy as loss function, the Root Means Square propagation optimizer (RMSprop), the batch size of 10, and the number epoch of 100. An epoch refers to the number of passes of the entire training dataset the deep learning technique has completed. The input layer had 12 units for 12 features, and the output had 1 unit to predict a 0 or 1 that maps back to the “healthy control” or “stroke” class. Each layer of nodes trained a distinct set of features based on the output of the previous layer. The feature hierarchical process of our DNN model made it capable of handling very large, and high-dimensional datasets with billions of parameters passed through nonlinear functions.

## Feature Importance

Feature importance [114] in machine learning refers to techniques that assign a score to each feature based on their usefulness in the classification task. The score is expressed in a percentage. We applied different feature importance techniques/calculations for the different machine learning techniques. For LR and SVM models, feature importance was based on an information-theoretic criterion, measuring the entropy in the changes of predictions, and perturbation of a given feature [115]. For the DT, RF, and RFT models, feature importance was computed as the mean and standard deviation of the impurity decrease (the total decrease in node impurity (weighted by the probability of reaching node) averaged over all trees of the ensemble) within each tree [116]. In general, a higher score of feature importance means the specific feature has a large effect (importance) on the model that is being used to classify participants as “stroke” and “control”, and a lower score means the specific feature has less impact on the classification model.