**Data Description and Visualization.** The dataset used in this study was extracted from the PPMI database28, a comprehensive, prospective, computational study including imaging data, clinical tests, biospecimens, and genetic biomarkers to track the progression of PD. Table 1 shows the contributed files of PPMI used in the current study. The PPMI has been collected at various centers, including the United States, Europe, Israel, and Australia. This diversity made the data collection process avoid limiting to a single hospital or clinic setting or the availability of some medications in a specific location.

Table 1

Informative files from the PPMI database contributed to this study.

| Folder Name in PPMI database | Description |

1 | Subject_Characteristics | Age at visits, gender |

2 | Motor MDS-UPDRS | MDS-UPDRS (I, II, III, IV), Modified Schwab England ADL |

3 | Medical_History | LEDD_Concomitant_Medication_Log |

4 | Hoehn & Yahr Stage | Hoehn and Yahr scale |

After merging severity scores and medication records based on visit dates, missing values were removed from the data set (0.6% of all visits). Furthermore, patients’ data for those with medication dosages higher than 2500 mg were considered outliers and removed from the study.

Types of PD-related medication were categorized into five main groups: LD, DA, MAOB-i, COMT-i, and Amantadine. All reported medications were reviewed to be consistent across the log. Spelling mistakes were corrected, and different representations of medication names were considered the same. For example, “LEVODOP,” “LEVODAP,” and “L-DOPA” were replaced by levodopa. The mean value was considered for cases where the same type of medication was used simultaneously with variable doses29. The data set's medication usage is displayed in Fig. 1 through box plots.37

The severity scores of the symptoms were only considered in the “ON state.” This study analyzed a dataset comprising 4,143 patient visits for PD, including 1,614 visits from female patients. All sub-items of the Movement Disorder Society-sponsored revision of the UPDRS (MDS‐UPDRS)30 in ON medication states, Activities of Daily Living, age, Hoehn and Yahr scale were included in this study. Table 2 summarizes the rating scores of the patients, including range, mean, and standard deviation for all visits.

Table 2

Summary of the rating scores of the patients.

Patient’s measurement | Scores range | Mean(std) |

UPDRS-Part1 | [0,24] | 6.82(4.09) |

UPDRS-Part2 | [0,45] | 9.61(6.62) |

UPDRS-Part3 | [0,89] | 22.50(12.33) |

UPDRS-Part4 | [0,17] | 1.95(2.85) |

Total UPDRS | [2,151] | 39.4(18.80) |

Activities of Daily Living (%) | [10,100] | 86.38(11.38) |

Age at visit | [29,91] | 64.59(9.74) |

Hoehn and Yahr | [1,5] | 1.9(0.56) |

Figure 2. Representation of the number of patients and their corresponding minimum number of visits. Among the 673 patients, 632 had a minimum of two consecutive visits. Patients without follow-up visits (i.e., with only one visit) were dropped from the study and not shown on the plot.

**Proposed framework.** The proposed model uses RNNs to provide a sequence-by-sequence decision support framework based on patients' subsequent visits. A multi-layer perceptron (MLP) network was also applied to compare the results with a non-sequential method. This approach considers every patient with a visitation history as an input sample. Considering the sequential time series issue in deep learning to apply crucial attributes of past visits makes the model more reliable and individualized than predicting dosage based on a single point in time31.

Figure 3 displays multiple visits for each patient, while only the last five visits were considered for all patients. Figure 4 provides an overall framework for the proposed DSS. The steps in this framework are as follows: 1. data preparation to manage outliers and missing values, 2. data normalization, 3. data reshaping, 4. sequence padding, 5. recurrent models fitting on the training set, and 6. models evaluation based on the test set.

**Recurrent neural network models to predict medication dosages.** The deep learning regression maps the non-linear relation between input characteristics and outputs. This study aimed to show the ability of sequence-by-sequence deep learning regression models, including simple recurrent neural network (Simple-RNN), long short-term memory (LSTM), and gated recurrent units (GRU) to solve the proposed regression problem and suggest the next equivalent medicine dosage based on each patient's consecutive visits. The RNNs, acting like a chain, allow the model to predict the medication dosage based on previous time-step computations. Non-sequential data was entered into an MLP (with no recurrent connections) of the same structure of recurrent networks for comparing the results. This section gives a brief explanation of the mentioned networks.

## Simple-RNN:

The RNNs are the class of artificial neural networks with dynamic behavior that enables the output of specific nodes to affect the subsequent input to the same nodes32. Figure 5 shows the structure of a Simple-RNN model.

As indicated in the unfolded diagram of RNN, each RNN is composed of specific memory units with hidden states of

$${\text{h}}_{\text{t}}=\text{t}\text{a}\text{n}\text{h}({\text{W}}_{\text{h}\text{h}}{\text{h}}_{\text{t}-1}+{\text{W}}_{\text{h}\text{x}}{\text{x}}_{\text{t}}+{\text{b}}_{\text{h}})$$

1

where, \({\text{W}}_{\text{h}\text{h}}\) and \({\text{W}}_{\text{h}\text{x}}\) indicate the weights, \({\text{x}}_{\text{t}}\) and \({h}_{t}\) are input and hidden states at time step \(\text{t},\) \({\text{h}}_{\text{t}-1}\) is the hidden state at time step \(\text{t}-1\), and \({\text{b}}_{\text{h}}\) is the bias term.

**Long Short-Term Memory (LSTM)**:

The LSTM is a recurrent neural network beneficial for predicting sequences because it can recognize long-term dependencies. Unlike other RNNs, the LSTM has resolved the problem of gradient disappearance by eliminating unnecessary past information33. There is a memory cell state (\({\text{c}}_{\text{t}}\)) in each memory unit of this network. These cells can read, write on, and remove the information from it, and the neural network learns over time by adjusting the weights to decide what information to keep and what information to delete for the next steps. Figure 6 shows an LSTM memory unit, including a forget gate (\({\text{f}}_{\text{t}}\)), input gate (\({\text{i}}_{\text{t}}\)), and output gate (\({\text{o}}_{\text{t}}\)). In Eq. (2), weights for the current input of \({\text{x}}_{\text{t}}\) are \({\text{W}}_{\text{i}\text{x}},{\text{W}}_{\text{f}\text{x}},{\text{W}}_{\text{o}\text{x}},{\text{W}}_{\text{c}\text{x}}\) and for the previous hidden state of \({\text{h}}_{\text{t}-1}\) are \({\text{W}}_{\text{i}\text{h}},{\text{W}}_{\text{f}\text{h}},{\text{W}}_{\text{o}\text{h}},{\text{W}}_{\text{c}\text{h}}.\)Furthermore, biases are indicated by \({\text{b}}_{\text{i}},{\text{b}}_{\text{f}},{\text{b}}_{\text{o}},{\text{b}}_{\text{c}}\) for the input gate, forget gate, output gate, and cell state, respectively. Sigmoid (\({\sigma }\)) is the activation function of the input, forget, and output gate, while tangent hyperbolic (tanh) is for the cell state.

\({\text{i}}_{\text{t}}={\sigma }\left({\text{W}}_{\text{i}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\text{i}\text{h}}{\text{h}}_{\text{t}-1}+{\text{b}}_{\text{i}}\right)\) | (2a) |

\({\text{f}}_{\text{t}}={\sigma }\left({\text{W}}_{\text{f}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\text{f}\text{h}}{\text{h}}_{\text{t}-1}+{\text{b}}_{\text{f}}\right)\) | (2b) |

\({\text{o}}_{\text{t}}={\sigma }\left({\text{W}}_{\text{o}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\text{o}\text{h}}{\text{h}}_{\text{t}-1}+{\text{b}}_{\text{o}}\right)\) | (2c) |

\({\text{c}}_{\text{t}}^{\text{*}}=\text{t}\text{a}\text{n}\text{h}\left({\text{W}}_{\text{c}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\text{c}\text{h}}{\text{h}}_{\text{t}-1}+{\text{b}}_{\text{c}}\right)\) | (2d) |

Update equations of the current memory cell state and hidden state are as follows.

\({\text{c}}_{\text{t}}={\text{c}}_{\text{t}-1}{ ⨀ \text{f}}_{\text{t}}+{\text{c}}_{\text{t}}^{\text{*}}{ ⨀ \text{i}}_{\text{t}}\) | (3a) |

\({\text{h}}_{\text{t}}={\text{o}}_{\text{t}} ⨀ \text{t}\text{a}\text{n}\text{h}\left({\text{c}}_{\text{t}}\right)\) | (3b) |

In Eq. (3a) and (3b) \(\text{‚}\text{⨀R }\text{Ä}\) stands for the elementwise multiplication operator.

**Gated Recurrent Unit (GRU)**:

The gated recurrent unit is another type of recurrent neural network with an update and reset gate instead of a cell state in LSTM (Fig. 7). The update gate controls how much information from the past is used. In contrast, the reset gate reduces the past information34. One important aspect of this model is that its two gates have been trained to retain information for extended periods without disappearing while requiring fewer computations than LSTM.

Equations (4a)-(4c) represent the calculations of a memory unit in GRU where \({\text{r}}_{\text{t}}\)is the reset gate, \({\text{z}}_{\text{t}}\) is the update gate and\(\stackrel{\sim}{{\text{h}}_{\text{t}}}\) is the candidate hidden state. \({\text{W}}_{[\text{r}\text{x},\text{z}\text{x},\stackrel{\sim}{\text{h}}\text{x}]}\) are weights for the current input of \({\text{x}}_{\text{t}}\), \({\text{W}}_{[\text{r}\text{h},\text{z}\text{h},\stackrel{\sim}{\text{h}}\text{h}]}\)are weights for the previous hidden state of \({\text{h}}_{\text{t}-1}\) and bias parameters are \({\text{b}}_{\text{r}}\), and \({\text{b}}_{\text{h}}\)for reset gate, update gate, and candidate hidden state in order. \({\sigma }\) and tanh are activation functions.

\({\text{r}}_{\text{t}}={\sigma }\left({\text{W}}_{\text{r}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\text{r}\text{h}}{\text{h}}_{\text{t}-1}+{\text{b}}_{\text{r}}\right)\) | (4a) |

\({\text{z}}_{\text{t}}={\sigma }\left({\text{W}}_{\text{z}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\text{z}\text{h}}{\text{h}}_{\text{t}-1}+{\text{b}}_{\text{z}}\right)\) | (4b) |

\(\stackrel{\sim}{{\text{h}}_{\text{t}}}=\text{tanh}\left({\text{W}}_{\stackrel{\sim}{\text{h}}\text{x}}{\text{x}}_{\text{t}}+{\text{W}}_{\stackrel{\sim}{\text{h}}\text{h}}\left({\text{r}}_{\text{t}}.{\text{h}}_{\text{t}-1}\right)+{\text{b}}_{\text{h}}\right)\) | (4c) |

Finally, the current hidden state of the cell (\({\text{h}}_{\text{t}}\)) can be calculated as

$${\text{h}}_{\text{t}}=\left(1-{\text{z}}_{\text{t}}\right).{\text{h}}_{\text{t}-1}+{\text{z}}_{\text{t}}.\stackrel{\sim}{{\text{h}}_{\text{t}}}$$

5

The input data for the models are customized based on the different visits of each individual by keeping the order of the reference visit to the last recorded visit of each patient. The patients with only one recorded visit were removed from the experiment. Because of the different number of visits for patients, according to Fig. 2, the last five visits were considered for all patients. The input time delay was padded to five for those with less than five visits by considering zeros for features.

The models were built by open-source Python software using Tensorflow and Keras libraries. For all models, 90% of the data were selected to train and 10% for the test, and the regularisation technique L2 was used for models to avoid overfitting. For time series recurrent neural networks, 3D input data of 632 patients (568 for train and 64 for testing) with five time-step visitation histories was fed into the network as input data. However, In the MLP network, each visit was considered an independent input sample without considering the visitation history of patients, and 4143 input samples (3728 for train and 415 for testing) were fed into the network.

**Performance Evaluation.** The proposed framework was evaluated using four statistical performance measures, including R-squared (R2), mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE).

In all the following evaluation metrics, n is the total number of test samples, \({y}_{i}\) is the observed output of ith sample, \(\widehat{{y}_{i}}\) is the corresponding predicted value, and \(\stackrel{-}{y}\) represents the mean of the observed values.

The R2 score, also known as the coefficient of determination, can be expressed as

$${R}^{2}=1-\frac{\sum _{i=1}^{n}{({y}_{i}-\widehat{{y}_{i}})}^{2}}{\sum _{i=1}^{n}{({y}_{i}-\stackrel{-}{y})}^{2}}$$

6

The R2 score represents how well the model’s dependent variable traces the variability of independent input features. A fit model can achieve an R2 score of 1. In contrast, a random model's R2 score is close to 0.

The mean squared error can be expressed as

$$MSE=\frac{\sum _{i=1}^{n}{({y}_{i}-\widehat{{y}_{i}})}^{2}}{n}$$

7

that represents how much the predicted value is close to the observed value in a regression model

The mean absolute error (MAE) is the average of differences between observed and predicted values and is given by

$$MAE=\frac{1}{n}\sum _{i=1}^{n}\left|{y}_{i}-\widehat{{y}_{i}}\right|$$

8

Finally, the root mean squared error shows the deviation between predictions and observed values using Euclidean distance and can be calculated by

$$RMSE=\sqrt{\frac{\sum _{i=1}^{n}{({y}_{i}-\widehat{{y}_{i}})}^{2}}{n}}$$

9