We will now discuss the methods used to develop and validate the AID-ME model (see also 6–8). Ethical approval was provided by the research ethics board of the Douglas Research Center.
Source of Data and Study Selection
The data used to develop the AID-ME model was derived from clinical trials of antidepressant medications. Clinical trial data was selected primarily to reduce confounds due to uncontrolled clinician behavior which can change over time23, biases in treatment selection or access to care24,25; to reduce data missingness; and, most importantly, because clinical trials contain clear clinical outcomes. All of the studies we used either randomized patients to treatment or assigned all patients to the same treatment (e.g. first level of STAR*D3), eliminating treatment selection bias operating at the individual clinician level. In addition, clinical trials have inclusion and exclusion criteria, which allow some assessment of potential selection biases which may be more covert in naturalistic datasets26,27, though clinical trials are not free of selection bias28. Studies came from three primary sources: the National Institutes of Mental Health (NIMH) Data Archive; data provided from clinical researchers in academic settings; and data provided by GlaxoSmithKline and Eli Lilly by way of de-identified data requests administered through the clinical study data request (CSDR) platform. Results from a model trained on the pharmaceutical dataset are available8.
Studies were selected in order to be representative of the intended use population: an adult population experiencing an acute episode of major depressive disorder, with demographics reflective of North America and Western Europe. These geographic regions were selected based on the availability of data and the sites likely to participate in the AID-ME study. These would be patients, much like those in the STAR*D dataset, who 1) could be seen in either primary or psychiatric care, 2) could have new onset or recurrent depression, and 3) would likely have other psychiatric comorbidities in addition to depression. However, the population would not be hospitalized, and would have primary MDD, and not depressive symptoms secondary to another medical condition, such as a stroke.
In order to achieve a set of studies that was consistent with this intended use population, we examined study protocols, publications, and clinical study reports. We excluded : 1) populations under age 18; 2) patients with bipolar depression/bipolar disorder; 3) studies in which the treatment of a major depressive episode was not the main objective (for example, studies of patients with dysthymia or with depressive symptoms without meeting criteria for an MDE in the context of fibromyalgia); 4) studies where the MDE is caused by another medical condition; and 5) studies of patients with only mild depression (though studies including patients with both mild and more severe depression were included, with mildly depressed patients later excluded). Given that the AID-ME study was planned to follow patients for 12 weeks, and that guidelines such as CANMAT suggest assessing remission after 6–8 weeks13, we included studies which had lengths between 6 and 14 weeks.
Studies were conducted between 1991 and 2016. After working to secure as many studies as possible, we began with 57 studies for consideration. After reducing the number of studies based on the criteria noted above, as well as excluding some studies to ensure that specific drugs were not excessively over-represented, we were left with 22 studies. See PRISMA diagram (Fig. 1). Important guidance in the selection of treatments of depression is provided by the Canadian Network for Mood and Anxiety (CANMAT) treatment guidelines for MDD in adults13. These CANMAT guidelines were used to identify treatments to be included in the model and were referred to during interpretation of model results.
Participants
Participants were all aged 18 or older and were of both sexes. They were treated in primary care or psychiatric settings. Participants were excluded if, at study baseline they had a depression of less than moderate severity based on the study’s depression rating tool. They were also excluded if they were in their respective study for less than two weeks, given that it is unlikely that an adequate determination of remission or non-remission could be made in this timeframe. Patients who remained in their studies for at least two weeks were included even if they dropped out of the study early.
Treatments
While a number of treatments were used across the studies included, we could only include treatments in the final model if there were sufficient observations of the treatment available for the model to learn to predict its outcome. In line with our previous work we did not generate predictions for treatments provided to fewer than 100 patients in the dataset7. In addition, participants taking less than the dose range recommended in the CANMAT guidelines 13were also eliminated from the dataset, as the CANMAT guidelines formed an integral part of the AID-ME study. As our modeling technique does not depend on a placebo comparator and no placebo treatment was to be recommended by the model in clinical practice, patients who received placebo were not included. The included medications are noted in Table 1. These include a diverse array of commonly used first-line treatments as well as two combinations of first line medications which are commonly used as augmentation strategies 29. After these exclusions, 9042 patients were included in the final analysis across both datasets.
Outcome
The predicted outcome was remission, cast as a binary variable so that the model could predict remission or non-remission as a label, and provide a probability of remission. Remission was chosen as it is the gold-standard outcome in treatment guidelines; achieving remission is important because the presence of significant residual symptoms increases relapse risk13. Remission was derived from treatment scores on standardized questionnaires and binarized based on cutoffs derived from the literature. These included the Montgomery-Asberg Depression rating scale (MADRS; remission defined as a study exit score < 11)30–32, the Quick Inventory of Depression Scale - Self Report (QIDS-SR-16; remission defined as a score < 6) and the Hamilton Depression Rating Scale (HAMD; remission defined as a score of < 8) 30,33. Remission was measured at the latest available measurement for each patient in order to preserve the most information for each patient (such that patients who remit but then become more symptomatic are not incorrectly classified as being in remission)7. Model developers and assessors were not blinded to the selected outcome given the limited size of the team and the need to carefully specify remission variables and cutoffs; however, final testing on the test set was carried out by a team member who was not involved in the development of the AID-ME model.
Data preprocessing and creation of stratified data splits
A full method description for the transformation pipeline and quality control measures is available in 8. Briefly, we created a “transformation pipeline” to combine different questions across datasets to prepare as input variables for feature selection. First, we developed a custom taxonomy inspired by the HiTOP taxonomy system 34 to categorize the study questionnaire data across clinical and sociodemographic dimensions. Standard versions of each question were tagged with at least one taxonomic category which allowed for items with the same semantic meaning to be grouped together and created into a “transformed question” representative of this category. Questions with categorical response values were grouped together and values were scaled with either linear equating or equipercentile equating, depending on the response value distributions 35. No transformation was required if all response values were binary, but if there was a mix of categorical and binary questions, the categorical values were “binarized” to either 0 or 1 based on the response value text and how it compared to the response value text of the natively binary questions.
To generate our data splits we stratified each division using the binary remission and treatment variables. The sequential splitting method was used to generate the three data splits: utilizing the train_test_split function from scikit-learn 36, we first separated our aggregate dataset into a test set with a 90 − 10 stratified split. The leftover 90% was further divided to form the train and validation sets, comprising 80% and 10%, respectively, of the original data. This process yielded an 80-10-10 split for our datasets. The test set was held out and not used in any way during model training; it was only used for testing after the final model was selected.
Missing data
We handled missing data as follows. First, features with excessive missingness (over 50%; see 8) were removed from the dataset. After this data was imputed using multiple imputation by chained equations (MICE) provided by the package Autoimpute37, which was built based on the R mice function. To reduce the number of variables available to feed into each variable’s imputation, we used multicollinearity analysis 38,39. The variance inflation factor of each variable was used to measure the ratio of the variance of the entire model to the variance of a model with only that specific feature. The threshold for exclusion was set at 5, the default. This enabled us to have a reduced set of variables to feed into the MICE algorithm without extreme multicollinearity as would be expected given the number of available variables. To use the MICE algorithm we generated 5 datasets and took the average between all datasets for the variables using the least-squares or the predictive mean matching (PMM) strategy (continuous variables) and the mode for binary or ordinal variables40. Race/ethnicity, sex, treatment, and outcome (remission) were not imputed; all patients included had data for these variables.
Feature selection
As a neural network was used for the classification model, we chose to also use neural networks for feature selection 41. To accomplish this we implemented a new layer called CancelOut 42. CancelOut is a fully-connected layer, allowing us to create a classification model with the same task of training with a target of remission. The CancelOut layer has a custom loss function that works as a scoring method so that by the end of training we can view and select features based on their score. The retained features were input into our bayesian optimization framework, which was used for selection of the optimal hyperparameters.
Final Predictors
Predictors, also known as features, were clinical and demographic variables which were selected by the feature selection process. These features are listed in Table 2. The team developing and validating the model was not blinded to features used for outcome prediction.
Model Development
A deep learning model was prepared by arranging a set of hyperparameters (including, for example, the number of layers, number of nodes per layer, activation function, learning rate, number of training epochs, etc). During the training process, model hyperparamters were refined using a Bayesian optimization process (BoTorch package)43. The number of layers and the Bayesian optimization targets (AUC and F1 score) were provided to the Bayesian optimization in a fixed manner. Early stopping, within standard ranges for epoch number between 150 and 300, was used to prevent overfitting44. Further details are available in the supplementary material.
Models were trained and their performance was optimized using the validation dataset. The best performing architectures were selected to produce a final model. This final model was then run on the test set. This test set was not used in any way during model training. See supplementary material for further details.
Model Usage After Freezing
When the model is used for a new patient after training and validation is completed and the model is made static, data provided for a patient for the selected features is inputted into a forward pass of the model, and run for each of the possible medications, generating a remission probability for each treatment. We additionally use the saliency map algorithm to generate the interpretability report that provides the top 5 most salient variables for each inference performed7. Finally, we calculate the average remission probabilities across all predicted treatments. The results are then packaged and returned to the physician to make their determination (see below and Fig. 6). During clinical use, data is provided by patient via self report questionnaires and by clinicians using a clinician questionnaire with built-in instructions, meaning that users require only minimal training to provide the data.
Primary metrics
The model evaluation was primarily driven by optimizations across the Area under the Receiver Operating Curve (AUC), as well as the F1 score (where remission was predicted if the model estimated a remission probability of 50% or greater). As the AUC is scale- and classification-threshold-invariant, it provides a holistic and well-rounded view of model performance (Huang, 2005). Since the F1 Score is the harmonic mean of precision and recall it helps us assess our model’s handling of imbalanced data (Raschka, 2014). In addition, Accuracy, Positive Predictive Value (PPV), Negative Predictive Value (NPV), Sensitivity, and Specificity were also assessed, as these are commonly utilized in clinical studies and are meaningful to both machine learning engineers and clinicians. NPV and PPV especially are important in helping clinicians understand the value of positive and negative predictions.
Another important metric was model calibration45. This metric reflects the relationship between the outcome predictions of the candidate model and the observed outcomes of the grouped sample. A model is said to be well calibrated if, for a group of patients with a mean predicted remission rate of x%, close to x% individuals have reached remission.
There was no difference in metrics used to assess the model during training and testing. A pre-specified model performance of an AUC of at least 0.65 was set as a target. Further discussion of pre-specified metrics is available in the supplementary material.
Secondary Metrics
To estimate clinical utility we use two analysis techniques: the “naive” and “conservative” analyses, which are described in detail in 7 and in the supplementary section. The naive analysis estimates the improvement in population remission rates, by taking the mean of the highest predicted remission rate for each patient from the test set from among the 10 drugs, and taking the mean of these predicted remission rates. The conservative analysis only examines predicted and actual remission rates for patients who actually received optimal treatment, utilizing a bootstrap procedure performed on the combined training and validation sets46. Both metrics allow estimation of predicted improvements in population remission rate, and we set a 5% minimum absolute increase in population remission rate, consistent with the benefit provided by other decision support systems such as pharmacogenetics 47.
Interpretability Report Generation
We generated interpretability reports using the saliency method 48 to assess the importance of the input data for each prediction produced. Using GuidedBackprop, we generated a numerical output that designates the importance of a feature in determining the output49, providing a list of prediction-specific important features which are similar to lists of ‘risk’ and ‘protective’ factors which clinicians are used to. The top 5 features in this list are provided to clinicians7 (see Fig. 6 for an example).
Bias Testing
We utilized a bias testing procedure we have previously described in 8 to determine if the model has learned to amplify harmful biases. For race, sex and age groupings, we create a bar graph depicting true remission rates for each group, as well as the mean predicted remission probability for that group. This allows us to assess if the model has learned to propagate any harmful biases, and to examine disparities in outcomes based on sociodemographics in the raw data (e.g. worse outcomes for non-caucasian groups). We also verify the model’s performance with respect to the observed remission rates for each treatment.
Sensitivity Analyses
Sensitivity analysis was used to probe model behavior in response to manipulation of specific variables to determine if the model responds in a manner that is consistent with previous evidence. For example, we can plot how the remission probability changes in response to artificially increasing suicidality; the model should produce a clear trend towards lower remission probabilities as suicidality increases10.