Design
We carried out a retrospective cross-sectional study to evaluate the validity of the EHR to identify dementia-related NPS. Data were obtained from a population registry of dementia cases built using the Basque Health Service’s institutional database, Oracle Business Intelligence (OBI), containing administrative and clinical records for primary, inpatient, emergency and outpatient care in an anonymized form which are updated daily [4, 12]. The study protocol was approved by the Clinical Research Ethics Committee (CEIC) of the Basque Country (registration number PI2018143 EPA-OD).
The registry included all patients diagnosed with dementia in OBI but our target population consisted only of individuals alive on 31 December 2018 (n = 31,000). The validation of the diagnosis of dementia in this registry evidenced adequate predictive values (positive and negative predictive values of 95.1% and 99.4% respectively) [12]. Criteria used for the diagnosis of dementia are described in the supplementary material. As previously noted, NPS are poorly coded but they are recorded as text in the EHR [4]. Given the full deployment of electronic prescriptions from 2008, medication prescribing is highly accurately recorded in OBI. Our hypothesis was that drug prescriptions together with other clinical variables could be used to build a predictive model to identify dementia-related BPS in our institutional database. Therefore, we carried out a validation study classifying NPS into two patterns characterized, on the one hand, by depressive or mood disorders, and on the other, by behavioral or psychotic disorders [4]. The dataset used and analyzed during the current study is available from the corresponding author on reasonable request.
Validation study
In a random sample of patients with dementia, the EHRs were individually reviewed by a trained clinical coding technician, supervised by a psychiatrist, looking within physicians’ notes for evidence of the presence of the two types of symptoms. The technician was blinded to the OBI diagnostic codes. Within NPS, we differentiated between mood disorders (depression, anxiety and apathy) and psychotic disorders (aggressiveness, irritability, restlessness and shouts of visual, auditory and delirious hallucinations). The terms mood disorders, apathy, bradypsychia, psychomotor slowness, sadness, depression, anxiety and negativism were sought as markers of depressive symptoms or mood disorders, and psychotic symptoms, behavioral symptoms, agitation, irritability, aggressiveness, restlessness, screams, visual or auditory hallucinations, delusions, alterations of behavior, erratic wandering, escape attempts, disinhibition and rejection of care as markers of psychotic or behavioral symptoms.
Variables
The EHR review supplied data on the two response variables used in the predictive models, namely, the presence of mood and/or behavioral symptoms. Regarding the explanatory variables, the following were considered: age, sex, institutionalization status, concomitant diagnoses (diabetes mellitus, hypertension, dyslipidemia, thyroid disease, Parkinson's disease, stroke, cardiovascular disease, head trauma, depressive disorder and psychotic disorder) and pharmacological treatment. We collected data on all prescriptions of medications in the following specific Anatomical Therapeutic Chemical Classification System subgroups: N06D (donepezil, rivastigmine and galantamine and Memantine), N06A (antidepressants), N05A and NO6C (antipsychotics), NO5B (anxiolytic) and N05C (hypnotics). As the hypothesis that the prescribing of antidepressants and antipsychotics can be used to detect NPS in population databases was the rationale for the current study, all the prescriptions and changes in prescriptions involving the aforementioned subgroups were recorded. This data collection process resulted in a longitudinal dataset with n data instances per participant, n being the number of different drug prescriptions issued to them over time. This longitudinal information was then used to create new summarizing variables to obtain a single data instance per participant. These summarizing variables included baseline features, concomitant diagnoses over time, sedative effects (highest level of sedation ever prescribed to the patient), drug prescriptions and changes therein (number of antidepressants prescribed, number of antipsychotics prescribed, number of changes from antidepressants to antipsychotics and number of changes from antipsychotics to antidepressants) and the two response variables (NPS documented in the EHR notes) (Table S1 in the Supplementary material). The level of sedation produced by each drug was categorized (0: none; 1: minimum; 2: mild; 3: moderate; 4: deep) as set out in Table S2 in the Supplementary Material.
All the preprocessing of the dataset and predictive modelling was done in R.
Machine learning
The sample (N) was randomly divided into a training set (N1 = 0.75 * N) and a validation set (N2 = 0.25 * N). It was checked that patient characteristics did not differ between training and validation sets. The ML random forest approach, fully described in the Supplementary Material, was applied to build predictive models [20]. The random forest algorithm [21] is a stochastic ensemble method that uses bagging, a combination of bootstrapping and aggregation of weak learners, more specifically, decision trees, seeking to detect patterns in data and use these to predict outcomes, in our case, NPS [19].
In the training set, we followed a stepwise process beginning with baseline models whose performance was improved by adding other explanatory variables in an iterative way to test their contribution. Mean decrease accuracy was used to assess the relative feature importance of the variables in the models [22]. This technique computes the accuracy of the trees that build the model for the out-of-the bag sample of each tree. Then, for each variable, it permutes the values of the variables one after another and measures how much the accuracy changes. Any decrease in accuracy resulting from this permutation is averaged over all trees, and used as a measure of the importance of each variable in the random forest model.
All the predictive models have been evaluated using a k-fold cross validation approach, with k = 10 and 10 repetitions. The main advantage of this evaluation technique is that it maximizes the availability of data for training the models, as it allows all the data instances to be used both for training and validation purposes in different iterations. In addition, it gives accurate estimates of the performance of the prediction models for unseen data. The same process was carried out separately for the psychotic and depressive symptom models for which discriminatory power was assessed.
Discriminatory power refers to the ability of a prediction model to distinguish between two outcome classes. In order to evaluate the classification ability of the models, the following statistics were calculated for each model: the area under the receiver operating curve (AUC), sensitivity, specificity, accuracy, no-information rate and Kappa index. The AUC gives an overview of a model’s ability to discriminate between positive and negative classes, independently of their prevalence, and is therefore suitable for imbalanced datasets. Sensitivity or the true positive rate is defined as the number of cases from the positive class that were predicted correctly by the model, while specificity or the true negative rate refers to the number of cases from the negative class that were actually predicted as negative. The no-information rate is the accuracy that can be achieved without a model, and for a model, accuracy means the percentage of correct classifications it provides. The Kappa index measures the agreement between two approaches to classify mutually exclusive categories, agreement being characterized as slight (for values of 0–0.20), fair (0.21–0.40), moderate (0.41–0.60) or substantial (0.61–0.80) [23].
Evaluation of model performance in the validation dataset
Model performance was measured by assessing both calibration and discrimination in the validation set [24–26]. Calibration is related to goodness-of-fit, which reflects the agreement between observed outcomes and predictions. To assess this, a calibration curve was drawn by plotting the predicted probabilities for groups on the x-axis and the mean observed values on the y-axis. Finally, discriminatory power was assessed with the same statistics as in the training stage.