Validation of machine learning models to predict dementia-related neuropsychiatric symptoms in real-world data

Background Neuropsychiatric symptoms (NPS) are the leading cause of the social burden of dementia but their role is underestimated. The objective of the study was to validate predictive models to separately identify psychotic and depressive symptoms in patients diagnosed with dementia using clinical databases representing the whole population (real-world data). Methods First, we searched the electronic health records of 4,003 patients with dementia to identify NPS. Second, machine learning (random forest) algorithms were applied to build in the training sample (N=3,003) separate predictive models for psychotic and depressive symptoms. In order to evaluate the classification ability of the models, the following statistics were calculated for each model: the area under the receiver operating curve (AUC), sensitivity, specificity, accuracy, no-information rate and Kappa index. Third, calibration and discrimination were assessed in the validation sample (N= 1,000) to assess the performance of the models. A calibration curve was drawn by plotting the predicted probabilities for groups on the x-axis and the mean observed values on the y-axis. Results Neuropsychiatric symptoms were noted in the electronic health record of 58% of patients. The AUC reached 0.80 for the psychotic symptoms model and 0.74 for the depressive symptoms model. The Kappa index and accuracy also showed better discrimination in the psychotic model. Calibration plots indicated that both types of model had less predictive accuracy when the probability of neuropsychiatric symptoms was < 25%. The most important variables in the psychotic symptom model were use of risperidone, level of sedation, quetiapine and haloperidol and the number of antipsychotics prescribed. In the depressive symptom model, the most important variable was number of antidepressants prescribed, use of escitalopram, level of sedation and age. Conclusions More than half of the sample had NPS as identified by the presence of key terms in the electronic health record. Although NPS are not encoded, they are treated with antipsychotics and antidepressants, which allows developing valid predictive models by joining machine learning tools and real-world data. Given their good performance, the predictive models can be used to estimate prevalence of NPS in


Introduction
Neuropsychiatric symptoms (NPS) are the leading cause of the social burden of dementia as they constitute the key factor in families' giving up on keeping patients at home [1,2]. It could be thought that they are well acknowledged as 8% of new drugs currently under evaluation for Alzheimer's disease are intended to treat NPS [3]. On the other hand, their population impact is underestimated as they are not well coded in health records [4][5][6], this implying inadequate clinical management [7].
In order to break the vicious circle established by the underestimation of the impact and the lack of proper recording of these symptoms, there is a need for tools providing information to monitor intervention plans from a population perspective [1,8,9].
The prevalence of NPS has been measured in clinical samples using questionnaires such as the Neuropsychiatric Inventory (NPI) that are rarely applied in clinical practice [4,10], and the prevalence found varies depending on whether the cases are identified in the community or nursing homes.
Specifically, NPS are less common (56-98%) and less severe in individuals with dementia living in the community than in those in hospitals or long-term care facilities (91-96%) [1,11]. The problem with these figures is that they cannot be extrapolated to populations as a whole due to their heterogeneous distribution of dementia stages [9]. This distribution could be measured by analyzing a random sample of the general population using a door-to-door survey [4]. Notably, however, another study design is now feasible based on anonymized databases built from electronic health records (EHRs). This approach based on real-world data (RWD) has already been used to validate the diagnosis of dementia [12,13] and the presence of agitation [6] but not to explore rates of NPS. On the other hand, validation studies are required in order to systematically use RWD as a source of epidemiological information [7,14]. RWD have been described in an Organisation for Economic Cooperation and Development report as "broad data" because they cover large populations but include limited amounts of outcome and exposure data [15]. In line with this, machine learning (ML) tools have been postulated as having more capacity to predict complex clinical conditions like NPS [15] and being able to convert RWD into "smart data" [15,16]. An example of this would be the calculation of the prevalence of NPS in population samples. While this design has been previously applied in cardiovascular research [17] and Alzheimer's disease neuroimaging [18], no examples have been reported of its use to measure features of dementia-related NPS at population level [19]. Therefore, the objective of this study was to construct and validate predictive models based on ML tools to identify the presence of psychotic and/or depressive symptoms in dementia-diagnosed patients from administrative and clinical databases that cover entire populations.

Design
We carried out a retrospective cross-sectional study to evaluate the validity of the EHR to identify dementia-related NPS. Data were obtained from a population registry of dementia cases built using the Basque Health Service's institutional database, Oracle Business Intelligence (OBI), containing administrative and clinical records for primary, inpatient, emergency and outpatient care in an anonymized form which are updated daily [4,12]. The study protocol was approved by the Clinical Research Ethics Committee (CEIC) of the Basque Country (registration number PI2018143 EPA-OD).
The registry included all patients diagnosed with dementia in OBI but our target population consisted only of individuals alive on 31 December 2018 (n = 31,000). The validation of the diagnosis of dementia in this registry evidenced adequate predictive values (positive and negative predictive values of 95.1% and 99.4% respectively) [12]. Criteria used for the diagnosis of dementia are described in the supplementary material. As previously noted, NPS are poorly coded but they are recorded as text in the EHR [4]. Given the full deployment of electronic prescriptions from 2008, medication prescribing is highly accurately recorded in OBI. Our hypothesis was that drug prescriptions together with other clinical variables could be used to build a predictive model to identify dementia-related BPS in our institutional database. Therefore, we carried out a validation study classifying NPS into two patterns characterized, on the one hand, by depressive or mood disorders, and on the other, by behavioral or psychotic disorders [4]. The dataset used and analyzed during the current study is available from the corresponding author on reasonable request.

Validation study
In a random sample of patients with dementia, the EHRs were individually reviewed by a trained clinical coding technician, supervised by a psychiatrist, looking within physicians' notes for evidence of the presence of the two types of symptoms. The technician was blinded to the OBI diagnostic codes. Within NPS, we differentiated between mood disorders (depression, anxiety and apathy) and psychotic disorders (aggressiveness, irritability, restlessness and shouts of visual, auditory and delirious hallucinations). The terms mood disorders, apathy, bradypsychia, psychomotor slowness, sadness, depression, anxiety and negativism were sought as markers of depressive symptoms or mood disorders, and psychotic symptoms, behavioral symptoms, agitation, irritability, aggressiveness, restlessness, screams, visual or auditory hallucinations, delusions, alterations of behavior, erratic wandering, escape attempts, disinhibition and rejection of care as markers of psychotic or behavioral symptoms.

Variables
The EHR review supplied data on the two response variables used in the predictive models, namely, the presence of mood and/or behavioral symptoms. Regarding the explanatory variables, the following were considered: age, sex, institutionalization status, concomitant diagnoses (diabetes mellitus, hypertension, dyslipidemia, thyroid disease, Parkinson's disease, stroke, cardiovascular disease, head trauma, depressive disorder and psychotic disorder) and pharmacological treatment.
We collected data on all prescriptions of medications in the following specific Anatomical Therapeutic Chemical Classification System subgroups: N06D (donepezil, rivastigmine and galantamine and Memantine), N06A (antidepressants), N05A and NO6C (antipsychotics), NO5B (anxiolytic) and N05C (hypnotics). As the hypothesis that the prescribing of antidepressants and antipsychotics can be used to detect NPS in population databases was the rationale for the current study, all the prescriptions and changes in prescriptions involving the aforementioned subgroups were recorded. This data collection process resulted in a longitudinal dataset with n data instances per participant, n being the number of different drug prescriptions issued to them over time. This longitudinal information was then used to create new summarizing variables to obtain a single data instance per participant. These summarizing variables included baseline features, concomitant diagnoses over time, sedative effects (highest level of sedation ever prescribed to the patient), drug prescriptions and changes therein (number of antidepressants prescribed, number of antipsychotics prescribed, number of changes from antidepressants to antipsychotics and number of changes from antipsychotics to antidepressants) and the two response variables (NPS documented in the EHR notes) (Table S1 in the Supplementary material). The level of sedation produced by each drug was categorized (0: none; 1: minimum; 2: mild; 3: moderate; 4: deep) as set out in Table S2 in the Supplementary Material.
All the preprocessing of the dataset and predictive modelling was done in R.

Machine learning
The sample (N) was randomly divided into a training set (N 1 = 0.75 * N) and a validation set (N 2 = 0.25 * N). It was checked that patient characteristics did not differ between training and validation sets. The ML random forest approach, fully described in the Supplementary Material, was applied to build predictive models [20]. The random forest algorithm [21] is a stochastic ensemble method that uses bagging, a combination of bootstrapping and aggregation of weak learners, more specifically, decision trees, seeking to detect patterns in data and use these to predict outcomes, in our case, NPS [19].
In the training set, we followed a stepwise process beginning with baseline models whose performance was improved by adding other explanatory variables in an iterative way to test their contribution. Mean decrease accuracy was used to assess the relative feature importance of the variables in the models [22]. This technique computes the accuracy of the trees that build the model for the out-of-the bag sample of each tree. Then, for each variable, it permutes the values of the variables one after another and measures how much the accuracy changes. Any decrease in accuracy resulting from this permutation is averaged over all trees, and used as a measure of the importance of each variable in the random forest model.
All the predictive models have been evaluated using a k-fold cross validation approach, with k = 10 and 10 repetitions. The main advantage of this evaluation technique is that it maximizes the availability of data for training the models, as it allows all the data instances to be used both for training and validation purposes in different iterations. In addition, it gives accurate estimates of the performance of the prediction models for unseen data. The same process was carried out separately for the psychotic and depressive symptom models for which discriminatory power was assessed.
Discriminatory power refers to the ability of a prediction model to distinguish between two outcome classes. In order to evaluate the classification ability of the models, the following statistics were calculated for each model: the area under the receiver operating curve (AUC), sensitivity, specificity, accuracy, no-information rate and Kappa index. The AUC gives an overview of a model's ability to discriminate between positive and negative classes, independently of their prevalence, and is therefore suitable for imbalanced datasets. Sensitivity or the true positive rate is defined as the number of cases from the positive class that were predicted correctly by the model, while specificity or the true negative rate refers to the number of cases from the negative class that were actually predicted as negative. The no-information rate is the accuracy that can be achieved without a model, Evaluation of model performance in the validation dataset Model performance was measured by assessing both calibration and discrimination in the validation set [24][25][26]. Calibration is related to goodness-of-fit, which reflects the agreement between observed outcomes and predictions. To assess this, a calibration curve was drawn by plotting the predicted probabilities for groups on the x-axis and the mean observed values on the y-axis. Finally, discriminatory power was assessed with the same statistics as in the training stage.

Results
The resulting dataset contained 62 variables and 4,003 cases, the main features of which are described in Table 1. Psychotic symptoms were documented in the EHR of 58% of the population and depressive or mood disorder in 59%. The dataset was randomly divided into a training set (N1 = 3003) and a validation set (N2 = 1000). These types of symptoms were more common in men than in women. The pattern with age was different: the group with psychotic disorders being older in age and those with depressive disorder younger. Living in a nursing home was strongly associated with both types of symptoms.  Table 2 shows the performance of the models tested for psychotic and depressive symptoms in both training and validation sets. The iterations and modelling variables tested in each model are also summarized in Table 2. The analysis of the raw data without other variables indicates that antipsychotic prescribing is more specific and antidepressant prescribing more sensitive for identifying NPS. Notably, the models seeking to predict psychotic symptoms perform better, reaching an AUC of 0.80, than the depressive symptom models (maximum AUC of 0.74). Other statistics like the kappa index and accuracy also evidenced that the psychotic symptom models had better discriminatory power.    Table 2 and Fig. 3), the curves demonstrate graphically the better predictive ability of the psychotic symptom model.

Discussion
To our knowledge, this is the first time that ML techniques have been applied to build and validate models based on real-world population data to estimate the prevalence of NPS in dementia. This study shows that ML-based models are good at predicting dementia-related NPS and opens the prospect of applying these techniques in population databases. The raw results showed that more than half of the sample had NPS as identified by the presence of key terms in the EHR. This fits well with the findings of Halpern et al. who found evidence of agitation in 44.6% of all patients when analyzing the EHR notes of dementia patients [6].
Considering AUCs of 0.70-0.79 to indicate acceptable and ≥ 0.80 excellent discrimination, the psychotic symptom model can be classified as excellent and the depressive symptom model as acceptable [27]. Given these classifications, it would be valid to apply the results to the whole population database. They are consistent with the expected higher specificity of models for psychosis.
When data collected from EHRs are applied to research, avoiding false-positive diagnoses may be more important than avoiding false negatives. In longitudinal studies, for example, false positives can dilute observed effects and reduce statistical power [28]. Therefore, our approach is valid for epidemiological research on dementia-related NPS. Consistent with the AUC values, other statistics used, namely accuracy and the kappa index, also indicated that the model predicting psychotic symptoms performed better.
As previously mentioned, however, the calibration was poor when the probability of NPS estimated by the model was low. This implies that both models systematically underestimated disease rates observed in the EHR. The explanation for this may be that when the symptoms are of recent onset the recording of symptoms in the EHR is not yet accompanied by pharmacological treatment.
Nonetheless, in more advanced stages of the disease, clinicians treat such symptoms, and hence, the calibration line overlaps the 45° line of the plot. The models' calibration is excellent for these late stages, observed and predicted cases fitting well. Various authors have underlined the need to integrate evidence from heterogeneous sources including clinical trials, cohort data and RWD to evaluate disease progression and build health economic models for dementia treatment [15,31,32]. Nonetheless, RWD lack consistency in the collection of outcomes and monitoring of disease severity. In this scenario, validation of variables available from EHRs appears as a key first step towards an AD/dementia integrated curated data environment fed from multiple sources [15]. Brayne et al. pointed out the crucial importance of approaches to dementia research being anchored in the true population as selective participation in observational studies may systematically bias findings [33].
Our work is not without limitations and the main one is that no validated scale such as the NPI was used to identify the presence of NPS in patients with dementia [10]. Finding specific terms in an EHR review only reveals that a physician recorded symptoms linked to the presence of behavior and depressive disorders. The identification of dementia cases could also be deemed problematic, as some authors have questioned the use of Medicare claims to identify dementia [34]. On the other hand, the aim of such claims is to allow physicians to be paid. In contrast, our database is obtained directly from a unified EHR used by all healthcare professionals (physicians and nurses) to document all patients' contacts with the health service in all care settings (primary, emergency, inpatient, home and outpatient care). The system includes an automatic coding system (ICD-10) managed by physicians when they provide care to patients. A neurologist or general practitioner is not able to move forward in the EHR if the episode is not assigned a diagnosis that is automatically coded.
Canadian researchers have applied a similar approach for identifying Parkinson's disease and dementia with good results [13,35].
We have applied a binary classification that simplifies the heterogeneous way in which doctors describe NPS in the EHR. Moreover, we have consciously avoided including sleep disturbances within the scope of the research due to the bidirectional relationship between sleep disturbances and dementia [5]. Given that it is unclear whether dementia is a cause or consequence of sleep disturbance, we believe that the interpretation of predictive models based on the use of hypnotics would be very difficult, and hence, for the time being, have focused on psychotic and mood disorders.

Conclusions
More than half of the sample of dementia patients had NPS as identified by the presence of key terms in the electronic health record. Although NPS are not coded in the diagnosis registry, they are treated with antipsychotics and antidepressants, which allows developing valid predictive models by joining machine learning tools and real-world data. Given their good performance, the predictive models can be used to estimate prevalence of NPS in population databases.

Ethical Approval and Consent to participate
The authors assert that all procedures contributing to this work comply with the ethical standards of

Consent for publication
Each of the authors has substantially contributed to conducting the underlying research and agrees with the contents of the manuscript.

Availability of data and materials
The dataset used and analyzed during the current study is available from the corresponding author on reasonable request.  Supplemantary material NPS predictive models validation ART.docx