Predictive Models and Features of Patient Mortality across Dementia Types

Dementia care is challenging due to the divergent trajectories in disease progression and outcomes. Predictive models are needed to identify patients at risk of near-term mortality. Here, we developed machine learning models predicting survival using a dataset of 45,275 unique participants and 163,782 visit records from the U.S. National Alzheimer’s Coordinating Center (NACC). Our models achieved an AUC-ROC of over 0.82 utilizing nine parsimonious features for all one-, three-, five-, and ten-year thresholds. The trained models mainly consisted of dementia-related predictors such as specific neuropsychological tests and were minimally affected by other age-related causes of death, e.g., stroke and cardiovascular conditions. Notably, stratified analyses revealed shared and distinct predictors of mortality across eight dementia types. Unsupervised clustering of mortality predictors grouped vascular dementia with depression and Lewy body dementia with frontotemporal lobar dementia. This study demonstrates the feasibility of flagging dementia patients at risk of mortality for personalized clinical management.


Introduction
Dementia has become a growing public health concern, classi ed as the seventh leading cause of death 1 and the fourth most burdensome disease or injury in the United States in 2016 based on years of life lost 2 . As of 2022, an estimated $1 trillion of global annual costs 3 can be attributed to Alzheimer's disease and other dementias, affecting an estimated 6.5 million Americans 4 and 57.4 million people worldwide, and those numbers are expected to triple by 2050 5 . Unfortunately, the true mortality burden associated with dementia may still be underestimated, as dementia itself tends to be underreported on death certi cates as the underlying cause of death 6 . This immense healthcare burden of dementia can be attributed to the lack of curative drugs 7,8 , the challenge in predicting patient trajectory, and the intrinsic di culty in diagnosing dementia, which often The whole cohort with 45,275 unique NACC individuals in this 2005-2021 time period was analyzed for comparing patient survival across different dementia types. We estimated survival time using the Kaplan-Meier method (Fig. 1b,c). Survival probability differed across primary etiologic diagnoses of dementia types (Fig. 1b). Patients with prion disease showed less overall median survival time than other dementia types (p < 0.0001). This is consistent with the rapid onset and progression feature of prion disease 28 . The overall median survival time for Alzheimer's disease was not reached, with 5-and 7-year survival rates of 76.05% and 66.63%, respectively. The overall median survival time in Lewy Body disease was 98.3 months (95% CI 84.2-119.5), with 5-and 7-year survival rates of 60.0% and 52.0%, respectively.
To illustrate the relationship between dementia severity and survival, we performed a survival analysis based on global clinical dementia rating (CDR) scores (Fig. 1c). The overall median survival time in global CDR score at 0 was not reached, with 5-and 7-year survival rates of 93.7% and 90.1%, respectively. The overall median survival time in global CDR scores at 1, 2, and 3 was 141.8 months (95% CI 128.8 -NA), 75.0 months (95% CI 69.0-81.0), 27.3 months (95% CI 25. 3-29.3), respectively. With increasing global CDR score, which represents more severe cognitive impairment, patients generally showed worse outcomes. Moreover, to determine whether these trends were re ected in patients with comorbidities (i.e., other causes of death such as cancer and cardiovascular disease), we performed additional survival analysis on global CDR scores strati ed by disease type. The results showed that regardless of whether or not dementia patients suffered from cancer or heart conditions, global CDR score remained signi cantly associated with mortality ( Supplementary Fig. 1 Fig. 2). We trained our models accordingly and employed Bayesian optimization 29 to select the optimal hyperparameters for each model.
The two-feature XGBoost models achieved an AUC-ROC of over 0.76 at all survival thresholds, though the higher thresholds achieved much higher AUC-PR scores, likely due to the large class imbalances at the lower thresholds . The full table of model performance for the two-feature XGBoost models in the test and  validation sets is shown in Supplementary Table 2, and the AUC-ROC curves are shown in Supplementary   Fig. 3. Overall, these basic models con rmed that age and clinical dementia rating alone provide a reasonable prediction of dementia patient mortality, and their contributions would be further elucidated with the inclusion of more clinical features in our subsequent analyses.
Multi-factorial machine learning models for predicting mortality in dementia patients We proceeded to build multi-factorial models that introduced a wider array of clinical features into the machine learning models. Initially, we built XGBoost models encompassing all 189 features of our preprocessed datasets. These initial results identi ed numerous recurring features among the top features, ranked by SHapley Additive exPlanations (SHAP) 30 values, of each of the four survival time thresholds, and much of the explainability of the predictions could be attributed to these top few features alone ( Supplementary Fig. 4). Therefore, we derived a parsimonious and informative feature subset across all four survival time thresholds by taking the union of the top ve features from each model, thus enhancing clinical interpretability without a drastic tradeoff in predictive performance. Notably, model AUC-PR was worse in one/three-year survival but increased dramatically at higher survival time thresholds, due to a higher proportion of mortality in patients and, thus, smaller class imbalances at these higher thresholds. Moreover, these performance trends were re ected in the external validation sets as well. The full table of model performance for the multi-factorial models in the test and validation sets is shown in Table 2, and the AUC-ROC curves are shown in Fig. 2a. We generated the bootstrapped SHAP plots (Fig. 2b)  were also predictive of mortality risk, though interestingly, the direction of the effects began to reverse at the longer survival thresholds. The full SHAP beeswarm plots for the multi-factorial models are shown in Fig. 2b.
Dementia type-speci c models The multi-factorial machine learning models provided a cogent framework for predicting mortality in an unspeci ed population of dementia patients. However, across dementia types, there may have been key similarities and distinctions that could not be captured in the pan-dementia analysis. Therefore, we strati ed the NACC cohort into smaller cohorts based on dementia types to conduct sub-dementia analyses. For these analyses, we aimed to predict dementia patient mortality solely at the ve-year survival threshold that provides a dataset with the smallest class imbalance while providing an extended time window for possible clinical actionability. We conducted our sub-dementia analysis on these eight dementia types with sample sizes greater than 100: no dementia (n = 42,135 visit records), Alzheimer's disease (AD, n = 37,990), unknown (n = 6,317), frontotemporal lobar degeneration (FTLD, n = 4,290), Lewy body dementia (LBD, n = 3,182), vascular brain injury or vascular dementia (VaD, n = 2,288), cognitive impairment due to other reasons (n = 1,362), and depression (n = 1,354). We strati ed the training, test, and validation sets of the ve-year dataset by dementia type and then trained, optimized, tested, and validated an XGBoost model for each of the eight dementia types.
Performance-wise, the models built on the commonly-de ned dementia types (e.g., AD, FTLD, LBD, and VaD) tended to perform better in the positive class (mortality) and, thus, generally had higher AUC-PR, whereas the models built on the non-dementia patients and the more ambiguous dementia types (e.g., depression, cognitive impairment for other speci ed reasons, and missing/unknown) were more robust at predicting the larger negative class (survival) and, thus, had higher AUC-ROC overall. All eight models achieved an AUC-ROC of over 0.79, with the no-dementia model attaining the highest AUC-ROC at 0.873 (95% CI: 0.859-0.879). The most consistent, all-around performer was the AD model, which is reasonable given that it was by far the most popular dementia type aside from the no dementia group. The full table of model performance for the sub-dementia models in their respective test and validation sets is shown in Table 3. The clustered feature importance heatmap is shown in Fig Meanwhile, several key differences distinguished individual dementia types and their clusters from one another. For instance, in both VaD and depression, alongside general cognitive features, body measurements and vital signs, such as 'HEIGHT' (Subject's height (inches)), 'WEIGHT' (Subject's weight (lbs)), 'NACCBMI' (Body mass index (BMI)), 'HRATE' (Subject's resting heart rate (pulse)), and 'BPDIAS' (Subject's blood pressure (sitting), diastolic), were more important for predicting mortality than for any other dementia type. In the vascular dementia subgroup, 'CVCHF' (Congestive heart failure) was also a pivotal feature, second in importance after age and accounting for over 5% of the mortality prediction among VaD patients.
In the FTLD and LBD cluster, feature importance was distributed across a substantially wider array of cognitive features, with less importance attributed to age and smoking years as compared to the other dementia types. In FTLD, for instance, a number of new cognitive features emerged: CDR® Plus NACC FTLD features (e.g., 'CDRLANG' (Language) and 'COMMUN' (Community Affairs)), clinician judgment features regarding motor function (e.g., 'NACCMOTF' (Indicate the predominant symptom that was rst recognized as a decline in the subject's motor function) and 'MOMODE' (Mode of onset of motor symptoms)), and neuropsychological battery summary scores (e.g., 'NACCMMSE' (Total MMSE score (using D-L-R-O-W))). Accordingly, di culty in performing functional and social activities, in addition to the loss of motor function, were crucial predictors of mortality in FTLD patients, more so than in any other dementia type. As for LBD, 'CDRSUM' (Standard CDR sum of boxes) superseded age as the most important feature, accounting for nearly 10% of the mortality prediction among LBD patients. However Finally, in the no dementia and unknown subgroups, many of the features were typically associated with mental cognition, such as age and performance on neuropsychological exams, remained important predictors of mortality, though others were superseded by more general comorbidities and risk factors. For instance, the relative importance of 'SMOKYRS' (Total years smoked cigarettes) was higher in the no dementia group than in any of the dementia groups, accounting for 7.5% of the mortality prediction among no dementia patients. Other non-cognitive risk factors such as 'HYPERTEN' (hypertension) and 'ENERGY' (Do you feel full of energy?) were also revealed to be relevant for predicting mortality in nondementia patients, despite having little to no contribution to the predictions in the dementia groups.
These results re-a rmed that mortality predictors differ between non-demented and dementia patients, who show multiple survival factors related to their neuropsychological ability.

Discussion
In this study, we developed machine learning models for predicting mortality through training, testing, and validation using 163,782 visit records of 45,275 unique NACC individuals in the United States from 2005 to 2021. We have demonstrated that machine learning models, which have thus far primarily been explored as screening or diagnosis tools in the context of dementia, have substantial utility in the prediction of mortality among dementia patients. First, we conducted multiple survival analyses, which con rmed that increasing global CDR scores coincided with decreased survival and showed that there was considerable variability in survival across dementia subtypes. Subsequently, we developed twofeature models (using only age and standard global CDR) and multi-factorial models (using nine features determined through feature selection) to predict dementia patient mortality at four distinct survival-time thresholds, all of which achieved high predictive performance. We additionally built machine learning models for eight different dementia subtypes and revealed key feature differences among them, though age and cognitive features derived from neuropsychological tests remained important predictors of mortality across all dementia types. These mortality predictors reveal similarities and differences in the etiology and clinical representation among individuals affected by different types of dementia.
The results of our global CDR survival analysis were consistent with those of past survival analyses in dementia patients 31 whereas prior studies identi ed comorbidities such as cardiovascular disease 34 to be associated with reduced survival probability, we found that regardless of heart conditions, the survival curves separated decisively across patients with varied global CDR scores within the NACC cohort.
Subsequently, we built machine learning models tasked with predicting dementia patient mortality at one-, three-, ve-, and ten-year survival thresholds. Our two-feature models, which utilized age and global CDR scores, achieved an AUC-ROC of over 0.76 at all four survival thresholds in the test set. Thus, age and global CDR provided a solid basis for predicting dementia patient mortality and, in the absence of additional clinical features, may alone be used to guide clinical judgment. Our multi-factorial models, for which we utilized SHAP to select a subset of nine features, achieved an AUC-ROC of over 0.82 at all four survival thresholds in the test set and comparable performance in the validation set. The crucial features used by the multi-factorial models con rm the known clinical indicators of dementia from a machine learning standpoint. The multi-factorial models revealed that a higher risk of mortality was predicted by older age 35 To our knowledge, our study is one of just a few studies to apply a machine learning-based approach to predicting mortality in dementia patients [23][24][25] (as opposed to statistical approaches), and the rst study to do so within population subsets strati ed by dementia type. In predicting dementia patient mortality at the ve-year survival threshold, our dementia type-speci c models all achieved an AUC-ROC of over 0.79 in the test set and similar performance in the validation set. Hierarchical clustering of survival predictors grouped the following dementia types together: (1) vascular dementia (VaD) with depression, (2) Lewy body dementia (LBD) with frontotemporal lobar dementia (FTLD), (3) Alzheimer's disease (AD) with other dementia, and (4) no dementia with unknown. Since many dementia types present similar symptoms and disease progressions 8 , differentiating and targeting dementia type-speci c symptoms and mortality predictors can be bene cial for patient populations 43 . Across all four clusters (even in the no dementia and unknown cluster), many features from the multi-factorial models remained key predictors of mortality, such as age, level of independence, smoking, and performance on neuropsychological exams like the Trail Making Test.
First, within the VaD and depression cluster, body measurements and vital signs (e.g., height, weight, BMI, heart rate, and diastolic blood pressure) contributed to the mortality prediction more than for any other dementia type. For VaD, congestive heart failure was the second most important feature after age, consistent with VaD's common risk factors 8 . Moreover, the grouping of VaD with depression con rms previous literature that has highlighted the synergistic effects of VaD and depression on patient mortality 26 , as VaD patients tend to exhibit a higher baseline risk for psychiatric symptoms like depression 43,44 . Second, within the FTLD and LBD cluster, features corresponding to MMSE score, standard CDR sum of boxes, and involvement in community affairs contributed more heavily to the mortality prediction. For FTLD in particular, features measuring di culty in performing social and functional activities were the pivotal predictors of mortality, consistent with the pathological effects of FTLD 8 . Our ndings regarding FTLD and LBD align with prior studies that have similarly grouped the two subtypes together and determined that executive dysfunction and activity disturbances are the key indicators of cognitive impairment for both 43,45 . Third, within the AD and other dementia clusters, general cognitive features, namely those from the multi-factorial models, remained the most important predictors of mortality. Standard CDR sum of boxes was also an important predictor of mortality in AD patients, as were body measurements and vital signs for other dementia patients. The grouping of AD with other dementia may be attributed to the di culty in differentiating AD from certain other types of dementia 46 , and given that AD was by far the most prevalent dementia type in the NACC cohort, it is likely that the other dementia patients were generally similar to AD patients. Finally, within the no dementia and unknown cluster, general cognitive features such as performance on the Trail Making Test, surprisingly, remained important predictors of mortality. However, general comorbidities and mortality risk factors, such as smoking, hypertension, and lack of energy, demonstrated high relative importance as well, more so than for any of the dementias. Notably, as in the survival analysis, cardiovascular diseases did not appear in the top features in either the multi-factorial models or the dementia type-speci c models, with the exception of congestive heart failure for VaD. The absence of these comorbidities from the top features in our machine learning models may suggest that cognitive decline is a stronger predictor of mortality in dementia patients than stroke or other comorbid cardiovascular conditions, though further studies could better interrogate this hypothesis.
Our study had several key strengths. First, the NACC database is the largest resource of its kind in the United States, covering a large, diverse patient population that was current through September 2021.
Moreover, we highlight a conscious design choice in stratifying our data into train, test, and validation sets. By introducing a prospective validation set based on date, we were able to ascertain the ability of our models to predict mortality within a prospective cohort based on past data. In our pan-dementia analysis, the use of two-feature and nine-feature (multi-factorial) models provided a parsimonious, clinically feasible framework for predicting dementia patient mortality, while in our sub-dementia analysis, the comparison of important predictors of mortality across various dementia types may help to guide precision management and treatment of dementia.
However, our study also had limitations. Due to the high prevalence of missing values, largely attributed to the di culty in acquiring certain data (e.g., neuropathological data) and differences in clinical procedures across ADCs, many features were preliminarily eliminated. Moreover, many features within the NACC data measure similar phenomena, certain variables have changed over time as updates were made to the UDS form, and many variables were derived from clinician diagnosis, precluding the use of a more granular feature selection method. By rst eliminating variables with over 40% missing values and subsequently using MICE to impute the remaining features, we aimed to reduce some bias in the feature selection process 47

Survival analysis
To gain preliminary insights into the relationship between dementia and patient survival/mortality, we rst conducted a survival analysis using global clinical dementia rating (CDR) and dementia type as strati cation variables, excluding dementia types with fewer than 100 patients. For our survival analysis, we built Kaplan-Meier estimator curves with the "surv t" function from the survival 49 R package. We used each unique patient's rst visit as the starting point for tracking patient survival, and we calculated days of survival since the rst visit based on (1) the time of death if the patient's death was recorded within the timespan of the dataset or (2) the expiration date of the dataset if the patient was still alive.

Data cleaning
All data cleaning was conducted in R v4.1.2 (R Foundation for Statistical Computing, Vienna, Austria).
First, we preserved NACCID (subject ID number) and NACCADC (ADC at which the subject was seen) but removed all other form header information and text eld variables. We then re-encoded the remaining features, which consisted primarily of categorical variables that were originally encoded as type numeric.

Missing data imputation
Generally, readily-available machine learning models are not compatible with missing data. Moreover, having large amounts of missing data can often affect model performance and generalizability across populations. Within the NACC dataset, which contains a large feature space and missing values scattered across variables, the removal of a row due to a single missing value can be especially detrimental and drastically reduce sample sizes. To avert potential bias introduced by manually selecting features, we opted to impute variables with missing values rather than only including patients with complete data.
The NACC Uniform Data Set has undergone several revisions since its inception in 2005, and the most recent version (version 3) was implemented in 2015. Consequently, certain variables that were collected in older versions of the UDS were no longer collected in UDS v3, and vice versa. Therefore, to minimize the number of features that did not contain su cient non-missing values, we rst omitted all variables with over 40% missing values. For the 189 remaining features, we imputed missing values using MICE (Multivariate Imputation by Chained Equations). 50 Multiple imputations is an imputation strategy that accounts for variability in missingness by generating multiple imputed datasets, which can then be aggregated into a single complete dataset 51 . Thus, multiple imputations generally outperform traditional machine learning methods used for imputation 52,53 . MICE implements a form of multiple imputations that relies on predictive mean matching to predict the value of a given missing variable based on data points that most closely resemble the missing datapoint.

Data splitting
To evaluate patient survival status, we employed one year, three years, ve years, and ten years as survival time thresholds. Accordingly, we determined each patient's survival status based on survival threshold year length, clinic visit date, and patient's time of death that was derived from variables NACCMOD (Month of Death) and NACCYOD (Year of Death), labeling each patient's one-year, three-year, ve-year, and ten-year survival as either 0 (survival) or 1 (deceased).
To assess the accuracy of our prediction models, we divided the whole cohort into a training/testing Subsequently, for each survival time threshold, we strati ed each survival dataset by date into a pandementia dataset that we used for training and testing and a separate validation cohort that we used to externally evaluate model performance. For our pan-dementia analysis, we included all dementia patients (i.e., all patients who received an impaired not MCI, MCI, or dementia diagnosis). However, for our subdementia analysis, we strati ed our datasets by dementia type, including non-dementia patients as a baseline for comparison.

Machine learning models
After experimenting with several machine learning algorithms (i.e., random forest, logistic regression, and eXtreme Gradient Boosting), our machine learning algorithm of choice was eXtreme Gradient Boosting (XGBoost), a high-performance, tree-based ensemble learning method that uses gradient tree boosting to sequentially add new trees to reduce the errors from previous trees 54 . We built XGBoost models for each of the one-year, three-year, ve-year, and ten-year datasets, with the goal of predicting dementia patient survival/mortality under varying survival thresholds. We built all of our machine learning models in Python v.3.7.12 using the xgboost and scikit-learn 55 libraries.

Feature selection
For our pan-dementia analyses, we aimed to build XGBoost models to predict one-year, three-year, veyear, and ten-year survival among all dementia patients. Our rst set of machine learning models utilized only two features: age and standard global CDR. These preliminary models served as a baseline of comparison for the more complex models and provided insight into how much of the mortality prediction could be explained by age and standard global CDR alone.
Subsequently, we built a more complex set of machine learning models that employed the larger feature space. However, in order to make our machine learning models more clinically feasible, we conducted feature selection using SHapley Additive exPlanations (SHAP) 56 , a uni ed, model-agnostic framework for interpreting the predictions of machine learning models. The SHAP algorithm is rooted in game theory, relying on the calculation of Shapley values to evaluate the relative contribution of each feature to a given prediction. Though SHAP is most often used as a feature importance metric, it has demonstrated considerable utility as a feature selection method as well, even outperforming many conventional feature selection methods 30 .
We used a variant of SHAP known as TreeSHAP 57 , an enhancement to SHAP designed for tree ensemble methods, such as XGBoost. In our study, we trained default XGBoost classi ers with ve-fold crossvalidation on each of the four training sets and then took the union of the top ve features from each model, ranked in order of decreasing mean absolute SHAP value.

Model training, testing, and validation
We trained our four XGBoost models on their respective training sets and tested their performance on their respective test sets, corresponding to their survival threshold. To account for any variability that may have been introduced by the random state of the train-test split, we conducted bootstrap resampling by generating fty bootstrap samples, re-tting the models on each bootstrap train set, and evaluating their performance on each bootstrap test set. All con dence intervals generated represent the 95% con dence intervals derived from bootstrap resampling. We also validated each model's performance on its respective external validation set, which we set aside during data splitting.

Sub-dementia analysis
In addition to predicting survival in all dementia patients, we conducted a sub-dementia analysis, analyzing discrepancies among dementia types. Since the majority of dementia-related studies are geared toward Alzheimer's disease, highlighting the distinctions between dementia types may provide insight into the mechanisms of the various forms of neurodegeneration, thus guiding clinical practice.
For our sub-dementia analysis, we only used a ve-year survival threshold, as the pan-dementia analysis demonstrated that ve years provides a reasonable timeframe for capturing patient mortality without a drastic trade-off in predictive performance. To ensure that each of our dementia-type models received su cient training data, we limited our analysis to eight dementia types, which each contained at least 1000 patients from the ve-year dataset between training and testing (excluding validation).
Accordingly, we built XGBoost models for each sub-dementia dataset and applied the same Bayesian optimization methodology and train-test-validation framework as with our pan-dementia analysis. However, in order to conclusively note differences between dementia types, we included all 189 original features and allowed each model to designate the most important features corresponding to its respective dementia type.

Feature importance
For both our pan-dementia analysis and sub-dementia analysis, we used the aforementioned SHapley Additive exPlanations (SHAP) 56 to determine feature importance within our XGBoost models. To distinguish the most important features in each model, we created 50 bootstrap samples with randomized train-test con gurations, t the model on each training split, and then calculated SHAP values within each test split. We then aggregated the SHAP values across all bootstrap samples before ranking the features in order of decreasing mean absolute SHAP value, based on their relative contribution to the models.   Clustered heatmap of the top features across dementia types. Only features with a normalized SHapley Additive exPlanations (SHAP) value greater than 2.5 (explaining at least 2.5% of the prediction) in any given dementia type model were included. The maximum normalized SHAP value for the clustered heatmap was set to 10 so that color contrasts were more discernable. The "No Dementia" category corresponds to patients receiving a primary etiologic diagnosis of "Not applicable, not cognitively impaired." The "Unknown" category corresponds to patients receiving a primary etiologic diagnosis of "Missing/unknown." "The "Other" category corresponds to patients receiving a primary etiologic Supplementary Files