Improving genetic risk modeling of dementia from real-world data in underrepresented populations

BACKGROUND: Genetic risk modeling for dementia offers significant benefits, but studies based on real-world data, particularly for underrepresented populations, are limited. METHODS: We employed an Elastic Net model for dementia risk prediction using single-nucleotide polymorphisms prioritized by functional genomic data from multiple neurodegenerative disease genome-wide association studies. We compared this model with APOE and polygenic risk score models across genetic ancestry groups, using electronic health records from UCLA Health for discovery and All of Us cohort for validation. RESULTS: Our model significantly outperforms other models across multiple ancestries, improving the area-under-precision-recall curve by 21-61% and the area-under-the-receiver-operating characteristic by 10-21% compared to the APOEand the polygenic risk score models. We identified shared and ancestry-specific risk genes and biological pathways, reinforcing and adding to existing knowledge. CONCLUSIONS: Our study highlights benefits of integrating functional mapping, multiple neurodegenerative diseases, and machine learning for genetic risk models in diverse populations. Our findings hold potential for refining precision medicine strategies in dementia diagnosis.


Background
Dementia, a complex and multifaceted syndrome, is characterized by a progressive decline in cognitive function beyond what might be expected from normal aging.Etiologies include Alzheimer's disease (AD), vascular dementia, Lewy body dementia (LBD), Frontotemporal dementia (FTD), and Parkinson's disease dementia (PDD), among others. 1The prognosis of dementia is generally a gradual and continuous decline in cognitive function, which can signi cantly impact an individual's ability to perform daily activities. 2 Dementia represents a signi cant public health concern, with a global prevalence estimated at around 36 million in 2020.Owing to an aging population, this number is projected to triple by 2050. 3 The economic burden of dementia is also substantial, with global costs estimated to be around $594 billion annually. 4mentia has a strong genetic predisposition, with numerous signi cant genetic variants associated with the disease identi ed through Genome-Wide Association Studies (GWASs).For example, the Apolipoprotein E (APOE) gene, which encodes a protein responsible for binding and transporting low-density lipids, signi cantly in uences the risk of late-onset AD, the most prevalent form of dementia. 5,6Similarly, the Microtubule-associated protein tau (MAPT) is a recognized genetic mutation in FTD, 7 and Synuclein Alpha (SNCA) is associated with PDD. 8 While these studies have deepened our understanding of the genetic architecture of dementia, additional research is necessary to successfully model personal dementia genetic risk and understand the potential limitations.
Polygenic risk scores (PRSs), which aggregate the effects of many genetic variants associated with a disease, have recently been used to quantify an individual's genetic predisposition for complex diseases like dementia. 9A growing number of studies have underscored the robust links between AD PRS and AD phenotype, [10][11][12][13] declines in memory and executive function, [14][15][16][17] clinical progression, 15 and amyloid load 18 in the non-Hispanic white population.However, the performance of PRSs in non-European ancestries has been suboptimal.0][21][22] Using PRSs for 245 curated traits from the UK Biobank data, Privé et al. 23 revealed notable disparities in the phenotypic variance explained by PRSs across different populations.Speci cally, compared to individuals of Northwestern European ancestry, the PRS-driven phenotypic variance is only 64.7% in South Asians, 48.6% in East Asians, and 18% in West Africans.Similarly, using a population from the Health and Retirement Study, Marden et al.  demonstrated that the estimated effect of the AD PRS was notably smaller for non-Hispanic black compared to non-Hispanic white in both dementia probability score and memory score. 24other limitation of current genetic risk modeling is differentiating between causal and uninformative variants.Causal variants, such as APOE in AD, have been suggested to be included as separate variables in genetic risk modeling due to their independent risk contribution. 25On the other hand, including uninformative, non-causal variants in prediction models may introduce "noise" that obscures the effects of important variants.In a study by Dickson et al.,26 a model incorporating allelic APOE terms and just 20 additional Single-Nucleotide Polymorphisms (SNPs) outperformed the model that included thousands of SNPs in AD risk prediction (area under the receiver operating characteristic (AUROC): 0.75 vs. 0.63).
Moreover, most current studies used longitudinal cohorts, which perform extensive testing and consensus criteria 27 applied by clinicians with expertise in dementias to determine dementia diagnosis.While this approach ensures precision within research cohorts, it does not necessarily mirror the practicalities of real-world community settings.In real-world clinical care, the expertise in dementia may vary, and the criteria used for diagnosis may not always align with the stringent standards of research cohorts.Diagnoses documented in the Electronic Health Records (EHRs) capture these real-world data and, by routinely capturing patient data over extended periods, form an expansive longitudinal cohort ideal for realworld research.Compared to traditional cohorts, EHR cohorts offer additional bene ts, such as vast sample sizes, diverse phenotypes, and a more inclusive representation of often underrepresented groups, like minorities and older adults. 28owever, only a few genetic studies on dementia have been conducted within the context of EHR, and have predominantly focus on AD 11,29 Finally, prior studies have primarily focused on the genetic risk prediction of AD.However, while AD accounts for a signi cant portion of dementia cases, concentrating solely on it risks overlooking the broader scope of cognitive disorders.
In real-world scenarios, many dementia cases display mixed pathologies, 30,31 with mixed dementia being a common occurrence 32 .Addressing dementia as a whole, rather than exclusively focusing on AD, could better re ect the clinical landscape and lead to interventions and therapies that bene t a larger cohort of affected individuals. 33fortunately, dementia remains signi cantly underdiagnosed in real-world community settings.5][36][37][38][39] Early detection of all-cause dementia with genetic modeling can empower healthcare providers to pinpoint the appropriate diagnostic processes, streamline care coordination, manage symptoms effectively, and begin suitable treatments.The above-mentioned limitations underscore the need for more re ned methodologies to develop genetic risk models across diverse populations accurately.
In the present study, we hypothesized that the risk SNPs associated with dementia, and their corresponding weights, may vary across diverse populations, namely Amerindian, African, and East Asian genetic ancestry.We further proposed that the prediction performance of dementia phenotypes in non-European populations could be enhanced by identifying biologicalmeaningful SNPs followed by sparse machine learning models within each genetic ancestry group.Thus, we present a novel approach for assessing individual dementia genetic risks across diverse populations.
Our approach addresses the previously noted limitations through several innovative measures.Firstly, we utilized functional and biological information to prioritize SNPs based on GWAS results, thereby targeting causal SNPs with the highest likelihood of contributing to dementia risk.Secondly, we employed machine learning algorithms to select important genetic variants.Our method allows for the ne-tuning of models across different ancestry groups, offering a signi cant advantage for non-European populations that are often underrepresented in GWAS studies.Finally, we developed and validated our models within real-world EHR settings, focusing on predicting dementia as an encompassing condition.This innovative approach holds promise for enhancing our understanding of individual dementia genetic risks and promoting health equity in genetic research.

UCLA ATLAS Community Health Initiative
Our discovery cohort for model development was derived from the biobank-linked EHR of the UCLA Health System. 40The UCLA ATLAS Community Health Initiative collects biosamples from participants of a diverse population.Upon obtaining patient consent, these biological samples undergo genotyping using a customized Illumina Global Screening Array. 41etailed information regarding the biobanking and consenting procedures can be referenced in our previous publications. 42,43After the genotype quality control described below, there were 54,935 individuals with genotype and UCLA EHR data.As all genetic data and EHRs utilized in this study were de-identi ed, the study was deemed exempt from human subject research regulations (UCLA IRB# 21-000435).

All of Us Research Hub
We validated our models and ndings using All of Us Research Hub data.As one of the most diverse biomedical data resources in the United States, the All of Us Research Program serves as a centralized data repository, offering secure access to de-identi ed data from program participants. 44For our validation, we utilized data release version 7, encompassing 409,420 individuals, of which 245,400 have undergone whole genome sequencing.

Patient genetic data preprocessing 2.2.1 Quality control
The quality control process was conducted using PLINK v1.9, 45 adhering to established guidelines. 40We removed samples with a missingness rate exceeding 5%.Low-quality SNPs with > 5% missingness and monomorphic and strand-ambiguous SNPs were excluded.Post-quality control, we performed genotype imputation via the Michigan Imputation Server. 46This step was crucial to augment the coverage of genetic variants and enable the comparison of results across diverse genotyping platforms.SNPs with imputation r 2 < 0.90 or MAF < 1% were pruned from the data.After quality control measures and imputation, there were 21,220,668 genotyped SNPs across a sample of 54,935 individuals.Finally, we restricted our analyses to SNPs that overlapped between UCLA ATLAS and All of Us, amounting to a total of 8,705,988 SNPs.This approach ensured consistency in the genetic variables under consideration across both datasets.

Inferring genetic ancestry
Genetic ancestry refers to the geographic origins of an individual's genome, tracing back to their most recent biological ancestors while largely excluding cultural aspects of their identity. 47Genetic Inferred Ancestry (GIA) employs genetic data, a reference population, and inferential methodologies to categorize individuals within a group likely to share common geographical ancestors. 48In our UCLA ATLAS sample, we used the reference panel from the 1000 Genomes Project 49 and principal component analysis 50 to infer a patient's genetic ancestry.GIA groups included European American (EA), African American (AA), Hispanic Latino American (HLA), East Asian American (EAA), and South Asian American (SAA).For instance, we designated individuals within the United States whose recent biological ancestors were inferred to be of Amerindian ancestry as "HLA GIA". 51In addition, we calculated ancestry-speci c principal components within each GIA group using principal component analysis.

GWAS selection
Our study's initial step is identifying potential risk SNPs as candidate predictors for dementia GWASs.A summary of the GWASs used and steps to select candidate SNPs in our study can be found in Supplementary Table 1 and Supplementary Fig. 1.
We selected GWASs for AD, 5,52,53 PDD, 54 PSP, 55 LBD, 56 and stroke 57 phenotypes.For AD GWASs, we included three different GWASs conducted on diverse populations, including European, 5 African American, 52 and multi-ancestries. 53The summary statistics from all these GWAS are publicly available.Detailed information regarding the recruitment procedures and diagnostic criteria can be found in the original publications.

Candidate SNPs identi cation and annotation
A signi cant proportion of GWAS hits are found in non-coding or intergenic regions, 58 and given the correlated nature of genetic variants in Linkage disequilibrium (LD), distinguishing causal from non-causal variants often proves challenging based solely on association P-values from GWASs. 59Pinpointing the most likely relevant causal variants typically involves understanding the regional LD patterns and assessing the functional consequences of correlated SNPs, such as protein coding, regulatory, and structural sequences. 60Several functionally validated variants have been proved to be clinically relevant to the pathogenesis of diseases, as con rmed through in vitro or in vivo experimental validation. 61To address this, we utilized the Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA), a tool that leverages information from biological data repositories and other resources to annotate and prioritize SNPs. 59r each GWAS summary statistic, we rst identi ed genomic risk loci using a P-value threshold (< 5e-8) and a precalculated LD structure (r 2 < 0.2) based on the relevant reference population from the 1000 Genomes. 49Subsequently, we identi ed two distinct sets of SNPs: 1. Independent genome-wide-signi cant SNPs: We selected the SNP with the most signi cant GWAS P-value within each genomic risk locus.This process was iterated until all SNPs were assigned to a risk locus cluster or considered independent.2. Independent gene-annotated SNPs: We prioritized SNPs based on their functional consequences on genes.In FUMA, the mapping from SNPs to genes was achieved by performing ANNOVAR 62 using Ensembl genes (build 85).SNPs were mapped to genes through positional mapping, eQTL associations, and 3D chromatin interactions.The Combined Annotation-Dependent Depletion (CADD) score 63 was used to select potential causal SNPs, with the SNP possessing the highest CADD score within each genomic risk locus being chosen, indicating a higher probability of the variant being deleterious.
The identi ed independent genome-wide-signi cant SNPs and independent gene-annotated SNPs were subsequently used in constructing the disease PRSs and as candidate features in dementia prediction models.To ensure the robustness of our ndings, we also adopted a stringent r 2 cut-off (< 0.1) to de ne independent genome-wide-signi cant SNPs, ensuring the selected SNPs were independent.

Polygenic risk scores and APOE-ε4
We computed the disease-speci c PRS as the sum of an individual's risk allele dosages, each weighted by its corresponding risk allele effect size from the GWAS summary statistics, as shown in the PRS equation .All PRSs were then standardized to a mean of 0 and a standard deviation of 1.The standardization process used the 1000 Genome European genetic ancestry as the reference population, ensuring that the scores' range and values are comparable across different GWASs.For each phenotype, we employed two distinct sets of SNPs identi ed by FUMA, namely the independent genome-wide-signi cant SNPs and independent gene-annotated SNPs, to calculate two respective PRSs: PRS.psig and PRS.map.
The APOE gene has two variants, rs7412 and rs429358, which determine the three common isoforms of the apoE protein: E2, E3, and E4, encoded by the ε2, ε3, and ε4 alleles. 64Previous research has demonstrated that out of the three polymorphic forms of APOE, carriers of APOE-e4 are at a higher risk of developing AD, and this association exhibits a dosedependent effect. 65Therefore, to quantify the APOE genotype in our study, we created a numerical variable, "APOE-e4count", with the two variants mentioned above, representing the number of ε4 alleles (0, 1, or 2) carried by each individual.

Dementia de nition and demographic features
The primary outcome of interest was dementia, which we de ned using the ICD-10 codes (Supplementary Table 2).The demographic variables considered in our study were self-reported sex and age.The age of each participant, measured in years, was calculated based on their self-reported birth date and the dates of their encounters.For individuals diagnosed with dementia, we determined the age at dementia onset.

Analytical sample selection
To focus on patients with longitudinal records, our analyses included patients with complete demographic data (age and sex) who had at least two medical encounters after age 55.We also applied a restriction of age at the last recorded encounter to be less than 90 as patients in the UCLA EHR dataset are censored when older than 90.
We identi ed eligible dementia cases as patients with at least one encounter with a recorded dementia diagnosis, provided that the initial onset of the condition occurred after age 55.To qualify as an eligible control, subjects were required to meet the following criteria: 1) not have any recorded dementia or related diagnoses, as determined by a set of prede ned exclusion phenotypes; 66 2) age at the last recorded visit > = 70, to exclude younger patients who may not have manifested signs of dementia; and 3) a minimum of ve years' length of records with an average of at least one encounter per year, thereby minimizing the potential for bias associated with misdiagnosis.
Upon the application of these selection criteria, the resultant sample served as the pool for permutation resampling and subsequent modeling in our study.

Prediction of dementia risk with machine learning models
In our discovery study, we developed a series of logistic regression models to predict the binary dementia phenotype in the UCLA ATLAS sample, strati ed by GIA groups.

Permutation resampling
In order to fortify the reliability of our ndings, we employed the permutation resampling methodology to assess model performance, ascertain feature importance, and evaluate statistical signi cance.Speci cally, we conducted random sampling from the pool of eligible controls, maintaining a case-to-control ratio of 1:3, and utilized the amalgamated case and control samples for the following modeling process.This iterative procedure was repeated 1000 times.

Regress out demographic variable effects
To distinctly assess genetic in uences, our analysis commenced by mitigating the impact of demographic factors, encompassing age, sex, and ancestry-speci c principal components (PCs), from the predictive model.We rst employed a logistic regression model that exclusively utilized these variables to predict dementia status.Subsequently, we derived the predicted values for each patient through this model.Applying an appropriate inverse link function (e.g., logit), we then subtracted these predicted values from the ultimate outcome (dementia status), generating an "offset" value.These offset values encapsulated the dementia status, after regressing out the effects of demographic variables and genetic population structure.

Genetic prediction models
Next, we trained genetic risk models to predict the outcome (dementia status) with the offset corrections applied in the linearized space, i.e., , where represents the predicted dementia status, and is the inverse of the link function. 67We compared four different sets of predictors: 1) APOE status, 2) AD PRS, 3) multiple PRSs, and 4) smaller SNP sets with Elastic Net regularization.The latter involved the application of a regularization technique known as Elastic Net to smaller sets of SNPs. 68For multiple PRS models, we crafted models utilizing diverse AD PRSs of varying ancestries or PRSs derived from other GWASs focused on neurodegenerative diseases.Across all models, we employed a 5-fold cross-validation methodology to authenticate their predictive e cacy, with the nal results reported on the combined hold-out testing set.
The primary assessment criterion was the Area Under the Precision-Recall Curve (AUPRC), speci cally chosen for its appropriateness in scenarios involving imbalanced datasets where the number of cases is signi cantly outnumbered by controls. 69Additionally, the AUROC was reported as a comprehensive metric for model evaluation.To determine the optimal threshold, we selected the point that maximized the Matthews Correlation Coe cient (MCC). 28Subsequent performance metrics, such as the F1 score, accuracy, precision, recall, and speci city, were computed based on this threshold.The 95% con dence intervals (CIs) and p-values ( ) were derived through 1000 permutations as described previously.

Validations in the All of Us sample
We conducted a validation study using the All of Us cohort to assess the generalizability of our ndings derived from the UCLA ATLAS sample.We selected a comparable sample from the All of Us Research Hub, adhering to the same criteria and sampling scheme for the GIA groups in the UCLA ATLAS sample.The same methodologies were employed to de ne dementia cases and controls.We extracted the same genetic risk loci from the All of Us Whole Genome Sequencing data for PRS construction or those identi ed through Elastic Net models in the UCLA ATLAS sample.We employed a consistent methodology to regress out demographic variables and genetic population structure (i.e., PCs) as a preliminary step.This approach was undertaken to derive offset corrections, mirroring the procedures employed in our prior research.By regressing out these factors, we aimed to ensure that the statistical models accurately re ect the intrinsic genetic associations, unconfounded by extraneous demographic or population structure in uences.
We compared three models in the All of Us sample: 1) the APOE-e4 model; 2) the best-performing PRS model; and 3) the best-performing Elastic Net SNP model.The same evaluation metrics were utilized for model comparisons.

Gene mapping and gene set analysis
To facilitate biological interpretations, we employed FUMA's positional, eQTL, and chromatin interaction mapping to associate dementia risk SNPs, identi ed from the top-performing Elastic Net SNP models, with speci c genes. 59We then tested these mapped genes against gene sets procured from MsigDB, such as positional gene sets and Gene Ontology (GO) gene sets, to assess the enrichment of biological functions through hypergeometric tests.To correct for multiple testing, we implemented the Benjamin-Hochberg adjustment. 70Using heatmaps, we reported and visualized gene sets with an adjusted P-value ≤ 0.05 and more than one overlapping gene.

Sample description
The study's primary dataset for model development was derived from EHR linked to the biobank of the UCLA Health System. 40A detailed depiction of the sample selection steps and resampling scheme is provided in Fig. 1A.
Figure 1B illustrates the nalized UCLA ATLAS samples, strati ed by GIA groups.Notably, the HLA sample comprised 610 patients, while the AA sample consisted of 440 patients, with 126 and 84 dementia cases, respectively, within each group.The distribution of International Classi cation of Diseases, 10th Revision (ICD-10) diagnosis codes remained relatively consistent across the two GIA samples, with Alzheimer's disease (G30) and unspeci ed dementia (F03) being the most prevalent diagnoses.However, it is important to highlight that the AA group exhibited a higher proportion of patients diagnosed with vascular dementia (F01) compared to the HLA group.The EAA group, with a limited case count (N = 75), was excluded from primary analyses but included in sensitivity analyses.
Within each GIA group, we found that eligible controls, due to the more stringent inclusion criteria, displayed a longer span of records and more encounters.There were no signi cant differences in other EHR features between dementia cases and controls (Table 1).

Performance comparison for dementia phenotype prediction task
We developed and evaluated a series of logistic regression models to predict the binary dementia phenotype within the UCLA ATLAS sample, strati ed by GIA groups.After regressing out the effects of age, sex, and ancestry-speci c genetic variations as represented by PCs, we constructed genetic risk models for dementia, incorporating offset corrections within a linearized framework.The predictive capabilities of these models were assessed using four distinct sets of genetic markers: 1) APOE-e4 counts, 2) AD PRS, 3) a composite of multiple PRSs, and 4) select SNPs re ned through Elastic Net regularization. 68For the selection of SNP sets, we utilized the FUMA tool 59 to prioritize independent genome-widesigni cant SNPs or independent gene-annotated SNPs.We employed the permutation resampling methodology (1000 times) to assess model performance, ascertain feature importance, and evaluate statistical signi cance (details see Methods).
The overall performances of models for predicting dementia phenotypes are visually represented in Fig. 2. No discernible differences were observed among APOE-e4 and all PRS models, irrespective of the SNP set employed for PRS construction -whether derived from ancestry-speci c GWASs, genome-wide-signi cant SNPs, or gene-annotated SNPs.Notably, the predictive performance of APOE-e4 and all PRS models within the AA GIA sample exhibited inferior outcomes compared to the HLA GIA sample, particularly evident in the AUPRC.
Elastic Net SNP models demonstrated an overall improvement in dementia prediction across both GIA groups.We also noted a substantial enhancement in the other performance metrics (based on the threshold that maximized the MCC) of the Elastic Net SNPs models compared to other models across both GIA samples (Supplementary Table 3).This was evidenced by marked improvements in accuracy, precision, and the F1 score.In our sensitivity analysis, applying a more stringent r 2 cut-off (< 0.1) for de ning independent genome-wide-signi cant SNPs yielded results consistent with our initial ndings, as detailed in Supplementary Table 4.
In summary, models leveraging SNPs as features identi ed through machine learning methods possess the potential to surpass those relying solely on summary scores such as PRSs.Furthermore, selecting SNPs mapped to genes using functional genomic data holds promise for further re ning predictive performance.

Featured risk variants and mapped genes
In our analysis of the best-performing Elastic Net SNPs models, we further examined the features selected by each model.The HLA and AA models identi ed 15 and 10 risk SNPs, respectively.A detailed list of SNPs, including related information, is provided in Table 2.By assessing the feature importance of the SNPs chosen by the models, we discovered that rs429358 (chr19:44908684, nearest gene: APOE), rs2075650 (chr19:44892362, nearest gene: TOMM40), and rs483082 (chr19: 44912921, nearest gene: APOC1) were selected as the top three important predictor for the HLA GIA group, together accounting for ~ 25% of the total predictive importance.Conversely, for the AA GIA group, the most in uential predictors were identi ed as rs2627641 (chr19:45205500, nearest gene: BLOC1S3), rs8073976 (chr17:44955857, nearest gene: C1QL1), and rs429358 (chr19:44908684, nearest gene: APOE).
Two AD-associated risk SNPs, rs429358 and rs2075650, were pinpointed by both GIA Elastic Net SNPs models, albeit with slight variations in their relative importance.Moreover, both models identi ed several risk SNPs of PDD and progressive supranuclear palsy (PSP) as crucial predictors of dementia.However, there were notable differences between the models.
For instance, the AA GIA model ascribed signi cant importance to a PSP-associated risk SNP, rs8073976, located on chromosome 17.Interestingly, stroke-risk SNPs were only identi ed as important predictors by the HLA GIA model, underscoring the distinct genetic underpinnings in uencing these different ancestry groups.
To better understand the biological functions and pathways associated with the identi ed risk variants, we then mapped those featured risk SNPs to genes.This was also achieved using FUMA, which incorporates positional, eQTL, and 3D chromatin mapping. 59tably, four genes were identi ed by both non-European GIA models (Fig. 3 & Supplementary Table 5).All shared genes were located near chr19q13, which includes the well-established AD risk gene cluster, APOE-TOMM40-APOC1. 71According to the enrichment analysis results, these shared genes are predominantly involved in biological pathways associated with lipid metabolism.These pathways encompass processes such as the assembly and organization of protein-lipid complexes, as delineated by the GO terms.Additionally, these genes play an essential role in regulating cholesterol, triglyceride, amyloid proteins, and lipoprotein particles, further underscoring the signi cance of lipid metabolic processes in dementia.In addition, we investigated ancestry-speci c genes.For instance, genes near the chr17q21 (e.g., CCDC43, GFAP, and C1QL1), and the chr11q25 region (e.g., GSF9B and JAM3) were uniquely pinpointed by the AA GIA model.
In the sensitivity analyses, we performed dementia risk modeling in the EAA GIA sample (N = 673).Similar to other GIA groups, the model incorporating gene-annotated SNPs from AD and other dementia-related disease GWASs performed the best compared to all other models, enhancing the AUPRC by 11% (0.511 vs. 0.459), and the AUC by 7% (0.754 vs. 0.703) compared to the best PRS model.Despite these improvements, the differences in performance between the leading Elastic Net SNP model and other models did not reach statistical signi cance (AUPRC: p-value = 0.438; AUROC: p-value = 0.376).Among the featured 12 risk SNPs, rs429358 (chr19:44908684, nearest gene: APOE), rs35106910 (chr19:44781009, nearest gene: CBLC), and rs66626994 (chr19:44924977, nearest gene: APOC1P1) were the most signi cant predictors for the EAA GIA group, collectively accounting for ~ 32% of the overall predictive importance.After mapping featured SNPs to gene, we also identi ed the AD-risk gene cluster, APOE-TOMM40-APOC1, as well as the gene region near chr17q21 (e.g., FMNL1 and SPPL2C) (Supplementary Table 6A-D).

Validations in the All of Us sample
We conducted a validation study using the All of Us cohort to evaluate the broad applicability of our ndings obtained from the UCLA ATLAS sample.A comparable sample was selected from the All of Us Research Hub, employing the same selection scheme to their corresponding GIA groups in the UCLA ATLAS sample.However, due to the limited number of eligible dementia cases (N case = 8) in the All of Us EAA GIA sample, we could only validate our models and ndings in the HLA (N_case = 81, N_control = 445) and AA (N_case = 181, N_control = 2,463) samples.In contrast to the UCLA ATLAS samples, the All of Us cohort samples exhibited a younger demographic pro le, with participants having comparatively shorter durations of EHR documentation and fewer recorded healthcare visits.Within each GIA sample, we found similar distributions of demographics and EHR features between dementia cases and eligible controls (Supplementary Table 7-8).
We applied the model weights trained from the UCLA ATLAS sample to the All of Us sample, strati ed by GIA groups.In the comparison of three representative models, namely 1) the APOE-e4 model; 2) the best-performing PRS model; and 3) the best-performing Elastic Net SNP model, our results mirrored those from the UCLA ATLAS sample, with the Elastic Net SNP model, which included gene-annotated SNPs from GWASs of AD and other dementia-related diseases, outperforming all other models in terms of the AUPRC and AUC in both the HLA and AA GIA samples (Table 3).

Discussion
Traditional genetic risk models have faced limitations in effectively capturing causal disease risk variants and accurately assessing genetic risks across diverse populations.To address these challenges, our present study introduces a novel approach to predicting dementia risks by leveraging functional mapping of genetic data in conjunction with machine learning methods in the real-world EHR setting.Our proposed method shows remarkable improvements in prediction performance compared to well-known approaches like APOE gene and PRS models.We successfully identi ed shared and ancestry-speci c risk genes and biological pathways contributing to dementia risks for each non-European GIA group.
Finally, we bolstered the reliability and generalizability of our ndings by validating our models using a comparable EHR sample from the All of Us cohort.
Our study highlights the signi cance of prioritizing biologically meaningful SNPs in genetic prediction.GWASs often identify genomic regions with multiple correlated SNPs, which may encompass several closely located genes.However, not all of these genes are relevant to the disease. 72Functional annotation of genetic variants enabled us to target potential causal SNPs by considering various factors, such as regional LD patterns, functional consequences of variants, their impact on gene expression, and their involvement in chromatin interaction sites. 59In our models developed on UCLA ATLAS samples, we achieved signi cant improvements in model performance by prioritizing biologically meaningful SNPs, ranging from 21-61% in AUPRC and 10-21% in AUROC across different GIA groups, compared to the APOE-e4 count and the best-performing PRS models.These results underscore the critical role of considering functional and biological information in enhancing the performance of genetic prediction models, especially in diverse populations.
It is worth highlighting that no discernible performance differences were observed between PRSs constructed using genome-wide-signi cant and gene-annotated SNPs.This can be attributed to the strong LD between genome-widesigni cant and gene-annotated SNPs within the same genomic region.As a result, these SNPs tend to have similar effect estimates in the GWASs.Thus, it is expected that the PRSs built with these two sets of SNPs would exhibit a high correlation (Supplementary Table 9), which further supports the notion that the choice of genome-wide-signi cant or geneannotated SNPs does not signi cantly impact the predictive performance of the PRSs in our study.
Moreover, our study emphasizes the signi cance of incorporating risk factors from multiple dementia-related diseases when developing predictive models for complex conditions like dementia.Both ancestry-speci c Elastic Net SNP models highlighted several PD and PSP risk variants as signi cant predictors of dementia.This nding aligns with the well-known complexity of dementia as a multifactorial disorder that shares common features with these related conditions. 73However, it is worth noting that including PRSs of those diseases did not signi cantly improve the overall performance (Fig. 2).This result is consistent with research conducted by Clark et al., 74 in which they demonstrated that a combined genetic score, which incorporated risk variants for AD and 24 other traits, had an equivalent predictive power as the AD PRS on its own.One possible explanation is that many traits were not dementia etiologies and diluted the effects of the true causal SNPs in the models.
Our proposed Elastic Net SNPs models identi ed several shared risk factors across different ancestries.Notably, a substantial proportion of the identi ed shared genes were found near the chr19q13 region, which is well-known for the AD risk gene cluster comprising APOE-TOMM40-APOC1.These ndings align with previous research, 6,52,64 further supporting the signi cance of this genomic region in contributing to the genetic risks associated with dementia.
At the same time, we have discovered compelling evidence supporting our hypothesis that risk SNPs associated with dementia, along with their corresponding weights, exhibit signi cant variations across diverse populations.Notably, our analysis of PRS models revealed that the performance of PRS built with the European population GWAS was worse when predicting a non-European GIA group.On the other hand, we also observed that the APOE-e4 count model performed better than most PRS models in HLA and AA GIA samples.[77][78] In addition, we observed notable differences in the feature importance of various SNPs within the best-performing Elastic Net models across distinct GIA groups.Consequently, this led us to identify ancestry-speci c genes and distinct biological pathways implicated in the genetic predisposition to dementia in diverse ancestral samples.These ndings highlight the uniqueness of genetic risk factors and functional pathways in diverse population groups.
Finally, we validated our models using samples from separate EHR linked with genetic data (All of Us).Our proposed Elastic Net SNP model consistently outperformed the APOE-e4 and the best PRS models.While the Elastic Net SNP model demonstrated effective performance in both HLA and AA populations, we observed a decrease in the general performance and signi cance (AUPRC and AUROC) in the All of Us sample compared to the UCLA ATLAS sample, particularly in the AA samples.One potential explanation for this discrepancy is the distinct population structure within each sample, as revealed by comparing patient characteristics (Supplementary Table 7).These ndings underscore the in uence of populationspeci c factors on the generalizability of genetic risk models, highlighting the critical need to account for population diversity in predictive models for complex diseases.
Our study boasts several notable strengths that contribute to its signi cance and impact.Firstly, machine learning techniques applied in our study allowed us to infer crucial dementia risk factors for underrepresented populations, such as HLA and AA, with GWAS summary statistics from extensively studied populations like Europeans.This approach enabled a deeper understanding of the genetic landscape of dementia in underrepresented populations, particularly valuable given the current limitations in large-sample-size GWASs speci c to these groups.Secondly, we forti ed the robustness and generalizability of our ndings through the validation of our model on an independent dataset from the All of Us cohort.
Furthermore, our innovative approach, which incorporated biologically relevant genetic markers and functional annotations, signi cantly enhanced the accuracy of disease prediction.This approach can be readily adapted to predict other complex diseases, extending the scope of its applications and enriching our understanding of diverse human populations' genetic traits.
However, we acknowledge certain limitations.Firstly, we observed variations in the composition of dementia subtypes among different GIA groups' case samples.Consequently, the distinct genes and biological pathways identi ed by different ancestry models should be interpreted with this consideration.Secondly, although our study identi ed potential risk SNPs and genes associated with dementia, additional experimentation is necessary to understand the precise mechanisms underlying the association of these factors with dementia.Thirdly, due to the limited number of dementia cases in the All of Us EAA GIA sample after applying our inclusion criteria, we could only validate our models and ndings in the HLA and AA samples.As a result, the generalizability of our ndings to the EAA ancestry is constrained.
In light of these limitations, further research with more extensive and diverse datasets, encompassing a broader range of dementia subtypes and GIA groups is imperative to strengthen the validity and applicability of our study's outcomes.Such efforts will contribute to a more comprehensive understanding of the genetic complexities underlying dementia across diverse populations.

Conclusions
Our study introduces a novel and robust approach to assessing individual genetic risks for dementia across diverse populations in a real-world setting.Our study demonstrates the importance of considering functional and biological information and population diversity when developing predictive models for complex diseases like dementia.The ndings from our research provide valuable insights into the intricate genetic factors underlying dementia.Moreover, this work opens up promising avenues for developing more accurate and e cient predictive models for complex genetic traits in diverse human populations.Such advancements can potentially be paired with the development of targeted treatments tailored to the speci c genetic pro les of individuals affected by dementia and related conditions.

Abbreviations
Abbr

Figures Figure 1
Figures

Figure 2 Overall
Figure 2

Table 1
Descriptive statistics of demographic and electronic health record features by case/control groups, UCLA ATLAS sample, strati ed by genetic inferred ancestry group model performance of APOE-e4 count, polygenic risk score, and Elastic Net SNP in dementia genetic prediction in validation of All of Us sample, stratified by genetic inferred ry