Polygenic Risk Score Effectively Predicts Risk of Depression Onset in Alzheimer’s Disease

Introduction: Depression is a common, though heterogenous, comorbidity in late-onset Alzheimer’s Disease (LOAD) patients. In addition, individuals with depression are at greater risk to develop LOAD. In previous work, we demonstrated shared genetic etiology between depression and LOAD. Collectively, this evidence suggested interactions between depression and LOAD. However, the underpinning genetic heterogeneity of depression co-occurrence with LOAD is largely unknown. Methods: Major Depressive Disorder (MDD) genome wide association study (GWAS) summary statistics were used to create polygenic risk scores (PRS). The Religious Orders Society and Rush Memory and Aging Project (ROSMAP) and National Alzheimer’s Coordinating Center (NACC) datasets were utilized to assess the PRS performance in predicting depression onset in LOAD patients. Results: The developed PRS showed marginal results in standalone models for predicting depression onset in both ROSMAP (AUC=0.540) and NACC (AUC=0.534). Full models, with baseline age, sex, education, and APOEε4 allele count, showed improved prediction of depression onset (ROSMAP AUC: 0.606, NACC AUC: 0.583). In time-to-event analysis, standalone PRS models showed signicant effects in ROSMAP (P=0.0051), but not in NACC cohort. Full models showed signicant performance in predicting depression in LOAD for both datasets (P<0.001 for all). Discussion: This study provided new insights into the genetic factors contributing to depression onset in LOAD and advanced our knowledge of the genetics underlying the heterogeneity of depression in LOAD. The developed PRS accurately predicted LOAD patients with depressive symptoms, thus, has clinical implications including, diagnosis of LOAD patients at high-risk to develop depression for early anti-depressant treatment. we performed the rst genetic comparison analysis between LOAD patients with and without depression to explore the genetic heterogeneity of the risk and onset time of depression in individuals with LOAD. derived a PRS that showed moderate effects in predicting depression onset in LOAD patients. The PRS predictive ability was improved with the inclusion of the covariates age, sex, education, APOEε4 allele count, with the addition of childhood nancial need further enhancing the predictive performance of the model. (a) Models using entire Full alone Full


Introduction
Neuropsychiatric symptoms (NPS) are common in Late-onset Alzheimer's Disease (LOAD), characterized by heterogeneity with highly variable onset duration and severity. Amongst LOAD with comorbid NPS, depression and anxiety are the most prevalent [1][2][3][4][5] . Furthermore, individuals with depression are at greater risk to develop LOAD, suggesting that treating depression may delay LOAD 2, 6 . In addition, distinct trajectories of increasing risk of depression were associated with LOAD pathology such as, lower cerebrospinal uid (CSF) Aβ 42 and higher CSF total and phosphorylated tau, highlighting the heterogeneity of depression within LOAD 5 . Interestingly, we previously identi ed shared genetic etiology between LOAD and major depressive disorder (MDD) 6 . Collectively, this evidence lends support for interrelationships between LOAD and depression disorders 6 .
Polygenic risk scores (PRS) offer a method to explore such relationships that may exist between LOAD and depression. The current LOAD polygenic risk scores (PRS) landscape focusses on predicting LOAD diagnosis [7][8][9][10][11][12] , with a few studies applying pathway and functional analysis to the selection of SNPs for PRS calculation 13,14 . LOAD PRS have been tested to predict mild cognitive impairment (MCI) to LOAD progression 7,12,15 . Additionally, studies have tested PRS association with LOAD phenotypes in CSF biomarkers 7,[11][12][13][14][15][16] and motor-function impairment 17 . Other than associations with biomarker data, the effectiveness of PRS to predict LOAD heterogenous endophenotypes especially comorbid NPS, including depression, has yet to be thoroughly examined.
In this study we generated and tested the effectiveness of PRS to predict depression risk and onset time course in LOAD patients. We created a novel PRS based on MDD genome wide association study (GWAS) summary statistics and examined its utility in predicting the risk to develop depression symptoms in LOAD patients using two well-characterized LOAD cohorts from the Religious Orders Study and Rush

Study cohorts
Two cohorts were used to evaluate the performance of the PRS in predicting risk of depression onset: ROSMAP and NACC. We used only the samples that had available genetic data and information on depression phenotypes. All samples were LOAD patients. Cases were de ned as LOAD with depression symptoms, and controls were LOAD individuals who did not experience depression (Fig. 1). To further control for APOE as a cofounding factor, we repeated the analyses using sub-cohorts strati ed into APOEε3 homozygotes (Fig. 1). Of note, the ROSMAP sample is also included in the NACC data. Table 1 summarized the descriptive statistics for the ROSMAP and NACC samples used in this study. Edition, Revised (DSM-III-R) 24 . A diagnosis of highly probable, probable, or possible depression (r_depres = 1,2 or 3) in any study visit was deemed a depression case. Thus, one instance of depression in the study duration was considered a depression case. The classi cation of the cohort and number of subjects in each category (LOAD with depression and LOAD only) are described in a owchart (Fig. 1).
Other variables included were age at baseline (age_bl), sex (msex), years of education (educ), nancial_need, and apoe_genotype. The educ variable represents years of education 25 , and the nancial_need variable estimates the total adverse events in childhood 26 . This variable is only available in the MAP data. The apoe_genotype variable speci es the subject's APOE genotype 27 ; this variable was then converted to another variable to count the number of APOEε4 alleles (0,1,2).
Overall, 517 LOAD cases from the ROSMAP cohort (total n = 1708) were used in our study, out of which there were 187 depression cases and 330 controls (i.e. only LOAD) (Fig. 1). The sample consisted of 68% female, with an average age at baseline of 81.5 (SD = 6.7) and years of education of 16.2 (SD = 3.7) ( Table 1). 284 of the entire LOAD cases were APOEε3 homozygote, with 112 cases (LOAD comorbid depression) and 172 controls (LOAD only) (Fig. 1). 70.4% of which were females and average baseline age of 81.8 (SD = 6.7) ( Table 1).

NACC
The second sample used was obtained from the National Alzheimer's Coordinating Center (NACC) 20 . The NACC is composed of 29 Alzheimer's Disease Research Centers (ADRC) located throughout North America. The data collection and management vary between centers, with each center enrolling based on speci c research interests. Some ADRCs require subjects to agree to autopsies. Written informed consent was acquired from each subject. The primary diagnosis variable (dx) was used to select LOAD cases, with dx = 050 corresponding to Alzheimer's Disease. The variable DEP was employed to select depression cases. These values result from a clinical diagnosis of depression using the Geriatric Depression Scale (GDS) 28 . Cases were de ned as participants with a diagnosis for depression within the last two years (DEP = 1), while controls were those that did not have a depression diagnosis (DEP = 0). As with ROSMAP, any instance of depression throughout the study course was marked as a depression case. The classi cation of the cohort and number of subjects in each category (LOAD with depression and LOAD only) are described in a owchart ( Fig. 1). Other variables included were age at baseline (NACCAGEB), sex, years of education (EDUC), and APOE genotype.
Overall, 2,968 LOAD cases from the entire NACC data (7,627) were used in our study. Out of which were 1,083 depression cases and 1,885 controls (i.e. only LOAD) (Fig. 1). The sample consisted of 52.1% female, with an average baseline age of 76.3 (SD = 9.1) and years of education of 15.6. (SD = 6.9) ( Table 1). 1092 of the NACC LOAD subjects were APOEε3 homozygotes, with 409 cases and 683 controls ( Fig. 1), 50.4% of which were females, and average baseline age was 78.0 (SD = 9.9) and years of education of 15.7 (SD = 7.4) (

PRS Calculation
Two formulas were used to calculate PRS. Formula 1 describes the method of calculating PRS by multiplying beta values (β) by the number of effect alleles (X) then summing these values, which will be referenced as PRS. Formula 2 utilizes the risk allele (G), or the allele with the positive beta value, which will be referenced as risk-increasing PRS 30 . The number of risk alleles is multiplied by its respective beta value. This term is then multiplied by the total number of SNPs (T) divided by the sum of all the beta values. This term allows for the risk-increasing PRS to represent the average of risk alleles, providing an interpretable result in terms of risk allele 30

Statistical Analysis
Logistic regression and Receiver Operating Characteristic (ROC) curves were calculated to assess the performance of the PRS to predict depression within the LOAD only samples. These analyses were completed for all p-value thresholds to determine the optimum threshold for prediction, which then was then utilized in subsequent analyses. For ROSMAP, performance was assessed of a statistical prediction model that included the covariates APOEε4 allele count, nancial need, sex, and the 0.005 p-value SNP selection threshold PRS. For NACC, a prediction model that included covariates education, APOEε4 allele count, sex, and 0.001 p-value SNP selection threshold PRS. Prediction models excluding PRS were constructed in both ROSMAP and NACC and compared with respective models including PRS using the DeLong test 32 . Additionally, time-to-event analysis was conducted, with left-truncated (age at entry) and right censored (age at depression onset or age at last visit) data. Statistical analysis was completed in JMP Pro 15 33 and the DeLong tests were run in the MedCalc application 34 .

Prediction of onset of depression in LOAD
We created 13 PRS, for each dataset, using multiple p-value thresholds (hereafter P Threshold ). For each cohort, SNPs were selected according to each p-value threshold (SNPs counts by P Threshold are summarized in Supplementary Table 1). Logistic regression plots of PRS and depression phenotype were then used to select the optimal P Threshold ; thus, the P Threshold with greater classi cation ability was selected for inclusion in the prediction models.

ROSMAP
The logistic regression analysis with a P Threshold of 0.005 resulted in the greatest effect (beta = 0.153, P = 0.089; Supplementary Table 2), with an AUC of 0.540 (Table 2). Next, we further evaluated this most signi cant PRS constructed based on SNPs selected for P Threshold =0.005. We applied the full model, which, in addition to the PRS (P Threshold =0.005), included baseline age, sex, years of education, and APOEε4 allele count (Supplementary Table 5). The model resulted in an AUC of 0.606 and was improved to an AUC of 0.680 with the inclusion of childhood nancial need as an additional variable (Fig. 2a,

Predicting Time to Depression Onset
Time-to-event analysis was conducted to assess the PRS ability to predict those at risk for developing depression early in their LOAD trajectory by examining the time interval between age at entry of respective study and age at depression onset (or age at last visit if no depression occurred). We tested PRS calculated by the two formulas (see Methods section): (1) PRS and, (2) risk-increasing PRS. PRS uses the standard calculation approach, while risk-increasing PRS utilizes alleles with positive betas, or risk alleles, providing an interpretable score in terms of the number of risk alleles.

NACC
We utilized PRS (P Threshold =0.001) and the risk-increasing PRS (P Threshold =0.001) in the time-to-event analyses. The models employing the PRS and the risk-increasing PRS, each alone, did not produce signi cant results; however, the risk-increasing PRS had improved performance as supported by smaller a p-value (Table 3). The full models, using each of the two PRS formulas with other covariates, showed signi cant results (Table 3, Supplementary Tables 17 and 21). The full model using the PRS had signi cant contributions from baseline age, sex, education, and PRS (beta = 0.041, P = 0.028), while the full model using risk-increasing PRS had signi cant contributions from baseline age, education and riskincreasing PRS (beta = 0.001, P = 0.034) (Supplementary Tables 17 and 21, respectively).
Repeating the analyses for the APOEε3 homozygote subgroup did not show signi cant results for PRS and the risk-increasing PRS alone models. As in the entire LOAD NACC cohort, the risk-increasing PRS demonstrated a marginally greater performance with a smaller p-value ( Table 3). The full models for both PRS and risk-increasing PRS resulted in signi cant results (Supplementary Tables 18 and 22) with major contributions from the covariates baseline age, sex, and education. However, neither the PRS nor the riskincreasing PRS had signi cant effects in their respective full models.

Discussion
LOAD is a heterogenous disease with various genetic etiologies 35,36 and diverse phenotypes including: heterogeneity of biomarkers 37 , coexisting pathologies 38 , and clinical symptoms 38-41 . Clinical heterogeneity is manifested also by comorbid neuropsychiatric symptoms (NPS), amongst which depression is very common. However, why some LOAD patients develop depression while others do not remain elusive. Previously, we found genetic pleiotropy between MDD and LOAD 6 , suggesting that genetics may contribute to the risk of depression symptom in LOAD. In this study to test this hypothesis, we performed the rst genetic comparison analysis between LOAD patients with and without depression to explore the genetic heterogeneity of the risk and onset time of depression in individuals with LOAD. We derived a PRS that showed moderate effects in predicting depression onset in LOAD patients. The PRS predictive ability was improved with the inclusion of the covariates age, sex, education, APOEε4 allele count, with the addition of childhood nancial need further enhancing the predictive performance of the model.
PRS are a well-established approach for the study of the genetics of complex diseases including LOAD and the utility of PRS to predict LOAD risk has been investigated by different groups [7][8][9][10][11][12][13][14][15][16][17] . However, to our knowledge, this is the rst study that progresses the use of PRS to predict clinical endophenotypes in LOAD, in particular depression. Our study is innovative in several ways: (1) The study was uniquely designed such that all subjects are LOAD patients whereas manifestation of depression de ned the casecontrol status.
(2) Most prior LOAD PRS studies focused on LOAD prediction employing LOAD GWAS summary statistics. Here we tested the utility of PRS based on GWAS data from a particular disorder (MDD) to predict risk for a shared phenotype (depression) in individuals with another disorder (LOAD). (3) While previous work identi ed unique trajectories of depression and apathy in LOAD subjects and biomarkers associated with LOAD-speci c depression progression 5 , the current work focused on a genetic based prediction model of depression in LOAD. Collectively, our approach generated PRS to identify LOAD subjects with greater genetic risk of developing depression and those at risk to develop depression earlier in the time course of LOAD.
PRSs generated for the two cohorts, ROSMAP and NACC, were different, due to differing genotyped SNPs leading to a distinct number of SNPs used. However, the results of PRS in ROSMAP and NACC were applied to demonstrate the effectiveness of employing our approach in different datasets. The results obtained for the two cohorts were generally consistent. However, there are some differences. In the NACC cohort, the PRS alone was more effective in classifying depression cases as evidenced by the logistic regression analysis, and it made more signi cant contributions to the full prediction model than in ROSMAP. However, the overall model performance was greater in ROSMAP. A possible explanation might be that ROSMAP is more homogenous than NACC, as ROSMAP contains a reduced range of baseline ages and is disproportionately female 18-20 resulting in greater homogeneity compared to NACC.  Figure 1 Sample selection owchart. The total samples from both ROSMAP and NACC datasets were divided into LOAD and non-LOAD groups, where the non-LOAD group was not studied. Depression case and controls were identi ed in the LOAD sample of both datasets. APOEε3 homozygotes were then selected from the LOAD sample to account for potential confounding by the APOEε4 allele.