Briefly, in this work we first studied whether the individual endophenotype and combined endophenotype-PRSs can capture the risk of dementia (expressed as both odds ratio and hazard ratio) as well as the age of dementia onset. Next, we tested the potential of endophenotype-PRS to capture the genetic risk beyond APOE by studying the differences in the dementia risk and median survival among ε3/ε3 participants. Finally, we computed linear mixed models to determine the relationship with longitudinal trajectories of known AD endophenotypes.
2.1 Study population
Data used in the preparation of this work were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) (15). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. For the purpose of developing endophenotype-PRSs, we focused on 11 biomarkers from ADNI1,GO/2, each biomarker representing either amyloid, tau, neuronal or vascular pathology (Fig. 1). From the 1,550 individuals available, only 585 participants had complete baseline information on all 11 biomarkers of interest. Of these 585 participants, 80% were used for training, 20% for validation, and the remaining 965 participants were used for testing (Table 1). Diagnosis was based on clinical criteria and consisted of five different categories: cognitively normal (CN), significant memory concerns (SMC), early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI) and demented (Dem), with demented being characterized as participants whose diagnosis was based on clinical rather than pathological evidence (16).
Table 1
Characteristic | Full | Training | Validation | Testing |
Count | 1550 | 468 | 117 | 965 |
Age | | | | |
Mean Range | 73.4 47–91 | 72.2 47–91 | 71.6 55–88 | 74.4 54–90 |
Gender (%) | | | | |
Male Female | 59.3 40.7 | 53.5 46.5 | 50.0 50.0 | 58.8 41.2 |
Diagnosis (%) | Baseline Diagnosis | Final Diagnosis | Baseline Diagnosis | Final Diagnosis | Baseline Diagnosis | Final Diagnosis | Baseline Diagnosis | Final Diagnosis |
CN SMC EMCI LMCI Dem | 23.7 6.0 18.0 33.1 19.2 | 22.1 5.7 17.5 20.0 34.7 | 21.8 12.0 29.8 18.6 17.8 | 22.1 11.6 26.6 10.9 28.9 | 19.5 11.9 32.3 18.6 17.8 | 22.0 11.9 26.3 11.0 28.8 | 24.2 2.4 10.4 41.5 21.5 | 21.5 2.2 11.9 25.8 38.7 |
2.2 Biomarker PCA
We focused on integrating information from 11 biomarkers that fall under the A/T/N/V framework (16). This is an expansion of the A/T/N framework (17), which was developed to reflect the pathophysiology progress of the disease and thus, provide a better understanding of its clinical stages. The set of biomarkers that we selected for PRS development is presented in Fig. 1. These include CSF and PET amyloid (A), CSF tau (T), MRI and FDG-PET (N) from selected regions of interest (ROI), as well as white matter hyperintensity (V). To summarize the information from these biomarkers, we performed principal components analysis (PCA) simultaneously on the residuals of all 11 biomarkers that were first pre-adjusted for age, sex, years of education, and the first two genetic PCs that controlled for population stratification. PCA was applied on the 585 individuals with full baseline biomarker data (Fig. 1). For each participant, the baseline was defined as the first time point with available measurements for all 11 biomarkers. The analysis returned 4 components, each representing one biomarker group (A, T, N, and V).
2.3 Single Nucleotide Polymorphism (SNP) filtering
For each of the 4 obtained endophenotype components, we ran GWASs on the same 585 participants that had been used for the PCA step (Fig. 1). The genotype data were HRC imputed, with a total number of 5,406,481 SNPs. The GWASs results have been filtered based on a range of p-value cut-offs (5e-05, 8e-05, 1e-04, 8e-04, 1e-03, 5e-03, 1e-02). To address the linkage disequilibrium problem (LD), we performed clumping using PLINK on SNPs with MAF\(\ge\)5%, r2 = 0.1 and window=1K kb. From the APOE region (defined as 1Mb up and downstream of the gene position, 44,409,039 to 46,412,650) only rs429358 and rs7412 were retained.
2.4 Further SNP filtering and SNP weight calculation
In addition to p-thresholding, we further filtered the SNPs by applying Lasso (18), a type of penalized regression. At each p-threshold and for each biomarker component, Lasso returned a list of SNPs and their corresponding weights. The Lasso penalty was determined by tuning the lambda parameter using 10-fold cross-validation. The criterion for optimal lambda selection was minimization of the mean square error (MSE). While shrinkage was applied to SNPs, the baseline age, sex, years of education, as well as the two APOE SNPs, rs429358 and rs7412, were not subject to penalization. To increase the stability of the result, Lasso was bootstrapped 100 times on the training set, returning at each iteration a list of SNPs and SNP weights. The final SNP list was obtained by retaining the most frequently selected SNPs (selection frequency \(\ge\) 80%, Fig. 1).
According to the literature, re-weighted SNP coefficients may achieve improved PRS performance (5, 19) compared to the traditional case/control GWAS SNP effects. Because Lasso estimates tend to be biased (20), we followed a two-step procedure by refitting a linear regression model on the Lasso selected SNPs. The regression model was adjusted for the covariates, age, sex, and years of education and was performed separately for each of the four endophenotype components. The process was bootstrapped 100 times on the training set, and the final PRS SNP weight was calculated by averaging the corresponding regression coefficient over the 100 bootstrap iterations, as described in Eq. (1). Here, \({w}_{s}\) is the new weight for SNP s, \(i\) is the bootstrap iteration index, \(N\) is the total number of bootstrap iterations (in this case \(N=100\)), and \({coef}_{si}\) is the regression coefficient for the SNP s at the \(i\)th iteration.
$${w}_{s}=\sum _{i=1}^{N}{coef}_{si}/N$$
1
2.5 Individual and Combined Biomarker-PRS
At each p-threshold and for every participant \(j\), we calculated four individual endophenotype-PRSs (PRSA, PRST, PRSN, PRSV) based on Eq. (2). The PRS was expressed as the sum over the weighted number of alleles per SNPs. Specifically, for the \({j}^{th}\) individual, the endophenotype \(b\) PRS \({(PRS}_{bj})\) was obtained by multiplying the minor allele count \({d}_{sj}\) of the SNP s by the SNP weight \({w}_{s}\) (described in Eq. 1).
$${PRS}_{bj}=\sum _{s=1}^{{S}_{b}}{w}_{s}{d}_{sj}$$
2
Finally, we generated two combined endophenotype-PRSs (PRSATNV, PRSAT) for each individual. The PRSATNV was expressed as the weighted sum of the individual biomarker-PRSs as shown in (3). In Eq. (3), \({PRS}_{b}\) is the individual endophenotype-PRS as described in Eq. (2). To obtain the weights \({w}_{b}\), we used the training set to regress each of the four endophenotype components on the corresponding \({PRS}_{b}\) while controlling for the age, sex and years of education. The final weight \({w}_{b}\) for each \({PRS}_{b}\) was the average coefficient over 100 bootstrap iterations. A similar approach was followed for generating the PRSAT.
$${PRS}_{ATNV}={\sum }_{b\in \{A, T, N ,V\}}{w}_{b}{PRS}_{b}$$
3
2.6 Best PRS threshold selection
To select the optimal p-threshold and thus the final PRS for the remainder of the analysis, we assessed the prediction performance of the biomarker PRSs on the validation set for each of the seven p-thresholds. Specifically, we obtained the adjusted variance explained (Adj.R2) by regressing each biomarker component on the corresponding biomarker-PRS while controlling for baseline age, sex and years of education. The average Adj.R2 over 100 bootstrap iterations indicated the best overall p-threshold.
2.7 Odds of AD in relation to PRS
To study the association between the six PRSs and the risk of dementia, we ran a logistic regression model, treating the PRS as predictor while adjusting for the centered covariates of age, sex, and years of education. In the model described here, age was defined as either the age of clinical diagnosis of dementia or the age at the last clinical visit for the non-demented participants. To simplify the interpretation, the PRSs, originally ranging from 0 to 1 with values closer to 1 indicating higher risk, were multiplied by 10. Among the 585 participants of the training set, 367 individuals that were either CN, SMC or Dem were used for model training. Having estimated the odds of AD for each PRS, we replicated the results on 712 individuals from the testing set, after excluding MCI patients. As an additional step, to assess the predictive ability of SNPs beyond APOE, we obtained the risk of developing dementia among \(\epsilon 3/\epsilon 3\) participants.
2.8 AD hazard and time to AD onset in relation to PRS
Other statistical measures of interest in AD research include the hazard of dementia and the age of dementia onset. To assess the strength of the relationship between these measures and the biomarker PRSs, we ran a Cox proportional hazard (PH) model, which was trained using 367 individuals from the training set. As “event” we considered the onset of dementia (clinical manifestation), and we treated the age at dementia diagnosis as the survival time in the model. PRS was used as a predictor in the model, while adjusting for the years of education and sex. To simplify interpretation, the PRSs were multiplied by 10 and education was centered. The PH assumption was tested using the cox.zph() function in R. To get predictions of the age to dementia onset among the Dem cases, we predicted the survival curves using the Cox model that was previously applied on the training data. The actual and the predicted age to dementia were divided into deciles. The relationship between the predicted age and actual age of dementia onset was assessed using Pearson correlation (r). The analysis was replicated on 712 non-MCI individuals from the test set as well as on the \(\epsilon 3/\epsilon 3\) participants.
2.9 PRS for baseline levels and longitudinal trajectories of responses of interest
In addition to dementia risk prediction, which may be useful at the prevention stage, information about disease progression and key outcomes are also important. Here, we assessed the baseline and longitudinal effects of PRSs on 14 responses of interest. These included three cognitive measures (ADNI-MEM, ADNI-EF and FAQ), as well as 11 biomarkers that were described earlier (Fig. 1). For each of the 14 responses, we applied a random intercept linear mixed model to account for the correlation between repeated measurements. The data were aligned for all participants, with time 0 representing the first visit when a measurement was available for each biomarker. All biomarkers were transformed to range between 0 to 1, while MRI biomarkers were pre-adjusted for intracranial volume (ICV). Whenever necessary, the biomarkers were log10 transformed. The model adjusted for sex and centered covariates, including years of education and baseline age. Fixed effects included years since baseline, as well as the PRS and their interaction. To simplify interpretation, the PRSs were multiplied by 10. The random intercept term allowed for varying intercepts among the participants. The performance was assessed by the Nakagawa’s marginal pseudo-R2 on the testing set. The significance of the increase in the pseudo-R2 was assessed by ANOVA, which compared the (full) model, PRS and its interaction with time, to the (base) model, which contained covariates only. The p-values of the main PRS effect (baseline effect) and the interaction effect (longitudinal change), were corrected for multiple comparisons. Specifically, for each endophenotype, a Bonferroni correction was applied to account for testing against six PRSs (Bonferroni p-value = 8.3e-03).