Analysis of rare Parkinson’s disease variants in millions of people

Objective Although many rare variants have been reportedly associated with Parkinson’s disease (PD), many have not been replicated or have failed to replicate. Here, we conduct a large-scale replication of rare PD variants. Methods We assessed a total of 27,590 PD cases, 6,701 PD proxies, and 3,106,080 controls from three data sets: 23andMe, Inc., UK Biobank, and AMP-PD. Based on well-known PD genes, 834 variants of interest were selected from the ClinVar annotated 23andMe dataset. We performed a meta-analysis using summary statistics of all three studies. Results The meta-analysis resulted in 11 significant variants after Bonferroni correction, including variants in GBA1 and LRRK2. At least 9 previously reported pathogenic or risk variants for PD did not pass Bonferroni correction in this analysis. Conclusions Here, we provide the largest rare variant meta-analysis to date, providing thorough information of variants confirmed, newly identified, or rebutted for their association with PD.


Introduction
Over the past several decades, common and rare variants in multiple genes have been associated with Parkinson's disease (PD). PD is a neurodegenerative disorder primarily affecting dopaminergic neurons in the substantia nigra, and is caused by a combination of aging, environmental factors, and genetics. Genetics account for a large proportion of PD risk (Blauwendraat, Nalls, and Singleton 2020; Bandres-Ciga et al. 2020). Several large-scale case-control genome-wide association studies (GWAS) have been performed and have identi ed 90 risk variants (Nalls et al. 2019), typically with moderate odds ratios (OR, mean = 1.24, range: 1.05-11.35). Many rare variants that have been associated with PD are considered either a strong risk factor (OR > 2) or a monogenic cause of disease. Pathogenic variants in GBA1 and LRRK2 are the most common high-risk genetic factors for PD, typically present in 1-10% of the PD population depending on genetic ancestry. Most other known pathogenic variants are very rare (allele frequency < 0.1%, e.g. damaging variants in SNCA and PRKN).
Typically, these rare variants are also associated with a slightly different PD phenotype, highlighting the importance of their research even more. Overall, more severe motor and non-motor symptoms are seen in GBA1 variant carriers ( (Khan et al. 2003). It is important to note that the status of some reported PD genes is debated, since most associations are reported without a replication cohort or in small, often biased single-case or family studies (Blauwendraat et al. 2018; Blauwendraat, Nalls, and Singleton 2020). Previously, we investigated the SNCA p.H50Q mutation, which was identi ed in a pathologically-proven PD case without family segregation (Proukakis et al. 2013). However, when assessing this variant in a large-scale study including over 10,000 cases and 10,000 controls, no association was identi ed (Blauwendraat et al. 2018). These data do not support a pathogenic role for SNCA p.H50Q despite functional evidence supporting a potential role in disease (Boyer et al. 2019). This potentially also applies to variants in other genes including DNAJC13, EIF4G1, GIGYF2, HTRA2, LRP10, TMEM230 and UCLH1, all of which have been categorized previously as low-con dence PD genes (Blauwendraat, Nalls, and Singleton 2020). A robust assessment of high-risk and causal PD variants will be incredibly valuable for the global PD community from laboratory researchers to genetic counselors.
Here we tested PD mutations from the ClinVar database for association with PD in three large case-control cohorts (23andMe, Inc., UK Biobank, and AMP-PD) totalling over 3 million individuals (27,590 cases, 6,701 proxies, and 3,106,080 controls). The large number of participants in this study is an opportunity to assess rare variants in PD more reliably, which has not previously been possible to this extent. Our goal was to create lists of highcon dence and low-con dence PD variants. This work improves our understanding of the clinical relevance of these variants, which can then be used to improve genetic testing in patients.

Gene and variant selection
For variant selection, we annotated PD GWAS summary statistics from 23andMe using ANNOVAR (K. Wang, Li, and Hakonarson 2010) including genebased and ClinVar (version clinvar_20220320) annotation. We investigated variants with a minor allele frequency (MAF) of less than 5% to restrict our focus to rare variants while retaining known risk variants such as GBA1 p.E365K. Only biallelic exonic or splicing variants related to "Parkinson's disease" and/or "Lewy body dementia" were kept. We selected all coding variants from genes previously published in literature: ATP13A2, DNAJC13, DNAJC6,  EIF4G1, FBXO7, GBA1, GIGYF2, HTRA2, LRP10, LRRK2, PARK7, PINK1, PLA2G6, POLG, PRKN, SNCA, SYNJ1, TMEM230, UCHL1, VPS13C, VPS35 (Blauwendraat, Nalls, and Singleton 2020) and also removed "benign" and "likely benign" variants in the GBA1 gene but also in genes identi ed in the keyword search "Parkinson's disease" and "Lewy body dementia". Synonymous, PINK1-antisense (AS), and UCHL-AS variants were also excluded from the dataset. The process of variant selection based on 23andMe data is summarized in Fig. 1. 23andMe Data   A rare variant association analysis was conducted using 3,090,507 unrelated people (3,065,473 controls, 25,034 PD cases). Data was processed as previously published in (Heilbron et al. 2021). In brief: related individuals were removed, de ned as > 700cM that are identical-by-descent (~ 20% of the genome or approximately rst cousins in an outbred population) (Henn et al. 2012). Ancestry composition was performed as previously reported (Durand et al., n.d.), and to minimize confounding by ancestry, only individuals with predominantly European ancestry were used.
Phased participant data were imputed using Minimac3. Throughout, structural variants and small indels were treated the same as SNPs. Association test results were computed by logistic regression assuming additive allelic effects. Covariates for age, sex, and the top ve genetic principal components (PCs) were included to account for residual population structure, and indicators for genotype platforms to account for genotype batch effects. The association test p-value reported was computed using a likelihood ratio test. Genotyped SNPs were excluded that: 1) had a genotyping rate < 90%, 2) were only genotyped on the "v1" or "v2" 23andMe genotyping array, 3) were found on the mitochondrial chromosome or the Y-chromosome, 4) failed a test for parent-offspring transmission (p < 10 − 20 ), 5) had an association with genotype date (p < 10 -50 by ANOVA of SNP genotypes against a factor dividing genotyping date into 20 roughly equal-sized buckets), 6) had a large sex effect (ANOVA of SNP genotypes, r2 > 0.1), or 7) had probes matching multiple genomic positions in the reference genome. For tests using imputed data, we used the imputed dosages rather than best-guess genotypes. Imputed SNPs were excluded that: 1) had imputation r2 < 0.5 for individuals genotyped on the "v4" and "v5" arrays, or 2) had a signi cant batch effect between the "v4" and "v5" genotyping arrays (p < 10 − 50 by ANOVA of SNP dosage against genotyping array). Both genotyped and imputed SNPs were removed if: 1) available sample size was less than 20% of the total GWAS sample size, or 2) the logistic regression failed to converge (absolute value of the estimated log odds ratio or standard error > 10).
Participants provided informed consent and volunteered to participate in the research online, under a protocol approved by the external AAHRPPaccredited IRB, Ethical & Independent (E&I) Review Services. As of 2022, E&I Review Services is part of Salus IRB (https://www.versiticlinicaltrials.org/salusirb). The full GWAS summary statistics for the 23andMe discovery data set will be made available through 23andMe to quali ed researchers under an agreement with 23andMe that protects the privacy of the 23andMe participants. Datasets will be made available at no cost for academic use. Please visit https://research.23andme.com/collaborate/#dataset-access/ for more information and to apply to access the data.

AMP-PD and UK Biobank Data
Association results for the variants selected from the 23andMe dataset were generated from whole-genome sequencing data made available by the Accelerating Medicines Partnership -Parkinson's disease Initiative (AMP-PD, https://amp-pd.org/) and the whole-exome sequencing data made available by the UK Biobank (https://www.ukbiobank.ac.uk/). Processing of these cohorts have been previously described elsewhere (Makarious et al. 2022). In brief, the AMP-PD datasets consist of unrelated individuals and publicly available cohorts only. After ltering and removing those recruited in a genetic study, the data set comprises a total of 4,007 people (2,197 males, 1,810 females) and 2,556 of those are controls and 1,451 are PD cases. The UK Biobank is a large-scale biomedical database containing genetic and health information from half a million participants (Bycroft et al. 2018). Similar ltering parameters were used for UK Biobank resulting in a total of 45,857 people (22,040 males, 23,817 females), of which 38,051 are controls, 1,105 cases, 6,033 individuals with a parent that is diagnosed with PD, and 668 individuals with a sibling that is diagnosed with PD (6,701 proxy cases). As previously described, Controls were ltered to exclude any individuals with an age of recruitment < 59 years, any reported nervous system disorders

Statistical analyses
We used three different data sets, including summary statistics from 23andMe and sequencing data from AMP-PD and UK Biobank. All data used genome build GRCh38. PLINK (v1.9, (Purcell et al. 2007)) was used to extract the variants identi ed in 23andMe data from AMP-PD and UK Biobank. To generate the association les for AMP-PD and UK Biobank, we then used RVTests (Zhan et al. 2016) for single variant association testing, using sex, and principal components (PC) 1 to 5 as covariates for AMP-PD, excluding age since this is an age-matched cohort. Genetic sex, age at recruitment, Townsend score, and PC 1 to 5 were used as covariates for UK Biobank. We conducted a xed-effect inverse variance-weighted meta-analysis with the summary statistics, using METAL (version 2020-05-05 (Willer, Li, and Abecasis 2010)). Results were annotated using ANNOVAR, refGene, avsnp150, and clinvar_20220320 (K. Wang, Li, and Hakonarson 2010). Forest plots were generated using the rmeta (version 3.0) and metafor (version 3.8-1) packages in R. Power calculations were conducted using the R (v. 3.6) package genpwr (version 1.0.4), a power and sample size calculator for genetic association studies which allows for misspeci cation of the model of genetic susceptibility (Moore, Jacobson, and Fingerlin 2019). This package allows the assessment of allele frequencies as low as 1E-9 at OR = 1.5, and as low as 1E-04 at OR = 3. We used an additive model with an alpha value of 0.05. Since we used an additive model, it is important to note that we had less power to detect recessive associations in our analysis.

Variant selection and dataset details
We attempted to replicate 834 published associations between PD and protein-altering genetic variants (Fig. 1, see Methods). 609 (73%) of the ClinVar annotated variants had either no reported clinical signi cance, or had con icting or uncertain interpretations. Of the remaining 225 variants, 133 variants (15.9%) were classi ed as pathogenic and/or likely pathogenic, followed by 84 variants (10.1%) classi ed as benign or likely benign variants. 6 variants (0.7%) were reported to be risk factors for PD, and 2 (0.2%) variants were reported to be either pathogenic/risk factors (Supplementary Table 1A). The 834 variants were located in 32 genes including GBA1 (n = 103), VPS13C (n = 101), and LRRK2 (n = 97) (Supplementary Table 2).
The list of 834 variants includes variants that did not pass 23andMe-internal QC and therefore had no summary statistics, but were still useful for variant selection in other studies. 282 (33.8%) variants were available in the AMP-PD dataset, 608 (72.9%) in UK Biobank, and 656 (78.7%) variants in 23andMe, resulting in a total of 679 unique variants. We focused on signi cant variants passing Bonferroni correction in each analysis, then generated a list of signi cant hits, calculating OR with a 95% con dence interval.

Analysis
We conducted a meta-analysis using summary statistics generated by the single-variant association testing data from 23andMe, UK Biobank, and AMP-PD (Table 1). 23andMe data comprised 25,034 PD cases and 3,065,473 controls. Cases were 59.9% male with a mean age of 72.3 (± 10.9) years, controls were 43.5% male with a mean age of 50.1 (± 17.5) years. The UK Biobank dataset comprised 1,105 cases and 6,701 proxies. Cases and proxies were 45.6% male with an average age of 59.1 (± 7.1) years, 38,051 were controls (48.6% male) with a mean age of 64.1 (± 2.8) years. AMP-PD had 1,451 cases (63.7% male) with a mean age of 61.3 (± 10.2) years, and 2,556 controls (49.8% male) with a mean age of 70.7 (± 13.2) years. A summary of cohort demographics can be found in Table 1. 679 unique variants were analyzed in the meta-analysis, using 656 variants from 23andMe, 408 from UK Biobank, and 256 from AMP-PD. 249 (36.7%) variants were solely contributed by 23andMe, whereas 18 (2.7%) variants were only found in UK Biobank, and 1 (0.1%) variant was only found in AMP-PD ( Supplementary Fig. 1). 68 variants passed the signi cance threshold of p < 0.05, and 11 variants passed Bonferroni correction of p < 7.4E-05, based on 0.05/679 (forest plots are provided in Supplementary Fig. 3 Based on the ClinVar database, GBA1 p.D179H and GBA1 p.S146L were not previously reported to be associated with PD and it is worth noting that (as is expected for very rare variants with MAF < 1%) especially GBA1 p.S146L had wide con dence intervals. GBA1 p.S146L was identi ed in 0 cases, 3 proxy cases (individuals with a parent with PD), and 0 controls in UK Biobank; and did not pass 23andMe QC (Fig. 2, Table 2). Interestingly, LRRK2 p.L1795F has been identi ed previously in a family with PD (Nichols et al. 2007), however no segregation was shown and further reports are lacking in the literature. This variant was identi ed in 2 cases and 0 controls in AMP-PD, shows a signi cant association in the 23andMe cohort (OR: 2.3, 95% CI: 1.57-3.61, P = 0.0003), but was not present in UKB.

Power calculations of variants of interest
In addition to con rming previously identi ed variants and nding new associations, another aim of this work was to assess if we could successfully replicate PD-associated variants found in the literature. However, when working with such low allele frequencies, one of the main challenges is su cient statistical power. This requires very large sample sizes to capture su cient numbers of rare variant carriers for robust statistical estimation.
We conducted power calculations for all variants in the meta-analysis, accounting for the tool's MAF limit of > 1E-09 (n = 669, Supplementary table 8) and had a special focus on variants with a p-value > 0.05 (n = 601) to assess whether we had > 80% statistical power to detect an association assuming OR = 1.5, alpha = 0.05, and an additive model. The parameters set for this calculation are generous, since an OR of 1.5 is insu cient to cause monogenic disease in affected individuals and we are not correcting for multiple testing. Therefore, under these parameters, any association that had > 80% power but was not statistically signi cant in our analysis is very likely to be a spurious association.

Discussion
Here, we assessed the role of rare variants and their association with PD using several large case-control datasets of European ancestry. We conducted a single rare variant association analysis of PD vs controls, including 23andMe, UK Biobank and AMP-PD; totalling over 3 million individuals comprising 27,590 cases, 6,701 PD proxy cases, and 3,106,080 controls. We provide very robust evidence of 11 high risk and causal variants for PD disease development. Additionally, we provide evidence that a large number of variants that have previously been associated with PD are likely not associated with PD and should be treated with caution.
Our ndings clearly con rm the prominent role of variants in LRRK2 and GBA1 on increased PD risk, in particular: LRRK2 p.R1441H, p.L1795F and p.G2019S; and GBA1 p.S146L, p.D179H, p.R296Q, p.E365K, p.T408M, p.N409S and p.R502C. All of these variants have odds ratios close to or over the threshold of 1.5, which is considered clinically more relevant (Szumilas 2010;Norton, Dowd, and Maciejewski 2018). Additionally, here we update the risk estimates for these variants with a more reliable con dence interval than previously reported. LRRK2 p.L1795F is a lesser-known and studied variant compared to p.R1441H and p.G2019S, which are both well-known damaging variants. LRRK2 p.L1795F is located in the C-terminal of ROC B region and has been reported in a family with PD, however segregation was not shown (Ghani et al. 2015). Here, we provide evidence for LRRK2 p.L1795F as a genetic risk factor for PD with an estimated OR of 2.5. Interestingly, this variant was recently shown to have a functional effect providing more evidence for its pathogenicity (Kalogeropulou et al. 2022). GBA1, p.E365K, p.T408M and p.N409S all have been previously identi ed via GWAS or similar approaches (Nalls et al. 2019), with p.E365K and p.T408M being risk factors for PD, and p.N409S being a risk factor for PD with and additional association with Gaucher disease. The other GBA1 variants (p.S146L, p.D179H, p.R296Q and p.R502C) are all associated with Gaucher disease in a biallelic state and were robustly associated with PD in this study. In the case of GBA1 p.D179H, the association with PD was statistically evident with higher ORs and narrow 95% CIs: 23andMe reporting OR = 3.4 (95% CI: 2.2-5.2) and OR = 3.3 (95% CI:2.1-5.0) in the meta-analysis. However it is worth noting that this variant is often on the same haplotype as p.E365K (Pawliczek et al. 2018;den Heijer et al. 2020) and indeed when exploring AMP-PD data we also identi ed that all instances of the p.D179H allele are in cis (D'=1) with p.E365K. Many variants, common or rare, are reported to be associated with PD, in fact the ClinVar database shows 4,320 results for the condition when searching for "Parkinson" as of January 1, 2023. To generate a list of variants that were not replicated previously in large case-control analyses, we assessed whether we had > 80% power at OR = 1.5 and 3, using a p-value threshold of p > 0.05. Most well-powered variants with meta-analysis p-value > 0.05 were of uncertain, unknown or unclear clinical signi cance, but quite a few were also previously associated with Parkinson's (Supplementary Table 3). For example, 6 variants that were previously reported to be "pathogenic" or "likely pathogenic" did not show evidence of association with PD and therefore these variants should be treated with caution. Those variants were: POLG p.467T (rs113994095), p.G737R (rs121918054), and p.G848S (rs113994098); GLUD2 p.S498A (rs9697983), PINK1 p.T313M (rs74315359), and SNCB p.P123H (rs104893937). Interestingly, some of these variants always had very weak evidence for an association in previous publications but were still considered PD variants in the ClinVar database. The PINK1 p.T313M variant was only found to be associated with PD in a single Saudi Arabian family (Chishti et al. 2006;Plaitakis et al. 2010). However, due to our study design, results regarding autosomal recessive genes should be interpreted carefully.
We acknowledge that our analysis comes with some limitations. First, very large sample sizes are required to study rare variants. Investigating variants with low frequencies often results in unreliable high odds ratios with wide con dence intervals, suggesting that the population parameter is not con dently predicted. For example the GBA1 p.S146L variant (rs758447515, OR = 1624.7, 95% CI = 58-45172), whose generated statistics suggest that it plays a role in PD risk but will need to be further studied to determine more accurate risk estimates. Another example is SNCA p.A53T which is known to be causal for PD (Polymeropoulos et al. 1997), but given the extreme low frequency of this allele it was not found in AMP-PD or UK Biobank data, and had a p-value of 0.03 with a very wide risk estimate OR = 38.0, 95% CI = 2.82-510.12 in the 23andMe dataset. This clearly shows association but not enough carriers to pass multiple test correction. Additionally, some pathogenic variants passing signi cance at p < 0.05 but failing the multiple test correction were identi ed. GBA1 p.V433LF (rs80356769, meta-analysis: P = 1.6E-04, OR = 4.2, 95% CI = 1.98-8.8) was present in ClinVar as a pathogenic PD variant, but also the previously linked LRRK2 variant p.I2020T (rs35870237, P = 0.0004, OR = 21.4, 95% CI = 3.97-115.5) is considered pathogenic based on biological evidence (Ray et al. 2014). LRRK2 p.I2020T is associated with autosomal dominant PD (Paisán- Ruíz et al. 2004;Zimprich et al. 2004). These variants showcase the complexities of rare variant analysis and although we used the largest sample size for PD yet, it still shows the limited power we have to identify robust risk estimates for very rare variants.
Second, an important aspect of rare and causal variants is the allele-dosage effect, since recessive genes are only causal in a homozygous or compound heterozygous state, which is especially important for PD since several genes are known to only be causal in a recessive state. Most of our statistical models used here are based on additive effects and since only summary level data was available from 23andMe, we cannot accurately report results on autosomal recessive genes such as PRKN and PINK1. To highlight this further the PRKN p.R275W (rs34424986) variant is known to increase the risk for PD in a homozygous or compound heterozygous state (Lubbe et al. 2021). In our analysis this particular variant also made the list of robust associations with an overall OR of 1.3 in the meta-analysis. A preliminary analysis by 23andMe using a recessive model for PRKN p.R275W, showed that in a heterozygous state OR is at 1.4, whereas in a homozygous alternative state the OR is 6. This variant is under heavy debate for its association with PD risk in a heterozygous state and is likely only disease causal if another damaging PRKN is on the other allele (Zhu et al. 2022).
Third, our analysis started with selection of variants based on array genotype data of 23andMe and therefore we missed variants that cannot be reliably imputed or are hard to genotype (such as GBA1 p.L444P). This is a limitation that will be resolved with the availability of more sequencing data in the coming years.
Finally, the lack of diversity is a critical challenge in genetic research and limits our insights and understanding of the disease, for example current sample sizes in 23andMe for African and Asian ancestry is under 400 cases (Kim et al. 2022). However, initiatives such as the Global Parkinson's Genetics Program (GP2) are making an active effort to increase the number of non-European underrepresented populations in genetic datasets (Global Parkinson's Genetics Program 2021), so that in the near future, based on genetics, we can hopefully create a more representative picture of the disease.
In summary, we provide a robust list of 11 variants associated with PD and more reliable risk estimates. Additionally we provide a list of variants that were previously considered pathogenic and risk variants for PD which were not replicated in our analysis and should be treated with caution moving forward. Assessing rare variants is crucial to improve our understanding of disease development and heritability, but several complications arise when working with low frequency levels, such as lack of power. Larger data sets and more suitable tools are a critical requirement to further our understanding of rare variants, and we hope this work can be a useful tool for the wider PD community.   Figure 1 Analysis owchart Data preparation: Variant selection was based on an annotated 23andMe rare variant association le, including variants that passed or failed internal quality control steps, resulting in some variants not having summary statistics available but could still be used for variant selection in other data sets. Data were ltered to remove synonymous variants and to keep biallelic exonic, and splicing variants. Data analysis: 744 variants of interest were used to calculate statistical power for each variant at odds ratio 1.5 and 3, and alpha 0.05. 656 variants had summary statistics available in 23andMe data and were analyzed individually but also as part of a meta-analysis, using 23andMe variants found in UK Biobank and AMP-PD data. Results: Results were assessed for different signi cance thresholds, main results focused on variants passing Bonferroni correction but also variants that reached su cient statistical power but were not associated with PD in this analysis. Figure created with BioRender.com.

Figure 2
Variants passing Bonferroni correction Forest plot showing 11 variants passing Bonferroni correction in the meta-analysis. Data is based on OR (dot) and 95% CI (error bar) and for visualization purposes the x-axis is on a log scale. Colors indicate clinical signi cance of the variant: Con icting/uncertain/unknown (red), pathogenic (green), and pathogenic/risk factor (blue) based on ClinVar annotations. Note: GBA1 p.S146L (con icting/uncertain/unknown clinical signi cance) also remained signi cant after multiple test correction, however with OR=1,624, 95% CI: 58-45,172, the statistics were too large to visualize in this plot.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.