Gene and variant selection
For variant selection, we annotated PD GWAS summary statistics from 23andMe using ANNOVAR (K. Wang, Li, and Hakonarson 2010) including gene-based and ClinVar (version clinvar_20220320) annotation. We investigated variants with a minor allele frequency (MAF) of less than 5% to restrict our focus to rare variants while retaining known risk variants such as GBA1 p.E365K. Only biallelic exonic or splicing variants related to “Parkinson’s disease” and/or “Lewy body dementia” were kept. We selected all coding variants from genes previously published in literature: ATP13A2, DNAJC13, DNAJC6, EIF4G1, FBXO7, GBA1, GIGYF2, HTRA2, LRP10, LRRK2, PARK7, PINK1, PLA2G6, POLG, PRKN, SNCA, SYNJ1, TMEM230, UCHL1, VPS13C, VPS35 (Blauwendraat, Nalls, and Singleton 2020) and also removed “benign” and “likely benign” variants in the GBA1 gene but also in genes identified in the keyword search “Parkinson’s disease” and “Lewy body dementia”. Synonymous, PINK1-antisense (AS), and UCHL-AS variants were also excluded from the dataset. The process of variant selection based on 23andMe data is summarized in Fig. 1.
23andMe Data
A rare variant association analysis was conducted using 3,090,507 unrelated people (3,065,473 controls, 25,034 PD cases). Data was processed as previously published in (Heilbron et al. 2021). In brief: related individuals were removed, defined as > 700cM that are identical-by-descent (~ 20% of the genome or approximately first cousins in an outbred population) (Henn et al. 2012). Ancestry composition was performed as previously reported (Durand et al., n.d.), and to minimize confounding by ancestry, only individuals with predominantly European ancestry were used.
Phased participant data were imputed using Minimac3. Throughout, structural variants and small indels were treated the same as SNPs. Association test results were computed by logistic regression assuming additive allelic effects. Covariates for age, sex, and the top five genetic principal components (PCs) were included to account for residual population structure, and indicators for genotype platforms to account for genotype batch effects. The association test p-value reported was computed using a likelihood ratio test. Genotyped SNPs were excluded that: 1) had a genotyping rate < 90%, 2) were only genotyped on the “v1” or “v2” 23andMe genotyping array, 3) were found on the mitochondrial chromosome or the Y-chromosome, 4) failed a test for parent-offspring transmission (p < 10− 20), 5) had an association with genotype date (p < 10–50 by ANOVA of SNP genotypes against a factor dividing genotyping date into 20 roughly equal-sized buckets), 6) had a large sex effect (ANOVA of SNP genotypes, r2 > 0.1), or 7) had probes matching multiple genomic positions in the reference genome. For tests using imputed data, we used the imputed dosages rather than best-guess genotypes. Imputed SNPs were excluded that: 1) had imputation r2 < 0.5 for individuals genotyped on the “v4” and “v5” arrays, or 2) had a significant batch effect between the “v4” and ”v5” genotyping arrays (p < 10− 50 by ANOVA of SNP dosage against genotyping array). Both genotyped and imputed SNPs were removed if: 1) available sample size was less than 20% of the total GWAS sample size, or 2) the logistic regression failed to converge (absolute value of the estimated log odds ratio or standard error > 10).
Participants provided informed consent and volunteered to participate in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical & Independent (E&I) Review Services. As of 2022, E&I Review Services is part of Salus IRB (https://www.versiticlinicaltrials.org/salusirb). The full GWAS summary statistics for the 23andMe discovery data set will be made available through 23andMe to qualified researchers under an agreement with 23andMe that protects the privacy of the 23andMe participants. Datasets will be made available at no cost for academic use. Please visit https://research.23andme.com/collaborate/#dataset-access/ for more information and to apply to access the data.
AMP-PD and UK Biobank Data
Association results for the variants selected from the 23andMe dataset were generated from whole-genome sequencing data made available by the Accelerating Medicines Partnership - Parkinson’s disease Initiative (AMP-PD, https://amp-pd.org/) and the whole-exome sequencing data made available by the UK Biobank (https://www.ukbiobank.ac.uk/). Processing of these cohorts have been previously described elsewhere (Makarious et al. 2022). In brief, the AMP-PD datasets consist of unrelated individuals and publicly available cohorts only. After filtering and removing those recruited in a genetic study, the data set comprises a total of 4,007 people (2,197 males, 1,810 females) and 2,556 of those are controls and 1,451 are PD cases. The UK Biobank is a large-scale biomedical database containing genetic and health information from half a million participants (Bycroft et al. 2018). Similar filtering parameters were used for UK Biobank resulting in a total of 45,857 people (22,040 males, 23,817 females), of which 38,051 are controls, 1,105 cases, 6,033 individuals with a parent that is diagnosed with PD, and 668 individuals with a sibling that is diagnosed with PD (6,701 proxy cases). As previously described, Controls were filtered to exclude any individuals with an age of recruitment < 59 years, any reported nervous system disorders (Category 2406), a parent with PD or dementia (field codes: 20107 and 20110) and any reported neurological disorder (field codes: Dementia/42018, Vascular dementia/42022, FTD/42024, ALS/42028, Parkinsonism/42030, PD/42032, PSP/42034, MSA/42036) (Makarious et al. 2022).
Statistical analyses
We used three different data sets, including summary statistics from 23andMe and sequencing data from AMP-PD and UK Biobank. All data used genome build GRCh38. PLINK (v1.9, (Purcell et al. 2007)) was used to extract the variants identified in 23andMe data from AMP-PD and UK Biobank. To generate the association files for AMP-PD and UK Biobank, we then used RVTests (Zhan et al. 2016) for single variant association testing, using sex, and principal components (PC) 1 to 5 as covariates for AMP-PD, excluding age since this is an age-matched cohort. Genetic sex, age at recruitment, Townsend score, and PC 1 to 5 were used as covariates for UK Biobank. We conducted a fixed-effect inverse variance-weighted meta-analysis with the summary statistics, using METAL (version 2020-05-05 (Willer, Li, and Abecasis 2010)). Results were annotated using ANNOVAR, refGene, avsnp150, and clinvar_20220320 (K. Wang, Li, and Hakonarson 2010). Forest plots were generated using the rmeta (version 3.0) and metafor (version 3.8-1) packages in R. Power calculations were conducted using the R (v. 3.6) package genpwr (version 1.0.4), a power and sample size calculator for genetic association studies which allows for misspecification of the model of genetic susceptibility (Moore, Jacobson, and Fingerlin 2019). This package allows the assessment of allele frequencies as low as 1E-9 at OR = 1.5, and as low as 1E-04 at OR = 3. We used an additive model with an alpha value of 0.05. Since we used an additive model, it is important to note that we had less power to detect recessive associations in our analysis.