Proteogenomic analysis of human cerebrospinal fluid identifies neurologically relevant regulation and informs causal proteins for Alzheimer’s disease

The integration of quantitative trait loci (QTL) with disease genome-wide association studies (GWAS) has proven successful at prioritizing candidate genes at disease-associated loci. QTL mapping has mainly been focused on multi-tissue expression QTL or plasma protein QTL (pQTL). Here we generated the largest-to-date cerebrospinal fluid (CSF) pQTL atlas by analyzing 7,028 proteins in 3,107 samples. We identified 3,373 independent study-wide associations for 1,961 proteins, including 2,448 novel pQTLs of which 1,585 are unique to CSF, demonstrating unique genetic regulation of the CSF proteome. In addition to the established chr6p22.2-21.32 HLA region, we identified pleiotropic regions on chr3q28 near OSTN and chr19q13.32 near APOE that were enriched for neuron-specificity and neurological development. We also integrated this pQTL atlas with the latest Alzheimer’s disease (AD) GWAS through PWAS, colocalization and Mendelian Randomization and identified 42 putative causal proteins for AD, 15 of which have drugs available. Finally, we developed a proteomics-based risk score for AD that outperforms genetics-based polygenic risk scores. These findings will be instrumental to further understand the biology and identify causal and druggable proteins for brain and neurological traits.

In the meta-analysis, we identi ed 2,316 index pQTL associations for 1,961 protein aptamers (1,786 unique proteins; Fig. 2B, Supp. Table S5). Of those, 1,247 (53.8%) were cis (+/-1MB from TSS) and 1,069 (46.2%) were trans-pQTL. Correlation of effect sizes for the study-wide signi cant variants was extremely high (R = 0.98, P < 2.2×10 − 16 , Supp. Figure 5, Supp. Table S5), con rming that both cohorts are contributing to the associations observed and highlighting the robustness of these pQTL associations. We further replicated our ndings using a third independent cohort with SOMAscan proteomics measurements and genomic data (N = 183 from the Stanford Iqbal Farrukh and Asad Jamal Alzheimer's Disease Research Center). We observed a strong index pQTL variant effect size correlation between the meta-analysis and external replication (1,676 variants tested) when comparing the SOMAscan results (N rep = 183, R 2 = 0.92, P < 2.2×10 − 16 , Supp. Figure 6, Supp. Table S5). For pathway enrichment analysis results for all proteins with a pQTL, see Supp. Figure 9 and Supp. Table S6.
Because a large portion of our dataset (2,031, 65.4%, Table 1) is derived from individuals with Alzheimer's disease, we sought to determine if our associations were consistent across healthy and affected individuals. We classi ed individuals into AD-relevant biomarker groups using the amyloid/tau/neurodegeneration (A/T/N) framework 25 Table S5), meta-analysis vs A + T + (R = 0.98, P < 2.2×10 − 16 , Supp. Figure 7A), and meta-analysis vs A − T − (R = 0.99, P < 2.2×10 − 16 , Supp. Figure 7B), indicating that the pQTL identi ed in our meta-analyses are consistent across disease states. However, we did observe some associations that were biased toward either diseased or healthy individuals (see Extended Results).
GWAS loci can often have independent signi cant variants in the same locus 26 . To identify independent signals in the same locus, we performed conditional analyses using GCTA-COJO 27 (Supp. Figure 5, Supp. Table S7). In total, we identi ed 3,373 conditionally independent associations (see Extended Results, Supp. Table S7). Most aptamers (1,137,58.0%) had a single independent association, but 824 aptamers had at least two, 337 had at least three, 144 had at least four, and one protein (GSTM1) had 11 independent associations at its cis locus. This suggests that many proteins have complex mechanisms regulating their CSF levels. These analyses were performed at a study-wide level, so it is possible that we excluded many additional independent associations below our study-wide signi cance threshold. We also used VEP to annotate each conditionally independent association.
We identi ed all variants in linkage disequilibrium (LD, R 2 > 0.8) with each conditionally independent pQTL variant and performed variant annotation on the associations using the Ensembl Variant Effect Predictor (VEP) 28 , determining the most severe consequence corresponding to each variant set (R 2 > 0.8, Supp. Table  S8&S9). The largest proportion of the variants were intronic (N = 1,156, 34.3%), followed by upstream/downstream of nearby genes (691, 20.5%), missense (799, 23.7%), or intergenic (324, 9.6%, Supp. Figure 8, Supp. Table S7). In order to determine if the pQTL were enriched in any speci c annotation category, we compared our proportions to those from VEP annotation of 3,373 random variants and 1,000 permutations. We observed a signi cant enrichment in multiple categories of protein-altering variants, including missense, splice donor, and protein-truncating variants (see Extended Results, Supp. Figure 8D, Supp. Table S10), but not intronic, intergenic, or non-coding. While intronic variants were the most prevalent pQTL annotation overall, the percentage annotated as intronic (34.3%) was much lower than the randomly selected variants (51.7% intronic on average). The independent variants were also enriched for protein-altering variants overall (P < 0.001), indicating that pQTL are enriched for coding variants in comparison with other variant types.
We and others have reported limited overlap of pQTL associations across tissues and with other molecular QTL 12,16 . We analyzed the overlap of the CSF pQTL with plasma and brain using two previously-published studies: rst, summary statistics from a plasma pQTL analysis of over 35,000 individuals and approximately 5,000 proteins (SOMAscan5k; Supp. Table S11) 11 . In total, 4,735 aptamers overlapped between our meta-analysis and the large plasma pQTL study, covering 1,682 (72.6%, out of 2,316) of our CSF pQTL associations. Of the 1,682 possible shared signals with the plasma study 11 (N = 35,559), 1,131 (67.2%, PP.H4 < 0.8) did not colocalize, representing novel and CSF-speci c pQTLs (Fig. 2F). We identi ed a signi cantly higher ratio of CSF-speci c trans associations compared to cis (548/734 trans vs 582/948 cis, P = 4.3×10 − 9 ), consistent with previous work 16 . We also identi ed 634 additional novel index pQTL (27.4% of the 2,316 total CSF associations, 299 cis and 335 trans) that correspond to proteins that were not analyzed in the plasma study. A total of 1,765 CSF pQTL were not observed in the large plasma study.
Second, we compared to our previous brain (N = 380) and plasma (N = 529) pQTL (SOMAscan1.3k platform; Supp. Tables S12 & S13) 16 . For the brain dataset, 1,002 aptamers overlapped, corresponding to 465 associations. For the in-house plasma dataset, 862 aptamers overlapped, corresponding to 389 associations. We performed Bayesian colocalization for each pQTL association that had the corresponding aptamer present in one of the three datasets. We observed shared associations (posterior probability of one functional variant, PP.H4 ≥ 0.8) for just 32.8% (551/1,682) of the CSF pQTL with the larger plasma dataset 11 , 16.5% (64/389) with our previous plasma study, and 7.1% (33/465) with our previous brain study (Figs. 2F&G). The low overlap with brain may be partially driven by lower power to detect associations in the brain study (N = 380). In total we identi ed 1,720 novel index pQTL (74.3% of all CSF pQTL, 848 cis and 872 trans) that were either only signi cant in CSF (CSF-speci c, 1,153) or were associated with proteins only measured in our analysis (true novel, 567). When analyzing all conditionally independent associations 2,448 were novel to CSF, of which 1,652 were considered CSF-speci c and 796 were truly novel.
Consistent with previous research 16 , we observed a high degree of both tissue and molecule speci city for our CSF pQTL, with over 50% of the associations (cis and trans) not overlapping with any other dataset or tissue. When analyzing cis associations speci cally, we observed the highest overlap with plasma pQTLs (366/1,247, 29.4% shared, Fig. 2G, Supp. Table S23), suggesting that regulation of proteins across tissues is more likely to be shared than that of eQTL vs pQTL even in the same tissues. This supports the notion that proteins have additional regulatory mechanisms beyond gene expression. When comparing our results to eQTL, we observed the highest overlap with the cortex and cerebellum (364 and 279 shared associations, 29.2% and 22.4%, Fig. 2G, Supp. Table S23), indicating CSF is a better proxy to brain than blood. These results support the fact that proteins are regulated at different levels compared to RNA that could include post-translational mechanisms (phosphorylation, glycosylation, cleavage, binding to receptors among others). These processes can contribute to localization of proteins, level of excretion from cells, and changes in protein conformation, all factors that can affect levels in the CSF.

Pleiotropic regions regulate neurologically relevant pathways
Pleiotropic regions of the genome may be vital to regulation of multiple factors of biologically relevant pathways, but are often removed or not investigated in pQTL studies. To identify genomic regions that regulated multiple proteins, we grouped the index pQTL variants from each pQTL association through linkage disequilibrium, using a threshold of R 2 ≥ 0.1. We identi ed 194 regions that regulated levels of at least two proteins (271 regulating at least two aptamers, Fig. 2B, top; Supp. Figure 5; Supp. Table S24). Of these, 27 regions were associated with at least ve proteins (37 with at least ve aptamers), seven regions regulated at least ten, and three regions regulated more than 50.
Here, we prioritized the three most pleiotropic regions (chr3q28, chr6p22.2-21.32, and chr19q13.32; other regions are discussed in the extended results). Due to the complexity of the HLA region on chr6p22.2-21.32, we grouped all variants located in this locus together regardless of linkage disequilibrium information (see Methods). For these three regions, we rst determined the genomic localization of the genes encoding each regulated protein. We then performed pathway and cell-type enrichment analysis to identify the cellular context of the regulated proteins. Finally, we performed a pheWAS using the GWAS Catalog 30 to determine other traits regulated by the index pQTL variants in each region.
The chr3q28 pleiotropic region is intergenic (located between GMNC and OSTN) and consists of eleven index variants corresponding to one LD block (R 2 > 0.5, Supp. Figure 20F; two blocks when using R 2 = 0.8, Supp. Figure 10G) associated with 130 unique proteins (136 aptamers; Fig. 3A, Supp. Table S24), all of them trans. No pQTLs was observed in this region in plasma 11 , suggesting this is a CSF-speci c pQTL hotspot. Proteins regulated by this region included ve members of the syntaxin family (STX2, 3, 7, 8, and 12) involved in synapse function 31 and ve ephrin family members (EPHA7, EFNB1-3, EPHB2) involved in neural development and memory 32,33 . For each of the proteins regulated in the region, we analyzed brainrelevant cell-type expression data 34 (Supp. . A PheWAS of the index variants in this region identi ed twelve traits associated with index pQTL variants, eleven of which were measures of brain morphology, volume, or surface area (Fig. 3D, Supp. Table S28). These include highly signi cant associations for vertex-wise sulcal depth 35 (P = 1×10 − 128 ), white matter mean diffusivity 36 (P = 1×10 − 19 ), and cortical surface area 37 (P = 1×10 − 17 ). Importantly, we also observed an association between rs9877502 and CSF levels of phosphorylated tau (P = 1×10 − 36 ), a pathological hallmark for AD 38 and a key biomarker. Additionally, OSTN, anking the 3' end of this region, is highly expressed in human neurons 39 and regulates dendritic growth in the human brain 40 , suggesting regulation of OSTN may be driving the associations in the region. The highly neuronal-speci c nature of the associated proteins, the link between OSTN and dendritic growth, and the numerous associations with brain morphology in this region highlight its importance for neurological development.  Table S24). The same region in plasma contained 1,757 associations 11 , signi cantly more than CSF (9.72% of all plasma associations vs 6.74% of all CSF associations; P = 4.35×10 − 6 ). In CSF, we identi ed 32 cis associations for 28 aptamers and 23 unique proteins and 124 trans associations for 60 aptamers and 56 unique proteins. These included multiple complement components such as complement C2 (top index pQTL SNP P = 8.32×10 − 32 ), complement factor B (top index pQTL SNP P = 1.86×10 − 31 ), and C4a anaphylatoxin (top index pQTL SNP P = 1.86×10 − 21 ). The proteins with pQTL in this region were enriched in microglia and macrophages (P = 0.03, FC = 1.574, Fig. 3F, Supp. Figure 11, Supp. Table S26). Pathway analysis identi ed 64 total enriched pathways among proteins regulated by variants in this region (Fig. 3G, Supp. Figure 12, Supp. Table S29). These were highly immune-speci c, including regulation of the immune response (GO: 0050776, 23/64 analyzed proteins, P = 2.98×10 − 10 ), leukocyte mediated immunity (GO: 0002443, 14/64, P = 1.38×10 − 7 ), and antigen processing and presentation (KEGG: hsa04612, 6/39, P = 7.03×10 − 6 ). Through a pheWAS of the index pQTL variants in this region, we identi ed 160 associations with traits and diseases (Fig. 3H, Supp. Table S30). These included autoimmune disorders like asthma 42 (P = 8×10 − 14 ) and myositis 43 (P = 3×10 − 49 ) and neurodegenerative disorders including Parkinson's disease 44 (P = 4×10 − 17 ) and AD 45 (P = 3×10 − 14 ). The shared associations in this region with both immunological and neurodegenerative traits adds evidence to the known relationship between these two processes 46 .
In CSF, the region associated with the largest number of proteins corresponds to eleven index pQTL variants in three LD blocks (R 2 > 0.5, Supp. Figure 13D; seven LD blocks with R 2 = 0.8; Supp. Figure 13E) on chr19q13.32 in the APOE gene region. Variants in this region are associated with the levels of 331 proteins (337 aptamers; Fig. 3I, Supp. Table S24), of which three (two isoforms of APOE and APOC2) are in cis.
This locus is known to regulate the levels of many proteins in plasma and CSF 11,12,16 . We observed signi cantly more associations located in this region in CSF than in plasma 11 (14.6% of all CSF pQTL vs 0.67% of all plasma pQTL, P < 2.2×10 − 16 ). APOE is the strongest genetic risk factor for AD 47 and has been associated with at least eighteen other diseases, including heart disease and high cholesterol 48 . We observe associations in this region for known AD biomarkers, including four members of the 14-3-3 protein family (YWHAB, YWHAE, YWHAG, and YWHAZ), and neuro lament light and heavy chains (NEFL and NEFH) 49,50 . Calcineurin (PPP3R1), associated with both phosphorylated tau levels and rate of decline in AD 51 , is also genetically regulated by this region. Interestingly, although we observed an enrichment in neurons of proteins regulated by variants in this region (P = 7.93×10 − 6 , FC = 1.53, Fig. 3J, Supp. Figure 11, Supp. Table S26), APOE itself was speci c to astrocytes (72.4% of expression in astrocytes, Supp. Table   S25). This suggests a potential cell-to-cell communication between astrocytes and neurons. Pathway analysis results support this hypothesis (Fig. 3K, Supp. Figure 13), as we observe enrichment for seventeen pathways including glutamatergic synapse (KEGG: hsa04724, P = 5.1×10 − 4 ), assembly and cell surface presentation of NMDA receptors (Reactome: R-HSA-9609736, P = 1.66×10 − 4 ), and axon development (GO: 0061564, P = 6.58×10 − 5 ). We also observe pathways enriched for apoptotic function, suggesting a potential role of APOE 52-54 in cell death. Alternatively, this nding could be tagging neuronal death and neuronal protein release, which would be expected driven to individuals with neurological disease, as and be associated with APOE4. In order to determine if this is the case, we analyzed these variants in individuals that are cognitively normal and biomarker negative (AT − ) and compared them to those who are biomarker positive (AT + ). We found that for these 337 aptamer-SNPs pairs, 299 showed no signi cant difference in effect size between the AT − and AT + analyses, suggesting that this nding is not just an artifact of disease status. APOE variants regulate the levels of these proteins in both healthy and diseases individuals. Pathways related with the apoptotic function include activation of BH3-only proteins (Reactome: R-HSA-114452, P = 5.67×10 − 5 ) and apoptosis itself (Reactome: R-HSA-109581, P = 2.4×10 − 4 , Supp. Table S31). Our PheWAS of the three independent signals in this locus identi ed 1,232 associated traits, highlighting the highly pleiotropic nature of this region and its involvement in several diseases and risk factors. These traits were largely involved in aging, cardiovascular, metabolic, and neurological traits, including lifespan, coronary artery disease, familial combined hyperlipidemia, and AD (Fig. 3L, Supp. Table  S32).
Overall, we observed three highly pleiotropic regions that regulated the levels of many CSF proteins located on chromosomes 3q28, 6p22.2-21.32, and 19q13.32. The chromosome 6 and 19 regions, centered around the HLA and APOE loci, respectively, were also observed in plasma, although our analyses indicate that the APOE region show a higher degree of pleiotropy in CSF than plasma 11 (P < 2.2×10 − 16 ) and the opposite for chr 6 (P = 4.35×10 − 6 ). In CSF both the chr3q28 and 19q13.32 regions were enriched for neuronal proteins and traits and were more pleiotropic in CSF than plasma, suggesting even pQTL hotspots show differences across tissues. While these regions are often overlooked in QTL analyses, we highlight that they represent master regulatory regions that are vital to the understanding of disease and important biological processes, and that to fully understand protein-protein interactions and proteins that are part of the same pathway it is instrumental to further explore these pleiotropic regions.
Novel proteins associated with Alzheimer's disease Approaches integrating QTL with disease GWAS have been successful at resolving GWAS loci to prioritize causal genes. We sought to build upon this in the context of AD by utilizing three independent and complementary approaches: estimation and correlation of protein levels in AD using shared genetic etiology through proteome-wide association study (PWAS) 55,56 , prioritization of putative causal proteins for AD through Mendelian randomization (MR) 57 , and identi cation of shared genetic variants associated with both protein level and AD through colocalization (COLOC) 58,59 .
We performed a PWAS using the FUSION framework 55 . We rst calculated variant weights for each protein and for each association separately for those aptamers with at least one study-wide pQTL. We then obtained the genetically estimated protein levels based on these weights and correlated them with disease status using AD GWAS summary statistics (Supp. Figure 14). Pathway enrichment analysis results and directionality for all PWAS-signi cant aptamers are available in Supp. Figures 14&15 and Supp. Table   S33&S34. Because all proteins with associations in the APOE region were associated with AD risk (Supp. Figure 16), we removed APOE-associated proteins and recalculated pathway enrichment (Supp. Figure 17, Supp. Table S35). Finally, all proteins associated with AD through a cis locus are detailed in Supp. Figure 18 and through a trans locus (besides the APOE and HLA regions) in Supp. Figure 19. After excluding pleiotropic regions (those associated with ve or more aptamers), 115 associations for 107 aptamers and 97 proteins were signi cant in the PWAS analyses, including 81 that were driven by cis-pQTL and 34 by trans (Supp . Table S33).
We next performed MR 57 to identify proteins causal for AD, using the pQTL summary statistics and the Bellenguez et al. AD GWAS summary statistics 22 . For each aptamer, we selected signi cant cis and trans instrument variables through LD pruning on all variants with P < 5×10 − 8 , after removing pleiotropic regions. Our analyses identi ed 37 proteins (40 aptamers) as putative causal for AD (Fig. 4A, Supp. Table   S36&S37) based on MR analyses (FDR-corrected P < 0.05) with con rmed directionality based on the Steiger test 60 (Supp . Table S38). In addition, 33 of these 40 MR-signi cant aptamers were also FDRsigni cant in the PWAS analysis (Fig. 4A). Cis-speci c MR for those with applicable variants con rmed all but one association (APOBEC2, Supp. Figure 21).
Finally, we performed COLOC 58,59 between each of our 2,316 signi cant pQTL associations (cis and trans) and the AD GWAS. After exclusion of pleiotropic regions, we identi ed 28 proteins (33 aptamers) that colocalized with AD risk (Fig. 4A Table S41). Through celltype enrichment analysis, we found that these 42 proteins were overrepresented in microglia/macrophages (N = 10, P = 0.008, FC = 2.01) and mature astrocytes (N = 8, P = 0.03, FC = 1.76, Fig. 4E, Supp. Figure 11, Table S26). We also performed pathway and gene set enrichment analyses and observed overrepresentation of the proteins in pathways relating to neurodegenerative traits, including late onset AD These results offer more evidence for established candidate genes like TREM2 and CR1, while also rea rming the genes prioritized at newlyidenti ed loci (SHARPIN, EGFR, CTSH, among others).
We identi ed alternative candidate proteins at four AD risk loci. First, the association at 1q32.2 has been associated with CR1 for years. While we observed an association between CR1 and AD through PWAS and COLOC, we also observed the same positive association with CR2 protein (index pQTL variant rs679515) in both PWAS and MR (PWAS Z = 11.85, PWAS P = 1.94×10 − 32 , MR P = 3.39×10 − 13 , COLOC PP.H4 = 0.996, Supp. Figure 22). Most studies have nominated CR1 as the functional AD gene in this region 61-63 , but our analyses also nominate CR2 as a potential additional factor affecting this locus. The similarities in structure and function between CR1 and CR2 64 make it plausible that both may play a role in neuroin ammation known to be implicated in AD pathogenesis 65 . At the 7q22.1 AD locus, the proposed genes were ZCWPW1 and NYAP1. However, the causal gene has been unclear, with PILRA and PILRB also  Figure 22), highlighting the complexity involved in GWAS locus resolution. This protein was also identi ed as being associated with AD in plasma 12 , albeit with an opposite direction of effect. This directionality difference between tissues has been reported for other proteins for AD in previous analyses 10 . Finally, at 19q13.32, we identify NECTIN2 (index pQTL variant rs3810143) as an additional candidate gene near APOE. CSF NECTIN2 levels were associated with lower risk of AD (PWAS Z = -12, MR β = -0.685). The protein serves as a receptor for multiple Herpes Simplex Virus (HSV) forms 69 , which have repeatedly been linked to AD 70 . Recent work from our group has also identi ed an association in the NECTIN2 region for CSF TREM2 levels (in press), supporting its relevance in AD biology. Through integration of AD GWAS with CSF pQTL, we identify proteins that differ from those prioritized previously and may constitute new targets.
Low FCGR3B copy number has been associated with autoimmune disease 71 , supporting an important role in immune function also associated with AD. We observed a negative association between cis-regulated In addition, we also identi ed 17 proteins contributing to AD through at least two of PWAS, MR, and/or COLOC whose associations are driven by trans-pQTL including TMEM132C, LRP6, CRADD, and STX17 among others (Table S41, Fig. 4C). These ndings suggest novel protein-protein interactions that have not been identi ed before, similar to the case of TREM2 levels regulated by trans-pQTL in the MS4A locus 7 that we have replicated here. First, the BIN1 locus is one of the regions with the strongest association with AD 22 . The causal link between BIN1 and AD is unclear, but it has been connected to increased risk through interaction with tau 77 . We found that levels of TMEM132C were associated with the BIN1 locus that colocalized with the AD risk signal (PP.H4 = 0.999, Supp. Table S39). CSF levels of TMEM132C regulated by this locus were associated with increased risk of AD (PWAS Z = 20.1, PP.H4 = 0.999, Supp. Fig. S21).
While little is known about TMEM132C speci cally, its family members have been linked to panic disorders 78 and insomnia 79 , suggesting an important brain function potentially through interaction with the actin cytoskeleton 80 . This protein also has a pQTL at the APOE locus discussed above. We also observed trans-pQTL for both STX17 and CRADD at 11q24.3 near ADAMTS8 (index pQTL variant rs3740888). Mutations in CRADD cause intellectual disability 81 and the protein contributes to apoptosis of neurons 82 , potentially mediating a protective effect on AD through increased apoptosis of diseased neurons. Like TMEM132C, CRADD is also associated with the APOE locus discussed above. STX17 is a member of the SNARE complex and is vital for the fusion of autophagosomes with lysosomes 83 and interruption of this function is associated with axonal dystrophy, a feature of AD 84 . Both proteins were associated with decreased risk of AD at this locus (STX17 PWAS Z=-3.84 MR β=-0.062; CRADD PWAS Z=-3.53, MR β=-0.102). These analyses suggest that ADAMTS8, STX17 and CRADD are part of the same pathway, potentially contributing to disease through autophagosomal-lysosomal pathways. Additionally, we observed an association between CSF levels of LRP6 and intergenic variants in ADAM10 at 15q21.3 (index pQTL variant rs1427281). ADAM10 contributes to cleavage of amyloid precursor protein into neuroprotective components 85 and overexpression of it in a mouse model of AD was found to reduce amyloid plaque burden 86 and lessen memory de cits 87 . Levels of LRP6 regulated by this locus were associated with increased risk of AD (PWAS Z = 5.81, MR β = 0.189). LRP6 has been found to function in Wnt signaling vital to synapse function and variants in LRP6 have been associated with synapse degeneration in AD 88 . Variants at this locus may contribute to AD by increasing secretion of LRP6 to CSF and thereby decreasing the cellular levels of the protein. The integration of trans-pQTL with AD performed here identi es novel protein-protein interactions that help to explain the disrupted pathways observed in AD. All other proteins prioritized through trans associations are discussed in the Extended Results.
Consistent with our cell-type results that showed enrichment of our 42 proteins in microglia/macrophages (N = 10, P = 0.008, FC = 2.01) and mature astrocytes (N = 8, P = 0.03, FC = 1.76, Fig. 4D, Supp. Figure 11, Table S26), these AD-associated proteins were enriched in immune function (GO:Regulation of immune response, P = 1.58×10 − 5 , Fig. 4E, Supp. Table S42). Immune-relevant proteins include TREM2 (cis association at 6p21.1 and trans at 11q12.2), which is well-known for its role in microglia's response to neurodegeneration 89 . It is targeted for processing to the soluble form by ADAM10 90 , whose genetic locus was found to also regulate LRP6 levels above. TREM2 complexes with DAP12, which regulates activation of microglia through its immunoreceptor tyrosine-based activation motif (ITAM) 91 . IL34, also identi ed here, interacts with CSF1R on the surface of microglia leading to microglial activation 92 . Several of our identi ed proteins, including PILRA 93 , CD33 94 , and SIGLEC9 95 , also contain ITAM or immunoreceptor tyrosine inhibitory motif (ITIM) domains. ITAM-containing proteins in coordination with ITIM-containing proteins regulate processes including microglial phagocytosis and apoptosis 96 . We also identi ed other proteins involved in immune response. Variants in SHARPIN were associated with AD through reduction of the NF-κB-mediated immune response 97 . C1S and C4BPA are both involved in the complement system 98,99 .
Another relevant pathway for these proteins involves lysosomal dysfunction which is known to be implicated in AD pathogenesis 100 . Our analyses not only support this process but also extend its ties to AD by adding novel causal proteins linked to the lysosome (GO:Lysosomal lumen acidi cation including CLN5, GRN, TMEM106B, P = 1.23×10 − 5 , Fig. 4E, Supp. Table S42). As stated above, mutations in CLN5 contribute to a form of neuronal ceroid lipofuscinosis 72 . Neurons de cient in CLN5 protein showed impaired lysosomal activity and movement, suggesting the vitality of CLN5 to lysosome function 72 . In AD, autophagosomes showed an inability to fuse with lysosomes, causing accumulation in neurites 101 . In addition, STX17, associated in trans near ADAMTS8, is vital to the normal function of this process and depletion of STX17 leads to autophagosome accumulation 83,102 . We also identi ed an increased risk for AD associated with CTSH (Fig. 4B, Supp. Table S33). This is consistent with a previous work, wherein increased mRNA levels of CTSH were observed in AD patients and microglia with CTSH knockouts showed increased Aβ phagocytosis 103 . Cathepsins are known to function in proteolysis of lysosomal proteins 104 .
Interestingly, CST8, a protein we associated with decreased risk of AD, is part of a protein family that inhibits the activity of cathepsins, offering a potential mechanism for its protective effect 105 . We also identi ed two proteins, GRN and TMEM106B, that are typically associated with frontotemporal dementia (FTD) 106,107 but were also identi ed in the latest AD risk GWAS 22 . Both proteins are implicated in lysosomal function 108, 109 . We prioritized two pathways through which AD pathology may develop. First, we identi ed numerous proteins involved in immune function, prioritizing those involved in ITAM/ITIM signaling. Second, we characterized potential functions of lysosomal proteins through interruption of autophagosome-lysosome fusion.
Finally, we searched DrugBank 110 to identify potential therapeutic agents targeting the identi ed ADassociated proteins that may be repurposed. Of the 42 prioritized proteins, 15 had at least one reported molecule in DrugBank (Fig. 4C). FCGR3B is targeted by cetuximab (DrugBank accession number DB00002), a chemotherapy agent. Cetuximab has been associated with decreased risk of AD 111 , making it a potential drug-repurposing target (Supp . Table S5). Glutamic acid (DB00142) targets SLC25A18 and is heavily involved in excitatory signaling in neurons 112 . SLC25A18's function as a mitochondrial glutamate transporter may be abnormal in AD 113 . CASQ1 is a ligand for calcium (DB11093), and studies have suggested that targeting calcium receptors may be bene cial in AD 114 . Finally, CA12 is targeted by carbonic anhydrase inhibitors like acetazolamide (DB00819), which was shown to reduce Aβ-induced mitochondrial toxicity 115 . All associated drugs can be seen in Supp. Table S5. Through our analyses, we identify numerous causal and druggable proteins for AD. These drugs, or others developed to target these proteins, may offer promising targets for AD treatment in the future.

Identi ed proteins successfully predict AD biomarker status
The identi cation of accurate tools to prioritize individuals at risk of AD is a vital focus for the eld. We sought to compare the predictive power of the prioritized proteins to classify individuals who were positive for two established AD biomarkers: CSF amyloid beta-42 and CSF phosphorylated tau-181 protein. We developed a proteomic risk score based on the transcriptional risk score (TRS) approach 116 (Fig. 5A). Brie y, we started with the normalized aptamer levels (z-score) for all proteins associated with AD through PWAS (N = 456) in each individual. Using the TRS framework, we harmonized the aptamer levels so that higher proteomic risk score would always increase AD risk. The average per-individual PWAS protein zscore was signi cantly associated with amyloid/tau status (P for A-T-vs A + T + = 9.04×10 − 116 , Fig. 5B). We then split our dataset into training and testing datasets (training: ADNI, FACE; testing: MAP, Barcelona-1, consistent with our discovery and replication for the pQTL mapping). We used LASSO regression to select signi cant predictors (N = 162, aptamers with non-zero weights are denoted in Supp. Table S33)  (288/456 measured, 106/162 signi cant predictors measured). Even with potentially losing some effectiveness due to missing measurements, we still observed high predictive ability in the independent dataset (AUC = 0.958, CI = 0.920-0.996, Fig. 5D, Supp. Table S44).
Given the prioritized proteins were identi ed through genetics, we also compared the ProtRS to a model based on a polygenic risk score calculated using the AD risk GWAS 22   PRSs for AD are heavily dependent on the inclusion of APOE ε2 and ε4 genotype, with knowledge of those genotypes alone being more predictive than a pruned model containing the other genome-wide variants 117 .
We strati ed our samples by APOE genotype to analyze the effectiveness of the ProtRS to predict AT status compared to PRS excluding the APOE region. The PRS model also included age and sex as covariates. Across ages and APOE genotypes, the ProtRS developed based on AD-associated proteins is highly accurate and consistently performs better than PRS at predicting amyloid/tau positivity. This is consistent with previous results, where risk scores based on molecules predict disease more effectively than genetic variants alone 116,118 . Our results were consistent across training, testing, and external replication datasets, supporting our model as being highly replicable. We also showed high accuracy in all age brackets and APOE genotypes, while the PRS was less accurate in younger individuals and those without the APOE ε4 allele.

Discussion
Neurological disorders and diseases constituted the cause of nine million deaths in 2019 alone 119 , representing the second leading cause of death after heart disease 120 . However, the prevalence of these disorders has not translated effectively to treatments. Dementia is one of the ten most common causes of death worldwide 121 . AD causes 60-80% of these cases 122  Here, we have analyzed the genetic regulators of CSF protein levels and integrated them with AD genetics to nominate new proteins involved in AD. One method to prioritize genes potentially useful for treatment is through integration of disease GWAS with molecular QTL, but the majority of these datasets target molecules or factors that do not directly affect disease. Proteins, due to their active participation in biological processes, often are the direct contributing factors in traits. However, there is limited knowledge of the genetic regulators of protein levels. The two largest datasets both analyzed over 10,000 individuals 11,12,15 , but both included fewer proteins than our dataset and analyzed plasma proteomics only. Transfer of molecules including proteins from blood into the brain and vice-versa is highly regulated by the blood-brain barrier 124 , potentially limiting the overlap between protein regulation in the brain and blood. In addition, even for proteins that originate in the central nervous system and are present in plasma, their levels may be far lower and regulated differently than in CSF 8,9 .
CSF is highly neurologically relevant as the uid that surrounds the brain and spinal cord, and a large portion of CSF proteins are derived from the central nervous system 125 . The ability to extract CSF from living individuals and its shared protein biology with the brain makes it highly relevant for the study of neurological disorders and diseases. Here, we present the largest-to-date proteogenomic analysis of CSF.
This dataset presents an expansion of previous CSF pQTL work 16-20 by tripling previous sample sizes (N = 3,107) and increasing the number of analyzed proteins by over sixfold (N = 7,028). Overall, we identi ed 3,373 independent study-wide signi cant pQTL associations (1,247 cis, 1,069 trans) for 1,786 proteins, corresponding to 28.9% of all analyzed proteins. We further con rmed that these are mainly unique to CSF, supporting the analysis of tissue-relevant proteomic datasets to better understand complex traits. Through these analyses, we show strong differences in protein regulation between tissues, emphasizing the need for tissue-relevant approaches for understanding brain-related traits.
While integration of GWAS with QTL has elucidated functional genes at many disease-associated loci, many genetic variants have not been connected to genes. In AD, novel loci have been identi ed without obvious causal genes 22,68 . To attempt to clarify the functional genes at some of these loci, we integrated our pQTL with summary statistics for AD through a proteome-wide association study, Mendelian randomization, and Bayesian colocalization. We identi ed 42 proteins associated with AD through at least two of these methods. These were enriched for microglia and astrocyte-speci c proteins, supporting previous evidence of a strong immunological component in AD [126][127][128] . The proteins were also enriched in immune-relevant pathways (through proteins like TREM2 7,129 , CD33, IL34, and others) and lysosomal function (GRN, TMEM106B, CLN5 among others). These candidate proteins provide support for candidate causal genes and identify potential new pathways relevant to AD. Two of these proteins (ACE, CTSH) were also identi ed in an AD PWAS using brain tissue, highlighting the relevance of CSF to brain biology 56 . We also identify potential drug repurposing candidates for AD, including cetuximab, acetazolamide, and calcium receptor-targeting drugs.
Using the transcriptional risk score framework 116 , we developed a risk score based on AD-associated proteins and applied it for predicting CSF amyloid-tau positivity, a proxy for AD diagnosis 25 . As the proteins were identi ed through integration with AD GWAS 22 , we compared the proteomic risk score to a PRS calculated from the same GWAS. We showed high accuracy in both internal and external replication datasets and improved upon PRS in all facets, including in age-speci c and APOE-speci c contexts, highlighting the proximity of proteins to disease compared to genetics.
We note several limitations in our study warranting further investigation. We included only non-Hispanic white individuals in our pQTL analysis, which may limit the number of associations identi ed and applicability of our results to other ancestries. Additionally, due to the nature of aptamers, mutations occurring in binding sites of the protein-targeting aptamers may lead to extreme differences in measured protein abundance due to binding that may not be true abundance differences. While we performed ADspeci c analyses using the largest AD GWAS to date 22 , this involved proxy-AD cases from UK Biobank that were not clinically diagnosed, potentially clouding the results with other forms of dementia or with individuals that do not develop AD. In addition, the gene prioritization was limited to the approximately 6,100 proteins that were present in the SOMAscan7k panel 24 , leaving many AD-associated regions unexplored. In many instances, the protein encoded by the candidate gene proposed in the AD GWAS was not included in our analysis, preventing us query those genes. Our cell-type analyses were also derived from healthy individuals and were based on RNA levels, which are known to correlate poorly with proteins. Investigation of the levels of these proteins in a cell-type speci c and disease-strati ed manner is an important direction.
While we have successfully identi ed novel candidate proteins in AD, we foresee this dataset being a useful resource for the investigation of all neurological traits. Previous evidence suggests that numerous candidate proteins can be identi ed for many neurological traits, even with smaller sample sizes 13,16 , and this dataset has already been applied to identify associations between proteins regulated by the LRRK2 genetic locus and Parkinson's disease 130 . We intend these results to add to the growing knowledge of genetic regulation of protein levels, and as such, we have made all summary statistics publicly available for use at https://www.niagads.org/knight-adrc-collection.

Con ict of interest
CC has received research support from GSK and EISAI. The funders of the study had no role in the collection, analysis, or interpretation of data; in the writing of the report; or in the decision to submit the paper for publication. CC is a member of the advisory board of Vivid Genomics and Circular Genomics and owns stocks in these companies. DP is an employee of GlaxoSmithKline (GSK) and holds stock in GSK.

Data availability
Proteomic data from the Knight ADRC participants are available at the NIAGADS and can be accessed at https://www.niagads.org/Knight ADRC-collection; The summary results using these data are also available to the scienti c community through a public web browser: www.omics.wustl.edu/proteomics .  Cerebrospinal uid (CSF) samples were collected through lumbar puncture from participants after an overnight fast. Samples were processed and stored at -80 C until they were sent for protein measurement.
In total, 4,223 CSF samples were collected from unique participants. Details separated by cohort are found The SOMAscan7k data underwent initial normalization by SomaLogic. At the sample level, they performed hybridization normalization. The aptamers were then divided into S1, S2, and S3 normalization groups based on signal to noise ratio. SomaLogic then performed median normalization to remove biases due to protein concentration, pipetting variation, reagent concentration variation, and assay timing among others 138 . Each sample was normalized to a reference to account for technical and biological variance.
This step was performed with iterative Adaptive Normalization by Maximum Likelihood (ANML) until convergence was reached. This method is a modi cation of median normalization. Additional normalization procedure details are documented in Somalogic's technical note 24 .
Further quality control was performed on the normalized SOMAscan7k data provided by SomaLogic according to an in-house protocol. Aptamers were removed if they failed either of two criteria: rst, if the maximum absolute difference between aptamer scale factor and median scale factor of any plate is > 0.5; second, if the median cross-plate coe cient of variation (CV) was > 0.15. Interquartile range (IQR) was then calculated for every aptamer based on log-10 transformed aptamer levels. Aptamer values outside of 1.5-fold of the IQR were replaced with NA values. Aptamers with call rate < 65% (aptamer measurement in less than 65% of samples) were excluded, and the same criteria was used to remove samples. Call rate for aptamers was then recalculated and a more stringent call rate threshold of 85% was applied. Sample call rate was recalculated after aptamer removal and a call rate threshold of 85% was applied at the sample level. A summary of the QC steps and aptamer/samples removed at each step is found in supplementary Fig. 1A. A summary of the samples removed by cohort is found in Supplementary Table S4.
Quality control was also performed on the SOMAscan5k data. Because information relating to scale factor and CV were not available for the PPMI cohort, those ltering steps were not performed. IQR-based and call-rate based quality control were performed using the same methods as with the SOMAscan7k dataset. A summary of the QC steps for the SOMAscan5k dataset is found in supplementary Fig. 1B.
To determine the number of unique proteins in each dataset (discovery and replication), we used information provided by Somalogic to map the aptamer ID to a Uniprot ID. If the aptamer was present in both the SOMAscan 5k and 7k datasets, we prioritized the mapping in 7k. We then determined the number of unique Uniprot IDs included in each dataset after QC. We used a three-stage analysis approach (discovery, replication, and meta-analysis) to identify protein quantitative trait loci (Fig. 1) pQTL analyses in both the discovery and replication cohorts were performed using PLINK2 139 . Aptamer levels were z-score normalized by rst transforming to log10-scale and then normalizing using the scale() function in R with options scale and center set to true, to center the values around zero with a variance of 1. For both analyses, age, self-reported sex, the rst ten genetic principal components, and cohortArray (for example ADNI_OmniEx, coded as a dummy variable) were used as covariates for an additive linear model in PLINK2 139 . The overall model was as follows: n in the equation above differs between discovery and replication due to differences in the number of cohortArray dummy variables included as covariates.

Cis-pQTL identi cation
We de ned cis-pQTL as all variant-aptamer associations with raw P < 5x10 − 8 that were within 1MB in either direction of the transcription start site of the corresponding protein-coding gene based on hg38 coordinates (Fig. 1).

Trans-pQTL identi cation
To de ne trans-pQTL, we calculated the multiple testing correction as follows. We calculated the number of proteomics principal components necessary to account for 95% of the variance in protein level between individuals, based only on the samples measured on the SOMAscan7k platform (I.e., excluding all samples from PPMI) to ensure as many proteins were present as possible. In total 1,450 proteomic PCs were necessary to account for that variance, so a study-wide signi cant p-value of 5x10 − 8 /1450 (approximately P < 3.45x10 − 11 ) was used to de ne trans-pQTL. We considered all study-wide signi cant variant-aptamer associations farther than 1MB in either direction from the transcription start site to be trans-pQTL (Fig. 1).

Meta-analysis
We utilized METAL 140 to perform a xed-effect meta-analysis using inverse-variance weighting for each aptamer. To correct for any remaining population strati cation and relatedness, we included the option GENOMICCONTROL. To identify signi cant pQTL (either cis or trans) in the meta-analysis, we required a variant-aptamer association to reach P < 0.005 in discovery, P < 0.05 in replication, to have the same direction of effect in discovery and replication, and to have reached the study-wide thresholds (P < 5x10 − 8 for cis and P < 3.45x10 − 11 for trans) in the meta-analysis (Fig. 1).

Identi cation of associations
We identi ed index variants used to de ne a pQTL through a distance-based method. For each aptamer, we scanned the genome to identify the single most signi cant variant-aptamer association that passed the study-wide multiple testing correction criteria and de ned it as the index variant for that association.
We then grouped all signi cant variant-aptamer associations within 1MB of the index variant and considered them the same signal. After excluding the variants in that region, we performed the same procedure for the next most signi cant variant-aptamer association (if applicable) until no associations reached the threshold for study-wide signi cance.

Disease-speci c analyses
Using PLINK2 139 , we also performed a joint association analysis utilizing all samples from discovery and replication, focusing only on the aptamers with a signi cant association in the meta-analysis and the top variant-aptamer pairs that made up those associations. We considered age, sex, genetic principal components 1-10, and cohortArray as covariates in the model. We used Alzheimer's disease-speci c biomarkers amyloid beta 42 (Aβ42) and phosphorylated tau-181 (pTau, both measured in CSF) to determine amyloid/tau classi cation 25  A z-score cutoff of 0.468 was identi ed for Aβ42, corresponding to a raw Aβ42 value of 856 pg/mL. Samples below 856 were considered Aβ42-positive. A z-score cutoff of -0.018 was identi ed for pTau, corresponding to a raw value of 67. Samples with a value greater than 67 were considered pTau-positive.
For DIAN, LumiPulse (Fujirebio) was used for both Aβ42 and pTau. Dichotomization for Aβ42 was performed on 478 samples and 474 for pTau. Five Aβ42 values and ten pTau values were replaced with CSF_xMAP (MilliporeSigma, Burlington, MA) platform measurements due to missingness (r = 0.77 and 0.89 between platforms for Aβ42 and pTau respectively). A z-score cutoff of -0.198 was identi ed for Aβ42, corresponding to a raw Aβ42 value of 517 pg/mL. Samples below 517 were considered to be Aβ42positive. A z-score cutoff of 0.31 was identi ed for pTau, corresponding to a raw value of 51.8. Samples with a value greater than 51.8 were considered pTau-positive.
Samples with proteomic data were then matched to samples that had genomic data that passed quality control. Amyloid and tau-positive (A + T + , 775) samples were treated as cases and amyloid and tau-negative samples (A − T − , 889) were treated as controls. We strati ed our samples by biomarker status and used PLINK2 with age, sex, genetic PCs 1-10, and cohortArray to test A + T + samples and A − T − samples separately for association of the index SNPs identi ed in the meta-analysis with their corresponding aptamer levels. We then correlated effect size between the meta-analysis and each classi cation as well as comparing them to each other (supplementary Fig. 7 and supplementary table X).

Conditional Analysis
We utilized GCTA-COJO 26,27 to perform approximate conditional analysis on our identi ed pQTL. For each variant-aptamer association, we converted the summary statistics from METAL to the format readable by COJO and performed conditional analysis on cis and trans associations using the appropriate p-value threshold.

Variant Annotation
We utilized the Variant Effect Predictor (VEP) 28 to annotate the most severe consequence for each of our pQTL associations. For each association, we identi ed all variants in high LD (R 2 > 0.8) with the index pQTL variant in that association. We then used those as input to VEP and obtained all annotation information for each variant. We then grouped each set of correlated variants and determined the most severe annotation for the association based on VEP's order of severity.

Enrichment of annotation categories
To test for enrichment of the variant annotation categories as identi ed by VEP, we rst determined the most severe annotation based on the methodology above for all independent variants. We then used a permutation-based approach where we randomly selected a group of 3,373 variants from all those assayed. Using the annotation approach above, we determined the most severe annotation for each of the randomly selected variants. We repeated this 1,000 times and obtained a distribution of the number of times each annotation was found in each group of 3,373 randomly-selected variants (I.e. for one permutation, 500 out of 3,373 were annotated as missense variants). To get a p-value, we then compared our actual independent pQTL variants to the randomly generated distribution and determined the number of permutations where more variants were mapped to each variant type than the actual data. If the pQTL variants had more of a variant type than any of the random sets, the p-value was set at P < 0.0001 (1/1000).

Identi cation of Drug Targets
Using the Uniprot 143 IDs corresponding to each aptamer as supplied by Somalogic (see Supp. Table S2 & S3), we queried the Uniprot 143 database "DrugBank" entry using their advanced search features. We then matched the DrugBank 110 molecule IDs to each of the corresponding aptamers (Supp . Table S5).

External Replication
We utilized an external dataset supplied by the Stanford Alzheimer's Disease Research Center (SADRC) and Aging Memory Study (SAMS) that included whole genome sequencing data as well as CSF proteomic data measured using either the SOMAscan5k assay 24 .

Stanford Iqbal Farrukh and Asad Jamal ADRC
Samples were acquired through the National Institute on Aging (NIA)-funded Stanford Alzheimer's Disease Research Center (SADRC). The SADRC cohort is a longitudinal observational study of clinical dementia subjects and age-sex-matched non-demented subjects. The collection of plasma was approved by the Institutional Review Board of Stanford University and written consent was obtained from all subjects.
Blood collection and processing were done according to a rigorous standardized protocol to minimize variation associated with blood draw and blood processing. Brie y, about 10 cc whole blood was collected in a vacutainer EDTA tube (BD Vacutainer EDTA tube) and spun at 3000RPM for 10 mins to separate out plasma, leaving 1 cm of plasma above the buffy coat and taking care not to disturb the buffy coat to circumvent cell contamination. Plasma processing times averaged approximately one hour from the time of the blood draw to the time of freezing and storage. All blood draws were done in the morning to minimize the impact of circadian rhythm on protein concentrations.
All healthy control participants were deemed cognitively unimpaired during a clinical consensus conference that included board-certi ed neurologists and neuropsychologists. Cognitively impaired subjects underwent Clinical Dementia Rating and standardized neurological and neuropsychological assessments to determine cognitive and diagnostic status, including procedures of the National Alzheimer's Coordinating Center (https://naccdata.org/). All participants included in this study were deemed cognitively impaired during a clinical consensus conference that included neurologists and neuropsychologists. All participants were free from acute infectious diseases and in good physical condition.

Stanford Aging Memory Study (SAMS)
SAMS is an ongoing longitudinal study of healthy aging. Blood collection and processing were done by the same team and using the same protocol as in SADRC. Neurological and neuropsychological assessment were done by the same team and using the same protocol as in SADRC. 192 participants were included in the present study, and 11 were participants in both the SADRC and SAMS study.

Genomic/proteomic data QC
We performed genomic principal component analysis and identity by descent ltering based on the genomic data, using 1000 Genomes samples as an anchor, based on the same general steps as for the main analysis. Plink1.9 139 was used for both ltering steps. The whole genome sequencing data was ltered using a genotyping rate threshold of 95% and a HWE threshold of 5×10

Tissue-speci city of pQTL Associations
To determine the overlap of signi cant CSF pQTL associations across tissues, we compared the pQTL associations from our meta-analysis to those identi ed in the largest-to-date aptamer-based plasma pQTL study 11 and to plasma and brain pQTL 16 . We performed Bayesian colocalization using coloc.abf 58 . We used a region of 1MB surrounding the index CSF pQTL variants to select variants for inclusion in the analysis. We included only variants that were present in both analyses. We tested for a shared genetic association between our CSF pQTL signals and both plasma datasets and the brain dataset, ensuring that the aptamer ID was consistent between each dataset analyzed. We considered an association to be shared between tissues according to colocalization if the posterior probability of H4 (shared association) was greater than or equal to 0.80. The plasma and brain analyses published by Yang et al. were based on the GRCh37 genome build, so we lifted over the summary statistics to GRCh38 coordinates using the UCSC Genome Browser's LiftOver tool 144 . Due to limited aptamer overlap, we were able to perform colocalization for 1,682 associations with Ferkingstad et al. 11 , 389 for in-house plasma 10 , and 465 for in-house brain 10 (out of 2,316). We determined if trans associations were more likely to be CSF-speci c than cis using the prop.test() function in R 145 . The remaining 634 associations could not be analyzed because the corresponding aptamer was not measured in the plasma study.

Molecule-speci city of pQTL associations
To determine the overlap between protein QTL associations and RNA transcript QTL (eQTL) associations, we analyzed whole blood eQTL summary statistics from eQTLGen 5 and GTEx 4 , cortex eQTL from GTEx, microglia-speci c eQTL 29 and cortex, hippocampus, basal ganglia, cerebellum, and spinal cord eQTL from MetaBrain 6 . The summary statistics from eQTLGen were supplied using GRCh37, so we lifted them to GRCh38 coordinates using LiftOver 144 . For the eQTL datasets, if Ensembl transcript ID was supplied, we matched using only the Ensembl gene ID and ignored the transcript info. Ensembl ID was matched to the Entrez Gene Symbol using the biomaRt R package 146 , then the Entrez Gene Symbols for each protein aptamer were used to match summary statistics between pQTL and eQTL datasets. Because most of these datasets were limited to cis-acting variants only, we performed colocalization using signi cant cis pQTL associations only. For each pQTL association, we kept all SNPs present in both pQTL and eQTL datasets that were within 1MB of the index pQTL variant's location. We utilized the coloc.abf function 58 to perform Bayesian colocalization. We determined shared associations between pQTL and eQTL to be those with PP.H4 ≥ 0.80.

Identi cation and Analysis of Pleiotropic Regions
We utilized an LD-based method to de ne pleiotropy. For each index pQTL variant identi ed, we used to determine if the pathways were especially relevant to a speci c cell type. Finally, we utilized the GWAS Catalog 30 to perform a phenome-wide association study (pheWAS). For each of the three regions, we used the index pQTL variants identi ed in each and queried the GWAS Catalog to identify all genome-wide signi cant associations for those variants. This was done to elucidate diseases and traits that may be regulated by the region and may include regulated proteins in their etiology. We performed LD pruning using PLINK2 139 using a window size of 500KB, step size of 50, and R 2 cutoff of 0.5 to de ne independent LD blocks. We utilized Haploview 148 to visualize the linkage disequilibrium structure of these pleiotropic regions.
We compared the number of aptamers associated with each region in CSF and plasma 11 . Using the index associations identi ed in plasma (N = 18,084), we determined the number of plasma associations in each of the three main CSF pleiotropic regions. Using the total number of index associations identi ed (2,316 in CSF and 18,084 in plasma), we performed a two-sided two-sample z-test for proportions using prop.test() in R to analyze the difference in the proportion of the total associations between tissues in each region.
Because the APOE chr19 region is so intertwined with disease, we sought to determine if the associations seen were driven by disease status, To do this, we tested for a difference in effect size between AT − and AT + individuals for each of the associations in this region using the following equation: We then used a Z-statistic cutoff of ±1.960 (corresponding to P = 0.05) to determine associations whose strength differed between ATand AT + samples.

Alzheimer's Disease Proteome-Wide Association Study
To identify proteins potentially involved in the etiology of AD, we utilized a modi ed version of the FUSION framework 55 . In brief, FUSION calculates the genetic contribution of each SNP in a region of interest to the level of an aptamer. Then, by matching the variants to those in a binary trait GWAS, the framework uses the calculated contributions (weights) to impute protein level based on the summary statistics from the GWAS. It then correlates the genetically regulated protein levels with disease status to identify both association and directionality between protein levels and the binary trait.
For our analyses, we used the Stage 1 summary statistics consisting of 39,106 clinically diagnosed AD cases, 46,828 proxy AD cases, and 401,577 controls from the largest-to-date AD GWAS study 22 . Typically, the FUSION framework focuses only on cis-regulated abundance, utilizing the transcription start site of the gene of interest to determine the region to use for weight calculation. However, because of the substantial presence of trans-pQTL associations identi ed in our analysis, we instead used the index variant for each pQTL association as the reference to choose variants. We then included all variants in a 1MB region surrounding each index variant. We included the methods "top1", "lasso", and "enet" for weight calculations. Because we only calculated weights for regions/aptamers that had a signi cant pQTL, we set the heritability p-value threshold to 1 to ensure that all association/aptamer pairs were analyzed for association. We included age, self-reported sex, genetic principal components 1-10, and cohortArray as covariates in the weight calculation model, matching those used for the initial pQTL identi cation. We then prepared a le containing information about the calculated weight les for each pQTL for input to the association analysis step.
To calculate association between imputed protein levels and AD status, we ran FUSION.assoc_test.R on each autosomal chromosome, using the same binary le used to calculate the weights further subsetted by chromosome. This calculates imputed protein levels in the AD GWAS summary statistics based on the weight les calculated in the previous step. It then associates the imputed protein levels with disease status. We used Benjamini-Hochberg 149 false discovery rate correction with an adjusted threshold of P FDR < 0.05 for association between each variant-aptamer pQTL region (because we are including both cis and trans associations, we cannot specify just aptamer) and disease status to determine signi cant aptamerdisease associations.

Alzheimer's Disease Colocalization Analysis
Colocalization under a single causal variant assumption To determine shared genetic etiology between aptamers and AD, we performed Bayesian colocalization analysis using coloc.abf from the coloc R package 58 . We used the default prior probabilities of P 1 = 1x10 − 4 , P 2 = 1x10 − 4 , and P 12 = 1x10 − 5 . For a genetic signal to be considered shared between aptamer level and AD status, we required that the posterior probability of H4 (PP.H4) be > 0.8. This threshold suggests high con dence that the genetic signals for the two traits are caused by the same variant. We used the same AD GWAS as for FUSION 22 . We included all variants within 1MB of the pQTL index variant that were present in both our pQTL analysis and the AD GWAS summary statistics. For each of these, we harmonized the effect allele to ensure consistent direction of effect between the two analyses. Using PLINK1.9 139 , we calculated an LD matrix based on the 3,107 samples analyzed. While the matrix is not necessary for this method, for consistency with COLOC-SuSiE we used it to lter all variants missing LD info before running colocalization. This ensured the same variants were used in both analyses. Because both the pQTL analyses and AD GWAS report the standard error of the effect size for each variant, we squared the standard error for each variant in each analysis to obtain the variance as required for coloc. Using only the variants that were present in both analyses, colocalization was then performed. For the AD GWAS, the "type" was set to "cc". Because z-score protein levels were used, "sdY" was set to 1 for the pQTL dataset.
Colocalization with no single causal variant assumption To relax the single causal variant assumption inherent in coloc 58 , we also performed colocalization using coloc-SuSiE, which accounted for LD and was able to compare multiple independently signi cant variants at each locus 59 . This approach rst utilizes SuSiE 150 , which uses an iterative Bayesian stepwise selection approach to identify independent credible sets of variants, each of which contains a variant predicted to be causal. This is performed on both the pQTL data and the AD GWAS data 22 , to get two separate lists of credible sets for each trait. The same variants as above were used for colocalization, which is then run on each pairing of credible sets. The LD matrix and the effect size and variance for each variant was then supplied to the runsusie function in the coloc R package, which calculated the credible sets for each trait.
This information was then passed to the coloc.susie function in the coloc R package to calculate colocalization. A threshold of PP.H4 > 0.8 was used to identify shared causal variants.

Alzheimer's Disease Mendelian Randomization
In order to estimate causality of the proteins measured by the aptamers in this analysis, we implemented MR using the R package TwoSampleMR 57 . TwoSampleMR rst performs LD-based clumping to identify independent instrument variables (variants) for each exposure (aptamer). We performed clumping on each aptamer with a study-wide signi cant pQTL. We limited the variants included for each aptamer to only those that reached genome-wide signi cance (P < 5x10 − 8 ) for that aptamer, using the prede ned default clumping thresholds. We extracted the instrument variants identi ed by clumping then harmonized the two datasets to ensure matching effect alleles. To account for pleiotropy in the analysis, we rst removed all trans-associated variants for each aptamer including any variant in a pleiotropic region (as identi ed above) that regulated ve or more aptamers. We then further removed all trans-associated variants within 500KB of rs429358 (APOE ε4) due to the highly signi cant association of this region with AD. Because cisregulated proteins in general are higher con dence to be causal than trans, we did not remove any cis variants based on pleiotropy. Cis instrument variables that are located in pleiotropic regions (regions regulating ve or more aptamers) are marked in Supp. We rst determined the difference in proteomic risk score value distribution between unrelated A − T − , A + T − , and A + T + samples (N = 785, 496, and 733, respectively) using the Wilcoxon rank-sum test 151 as implemented in the compare_means function in the ggpubr R library. A − T + samples were included because of unclear relevance to AD. Samples from DIAN were excluded due to the high prevalence of early-onset forms of disease while the proteins were chosen based on late-onset AD.

AT status prediction using aptamers chosen by LASSO regression
We utilized the R package "glmnet" to perform LASSO regression 152  We also performed age and APOE genotype-strati ed analyses using the LASSO-de ned model. As with the training and testing datasets, we used the predict() command to obtain values for each individual that were then used to predict AT status. The strati ed datasets were obtained by selecting all samples tting the strati cation criteria (age bracket or APOE genotype) from both the testing and training datasets.

Risk prediction using polygenic risk scores
Because the proteins prioritized for use in the proteomic risk scores were identi ed based on genetic associations, we wished to compare the predictive ability of the ProtRS vs a polygenic risk score (PRS) calculated using variants associated with AD status 22  supplied the list of selected Entrez gene symbols as input (IDs instead of symbols for enrichDGN) and set the universe to all unique genes covered by the SOMAscan7k panel (n = 6,112). For gseGO, the default universe was used, as there is not a command to change the background. For all enrichment analyses, an FDR-corrected p-value threshold of 0.05 was used.

Cell-type Speci city Analysis
We downloaded gene expression data from human astrocytes, neurons, oligodendrocytes, microglia/macrophages, and endothelial cells 34 to determine the degree of speci city to relevant cell types for the aptamers included in the SOMAscan7k panel. The data downloaded contained multiple subtypes of astrocytes; we focused only on human mature astrocytes for our analysis. For each cell type, we determined the average expression level across individuals for each gene. We then added the averages from each cell type to get a total expression level for that gene across the ve cell types. We then calculated the percentage of the total expression that each cell type contributed. A gene was reported to be cell-type speci c if the percentage of its full expression contributed by the top cell type was 1.5x higher than the second top cell type. For example, 72.5% of the brain expression of APOE was attributed to mature astrocytes, while the next highest cell type was neurons at 10.6%. Because the ratio of astrocyte contribution (72.5%) to neuron contribution (10.6%) is greater than 1.5, APOE was considered astrocytespeci c.
To determine enrichment, we rst matched each of the proteins in the SOMAscan7k platform to their EntrezGeneSymbols using the Somalogic-provided documentation. Using the above ratio-based strategy, we determined the cell-type speci city for all EntrezGeneSymbols and counted the number of genes that were speci c to each cell type. In total, 5,750 proteins had corresponding genes that were included in the cell-type expression data. For each protein subset, we determined the number of corresponding genes that were speci c to each cell type. We then tested for enrichment using the hypergeometric test in Excel.  Cerebrospinal uid pQTLs are consistent across disease and largely tissue and molecule-speci c. A.
Miami plot of combined results for each aptamer from pQTL analysis in the discovery (top, 1,912 samples and 7,559 aptamers) and replication (bottom, 1,195 samples and 7,028 aptamers) datasets. The x-axis represents genome position of an associated variant and the y-axis represent the -log10(p-value) for association of each SNP with an aptamer. B. 2D Manhattan plot of 2,316 index pQTLs for 1,961 aptamers that were signi cant after meta-analysis. The x-axis position represents the genome position of the pQTL signal. The y-axis position represents the location of the protein-coding gene corresponding to the aptamer with a pQTL. Color represents cis or trans status of the index SNP (cis, red; trans, blue). The top panel maps the pleiotropic regions of the genome (limited to 100 aptamers with an association in the same region). C. Scatter plot of index pQTL SNP effect size (y-axis) vs index pQTL SNP minor allele frequency (xaxis). Color represents cis or trans status of the index SNP (cis, blue; trans, black). Correlation was calculated using the Pearson method. D. Scatter plot of cis index pQTL effect sizes (y-axis) vs distance from the transcription start site of the gene encoding the associated protein (x-axis). Color represents the minor allele frequency (darker = less common). E. Scatter plot of effect size of index pQTL variants in dichotomized amyloid/tau positive samples (x-axis) vs dichotomized amyloid/tau negative samples (yaxis). Color represents minor allele frequency (darker = less common). Correlation was calculated using the Pearson method. F. Colocalization of CSF pQTL associations with plasma pQTL associations from Ferkingstad et al 11 . Plasma + CSF represents pQTL associations that colocalized across tissues. CSFspeci c represents pQTL associations that did not colocalize with plasma pQTLs for the same protein.
New Proteins are pQTL associations for proteins measured in CSF but not plasma. Color represents cis or trans status of the index CSF pQTL (cis, blue; trans, green). G. Colocalization of cis CSF pQTL associations with various QTL types. Green bars represent brain-relevant tissues, while blue bars represent other tissues. Bar labels represent the number of colocalizing QTLs.

Figure 3
Three pleiotropic regions make up hotspots of protein regulation involved in neurological processes. A. Circos plot showing the genomic locations of all protein-coding genes whose proteins are regulated by the chr3q28 pleiotropic region. Colors represent the predominant brain-relevant cell type corresponding to that protein. B. Enrichment of proteins associated with the chr3 region in brain-relevant cell types (based on classi cation shown in A). Fold change was calculated based on the number of cell type-speci c proteins in the region compared to the number in the entire SOMAscan7k panel. Enrichment p-value was calculated using a one-sided hypergeometric test. C. Selected pathways enriched for proteins associated with the chr3 region. Gene Ratio represents the proportion of all proteins associated with the region that are part of each pathway. D. PheWAS of index pQTL SNPs located in the chr3 region, as found in the GWAS catalog 30 . E. Circos plot showing the genomic locations of all protein-coding genes whose proteins are associated with the chr6p22.2-21.32 pleiotropic region. F. Enrichment of proteins associated with the chr6 region in brain-relevant cell types. G. Selected pathways enriched for proteins associated with the chr6 region. H. Selected traits and diseases associated with index pQTL SNPs located in the chr6 region, as determined by the GWAS catalog. I. Circos plot showing the genomic locations of all protein-coding genes whose proteins are regulated by the chr19q13.32 pleiotropic region. J. Enrichment of proteins associated with the chr19 pleiotropic region in brain-relevant cell types. K. Selected pathways enriched for proteins associated with the chr19 region. L. Selected traits and diseases associated with index pQTL SNPs located in the chr19 region, as determined by the GWAS catalog.   Protein-based prediction of amyloid/tau positivity is more accurate than a polygenic risk score. A. General schematic of proteomic risk score work ow. B. Violin plot of proteomic risk score values across amyloid/tau negative samples (gray, N = 889), amyloid positive/tau negative samples (blue, N = 496), and amyloid/tau positive samples (green, N = 775). Signi cance was calculated using the Wilcoxon rank-sum