Multi-cohort cerebrospinal fluid proteomics identifies robust molecular signatures for asymptomatic and symptomatic Alzheimer’s disease.

Changes in Amyloid-β (A), hyperphosphorylated Tau (T) in brain and cerebrospinal fluid (CSF) precedes AD symptoms, making CSF proteome a potential avenue to understand the pathophysiology and facilitate reliable diagnostics and therapies. Using the AT framework and a three-stage study design (discovery, replication, and meta-analysis), we identified 2,173 proteins dysregulated in AD, that were further validated in a third totally independent cohort. Machine learning was implemented to create and validate highly accurate and replicable (AUC>0.90) models that predict AD biomarker positivity and clinical status. These models can also identify people that will convert to AD and those AD cases with faster progression. The associated proteins cluster in four different protein pseudo-trajectories groups spanning the AD continuum and were enrichment in specific pathways including neuronal death, apoptosis and tau phosphorylation (early stages), microglia dysregulation and endolysosomal dysfuncton(mid-stages), brain plasticity and longevity (mid-stages) and late microglia-neuron crosstalk (late stages).

framework. 1 Despite their diagnostic utility, these markers only capture a fraction of the intricate pathophysiology of AD.Genetics has also substantially advanced our understanding of AD heritable risk, revealing the complex polygenic nature of this disorder with an estimated genetic heritability between 58% to 79%. 2 However, the interplay between AT(N) changes and the precise in uence of genetic risk factors on the biological pathways underlying AD pathophysiology is not always clear. 3 To gain comprehensive insights into the biological implications of AD, further analysis utilizing complementary -omics methodologies is often necessary.To this end, transcriptomic pro ling has emerged as a widely employed approach to quantify mRNA transcripts in the post-mortem AD brains. 4e resulting transcriptomic data have also been integrated with AD genetic risk information to understand disease pathophysiology. 5Nevertheless, the proteins and metabolic pathways they regulate are frequently cited as the ultimate biological effectors of both genetic and environmental risk factors in AD.Therefore, high-throughput omics-based investigations in biological uids such as cerebrospinal uid (CSF) and plasma are needed to further gain mechanistic insights into the molecular processes involved in AD pathogenesis and prioritize connections to relevant clinical and neuropathological traits.
CSF serves as a valuable source for understanding different biochemical changes occurring in the brain during neurodegenerative disorders, offering insights into their underlying pathobiology. 6The classical AD CSF biomarkers include Aβ 42 or its ratio (Aβ 42/40 ), hyperphosphorylated tau (pTau) or total tau (tTau), and neuro lament light chain (NFL), which indicate the senile plaque pathology, formation of neuro brillary tangles, and axonal degeneration in the brains, respectively. 7Alterations in the protein levels of these biomarkers, among others, can be detected years before the symptoms of AD appear. 8,9hile these established pathological markers are widely employed for early AD diagnosis in research and clinical settings, 10 they have limited utility in capturing the biological diversity of AD. [11][12][13] Therefore, a systematic exploration of the CSF proteome holds the potential for identifying novel markers that re ect the multifaceted pathophysiology of AD.Besides re ning the biological de nition of AD, it can also provide crucial insights for developing robust AD prediction models that are independent of Aβ and tau pathology.
A growing body of literature, [14][15][16][17][18] , including our own, 19 has leveraged proteomics datasets from CSF and plasma for identifying several pathways in AD including innate immune response and in ammation, oxidative stress, energy metabolism, and mitochondrial function.While the existing proteomics approaches have contributed signi cantly, they have been relatively limited in their coverage of target analytes that range between 453 to 4,001 protein analytes pro led using mass spectrometry 17,14 .This limited detection power of the existing approaches is primarily due to the extraordinary complexity and broad dynamic range of protein concentrations in the CSF and plasma. 20Furthermore, the limited sample of these studies, with samples sizes lower than 1000 samples, 16 is also a signi cant hurdle in deriving statistically signi cant ndings.Altogether, the utilization of low throughput protein pro ling techniques and the limited sample size of the existing studies have signi cantly hampered their potential for identifying additional biomarker signatures and providing novel candidates that can serve as effective disease-modifying targets.
In order to identify signi cant alterations in the AD CSF proteome, create robust prediction models, and identify functional pathways compromised in AD, we have generated and analyzed high-throughput proteomics data from 2,286 participants in a three-stage study (Fig. 1).Then, the proteins associated with AD were used to create prediction models and for pathway and cell-type enrichment analyses in order to determine pathways implicated on AD pathogenesis.

Study design
In this study, we used SomaLogic Somascan assay for measuring the protein levels of 7,029 analytes in CSF of 2,286 participants from Knight ADRC, 21 FACE, ADNI, and Barcelona-1 cohorts.We employed a three-stage analytical approach (stage 1, stage 2, and meta-analysis) to identify robust proteomic alterations in the AD CSF proteome.Based on an AT(N) paradigm, the discovery analysis (stage 1) was performed in the Knight ADRC and FACE cohorts (n=1,170; A -T -= 680, which correspond to biomarker negative individuals, and A + T + = 490, or biomarker positive individuals).The signi cant proteins after false discovery rate correction (FDR < 0.05) were further replicated in stage 2 using ADNI and Barcelona-1 cohorts (n=593; A -T -= 235 and A + T + = 358).Finally, a meta-analysis of stage 1 and 2 was performed to identify robust proteins associations passing a more stringent Bonferroni correction (Bonf < 0.05) criteria.We further validated these proteins in a completely independent CSF proteomics cohort (Stanford ADRC; A -T -= 80 and A + T + = 27) pro led using a different protein quanti cation platform (Somascan 5K).
A comprehensive examination of protein abundance across various AT groups (A -T -, A + T -, and A + T + ) revealed distinct protein pseudo-trajectories (estimating protein longitudinal trajectories based on crosssectional data) that span the entire AD continuum.Based on these disease stage-we obtained four different group of proteins, with unique pseud-trajectories.Group-speci c pathway enrichment was performed to understand biological processes compromised during different stages of AD continuum.Each group displayed enrichment for several biological systems (nervous system, immune response, biosynthesis, and signal transduction) and speci c brain cell types (neuron, astrocytes, and microglial cells).Overall, the disease and pathway enrichment analyses highlighted several neurological disorders (e.g., AD, tauopathy, and synucleinopathy) and neuronal functions (neuron projection morphogenesis, synapse assembly, and axonogenesis) to be signi cantly enriched (FDR < 0.05) in the altered AD CSF proteome (Table 1, Fig. 1).

Identi cation of AD-speci c CSF proteomic alterations
We performed a three-stage study to identify signi cant alterations in AD CSF proteome (Fig. 2B).In the rst stage, a discovery analysis was performed on 1,170 individuals (A + T + = 490, A -T -= 68) from the Knight ADRC and FACE studies (Fig. 2A).We identi ed 3,565 with signi cantly different levels (FDR < 0.05) between A -T -(biomarker negative and a proxy for controls) and A + T + (biomarker positive and a proxy for AD cases) individuals (Fig. 2C and Supplementary Table 1).2][23][24] In the second stage, the protein that showed signi cant associations in stage 1 were further tested in the stage 2 that comprises 593 individuals (A -T -= 235, A + T + = 358) from ADNI and Barcelona-1 (Fig. 2A).Of the 3,565 identi ed proteins in stage 1, 2,608 replicated in stage 2 after FDR and with consistent effect direction (Fig. 2D and Supplementary Table 2).Of these 1,693 were upregulated in A + T + (cases) compared to A -T -(controls), and 915 were downregulated.
In the third stage, we performed a meta-analysis to combine the p-values from stage 1 and 2, for those proteins that replicate in stage 2 and applied a stringent Bonferroni correction to minimize the chances for false-positive results (Fig. 2A).The meta-analysis resulted in 2,173 proteins associated with AT status after Bonferroni correction (Fig. 2E and Supplementary Table 3).Finally, we validated these ndings by using CSF proteomics data from an independent study (Stanford ARDC) that employed a different proteomic panel (Somascan 5K).As this validation cohort had a limited size (n=132 and Table 1), we assessed the consistency of effect size and signi cance (p-value) across all these studies.We observed a strong correlation between the effect size (corr = 0.90, p = 3.3×10 -187 ) and p-values (corr = 0.82, p = 1.5×10 -138 ) of the meta-analysis and the Stanford ADRC study (Figure S2).This unbiased validation con rms the platform-independent robustness of our meta-analysis results.We considered the 2,173 proteins that passed Bonferroni correction in stage 3 for downstream analyses: disease prediction models, and pathway enrichment (Fig. 1).

Identi cation of a robust and AD-speci c prediction model
Since the entire set of differentially abundant analytes (DAA; n=2,173), identi ed using multi-stage metanalysis, is too large for developing a clinically meaningful proteomics panels for AD diagnosis and prognosis, we used machine learning approaches to identify the minimum number of proteins with high prediction power (Fig. 3A).We used least absolute shrinkage and selection operator (Lasso) regression model 25 on 70% of the stage 1 (stage 1 training; n=819) for training.The Lasso regression model with ve-fold cross-validation identi ed 56 proteins.Proteins displaying high correlation (Pearson correlation > 0.8) between the abundance levels in the stage 1 data were removed to further reduce the size of proteomic signature.Since the performance of identi ed proteomic signature was also assessed in an independent study (Stanford ADRC) that used a different protein quanti cation platform (Somascan 5K), only proteins overlapping between the proteomic signature and Stanford ADRC data (n=25) were kept.Finally, a set of 11 proteins, which signi cantly contributed to the prediction (P < 0.05 in the multi-variant model; Supplementary Table 4) were kept.The identi ed proteomic signature included some of the wellknown AD-associated proteins such as YWHAG, 18,22 PIN1, 26 and EZR. 27is model (11 proteins and speci c weights) was assessed in the stage 1 testing (30% of stage 1 data; n=351), stage 2 (replication; n=593), and external validation (Stanford ADRC; n=107) datasets.This model showed strong prediction power for classifying A + T + vs A -T -individuals, with an area under the curve (AUC) of 0.98 and 0.97 for stage 1 testing and stage 2 datasets respectively, and 0.99 in the independent Stanford ADRC cohort (Fig. 3B).Positive predictive value (PPV) and negative predictive value (NPV) were >0.86 in all cases (Supplementary Table 5).The performance of the baseline model, which only used age and sex for predicting AT status was signi cantly low for stage1 testing, stage 2, and Stanford validation cohorts, with an AUC of 0.72, 0.59, and 0.57, respectively (Fig. 3B).
We also analyzed if the same model can predict clinical diagnosis (Controls = 724, AD = 882), and obtained an AUC of 0.89 for stage 1+2, and 0.97 and Stanford ADRC (Fig. 3C).These high AUC suggests the robustness of our prediction model in stratifying clinical AD individuals from controls, as well as AT biomarker status.
To further assess the speci city of this prediction model for AD, we also applied it (same proteins, weights and cut-off as identi ed in Stage-1 training) to other dementia disorders including dementia Lewy body (DLB; n=25), frontotemporal dementia (FTD; n=42), and Parkinson's disease (PD; n=507), as well as other non-AD individuals (n=335) and healthy controls (n=1,157).We observed that model did not have a strong prediction power for these non-AD dementias and PD, with AUC ranging from a maximum of 0.70 in the case of DLB to a minimum of 0.44 for PD (Fig. 3D).Overall, these results suggest that we have identi ed a unique signature of 11 proteins that showed consistently high prediction power for predicting AD clinical or biomarker status.This identi ed proteomic signature is speci c to AD as it showed very low power for other dementia such as FTD, DLB, or PD.
Assessing progression to dementia and rate of memory decline Next, we asked if the identi ed CSF 11-proteins signature can reliably distinguish between slow and fast progressors.For this analysis, we focused on individuals with an AD-diagnosis at lumbar puncture and rate of memory decline was modeled using change in Clinical Dementia Rating sum of boxes (CDR-SB) per year.We observed a signi cant separation between the regression slopes for individuals predicted as proteomic signature-positive and -negative (Fig. 3E; red and green slopes, respectively).Individuals positive for the proteomic signature presented faster rate of progression (β = 0.35, p = 2.1×10 -04 ).No difference between the slopes was observed between A -T -vs A + T + individuals (Fig. 3E; blue and orange slopes).
We also performed a time-to-event analysis to assess if our proteomic signature can also determine if cognitive normal individuals at lumbar puncture are more likely to develop AD.We observed that individuals positive for the 11-protein panel displayed a signi cantly high probability of developing AD (p = 2.2×10 -58 ) in comparison to individuals that were negative for the proteomic signature (Fig. 3F).In particular, the individuals positive for the 11-protein panel displayed almost 100% of the individuals develop AD in the 10-year interval post-rst clinical assessment, whereas the individuals negative for this panel showed 35% probability of developing AD in the same time span.In summary, these results indicate that the identi ed AD CSF proteomic signature is a better predictor of dementia progression than the known AT status.Furthermore, individuals that are predicted to be positive for this proteomic signature exhibit signi cantly low probability of not developing AD as compared to their counterparts that are negative for this signature.

CSF proteome exhibit distinct protein expression patterns throughout the AD continuum
Following the AT classi cation system, we categorized individuals into three groups: biomarker negative individuals (A -T -), individuals in early AD stages: amyloid positivity but tau negativity (A + T -), and full biomarkers positive individuals (A + T + ), which cover the entire AD continuum.The goal of this analyses is to determine how the protein levels change across the AD continuum (pseudo-trajectories), determine if there are speci c patterns of those changes and the pathways associated with those changes.
Based on the differences in the estimates and their signi cance across three independent differential abundance analyses (A -T -vs.A + T -, A + T -vs.A + T + , and A -T -vs.A + T + ), we identi ed four distinct groups of proteins (Fig. 4A and Supplementary Table 6).Speci cally, group one (G1) included 471 proteins that showed consistently "linear increase" in protein abundance from healthy controls (A-T-) to asymptomatic (A + T -) to AD (A + T + ) stage.The second group (G2) included a set of 482 proteins that followed an "updown" trend, i.e., they showed an increase in protein abundance from biomarker negative to early stages and then a decrease to full biomarker positive.Group 3 (G3) included a set of 184 protein analytes that showed a consistent "linear decrease" from biomarker negative to positive.Finally, group four (G4) showed the exact opposite behavior of G2, a "down-up" trajectory, where an initial decrease was followed by an increase.
The G1 (linear increase) includes key AD-associated proteins such as SPARC-related modular calciumbinding protein 1 (SMOC1, 28 Extended Data Fig. 1), Neuro lament Light Chain (NEFL) 29 , Glial Fibrillary Acidic Protein (GFAP) 30 , Granulin Precursor (GRN) 31 , Protein Phosphatase 3 Regulatory Subunit B, Alpha (PPP3R1) 32 , and Alpha-Synuclein (SNCA) 33 .Besides having these established AD biomarkers, this group also included NCK Adaptor Protein 2 (NCK2) and SHANK Associated RH Domain Interactor (SHARPIN) which are located on two known AD risk loci. 34Recent studies have revealed that SMOC1 protein in the brain colocalizes with Aβ plaques 35 and its CSF levels increase almost 30 years before AD symptom onset. 36e G2 (up-down) also includes proteins located on multiple known AD-risk loci such as SPI1 and Protein Tyrosine Kinase 2 Beta (PTK2B), as well as other proteins known to be implicated on AD or neurodegeneration such as Brain-Derived Neurotrophic Factor (BDNF) 37,38 , Cathepsin D (CTSD) 39,40 , and Nuclear Factor Kappa B Subunit 1 (NFKB1) 41,42 .Some of the key proteins contained in the G3 (linear decrease) group include Carboxylesterase 1 (CES1) 43 , Interleukin 6 (IL6) 44,45 , and Forkhead Box O1 (FOXO1) 46,47 , which have been implicated in various metabolic, age-, and immune system-related mechanisms that underlie AD pathogenesis.Finally, and consistent with previous study, 48 we found Triggering Receptor Expressed On Myeloid Cells 2 (TREM2) in the G4 (down-up), which showed a decrease from controls to the asymptomatic stage but then signi cantly elevated levels are noticed in AD individuals.Besides TREM2, G4 contains various other proteins that have been implicated in AD, including Apolipoprotein E (APOE) 49 , Neurogranin (NRGN) 50,51 , ADAM Metallopeptidase Domain 17 (ADAM17) 52,53 , and Nectin Cell Adhesion Molecule 2 (NECTIN2) 54 .Overall, these results identi ed four groups of proteins based on their estimated trajectories based on the AD continuum, with each group including known proteins implicated on AD or neurodegeneration.

Network and pathway analysis of the CSF proteome reveal novel proteins related to AD pathophysiology
In order to identify the speci c biological process that each of those groups, with unique trajectories, we conducted a functional pathway enrichment analysis using a set of selected topologically important proteins (Fig. 4F-I).To further gain a systems-level understanding of the proteins part of speci c pathways, we utilized STRING database 55 and extracted protein-protein interaction (PPI) information between the constituent proteins from the top 10 pathways.G1 captures neuronal death, apoptosis and defects in phosphorylation/dephosphorylation.Speci cally, proteins in G1 were enriched in the nervous system related pathways (Fig. 4F and Supplementary Table 7) including pathways of neurodegeneration -multiple diseases (FDR = 1.6×10 -05 ), glutamatergic (FDR = 1.8×10 -04 ) or dopaminergic synapse (FDR = 3.40×10 -04 ) and Parkinson's disease (FDR = 9.2×10 -05 ) among others.The dopaminergic synapse pathway includes known kinases (GSK3A), and phosphatases (PPP3CA and PPP2R5D).GSK3A and calcineurin (PPP3CA) are known to be involved on tau phosphorylation regulation 56 and PPP2R5D is known to cause an autosomal dominant neurodevelopmental disorder, Jordan's syndrome, 57 although this is the rst time this protein is implicated on AD.The glutamaergic pathway includes proteins known to be part of the causal AD pathways such as another proteoforms of calcineurin (PPP3R1) reported to be associated with phosphotau levels and rate of memory decline, 56 or HOMER1. 58.This pathway also includes DLG4 and GLUL, both neuronal-speci c proteins, involved on signal transduction.The G1 group also includes several proteins implicated on Parkinson such as PRKN, SNCA and PARK7 59,60 .The identi cation of these proteins could explain why around 30% of the AD cases have Lewy Body pathology, which is normally found in PD.The G1 network also contained NCK2 and SHARPIN, two previously known AD risk loci, 34 associated with the ErbB signaling pathway (FDR = 4.65×10 -05 ), and Nectoproptsis (FDR = 7.52×10 -05 ), respectively.The nectoproptsis pathway also include other proteins such as SHARPIN what we recently found to be genetically dysregulated in AD cases and to be part of the causal pathways by performing pQTL mapping couple with colocalization and Mendelian Randomization analyses.All these results suggesting that some of these proteins could not only be pure biomarkers but also part of causal pathway of AD.On the other hand, some known biomarkers included in this group includes both NEFL and NEFH. 61,62We also found multiple 14-3-3 proteins (e.g.YWHAB, YWHAG, and YWHAH) to be part of this group extending our previous results, 63 which are predicted to be neuronal speci c and are part of the cell division pathway (FDR = 5.29×10 -08 ).Multiple recent studies suggest that mosaic mutations resulting from mitosis defect 64 could also be involved in AD pathogenesis.
In contrast of G1 (lineal increase) which seems to capture early neuronal death, the G2 (down-up) group is capturing immune response glia-speci c and endolysosome pathways, including platelet activation (FDR = 0.006), chemokine signaling pathways (FDR = 0.008), and acute myeloid leukemia (FDR = 0.001; Fig. 4G and Supplementary Table 8), which likely as a response to early neuronal death.SPI1, a microglial marker gene located on a well-known AD risk locus, 34 was be a part of transcriptional misregulation in cancer pathway, most likely regulating the microglial in ammatory response in AD 65 .We observed a consistently low abundance levels of SPI1 in the CSF of A + T + individuals compared to A -T -in both stage 1 (estimate = -0.001,FDR = 2.4×10 -07 ) and stage 2 (estimate = -0.01,FDR = 1.5×10 -04 ).In line with our ndings, a decreased level of SPI1 in both primary human microglia and the BV-2 mouse microglia cell line has been shown to be associated with reduced phagocytic capacity of the cells 65-67 , further supporting these ndings.Other proteins of this pathway that interact with SP1 include FLT3 which is important for the normal development and the immune system and is a drug target for acute myeloid leukemia (AML). 68.PML, another protein identi ed in our analyses is also part of this group, interacts with SP1 and is a tumor suppressor protein that is associated with acute promyelocytic leukemia. 69BPA, another protein involved in leukemia 70 is also part of this network.2][73] .In summary the G2 group is able to capture many novel proteins that are part of the in ammatory and immune response pathway that may become dysregulated due to early neuronal death and apoptosis.
The G3, which displayed linear decrease throughout the AD continuum, seems to be capturing proteins and pathways related to brain plasticity or mechanism trying to compensate for AD-related pathology, including pathways part of biosynthesis-related biological processes (Fig. 4H and Supplementary Table 9), such as cholesterol biosynthetic process (P = 4.4×10 -05 ), sterol biosynthetic process (P = 7.8×10 -05 ), and stem cells proliferation (P = 2.1×10 -04 ; Fig. 4D).Numerous proteins within this group include AXIN2 and CTNNB1. 74AXIN2 is a suppressor of Wnt/β-catenin signaling known to affect mitochondrial biogenesis, which is linked to several neurodegenerative disorders, including AD. 75 Consistent with our ndings, a signi cant reduction (~70%, P < 0.001) in soluble β-catenin (CTNNB1) levels has already been shown in AD brains as compared to controls 76 , and accumulation of this protein is a marker of ubiquitination and rapid proteasomal degradation. 77Part of the same pathway as CTNNB1, is SIRT6 which several studies that higher SIRT6 levels area associated with longer lifespan 78,79 , which is in line with our ndings as we found these proteins decreased in AD cases.
Lastly, proteins within G4 group (down-up) captures a difference microglia activity to that of G2 (up-down; Fig. 4E), as it has the opposite pseudodirectory pattern, and it also captures cell-to-cell crosstalk.Proteins in this group were enriched in in the MAPK signaling (FDR = 3.7×10 -07 ), Ras signaling (FDR = 5.4×10 -06 ), Rap1 signaling (FDR = 1.1×10 -04 ), and different cancer-related pathways e.g., pathways in cancer (FDR = 2.4×10 -04 ) and prostate cancer pathways (FDR = 2.8×10 -04 ; Fig. 4I and Supplementary Table 10).Some of the important highlights of this network include CSF1, and CSF1R, involved in several signally pathways, 1][82] .Other microglia and in ammation related proteins include, FAS, IGF1, and IGF1R proteins, which have been implicated in the pathogenesis of AD and other amyloidosis disorders 83 .Other key proteins in this group that were not part of the top 10 pathways included TREM2, APOE, PLD3, and NRGN, which have already been implicated in AD, known to be involved on microglia or lysosome activity. 49,50,84,85.At the same time, we observed multiple proteins in this group to be the protein-encoding marker genes for neurons (NRG1, a co-receptor for of RENL, a recent gene identi ed in AD resilience, 86 , NTRK2, L1CAM, and EPHA7) and astrocytes (CTNNB1, FGFR3, and PDGFC), with most of them being enriched in signaling related processes which suggest microglia neuron communication.
In summary, by rst grouping the proteins based on their trajectory and performing pathway analyses we have been able to identify speci c mechanism affecting AD pathogenesis at different stages of the disease that other general pathways analyses would have missed (table ST11-ST15, extended results).

Discussion
Cerebrospinal uid (CSF) serves as a protective barrier for the central nervous system (CNS) and analyzing CSF proteome can contribute to the diagnosis of various CNS-related diseases, but our understanding of robust AD-speci c CSF proteome alterations is currently limited. 87,88While numerous CSF proteomics investigations have focused on AD, [14][15][16][17][18][19]22,23 none have examined a cohort of this magnitude (7,029 proteins in 2,286 individuals), thus hindering their ability to identify consistent proteomic changes and construct a reliable predictive model. Althogh previous studies have identi ed several novel protein markers for AD, most of which were also replicated in our analyses, their major limitations were the limited coverage of the proteome and relatively small sample size.Moreover, relatively fewer studies have covered the entire AD continuum, [14][15][16][17][18] as many of them omitted the asymptomatic (A + T -) or mild cognitive impairment (MCI) stage, which is crucial for identifying early biomarkers for AD.In-depth CSF proteomic pro ling of AD patients and controls has the potential to uncover disease-speci c proteomic alterations, provide insights into the underlying biological processes, and translate these multifaceted ndings into practical disease prediction models for better and early diagnosis.In this study, we generated and analyzed one of the largest AD CSF proteomic pro les from four independent cohorts, comprising 2,286 individuals, measuring 7,029 protein analytes.
Proteins displaying signi cant changes in the AD CSF proteome can be leveraged for creating prognostic biomarkers to identify individuals at high risk for disease.An existing study 16 utilized CSF proteomic data from 425 individuals to propose an 8-proteins diagnostic panel with an AUC of 96% for distinguishing AD from controls in discovery and 0.94 (in replication; n=62), however this model also shoed high AUC for non-AD dementia was quite high= 0.80.Similarly, another study utilizing CSF proteomics data measured using 1.3K Somascan, 19 introduced a 12 protein panel that distinguished sporadic AD from healthy controls with an AUC of 88% and 100% in discovery (n=717) and replication (n=110) datasets, respectively.Although both these approaches displayed reasonable model performance, they shared two important limitations.Firstly, they had a signi cantly low replication sample size (almost 6 times lower than discovery) resulting in less reliable AUC estimation. 89Secondly, a lack of systematic sensitivity and speci city assessment by testing the identi ed panels on other dementias (e.g., FTD, DLB, and PD).Our study addresses these limitations by considering a well-balanced sample size for discovery (stage 1) and replication (stage 2) cohorts as well as utilizing a completely independent validation cohort (Stanford ADRC), which used a different platform for proteomic quanti cation, for an unbiased assessment of biomarker performance across different cohorts.Within the large number of associated proteins, we identi ed and externally validated an 11-protein CSF AD proteomic signature capable of distinguishing patients with AD from cognitively healthy controls (with AUCs of 0.98 -0.99; Fig. 3A and B), as well as from the asymptomatic individuals (with AUCs of 0.88 -0.96; Extended Data Fig. 3).The identi ed biomarker is speci c to AD, as its predictive power is signi cantly low (with AUCs of 0.44 -0.70) when tested on other non-AD dementia datasets (e.g., FTD, DLB, and PD; Fig. 3D) further strengthening our hypothesis that the identi ed proteomic biomarker panel is AD-speci c and a promising candidate for the development of clinical assays.
To further assess the diagnostic practicality of developed AD biomarker panel, we performed a rate of disease progression analysis and observed a signi cantly positive association (β = 0.35, p = 2.1×10 -04 ) between biomarker positivity (A + T + ) and faster progression for cognitive decline (Fig. 3E and Extended Data Fig. 4).We did not observe any signi cant effect of covariates like age and sex on this association (Extended Data Fig. 5).In contrast, when using the actual AT status, the regression model was not able to capture any difference (β = 0.10, p = 0.36) between the slow and fast progressors (Fig. 3E).These results suggest that identi ed CSF proteomic biomarker can reliably distinguish between slow and fast progressors, underscoring its promising potential in the clinical diagnostic settings.Although previous studies have applied this concept to explain the genetic architecture of the AD and its differential effect on different sexes 90,91 , where women displayed two-fold faster progression for cognitive decline than male, this avenue was not yet explored for assessing the predictive power of a disease-speci c biomarker panel.To further complement these ndings, we employed a time-to-event regression model 92 and evaluated potential distinctions in probabilities for not developing AD between proteomic signaturepositive and -negative individuals.Individuals positive for 11-protein signature exhibited a signi cantly lower probability of not developing AD (p = 2.2×10 -58 ) when compared to individuals -negative for this signature (Fig. 3F).In particular, the individuals -positive for identi ed proteomic panel showed a disease conversion (incidence) rate of almost 100% in the 10-year follow-up from the rst clinical assessment.In contrast, the individuals -negative for this panel displayed signi cantly lower (~35%) disease conversion rate in the same time span.In summary, these results highlight substantial variations in cognitive decline rate and survival time between proteomic signature-positive and -negative individuals, highlighting the potential of the identi ed AD proteomic panel for early disease detection.
The signi cantly altered CSF proteins span diverse mechanisms linked to AD pathogenesis, including several neuronal and immune system related functions as well as different neurological disorders, offering a new in vivo perspective on the complex nature of the disease (Fig. 4).Instead of performing pathway analyses for all associated proteins or based on protein correlations, using WGCNA-like approaches, we decided to apply a novel approach based on grouping the proteins based on their pseudo-trajectories along the AD continuum.This approach allowed us to disentangle novel biological pathways that otherwise could be eclipse in regular pathways analysis due among the large number of proteins associated with AD.Among the biological pathways signi cantly enriched in AD-speci c proteomic alterations, in the rst group (G1) of proteins based on their pseudo-trajectories includes pathways of neurodegeneration (FDR = 1.6×10 -05 ) and tau phosphorylation, and apoptosis, which is likely capturing neuronal death (Supplementary Table 7-10 and Extended Data Fig. 6-9).The pathways of neurodegeneration include proteins related to Neuro laments (NEFH and NEFL), among others.The gene NEFL is a putative biomarker of neurodegeneration 93 and its corresponding protein level in plasma has been used for assessing cognitive decline and mild cognitive impairment in AD. 29 NEFH is primarily associated with neurons, and elevated CSF levels of this protein have been detected across multiple neurodegenerative disorders 94,95 along with AD 96 .The second group is capturing a unique set of microglia and immune-related proteins (SPI1 and RUNX3) involved on regulating neuroin ammatory response and displaying high transcriptomic expression in late-onset AD (LOAD), 97 we also observed an elevation in their AD CSF levels.In addition, SPI1 has already been characterized as a known AD risk loci (Odds Ratio = 1.06,P = 5.3×10 -14 ) 34 .This may be helpful for fully understanding how changes in brain microglia can contribute to the dysregulation of immune response in AD.The Rap1 signaling pathway regulates several cellular processes, including synaptic e cacy, cytosolic calcium in ux, and neuronal repolarization. 98Dysregulation of these processes is among the earliest pathological events in both familial and sporadic AD, 99,100 implying that devising interventions directed at this pathway could be notably effective when AD pathology becomes evident.The third group of proteins based on their pseudo-trajectory are enriched in proteins related to healthy aging/longevity and brain plasticity, likely capturing brain processes that are trying to compensate for the neuronal death and overall ongoing pathology due to disease.Some of the proteins in this group, could be targeted to delay or stop AD progression, although additional analyses will be needed.Finally, the last group also include many known microglia proteins (CSF1, CSF1R, TREM2), but this group showed opposite pseudo-trajectories to those microglia proteins in G2, suggesting that different microglia subpopulations and/or pathways play different roles on AD pathogenesis.
While this study analyzed a substantial number of 7,029 proteins and included 2,286 samples, it is not without limitations.Firstly, we observed 14% and 11% of the individuals clinically diagnosed as AD or controls to be biomarker-negative (A -T -) and -positive (A + T + ), respectively, implying the potential in uence of misdiagnosis.Secondly, we employed multiple external datasets for validation of our nding.
Some of these datasets used different proteomic pro ling platforms (e.g., Stanford ADRC using 5K Somascan panel), leading to the omission of certain proteins identi ed during the discovery phase in the validation cohort.Although genomics and transcriptomics have contributed signi cantly to the development of clinical diagnostic assays, 4,101 proteomics approaches have been relatively limited in their coverage of target analytes, primarily due to the extraordinary complexity and broad dynamic range of protein concentrations in the CSF or plasma. 20Lastly, since our study exclusively involves individuals from the non-Hispanic whites population, we cannot extend the assessment of the identi ed AD CSF proteomic biomarker to other racial groups, as demonstrated previously by Modeste et al. 22 In summary, we have analyzed a large well-characterized AD CSF proteomics cohorts and identi ed novel proteins and pathways dysregulated in AD.Our study showed the potential of utilizing these proteomic alterations for developing robust and AD-speci c biomarker panel with promising diagnostic applications in clinical assays.While further validation of this biomarker panel is warranted across different racial groups, we observed consistent and replicable results when tested in a completely independent cohort pro led using a different platform, underscoring the e ciency of the employed work ow for biomarker development.Overall, our ndings display the potential of proteomic studies in advancing our understanding of AD biology and pathophysiology.

Study design
The aim of this study was to investigate AD CSF proteome alterations for identifying AD-speci c proteomic signatures and examining the interactions between identi ed proteins to better understand the underlying AD biology.CSF samples used in this study were obtained from the Charles F. and Joanne Knight Alzheimer Disease Research Center (Knight ADRC, n=836), 21 Alzheimer's Disease Neuroimaging Initiative (ADNI, n=700), Fundació ACE Alzheimer Center Barcelona (FACE, n=618), and Barcelona-1 (n=132) cohorts (Table 1).Altogether, this constitutes one of the largest AD proteomic pro les, including 7,029 protein analytes measured in the CSF of a total 2,286 individuals, which were analyzed in a threestage study.In stage 1, a discovery was performed in 1,170 samples from the Knight ADRC and FACE cohorts using the ATN framework (A -T -= 680 and A + T + = 490).In stage 2, the proteins that passed multiple test corrections (FDR < 0.05) in the stage 1 were further tested in 593 individuals (A -T -= 235 and A + T + = 358) from the ADNI and Barcelona-1 replication cohorts.In stage 3, we performed a meta-analysis encompassing both stages (1 and 2), and proteins demonstrating consistent effect sizes and surviving multiple Bonferroni corrections (Bonf < 0.05) were identi ed as signi cantly altered proteins in the AD CSF.The identi ed proteomic alterations further underwent a validation using completely independent CSF proteomic study (Stanford ADRC, n = 132) pro led with a different quanti cation platform.The identi ed CSF proteomic changes were utilized to develop robust AD-speci c prediction models and categorize proteins into four different groups based on their varying trajectories across the AD continuum (A -T -, A + T -, A + T + ).Besides assessing the performance of AD prediction model in three independents cohorts (stage 1, stage 2, Stanford ADRC), its speci city and sensitivity for AD were evaluated using datasets from other neurodegenerative disorders (DLB, FTD, PD, and non-AD).Furthermore, we investigated the association of the identi ed proteomic signature with the progression to dementia and the rate of memory decline.Finally, pathway and network enrichment analyses were performed separately for each protein group to gain mechanistic insights into AD pathophysiology (Fig. 1).

ATN Classi cation
Amyloid-β (Aβ 42 ) and hyperphosphorylated Tau 181 (pTau) biomarker levels obtained from CSF samples were utilized to categorize participants into cases and controls using the AT(N) classi cation framework 1 .This framework was applied separately for each individual cohort and independently for Aβ42 and pTau biomarkers, as described previously 102,103 .Brie y, we utilized Gaussian mixture models to dichotomize quantitative Aβ42 and pTau measures into high (Biomarker positive) and low levels (Biomarker negative).Individuals with low CSF Aβ42 and high pTau levels were classi ed as amyloid/tau positive (A + T + ), indicating high plaque and tangles in the brain.Conversely, individuals with high Aβ42 and low pTau levels were de ned as controls (A -T -), indicating low plaque and tangles in the brain.
Individuals with low CSF Aβ42 and pTau levels were classi ed as amyloid positive and tau negative (A + T -), indicating asymptomatic stages of AD characterized by high plaque and low tangles in the brain.
Proteomics data collection, processing and quality control (QC) CSF samples in each cohort were collected through a lumbar puncture in the morning following an overnight fast.All samples underwent identical protocols for preparation and processing and were stored at -80 °C.To mitigate batch effects, the samples were sent together to SomaLogic and randomly allocated across different plates.Protein abundance levels were quanti ed using the SomaLogic aptamer-based SOMAscan platform that offers a multiplexed-based single-stranded DNA aptamer assay for protein quanti cation.The obtained data contains the quantitative levels of 7,293 aptamers measured in relative uorescence unit (RFU).Initial data normalization was conducted by SomaLogic, which utilized hybridization controls for intra-plate and median signals to account for inter-plate variability 104 .SomaLogic also performed an additional normalization step where data is further normalized against an external reference to control for biological variation 105 .Aptamer-and individual-level QC were subsequently carried out for the detection and exclusion of outlier analytes and samples, using an inhouse developed pipeline 103,105 .Brie y, we removed all the aptamers with a maximum absolute difference between calibration and median scale factors surpassing 0.5, calculated individually for each plate.Additionally, we removed aptamers with a median coe cient of variation (CV) exceeding 0.15 or those that deviated beyond 1.5-fold of the interquartile range (IQR) on either end in over 85% of samples.
The IQR was calculated based on log10-transformed protein levels.At the end of aptamer-level QC, we also excluded analytes targeting non-human proteins.In the individual-level QC, a sample was removed if the log10-transformed RFU levels for that sample deviated beyond 1.5-fold of the IQR in over 85% of the aptamers.In total, 2,286 samples and 7,029 aptamers targeting 6,163 unique proteins passed the nal QC and were used for subsequent analyses.

Differential abundance analysis
Differential abundance of protein analytes across different AT groups (A -T -vs.A + T + , A -T -vs.A + T -, and A + T -vs.A + T + ) was detected using the following linear regression model where age at CSF draw, sex, plate id, and rst two surrogate variables (SV) were used as covariates.Protein analytes that passed FDR correction (FDR < 0.05) in the stage 1 were further tested for differential protein expression in the stage 2 using the same linear regression model.Next, the analytes that also passed FDR correction (FDR < 0.05) in the stage 2 and showed a consistent direction of estimate (i.e., upor down-regulated in both discovery and replication stages) were considered for meta-analysis.We employed Stouffer's Z method for performing the meta-analysis using the "combinePValues" function from scran R package version 1.28.1 108 .Stouffer's Z method 109 was used because of its inherent property of independence from test statistics that tends to prioritize symmetric rejection and is less affected by a single low p-value, thereby, requiring more consistently low p-values to yield a low combined p-value 110 .A more stringent Bonferroni correction was applied to the meta-analysis p values using p.adjust function in R to identify a nal set of signi cantly altered (Bonf < 0.05) protein analytes.

Prediction models
Protein analytes that showed signi cant alterations in A + T + individuals in comparison to A -T -across both discovery (stage 1) and replication (stage 2) cohorts as well as in the meta-analysis (stage 3), were considered for building an AD prediction model.As the number of differentially abundant analytes was relatively high (n=2,173), we used least absolute shrinkage and selection operator (Lasso) regression model 25 with ve-fold cross-validation to identify a minimum set of most informative proteins for developing the AD prediction model.We used "train" function in the caret R package version 6.0-94 111 for employing the Lasso regression model in the stage 1 training dataset (n=819).In the case of highly correlated (Pearson correlation > 0.8) analytes, one of the representative analytes was kept from each pair.Starting from an initial set of 2,173 differentially abundant analytes, we identi ed a subset of 38 analytes that comprised our initial CSF AD proteomic signature.Because we also aimed to test the performance of this prediction model in an external dataset pro led using a different platform (Stanford ADRC), we retained an overlapping set of proteins (n=25) within both datasets for subsequent analysis.
After examining the association of this proteomic panel with AT status in the stage 1 training data, we identi ed a group of 11 proteins that displayed signi cant associations (P < 0.05), constituting our distinctive AD-speci c CSF proteomic signature.
To assess the predictive power of the proposed AD proteomic signature, we used a three-stage (discovery, replication, and validation) approach.The identi ed 11-protein AD prediction model was trained using 70% of stage 1 training data (discovery) and tested on the remaining 30% of the stage 1 testing data as well as the complete stage 2 data (replication) using the model weights (cutoffs) derived from stage 1 training.Finally, we tested the model performance in a completely independent validation dataset from the Stanford ADRC cohort, which, unlike our stage 1 and stage 2 cohorts, used the 5K Somascan panel for proteome pro ling.Although this prediction model was inferred using the AT framework, its performance was also tested on the data where individuals were strati ed using clinical case-control diagnosis based on the clinical dementia rating (CDR © ) and cognitive assessment.Furthermore, the speci city of this AD-speci c prediction model was also assessed in datasets from other dementias, such as dementia Lewy body (DLB), frontotemporal dementia (FTD), and Parkinson's disease (PD) as well as other non-AD individuals.For PD, we used the CSF proteomics dataset obtained from the Parkinson's Progression Markers Initiative (PPMI) study 112 that included 507 PD and 168 control individuals, pro led using Somascan 5K panel.The sensitivity (true-positive rate) and speci city (true-negative rate) of the developed AD prediction model were assessed by plotting the receiver operator characteristic (ROC) curves using pROC R package version 1.18.2 113 .To further evaluate the performance of these proteins, we generated areas under the curves (AUC) statistics and also estimated the positive predictive value (PPV) and negative predictive value (NPV) based on Youden's J statistic 114 optimal cut-off using "cords" function in the pROC R package.

AD CSF proteome clustering
A total 2,173 protein analytes that showed signi cant alterations in the A + T + compared to A -T - individuals were clustered into 4 distinct groups based on their estimates (direction of effects) and signi cance (p-value) across three different stages in the AD continuum (A -T -, A + T -, A + T + ).A pair-wise differential abundance analysis (DAA) was performed between all these AT groups (A -T -vs.A + T -, A + T -vs. A + T + , and A -T -vs.A + T + ) to track the trajectory of protein abundance from control (A -T -) to asymptomatic Functional interaction networks were built to understand the cross-talk between key proteins present in each of the four identi ed groups.Speci cally, proteins belonging to the top 10 pathways from the KEGG and GO enrichment analyses were used to build protein-protein interaction (PPI) network using STRING database version 12.0 55 .To obtain an appropriate set of functional PPI between the identi ed set of proteins, our analysis was restricted to Homo sapiens with active interaction sources from "Experiments", "Databases", and "Co-expression".The only exception was group 3, where all of the available sources including "Text-mining", "Neighborhood", "Gene Fusion", and "Co-occurrence" were also considered because of the limited number of proteins in that group.The resulting functional PPIs were visually shown in the form of a network developed by the Cytoscape tool version 3.10.0 122.

Cell type enrichment
For conducting cell type enrichment analysis, we used an in-house developed and manually curated marker list that was prepared using the CellMarker database and existing literature 123,124 .As the CellMarker database does not provide cell-type-speci c marker information, since many marker genes are associated with multiple cell types, we used existing literature to manually curate a list of marker genes that are exclusively expressed in only one particular cell type.A hypergeometric test 125 , which is equivalent to one-tailed Fisher's exact test, was employed for performing the cell type enrichment analysis using "phyper" function in base R package stats.
Progression to dementia and time-to-event analysis In order to assess there is a signi cant difference in AD progression between predicted proteomic signature-positive and -negative individuals, the rate of dementia progression analysis was performed using CDR sum of boxes (CDR-SB) per year, as described previously 90 .As longitudinal data to assess change in CDR-SB was available for only Knight ADRC and ADNI cohorts, we focused on investigating the differences between the rate of dementia progression for individuals predicted to be proteomic signature-positive and -negative according to 11 analytes AD prediction model.This analysis was performed on longitudinal data from 333 individuals in the Knight ADRC (n=117; A -T -= 23, A + T + = 94) and ADNI (n=215; A -T -= 81, A + T + = 135) cohorts.A linear regression model was t, regressing CDR-SB on time in years, where age, sex, predicted biomarker status, known AT status, and initial CDR were used as covariates, as previously explained 90 .The model also included interaction terms between time and predicted status as well as age and predicted status.
We also conducted a time-to-event analysis for individuals predicted to be proteomic signature-positive and -negative using Cox proportional hazards regression model 92 implemented in "surv t" function of the survival R package (version 3.5.5,RRID:SCR_021137).At rst, we created a survival object using the "Surv" function that used follow-up years and censored status and resulted in a response variable that was further regressed on predicted biomarker status to compute an estimate of a survival curve for censored data using the Kaplan-Meier method 126 .We used "ggsurvplot" function from the survminer R package (version 0.4.9,RRID:SCR_021094) to visualize the Kaplan-Meier plots for probability of not developing AD over a 15-year time period.

Supplementary Files
This is a list of supplementary les associated with preprint.Click to download.

log 10 ( 0 (
protein level) ~ Status + age + sex + plate + SV1 + SV2We used "lm" function from the base stats package in R version 4.3.0 106for constructing the linear regression model and applied it to the log10 normalized protein analyte abundance data that follows a normal distribution.Status corresponds to the Binarized AT status (e.g., A + T + = 1 and A -T -= 0) of the individual.The computation of SV was carried out using "num.sv"function from the R sva package version 3.48.0107with a random seed value xed to 2022.P values corresponding to the signi cance of alteration of analytes in the comparison under investigation were corrected for false discovery rate (FDR) using "p.adjust" function from the base stats R package.The results of the differential abundance analysis in the form of signi cantly up-and down-regulated protein analytes were visualized in the form of a volcano plot using the EnhancedVolcano R function and package version 1.18.RRID:SCR_018931).

Figure 3 Performance
Figure 3

Table 1 :
Demographics information of participants at the time of the CSF draw.
This table summarizes basic demographic information of CSF proteomics study participants.For each cohort, we report sample size, percentage of females and males, mean age and its standard deviation (SD), percentage of A+T+, A+T-, and A-T-participants, and percentage of APOE4+ and APOE4-individuals.Abbreviations: Knight-ADRC, Knight Alzheimer's Disease Research Center; ADNI, Alzheimer's Disease Neuroimaging Initiative; SD, standard deviation.