Structural Variation Detection and Association Analysis of Whole-Genome-Sequence Data from 16,905 Alzheimer’s Diseases Sequencing Project Subjects

Structural variations (SVs) are important contributors to the genetics of human diseases. However, their role in Alzheimer’s disease (AD) remains largely unstudied due to challenges in accurately detecting SVs. We analyzed whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (N = 16,905) and identified 400,234 (168,223 high-quality) SVs. Laboratory validation yielded a sensitivity of 82% (85% for high-quality). We found a significant burden of deletions and duplications in AD cases, particularly for singletons and homozygous events. On AD genes, we observed the ultra-rare SVs associated with the disease, including protein-altering SVs in ABCA7, APP, PLCG2, and SORL1. Twenty-one SVs are in linkage disequilibrium (LD) with known AD-risk variants, exemplified by a 5k deletion in complete LD with rs143080277 in NCK2. We also identified 16 SVs associated with AD and 13 SVs linked to AD-related pathological/cognitive endophenotypes. This study highlights the pivotal role of SVs in shaping our understanding of AD genetics.


Introduction
Alzheimer's disease (AD) is a neurodegenerative disease characterized by abnormal deposits of extracellular Aβ plaques and intracellular neuro brillary tangles 1 .Typically, the accumulation of these neuropathological changes is accompanied by neuronal death, leading to various symptoms such as memory loss, apathy, di culty swallowing, and walking 2 .Among individuals aged 65 and older, AD has an incidence rate of 10.7% and is the fth-leading cause of death 2 .
Genetic factors play a signi cant role in the etiology of AD, with the estimate of heritability ranging from 58-79% 3 .However, genetic risk factors identi ed in previous studies explain only a limited portion of heritability in AD.
Mutations in APP, PSEN1, and PSEN2 cause an early-onset form of AD that is inherited as an autosomal dominant trait with high penetrance, but these mutations only account for about 11% of early-onset AD that is approximately 0.6% of all AD 4 .APOE genotype is the most prominent genetic risk factor for AD, and it is estimated that approximately 40-50% of individuals diagnosed with AD carry at least one copy of the APOE ε4 risk allele 5 .Overall, variations in APP, PSEN1, PSEN2, and APOE explain 20-50% of total genetic variance (heritability) of AD, with APOE ε4 accounting for most of this fraction due to its high frequency 6 .
In the past decade, genome-wide association studies (GWASs) identi ed > 75 additional AD risk loci [7][8][9][10] .However, compared to APOE alleles, variants at those loci have a small effect size or are rare in the population, contributing little to the overall heritability.APOE alleles alone can achieve an AUC (Area Under the Receiver Operating Characteristic Curve) of 0.70 in predicting AD, whereas the best AUC is only 0.61 when all other common single nucleotide variants (SNVs) are combined 11 .Even if all common SNVs, including APOE alleles, are considered, they only account for 24-33% of phenotypic variance 12,13 , which is much lower than the estimated heritability of AD and thus suggests a role for other genetic mechanism.
Structural variants (SVs) are genomic alterations larger than 50bp that include deletions, duplications, inversions, insertions, translocations, and complex combinations of these events.SVs contribute more to individual genetic variation in terms of total nucleotide content, and thus the difference in genomic sequences between two humans can increase from 0.1% with SNVs alone to 1.5% when SVs are considered 14 .Moreover, SVs can have profound effects on diseases and other traits by disrupting gene function and regulation or modifying gene dosage through copy number changes, deleting exons, and creating new splicing acceptors or donor sites.Therefore, analyzing SVs has the potential to identify new genetic associations and account for the missing heritability in AD.
SVs have been identi ed in several genes implicated in AD.For instance, duplications in APP have been found to be the causal factor for autosomal dominant early-onset AD in a few families [15][16][17][18][19] .In addition, a deletion in exon 9 of PSEN1 was identi ed in families with a form of early-onset AD characterized by spastic paraparesis and atypical plaques 20,21 .A low-copy repeat of 18 Kb in length within CR1, which creates an additional C3b/C4b-binding site, may account for some GWAS signals in the CR1 region 22,23 .The 1 Mb region on 17q21.31containing MAPT has two major haplotypes H1 and H2, which are characterized by a ~ 900 Kb inversion anked by a few duplication blocks and tagged by a 238 bp deletion between exons 9 and 10 of MAPT 24 .The H1 and H2 haplotypes are associated with a range of neurodegenerative diseases including AD [24][25][26] .Additionally, copy number variants (CNVs) in AMY1, which are correlated with salivary amylase protein level and digestion of starchy food, are associated with AD.Individuals with high copy numbers (≥ 10) of AMY1 have a signi cantly lower risk of developing AD 27 .These examples show that identi cation and analysis of SVs in AD genetics hold great potential for uncovering new genetic associations and providing a more comprehensive understanding of the genetic underpinnings of this complex disease.
To discover SV variants possibly contributing to AD risk, we evaluated SVs detected in whole-genome sequence (WGS) data from 16,905 subjects from the Alzheimer's Disease Sequencing Project (ADSP).We detected 400,234 SVs and found rare SVs in known AD genes, including SORL1, ABCA7 and APP, as well as SVs in linkage disequilibrium (LD) with AD GWAS signals.Moreover, we found an increased burden of deletions and duplications (particularly, singleton and homozygous events) in AD and identi ed possible risk SVs in ADD3, ITPR2, and NTM through association analysis.
On average, each individual had 14,607 (3,875 high-quality) deletions, 764 (288 high-quality) duplications, 6,980 (2,504 high-quality) insertions, and 19 (3 high-quality) inversions.Individuals of African ancestry had more SV calls compared to individuals of other ancestries (Fig. 1A), possibly because the human reference genome is biased towards European ancestry or higher level of genetic diversity in Africans [32][33][34] .Similar to SNVs, the rst two principal components of common SVs distinguished samples from different ancestral backgrounds (Fig. 1B).However, the third principal component of SVs is associated with read length and sequencing platforms (Fig. S2), indicating batch effect is an important confounding factor to consider when performing SV analysis.
Comparable to the allele frequency (AF) distribution of SNVs, most SVs are extremely rare.Among 400,234 SVs, 94,923 (24%) are singletons, and 232,295 (58%) are rare with AF < 1% (Fig. S3).When considering the 168,223 highquality SVs, 67,595 (40%) are singletons, and 140,164 (83%) are rare with AF < 1%. Figure 1C shows that the AF distribution of deletions is more similar to the AF distribution of SNVs compared to other SV types.Analysis of the size of the SVs revealed two peaks centered around 300 bp and 6,000 bp (Fig. 1D), suggesting the possibility that many SVs are introduced by transposons, particularly, Alus (~ 300 bp) and L1s (~ 6,000 bp).
Functional annotation analysis performed using AnnotSV 35 showed that rare SVs are more likely to be deleterious than common SVs (Wilcoxon Rank Sum P < 0.0001) (Fig. 2A).This nding was con rmed using annotation from VEP 36 , which shows that protein-altering SVs tend to be rare (odds ratio [OR] = 4.71, Chi-Square P < 0.0001, Fig. 2B).Additionally, we observed a higher proportion of singletons SVs in coding or regulatory regions (Fig. 2C), suggesting negative selection against deleterious SVs in functionally important regions of the genome.Overall, our results highlight the importance of evaluating rare SVs when studying genetic variation in human disease.a APOE ε4 status is based on rs429358 observed from whole genome sequencing data.
b Ancestry is inferred using GRAF-pop 37 .

SV quality evaluation and laboratory validation
Evaluation of the sensitivity of SV calling pipeline using synthetic mutations 38 (Methods) revealed a sensitivity of 99.4% for 4,000 deletions and 94.4% for 1,500 inversions (Table S2).We did not perform an evaluation for insertions since the inserted sequences and positions are ambiguous in the simulation of synthetic mutations.
Then, we evaluated our SV call set against external SV databases.Approximately 50% of the high-quality SVs were detected in the Genome Aggregation Database (gnomAD, 292,307 SV sites), but there was less overlap with SVs in the 1000 Genomes Project (1KG, 66,505 SV sites) (Fig. S4).The difference was due to fewer samples in 1KG compared to gnomAD.The SV callset before high-quality ltering had a higher recall (a higher percentage of SVs from gnomAD and 1KG) at the cost of lower precision (a lower percentage of SVs con rmed by gnomAD and 1KG) (Fig. S4, Fig. S5).
Of 95 SVs selected for experimental validation (Table S3; Methods), 78 were con rmed, resulting in a sensitivity rate of 82%.When considering only high-quality SVs, the sensitivity increased to 85% with 61 out of 72 SVs being experimentally validated.On individual genotype level, an accuracy of 89% was achieved for 276 called genotypes for 95 SVs undergoing PCR validation (Table S4), and this value increased to 92% for 207 called genotypes for 72 high-quality SVs (Table S4).

SVs in linkage disequilibrium with known AD risk loci
SVs are larger genomic perturbations and may have more severe functional impact compared to SNVs; therefore, SVs in LD with AD GWAS risk SNVs are more likely to account for the statistical association in the regions, especially if the SNVs are not predicted to have an impact on protein structure or gene expression.We identi ed 21 SVs (12 deletions, two duplications, and seven insertions) that are in LD with established AD GWAS loci 8-10 (Table 2).Three deletions, in particular, showed high LD (R 2 > 0.9) with GWAS signals near or in NCK2, NBEAL1, and TMEM106B.A 5.5 Kb deletion (chr2:105731359-105736864) located 8 Kb upstream of NCK2 is in perfect LD (R 2 = 0.99) with rs143080277 (chr2:105749599), which is a rare variant (AF = 0.005) in the intron of NCK2 10 .A 5.2 Kb deletion (chr2:203034369-203039560) in NBEAL1 intron and overlapping with H3K27ac peak from Encode 39 is in high LD (R 2 = 0.94) with rs139643391 (chr2:202878716) 10 , which is a 3 prime UTR variant of WDR12.A 323 bp (chr7:12242077-12242399) Alu deletion located on the exon 8 of TMEM106B is in LD with TMEM106B intronic variants, rs5011436 (chr7:12229132, R 2 = 0.92) 9 and rs13237518 (chr7:12229967, R 2 = 0.90) 10 , which are not only associated with the risk of AD but also protect carriers of C9ORF72 repeat expansion from the risk of frontotemporal dementia 40 .
The MAPT H1/H2 haplotype, de ned by a 900 kb inversion and tagged by numerous SNVs, has been associated with several neurodegenerative diseases, including progress supranuclear palsy, frontotemporal disorders, Parkinson's disease, and AD 24,25,41 .We identi ed ve deletions and two duplications in moderate LD (R 2 = 0.35-0.67,Table 2) with a H1/H2 tagging SNV (rs199515, chr17:46779275), which is associated with AD 10 .These SVs further con rmed the complex genomic structure in the region and highlight the di culty in identifying the causal variants within the H1/H2 haplotype.Table 2 also describes seven high-quality insertions, excluding those in problematic regions (Methods), that are in LD with AD GWAS signals.

SVs on AD risk/protective genes
We rst focused on SVs that were reported to be associated with AD in previous studies [43][44][45][46][47] .Ten rare SVs (Table S5) were replicated in our SV callset.A 417 Kb duplication (Fig. S6) covering the APP is identi ed in one individual with early onset of AD at his age of 52.Subsequently, we noticed two other carriers of duplication who were dropped from the initial analysis due to failed quality control.One individual having the duplication was his sibling and developed AD at age of 49, and the other individual is his sibling's offspring who developed AD at age of 53.This nding provides compelling evidence that the duplication of APP is a rare cause of autosomal dominant early-onset AD [15][16][17]19 . A 768 Mb inversion covering the entire 21q21.2 is identi ed in one individual with early onset of AD at her age of 60 years old.The inversion was experimentally validated, and the alignments showed clear breakpoints of the inversion (Fig. S7).In addition, the 5.6 Kb deletion, covering exons 2-5 of HLA-DRA found in nine AD cases by Swaminathan et al. 43 , are present in eight samples in our analysis, including ve AD cases (three showed early onset of AD with age < 65) and three unclear-AD-status individuals (two are diagnosed as progressive supranuclear palsy, and the remaining one is with BRAAK stage 2).A few other SVs, encompassing GBE1, EPHA5, and EVC, are replicated in our dataset.
SVs on AD risk/protective genes could interfere with protein function and lead to disease.Therefore, we identi ed 77 high-con dent SVs (Methods), including 44 deletions, 15 duplications, and 18 insertions on AD risk/protective genes determined by the ADSP gene veri cation committee (see Table 2 on https://adsp.niagads.org/gvc-top-hits-list/).Nine deletions and ve duplications have an allele count ≥ 5 (AF ranging from 0.0002 to 0.4690), but none of them were signi cantly associated with AD (Table S6), and none of these SVs were tagged known AD-associated SNVs.
The remaining 35 deletions and 10 duplications are ultra-rare (MAC < 5), of which 34 (25 deletions and 9 duplications) are singletons (Table 3).We performed an aggregated analysis of 45 ultra-rare CNVs (35 deletions and 10 duplications), using SKAT-O test 48 instead of calculating individual p-values given the limited statistical power due to low allele count, and observed a signi cant association with AD status (P = 0.0050), highlighting the contribution of ultra-rare CNVs to the etiology of AD.
Notably, 14 of the 35 ultra-rare deletions and 8 of the 10 ultra-rare duplications are protein altering variants.For instance, we identi ed in SORL1 a 192 Kb duplication spanning exons 1-5 and an 8 Kb deletion affecting exon 6 (Fig. 4).Previous studies indicated that SORL1 de ciency can lead to AD through defects in the endolysosomeautophagy network 49,50 , and nearly all individuals with damaging SNVs in SORL1 developed AD 51 .Eight out of nine individuals with ABCA7 exonic deletions or duplications in our data (Fig. S8) developed AD, supporting previous studies that observed loss-of-function ABCA7 variants among AD cases 52 .We also found protein-altering ultra-rare deletions and duplications in APP, PLCG2, PILRA, CASP7, MS4A6A, RIN3, APOE, and PSEN1 (Table 3).In particular, 17 of 21 individuals with ultra-rare deletions in PLCG2 were AD cases (SKAT-O P = 0.029).We also identi ed 18 highquality insertions located in AD genes (Table S7).However, the aggregated effect of these insertions on AD risk was not signi cant (SKAT-O P = 0.21).

SV burden in AD
We performed burden tests of SVs, including CNVs (deletions and duplications), insertions, and inversions separately and collectively, and found a moderate burden of CNVs in AD cases (OR = 1.05,P = 0.0321), but no signi cant burden of insertions and inversions was detected (Table S8).The increased CNV burden in AD cases was driven by the presence of singletons (OR = 1.12,P = 0.0002) and homozygous CNVs (OR = 1.10,P = 0.0004).This is consistent with the burden of ultra-rare CNVs in AD genes, in which 34 out of 45 ultra-rare CNVs are singletons.The result suggests that singletons and homozygous CNVs, which were not considered in previous association analyses, may be important contributors to the genetic basis of AD.

SVs associated with AD and AD endophenotypes
From our association analysis using 12,908 subjects (6,328 AD cases and 6,580 controls, excluding subjects with unknown AD diagnosis and SV quality outliers, Methods), six common and nine rare SVs were found associated with AD at a false discovery rate (FDR) < 0.2 (Table 4, Fig. 5A).Notably, a 12.7 Kb (chr10:110025269-110037941, AF = 0.000426) deletion in the intron of ADD3 was exclusively found in 11 AD cases and not in any control.In gnomAD, this deletion has a lower AF of 0.000277, which may be attributed to fewer AD cases in gnomAD.Moreover, there is a rare SNV (rs773892439) in complete LD (R 2 = 1) with this deletion.Since the SNV is extremely rare (gnomAD AF of 0.00022, TOPMed 53 AF of 0.00033, and our AF of 0.00065), it was not included in previous GWASs.
Another rare deletion (chr12:26731939-26732033, AF = 0.00155) in ITPR2 was found in 33 AD cases and 7 controls.The deletion is in intron 2 of ITPR2, which may be a regulatory region as indicated by the H3K4me1 and H3K27ac signals as well as transcription factor ChIP-seq clusters in this region (Fig. S11A).ITPR2 was found to be widely expressed across different brain regions (Fig. S11B), with a higher expression in AD (Fig. S11C).SVs in LMNTD1, LHFPL6, RNA5SP293, RABGAP1, ADD3, ITPR2, and CLIC4 were con rmed by PCR validation.
Under a nominal P < 0.05, there are 2,411 high-quality SVs not in the problematic regions (Methods).Enrichment analysis of the 2,411 SVs revealed an over-representation of neuronal function-related terms, such as axon development and synaptic membrane (Fig. 5B).Among the 2,411 SVs, 37 are protein-altering variants (Table S9), including protein-altering variants in genes that have been found to be related to AD, e.g., NTN3 and CIB2 54,55 Since a signi cant homozygous CNV burden is detected, we performed association using a recessive model, of which assumes that two copies of the alternative allele are required to alter the risk.As a result, a 1 Kb deletion (chr11:131726334-131727274) in the intron of NTM is the only SV with FDR < 0.2 using the recessive model.
Interestingly, the variants inside NTM have been associated with tau pathology in previous studies 56, 57 .
In addition, we extended our association analysis to endophenotypes.Table 5 shows six common and six rare SVs with an FDR < 0.2 for cognitive functions, CSF biomarkers, and neuropathologic measurements.No signi cant genomic in ation was observed for all endophenotypes (Fig. S12), indicating that confounding factors are well adjusted.The most signi cant signal is a rare deletion (chr4:188173309-188183202, AF = 0.0028, P = 1.72 × 10 − 08 ) located in the intergenic region that is a transcription factor binding site.A rare SNV (rs1418703978) which shows even lower AF (gnomAD AF of 0.00019, TOPMed AF of 0.00026, and our AF of 0.00047) is in complete LD with the deletion.A 100 Kb deletion (chr6:31391686-31488609) encompassing the entire MICA gene is associated with amyloid presence (P = 1.09 × 10 − 07 ).Previous studies showed that the MICA deletion is accompanied by a MICB null allele (MICB0107N) 58 , indicating loss of function of both MICA and MICB.These genes are located in the MHC locus, which has been found associated with AD risk 59 .

Discussion
The complexity of generating high-quality SVs on WGS for SV association analysis is challenging, and a major concern is to ensure the analysis is not based on false positive SVs.To achieve this, we developed a pipeline to lter SVs and employed stringent criteria during the burden analysis to only include high-quality SVs.For each signi cant SV, we examined read coverages and other alignment signals by Samplot and performed experimental validations if samples are available in the lab (Methods).Despite our efforts, false positive/negative calls on individual samples can still occur, which may undermine the result of the analysis.Therefore, we suggest a broader validation of signi cant SVs using long reads as the cost and accuracy of long reads improve rapidly.
We reported SVs in LD with known AD risk loci (such as SNVs in NCK2, WDR12, and TMEM106B) and on AD risk/protective genes (such as APP, SORL1, and ABCA7).Other than that, researchers can use our SV calling set to explore SVs on a particular gene of interest.For example, there are SVs on genes that might be related to the risk of disease by interacting with well-known AD genes (such as PSEN2 and APOE).A deletion (chr1:226827423-226834076, near PSEN2) spanning the entire lnc-PSEN2-7 and overlapping with a possible enhancer supported by H3K27ac signals (Fig. S9) was identi ed in an individual (Latin American ancestry, inferred by GRAF-pop 37 ), who had onset of AD symptoms at age 71 years old.We also observed in one AD case an exonic deletion in MPO (Fig. S10), a gene that has been reported to affect AD risk through interact with APOE 60 .
Our association analysis yielded some interesting ndings.One notable discovery is a 12 Kb deletion in ADD3, which is a gene encoding a subunit of adducin protein called γ-adducin and was reported associated with neural function.
The α-adducin encoded by ADD1 can either dimerize with β-adducin (ADD2) or γ-adducin (ADD3) to form the adducin protein 61 .Heterodimers of α-adducin and β-adducin are mainly in red blood cells and neurons as the expression of adducin β were tissue-speci c and α-adducin and γ-adducin were present in most tissue types 61 .Adducin plays an essential role in the membrane cytoskeleton of red blood cells 62 and is highly expressed in dendritic spines 63 and growth cones of neurons 64 .Moreover, overexpression of γ-adducin promotes neurite-like process in COS7 cells 65 , suggesting important roles of adducin in brain function.Variants in ADD3 were found to be associated with hypertension, cerebral palsy, renal disease, vascular disease and cognitive dysfunction 66,67 .Along with tau and a few other CDK5 substrates, γ-adducin is also hyperphosphorylated (possibly by CDK5) in APP/PS1 mice 68 .
Interestingly, ADD3 displayed a signi cantly lower expression in 6-month-old APP/PS1 mice while signi cantly higher expression in 14-month-old APP/PS1 mice 69 .In addition, γ-adducin is in involved in trans-Golgi-network through re-organization of the actin network around the Golgi complex 65 , therefore, may be able to regulate intracellular tra cking of APP and relevant secretases.
Our study provided a valuable resource for the analysis of SVs in AD.We identi ed SVs from WGS data across a large cohort of AD participants with diverse ancestry.We reported SVs tagging AD risk SNVs, providing new mechanism of actions for GWAS signals.Deleterious rare SVs on well-known AD genes have been discovered.We found a higher burden of ultra-rare SVs on AD genes, and overall, higher burden of homozygous and singleton CNVs in AD patients.Finally, our association analysis revealed a few potential candidate SVs and genes that are worthy of further study.

Study subjects
Alzheimer's Disease Sequencing Project (ADSP) 31 is a collaborative project aiming at identifying new variants, genes, and therapeutic targets in AD.In the R3 release of ADSP, 16,905 subjects were collected across 24 cohorts and whole genome sequencing was performed by Illumina HiSeqX, HiSeq2000, HiSeq2500, and NovaSeq platforms.
The ancestry of each individual was inferred using GRAF-pop 37 .The samples came from diverse ancestries with 10,466 Europeans, 3,619 African Americans, 2,677 Latin Americans, 59 East Asians, 84 of other ancestries.There are 6,646 AD cases, 6,938 controls and 3,321 subjects with unknown status in this study.Sample characteristics were displayed in Table 1.
After removing duplicates and subjects without AD diagnosis, 13,371 samples were kept for analysis.Then, 463 outlier subjects, with too many (> median + 4*MAD) SV calls or too few (< median − 4*MAD) high-quality SV calls, were removed (Fig. S13).There were 12,908 samples (6,328 cases and 6,580 controls) remaining for association analysis (Fig. S14).Compared to the samples that were kept for further analysis, outliers are more likely to be of smaller insert size and lower coverage (Fig. S15).

SV calling
Figure S16 illustrates the SV calling pipeline.For each sample, SVs were called by Manta 28 (v1.6.0) and Smoove

SV selection by algorithmic models
Graphtyper2 annotates each SV call by algorithmic models, i.e., breakpoint, coverage, and aggregated models 30 .
Note that an SV call can be annotated by multiple models so there will be duplicated records in VCF if an SV call has more than one algorithmic model.Aggregated model has the highest recall than the other two models 30 .Therefore, SVs were selected based on the order of aggregated, breakpoint, and then coverage models (Table S1).

High-quality SVs
A subset of SV calls was de ned as high-quality calls.The criteria for high-quality SVs can be found in Graphtyper2

Problematic regions
There are regions in the human genome that tend to have anomalous, or high signal in WGS experiments 70 .SVs that reside in those regions can be unreliable and should be reported.Speci cally, we compiled problematic regions in the genome from the following sources: and 0.64uM sequencing primer (IDT) in a total volume of 5ul.The sequencing reaction was performed in a SimpliAmp Thermal Cycler (Applied Biosystems) using the following program: 96°C 1min followed by 25 cycles of 96°C 10sec, 50°C 5sec, 60°C 1min15sec.The products were cleaned using XTerminator and SAM Solution (Applied Biosystems) with 30min of shaking at 1800rpm followed by centrifugation at 1000 rpm for 2min.The sequencing products were analyzed on a SeqStudio Genetic Analyzer (Applied Biosystems) and the sequencing traces were analyzed using Sequencher 5.4 (Gene Code) SVs on AD risk loci and AD genes We rst searched for SVs that are in linkage disequilibrium (LD) with AD associated loci from three GWASs 8-10 .
There are 123 unique variants that reached genome-wide signi cance from three studies.After excluding nine variants that were not found in the WGS data, we searched for SVs that are in LD (R 2 > 0.2) with the rest of 114 variants.For SVs, P value from fastGWA 42 adjusting for PCs 1-5, age, sex, sequencing centers, sequencing platforms, and PCR status were also provided.
Then, we investigated SVs on known AD genes.A list of 20 expert curated AD risk/protective causal genes were downloaded from: https://adsp.niagads.org/index.php/gvc-top-hits-list.These genes were identi ed by a review of literature, pathway analysis, and by integration of genetic studies with myeloid genomics.All deletions, duplications, and inversions with missing rate less than 0.5 that overlap with these genes were inspected.Association of ultra-rare SVs on 20 AD genes were evaluated using SKAT-O test from R package SKAT 48 .
Overall SV burden in AD Overall SV burden between AD cases and controls was compared.SV burden was measured by the difference in the number of high-quality SVs in cases and controls.Logistic regression model adjusted for covariates (PCs 1-5, age, sex, sequencing center, sequencing platform, PCR status) were used.One-sided empirical p values (assuming increased SV burden in cases) were calculated based on 10,000 permutations.Particularly, we evaluated the burden of singletons and homozygous SVs in AD compared to controls.

Association and functional analyses
In total, 136,092 SVs with a missing rate < 0.5 and minor allele count > 5 were evaluated using mixed linear model based tool (fastGWA) implemented in GCTA 42 .Age, sex, sample PCR status, sequencing platforms, sequencing centers, and PCs 1-5 calculated from common SNVs were included as covariates.The age of cases was determined by the age at disease onset.The age of controls was determined by the age at the last exam.Sparse genetic relationship matrix was generated using SNVs as well with a cutoff of 0.05.High-con dent deletions, duplications, and inversions were selected by Samplot and experimentally validated by PCR.For insertions, only high-con dent ones that are high-quality and not on the problematic regions were reported.Enrichment analysis for nominal signi cant signals (2,411 high-quality SVs with P < 0.05) was performed using clusterPro ler 79 .
Other than binary AD diagnosis, we also assessed SV association with cognitive scores, uid biomarkers, and neuropathological measurements that were harmonized by the ADSP Phenotype Harmonization Consortium
(1) the ENCODE blacklist: a comprehensive set of regions that could result in erroneous signal 71 ; (2) the 1000 Genome masks: regions of the genome that are more or less accessible to next generation sequencing methods using short reads; (3) the set of assembly gaps de ned by UCSC; (4) the set of segmental duplications de ned by UCUC; (5) the low-complexity regions, satellite sequences and simple repeats de ned by RepeatMasker (Tarailo-Graovac and Chen 2009).

Figures Figure 1
Figures

Figure 2 Functional
Figure 2

Table 2
High-con dent SVs in linkage disequilibrium with AD GWAS signals 42P values from association analysis by fastGWA42.b Wightman et al., 2021. c ellenguez et al., 2022.d Kungle et al., 2019.* SVs that have been experimentally validated.

Table 3
a SV overlaps with gene exons.bSVs that are experimentally validated.cSVs with read depth and split reads support but no PCR product for anking primers.dRepresents homozygous even e Two additional family individuals had the duplication but failed quality control due to less con dent genotypes.Nevertheless, alignment evidence strongly supports the presence of duplication.Their onset ages are 49 and 53.F, female; M, male; E, European, A, Asian; AA, African American; L, Latin American; O, other; "-" represents missing age.AD risk/protective genes are selected by ADSP Gene Veri cation Committee (https://adsp.niagads.org/gvc-tophits-list/).
a SVs that are experimentally validated.bHomozygous allele frequency.AF 30(v0.2.5) with default parameters.Calls from Manta and Smoove were merged by Svimmer30to generate a union of two call sets for a sample. Unference 'breakends' (BNDs) and SVs > 10 Mb were ltered.Then, all individual sample VCF les were merged together by Svimmer as input to Graphtyper2 (v2.7.3)30for joint genotyping.SV calls after joint-genotyping are comparable across the samples, therefore, can be used directly in genome-wide association analysis30.The pipeline is available on https://github.com/whtop/SV-ADSP-Pipeline.