Prioritizing disease-related rare variants by integrating gene expression data

Rare variants, comprising a vast majority of human genetic variations, are likely to have more deleterious impact on human diseases compared to common variants. Here we present carrier statistic, a statistical framework to prioritize disease-related rare variants by integrating gene expression data. By quantifying the impact of rare variants on gene expression, carrier statistic can prioritize those rare variants that have large functional consequence in the diseased patients. Through simulation studies and analyzing real multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with limited sample size (a few hundreds) and achieves substantially higher sensitivity than existing rare variants association methods. Application to Alzheimer’s disease reveals 16 rare variants within 15 genes with extreme carrier statistics. We also found strong excess of rare variants among the top prioritized genes in diseased patients compared to that in healthy individuals. The carrier statistic method can be applied to various rare variant types and is adaptable to other omics data modalities, offering a powerful tool for investigating the molecular mechanisms underlying complex diseases.


Introduction
Rare variants (minor allele frequency (MAF) < 1%) constitute a vast majority of human genetic variations 1,2 .They are on average more deleterious compared with common variants, and thus undergoes stronger selection and remains at low frequency in the general population.By analyzing large cohorts of whole genome sequencing/whole exome sequencing (WGS/WES) data, researchers have identi ed a handful of rare variants-trait association [3][4][5][6] and shown that rare variants contribute to a large proportion of missing heritability that cannot be explained by common variants 7 .
Rare variants on average confer larger effects on gene expression and complex diseases and are easier to map to causal genes than common variants 8 .However, statistical power to identify diseaseassociated rare variants, especially for ultra-rare variants or even singletons, is limited, given that the sample size is not too high or the effect size is not too large.Variants collapsing methods (burden test, variance component test, omnibus test) are proposed to circumvent this obstacle 6, [9][10][11] , which evaluate association for multiple variants in a biologically relevant region, such as a gene, instead of testing the effect of single variant.These methods work well under the assumption that multiple variants in a gene cumulatively contribute to the disease risk with each individual allele explaining only a small fraction of the cases, but resolution to pinpoint the risk variants may be diluted when the assumption is violated.Recent studies based on large scale WES have revealed rare protein truncating variants associated with a wide range of phenotypes 4,5,12,13 .These variants exert extreme effects on the function of genes and their encoded proteins, underscoring the importance of considering how rare variants are related to gene expression.
Complementary to genome sequencing assay, RNA-seq can quantitatively measure gene expression level and provide molecular cause of complex diseases, especially rare diseases.Previous studies have shown that rare variants are enriched near genes with aberrant gene expression [14][15][16] .We posit that those rare variants will be more prone to disease pathology.In this work, we propose carrier statistic, a statistical framework for prioritizing those rare variants with large functional consequence in diseased patients by integrating gene expression data.We demonstrate superior performance of our method through extensive simulations and application study to the Alzheimer's disease, where given a limited sample size, existing rare variants association methods without functional gene expression data cannot provide positive ndings.

Method Overview
Our method stems from the expectation that diseased population shows enrichment in rare variants that have large impact on expression for disease-related genes.Suppose we have genotypes (e.g.variants call from WGS data) and gene expression measurements (e.g.reads count from RNA-seq data) on a disease relevant cell type for both diseased patients and healthy controls.For each rare variant-gene pair, we calculate the expression association z-score for rare variant carriers by using the gene expression from individuals without the variant as the null distribution (Fig. 1a).The expression association z-score, which we term as carrier statistic, is calculated separately within the case and the control group.We only consider rare variant-gene pair wherein the variant is located within the exon of the gene throughout this study.The carrier statistic quanti es the degree to which the rare variant impacts gene expression level.We assume that most rare variants do not have a large impact on the gene function, so the distribution of carrier statistic will be centered around 0. Now, if a gene is relevant for the disease, then conditioning on having the disease will bias the sampling towards people carrying rare variants with large functional impact on the gene expression.Thus, the carrier statistics for disease-related rare variant-gene pairs will tend to be more extreme compared to those for non-related pairs in the case group (Fig. 1b).We prioritize those rare variants and genes with outlier carrier statistic in the case group.False discovery rate (FDR) can be computed as the ratio of tail probability for carrier statistic between two groups (Methods).

Simulation Results
We rst carried out simulations to assess whether the carrier statistic-based method would produce false positive ndings.We simulated genotypes based on whole exome sequencing (WES) data from Genome Aggregation Database (gnomAD) 2 and simulated gene expression pro les based on RNA-seq data in whole blood tissue from the Genotype-Tissue Expression (GTEx) project (Methods).We perturbed the expression level of the causal genes for causal variants carriers by assuming that causal rare variants have large functional impact on disease-related genes.FDR for carrier statistic was wellcalibrated in all simulation settings with varying penetrance of causal variant, prevalence in causal variant noncarriers, and number of causal variants per causal gene (Supplementary Fig. 1).We also checked if using gene expression from noncarriers in two groups together rather than separately as null distribution will produce false positive ndings (Methods).In this case, we observed substantial in ation in FDR, particularly when diseased patients have systematic change in gene expression pro le from healthy controls.In contrast, using gene expression from noncarriers in the same group as null distribution consistently gave well-calibrated error rates (Supplementary Fig. 2).
Next, based on the simulated data, we benchmarked the performance of carrier statistic with three existing rare variants association methods: burden test 9 , Sequence Kernel Association Test (SKAT) 10 , and SKAT-O 11 .Burden test counts the number of rare variants within a gene followed by the association test with the disease.SKAT computes a gene-level variance component score statistic which allows bidirectional effect of different variants.The uni ed test SKAT-O implements a linear combination of burden test statistic and variance component test statistic, which is preferred when the underlying genetic architecture of the disease is not known.While all four methods successfully controlled FDR at the nominal level (Supplementary Fig. 1), carrier statistic achieved higher sensitivity than the three variants collapsing methods under all simulation settings (Fig. 2).We believe the low sensitivity of burden-like statistics is due to the small number of case samples that can be attributed to the causal variants in any gene region, which makes it di cult to attain statistical signi cance of enrichment of rare variants burden for any causal gene region (Supplementary Tables 1-3).
We also performed empirical power analysis, which provides guidance for designing new disease association study with genome sequencing and RNA-seq data.Simulations were repeated 50 times to determine the sample size required for achieving 80% sensitivity, given that the penetrance of causal variant is 70%, the prevalence of causal variant noncarrier is 1%, and there are 5 causal variants per causal gene.When the effect size of causal variant on gene expression is large (e.g.Z = 5), 80% causal genes can be identi ed based on a cohort of 500 cases and 500 controls (Fig. 3).The necessary sample size gradually increases as causal variants become less deleterious and have smaller functional consequence on gene expression, e.g.2,000 samples will be needed to achieve 80% sensitivity for Z = 4.5.In contrast, given the same sample sizes standard GWAS will not have su cient power to detect rare causal variants regardless of their effect sizes.

Application to Alzheimer's disease
Alzheimer's disease is highly heritable, with heritability estimated to be as high as 60%-80% based on twin studies 17 .Large scale GWASs have identi ed multiple loci contributing to Alzheimer's disease, but the genetic variance explained by these loci is far below the level suggested by the disease heritability 18 .
Additionally, there is limited understanding regarding the molecular mechanism through which these GWAS variants affect the disease, with the exception of the well-known APOE locus.To investigate whether rare variants (single nucleotide variants [SNVs] and short indels) confer functional consequences in Alzheimer's disease, we applied carrier statistic to a harmonized multi-omics dataset ( , ) consisting of WGS and RNA-seq from prefrontal cortex in four aging cohort studies: the Religious Orders Study (ROS) and Memory and Aging Project (MAP), the Mount Sinai Brain Bank (MSBB), and the Mayo Clinic (Methods).We found signi cant excess of large carrier statistic in the diseased patients (Fig. 4).Controlling FDR with a cutoff of 0.2, we prioritized 16 rare variants within 15 genes with large carrier statistic in the case group (Table 1), implicating them as candidate variants that may contribute to Alzheimer's disease through up-regulating gene expression in the brain.
Table 1 16 rare variants within 15 genes with large carrier statistic in the Alzheimer's disease patients.
To see if existing methods can also detect these variants, we applied burden test, SKAT, SKAT-O to the same Alzheimer's disease dataset.These three variants collapsing methods did not identify any signi cant genes (FDR < 0.2), possibly due to insu cient sample size.To further evaluate the performance of carrier statistic, we assessed the enrichment of rare variants burden within the top prioritized genes in case group compared to that in controls.Among the top 100 genes with largest carrier statistic, 67 genes have fold enrichment larger than 1, 32 genes have fold enrichment larger than 4/3, while only 6 genes have fold enrichment smaller than 3/4 (Supplementary Fig. 3).Consistent with results from the simulations, enrichment of rare variants burden within each of those genes was moderate and did not pass signi cance threshold by the variants collapsing methods.
The signi cant genes prioritized by the carrier statistics may shed light on the genetic etiology of Alzheimer's disease (Fig. 5).COCH has the largest carrier statistic of 5.78.Missense mutations within this gene were found to cause the late-onset DFNA9 deafness disorder 19,20 .Furthermore, deposits of Cochlin encoded by COCH is associated with age-related glaucomatous trabecular meshwork but absent in healthy controls.Additionally, SNPs inside COCH are associated with cortical thickness 21 , changes in which through neuroimaging techniques is commonly used in early detection and monitoring of Alzheimer's disease progression 22,23 .Gene ARHGAP11A has a carrier statistic of 4.62.Transcribed mRNAs of the gene subcellularly localize and are locally translated in radial glia cells of human cerebral cortex and further regulate cortical development 24 .More importantly, ARHGAP11A may contribute to Alzheimer's disease pathology by mediating Amyloid-β generation and Amyloid-β oligomer neurotoxicity 25 .PPP1R17 (carrier statistic = 4.61) functions as a suppressor of phosphatase complexes 1 (PP1) and 2A (PP2A).A recent study suggests that a subpopulation of neurons in the dorsomedial hypothalamus regulate aging and lifespan in mice through hypothalamic-adipose inter-tissue communication and this regulation depends on Ppp1r17 expression 26 .Interestingly, PPP1R17 is also involved in human-speci c cortical neurodevelopment regulated by enhancers in human accelerated regions 27 .ZIC4 (carrier statistic = 5.02) plays an important role in the embryonal development of the cerebellum.Heterozygous deletions encompassing the ZIC4 locus are associated with a rare congenital cerebellar malformation known as the Dandy-Walker malformation 28 .Notably, mutations in proximity to the ZIC4 loci are implicated in multiple system atrophy, a rare neurodegenerative disease 29 .Large scale GWAS study of brain morphology has also identi ed associations with ZIC4, underscoring its signi cance in diverse neurological processes 30 .MDGA1 (carrier statistic = 4.71) encodes a glycosylphosphatidylinositol (GPI)-anchored cell surface glycoprotein.It has been reported that MDGA1 can contribute to cognitive de cits through altering inhibitory synapse development and transmission in the hippocampus 31 .Of note, MDGA1 is one of the 96 genes from the Olink neurology panel with established links to neurobiological processes and neurological diseases.

Discussion
We presented carrier statistic, a statistical framework to perform multi-omics data analysis, for prioritization of disease-related rare variants and their regulated genes.Through simulations and analyses of real multi-omics dataset, we demonstrated that carrier statistic overcomes sample size limitation and achieves substantial gain in statistical power compared to existing variants collapsing methods.The superior performance of carrier statistic can be attributed to incorporation of functional gene expression data, which allows quantitatively measuring the impact of rare variants that cannot be determined by looking at the variants alone.We applied carrier statistic to Alzheimer's disease and highlighted several novel risk genes, providing insights into the molecular etiology of the complex disease.
Carrier statistic serves as a general approach to study how rare variants affect complex disease through mediating gene expression.There exist several methods such as transcriptome-wide association study (TWAS) 32,33 or colocalisation 34,35 that can also perform integrative analysis across multiple data modalities (genotype, gene expression, and phenotype).However, those methods focus exclusively on effects of common SNPs and will have limited power for rare variants.In addition to SNVs and short indels that we included in this study, the statistical framework can be also applied to other types of rare variants (simple structural variants [SVs], complex SVs, mobile element insertions, tandem repeat expansions), which in general have larger effect size than SNVs 36 .Finally, carrier statistic can be adapted to other omics data, such as epigenomics and proteomics.
Over the last fteen years, abundant disease-associated loci have been identi ed based on genome sequences of biobank-scale sample size (e.g.hundreds of thousand) 37,38 .However, even with such large sample sizes it is still di cult for GWAS analysis to detect rare causal variants.Another important objective is to understand the biological roles of the detected loci, which remains challenging.We showed here that by integrating RNA-seq data, the carrier statistic approach offers a study design that may overcome the sample size limitation and may help to associate the functional rare variants and their target genes.This approach will be especially useful for the study of diseases for which biospecimen of disease relevant tissue are easy to obtain and RNA-seq can be performed, such as autoimmune disease (relevant to blood) or skin-related disease.Furthermore, when the tissue sample is available, adding RNA-seq to a WGS-based GWAS will not increase the cost of the study signi cantly.As multi-omics data accumulates alongside genome sequencing data, we anticipate that carrier statistics will become an effective approach to dissect the molecular mechanism of complex diseases.

Carrier statistic
For each rare variant-gene pair (the variant is located within the exon of the gene), we used the expression of that gene in the rare variant noncarriers as the null distribution and computed a z-score for each rare variant carrier, then average over carriers of that variant.The carrier statistic was computed separately within the case group and the control group.Rare variants were de ned as SNVs and short indels whose allele count was no larger than 5 within the case group or within the control group.Therefore, the rare variants and thus the number of carrier statistics are not the same between two groups.The carrier statistic quanti es the degree to which the rare variant impacts gene expression level.We assume that diseased population shows enrichment in rare variants that have large impact on expression for disease-related genes, thus there will also be enrichment of extreme carrier statistic for disease-related rare variant-gene pairs in the case group.We prioritize those rare variants and genes with outlier carrier statistic in the case group.For rare variant-gene pairs with positive carrier statistic, false discovery rate at a given threshold of carrier statistic, denoted by , can be computed as , where and denote the carrier statistic in the case group and control group, respectively.Duplicative carrier statistics were removed (i.e.multiple rare variants occurring in the same individuals are counted as the same rare variant).Similarly, for rare variant-gene pairs with negative carrier statistic, false discovery rate at threshold of can be computed as .

Simulations
We simulated genotypes for a large population consisting of 125,748 individuals based on the alternative allele count from 125,748 exomes in the gnomAD v2.1.1 dataset 2 .Only exonic variants that passed all variant lters in the gnomAD dataset were retained.Then we simulated gene expression data for the large population as follows.We rst simulated background gene expression pro le for these 125,748 individuals while matching the mean and standard deviation of normalized gene expression in the reference expression dataset.In this study, we used as normalized gene expression and RNA-seq in the whole blood tissue from GTEx project v8 as the reference expression dataset 39 .Genes whose median number of reads count in the large population < 10 were removed.We randomly selected causal genes and causal variants for each causal gene, where was set as 50 and was set to vary from 1 to 10.For each causal gene, we perturbed the gene expression for causal variant carriers by fold, where was set as 4 and was the standard deviation of normalized gene expression in the reference dataset.Next, we simulated the disease status for the large population by assuming penetrance of causal variant as and prevalence in causal variant noncarrier as .Here varied from 0.5 to 0.9 and varied from 0.005 to 0.02.Finally, we randomly sampled 500 cases and 500 controls from the affected and nonaffected population respectively to mimic the sample recruitment procedure in the disease study.Each simulation setting was repeated for 100 times.
We evaluated the performance of different methods using two metrics: FDR and sensitivity.FDR was de ned as the proportion of falsely identi ed genes among all identi ed ones.If no gene was identi ed then FDR was set as 0. Sensitivity was de ned as the proportion of truly identi ed genes among all underlying causal genes.
Note that carrier statistic was computed by using gene expression from rare variant noncarriers in the same group (i.e.case or control) as the carriers as null distribution.We also checked if using gene expression from noncarriers in both case group and control group as the null distribution will produce false positive ndings.We perturbed expression level for all genes in the case group.In this case, FDR showed substantial in ation for using gene expression from all individuals in two groups as null distribution, especially when there is large systematic difference in the transcriptome between two groups (Supplementary Fig. 2b).On the contrary, using gene expression from rare variant noncarriers in the same group of carriers as null distribution consistently controlled FDR at the nominal level (Supplementary Fig. 2a).

Implementation of different methods
Variants collapsing methods were performed using the R package SKAT v.2.2.5.All the parameters were set as default value.Both common and rare variants were included in the analysis.
Multi-omics data analysis for Alzheimer's disease WGS data for the four aging cohorts (ROS/MAP, MSBB, and the Mayo Clinic) were obtained from the Whole Genome Sequence Harmonization Study (Synapse ID: syn22264775).RNA-seq data for the same four cohorts were downloaded from the RNAseq Harmonization Study (syn21241740).Only white people with both WGS and RNA-seq data were included in the analysis.We determined disease status following description in previous publications 40,41  Next, we performed quality control on the WGS and RNA-seq data.For WGS data, only exonic variants with missing genotypes 10% were retained.For RNA-seq data, we selected prefrontal cortex as the target brain tissue.If a donor does not have RNA-seq in the prefrontal cortex, then RNA-seq in other tissues will be used based on the following order: dorsolateral prefrontal cortex > posterior cingulate cortex > head of caudate nucleus in the ROS/MAP cohort, prefrontal cortex > frontal pole > superior temporal gyrus > inferior frontal gyrus > parahippocampal gyrus in the MSBB cohort, and temporal cortex > cerebellum in the Mayo Clinic cohort.Genes with zero reads count in more than 10% of samples or with median reads count 10 across samples were excluded.was used as normalized gene expression.Then we applied carrier statistic to perform downstream analysis. Figures

Figure 3 Required
Figure 3

Figure 4 Alzheimer
Figure 4 . For the ROS/MAP cohorts, individuals with a Braak neuro brillary tangle score 4, a CERAD neuritic and cortical plaque score 2, and a cognitive diagnosis of probable Alzheimer's disease with no other causes (cogdx = 4) were classi ed as cases, while individuals with a Braak score 3, a CERAD score 3, and a cognitive diagnosis of no cognitive impairment (cogdx = 1) were classi ed as controls.For MSBB, individuals with a Braak score 4, a CERAD score 2, and a Clinical Dementia Rating (CDR) score 1 were classi ed as cases, while individuals with a Braak score 3, a CERAD score 1, and a CDR score 0.5 were classi ed as controls.For the Mayo Clinic cohort, individuals with a Braak score 4 and a CERAD score 2 were classi ed as cases, while individuals with a Braak score 3 and a CERAD score 1 were classi ed as controls.Of note, de nition of CERAD score in the ROS/MAP cohort is different from that in the MSBB and the Mayo Clinic cohorts.After harmonization across cohorts, 444 cases and 234 controls in total were identi ed and used for downstream analysis.