Method Overview
Our method stems from the expectation that diseased population shows enrichment in rare variants that have large impact on expression for disease-related genes. Suppose we have genotypes (e.g. variants call from WGS data) and gene expression measurements (e.g. reads count from RNA-seq data) on a disease relevant cell type for both diseased patients and healthy controls. For each rare variant-gene pair, we calculate the expression association z-score for rare variant carriers by using the gene expression from individuals without the variant as the null distribution (Fig. 1a). The expression association z-score, which we term as carrier statistic, is calculated separately within the case and the control group. We only consider rare variant-gene pair wherein the variant is located within the exon of the gene throughout this study. The carrier statistic quantifies the degree to which the rare variant impacts gene expression level. We assume that most rare variants do not have a large impact on the gene function, so the distribution of carrier statistic will be centered around 0. Now, if a gene is relevant for the disease, then conditioning on having the disease will bias the sampling towards people carrying rare variants with large functional impact on the gene expression. Thus, the carrier statistics for disease-related rare variant-gene pairs will tend to be more extreme compared to those for non-related pairs in the case group (Fig. 1b). We prioritize those rare variants and genes with outlier carrier statistic in the case group. False discovery rate (FDR) can be computed as the ratio of tail probability for carrier statistic between two groups (Methods).
Simulation Results
We first carried out simulations to assess whether the carrier statistic-based method would produce false positive findings. We simulated genotypes based on whole exome sequencing (WES) data from Genome Aggregation Database (gnomAD)2 and simulated gene expression profiles based on RNA-seq data in whole blood tissue from the Genotype-Tissue Expression (GTEx) project (Methods). We perturbed the expression level of the causal genes for causal variants carriers by assuming that causal rare variants have large functional impact on disease-related genes. FDR for carrier statistic was well-calibrated in all simulation settings with varying penetrance of causal variant, prevalence in causal variant noncarriers, and number of causal variants per causal gene (Supplementary Fig. 1). We also checked if using gene expression from noncarriers in two groups together rather than separately as null distribution will produce false positive findings (Methods). In this case, we observed substantial inflation in FDR, particularly when diseased patients have systematic change in gene expression profile from healthy controls. In contrast, using gene expression from noncarriers in the same group as null distribution consistently gave well-calibrated error rates (Supplementary Fig. 2).
Next, based on the simulated data, we benchmarked the performance of carrier statistic with three existing rare variants association methods: burden test9, Sequence Kernel Association Test (SKAT)10, and SKAT-O11. Burden test counts the number of rare variants within a gene followed by the association test with the disease. SKAT computes a gene-level variance component score statistic which allows bidirectional effect of different variants. The unified test SKAT-O implements a linear combination of burden test statistic and variance component test statistic, which is preferred when the underlying genetic architecture of the disease is not known. While all four methods successfully controlled FDR at the nominal level (Supplementary Fig. 1), carrier statistic achieved higher sensitivity than the three variants collapsing methods under all simulation settings (Fig. 2). We believe the low sensitivity of burden-like statistics is due to the small number of case samples that can be attributed to the causal variants in any gene region, which makes it difficult to attain statistical significance of enrichment of rare variants burden for any causal gene region (Supplementary Tables 1–3).
We also performed empirical power analysis, which provides guidance for designing new disease association study with genome sequencing and RNA-seq data. Simulations were repeated 50 times to determine the sample size required for achieving 80% sensitivity, given that the penetrance of causal variant is 70%, the prevalence of causal variant noncarrier is 1%, and there are 5 causal variants per causal gene. When the effect size of causal variant on gene expression is large (e.g. Z = 5), 80% causal genes can be identified based on a cohort of 500 cases and 500 controls (Fig. 3). The necessary sample size gradually increases as causal variants become less deleterious and have smaller functional consequence on gene expression, e.g. 2,000 samples will be needed to achieve 80% sensitivity for Z = 4.5. In contrast, given the same sample sizes standard GWAS will not have sufficient power to detect rare causal variants regardless of their effect sizes.
Application to Alzheimer’s disease
Alzheimer’s disease is highly heritable, with heritability estimated to be as high as 60%-80% based on twin studies17. Large scale GWASs have identified multiple loci contributing to Alzheimer’s disease, but the genetic variance explained by these loci is far below the level suggested by the disease heritability18. Additionally, there is limited understanding regarding the molecular mechanism through which these GWAS variants affect the disease, with the exception of the well-known APOE locus. To investigate whether rare variants (single nucleotide variants [SNVs] and short indels) confer functional consequences in Alzheimer’s disease, we applied carrier statistic to a harmonized multi-omics dataset (\({n}_{case}=444\), \({n}_{control}=234\)) consisting of WGS and RNA-seq from prefrontal cortex in four aging cohort studies: the Religious Orders Study (ROS) and Memory and Aging Project (MAP), the Mount Sinai Brain Bank (MSBB), and the Mayo Clinic (Methods). We found significant excess of large carrier statistic in the diseased patients (Fig. 4). Controlling FDR with a cutoff of 0.2, we prioritized 16 rare variants within 15 genes with large carrier statistic in the case group (Table 1), implicating them as candidate variants that may contribute to Alzheimer’s disease through up-regulating gene expression in the brain.
Table 1 16 rare variants within 15 genes with large carrier statistic in the Alzheimer’s disease patients.
To see if existing methods can also detect these variants, we applied burden test, SKAT, SKAT-O to the same Alzheimer’s disease dataset. These three variants collapsing methods did not identify any significant genes (FDR < 0.2), possibly due to insufficient sample size. To further evaluate the performance of carrier statistic, we assessed the enrichment of rare variants burden within the top prioritized genes in case group compared to that in controls. Among the top 100 genes with largest carrier statistic, 67 genes have fold enrichment larger than 1, 32 genes have fold enrichment larger than 4/3, while only 6 genes have fold enrichment smaller than 3/4 (Supplementary Fig. 3). Consistent with results from the simulations, enrichment of rare variants burden within each of those genes was moderate and did not pass significance threshold by the variants collapsing methods.
The significant genes prioritized by the carrier statistics may shed light on the genetic etiology of Alzheimer’s disease (Fig. 5). COCH has the largest carrier statistic of 5.78. Missense mutations within this gene were found to cause the late-onset DFNA9 deafness disorder19,20. Furthermore, deposits of Cochlin encoded by COCH is associated with age-related glaucomatous trabecular meshwork but absent in healthy controls. Additionally, SNPs inside COCH are associated with cortical thickness21, changes in which through neuroimaging techniques is commonly used in early detection and monitoring of Alzheimer's disease progression22,23. Gene ARHGAP11A has a carrier statistic of 4.62. Transcribed mRNAs of the gene subcellularly localize and are locally translated in radial glia cells of human cerebral cortex and further regulate cortical development24. More importantly, ARHGAP11A may contribute to Alzheimer’s disease pathology by mediating Amyloid-β generation and Amyloid-β oligomer neurotoxicity25. PPP1R17 (carrier statistic = 4.61) functions as a suppressor of phosphatase complexes 1 (PP1) and 2A (PP2A). A recent study suggests that a subpopulation of neurons in the dorsomedial hypothalamus regulate aging and lifespan in mice through hypothalamic-adipose inter-tissue communication and this regulation depends on Ppp1r17 expression26. Interestingly, PPP1R17 is also involved in human-specific cortical neurodevelopment regulated by enhancers in human accelerated regions27. ZIC4 (carrier statistic = 5.02) plays an important role in the embryonal development of the cerebellum. Heterozygous deletions encompassing the ZIC4 locus are associated with a rare congenital cerebellar malformation known as the Dandy–Walker malformation28. Notably, mutations in proximity to the ZIC4 loci are implicated in multiple system atrophy, a rare neurodegenerative disease29. Large scale GWAS study of brain morphology has also identified associations with ZIC4, underscoring its significance in diverse neurological processes30. MDGA1 (carrier statistic = 4.71) encodes a glycosylphosphatidylinositol (GPI)-anchored cell surface glycoprotein. It has been reported that MDGA1 can contribute to cognitive deficits through altering inhibitory synapse development and transmission in the hippocampus31. Of note, MDGA1 is one of the 96 genes from the Olink neurology panel with established links to neurobiological processes and neurological diseases.