Examination of a Novel Expression-Based Gene-SNP Annotation Strategy to Identify Tissue-Speci�c Contributions to Heritability in Multiple Traits

Complex traits show clear patterns of tissue-speci�c expression in�uenced by single-nucleotide polymorphisms (SNPs), yet current strategies aggregate SNP effects to genes by employing simple physical proximity-based windows. Here, we examined whether incorporating only those SNPs with effects on tissue-speci�c cis-expression would improve our ability to detect trait-relevant tissues across 31 complex traits using strati�ed linkage disequilibrium score regression (S-LDSC). We found that a physical proximity annotation produced more signi�cant tissue enrichments and larger S-LDSC regression coe�cients, as compared to an expression-based annotation. Furthermore, we showed that our expression-based annotation did not outperform an annotation strategy in which an equal number of randomly chosen SNPs were annotated to genes within the same genomic window, suggesting extensive redundancy among SNP effect estimates due to linkage disequilibrium. That said, current sample sizes limit estimation of cis-genetic SNP effects; therefore, we recommend reexamination of the expression-based annotation when larger tissue-speci�c expression datasets become available. Finally, we report new and updated tissue enrichment estimates across 31 complex traits, such as signi�cant heritability enrichment of the frontal cortex for cognitive performance, educational attainment, and intelligence, providing further evidence of this structure’s importance in higher cognitive function.


Introduction
Regulation of gene expression is one mechanism of heritable variation of complex traits [1][2][3][4][5] , yet integration of expression-related processes in identifying genetic associations and in uences on traits remains incomplete. Single-nucleotide polymorphisms (SNPs) located in or near a gene (i.e., in cis) may in uence that gene's function through alterations of transcription regulatory elements 6 , mRNA splicing 7 , translation 8, 9 , and many other factors 10,11 . However, the functional effects of most SNPs on complex traits, particularly those without clear protein-coding changes, remain unknown 12 . Despite this, there is clear evidence that genetic in uences on tissue-and cell-type-speci c expression affect complex traits [13][14][15][16] . This has led to discussions of pleiotropic effects across tissues 17 and how additional transcriptomic complexity may in uence our ability to estimate expression-mediated heritability 14 .
Recently, Finucane et al. (2018) 2 used Genotype Tissue Expression (GTEx) 18 RNA-seq data to identify trait-relevant tissues and cell types across 48 traits and diseases through the application of strati ed linkage disequilibrium score regression (S-LDSC) 19 . They identi ed sets of genes speci cally expressed in tissues or brain regions, within which SNPs contribute signi cantly to heritable variation in complex traits.
As is standard practice, they utilized a 100kb physical window within which all SNPs were annotated to genes. Previous studies suggest that while cis-acting factors in uence tissue-speci c gene expression 18, 20 , causal genes are frequently further than 100kb away from associated SNPs 21 , and the addition of functional information (e.g., expression quantitative trait loci) can improve causal inference in genomic annotation 12 . Together, these studies imply that incorporating variants estimated to in uence tissue-speci c expression, in addition to specifying a larger genomic window, may capture additional transcriptomic information resulting in heightened model speci city and improved estimates of tissue speci c expression contribution to heritability of complex traits.
There are several approaches to estimate SNP effects on tissue-speci c gene expression 4,14,15,18 and it is therefore possible to annotate variants to genes based on their predicted functional impact on expression, foregoing the assumption of the physically nearest genes being causal. Here, we sought to examine whether the annotation of SNPs to genes based on their estimated effects on expression 1 would identify novel expression-trait estimates of tissue-speci c enrichment while leveraging the largest publicly available genome-wide association study (GWAS) summary statistics to date. We hypothesized the expression-based annotation would improve the ability to identify trait-relevant tissues by increasing gene-SNP annotation speci city while simultaneously incorporating a larger genomic window. We also updated estimates of partitioned cis-expression SNP heritability ( ) of speci cally expressed genes using the most recent available GWAS across 31 complex traits.

Overview of Cell-Type-Speci c Strati ed LDSC and Speci cally Expressed Genes
Cell-type-speci c S-LDSC 2,19 models the combined heritable contribution of a SNP and those with which it is in LD, such that , where is the expectation of the association test statistic for SNP i, N is the sample size of the GWAS, a is a measure of confounding bias (e.g., population strati cation), and l(i,k) is SNP i's LD score for a functional category k. The regression coe cient ( ) for the k th functional annotation estimates the contribution of that category's speci c expression to enrichment conditional on all other annotations. More speci cally, when = 0 there is no enrichment, when < 0 there is a decrease in the per-SNP heritability while accounting for other annotations, and when > 0 there is an increase in the per-SNP heritability while accounting for other annotations. The 53 baseline functional annotations incorporate genetic information not speci c to cell type from gene structure (e.g., promoter, super enhancer, intronic, exonic) and methylation patterns (e.g., histone marks, chromatin structure) to increase the accuracy of estimated enrichment 19 . Refer to Sets of genes speci cally expressed in individual tissues and brain regions using GTEx v6p expression data were identi ed by Finucane et al. 2 , which we downloaded from https://alkesgroup.broadinstitute.org/LDSCORE/ and used in all subsequent analyses. The baseline annotation strategy is based on physical proximity of SNPs to genes of interest, annotating all SNPs ± 100kb surrounding each gene of interest ( Fig. 1).
We applied, following Finucane et al. 2 , two series of gene sets: 1) genes with unique expression within a category of similar tissues, as compared to the expression of other tissues (e.g., cortex vs. all non-brain tissues), and 2) genes with unique expression within a given brain region, as compared to the expression of all other available brain regions. For simplicity, we refer to gene sets from 1) as multi-tissue and 2) as within-brain. Refer to Finucane et al. 2 for a complete description of the methods employed to identify these sets of genes.
We accessed available GWAS summary statistics for 31 phenotypes 22-41 (Supp. Table 1) and performed S-LDSC using both the physical and expression-based annotation approaches (described below). Due to the limited samples of publicly released GWAS summary statistics, only data from individuals of European descent were included in this study.

Expression-Based Annotation
We identi ed SNPs with evidence of expression effects in GTEx v7 data in each of 48 tissues based on pre-computed gene expression weights estimated in Functional Summary-based Imputation (FUSION) 1 and available at http://gusevlab.org/projects/fusion/. FUSION applies four statistical models (best linear unbiased prediction (BLUP), top1, lasso regression, and elastic net regression), each with distinct assumptions of polygenicity, to estimate tissue-speci c gene expression weights ± 500kb from the transcription start site of all genes for which there was an available expression weight. For each gene, FUSION then selects the highest performing model based on the model's R 2 and its corresponding pvalue. Brie y, BLUP includes all non-zero effect SNPs, top1 incorporates the single largest effect SNP, lasso regression generates a large effect SNP sparse model, and elastic net regression tests across a spectrum of SNP inclusion ranging from BLUP to top1. Thus, for each gene, a set of SNPs was identi ed that had non-zero cis-expression effects within 500kb upstream and downstream of the transcription start site, and within speci c tissues or brain regions.
We annotated these putatively expression-in uencing SNPs to the speci cally expressed sets of genes identi ed by Finucane et al. 2 (Fig. 1) and implemented gene-set enrichment analyses in S-LDSC to test for trait-speci c tissue relevance (see below). As LDSC performs poorly when annotations have too few SNPs 16 , all gene expression weights for a given tissue, regardless of their estimated cis-genetic expression , were included to ensure a su cient number of SNPs were incorporated to control for type 1 error using S-LDSC, as well as to assure the highest degree of overlap with the Finucane et al. 2 sets of genes (Supp. Table 2). We then directly compared our expression-based annotation approach to the standard physical proximity gene-SNP annotation using estimates published by Finucane et al. 2 (see   below).
For each tissue, speci cally expressed sets of genes were incorporated as a 54th functional annotation category alongside the 53 baseline functional categories 19 plus an additional annotation category that consisted of either all genes examined in the GTEx v6p gene expression dataset from which the speci cally expressed gene-sets were derived, as speci ed in Finucane et al. 2 (when using physical h 2 S N P proximity annotation), or all genes for which there were available expression weights for the expressionbased annotation. To correct for multiple tests, a false discovery rate (FDR) < 5% was applied across multi-tissue S-LDSC analyses, as in Finucane et al. 2 .
For traits with multiple implicated brain regions based on the cross-tissue analyses, a within-brain analysis was conducted to control for overlap between brain expression gene sets implemented in the multi-tissue analyses, as described above and similar to Finucane et al. 2 . If no brain regions were identi ed as signi cant for a given trait in the multi-tissue analyses, that trait was excluded from consideration in the within-brain analyses. All background LD model functional annotations remained the same between the multi-tissue and within-brain analyses. Once again, to correct for multiple tests, an FDR < 5% was applied across within-brain S-LDSC analyses.
Testing speci city of LDSC coe cients between expression-and proximity-based SNP annotations To assess whether the expression-based annotation procedure led to increased speci city of heritable contribution relative to physical proximity-based annotation we performed permutation tests. We used data from schizophrenia 40 (Fig. 2) as a baseline phenotype to compare estimated enrichment when applying expression-based annotation against a set of randomly chosen SNPs annotated within an equivalently sized genomic window (1mb). For each gene, and each permutation, the same number of random SNPs were annotated equivalent to the number of SNPs with non-zero expression weights from the best performing expression prediction model. For each permutation, linkage disequilibrium scores were re-calculated and LDSC regression coe cients estimated. We performed 1000 permutations to generate a normal distribution of S-LDSC coe cients for a single tissue, representing a null distribution to test whether the speci c choice of SNPs used, namely, those with evidence of expression effects, provides more information than a random set of SNPs of equal number. We compared our expressionbased LDSC regression coe cients to this null distribution using a one-tailed signi cance test (p = 0.05). We performed this permutation procedure for three separate scenarios, applied to the 2014 schizophrenia GWAS summary statistics: 1) nonsigni cant enrichment when using the expression-based annotation for the within-brain frontal cortex gene set, 2) signi cant enrichment when using the expression-based annotation for the within-brain cerebellum gene set, and 3) signi cant enrichment when using both the expression-based and physical proximity annotation for the multi-tissue cerebellum gene set. We chose these scenarios as they represent a range of results and possible outcomes.

Results
Ensuring consistency between our proximity-based annotation results and prior work, we rst replicated the Finucane et al. multi-tissue schizophrenia S-LDSC analyses (correlation of the total number of SNPs included for each tissue, r = 1, Supp.  Table 3). This small difference is most likely due to minor variability in reported GWAS summary statistics over time.
Next, we sought to examine differences in S-LDSC coe cients across prior and newly reported GWAS for the same phenotypes. While some traits were highly consistent, such as height (LDSC regression coe cient estimates based on the new vs. prior GWAS summary statistics r = 0.9968, Supp. Figure 1), comparisons of new vs. prior GWAS for other traits, such as Alzheimer's disease (r = 0.3265, Supp. Figure 2), strongly differed (Supp. Table 4).
For tissue enrichment analyses, at least 100,000 SNPs were mapped to 46 of the 48 tissues when using an expression-based annotation (Supp. Table 2), with pancreas and whole blood falling below this threshold (83,239 and 87,720 SNPs, respectively), suggesting that for these 46 tissues, S-LDSC regression coe cients are likely well controlled for type 1 error. We found the expression-based annotation resulted in fewer identi ed tissues or brain regions that contribute signi cantly to in complex traits when compared to the physical proximity-based annotation. Across the multi-tissue analyses, of the 31 phenotypes examined, 18 had at least one signi cant tissue when employing a physical proximity annotation, whereas only seven phenotypes had at least one signi cant tissue when using an expressionbased annotation (FDR < 5%, Supp. Figures 3-20, Supp. Table 5). All tissue and trait combinations with signi cant expression-based annotation enrichments were also identi ed using the physical proximity annotation with the single exception of ovary tissue in Tourette syndrome (Supp. Figure 19). Of the phenotypes examined, schizophrenia (both sets of published GWAS summary statistics), educational attainment, and intelligence identi ed all 13 brain regions as signi cant, representing the maximum individual tissues identi ed for any trait.
Within-brain analyses were conducted for the 16 phenotypes with signi cant contribution of at least one brain region identi ed in the multi-tissue analyses. We identi ed signi cant contributions of speci c brain regions in nine and four phenotypes when using a physical proximity and expression-based annotation, respectively (FDR < 5%, Supp. Figures 21-30, Supp. Table 6). All signi cant expression-based annotations overlapped with a signi cant physical proximity annotation, with three exceptions: cortex in major depressive disorder (Supp. Figure 27) and cerebellum in the two schizophrenia datasets 40,42 (Supp. Figures 28 and 29).
Permutation tests, to assess whether the expression-based annotation procedure led to increased speci city of heritable contribution relative to physical proximity-based annotation, suggested that SNPgene annotations based on SNP expression effects do not differ from randomly chosen SNPs within the same regions. In all three instances examined, the strength of the regression coe cient using expressionbased annotation was no different than when annotating SNPs to genes at random (all p > 0.32, Supp. Figures 31-33, Supp. Table 7).

Discussion
We tested whether expression-based annotation of SNPs to genes with tissue-speci c gene expression in uences the speci city of tissue enrichment estimates utilizing S-LDSC and updated estimates of partitioned across 31 different complex traits. We found little evidence that annotating SNPs to genes based on evidence of expression positively impacts estimates of S-LDSC. In both the multitissue and within-brain analyses, the physical proximity annotation resulted in more instances of signi cant tissue enrichment and larger S-LDSC regression coe cients, as compared to the expressionbased annotation. There were only four occurrences in which the expression-based annotation identi ed signi cant tissue enrichments not found when employing the physical proximity annotation: 1) ovary tissue for Tourette syndrome in multi-tissue analyses, 2) brain cortex for major depressive disorder in within-brain analyses, 3) cerebellum for schizophrenia (Daner) 40 in within-brain analyses, and 4) cerebellum for schizophrenia (Clozuk) 42 in within-brain analyses.
Permutations suggest that the expression-based annotation did not outperform an annotation strategy in which an equal number of randomly chosen SNPs were annotated to genes within the same genomic window (Supp. to also increase, as well as the accuracy of their expression prediction models 1 . As such, we predict that as these resources continue to grow, expression-based annotation may result in higher speci city of identi ed relevant tissues and better estimates of partitioned . Therefore, we suggest reexamination of the expression-based annotation in estimating tissue-speci c enrichment when larger expression reference panel sample sizes become available. In addition to our comparison of annotation strategies, here we also report new and updated S-LDSC tissue enrichments across a wide variety of traits (Supp . Tables 5 and 6). For example, we report signi cant heritability enrichment of the frontal cortex for cognitive performance, educational attainment, and intelligence while controlling for expression in other brain regions, corroborating evidence of this structure's importance in higher cognitive function 43  Identi cation of heritable contributions to complex traits remains pertinent as we transition from macrolevel estimates of heritability to a ner-scale of tissue-and cell-type relevancy. Here, we have attempted to build upon current gene-SNP annotation strategies through the incorporation of estimated effects on gene expression, while simultaneously providing updated and new tissue-speci c enrichment estimates across 31 complex traits. While our expression-based annotation did not improve our ability to identify trait-relevant tissues, we suggest further examination of this approach as tissue-and cell type-speci c transcriptome reference panels continue to grow.

Declaration of Interests
The authors declare no con ict of interest.
Overview of S-LDSC physical proximity and expression-based annotation methods used to compare estimates of LDSC enrichment coe cients across tissues.

Figure 2
Overview of permutation procedure to test whether a SNP annotation based on putative expression effects provides stronger evidence of cell type or tissue enrichment than random sets of nearby SNPs.