A flow diagram that describes the data and process flow of the Effector Index is provided in Fig. S9.
Selecting positive control genes
Positive control genes were defined via two approaches: (1) clinician scientists manually inspected the Human Disease Ontology database(Schriml et al. 2019) for relevant ontological terms (Table S1), and the associated OMIM linkage information was used to obtain a list of genes associated with these disease (Table S2); and (2) clinician scientists identified guideline-approved medications from UpToDate (Table S3) and this information was linked to Drugbank(Wishart et al. 2018) to obtain a list of drug targets for drugs with a known mechanism of action (Table S4). The above procedure was performed for all traits, except for eBMD, for which we extracted the positive control gene list from Supplementary Table 12 of Morris et al. 2018 (Morris et al. 2018). To assess the performance of the prediction models, we also used data from the National Institutes of Health T2D Accelerated Medicines Program, a collaboration between industry and academia to identify causal genes for disease (Mahajan and McCarthy 2019). We included the original T2D genes described identified as monogenic causes or drug targets for T2D as above and included additional genes labeled to be “causal”. Such additional genes met any of the following criteria: 1) Exome array evidence with a cumulative posterior probability of association (PPA) of ≥80%;(Mahajan et al. 2018b) 2) Burden test evidence from WES, using the best gene-level p-value from the extreme p-value aggregation test, or weighted aggregation test performed in an exome sequencing analysis of over 49,000 individuals (Flannick et al. 2019) and had a P-value 5x10-6; 3) Strongly associated coding variants reported in the literature (Mahajan et al. 2018a). We removed all genes that were labeled as “causal” due to evidence only from GWASs, as including such genes would bias our algorithm’s performance away from the null.
Obtaining GWAS summary statistics and defining associated loci
GWAS summary statistics were obtained from a combination of publicly available resources and though the GWAS of UK Biobank traits in this study (Table 2). These diseases and traits were T2D, LDL levels, red blood cell count, diastolic blood pressure, systolic blood pressure, triglyceride levels, estimated bone mineral density, glucose levels, calcium levels, direct bilirubin levels, height and hypothyroidism. These traits were selected because they represent a broad spectrum of allelic architectures, ranging from oligogenic to highly polygenic as evidenced by the variable number of loci identified using the method described below (Table 2).
For the GWAS of traits generated in this study we used the White-British subset of individuals in UK Biobank (N=440,346); an analysis that was performed in a previous study that consisted of the projection of UK Biobank individuals to 1000 Genomes, followed by a cluster analysis to identify a subset of individuals of relatively homogenous ancestry (Morris et al. 2018).
For the GWAS of bilirubin we natural log transformed the direct bilirubin measurement (UK Biobank data-field 30660) and retained measurements within 3 standard deviations of the mean. For the GWAS of calcium (UK Biobank data-field 30680) we retained unadjusted measurements within 3 standard deviations of the mean. For the GWAS of glucose (UK Biobank data-field 30740) we retained unadjusted measurements within 2.5 standard deviations of the mean. For the GWAS of low-density lipoprotein (UK Biobank data-field 30780) and triglycerides (UK Biobank data-field 30870) we adjusted the measurements for participants taking medications as follows. Participants that self-report taking relevant medications were obtained from UK Biobank data-field 20003 as well as data-fields 6153 and 6177. In reference to previous randomized control trials (Jones et al. 1998; 2005) and meta-analyses (LaRosa et al. 1999; Law et al. 2003; Pandor et al. 2009; Boekholdt et al. 2012) on the percentage effect of each commonly used medication, we have generated adjustment factors listed in Table S13. To estimate the LDL or TG level without medication’s influence, an individual’s measured level was divided by (1 - adjustment factor). Those who did not specify the type of cholesterol-lowering medication were assumed to be on statin, and their adjustment factor was defined as the weighted mean percentage effect of the five statins by their prevalence in UK Biobank population. Further details of the adjustments are provided in Table S13.
The GWAS for the 5 traits was performed using fastGWA (Jiang et al. 2019) on SNVs with minor allele frequency greater than 0.0001 and information score (INFO) greater than 0.8. Age, sex, assessment center, genotype array, and first 10 principal components of ancestry (PCs) of ancestry (derived from the principal component analysis in the white British cohort in UK Biobank) were used as additional covariates.
GWAS summary statistics across the 12 traits were harmonized by retaining only SNVs with minor allele frequency below 0.005 that were also present in UK Biobank (matched by SNV alleles and genomic position to GRCh37). The harmonized SNV count for each trait is listed in Table 2.
Defining GWAS loci for Bayesian fine-mapping
GWAS loci were defined using a two-step procedure of LD clumping followed by merging of adjacent signals. First, we determined a set of lead SNVs by LD clumping genome-wide significant SNVs using PLINK 1.9 (p < 5 x 10-8, r2 < 0.01, distance 250 kilobases) to a reference panel of 50,000 randomly selected white British individuals (N=409,703) as determined by Bycroft et al. (2018) (Bycroft et al. 2018). Second, lead SNVs that were within 50 Kbp of each other were merged using bedtools ‘merge’. The resulting genomic regions consisting of one or more lead SNVs were padded with an additional 250 Kbp, resulting in a set of loci that are at least 500 Kbp in size, but some loci may be larger due to multiple adjacent lead SNVs. Loci that overlap the major-histocompatibility complex locus were excluded (chr6:28477797-33448354 on GRCh37).
Finding causal SNVs using Bayesian fine-mapping
We used Bayesian fine-mapping as implemented by FINEMAP version 1.3.1 to find a set of putatively causal SNVs at each locus. This program uses a shotgun stochastic search algorithm to efficiently search for possible causal configurations of up to k SNVs at a locus. We used this program to find causal SNVs at a locus (up to a maximum of k=20) by providing as input the GWAS summary statistics at the locus, as well as the same reference panel of 50,000 individuals from UK Biobank that was used for defining the GWAS loci in the previous section. As recommended by Benner et al. (2017)(Benner et al. 2017), the population of this reference panel matches what we used for the UK Biobank GWAS or is similar to the publicly available GWAS included in this study.
To do this, we created a simple shell script that extracted individual level genotype data from UK Biobank using bgenix and from this calculate the genotype correlation matrix using LDStore (Benner et al. 2017). GWAS summary statistics were formatted for input into FINEMAP and the prior standard deviation of effect size was determined for case-control studies as defined by Benner et al. (2016) (Table S14) (Benner et al. 2016). Loci that report a sum of posterior probabilities of < 0.95 for the causal configurations 19 SNVs or less were deemed to have failed convergence and were discarded.
The GWAS loci defined for Bayesian fine-mapping may overlap due to lead SNVs being beyond the maximum distance allowed for locus merging (50 Kbp), but the within the padded distance added to each locus of 250 Kbp. As a result, this generated multiple FINEMAP summary statistics for a SNV in a region overlapped by multiple GWAS loci. For these SNVs, we conservatively assigned the FINEMAP summary statistics that report the lowest log10(BF). As a consequence of this harmonization, we were then able to merge overlapping loci used for Bayesian fine-mapping into a single larger locus. SNVs achieving a log10(BF) > 2 were retained for further analyses as this threshold is generally considered to be strong evidence for causality (Johnson 2013; Benjamin et al. 2018), and we have previously shown that SNVs at or above this threshold are enriched for missense SNVs and of SNVs at accessible chromatin sites (Morris et al. 2018).
The number of loci, GWAS lead SNVs, and SNVs with log10(BF) > 2 that were retained per trait for subsequent analyses are listed in Table 2 and Table S15.
Source or generation of DNase-seq data
Saos-2 and U2OS cells were maintained in adherent cultures in McCoy’s 5A medium supplemented with 1x Penicillin/Streptomycin and Fetal Bovine Serum (FBS) (15% and 10%, respectively). Saos-2 and U2OS cells were sub-cultured at a ratio of 1:3 and 1:6, respectively, once they reached 80% confluency. DNase digestion was performed as described previously(John et al. 2013) adapted to 200 uL thermocycler tubes. Briefly, nuclei were extracted from cells and incubated with limiting concentrations of the DNA endonuclease DNase I (Sigma) supplemented with Ca2+ and Mg2+ for 3 min at 37 °C. Digestion was stopped by the addition of EDTA, and the samples were treated with proteinase K and RNase A. Short double hit fragments from DNase digestion using magnetic bead polyethylene glycol (PEG) fractionation. Illumina libraries were generated and sequenced on an Illumina NextSeq 500.
DNase-seq datasets from the ENCODE and Roadmap projects were downloaded from http://www.encodeproject.org (Thurman et al. 2012b; Aguet et al. 2017). ATAC-seq data of pancreatic islets was downloaded from SRA SRR8729334 (Greenwald et al. 2019). See also Table S5.
All DNase-seq and ATAC-seq data were processed using a uniform mapping and peak calling pipeline (https://github.com/mauranolab/mapping/tree/master/dnase). Illumina sequencing adapters were trimmed with Trimmomatic (Bolger et al. 2014). Reads were aligned to the human reference genome (GRCh38/hg38) using BWA (Li and Durbin 2009). Hotspots were called using hotspot2 (https://github.com/Altius/hotspot2) with a cutoff of 5% false discovery rate. Hotspots were converted to hg19 reference coordinates using UCSC liftOver.
Saos-2 and U2OS DNase-seq data are available from GEO at accession GSE142160.
Tissue-selective Expression Positive Control Gene Sets
Tissue-selective expression was established by differential expression analysis in a selection of 32 tissues from the GTEx project v7 obtained from the GTEx Portal on 09/27/2017 (Aguet et al. 2017). The top five RNA-seq samples in terms of data quality were selected per tissue, and all pairwise differential expression analyses were performed using DESeq2 (Love et al. 2014). A gene was considered differentially expressed between two tissues if it passed cutoffs for both log2(fold change) > 3 and Bonferroni-adjusted P-value < 0.01. Bonferroni-adjusted P-values were calculated to account for all pairwise comparisons. Gene sets for each tissue were defined as genes differentially expressed in at least 26 (50%) comparisons. See also Table S16.
Processed transcript quantification(Li et al. 2011) of RNA-seq data was downloaded for purified T-cells, B-cells and monocytes (CD4 or CD8, CD14, and CD20) from ENCODE(Djebali et al. 2012; Thurman et al. 2012b; Aguet et al. 2017); http://encodeproject.org, accessions ENCFF269QBU, ENCFF880QDD, ENCFF557VGS, ENCFF495CNV, ENCFF081MXC, ENCFF669GZO, ENCFF049GRB, and ENCFF422SXS). Gene sets were generated as above, except that a differential expression cutoff of log2(fold change) > 2 was used. Gene sets for each tissue were defined as genes differentially expressed in at least 2 (50%) comparisons.
For T2D, additional gene sets for were obtained from RNA-seq of pancreatic islets(Parker et al. 2013) or scRNA-seq of pancreatic endocrine cells (Lawlor et al. 2017).
All gene IDs were mapped to Gencode v24. Gene lists of similar tissues were merged together (Table S5).
Transcript Annotation
For Fig. 2, SNVs affecting missense coding sequence or transcript structure were identified using the ENSEMBL Variant Effect Predictor v92(McLaren et al. 2016) and Gencode v24, using the following tags: splice_donor_variant, splice_acceptor_variant, stop_gained, stop_lost, start_lost, missense_variant, splice_region_variant, incomplete_terminal_codon_variant, stop_retained_variant, coding_sequence_variant, mature_miRNA_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, NMD_transcript_variant.
Hi-C and eQTL data
Published pcHi-C data obtained for a subset of traits (Table S16) (Pan et al. 2018; Jung et al. 2019; Miguel-Escalada et al. 2019). Interactions were removed if either contact region was >30 Kbp in length, or if the distance between the midpoints of interacting contact regions was > 1 Mbp. eQTL data v7 was downloaded from the GTEx Portal on 09/27/2017 (Aguet et al. 2017). For each trait, a set of relevant tissue types was defined and all eQTL-eGene pairs passing a 5% FDR cutoff were used (Table S16).
Variant-to-Gene Enrichment
We assessed a variety of criteria to link associated variants to genes through enrichment of significant SNV sets for positive control genes. The background was defined as protein coding and lincRNA genes in the Gencode v24 basic annotation. Enrichments were defined as the proportion of positive control genes within the targeted set divided by the background proportion of positive control genes. All SNVs within coding regions were removed before gene targeting.
SNV-gene annotation
Putatively causal SNVs from the Bayesian fine-mapping analysis were annotated for potential functional effect using the following methods:
- Predicted functional impact was extracted using SNPEff version 4.3T for build GRCh37.75 with default parameters and databases (snpEffectPredictor, nextProt, pwms, and protein interactions). The predicted impact annotation was then numerically ranked as follows: 1:'HIGH', 2:'MODERATE', 3:'LOW', 4:'MODIFIER', 5:'NONE' (Table S6). Note that the MODIFIER impact prediction includes genomic regions outside the gene body, such as regions 5 Kbp upstream and downstream of the gene (see http://snpeff.sourceforge.net/SnpEff_manual.html).
- We extracted the predicted functional effect of each SNV from version 150 of dbSNP obtained from the UCSC Genome Browser (build GRCh37).
- We identified any overlap of putatively causal SNVs with one or more DHSs from a set of trait-matched cell/tissue types and to the entire set of 160 cell/tissue types collected from ENCODE and other sources (Table S5). See also “Source or generation of DHS data”.
Putatively causal SNVs and their annotations were assigned to protein coding genes using GENCODE v29 to identify genes overlapping each locus by ≥ 50% of their length as follows:
- For each gene at a locus, assign all overlapping putatively causal SNVs.
- For each gene at a locus, assign the putatively causal SNV nearest to its transcription start site.
- Assign a putatively causal SNV to gene if the affected gene reported by SNPEff is a gene at the locus (i.e. protein or transcript-altering SNVs).
- For each gene at a locus, assign the closest putatively causal SNV overlapping a DHS.
Each SNV-gene assignment was additionally annotated with distance from the SNV to the gene TSS, to the gene body, and to the transcription end site (Table S6).
SNV-gene annotation was summarized per gene using average, minima, and maxima to create 140 candidate gene-level features (Table S7). We also generated a set of locus-level features which do not vary across all genes at each locus (Table S7). Uninformative features were removed, such as those with a standard deviation of 0, and correlated features were consolidated.
Building and testing predictive models
Model training was performed at loci containing at least one positive control gene. For each trait or disease (k), we coded gene (j) as a positive control (Yjk) as 1 if the gene was labeled as a positive control gene, or 0 if not. To avoid overfitting, we pruned the matrix so that each gene contributed at most once to model building using the following approach:
- If a gene is a positive control gene for multiple traits, then one of these traits was randomly chosen and retained for analysis (Yjk = 1) while the other entries were dropped.
- If a gene is a positive control gene for some traits, but a negative gene for other traits, then that gene was retained only for the true positive trait (Yjk = 1). One trait was randomly selected from among the positives, if there was more than one.
- If a gene is negative (Yjk = 0) for multiple traits, then one trait was randomly chosen and other traits were dropped.
Training and performance assessment was performed by combining genes across all but one of the traits, and then testing on the trait left out. As each gene can appear a maximum of once, no genes can overlap between the training and testing sets.
We predicted the true causal status of a gene (Y ~ X) by using the gradient boosted trees algorithm, as implemented in the XGBoost package in R (https://cran.r-project.org/web/packages/xgboost/). The input variables (X) were standardized prior to statistical modeling and the response variable is binary (Y = 1 or 0). Each observation corresponding to a negative gene was down-weighted according to the ratio of the positive genes to negative genes in the training dataset for trait k. There were few positive control genes relative to negative genes, and therefore this weighting ensures that their features contribute substantially to the model. The hyperparameters in XGBoost, such as tree depth, lambda and gamma, were tuned to optimize the cross-validated performance on training data. Performance was measured using the area under both receiver operator curves (AUC-ROC) and precision-recall curves (AUC-PRC) in the test datasets.
eCAVIAR
We used eCAVIAR version 2.2 to prioritize genes at each locus by determining the SNVs that are responsible for both the GWAS and eQTL signals. To ensure a comparable prediction performance to Ei we prepared input data as follows: For each trait we selected eQTL GWAS summary statistics for tissues from GTEx v7 as defined previously (Table S5). We then obtained trait GWAS summary statistics and genes from loci as defined previously for this study (see “Defining GWAS loci for Bayesian fine-mapping”). SNV summary statistics were then harmonized between the trait and eQTL GWAS to include only SNVs matching by NCBI dbSNP identifier and alleles, as well as retaining only SNVs with MAF > 0.01 from GTEx v7. Linkage disequilibrium was computed using the same reference panel described previously for this study (see “Defining GWAS loci for Bayesian fine-mapping”). We then used eCAVIAR with one maximum causal SNV per locus (-c 1) to obtain a listing of SNV-gene pairings and the colocalization posterior probability (CLPP) of the SNV being responsible for the eQTL tissue and trait GWAS signals. We then collated this information across trait-tissue pairings and report the maximal CLPP for a each gene across all tissues for that trait.
Specific ethics approval was not required for this study.