Genetic architecture of smoking: Evaluating rare variant contribution from deep whole-genome sequencing of up to 26,000 individuals

Seon-Kyeong Jang, Luke Evans, Allison Fialkowski, Donna K. Arnett, Diane M. Becker, Joshua C. Bis, John Blangero, Eugene R. Bleecker, Jennifer A Brody, L. Adrienne Cupples, Scott M. Damrauer, Sean P. David, Mariza de Andrade, Tasha E. Fingerlin, Sina A. Gharib, David C Glahn, Jeffrey Haessler, Susan R. Heckbert, John E. Hokanson, Shih-Jen Hwang, Matthew C. Hyman, Renae Judy, Anne E. Justice, Robert C Kaplan, Wonji Kim, Charles Kooperberg, Dan Levy, Ruth J.F. Loos, Ani W. Manichaikul, Mark T. Gladwin, Lisa Warsinger Martin, Mehdi Nouraie, Olle Melander, Deborah A. Meyers, Kari E. North, Elizabeth C. Oelsner, Anna L. Peljto, Michael Preuss, Bruce M Psaty, Dandi Qiao, Daniel J. Rader, Robert M. Reed, Alexander P. Reiner, Stephen S. Rich, Jerome I. Rotter, David A. Schwartz, Aladdin H. Shadyab, Edwin K. Silverman, Nicholas L. Smith, J. Gustav Smith, Albert V. Smith, Weihong Tang, Kent D. Taylor, Ramachandran S. Vasan, Victor R. Gordeuk, Zhe Wang, Kerri L. Wiggins, Lisa R. Yanek, Ivana V. Yang, Kendra A. Young, Kristin L. Young, Yingze Zhang, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Dajiang J. Liu, Matthew Keller, Scott Vrieze


Introduction
Characterizing the genetic architecture of complex phenotypes has long been an important goal of genetic epidemiology, with implications for diverse fields including biology, medicine, and psychology.
One aspect of this work involves characterizing the joint distribution of effect sizes and minor allele frequency (MAF), which is shaped by natural selection and population history 1,2 . Genome-wide association studies (GWAS) have discovered tens of thousands of genomic loci associated with complex phenotypes, providing new and basic insights into the genetic architecture of complex phenotypes, including rampant polygenicity 3 . Aggregating across loci typically explains only small fractions of phenotypic variance, fractions much lower than have been obtained in family-based studies (e.g., twins, or siblings, or larger pedigrees) 4 . This difference has been coined the "missing heritability".
Tobacco use is a complex behavioral trait of high public health concern 5 , with demonstrated genetic and environmental (e.g., policy, cultural) influences. Not only does tobacco use influence risk of many diseases and represents a leading causes of global morbidity and mortality, but measures of tobacco use are strong indicators of addiction to nicotine (e.g., number of cigarettes smoked per day is genetically highly correlated (r=.95) with nicotine dependence 6 ) and other commonly used substances. Heritability of smoking behaviors has been estimated at approximately 50% (SE 5%) 7 in twin studies, comparable to many other complex behavioral traits. On the other hand, estimates of tobacco use heritability from GWAS of single nucleotide polymorphisms (SNPs) have routinely found much lower SNP-based heritability (h 2 SNP ) estimates 8,9 . Such analyses to date have been based on common variants (e.g., MAF > 1%) from GWAS of imputed microarrays. In a recent GWAS of tobacco use in up to 1 Similar to results for other complex traits 7,11 , some but far from all of the twin-based heritability of smoking can be attributed to common variants obtained through imputation of microarray genotypes.
There are many possible contributors to missing heritability, including inflated family-based heritability estimates 12 , epistasis 13 , structural variation 11 , and rare variants 14 . Rare variants are one compelling explanation, as one expects negative selection to force strongly deleterious alleles to low frequencies 15 . Current SNP-based heritability estimates are based on a few million common variants.
With imputed variants, the quality of imputation depends on the reference panel used 16,17 , and even the best imputation strategies perform poorly for rare variants (e.g., MAF < 1%) in population samples 18,19 .
With the advent of relatively affordable deep whole genome sequencing (WGS), it is now possible to directly genotype variants of any frequency. While genetic association studies may be underpowered to detect an association between a given single rare variant and a complex trait 20 , we can test the contribution of rare variants in aggregate to phenotypic heritability over and above common variation 18,19 20,21 . A small number of recent WGS studies reported evidence that low-frequency and rare variants contribute to heritability of anthropometric, transcriptomic, and medical phenotypes [22][23][24][25] (but see also 26,27  Genetic principal components and kinship-based mixed models tend to reduce confounding due to stratification for tests of common variants 28,29 . However, it remains unclear the extent to which these techniques will satisfactorily work for rare variant analyses [30][31][32] . For example, rare variants may be more likely to be shared among individuals living in close proximity, possibly confounding heritability when rare genotype sharing coincides with geographically clustered environmental risk factors 33 . Thus, novel approaches are needed to control for population stratification in rare variant analyses [34][35][36] . Here, we used deep WGS from the Trans-Omics for Precision Medicine (TOPMed) initiative to evaluate the genetic architecture of four smoking phenotypes down to minor allele frequencies of 1 in 10,000 (MAF≥0.0001). We also evaluated issues of rare variant-based population structure using a new permutation method developed for rare variant associations 37,38 .

Sample
We considered individuals of European ancestry in TOPMed (freeze 8, mean depth >30) 21 measured for at least one of four smoking phenotypes for inclusion. We determined European ancestry in two steps. First, we identified an initial ancestry-inclusive set by projecting TOPMed genotypes (N=137,977) onto genetic principal (PC) axes from the 1000 Genomes project 8 Figure S5). We then further restricted samples to those whose summed Euclidean distance of PCs 1-4 fell within the 1 interquartile range (IQR) of the European sample (N=38,915) identified in the first step. We additionally created samples using 1.5, 2, 3 IQR and reserved them for sensitivity analysis, described further in that section ( Figure S1-2). After IQR filtering, we only retained unrelated individuals, resulting in following final sample size per phenotype in Smoking cessation (SmkCes) and initiation (SmkInit) are binary variables indicating former versus current smokers and never versus ever smoker, with case defined as current and ever smoker, respectively. A linear mixed-effects regression analysis indicated that these four variables are correlated but not redundant, each measuring distinct aspects of smoking behavior (Supplementary Note, Table S2).
Descriptive statistics for each phenotype across cohorts are presented in Table S3.

Genotypes, LD scores, GRM, and GREML-LDMS-I
Genome-based restricted maximum likelihood (GREML) estimates heritability by comparing phenotypic similarity to observed genetic similarity among distantly related individuals using a linear mixed model 40  For each phenotype, we performed GREML-LDMS-I with the GRMs for above-mentioned six bins and cohort indicator matrix as random effects. The cohort matrix was a N x N matrix indicating whether a given pair of individuals belongs to the same study (1, otherwise 0). For AgeSmk and CigDay, we inverse-rank normalized residuals of these phenotypes after regressing out age, age 2 , sex and their twoway interaction terms [42][43][44] , and entered 11 PCs and sequencing center as fixed effects ( Figure S3). We used PCs released by TOPMed consortium, which were calculated by pcair function in GENESIS package in R using 638,486 SNPs with |LD| < 0.32 and MAF > 0.01. SmkCes and SmkInit are binary phenotypes, thus no transformation was applied. Age, age 2 , sex and their interaction terms, 11 PCs and sequencing center were entered as fixed effects for binary phenotypes. We used GCTAv1.92 for both construction of GRM and GREML-LDMS-I analysis. We allowed the estimates to be negative to obtain unbiased estimates of heritability and standard error. Heritabilities for binary phenotypes were analyzed under a liability threshold model 45 46 .

Partitioning rare variants heritability
To further interrogate sources of rare variant heritability, we divided variants in the lowfrequency and rare variants bins into protein altering versus non-protein altering variants bins 22 .
Functional impact of variants was assessed by snpEff 4.3 annotation with "HIGH" and "MODERATE" categorized as protein altering while "LOW" and "MODIFIER" categorized as non-protein altering 47 .
Variance components were then estimated for a total of eight bins including four low-frequency/rare variants and the four common variant bins.

Sensitivity analyses
To evaluate the robustness of our results, we tested the effect of different decisions with respect to filtering on ancestry, relatedness, and phenotype transformation. First, we created three additional samples with different levels of ancestral variation by gradually relaxing the sample inclusion thresholds based on ancestry PCs. Specifically, we created three samples whose PC-based Euclidean distance lay within 1.5 IQR, 2 IQR, and 3 IQR of the European sample (N=38,915). The greater the IQR threshold, the more ancestral variation is present in the resulting sample, the larger the sample, and the greater the chance of observing effects of population stratification. Finally, we evaluated yet another alternative ancestry-based filter for comparison with our main result. We selected European samples whose PCs were within 6 standard deviations (SD) of the mean of PC1 and PC2 of 1000G CEU sample (Utah residents with Northern and Western European ancestry), an approach used previously in a past study of height 22 . This resulted in similar sample size with that from main analysis (Table S1). More details for the sensitivity analysis procedure is in Supplementary Note.
We also evaluated two relatedness thresholds: π < .05 and < .025, which correspond approximately to being related less than first and second cousin, respectively. In addition, we evaluated sensitivity to phenotype transformation methods for quantitative phenotypes (AgeSmk and CigDay). For AgeSmk, we compared rank-based inverse normal transformation (IVRT), log transformation, and log transformation after removing outliers defined as observations lying beyond 3SD from the mean. For CigDay, we compared IVRT and log transformation only, given that CigDay is binned to five groups and thus had no outliers.

Permutation to further control effects of population stratification
Past research has indicated that rare variants can show different patterns of confounding than common variants 34 . To deal with this possibility, we applied a recently developed permutation method designed to control for type I error in genetic association tests of rare variants 23,37,38 . Specifically, we created an N × N distance matrix populated by scaled Euclidian distance of PC1-11 between each individual. Then, we randomly exchanged genotypes of a given individual with one of their 100 nearest neighbors 23,38,48 . We created a total of 200 replicates of permuted genotypes and applied GREML with the same set of fixed and random covariates used in the main analysis. Mean (ĥ 2 null ) and SD of heritability estimates from 200 permutation replicates were calculated for each bin and were tested against zero using a one-sided Z-test. Mean heritability greater than zero across replicates would indicate significant bias induced by population stratification. We also tested the estimates from main analysis against the permuted null distribution using two-sided Z-test (Supplementary Note).

Pedigree-based heritability
Missing heritability is often quantified as the difference between GREML results and biometric variance decompositions based on families (e.g., twins). Indeed, GREML as described thus far was applied only to distantly related individuals, so as to avoid confounding with non-additive and shared environmental effects. Here, we take advantage of thousands of closely related pairs of participants in TOPMed to estimate the heritability of our four smoking phenotypes in close pedigrees (ĥ 2 ped ). ĥ 2 ped is same as K IBS>t in Zaitlen et al 2013 12 , which is estimated from a single GRM including non-zero values for closely related individuals and zero values for distantly related individuals. This provides an upper bound on the narrow-sense heritability, to which we can compare our GREML estimates to quantify any of the remaining missing heritability. This quantity is analogous to twin heritability, but unlike traditional pedigree or twin estimates, ĥ 2 ped uses measured genotypes to estimate relatedness rather than expected relatedness of relatives based solely on their pedigree.
For this analysis, we created a GRM with all available samples after excluding pairs related greater than .80 to exclude identical twins and duplicates. To aid in model identification, we included cohorts that had at least 10 first-degree relatives (π > .375). A list of cohorts and details of the procedure are presented in Table S5 and Supplementary Note. This GRM was fitted together with a cohort matrix and the same set of fixed effect covariates used in the main analysis. To test whether resulting ĥ 2 ped is underestimated due to relatively low level of relatedness structure in the sample, we conducted the same analysis again using only Framingham Heart Study (FHS) which consists of family samples (i.e., high level of relatedness structure) without a random effect of cohort.

Heritability estimates
SNP-based heritability (ĥ 2 WGS ) for the four smoking phenotypes was initially estimated as  Figure 1 and Table S6.
We further partitioned heritability of two rare variant bins into protein-altering and non-altering  Table S7 for full results). Overall, there was very limited evidence for a prominent role of rare protein coding variants in the genetic etiology of these smoking phenotypes.
Permuted mean heritabilities were mostly close to zero across different bins and phenotypes ( Figure 2, Table S8). Only the rarest bin of AgeSmk had permuted mean heritability significantly different from zero with weaker evidence for SmkCes (AgeSmk: Mean=.102, SE=.050, SmkCes: Mean=.043, SE=.041). We also tested heritability estimates from main analysis against the permuted null distribution ( Figure 1, Table S6). The rarest bins of AgeSmk and SmkCes were unlikely to be drawn from the null distribution (p=.002 and p=.012, respectively), indicating that the initial estimates are not entirely accounted for by residual population structure. We adjusted partial impact of residual stratification by subtracting the permuted mean from ĥ 2 WGS . Adjusted ĥ 2 WGS for AgeSmk and SmkCes was 0.212 (SE 0.075) and 0.209 (SE 0.087). Note that the adjusted ĥ 2 WGS is conservative, as permuted heritability may partly capture true rare variant effects among individuals sharing recent ancestors.
Estimates from pedigree analysis (ĥ 2 ped ) are presented in Figure 3 and Table S9. All ĥ 2 ped were greater than our (both adjusted and unadjusted) ĥ 2 WGS , except for SmkCes. When we compare ĥ 2 WGS and ĥ 2 ped , it would appear that the inclusion of rare variants from WGS accounts for much of the missing heritability for smoking phenotypes (60%-100%). Using the Framingham Heart Study (FHS) cohort, which has a large proportion of related individuals, we obtained generally comparable ĥ 2 ped , albeit with lower estimate for AgeSmk and overall greater SEs.

Sensitivity analyses
We explored how sensitive the results were to ancestral filtering thresholds, relatedness cut-off thresholds, and phenotype transformation methods (Figure 4; Figures S1-4; Table S10). The point estimates tended to slightly increase as we applied stricter ancestry-based filtering for AgeSmk and SmkInit. Notably, ĥ 2 WGS was estimated higher when we included extreme AgeSmk observations compared to when excluding outliers or applying rank-based inverse normalization, both of which effectively controlled the influence of outliers. Across all phenotypes, ĥ 2 WGS tended to be higher when using a relatedness cut-off of .05 compared to using .025 with maximum difference in ĥ 2 WGS being .055 for SmkInit in 2IQR sample. Finally, heritability estimates using alternative sample selection strategy based on the 1000 Genomes CEU group were generally comparable to the main ĥ 2 WGS estimates across six bins (Table S10).

Discussion
Using the largest sample of whole genome sequences (N=14,747 -26,340) to date for complex behavioral traits, we found that rare variants with MAF 0.01% to 0.1% accounted for approximately 15% and 10% of phenotypic variation of AgeSmk and SmkCes after correcting for potential influence of population structure at rare variants. These estimates can be seen conservative as our permutated mean might have partially included true rare variant effects. Compared to AgeSmk and SmkCes, contribution of rare variants to CigDay and SmkInit was lower (6% and 7%, respectively), with common variants explaining a greater proportion of the total phenotypic variation (9% and 17%, respectively). After adjustment, the total heritability estimate (ĥ 2 WGS ) was 0.21, 0.15, 0.21, and 0.24 for AgeSmk, CigDay, SmkCes, and SmkInit, respectively which is 1.8 to 4.5 times higher than past heritability estimates based on common variants alone 8 .
A handful of studies have now reported evidence for rare variants contributing to phenotypic variance of anthropometric 22 , medical 17 , and transcriptomic phenotypes 16 . For example, rare variation may explain all of the missing heritability for height and BMI 22 . Our ĥ 2 WGS generally falls below past twin estimates for tobacco use phenotypes (ĥ 2 twin = .48; SE .05) 7 . However, after possibly overcorrecting for population stratification, current ĥ 2 WGS estimates accounted for 61% to 100% of our pedigree-based heritability estimate (ĥ 2 ped ) across the four phenotypes, closing the gap on the missing heritability for these phenotypes. In the present study, we consider our pedigree estimate of heritability as the most relevant benchmark by which to judge the GREMLresults. While twin studiesespecially same-sex twinscontrol perfectly for standard covariates (e.g., age, sex, birth year), family studies, especially of multiple generations, typically incorporate only linear combinations of such predictors. This results in a statistically adjusted phenotype that contains more noise, and proportionally less heritable variance than a twin study. Future studies will benefit from considering how different analytic approaches, such as using twin-only samples versus diverse classes of relatives (e.g., grandparents-grandchild, siblings, half-sibling etc) and using SNP-based versus expected relatedness, influence heritability estimation and potentially capture different aspects of heritable variation 12 .
Published studies suggested that population structure induced by rare variants may not be sufficiently accounted for by PC correction 32,34,49,50 . Similarly, we found that mean permuted heritability substantially departed from zero (M=.10, SE=.05) for the MAF [0.01-0.1%] bin of AgeSmk and to lesser degree, for that of SmkCes (M=.04, SE=.04), indicating the possibility of partial confounding due to residual population structure. We did not find evidence for this confounding for other phenotypes or MAFs. Existence of geographically localized non-genetic risk or systematic measurement bias in different cohorts could lead to rare variant stratification 34 . Without substantial knowledge of causal risk factors that align with rare variant sharing and the availability of such data, it is difficult to directly identify the source of the confounding. Given increasing interest in the role of rare variation in complex disease, research into the nature and method to detect and adjust for rare variants will be crucial. For example, future studies should consider extending the current permutation approach and improve computational efficiency.
The majority of the rare variant heritability (90%-100%) was attributable to non-protein altering regions for AgeSmk and SmkCes. This suggests that most genetic variation is likely to be located outside protein-coding regions, which themselves comprise only ~1% of base pairs in the human genome. For common variants, a majority of the heritability appeared attributable to low-LD variants for all smoking traits. This seems consistent with the action of negative selection, indicating that these variants are relatively young in the genealogical history, and are still being pushed to lower frequencies. Consistent with this idea, Gazal et al 2017 reported that common variants with low LD in low-recombination rate regions had larger per-SNP heritability than those with high LD as they are more likely to be recent and thus have less time to be removed by negative selection 51 .
In sensitivity analyses, most heritability estimates from varying combinations of ancestral variation, relatedness cut-offs, and phenotype transformation typically showed differences of 1%-5% in each sensitivity condition. AgeSmk tended to show higher rare variant heritability when extreme observations were included than they were not. Extreme observations may induce model misspecification or alternatively, are enriched in causal rare variants, possibly leading to higher heritability estimates 23,52,53 .
The influence phenotypic scaling was reported in recent study of Evans et al. 2021 where CigDay showed higher heritability when it was dichotomized versus the ordinal version analyzed here 52 . Estimation of heritability for binned variable (e.g., CigDay in this study) may be improved by different modeling choice including liability threshold model 54 . Finally, relaxing relatedness thresholds tended to be associated with higher estimates, an effect one would expect if shared environment in distantly related individuals were influencing smoking traits (e.g., state-level policy or regional culture) 55 .
Our findings should be interpreted in light of several limitations. First, while our sample size is the largest to date for heritability analysis using WGS, even larger datasets or more precise phenotypic measures are required. Larger studies would provide greater precision in estimation and a more comprehensive assessment of genetic architecture of these complex traits by finer partitioning by MAF and functional annotations. Second, even with the use of deep sequences, we did not fully "recover" trait heritability, either as estimated using available pedigrees, or twin heritability reported in the literature.
There remain many explanations, including ultra-rare variants (MAF <= .0001) and other types of genetic variations (e.g., copy number variations) that may constitute additional sources of heritability. Next, our pedigree-based heritability estimates may be inflated by shared environment, as we were unable to model genetic similarity and environmental similarity separately. Therefore, the pedigree estimates should be interpreted as likely upward biased upper-bounds for SNP-based narrow-sense heritability. Finally, smoking phenotypes were measured by one or two questions and were limited to those commonly collected in biomedical studies like those in TOPMed. This allows accumulations of large sample sizes across multiple independently-collected samples, but also may reflect individual study characteristics, whose variance ultimately can be captured by our cohort random effect.
In conclusion, our results indicate that rare variants in regulatory region contribute substantially to the heritability of smoking phenotypes. Note. Bars are standard errors. The "Rare" bin is the sum of the MAF .1=1% and MAF .01-.1%. "Common" is the sum of the other MAF bins. Total is the sum of Rare and Common. No estimates shown here were adjusted by results from permutation procedure. For adjusted results, please see Figure 2.
Asterisks were added to the components that are significantly different from permuted mean under null distribution (see Table S6). *p < .05, **p < .01, ***p < .001.   Note. X-axis indicates different ancestry filtering thresholds. The shape of the points indicates phenotype transformation methods with "Log", "Log w/o outliers", "Inverse normal", "None" indicating log transformation including outliers, log transformation without outliers, rank-based inverse normal transformation, and no transformation, respectively. Red and blue color each indicate relatedness thresholds .025 and .05. Dots and whiskers each represent heritability estimates and their SEs. Note that here we presented ĥ 2 WGS for AgeSmk (1IQR and π < .025) estimated excluding variants with MAC 3 as these variants fall below the MAF threshold for all other comparison ĥ 2 WGS for AgeSmk. ,108,704 a 1IQR is used for main analysis and the rest of samples was used for sensitivity analysis. 1IQR = 1 * the interquartile range of PCs 1-4, and is the most restrictive choice, 1.5IQR is 1.5 * the interquartile range of PCs 1-4, and so on. CEU 6SD is an alternative way to select samples based on ancestry of the CEU group of 1000 Genomes. b This shows the number of variants per bin in 1IQR unrelated samples (π < .025).